Compare commits
16 Commits
7fcc043a4d
...
pipe-to-lo
| Author | SHA1 | Date | |
|---|---|---|---|
|
78eb04205f
|
|||
| 19cb61ebbc | |||
|
9ed09c9a9c
|
|||
|
b31c64f1b9
|
|||
|
54b6e37420
|
|||
|
b845a8bb8b
|
|||
|
bfbf0cea68
|
|||
|
3abe5e83a7
|
|||
|
67c27555f3
|
|||
|
1674b6a844
|
|||
|
311be282b6
|
|||
|
11cbb64097
|
|||
|
e2dd21c994
|
|||
|
463342133e
|
|||
|
de36b9d016
|
|||
|
3f1d966919
|
180
.claude/agents/auditor.md
Normal file
180
.claude/agents/auditor.md
Normal file
@@ -0,0 +1,180 @@
|
||||
---
|
||||
name: auditor
|
||||
description: Analyzes audit logs to investigate user activity, command execution, and suspicious behavior on hosts. Can be used standalone for security reviews or called by other agents for behavioral context.
|
||||
tools: Read, Grep, Glob
|
||||
mcpServers:
|
||||
- lab-monitoring
|
||||
---
|
||||
|
||||
You are a security auditor for a NixOS homelab infrastructure. Your task is to analyze audit logs and reconstruct user activity on hosts.
|
||||
|
||||
## Input
|
||||
|
||||
You may receive:
|
||||
- A host or list of hosts to investigate
|
||||
- A time window (e.g., "last hour", "today", "between 14:00 and 15:00")
|
||||
- Optional context: specific events to look for, user to focus on, or suspicious activity to investigate
|
||||
- Optional context from a parent investigation (e.g., "a service stopped at 14:32, what happened around that time?")
|
||||
|
||||
## Audit Log Structure
|
||||
|
||||
Logs are shipped to Loki via promtail. Audit events use these labels:
|
||||
- `host` - hostname
|
||||
- `systemd_unit` - typically `auditd.service` for audit logs
|
||||
- `job` - typically `systemd-journal`
|
||||
|
||||
Audit log entries contain structured data:
|
||||
- `EXECVE` - command execution with full arguments
|
||||
- `USER_LOGIN` / `USER_LOGOUT` - session start/end
|
||||
- `USER_CMD` - sudo command execution
|
||||
- `CRED_ACQ` / `CRED_DISP` - credential acquisition/disposal
|
||||
- `SERVICE_START` / `SERVICE_STOP` - systemd service events
|
||||
|
||||
## Investigation Techniques
|
||||
|
||||
### 1. SSH Session Activity
|
||||
|
||||
Find SSH logins and session activity:
|
||||
```logql
|
||||
{host="<hostname>", systemd_unit="sshd.service"}
|
||||
```
|
||||
|
||||
Look for:
|
||||
- Accepted/Failed authentication
|
||||
- Session opened/closed
|
||||
- Unusual source IPs or users
|
||||
|
||||
### 2. Command Execution
|
||||
|
||||
Query executed commands (filter out noise):
|
||||
```logql
|
||||
{host="<hostname>"} |= "EXECVE" != "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"
|
||||
```
|
||||
|
||||
Further filtering:
|
||||
- Exclude systemd noise: `!= "systemd" != "/nix/store"`
|
||||
- Focus on specific commands: `|= "rm" |= "-rf"`
|
||||
- Focus on specific user: `|= "uid=1000"`
|
||||
|
||||
### 3. Sudo Activity
|
||||
|
||||
Check for privilege escalation:
|
||||
```logql
|
||||
{host="<hostname>"} |= "sudo" |= "COMMAND"
|
||||
```
|
||||
|
||||
Or via audit:
|
||||
```logql
|
||||
{host="<hostname>"} |= "USER_CMD"
|
||||
```
|
||||
|
||||
### 4. Service Manipulation
|
||||
|
||||
Check if services were manually stopped/started:
|
||||
```logql
|
||||
{host="<hostname>"} |= "EXECVE" |= "systemctl"
|
||||
```
|
||||
|
||||
### 5. File Operations
|
||||
|
||||
Look for file modifications (if auditd rules are configured):
|
||||
```logql
|
||||
{host="<hostname>"} |= "EXECVE" |= "vim"
|
||||
{host="<hostname>"} |= "EXECVE" |= "nano"
|
||||
{host="<hostname>"} |= "EXECVE" |= "rm"
|
||||
```
|
||||
|
||||
## Query Guidelines
|
||||
|
||||
**Start narrow, expand if needed:**
|
||||
- Begin with `limit: 20-30`
|
||||
- Use tight time windows: `start: "15m"` or `start: "30m"`
|
||||
- Add filters progressively
|
||||
|
||||
**Avoid:**
|
||||
- Querying all audit logs without EXECVE filter (extremely verbose)
|
||||
- Large time ranges without specific filters
|
||||
- Limits over 50 without tight filters
|
||||
|
||||
**Time-bounded queries:**
|
||||
When investigating around a specific event:
|
||||
```logql
|
||||
{host="<hostname>"} |= "EXECVE" != "systemd"
|
||||
```
|
||||
With `start: "2026-02-08T14:30:00Z"` and `end: "2026-02-08T14:35:00Z"`
|
||||
|
||||
## Suspicious Patterns to Watch For
|
||||
|
||||
1. **Unusual login times** - Activity outside normal hours
|
||||
2. **Failed authentication** - Brute force attempts
|
||||
3. **Privilege escalation** - Unexpected sudo usage
|
||||
4. **Reconnaissance commands** - `whoami`, `id`, `uname`, `cat /etc/passwd`
|
||||
5. **Data exfiltration indicators** - `curl`, `wget`, `scp`, `rsync` to external destinations
|
||||
6. **Persistence mechanisms** - Cron modifications, systemd service creation
|
||||
7. **Log tampering** - Commands targeting log files
|
||||
8. **Lateral movement** - SSH to other internal hosts
|
||||
9. **Service manipulation** - Stopping security services, disabling firewalls
|
||||
10. **Cleanup activity** - Deleting bash history, clearing logs
|
||||
|
||||
## Output Format
|
||||
|
||||
### For Standalone Security Reviews
|
||||
|
||||
```
|
||||
## Activity Summary
|
||||
|
||||
**Host:** <hostname>
|
||||
**Time Period:** <start> to <end>
|
||||
**Sessions Found:** <count>
|
||||
|
||||
## User Sessions
|
||||
|
||||
### Session 1: <user> from <source_ip>
|
||||
- **Login:** HH:MM:SSZ
|
||||
- **Logout:** HH:MM:SSZ (or ongoing)
|
||||
- **Commands executed:**
|
||||
- HH:MM:SSZ - <command>
|
||||
- HH:MM:SSZ - <command>
|
||||
|
||||
## Suspicious Activity
|
||||
|
||||
[If any patterns from the watch list were detected]
|
||||
- **Finding:** <description>
|
||||
- **Evidence:** <log entries>
|
||||
- **Risk Level:** Low / Medium / High
|
||||
|
||||
## Summary
|
||||
|
||||
[Overall assessment: normal activity, concerning patterns, or clear malicious activity]
|
||||
```
|
||||
|
||||
### When Called by Another Agent
|
||||
|
||||
Provide a focused response addressing the specific question:
|
||||
|
||||
```
|
||||
## Audit Findings
|
||||
|
||||
**Query:** <what was asked>
|
||||
**Time Window:** <investigated period>
|
||||
|
||||
## Relevant Activity
|
||||
|
||||
[Chronological list of relevant events]
|
||||
- HH:MM:SSZ - <event>
|
||||
- HH:MM:SSZ - <event>
|
||||
|
||||
## Assessment
|
||||
|
||||
[Direct answer to the question with supporting evidence]
|
||||
```
|
||||
|
||||
## Guidelines
|
||||
|
||||
- Reconstruct timelines chronologically
|
||||
- Correlate events (login → commands → logout)
|
||||
- Note gaps or missing data
|
||||
- Distinguish between automated (systemd, cron) and interactive activity
|
||||
- Consider the host's role and tier when assessing severity
|
||||
- When called by another agent, focus on answering their specific question
|
||||
- Don't speculate without evidence - state what the logs show and don't show
|
||||
@@ -4,6 +4,7 @@ description: Investigates a single system alarm by querying Prometheus metrics a
|
||||
tools: Read, Grep, Glob
|
||||
mcpServers:
|
||||
- lab-monitoring
|
||||
- git-explorer
|
||||
---
|
||||
|
||||
You are an alarm investigation specialist for a NixOS homelab infrastructure. Your task is to analyze a single alarm and determine its root cause.
|
||||
@@ -33,18 +34,52 @@ Gather evidence about the current system state:
|
||||
- Use `list_targets` to verify the host/service is being scraped successfully
|
||||
- Look for correlated metrics that might explain the issue
|
||||
|
||||
### 3. Check Logs
|
||||
### 3. Check Service Logs
|
||||
|
||||
Search for relevant log entries:
|
||||
- Use `query_logs` to search Loki for the affected host/service
|
||||
- Common patterns:
|
||||
- `{host="<hostname>", systemd_unit="<service>.service"}`
|
||||
- `{host="<hostname>"} |= "error"`
|
||||
- `{systemd_unit="<service>.service"}` across all hosts
|
||||
- Look for errors, warnings, or unusual patterns around the alert time
|
||||
- Use `start: "1h"` or longer for context
|
||||
Search for relevant log entries using `query_logs`. Focus on service-specific logs and errors.
|
||||
|
||||
### 4. Check Configuration (if relevant)
|
||||
**Query strategies (start narrow, expand if needed):**
|
||||
- Start with `limit: 20-30`, increase only if needed
|
||||
- Use tight time windows: `start: "15m"` or `start: "30m"` initially
|
||||
- Filter to specific services: `{host="<hostname>", systemd_unit="<service>.service"}`
|
||||
- Search for errors: `{host="<hostname>"} |= "error"` or `|= "failed"`
|
||||
|
||||
**Common patterns:**
|
||||
- Service logs: `{host="<hostname>", systemd_unit="<service>.service"}`
|
||||
- All errors on host: `{host="<hostname>"} |= "error"`
|
||||
- Journal for a unit: `{host="<hostname>", systemd_unit="nginx.service"} |= "failed"`
|
||||
|
||||
**Avoid:**
|
||||
- Using `start: "1h"` with no filters on busy hosts
|
||||
- Limits over 50 without specific filters
|
||||
|
||||
### 4. Investigate User Activity
|
||||
|
||||
For any analysis of user activity, **always spawn the `auditor` agent**. Do not query audit logs (EXECVE, USER_LOGIN, etc.) directly - delegate this to the auditor.
|
||||
|
||||
**Always call the auditor when:**
|
||||
- A service stopped unexpectedly (may have been manually stopped)
|
||||
- A process was killed or a config was changed
|
||||
- You need to know who was logged in around the time of an incident
|
||||
- You need to understand what commands led to the current state
|
||||
- The cause isn't obvious from service logs alone
|
||||
|
||||
**Do NOT try to query audit logs yourself.** The auditor is specialized for:
|
||||
- Parsing EXECVE records and reconstructing command lines
|
||||
- Correlating SSH sessions with commands executed
|
||||
- Identifying suspicious patterns
|
||||
- Filtering out systemd/nix-store noise
|
||||
|
||||
**Example prompt for auditor:**
|
||||
```
|
||||
Investigate user activity on <hostname> between <start_time> and <end_time>.
|
||||
Context: The prometheus-node-exporter service stopped at 14:32.
|
||||
Determine if it was manually stopped and by whom.
|
||||
```
|
||||
|
||||
Incorporate the auditor's findings into your timeline and root cause analysis.
|
||||
|
||||
### 5. Check Configuration (if relevant)
|
||||
|
||||
If the alert relates to a NixOS-managed service:
|
||||
- Check host configuration in `/hosts/<hostname>/`
|
||||
@@ -52,9 +87,61 @@ If the alert relates to a NixOS-managed service:
|
||||
- Look for thresholds, resource limits, or misconfigurations
|
||||
- Check `homelab.host` options for tier/priority/role metadata
|
||||
|
||||
### 5. Consider Common Causes
|
||||
### 6. Check for Configuration Drift
|
||||
|
||||
Use the git-explorer MCP server to compare the host's deployed configuration against the current master branch. This helps identify:
|
||||
- Hosts running outdated configurations
|
||||
- Recent changes that might have caused the issue
|
||||
- Whether a fix has already been committed but not deployed
|
||||
|
||||
**Step 1: Get the deployed revision from Prometheus**
|
||||
```promql
|
||||
nixos_flake_info{hostname="<hostname>"}
|
||||
```
|
||||
The `current_rev` label contains the deployed git commit hash.
|
||||
|
||||
**Step 2: Check if the host is behind master**
|
||||
```
|
||||
resolve_ref("master") # Get current master commit
|
||||
is_ancestor(deployed, master) # Check if host is behind
|
||||
```
|
||||
|
||||
**Step 3: See what commits are missing**
|
||||
```
|
||||
commits_between(deployed, master) # List commits not yet deployed
|
||||
```
|
||||
|
||||
**Step 4: Check which files changed**
|
||||
```
|
||||
get_diff_files(deployed, master) # Files modified since deployment
|
||||
```
|
||||
Look for files in `hosts/<hostname>/`, `services/<relevant-service>/`, or `system/` that affect this host.
|
||||
|
||||
**Step 5: View configuration at the deployed revision**
|
||||
```
|
||||
get_file_at_commit(deployed, "services/<service>/default.nix")
|
||||
```
|
||||
Compare against the current file to understand differences.
|
||||
|
||||
**Step 6: Find when something changed**
|
||||
```
|
||||
search_commits("<service-name>") # Find commits mentioning the service
|
||||
get_commit_info(<hash>) # Get full details of a specific change
|
||||
```
|
||||
|
||||
**Example workflow for a service-related alert:**
|
||||
1. Query `nixos_flake_info{hostname="monitoring01"}` → `current_rev: 8959829`
|
||||
2. `resolve_ref("master")` → `4633421`
|
||||
3. `is_ancestor("8959829", "4633421")` → Yes, host is behind
|
||||
4. `commits_between("8959829", "4633421")` → 7 commits missing
|
||||
5. `get_diff_files("8959829", "4633421")` → Check if relevant service files changed
|
||||
6. If a fix was committed after the deployed rev, recommend deployment
|
||||
|
||||
### 7. Consider Common Causes
|
||||
|
||||
For infrastructure alerts, common causes include:
|
||||
- **Manual intervention**: Service manually stopped/restarted (call auditor to confirm)
|
||||
- **Configuration drift**: Host running outdated config, fix already in master
|
||||
- **Disk space**: Nix store growth, logs, temp files
|
||||
- **Memory pressure**: Service memory leaks, insufficient limits
|
||||
- **CPU**: Runaway processes, build jobs
|
||||
@@ -62,6 +149,8 @@ For infrastructure alerts, common causes include:
|
||||
- **Service restarts**: Failed upgrades, configuration errors
|
||||
- **Scrape failures**: Service down, firewall issues, port changes
|
||||
|
||||
**Note:** If a service stopped unexpectedly and service logs don't show a crash or error, it was likely manual intervention - call the auditor to investigate.
|
||||
|
||||
## Output Format
|
||||
|
||||
Provide a concise report with one of two outcomes:
|
||||
@@ -118,4 +207,5 @@ Provide a concise report with one of two outcomes:
|
||||
- If the alert is a false positive or expected behavior, explain why
|
||||
- Consider the host's tier (test vs prod) when assessing severity
|
||||
- Build a timeline from log timestamps and metrics to show the sequence of events
|
||||
- Include precursor events (logins, config changes, restarts) that led to the issue
|
||||
- **Query logs incrementally**: start with narrow filters and small limits, expand only if needed
|
||||
- **Always delegate to the auditor agent** for any user activity analysis - never query EXECVE or audit logs directly
|
||||
|
||||
@@ -33,6 +33,13 @@
|
||||
"--nats-url", "nats://nats1.home.2rjus.net:4222",
|
||||
"--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey"
|
||||
]
|
||||
},
|
||||
"git-explorer": {
|
||||
"command": "nix",
|
||||
"args": ["run", "git+https://git.t-juice.club/torjus/labmcp#git-explorer", "--", "serve"],
|
||||
"env": {
|
||||
"GIT_REPO_PATH": "/home/torjus/git/nixos-servers"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -35,6 +35,10 @@ nix build .#create-host
|
||||
|
||||
Do not automatically deploy changes. Deployments are usually done by updating the master branch, and then triggering the auto update on the specific host.
|
||||
|
||||
### SSH Commands
|
||||
|
||||
Do not run SSH commands directly. If a command needs to be run on a remote host, provide the command to the user and ask them to run it manually.
|
||||
|
||||
### Testing Feature Branches on Hosts
|
||||
|
||||
All hosts have the `nixos-rebuild-test` helper script for testing feature branches before merging:
|
||||
|
||||
@@ -66,9 +66,9 @@ This future migration path is a strong argument for Kanidm over LDAP-only soluti
|
||||
- Vault integration for idm_admin password
|
||||
- LDAPS on port 636
|
||||
|
||||
2. **Configure declarative provisioning** ✅
|
||||
- Groups: `admins`, `users`, `ssh-users`
|
||||
- User: `torjus` (member of all groups)
|
||||
2. **Configure provisioning** ✅
|
||||
- Groups provisioned declaratively: `admins`, `users`, `ssh-users`
|
||||
- Users managed imperatively via CLI (allows setting POSIX passwords in one step)
|
||||
- POSIX attributes enabled (UID/GID range 65,536-69,999)
|
||||
|
||||
3. **Test NAS integration** (in progress)
|
||||
@@ -80,14 +80,16 @@ This future migration path is a strong argument for Kanidm over LDAP-only soluti
|
||||
- Grafana
|
||||
- Other services as needed
|
||||
|
||||
5. **Create client module** in `system/` for PAM/NSS
|
||||
- Enable on all hosts that need central auth
|
||||
- Configure trusted CA
|
||||
5. **Create client module** in `system/` for PAM/NSS ✅
|
||||
- Module: `system/kanidm-client.nix`
|
||||
- `homelab.kanidm.enable = true` enables PAM/NSS
|
||||
- Short usernames (not SPN format)
|
||||
- Home directory symlinks via `home_alias`
|
||||
- Enabled on test tier: testvm01, testvm02, testvm03
|
||||
|
||||
6. **Documentation**
|
||||
- User management procedures
|
||||
- Adding new OAuth2 clients
|
||||
- Troubleshooting PAM/NSS issues
|
||||
6. **Documentation** ✅
|
||||
- `docs/user-management.md` - CLI workflows, troubleshooting
|
||||
- User/group creation procedures verified working
|
||||
|
||||
## Progress
|
||||
|
||||
@@ -106,14 +108,37 @@ This future migration path is a strong argument for Kanidm over LDAP-only soluti
|
||||
- Prometheus monitoring scrape target configured
|
||||
|
||||
**Provisioned entities:**
|
||||
- Groups: `admins`, `users`, `ssh-users`
|
||||
- User: `torjus` (member of all groups, POSIX enabled with GID 65536)
|
||||
- Groups: `admins`, `users`, `ssh-users` (declarative)
|
||||
- Users managed via CLI (imperative)
|
||||
|
||||
**Verified working:**
|
||||
- WebUI login with idm_admin
|
||||
- LDAP bind and search with POSIX-enabled user
|
||||
- LDAPS with valid internal CA certificate
|
||||
|
||||
### Completed (2026-02-08) - PAM/NSS Client
|
||||
|
||||
**Client module deployed (`system/kanidm-client.nix`):**
|
||||
- `homelab.kanidm.enable = true` enables PAM/NSS integration
|
||||
- Connects to auth.home.2rjus.net
|
||||
- Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
|
||||
- Home directory symlinks (`/home/torjus` → UUID-based dir)
|
||||
- Login restricted to `ssh-users` group
|
||||
|
||||
**Enabled on test tier:**
|
||||
- testvm01, testvm02, testvm03
|
||||
|
||||
**Verified working:**
|
||||
- User/group resolution via `getent`
|
||||
- SSH login with Kanidm unix passwords
|
||||
- Home directory creation with symlinks
|
||||
- Imperative user/group creation via CLI
|
||||
|
||||
**Documentation:**
|
||||
- `docs/user-management.md` with full CLI workflows
|
||||
- Password requirements (min 10 chars)
|
||||
- Troubleshooting guide (nscd, cache invalidation)
|
||||
|
||||
### UID/GID Range (Resolved)
|
||||
|
||||
**Range: 65,536 - 69,999** (manually allocated)
|
||||
@@ -128,10 +153,9 @@ Rationale:
|
||||
|
||||
### Next Steps
|
||||
|
||||
1. Deploy to monitoring01 to enable Prometheus scraping
|
||||
1. Enable PAM/NSS on production hosts (after test tier validation)
|
||||
2. Configure TrueNAS LDAP client for NAS integration testing
|
||||
3. Add OAuth2 clients (Grafana first)
|
||||
4. Create PAM/NSS client module for other hosts
|
||||
|
||||
## References
|
||||
|
||||
|
||||
116
docs/plans/memory-issues-follow-up.md
Normal file
116
docs/plans/memory-issues-follow-up.md
Normal file
@@ -0,0 +1,116 @@
|
||||
# Memory Issues Follow-up
|
||||
|
||||
Tracking the zram change to verify it resolves OOM issues during nixos-upgrade on low-memory hosts.
|
||||
|
||||
## Background
|
||||
|
||||
On 2026-02-08, ns2 (2GB RAM) experienced an OOM kill during nixos-upgrade. The Nix evaluation process consumed ~1.6GB before being killed by the kernel. ns1 (manually increased to 4GB) succeeded with the same upgrade.
|
||||
|
||||
Root cause: 2GB RAM is insufficient for Nix flake evaluation without swap.
|
||||
|
||||
## Fix Applied
|
||||
|
||||
**Commit:** `1674b6a` - system: enable zram swap for all hosts
|
||||
|
||||
**Merged:** 2026-02-08 ~12:15 UTC
|
||||
|
||||
**Change:** Added `zramSwap.enable = true` to `system/zram.nix`, providing ~2GB compressed swap on all hosts.
|
||||
|
||||
## Timeline
|
||||
|
||||
| Time (UTC) | Event |
|
||||
|------------|-------|
|
||||
| 05:00:46 | ns2 nixos-upgrade OOM killed |
|
||||
| 05:01:47 | `nixos_upgrade_failed` alert fired |
|
||||
| 12:15 | zram commit merged to master |
|
||||
| 12:19 | ns2 rebooted with zram enabled |
|
||||
| 12:20 | ns1 rebooted (memory reduced to 2GB via tofu) |
|
||||
|
||||
## Hosts Affected
|
||||
|
||||
All 2GB VMs that run nixos-upgrade:
|
||||
- ns1, ns2 (DNS)
|
||||
- vault01
|
||||
- testvm01, testvm02, testvm03
|
||||
- kanidm01
|
||||
|
||||
## Metrics to Monitor
|
||||
|
||||
Check these in Grafana or via PromQL to verify the fix:
|
||||
|
||||
### Swap availability (should be ~2GB after upgrade)
|
||||
```promql
|
||||
node_memory_SwapTotal_bytes / 1024 / 1024
|
||||
```
|
||||
|
||||
### Swap usage during upgrades
|
||||
```promql
|
||||
(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / 1024 / 1024
|
||||
```
|
||||
|
||||
### Zswap compressed bytes (active compression)
|
||||
```promql
|
||||
node_memory_Zswap_bytes / 1024 / 1024
|
||||
```
|
||||
|
||||
### Upgrade failures (should be 0)
|
||||
```promql
|
||||
node_systemd_unit_state{name="nixos-upgrade.service", state="failed"}
|
||||
```
|
||||
|
||||
### Memory available during upgrades
|
||||
```promql
|
||||
node_memory_MemAvailable_bytes / 1024 / 1024
|
||||
```
|
||||
|
||||
## Verification Steps
|
||||
|
||||
After a few days (allow auto-upgrades to run on all hosts):
|
||||
|
||||
1. Check all hosts have swap enabled:
|
||||
```promql
|
||||
node_memory_SwapTotal_bytes > 0
|
||||
```
|
||||
|
||||
2. Check for any upgrade failures since the fix:
|
||||
```promql
|
||||
count_over_time(ALERTS{alertname="nixos_upgrade_failed"}[7d])
|
||||
```
|
||||
|
||||
3. Review if any hosts used swap during upgrades (check historical graphs)
|
||||
|
||||
## Success Criteria
|
||||
|
||||
- No `nixos_upgrade_failed` alerts due to OOM after 2026-02-08
|
||||
- All hosts show ~2GB swap available
|
||||
- Upgrades complete successfully on 2GB VMs
|
||||
|
||||
## Fallback Options
|
||||
|
||||
If zram is insufficient:
|
||||
|
||||
1. **Increase VM memory** - Update `terraform/vms.tf` to 4GB for affected hosts
|
||||
2. **Enable memory ballooning** - Configure VMs with dynamic memory allocation (see below)
|
||||
3. **Use remote builds** - Configure `nix.buildMachines` to offload evaluation
|
||||
4. **Reduce flake size** - Split configurations to reduce evaluation memory
|
||||
|
||||
### Memory Ballooning
|
||||
|
||||
Proxmox supports memory ballooning, which allows VMs to dynamically grow/shrink memory allocation based on demand. The balloon driver inside the guest communicates with the hypervisor to release or reclaim memory pages.
|
||||
|
||||
Configuration in `terraform/vms.tf`:
|
||||
```hcl
|
||||
memory = 4096 # maximum memory
|
||||
balloon = 2048 # minimum memory (shrinks to this when idle)
|
||||
```
|
||||
|
||||
Pros:
|
||||
- VMs get memory on-demand without reboots
|
||||
- Better host memory utilization
|
||||
- Solves upgrade OOM without permanently allocating 4GB
|
||||
|
||||
Cons:
|
||||
- Requires QEMU guest agent running in guest
|
||||
- Guest can experience memory pressure if host is overcommitted
|
||||
|
||||
Ballooning and zram are complementary - ballooning provides headroom from the host, zram provides overflow within the guest.
|
||||
224
docs/plans/security-hardening.md
Normal file
224
docs/plans/security-hardening.md
Normal file
@@ -0,0 +1,224 @@
|
||||
# Security Hardening Plan
|
||||
|
||||
## Overview
|
||||
|
||||
Address security gaps identified in infrastructure review. Focus areas: SSH hardening, network security, logging improvements, and secrets management.
|
||||
|
||||
## Current State
|
||||
|
||||
- SSH allows password auth and unrestricted root login (`system/sshd.nix`)
|
||||
- Firewall disabled on all hosts (`networking.firewall.enable = false`)
|
||||
- Promtail ships logs over HTTP to Loki
|
||||
- Loki has no authentication (`auth_enabled = false`)
|
||||
- AppRole secret-IDs never expire (`secret_id_ttl = 0`)
|
||||
- Vault TLS verification disabled by default (`skipTlsVerify = true`)
|
||||
- Audit logging exists (`common/ssh-audit.nix`) but not applied globally
|
||||
- Alert rules focus on availability, no security event detection
|
||||
|
||||
## Priority Matrix
|
||||
|
||||
| Issue | Severity | Effort | Priority |
|
||||
|-------|----------|--------|----------|
|
||||
| SSH password auth | High | Low | **P1** |
|
||||
| Firewall disabled | High | Medium | **P1** |
|
||||
| Promtail HTTP (no TLS) | High | Medium | **P2** |
|
||||
| No security alerting | Medium | Low | **P2** |
|
||||
| Audit logging not global | Low | Low | **P2** |
|
||||
| Loki no auth | Medium | Medium | **P3** |
|
||||
| Secret-ID TTL | Medium | Medium | **P3** |
|
||||
| Vault skipTlsVerify | Medium | Low | **P3** |
|
||||
|
||||
## Phase 1: Quick Wins (P1)
|
||||
|
||||
### 1.1 SSH Hardening
|
||||
|
||||
Edit `system/sshd.nix`:
|
||||
|
||||
```nix
|
||||
services.openssh = {
|
||||
enable = true;
|
||||
settings = {
|
||||
PermitRootLogin = "prohibit-password"; # Key-only root login
|
||||
PasswordAuthentication = false;
|
||||
KbdInteractiveAuthentication = false;
|
||||
};
|
||||
};
|
||||
```
|
||||
|
||||
**Prerequisite:** Verify all hosts have SSH keys deployed for root.
|
||||
|
||||
### 1.2 Enable Firewall
|
||||
|
||||
Create `system/firewall.nix` with default deny policy:
|
||||
|
||||
```nix
|
||||
{ ... }: {
|
||||
networking.firewall.enable = true;
|
||||
|
||||
# Use openssh's built-in firewall integration
|
||||
services.openssh.openFirewall = true;
|
||||
}
|
||||
```
|
||||
|
||||
**Useful firewall options:**
|
||||
|
||||
| Option | Description |
|
||||
|--------|-------------|
|
||||
| `networking.firewall.trustedInterfaces` | Accept all traffic from these interfaces (e.g., `[ "lo" ]`) |
|
||||
| `networking.firewall.interfaces.<name>.allowedTCPPorts` | Per-interface port rules |
|
||||
| `networking.firewall.extraInputRules` | Custom nftables rules (for complex filtering) |
|
||||
|
||||
**Network range restrictions:** Consider restricting SSH to the infrastructure subnet (`10.69.13.0/24`) using `extraInputRules` for defense in depth. However, this adds complexity and may not be necessary given the trusted network model.
|
||||
|
||||
#### Per-Interface Rules (http-proxy WireGuard)
|
||||
|
||||
The `http-proxy` host has a WireGuard interface (`wg0`) that may need different rules than the LAN interface. Use `networking.firewall.interfaces` to apply per-interface policies:
|
||||
|
||||
```nix
|
||||
# Example: http-proxy with different rules per interface
|
||||
networking.firewall = {
|
||||
enable = true;
|
||||
|
||||
# Default: only SSH (via openFirewall)
|
||||
allowedTCPPorts = [ ];
|
||||
|
||||
# LAN interface: allow HTTP/HTTPS
|
||||
interfaces.ens18 = {
|
||||
allowedTCPPorts = [ 80 443 ];
|
||||
};
|
||||
|
||||
# WireGuard interface: restrict to specific services or trust fully
|
||||
interfaces.wg0 = {
|
||||
allowedTCPPorts = [ 80 443 ];
|
||||
# Or use trustedInterfaces = [ "wg0" ] if fully trusted
|
||||
};
|
||||
};
|
||||
```
|
||||
|
||||
**TODO:** Investigate current WireGuard usage on http-proxy to determine appropriate rules.
|
||||
|
||||
Then per-host, open required ports:
|
||||
|
||||
| Host | Additional Ports |
|
||||
|------|------------------|
|
||||
| ns1/ns2 | 53 (TCP/UDP) |
|
||||
| vault01 | 8200 |
|
||||
| monitoring01 | 3100, 9090, 3000, 9093 |
|
||||
| http-proxy | 80, 443 |
|
||||
| nats1 | 4222 |
|
||||
| ha1 | 1883, 8123 |
|
||||
| jelly01 | 8096 |
|
||||
| nix-cache01 | 5000 |
|
||||
|
||||
## Phase 2: Logging & Detection (P2)
|
||||
|
||||
### 2.1 Enable TLS for Promtail → Loki
|
||||
|
||||
Update `system/monitoring/logs.nix`:
|
||||
|
||||
```nix
|
||||
clients = [{
|
||||
url = "https://monitoring01.home.2rjus.net:3100/loki/api/v1/push";
|
||||
tls_config = {
|
||||
ca_file = "/etc/ssl/certs/homelab-root-ca.pem";
|
||||
};
|
||||
}];
|
||||
```
|
||||
|
||||
Requires:
|
||||
- Configure Loki with TLS certificate (use internal ACME)
|
||||
- Ensure all hosts trust root CA (already done via `system/pki/root-ca.nix`)
|
||||
|
||||
### 2.2 Security Alert Rules
|
||||
|
||||
Add to `services/monitoring/rules.yml`:
|
||||
|
||||
```yaml
|
||||
- name: security_rules
|
||||
rules:
|
||||
- alert: ssh_auth_failures
|
||||
expr: increase(node_logind_sessions_total[5m]) > 20
|
||||
for: 0m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Unusual login activity on {{ $labels.instance }}"
|
||||
|
||||
- alert: vault_secret_fetch_failure
|
||||
expr: increase(vault_secret_failures[5m]) > 5
|
||||
for: 0m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Vault secret fetch failures on {{ $labels.instance }}"
|
||||
```
|
||||
|
||||
Also add Loki-based alerts for:
|
||||
- Failed SSH attempts: `{job="systemd-journal"} |= "Failed password"`
|
||||
- sudo usage: `{job="systemd-journal"} |= "sudo"`
|
||||
|
||||
### 2.3 Global Audit Logging
|
||||
|
||||
Add `./common/ssh-audit.nix` import to `system/default.nix`:
|
||||
|
||||
```nix
|
||||
imports = [
|
||||
# ... existing imports
|
||||
../common/ssh-audit.nix
|
||||
];
|
||||
```
|
||||
|
||||
## Phase 3: Defense in Depth (P3)
|
||||
|
||||
### 3.1 Loki Authentication
|
||||
|
||||
Options:
|
||||
1. **Basic auth via reverse proxy** - Put Loki behind Caddy with auth
|
||||
2. **Loki multi-tenancy** - Enable `auth_enabled = true` and use tenant IDs
|
||||
3. **Network isolation** - Bind Loki only to localhost, expose via authenticated proxy
|
||||
|
||||
Recommendation: Option 1 (reverse proxy) is simplest for homelab.
|
||||
|
||||
### 3.2 AppRole Secret Rotation
|
||||
|
||||
Update `terraform/vault/approle.tf`:
|
||||
|
||||
```hcl
|
||||
secret_id_ttl = 2592000 # 30 days
|
||||
```
|
||||
|
||||
Add documentation for manual rotation procedure or implement automated rotation via the existing `restartTrigger` mechanism in `vault-secrets.nix`.
|
||||
|
||||
### 3.3 Enable Vault TLS Verification
|
||||
|
||||
Change default in `system/vault-secrets.nix`:
|
||||
|
||||
```nix
|
||||
skipTlsVerify = mkOption {
|
||||
type = types.bool;
|
||||
default = false; # Changed from true
|
||||
};
|
||||
```
|
||||
|
||||
**Prerequisite:** Verify all hosts trust the internal CA that signed the Vault certificate.
|
||||
|
||||
## Implementation Order
|
||||
|
||||
1. **Test on test-tier first** - Deploy phases 1-2 to testvm01/02/03
|
||||
2. **Validate SSH access** - Ensure key-based login works before disabling passwords
|
||||
3. **Document firewall ports** - Create reference of ports per host before enabling
|
||||
4. **Phase prod rollout** - Deploy to prod hosts one at a time, verify each
|
||||
|
||||
## Open Questions
|
||||
|
||||
- [ ] Do all hosts have SSH keys configured for root access?
|
||||
- [ ] Should firewall rules be per-host or use a central definition with roles?
|
||||
- [ ] Should Loki authentication use the existing Kanidm setup?
|
||||
|
||||
**Resolved:** Password-based SSH access for recovery is not required - most hosts have console access through Proxmox or physical access, which provides an out-of-band recovery path if SSH keys fail.
|
||||
|
||||
## Notes
|
||||
|
||||
- Firewall changes are the highest risk - test thoroughly on test-tier
|
||||
- SSH hardening must not lock out access - verify keys first
|
||||
- Consider creating a "break glass" procedure for emergency access if keys fail
|
||||
267
docs/user-management.md
Normal file
267
docs/user-management.md
Normal file
@@ -0,0 +1,267 @@
|
||||
# User Management with Kanidm
|
||||
|
||||
Central authentication for the homelab using Kanidm.
|
||||
|
||||
## Overview
|
||||
|
||||
- **Server**: kanidm01.home.2rjus.net (auth.home.2rjus.net)
|
||||
- **WebUI**: https://auth.home.2rjus.net
|
||||
- **LDAPS**: port 636
|
||||
|
||||
## CLI Setup
|
||||
|
||||
The `kanidm` CLI is available in the devshell:
|
||||
|
||||
```bash
|
||||
nix develop
|
||||
|
||||
# Login as idm_admin
|
||||
kanidm login --name idm_admin --url https://auth.home.2rjus.net
|
||||
```
|
||||
|
||||
## User Management
|
||||
|
||||
POSIX users are managed imperatively via the `kanidm` CLI. This allows setting
|
||||
all attributes (including UNIX password) in one workflow.
|
||||
|
||||
### Creating a POSIX User
|
||||
|
||||
```bash
|
||||
# Create the person
|
||||
kanidm person create <username> "<Display Name>"
|
||||
|
||||
# Add to groups
|
||||
kanidm group add-members ssh-users <username>
|
||||
|
||||
# Enable POSIX (UID is auto-assigned)
|
||||
kanidm person posix set <username>
|
||||
|
||||
# Set UNIX password (required for SSH login, min 10 characters)
|
||||
kanidm person posix set-password <username>
|
||||
|
||||
# Optionally set login shell
|
||||
kanidm person posix set <username> --shell /bin/zsh
|
||||
```
|
||||
|
||||
### Example: Full User Creation
|
||||
|
||||
```bash
|
||||
kanidm person create testuser "Test User"
|
||||
kanidm group add-members ssh-users testuser
|
||||
kanidm person posix set testuser
|
||||
kanidm person posix set-password testuser
|
||||
kanidm person get testuser
|
||||
```
|
||||
|
||||
After creation, verify on a client host:
|
||||
```bash
|
||||
getent passwd testuser
|
||||
ssh testuser@testvm01.home.2rjus.net
|
||||
```
|
||||
|
||||
### Viewing User Details
|
||||
|
||||
```bash
|
||||
kanidm person get <username>
|
||||
```
|
||||
|
||||
### Removing a User
|
||||
|
||||
```bash
|
||||
kanidm person delete <username>
|
||||
```
|
||||
|
||||
## Group Management
|
||||
|
||||
Groups for POSIX access are also managed via CLI.
|
||||
|
||||
### Creating a POSIX Group
|
||||
|
||||
```bash
|
||||
# Create the group
|
||||
kanidm group create <group-name>
|
||||
|
||||
# Enable POSIX with a specific GID
|
||||
kanidm group posix set <group-name> --gidnumber <gid>
|
||||
```
|
||||
|
||||
### Adding Members
|
||||
|
||||
```bash
|
||||
kanidm group add-members <group-name> <username>
|
||||
```
|
||||
|
||||
### Viewing Group Details
|
||||
|
||||
```bash
|
||||
kanidm group get <group-name>
|
||||
kanidm group list-members <group-name>
|
||||
```
|
||||
|
||||
### Example: Full Group Creation
|
||||
|
||||
```bash
|
||||
kanidm group create testgroup
|
||||
kanidm group posix set testgroup --gidnumber 68010
|
||||
kanidm group add-members testgroup testuser
|
||||
kanidm group get testgroup
|
||||
```
|
||||
|
||||
After creation, verify on a client host:
|
||||
```bash
|
||||
getent group testgroup
|
||||
```
|
||||
|
||||
### Current Groups
|
||||
|
||||
| Group | GID | Purpose |
|
||||
|-------|-----|---------|
|
||||
| ssh-users | 68000 | SSH login access |
|
||||
| admins | 68001 | Administrative access |
|
||||
| users | 68002 | General users |
|
||||
|
||||
### UID/GID Allocation
|
||||
|
||||
Kanidm auto-assigns UIDs/GIDs from its configured range. For manually assigned GIDs:
|
||||
|
||||
| Range | Purpose |
|
||||
|-------|---------|
|
||||
| 65,536+ | Users (auto-assigned) |
|
||||
| 68,000 - 68,999 | Groups (manually assigned) |
|
||||
|
||||
## PAM/NSS Client Configuration
|
||||
|
||||
Enable central authentication on a host:
|
||||
|
||||
```nix
|
||||
homelab.kanidm.enable = true;
|
||||
```
|
||||
|
||||
This configures:
|
||||
- `services.kanidm.enablePam = true`
|
||||
- Client connection to auth.home.2rjus.net
|
||||
- Login authorization for `ssh-users` group
|
||||
- Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
|
||||
- Home directory symlinks (`/home/torjus` → UUID-based directory)
|
||||
|
||||
### Enabled Hosts
|
||||
|
||||
- testvm01, testvm02, testvm03 (test tier)
|
||||
|
||||
### Options
|
||||
|
||||
```nix
|
||||
homelab.kanidm = {
|
||||
enable = true;
|
||||
server = "https://auth.home.2rjus.net"; # default
|
||||
allowedLoginGroups = [ "ssh-users" ]; # default
|
||||
};
|
||||
```
|
||||
|
||||
### Home Directories
|
||||
|
||||
Home directories use UUID-based paths for stability (so renaming a user doesn't
|
||||
require moving their home directory). Symlinks provide convenient access:
|
||||
|
||||
```
|
||||
/home/torjus -> /home/e4f4c56c-4aee-4c20-846f-90cb69807733
|
||||
```
|
||||
|
||||
The symlinks are created by `kanidm-unixd-tasks` on first login.
|
||||
|
||||
## Testing
|
||||
|
||||
### Verify NSS Resolution
|
||||
|
||||
```bash
|
||||
# Check user resolution
|
||||
getent passwd <username>
|
||||
|
||||
# Check group resolution
|
||||
getent group <group-name>
|
||||
```
|
||||
|
||||
### Test SSH Login
|
||||
|
||||
```bash
|
||||
ssh <username>@<hostname>.home.2rjus.net
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "PAM user mismatch" error
|
||||
|
||||
SSH fails with "fatal: PAM user mismatch" in logs. This happens when Kanidm returns
|
||||
usernames in SPN format (`torjus@home.2rjus.net`) but SSH expects short names (`torjus`).
|
||||
|
||||
**Solution**: Configure `uid_attr_map = "name"` in unixSettings (already set in our module).
|
||||
|
||||
Check current format:
|
||||
```bash
|
||||
getent passwd torjus
|
||||
# Should show: torjus:x:65536:...
|
||||
# NOT: torjus@home.2rjus.net:x:65536:...
|
||||
```
|
||||
|
||||
### User resolves but SSH fails immediately
|
||||
|
||||
The user's login group (e.g., `ssh-users`) likely doesn't have POSIX enabled:
|
||||
|
||||
```bash
|
||||
# Check if group has POSIX
|
||||
getent group ssh-users
|
||||
|
||||
# If empty, enable POSIX on the server
|
||||
kanidm group posix set ssh-users --gidnumber 68000
|
||||
```
|
||||
|
||||
### User doesn't resolve via getent
|
||||
|
||||
1. Check kanidm-unixd service is running:
|
||||
```bash
|
||||
systemctl status kanidm-unixd
|
||||
```
|
||||
|
||||
2. Check unixd can reach server:
|
||||
```bash
|
||||
kanidm-unix status
|
||||
# Should show: system: online, Kanidm: online
|
||||
```
|
||||
|
||||
3. Check client can reach server:
|
||||
```bash
|
||||
curl -s https://auth.home.2rjus.net/status
|
||||
```
|
||||
|
||||
4. Check user has POSIX enabled on server:
|
||||
```bash
|
||||
kanidm person get <username>
|
||||
```
|
||||
|
||||
5. Restart nscd to clear stale cache:
|
||||
```bash
|
||||
systemctl restart nscd
|
||||
```
|
||||
|
||||
6. Invalidate kanidm cache:
|
||||
```bash
|
||||
kanidm-unix cache-invalidate
|
||||
```
|
||||
|
||||
### Changes not taking effect after deployment
|
||||
|
||||
NixOS uses nsncd (a Rust reimplementation of nscd) for NSS caching. After deploying
|
||||
kanidm-unixd config changes, you may need to restart both services:
|
||||
|
||||
```bash
|
||||
systemctl restart kanidm-unixd
|
||||
systemctl restart nscd
|
||||
```
|
||||
|
||||
### Test PAM authentication directly
|
||||
|
||||
Use the kanidm-unix CLI to test PAM auth without SSH:
|
||||
|
||||
```bash
|
||||
kanidm-unix auth-test --name <username>
|
||||
```
|
||||
@@ -207,6 +207,7 @@
|
||||
pkgs.ansible
|
||||
pkgs.opentofu
|
||||
pkgs.openbao
|
||||
pkgs.kanidm_1_8
|
||||
(pkgs.callPackage ./scripts/create-host { })
|
||||
homelab-deploy.packages.${pkgs.system}.default
|
||||
];
|
||||
|
||||
@@ -64,9 +64,5 @@
|
||||
vault.enable = true;
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
zramSwap = {
|
||||
enable = true;
|
||||
};
|
||||
|
||||
system.stateVersion = "23.11"; # Did you read the comment?
|
||||
}
|
||||
|
||||
@@ -4,6 +4,5 @@
|
||||
./configuration.nix
|
||||
../../services/nix-cache
|
||||
../../services/actions-runner
|
||||
./zram.nix
|
||||
];
|
||||
}
|
||||
|
||||
@@ -1,6 +0,0 @@
|
||||
{ ... }:
|
||||
{
|
||||
zramSwap = {
|
||||
enable = true;
|
||||
};
|
||||
}
|
||||
@@ -79,5 +79,8 @@
|
||||
# Or disable the firewall altogether.
|
||||
networking.firewall.enable = false;
|
||||
|
||||
# Compressed swap in RAM - prevents OOM during bootstrap nixos-rebuild
|
||||
zramSwap.enable = true;
|
||||
|
||||
system.stateVersion = "25.11";
|
||||
}
|
||||
|
||||
@@ -25,6 +25,9 @@
|
||||
# Enable remote deployment via NATS
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
# Enable Kanidm PAM/NSS for central authentication
|
||||
homelab.kanidm.enable = true;
|
||||
|
||||
nixpkgs.config.allowUnfree = true;
|
||||
boot.loader.grub.enable = true;
|
||||
boot.loader.grub.device = "/dev/vda";
|
||||
|
||||
@@ -25,6 +25,9 @@
|
||||
# Enable remote deployment via NATS
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
# Enable Kanidm PAM/NSS for central authentication
|
||||
homelab.kanidm.enable = true;
|
||||
|
||||
nixpkgs.config.allowUnfree = true;
|
||||
boot.loader.grub.enable = true;
|
||||
boot.loader.grub.device = "/dev/vda";
|
||||
|
||||
@@ -25,6 +25,9 @@
|
||||
# Enable remote deployment via NATS
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
# Enable Kanidm PAM/NSS for central authentication
|
||||
homelab.kanidm.enable = true;
|
||||
|
||||
nixpkgs.config.allowUnfree = true;
|
||||
boot.loader.grub.enable = true;
|
||||
boot.loader.grub.device = "/dev/vda";
|
||||
|
||||
@@ -17,7 +17,8 @@
|
||||
};
|
||||
};
|
||||
|
||||
# Provisioning - initial users/groups
|
||||
# Provision base groups only - users are managed via CLI
|
||||
# See docs/user-management.md for details
|
||||
provision = {
|
||||
enable = true;
|
||||
idmAdminPasswordFile = config.vault.secrets.kanidm-idm-admin.outputDir;
|
||||
@@ -28,10 +29,7 @@
|
||||
ssh-users = { };
|
||||
};
|
||||
|
||||
persons.torjus = {
|
||||
displayName = "Torjus";
|
||||
groups = [ "admins" "users" "ssh-users" ];
|
||||
};
|
||||
# Regular users (persons) are managed imperatively via kanidm CLI
|
||||
};
|
||||
};
|
||||
|
||||
@@ -39,12 +37,14 @@
|
||||
users.users.kanidm.extraGroups = [ "acme" ];
|
||||
|
||||
# ACME certificate from internal CA
|
||||
# Include both the CNAME (auth) and A record (kanidm01) for Prometheus scraping
|
||||
security.acme.certs."auth.home.2rjus.net" = {
|
||||
listenHTTP = ":80";
|
||||
reloadServices = [ "kanidm" ];
|
||||
extraDomainNames = [ "${config.networking.hostName}.home.2rjus.net" ];
|
||||
};
|
||||
|
||||
# Vault secret for idm_admin password
|
||||
# Vault secret for idm_admin password (used for provisioning)
|
||||
vault.secrets.kanidm-idm-admin = {
|
||||
secretPath = "kanidm/idm-admin-password";
|
||||
extractKey = "password";
|
||||
@@ -53,12 +53,13 @@
|
||||
group = "kanidm";
|
||||
};
|
||||
|
||||
# Monitoring scrape target
|
||||
homelab.monitoring.scrapeTargets = [
|
||||
{
|
||||
job_name = "kanidm";
|
||||
port = 443;
|
||||
scheme = "https";
|
||||
}
|
||||
];
|
||||
# Note: Kanidm does not expose Prometheus metrics
|
||||
# If metrics support is added in the future, uncomment:
|
||||
# homelab.monitoring.scrapeTargets = [
|
||||
# {
|
||||
# job_name = "kanidm";
|
||||
# port = 443;
|
||||
# scheme = "https";
|
||||
# }
|
||||
# ];
|
||||
}
|
||||
|
||||
@@ -4,13 +4,16 @@
|
||||
./acme.nix
|
||||
./autoupgrade.nix
|
||||
./homelab-deploy.nix
|
||||
./kanidm-client.nix
|
||||
./monitoring
|
||||
./motd.nix
|
||||
./packages.nix
|
||||
./nix.nix
|
||||
./pipe-to-loki.nix
|
||||
./root-user.nix
|
||||
./pki/root-ca.nix
|
||||
./sshd.nix
|
||||
./vault-secrets.nix
|
||||
./zram.nix
|
||||
];
|
||||
}
|
||||
|
||||
42
system/kanidm-client.nix
Normal file
42
system/kanidm-client.nix
Normal file
@@ -0,0 +1,42 @@
|
||||
{ lib, config, pkgs, ... }:
|
||||
let
|
||||
cfg = config.homelab.kanidm;
|
||||
in
|
||||
{
|
||||
options.homelab.kanidm = {
|
||||
enable = lib.mkEnableOption "Kanidm PAM/NSS client for central authentication";
|
||||
|
||||
server = lib.mkOption {
|
||||
type = lib.types.str;
|
||||
default = "https://auth.home.2rjus.net";
|
||||
description = "URI of the Kanidm server";
|
||||
};
|
||||
|
||||
allowedLoginGroups = lib.mkOption {
|
||||
type = lib.types.listOf lib.types.str;
|
||||
default = [ "ssh-users" ];
|
||||
description = "Groups allowed to log in via PAM";
|
||||
};
|
||||
};
|
||||
|
||||
config = lib.mkIf cfg.enable {
|
||||
services.kanidm = {
|
||||
package = pkgs.kanidm_1_8;
|
||||
enablePam = true;
|
||||
|
||||
clientSettings = {
|
||||
uri = cfg.server;
|
||||
};
|
||||
|
||||
unixSettings = {
|
||||
pam_allowed_login_groups = cfg.allowedLoginGroups;
|
||||
# Use short names (torjus) instead of SPN format (torjus@home.2rjus.net)
|
||||
# This prevents "PAM user mismatch" errors with SSH
|
||||
uid_attr_map = "name";
|
||||
gid_attr_map = "name";
|
||||
# Create symlink /home/torjus -> /home/torjus@home.2rjus.net
|
||||
home_alias = "name";
|
||||
};
|
||||
};
|
||||
};
|
||||
}
|
||||
140
system/pipe-to-loki.nix
Normal file
140
system/pipe-to-loki.nix
Normal file
@@ -0,0 +1,140 @@
|
||||
{
|
||||
config,
|
||||
pkgs,
|
||||
lib,
|
||||
...
|
||||
}:
|
||||
let
|
||||
pipe-to-loki = pkgs.writeShellApplication {
|
||||
name = "pipe-to-loki";
|
||||
runtimeInputs = with pkgs; [
|
||||
curl
|
||||
jq
|
||||
util-linux
|
||||
coreutils
|
||||
];
|
||||
text = ''
|
||||
set -euo pipefail
|
||||
|
||||
LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
|
||||
HOSTNAME=$(hostname)
|
||||
SESSION_ID=""
|
||||
RECORD_MODE=false
|
||||
|
||||
usage() {
|
||||
echo "Usage: pipe-to-loki [--id ID] [--record]"
|
||||
echo ""
|
||||
echo "Send command output or interactive sessions to Loki."
|
||||
echo ""
|
||||
echo "Options:"
|
||||
echo " --id ID Set custom session ID (default: auto-generated)"
|
||||
echo " --record Start interactive recording session"
|
||||
echo ""
|
||||
echo "Examples:"
|
||||
echo " command | pipe-to-loki # Pipe command output"
|
||||
echo " command | pipe-to-loki --id foo # Pipe with custom ID"
|
||||
echo " pipe-to-loki --record # Start recording session"
|
||||
exit 1
|
||||
}
|
||||
|
||||
generate_id() {
|
||||
local random_chars
|
||||
random_chars=$(head -c 2 /dev/urandom | od -An -tx1 | tr -d ' \n')
|
||||
echo "''${HOSTNAME}-$(date +%s)-''${random_chars}"
|
||||
}
|
||||
|
||||
send_to_loki() {
|
||||
local content="$1"
|
||||
local type="$2"
|
||||
local timestamp_ns
|
||||
timestamp_ns=$(date +%s%N)
|
||||
|
||||
local payload
|
||||
payload=$(jq -n \
|
||||
--arg job "pipe-to-loki" \
|
||||
--arg host "$HOSTNAME" \
|
||||
--arg type "$type" \
|
||||
--arg id "$SESSION_ID" \
|
||||
--arg ts "$timestamp_ns" \
|
||||
--arg content "$content" \
|
||||
'{
|
||||
streams: [{
|
||||
stream: {
|
||||
job: $job,
|
||||
host: $host,
|
||||
type: $type,
|
||||
id: $id
|
||||
},
|
||||
values: [[$ts, $content]]
|
||||
}]
|
||||
}')
|
||||
|
||||
if curl -s -X POST "$LOKI_URL" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d "$payload" > /dev/null; then
|
||||
return 0
|
||||
else
|
||||
echo "Error: Failed to send to Loki" >&2
|
||||
return 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Parse arguments
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case $1 in
|
||||
--id)
|
||||
SESSION_ID="$2"
|
||||
shift 2
|
||||
;;
|
||||
--record)
|
||||
RECORD_MODE=true
|
||||
shift
|
||||
;;
|
||||
--help|-h)
|
||||
usage
|
||||
;;
|
||||
*)
|
||||
echo "Unknown option: $1" >&2
|
||||
usage
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
# Generate ID if not provided
|
||||
if [[ -z "$SESSION_ID" ]]; then
|
||||
SESSION_ID=$(generate_id)
|
||||
fi
|
||||
|
||||
if $RECORD_MODE; then
|
||||
# Session recording mode
|
||||
SCRIPT_FILE=$(mktemp)
|
||||
trap 'rm -f "$SCRIPT_FILE"' EXIT
|
||||
|
||||
echo "Recording session $SESSION_ID... (exit to send)"
|
||||
|
||||
# Use script to record the session
|
||||
script -q "$SCRIPT_FILE"
|
||||
|
||||
# Read the transcript and send to Loki
|
||||
content=$(cat "$SCRIPT_FILE")
|
||||
if send_to_loki "$content" "session"; then
|
||||
echo "Session $SESSION_ID sent to Loki"
|
||||
fi
|
||||
else
|
||||
# Pipe mode - read from stdin
|
||||
if [[ -t 0 ]]; then
|
||||
echo "Error: No input provided. Pipe a command or use --record for interactive mode." >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
content=$(cat)
|
||||
if send_to_loki "$content" "command"; then
|
||||
echo "Sent to Loki with id: $SESSION_ID"
|
||||
fi
|
||||
fi
|
||||
'';
|
||||
};
|
||||
in
|
||||
{
|
||||
environment.systemPackages = [ pipe-to-loki ];
|
||||
}
|
||||
8
system/zram.nix
Normal file
8
system/zram.nix
Normal file
@@ -0,0 +1,8 @@
|
||||
# Compressed swap in RAM
|
||||
#
|
||||
# Provides overflow memory during Nix builds and upgrades.
|
||||
# Prevents OOM kills on low-memory hosts (2GB VMs).
|
||||
{ ... }:
|
||||
{
|
||||
zramSwap.enable = true;
|
||||
}
|
||||
Reference in New Issue
Block a user