docs: add user-management documentation

- CLI workflows for creating users and groups - Troubleshooting guide (nscd, cache invalidation) - Home directory behavior (UUID-based with symlinks) - Update auth-system-replacement plan with progress Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
kanidm: remove declarative user provisioning
2026-02-08 15:14:21 +01:00 · 2026-02-08 15:14:03 +01:00 · 2026-02-08 15:12:19 +01:00 · 2026-02-08 15:12:19 +01:00 · 2026-02-08 13:34:08 +01:00 · 2026-02-08 13:29:42 +01:00
19 changed files with 988 additions and 45 deletions
--- a/.claude/agents/auditor.md
+++ b/.claude/agents/auditor.md
@@ -0,0 +1,180 @@
+---
+name: auditor
+description: Analyzes audit logs to investigate user activity, command execution, and suspicious behavior on hosts. Can be used standalone for security reviews or called by other agents for behavioral context.
+tools: Read, Grep, Glob
+mcpServers:
+  - lab-monitoring
+---
+
+You are a security auditor for a NixOS homelab infrastructure. Your task is to analyze audit logs and reconstruct user activity on hosts.
+
+## Input
+
+You may receive:
+- A host or list of hosts to investigate
+- A time window (e.g., "last hour", "today", "between 14:00 and 15:00")
+- Optional context: specific events to look for, user to focus on, or suspicious activity to investigate
+- Optional context from a parent investigation (e.g., "a service stopped at 14:32, what happened around that time?")
+
+## Audit Log Structure
+
+Logs are shipped to Loki via promtail. Audit events use these labels:
+- `host` - hostname
+- `systemd_unit` - typically `auditd.service` for audit logs
+- `job` - typically `systemd-journal`
+
+Audit log entries contain structured data:
+- `EXECVE` - command execution with full arguments
+- `USER_LOGIN` / `USER_LOGOUT` - session start/end
+- `USER_CMD` - sudo command execution
+- `CRED_ACQ` / `CRED_DISP` - credential acquisition/disposal
+- `SERVICE_START` / `SERVICE_STOP` - systemd service events
+
+## Investigation Techniques
+
+### 1. SSH Session Activity
+
+Find SSH logins and session activity:
+```logql
+{host="<hostname>", systemd_unit="sshd.service"}
+```
+
+Look for:
+- Accepted/Failed authentication
+- Session opened/closed
+- Unusual source IPs or users
+
+### 2. Command Execution
+
+Query executed commands (filter out noise):
+```logql
+{host="<hostname>"} |= "EXECVE" != "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"
+```
+
+Further filtering:
+- Exclude systemd noise: `!= "systemd" != "/nix/store"`
+- Focus on specific commands: `|= "rm" |= "-rf"`
+- Focus on specific user: `|= "uid=1000"`
+
+### 3. Sudo Activity
+
+Check for privilege escalation:
+```logql
+{host="<hostname>"} |= "sudo" |= "COMMAND"
+```
+
+Or via audit:
+```logql
+{host="<hostname>"} |= "USER_CMD"
+```
+
+### 4. Service Manipulation
+
+Check if services were manually stopped/started:
+```logql
+{host="<hostname>"} |= "EXECVE" |= "systemctl"
+```
+
+### 5. File Operations
+
+Look for file modifications (if auditd rules are configured):
+```logql
+{host="<hostname>"} |= "EXECVE" |= "vim"
+{host="<hostname>"} |= "EXECVE" |= "nano"
+{host="<hostname>"} |= "EXECVE" |= "rm"
+```
+
+## Query Guidelines
+
+**Start narrow, expand if needed:**
+- Begin with `limit: 20-30`
+- Use tight time windows: `start: "15m"` or `start: "30m"`
+- Add filters progressively
+
+**Avoid:**
+- Querying all audit logs without EXECVE filter (extremely verbose)
+- Large time ranges without specific filters
+- Limits over 50 without tight filters
+
+**Time-bounded queries:**
+When investigating around a specific event:
+```logql
+{host="<hostname>"} |= "EXECVE" != "systemd"
+```
+With `start: "2026-02-08T14:30:00Z"` and `end: "2026-02-08T14:35:00Z"`
+
+## Suspicious Patterns to Watch For
+
+1. **Unusual login times** - Activity outside normal hours
+2. **Failed authentication** - Brute force attempts
+3. **Privilege escalation** - Unexpected sudo usage
+4. **Reconnaissance commands** - `whoami`, `id`, `uname`, `cat /etc/passwd`
+5. **Data exfiltration indicators** - `curl`, `wget`, `scp`, `rsync` to external destinations
+6. **Persistence mechanisms** - Cron modifications, systemd service creation
+7. **Log tampering** - Commands targeting log files
+8. **Lateral movement** - SSH to other internal hosts
+9. **Service manipulation** - Stopping security services, disabling firewalls
+10. **Cleanup activity** - Deleting bash history, clearing logs
+
+## Output Format
+
+### For Standalone Security Reviews
+
+```
+## Activity Summary
+
+**Host:** <hostname>
+**Time Period:** <start> to <end>
+**Sessions Found:** <count>
+
+## User Sessions
+
+### Session 1: <user> from <source_ip>
+- **Login:** HH:MM:SSZ
+- **Logout:** HH:MM:SSZ (or ongoing)
+- **Commands executed:**
+  - HH:MM:SSZ - <command>
+  - HH:MM:SSZ - <command>
+
+## Suspicious Activity
+
+[If any patterns from the watch list were detected]
+- **Finding:** <description>
+- **Evidence:** <log entries>
+- **Risk Level:** Low / Medium / High
+
+## Summary
+
+[Overall assessment: normal activity, concerning patterns, or clear malicious activity]
+```
+
+### When Called by Another Agent
+
+Provide a focused response addressing the specific question:
+
+```
+## Audit Findings
+
+**Query:** <what was asked>
+**Time Window:** <investigated period>
+
+## Relevant Activity
+
+[Chronological list of relevant events]
+- HH:MM:SSZ - <event>
+- HH:MM:SSZ - <event>
+
+## Assessment
+
+[Direct answer to the question with supporting evidence]
+```
+
+## Guidelines
+
+- Reconstruct timelines chronologically
+- Correlate events (login → commands → logout)
+- Note gaps or missing data
+- Distinguish between automated (systemd, cron) and interactive activity
+- Consider the host's role and tier when assessing severity
+- When called by another agent, focus on answering their specific question
+- Don't speculate without evidence - state what the logs show and don't show
--- a/.claude/agents/investigate-alarm.md
+++ b/.claude/agents/investigate-alarm.md
@@ -4,6 +4,7 @@ description: Investigates a single system alarm by querying Prometheus metrics a
 tools: Read, Grep, Glob
 mcpServers:
  - lab-monitoring
+  - git-explorer
 ---

 You are an alarm investigation specialist for a NixOS homelab infrastructure. Your task is to analyze a single alarm and determine its root cause.
@@ -33,9 +34,9 @@ Gather evidence about the current system state:
 - Use `list_targets` to verify the host/service is being scraped successfully
 - Look for correlated metrics that might explain the issue

-### 3. Check Logs
+### 3. Check Service Logs

-Search for relevant log entries using `query_logs`. **Be careful to avoid overly broad queries that return too much data.**
+Search for relevant log entries using `query_logs`. Focus on service-specific logs and errors.

 **Query strategies (start narrow, expand if needed):**
 - Start with `limit: 20-30`, increase only if needed
@@ -43,23 +44,42 @@ Search for relevant log entries using `query_logs`. **Be careful to avoid overly
 - Filter to specific services: `{host="<hostname>", systemd_unit="<service>.service"}`
 - Search for errors: `{host="<hostname>"} |= "error"` or `|= "failed"`

-**For audit logs (SSH sessions, command execution):**
- Filter to just commands: `{host="<hostname>"} |= "EXECVE"`
- Exclude verbose noise: `!= "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"`
- Example: `{host="testvm01"} |= "EXECVE" != "systemd"` (user commands only)
-
 **Common patterns:**
 - Service logs: `{host="<hostname>", systemd_unit="<service>.service"}`
- SSH activity: `{host="<hostname>", systemd_unit="sshd.service"}`
 - All errors on host: `{host="<hostname>"} |= "error"`
- Specific command: `{host="<hostname>"} |= "EXECVE" |= "stress"`
+- Journal for a unit: `{host="<hostname>", systemd_unit="nginx.service"} |= "failed"`

 **Avoid:**
- Querying all audit logs without filtering (very verbose)
 - Using `start: "1h"` with no filters on busy hosts
 - Limits over 50 without specific filters

-### 4. Check Configuration (if relevant)
+### 4. Investigate User Activity
+
+For any analysis of user activity, **always spawn the `auditor` agent**. Do not query audit logs (EXECVE, USER_LOGIN, etc.) directly - delegate this to the auditor.
+
+**Always call the auditor when:**
+- A service stopped unexpectedly (may have been manually stopped)
+- A process was killed or a config was changed
+- You need to know who was logged in around the time of an incident
+- You need to understand what commands led to the current state
+- The cause isn't obvious from service logs alone
+
+**Do NOT try to query audit logs yourself.** The auditor is specialized for:
+- Parsing EXECVE records and reconstructing command lines
+- Correlating SSH sessions with commands executed
+- Identifying suspicious patterns
+- Filtering out systemd/nix-store noise
+
+**Example prompt for auditor:**
+```
+Investigate user activity on <hostname> between <start_time> and <end_time>.
+Context: The prometheus-node-exporter service stopped at 14:32.
+Determine if it was manually stopped and by whom.
+```
+
+Incorporate the auditor's findings into your timeline and root cause analysis.
+
+### 5. Check Configuration (if relevant)

 If the alert relates to a NixOS-managed service:
 - Check host configuration in `/hosts/<hostname>/`
@@ -67,9 +87,61 @@ If the alert relates to a NixOS-managed service:
 - Look for thresholds, resource limits, or misconfigurations
 - Check `homelab.host` options for tier/priority/role metadata

-### 5. Consider Common Causes
+### 6. Check for Configuration Drift
+
+Use the git-explorer MCP server to compare the host's deployed configuration against the current master branch. This helps identify:
+- Hosts running outdated configurations
+- Recent changes that might have caused the issue
+- Whether a fix has already been committed but not deployed
+
+**Step 1: Get the deployed revision from Prometheus**
+```promql
+nixos_flake_info{hostname="<hostname>"}
+```
+The `current_rev` label contains the deployed git commit hash.
+
+**Step 2: Check if the host is behind master**
+```
+resolve_ref("master")           # Get current master commit
+is_ancestor(deployed, master)   # Check if host is behind
+```
+
+**Step 3: See what commits are missing**
+```
+commits_between(deployed, master)  # List commits not yet deployed
+```
+
+**Step 4: Check which files changed**
+```
+get_diff_files(deployed, master)   # Files modified since deployment
+```
+Look for files in `hosts/<hostname>/`, `services/<relevant-service>/`, or `system/` that affect this host.
+
+**Step 5: View configuration at the deployed revision**
+```
+get_file_at_commit(deployed, "services/<service>/default.nix")
+```
+Compare against the current file to understand differences.
+
+**Step 6: Find when something changed**
+```
+search_commits("<service-name>")   # Find commits mentioning the service
+get_commit_info(<hash>)            # Get full details of a specific change
+```
+
+**Example workflow for a service-related alert:**
+1. Query `nixos_flake_info{hostname="monitoring01"}` → `current_rev: 8959829`
+2. `resolve_ref("master")` → `4633421`
+3. `is_ancestor("8959829", "4633421")` → Yes, host is behind
+4. `commits_between("8959829", "4633421")` → 7 commits missing
+5. `get_diff_files("8959829", "4633421")` → Check if relevant service files changed
+6. If a fix was committed after the deployed rev, recommend deployment
+
+### 7. Consider Common Causes

 For infrastructure alerts, common causes include:
+- **Manual intervention**: Service manually stopped/restarted (call auditor to confirm)
+- **Configuration drift**: Host running outdated config, fix already in master
 - **Disk space**: Nix store growth, logs, temp files
 - **Memory pressure**: Service memory leaks, insufficient limits
 - **CPU**: Runaway processes, build jobs
@@ -77,6 +149,8 @@ For infrastructure alerts, common causes include:
 - **Service restarts**: Failed upgrades, configuration errors
 - **Scrape failures**: Service down, firewall issues, port changes

+**Note:** If a service stopped unexpectedly and service logs don't show a crash or error, it was likely manual intervention - call the auditor to investigate.
+
 ## Output Format

 Provide a concise report with one of two outcomes:
@@ -133,6 +207,5 @@ Provide a concise report with one of two outcomes:
 - If the alert is a false positive or expected behavior, explain why
 - Consider the host's tier (test vs prod) when assessing severity
 - Build a timeline from log timestamps and metrics to show the sequence of events
- Include precursor events (logins, config changes, restarts) that led to the issue
 - **Query logs incrementally**: start with narrow filters and small limits, expand only if needed
- **Avoid broad audit log queries**: always filter to EXECVE and exclude noise (PATH, SYSCALL, BPF)
+- **Always delegate to the auditor agent** for any user activity analysis - never query EXECVE or audit logs directly
--- a/.mcp.json
+++ b/.mcp.json
@@ -33,6 +33,13 @@
        "--nats-url", "nats://nats1.home.2rjus.net:4222",
        "--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey"
      ]
+    },
+    "git-explorer": {
+      "command": "nix",
+      "args": ["run", "git+https://git.t-juice.club/torjus/labmcp#git-explorer", "--", "serve"],
+      "env": {
+        "GIT_REPO_PATH": "/home/torjus/git/nixos-servers"
+      }
    }
  }
 }
--- a/docs/plans/auth-system-replacement.md
+++ b/docs/plans/auth-system-replacement.md
@@ -66,9 +66,9 @@ This future migration path is a strong argument for Kanidm over LDAP-only soluti
   - Vault integration for idm_admin password
   - LDAPS on port 636

-2. **Configure declarative provisioning** ✅
-   - Groups: `admins`, `users`, `ssh-users`
-   - User: `torjus` (member of all groups)
+2. **Configure provisioning** ✅
+   - Groups provisioned declaratively: `admins`, `users`, `ssh-users`
+   - Users managed imperatively via CLI (allows setting POSIX passwords in one step)
   - POSIX attributes enabled (UID/GID range 65,536-69,999)

 3. **Test NAS integration** (in progress)
@@ -80,14 +80,16 @@ This future migration path is a strong argument for Kanidm over LDAP-only soluti
   - Grafana
   - Other services as needed

-5. **Create client module** in `system/` for PAM/NSS
-   - Enable on all hosts that need central auth
-   - Configure trusted CA
+5. **Create client module** in `system/` for PAM/NSS ✅
+   - Module: `system/kanidm-client.nix`
+   - `homelab.kanidm.enable = true` enables PAM/NSS
+   - Short usernames (not SPN format)
+   - Home directory symlinks via `home_alias`
+   - Enabled on test tier: testvm01, testvm02, testvm03

-6. **Documentation**
-   - User management procedures
-   - Adding new OAuth2 clients
-   - Troubleshooting PAM/NSS issues
+6. **Documentation** ✅
+   - `docs/user-management.md` - CLI workflows, troubleshooting
+   - User/group creation procedures verified working

 ## Progress

@@ -106,14 +108,37 @@ This future migration path is a strong argument for Kanidm over LDAP-only soluti
 - Prometheus monitoring scrape target configured

 **Provisioned entities:**
- Groups: `admins`, `users`, `ssh-users`
- User: `torjus` (member of all groups, POSIX enabled with GID 65536)
+- Groups: `admins`, `users`, `ssh-users` (declarative)
+- Users managed via CLI (imperative)

 **Verified working:**
 - WebUI login with idm_admin
 - LDAP bind and search with POSIX-enabled user
 - LDAPS with valid internal CA certificate

+### Completed (2026-02-08) - PAM/NSS Client
+
+**Client module deployed (`system/kanidm-client.nix`):**
+- `homelab.kanidm.enable = true` enables PAM/NSS integration
+- Connects to auth.home.2rjus.net
+- Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
+- Home directory symlinks (`/home/torjus` → UUID-based dir)
+- Login restricted to `ssh-users` group
+
+**Enabled on test tier:**
+- testvm01, testvm02, testvm03
+
+**Verified working:**
+- User/group resolution via `getent`
+- SSH login with Kanidm unix passwords
+- Home directory creation with symlinks
+- Imperative user/group creation via CLI
+
+**Documentation:**
+- `docs/user-management.md` with full CLI workflows
+- Password requirements (min 10 chars)
+- Troubleshooting guide (nscd, cache invalidation)
+
 ### UID/GID Range (Resolved)

 **Range: 65,536 - 69,999** (manually allocated)
@@ -128,10 +153,9 @@ Rationale:

 ### Next Steps

-1. Deploy to monitoring01 to enable Prometheus scraping
+1. Enable PAM/NSS on production hosts (after test tier validation)
 2. Configure TrueNAS LDAP client for NAS integration testing
 3. Add OAuth2 clients (Grafana first)
-4. Create PAM/NSS client module for other hosts

 ## References

--- a/docs/plans/memory-issues-follow-up.md
+++ b/docs/plans/memory-issues-follow-up.md
@@ -0,0 +1,116 @@
+# Memory Issues Follow-up
+
+Tracking the zram change to verify it resolves OOM issues during nixos-upgrade on low-memory hosts.
+
+## Background
+
+On 2026-02-08, ns2 (2GB RAM) experienced an OOM kill during nixos-upgrade. The Nix evaluation process consumed ~1.6GB before being killed by the kernel. ns1 (manually increased to 4GB) succeeded with the same upgrade.
+
+Root cause: 2GB RAM is insufficient for Nix flake evaluation without swap.
+
+## Fix Applied
+
+**Commit:** `1674b6a` - system: enable zram swap for all hosts
+
+**Merged:** 2026-02-08 ~12:15 UTC
+
+**Change:** Added `zramSwap.enable = true` to `system/zram.nix`, providing ~2GB compressed swap on all hosts.
+
+## Timeline
+
+| Time (UTC) | Event |
+|------------|-------|
+| 05:00:46 | ns2 nixos-upgrade OOM killed |
+| 05:01:47 | `nixos_upgrade_failed` alert fired |
+| 12:15 | zram commit merged to master |
+| 12:19 | ns2 rebooted with zram enabled |
+| 12:20 | ns1 rebooted (memory reduced to 2GB via tofu) |
+
+## Hosts Affected
+
+All 2GB VMs that run nixos-upgrade:
+- ns1, ns2 (DNS)
+- vault01
+- testvm01, testvm02, testvm03
+- kanidm01
+
+## Metrics to Monitor
+
+Check these in Grafana or via PromQL to verify the fix:
+
+### Swap availability (should be ~2GB after upgrade)
+```promql
+node_memory_SwapTotal_bytes / 1024 / 1024
+```
+
+### Swap usage during upgrades
+```promql
+(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / 1024 / 1024
+```
+
+### Zswap compressed bytes (active compression)
+```promql
+node_memory_Zswap_bytes / 1024 / 1024
+```
+
+### Upgrade failures (should be 0)
+```promql
+node_systemd_unit_state{name="nixos-upgrade.service", state="failed"}
+```
+
+### Memory available during upgrades
+```promql
+node_memory_MemAvailable_bytes / 1024 / 1024
+```
+
+## Verification Steps
+
+After a few days (allow auto-upgrades to run on all hosts):
+
+1. Check all hosts have swap enabled:
+   ```promql
+   node_memory_SwapTotal_bytes > 0
+   ```
+
+2. Check for any upgrade failures since the fix:
+   ```promql
+   count_over_time(ALERTS{alertname="nixos_upgrade_failed"}[7d])
+   ```
+
+3. Review if any hosts used swap during upgrades (check historical graphs)
+
+## Success Criteria
+
+- No `nixos_upgrade_failed` alerts due to OOM after 2026-02-08
+- All hosts show ~2GB swap available
+- Upgrades complete successfully on 2GB VMs
+
+## Fallback Options
+
+If zram is insufficient:
+
+1. **Increase VM memory** - Update `terraform/vms.tf` to 4GB for affected hosts
+2. **Enable memory ballooning** - Configure VMs with dynamic memory allocation (see below)
+3. **Use remote builds** - Configure `nix.buildMachines` to offload evaluation
+4. **Reduce flake size** - Split configurations to reduce evaluation memory
+
+### Memory Ballooning
+
+Proxmox supports memory ballooning, which allows VMs to dynamically grow/shrink memory allocation based on demand. The balloon driver inside the guest communicates with the hypervisor to release or reclaim memory pages.
+
+Configuration in `terraform/vms.tf`:
+```hcl
+memory  = 4096  # maximum memory
+balloon = 2048  # minimum memory (shrinks to this when idle)
+```
+
+Pros:
+- VMs get memory on-demand without reboots
+- Better host memory utilization
+- Solves upgrade OOM without permanently allocating 4GB
+
+Cons:
+- Requires QEMU guest agent running in guest
+- Guest can experience memory pressure if host is overcommitted
+
+Ballooning and zram are complementary - ballooning provides headroom from the host, zram provides overflow within the guest.
--- a/docs/plans/security-hardening.md
+++ b/docs/plans/security-hardening.md
@@ -0,0 +1,224 @@
+# Security Hardening Plan
+
+## Overview
+
+Address security gaps identified in infrastructure review. Focus areas: SSH hardening, network security, logging improvements, and secrets management.
+
+## Current State
+
+- SSH allows password auth and unrestricted root login (`system/sshd.nix`)
+- Firewall disabled on all hosts (`networking.firewall.enable = false`)
+- Promtail ships logs over HTTP to Loki
+- Loki has no authentication (`auth_enabled = false`)
+- AppRole secret-IDs never expire (`secret_id_ttl = 0`)
+- Vault TLS verification disabled by default (`skipTlsVerify = true`)
+- Audit logging exists (`common/ssh-audit.nix`) but not applied globally
+- Alert rules focus on availability, no security event detection
+
+## Priority Matrix
+
+| Issue | Severity | Effort | Priority |
+|-------|----------|--------|----------|
+| SSH password auth | High | Low | **P1** |
+| Firewall disabled | High | Medium | **P1** |
+| Promtail HTTP (no TLS) | High | Medium | **P2** |
+| No security alerting | Medium | Low | **P2** |
+| Audit logging not global | Low | Low | **P2** |
+| Loki no auth | Medium | Medium | **P3** |
+| Secret-ID TTL | Medium | Medium | **P3** |
+| Vault skipTlsVerify | Medium | Low | **P3** |
+
+## Phase 1: Quick Wins (P1)
+
+### 1.1 SSH Hardening
+
+Edit `system/sshd.nix`:
+
+```nix
+services.openssh = {
+  enable = true;
+  settings = {
+    PermitRootLogin = "prohibit-password";  # Key-only root login
+    PasswordAuthentication = false;
+    KbdInteractiveAuthentication = false;
+  };
+};
+```
+
+**Prerequisite:** Verify all hosts have SSH keys deployed for root.
+
+### 1.2 Enable Firewall
+
+Create `system/firewall.nix` with default deny policy:
+
+```nix
+{ ... }: {
+  networking.firewall.enable = true;
+
+  # Use openssh's built-in firewall integration
+  services.openssh.openFirewall = true;
+}
+```
+
+**Useful firewall options:**
+
+| Option | Description |
+|--------|-------------|
+| `networking.firewall.trustedInterfaces` | Accept all traffic from these interfaces (e.g., `[ "lo" ]`) |
+| `networking.firewall.interfaces.<name>.allowedTCPPorts` | Per-interface port rules |
+| `networking.firewall.extraInputRules` | Custom nftables rules (for complex filtering) |
+
+**Network range restrictions:** Consider restricting SSH to the infrastructure subnet (`10.69.13.0/24`) using `extraInputRules` for defense in depth. However, this adds complexity and may not be necessary given the trusted network model.
+
+#### Per-Interface Rules (http-proxy WireGuard)
+
+The `http-proxy` host has a WireGuard interface (`wg0`) that may need different rules than the LAN interface. Use `networking.firewall.interfaces` to apply per-interface policies:
+
+```nix
+# Example: http-proxy with different rules per interface
+networking.firewall = {
+  enable = true;
+
+  # Default: only SSH (via openFirewall)
+  allowedTCPPorts = [ ];
+
+  # LAN interface: allow HTTP/HTTPS
+  interfaces.ens18 = {
+    allowedTCPPorts = [ 80 443 ];
+  };
+
+  # WireGuard interface: restrict to specific services or trust fully
+  interfaces.wg0 = {
+    allowedTCPPorts = [ 80 443 ];
+    # Or use trustedInterfaces = [ "wg0" ] if fully trusted
+  };
+};
+```
+
+**TODO:** Investigate current WireGuard usage on http-proxy to determine appropriate rules.
+
+Then per-host, open required ports:
+
+| Host | Additional Ports |
+|------|------------------|
+| ns1/ns2 | 53 (TCP/UDP) |
+| vault01 | 8200 |
+| monitoring01 | 3100, 9090, 3000, 9093 |
+| http-proxy | 80, 443 |
+| nats1 | 4222 |
+| ha1 | 1883, 8123 |
+| jelly01 | 8096 |
+| nix-cache01 | 5000 |
+
+## Phase 2: Logging & Detection (P2)
+
+### 2.1 Enable TLS for Promtail → Loki
+
+Update `system/monitoring/logs.nix`:
+
+```nix
+clients = [{
+  url = "https://monitoring01.home.2rjus.net:3100/loki/api/v1/push";
+  tls_config = {
+    ca_file = "/etc/ssl/certs/homelab-root-ca.pem";
+  };
+}];
+```
+
+Requires:
+- Configure Loki with TLS certificate (use internal ACME)
+- Ensure all hosts trust root CA (already done via `system/pki/root-ca.nix`)
+
+### 2.2 Security Alert Rules
+
+Add to `services/monitoring/rules.yml`:
+
+```yaml
+- name: security_rules
+  rules:
+    - alert: ssh_auth_failures
+      expr: increase(node_logind_sessions_total[5m]) > 20
+      for: 0m
+      labels:
+        severity: warning
+      annotations:
+        summary: "Unusual login activity on {{ $labels.instance }}"
+
+    - alert: vault_secret_fetch_failure
+      expr: increase(vault_secret_failures[5m]) > 5
+      for: 0m
+      labels:
+        severity: warning
+      annotations:
+        summary: "Vault secret fetch failures on {{ $labels.instance }}"
+```
+
+Also add Loki-based alerts for:
+- Failed SSH attempts: `{job="systemd-journal"} |= "Failed password"`
+- sudo usage: `{job="systemd-journal"} |= "sudo"`
+
+### 2.3 Global Audit Logging
+
+Add `./common/ssh-audit.nix` import to `system/default.nix`:
+
+```nix
+imports = [
+  # ... existing imports
+  ../common/ssh-audit.nix
+];
+```
+
+## Phase 3: Defense in Depth (P3)
+
+### 3.1 Loki Authentication
+
+Options:
+1. **Basic auth via reverse proxy** - Put Loki behind Caddy with auth
+2. **Loki multi-tenancy** - Enable `auth_enabled = true` and use tenant IDs
+3. **Network isolation** - Bind Loki only to localhost, expose via authenticated proxy
+
+Recommendation: Option 1 (reverse proxy) is simplest for homelab.
+
+### 3.2 AppRole Secret Rotation
+
+Update `terraform/vault/approle.tf`:
+
+```hcl
+secret_id_ttl  = 2592000  # 30 days
+```
+
+Add documentation for manual rotation procedure or implement automated rotation via the existing `restartTrigger` mechanism in `vault-secrets.nix`.
+
+### 3.3 Enable Vault TLS Verification
+
+Change default in `system/vault-secrets.nix`:
+
+```nix
+skipTlsVerify = mkOption {
+  type = types.bool;
+  default = false;  # Changed from true
+};
+```
+
+**Prerequisite:** Verify all hosts trust the internal CA that signed the Vault certificate.
+
+## Implementation Order
+
+1. **Test on test-tier first** - Deploy phases 1-2 to testvm01/02/03
+2. **Validate SSH access** - Ensure key-based login works before disabling passwords
+3. **Document firewall ports** - Create reference of ports per host before enabling
+4. **Phase prod rollout** - Deploy to prod hosts one at a time, verify each
+
+## Open Questions
+
+- [ ] Do all hosts have SSH keys configured for root access?
+- [ ] Should firewall rules be per-host or use a central definition with roles?
+- [ ] Should Loki authentication use the existing Kanidm setup?
+
+**Resolved:** Password-based SSH access for recovery is not required - most hosts have console access through Proxmox or physical access, which provides an out-of-band recovery path if SSH keys fail.
+
+## Notes
+
+- Firewall changes are the highest risk - test thoroughly on test-tier
+- SSH hardening must not lock out access - verify keys first
+- Consider creating a "break glass" procedure for emergency access if keys fail
--- a/docs/user-management.md
+++ b/docs/user-management.md
@@ -0,0 +1,267 @@
+# User Management with Kanidm
+
+Central authentication for the homelab using Kanidm.
+
+## Overview
+
+- **Server**: kanidm01.home.2rjus.net (auth.home.2rjus.net)
+- **WebUI**: https://auth.home.2rjus.net
+- **LDAPS**: port 636
+
+## CLI Setup
+
+The `kanidm` CLI is available in the devshell:
+
+```bash
+nix develop
+
+# Login as idm_admin
+kanidm login --name idm_admin --url https://auth.home.2rjus.net
+```
+
+## User Management
+
+POSIX users are managed imperatively via the `kanidm` CLI. This allows setting
+all attributes (including UNIX password) in one workflow.
+
+### Creating a POSIX User
+
+```bash
+# Create the person
+kanidm person create <username> "<Display Name>"
+
+# Add to groups
+kanidm group add-members ssh-users <username>
+
+# Enable POSIX (UID is auto-assigned)
+kanidm person posix set <username>
+
+# Set UNIX password (required for SSH login, min 10 characters)
+kanidm person posix set-password <username>
+
+# Optionally set login shell
+kanidm person posix set <username> --shell /bin/zsh
+```
+
+### Example: Full User Creation
+
+```bash
+kanidm person create testuser "Test User"
+kanidm group add-members ssh-users testuser
+kanidm person posix set testuser
+kanidm person posix set-password testuser
+kanidm person get testuser
+```
+
+After creation, verify on a client host:
+```bash
+getent passwd testuser
+ssh testuser@testvm01.home.2rjus.net
+```
+
+### Viewing User Details
+
+```bash
+kanidm person get <username>
+```
+
+### Removing a User
+
+```bash
+kanidm person delete <username>
+```
+
+## Group Management
+
+Groups for POSIX access are also managed via CLI.
+
+### Creating a POSIX Group
+
+```bash
+# Create the group
+kanidm group create <group-name>
+
+# Enable POSIX with a specific GID
+kanidm group posix set <group-name> --gidnumber <gid>
+```
+
+### Adding Members
+
+```bash
+kanidm group add-members <group-name> <username>
+```
+
+### Viewing Group Details
+
+```bash
+kanidm group get <group-name>
+kanidm group list-members <group-name>
+```
+
+### Example: Full Group Creation
+
+```bash
+kanidm group create testgroup
+kanidm group posix set testgroup --gidnumber 68010
+kanidm group add-members testgroup testuser
+kanidm group get testgroup
+```
+
+After creation, verify on a client host:
+```bash
+getent group testgroup
+```
+
+### Current Groups
+
+| Group | GID | Purpose |
+|-------|-----|---------|
+| ssh-users | 68000 | SSH login access |
+| admins | 68001 | Administrative access |
+| users | 68002 | General users |
+
+### UID/GID Allocation
+
+Kanidm auto-assigns UIDs/GIDs from its configured range. For manually assigned GIDs:
+
+| Range | Purpose |
+|-------|---------|
+| 65,536+ | Users (auto-assigned) |
+| 68,000 - 68,999 | Groups (manually assigned) |
+
+## PAM/NSS Client Configuration
+
+Enable central authentication on a host:
+
+```nix
+homelab.kanidm.enable = true;
+```
+
+This configures:
+- `services.kanidm.enablePam = true`
+- Client connection to auth.home.2rjus.net
+- Login authorization for `ssh-users` group
+- Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
+- Home directory symlinks (`/home/torjus` → UUID-based directory)
+
+### Enabled Hosts
+
+- testvm01, testvm02, testvm03 (test tier)
+
+### Options
+
+```nix
+homelab.kanidm = {
+  enable = true;
+  server = "https://auth.home.2rjus.net";  # default
+  allowedLoginGroups = [ "ssh-users" ];     # default
+};
+```
+
+### Home Directories
+
+Home directories use UUID-based paths for stability (so renaming a user doesn't
+require moving their home directory). Symlinks provide convenient access:
+
+```
+/home/torjus -> /home/e4f4c56c-4aee-4c20-846f-90cb69807733
+```
+
+The symlinks are created by `kanidm-unixd-tasks` on first login.
+
+## Testing
+
+### Verify NSS Resolution
+
+```bash
+# Check user resolution
+getent passwd <username>
+
+# Check group resolution
+getent group <group-name>
+```
+
+### Test SSH Login
+
+```bash
+ssh <username>@<hostname>.home.2rjus.net
+```
+
+## Troubleshooting
+
+### "PAM user mismatch" error
+
+SSH fails with "fatal: PAM user mismatch" in logs. This happens when Kanidm returns
+usernames in SPN format (`torjus@home.2rjus.net`) but SSH expects short names (`torjus`).
+
+**Solution**: Configure `uid_attr_map = "name"` in unixSettings (already set in our module).
+
+Check current format:
+```bash
+getent passwd torjus
+# Should show: torjus:x:65536:...
+# NOT: torjus@home.2rjus.net:x:65536:...
+```
+
+### User resolves but SSH fails immediately
+
+The user's login group (e.g., `ssh-users`) likely doesn't have POSIX enabled:
+
+```bash
+# Check if group has POSIX
+getent group ssh-users
+
+# If empty, enable POSIX on the server
+kanidm group posix set ssh-users --gidnumber 68000
+```
+
+### User doesn't resolve via getent
+
+1. Check kanidm-unixd service is running:
+   ```bash
+   systemctl status kanidm-unixd
+   ```
+
+2. Check unixd can reach server:
+   ```bash
+   kanidm-unix status
+   # Should show: system: online, Kanidm: online
+   ```
+
+3. Check client can reach server:
+   ```bash
+   curl -s https://auth.home.2rjus.net/status
+   ```
+
+4. Check user has POSIX enabled on server:
+   ```bash
+   kanidm person get <username>
+   ```
+
+5. Restart nscd to clear stale cache:
+   ```bash
+   systemctl restart nscd
+   ```
+
+6. Invalidate kanidm cache:
+   ```bash
+   kanidm-unix cache-invalidate
+   ```
+
+### Changes not taking effect after deployment
+
+NixOS uses nsncd (a Rust reimplementation of nscd) for NSS caching. After deploying
+kanidm-unixd config changes, you may need to restart both services:
+
+```bash
+systemctl restart kanidm-unixd
+systemctl restart nscd
+```
+
+### Test PAM authentication directly
+
+Use the kanidm-unix CLI to test PAM auth without SSH:
+
+```bash
+kanidm-unix auth-test --name <username>
+```
--- a/flake.nix
+++ b/flake.nix
@@ -207,6 +207,7 @@
              pkgs.ansible
              pkgs.opentofu
              pkgs.openbao
+              pkgs.kanidm_1_8
              (pkgs.callPackage ./scripts/create-host { })
              homelab-deploy.packages.${pkgs.system}.default
            ];
--- a/hosts/jelly01/configuration.nix
+++ b/hosts/jelly01/configuration.nix
@@ -64,9 +64,5 @@
  vault.enable = true;
  homelab.deploy.enable = true;

-  zramSwap = {
-    enable = true;
-  };
-
  system.stateVersion = "23.11"; # Did you read the comment?
 }
--- a/hosts/nix-cache01/default.nix
+++ b/hosts/nix-cache01/default.nix
@@ -4,6 +4,5 @@
    ./configuration.nix
    ../../services/nix-cache
    ../../services/actions-runner
-    ./zram.nix
  ];
 }
--- a/hosts/nix-cache01/zram.nix
+++ b/hosts/nix-cache01/zram.nix
@@ -1,6 +0,0 @@
-{ ... }:
-{
-  zramSwap = {
-    enable = true;
-  };
-}
--- a/hosts/template2/configuration.nix
+++ b/hosts/template2/configuration.nix
@@ -79,5 +79,8 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

+  # Compressed swap in RAM - prevents OOM during bootstrap nixos-rebuild
+  zramSwap.enable = true;
+
  system.stateVersion = "25.11";
 }
--- a/hosts/testvm01/configuration.nix
+++ b/hosts/testvm01/configuration.nix
@@ -25,6 +25,9 @@
  # Enable remote deployment via NATS
  homelab.deploy.enable = true;

+  # Enable Kanidm PAM/NSS for central authentication
+  homelab.kanidm.enable = true;
+
  nixpkgs.config.allowUnfree = true;
  boot.loader.grub.enable = true;
  boot.loader.grub.device = "/dev/vda";
--- a/hosts/testvm02/configuration.nix
+++ b/hosts/testvm02/configuration.nix
@@ -25,6 +25,9 @@
  # Enable remote deployment via NATS
  homelab.deploy.enable = true;

+  # Enable Kanidm PAM/NSS for central authentication
+  homelab.kanidm.enable = true;
+
  nixpkgs.config.allowUnfree = true;
  boot.loader.grub.enable = true;
  boot.loader.grub.device = "/dev/vda";
--- a/hosts/testvm03/configuration.nix
+++ b/hosts/testvm03/configuration.nix
@@ -25,6 +25,9 @@
  # Enable remote deployment via NATS
  homelab.deploy.enable = true;

+  # Enable Kanidm PAM/NSS for central authentication
+  homelab.kanidm.enable = true;
+
  nixpkgs.config.allowUnfree = true;
  boot.loader.grub.enable = true;
  boot.loader.grub.device = "/dev/vda";
--- a/services/kanidm/default.nix
+++ b/services/kanidm/default.nix
@@ -17,7 +17,8 @@
      };
    };

-    # Provisioning - initial users/groups
+    # Provision base groups only - users are managed via CLI
+    # See docs/user-management.md for details
    provision = {
      enable = true;
      idmAdminPasswordFile = config.vault.secrets.kanidm-idm-admin.outputDir;
@@ -28,10 +29,7 @@
        ssh-users = { };
      };

-      persons.torjus = {
-        displayName = "Torjus";
-        groups = [ "admins" "users" "ssh-users" ];
-      };
+      # Regular users (persons) are managed imperatively via kanidm CLI
    };
  };

@@ -46,7 +44,7 @@
    extraDomainNames = [ "${config.networking.hostName}.home.2rjus.net" ];
  };

-  # Vault secret for idm_admin password
+  # Vault secret for idm_admin password (used for provisioning)
  vault.secrets.kanidm-idm-admin = {
    secretPath = "kanidm/idm-admin-password";
    extractKey = "password";
--- a/system/default.nix
+++ b/system/default.nix
@@ -4,6 +4,7 @@
    ./acme.nix
    ./autoupgrade.nix
    ./homelab-deploy.nix
+    ./kanidm-client.nix
    ./monitoring
    ./motd.nix
    ./packages.nix
@@ -12,5 +13,6 @@
    ./pki/root-ca.nix
    ./sshd.nix
    ./vault-secrets.nix
+    ./zram.nix
  ];
 }
--- a/system/kanidm-client.nix
+++ b/system/kanidm-client.nix
@@ -0,0 +1,42 @@
+{ lib, config, pkgs, ... }:
+let
+  cfg = config.homelab.kanidm;
+in
+{
+  options.homelab.kanidm = {
+    enable = lib.mkEnableOption "Kanidm PAM/NSS client for central authentication";
+
+    server = lib.mkOption {
+      type = lib.types.str;
+      default = "https://auth.home.2rjus.net";
+      description = "URI of the Kanidm server";
+    };
+
+    allowedLoginGroups = lib.mkOption {
+      type = lib.types.listOf lib.types.str;
+      default = [ "ssh-users" ];
+      description = "Groups allowed to log in via PAM";
+    };
+  };
+
+  config = lib.mkIf cfg.enable {
+    services.kanidm = {
+      package = pkgs.kanidm_1_8;
+      enablePam = true;
+
+      clientSettings = {
+        uri = cfg.server;
+      };
+
+      unixSettings = {
+        pam_allowed_login_groups = cfg.allowedLoginGroups;
+        # Use short names (torjus) instead of SPN format (torjus@home.2rjus.net)
+        # This prevents "PAM user mismatch" errors with SSH
+        uid_attr_map = "name";
+        gid_attr_map = "name";
+        # Create symlink /home/torjus -> /home/torjus@home.2rjus.net
+        home_alias = "name";
+      };
+    };
+  };
+}
--- a/system/zram.nix
+++ b/system/zram.nix
@@ -0,0 +1,8 @@
+# Compressed swap in RAM
+#
+# Provides overflow memory during Nix builds and upgrades.
+# Prevents OOM kills on low-memory hosts (2GB VMs).
+{ ... }:
+{
+  zramSwap.enable = true;
+}
Author	SHA1	Message	Date
Torjus Håkestad	9ed09c9a9c	docs: add user-management documentation All checks were successful Run nix flake check / flake-check (pull_request) Successful in 3m33s Details Run nix flake check / flake-check (push) Successful in 2m0s Details - CLI workflows for creating users and groups - Troubleshooting guide (nscd, cache invalidation) - Home directory behavior (UUID-based with symlinks) - Update auth-system-replacement plan with progress Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:14:21 +01:00
Torjus Håkestad	b31c64f1b9	kanidm: remove declarative user provisioning Keep base groups (admins, users, ssh-users) provisioned declaratively but manage regular users via the kanidm CLI. This allows setting POSIX attributes and passwords in a single workflow. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:14:03 +01:00
Torjus Håkestad	54b6e37420	flake: add kanidm to devshell Add kanidm_1_8 CLI for administering the Kanidm server. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:12:19 +01:00
Torjus Håkestad	b845a8bb8b	system: add kanidm PAM/NSS client module Add homelab.kanidm.enable option for central authentication via Kanidm. The module configures: - PAM/NSS integration with kanidm-unixd - Client connection to auth.home.2rjus.net - Login authorization for ssh-users group Enable on testvm01-03 for testing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:12:19 +01:00
Torjus Håkestad	bfbf0cea68	template2: enable zram for bootstrap Some checks failed Run nix flake check / flake-check (push) Failing after 3m34s Details Prevents OOM during initial nixos-rebuild on 2GB VMs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 13:34:08 +01:00
Torjus Håkestad	3abe5e83a7	docs: add memory ballooning as fallback option All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 13:29:42 +01:00
Torjus Håkestad	67c27555f3	docs: add memory issues follow-up plan All checks were successful Run nix flake check / flake-check (push) Successful in 2m2s Details Track zram change effectiveness for OOM prevention during upgrades. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 13:26:31 +01:00
Torjus Håkestad	1674b6a844	system: enable zram swap for all hosts Some checks failed Run nix flake check / flake-check (push) Failing after 12m6s Details Provides compressed swap in RAM to prevent OOM kills during nixos-rebuild on low-memory VMs (2GB). Removes duplicate zram configs from jelly01 and nix-cache01. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 13:02:58 +01:00
Torjus Håkestad	311be282b6	docs: add security hardening plan Some checks failed Run nix flake check / flake-check (push) Failing after 2s Details Based on security review findings, covering SSH hardening, firewall enablement, log transport TLS, security alerting, and secrets management. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 05:26:15 +01:00
Torjus Håkestad	11cbb64097	claude: make auditor delegation explicit in investigate-alarm Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - Changed section 4 from "if needed" to always spawn auditor - Added explicit "Do NOT query audit logs yourself" guidance - Listed specific scenarios requiring auditor (service stopped, etc.) - Added manual intervention as first common cause - Updated guidelines to emphasize mandatory delegation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 05:11:09 +01:00
Torjus Håkestad	e2dd21c994	claude: add auditor agent and git-explorer MCP Add new auditor agent for security-focused audit log analysis: - SSH session tracking, command execution, sudo usage - Suspicious activity detection patterns - Can be used standalone or as sub-agent by investigate-alarm Update investigate-alarm to delegate audit analysis to auditor and add git-explorer MCP for configuration drift detection. Add git-explorer to .mcp.json for repository inspection. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 04:48:55 +01:00