docs: add user-management documentation

- CLI workflows for creating users and groups - Troubleshooting guide (nscd, cache invalidation) - Home directory behavior (UUID-based with symlinks) - Update auth-system-replacement plan with progress Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
kanidm: remove declarative user provisioning
2026-02-08 15:14:21 +01:00 · 2026-02-08 15:14:03 +01:00 · 2026-02-08 15:12:19 +01:00 · 2026-02-08 15:12:19 +01:00 · 2026-02-08 13:34:08 +01:00 · 2026-02-08 13:29:42 +01:00
20 changed files with 1018 additions and 51 deletions
--- a/.claude/agents/auditor.md
+++ b/.claude/agents/auditor.md
@@ -0,0 +1,180 @@
+---
+name: auditor
+description: Analyzes audit logs to investigate user activity, command execution, and suspicious behavior on hosts. Can be used standalone for security reviews or called by other agents for behavioral context.
+tools: Read, Grep, Glob
+mcpServers:
+  - lab-monitoring
+---
+
+You are a security auditor for a NixOS homelab infrastructure. Your task is to analyze audit logs and reconstruct user activity on hosts.
+
+## Input
+
+You may receive:
+- A host or list of hosts to investigate
+- A time window (e.g., "last hour", "today", "between 14:00 and 15:00")
+- Optional context: specific events to look for, user to focus on, or suspicious activity to investigate
+- Optional context from a parent investigation (e.g., "a service stopped at 14:32, what happened around that time?")
+
+## Audit Log Structure
+
+Logs are shipped to Loki via promtail. Audit events use these labels:
+- `host` - hostname
+- `systemd_unit` - typically `auditd.service` for audit logs
+- `job` - typically `systemd-journal`
+
+Audit log entries contain structured data:
+- `EXECVE` - command execution with full arguments
+- `USER_LOGIN` / `USER_LOGOUT` - session start/end
+- `USER_CMD` - sudo command execution
+- `CRED_ACQ` / `CRED_DISP` - credential acquisition/disposal
+- `SERVICE_START` / `SERVICE_STOP` - systemd service events
+
+## Investigation Techniques
+
+### 1. SSH Session Activity
+
+Find SSH logins and session activity:
+```logql
+{host="<hostname>", systemd_unit="sshd.service"}
+```
+
+Look for:
+- Accepted/Failed authentication
+- Session opened/closed
+- Unusual source IPs or users
+
+### 2. Command Execution
+
+Query executed commands (filter out noise):
+```logql
+{host="<hostname>"} |= "EXECVE" != "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"
+```
+
+Further filtering:
+- Exclude systemd noise: `!= "systemd" != "/nix/store"`
+- Focus on specific commands: `|= "rm" |= "-rf"`
+- Focus on specific user: `|= "uid=1000"`
+
+### 3. Sudo Activity
+
+Check for privilege escalation:
+```logql
+{host="<hostname>"} |= "sudo" |= "COMMAND"
+```
+
+Or via audit:
+```logql
+{host="<hostname>"} |= "USER_CMD"
+```
+
+### 4. Service Manipulation
+
+Check if services were manually stopped/started:
+```logql
+{host="<hostname>"} |= "EXECVE" |= "systemctl"
+```
+
+### 5. File Operations
+
+Look for file modifications (if auditd rules are configured):
+```logql
+{host="<hostname>"} |= "EXECVE" |= "vim"
+{host="<hostname>"} |= "EXECVE" |= "nano"
+{host="<hostname>"} |= "EXECVE" |= "rm"
+```
+
+## Query Guidelines
+
+**Start narrow, expand if needed:**
+- Begin with `limit: 20-30`
+- Use tight time windows: `start: "15m"` or `start: "30m"`
+- Add filters progressively
+
+**Avoid:**
+- Querying all audit logs without EXECVE filter (extremely verbose)
+- Large time ranges without specific filters
+- Limits over 50 without tight filters
+
+**Time-bounded queries:**
+When investigating around a specific event:
+```logql
+{host="<hostname>"} |= "EXECVE" != "systemd"
+```
+With `start: "2026-02-08T14:30:00Z"` and `end: "2026-02-08T14:35:00Z"`
+
+## Suspicious Patterns to Watch For
+
+1. **Unusual login times** - Activity outside normal hours
+2. **Failed authentication** - Brute force attempts
+3. **Privilege escalation** - Unexpected sudo usage
+4. **Reconnaissance commands** - `whoami`, `id`, `uname`, `cat /etc/passwd`
+5. **Data exfiltration indicators** - `curl`, `wget`, `scp`, `rsync` to external destinations
+6. **Persistence mechanisms** - Cron modifications, systemd service creation
+7. **Log tampering** - Commands targeting log files
+8. **Lateral movement** - SSH to other internal hosts
+9. **Service manipulation** - Stopping security services, disabling firewalls
+10. **Cleanup activity** - Deleting bash history, clearing logs
+
+## Output Format
+
+### For Standalone Security Reviews
+
+```
+## Activity Summary
+
+**Host:** <hostname>
+**Time Period:** <start> to <end>
+**Sessions Found:** <count>
+
+## User Sessions
+
+### Session 1: <user> from <source_ip>
+- **Login:** HH:MM:SSZ
+- **Logout:** HH:MM:SSZ (or ongoing)
+- **Commands executed:**
+  - HH:MM:SSZ - <command>
+  - HH:MM:SSZ - <command>
+
+## Suspicious Activity
+
+[If any patterns from the watch list were detected]
+- **Finding:** <description>
+- **Evidence:** <log entries>
+- **Risk Level:** Low / Medium / High
+
+## Summary
+
+[Overall assessment: normal activity, concerning patterns, or clear malicious activity]
+```
+
+### When Called by Another Agent
+
+Provide a focused response addressing the specific question:
+
+```
+## Audit Findings
+
+**Query:** <what was asked>
+**Time Window:** <investigated period>
+
+## Relevant Activity
+
+[Chronological list of relevant events]
+- HH:MM:SSZ - <event>
+- HH:MM:SSZ - <event>
+
+## Assessment
+
+[Direct answer to the question with supporting evidence]
+```
+
+## Guidelines
+
+- Reconstruct timelines chronologically
+- Correlate events (login → commands → logout)
+- Note gaps or missing data
+- Distinguish between automated (systemd, cron) and interactive activity
+- Consider the host's role and tier when assessing severity
+- When called by another agent, focus on answering their specific question
+- Don't speculate without evidence - state what the logs show and don't show
--- a/.claude/agents/investigate-alarm.md
+++ b/.claude/agents/investigate-alarm.md
@@ -4,6 +4,7 @@ description: Investigates a single system alarm by querying Prometheus metrics a
 tools: Read, Grep, Glob
 mcpServers:
  - lab-monitoring
+  - git-explorer
 ---

 You are an alarm investigation specialist for a NixOS homelab infrastructure. Your task is to analyze a single alarm and determine its root cause.
@@ -33,18 +34,52 @@ Gather evidence about the current system state:
 - Use `list_targets` to verify the host/service is being scraped successfully
 - Look for correlated metrics that might explain the issue

-### 3. Check Logs
+### 3. Check Service Logs

-Search for relevant log entries:
- Use `query_logs` to search Loki for the affected host/service
- Common patterns:
-  - `{host="<hostname>", systemd_unit="<service>.service"}`
-  - `{host="<hostname>"} |= "error"`
-  - `{systemd_unit="<service>.service"}` across all hosts
- Look for errors, warnings, or unusual patterns around the alert time
- Use `start: "1h"` or longer for context
+Search for relevant log entries using `query_logs`. Focus on service-specific logs and errors.

-### 4. Check Configuration (if relevant)
+**Query strategies (start narrow, expand if needed):**
+- Start with `limit: 20-30`, increase only if needed
+- Use tight time windows: `start: "15m"` or `start: "30m"` initially
+- Filter to specific services: `{host="<hostname>", systemd_unit="<service>.service"}`
+- Search for errors: `{host="<hostname>"} |= "error"` or `|= "failed"`
+
+**Common patterns:**
+- Service logs: `{host="<hostname>", systemd_unit="<service>.service"}`
+- All errors on host: `{host="<hostname>"} |= "error"`
+- Journal for a unit: `{host="<hostname>", systemd_unit="nginx.service"} |= "failed"`
+
+**Avoid:**
+- Using `start: "1h"` with no filters on busy hosts
+- Limits over 50 without specific filters
+
+### 4. Investigate User Activity
+
+For any analysis of user activity, **always spawn the `auditor` agent**. Do not query audit logs (EXECVE, USER_LOGIN, etc.) directly - delegate this to the auditor.
+
+**Always call the auditor when:**
+- A service stopped unexpectedly (may have been manually stopped)
+- A process was killed or a config was changed
+- You need to know who was logged in around the time of an incident
+- You need to understand what commands led to the current state
+- The cause isn't obvious from service logs alone
+
+**Do NOT try to query audit logs yourself.** The auditor is specialized for:
+- Parsing EXECVE records and reconstructing command lines
+- Correlating SSH sessions with commands executed
+- Identifying suspicious patterns
+- Filtering out systemd/nix-store noise
+
+**Example prompt for auditor:**
+```
+Investigate user activity on <hostname> between <start_time> and <end_time>.
+Context: The prometheus-node-exporter service stopped at 14:32.
+Determine if it was manually stopped and by whom.
+```
+
+Incorporate the auditor's findings into your timeline and root cause analysis.
+
+### 5. Check Configuration (if relevant)

 If the alert relates to a NixOS-managed service:
 - Check host configuration in `/hosts/<hostname>/`
@@ -52,9 +87,61 @@ If the alert relates to a NixOS-managed service:
 - Look for thresholds, resource limits, or misconfigurations
 - Check `homelab.host` options for tier/priority/role metadata

-### 5. Consider Common Causes
+### 6. Check for Configuration Drift
+
+Use the git-explorer MCP server to compare the host's deployed configuration against the current master branch. This helps identify:
+- Hosts running outdated configurations
+- Recent changes that might have caused the issue
+- Whether a fix has already been committed but not deployed
+
+**Step 1: Get the deployed revision from Prometheus**
+```promql
+nixos_flake_info{hostname="<hostname>"}
+```
+The `current_rev` label contains the deployed git commit hash.
+
+**Step 2: Check if the host is behind master**
+```
+resolve_ref("master")           # Get current master commit
+is_ancestor(deployed, master)   # Check if host is behind
+```
+
+**Step 3: See what commits are missing**
+```
+commits_between(deployed, master)  # List commits not yet deployed
+```
+
+**Step 4: Check which files changed**
+```
+get_diff_files(deployed, master)   # Files modified since deployment
+```
+Look for files in `hosts/<hostname>/`, `services/<relevant-service>/`, or `system/` that affect this host.
+
+**Step 5: View configuration at the deployed revision**
+```
+get_file_at_commit(deployed, "services/<service>/default.nix")
+```
+Compare against the current file to understand differences.
+
+**Step 6: Find when something changed**
+```
+search_commits("<service-name>")   # Find commits mentioning the service
+get_commit_info(<hash>)            # Get full details of a specific change
+```
+
+**Example workflow for a service-related alert:**
+1. Query `nixos_flake_info{hostname="monitoring01"}` → `current_rev: 8959829`
+2. `resolve_ref("master")` → `4633421`
+3. `is_ancestor("8959829", "4633421")` → Yes, host is behind
+4. `commits_between("8959829", "4633421")` → 7 commits missing
+5. `get_diff_files("8959829", "4633421")` → Check if relevant service files changed
+6. If a fix was committed after the deployed rev, recommend deployment
+
+### 7. Consider Common Causes

 For infrastructure alerts, common causes include:
+- **Manual intervention**: Service manually stopped/restarted (call auditor to confirm)
+- **Configuration drift**: Host running outdated config, fix already in master
 - **Disk space**: Nix store growth, logs, temp files
 - **Memory pressure**: Service memory leaks, insufficient limits
 - **CPU**: Runaway processes, build jobs
@@ -62,6 +149,8 @@ For infrastructure alerts, common causes include:
 - **Service restarts**: Failed upgrades, configuration errors
 - **Scrape failures**: Service down, firewall issues, port changes

+**Note:** If a service stopped unexpectedly and service logs don't show a crash or error, it was likely manual intervention - call the auditor to investigate.
+
 ## Output Format

 Provide a concise report with one of two outcomes:
@@ -118,4 +207,5 @@ Provide a concise report with one of two outcomes:
 - If the alert is a false positive or expected behavior, explain why
 - Consider the host's tier (test vs prod) when assessing severity
 - Build a timeline from log timestamps and metrics to show the sequence of events
- Include precursor events (logins, config changes, restarts) that led to the issue
+- **Query logs incrementally**: start with narrow filters and small limits, expand only if needed
+- **Always delegate to the auditor agent** for any user activity analysis - never query EXECVE or audit logs directly
--- a/.mcp.json
+++ b/.mcp.json
@@ -33,6 +33,13 @@
        "--nats-url", "nats://nats1.home.2rjus.net:4222",
        "--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey"
      ]
+    },
+    "git-explorer": {
+      "command": "nix",
+      "args": ["run", "git+https://git.t-juice.club/torjus/labmcp#git-explorer", "--", "serve"],
+      "env": {
+        "GIT_REPO_PATH": "/home/torjus/git/nixos-servers"
+      }
    }
  }
 }
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -35,6 +35,10 @@ nix build .#create-host

 Do not automatically deploy changes. Deployments are usually done by updating the master branch, and then triggering the auto update on the specific host.

+### SSH Commands
+
+Do not run SSH commands directly. If a command needs to be run on a remote host, provide the command to the user and ask them to run it manually.
+
 ### Testing Feature Branches on Hosts

 All hosts have the `nixos-rebuild-test` helper script for testing feature branches before merging:
--- a/docs/plans/auth-system-replacement.md
+++ b/docs/plans/auth-system-replacement.md
@@ -66,9 +66,9 @@ This future migration path is a strong argument for Kanidm over LDAP-only soluti
   - Vault integration for idm_admin password
   - LDAPS on port 636

-2. **Configure declarative provisioning** ✅
-   - Groups: `admins`, `users`, `ssh-users`
-   - User: `torjus` (member of all groups)
+2. **Configure provisioning** ✅
+   - Groups provisioned declaratively: `admins`, `users`, `ssh-users`
+   - Users managed imperatively via CLI (allows setting POSIX passwords in one step)
   - POSIX attributes enabled (UID/GID range 65,536-69,999)

 3. **Test NAS integration** (in progress)
@@ -80,14 +80,16 @@ This future migration path is a strong argument for Kanidm over LDAP-only soluti
   - Grafana
   - Other services as needed

-5. **Create client module** in `system/` for PAM/NSS
-   - Enable on all hosts that need central auth
-   - Configure trusted CA
+5. **Create client module** in `system/` for PAM/NSS ✅
+   - Module: `system/kanidm-client.nix`
+   - `homelab.kanidm.enable = true` enables PAM/NSS
+   - Short usernames (not SPN format)
+   - Home directory symlinks via `home_alias`
+   - Enabled on test tier: testvm01, testvm02, testvm03

-6. **Documentation**
-   - User management procedures
-   - Adding new OAuth2 clients
-   - Troubleshooting PAM/NSS issues
+6. **Documentation** ✅
+   - `docs/user-management.md` - CLI workflows, troubleshooting
+   - User/group creation procedures verified working

 ## Progress

@@ -106,14 +108,37 @@ This future migration path is a strong argument for Kanidm over LDAP-only soluti
 - Prometheus monitoring scrape target configured

 **Provisioned entities:**
- Groups: `admins`, `users`, `ssh-users`
- User: `torjus` (member of all groups, POSIX enabled with GID 65536)
+- Groups: `admins`, `users`, `ssh-users` (declarative)
+- Users managed via CLI (imperative)

 **Verified working:**
 - WebUI login with idm_admin
 - LDAP bind and search with POSIX-enabled user
 - LDAPS with valid internal CA certificate

+### Completed (2026-02-08) - PAM/NSS Client
+
+**Client module deployed (`system/kanidm-client.nix`):**
+- `homelab.kanidm.enable = true` enables PAM/NSS integration
+- Connects to auth.home.2rjus.net
+- Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
+- Home directory symlinks (`/home/torjus` → UUID-based dir)
+- Login restricted to `ssh-users` group
+
+**Enabled on test tier:**
+- testvm01, testvm02, testvm03
+
+**Verified working:**
+- User/group resolution via `getent`
+- SSH login with Kanidm unix passwords
+- Home directory creation with symlinks
+- Imperative user/group creation via CLI
+
+**Documentation:**
+- `docs/user-management.md` with full CLI workflows
+- Password requirements (min 10 chars)
+- Troubleshooting guide (nscd, cache invalidation)
+
 ### UID/GID Range (Resolved)

 **Range: 65,536 - 69,999** (manually allocated)
@@ -128,10 +153,9 @@ Rationale:

 ### Next Steps

-1. Deploy to monitoring01 to enable Prometheus scraping
+1. Enable PAM/NSS on production hosts (after test tier validation)
 2. Configure TrueNAS LDAP client for NAS integration testing
 3. Add OAuth2 clients (Grafana first)
-4. Create PAM/NSS client module for other hosts

 ## References

--- a/docs/plans/memory-issues-follow-up.md
+++ b/docs/plans/memory-issues-follow-up.md
@@ -0,0 +1,116 @@
+# Memory Issues Follow-up
+
+Tracking the zram change to verify it resolves OOM issues during nixos-upgrade on low-memory hosts.
+
+## Background
+
+On 2026-02-08, ns2 (2GB RAM) experienced an OOM kill during nixos-upgrade. The Nix evaluation process consumed ~1.6GB before being killed by the kernel. ns1 (manually increased to 4GB) succeeded with the same upgrade.
+
+Root cause: 2GB RAM is insufficient for Nix flake evaluation without swap.
+
+## Fix Applied
+
+**Commit:** `1674b6a` - system: enable zram swap for all hosts
+
+**Merged:** 2026-02-08 ~12:15 UTC
+
+**Change:** Added `zramSwap.enable = true` to `system/zram.nix`, providing ~2GB compressed swap on all hosts.
+
+## Timeline
+
+| Time (UTC) | Event |
+|------------|-------|
+| 05:00:46 | ns2 nixos-upgrade OOM killed |
+| 05:01:47 | `nixos_upgrade_failed` alert fired |
+| 12:15 | zram commit merged to master |
+| 12:19 | ns2 rebooted with zram enabled |
+| 12:20 | ns1 rebooted (memory reduced to 2GB via tofu) |
+
+## Hosts Affected
+
+All 2GB VMs that run nixos-upgrade:
+- ns1, ns2 (DNS)
+- vault01
+- testvm01, testvm02, testvm03
+- kanidm01
+
+## Metrics to Monitor
+
+Check these in Grafana or via PromQL to verify the fix:
+
+### Swap availability (should be ~2GB after upgrade)
+```promql
+node_memory_SwapTotal_bytes / 1024 / 1024
+```
+
+### Swap usage during upgrades
+```promql
+(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / 1024 / 1024
+```
+
+### Zswap compressed bytes (active compression)
+```promql
+node_memory_Zswap_bytes / 1024 / 1024
+```
+
+### Upgrade failures (should be 0)
+```promql
+node_systemd_unit_state{name="nixos-upgrade.service", state="failed"}
+```
+
+### Memory available during upgrades
+```promql
+node_memory_MemAvailable_bytes / 1024 / 1024
+```
+
+## Verification Steps
+
+After a few days (allow auto-upgrades to run on all hosts):
+
+1. Check all hosts have swap enabled:
+   ```promql
+   node_memory_SwapTotal_bytes > 0
+   ```
+
+2. Check for any upgrade failures since the fix:
+   ```promql
+   count_over_time(ALERTS{alertname="nixos_upgrade_failed"}[7d])
+   ```
+
+3. Review if any hosts used swap during upgrades (check historical graphs)
+
+## Success Criteria
+
+- No `nixos_upgrade_failed` alerts due to OOM after 2026-02-08
+- All hosts show ~2GB swap available
+- Upgrades complete successfully on 2GB VMs
+
+## Fallback Options
+
+If zram is insufficient:
+
+1. **Increase VM memory** - Update `terraform/vms.tf` to 4GB for affected hosts
+2. **Enable memory ballooning** - Configure VMs with dynamic memory allocation (see below)
+3. **Use remote builds** - Configure `nix.buildMachines` to offload evaluation
+4. **Reduce flake size** - Split configurations to reduce evaluation memory
+
+### Memory Ballooning
+
+Proxmox supports memory ballooning, which allows VMs to dynamically grow/shrink memory allocation based on demand. The balloon driver inside the guest communicates with the hypervisor to release or reclaim memory pages.
+
+Configuration in `terraform/vms.tf`:
+```hcl
+memory  = 4096  # maximum memory
+balloon = 2048  # minimum memory (shrinks to this when idle)
+```
+
+Pros:
+- VMs get memory on-demand without reboots
+- Better host memory utilization
+- Solves upgrade OOM without permanently allocating 4GB
+
+Cons:
+- Requires QEMU guest agent running in guest
+- Guest can experience memory pressure if host is overcommitted
+
+Ballooning and zram are complementary - ballooning provides headroom from the host, zram provides overflow within the guest.
--- a/docs/plans/security-hardening.md
+++ b/docs/plans/security-hardening.md
@@ -0,0 +1,224 @@
+# Security Hardening Plan
+
+## Overview
+
+Address security gaps identified in infrastructure review. Focus areas: SSH hardening, network security, logging improvements, and secrets management.
+
+## Current State
+
+- SSH allows password auth and unrestricted root login (`system/sshd.nix`)
+- Firewall disabled on all hosts (`networking.firewall.enable = false`)
+- Promtail ships logs over HTTP to Loki
+- Loki has no authentication (`auth_enabled = false`)
+- AppRole secret-IDs never expire (`secret_id_ttl = 0`)
+- Vault TLS verification disabled by default (`skipTlsVerify = true`)
+- Audit logging exists (`common/ssh-audit.nix`) but not applied globally
+- Alert rules focus on availability, no security event detection
+
+## Priority Matrix
+
+| Issue | Severity | Effort | Priority |
+|-------|----------|--------|----------|
+| SSH password auth | High | Low | **P1** |
+| Firewall disabled | High | Medium | **P1** |
+| Promtail HTTP (no TLS) | High | Medium | **P2** |
+| No security alerting | Medium | Low | **P2** |
+| Audit logging not global | Low | Low | **P2** |
+| Loki no auth | Medium | Medium | **P3** |
+| Secret-ID TTL | Medium | Medium | **P3** |
+| Vault skipTlsVerify | Medium | Low | **P3** |
+
+## Phase 1: Quick Wins (P1)
+
+### 1.1 SSH Hardening
+
+Edit `system/sshd.nix`:
+
+```nix
+services.openssh = {
+  enable = true;
+  settings = {
+    PermitRootLogin = "prohibit-password";  # Key-only root login
+    PasswordAuthentication = false;
+    KbdInteractiveAuthentication = false;
+  };
+};
+```
+
+**Prerequisite:** Verify all hosts have SSH keys deployed for root.
+
+### 1.2 Enable Firewall
+
+Create `system/firewall.nix` with default deny policy:
+
+```nix
+{ ... }: {
+  networking.firewall.enable = true;
+
+  # Use openssh's built-in firewall integration
+  services.openssh.openFirewall = true;
+}
+```
+
+**Useful firewall options:**
+
+| Option | Description |
+|--------|-------------|
+| `networking.firewall.trustedInterfaces` | Accept all traffic from these interfaces (e.g., `[ "lo" ]`) |
+| `networking.firewall.interfaces.<name>.allowedTCPPorts` | Per-interface port rules |
+| `networking.firewall.extraInputRules` | Custom nftables rules (for complex filtering) |
+
+**Network range restrictions:** Consider restricting SSH to the infrastructure subnet (`10.69.13.0/24`) using `extraInputRules` for defense in depth. However, this adds complexity and may not be necessary given the trusted network model.
+
+#### Per-Interface Rules (http-proxy WireGuard)
+
+The `http-proxy` host has a WireGuard interface (`wg0`) that may need different rules than the LAN interface. Use `networking.firewall.interfaces` to apply per-interface policies:
+
+```nix
+# Example: http-proxy with different rules per interface
+networking.firewall = {
+  enable = true;
+
+  # Default: only SSH (via openFirewall)
+  allowedTCPPorts = [ ];
+
+  # LAN interface: allow HTTP/HTTPS
+  interfaces.ens18 = {
+    allowedTCPPorts = [ 80 443 ];
+  };
+
+  # WireGuard interface: restrict to specific services or trust fully
+  interfaces.wg0 = {
+    allowedTCPPorts = [ 80 443 ];
+    # Or use trustedInterfaces = [ "wg0" ] if fully trusted
+  };
+};
+```
+
+**TODO:** Investigate current WireGuard usage on http-proxy to determine appropriate rules.
+
+Then per-host, open required ports:
+
+| Host | Additional Ports |
+|------|------------------|
+| ns1/ns2 | 53 (TCP/UDP) |
+| vault01 | 8200 |
+| monitoring01 | 3100, 9090, 3000, 9093 |
+| http-proxy | 80, 443 |
+| nats1 | 4222 |
+| ha1 | 1883, 8123 |
+| jelly01 | 8096 |
+| nix-cache01 | 5000 |
+
+## Phase 2: Logging & Detection (P2)
+
+### 2.1 Enable TLS for Promtail → Loki
+
+Update `system/monitoring/logs.nix`:
+
+```nix
+clients = [{
+  url = "https://monitoring01.home.2rjus.net:3100/loki/api/v1/push";
+  tls_config = {
+    ca_file = "/etc/ssl/certs/homelab-root-ca.pem";
+  };
+}];
+```
+
+Requires:
+- Configure Loki with TLS certificate (use internal ACME)
+- Ensure all hosts trust root CA (already done via `system/pki/root-ca.nix`)
+
+### 2.2 Security Alert Rules
+
+Add to `services/monitoring/rules.yml`:
+
+```yaml
+- name: security_rules
+  rules:
+    - alert: ssh_auth_failures
+      expr: increase(node_logind_sessions_total[5m]) > 20
+      for: 0m
+      labels:
+        severity: warning
+      annotations:
+        summary: "Unusual login activity on {{ $labels.instance }}"
+
+    - alert: vault_secret_fetch_failure
+      expr: increase(vault_secret_failures[5m]) > 5
+      for: 0m
+      labels:
+        severity: warning
+      annotations:
+        summary: "Vault secret fetch failures on {{ $labels.instance }}"
+```
+
+Also add Loki-based alerts for:
+- Failed SSH attempts: `{job="systemd-journal"} |= "Failed password"`
+- sudo usage: `{job="systemd-journal"} |= "sudo"`
+
+### 2.3 Global Audit Logging
+
+Add `./common/ssh-audit.nix` import to `system/default.nix`:
+
+```nix
+imports = [
+  # ... existing imports
+  ../common/ssh-audit.nix
+];
+```
+
+## Phase 3: Defense in Depth (P3)
+
+### 3.1 Loki Authentication
+
+Options:
+1. **Basic auth via reverse proxy** - Put Loki behind Caddy with auth
+2. **Loki multi-tenancy** - Enable `auth_enabled = true` and use tenant IDs
+3. **Network isolation** - Bind Loki only to localhost, expose via authenticated proxy
+
+Recommendation: Option 1 (reverse proxy) is simplest for homelab.
+
+### 3.2 AppRole Secret Rotation
+
+Update `terraform/vault/approle.tf`:
+
+```hcl
+secret_id_ttl  = 2592000  # 30 days
+```
+
+Add documentation for manual rotation procedure or implement automated rotation via the existing `restartTrigger` mechanism in `vault-secrets.nix`.
+
+### 3.3 Enable Vault TLS Verification
+
+Change default in `system/vault-secrets.nix`:
+
+```nix
+skipTlsVerify = mkOption {
+  type = types.bool;
+  default = false;  # Changed from true
+};
+```
+
+**Prerequisite:** Verify all hosts trust the internal CA that signed the Vault certificate.
+
+## Implementation Order
+
+1. **Test on test-tier first** - Deploy phases 1-2 to testvm01/02/03
+2. **Validate SSH access** - Ensure key-based login works before disabling passwords
+3. **Document firewall ports** - Create reference of ports per host before enabling
+4. **Phase prod rollout** - Deploy to prod hosts one at a time, verify each
+
+## Open Questions
+
+- [ ] Do all hosts have SSH keys configured for root access?
+- [ ] Should firewall rules be per-host or use a central definition with roles?
+- [ ] Should Loki authentication use the existing Kanidm setup?
+
+**Resolved:** Password-based SSH access for recovery is not required - most hosts have console access through Proxmox or physical access, which provides an out-of-band recovery path if SSH keys fail.
+
+## Notes
+
+- Firewall changes are the highest risk - test thoroughly on test-tier
+- SSH hardening must not lock out access - verify keys first
+- Consider creating a "break glass" procedure for emergency access if keys fail
--- a/docs/user-management.md
+++ b/docs/user-management.md
@@ -0,0 +1,267 @@
+# User Management with Kanidm
+
+Central authentication for the homelab using Kanidm.
+
+## Overview
+
+- **Server**: kanidm01.home.2rjus.net (auth.home.2rjus.net)
+- **WebUI**: https://auth.home.2rjus.net
+- **LDAPS**: port 636
+
+## CLI Setup
+
+The `kanidm` CLI is available in the devshell:
+
+```bash
+nix develop
+
+# Login as idm_admin
+kanidm login --name idm_admin --url https://auth.home.2rjus.net
+```
+
+## User Management
+
+POSIX users are managed imperatively via the `kanidm` CLI. This allows setting
+all attributes (including UNIX password) in one workflow.
+
+### Creating a POSIX User
+
+```bash
+# Create the person
+kanidm person create <username> "<Display Name>"
+
+# Add to groups
+kanidm group add-members ssh-users <username>
+
+# Enable POSIX (UID is auto-assigned)
+kanidm person posix set <username>
+
+# Set UNIX password (required for SSH login, min 10 characters)
+kanidm person posix set-password <username>
+
+# Optionally set login shell
+kanidm person posix set <username> --shell /bin/zsh
+```
+
+### Example: Full User Creation
+
+```bash
+kanidm person create testuser "Test User"
+kanidm group add-members ssh-users testuser
+kanidm person posix set testuser
+kanidm person posix set-password testuser
+kanidm person get testuser
+```
+
+After creation, verify on a client host:
+```bash
+getent passwd testuser
+ssh testuser@testvm01.home.2rjus.net
+```
+
+### Viewing User Details
+
+```bash
+kanidm person get <username>
+```
+
+### Removing a User
+
+```bash
+kanidm person delete <username>
+```
+
+## Group Management
+
+Groups for POSIX access are also managed via CLI.
+
+### Creating a POSIX Group
+
+```bash
+# Create the group
+kanidm group create <group-name>
+
+# Enable POSIX with a specific GID
+kanidm group posix set <group-name> --gidnumber <gid>
+```
+
+### Adding Members
+
+```bash
+kanidm group add-members <group-name> <username>
+```
+
+### Viewing Group Details
+
+```bash
+kanidm group get <group-name>
+kanidm group list-members <group-name>
+```
+
+### Example: Full Group Creation
+
+```bash
+kanidm group create testgroup
+kanidm group posix set testgroup --gidnumber 68010
+kanidm group add-members testgroup testuser
+kanidm group get testgroup
+```
+
+After creation, verify on a client host:
+```bash
+getent group testgroup
+```
+
+### Current Groups
+
+| Group | GID | Purpose |
+|-------|-----|---------|
+| ssh-users | 68000 | SSH login access |
+| admins | 68001 | Administrative access |
+| users | 68002 | General users |
+
+### UID/GID Allocation
+
+Kanidm auto-assigns UIDs/GIDs from its configured range. For manually assigned GIDs:
+
+| Range | Purpose |
+|-------|---------|
+| 65,536+ | Users (auto-assigned) |
+| 68,000 - 68,999 | Groups (manually assigned) |
+
+## PAM/NSS Client Configuration
+
+Enable central authentication on a host:
+
+```nix
+homelab.kanidm.enable = true;
+```
+
+This configures:
+- `services.kanidm.enablePam = true`
+- Client connection to auth.home.2rjus.net
+- Login authorization for `ssh-users` group
+- Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
+- Home directory symlinks (`/home/torjus` → UUID-based directory)
+
+### Enabled Hosts
+
+- testvm01, testvm02, testvm03 (test tier)
+
+### Options
+
+```nix
+homelab.kanidm = {
+  enable = true;
+  server = "https://auth.home.2rjus.net";  # default
+  allowedLoginGroups = [ "ssh-users" ];     # default
+};
+```
+
+### Home Directories
+
+Home directories use UUID-based paths for stability (so renaming a user doesn't
+require moving their home directory). Symlinks provide convenient access:
+
+```
+/home/torjus -> /home/e4f4c56c-4aee-4c20-846f-90cb69807733
+```
+
+The symlinks are created by `kanidm-unixd-tasks` on first login.
+
+## Testing
+
+### Verify NSS Resolution
+
+```bash
+# Check user resolution
+getent passwd <username>
+
+# Check group resolution
+getent group <group-name>
+```
+
+### Test SSH Login
+
+```bash
+ssh <username>@<hostname>.home.2rjus.net
+```
+
+## Troubleshooting
+
+### "PAM user mismatch" error
+
+SSH fails with "fatal: PAM user mismatch" in logs. This happens when Kanidm returns
+usernames in SPN format (`torjus@home.2rjus.net`) but SSH expects short names (`torjus`).
+
+**Solution**: Configure `uid_attr_map = "name"` in unixSettings (already set in our module).
+
+Check current format:
+```bash
+getent passwd torjus
+# Should show: torjus:x:65536:...
+# NOT: torjus@home.2rjus.net:x:65536:...
+```
+
+### User resolves but SSH fails immediately
+
+The user's login group (e.g., `ssh-users`) likely doesn't have POSIX enabled:
+
+```bash
+# Check if group has POSIX
+getent group ssh-users
+
+# If empty, enable POSIX on the server
+kanidm group posix set ssh-users --gidnumber 68000
+```
+
+### User doesn't resolve via getent
+
+1. Check kanidm-unixd service is running:
+   ```bash
+   systemctl status kanidm-unixd
+   ```
+
+2. Check unixd can reach server:
+   ```bash
+   kanidm-unix status
+   # Should show: system: online, Kanidm: online
+   ```
+
+3. Check client can reach server:
+   ```bash
+   curl -s https://auth.home.2rjus.net/status
+   ```
+
+4. Check user has POSIX enabled on server:
+   ```bash
+   kanidm person get <username>
+   ```
+
+5. Restart nscd to clear stale cache:
+   ```bash
+   systemctl restart nscd
+   ```
+
+6. Invalidate kanidm cache:
+   ```bash
+   kanidm-unix cache-invalidate
+   ```
+
+### Changes not taking effect after deployment
+
+NixOS uses nsncd (a Rust reimplementation of nscd) for NSS caching. After deploying
+kanidm-unixd config changes, you may need to restart both services:
+
+```bash
+systemctl restart kanidm-unixd
+systemctl restart nscd
+```
+
+### Test PAM authentication directly
+
+Use the kanidm-unix CLI to test PAM auth without SSH:
+
+```bash
+kanidm-unix auth-test --name <username>
+```
--- a/flake.nix
+++ b/flake.nix
@@ -207,6 +207,7 @@
              pkgs.ansible
              pkgs.opentofu
              pkgs.openbao
+              pkgs.kanidm_1_8
              (pkgs.callPackage ./scripts/create-host { })
              homelab-deploy.packages.${pkgs.system}.default
            ];
--- a/hosts/jelly01/configuration.nix
+++ b/hosts/jelly01/configuration.nix
@@ -64,9 +64,5 @@
  vault.enable = true;
  homelab.deploy.enable = true;

-  zramSwap = {
-    enable = true;
-  };
-
  system.stateVersion = "23.11"; # Did you read the comment?
 }
--- a/hosts/nix-cache01/default.nix
+++ b/hosts/nix-cache01/default.nix
@@ -4,6 +4,5 @@
    ./configuration.nix
    ../../services/nix-cache
    ../../services/actions-runner
-    ./zram.nix
  ];
 }
--- a/hosts/nix-cache01/zram.nix
+++ b/hosts/nix-cache01/zram.nix
@@ -1,6 +0,0 @@
-{ ... }:
-{
-  zramSwap = {
-    enable = true;
-  };
-}
--- a/hosts/template2/configuration.nix
+++ b/hosts/template2/configuration.nix
@@ -79,5 +79,8 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

+  # Compressed swap in RAM - prevents OOM during bootstrap nixos-rebuild
+  zramSwap.enable = true;
+
  system.stateVersion = "25.11";
 }
--- a/hosts/testvm01/configuration.nix
+++ b/hosts/testvm01/configuration.nix
@@ -25,6 +25,9 @@
  # Enable remote deployment via NATS
  homelab.deploy.enable = true;

+  # Enable Kanidm PAM/NSS for central authentication
+  homelab.kanidm.enable = true;
+
  nixpkgs.config.allowUnfree = true;
  boot.loader.grub.enable = true;
  boot.loader.grub.device = "/dev/vda";
--- a/hosts/testvm02/configuration.nix
+++ b/hosts/testvm02/configuration.nix
@@ -25,6 +25,9 @@
  # Enable remote deployment via NATS
  homelab.deploy.enable = true;

+  # Enable Kanidm PAM/NSS for central authentication
+  homelab.kanidm.enable = true;
+
  nixpkgs.config.allowUnfree = true;
  boot.loader.grub.enable = true;
  boot.loader.grub.device = "/dev/vda";
--- a/hosts/testvm03/configuration.nix
+++ b/hosts/testvm03/configuration.nix
@@ -25,6 +25,9 @@
  # Enable remote deployment via NATS
  homelab.deploy.enable = true;

+  # Enable Kanidm PAM/NSS for central authentication
+  homelab.kanidm.enable = true;
+
  nixpkgs.config.allowUnfree = true;
  boot.loader.grub.enable = true;
  boot.loader.grub.device = "/dev/vda";
--- a/services/kanidm/default.nix
+++ b/services/kanidm/default.nix
@@ -17,7 +17,8 @@
      };
    };

-    # Provisioning - initial users/groups
+    # Provision base groups only - users are managed via CLI
+    # See docs/user-management.md for details
    provision = {
      enable = true;
      idmAdminPasswordFile = config.vault.secrets.kanidm-idm-admin.outputDir;
@@ -28,10 +29,7 @@
        ssh-users = { };
      };

-      persons.torjus = {
-        displayName = "Torjus";
-        groups = [ "admins" "users" "ssh-users" ];
-      };
+      # Regular users (persons) are managed imperatively via kanidm CLI
    };
  };

@@ -39,12 +37,14 @@
  users.users.kanidm.extraGroups = [ "acme" ];

  # ACME certificate from internal CA
+  # Include both the CNAME (auth) and A record (kanidm01) for Prometheus scraping
  security.acme.certs."auth.home.2rjus.net" = {
    listenHTTP = ":80";
    reloadServices = [ "kanidm" ];
+    extraDomainNames = [ "${config.networking.hostName}.home.2rjus.net" ];
  };

-  # Vault secret for idm_admin password
+  # Vault secret for idm_admin password (used for provisioning)
  vault.secrets.kanidm-idm-admin = {
    secretPath = "kanidm/idm-admin-password";
    extractKey = "password";
@@ -53,12 +53,13 @@
    group = "kanidm";
  };

-  # Monitoring scrape target
-  homelab.monitoring.scrapeTargets = [
-    {
-      job_name = "kanidm";
-      port = 443;
-      scheme = "https";
-    }
-  ];
+  # Note: Kanidm does not expose Prometheus metrics
+  # If metrics support is added in the future, uncomment:
+  # homelab.monitoring.scrapeTargets = [
+  #   {
+  #     job_name = "kanidm";
+  #     port = 443;
+  #     scheme = "https";
+  #   }
+  # ];
 }
--- a/system/default.nix
+++ b/system/default.nix
@@ -4,6 +4,7 @@
    ./acme.nix
    ./autoupgrade.nix
    ./homelab-deploy.nix
+    ./kanidm-client.nix
    ./monitoring
    ./motd.nix
    ./packages.nix
@@ -12,5 +13,6 @@
    ./pki/root-ca.nix
    ./sshd.nix
    ./vault-secrets.nix
+    ./zram.nix
  ];
 }
--- a/system/kanidm-client.nix
+++ b/system/kanidm-client.nix
@@ -0,0 +1,42 @@
+{ lib, config, pkgs, ... }:
+let
+  cfg = config.homelab.kanidm;
+in
+{
+  options.homelab.kanidm = {
+    enable = lib.mkEnableOption "Kanidm PAM/NSS client for central authentication";
+
+    server = lib.mkOption {
+      type = lib.types.str;
+      default = "https://auth.home.2rjus.net";
+      description = "URI of the Kanidm server";
+    };
+
+    allowedLoginGroups = lib.mkOption {
+      type = lib.types.listOf lib.types.str;
+      default = [ "ssh-users" ];
+      description = "Groups allowed to log in via PAM";
+    };
+  };
+
+  config = lib.mkIf cfg.enable {
+    services.kanidm = {
+      package = pkgs.kanidm_1_8;
+      enablePam = true;
+
+      clientSettings = {
+        uri = cfg.server;
+      };
+
+      unixSettings = {
+        pam_allowed_login_groups = cfg.allowedLoginGroups;
+        # Use short names (torjus) instead of SPN format (torjus@home.2rjus.net)
+        # This prevents "PAM user mismatch" errors with SSH
+        uid_attr_map = "name";
+        gid_attr_map = "name";
+        # Create symlink /home/torjus -> /home/torjus@home.2rjus.net
+        home_alias = "name";
+      };
+    };
+  };
+}
--- a/system/zram.nix
+++ b/system/zram.nix
@@ -0,0 +1,8 @@
+# Compressed swap in RAM
+#
+# Provides overflow memory during Nix builds and upgrades.
+# Prevents OOM kills on low-memory hosts (2GB VMs).
+{ ... }:
+{
+  zramSwap.enable = true;
+}
Author	SHA1	Message	Date
Torjus Håkestad	9ed09c9a9c	docs: add user-management documentation All checks were successful Run nix flake check / flake-check (pull_request) Successful in 3m33s Details Run nix flake check / flake-check (push) Successful in 2m0s Details - CLI workflows for creating users and groups - Troubleshooting guide (nscd, cache invalidation) - Home directory behavior (UUID-based with symlinks) - Update auth-system-replacement plan with progress Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:14:21 +01:00
Torjus Håkestad	b31c64f1b9	kanidm: remove declarative user provisioning Keep base groups (admins, users, ssh-users) provisioned declaratively but manage regular users via the kanidm CLI. This allows setting POSIX attributes and passwords in a single workflow. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:14:03 +01:00
Torjus Håkestad	54b6e37420	flake: add kanidm to devshell Add kanidm_1_8 CLI for administering the Kanidm server. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:12:19 +01:00
Torjus Håkestad	b845a8bb8b	system: add kanidm PAM/NSS client module Add homelab.kanidm.enable option for central authentication via Kanidm. The module configures: - PAM/NSS integration with kanidm-unixd - Client connection to auth.home.2rjus.net - Login authorization for ssh-users group Enable on testvm01-03 for testing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:12:19 +01:00
Torjus Håkestad	bfbf0cea68	template2: enable zram for bootstrap Some checks failed Run nix flake check / flake-check (push) Failing after 3m34s Details Prevents OOM during initial nixos-rebuild on 2GB VMs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 13:34:08 +01:00
Torjus Håkestad	3abe5e83a7	docs: add memory ballooning as fallback option All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 13:29:42 +01:00
Torjus Håkestad	67c27555f3	docs: add memory issues follow-up plan All checks were successful Run nix flake check / flake-check (push) Successful in 2m2s Details Track zram change effectiveness for OOM prevention during upgrades. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 13:26:31 +01:00
Torjus Håkestad	1674b6a844	system: enable zram swap for all hosts Some checks failed Run nix flake check / flake-check (push) Failing after 12m6s Details Provides compressed swap in RAM to prevent OOM kills during nixos-rebuild on low-memory VMs (2GB). Removes duplicate zram configs from jelly01 and nix-cache01. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 13:02:58 +01:00
Torjus Håkestad	311be282b6	docs: add security hardening plan Some checks failed Run nix flake check / flake-check (push) Failing after 2s Details Based on security review findings, covering SSH hardening, firewall enablement, log transport TLS, security alerting, and secrets management. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 05:26:15 +01:00
Torjus Håkestad	11cbb64097	claude: make auditor delegation explicit in investigate-alarm Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - Changed section 4 from "if needed" to always spawn auditor - Added explicit "Do NOT query audit logs yourself" guidance - Listed specific scenarios requiring auditor (service stopped, etc.) - Added manual intervention as first common cause - Updated guidelines to emphasize mandatory delegation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 05:11:09 +01:00
Torjus Håkestad	e2dd21c994	claude: add auditor agent and git-explorer MCP Add new auditor agent for security-focused audit log analysis: - SSH session tracking, command execution, sudo usage - Suspicious activity detection patterns - Can be used standalone or as sub-agent by investigate-alarm Update investigate-alarm to delegate audit analysis to auditor and add git-explorer MCP for configuration drift detection. Add git-explorer to .mcp.json for repository inspection. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 04:48:55 +01:00
Torjus Håkestad	463342133e	kanidm: remove non-functional metrics scrape target All checks were successful Run nix flake check / flake-check (push) Successful in 1m56s Details Kanidm does not expose a Prometheus /metrics endpoint. The scrape target was causing 404 errors after the TLS certificate issue was fixed. Also add SSH command restriction to CLAUDE.md. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:34:12 +01:00
Torjus Håkestad	de36b9d016	kanidm: add hostname SAN to ACME certificate Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Include both auth.home.2rjus.net (CNAME) and kanidm01.home.2rjus.net (A record) as SANs in the TLS certificate. This fixes Prometheus scraping which connects via the hostname, not the CNAME. Fixes: x509: certificate is valid for auth.home.2rjus.net, not kanidm01.home.2rjus.net Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:29:54 +01:00
Torjus Håkestad	3f1d966919	claude: improve investigate-alarm log query guidelines Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Add best practices for querying Loki to avoid overwhelming responses: - Start with narrow filters and small limits - Filter audit logs to EXECVE only - Exclude verbose noise (PATH, PROCTITLE, SYSCALL, BPF) - Expand queries incrementally if needed Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:14:54 +01:00