system: add pipe-to-loki helper script

Adds a system-wide script for sending command output or interactive sessions to Loki for easy sharing with Claude. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Merge pull request 'kanidm-pam-client' (#34 ) from kanidm-pam-client into master
2026-02-08 15:30:53 +01:00 · 2026-02-08 14:14:53 +00:00 · 2026-02-08 15:14:21 +01:00 · 2026-02-08 15:14:03 +01:00 · 2026-02-08 15:12:19 +01:00 · 2026-02-08 15:12:19 +01:00
49 changed files with 2539 additions and 624 deletions
--- a/.claude/agents/auditor.md
+++ b/.claude/agents/auditor.md
@@ -0,0 +1,180 @@
+---
+name: auditor
+description: Analyzes audit logs to investigate user activity, command execution, and suspicious behavior on hosts. Can be used standalone for security reviews or called by other agents for behavioral context.
+tools: Read, Grep, Glob
+mcpServers:
+  - lab-monitoring
+---
+
+You are a security auditor for a NixOS homelab infrastructure. Your task is to analyze audit logs and reconstruct user activity on hosts.
+
+## Input
+
+You may receive:
+- A host or list of hosts to investigate
+- A time window (e.g., "last hour", "today", "between 14:00 and 15:00")
+- Optional context: specific events to look for, user to focus on, or suspicious activity to investigate
+- Optional context from a parent investigation (e.g., "a service stopped at 14:32, what happened around that time?")
+
+## Audit Log Structure
+
+Logs are shipped to Loki via promtail. Audit events use these labels:
+- `host` - hostname
+- `systemd_unit` - typically `auditd.service` for audit logs
+- `job` - typically `systemd-journal`
+
+Audit log entries contain structured data:
+- `EXECVE` - command execution with full arguments
+- `USER_LOGIN` / `USER_LOGOUT` - session start/end
+- `USER_CMD` - sudo command execution
+- `CRED_ACQ` / `CRED_DISP` - credential acquisition/disposal
+- `SERVICE_START` / `SERVICE_STOP` - systemd service events
+
+## Investigation Techniques
+
+### 1. SSH Session Activity
+
+Find SSH logins and session activity:
+```logql
+{host="<hostname>", systemd_unit="sshd.service"}
+```
+
+Look for:
+- Accepted/Failed authentication
+- Session opened/closed
+- Unusual source IPs or users
+
+### 2. Command Execution
+
+Query executed commands (filter out noise):
+```logql
+{host="<hostname>"} |= "EXECVE" != "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"
+```
+
+Further filtering:
+- Exclude systemd noise: `!= "systemd" != "/nix/store"`
+- Focus on specific commands: `|= "rm" |= "-rf"`
+- Focus on specific user: `|= "uid=1000"`
+
+### 3. Sudo Activity
+
+Check for privilege escalation:
+```logql
+{host="<hostname>"} |= "sudo" |= "COMMAND"
+```
+
+Or via audit:
+```logql
+{host="<hostname>"} |= "USER_CMD"
+```
+
+### 4. Service Manipulation
+
+Check if services were manually stopped/started:
+```logql
+{host="<hostname>"} |= "EXECVE" |= "systemctl"
+```
+
+### 5. File Operations
+
+Look for file modifications (if auditd rules are configured):
+```logql
+{host="<hostname>"} |= "EXECVE" |= "vim"
+{host="<hostname>"} |= "EXECVE" |= "nano"
+{host="<hostname>"} |= "EXECVE" |= "rm"
+```
+
+## Query Guidelines
+
+**Start narrow, expand if needed:**
+- Begin with `limit: 20-30`
+- Use tight time windows: `start: "15m"` or `start: "30m"`
+- Add filters progressively
+
+**Avoid:**
+- Querying all audit logs without EXECVE filter (extremely verbose)
+- Large time ranges without specific filters
+- Limits over 50 without tight filters
+
+**Time-bounded queries:**
+When investigating around a specific event:
+```logql
+{host="<hostname>"} |= "EXECVE" != "systemd"
+```
+With `start: "2026-02-08T14:30:00Z"` and `end: "2026-02-08T14:35:00Z"`
+
+## Suspicious Patterns to Watch For
+
+1. **Unusual login times** - Activity outside normal hours
+2. **Failed authentication** - Brute force attempts
+3. **Privilege escalation** - Unexpected sudo usage
+4. **Reconnaissance commands** - `whoami`, `id`, `uname`, `cat /etc/passwd`
+5. **Data exfiltration indicators** - `curl`, `wget`, `scp`, `rsync` to external destinations
+6. **Persistence mechanisms** - Cron modifications, systemd service creation
+7. **Log tampering** - Commands targeting log files
+8. **Lateral movement** - SSH to other internal hosts
+9. **Service manipulation** - Stopping security services, disabling firewalls
+10. **Cleanup activity** - Deleting bash history, clearing logs
+
+## Output Format
+
+### For Standalone Security Reviews
+
+```
+## Activity Summary
+
+**Host:** <hostname>
+**Time Period:** <start> to <end>
+**Sessions Found:** <count>
+
+## User Sessions
+
+### Session 1: <user> from <source_ip>
+- **Login:** HH:MM:SSZ
+- **Logout:** HH:MM:SSZ (or ongoing)
+- **Commands executed:**
+  - HH:MM:SSZ - <command>
+  - HH:MM:SSZ - <command>
+
+## Suspicious Activity
+
+[If any patterns from the watch list were detected]
+- **Finding:** <description>
+- **Evidence:** <log entries>
+- **Risk Level:** Low / Medium / High
+
+## Summary
+
+[Overall assessment: normal activity, concerning patterns, or clear malicious activity]
+```
+
+### When Called by Another Agent
+
+Provide a focused response addressing the specific question:
+
+```
+## Audit Findings
+
+**Query:** <what was asked>
+**Time Window:** <investigated period>
+
+## Relevant Activity
+
+[Chronological list of relevant events]
+- HH:MM:SSZ - <event>
+- HH:MM:SSZ - <event>
+
+## Assessment
+
+[Direct answer to the question with supporting evidence]
+```
+
+## Guidelines
+
+- Reconstruct timelines chronologically
+- Correlate events (login → commands → logout)
+- Note gaps or missing data
+- Distinguish between automated (systemd, cron) and interactive activity
+- Consider the host's role and tier when assessing severity
+- When called by another agent, focus on answering their specific question
+- Don't speculate without evidence - state what the logs show and don't show
--- a/.claude/agents/investigate-alarm.md
+++ b/.claude/agents/investigate-alarm.md
@@ -0,0 +1,211 @@
+---
+name: investigate-alarm
+description: Investigates a single system alarm by querying Prometheus metrics and Loki logs, analyzing configuration files for affected hosts/services, and providing root cause analysis.
+tools: Read, Grep, Glob
+mcpServers:
+  - lab-monitoring
+  - git-explorer
+---
+
+You are an alarm investigation specialist for a NixOS homelab infrastructure. Your task is to analyze a single alarm and determine its root cause.
+
+## Input
+
+You will receive information about an alarm, which may include:
+- Alert name and severity
+- Affected host or service
+- Alert expression/threshold
+- Current value or status
+- When it started firing
+
+## Investigation Process
+
+### 1. Understand the Alert Context
+
+Start by understanding what the alert is measuring:
+- Use `get_alert` if you have a fingerprint, or `list_alerts` to find matching alerts
+- Use `get_metric_metadata` to understand the metric being monitored
+- Use `search_metrics` to find related metrics
+
+### 2. Query Current State
+
+Gather evidence about the current system state:
+- Use `query` to check the current metric values and related metrics
+- Use `list_targets` to verify the host/service is being scraped successfully
+- Look for correlated metrics that might explain the issue
+
+### 3. Check Service Logs
+
+Search for relevant log entries using `query_logs`. Focus on service-specific logs and errors.
+
+**Query strategies (start narrow, expand if needed):**
+- Start with `limit: 20-30`, increase only if needed
+- Use tight time windows: `start: "15m"` or `start: "30m"` initially
+- Filter to specific services: `{host="<hostname>", systemd_unit="<service>.service"}`
+- Search for errors: `{host="<hostname>"} |= "error"` or `|= "failed"`
+
+**Common patterns:**
+- Service logs: `{host="<hostname>", systemd_unit="<service>.service"}`
+- All errors on host: `{host="<hostname>"} |= "error"`
+- Journal for a unit: `{host="<hostname>", systemd_unit="nginx.service"} |= "failed"`
+
+**Avoid:**
+- Using `start: "1h"` with no filters on busy hosts
+- Limits over 50 without specific filters
+
+### 4. Investigate User Activity
+
+For any analysis of user activity, **always spawn the `auditor` agent**. Do not query audit logs (EXECVE, USER_LOGIN, etc.) directly - delegate this to the auditor.
+
+**Always call the auditor when:**
+- A service stopped unexpectedly (may have been manually stopped)
+- A process was killed or a config was changed
+- You need to know who was logged in around the time of an incident
+- You need to understand what commands led to the current state
+- The cause isn't obvious from service logs alone
+
+**Do NOT try to query audit logs yourself.** The auditor is specialized for:
+- Parsing EXECVE records and reconstructing command lines
+- Correlating SSH sessions with commands executed
+- Identifying suspicious patterns
+- Filtering out systemd/nix-store noise
+
+**Example prompt for auditor:**
+```
+Investigate user activity on <hostname> between <start_time> and <end_time>.
+Context: The prometheus-node-exporter service stopped at 14:32.
+Determine if it was manually stopped and by whom.
+```
+
+Incorporate the auditor's findings into your timeline and root cause analysis.
+
+### 5. Check Configuration (if relevant)
+
+If the alert relates to a NixOS-managed service:
+- Check host configuration in `/hosts/<hostname>/`
+- Check service modules in `/services/<service>/`
+- Look for thresholds, resource limits, or misconfigurations
+- Check `homelab.host` options for tier/priority/role metadata
+
+### 6. Check for Configuration Drift
+
+Use the git-explorer MCP server to compare the host's deployed configuration against the current master branch. This helps identify:
+- Hosts running outdated configurations
+- Recent changes that might have caused the issue
+- Whether a fix has already been committed but not deployed
+
+**Step 1: Get the deployed revision from Prometheus**
+```promql
+nixos_flake_info{hostname="<hostname>"}
+```
+The `current_rev` label contains the deployed git commit hash.
+
+**Step 2: Check if the host is behind master**
+```
+resolve_ref("master")           # Get current master commit
+is_ancestor(deployed, master)   # Check if host is behind
+```
+
+**Step 3: See what commits are missing**
+```
+commits_between(deployed, master)  # List commits not yet deployed
+```
+
+**Step 4: Check which files changed**
+```
+get_diff_files(deployed, master)   # Files modified since deployment
+```
+Look for files in `hosts/<hostname>/`, `services/<relevant-service>/`, or `system/` that affect this host.
+
+**Step 5: View configuration at the deployed revision**
+```
+get_file_at_commit(deployed, "services/<service>/default.nix")
+```
+Compare against the current file to understand differences.
+
+**Step 6: Find when something changed**
+```
+search_commits("<service-name>")   # Find commits mentioning the service
+get_commit_info(<hash>)            # Get full details of a specific change
+```
+
+**Example workflow for a service-related alert:**
+1. Query `nixos_flake_info{hostname="monitoring01"}` → `current_rev: 8959829`
+2. `resolve_ref("master")` → `4633421`
+3. `is_ancestor("8959829", "4633421")` → Yes, host is behind
+4. `commits_between("8959829", "4633421")` → 7 commits missing
+5. `get_diff_files("8959829", "4633421")` → Check if relevant service files changed
+6. If a fix was committed after the deployed rev, recommend deployment
+
+### 7. Consider Common Causes
+
+For infrastructure alerts, common causes include:
+- **Manual intervention**: Service manually stopped/restarted (call auditor to confirm)
+- **Configuration drift**: Host running outdated config, fix already in master
+- **Disk space**: Nix store growth, logs, temp files
+- **Memory pressure**: Service memory leaks, insufficient limits
+- **CPU**: Runaway processes, build jobs
+- **Network**: DNS issues, connectivity problems
+- **Service restarts**: Failed upgrades, configuration errors
+- **Scrape failures**: Service down, firewall issues, port changes
+
+**Note:** If a service stopped unexpectedly and service logs don't show a crash or error, it was likely manual intervention - call the auditor to investigate.
+
+## Output Format
+
+Provide a concise report with one of two outcomes:
+
+### If Root Cause Identified:
+
+```
+## Root Cause
+[1-2 sentence summary of the root cause]
+
+## Timeline
+[Chronological sequence of relevant events leading to the alert]
+- HH:MM:SSZ - [Event description]
+- HH:MM:SSZ - [Event description]
+- HH:MM:SSZ - [Alert fired]
+
+### Timeline sources
+- HH:MM:SSZ - [Source for information about this event. Which metric or log file]
+- HH:MM:SSZ - [Source for information about this event. Which metric or log file]
+- HH:MM:SSZ - [Alert fired]
+
+
+## Evidence
+- [Specific metric values or log entries that support the conclusion]
+- [Configuration details if relevant]
+
+
+## Recommended Actions
+1. [Specific remediation step]
+2. [Follow-up actions if any]
+```
+
+### If Root Cause Unclear:
+
+```
+## Investigation Summary
+[What was checked and what was found]
+
+## Possible Causes
+- [Hypothesis 1 with supporting/contradicting evidence]
+- [Hypothesis 2 with supporting/contradicting evidence]
+
+## Additional Information Needed
+- [Specific data, logs, or access that would help]
+- [Suggested queries or checks for the operator]
+```
+
+## Guidelines
+
+- Be concise and actionable
+- Reference specific metric names and values as evidence
+- Include log snippets when they're informative
+- Don't speculate without evidence
+- If the alert is a false positive or expected behavior, explain why
+- Consider the host's tier (test vs prod) when assessing severity
+- Build a timeline from log timestamps and metrics to show the sequence of events
+- **Query logs incrementally**: start with narrow filters and small limits, expand only if needed
+- **Always delegate to the auditor agent** for any user activity analysis - never query EXECVE or audit logs directly
--- a/.claude/skills/observability/SKILL.md
+++ b/.claude/skills/observability/SKILL.md
@@ -32,7 +32,7 @@ Use the `lab-monitoring` MCP server tools:
 Available labels for log queries:
 - `host` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`)
 - `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs)
+- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
 - `filename` - For `varlog` job, the log file path
 - `hostname` - Alternative to `host` for some streams

@@ -102,6 +102,36 @@ Useful systemd units for troubleshooting:
 - `sshd.service` - SSH daemon
 - `nix-gc.service` - Nix garbage collection

+### Bootstrap Logs
+
+VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
+
+- `host` - Target hostname
+- `branch` - Git branch being deployed
+- `stage` - Bootstrap stage (see table below)
+
+**Bootstrap stages:**
+
+| Stage | Message | Meaning |
+|-------|---------|---------|
+| `starting` | Bootstrap starting for \<host\> (branch: \<branch\>) | Bootstrap service has started |
+| `network_ok` | Network connectivity confirmed | Can reach git server |
+| `vault_ok` | Vault credentials unwrapped and stored | AppRole credentials provisioned |
+| `vault_skip` | No Vault token provided - skipping credential setup | No wrapped token was provided |
+| `vault_warn` | Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) |
+| `building` | Starting nixos-rebuild boot | NixOS build starting |
+| `success` | Build successful - rebooting into new configuration | Build complete, rebooting |
+| `failed` | nixos-rebuild failed - manual intervention required | Build failed |
+
+**Bootstrap queries:**
+
+```logql
+{job="bootstrap"}                              # All bootstrap logs
+{job="bootstrap", host="myhost"}               # Specific host
+{job="bootstrap", stage="failed"}              # All failures
+{job="bootstrap", stage=~"building|success"}   # Track build progress
+```
+
 ### Extracting JSON Fields

 Parse JSON and filter on fields:
@@ -175,15 +205,39 @@ Disk space (root filesystem):
 node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
 ```

-### Service-Specific Metrics
+### Prometheus Jobs

-Common job names:
- `node-exporter` - System metrics (all hosts)
- `nixos-exporter` - NixOS version/generation metrics
- `caddy` - Reverse proxy metrics
- `prometheus` / `loki` / `grafana` - Monitoring stack
- `home-assistant` - Home automation
- `step-ca` - Internal CA
+All available Prometheus job names:
+
+**System exporters (on all/most hosts):**
+- `node-exporter` - System metrics (CPU, memory, disk, network)
+- `nixos-exporter` - NixOS flake revision and generation info
+- `systemd-exporter` - Systemd unit status metrics
+- `homelab-deploy` - Deployment listener metrics
+
+**Service-specific exporters:**
+- `caddy` - Reverse proxy metrics (http-proxy)
+- `nix-cache_caddy` - Nix binary cache metrics
+- `home-assistant` - Home automation metrics (ha1)
+- `jellyfin` - Media server metrics (jelly01)
+- `kanidm` - Authentication server metrics (kanidm01)
+- `nats` - NATS messaging metrics (nats1)
+- `openbao` - Secrets management metrics (vault01)
+- `unbound` - DNS resolver metrics (ns1, ns2)
+- `wireguard` - VPN tunnel metrics (http-proxy)
+
+**Monitoring stack (localhost on monitoring01):**
+- `prometheus` - Prometheus self-metrics
+- `loki` - Loki self-metrics
+- `grafana` - Grafana self-metrics
+- `alertmanager` - Alertmanager metrics
+- `pushgateway` - Push-based metrics gateway
+
+**External/infrastructure:**
+- `pve-exporter` - Proxmox hypervisor metrics
+- `smartctl` - Disk SMART health (gunter)
+- `restic_rest` - Backup server metrics
+- `ghettoptt` - PTT service metrics (gunter)

 ### Target Labels

@@ -237,6 +291,7 @@ Current host labels:
 | ns2 | `role=dns`, `dns_role=secondary` |
 | nix-cache01 | `role=build-host` |
 | vault01 | `role=vault` |
+| kanidm01 | `role=auth`, `tier=test` |
 | testvm01/02/03 | `tier=test` |

 ---
@@ -265,6 +320,17 @@ Current host labels:
 3. Check service logs for startup issues
 4. Check service metrics are being scraped

+### Monitor VM Bootstrap
+
+When provisioning new VMs, track bootstrap progress:
+
+1. Watch bootstrap logs: `{job="bootstrap", host="<hostname>"}`
+2. Check for failures: `{job="bootstrap", host="<hostname>", stage="failed"}`
+3. After success, verify host appears in metrics: `up{hostname="<hostname>"}`
+4. Check logs are flowing: `{host="<hostname>"}`
+
+See [docs/host-creation.md](../../../docs/host-creation.md) for the full host creation pipeline.
+
 ### Debug SSH/Access Issues

 ```logql
--- a/.mcp.json
+++ b/.mcp.json
@@ -33,6 +33,13 @@
        "--nats-url", "nats://nats1.home.2rjus.net:4222",
        "--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey"
      ]
+    },
+    "git-explorer": {
+      "command": "nix",
+      "args": ["run", "git+https://git.t-juice.club/torjus/labmcp#git-explorer", "--", "serve"],
+      "env": {
+        "GIT_REPO_PATH": "/home/torjus/git/nixos-servers"
+      }
    }
  }
 }
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -35,6 +35,10 @@ nix build .#create-host

 Do not automatically deploy changes. Deployments are usually done by updating the master branch, and then triggering the auto update on the specific host.

+### SSH Commands
+
+Do not run SSH commands directly. If a command needs to be run on a remote host, provide the command to the user and ask them to run it manually.
+
 ### Testing Feature Branches on Hosts

 All hosts have the `nixos-rebuild-test` helper script for testing feature branches before merging:
@@ -152,82 +156,16 @@ Two MCP servers are available for searching NixOS options and packages:

 This ensures documentation matches the exact nixpkgs version (currently NixOS 25.11) used by this flake.

-### Lab Monitoring Log Queries
+### Lab Monitoring

-The **lab-monitoring** MCP server can query logs from Loki. All hosts ship systemd journal logs via Promtail.
+The **lab-monitoring** MCP server provides access to Prometheus metrics and Loki logs. Use the `/observability` skill for detailed reference on:

-**Loki Label Reference:**
+- Available Prometheus jobs and exporters
+- Loki labels and LogQL query syntax
+- Bootstrap log monitoring for new VMs
+- Common troubleshooting workflows

- `host` - Hostname (e.g., `ns1`, `ns2`, `monitoring01`, `ha1`). Use this label, not `hostname`.
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `prometheus.service`, `nixos-upgrade.service`)
- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
- `filename` - For `varlog` job, the log file path (e.g., `/var/log/caddy/nix-cache.log`)
-
-Journal log entries are JSON-formatted with the actual log message in the `MESSAGE` field. Other useful fields include `PRIORITY` and `SYSLOG_IDENTIFIER`.
-
-**Bootstrap Logs:**
-
-VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
-
- `host` - Target hostname
- `branch` - Git branch being deployed
- `stage` - Bootstrap stage: `starting`, `network_ok`, `vault_ok`/`vault_skip`/`vault_warn`, `building`, `success`, `failed`
-
-Query bootstrap status:
-```
-{job="bootstrap"}                              # All bootstrap logs
-{job="bootstrap", host="testvm01"}             # Specific host
-{job="bootstrap", stage="failed"}              # All failures
-{job="bootstrap", stage=~"building|success"}   # Track build progress
-```
-
-**Example LogQL queries:**
-```
-# Logs from a specific service on a host
-{host="ns2", systemd_unit="nsd.service"}
-
-# Substring match on log content
-{host="ns1", systemd_unit="nsd.service"} |= "error"
-
-# File-based logs (e.g., caddy access logs)
-{job="varlog", hostname="nix-cache01"}
-```
-
-Default lookback is 1 hour. Use the `start` parameter with relative durations (e.g., `24h`, `168h`) for older logs.
-
-### Lab Monitoring Prometheus Queries
-
-The **lab-monitoring** MCP server can query Prometheus metrics via PromQL. The `instance` label uses the FQDN format `<host>.home.2rjus.net:<port>`.
-
-**Prometheus Job Names:**
-
- `node-exporter` - System metrics from all hosts (CPU, memory, disk, network)
- `caddy` - Reverse proxy metrics (http-proxy)
- `nix-cache_caddy` - Nix binary cache metrics
- `home-assistant` - Home automation metrics
- `jellyfin` - Media server metrics
- `loki` / `prometheus` / `grafana` - Monitoring stack self-metrics
- `pve-exporter` - Proxmox hypervisor metrics
- `smartctl` - Disk SMART health (gunter)
- `wireguard` - VPN metrics (http-proxy)
- `pushgateway` - Push-based metrics (e.g., backup results)
- `restic_rest` - Backup server metrics
- `ghettoptt` / `alertmanager` - Other service metrics
-
-**Example PromQL queries:**
-```
-# Check all targets are up
-up
-
-# CPU usage for a specific host
-rate(node_cpu_seconds_total{instance=~"ns1.*", mode!="idle"}[5m])
-
-# Memory usage across all hosts
-node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
-
-# Disk space
-node_filesystem_avail_bytes{mountpoint="/"}
-```
+The skill contains up-to-date information about all scrape targets, host labels, and example queries.

 ### Deploying to Test Hosts

@@ -496,20 +434,11 @@ This means:

 ### Adding a New Host

-1. Create `/hosts/<hostname>/` directory
-2. Copy structure from `template1` or similar host
-3. Add host entry to `flake.nix` nixosConfigurations
-4. Configure networking in `configuration.nix` (static IP via `systemd.network.networks`, DNS servers)
-5. (Optional) Add `homelab.dns.cnames` if the host needs CNAME aliases
-6. Add `vault.enable = true;` to the host configuration
-7. Add AppRole policy in `terraform/vault/approle.tf` and any secrets in `secrets.tf`
-8. Run `tofu apply` in `terraform/vault/`
-9. User clones template host
-10. User runs `prepare-host.sh` on new host
-11. Provision AppRole credentials: `nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<host>`
-12. Commit changes, and merge to master.
-13. Deploy by running `nixos-rebuild boot --flake URL#<hostname>` on the host.
-14. Run auto-upgrade on DNS servers (ns1, ns2) to pick up the new host's DNS entry
+See [docs/host-creation.md](docs/host-creation.md) for the complete host creation pipeline, including:
+- Using the `create-host` script to generate host configurations
+- Deploying VMs and secrets with OpenTofu
+- Monitoring the bootstrap process via Loki
+- Verification and troubleshooting steps

 **Note:** DNS A records and Prometheus node-exporter scrape targets are auto-generated from the host's `systemd.network.networks` static IP configuration. No manual zone file or Prometheus config editing is required.

--- a/README.md
+++ b/README.md
@@ -13,7 +13,6 @@ NixOS Flake-based configuration repository for a homelab infrastructure. All hos
 | `monitoring01` | Prometheus, Grafana, Loki, Tempo, Pyroscope |
 | `jelly01` | Jellyfin media server |
 | `nix-cache01` | Nix binary cache |
-| `pgdb1` | PostgreSQL |
 | `nats1` | NATS messaging |
 | `vault01` | OpenBao (Vault) secrets management |
 | `template1`, `template2` | VM templates for cloning new hosts |
--- a/common/ssh-audit.nix
+++ b/common/ssh-audit.nix
@@ -0,0 +1,21 @@
+# SSH session command auditing
+#
+# Logs all commands executed by users who logged in interactively (SSH).
+# System services and nix builds are excluded via auid filter.
+#
+# Logs are sent to journald and forwarded to Loki via promtail.
+# Query with: {host="<hostname>"} |= "EXECVE"
+{
+  # Enable Linux audit subsystem
+  security.audit.enable = true;
+  security.auditd.enable = true;
+
+  # Log execve syscalls only from interactive login sessions
+  # auid!=4294967295 means "audit login uid is set" (excludes system services, nix builds)
+  security.audit.rules = [
+    "-a exit,always -F arch=b64 -S execve -F auid!=4294967295"
+  ];
+
+  # Forward audit logs to journald (so promtail ships them to Loki)
+  services.journald.audit = true;
+}
--- a/docs/host-creation.md
+++ b/docs/host-creation.md
@@ -0,0 +1,217 @@
+# Host Creation Pipeline
+
+This document describes the process for creating new hosts in the homelab infrastructure.
+
+## Overview
+
+We use the `create-host` script to create new hosts, which generates default configurations from a template. We then use OpenTofu to deploy both secrets and VMs. The VMs boot using a template image (built from `hosts/template2`), which starts a bootstrap process. This bootstrap process applies the host's NixOS configuration and then reboots into the new config.
+
+## Prerequisites
+
+All tools are available in the devshell: `create-host`, `bao` (OpenBao CLI), `tofu`.
+
+```bash
+nix develop
+```
+
+## Steps
+
+Steps marked with **USER** must be performed by the user due to credential requirements.
+
+1. **USER**: Run `create-host --hostname <name> --ip <ip/prefix>`
+2. Edit the auto-generated configurations in `hosts/<hostname>/` to import whatever modules are needed for its purpose
+3. Add any secrets needed to `terraform/vault/`
+4. Edit the VM specs in `terraform/vms.tf` if needed. To deploy from a branch other than master, add `flake_branch = "<branch>"` to the VM definition
+5. Push configuration to master (or the branch specified by `flake_branch`)
+6. **USER**: Apply terraform:
+   ```bash
+   nix develop -c tofu -chdir=terraform/vault apply
+   nix develop -c tofu -chdir=terraform apply
+   ```
+7. Once terraform completes, a VM boots in Proxmox using the template image
+8. The VM runs the `nixos-bootstrap` service, which applies the host config and reboots
+9. After reboot, the host should be operational
+10. Trigger auto-upgrade on `ns1` and `ns2` to propagate DNS records for the new host
+11. Trigger auto-upgrade on `monitoring01` to add the host to Prometheus scrape targets
+
+## Tier Specification
+
+New hosts should set `homelab.host.tier` in their configuration:
+
+```nix
+homelab.host.tier = "test";  # or "prod"
+```
+
+- **test** - Test-tier hosts can receive remote deployments via the `homelab-deploy` MCP server and have different credential access. Use for staging/testing.
+- **prod** - Production hosts. Deployments require direct access or the CLI with appropriate credentials.
+
+## Observability
+
+During the bootstrap process, status updates are sent to Loki. Query bootstrap logs with:
+
+```
+{job="bootstrap", host="<hostname>"}
+```
+
+### Bootstrap Stages
+
+The bootstrap process reports these stages via the `stage` label:
+
+| Stage | Message | Meaning |
+|-------|---------|---------|
+| `starting` | Bootstrap starting for \<host\> (branch: \<branch\>) | Bootstrap service has started |
+| `network_ok` | Network connectivity confirmed | Can reach git server |
+| `vault_ok` | Vault credentials unwrapped and stored | AppRole credentials provisioned |
+| `vault_skip` | No Vault token provided - skipping credential setup | No wrapped token was provided |
+| `vault_warn` | Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) |
+| `building` | Starting nixos-rebuild boot | NixOS build starting |
+| `success` | Build successful - rebooting into new configuration | Build complete, rebooting |
+| `failed` | nixos-rebuild failed - manual intervention required | Build failed |
+
+### Useful Queries
+
+```
+# All bootstrap activity for a host
+{job="bootstrap", host="myhost"}
+
+# Track all failures
+{job="bootstrap", stage="failed"}
+
+# Monitor builds in progress
+{job="bootstrap", stage=~"building|success"}
+```
+
+Once the VM reboots with its full configuration, it will start publishing metrics to Prometheus and logs to Loki via Promtail.
+
+## Verification
+
+1. Check bootstrap completed successfully:
+   ```
+   {job="bootstrap", host="<hostname>", stage="success"}
+   ```
+
+2. Verify the host is up and reporting metrics:
+   ```promql
+   up{instance=~"<hostname>.*"}
+   ```
+
+3. Verify the correct flake revision is deployed:
+   ```promql
+   nixos_flake_info{instance=~"<hostname>.*"}
+   ```
+
+4. Check logs are flowing:
+   ```
+   {host="<hostname>"}
+   ```
+
+5. Confirm expected services are running and producing logs
+
+## Troubleshooting
+
+### Bootstrap Failed
+
+#### Common Issues
+
+* VM has trouble running initial nixos-rebuild. Usually caused if it needs to compile packages from scratch if they are not available in our local nix-cache.
+
+#### Troubleshooting
+
+1. Check bootstrap logs in Loki - if they never progress past `building`, the rebuild likely consumed all resources:
+   ```
+   {job="bootstrap", host="<hostname>"}
+   ```
+
+2. **USER**: SSH into the host and check the bootstrap service:
+   ```bash
+   ssh root@<hostname>
+   journalctl -u nixos-bootstrap.service
+   ```
+
+3. If the build failed due to resource constraints, increase VM specs in `terraform/vms.tf` and redeploy, or manually run the rebuild:
+   ```bash
+   nixos-rebuild boot --flake git+https://git.t-juice.club/torjus/nixos-servers.git#<hostname>
+   ```
+
+4. If the host config doesn't exist in the flake, ensure step 5 was completed (config pushed to the correct branch).
+
+### Vault Credentials Not Working
+
+Usually caused by running the `create-host` script without proper credentials, or the wrapped token has expired/already been used.
+
+#### Troubleshooting
+
+1. Check if credentials exist on the host:
+   ```bash
+   ssh root@<hostname>
+   ls -la /var/lib/vault/approle/
+   ```
+
+2. Check bootstrap logs for vault-related stages:
+   ```
+   {job="bootstrap", host="<hostname>", stage=~"vault.*"}
+   ```
+
+3. **USER**: Regenerate and provision credentials manually:
+   ```bash
+   nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<hostname>
+   ```
+
+### Host Not Appearing in DNS
+
+Usually caused by not having deployed the commit with the new host to ns1/ns2.
+
+#### Troubleshooting
+
+1. Verify the host config has a static IP configured in `systemd.network.networks`
+
+2. Check that `homelab.dns.enable` is not set to `false`
+
+3. **USER**: Trigger auto-upgrade on DNS servers:
+   ```bash
+   ssh root@ns1 systemctl start nixos-upgrade.service
+   ssh root@ns2 systemctl start nixos-upgrade.service
+   ```
+
+4. Verify DNS resolution after upgrade completes:
+   ```bash
+   dig @ns1.home.2rjus.net <hostname>.home.2rjus.net
+   ```
+
+### Host Not Being Scraped by Prometheus
+
+Usually caused by not having deployed the commit with the new host to the monitoring host.
+
+#### Troubleshooting
+
+1. Check that `homelab.monitoring.enable` is not set to `false`
+
+2. **USER**: Trigger auto-upgrade on monitoring01:
+   ```bash
+   ssh root@monitoring01 systemctl start nixos-upgrade.service
+   ```
+
+3. Verify the target appears in Prometheus:
+   ```promql
+   up{instance=~"<hostname>.*"}
+   ```
+
+4. If the target is down, check that node-exporter is running on the host:
+   ```bash
+   ssh root@<hostname> systemctl status prometheus-node-exporter.service
+   ```
+
+## Related Files
+
+| Path | Description |
+|------|-------------|
+| `scripts/create-host/` | The `create-host` script that generates host configurations |
+| `hosts/template2/` | Template VM configuration (base image for new VMs) |
+| `hosts/template2/bootstrap.nix` | Bootstrap service that applies NixOS config on first boot |
+| `terraform/vms.tf` | VM definitions (specs, IPs, branch overrides) |
+| `terraform/cloud-init.tf` | Cloud-init configuration (passes hostname, branch, vault token) |
+| `terraform/vault/approle.tf` | AppRole policies for each host |
+| `terraform/vault/secrets.tf` | Secret definitions in Vault |
+| `terraform/vault/hosts-generated.tf` | Auto-generated wrapped tokens for VM bootstrap |
+| `playbooks/provision-approle.yml` | Ansible playbook for manual credential provisioning |
+| `flake.nix` | Flake with all host configurations (add new hosts here) |
--- a/docs/plans/auth-system-replacement.md
+++ b/docs/plans/auth-system-replacement.md
@@ -2,7 +2,7 @@

 ## Overview

-Replace the current auth01 setup (LLDAP + Authelia) with a modern, unified authentication solution. The current setup is not in active use, making this a good time to evaluate alternatives.
+Deploy a modern, unified authentication solution for the homelab. Provides central user management, SSO for web services, and consistent UID/GID mapping for NAS permissions.

 ## Goals

@@ -11,66 +11,9 @@ Replace the current auth01 setup (LLDAP + Authelia) with a modern, unified authe
 3. **UID/GID consistency** - Proper POSIX attributes for NAS share permissions
 4. **OIDC provider** - Single sign-on for homelab web services (Grafana, etc.)

-## Options Evaluated
+## Solution: Kanidm

-### OpenLDAP (raw)
-
- **NixOS Support:** Good (`services.openldap` with `declarativeContents`)
- **Pros:** Most widely supported, very flexible
- **Cons:** LDIF format is painful, schema management is complex, no built-in OIDC, requires SSSD on each client
- **Verdict:** Doesn't address LDAP complexity concerns
-
-### LLDAP + Authelia (current)
-
- **NixOS Support:** Both have good modules
- **Pros:** Already configured, lightweight, nice web UIs
- **Cons:** Two services to manage, limited POSIX attribute support in LLDAP, requires SSSD on every client host
- **Verdict:** Workable but has friction for NAS/UID goals
-
-### FreeIPA
-
- **NixOS Support:** None
- **Pros:** Full enterprise solution (LDAP + Kerberos + DNS + CA)
- **Cons:** Extremely heavy, wants to own DNS, designed for Red Hat ecosystems, massive overkill for homelab
- **Verdict:** Overkill, no NixOS support
-
-### Keycloak
-
- **NixOS Support:** None
- **Pros:** Good OIDC/SAML, nice UI
- **Cons:** Primarily an identity broker not a user directory, poor POSIX support, heavy (Java)
- **Verdict:** Wrong tool for Linux user management
-
-### Authentik
-
- **NixOS Support:** None (would need Docker)
- **Pros:** All-in-one with LDAP outpost and OIDC, modern UI
- **Cons:** Heavy stack (Python + PostgreSQL + Redis), LDAP is a separate component
- **Verdict:** Would work but requires Docker and is heavy
-
-### Kanidm
-
- **NixOS Support:** Excellent - first-class module with PAM/NSS integration
- **Pros:**
-  - Native PAM/NSS module (no SSSD needed)
-  - Built-in OIDC provider
-  - Optional LDAP interface for legacy services
-  - Declarative provisioning via NixOS (users, groups, OAuth2 clients)
-  - Modern, written in Rust
-  - Single service handles everything
- **Cons:** Newer project, smaller community than LDAP
- **Verdict:** Best fit for requirements
-
-### Pocket-ID
-
- **NixOS Support:** Unknown
- **Pros:** Very lightweight, passkey-first
- **Cons:** No LDAP, no PAM/NSS integration - purely OIDC for web apps
- **Verdict:** Doesn't solve Linux user management goal
-
-## Recommendation: Kanidm
-
-Kanidm is the recommended solution for the following reasons:
+Kanidm was chosen for the following reasons:

 | Requirement | Kanidm Support |
 |-------------|----------------|
@@ -82,42 +25,10 @@ Kanidm is the recommended solution for the following reasons:
 | Simplicity | Modern API, LDAP optional |
 | NixOS integration | First-class |

-### Key NixOS Features
+### Configuration Files

-**Server configuration:**
-```nix
-services.kanidm.enableServer = true;
-services.kanidm.serverSettings = {
-  domain = "home.2rjus.net";
-  origin = "https://auth.home.2rjus.net";
-  ldapbindaddress = "0.0.0.0:636";  # Optional LDAP interface
-};
-```
-
-**Declarative user provisioning:**
-```nix
-services.kanidm.provision.enable = true;
-services.kanidm.provision.persons.torjus = {
-  displayName = "Torjus";
-  groups = [ "admins" "nas-users" ];
-};
-```
-
-**Declarative OAuth2 clients:**
-```nix
-services.kanidm.provision.systems.oauth2.grafana = {
-  displayName = "Grafana";
-  originUrl = "https://grafana.home.2rjus.net/login/generic_oauth";
-  originLanding = "https://grafana.home.2rjus.net";
-};
-```
-
-**Client host configuration (add to system/):**
-```nix
-services.kanidm.enableClient = true;
-services.kanidm.enablePam = true;
-services.kanidm.clientSettings.uri = "https://auth.home.2rjus.net";
-```
+- **Host configuration:** `hosts/kanidm01/`
+- **Service module:** `services/kanidm/default.nix`

 ## NAS Integration

@@ -148,42 +59,103 @@ This future migration path is a strong argument for Kanidm over LDAP-only soluti

 ## Implementation Steps

-1. **Create Kanidm service module** in `services/kanidm/`
-   - Server configuration
-   - TLS via internal ACME
-   - Vault secrets for admin passwords
+1. **Create kanidm01 host and service module** ✅
+   - Host: `kanidm01.home.2rjus.net` (10.69.13.23, test tier)
+   - Service module: `services/kanidm/`
+   - TLS via internal ACME (`auth.home.2rjus.net`)
+   - Vault integration for idm_admin password
+   - LDAPS on port 636

-2. **Configure declarative provisioning**
-   - Define initial users and groups
-   - Set up POSIX attributes (UID/GID ranges)
+2. **Configure provisioning** ✅
+   - Groups provisioned declaratively: `admins`, `users`, `ssh-users`
+   - Users managed imperatively via CLI (allows setting POSIX passwords in one step)
+   - POSIX attributes enabled (UID/GID range 65,536-69,999)

-3. **Add OIDC clients** for homelab services
-   - Grafana
-   - Other services as needed
-
-4. **Create client module** in `system/` for PAM/NSS
-   - Enable on all hosts that need central auth
-   - Configure trusted CA
-
-5. **Test NAS integration**
+3. **Test NAS integration** (in progress)
+   - ✅ LDAP interface verified working
   - Configure TrueNAS LDAP client to connect to Kanidm
   - Verify UID/GID mapping works with NFS shares

-6. **Migrate auth01**
-   - Remove LLDAP and Authelia services
-   - Deploy Kanidm
-   - Update DNS CNAMEs if needed
+4. **Add OIDC clients** for homelab services
+   - Grafana
+   - Other services as needed

-7. **Documentation**
-   - User management procedures
-   - Adding new OAuth2 clients
-   - Troubleshooting PAM/NSS issues
+5. **Create client module** in `system/` for PAM/NSS ✅
+   - Module: `system/kanidm-client.nix`
+   - `homelab.kanidm.enable = true` enables PAM/NSS
+   - Short usernames (not SPN format)
+   - Home directory symlinks via `home_alias`
+   - Enabled on test tier: testvm01, testvm02, testvm03

-## Open Questions
+6. **Documentation** ✅
+   - `docs/user-management.md` - CLI workflows, troubleshooting
+   - User/group creation procedures verified working

- What UID/GID range should be reserved for Kanidm-managed users?
- Which hosts should have PAM/NSS enabled initially?
- What OAuth2 clients are needed at launch?
+## Progress
+
+### Completed (2026-02-08)
+
+**Kanidm server deployed on kanidm01 (test tier):**
+- Host: `kanidm01.home.2rjus.net` (10.69.13.23)
+- WebUI: `https://auth.home.2rjus.net`
+- LDAPS: port 636
+- Valid certificate from internal CA
+
+**Configuration:**
+- Kanidm 1.8 with secret provisioning support
+- Daily backups at 22:00 (7 versions retained)
+- Vault integration for idm_admin password
+- Prometheus monitoring scrape target configured
+
+**Provisioned entities:**
+- Groups: `admins`, `users`, `ssh-users` (declarative)
+- Users managed via CLI (imperative)
+
+**Verified working:**
+- WebUI login with idm_admin
+- LDAP bind and search with POSIX-enabled user
+- LDAPS with valid internal CA certificate
+
+### Completed (2026-02-08) - PAM/NSS Client
+
+**Client module deployed (`system/kanidm-client.nix`):**
+- `homelab.kanidm.enable = true` enables PAM/NSS integration
+- Connects to auth.home.2rjus.net
+- Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
+- Home directory symlinks (`/home/torjus` → UUID-based dir)
+- Login restricted to `ssh-users` group
+
+**Enabled on test tier:**
+- testvm01, testvm02, testvm03
+
+**Verified working:**
+- User/group resolution via `getent`
+- SSH login with Kanidm unix passwords
+- Home directory creation with symlinks
+- Imperative user/group creation via CLI
+
+**Documentation:**
+- `docs/user-management.md` with full CLI workflows
+- Password requirements (min 10 chars)
+- Troubleshooting guide (nscd, cache invalidation)
+
+### UID/GID Range (Resolved)
+
+**Range: 65,536 - 69,999** (manually allocated)
+
+- Users: 65,536 - 67,999 (up to ~2500 users)
+- Groups: 68,000 - 69,999 (up to ~2000 groups)
+
+Rationale:
+- Starts at Kanidm's recommended minimum (65,536)
+- Well above NixOS system users (typically <1000)
+- Avoids Podman/container issues with very high GIDs
+
+### Next Steps
+
+1. Enable PAM/NSS on production hosts (after test tier validation)
+2. Configure TrueNAS LDAP client for NAS integration testing
+3. Add OAuth2 clients (Grafana first)

 ## References

--- a/docs/plans/completed/bootstrap-cache.md
+++ b/docs/plans/completed/bootstrap-cache.md
--- a/docs/plans/completed/ns1-recreation.md
+++ b/docs/plans/completed/ns1-recreation.md
@@ -0,0 +1,107 @@
+# ns1 Recreation Plan
+
+## Overview
+
+Recreate ns1 using the OpenTofu workflow after the existing VM entered emergency mode due to incorrect hardware-configuration.nix (hardcoded UUIDs that don't match actual disk layout).
+
+## Current ns1 Configuration to Preserve
+
+- **IP:** 10.69.13.5/24
+- **Gateway:** 10.69.13.1
+- **Role:** Primary DNS (authoritative + resolver)
+- **Services:**
+  - `../../services/ns/master-authorative.nix`
+  - `../../services/ns/resolver.nix`
+- **Metadata:**
+  - `homelab.host.role = "dns"`
+  - `homelab.host.labels.dns_role = "primary"`
+- **Vault:** enabled
+- **Deploy:** enabled
+
+## Execution Steps
+
+### Phase 1: Remove Old Configuration
+
+```bash
+nix develop -c create-host --remove --hostname ns1 --force
+```
+
+This removes:
+- `hosts/ns1/` directory
+- Entry from `flake.nix`
+- Any terraform entries (none exist currently)
+
+### Phase 2: Create New Configuration
+
+```bash
+nix develop -c create-host --hostname ns1 --ip 10.69.13.5/24
+```
+
+This creates:
+- `hosts/ns1/` with template2-based configuration
+- Entry in `flake.nix`
+- Entry in `terraform/vms.tf`
+- Vault wrapped token for bootstrap
+
+### Phase 3: Customize Configuration
+
+After create-host, manually update `hosts/ns1/configuration.nix` to add:
+
+1. DNS service imports:
+   ```nix
+   ../../services/ns/master-authorative.nix
+   ../../services/ns/resolver.nix
+   ```
+
+2. Host metadata:
+   ```nix
+   homelab.host = {
+     tier = "prod";
+     role = "dns";
+     labels.dns_role = "primary";
+   };
+   ```
+
+3. Disable resolved (conflicts with Unbound):
+   ```nix
+   services.resolved.enable = false;
+   ```
+
+### Phase 4: Commit Changes
+
+```bash
+git add -A
+git commit -m "ns1: recreate with OpenTofu workflow
+
+Old VM had incorrect hardware-configuration.nix with hardcoded UUIDs
+that didn't match actual disk layout, causing boot failure.
+
+Recreated using template2-based configuration for OpenTofu provisioning."
+```
+
+### Phase 5: Infrastructure
+
+1. Delete old ns1 VM in Proxmox (it's broken anyway)
+2. Run `nix develop -c tofu -chdir=terraform apply`
+3. Wait for bootstrap to complete
+4. Verify ns1 is functional:
+   - DNS resolution working
+   - Zone transfer to ns2 working
+   - All exporters responding
+
+### Phase 6: Finalize
+
+- Push to master
+- Move this plan to `docs/plans/completed/`
+
+## Rollback
+
+If the new VM fails:
+1. ns2 is still operational as secondary DNS
+2. Can recreate with different settings if needed
+
+## Notes
+
+- ns2 will continue serving DNS during the migration
+- Zone data is generated from flake, so no data loss
+- The old VM's disk can be kept briefly in Proxmox as backup if desired
--- a/docs/plans/host-migration-to-opentofu.md
+++ b/docs/plans/host-migration-to-opentofu.md
@@ -9,24 +9,23 @@ hosts are decommissioned or deferred.

 ## Current State

-Hosts already managed by OpenTofu: `vault01`, `testvm01`, `vaulttest01`
+Hosts already managed by OpenTofu: `vault01`, `testvm01`, `testvm02`, `testvm03`, `ns2`, `ns1`

 Hosts to migrate:

 | Host | Category | Notes |
 |------|----------|-------|
-| ns1 | Stateless | Primary DNS, recreate |
-| ns2 | Stateless | Secondary DNS, recreate |
+| ~~ns1~~ | ~~Stateless~~ | ✓ Complete |
 | nix-cache01 | Stateless | Binary cache, recreate |
 | http-proxy | Stateless | Reverse proxy, recreate |
 | nats1 | Stateless | Messaging, recreate |
-| auth01 | Decommission | No longer in use |
 | ha1 | Stateful | Home Assistant + Zigbee2MQTT + Mosquitto |
 | monitoring01 | Stateful | Prometheus, Grafana, Loki |
 | jelly01 | Stateful | Jellyfin metadata, watch history, config |
-| pgdb1 | Stateful | PostgreSQL databases |
-| jump | Decommission | No longer needed |
-| ca | Deferred | Pending Phase 4c PKI migration to OpenBao |
+| pgdb1 | Decommission | Only used by Open WebUI on gunter, migrating to local postgres |
+| ~~jump~~ | ~~Decommission~~ | ✓ Complete |
+| ~~auth01~~ | ~~Decommission~~ | ✓ Complete |
+| ~~ca~~ | ~~Deferred~~ | ✓ Complete |

 ## Phase 1: Backup Preparation

@@ -46,39 +45,19 @@ No backup currently exists. Add a restic backup job for `/var/lib/jellyfin/` whi
 Media files are on the NAS (`nas.home.2rjus.net:/mnt/hdd-pool/media`) and do not need backup.
 The cache directory (`/var/cache/jellyfin/`) does not need backup — it regenerates.

-### 1c. Add PostgreSQL Backup to pgdb1
-
-No backup currently exists. Add a restic backup job with a `pg_dumpall` pre-hook to capture
-all databases and roles. The dump should be piped through restic's stdin backup (similar to
-the Grafana DB dump pattern on monitoring01).
-
-### 1d. Verify Existing ha1 Backup
+### 1c. Verify Existing ha1 Backup

 ha1 already backs up `/var/lib/hass`, `/var/lib/zigbee2mqtt`, `/var/lib/mosquitto`. Verify
 these backups are current and restorable before proceeding with migration.

-### 1e. Verify All Backups
+### 1d. Verify All Backups

 After adding/expanding backup jobs:
 1. Trigger a manual backup run on each host
 2. Verify backup integrity with `restic check`
 3. Test a restore to a temporary location to confirm data is recoverable

-## Phase 2: Declare pgdb1 Databases in Nix
-
-Before migrating pgdb1, audit the manually-created databases and users on the running
-instance, then declare them in the Nix configuration using `ensureDatabases` and
-`ensureUsers`. This makes the PostgreSQL setup reproducible on the new host.
-
-Steps:
-1. SSH to pgdb1, run `\l` and `\du` in psql to list databases and roles
-2. Add `ensureDatabases` and `ensureUsers` to `services/postgres/postgres.nix`
-3. Document any non-default PostgreSQL settings or extensions per database
-
-After reprovisioning, the databases will be created by NixOS, and data restored from the
-`pg_dumpall` backup.
-
-## Phase 3: Stateless Host Migration
+## Phase 2: Stateless Host Migration

 These hosts have no meaningful state and can be recreated fresh. For each host:

@@ -95,13 +74,14 @@ Migrate stateless hosts in an order that minimizes disruption:

 1. **nix-cache01** — low risk, no downstream dependencies during migration
 2. **nats1** — low risk, verify no persistent JetStream streams first
-4. **http-proxy** — brief disruption to proxied services, migrate during low-traffic window
-5. **ns1, ns2** — migrate one at a time, verify DNS resolution between each
+3. **http-proxy** — brief disruption to proxied services, migrate during low-traffic window
+4. ~~**ns1** — ns2 already migrated, verify AXFR works after ns1 migration~~ ✓ Complete

-For ns1/ns2: migrate ns2 first (secondary), verify AXFR works, then migrate ns1. All hosts
-use both ns1 and ns2 as resolvers, so one being down briefly is tolerable.
+~~For ns1/ns2: migrate ns2 first (secondary), verify AXFR works, then migrate ns1.~~ Both ns1
+and ns2 migration complete. Zone transfer (AXFR) verified working between ns1 (primary) and
+ns2 (secondary).

-## Phase 4: Stateful Host Migration
+## Phase 3: Stateful Host Migration

 For each stateful host, the procedure is:

@@ -114,17 +94,7 @@ For each stateful host, the procedure is:
 7. Start services and verify functionality
 8. Decommission the old VM

-### 4a. pgdb1
-
-1. Run final `pg_dumpall` backup via restic
-2. Stop PostgreSQL on the old host
-3. Provision new pgdb1 via OpenTofu
-4. After bootstrap, NixOS creates the declared databases/users
-5. Restore data with `pg_restore` or `psql < dumpall.sql`
-6. Verify database connectivity from gunter (`10.69.30.105`)
-7. Decommission old VM
-
-### 4b. monitoring01
+### 3a. monitoring01

 1. Run final Grafana backup
 2. Provision new monitoring01 via OpenTofu
@@ -134,7 +104,7 @@ For each stateful host, the procedure is:
 6. Verify all scrape targets are being collected
 7. Decommission old VM

-### 4c. jelly01
+### 3b. jelly01

 1. Run final Jellyfin backup
 2. Provision new jelly01 via OpenTofu
@@ -143,7 +113,7 @@ For each stateful host, the procedure is:
 5. Start Jellyfin, verify watch history and library metadata are present
 6. Decommission old VM

-### 4d. ha1
+### 3c. ha1

 1. Verify latest restic backup is current
 2. Stop Home Assistant, Zigbee2MQTT, and Mosquitto on old host
@@ -167,47 +137,69 @@ OpenTofu/Proxmox. Verify the USB device ID on the hypervisor and add the appropr
 `usb` block to the VM definition in `terraform/vms.tf`. The USB device must be passed
 through before starting Zigbee2MQTT on the new host.

-## Phase 5: Decommission jump and auth01 Hosts
+## Phase 4: Decommission Hosts

-### jump
-1. Verify nothing depends on the jump host (no SSH proxy configs pointing to it, etc.)
-2. Remove host configuration from `hosts/jump/`
-3. Remove from `flake.nix`
-4. Remove any secrets in `secrets/jump/`
-5. Remove from `.sops.yaml`
+### jump ✓ COMPLETE
+
+~~1. Verify nothing depends on the jump host (no SSH proxy configs pointing to it, etc.)~~
+~~2. Remove host configuration from `hosts/jump/`~~
+~~3. Remove from `flake.nix`~~
+~~4. Remove any secrets in `secrets/jump/`~~
+~~5. Remove from `.sops.yaml`~~
+~~6. Destroy the VM in Proxmox~~
+~~7. Commit cleanup~~
+
+Host was already removed from flake.nix and VM destroyed. Configuration cleaned up in ba9f47f.
+
+### auth01 ✓ COMPLETE
+
+~~1. Remove host configuration from `hosts/auth01/`~~
+~~2. Remove from `flake.nix`~~
+~~3. Remove any secrets in `secrets/auth01/`~~
+~~4. Remove from `.sops.yaml`~~
+~~5. Remove `services/authelia/` and `services/lldap/` (only used by auth01)~~
+~~6. Destroy the VM in Proxmox~~
+~~7. Commit cleanup~~
+
+Host configuration, services, and VM already removed.
+
+### pgdb1 (in progress)
+
+Only consumer was Open WebUI on gunter, which has been migrated to use local PostgreSQL.
+
+1. ~~Verify Open WebUI on gunter is using local PostgreSQL (not pgdb1)~~ ✓
+2. ~~Remove host configuration from `hosts/pgdb1/`~~ ✓
+3. ~~Remove `services/postgres/` (only used by pgdb1)~~ ✓
+4. ~~Remove from `flake.nix`~~ ✓
+5. ~~Remove Vault AppRole from `terraform/vault/approle.tf`~~ ✓
 6. Destroy the VM in Proxmox
-7. Commit cleanup
+7. ~~Commit cleanup~~ ✓

-### auth01
-1. Remove host configuration from `hosts/auth01/`
-2. Remove from `flake.nix`
-3. Remove any secrets in `secrets/auth01/`
-4. Remove from `.sops.yaml`
-5. Remove `services/authelia/` and `services/lldap/` (only used by auth01)
-6. Destroy the VM in Proxmox
-7. Commit cleanup
+See `docs/plans/pgdb1-decommission.md` for detailed plan.

-## Phase 6: Decommission ca Host (Deferred)
+## Phase 5: Decommission ca Host ✓ COMPLETE

-Deferred until Phase 4c (PKI migration to OpenBao) is complete. Once all hosts use the
+~~Deferred until Phase 4c (PKI migration to OpenBao) is complete. Once all hosts use the
 OpenBao ACME endpoint for certificates, the step-ca host can be decommissioned following
-the same cleanup steps as the jump host.
+the same cleanup steps as the jump host.~~

-## Phase 7: Remove sops-nix
+PKI migration to OpenBao complete. Host configuration, `services/ca/`, and VM removed.

-Once `ca` is decommissioned (Phase 6), `sops-nix` is no longer used by any host. Remove
-all remnants:
- `sops-nix` input from `flake.nix` and `flake.lock`
- `sops-nix.nixosModules.sops` from all host module lists in `flake.nix`
- `inherit sops-nix` from all specialArgs in `flake.nix`
- `system/sops.nix` and its import in `system/default.nix`
- `.sops.yaml`
- `secrets/` directory
- All `sops.secrets.*` declarations in `services/ca/`, `services/authelia/`, `services/lldap/`
- Template scripts that generate age keys for sops (`hosts/template/scripts.nix`,
-  `hosts/template2/scripts.nix`)
+## Phase 6: Remove sops-nix ✓ COMPLETE

-See `docs/plans/completed/sops-to-openbao-migration.md` for full context.
+~~Once `ca` is decommissioned (Phase 6), `sops-nix` is no longer used by any host. Remove
+all remnants:~~
+~~- `sops-nix` input from `flake.nix` and `flake.lock`~~
+~~- `sops-nix.nixosModules.sops` from all host module lists in `flake.nix`~~
+~~- `inherit sops-nix` from all specialArgs in `flake.nix`~~
+~~- `system/sops.nix` and its import in `system/default.nix`~~
+~~- `.sops.yaml`~~
+~~- `secrets/` directory~~
+~~- All `sops.secrets.*` declarations in `services/ca/`, `services/authelia/`, `services/lldap/`~~
+~~- Template scripts that generate age keys for sops (`hosts/template/scripts.nix`,
+  `hosts/template2/scripts.nix`)~~
+
+All sops-nix remnants removed. See `docs/plans/completed/sops-to-openbao-migration.md` for context.

 ## Notes

@@ -216,7 +208,7 @@ See `docs/plans/completed/sops-to-openbao-migration.md` for full context.
 - The old VMs use IPs that the new VMs need, so the old VM must be shut down before
  the new one is provisioned (or use a temporary IP and swap after verification)
 - Stateful migrations should be done during low-usage windows
- After all migrations are complete, the only hosts not in OpenTofu will be ca (deferred)
+- After all migrations are complete, all decommissioned hosts (jump, auth01, ca) have been removed
 - Since many hosts are being recreated, this is a good opportunity to establish consistent
  hostname naming conventions before provisioning the new VMs. Current naming is inconsistent
  (e.g. `ns1` vs `nix-cache01`, `ha1` vs `auth01`, `pgdb1` vs `http-proxy`). Decide on a
--- a/docs/plans/memory-issues-follow-up.md
+++ b/docs/plans/memory-issues-follow-up.md
@@ -0,0 +1,116 @@
+# Memory Issues Follow-up
+
+Tracking the zram change to verify it resolves OOM issues during nixos-upgrade on low-memory hosts.
+
+## Background
+
+On 2026-02-08, ns2 (2GB RAM) experienced an OOM kill during nixos-upgrade. The Nix evaluation process consumed ~1.6GB before being killed by the kernel. ns1 (manually increased to 4GB) succeeded with the same upgrade.
+
+Root cause: 2GB RAM is insufficient for Nix flake evaluation without swap.
+
+## Fix Applied
+
+**Commit:** `1674b6a` - system: enable zram swap for all hosts
+
+**Merged:** 2026-02-08 ~12:15 UTC
+
+**Change:** Added `zramSwap.enable = true` to `system/zram.nix`, providing ~2GB compressed swap on all hosts.
+
+## Timeline
+
+| Time (UTC) | Event |
+|------------|-------|
+| 05:00:46 | ns2 nixos-upgrade OOM killed |
+| 05:01:47 | `nixos_upgrade_failed` alert fired |
+| 12:15 | zram commit merged to master |
+| 12:19 | ns2 rebooted with zram enabled |
+| 12:20 | ns1 rebooted (memory reduced to 2GB via tofu) |
+
+## Hosts Affected
+
+All 2GB VMs that run nixos-upgrade:
+- ns1, ns2 (DNS)
+- vault01
+- testvm01, testvm02, testvm03
+- kanidm01
+
+## Metrics to Monitor
+
+Check these in Grafana or via PromQL to verify the fix:
+
+### Swap availability (should be ~2GB after upgrade)
+```promql
+node_memory_SwapTotal_bytes / 1024 / 1024
+```
+
+### Swap usage during upgrades
+```promql
+(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / 1024 / 1024
+```
+
+### Zswap compressed bytes (active compression)
+```promql
+node_memory_Zswap_bytes / 1024 / 1024
+```
+
+### Upgrade failures (should be 0)
+```promql
+node_systemd_unit_state{name="nixos-upgrade.service", state="failed"}
+```
+
+### Memory available during upgrades
+```promql
+node_memory_MemAvailable_bytes / 1024 / 1024
+```
+
+## Verification Steps
+
+After a few days (allow auto-upgrades to run on all hosts):
+
+1. Check all hosts have swap enabled:
+   ```promql
+   node_memory_SwapTotal_bytes > 0
+   ```
+
+2. Check for any upgrade failures since the fix:
+   ```promql
+   count_over_time(ALERTS{alertname="nixos_upgrade_failed"}[7d])
+   ```
+
+3. Review if any hosts used swap during upgrades (check historical graphs)
+
+## Success Criteria
+
+- No `nixos_upgrade_failed` alerts due to OOM after 2026-02-08
+- All hosts show ~2GB swap available
+- Upgrades complete successfully on 2GB VMs
+
+## Fallback Options
+
+If zram is insufficient:
+
+1. **Increase VM memory** - Update `terraform/vms.tf` to 4GB for affected hosts
+2. **Enable memory ballooning** - Configure VMs with dynamic memory allocation (see below)
+3. **Use remote builds** - Configure `nix.buildMachines` to offload evaluation
+4. **Reduce flake size** - Split configurations to reduce evaluation memory
+
+### Memory Ballooning
+
+Proxmox supports memory ballooning, which allows VMs to dynamically grow/shrink memory allocation based on demand. The balloon driver inside the guest communicates with the hypervisor to release or reclaim memory pages.
+
+Configuration in `terraform/vms.tf`:
+```hcl
+memory  = 4096  # maximum memory
+balloon = 2048  # minimum memory (shrinks to this when idle)
+```
+
+Pros:
+- VMs get memory on-demand without reboots
+- Better host memory utilization
+- Solves upgrade OOM without permanently allocating 4GB
+
+Cons:
+- Requires QEMU guest agent running in guest
+- Guest can experience memory pressure if host is overcommitted
+
+Ballooning and zram are complementary - ballooning provides headroom from the host, zram provides overflow within the guest.
--- a/docs/plans/monitoring-migration-victoriametrics.md
+++ b/docs/plans/monitoring-migration-victoriametrics.md
@@ -0,0 +1,219 @@
+# Monitoring Stack Migration to VictoriaMetrics
+
+## Overview
+
+Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
+and longer retention. Run in parallel with monitoring01 until validated, then switch over using
+a `monitoring` CNAME for seamless transition.
+
+## Current State
+
+**monitoring01** (10.69.13.13):
+- 4 CPU cores, 4GB RAM, 33GB disk
+- Prometheus with 30-day retention (15s scrape interval)
+- Alertmanager (routes to alerttonotify webhook)
+- Grafana (dashboards, datasources)
+- Loki (log aggregation from all hosts via Promtail)
+- Tempo (distributed tracing)
+- Pyroscope (continuous profiling)
+
+**Hardcoded References to monitoring01:**
+- `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100`
+- `hosts/template2/bootstrap.nix` - Bootstrap logs to Loki (keep as-is until decommission)
+- `services/http-proxy/proxy.nix` - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway
+
+**Auto-generated:**
+- Prometheus scrape targets (from `lib/monitoring.nix` + `homelab.monitoring.scrapeTargets`)
+- Node-exporter targets (from all hosts with static IPs)
+
+## Decision: VictoriaMetrics
+
+Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
+- Single binary replacement for Prometheus
+- 5-10x better compression (30 days could become 180+ days in same space)
+- Same PromQL query language (Grafana dashboards work unchanged)
+- Same scrape config format (existing auto-generated configs work)
+
+If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
+
+## Architecture
+
+```
+                     ┌─────────────────┐
+                     │  monitoring02   │
+                     │  VictoriaMetrics│
+                     │  + Grafana      │
+     monitoring      │  + Loki         │
+     CNAME ──────────│  + Tempo        │
+                     │  + Pyroscope    │
+                     │  + Alertmanager │
+                     │  (vmalert)      │
+                     └─────────────────┘
+                            ▲
+                            │ scrapes
+            ┌───────────────┼───────────────┐
+            │               │               │
+       ┌────┴────┐    ┌─────┴────┐    ┌─────┴────┐
+       │  ns1    │    │  ha1     │    │  ...     │
+       │ :9100   │    │ :9100    │    │ :9100    │
+       └─────────┘    └──────────┘    └──────────┘
+```
+
+## Implementation Plan
+
+### Phase 1: Create monitoring02 Host
+
+Use `create-host` script which handles flake.nix and terraform/vms.tf automatically.
+
+1. **Run create-host**: `nix develop -c create-host monitoring02 10.69.13.24`
+2. **Update VM resources** in `terraform/vms.tf`:
+   - 4 cores (same as monitoring01)
+   - 8GB RAM (double, for VictoriaMetrics headroom)
+   - 100GB disk (for 3+ months retention with compression)
+3. **Update host configuration**: Import monitoring services
+4. **Create Vault AppRole**: Add to `terraform/vault/approle.tf`
+
+### Phase 2: Set Up VictoriaMetrics Stack
+
+Create new service module at `services/monitoring/victoriametrics/` for testing alongside existing
+Prometheus config. Once validated, this can replace the Prometheus module.
+
+1. **VictoriaMetrics** (port 8428):
+   - `services.victoriametrics.enable = true`
+   - `services.victoriametrics.retentionPeriod = "3m"` (3 months, increase later based on disk usage)
+   - Migrate scrape configs via `prometheusConfig`
+   - Use native push support (replaces Pushgateway)
+
+2. **vmalert** for alerting rules:
+   - `services.vmalert.enable = true`
+   - Point to VictoriaMetrics for metrics evaluation
+   - Keep rules in separate `rules.yml` file (same format as Prometheus)
+   - No receiver configured during parallel operation (prevents duplicate alerts)
+
+3. **Alertmanager** (port 9093):
+   - Keep existing configuration (alerttonotify webhook routing)
+   - Only enable receiver after cutover from monitoring01
+
+4. **Loki** (port 3100):
+   - Same configuration as current
+
+5. **Grafana** (port 3000):
+   - Define dashboards declaratively via NixOS options (not imported from monitoring01)
+   - Reference existing dashboards on monitoring01 for content inspiration
+   - Configure VictoriaMetrics datasource (port 8428)
+   - Configure Loki datasource
+
+6. **Tempo** (ports 3200, 3201):
+   - Same configuration
+
+7. **Pyroscope** (port 4040):
+   - Same Docker-based deployment
+
+### Phase 3: Parallel Operation
+
+Run both monitoring01 and monitoring02 simultaneously:
+
+1. **Dual scraping**: Both hosts scrape the same targets
+   - Validates VictoriaMetrics is collecting data correctly
+
+2. **Dual log shipping**: Configure Promtail to send logs to both Loki instances
+   - Add second client in `system/monitoring/logs.nix` pointing to monitoring02
+
+3. **Validate dashboards**: Access Grafana on monitoring02, verify dashboards work
+
+4. **Validate alerts**: Verify vmalert evaluates rules correctly (no receiver = no notifications)
+
+5. **Compare resource usage**: Monitor disk/memory consumption between hosts
+
+### Phase 4: Add monitoring CNAME
+
+Add CNAME to monitoring02 once validated:
+
+```nix
+# hosts/monitoring02/configuration.nix
+homelab.dns.cnames = [ "monitoring" ];
+```
+
+This creates `monitoring.home.2rjus.net` pointing to monitoring02.
+
+### Phase 5: Update References
+
+Update hardcoded references to use the CNAME:
+
+1. **system/monitoring/logs.nix**:
+   - Remove dual-shipping, point only to `http://monitoring.home.2rjus.net:3100`
+
+2. **services/http-proxy/proxy.nix**: Update reverse proxy backends:
+   - prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
+   - alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
+   - grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
+   - pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040
+
+Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission.
+
+### Phase 6: Enable Alerting
+
+Once ready to cut over:
+1. Enable Alertmanager receiver on monitoring02
+2. Verify test alerts route correctly
+
+### Phase 7: Cutover and Decommission
+
+1. **Stop monitoring01**: Prevent duplicate alerts during transition
+2. **Update bootstrap.nix**: Point to `monitoring.home.2rjus.net`
+3. **Verify all targets scraped**: Check VictoriaMetrics UI
+4. **Verify logs flowing**: Check Loki on monitoring02
+5. **Decommission monitoring01**:
+   - Remove from flake.nix
+   - Remove host configuration
+   - Destroy VM in Proxmox
+   - Remove from terraform state
+
+## Open Questions
+
+- [ ] What disk size for monitoring02? 100GB should allow 3+ months with VictoriaMetrics compression
+- [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
+
+## VictoriaMetrics Service Configuration
+
+Example NixOS configuration for monitoring02:
+
+```nix
+# VictoriaMetrics replaces Prometheus
+services.victoriametrics = {
+  enable = true;
+  retentionPeriod = "3m";  # 3 months, increase based on disk usage
+  prometheusConfig = {
+    global.scrape_interval = "15s";
+    scrape_configs = [
+      # Auto-generated node-exporter targets
+      # Service-specific scrape targets
+      # External targets
+    ];
+  };
+};
+
+# vmalert for alerting rules (no receiver during parallel operation)
+services.vmalert = {
+  enable = true;
+  datasource.url = "http://localhost:8428";
+  # notifier.alertmanager.url = "http://localhost:9093";  # Enable after cutover
+  rule = [ ./rules.yml ];
+};
+```
+
+## Rollback Plan
+
+If issues arise after cutover:
+1. Move `monitoring` CNAME back to monitoring01
+2. Restart monitoring01 services
+3. Revert Promtail config to point only to monitoring01
+4. Revert http-proxy backends
+
+## Notes
+
+- VictoriaMetrics uses port 8428 vs Prometheus 9090
+- PromQL compatibility is excellent
+- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
+- monitoring02 deployed via OpenTofu using `create-host` script
+- Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state
--- a/docs/plans/nix-cache-reprovision.md
+++ b/docs/plans/nix-cache-reprovision.md
@@ -0,0 +1,212 @@
+# Nix Cache Host Reprovision
+
+## Overview
+
+Reprovision `nix-cache01` using the OpenTofu workflow, and improve the build/cache system with:
+1. NATS-based remote build triggering (replacing the current bash script)
+2. Safer flake update workflow that validates builds before pushing to master
+
+## Current State
+
+### Host Configuration
+- `nix-cache01` at 10.69.13.15 serves the binary cache via Harmonia
+- Runs Gitea Actions runner for CI workflows
+- Has `homelab.deploy.enable = true` (already supports NATS-based deployment)
+- Uses a dedicated XFS volume at `/nix` for cache storage
+
+### Current Build System (`services/nix-cache/build-flakes.sh`)
+- Runs every 30 minutes via systemd timer
+- Clones/pulls two repos: `nixos-servers` and `nixos` (gunter)
+- Builds all hosts with `nixos-rebuild build` (no blacklist despite docs mentioning it)
+- Pushes success/failure metrics to pushgateway
+- Simple but has no filtering, no parallelism, no remote triggering
+
+### Current Flake Update Workflow (`.github/workflows/flake-update.yaml`)
+- Runs daily at midnight via cron
+- Runs `nix flake update --commit-lock-file`
+- Pushes directly to master
+- No build validation — can push broken inputs
+
+## Improvement 1: NATS-Based Remote Build Triggering
+
+### Design
+
+Extend the existing `homelab-deploy` tool to support a "build" command that triggers builds on the cache host. This reuses the NATS infrastructure already in place.
+
+| Approach | Pros | Cons |
+|----------|------|------|
+| Extend homelab-deploy | Reuses existing NATS auth, NKey handling, CLI | Adds scope to existing tool |
+| New nix-cache-tool | Clean separation | Duplicate NATS boilerplate, new credentials |
+| Gitea Actions webhook | No custom tooling | Less flexible, tied to Gitea |
+
+**Recommendation:** Extend `homelab-deploy` with a build subcommand. The tool already has NATS client code, authentication handling, and a listener module in NixOS.
+
+### Implementation
+
+1. Add new message type to homelab-deploy: `build.<host>` subject
+2. Listener on nix-cache01 subscribes to `build.>` wildcard
+3. On message receipt, builds the specified host and returns success/failure
+4. CLI command: `homelab-deploy build <hostname>` or `homelab-deploy build --all`
+
+### Benefits
+- Trigger rebuild for specific host to ensure it's cached
+- Could be called from CI after merging PRs
+- Reuses existing NATS infrastructure and auth
+- Progress/status could stream back via NATS reply
+
+## Improvement 2: Smarter Flake Update Workflow
+
+### Current Problems
+1. Updates can push breaking changes to master
+2. No visibility into what broke when it does
+3. Hosts that auto-update can pull broken configs
+
+### Proposed Workflow
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                    Flake Update Workflow                         │
+├─────────────────────────────────────────────────────────────────┤
+│  1. nix flake update (on feature branch)                        │
+│  2. Build ALL hosts locally                                      │
+│  3. If all pass → fast-forward merge to master                  │
+│  4. If any fail → create PR with failure logs attached          │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+### Implementation Options
+
+| Option | Description | Pros | Cons |
+|--------|-------------|------|------|
+| **A: Self-hosted runner** | Build on nix-cache01 | Fast (local cache), simple | Ties up cache host during build |
+| **B: Gitea Actions only** | Use container runner | Clean separation | Slow (no cache), resource limits |
+| **C: Hybrid** | Trigger builds on nix-cache01 via NATS from Actions | Best of both | More complex |
+
+**Recommendation:** Option A with nix-cache01 as the runner. The host is already running Gitea Actions runner and has the cache. Building all ~16 hosts is disk I/O heavy but feasible on dedicated hardware.
+
+### Workflow Steps
+
+1. Workflow runs on schedule (daily or weekly)
+2. Creates branch `flake-update/YYYY-MM-DD`
+3. Runs `nix flake update --commit-lock-file`
+4. Builds each host: `nix build .#nixosConfigurations.<host>.config.system.build.toplevel`
+5. If all succeed:
+   - Fast-forward merge to master
+   - Delete feature branch
+6. If any fail:
+   - Create PR from the update branch
+   - Attach build logs as PR comment
+   - Label PR with `needs-review` or `build-failure`
+   - Do NOT merge automatically
+
+### Workflow File Changes
+
+```yaml
+# New: .github/workflows/flake-update-safe.yaml
+name: Safe flake update
+on:
+  schedule:
+    - cron: "0 2 * * 0"  # Weekly on Sunday at 2 AM
+  workflow_dispatch:  # Manual trigger
+
+jobs:
+  update-and-validate:
+    runs-on: homelab  # Use self-hosted runner on nix-cache01
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          ref: master
+          fetch-depth: 0  # Need full history for merge
+
+      - name: Create update branch
+        run: |
+          BRANCH="flake-update/$(date +%Y-%m-%d)"
+          git checkout -b "$BRANCH"
+
+      - name: Update flake
+        run: nix flake update --commit-lock-file
+
+      - name: Build all hosts
+        id: build
+        run: |
+          FAILED=""
+          for host in $(nix flake show --json | jq -r '.nixosConfigurations | keys[]'); do
+            echo "Building $host..."
+            if ! nix build ".#nixosConfigurations.$host.config.system.build.toplevel" 2>&1 | tee "build-$host.log"; then
+              FAILED="$FAILED $host"
+            fi
+          done
+          echo "failed=$FAILED" >> $GITHUB_OUTPUT
+
+      - name: Merge to master (if all pass)
+        if: steps.build.outputs.failed == ''
+        run: |
+          git checkout master
+          git merge --ff-only "$BRANCH"
+          git push origin master
+          git push origin --delete "$BRANCH"
+
+      - name: Create PR (if any fail)
+        if: steps.build.outputs.failed != ''
+        run: |
+          git push origin "$BRANCH"
+          # Create PR via Gitea API with build logs
+          # ... (PR creation with log attachment)
+```
+
+## Migration Steps
+
+### Phase 1: Reprovision Host via OpenTofu
+
+1. Add `nix-cache01` to `terraform/vms.tf`:
+   ```hcl
+   "nix-cache01" = {
+     ip        = "10.69.13.15/24"
+     cpu_cores = 4
+     memory    = 8192
+     disk_size = "100G"  # Larger for nix store
+   }
+   ```
+
+2. Shut down existing nix-cache01 VM
+3. Run `tofu apply` to provision new VM
+4. Verify bootstrap completes and cache is serving
+
+**Note:** The cache will be cold after reprovision. Run initial builds to populate.
+
+### Phase 2: Add Build Triggering to homelab-deploy
+
+1. Add `build` command to homelab-deploy CLI
+2. Add listener handler in NixOS module for `build.*` subjects
+3. Update nix-cache01 config to enable build listener
+4. Test with `homelab-deploy build testvm01`
+
+### Phase 3: Implement Safe Flake Update Workflow
+
+1. Create `.github/workflows/flake-update-safe.yaml`
+2. Disable or remove old `flake-update.yaml`
+3. Test manually with `workflow_dispatch`
+4. Monitor first automated run
+
+### Phase 4: Remove Old Build Script
+
+1. After new workflow is stable, remove:
+   - `services/nix-cache/build-flakes.nix`
+   - `services/nix-cache/build-flakes.sh`
+2. The new workflow handles scheduled builds
+
+## Open Questions
+
+- [ ] What runner labels should the self-hosted runner use for the update workflow?
+- [ ] Should we build hosts in parallel (faster) or sequentially (easier to debug)?
+- [ ] How long to keep flake-update PRs open before auto-closing stale ones?
+- [ ] Should successful updates trigger a NATS notification to rebuild all hosts?
+- [ ] What to do about `gunter` (external nixos repo) - include in validation?
+- [ ] Disk size for new nix-cache01 - is 100G enough for cache + builds?
+
+## Notes
+
+- The existing `homelab.deploy.enable = true` on nix-cache01 means it already has NATS connectivity
+- The Harmonia service and cache signing key will work the same after reprovision
+- Actions runner token is in Vault, will be provisioned automatically
+- Consider adding a `homelab.host.role = "build-host"` label for monitoring/filtering
--- a/docs/plans/pgdb1-decommission.md
+++ b/docs/plans/pgdb1-decommission.md
@@ -0,0 +1,113 @@
+# pgdb1 Decommissioning Plan
+
+## Overview
+
+Decommission the pgdb1 PostgreSQL server. The only consumer was Open WebUI on gunter, which has been migrated to use a local PostgreSQL instance.
+
+## Pre-flight Verification
+
+Before proceeding, verify that gunter is no longer using pgdb1:
+
+1. Check Open WebUI on gunter is configured for local PostgreSQL (not 10.69.13.16)
+2. Optionally: Check pgdb1 for recent connection activity:
+   ```bash
+   ssh pgdb1 'sudo -u postgres psql -c "SELECT * FROM pg_stat_activity WHERE datname IS NOT NULL;"'
+   ```
+
+## Files to Remove
+
+### Host Configuration
+- `hosts/pgdb1/default.nix`
+- `hosts/pgdb1/configuration.nix`
+- `hosts/pgdb1/hardware-configuration.nix`
+- `hosts/pgdb1/` (directory)
+
+### Service Module
+- `services/postgres/postgres.nix`
+- `services/postgres/default.nix`
+- `services/postgres/` (directory)
+
+Note: This service module is only used by pgdb1, so it can be removed entirely.
+
+### Flake Entry
+Remove from `flake.nix` (lines 131-138):
+```nix
+pgdb1 = nixpkgs.lib.nixosSystem {
+  inherit system;
+  specialArgs = {
+    inherit inputs self;
+  };
+  modules = commonModules ++ [
+    ./hosts/pgdb1
+  ];
+};
+```
+
+### Vault AppRole
+Remove from `terraform/vault/approle.tf` (lines 69-73):
+```hcl
+"pgdb1" = {
+  paths = [
+    "secret/data/hosts/pgdb1/*",
+  ]
+}
+```
+
+### Monitoring Rules
+Remove from `services/monitoring/rules.yml` the `postgres_down` alert (lines 359-365):
+```yaml
+- name: postgres_rules
+  rules:
+    - alert: postgres_down
+      expr: node_systemd_unit_state{instance="pgdb1.home.2rjus.net:9100", name="postgresql.service", state="active"} == 0
+      for: 5m
+      labels:
+        severity: critical
+```
+
+### Utility Scripts
+Delete `rebuild-all.sh` entirely (obsolete script).
+
+## Execution Steps
+
+### Phase 1: Verification
+- [ ] Confirm Open WebUI on gunter uses local PostgreSQL
+- [ ] Verify no active connections to pgdb1
+
+### Phase 2: Code Cleanup
+- [ ] Create feature branch: `git checkout -b decommission-pgdb1`
+- [ ] Remove `hosts/pgdb1/` directory
+- [ ] Remove `services/postgres/` directory
+- [ ] Remove pgdb1 entry from `flake.nix`
+- [ ] Remove postgres alert from `services/monitoring/rules.yml`
+- [ ] Delete `rebuild-all.sh` (obsolete)
+- [ ] Run `nix flake check` to verify no broken references
+- [ ] Commit changes
+
+### Phase 3: Terraform Cleanup
+- [ ] Remove pgdb1 from `terraform/vault/approle.tf`
+- [ ] Run `tofu plan` in `terraform/vault/` to preview changes
+- [ ] Run `tofu apply` to remove the AppRole
+- [ ] Commit terraform changes
+
+### Phase 4: Infrastructure Cleanup
+- [ ] Shut down pgdb1 VM in Proxmox
+- [ ] Delete the VM from Proxmox
+- [ ] (Optional) Remove any DNS entries if not auto-generated
+
+### Phase 5: Finalize
+- [ ] Merge feature branch to master
+- [ ] Trigger auto-upgrade on DNS servers (ns1, ns2) to remove DNS entry
+- [ ] Move this plan to `docs/plans/completed/`
+
+## Rollback
+
+If issues arise after decommissioning:
+1. The VM can be recreated from template using the git history
+2. Database data would need to be restored from backup (if any exists)
+
+## Notes
+
+- pgdb1 IP: 10.69.13.16
+- The postgres service allowed connections from gunter (10.69.30.105)
+- No restic backup was configured for this host
--- a/docs/plans/security-hardening.md
+++ b/docs/plans/security-hardening.md
@@ -0,0 +1,224 @@
+# Security Hardening Plan
+
+## Overview
+
+Address security gaps identified in infrastructure review. Focus areas: SSH hardening, network security, logging improvements, and secrets management.
+
+## Current State
+
+- SSH allows password auth and unrestricted root login (`system/sshd.nix`)
+- Firewall disabled on all hosts (`networking.firewall.enable = false`)
+- Promtail ships logs over HTTP to Loki
+- Loki has no authentication (`auth_enabled = false`)
+- AppRole secret-IDs never expire (`secret_id_ttl = 0`)
+- Vault TLS verification disabled by default (`skipTlsVerify = true`)
+- Audit logging exists (`common/ssh-audit.nix`) but not applied globally
+- Alert rules focus on availability, no security event detection
+
+## Priority Matrix
+
+| Issue | Severity | Effort | Priority |
+|-------|----------|--------|----------|
+| SSH password auth | High | Low | **P1** |
+| Firewall disabled | High | Medium | **P1** |
+| Promtail HTTP (no TLS) | High | Medium | **P2** |
+| No security alerting | Medium | Low | **P2** |
+| Audit logging not global | Low | Low | **P2** |
+| Loki no auth | Medium | Medium | **P3** |
+| Secret-ID TTL | Medium | Medium | **P3** |
+| Vault skipTlsVerify | Medium | Low | **P3** |
+
+## Phase 1: Quick Wins (P1)
+
+### 1.1 SSH Hardening
+
+Edit `system/sshd.nix`:
+
+```nix
+services.openssh = {
+  enable = true;
+  settings = {
+    PermitRootLogin = "prohibit-password";  # Key-only root login
+    PasswordAuthentication = false;
+    KbdInteractiveAuthentication = false;
+  };
+};
+```
+
+**Prerequisite:** Verify all hosts have SSH keys deployed for root.
+
+### 1.2 Enable Firewall
+
+Create `system/firewall.nix` with default deny policy:
+
+```nix
+{ ... }: {
+  networking.firewall.enable = true;
+
+  # Use openssh's built-in firewall integration
+  services.openssh.openFirewall = true;
+}
+```
+
+**Useful firewall options:**
+
+| Option | Description |
+|--------|-------------|
+| `networking.firewall.trustedInterfaces` | Accept all traffic from these interfaces (e.g., `[ "lo" ]`) |
+| `networking.firewall.interfaces.<name>.allowedTCPPorts` | Per-interface port rules |
+| `networking.firewall.extraInputRules` | Custom nftables rules (for complex filtering) |
+
+**Network range restrictions:** Consider restricting SSH to the infrastructure subnet (`10.69.13.0/24`) using `extraInputRules` for defense in depth. However, this adds complexity and may not be necessary given the trusted network model.
+
+#### Per-Interface Rules (http-proxy WireGuard)
+
+The `http-proxy` host has a WireGuard interface (`wg0`) that may need different rules than the LAN interface. Use `networking.firewall.interfaces` to apply per-interface policies:
+
+```nix
+# Example: http-proxy with different rules per interface
+networking.firewall = {
+  enable = true;
+
+  # Default: only SSH (via openFirewall)
+  allowedTCPPorts = [ ];
+
+  # LAN interface: allow HTTP/HTTPS
+  interfaces.ens18 = {
+    allowedTCPPorts = [ 80 443 ];
+  };
+
+  # WireGuard interface: restrict to specific services or trust fully
+  interfaces.wg0 = {
+    allowedTCPPorts = [ 80 443 ];
+    # Or use trustedInterfaces = [ "wg0" ] if fully trusted
+  };
+};
+```
+
+**TODO:** Investigate current WireGuard usage on http-proxy to determine appropriate rules.
+
+Then per-host, open required ports:
+
+| Host | Additional Ports |
+|------|------------------|
+| ns1/ns2 | 53 (TCP/UDP) |
+| vault01 | 8200 |
+| monitoring01 | 3100, 9090, 3000, 9093 |
+| http-proxy | 80, 443 |
+| nats1 | 4222 |
+| ha1 | 1883, 8123 |
+| jelly01 | 8096 |
+| nix-cache01 | 5000 |
+
+## Phase 2: Logging & Detection (P2)
+
+### 2.1 Enable TLS for Promtail → Loki
+
+Update `system/monitoring/logs.nix`:
+
+```nix
+clients = [{
+  url = "https://monitoring01.home.2rjus.net:3100/loki/api/v1/push";
+  tls_config = {
+    ca_file = "/etc/ssl/certs/homelab-root-ca.pem";
+  };
+}];
+```
+
+Requires:
+- Configure Loki with TLS certificate (use internal ACME)
+- Ensure all hosts trust root CA (already done via `system/pki/root-ca.nix`)
+
+### 2.2 Security Alert Rules
+
+Add to `services/monitoring/rules.yml`:
+
+```yaml
+- name: security_rules
+  rules:
+    - alert: ssh_auth_failures
+      expr: increase(node_logind_sessions_total[5m]) > 20
+      for: 0m
+      labels:
+        severity: warning
+      annotations:
+        summary: "Unusual login activity on {{ $labels.instance }}"
+
+    - alert: vault_secret_fetch_failure
+      expr: increase(vault_secret_failures[5m]) > 5
+      for: 0m
+      labels:
+        severity: warning
+      annotations:
+        summary: "Vault secret fetch failures on {{ $labels.instance }}"
+```
+
+Also add Loki-based alerts for:
+- Failed SSH attempts: `{job="systemd-journal"} |= "Failed password"`
+- sudo usage: `{job="systemd-journal"} |= "sudo"`
+
+### 2.3 Global Audit Logging
+
+Add `./common/ssh-audit.nix` import to `system/default.nix`:
+
+```nix
+imports = [
+  # ... existing imports
+  ../common/ssh-audit.nix
+];
+```
+
+## Phase 3: Defense in Depth (P3)
+
+### 3.1 Loki Authentication
+
+Options:
+1. **Basic auth via reverse proxy** - Put Loki behind Caddy with auth
+2. **Loki multi-tenancy** - Enable `auth_enabled = true` and use tenant IDs
+3. **Network isolation** - Bind Loki only to localhost, expose via authenticated proxy
+
+Recommendation: Option 1 (reverse proxy) is simplest for homelab.
+
+### 3.2 AppRole Secret Rotation
+
+Update `terraform/vault/approle.tf`:
+
+```hcl
+secret_id_ttl  = 2592000  # 30 days
+```
+
+Add documentation for manual rotation procedure or implement automated rotation via the existing `restartTrigger` mechanism in `vault-secrets.nix`.
+
+### 3.3 Enable Vault TLS Verification
+
+Change default in `system/vault-secrets.nix`:
+
+```nix
+skipTlsVerify = mkOption {
+  type = types.bool;
+  default = false;  # Changed from true
+};
+```
+
+**Prerequisite:** Verify all hosts trust the internal CA that signed the Vault certificate.
+
+## Implementation Order
+
+1. **Test on test-tier first** - Deploy phases 1-2 to testvm01/02/03
+2. **Validate SSH access** - Ensure key-based login works before disabling passwords
+3. **Document firewall ports** - Create reference of ports per host before enabling
+4. **Phase prod rollout** - Deploy to prod hosts one at a time, verify each
+
+## Open Questions
+
+- [ ] Do all hosts have SSH keys configured for root access?
+- [ ] Should firewall rules be per-host or use a central definition with roles?
+- [ ] Should Loki authentication use the existing Kanidm setup?
+
+**Resolved:** Password-based SSH access for recovery is not required - most hosts have console access through Proxmox or physical access, which provides an out-of-band recovery path if SSH keys fail.
+
+## Notes
+
+- Firewall changes are the highest risk - test thoroughly on test-tier
+- SSH hardening must not lock out access - verify keys first
+- Consider creating a "break glass" procedure for emergency access if keys fail
--- a/docs/user-management.md
+++ b/docs/user-management.md
@@ -0,0 +1,267 @@
+# User Management with Kanidm
+
+Central authentication for the homelab using Kanidm.
+
+## Overview
+
+- **Server**: kanidm01.home.2rjus.net (auth.home.2rjus.net)
+- **WebUI**: https://auth.home.2rjus.net
+- **LDAPS**: port 636
+
+## CLI Setup
+
+The `kanidm` CLI is available in the devshell:
+
+```bash
+nix develop
+
+# Login as idm_admin
+kanidm login --name idm_admin --url https://auth.home.2rjus.net
+```
+
+## User Management
+
+POSIX users are managed imperatively via the `kanidm` CLI. This allows setting
+all attributes (including UNIX password) in one workflow.
+
+### Creating a POSIX User
+
+```bash
+# Create the person
+kanidm person create <username> "<Display Name>"
+
+# Add to groups
+kanidm group add-members ssh-users <username>
+
+# Enable POSIX (UID is auto-assigned)
+kanidm person posix set <username>
+
+# Set UNIX password (required for SSH login, min 10 characters)
+kanidm person posix set-password <username>
+
+# Optionally set login shell
+kanidm person posix set <username> --shell /bin/zsh
+```
+
+### Example: Full User Creation
+
+```bash
+kanidm person create testuser "Test User"
+kanidm group add-members ssh-users testuser
+kanidm person posix set testuser
+kanidm person posix set-password testuser
+kanidm person get testuser
+```
+
+After creation, verify on a client host:
+```bash
+getent passwd testuser
+ssh testuser@testvm01.home.2rjus.net
+```
+
+### Viewing User Details
+
+```bash
+kanidm person get <username>
+```
+
+### Removing a User
+
+```bash
+kanidm person delete <username>
+```
+
+## Group Management
+
+Groups for POSIX access are also managed via CLI.
+
+### Creating a POSIX Group
+
+```bash
+# Create the group
+kanidm group create <group-name>
+
+# Enable POSIX with a specific GID
+kanidm group posix set <group-name> --gidnumber <gid>
+```
+
+### Adding Members
+
+```bash
+kanidm group add-members <group-name> <username>
+```
+
+### Viewing Group Details
+
+```bash
+kanidm group get <group-name>
+kanidm group list-members <group-name>
+```
+
+### Example: Full Group Creation
+
+```bash
+kanidm group create testgroup
+kanidm group posix set testgroup --gidnumber 68010
+kanidm group add-members testgroup testuser
+kanidm group get testgroup
+```
+
+After creation, verify on a client host:
+```bash
+getent group testgroup
+```
+
+### Current Groups
+
+| Group | GID | Purpose |
+|-------|-----|---------|
+| ssh-users | 68000 | SSH login access |
+| admins | 68001 | Administrative access |
+| users | 68002 | General users |
+
+### UID/GID Allocation
+
+Kanidm auto-assigns UIDs/GIDs from its configured range. For manually assigned GIDs:
+
+| Range | Purpose |
+|-------|---------|
+| 65,536+ | Users (auto-assigned) |
+| 68,000 - 68,999 | Groups (manually assigned) |
+
+## PAM/NSS Client Configuration
+
+Enable central authentication on a host:
+
+```nix
+homelab.kanidm.enable = true;
+```
+
+This configures:
+- `services.kanidm.enablePam = true`
+- Client connection to auth.home.2rjus.net
+- Login authorization for `ssh-users` group
+- Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
+- Home directory symlinks (`/home/torjus` → UUID-based directory)
+
+### Enabled Hosts
+
+- testvm01, testvm02, testvm03 (test tier)
+
+### Options
+
+```nix
+homelab.kanidm = {
+  enable = true;
+  server = "https://auth.home.2rjus.net";  # default
+  allowedLoginGroups = [ "ssh-users" ];     # default
+};
+```
+
+### Home Directories
+
+Home directories use UUID-based paths for stability (so renaming a user doesn't
+require moving their home directory). Symlinks provide convenient access:
+
+```
+/home/torjus -> /home/e4f4c56c-4aee-4c20-846f-90cb69807733
+```
+
+The symlinks are created by `kanidm-unixd-tasks` on first login.
+
+## Testing
+
+### Verify NSS Resolution
+
+```bash
+# Check user resolution
+getent passwd <username>
+
+# Check group resolution
+getent group <group-name>
+```
+
+### Test SSH Login
+
+```bash
+ssh <username>@<hostname>.home.2rjus.net
+```
+
+## Troubleshooting
+
+### "PAM user mismatch" error
+
+SSH fails with "fatal: PAM user mismatch" in logs. This happens when Kanidm returns
+usernames in SPN format (`torjus@home.2rjus.net`) but SSH expects short names (`torjus`).
+
+**Solution**: Configure `uid_attr_map = "name"` in unixSettings (already set in our module).
+
+Check current format:
+```bash
+getent passwd torjus
+# Should show: torjus:x:65536:...
+# NOT: torjus@home.2rjus.net:x:65536:...
+```
+
+### User resolves but SSH fails immediately
+
+The user's login group (e.g., `ssh-users`) likely doesn't have POSIX enabled:
+
+```bash
+# Check if group has POSIX
+getent group ssh-users
+
+# If empty, enable POSIX on the server
+kanidm group posix set ssh-users --gidnumber 68000
+```
+
+### User doesn't resolve via getent
+
+1. Check kanidm-unixd service is running:
+   ```bash
+   systemctl status kanidm-unixd
+   ```
+
+2. Check unixd can reach server:
+   ```bash
+   kanidm-unix status
+   # Should show: system: online, Kanidm: online
+   ```
+
+3. Check client can reach server:
+   ```bash
+   curl -s https://auth.home.2rjus.net/status
+   ```
+
+4. Check user has POSIX enabled on server:
+   ```bash
+   kanidm person get <username>
+   ```
+
+5. Restart nscd to clear stale cache:
+   ```bash
+   systemctl restart nscd
+   ```
+
+6. Invalidate kanidm cache:
+   ```bash
+   kanidm-unix cache-invalidate
+   ```
+
+### Changes not taking effect after deployment
+
+NixOS uses nsncd (a Rust reimplementation of nscd) for NSS caching. After deploying
+kanidm-unixd config changes, you may need to restart both services:
+
+```bash
+systemctl restart kanidm-unixd
+systemctl restart nscd
+```
+
+### Test PAM authentication directly
+
+Use the kanidm-unix CLI to test PAM auth without SSH:
+
+```bash
+kanidm-unix auth-test --name <username>
+```
--- a/flake.nix
+++ b/flake.nix
@@ -65,15 +65,6 @@
    in
    {
      nixosConfigurations = {
-        ns1 = nixpkgs.lib.nixosSystem {
-          inherit system;
-          specialArgs = {
-            inherit inputs self;
-          };
-          modules = commonModules ++ [
-            ./hosts/ns1
-          ];
-        };
        ha1 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
@@ -128,15 +119,6 @@
            ./hosts/nix-cache01
          ];
        };
-        pgdb1 = nixpkgs.lib.nixosSystem {
-          inherit system;
-          specialArgs = {
-            inherit inputs self;
-          };
-          modules = commonModules ++ [
-            ./hosts/pgdb1
-          ];
-        };
        nats1 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
@@ -191,6 +173,24 @@
            ./hosts/ns2
          ];
        };
+        ns1 = nixpkgs.lib.nixosSystem {
+          inherit system;
+          specialArgs = {
+            inherit inputs self;
+          };
+          modules = commonModules ++ [
+            ./hosts/ns1
+          ];
+        };
+        kanidm01 = nixpkgs.lib.nixosSystem {
+          inherit system;
+          specialArgs = {
+            inherit inputs self;
+          };
+          modules = commonModules ++ [
+            ./hosts/kanidm01
+          ];
+        };
      };
      packages = forAllSystems (
        { pkgs }:
@@ -207,6 +207,7 @@
              pkgs.ansible
              pkgs.opentofu
              pkgs.openbao
+              pkgs.kanidm_1_8
              (pkgs.callPackage ./scripts/create-host { })
              homelab-deploy.packages.${pkgs.system}.default
            ];
--- a/hosts/jelly01/configuration.nix
+++ b/hosts/jelly01/configuration.nix
@@ -64,9 +64,5 @@
  vault.enable = true;
  homelab.deploy.enable = true;

-  zramSwap = {
-    enable = true;
-  };
-
  system.stateVersion = "23.11"; # Did you read the comment?
 }
--- a/hosts/jump/configuration.nix
+++ b/hosts/jump/configuration.nix
@@ -1,56 +0,0 @@
-{ config, lib, pkgs, ... }:
-
-{
-  imports =
-    [
-      ./hardware-configuration.nix
-      ../../system
-    ];
-
-  nixpkgs.config.allowUnfree = true;
-
-  homelab.host.role = "bastion";
-
-  # Use the systemd-boot EFI boot loader.
-  boot.loader.grub.enable = true;
-  boot.loader.grub.device = "/dev/sda";
-
-  networking.hostName = "jump";
-  networking.domain = "home.2rjus.net";
-  networking.useNetworkd = true;
-  networking.useDHCP = false;
-  services.resolved.enable = false;
-  networking.nameservers = [
-    "10.69.13.5"
-    "10.69.13.6"
-  ];
-
-  systemd.network.enable = true;
-  systemd.network.networks."ens18" = {
-    matchConfig.Name = "ens18";
-    address = [
-      "10.69.13.10/24"
-    ];
-    routes = [
-      { Gateway = "10.69.13.1"; }
-    ];
-    linkConfig.RequiredForOnline = "routable";
-  };
-  time.timeZone = "Europe/Oslo";
-
-  nix.settings.experimental-features = [ "nix-command" "flakes" ];
-  environment.systemPackages = with pkgs; [
-    vim
-    wget
-    git
-  ];
-
-  # Open ports in the firewall.
-  # networking.firewall.allowedTCPPorts = [ ... ];
-  # networking.firewall.allowedUDPPorts = [ ... ];
-  # Or disable the firewall altogether.
-  networking.firewall.enable = false;
-
-  system.stateVersion = "23.11"; # Did you read the comment?
-}
-
--- a/hosts/jump/hardware-configuration.nix
+++ b/hosts/jump/hardware-configuration.nix
@@ -1,36 +0,0 @@
-{ config, lib, pkgs, modulesPath, ... }:
-
-{
-  imports =
-    [
-      (modulesPath + "/profiles/qemu-guest.nix")
-    ];
-
-  boot.initrd.availableKernelModules = [ "ata_piix" "uhci_hcd" "virtio_pci" "virtio_scsi" "sd_mod" "sr_mod" ];
-  boot.initrd.kernelModules = [ ];
-  # boot.kernelModules = [ ];
-  # boot.extraModulePackages = [ ];
-
-  fileSystems."/" =
-    {
-      device = "/dev/disk/by-uuid/6889aba9-61ed-4687-ab10-e5cf4017ac8d";
-      fsType = "xfs";
-    };
-
-  fileSystems."/boot" =
-    {
-      device = "/dev/disk/by-uuid/BC07-3B7A";
-      fsType = "vfat";
-    };
-
-  swapDevices =
-    [{ device = "/dev/disk/by-uuid/64e5757b-6625-4dd2-aa2a-66ca93444d23"; }];
-
-  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
-  # (the default) this is the recommended approach. When using systemd-networkd it's
-  # still possible to use this option, but it's recommended to use it in conjunction
-  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
-  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
-
-  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
-}
--- a/hosts/kanidm01/configuration.nix
+++ b/hosts/kanidm01/configuration.nix
@@ -1,25 +1,39 @@
 {
+  config,
+  lib,
  pkgs,
  ...
 }:

 {
  imports = [
-    ./hardware-configuration.nix
+    ../template2/hardware-configuration.nix

    ../../system
    ../../common/vm
+    ../../services/kanidm
  ];

-  nixpkgs.config.allowUnfree = true;
-  # Use the systemd-boot EFI boot loader.
-  boot.loader.grub = {
-    enable = true;
-    device = "/dev/sda";
-    configurationLimit = 3;
+  # Host metadata
+  homelab.host = {
+    tier = "test";
+    role = "auth";
  };

-  networking.hostName = "pgdb1";
+  # DNS CNAME for auth.home.2rjus.net
+  homelab.dns.cnames = [ "auth" ];
+
+  # Enable Vault integration
+  vault.enable = true;
+
+  # Enable remote deployment via NATS
+  homelab.deploy.enable = true;
+
+  nixpkgs.config.allowUnfree = true;
+  boot.loader.grub.enable = true;
+  boot.loader.grub.device = "/dev/vda";
+
+  networking.hostName = "kanidm01";
  networking.domain = "home.2rjus.net";
  networking.useNetworkd = true;
  networking.useDHCP = false;
@@ -33,7 +47,7 @@
  systemd.network.networks."ens18" = {
    matchConfig.Name = "ens18";
    address = [
-      "10.69.13.16/24"
+      "10.69.13.23/24"
    ];
    routes = [
      { Gateway = "10.69.13.1"; }
@@ -59,8 +73,5 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

-  vault.enable = true;
-  homelab.deploy.enable = true;
-
-  system.stateVersion = "23.11"; # Did you read the comment?
-}
+  system.stateVersion = "25.11"; # Did you read the comment?
+}
--- a/hosts/kanidm01/default.nix
+++ b/hosts/kanidm01/default.nix
@@ -2,4 +2,4 @@
  imports = [
    ./configuration.nix
  ];
-}
+}
--- a/hosts/nix-cache01/default.nix
+++ b/hosts/nix-cache01/default.nix
@@ -4,6 +4,5 @@
    ./configuration.nix
    ../../services/nix-cache
    ../../services/actions-runner
-    ./zram.nix
  ];
 }
--- a/hosts/nix-cache01/zram.nix
+++ b/hosts/nix-cache01/zram.nix
@@ -1,6 +0,0 @@
-{ ... }:
-{
-  zramSwap = {
-    enable = true;
-  };
-}
--- a/hosts/ns1/configuration.nix
+++ b/hosts/ns1/configuration.nix
@@ -7,23 +7,38 @@

 {
  imports = [
-    ./hardware-configuration.nix
+    ../template2/hardware-configuration.nix

    ../../system
+    ../../common/vm
+
+    # DNS services
    ../../services/ns/master-authorative.nix
    ../../services/ns/resolver.nix
-    ../../common/vm
  ];

+  # Host metadata
+  homelab.host = {
+    tier = "prod";
+    role = "dns";
+    labels.dns_role = "primary";
+  };
+
+  # Enable Vault integration
+  vault.enable = true;
+
+  # Enable remote deployment via NATS
+  homelab.deploy.enable = true;
+
  nixpkgs.config.allowUnfree = true;
-  # Use the systemd-boot EFI boot loader.
  boot.loader.grub.enable = true;
-  boot.loader.grub.device = "/dev/sda";
+  boot.loader.grub.device = "/dev/vda";

  networking.hostName = "ns1";
  networking.domain = "home.2rjus.net";
  networking.useNetworkd = true;
  networking.useDHCP = false;
+  # Disable resolved - conflicts with Unbound resolver
  services.resolved.enable = false;
  networking.nameservers = [
    "10.69.13.5"
@@ -47,14 +62,6 @@
    "nix-command"
    "flakes"
  ];
-  vault.enable = true;
-  homelab.deploy.enable = true;
-
-  homelab.host = {
-    role = "dns";
-    labels.dns_role = "primary";
-  };
-
  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
    vim
@@ -68,5 +75,5 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

-  system.stateVersion = "23.11"; # Did you read the comment?
-}
+  system.stateVersion = "25.11"; # Did you read the comment?
+}
--- a/hosts/ns1/default.nix
+++ b/hosts/ns1/default.nix
@@ -2,4 +2,4 @@
  imports = [
    ./configuration.nix
  ];
-}
+}
--- a/hosts/ns1/hardware-configuration.nix
+++ b/hosts/ns1/hardware-configuration.nix
@@ -1,36 +0,0 @@
-{ config, lib, pkgs, modulesPath, ... }:
-
-{
-  imports =
-    [
-      (modulesPath + "/profiles/qemu-guest.nix")
-    ];
-
-  boot.initrd.availableKernelModules = [ "ata_piix" "uhci_hcd" "virtio_pci" "virtio_scsi" "sd_mod" "sr_mod" ];
-  boot.initrd.kernelModules = [ ];
-  # boot.kernelModules = [ ];
-  # boot.extraModulePackages = [ ];
-
-  fileSystems."/" =
-    {
-      device = "/dev/disk/by-uuid/6889aba9-61ed-4687-ab10-e5cf4017ac8d";
-      fsType = "xfs";
-    };
-
-  fileSystems."/boot" =
-    {
-      device = "/dev/disk/by-uuid/BC07-3B7A";
-      fsType = "vfat";
-    };
-
-  swapDevices =
-    [{ device = "/dev/disk/by-uuid/64e5757b-6625-4dd2-aa2a-66ca93444d23"; }];
-
-  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
-  # (the default) this is the recommended approach. When using systemd-networkd it's
-  # still possible to use this option, but it's recommended to use it in conjunction
-  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
-  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
-
-  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
-}
--- a/hosts/pgdb1/default.nix
+++ b/hosts/pgdb1/default.nix
@@ -1,7 +0,0 @@
-{ ... }:
-{
-  imports = [
-    ./configuration.nix
-    ../../services/postgres
-  ];
-}
--- a/hosts/pgdb1/hardware-configuration.nix
+++ b/hosts/pgdb1/hardware-configuration.nix
@@ -1,42 +0,0 @@
-{
-  config,
-  lib,
-  pkgs,
-  modulesPath,
-  ...
-}:
-
-{
-  imports = [
-    (modulesPath + "/profiles/qemu-guest.nix")
-  ];
-  boot.initrd.availableKernelModules = [
-    "ata_piix"
-    "uhci_hcd"
-    "virtio_pci"
-    "virtio_scsi"
-    "sd_mod"
-    "sr_mod"
-  ];
-  boot.initrd.kernelModules = [ "dm-snapshot" ];
-  boot.kernelModules = [
-    "ptp_kvm"
-  ];
-  boot.extraModulePackages = [ ];
-
-  fileSystems."/" = {
-    device = "/dev/disk/by-label/root";
-    fsType = "xfs";
-  };
-
-  swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
-
-  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
-  # (the default) this is the recommended approach. When using systemd-networkd it's
-  # still possible to use this option, but it's recommended to use it in conjunction
-  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
-  networking.useDHCP = lib.mkDefault true;
-  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
-
-  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
-}
--- a/hosts/template2/configuration.nix
+++ b/hosts/template2/configuration.nix
@@ -58,6 +58,14 @@
    "flakes"
  ];
  nix.settings.tarball-ttl = 0;
+  nix.settings.substituters = [
+    "https://nix-cache.home.2rjus.net"
+    "https://cache.nixos.org"
+  ];
+  nix.settings.trusted-public-keys = [
+    "nix-cache.home.2rjus.net-1:2kowZOG6pvhoK4AHVO3alBlvcghH20wchzoR0V86UWI="
+    "cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY="
+  ];
  environment.systemPackages = with pkgs; [
    age
    vim
@@ -71,5 +79,8 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

+  # Compressed swap in RAM - prevents OOM during bootstrap nixos-rebuild
+  zramSwap.enable = true;
+
  system.stateVersion = "25.11";
 }
--- a/hosts/testvm01/configuration.nix
+++ b/hosts/testvm01/configuration.nix
@@ -11,6 +11,7 @@

    ../../system
    ../../common/vm
+    ../../common/ssh-audit.nix
  ];

  # Host metadata (adjust as needed)
@@ -24,6 +25,9 @@
  # Enable remote deployment via NATS
  homelab.deploy.enable = true;

+  # Enable Kanidm PAM/NSS for central authentication
+  homelab.kanidm.enable = true;
+
  nixpkgs.config.allowUnfree = true;
  boot.loader.grub.enable = true;
  boot.loader.grub.device = "/dev/vda";
--- a/hosts/testvm02/configuration.nix
+++ b/hosts/testvm02/configuration.nix
@@ -11,6 +11,7 @@

    ../../system
    ../../common/vm
+    ../../common/ssh-audit.nix
  ];

  # Host metadata (adjust as needed)
@@ -24,6 +25,9 @@
  # Enable remote deployment via NATS
  homelab.deploy.enable = true;

+  # Enable Kanidm PAM/NSS for central authentication
+  homelab.kanidm.enable = true;
+
  nixpkgs.config.allowUnfree = true;
  boot.loader.grub.enable = true;
  boot.loader.grub.device = "/dev/vda";
--- a/hosts/testvm03/configuration.nix
+++ b/hosts/testvm03/configuration.nix
@@ -11,6 +11,7 @@

    ../../system
    ../../common/vm
+    ../../common/ssh-audit.nix
  ];

  # Host metadata (adjust as needed)
@@ -24,6 +25,9 @@
  # Enable remote deployment via NATS
  homelab.deploy.enable = true;

+  # Enable Kanidm PAM/NSS for central authentication
+  homelab.kanidm.enable = true;
+
  nixpkgs.config.allowUnfree = true;
  boot.loader.grub.enable = true;
  boot.loader.grub.device = "/dev/vda";
--- a/rebuild-all.sh
+++ b/rebuild-all.sh
@@ -1,19 +0,0 @@
-#!/usr/bin/env bash
-set -euo pipefail
-
-# array of hosts
-HOSTS=(
-    "ns1"
-    "ns2"
-    "ha1"
-    "http-proxy"
-    "jelly01"
-    "monitoring01"
-    "nix-cache01"
-    "pgdb1"
-)
-
-for host in "${HOSTS[@]}"; do
-    echo "Rebuilding $host"
-    nixos-rebuild boot --flake .#${host} --target-host root@${host}
-done
--- a/services/kanidm/default.nix
+++ b/services/kanidm/default.nix
@@ -0,0 +1,65 @@
+{ config, lib, pkgs, ... }:
+{
+  services.kanidm = {
+    package = pkgs.kanidmWithSecretProvisioning_1_8;
+    enableServer = true;
+    serverSettings = {
+      domain = "home.2rjus.net";
+      origin = "https://auth.home.2rjus.net";
+      bindaddress = "0.0.0.0:443";
+      ldapbindaddress = "0.0.0.0:636";
+      tls_chain = "/var/lib/acme/auth.home.2rjus.net/fullchain.pem";
+      tls_key = "/var/lib/acme/auth.home.2rjus.net/key.pem";
+      online_backup = {
+        path = "/var/lib/kanidm/backups";
+        schedule = "00 22 * * *";
+        versions = 7;
+      };
+    };
+
+    # Provision base groups only - users are managed via CLI
+    # See docs/user-management.md for details
+    provision = {
+      enable = true;
+      idmAdminPasswordFile = config.vault.secrets.kanidm-idm-admin.outputDir;
+
+      groups = {
+        admins = { };
+        users = { };
+        ssh-users = { };
+      };
+
+      # Regular users (persons) are managed imperatively via kanidm CLI
+    };
+  };
+
+  # Grant kanidm access to ACME certificates
+  users.users.kanidm.extraGroups = [ "acme" ];
+
+  # ACME certificate from internal CA
+  # Include both the CNAME (auth) and A record (kanidm01) for Prometheus scraping
+  security.acme.certs."auth.home.2rjus.net" = {
+    listenHTTP = ":80";
+    reloadServices = [ "kanidm" ];
+    extraDomainNames = [ "${config.networking.hostName}.home.2rjus.net" ];
+  };
+
+  # Vault secret for idm_admin password (used for provisioning)
+  vault.secrets.kanidm-idm-admin = {
+    secretPath = "kanidm/idm-admin-password";
+    extractKey = "password";
+    services = [ "kanidm" ];
+    owner = "kanidm";
+    group = "kanidm";
+  };
+
+  # Note: Kanidm does not expose Prometheus metrics
+  # If metrics support is added in the future, uncomment:
+  # homelab.monitoring.scrapeTargets = [
+  #   {
+  #     job_name = "kanidm";
+  #     port = 443;
+  #     scheme = "https";
+  #   }
+  # ];
+}
--- a/services/monitoring/rules.yml
+++ b/services/monitoring/rules.yml
@@ -356,32 +356,6 @@ groups:
        annotations:
          summary: "Proxmox VM {{ $labels.id }} is stopped"
          description: "Proxmox VM {{ $labels.id }} ({{ $labels.name }}) has onboot=1 but is stopped."
-  - name: postgres_rules
-    rules:
-      - alert: postgres_down
-        expr: node_systemd_unit_state{instance="pgdb1.home.2rjus.net:9100", name="postgresql.service", state="active"} == 0
-        for: 5m
-        labels:
-          severity: critical
-        annotations:
-          summary: "PostgreSQL not running on {{ $labels.instance }}"
-          description: "PostgreSQL has been down on {{ $labels.instance }} more than 5 minutes."
-      - alert: postgres_exporter_down
-        expr: up{job="postgres"} == 0
-        for: 5m
-        labels:
-          severity: warning
-        annotations:
-          summary: "PostgreSQL exporter down on {{ $labels.instance }}"
-          description: "Cannot scrape PostgreSQL metrics from {{ $labels.instance }}."
-      - alert: postgres_high_connections
-        expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
-        for: 5m
-        labels:
-          severity: warning
-        annotations:
-          summary: "PostgreSQL connection pool near exhaustion on {{ $labels.instance }}"
-          description: "PostgreSQL is using over 80% of max_connections on {{ $labels.instance }}."
  - name: jellyfin_rules
    rules:
      - alert: jellyfin_down
--- a/services/ns/resolver.nix
+++ b/services/ns/resolver.nix
@@ -45,7 +45,11 @@
      };
      stub-zone = {
        name = "home.2rjus.net";
-        stub-addr = "127.0.0.1@8053";
+        stub-addr = [
+          "127.0.0.1@8053"   # Local NSD
+          "10.69.13.5@8053"  # ns1
+          "10.69.13.6@8053"  # ns2
+        ];
      };
      forward-zone = {
        name = ".";
--- a/services/postgres/default.nix
+++ b/services/postgres/default.nix
@@ -1,6 +0,0 @@
-{ ... }:
-{
-  imports = [
-    ./postgres.nix
-  ];
-}
--- a/services/postgres/postgres.nix
+++ b/services/postgres/postgres.nix
@@ -1,23 +0,0 @@
-{ pkgs, ... }:
-{
-  homelab.monitoring.scrapeTargets = [{
-    job_name = "postgres";
-    port = 9187;
-  }];
-
-  services.prometheus.exporters.postgres = {
-    enable = true;
-    runAsLocalSuperUser = true; # Use peer auth as postgres user
-  };
-
-  services.postgresql = {
-    enable = true;
-    enableJIT = true;
-    enableTCPIP = true;
-    extensions = ps: with ps; [ pgvector ];
-    authentication = ''
-      # Allow access to everything from gunter
-      host    all             all             10.69.30.105/32         scram-sha-256
-    '';
-  };
-}
--- a/system/default.nix
+++ b/system/default.nix
@@ -4,13 +4,16 @@
    ./acme.nix
    ./autoupgrade.nix
    ./homelab-deploy.nix
+    ./kanidm-client.nix
    ./monitoring
    ./motd.nix
    ./packages.nix
    ./nix.nix
+    ./pipe-to-loki.nix
    ./root-user.nix
    ./pki/root-ca.nix
    ./sshd.nix
    ./vault-secrets.nix
+    ./zram.nix
  ];
 }
--- a/system/kanidm-client.nix
+++ b/system/kanidm-client.nix
@@ -0,0 +1,42 @@
+{ lib, config, pkgs, ... }:
+let
+  cfg = config.homelab.kanidm;
+in
+{
+  options.homelab.kanidm = {
+    enable = lib.mkEnableOption "Kanidm PAM/NSS client for central authentication";
+
+    server = lib.mkOption {
+      type = lib.types.str;
+      default = "https://auth.home.2rjus.net";
+      description = "URI of the Kanidm server";
+    };
+
+    allowedLoginGroups = lib.mkOption {
+      type = lib.types.listOf lib.types.str;
+      default = [ "ssh-users" ];
+      description = "Groups allowed to log in via PAM";
+    };
+  };
+
+  config = lib.mkIf cfg.enable {
+    services.kanidm = {
+      package = pkgs.kanidm_1_8;
+      enablePam = true;
+
+      clientSettings = {
+        uri = cfg.server;
+      };
+
+      unixSettings = {
+        pam_allowed_login_groups = cfg.allowedLoginGroups;
+        # Use short names (torjus) instead of SPN format (torjus@home.2rjus.net)
+        # This prevents "PAM user mismatch" errors with SSH
+        uid_attr_map = "name";
+        gid_attr_map = "name";
+        # Create symlink /home/torjus -> /home/torjus@home.2rjus.net
+        home_alias = "name";
+      };
+    };
+  };
+}
--- a/system/pipe-to-loki.nix
+++ b/system/pipe-to-loki.nix
@@ -0,0 +1,140 @@
+{
+  config,
+  pkgs,
+  lib,
+  ...
+}:
+let
+  pipe-to-loki = pkgs.writeShellApplication {
+    name = "pipe-to-loki";
+    runtimeInputs = with pkgs; [
+      curl
+      jq
+      util-linux
+      coreutils
+    ];
+    text = ''
+      set -euo pipefail
+
+      LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
+      HOSTNAME=$(hostname)
+      SESSION_ID=""
+      RECORD_MODE=false
+
+      usage() {
+        echo "Usage: pipe-to-loki [--id ID] [--record]"
+        echo ""
+        echo "Send command output or interactive sessions to Loki."
+        echo ""
+        echo "Options:"
+        echo "  --id ID      Set custom session ID (default: auto-generated)"
+        echo "  --record     Start interactive recording session"
+        echo ""
+        echo "Examples:"
+        echo "  command | pipe-to-loki           # Pipe command output"
+        echo "  command | pipe-to-loki --id foo  # Pipe with custom ID"
+        echo "  pipe-to-loki --record            # Start recording session"
+        exit 1
+      }
+
+      generate_id() {
+        local random_chars
+        random_chars=$(head -c 2 /dev/urandom | od -An -tx1 | tr -d ' \n')
+        echo "''${HOSTNAME}-$(date +%s)-''${random_chars}"
+      }
+
+      send_to_loki() {
+        local content="$1"
+        local type="$2"
+        local timestamp_ns
+        timestamp_ns=$(date +%s%N)
+
+        local payload
+        payload=$(jq -n \
+          --arg job "pipe-to-loki" \
+          --arg host "$HOSTNAME" \
+          --arg type "$type" \
+          --arg id "$SESSION_ID" \
+          --arg ts "$timestamp_ns" \
+          --arg content "$content" \
+          '{
+            streams: [{
+              stream: {
+                job: $job,
+                host: $host,
+                type: $type,
+                id: $id
+              },
+              values: [[$ts, $content]]
+            }]
+          }')
+
+        if curl -s -X POST "$LOKI_URL" \
+          -H "Content-Type: application/json" \
+          -d "$payload" > /dev/null; then
+          return 0
+        else
+          echo "Error: Failed to send to Loki" >&2
+          return 1
+        fi
+      }
+
+      # Parse arguments
+      while [[ $# -gt 0 ]]; do
+        case $1 in
+          --id)
+            SESSION_ID="$2"
+            shift 2
+            ;;
+          --record)
+            RECORD_MODE=true
+            shift
+            ;;
+          --help|-h)
+            usage
+            ;;
+          *)
+            echo "Unknown option: $1" >&2
+            usage
+            ;;
+        esac
+      done
+
+      # Generate ID if not provided
+      if [[ -z "$SESSION_ID" ]]; then
+        SESSION_ID=$(generate_id)
+      fi
+
+      if $RECORD_MODE; then
+        # Session recording mode
+        SCRIPT_FILE=$(mktemp)
+        trap 'rm -f "$SCRIPT_FILE"' EXIT
+
+        echo "Recording session $SESSION_ID... (exit to send)"
+
+        # Use script to record the session
+        script -q "$SCRIPT_FILE"
+
+        # Read the transcript and send to Loki
+        content=$(cat "$SCRIPT_FILE")
+        if send_to_loki "$content" "session"; then
+          echo "Session $SESSION_ID sent to Loki"
+        fi
+      else
+        # Pipe mode - read from stdin
+        if [[ -t 0 ]]; then
+          echo "Error: No input provided. Pipe a command or use --record for interactive mode." >&2
+          exit 1
+        fi
+
+        content=$(cat)
+        if send_to_loki "$content" "command"; then
+          echo "Sent to Loki with id: $SESSION_ID"
+        fi
+      fi
+    '';
+  };
+in
+{
+  environment.systemPackages = [ pipe-to-loki ];
+}
--- a/system/zram.nix
+++ b/system/zram.nix
@@ -0,0 +1,8 @@
+# Compressed swap in RAM
+#
+# Provides overflow memory during Nix builds and upgrades.
+# Prevents OOM kills on low-memory hosts (2GB VMs).
+{ ... }:
+{
+  zramSwap.enable = true;
+}
--- a/terraform/vault/approle.tf
+++ b/terraform/vault/approle.tf
@@ -66,19 +66,7 @@ locals {
      ]
    }

-    "pgdb1" = {
-      paths = [
-        "secret/data/hosts/pgdb1/*",
-      ]
-    }
-
    # Wave 3: DNS servers
-    "ns1" = {
-      paths = [
-        "secret/data/hosts/ns1/*",
-        "secret/data/shared/dns/*",
-      ]
-    }

    # Wave 4: http-proxy
    "http-proxy" = {
--- a/terraform/vault/hosts-generated.tf
+++ b/terraform/vault/hosts-generated.tf
@@ -26,6 +26,19 @@ locals {
        "secret/data/shared/dns/*",
      ]
    }
+    "ns1" = {
+      paths = [
+        "secret/data/hosts/ns1/*",
+        "secret/data/shared/dns/*",
+        "secret/data/shared/homelab-deploy/*",
+      ]
+    }
+    "kanidm01" = {
+      paths = [
+        "secret/data/hosts/kanidm01/*",
+        "secret/data/kanidm/*",
+      ]
+    }
  
  }

--- a/terraform/vault/secrets.tf
+++ b/terraform/vault/secrets.tf
@@ -102,6 +102,12 @@ locals {
      auto_generate = false
      data          = { nkey = var.homelab_deploy_admin_deployer_nkey }
    }
+
+    # Kanidm idm_admin password
+    "kanidm/idm-admin-password" = {
+      auto_generate   = true
+      password_length = 32
+    }
  }
 }

--- a/terraform/vms.tf
+++ b/terraform/vms.tf
@@ -65,6 +65,20 @@ locals {
      disk_size = "20G"
      vault_wrapped_token = "s.3nran1e1Uim4B1OomIWCoS4T"
    }
+    "ns1" = {
+      ip        = "10.69.13.5/24"
+      cpu_cores = 2
+      memory    = 2048
+      disk_size = "20G"
+      vault_wrapped_token = "s.b6ge0KMtNQctdKkvm0RNxGdt"
+    }
+    "kanidm01" = {
+      ip        = "10.69.13.23/24"
+      cpu_cores = 2
+      memory    = 2048
+      disk_size = "20G"
+      vault_wrapped_token = "s.OOqjEECeIV7dNgCS6jNmyY3K"
+    }
  }

  # Compute VM configurations with defaults applied
Author	SHA1	Message	Date
Torjus Håkestad	78eb04205f	system: add pipe-to-loki helper script Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Adds a system-wide script for sending command output or interactive sessions to Loki for easy sharing with Claude. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:30:53 +01:00
Torjus Håkestad	19cb61ebbc	Merge pull request 'kanidm-pam-client' (#34 ) from kanidm-pam-client into master All checks were successful Run nix flake check / flake-check (push) Successful in 3m19s Details Reviewed-on: #34	2026-02-08 14:14:53 +00:00
Torjus Håkestad	9ed09c9a9c	docs: add user-management documentation All checks were successful Run nix flake check / flake-check (pull_request) Successful in 3m33s Details Run nix flake check / flake-check (push) Successful in 2m0s Details - CLI workflows for creating users and groups - Troubleshooting guide (nscd, cache invalidation) - Home directory behavior (UUID-based with symlinks) - Update auth-system-replacement plan with progress Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:14:21 +01:00
Torjus Håkestad	b31c64f1b9	kanidm: remove declarative user provisioning Keep base groups (admins, users, ssh-users) provisioned declaratively but manage regular users via the kanidm CLI. This allows setting POSIX attributes and passwords in a single workflow. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:14:03 +01:00
Torjus Håkestad	54b6e37420	flake: add kanidm to devshell Add kanidm_1_8 CLI for administering the Kanidm server. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:12:19 +01:00
Torjus Håkestad	b845a8bb8b	system: add kanidm PAM/NSS client module Add homelab.kanidm.enable option for central authentication via Kanidm. The module configures: - PAM/NSS integration with kanidm-unixd - Client connection to auth.home.2rjus.net - Login authorization for ssh-users group Enable on testvm01-03 for testing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:12:19 +01:00
Torjus Håkestad	bfbf0cea68	template2: enable zram for bootstrap Some checks failed Run nix flake check / flake-check (push) Failing after 3m34s Details Prevents OOM during initial nixos-rebuild on 2GB VMs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 13:34:08 +01:00
Torjus Håkestad	3abe5e83a7	docs: add memory ballooning as fallback option All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 13:29:42 +01:00
Torjus Håkestad	67c27555f3	docs: add memory issues follow-up plan All checks were successful Run nix flake check / flake-check (push) Successful in 2m2s Details Track zram change effectiveness for OOM prevention during upgrades. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 13:26:31 +01:00
Torjus Håkestad	1674b6a844	system: enable zram swap for all hosts Some checks failed Run nix flake check / flake-check (push) Failing after 12m6s Details Provides compressed swap in RAM to prevent OOM kills during nixos-rebuild on low-memory VMs (2GB). Removes duplicate zram configs from jelly01 and nix-cache01. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 13:02:58 +01:00
Torjus Håkestad	311be282b6	docs: add security hardening plan Some checks failed Run nix flake check / flake-check (push) Failing after 2s Details Based on security review findings, covering SSH hardening, firewall enablement, log transport TLS, security alerting, and secrets management. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 05:26:15 +01:00
Torjus Håkestad	11cbb64097	claude: make auditor delegation explicit in investigate-alarm Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - Changed section 4 from "if needed" to always spawn auditor - Added explicit "Do NOT query audit logs yourself" guidance - Listed specific scenarios requiring auditor (service stopped, etc.) - Added manual intervention as first common cause - Updated guidelines to emphasize mandatory delegation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 05:11:09 +01:00
Torjus Håkestad	e2dd21c994	claude: add auditor agent and git-explorer MCP Add new auditor agent for security-focused audit log analysis: - SSH session tracking, command execution, sudo usage - Suspicious activity detection patterns - Can be used standalone or as sub-agent by investigate-alarm Update investigate-alarm to delegate audit analysis to auditor and add git-explorer MCP for configuration drift detection. Add git-explorer to .mcp.json for repository inspection. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 04:48:55 +01:00
Torjus Håkestad	463342133e	kanidm: remove non-functional metrics scrape target All checks were successful Run nix flake check / flake-check (push) Successful in 1m56s Details Kanidm does not expose a Prometheus /metrics endpoint. The scrape target was causing 404 errors after the TLS certificate issue was fixed. Also add SSH command restriction to CLAUDE.md. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:34:12 +01:00
Torjus Håkestad	de36b9d016	kanidm: add hostname SAN to ACME certificate Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Include both auth.home.2rjus.net (CNAME) and kanidm01.home.2rjus.net (A record) as SANs in the TLS certificate. This fixes Prometheus scraping which connects via the hostname, not the CNAME. Fixes: x509: certificate is valid for auth.home.2rjus.net, not kanidm01.home.2rjus.net Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:29:54 +01:00
Torjus Håkestad	3f1d966919	claude: improve investigate-alarm log query guidelines Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Add best practices for querying Loki to avoid overwhelming responses: - Start with narrow filters and small limits - Filter audit logs to EXECVE only - Exclude verbose noise (PATH, PROCTITLE, SYSCALL, BPF) - Expand queries incrementally if needed Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:14:54 +01:00
Torjus Håkestad	7fcc043a4d	testvm: add SSH session command auditing Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Enable Linux audit to log execve syscalls from interactive SSH sessions. Uses auid filter to exclude system services and nix builds. Logs forwarded to journald for Loki ingestion. Query with: {host="testvmXX"} \|= "EXECVE" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:07:10 +01:00
Torjus Håkestad	70ec5f8109	claude: add investigate-alarm agent Sub-agent for investigating system alarms using Prometheus metrics and Loki logs. Provides root cause analysis with timeline of events. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:07:03 +01:00
Torjus Håkestad	c2ec34cab9	docs: consolidate monitoring docs into observability skill Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - Move detailed Prometheus/Loki reference from CLAUDE.md to the observability skill - Add complete list of Prometheus jobs organized by category - Add bootstrap log documentation with stages table - Add kanidm01 to host labels table - CLAUDE.md now references the skill instead of duplicating info Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 02:15:02 +01:00
Torjus Håkestad	8fbf1224fa	docs: add host creation pipeline documentation Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Document the end-to-end host creation workflow including: - Prerequisites and step-by-step process - Tier specification (test vs prod) - Bootstrap observability via Loki - Verification steps - Troubleshooting guide - Related files reference Update CLAUDE.md to reference the new document. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 02:05:21 +01:00
Torjus Håkestad	8959829f77	docs: add monitoring migration to VictoriaMetrics plan Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Plan for migrating from Prometheus to VictoriaMetrics on new monitoring02 host with parallel operation, declarative Grafana dashboards, and CNAME-based cutover. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 01:11:07 +01:00
Torjus Håkestad	93dbb45802	docs: update auth-system-replacement plan with progress Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Periodic flake update / flake-update (push) Failing after 5s Details - Mark completed implementation steps - Document deployed kanidm01 configuration - Record UID/GID range decision (65,536-69,999) - Add verified working items (WebUI, LDAP, certs) - Update next steps and resolved questions Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 00:50:36 +01:00
Torjus Håkestad	538c2ad097	kanidm: fix secret file permissions for provisioning Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Set owner/group to kanidm so the post-start provisioning script can read the idm_admin password. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 00:24:41 +01:00
Torjus Håkestad	d99c82c74c	kanidm: fix service ordering for vault secret Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Ensure vault-secret-kanidm-idm-admin runs before kanidm.service by adding services dependency. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 00:21:11 +01:00
Torjus Håkestad	ca0e3fd629	kanidm01: add kanidm authentication server Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - New test-tier VM at 10.69.13.23 with role=auth - Kanidm 1.8 server with HTTPS (443) and LDAPS (636) - ACME certificate from internal CA (auth.home.2rjus.net) - Provisioned groups: admins, users, ssh-users - Provisioned user: torjus - Daily backups at 22:00 (7 versions) - Prometheus monitoring scrape target Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 00:13:59 +01:00
Torjus Håkestad	732e9b8c22	docs: move bootstrap-cache plan to completed Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:41:05 +01:00
Torjus Håkestad	3a14ffd6b5	template2: add nix cache configuration Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details New VMs bootstrapped from template2 will now use the local nix cache during initial nixos-rebuild, speeding up bootstrap times. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:40:53 +01:00
Torjus Håkestad	f9a3961457	docs: move ns1-recreation plan to completed Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:35:04 +01:00
Torjus Håkestad	003d4ccf03	docs: mark ns1 migration to OpenTofu as complete Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:34:44 +01:00
Torjus Håkestad	735b8a9ee3	terraform: add dns and homelab-deploy secrets to ns1 policy Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details ns1 needs access to shared/dns/* for zone transfer key and shared/homelab-deploy/* for the NATS listener. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:33:36 +01:00
Torjus Håkestad	94feae82a0	ns1: recreate with OpenTofu workflow Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Old VM had incorrect hardware-configuration.nix with hardcoded UUIDs that didn't match actual disk layout, causing boot failure (emergency mode). Recreated using template2-based configuration for OpenTofu provisioning. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:18:08 +01:00
Torjus Håkestad	3f94f7ee95	docs: update pgdb1 decommission progress Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:55:55 +01:00
Torjus Håkestad	b7e398c9a7	terraform: remove pgdb1 vault approle Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:55:39 +01:00
Torjus Håkestad	8ec2a083bd	pgdb1: decommission postgresql host Remove pgdb1 host configuration and postgres service module. The only consumer (Open WebUI on gunter) has migrated to local PostgreSQL. Removed: - hosts/pgdb1/ - host configuration - services/postgres/ - service module (only used by pgdb1) - postgres_rules from monitoring rules - rebuild-all.sh (obsolete script) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:54:50 +01:00
Torjus Håkestad	ec4ac1477e	docs: mark pgdb1 for decommissioning instead of migration Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Only consumer was Open WebUI on gunter, which will migrate to local PostgreSQL. Removed pgdb1 backup/migration phases and added to decommission list. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:49:53 +01:00
Torjus Håkestad	e937c68965	docs: mark auth01, ca, and sops-nix removal as complete Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - auth01 host and services (authelia, lldap) already removed - ca host and services already removed (PKI migrated to OpenBao) - sops-nix fully removed (secrets/, .sops.yaml gone) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:33:18 +01:00
Torjus Håkestad	98e808cd6c	docs: mark jump host decommissioning as complete Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:31:14 +01:00
Torjus Håkestad	ba9f47f914	jump: remove unused host configuration Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Host was decommissioned and not in flake.nix. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:30:37 +01:00
Torjus Håkestad	1066e81ba8	docs: update opentofu migration plan with current state Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - ns2 migrated to OpenTofu - testvm02, testvm03 added to managed hosts - Remove vaulttest01 (no longer exists) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:26:49 +01:00
Torjus Håkestad	f0950b33de	docs: add plan for nix-cache01 reprovision Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 20:34:52 +01:00
Torjus Håkestad	bf199bd7c6	ns/resolver: add redundant stub-zone addresses Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Configure Unbound to query both ns1 and ns2 for the home.2rjus.net zone, in addition to local NSD. This provides redundancy during bootstrap or if local NSD is temporarily unavailable. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 20:10:17 +01:00