docs: update auth-system-replacement plan with PAM/NSS progress

- Mark PAM/NSS client module as complete - Mark documentation as complete - Update provisioning approach (declarative groups, imperative users) - Add details on client module and verified functionality - Update next steps Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
docs: add verified group creation example
2026-02-08 15:09:05 +01:00 · 2026-02-08 15:05:20 +01:00 · 2026-02-08 15:02:12 +01:00 · 2026-02-08 14:55:19 +01:00 · 2026-02-08 14:51:08 +01:00 · 2026-02-08 14:45:37 +01:00
105 changed files with 3249 additions and 1751 deletions
--- a/.claude/agents/auditor.md
+++ b/.claude/agents/auditor.md
@@ -0,0 +1,180 @@
+---
+name: auditor
+description: Analyzes audit logs to investigate user activity, command execution, and suspicious behavior on hosts. Can be used standalone for security reviews or called by other agents for behavioral context.
+tools: Read, Grep, Glob
+mcpServers:
+  - lab-monitoring
+---
+
+You are a security auditor for a NixOS homelab infrastructure. Your task is to analyze audit logs and reconstruct user activity on hosts.
+
+## Input
+
+You may receive:
+- A host or list of hosts to investigate
+- A time window (e.g., "last hour", "today", "between 14:00 and 15:00")
+- Optional context: specific events to look for, user to focus on, or suspicious activity to investigate
+- Optional context from a parent investigation (e.g., "a service stopped at 14:32, what happened around that time?")
+
+## Audit Log Structure
+
+Logs are shipped to Loki via promtail. Audit events use these labels:
+- `host` - hostname
+- `systemd_unit` - typically `auditd.service` for audit logs
+- `job` - typically `systemd-journal`
+
+Audit log entries contain structured data:
+- `EXECVE` - command execution with full arguments
+- `USER_LOGIN` / `USER_LOGOUT` - session start/end
+- `USER_CMD` - sudo command execution
+- `CRED_ACQ` / `CRED_DISP` - credential acquisition/disposal
+- `SERVICE_START` / `SERVICE_STOP` - systemd service events
+
+## Investigation Techniques
+
+### 1. SSH Session Activity
+
+Find SSH logins and session activity:
+```logql
+{host="<hostname>", systemd_unit="sshd.service"}
+```
+
+Look for:
+- Accepted/Failed authentication
+- Session opened/closed
+- Unusual source IPs or users
+
+### 2. Command Execution
+
+Query executed commands (filter out noise):
+```logql
+{host="<hostname>"} |= "EXECVE" != "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"
+```
+
+Further filtering:
+- Exclude systemd noise: `!= "systemd" != "/nix/store"`
+- Focus on specific commands: `|= "rm" |= "-rf"`
+- Focus on specific user: `|= "uid=1000"`
+
+### 3. Sudo Activity
+
+Check for privilege escalation:
+```logql
+{host="<hostname>"} |= "sudo" |= "COMMAND"
+```
+
+Or via audit:
+```logql
+{host="<hostname>"} |= "USER_CMD"
+```
+
+### 4. Service Manipulation
+
+Check if services were manually stopped/started:
+```logql
+{host="<hostname>"} |= "EXECVE" |= "systemctl"
+```
+
+### 5. File Operations
+
+Look for file modifications (if auditd rules are configured):
+```logql
+{host="<hostname>"} |= "EXECVE" |= "vim"
+{host="<hostname>"} |= "EXECVE" |= "nano"
+{host="<hostname>"} |= "EXECVE" |= "rm"
+```
+
+## Query Guidelines
+
+**Start narrow, expand if needed:**
+- Begin with `limit: 20-30`
+- Use tight time windows: `start: "15m"` or `start: "30m"`
+- Add filters progressively
+
+**Avoid:**
+- Querying all audit logs without EXECVE filter (extremely verbose)
+- Large time ranges without specific filters
+- Limits over 50 without tight filters
+
+**Time-bounded queries:**
+When investigating around a specific event:
+```logql
+{host="<hostname>"} |= "EXECVE" != "systemd"
+```
+With `start: "2026-02-08T14:30:00Z"` and `end: "2026-02-08T14:35:00Z"`
+
+## Suspicious Patterns to Watch For
+
+1. **Unusual login times** - Activity outside normal hours
+2. **Failed authentication** - Brute force attempts
+3. **Privilege escalation** - Unexpected sudo usage
+4. **Reconnaissance commands** - `whoami`, `id`, `uname`, `cat /etc/passwd`
+5. **Data exfiltration indicators** - `curl`, `wget`, `scp`, `rsync` to external destinations
+6. **Persistence mechanisms** - Cron modifications, systemd service creation
+7. **Log tampering** - Commands targeting log files
+8. **Lateral movement** - SSH to other internal hosts
+9. **Service manipulation** - Stopping security services, disabling firewalls
+10. **Cleanup activity** - Deleting bash history, clearing logs
+
+## Output Format
+
+### For Standalone Security Reviews
+
+```
+## Activity Summary
+
+**Host:** <hostname>
+**Time Period:** <start> to <end>
+**Sessions Found:** <count>
+
+## User Sessions
+
+### Session 1: <user> from <source_ip>
+- **Login:** HH:MM:SSZ
+- **Logout:** HH:MM:SSZ (or ongoing)
+- **Commands executed:**
+  - HH:MM:SSZ - <command>
+  - HH:MM:SSZ - <command>
+
+## Suspicious Activity
+
+[If any patterns from the watch list were detected]
+- **Finding:** <description>
+- **Evidence:** <log entries>
+- **Risk Level:** Low / Medium / High
+
+## Summary
+
+[Overall assessment: normal activity, concerning patterns, or clear malicious activity]
+```
+
+### When Called by Another Agent
+
+Provide a focused response addressing the specific question:
+
+```
+## Audit Findings
+
+**Query:** <what was asked>
+**Time Window:** <investigated period>
+
+## Relevant Activity
+
+[Chronological list of relevant events]
+- HH:MM:SSZ - <event>
+- HH:MM:SSZ - <event>
+
+## Assessment
+
+[Direct answer to the question with supporting evidence]
+```
+
+## Guidelines
+
+- Reconstruct timelines chronologically
+- Correlate events (login → commands → logout)
+- Note gaps or missing data
+- Distinguish between automated (systemd, cron) and interactive activity
+- Consider the host's role and tier when assessing severity
+- When called by another agent, focus on answering their specific question
+- Don't speculate without evidence - state what the logs show and don't show
--- a/.claude/agents/investigate-alarm.md
+++ b/.claude/agents/investigate-alarm.md
@@ -0,0 +1,211 @@
+---
+name: investigate-alarm
+description: Investigates a single system alarm by querying Prometheus metrics and Loki logs, analyzing configuration files for affected hosts/services, and providing root cause analysis.
+tools: Read, Grep, Glob
+mcpServers:
+  - lab-monitoring
+  - git-explorer
+---
+
+You are an alarm investigation specialist for a NixOS homelab infrastructure. Your task is to analyze a single alarm and determine its root cause.
+
+## Input
+
+You will receive information about an alarm, which may include:
+- Alert name and severity
+- Affected host or service
+- Alert expression/threshold
+- Current value or status
+- When it started firing
+
+## Investigation Process
+
+### 1. Understand the Alert Context
+
+Start by understanding what the alert is measuring:
+- Use `get_alert` if you have a fingerprint, or `list_alerts` to find matching alerts
+- Use `get_metric_metadata` to understand the metric being monitored
+- Use `search_metrics` to find related metrics
+
+### 2. Query Current State
+
+Gather evidence about the current system state:
+- Use `query` to check the current metric values and related metrics
+- Use `list_targets` to verify the host/service is being scraped successfully
+- Look for correlated metrics that might explain the issue
+
+### 3. Check Service Logs
+
+Search for relevant log entries using `query_logs`. Focus on service-specific logs and errors.
+
+**Query strategies (start narrow, expand if needed):**
+- Start with `limit: 20-30`, increase only if needed
+- Use tight time windows: `start: "15m"` or `start: "30m"` initially
+- Filter to specific services: `{host="<hostname>", systemd_unit="<service>.service"}`
+- Search for errors: `{host="<hostname>"} |= "error"` or `|= "failed"`
+
+**Common patterns:**
+- Service logs: `{host="<hostname>", systemd_unit="<service>.service"}`
+- All errors on host: `{host="<hostname>"} |= "error"`
+- Journal for a unit: `{host="<hostname>", systemd_unit="nginx.service"} |= "failed"`
+
+**Avoid:**
+- Using `start: "1h"` with no filters on busy hosts
+- Limits over 50 without specific filters
+
+### 4. Investigate User Activity
+
+For any analysis of user activity, **always spawn the `auditor` agent**. Do not query audit logs (EXECVE, USER_LOGIN, etc.) directly - delegate this to the auditor.
+
+**Always call the auditor when:**
+- A service stopped unexpectedly (may have been manually stopped)
+- A process was killed or a config was changed
+- You need to know who was logged in around the time of an incident
+- You need to understand what commands led to the current state
+- The cause isn't obvious from service logs alone
+
+**Do NOT try to query audit logs yourself.** The auditor is specialized for:
+- Parsing EXECVE records and reconstructing command lines
+- Correlating SSH sessions with commands executed
+- Identifying suspicious patterns
+- Filtering out systemd/nix-store noise
+
+**Example prompt for auditor:**
+```
+Investigate user activity on <hostname> between <start_time> and <end_time>.
+Context: The prometheus-node-exporter service stopped at 14:32.
+Determine if it was manually stopped and by whom.
+```
+
+Incorporate the auditor's findings into your timeline and root cause analysis.
+
+### 5. Check Configuration (if relevant)
+
+If the alert relates to a NixOS-managed service:
+- Check host configuration in `/hosts/<hostname>/`
+- Check service modules in `/services/<service>/`
+- Look for thresholds, resource limits, or misconfigurations
+- Check `homelab.host` options for tier/priority/role metadata
+
+### 6. Check for Configuration Drift
+
+Use the git-explorer MCP server to compare the host's deployed configuration against the current master branch. This helps identify:
+- Hosts running outdated configurations
+- Recent changes that might have caused the issue
+- Whether a fix has already been committed but not deployed
+
+**Step 1: Get the deployed revision from Prometheus**
+```promql
+nixos_flake_info{hostname="<hostname>"}
+```
+The `current_rev` label contains the deployed git commit hash.
+
+**Step 2: Check if the host is behind master**
+```
+resolve_ref("master")           # Get current master commit
+is_ancestor(deployed, master)   # Check if host is behind
+```
+
+**Step 3: See what commits are missing**
+```
+commits_between(deployed, master)  # List commits not yet deployed
+```
+
+**Step 4: Check which files changed**
+```
+get_diff_files(deployed, master)   # Files modified since deployment
+```
+Look for files in `hosts/<hostname>/`, `services/<relevant-service>/`, or `system/` that affect this host.
+
+**Step 5: View configuration at the deployed revision**
+```
+get_file_at_commit(deployed, "services/<service>/default.nix")
+```
+Compare against the current file to understand differences.
+
+**Step 6: Find when something changed**
+```
+search_commits("<service-name>")   # Find commits mentioning the service
+get_commit_info(<hash>)            # Get full details of a specific change
+```
+
+**Example workflow for a service-related alert:**
+1. Query `nixos_flake_info{hostname="monitoring01"}` → `current_rev: 8959829`
+2. `resolve_ref("master")` → `4633421`
+3. `is_ancestor("8959829", "4633421")` → Yes, host is behind
+4. `commits_between("8959829", "4633421")` → 7 commits missing
+5. `get_diff_files("8959829", "4633421")` → Check if relevant service files changed
+6. If a fix was committed after the deployed rev, recommend deployment
+
+### 7. Consider Common Causes
+
+For infrastructure alerts, common causes include:
+- **Manual intervention**: Service manually stopped/restarted (call auditor to confirm)
+- **Configuration drift**: Host running outdated config, fix already in master
+- **Disk space**: Nix store growth, logs, temp files
+- **Memory pressure**: Service memory leaks, insufficient limits
+- **CPU**: Runaway processes, build jobs
+- **Network**: DNS issues, connectivity problems
+- **Service restarts**: Failed upgrades, configuration errors
+- **Scrape failures**: Service down, firewall issues, port changes
+
+**Note:** If a service stopped unexpectedly and service logs don't show a crash or error, it was likely manual intervention - call the auditor to investigate.
+
+## Output Format
+
+Provide a concise report with one of two outcomes:
+
+### If Root Cause Identified:
+
+```
+## Root Cause
+[1-2 sentence summary of the root cause]
+
+## Timeline
+[Chronological sequence of relevant events leading to the alert]
+- HH:MM:SSZ - [Event description]
+- HH:MM:SSZ - [Event description]
+- HH:MM:SSZ - [Alert fired]
+
+### Timeline sources
+- HH:MM:SSZ - [Source for information about this event. Which metric or log file]
+- HH:MM:SSZ - [Source for information about this event. Which metric or log file]
+- HH:MM:SSZ - [Alert fired]
+
+
+## Evidence
+- [Specific metric values or log entries that support the conclusion]
+- [Configuration details if relevant]
+
+
+## Recommended Actions
+1. [Specific remediation step]
+2. [Follow-up actions if any]
+```
+
+### If Root Cause Unclear:
+
+```
+## Investigation Summary
+[What was checked and what was found]
+
+## Possible Causes
+- [Hypothesis 1 with supporting/contradicting evidence]
+- [Hypothesis 2 with supporting/contradicting evidence]
+
+## Additional Information Needed
+- [Specific data, logs, or access that would help]
+- [Suggested queries or checks for the operator]
+```
+
+## Guidelines
+
+- Be concise and actionable
+- Reference specific metric names and values as evidence
+- Include log snippets when they're informative
+- Don't speculate without evidence
+- If the alert is a false positive or expected behavior, explain why
+- Consider the host's tier (test vs prod) when assessing severity
+- Build a timeline from log timestamps and metrics to show the sequence of events
+- **Query logs incrementally**: start with narrow filters and small limits, expand only if needed
+- **Always delegate to the auditor agent** for any user activity analysis - never query EXECVE or audit logs directly
--- a/.claude/skills/observability/SKILL.md
+++ b/.claude/skills/observability/SKILL.md
@@ -32,7 +32,7 @@ Use the `lab-monitoring` MCP server tools:
 Available labels for log queries:
 - `host` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`)
 - `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs)
+- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
 - `filename` - For `varlog` job, the log file path
 - `hostname` - Alternative to `host` for some streams

@@ -102,6 +102,36 @@ Useful systemd units for troubleshooting:
 - `sshd.service` - SSH daemon
 - `nix-gc.service` - Nix garbage collection

+### Bootstrap Logs
+
+VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
+
+- `host` - Target hostname
+- `branch` - Git branch being deployed
+- `stage` - Bootstrap stage (see table below)
+
+**Bootstrap stages:**
+
+| Stage | Message | Meaning |
+|-------|---------|---------|
+| `starting` | Bootstrap starting for \<host\> (branch: \<branch\>) | Bootstrap service has started |
+| `network_ok` | Network connectivity confirmed | Can reach git server |
+| `vault_ok` | Vault credentials unwrapped and stored | AppRole credentials provisioned |
+| `vault_skip` | No Vault token provided - skipping credential setup | No wrapped token was provided |
+| `vault_warn` | Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) |
+| `building` | Starting nixos-rebuild boot | NixOS build starting |
+| `success` | Build successful - rebooting into new configuration | Build complete, rebooting |
+| `failed` | nixos-rebuild failed - manual intervention required | Build failed |
+
+**Bootstrap queries:**
+
+```logql
+{job="bootstrap"}                              # All bootstrap logs
+{job="bootstrap", host="myhost"}               # Specific host
+{job="bootstrap", stage="failed"}              # All failures
+{job="bootstrap", stage=~"building|success"}   # Track build progress
+```
+
 ### Extracting JSON Fields

 Parse JSON and filter on fields:
@@ -175,31 +205,95 @@ Disk space (root filesystem):
 node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
 ```

-### Service-Specific Metrics
+### Prometheus Jobs

-Common job names:
- `node-exporter` - System metrics (all hosts)
- `nixos-exporter` - NixOS version/generation metrics
- `caddy` - Reverse proxy metrics
- `prometheus` / `loki` / `grafana` - Monitoring stack
- `home-assistant` - Home automation
- `step-ca` - Internal CA
+All available Prometheus job names:

-### Instance Label Format
+**System exporters (on all/most hosts):**
+- `node-exporter` - System metrics (CPU, memory, disk, network)
+- `nixos-exporter` - NixOS flake revision and generation info
+- `systemd-exporter` - Systemd unit status metrics
+- `homelab-deploy` - Deployment listener metrics

-The `instance` label uses FQDN format:
+**Service-specific exporters:**
+- `caddy` - Reverse proxy metrics (http-proxy)
+- `nix-cache_caddy` - Nix binary cache metrics
+- `home-assistant` - Home automation metrics (ha1)
+- `jellyfin` - Media server metrics (jelly01)
+- `kanidm` - Authentication server metrics (kanidm01)
+- `nats` - NATS messaging metrics (nats1)
+- `openbao` - Secrets management metrics (vault01)
+- `unbound` - DNS resolver metrics (ns1, ns2)
+- `wireguard` - VPN tunnel metrics (http-proxy)

-```
-<hostname>.home.2rjus.net:<port>
-```
+**Monitoring stack (localhost on monitoring01):**
+- `prometheus` - Prometheus self-metrics
+- `loki` - Loki self-metrics
+- `grafana` - Grafana self-metrics
+- `alertmanager` - Alertmanager metrics
+- `pushgateway` - Push-based metrics gateway

-Example queries filtering by host:
+**External/infrastructure:**
+- `pve-exporter` - Proxmox hypervisor metrics
+- `smartctl` - Disk SMART health (gunter)
+- `restic_rest` - Backup server metrics
+- `ghettoptt` - PTT service metrics (gunter)
+
+### Target Labels
+
+All scrape targets have these labels:
+
+**Standard labels:**
+- `instance` - Full target address (`<hostname>.home.2rjus.net:<port>`)
+- `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`)
+- `hostname` - Short hostname (e.g., `ns1`, `monitoring01`) - use this for host filtering
+
+**Host metadata labels** (when configured in `homelab.host`):
+- `role` - Host role (e.g., `dns`, `build-host`, `vault`)
+- `tier` - Deployment tier (`test` for test VMs, absent for prod)
+- `dns_role` - DNS-specific role (`primary` or `secondary` for ns1/ns2)
+
+### Filtering by Host
+
+Use the `hostname` label for easy host filtering across all jobs:

 ```promql
-up{instance=~"monitoring01.*"}
-node_load1{instance=~"ns1.*"}
+{hostname="ns1"}                    # All metrics from ns1
+node_load1{hostname="monitoring01"} # Specific metric by hostname
+up{hostname="ha1"}                  # Check if ha1 is up
 ```

+This is simpler than wildcarding the `instance` label:
+
+```promql
+# Old way (still works but verbose)
+up{instance=~"monitoring01.*"}
+
+# New way (preferred)
+up{hostname="monitoring01"}
+```
+
+### Filtering by Role/Tier
+
+Filter hosts by their role or tier:
+
+```promql
+up{role="dns"}                      # All DNS servers (ns1, ns2)
+node_cpu_seconds_total{role="build-host"}  # Build hosts only (nix-cache01)
+up{tier="test"}                     # All test-tier VMs
+up{dns_role="primary"}              # Primary DNS only (ns1)
+```
+
+Current host labels:
+| Host | Labels |
+|------|--------|
+| ns1 | `role=dns`, `dns_role=primary` |
+| ns2 | `role=dns`, `dns_role=secondary` |
+| nix-cache01 | `role=build-host` |
+| vault01 | `role=vault` |
+| kanidm01 | `role=auth`, `tier=test` |
+| testvm01/02/03 | `tier=test` |
+
 ---

 ## Troubleshooting Workflows
@@ -212,11 +306,12 @@ node_load1{instance=~"ns1.*"}

 ### Investigate Service Issues

-1. Check `up{job="<service>"}` for scrape failures
+1. Check `up{job="<service>"}` or `up{hostname="<host>"}` for scrape failures
 2. Use `list_targets` to see target health details
 3. Query service logs: `{host="<host>", systemd_unit="<service>.service"}`
 4. Search for errors: `{host="<host>"} |= "error"`
 5. Check `list_alerts` for related alerts
+6. Use role filters for group issues: `up{role="dns"}` to check all DNS servers

 ### After Deploying Changes

@@ -225,6 +320,17 @@ node_load1{instance=~"ns1.*"}
 3. Check service logs for startup issues
 4. Check service metrics are being scraped

+### Monitor VM Bootstrap
+
+When provisioning new VMs, track bootstrap progress:
+
+1. Watch bootstrap logs: `{job="bootstrap", host="<hostname>"}`
+2. Check for failures: `{job="bootstrap", host="<hostname>", stage="failed"}`
+3. After success, verify host appears in metrics: `up{hostname="<hostname>"}`
+4. Check logs are flowing: `{host="<hostname>"}`
+
+See [docs/host-creation.md](../../../docs/host-creation.md) for the full host creation pipeline.
+
 ### Debug SSH/Access Issues

 ```logql
@@ -246,5 +352,6 @@ With `start: "24h"` to see last 24 hours of upgrades across all hosts.
 - Default scrape interval is 15s for most metrics targets
 - Default log lookback is 1h - use `start` parameter for older logs
 - Use `rate()` for counter metrics, direct queries for gauges
- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters
+- Use the `hostname` label to filter metrics by host (simpler than regex on `instance`)
+- Host metadata labels (`role`, `tier`, `dns_role`) are propagated to all scrape targets
 - Log `MESSAGE` field contains the actual log content in JSON format
--- a/.mcp.json
+++ b/.mcp.json
@@ -33,6 +33,13 @@
        "--nats-url", "nats://nats1.home.2rjus.net:4222",
        "--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey"
      ]
+    },
+    "git-explorer": {
+      "command": "nix",
+      "args": ["run", "git+https://git.t-juice.club/torjus/labmcp#git-explorer", "--", "serve"],
+      "env": {
+        "GIT_REPO_PATH": "/home/torjus/git/nixos-servers"
+      }
    }
  }
 }
--- a/.sops.yaml
+++ b/.sops.yaml
@@ -1,52 +0,0 @@
-keys:
-  - &admin_torjus age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
-  - &server_ns1 age1hz2lz4k050ru3shrk5j3zk3f8azxmrp54pktw5a7nzjml4saudesx6jsl0
-  - &server_ns2 age1w2q4gm2lrcgdzscq8du3ssyvk6qtzm4fcszc92z9ftclq23yyydqdga5um
-  - &server_ha1 age1d2w5zece9647qwyq4vas9qyqegg96xwmg6c86440a6eg4uj6dd2qrq0w3l
-  - &server_http-proxy age1gq8434ku0xekqmvnseeunv83e779cg03c06gwrusnymdsr3rpufqx6vr3m
-  - &server_ca age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk
-  - &server_monitoring01 age1vpns76ykll8jgdlu3h05cur4ew2t3k7u03kxdg8y6ypfhsfhq9fqyurjey
-  - &server_jelly01 age1hchvlf3apn8g8jq2743pw53sd6v6ay6xu6lqk0qufrjeccan9vzsc7hdfq
-  - &server_nix-cache01 age1w029fksjv0edrff9p7s03tgk3axecdkppqymfpwfn2nu2gsqqefqc37sxq
-  - &server_pgdb1 age1ha34qeksr4jeaecevqvv2afqem67eja2mvawlmrqsudch0e7fe7qtpsekv
-  - &server_nats1 age1cxt8kwqzx35yuldazcc49q88qvgy9ajkz30xu0h37uw3ts97jagqgmn2ga
-creation_rules:
-  - path_regex: secrets/[^/]+\.(yaml|json|env|ini)
-    key_groups:
-      - age:
-        - *admin_torjus
-        - *server_ns1
-        - *server_ns2
-        - *server_ha1
-        - *server_http-proxy
-        - *server_ca
-        - *server_monitoring01
-        - *server_jelly01
-        - *server_nix-cache01
-        - *server_pgdb1
-        - *server_nats1
-  - path_regex: secrets/ca/[^/]+\.(yaml|json|env|ini|)
-    key_groups:
-      - age:
-        - *admin_torjus
-        - *server_ca
-  - path_regex: secrets/monitoring01/[^/]+\.(yaml|json|env|ini)
-    key_groups:
-      - age:
-        - *admin_torjus
-        - *server_monitoring01
-  - path_regex: secrets/ca/keys/.+
-    key_groups:
-      - age:
-        - *admin_torjus
-        - *server_ca
-  - path_regex: secrets/nix-cache01/.+
-    key_groups:
-      - age:
-        - *admin_torjus
-        - *server_nix-cache01
-  - path_regex: secrets/http-proxy/.+
-    key_groups:
-      - age:
-        - *admin_torjus
-        - *server_http-proxy
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -35,6 +35,10 @@ nix build .#create-host

 Do not automatically deploy changes. Deployments are usually done by updating the master branch, and then triggering the auto update on the specific host.

+### SSH Commands
+
+Do not run SSH commands directly. If a command needs to be run on a remote host, provide the command to the user and ask them to run it manually.
+
 ### Testing Feature Branches on Hosts

 All hosts have the `nixos-rebuild-test` helper script for testing feature branches before merging:
@@ -61,25 +65,45 @@ Do not run `nix flake update`. Should only be done manually by user.
 ### Development Environment

 ```bash
-# Enter development shell (provides ansible, python3)
+# Enter development shell
 nix develop
 ```

+The devshell provides: `ansible`, `tofu` (OpenTofu), `bao` (OpenBao CLI), `create-host`, and `homelab-deploy`.
+
+**Important:** When suggesting commands that use devshell tools, always use `nix develop -c <command>` syntax rather than assuming the user is already in a devshell. For example:
+```bash
+# Good - works regardless of current shell
+nix develop -c tofu plan
+
+# Avoid - requires user to be in devshell
+tofu plan
+```
+
+**OpenTofu:** Use the `-chdir` option instead of `cd` when running tofu commands in subdirectories:
+```bash
+# Good - uses -chdir option
+nix develop -c tofu -chdir=terraform plan
+nix develop -c tofu -chdir=terraform/vault apply
+
+# Avoid - changing directories
+cd terraform && tofu plan
+```
+
 ### Secrets Management

 Secrets are managed by OpenBao (Vault) using AppRole authentication. Most hosts use the
 `vault.secrets` option defined in `system/vault-secrets.nix` to fetch secrets at boot.
 Terraform manages the secrets and AppRole policies in `terraform/vault/`.

-Legacy sops-nix is still present but only actively used by the `ca` host. Do not edit any
-`.sops.yaml` or any file within `secrets/`. Ask the user to modify if necessary.
-
 ### Git Workflow

 **Important:** Never commit directly to `master` unless the user explicitly asks for it. Always create a feature branch for changes.

 **Important:** Never amend commits to `master` unless the user explicitly asks for it. Amending rewrites history and causes issues for deployed configurations.

+**Important:** Do not use `gh pr create` to create pull requests. The git server does not support GitHub CLI for PR creation. Instead, push the branch and let the user create the PR manually via the web interface.
+
 When starting a new plan or task, the first step should typically be to create and checkout a new branch with an appropriate name (e.g., `git checkout -b dns-automation` or `git checkout -b fix-nginx-config`).

 ### Plan Management
@@ -132,67 +156,16 @@ Two MCP servers are available for searching NixOS options and packages:

 This ensures documentation matches the exact nixpkgs version (currently NixOS 25.11) used by this flake.

-### Lab Monitoring Log Queries
+### Lab Monitoring

-The **lab-monitoring** MCP server can query logs from Loki. All hosts ship systemd journal logs via Promtail.
+The **lab-monitoring** MCP server provides access to Prometheus metrics and Loki logs. Use the `/observability` skill for detailed reference on:

-**Loki Label Reference:**
+- Available Prometheus jobs and exporters
+- Loki labels and LogQL query syntax
+- Bootstrap log monitoring for new VMs
+- Common troubleshooting workflows

- `host` - Hostname (e.g., `ns1`, `ns2`, `monitoring01`, `ha1`). Use this label, not `hostname`.
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `prometheus.service`, `nixos-upgrade.service`)
- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs like caddy access logs)
- `filename` - For `varlog` job, the log file path (e.g., `/var/log/caddy/nix-cache.log`)
-
-Journal log entries are JSON-formatted with the actual log message in the `MESSAGE` field. Other useful fields include `PRIORITY` and `SYSLOG_IDENTIFIER`.
-
-**Example LogQL queries:**
-```
-# Logs from a specific service on a host
-{host="ns2", systemd_unit="nsd.service"}
-
-# Substring match on log content
-{host="ns1", systemd_unit="nsd.service"} |= "error"
-
-# File-based logs (e.g., caddy access logs)
-{job="varlog", hostname="nix-cache01"}
-```
-
-Default lookback is 1 hour. Use the `start` parameter with relative durations (e.g., `24h`, `168h`) for older logs.
-
-### Lab Monitoring Prometheus Queries
-
-The **lab-monitoring** MCP server can query Prometheus metrics via PromQL. The `instance` label uses the FQDN format `<host>.home.2rjus.net:<port>`.
-
-**Prometheus Job Names:**
-
- `node-exporter` - System metrics from all hosts (CPU, memory, disk, network)
- `caddy` - Reverse proxy metrics (http-proxy)
- `nix-cache_caddy` - Nix binary cache metrics
- `home-assistant` - Home automation metrics
- `jellyfin` - Media server metrics
- `loki` / `prometheus` / `grafana` - Monitoring stack self-metrics
- `step-ca` - Internal CA metrics
- `pve-exporter` - Proxmox hypervisor metrics
- `smartctl` - Disk SMART health (gunter)
- `wireguard` - VPN metrics (http-proxy)
- `pushgateway` - Push-based metrics (e.g., backup results)
- `restic_rest` - Backup server metrics
- `labmon` / `ghettoptt` / `alertmanager` - Other service metrics
-
-**Example PromQL queries:**
-```
-# Check all targets are up
-up
-
-# CPU usage for a specific host
-rate(node_cpu_seconds_total{instance=~"ns1.*", mode!="idle"}[5m])
-
-# Memory usage across all hosts
-node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
-
-# Disk space
-node_filesystem_avail_bytes{mountpoint="/"}
-```
+The skill contains up-to-date information about all scrape targets, host labels, and example queries.

 ### Deploying to Test Hosts

@@ -229,6 +202,21 @@ deploy(role="vault", action="switch")

 **Note:** Only test-tier hosts with `homelab.deploy.enable = true` and the listener service running will respond to deployments.

+**Deploying to Prod Hosts:**
+
+The MCP server only deploys to test-tier hosts. For prod hosts, use the CLI directly:
+
+```bash
+nix develop -c homelab-deploy -- deploy \
+  --nats-url nats://nats1.home.2rjus.net:4222 \
+  --nkey-file ~/.config/homelab-deploy/admin-deployer.nkey \
+  --branch <branch-name> \
+  --action switch \
+  deploy.prod.<hostname>
+```
+
+Subject format: `deploy.<tier>.<hostname>` (e.g., `deploy.prod.monitoring01`, `deploy.test.testvm01`)
+
 **Verifying Deployments:**

 After deploying, use the `nixos_flake_info` metric from nixos-exporter to verify the host is running the expected revision:
@@ -248,10 +236,11 @@ The `current_rev` label contains the git commit hash of the deployed flake confi
  - `default.nix` - Entry point, imports configuration.nix and services
  - `configuration.nix` - Host-specific settings (networking, hardware, users)
 - `/system/` - Shared system-level configurations applied to ALL hosts
-  - Core modules: nix.nix, sshd.nix, sops.nix (legacy), vault-secrets.nix, acme.nix, autoupgrade.nix
+  - Core modules: nix.nix, sshd.nix, vault-secrets.nix, acme.nix, autoupgrade.nix
+  - Additional modules: motd.nix (dynamic MOTD), packages.nix (base packages), root-user.nix (root config), homelab-deploy.nix (NATS listener)
  - Monitoring: node-exporter and promtail on every host
 - `/modules/` - Custom NixOS modules
-  - `homelab/` - Homelab-specific options (DNS automation, monitoring scrape targets)
+  - `homelab/` - Homelab-specific options (see "Homelab Module Options" section below)
 - `/lib/` - Nix library functions
  - `dns-zone.nix` - DNS zone generation functions
  - `monitoring.nix` - Prometheus scrape target generation functions
@@ -259,14 +248,14 @@ The `current_rev` label contains the git commit hash of the deployed flake confi
  - `home-assistant/` - Home automation stack
  - `monitoring/` - Observability stack (Prometheus, Grafana, Loki, Tempo)
  - `ns/` - DNS services (authoritative, resolver, zone generation)
-  - `http-proxy/`, `ca/`, `postgres/`, `nats/`, `jellyfin/`, etc.
- `/secrets/` - SOPS-encrypted secrets with age encryption (legacy, only used by ca)
+  - `vault/` - OpenBao (Vault) secrets server
+  - `actions-runner/` - GitHub Actions runner
+  - `http-proxy/`, `postgres/`, `nats/`, `jellyfin/`, etc.
 - `/common/` - Shared configurations (e.g., VM guest agent)
 - `/docs/` - Documentation and plans
  - `plans/` - Future plans and proposals
  - `plans/completed/` - Completed plans (moved here when done)
 - `/playbooks/` - Ansible playbooks for fleet management
- `/.sops.yaml` - SOPS configuration with age keys (legacy, only used by ca)

 ### Configuration Inheritance

@@ -283,7 +272,7 @@ All hosts automatically get:
 - Nix binary cache (nix-cache.home.2rjus.net)
 - SSH with root login enabled
 - OpenBao (Vault) secrets management via AppRole
- Internal ACME CA integration (ca.home.2rjus.net)
+- Internal ACME CA integration (OpenBao PKI at vault.home.2rjus.net)
 - Daily auto-upgrades with auto-reboot
 - Prometheus node-exporter + Promtail (logs to monitoring01)
 - Monitoring scrape target auto-registration via `homelab.monitoring` options
@@ -292,28 +281,31 @@ All hosts automatically get:

 ### Active Hosts

-Production servers managed by `rebuild-all.sh`:
+Production servers:
 - `ns1`, `ns2` - Primary/secondary DNS servers (10.69.13.5/6)
- `ca` - Internal Certificate Authority
+- `vault01` - OpenBao (Vault) secrets server + PKI CA
 - `ha1` - Home Assistant + Zigbee2MQTT + Mosquitto
 - `http-proxy` - Reverse proxy
 - `monitoring01` - Full observability stack (Prometheus, Grafana, Loki, Tempo, Pyroscope)
 - `jelly01` - Jellyfin media server
- `nix-cache01` - Binary cache server
+- `nix-cache01` - Binary cache server + GitHub Actions runner
 - `pgdb1` - PostgreSQL database
 - `nats1` - NATS messaging server

-Template/test hosts:
- `template1` - Base template for cloning new hosts
+Test/staging hosts:
+- `testvm01`, `testvm02`, `testvm03` - Test-tier VMs for branch testing and deployment validation
+
+Template hosts:
+- `template1`, `template2` - Base templates for cloning new hosts

 ### Flake Inputs

 - `nixpkgs` - NixOS 25.11 stable (primary)
 - `nixpkgs-unstable` - Unstable channel (available via overlay as `pkgs.unstable.<package>`)
- `sops-nix` - Secrets management (legacy, only used by ca)
+- `nixos-exporter` - NixOS module for exposing flake revision metrics (used to verify deployments)
+- `homelab-deploy` - NATS-based remote deployment tool for test-tier hosts
 - Custom packages from git.t-juice.club:
  - `alerttonotify` - Alert routing
-  - `labmon` - Lab monitoring

 ### Network Architecture

@@ -337,11 +329,6 @@ Most hosts use OpenBao (Vault) for secrets:
 - Fallback to cached secrets in `/var/lib/vault/cache/` when Vault is unreachable
 - Provision AppRole credentials: `nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<host>`

-Legacy SOPS (only used by `ca` host):
- SOPS with age encryption, keys in `.sops.yaml`
- Shared secrets: `/secrets/secrets.yaml`
- Per-host secrets: `/secrets/<hostname>/`
-
 ### Auto-Upgrade System

 All hosts pull updates daily from:
@@ -402,9 +389,21 @@ Example VM deployment includes:
 - Custom CPU/memory/disk sizing
 - VLAN tagging
 - QEMU guest agent
+- Automatic Vault credential provisioning via `vault_wrapped_token`

 OpenTofu outputs the VM's IP address after deployment for easy SSH access.

+**Automatic Vault Credential Provisioning:**
+
+VMs can receive Vault (OpenBao) credentials automatically during bootstrap:
+
+1. OpenTofu generates a wrapped token via `terraform/vault/` and stores it in the VM configuration
+2. Cloud-init passes `VAULT_WRAPPED_TOKEN` and `NIXOS_FLAKE_BRANCH` to the bootstrap script
+3. The bootstrap script unwraps the token to obtain AppRole credentials
+4. Credentials are written to `/var/lib/vault/approle/` before the NixOS rebuild
+
+This eliminates the need for manual `provision-approle.yml` playbook runs on new VMs. Bootstrap progress is logged to Loki with `job="bootstrap"` labels.
+
 #### Template Rebuilding and Terraform State

 When the Proxmox template is rebuilt (via `build-and-deploy-template.yml`), the template name may change. This would normally cause Terraform to want to recreate all existing VMs, but that's unnecessary since VMs are independent once cloned.
@@ -435,20 +434,11 @@ This means:

 ### Adding a New Host

-1. Create `/hosts/<hostname>/` directory
-2. Copy structure from `template1` or similar host
-3. Add host entry to `flake.nix` nixosConfigurations
-4. Configure networking in `configuration.nix` (static IP via `systemd.network.networks`, DNS servers)
-5. (Optional) Add `homelab.dns.cnames` if the host needs CNAME aliases
-6. Add `vault.enable = true;` to the host configuration
-7. Add AppRole policy in `terraform/vault/approle.tf` and any secrets in `secrets.tf`
-8. Run `tofu apply` in `terraform/vault/`
-9. User clones template host
-10. User runs `prepare-host.sh` on new host
-11. Provision AppRole credentials: `nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<host>`
-12. Commit changes, and merge to master.
-13. Deploy by running `nixos-rebuild boot --flake URL#<hostname>` on the host.
-14. Run auto-upgrade on DNS servers (ns1, ns2) to pick up the new host's DNS entry
+See [docs/host-creation.md](docs/host-creation.md) for the complete host creation pipeline, including:
+- Using the `create-host` script to generate host configurations
+- Deploying VMs and secrets with OpenTofu
+- Monitoring the bootstrap process via Loki
+- Verification and troubleshooting steps

 **Note:** DNS A records and Prometheus node-exporter scrape targets are auto-generated from the host's `systemd.network.networks` static IP configuration. No manual zone file or Prometheus config editing is required.

@@ -484,11 +474,7 @@ Prometheus scrape targets are automatically generated from host configurations,
 - **External targets**: Non-flake hosts defined in `/services/monitoring/external-targets.nix`
 - **Library**: `lib/monitoring.nix` provides `generateNodeExporterTargets` and `generateScrapeConfigs`

-Host monitoring options (`homelab.monitoring.*`):
- `enable` (default: `true`) - Include host in Prometheus node-exporter scrape targets
- `scrapeTargets` (default: `[]`) - Additional scrape targets exposed by this host (job_name, port, metrics_path, scheme, scrape_interval, honor_labels)
-
-Service modules declare their scrape targets directly (e.g., `services/ca/default.nix` declares step-ca on port 9000). The Prometheus config on monitoring01 auto-generates scrape configs from all hosts.
+Service modules declare their scrape targets directly via `homelab.monitoring.scrapeTargets`. The Prometheus config on monitoring01 auto-generates scrape configs from all hosts. See "Homelab Module Options" section for available options.

 To add monitoring targets for non-NixOS hosts, edit `/services/monitoring/external-targets.nix`.

@@ -507,13 +493,30 @@ DNS zone entries are automatically generated from host configurations:
 - **External hosts**: Non-flake hosts defined in `/services/ns/external-hosts.nix`
 - **Serial number**: Uses `self.sourceInfo.lastModified` (git commit timestamp)

-Host DNS options (`homelab.dns.*`):
- `enable` (default: `true`) - Include host in DNS zone generation
- `cnames` (default: `[]`) - List of CNAME aliases pointing to this host
-
 Hosts are automatically excluded from DNS if:
 - `homelab.dns.enable = false` (e.g., template hosts)
 - No static IP configured (e.g., DHCP-only hosts)
 - Network interface is a VPN/tunnel (wg*, tun*, tap*)

 To add DNS entries for non-NixOS hosts, edit `/services/ns/external-hosts.nix`.
+
+### Homelab Module Options
+
+The `modules/homelab/` directory defines custom options used across hosts for automation and metadata.
+
+**Host options (`homelab.host.*`):**
+- `tier` - Deployment tier: `test` or `prod`. Test-tier hosts can receive remote deployments and have different credential access.
+- `priority` - Alerting priority: `high` or `low`. Controls alerting thresholds for the host.
+- `role` - Primary role designation (e.g., `dns`, `database`, `bastion`, `vault`)
+- `labels` - Free-form key-value metadata for host categorization
+
+**DNS options (`homelab.dns.*`):**
+- `enable` (default: `true`) - Include host in DNS zone generation
+- `cnames` (default: `[]`) - List of CNAME aliases pointing to this host
+
+**Monitoring options (`homelab.monitoring.*`):**
+- `enable` (default: `true`) - Include host in Prometheus node-exporter scrape targets
+- `scrapeTargets` (default: `[]`) - Additional scrape targets exposed by this host
+
+**Deploy options (`homelab.deploy.*`):**
+- `enable` (default: `false`) - Enable NATS-based remote deployment listener. When enabled, the host listens for deployment commands via NATS and can be targeted by the `homelab-deploy` MCP server.
--- a/README.md
+++ b/README.md
@@ -13,7 +13,6 @@ NixOS Flake-based configuration repository for a homelab infrastructure. All hos
 | `monitoring01` | Prometheus, Grafana, Loki, Tempo, Pyroscope |
 | `jelly01` | Jellyfin media server |
 | `nix-cache01` | Nix binary cache |
-| `pgdb1` | PostgreSQL |
 | `nats1` | NATS messaging |
 | `vault01` | OpenBao (Vault) secrets management |
 | `template1`, `template2` | VM templates for cloning new hosts |
--- a/common/ssh-audit.nix
+++ b/common/ssh-audit.nix
@@ -0,0 +1,21 @@
+# SSH session command auditing
+#
+# Logs all commands executed by users who logged in interactively (SSH).
+# System services and nix builds are excluded via auid filter.
+#
+# Logs are sent to journald and forwarded to Loki via promtail.
+# Query with: {host="<hostname>"} |= "EXECVE"
+{
+  # Enable Linux audit subsystem
+  security.audit.enable = true;
+  security.auditd.enable = true;
+
+  # Log execve syscalls only from interactive login sessions
+  # auid!=4294967295 means "audit login uid is set" (excludes system services, nix builds)
+  security.audit.rules = [
+    "-a exit,always -F arch=b64 -S execve -F auid!=4294967295"
+  ];
+
+  # Forward audit logs to journald (so promtail ships them to Loki)
+  services.journald.audit = true;
+}
--- a/docs/host-creation.md
+++ b/docs/host-creation.md
@@ -0,0 +1,217 @@
+# Host Creation Pipeline
+
+This document describes the process for creating new hosts in the homelab infrastructure.
+
+## Overview
+
+We use the `create-host` script to create new hosts, which generates default configurations from a template. We then use OpenTofu to deploy both secrets and VMs. The VMs boot using a template image (built from `hosts/template2`), which starts a bootstrap process. This bootstrap process applies the host's NixOS configuration and then reboots into the new config.
+
+## Prerequisites
+
+All tools are available in the devshell: `create-host`, `bao` (OpenBao CLI), `tofu`.
+
+```bash
+nix develop
+```
+
+## Steps
+
+Steps marked with **USER** must be performed by the user due to credential requirements.
+
+1. **USER**: Run `create-host --hostname <name> --ip <ip/prefix>`
+2. Edit the auto-generated configurations in `hosts/<hostname>/` to import whatever modules are needed for its purpose
+3. Add any secrets needed to `terraform/vault/`
+4. Edit the VM specs in `terraform/vms.tf` if needed. To deploy from a branch other than master, add `flake_branch = "<branch>"` to the VM definition
+5. Push configuration to master (or the branch specified by `flake_branch`)
+6. **USER**: Apply terraform:
+   ```bash
+   nix develop -c tofu -chdir=terraform/vault apply
+   nix develop -c tofu -chdir=terraform apply
+   ```
+7. Once terraform completes, a VM boots in Proxmox using the template image
+8. The VM runs the `nixos-bootstrap` service, which applies the host config and reboots
+9. After reboot, the host should be operational
+10. Trigger auto-upgrade on `ns1` and `ns2` to propagate DNS records for the new host
+11. Trigger auto-upgrade on `monitoring01` to add the host to Prometheus scrape targets
+
+## Tier Specification
+
+New hosts should set `homelab.host.tier` in their configuration:
+
+```nix
+homelab.host.tier = "test";  # or "prod"
+```
+
+- **test** - Test-tier hosts can receive remote deployments via the `homelab-deploy` MCP server and have different credential access. Use for staging/testing.
+- **prod** - Production hosts. Deployments require direct access or the CLI with appropriate credentials.
+
+## Observability
+
+During the bootstrap process, status updates are sent to Loki. Query bootstrap logs with:
+
+```
+{job="bootstrap", host="<hostname>"}
+```
+
+### Bootstrap Stages
+
+The bootstrap process reports these stages via the `stage` label:
+
+| Stage | Message | Meaning |
+|-------|---------|---------|
+| `starting` | Bootstrap starting for \<host\> (branch: \<branch\>) | Bootstrap service has started |
+| `network_ok` | Network connectivity confirmed | Can reach git server |
+| `vault_ok` | Vault credentials unwrapped and stored | AppRole credentials provisioned |
+| `vault_skip` | No Vault token provided - skipping credential setup | No wrapped token was provided |
+| `vault_warn` | Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) |
+| `building` | Starting nixos-rebuild boot | NixOS build starting |
+| `success` | Build successful - rebooting into new configuration | Build complete, rebooting |
+| `failed` | nixos-rebuild failed - manual intervention required | Build failed |
+
+### Useful Queries
+
+```
+# All bootstrap activity for a host
+{job="bootstrap", host="myhost"}
+
+# Track all failures
+{job="bootstrap", stage="failed"}
+
+# Monitor builds in progress
+{job="bootstrap", stage=~"building|success"}
+```
+
+Once the VM reboots with its full configuration, it will start publishing metrics to Prometheus and logs to Loki via Promtail.
+
+## Verification
+
+1. Check bootstrap completed successfully:
+   ```
+   {job="bootstrap", host="<hostname>", stage="success"}
+   ```
+
+2. Verify the host is up and reporting metrics:
+   ```promql
+   up{instance=~"<hostname>.*"}
+   ```
+
+3. Verify the correct flake revision is deployed:
+   ```promql
+   nixos_flake_info{instance=~"<hostname>.*"}
+   ```
+
+4. Check logs are flowing:
+   ```
+   {host="<hostname>"}
+   ```
+
+5. Confirm expected services are running and producing logs
+
+## Troubleshooting
+
+### Bootstrap Failed
+
+#### Common Issues
+
+* VM has trouble running initial nixos-rebuild. Usually caused if it needs to compile packages from scratch if they are not available in our local nix-cache.
+
+#### Troubleshooting
+
+1. Check bootstrap logs in Loki - if they never progress past `building`, the rebuild likely consumed all resources:
+   ```
+   {job="bootstrap", host="<hostname>"}
+   ```
+
+2. **USER**: SSH into the host and check the bootstrap service:
+   ```bash
+   ssh root@<hostname>
+   journalctl -u nixos-bootstrap.service
+   ```
+
+3. If the build failed due to resource constraints, increase VM specs in `terraform/vms.tf` and redeploy, or manually run the rebuild:
+   ```bash
+   nixos-rebuild boot --flake git+https://git.t-juice.club/torjus/nixos-servers.git#<hostname>
+   ```
+
+4. If the host config doesn't exist in the flake, ensure step 5 was completed (config pushed to the correct branch).
+
+### Vault Credentials Not Working
+
+Usually caused by running the `create-host` script without proper credentials, or the wrapped token has expired/already been used.
+
+#### Troubleshooting
+
+1. Check if credentials exist on the host:
+   ```bash
+   ssh root@<hostname>
+   ls -la /var/lib/vault/approle/
+   ```
+
+2. Check bootstrap logs for vault-related stages:
+   ```
+   {job="bootstrap", host="<hostname>", stage=~"vault.*"}
+   ```
+
+3. **USER**: Regenerate and provision credentials manually:
+   ```bash
+   nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<hostname>
+   ```
+
+### Host Not Appearing in DNS
+
+Usually caused by not having deployed the commit with the new host to ns1/ns2.
+
+#### Troubleshooting
+
+1. Verify the host config has a static IP configured in `systemd.network.networks`
+
+2. Check that `homelab.dns.enable` is not set to `false`
+
+3. **USER**: Trigger auto-upgrade on DNS servers:
+   ```bash
+   ssh root@ns1 systemctl start nixos-upgrade.service
+   ssh root@ns2 systemctl start nixos-upgrade.service
+   ```
+
+4. Verify DNS resolution after upgrade completes:
+   ```bash
+   dig @ns1.home.2rjus.net <hostname>.home.2rjus.net
+   ```
+
+### Host Not Being Scraped by Prometheus
+
+Usually caused by not having deployed the commit with the new host to the monitoring host.
+
+#### Troubleshooting
+
+1. Check that `homelab.monitoring.enable` is not set to `false`
+
+2. **USER**: Trigger auto-upgrade on monitoring01:
+   ```bash
+   ssh root@monitoring01 systemctl start nixos-upgrade.service
+   ```
+
+3. Verify the target appears in Prometheus:
+   ```promql
+   up{instance=~"<hostname>.*"}
+   ```
+
+4. If the target is down, check that node-exporter is running on the host:
+   ```bash
+   ssh root@<hostname> systemctl status prometheus-node-exporter.service
+   ```
+
+## Related Files
+
+| Path | Description |
+|------|-------------|
+| `scripts/create-host/` | The `create-host` script that generates host configurations |
+| `hosts/template2/` | Template VM configuration (base image for new VMs) |
+| `hosts/template2/bootstrap.nix` | Bootstrap service that applies NixOS config on first boot |
+| `terraform/vms.tf` | VM definitions (specs, IPs, branch overrides) |
+| `terraform/cloud-init.tf` | Cloud-init configuration (passes hostname, branch, vault token) |
+| `terraform/vault/approle.tf` | AppRole policies for each host |
+| `terraform/vault/secrets.tf` | Secret definitions in Vault |
+| `terraform/vault/hosts-generated.tf` | Auto-generated wrapped tokens for VM bootstrap |
+| `playbooks/provision-approle.yml` | Ansible playbook for manual credential provisioning |
+| `flake.nix` | Flake with all host configurations (add new hosts here) |
--- a/docs/plans/auth-system-replacement.md
+++ b/docs/plans/auth-system-replacement.md
@@ -2,7 +2,7 @@

 ## Overview

-Replace the current auth01 setup (LLDAP + Authelia) with a modern, unified authentication solution. The current setup is not in active use, making this a good time to evaluate alternatives.
+Deploy a modern, unified authentication solution for the homelab. Provides central user management, SSO for web services, and consistent UID/GID mapping for NAS permissions.

 ## Goals

@@ -11,66 +11,9 @@ Replace the current auth01 setup (LLDAP + Authelia) with a modern, unified authe
 3. **UID/GID consistency** - Proper POSIX attributes for NAS share permissions
 4. **OIDC provider** - Single sign-on for homelab web services (Grafana, etc.)

-## Options Evaluated
+## Solution: Kanidm

-### OpenLDAP (raw)
-
- **NixOS Support:** Good (`services.openldap` with `declarativeContents`)
- **Pros:** Most widely supported, very flexible
- **Cons:** LDIF format is painful, schema management is complex, no built-in OIDC, requires SSSD on each client
- **Verdict:** Doesn't address LDAP complexity concerns
-
-### LLDAP + Authelia (current)
-
- **NixOS Support:** Both have good modules
- **Pros:** Already configured, lightweight, nice web UIs
- **Cons:** Two services to manage, limited POSIX attribute support in LLDAP, requires SSSD on every client host
- **Verdict:** Workable but has friction for NAS/UID goals
-
-### FreeIPA
-
- **NixOS Support:** None
- **Pros:** Full enterprise solution (LDAP + Kerberos + DNS + CA)
- **Cons:** Extremely heavy, wants to own DNS, designed for Red Hat ecosystems, massive overkill for homelab
- **Verdict:** Overkill, no NixOS support
-
-### Keycloak
-
- **NixOS Support:** None
- **Pros:** Good OIDC/SAML, nice UI
- **Cons:** Primarily an identity broker not a user directory, poor POSIX support, heavy (Java)
- **Verdict:** Wrong tool for Linux user management
-
-### Authentik
-
- **NixOS Support:** None (would need Docker)
- **Pros:** All-in-one with LDAP outpost and OIDC, modern UI
- **Cons:** Heavy stack (Python + PostgreSQL + Redis), LDAP is a separate component
- **Verdict:** Would work but requires Docker and is heavy
-
-### Kanidm
-
- **NixOS Support:** Excellent - first-class module with PAM/NSS integration
- **Pros:**
-  - Native PAM/NSS module (no SSSD needed)
-  - Built-in OIDC provider
-  - Optional LDAP interface for legacy services
-  - Declarative provisioning via NixOS (users, groups, OAuth2 clients)
-  - Modern, written in Rust
-  - Single service handles everything
- **Cons:** Newer project, smaller community than LDAP
- **Verdict:** Best fit for requirements
-
-### Pocket-ID
-
- **NixOS Support:** Unknown
- **Pros:** Very lightweight, passkey-first
- **Cons:** No LDAP, no PAM/NSS integration - purely OIDC for web apps
- **Verdict:** Doesn't solve Linux user management goal
-
-## Recommendation: Kanidm
-
-Kanidm is the recommended solution for the following reasons:
+Kanidm was chosen for the following reasons:

 | Requirement | Kanidm Support |
 |-------------|----------------|
@@ -82,42 +25,10 @@ Kanidm is the recommended solution for the following reasons:
 | Simplicity | Modern API, LDAP optional |
 | NixOS integration | First-class |

-### Key NixOS Features
+### Configuration Files

-**Server configuration:**
-```nix
-services.kanidm.enableServer = true;
-services.kanidm.serverSettings = {
-  domain = "home.2rjus.net";
-  origin = "https://auth.home.2rjus.net";
-  ldapbindaddress = "0.0.0.0:636";  # Optional LDAP interface
-};
-```
-
-**Declarative user provisioning:**
-```nix
-services.kanidm.provision.enable = true;
-services.kanidm.provision.persons.torjus = {
-  displayName = "Torjus";
-  groups = [ "admins" "nas-users" ];
-};
-```
-
-**Declarative OAuth2 clients:**
-```nix
-services.kanidm.provision.systems.oauth2.grafana = {
-  displayName = "Grafana";
-  originUrl = "https://grafana.home.2rjus.net/login/generic_oauth";
-  originLanding = "https://grafana.home.2rjus.net";
-};
-```
-
-**Client host configuration (add to system/):**
-```nix
-services.kanidm.enableClient = true;
-services.kanidm.enablePam = true;
-services.kanidm.clientSettings.uri = "https://auth.home.2rjus.net";
-```
+- **Host configuration:** `hosts/kanidm01/`
+- **Service module:** `services/kanidm/default.nix`

 ## NAS Integration

@@ -148,42 +59,103 @@ This future migration path is a strong argument for Kanidm over LDAP-only soluti

 ## Implementation Steps

-1. **Create Kanidm service module** in `services/kanidm/`
-   - Server configuration
-   - TLS via internal ACME
-   - Vault secrets for admin passwords
+1. **Create kanidm01 host and service module** ✅
+   - Host: `kanidm01.home.2rjus.net` (10.69.13.23, test tier)
+   - Service module: `services/kanidm/`
+   - TLS via internal ACME (`auth.home.2rjus.net`)
+   - Vault integration for idm_admin password
+   - LDAPS on port 636

-2. **Configure declarative provisioning**
-   - Define initial users and groups
-   - Set up POSIX attributes (UID/GID ranges)
+2. **Configure provisioning** ✅
+   - Groups provisioned declaratively: `admins`, `users`, `ssh-users`
+   - Users managed imperatively via CLI (allows setting POSIX passwords in one step)
+   - POSIX attributes enabled (UID/GID range 65,536-69,999)

-3. **Add OIDC clients** for homelab services
-   - Grafana
-   - Other services as needed
-
-4. **Create client module** in `system/` for PAM/NSS
-   - Enable on all hosts that need central auth
-   - Configure trusted CA
-
-5. **Test NAS integration**
+3. **Test NAS integration** (in progress)
+   - ✅ LDAP interface verified working
   - Configure TrueNAS LDAP client to connect to Kanidm
   - Verify UID/GID mapping works with NFS shares

-6. **Migrate auth01**
-   - Remove LLDAP and Authelia services
-   - Deploy Kanidm
-   - Update DNS CNAMEs if needed
+4. **Add OIDC clients** for homelab services
+   - Grafana
+   - Other services as needed

-7. **Documentation**
-   - User management procedures
-   - Adding new OAuth2 clients
-   - Troubleshooting PAM/NSS issues
+5. **Create client module** in `system/` for PAM/NSS ✅
+   - Module: `system/kanidm-client.nix`
+   - `homelab.kanidm.enable = true` enables PAM/NSS
+   - Short usernames (not SPN format)
+   - Home directory symlinks via `home_alias`
+   - Enabled on test tier: testvm01, testvm02, testvm03

-## Open Questions
+6. **Documentation** ✅
+   - `docs/user-management.md` - CLI workflows, troubleshooting
+   - User/group creation procedures verified working

- What UID/GID range should be reserved for Kanidm-managed users?
- Which hosts should have PAM/NSS enabled initially?
- What OAuth2 clients are needed at launch?
+## Progress
+
+### Completed (2026-02-08)
+
+**Kanidm server deployed on kanidm01 (test tier):**
+- Host: `kanidm01.home.2rjus.net` (10.69.13.23)
+- WebUI: `https://auth.home.2rjus.net`
+- LDAPS: port 636
+- Valid certificate from internal CA
+
+**Configuration:**
+- Kanidm 1.8 with secret provisioning support
+- Daily backups at 22:00 (7 versions retained)
+- Vault integration for idm_admin password
+- Prometheus monitoring scrape target configured
+
+**Provisioned entities:**
+- Groups: `admins`, `users`, `ssh-users` (declarative)
+- Users managed via CLI (imperative)
+
+**Verified working:**
+- WebUI login with idm_admin
+- LDAP bind and search with POSIX-enabled user
+- LDAPS with valid internal CA certificate
+
+### Completed (2026-02-08) - PAM/NSS Client
+
+**Client module deployed (`system/kanidm-client.nix`):**
+- `homelab.kanidm.enable = true` enables PAM/NSS integration
+- Connects to auth.home.2rjus.net
+- Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
+- Home directory symlinks (`/home/torjus` → UUID-based dir)
+- Login restricted to `ssh-users` group
+
+**Enabled on test tier:**
+- testvm01, testvm02, testvm03
+
+**Verified working:**
+- User/group resolution via `getent`
+- SSH login with Kanidm unix passwords
+- Home directory creation with symlinks
+- Imperative user/group creation via CLI
+
+**Documentation:**
+- `docs/user-management.md` with full CLI workflows
+- Password requirements (min 10 chars)
+- Troubleshooting guide (nscd, cache invalidation)
+
+### UID/GID Range (Resolved)
+
+**Range: 65,536 - 69,999** (manually allocated)
+
+- Users: 65,536 - 67,999 (up to ~2500 users)
+- Groups: 68,000 - 69,999 (up to ~2000 groups)
+
+Rationale:
+- Starts at Kanidm's recommended minimum (65,536)
+- Well above NixOS system users (typically <1000)
+- Avoids Podman/container issues with very high GIDs
+
+### Next Steps
+
+1. Enable PAM/NSS on production hosts (after test tier validation)
+2. Configure TrueNAS LDAP client for NAS integration testing
+3. Add OAuth2 clients (Grafana first)

 ## References

--- a/docs/plans/cert-monitoring.md
+++ b/docs/plans/cert-monitoring.md
@@ -0,0 +1,72 @@
+# Certificate Monitoring Plan
+
+## Summary
+
+This document describes the removal of labmon certificate monitoring and outlines future needs for certificate monitoring in the homelab.
+
+## What Was Removed
+
+### labmon Service
+
+The `labmon` service was a custom Go application that provided:
+
+1. **StepMonitor**: Monitoring for step-ca (Smallstep CA) certificate provisioning and health
+2. **TLSConnectionMonitor**: Periodic TLS connection checks to verify certificate validity and expiration
+
+The service exposed Prometheus metrics at `:9969` including:
+- `labmon_tlsconmon_certificate_seconds_left` - Time until certificate expiration
+- `labmon_tlsconmon_certificate_check_error` - Whether the TLS check failed
+- `labmon_stepmon_certificate_seconds_left` - Step-CA internal certificate expiration
+
+### Affected Files
+
+- `hosts/monitoring01/configuration.nix` - Removed labmon configuration block
+- `services/monitoring/prometheus.nix` - Removed labmon scrape target
+- `services/monitoring/rules.yml` - Removed `certificate_rules` alert group
+- `services/monitoring/alloy.nix` - Deleted (was only used for labmon profiling)
+- `services/monitoring/default.nix` - Removed alloy.nix import
+
+### Removed Alerts
+
+- `certificate_expiring_soon` - Warned when any monitored TLS cert had < 24h validity
+- `step_ca_serving_cert_expiring` - Critical alert for step-ca's own serving certificate
+- `certificate_check_error` - Warned when TLS connection check failed
+- `step_ca_certificate_expiring` - Critical alert for step-ca issued certificates
+
+## Why It Was Removed
+
+1. **step-ca decommissioned**: The primary monitoring target (step-ca) is no longer in use
+2. **Outdated codebase**: labmon was a custom tool that required maintenance
+3. **Limited value**: With ACME auto-renewal, certificates should renew automatically
+
+## Current State
+
+ACME certificates are now issued by OpenBao PKI at `vault.home.2rjus.net:8200`. The ACME protocol handles automatic renewal, and certificates are typically renewed well before expiration.
+
+## Future Needs
+
+While ACME handles renewal automatically, we should consider monitoring for:
+
+1. **ACME renewal failures**: Alert when a certificate fails to renew
+   - Could monitor ACME client logs (via Loki queries)
+   - Could check certificate file modification times
+
+2. **Certificate expiration as backup**: Even with auto-renewal, a last-resort alert for certificates approaching expiration would catch renewal failures
+
+3. **Certificate transparency**: Monitor for unexpected certificate issuance
+
+### Potential Solutions
+
+1. **Prometheus blackbox_exporter**: Can probe TLS endpoints and export certificate expiration metrics
+   - `probe_ssl_earliest_cert_expiry` metric
+   - Already a standard tool, well-maintained
+
+2. **Custom Loki alerting**: Query ACME service logs for renewal failures
+   - Works with existing infrastructure
+   - No additional services needed
+
+3. **Node-exporter textfile collector**: Script that checks local certificate files and writes expiration metrics
+
+## Status
+
+**Not yet implemented.** This document serves as a placeholder for future work on certificate monitoring.
--- a/docs/plans/completed/automated-host-deployment-pipeline.md
+++ b/docs/plans/completed/automated-host-deployment-pipeline.md
--- a/docs/plans/completed/bootstrap-cache.md
+++ b/docs/plans/completed/bootstrap-cache.md
@@ -0,0 +1,35 @@
+# Plan: Configure Template2 to Use Nix Cache
+
+## Problem
+
+New VMs bootstrapped from template2 don't use our local nix cache (nix-cache.home.2rjus.net) during the initial `nixos-rebuild boot`. This means the first build downloads everything from cache.nixos.org, which is slower and uses more bandwidth.
+
+## Solution
+
+Update the template2 base image to include the nix cache configuration, so new VMs immediately benefit from cached builds during bootstrap.
+
+## Implementation
+
+1. Add nix cache configuration to `hosts/template2/configuration.nix`:
+   ```nix
+   nix.settings = {
+     substituters = [ "https://nix-cache.home.2rjus.net" "https://cache.nixos.org" ];
+     trusted-public-keys = [
+       "nix-cache.home.2rjus.net:..."  # Add the cache's public key
+       "cache.nixos.org-1:..."
+     ];
+   };
+   ```
+
+2. Rebuild and redeploy the Proxmox template:
+   ```bash
+   nix develop -c ansible-playbook -i playbooks/inventory.ini playbooks/build-and-deploy-template.yml
+   ```
+
+3. Update `default_template_name` in `terraform/variables.tf` if the template name changed
+
+## Benefits
+
+- Faster VM bootstrap times
+- Reduced bandwidth to external cache
+- Most derivations will already be cached from other hosts
--- a/docs/plans/completed/nats-deploy-service.md
+++ b/docs/plans/completed/nats-deploy-service.md
--- a/docs/plans/completed/ns1-recreation.md
+++ b/docs/plans/completed/ns1-recreation.md
@@ -0,0 +1,107 @@
+# ns1 Recreation Plan
+
+## Overview
+
+Recreate ns1 using the OpenTofu workflow after the existing VM entered emergency mode due to incorrect hardware-configuration.nix (hardcoded UUIDs that don't match actual disk layout).
+
+## Current ns1 Configuration to Preserve
+
+- **IP:** 10.69.13.5/24
+- **Gateway:** 10.69.13.1
+- **Role:** Primary DNS (authoritative + resolver)
+- **Services:**
+  - `../../services/ns/master-authorative.nix`
+  - `../../services/ns/resolver.nix`
+- **Metadata:**
+  - `homelab.host.role = "dns"`
+  - `homelab.host.labels.dns_role = "primary"`
+- **Vault:** enabled
+- **Deploy:** enabled
+
+## Execution Steps
+
+### Phase 1: Remove Old Configuration
+
+```bash
+nix develop -c create-host --remove --hostname ns1 --force
+```
+
+This removes:
+- `hosts/ns1/` directory
+- Entry from `flake.nix`
+- Any terraform entries (none exist currently)
+
+### Phase 2: Create New Configuration
+
+```bash
+nix develop -c create-host --hostname ns1 --ip 10.69.13.5/24
+```
+
+This creates:
+- `hosts/ns1/` with template2-based configuration
+- Entry in `flake.nix`
+- Entry in `terraform/vms.tf`
+- Vault wrapped token for bootstrap
+
+### Phase 3: Customize Configuration
+
+After create-host, manually update `hosts/ns1/configuration.nix` to add:
+
+1. DNS service imports:
+   ```nix
+   ../../services/ns/master-authorative.nix
+   ../../services/ns/resolver.nix
+   ```
+
+2. Host metadata:
+   ```nix
+   homelab.host = {
+     tier = "prod";
+     role = "dns";
+     labels.dns_role = "primary";
+   };
+   ```
+
+3. Disable resolved (conflicts with Unbound):
+   ```nix
+   services.resolved.enable = false;
+   ```
+
+### Phase 4: Commit Changes
+
+```bash
+git add -A
+git commit -m "ns1: recreate with OpenTofu workflow
+
+Old VM had incorrect hardware-configuration.nix with hardcoded UUIDs
+that didn't match actual disk layout, causing boot failure.
+
+Recreated using template2-based configuration for OpenTofu provisioning."
+```
+
+### Phase 5: Infrastructure
+
+1. Delete old ns1 VM in Proxmox (it's broken anyway)
+2. Run `nix develop -c tofu -chdir=terraform apply`
+3. Wait for bootstrap to complete
+4. Verify ns1 is functional:
+   - DNS resolution working
+   - Zone transfer to ns2 working
+   - All exporters responding
+
+### Phase 6: Finalize
+
+- Push to master
+- Move this plan to `docs/plans/completed/`
+
+## Rollback
+
+If the new VM fails:
+1. ns2 is still operational as secondary DNS
+2. Can recreate with different settings if needed
+
+## Notes
+
+- ns2 will continue serving DNS during the migration
+- Zone data is generated from flake, so no data loss
+- The old VM's disk can be kept briefly in Proxmox as backup if desired
--- a/docs/plans/completed/prometheus-scrape-target-labels.md
+++ b/docs/plans/completed/prometheus-scrape-target-labels.md
@@ -1,10 +1,38 @@
 # Prometheus Scrape Target Labels

+## Implementation Status
+
+| Step | Status | Notes |
+|------|--------|-------|
+| 1. Create `homelab.host` module | ✅ Complete | `modules/homelab/host.nix` |
+| 2. Update `lib/monitoring.nix` | ✅ Complete | Labels extracted and propagated |
+| 3. Update Prometheus config | ✅ Complete | Uses structured static_configs |
+| 4. Set metadata on hosts | ✅ Complete | All relevant hosts configured |
+| 5. Update alert rules | ✅ Complete | Role-based filtering implemented |
+| 6. Labels for service targets | ✅ Complete | Host labels propagated to all services |
+| 7. Add hostname label | ✅ Complete | All targets have `hostname` label for easy filtering |
+
+**Hosts with metadata configured:**
+- `ns1`, `ns2`: `role = "dns"`, `labels.dns_role = "primary"/"secondary"`
+- `nix-cache01`: `role = "build-host"`
+- `vault01`: `role = "vault"`
+- `testvm01/02/03`: `tier = "test"`
+
+**Implementation complete.** Branch: `prometheus-scrape-target-labels`
+
+**Query examples:**
+- `{hostname="ns1"}` - all metrics from ns1 (any job/port)
+- `node_cpu_seconds_total{hostname="monitoring01"}` - specific metric by hostname
+- `up{role="dns"}` - all DNS servers
+- `up{tier="test"}` - all test-tier hosts
+
+---
+
 ## Goal

 Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names.

-**Related:** This plan shares the `homelab.host` module with `docs/plans/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment.
+**Related:** This plan shares the `homelab.host` module with `docs/plans/completed/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment.

 ## Motivation

@@ -54,12 +82,11 @@ or

 ## Implementation

-This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/nats-deploy-service.md` which uses the same module for deployment tier assignment.
+This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/completed/nats-deploy-service.md` which uses the same module for deployment tier assignment.

 ### 1. Create `homelab.host` module

-**Status:** Step 1 (Create `homelab.host` module) is complete. The module is in
-`modules/homelab/host.nix` with tier, priority, role, and labels options.
+✅ **Complete.** The module is in `modules/homelab/host.nix`.

 Create `modules/homelab/host.nix` with shared host metadata options:

@@ -98,6 +125,8 @@ Import this module in `modules/homelab/default.nix`.

 ### 2. Update `lib/monitoring.nix`

+✅ **Complete.** Labels are now extracted and propagated.
+
 - `extractHostMonitoring` should also extract `homelab.host` values (priority, role, labels).
 - Build the combined label set from `homelab.host`:

@@ -126,6 +155,8 @@ This requires grouping hosts by their label attrset and producing one `static_co

 ### 3. Update `services/monitoring/prometheus.nix`

+✅ **Complete.** Now uses structured static_configs output.
+
 Change the node-exporter scrape config to use the new structured output:

 ```nix
@@ -138,36 +169,37 @@ static_configs = nodeExporterTargets;

 ### 4. Set metadata on hosts

+✅ **Complete.** All relevant hosts have metadata configured. Note: The implementation filters by `role` rather than `priority`, which matches the existing nix-cache01 configuration.
+
 Example in `hosts/nix-cache01/configuration.nix`:

 ```nix
 homelab.host = {
-  tier = "test";       # can be deployed by MCP (used by homelab-deploy)
  priority = "low";    # relaxed alerting thresholds
  role = "build-host";
 };
 ```

+**Note:** Current implementation only sets `role = "build-host"`. Consider adding `priority = "low"` when label propagation is implemented.
+
 Example in `hosts/ns1/configuration.nix`:

 ```nix
 homelab.host = {
-  tier = "prod";
-  priority = "high";
  role = "dns";
  labels.dns_role = "primary";
 };
 ```

+**Note:** `tier` and `priority` use defaults ("prod" and "high"), which is the intended behavior. The current ns1/ns2 configurations match this pattern.
+
 ### 5. Update alert rules

-After implementing labels, review and update `services/monitoring/rules.yml`:
+✅ **Complete.** Updated `services/monitoring/rules.yml`:

- Replace instance-name exclusions with label-based filters (e.g. `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`).
- Consider whether any other rules should differentiate by priority or role.
+- `high_cpu_load`: Replaced `instance!="nix-cache01..."` with `role!="build-host"` for standard hosts (15m duration) and `role="build-host"` for build hosts (2h duration).
+- `unbound_low_cache_hit_ratio`: Added `dns_role="primary"` filter to only alert on the primary DNS resolver (secondary has a cold cache).

-Specifically, the `high_cpu_load` rule currently has a nix-cache01 exclusion that should be replaced with a `priority`-based filter.
+### 6. Labels for `generateScrapeConfigs` (service targets)

-### 6. Consider labels for `generateScrapeConfigs` (service targets)
-
-The same label propagation could be applied to service-level scrape targets. This is optional and can be deferred -- service targets are more specialized and less likely to need generic label-based filtering.
+✅ **Complete.** Host labels are now propagated to all auto-generated service scrape targets (unbound, homelab-deploy, nixos-exporter, etc.). This enables semantic filtering on any service metric, such as using `dns_role="primary"` with the unbound job.
--- a/docs/plans/host-migration-to-opentofu.md
+++ b/docs/plans/host-migration-to-opentofu.md
@@ -9,24 +9,23 @@ hosts are decommissioned or deferred.

 ## Current State

-Hosts already managed by OpenTofu: `vault01`, `testvm01`, `vaulttest01`
+Hosts already managed by OpenTofu: `vault01`, `testvm01`, `testvm02`, `testvm03`, `ns2`, `ns1`

 Hosts to migrate:

 | Host | Category | Notes |
 |------|----------|-------|
-| ns1 | Stateless | Primary DNS, recreate |
-| ns2 | Stateless | Secondary DNS, recreate |
+| ~~ns1~~ | ~~Stateless~~ | ✓ Complete |
 | nix-cache01 | Stateless | Binary cache, recreate |
 | http-proxy | Stateless | Reverse proxy, recreate |
 | nats1 | Stateless | Messaging, recreate |
-| auth01 | Decommission | No longer in use |
 | ha1 | Stateful | Home Assistant + Zigbee2MQTT + Mosquitto |
 | monitoring01 | Stateful | Prometheus, Grafana, Loki |
 | jelly01 | Stateful | Jellyfin metadata, watch history, config |
-| pgdb1 | Stateful | PostgreSQL databases |
-| jump | Decommission | No longer needed |
-| ca | Deferred | Pending Phase 4c PKI migration to OpenBao |
+| pgdb1 | Decommission | Only used by Open WebUI on gunter, migrating to local postgres |
+| ~~jump~~ | ~~Decommission~~ | ✓ Complete |
+| ~~auth01~~ | ~~Decommission~~ | ✓ Complete |
+| ~~ca~~ | ~~Deferred~~ | ✓ Complete |

 ## Phase 1: Backup Preparation

@@ -46,39 +45,19 @@ No backup currently exists. Add a restic backup job for `/var/lib/jellyfin/` whi
 Media files are on the NAS (`nas.home.2rjus.net:/mnt/hdd-pool/media`) and do not need backup.
 The cache directory (`/var/cache/jellyfin/`) does not need backup — it regenerates.

-### 1c. Add PostgreSQL Backup to pgdb1
-
-No backup currently exists. Add a restic backup job with a `pg_dumpall` pre-hook to capture
-all databases and roles. The dump should be piped through restic's stdin backup (similar to
-the Grafana DB dump pattern on monitoring01).
-
-### 1d. Verify Existing ha1 Backup
+### 1c. Verify Existing ha1 Backup

 ha1 already backs up `/var/lib/hass`, `/var/lib/zigbee2mqtt`, `/var/lib/mosquitto`. Verify
 these backups are current and restorable before proceeding with migration.

-### 1e. Verify All Backups
+### 1d. Verify All Backups

 After adding/expanding backup jobs:
 1. Trigger a manual backup run on each host
 2. Verify backup integrity with `restic check`
 3. Test a restore to a temporary location to confirm data is recoverable

-## Phase 2: Declare pgdb1 Databases in Nix
-
-Before migrating pgdb1, audit the manually-created databases and users on the running
-instance, then declare them in the Nix configuration using `ensureDatabases` and
-`ensureUsers`. This makes the PostgreSQL setup reproducible on the new host.
-
-Steps:
-1. SSH to pgdb1, run `\l` and `\du` in psql to list databases and roles
-2. Add `ensureDatabases` and `ensureUsers` to `services/postgres/postgres.nix`
-3. Document any non-default PostgreSQL settings or extensions per database
-
-After reprovisioning, the databases will be created by NixOS, and data restored from the
-`pg_dumpall` backup.
-
-## Phase 3: Stateless Host Migration
+## Phase 2: Stateless Host Migration

 These hosts have no meaningful state and can be recreated fresh. For each host:

@@ -95,13 +74,14 @@ Migrate stateless hosts in an order that minimizes disruption:

 1. **nix-cache01** — low risk, no downstream dependencies during migration
 2. **nats1** — low risk, verify no persistent JetStream streams first
-4. **http-proxy** — brief disruption to proxied services, migrate during low-traffic window
-5. **ns1, ns2** — migrate one at a time, verify DNS resolution between each
+3. **http-proxy** — brief disruption to proxied services, migrate during low-traffic window
+4. ~~**ns1** — ns2 already migrated, verify AXFR works after ns1 migration~~ ✓ Complete

-For ns1/ns2: migrate ns2 first (secondary), verify AXFR works, then migrate ns1. All hosts
-use both ns1 and ns2 as resolvers, so one being down briefly is tolerable.
+~~For ns1/ns2: migrate ns2 first (secondary), verify AXFR works, then migrate ns1.~~ Both ns1
+and ns2 migration complete. Zone transfer (AXFR) verified working between ns1 (primary) and
+ns2 (secondary).

-## Phase 4: Stateful Host Migration
+## Phase 3: Stateful Host Migration

 For each stateful host, the procedure is:

@@ -114,17 +94,7 @@ For each stateful host, the procedure is:
 7. Start services and verify functionality
 8. Decommission the old VM

-### 4a. pgdb1
-
-1. Run final `pg_dumpall` backup via restic
-2. Stop PostgreSQL on the old host
-3. Provision new pgdb1 via OpenTofu
-4. After bootstrap, NixOS creates the declared databases/users
-5. Restore data with `pg_restore` or `psql < dumpall.sql`
-6. Verify database connectivity from gunter (`10.69.30.105`)
-7. Decommission old VM
-
-### 4b. monitoring01
+### 3a. monitoring01

 1. Run final Grafana backup
 2. Provision new monitoring01 via OpenTofu
@@ -134,7 +104,7 @@ For each stateful host, the procedure is:
 6. Verify all scrape targets are being collected
 7. Decommission old VM

-### 4c. jelly01
+### 3b. jelly01

 1. Run final Jellyfin backup
 2. Provision new jelly01 via OpenTofu
@@ -143,7 +113,7 @@ For each stateful host, the procedure is:
 5. Start Jellyfin, verify watch history and library metadata are present
 6. Decommission old VM

-### 4d. ha1
+### 3c. ha1

 1. Verify latest restic backup is current
 2. Stop Home Assistant, Zigbee2MQTT, and Mosquitto on old host
@@ -167,47 +137,69 @@ OpenTofu/Proxmox. Verify the USB device ID on the hypervisor and add the appropr
 `usb` block to the VM definition in `terraform/vms.tf`. The USB device must be passed
 through before starting Zigbee2MQTT on the new host.

-## Phase 5: Decommission jump and auth01 Hosts
+## Phase 4: Decommission Hosts

-### jump
-1. Verify nothing depends on the jump host (no SSH proxy configs pointing to it, etc.)
-2. Remove host configuration from `hosts/jump/`
-3. Remove from `flake.nix`
-4. Remove any secrets in `secrets/jump/`
-5. Remove from `.sops.yaml`
+### jump ✓ COMPLETE
+
+~~1. Verify nothing depends on the jump host (no SSH proxy configs pointing to it, etc.)~~
+~~2. Remove host configuration from `hosts/jump/`~~
+~~3. Remove from `flake.nix`~~
+~~4. Remove any secrets in `secrets/jump/`~~
+~~5. Remove from `.sops.yaml`~~
+~~6. Destroy the VM in Proxmox~~
+~~7. Commit cleanup~~
+
+Host was already removed from flake.nix and VM destroyed. Configuration cleaned up in ba9f47f.
+
+### auth01 ✓ COMPLETE
+
+~~1. Remove host configuration from `hosts/auth01/`~~
+~~2. Remove from `flake.nix`~~
+~~3. Remove any secrets in `secrets/auth01/`~~
+~~4. Remove from `.sops.yaml`~~
+~~5. Remove `services/authelia/` and `services/lldap/` (only used by auth01)~~
+~~6. Destroy the VM in Proxmox~~
+~~7. Commit cleanup~~
+
+Host configuration, services, and VM already removed.
+
+### pgdb1 (in progress)
+
+Only consumer was Open WebUI on gunter, which has been migrated to use local PostgreSQL.
+
+1. ~~Verify Open WebUI on gunter is using local PostgreSQL (not pgdb1)~~ ✓
+2. ~~Remove host configuration from `hosts/pgdb1/`~~ ✓
+3. ~~Remove `services/postgres/` (only used by pgdb1)~~ ✓
+4. ~~Remove from `flake.nix`~~ ✓
+5. ~~Remove Vault AppRole from `terraform/vault/approle.tf`~~ ✓
 6. Destroy the VM in Proxmox
-7. Commit cleanup
+7. ~~Commit cleanup~~ ✓

-### auth01
-1. Remove host configuration from `hosts/auth01/`
-2. Remove from `flake.nix`
-3. Remove any secrets in `secrets/auth01/`
-4. Remove from `.sops.yaml`
-5. Remove `services/authelia/` and `services/lldap/` (only used by auth01)
-6. Destroy the VM in Proxmox
-7. Commit cleanup
+See `docs/plans/pgdb1-decommission.md` for detailed plan.

-## Phase 6: Decommission ca Host (Deferred)
+## Phase 5: Decommission ca Host ✓ COMPLETE

-Deferred until Phase 4c (PKI migration to OpenBao) is complete. Once all hosts use the
+~~Deferred until Phase 4c (PKI migration to OpenBao) is complete. Once all hosts use the
 OpenBao ACME endpoint for certificates, the step-ca host can be decommissioned following
-the same cleanup steps as the jump host.
+the same cleanup steps as the jump host.~~

-## Phase 7: Remove sops-nix
+PKI migration to OpenBao complete. Host configuration, `services/ca/`, and VM removed.

-Once `ca` is decommissioned (Phase 6), `sops-nix` is no longer used by any host. Remove
-all remnants:
- `sops-nix` input from `flake.nix` and `flake.lock`
- `sops-nix.nixosModules.sops` from all host module lists in `flake.nix`
- `inherit sops-nix` from all specialArgs in `flake.nix`
- `system/sops.nix` and its import in `system/default.nix`
- `.sops.yaml`
- `secrets/` directory
- All `sops.secrets.*` declarations in `services/ca/`, `services/authelia/`, `services/lldap/`
- Template scripts that generate age keys for sops (`hosts/template/scripts.nix`,
-  `hosts/template2/scripts.nix`)
+## Phase 6: Remove sops-nix ✓ COMPLETE

-See `docs/plans/completed/sops-to-openbao-migration.md` for full context.
+~~Once `ca` is decommissioned (Phase 6), `sops-nix` is no longer used by any host. Remove
+all remnants:~~
+~~- `sops-nix` input from `flake.nix` and `flake.lock`~~
+~~- `sops-nix.nixosModules.sops` from all host module lists in `flake.nix`~~
+~~- `inherit sops-nix` from all specialArgs in `flake.nix`~~
+~~- `system/sops.nix` and its import in `system/default.nix`~~
+~~- `.sops.yaml`~~
+~~- `secrets/` directory~~
+~~- All `sops.secrets.*` declarations in `services/ca/`, `services/authelia/`, `services/lldap/`~~
+~~- Template scripts that generate age keys for sops (`hosts/template/scripts.nix`,
+  `hosts/template2/scripts.nix`)~~
+
+All sops-nix remnants removed. See `docs/plans/completed/sops-to-openbao-migration.md` for context.

 ## Notes

@@ -216,7 +208,7 @@ See `docs/plans/completed/sops-to-openbao-migration.md` for full context.
 - The old VMs use IPs that the new VMs need, so the old VM must be shut down before
  the new one is provisioned (or use a temporary IP and swap after verification)
 - Stateful migrations should be done during low-usage windows
- After all migrations are complete, the only hosts not in OpenTofu will be ca (deferred)
+- After all migrations are complete, all decommissioned hosts (jump, auth01, ca) have been removed
 - Since many hosts are being recreated, this is a good opportunity to establish consistent
  hostname naming conventions before provisioning the new VMs. Current naming is inconsistent
  (e.g. `ns1` vs `nix-cache01`, `ha1` vs `auth01`, `pgdb1` vs `http-proxy`). Decide on a
--- a/docs/plans/memory-issues-follow-up.md
+++ b/docs/plans/memory-issues-follow-up.md
@@ -0,0 +1,116 @@
+# Memory Issues Follow-up
+
+Tracking the zram change to verify it resolves OOM issues during nixos-upgrade on low-memory hosts.
+
+## Background
+
+On 2026-02-08, ns2 (2GB RAM) experienced an OOM kill during nixos-upgrade. The Nix evaluation process consumed ~1.6GB before being killed by the kernel. ns1 (manually increased to 4GB) succeeded with the same upgrade.
+
+Root cause: 2GB RAM is insufficient for Nix flake evaluation without swap.
+
+## Fix Applied
+
+**Commit:** `1674b6a` - system: enable zram swap for all hosts
+
+**Merged:** 2026-02-08 ~12:15 UTC
+
+**Change:** Added `zramSwap.enable = true` to `system/zram.nix`, providing ~2GB compressed swap on all hosts.
+
+## Timeline
+
+| Time (UTC) | Event |
+|------------|-------|
+| 05:00:46 | ns2 nixos-upgrade OOM killed |
+| 05:01:47 | `nixos_upgrade_failed` alert fired |
+| 12:15 | zram commit merged to master |
+| 12:19 | ns2 rebooted with zram enabled |
+| 12:20 | ns1 rebooted (memory reduced to 2GB via tofu) |
+
+## Hosts Affected
+
+All 2GB VMs that run nixos-upgrade:
+- ns1, ns2 (DNS)
+- vault01
+- testvm01, testvm02, testvm03
+- kanidm01
+
+## Metrics to Monitor
+
+Check these in Grafana or via PromQL to verify the fix:
+
+### Swap availability (should be ~2GB after upgrade)
+```promql
+node_memory_SwapTotal_bytes / 1024 / 1024
+```
+
+### Swap usage during upgrades
+```promql
+(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / 1024 / 1024
+```
+
+### Zswap compressed bytes (active compression)
+```promql
+node_memory_Zswap_bytes / 1024 / 1024
+```
+
+### Upgrade failures (should be 0)
+```promql
+node_systemd_unit_state{name="nixos-upgrade.service", state="failed"}
+```
+
+### Memory available during upgrades
+```promql
+node_memory_MemAvailable_bytes / 1024 / 1024
+```
+
+## Verification Steps
+
+After a few days (allow auto-upgrades to run on all hosts):
+
+1. Check all hosts have swap enabled:
+   ```promql
+   node_memory_SwapTotal_bytes > 0
+   ```
+
+2. Check for any upgrade failures since the fix:
+   ```promql
+   count_over_time(ALERTS{alertname="nixos_upgrade_failed"}[7d])
+   ```
+
+3. Review if any hosts used swap during upgrades (check historical graphs)
+
+## Success Criteria
+
+- No `nixos_upgrade_failed` alerts due to OOM after 2026-02-08
+- All hosts show ~2GB swap available
+- Upgrades complete successfully on 2GB VMs
+
+## Fallback Options
+
+If zram is insufficient:
+
+1. **Increase VM memory** - Update `terraform/vms.tf` to 4GB for affected hosts
+2. **Enable memory ballooning** - Configure VMs with dynamic memory allocation (see below)
+3. **Use remote builds** - Configure `nix.buildMachines` to offload evaluation
+4. **Reduce flake size** - Split configurations to reduce evaluation memory
+
+### Memory Ballooning
+
+Proxmox supports memory ballooning, which allows VMs to dynamically grow/shrink memory allocation based on demand. The balloon driver inside the guest communicates with the hypervisor to release or reclaim memory pages.
+
+Configuration in `terraform/vms.tf`:
+```hcl
+memory  = 4096  # maximum memory
+balloon = 2048  # minimum memory (shrinks to this when idle)
+```
+
+Pros:
+- VMs get memory on-demand without reboots
+- Better host memory utilization
+- Solves upgrade OOM without permanently allocating 4GB
+
+Cons:
+- Requires QEMU guest agent running in guest
+- Guest can experience memory pressure if host is overcommitted
+
+Ballooning and zram are complementary - ballooning provides headroom from the host, zram provides overflow within the guest.
--- a/docs/plans/monitoring-migration-victoriametrics.md
+++ b/docs/plans/monitoring-migration-victoriametrics.md
@@ -0,0 +1,219 @@
+# Monitoring Stack Migration to VictoriaMetrics
+
+## Overview
+
+Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
+and longer retention. Run in parallel with monitoring01 until validated, then switch over using
+a `monitoring` CNAME for seamless transition.
+
+## Current State
+
+**monitoring01** (10.69.13.13):
+- 4 CPU cores, 4GB RAM, 33GB disk
+- Prometheus with 30-day retention (15s scrape interval)
+- Alertmanager (routes to alerttonotify webhook)
+- Grafana (dashboards, datasources)
+- Loki (log aggregation from all hosts via Promtail)
+- Tempo (distributed tracing)
+- Pyroscope (continuous profiling)
+
+**Hardcoded References to monitoring01:**
+- `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100`
+- `hosts/template2/bootstrap.nix` - Bootstrap logs to Loki (keep as-is until decommission)
+- `services/http-proxy/proxy.nix` - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway
+
+**Auto-generated:**
+- Prometheus scrape targets (from `lib/monitoring.nix` + `homelab.monitoring.scrapeTargets`)
+- Node-exporter targets (from all hosts with static IPs)
+
+## Decision: VictoriaMetrics
+
+Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
+- Single binary replacement for Prometheus
+- 5-10x better compression (30 days could become 180+ days in same space)
+- Same PromQL query language (Grafana dashboards work unchanged)
+- Same scrape config format (existing auto-generated configs work)
+
+If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
+
+## Architecture
+
+```
+                     ┌─────────────────┐
+                     │  monitoring02   │
+                     │  VictoriaMetrics│
+                     │  + Grafana      │
+     monitoring      │  + Loki         │
+     CNAME ──────────│  + Tempo        │
+                     │  + Pyroscope    │
+                     │  + Alertmanager │
+                     │  (vmalert)      │
+                     └─────────────────┘
+                            ▲
+                            │ scrapes
+            ┌───────────────┼───────────────┐
+            │               │               │
+       ┌────┴────┐    ┌─────┴────┐    ┌─────┴────┐
+       │  ns1    │    │  ha1     │    │  ...     │
+       │ :9100   │    │ :9100    │    │ :9100    │
+       └─────────┘    └──────────┘    └──────────┘
+```
+
+## Implementation Plan
+
+### Phase 1: Create monitoring02 Host
+
+Use `create-host` script which handles flake.nix and terraform/vms.tf automatically.
+
+1. **Run create-host**: `nix develop -c create-host monitoring02 10.69.13.24`
+2. **Update VM resources** in `terraform/vms.tf`:
+   - 4 cores (same as monitoring01)
+   - 8GB RAM (double, for VictoriaMetrics headroom)
+   - 100GB disk (for 3+ months retention with compression)
+3. **Update host configuration**: Import monitoring services
+4. **Create Vault AppRole**: Add to `terraform/vault/approle.tf`
+
+### Phase 2: Set Up VictoriaMetrics Stack
+
+Create new service module at `services/monitoring/victoriametrics/` for testing alongside existing
+Prometheus config. Once validated, this can replace the Prometheus module.
+
+1. **VictoriaMetrics** (port 8428):
+   - `services.victoriametrics.enable = true`
+   - `services.victoriametrics.retentionPeriod = "3m"` (3 months, increase later based on disk usage)
+   - Migrate scrape configs via `prometheusConfig`
+   - Use native push support (replaces Pushgateway)
+
+2. **vmalert** for alerting rules:
+   - `services.vmalert.enable = true`
+   - Point to VictoriaMetrics for metrics evaluation
+   - Keep rules in separate `rules.yml` file (same format as Prometheus)
+   - No receiver configured during parallel operation (prevents duplicate alerts)
+
+3. **Alertmanager** (port 9093):
+   - Keep existing configuration (alerttonotify webhook routing)
+   - Only enable receiver after cutover from monitoring01
+
+4. **Loki** (port 3100):
+   - Same configuration as current
+
+5. **Grafana** (port 3000):
+   - Define dashboards declaratively via NixOS options (not imported from monitoring01)
+   - Reference existing dashboards on monitoring01 for content inspiration
+   - Configure VictoriaMetrics datasource (port 8428)
+   - Configure Loki datasource
+
+6. **Tempo** (ports 3200, 3201):
+   - Same configuration
+
+7. **Pyroscope** (port 4040):
+   - Same Docker-based deployment
+
+### Phase 3: Parallel Operation
+
+Run both monitoring01 and monitoring02 simultaneously:
+
+1. **Dual scraping**: Both hosts scrape the same targets
+   - Validates VictoriaMetrics is collecting data correctly
+
+2. **Dual log shipping**: Configure Promtail to send logs to both Loki instances
+   - Add second client in `system/monitoring/logs.nix` pointing to monitoring02
+
+3. **Validate dashboards**: Access Grafana on monitoring02, verify dashboards work
+
+4. **Validate alerts**: Verify vmalert evaluates rules correctly (no receiver = no notifications)
+
+5. **Compare resource usage**: Monitor disk/memory consumption between hosts
+
+### Phase 4: Add monitoring CNAME
+
+Add CNAME to monitoring02 once validated:
+
+```nix
+# hosts/monitoring02/configuration.nix
+homelab.dns.cnames = [ "monitoring" ];
+```
+
+This creates `monitoring.home.2rjus.net` pointing to monitoring02.
+
+### Phase 5: Update References
+
+Update hardcoded references to use the CNAME:
+
+1. **system/monitoring/logs.nix**:
+   - Remove dual-shipping, point only to `http://monitoring.home.2rjus.net:3100`
+
+2. **services/http-proxy/proxy.nix**: Update reverse proxy backends:
+   - prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
+   - alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
+   - grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
+   - pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040
+
+Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission.
+
+### Phase 6: Enable Alerting
+
+Once ready to cut over:
+1. Enable Alertmanager receiver on monitoring02
+2. Verify test alerts route correctly
+
+### Phase 7: Cutover and Decommission
+
+1. **Stop monitoring01**: Prevent duplicate alerts during transition
+2. **Update bootstrap.nix**: Point to `monitoring.home.2rjus.net`
+3. **Verify all targets scraped**: Check VictoriaMetrics UI
+4. **Verify logs flowing**: Check Loki on monitoring02
+5. **Decommission monitoring01**:
+   - Remove from flake.nix
+   - Remove host configuration
+   - Destroy VM in Proxmox
+   - Remove from terraform state
+
+## Open Questions
+
+- [ ] What disk size for monitoring02? 100GB should allow 3+ months with VictoriaMetrics compression
+- [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
+
+## VictoriaMetrics Service Configuration
+
+Example NixOS configuration for monitoring02:
+
+```nix
+# VictoriaMetrics replaces Prometheus
+services.victoriametrics = {
+  enable = true;
+  retentionPeriod = "3m";  # 3 months, increase based on disk usage
+  prometheusConfig = {
+    global.scrape_interval = "15s";
+    scrape_configs = [
+      # Auto-generated node-exporter targets
+      # Service-specific scrape targets
+      # External targets
+    ];
+  };
+};
+
+# vmalert for alerting rules (no receiver during parallel operation)
+services.vmalert = {
+  enable = true;
+  datasource.url = "http://localhost:8428";
+  # notifier.alertmanager.url = "http://localhost:9093";  # Enable after cutover
+  rule = [ ./rules.yml ];
+};
+```
+
+## Rollback Plan
+
+If issues arise after cutover:
+1. Move `monitoring` CNAME back to monitoring01
+2. Restart monitoring01 services
+3. Revert Promtail config to point only to monitoring01
+4. Revert http-proxy backends
+
+## Notes
+
+- VictoriaMetrics uses port 8428 vs Prometheus 9090
+- PromQL compatibility is excellent
+- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
+- monitoring02 deployed via OpenTofu using `create-host` script
+- Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state
--- a/docs/plans/nix-cache-reprovision.md
+++ b/docs/plans/nix-cache-reprovision.md
@@ -0,0 +1,212 @@
+# Nix Cache Host Reprovision
+
+## Overview
+
+Reprovision `nix-cache01` using the OpenTofu workflow, and improve the build/cache system with:
+1. NATS-based remote build triggering (replacing the current bash script)
+2. Safer flake update workflow that validates builds before pushing to master
+
+## Current State
+
+### Host Configuration
+- `nix-cache01` at 10.69.13.15 serves the binary cache via Harmonia
+- Runs Gitea Actions runner for CI workflows
+- Has `homelab.deploy.enable = true` (already supports NATS-based deployment)
+- Uses a dedicated XFS volume at `/nix` for cache storage
+
+### Current Build System (`services/nix-cache/build-flakes.sh`)
+- Runs every 30 minutes via systemd timer
+- Clones/pulls two repos: `nixos-servers` and `nixos` (gunter)
+- Builds all hosts with `nixos-rebuild build` (no blacklist despite docs mentioning it)
+- Pushes success/failure metrics to pushgateway
+- Simple but has no filtering, no parallelism, no remote triggering
+
+### Current Flake Update Workflow (`.github/workflows/flake-update.yaml`)
+- Runs daily at midnight via cron
+- Runs `nix flake update --commit-lock-file`
+- Pushes directly to master
+- No build validation — can push broken inputs
+
+## Improvement 1: NATS-Based Remote Build Triggering
+
+### Design
+
+Extend the existing `homelab-deploy` tool to support a "build" command that triggers builds on the cache host. This reuses the NATS infrastructure already in place.
+
+| Approach | Pros | Cons |
+|----------|------|------|
+| Extend homelab-deploy | Reuses existing NATS auth, NKey handling, CLI | Adds scope to existing tool |
+| New nix-cache-tool | Clean separation | Duplicate NATS boilerplate, new credentials |
+| Gitea Actions webhook | No custom tooling | Less flexible, tied to Gitea |
+
+**Recommendation:** Extend `homelab-deploy` with a build subcommand. The tool already has NATS client code, authentication handling, and a listener module in NixOS.
+
+### Implementation
+
+1. Add new message type to homelab-deploy: `build.<host>` subject
+2. Listener on nix-cache01 subscribes to `build.>` wildcard
+3. On message receipt, builds the specified host and returns success/failure
+4. CLI command: `homelab-deploy build <hostname>` or `homelab-deploy build --all`
+
+### Benefits
+- Trigger rebuild for specific host to ensure it's cached
+- Could be called from CI after merging PRs
+- Reuses existing NATS infrastructure and auth
+- Progress/status could stream back via NATS reply
+
+## Improvement 2: Smarter Flake Update Workflow
+
+### Current Problems
+1. Updates can push breaking changes to master
+2. No visibility into what broke when it does
+3. Hosts that auto-update can pull broken configs
+
+### Proposed Workflow
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                    Flake Update Workflow                         │
+├─────────────────────────────────────────────────────────────────┤
+│  1. nix flake update (on feature branch)                        │
+│  2. Build ALL hosts locally                                      │
+│  3. If all pass → fast-forward merge to master                  │
+│  4. If any fail → create PR with failure logs attached          │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+### Implementation Options
+
+| Option | Description | Pros | Cons |
+|--------|-------------|------|------|
+| **A: Self-hosted runner** | Build on nix-cache01 | Fast (local cache), simple | Ties up cache host during build |
+| **B: Gitea Actions only** | Use container runner | Clean separation | Slow (no cache), resource limits |
+| **C: Hybrid** | Trigger builds on nix-cache01 via NATS from Actions | Best of both | More complex |
+
+**Recommendation:** Option A with nix-cache01 as the runner. The host is already running Gitea Actions runner and has the cache. Building all ~16 hosts is disk I/O heavy but feasible on dedicated hardware.
+
+### Workflow Steps
+
+1. Workflow runs on schedule (daily or weekly)
+2. Creates branch `flake-update/YYYY-MM-DD`
+3. Runs `nix flake update --commit-lock-file`
+4. Builds each host: `nix build .#nixosConfigurations.<host>.config.system.build.toplevel`
+5. If all succeed:
+   - Fast-forward merge to master
+   - Delete feature branch
+6. If any fail:
+   - Create PR from the update branch
+   - Attach build logs as PR comment
+   - Label PR with `needs-review` or `build-failure`
+   - Do NOT merge automatically
+
+### Workflow File Changes
+
+```yaml
+# New: .github/workflows/flake-update-safe.yaml
+name: Safe flake update
+on:
+  schedule:
+    - cron: "0 2 * * 0"  # Weekly on Sunday at 2 AM
+  workflow_dispatch:  # Manual trigger
+
+jobs:
+  update-and-validate:
+    runs-on: homelab  # Use self-hosted runner on nix-cache01
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          ref: master
+          fetch-depth: 0  # Need full history for merge
+
+      - name: Create update branch
+        run: |
+          BRANCH="flake-update/$(date +%Y-%m-%d)"
+          git checkout -b "$BRANCH"
+
+      - name: Update flake
+        run: nix flake update --commit-lock-file
+
+      - name: Build all hosts
+        id: build
+        run: |
+          FAILED=""
+          for host in $(nix flake show --json | jq -r '.nixosConfigurations | keys[]'); do
+            echo "Building $host..."
+            if ! nix build ".#nixosConfigurations.$host.config.system.build.toplevel" 2>&1 | tee "build-$host.log"; then
+              FAILED="$FAILED $host"
+            fi
+          done
+          echo "failed=$FAILED" >> $GITHUB_OUTPUT
+
+      - name: Merge to master (if all pass)
+        if: steps.build.outputs.failed == ''
+        run: |
+          git checkout master
+          git merge --ff-only "$BRANCH"
+          git push origin master
+          git push origin --delete "$BRANCH"
+
+      - name: Create PR (if any fail)
+        if: steps.build.outputs.failed != ''
+        run: |
+          git push origin "$BRANCH"
+          # Create PR via Gitea API with build logs
+          # ... (PR creation with log attachment)
+```
+
+## Migration Steps
+
+### Phase 1: Reprovision Host via OpenTofu
+
+1. Add `nix-cache01` to `terraform/vms.tf`:
+   ```hcl
+   "nix-cache01" = {
+     ip        = "10.69.13.15/24"
+     cpu_cores = 4
+     memory    = 8192
+     disk_size = "100G"  # Larger for nix store
+   }
+   ```
+
+2. Shut down existing nix-cache01 VM
+3. Run `tofu apply` to provision new VM
+4. Verify bootstrap completes and cache is serving
+
+**Note:** The cache will be cold after reprovision. Run initial builds to populate.
+
+### Phase 2: Add Build Triggering to homelab-deploy
+
+1. Add `build` command to homelab-deploy CLI
+2. Add listener handler in NixOS module for `build.*` subjects
+3. Update nix-cache01 config to enable build listener
+4. Test with `homelab-deploy build testvm01`
+
+### Phase 3: Implement Safe Flake Update Workflow
+
+1. Create `.github/workflows/flake-update-safe.yaml`
+2. Disable or remove old `flake-update.yaml`
+3. Test manually with `workflow_dispatch`
+4. Monitor first automated run
+
+### Phase 4: Remove Old Build Script
+
+1. After new workflow is stable, remove:
+   - `services/nix-cache/build-flakes.nix`
+   - `services/nix-cache/build-flakes.sh`
+2. The new workflow handles scheduled builds
+
+## Open Questions
+
+- [ ] What runner labels should the self-hosted runner use for the update workflow?
+- [ ] Should we build hosts in parallel (faster) or sequentially (easier to debug)?
+- [ ] How long to keep flake-update PRs open before auto-closing stale ones?
+- [ ] Should successful updates trigger a NATS notification to rebuild all hosts?
+- [ ] What to do about `gunter` (external nixos repo) - include in validation?
+- [ ] Disk size for new nix-cache01 - is 100G enough for cache + builds?
+
+## Notes
+
+- The existing `homelab.deploy.enable = true` on nix-cache01 means it already has NATS connectivity
+- The Harmonia service and cache signing key will work the same after reprovision
+- Actions runner token is in Vault, will be provisioned automatically
+- Consider adding a `homelab.host.role = "build-host"` label for monitoring/filtering
--- a/docs/plans/pgdb1-decommission.md
+++ b/docs/plans/pgdb1-decommission.md
@@ -0,0 +1,113 @@
+# pgdb1 Decommissioning Plan
+
+## Overview
+
+Decommission the pgdb1 PostgreSQL server. The only consumer was Open WebUI on gunter, which has been migrated to use a local PostgreSQL instance.
+
+## Pre-flight Verification
+
+Before proceeding, verify that gunter is no longer using pgdb1:
+
+1. Check Open WebUI on gunter is configured for local PostgreSQL (not 10.69.13.16)
+2. Optionally: Check pgdb1 for recent connection activity:
+   ```bash
+   ssh pgdb1 'sudo -u postgres psql -c "SELECT * FROM pg_stat_activity WHERE datname IS NOT NULL;"'
+   ```
+
+## Files to Remove
+
+### Host Configuration
+- `hosts/pgdb1/default.nix`
+- `hosts/pgdb1/configuration.nix`
+- `hosts/pgdb1/hardware-configuration.nix`
+- `hosts/pgdb1/` (directory)
+
+### Service Module
+- `services/postgres/postgres.nix`
+- `services/postgres/default.nix`
+- `services/postgres/` (directory)
+
+Note: This service module is only used by pgdb1, so it can be removed entirely.
+
+### Flake Entry
+Remove from `flake.nix` (lines 131-138):
+```nix
+pgdb1 = nixpkgs.lib.nixosSystem {
+  inherit system;
+  specialArgs = {
+    inherit inputs self;
+  };
+  modules = commonModules ++ [
+    ./hosts/pgdb1
+  ];
+};
+```
+
+### Vault AppRole
+Remove from `terraform/vault/approle.tf` (lines 69-73):
+```hcl
+"pgdb1" = {
+  paths = [
+    "secret/data/hosts/pgdb1/*",
+  ]
+}
+```
+
+### Monitoring Rules
+Remove from `services/monitoring/rules.yml` the `postgres_down` alert (lines 359-365):
+```yaml
+- name: postgres_rules
+  rules:
+    - alert: postgres_down
+      expr: node_systemd_unit_state{instance="pgdb1.home.2rjus.net:9100", name="postgresql.service", state="active"} == 0
+      for: 5m
+      labels:
+        severity: critical
+```
+
+### Utility Scripts
+Delete `rebuild-all.sh` entirely (obsolete script).
+
+## Execution Steps
+
+### Phase 1: Verification
+- [ ] Confirm Open WebUI on gunter uses local PostgreSQL
+- [ ] Verify no active connections to pgdb1
+
+### Phase 2: Code Cleanup
+- [ ] Create feature branch: `git checkout -b decommission-pgdb1`
+- [ ] Remove `hosts/pgdb1/` directory
+- [ ] Remove `services/postgres/` directory
+- [ ] Remove pgdb1 entry from `flake.nix`
+- [ ] Remove postgres alert from `services/monitoring/rules.yml`
+- [ ] Delete `rebuild-all.sh` (obsolete)
+- [ ] Run `nix flake check` to verify no broken references
+- [ ] Commit changes
+
+### Phase 3: Terraform Cleanup
+- [ ] Remove pgdb1 from `terraform/vault/approle.tf`
+- [ ] Run `tofu plan` in `terraform/vault/` to preview changes
+- [ ] Run `tofu apply` to remove the AppRole
+- [ ] Commit terraform changes
+
+### Phase 4: Infrastructure Cleanup
+- [ ] Shut down pgdb1 VM in Proxmox
+- [ ] Delete the VM from Proxmox
+- [ ] (Optional) Remove any DNS entries if not auto-generated
+
+### Phase 5: Finalize
+- [ ] Merge feature branch to master
+- [ ] Trigger auto-upgrade on DNS servers (ns1, ns2) to remove DNS entry
+- [ ] Move this plan to `docs/plans/completed/`
+
+## Rollback
+
+If issues arise after decommissioning:
+1. The VM can be recreated from template using the git history
+2. Database data would need to be restored from backup (if any exists)
+
+## Notes
+
+- pgdb1 IP: 10.69.13.16
+- The postgres service allowed connections from gunter (10.69.30.105)
+- No restic backup was configured for this host
--- a/docs/plans/security-hardening.md
+++ b/docs/plans/security-hardening.md
@@ -0,0 +1,224 @@
+# Security Hardening Plan
+
+## Overview
+
+Address security gaps identified in infrastructure review. Focus areas: SSH hardening, network security, logging improvements, and secrets management.
+
+## Current State
+
+- SSH allows password auth and unrestricted root login (`system/sshd.nix`)
+- Firewall disabled on all hosts (`networking.firewall.enable = false`)
+- Promtail ships logs over HTTP to Loki
+- Loki has no authentication (`auth_enabled = false`)
+- AppRole secret-IDs never expire (`secret_id_ttl = 0`)
+- Vault TLS verification disabled by default (`skipTlsVerify = true`)
+- Audit logging exists (`common/ssh-audit.nix`) but not applied globally
+- Alert rules focus on availability, no security event detection
+
+## Priority Matrix
+
+| Issue | Severity | Effort | Priority |
+|-------|----------|--------|----------|
+| SSH password auth | High | Low | **P1** |
+| Firewall disabled | High | Medium | **P1** |
+| Promtail HTTP (no TLS) | High | Medium | **P2** |
+| No security alerting | Medium | Low | **P2** |
+| Audit logging not global | Low | Low | **P2** |
+| Loki no auth | Medium | Medium | **P3** |
+| Secret-ID TTL | Medium | Medium | **P3** |
+| Vault skipTlsVerify | Medium | Low | **P3** |
+
+## Phase 1: Quick Wins (P1)
+
+### 1.1 SSH Hardening
+
+Edit `system/sshd.nix`:
+
+```nix
+services.openssh = {
+  enable = true;
+  settings = {
+    PermitRootLogin = "prohibit-password";  # Key-only root login
+    PasswordAuthentication = false;
+    KbdInteractiveAuthentication = false;
+  };
+};
+```
+
+**Prerequisite:** Verify all hosts have SSH keys deployed for root.
+
+### 1.2 Enable Firewall
+
+Create `system/firewall.nix` with default deny policy:
+
+```nix
+{ ... }: {
+  networking.firewall.enable = true;
+
+  # Use openssh's built-in firewall integration
+  services.openssh.openFirewall = true;
+}
+```
+
+**Useful firewall options:**
+
+| Option | Description |
+|--------|-------------|
+| `networking.firewall.trustedInterfaces` | Accept all traffic from these interfaces (e.g., `[ "lo" ]`) |
+| `networking.firewall.interfaces.<name>.allowedTCPPorts` | Per-interface port rules |
+| `networking.firewall.extraInputRules` | Custom nftables rules (for complex filtering) |
+
+**Network range restrictions:** Consider restricting SSH to the infrastructure subnet (`10.69.13.0/24`) using `extraInputRules` for defense in depth. However, this adds complexity and may not be necessary given the trusted network model.
+
+#### Per-Interface Rules (http-proxy WireGuard)
+
+The `http-proxy` host has a WireGuard interface (`wg0`) that may need different rules than the LAN interface. Use `networking.firewall.interfaces` to apply per-interface policies:
+
+```nix
+# Example: http-proxy with different rules per interface
+networking.firewall = {
+  enable = true;
+
+  # Default: only SSH (via openFirewall)
+  allowedTCPPorts = [ ];
+
+  # LAN interface: allow HTTP/HTTPS
+  interfaces.ens18 = {
+    allowedTCPPorts = [ 80 443 ];
+  };
+
+  # WireGuard interface: restrict to specific services or trust fully
+  interfaces.wg0 = {
+    allowedTCPPorts = [ 80 443 ];
+    # Or use trustedInterfaces = [ "wg0" ] if fully trusted
+  };
+};
+```
+
+**TODO:** Investigate current WireGuard usage on http-proxy to determine appropriate rules.
+
+Then per-host, open required ports:
+
+| Host | Additional Ports |
+|------|------------------|
+| ns1/ns2 | 53 (TCP/UDP) |
+| vault01 | 8200 |
+| monitoring01 | 3100, 9090, 3000, 9093 |
+| http-proxy | 80, 443 |
+| nats1 | 4222 |
+| ha1 | 1883, 8123 |
+| jelly01 | 8096 |
+| nix-cache01 | 5000 |
+
+## Phase 2: Logging & Detection (P2)
+
+### 2.1 Enable TLS for Promtail → Loki
+
+Update `system/monitoring/logs.nix`:
+
+```nix
+clients = [{
+  url = "https://monitoring01.home.2rjus.net:3100/loki/api/v1/push";
+  tls_config = {
+    ca_file = "/etc/ssl/certs/homelab-root-ca.pem";
+  };
+}];
+```
+
+Requires:
+- Configure Loki with TLS certificate (use internal ACME)
+- Ensure all hosts trust root CA (already done via `system/pki/root-ca.nix`)
+
+### 2.2 Security Alert Rules
+
+Add to `services/monitoring/rules.yml`:
+
+```yaml
+- name: security_rules
+  rules:
+    - alert: ssh_auth_failures
+      expr: increase(node_logind_sessions_total[5m]) > 20
+      for: 0m
+      labels:
+        severity: warning
+      annotations:
+        summary: "Unusual login activity on {{ $labels.instance }}"
+
+    - alert: vault_secret_fetch_failure
+      expr: increase(vault_secret_failures[5m]) > 5
+      for: 0m
+      labels:
+        severity: warning
+      annotations:
+        summary: "Vault secret fetch failures on {{ $labels.instance }}"
+```
+
+Also add Loki-based alerts for:
+- Failed SSH attempts: `{job="systemd-journal"} |= "Failed password"`
+- sudo usage: `{job="systemd-journal"} |= "sudo"`
+
+### 2.3 Global Audit Logging
+
+Add `./common/ssh-audit.nix` import to `system/default.nix`:
+
+```nix
+imports = [
+  # ... existing imports
+  ../common/ssh-audit.nix
+];
+```
+
+## Phase 3: Defense in Depth (P3)
+
+### 3.1 Loki Authentication
+
+Options:
+1. **Basic auth via reverse proxy** - Put Loki behind Caddy with auth
+2. **Loki multi-tenancy** - Enable `auth_enabled = true` and use tenant IDs
+3. **Network isolation** - Bind Loki only to localhost, expose via authenticated proxy
+
+Recommendation: Option 1 (reverse proxy) is simplest for homelab.
+
+### 3.2 AppRole Secret Rotation
+
+Update `terraform/vault/approle.tf`:
+
+```hcl
+secret_id_ttl  = 2592000  # 30 days
+```
+
+Add documentation for manual rotation procedure or implement automated rotation via the existing `restartTrigger` mechanism in `vault-secrets.nix`.
+
+### 3.3 Enable Vault TLS Verification
+
+Change default in `system/vault-secrets.nix`:
+
+```nix
+skipTlsVerify = mkOption {
+  type = types.bool;
+  default = false;  # Changed from true
+};
+```
+
+**Prerequisite:** Verify all hosts trust the internal CA that signed the Vault certificate.
+
+## Implementation Order
+
+1. **Test on test-tier first** - Deploy phases 1-2 to testvm01/02/03
+2. **Validate SSH access** - Ensure key-based login works before disabling passwords
+3. **Document firewall ports** - Create reference of ports per host before enabling
+4. **Phase prod rollout** - Deploy to prod hosts one at a time, verify each
+
+## Open Questions
+
+- [ ] Do all hosts have SSH keys configured for root access?
+- [ ] Should firewall rules be per-host or use a central definition with roles?
+- [ ] Should Loki authentication use the existing Kanidm setup?
+
+**Resolved:** Password-based SSH access for recovery is not required - most hosts have console access through Proxmox or physical access, which provides an out-of-band recovery path if SSH keys fail.
+
+## Notes
+
+- Firewall changes are the highest risk - test thoroughly on test-tier
+- SSH hardening must not lock out access - verify keys first
+- Consider creating a "break glass" procedure for emergency access if keys fail
--- a/docs/user-management.md
+++ b/docs/user-management.md
@@ -0,0 +1,267 @@
+# User Management with Kanidm
+
+Central authentication for the homelab using Kanidm.
+
+## Overview
+
+- **Server**: kanidm01.home.2rjus.net (auth.home.2rjus.net)
+- **WebUI**: https://auth.home.2rjus.net
+- **LDAPS**: port 636
+
+## CLI Setup
+
+The `kanidm` CLI is available in the devshell:
+
+```bash
+nix develop
+
+# Login as idm_admin
+kanidm login --name idm_admin --url https://auth.home.2rjus.net
+```
+
+## User Management
+
+POSIX users are managed imperatively via the `kanidm` CLI. This allows setting
+all attributes (including UNIX password) in one workflow.
+
+### Creating a POSIX User
+
+```bash
+# Create the person
+kanidm person create <username> "<Display Name>"
+
+# Add to groups
+kanidm group add-members ssh-users <username>
+
+# Enable POSIX (UID is auto-assigned)
+kanidm person posix set <username>
+
+# Set UNIX password (required for SSH login, min 10 characters)
+kanidm person posix set-password <username>
+
+# Optionally set login shell
+kanidm person posix set <username> --shell /bin/zsh
+```
+
+### Example: Full User Creation
+
+```bash
+kanidm person create testuser "Test User"
+kanidm group add-members ssh-users testuser
+kanidm person posix set testuser
+kanidm person posix set-password testuser
+kanidm person get testuser
+```
+
+After creation, verify on a client host:
+```bash
+getent passwd testuser
+ssh testuser@testvm01.home.2rjus.net
+```
+
+### Viewing User Details
+
+```bash
+kanidm person get <username>
+```
+
+### Removing a User
+
+```bash
+kanidm person delete <username>
+```
+
+## Group Management
+
+Groups for POSIX access are also managed via CLI.
+
+### Creating a POSIX Group
+
+```bash
+# Create the group
+kanidm group create <group-name>
+
+# Enable POSIX with a specific GID
+kanidm group posix set <group-name> --gidnumber <gid>
+```
+
+### Adding Members
+
+```bash
+kanidm group add-members <group-name> <username>
+```
+
+### Viewing Group Details
+
+```bash
+kanidm group get <group-name>
+kanidm group list-members <group-name>
+```
+
+### Example: Full Group Creation
+
+```bash
+kanidm group create testgroup
+kanidm group posix set testgroup --gidnumber 68010
+kanidm group add-members testgroup testuser
+kanidm group get testgroup
+```
+
+After creation, verify on a client host:
+```bash
+getent group testgroup
+```
+
+### Current Groups
+
+| Group | GID | Purpose |
+|-------|-----|---------|
+| ssh-users | 68000 | SSH login access |
+| admins | 68001 | Administrative access |
+| users | 68002 | General users |
+
+### UID/GID Allocation
+
+Kanidm auto-assigns UIDs/GIDs from its configured range. For manually assigned GIDs:
+
+| Range | Purpose |
+|-------|---------|
+| 65,536+ | Users (auto-assigned) |
+| 68,000 - 68,999 | Groups (manually assigned) |
+
+## PAM/NSS Client Configuration
+
+Enable central authentication on a host:
+
+```nix
+homelab.kanidm.enable = true;
+```
+
+This configures:
+- `services.kanidm.enablePam = true`
+- Client connection to auth.home.2rjus.net
+- Login authorization for `ssh-users` group
+- Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
+- Home directory symlinks (`/home/torjus` → UUID-based directory)
+
+### Enabled Hosts
+
+- testvm01, testvm02, testvm03 (test tier)
+
+### Options
+
+```nix
+homelab.kanidm = {
+  enable = true;
+  server = "https://auth.home.2rjus.net";  # default
+  allowedLoginGroups = [ "ssh-users" ];     # default
+};
+```
+
+### Home Directories
+
+Home directories use UUID-based paths for stability (so renaming a user doesn't
+require moving their home directory). Symlinks provide convenient access:
+
+```
+/home/torjus -> /home/e4f4c56c-4aee-4c20-846f-90cb69807733
+```
+
+The symlinks are created by `kanidm-unixd-tasks` on first login.
+
+## Testing
+
+### Verify NSS Resolution
+
+```bash
+# Check user resolution
+getent passwd <username>
+
+# Check group resolution
+getent group <group-name>
+```
+
+### Test SSH Login
+
+```bash
+ssh <username>@<hostname>.home.2rjus.net
+```
+
+## Troubleshooting
+
+### "PAM user mismatch" error
+
+SSH fails with "fatal: PAM user mismatch" in logs. This happens when Kanidm returns
+usernames in SPN format (`torjus@home.2rjus.net`) but SSH expects short names (`torjus`).
+
+**Solution**: Configure `uid_attr_map = "name"` in unixSettings (already set in our module).
+
+Check current format:
+```bash
+getent passwd torjus
+# Should show: torjus:x:65536:...
+# NOT: torjus@home.2rjus.net:x:65536:...
+```
+
+### User resolves but SSH fails immediately
+
+The user's login group (e.g., `ssh-users`) likely doesn't have POSIX enabled:
+
+```bash
+# Check if group has POSIX
+getent group ssh-users
+
+# If empty, enable POSIX on the server
+kanidm group posix set ssh-users --gidnumber 68000
+```
+
+### User doesn't resolve via getent
+
+1. Check kanidm-unixd service is running:
+   ```bash
+   systemctl status kanidm-unixd
+   ```
+
+2. Check unixd can reach server:
+   ```bash
+   kanidm-unix status
+   # Should show: system: online, Kanidm: online
+   ```
+
+3. Check client can reach server:
+   ```bash
+   curl -s https://auth.home.2rjus.net/status
+   ```
+
+4. Check user has POSIX enabled on server:
+   ```bash
+   kanidm person get <username>
+   ```
+
+5. Restart nscd to clear stale cache:
+   ```bash
+   systemctl restart nscd
+   ```
+
+6. Invalidate kanidm cache:
+   ```bash
+   kanidm-unix cache-invalidate
+   ```
+
+### Changes not taking effect after deployment
+
+NixOS uses nsncd (a Rust reimplementation of nscd) for NSS caching. After deploying
+kanidm-unixd config changes, you may need to restart both services:
+
+```bash
+systemctl restart kanidm-unixd
+systemctl restart nscd
+```
+
+### Test PAM authentication directly
+
+Use the kanidm-unix CLI to test PAM auth without SSH:
+
+```bash
+kanidm-unix auth-test --name <username>
+```
--- a/flake.lock
+++ b/flake.lock
@@ -28,11 +28,11 @@
        ]
      },
      "locked": {
-        "lastModified": 1770447502,
-        "narHash": "sha256-xH1PNyE3ydj4udhe1IpK8VQxBPZETGLuORZdSWYRmSU=",
+        "lastModified": 1770481834,
+        "narHash": "sha256-Xx9BYnI0C/qgPbwr9nj6NoAdQTbYLunrdbNSaUww9oY=",
        "ref": "master",
-        "rev": "79db119d1ca6630023947ef0a65896cc3307c2ff",
-        "revCount": 22,
+        "rev": "fd0d63b103dfaf21d1c27363266590e723021c67",
+        "revCount": 24,
        "type": "git",
        "url": "https://git.t-juice.club/torjus/homelab-deploy"
      },
@@ -42,27 +42,6 @@
        "url": "https://git.t-juice.club/torjus/homelab-deploy"
      }
    },
-    "labmon": {
-      "inputs": {
-        "nixpkgs": [
-          "nixpkgs-unstable"
-        ]
-      },
-      "locked": {
-        "lastModified": 1748983975,
-        "narHash": "sha256-DA5mOqxwLMj/XLb4hvBU1WtE6cuVej7PjUr8N0EZsCE=",
-        "ref": "master",
-        "rev": "040a73e891a70ff06ec7ab31d7167914129dbf7d",
-        "revCount": 17,
-        "type": "git",
-        "url": "https://git.t-juice.club/torjus/labmon"
-      },
-      "original": {
-        "ref": "master",
-        "type": "git",
-        "url": "https://git.t-juice.club/torjus/labmon"
-      }
-    },
    "nixos-exporter": {
      "inputs": {
        "nixpkgs": [
@@ -119,31 +98,9 @@
      "inputs": {
        "alerttonotify": "alerttonotify",
        "homelab-deploy": "homelab-deploy",
-        "labmon": "labmon",
        "nixos-exporter": "nixos-exporter",
        "nixpkgs": "nixpkgs",
-        "nixpkgs-unstable": "nixpkgs-unstable",
-        "sops-nix": "sops-nix"
-      }
-    },
-    "sops-nix": {
-      "inputs": {
-        "nixpkgs": [
-          "nixpkgs-unstable"
-        ]
-      },
-      "locked": {
-        "lastModified": 1770145881,
-        "narHash": "sha256-ktjWTq+D5MTXQcL9N6cDZXUf9kX8JBLLBLT0ZyOTSYY=",
-        "owner": "Mic92",
-        "repo": "sops-nix",
-        "rev": "17eea6f3816ba6568b8c81db8a4e6ca438b30b7c",
-        "type": "github"
-      },
-      "original": {
-        "owner": "Mic92",
-        "repo": "sops-nix",
-        "type": "github"
+        "nixpkgs-unstable": "nixpkgs-unstable"
      }
    }
  },
--- a/flake.nix
+++ b/flake.nix
@@ -5,18 +5,10 @@
    nixpkgs.url = "github:nixos/nixpkgs?ref=nixos-25.11";
    nixpkgs-unstable.url = "github:nixos/nixpkgs?ref=nixos-unstable";

-    sops-nix = {
-      url = "github:Mic92/sops-nix";
-      inputs.nixpkgs.follows = "nixpkgs-unstable";
-    };
    alerttonotify = {
      url = "git+https://git.t-juice.club/torjus/alerttonotify?ref=master";
      inputs.nixpkgs.follows = "nixpkgs-unstable";
    };
-    labmon = {
-      url = "git+https://git.t-juice.club/torjus/labmon?ref=master";
-      inputs.nixpkgs.follows = "nixpkgs-unstable";
-    };
    nixos-exporter = {
      url = "git+https://git.t-juice.club/torjus/nixos-exporter";
      inputs.nixpkgs.follows = "nixpkgs-unstable";
@@ -32,9 +24,7 @@
      self,
      nixpkgs,
      nixpkgs-unstable,
-      sops-nix,
      alerttonotify,
-      labmon,
      nixos-exporter,
      homelab-deploy,
      ...
@@ -50,7 +40,6 @@
      commonOverlays = [
        overlay-unstable
        alerttonotify.overlays.default
-        labmon.overlays.default
      ];
      # Common modules applied to all hosts
      commonModules = [
@@ -61,7 +50,6 @@
            system.configurationRevision = self.rev or self.dirtyRev or "dirty";
          }
        )
-        sops-nix.nixosModules.sops
        nixos-exporter.nixosModules.default
        homelab-deploy.nixosModules.default
        ./modules/homelab
@@ -77,46 +65,19 @@
    in
    {
      nixosConfigurations = {
-        ns1 = nixpkgs.lib.nixosSystem {
-          inherit system;
-          specialArgs = {
-            inherit inputs self sops-nix;
-          };
-          modules = commonModules ++ [
-            ./hosts/ns1
-          ];
-        };
-        ns2 = nixpkgs.lib.nixosSystem {
-          inherit system;
-          specialArgs = {
-            inherit inputs self sops-nix;
-          };
-          modules = commonModules ++ [
-            ./hosts/ns2
-          ];
-        };
        ha1 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/ha1
          ];
        };
-        template1 = nixpkgs.lib.nixosSystem {
-          inherit system;
-          specialArgs = {
-            inherit inputs self sops-nix;
-          };
-          modules = commonModules ++ [
-            ./hosts/template
-          ];
-        };
        template2 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/template2
@@ -125,35 +86,25 @@
        http-proxy = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/http-proxy
          ];
        };
-        ca = nixpkgs.lib.nixosSystem {
-          inherit system;
-          specialArgs = {
-            inherit inputs self sops-nix;
-          };
-          modules = commonModules ++ [
-            ./hosts/ca
-          ];
-        };
        monitoring01 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/monitoring01
-            labmon.nixosModules.labmon
          ];
        };
        jelly01 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/jelly01
@@ -162,25 +113,16 @@
        nix-cache01 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/nix-cache01
          ];
        };
-        pgdb1 = nixpkgs.lib.nixosSystem {
-          inherit system;
-          specialArgs = {
-            inherit inputs self sops-nix;
-          };
-          modules = commonModules ++ [
-            ./hosts/pgdb1
-          ];
-        };
        nats1 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/nats1
@@ -189,7 +131,7 @@
        vault01 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/vault01
@@ -198,7 +140,7 @@
        testvm01 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/testvm01
@@ -207,7 +149,7 @@
        testvm02 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/testvm02
@@ -216,12 +158,39 @@
        testvm03 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/testvm03
          ];
        };
+        ns2 = nixpkgs.lib.nixosSystem {
+          inherit system;
+          specialArgs = {
+            inherit inputs self;
+          };
+          modules = commonModules ++ [
+            ./hosts/ns2
+          ];
+        };
+        ns1 = nixpkgs.lib.nixosSystem {
+          inherit system;
+          specialArgs = {
+            inherit inputs self;
+          };
+          modules = commonModules ++ [
+            ./hosts/ns1
+          ];
+        };
+        kanidm01 = nixpkgs.lib.nixosSystem {
+          inherit system;
+          specialArgs = {
+            inherit inputs self;
+          };
+          modules = commonModules ++ [
+            ./hosts/kanidm01
+          ];
+        };
      };
      packages = forAllSystems (
        { pkgs }:
@@ -238,6 +207,7 @@
              pkgs.ansible
              pkgs.opentofu
              pkgs.openbao
+              pkgs.kanidm_1_8
              (pkgs.callPackage ./scripts/create-host { })
              homelab-deploy.packages.${pkgs.system}.default
            ];
--- a/hosts/ca/configuration.nix
+++ b/hosts/ca/configuration.nix
@@ -1,63 +0,0 @@
-{
-  pkgs,
-  ...
-}:
-
-{
-  imports = [
-    ../template/hardware-configuration.nix
-
-    ../../system
-    ../../common/vm
-  ];
-
-  nixpkgs.config.allowUnfree = true;
-  # Use the systemd-boot EFI boot loader.
-  boot.loader.grub = {
-    enable = true;
-    device = "/dev/sda";
-    configurationLimit = 3;
-  };
-
-  networking.hostName = "ca";
-  networking.domain = "home.2rjus.net";
-  networking.useNetworkd = true;
-  networking.useDHCP = false;
-  services.resolved.enable = true;
-  networking.nameservers = [
-    "10.69.13.5"
-    "10.69.13.6"
-  ];
-
-  systemd.network.enable = true;
-  systemd.network.networks."ens18" = {
-    matchConfig.Name = "ens18";
-    address = [
-      "10.69.13.12/24"
-    ];
-    routes = [
-      { Gateway = "10.69.13.1"; }
-    ];
-    linkConfig.RequiredForOnline = "routable";
-  };
-  time.timeZone = "Europe/Oslo";
-
-  nix.settings.experimental-features = [
-    "nix-command"
-    "flakes"
-  ];
-  nix.settings.tarball-ttl = 0;
-  environment.systemPackages = with pkgs; [
-    vim
-    wget
-    git
-  ];
-
-  # Open ports in the firewall.
-  # networking.firewall.allowedTCPPorts = [ ... ];
-  # networking.firewall.allowedUDPPorts = [ ... ];
-  # Or disable the firewall altogether.
-  networking.firewall.enable = false;
-
-  system.stateVersion = "23.11"; # Did you read the comment?
-}
--- a/hosts/ca/default.nix
+++ b/hosts/ca/default.nix
@@ -1,7 +0,0 @@
-{ ... }:
-{
-  imports = [
-    ./configuration.nix
-    ../../services/ca
-  ];
-}
--- a/hosts/ha1/configuration.nix
+++ b/hosts/ha1/configuration.nix
@@ -7,7 +7,7 @@

 {
  imports = [
-    ../template/hardware-configuration.nix
+    ./hardware-configuration.nix

    ../../system
    ../../common/vm
--- a/hosts/template/hardware-configuration.nix
+++ b/hosts/template/hardware-configuration.nix
--- a/hosts/http-proxy/configuration.nix
+++ b/hosts/http-proxy/configuration.nix
@@ -5,7 +5,7 @@

 {
  imports = [
-    ../template/hardware-configuration.nix
+    ./hardware-configuration.nix

    ../../system
    ../../common/vm
--- a/hosts/http-proxy/hardware-configuration.nix
+++ b/hosts/http-proxy/hardware-configuration.nix
@@ -0,0 +1,42 @@
+{
+  config,
+  lib,
+  pkgs,
+  modulesPath,
+  ...
+}:
+
+{
+  imports = [
+    (modulesPath + "/profiles/qemu-guest.nix")
+  ];
+  boot.initrd.availableKernelModules = [
+    "ata_piix"
+    "uhci_hcd"
+    "virtio_pci"
+    "virtio_scsi"
+    "sd_mod"
+    "sr_mod"
+  ];
+  boot.initrd.kernelModules = [ "dm-snapshot" ];
+  boot.kernelModules = [
+    "ptp_kvm"
+  ];
+  boot.extraModulePackages = [ ];
+
+  fileSystems."/" = {
+    device = "/dev/disk/by-label/root";
+    fsType = "xfs";
+  };
+
+  swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
+
+  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
+  # (the default) this is the recommended approach. When using systemd-networkd it's
+  # still possible to use this option, but it's recommended to use it in conjunction
+  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
+  networking.useDHCP = lib.mkDefault true;
+  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
+
+  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
+}
--- a/hosts/jelly01/configuration.nix
+++ b/hosts/jelly01/configuration.nix
@@ -5,7 +5,7 @@

 {
  imports = [
-    ../template/hardware-configuration.nix
+    ./hardware-configuration.nix

    ../../system
    ../../common/vm
@@ -61,9 +61,8 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

-  zramSwap = {
-    enable = true;
-  };
+  vault.enable = true;
+  homelab.deploy.enable = true;

  system.stateVersion = "23.11"; # Did you read the comment?
 }
--- a/hosts/jelly01/hardware-configuration.nix
+++ b/hosts/jelly01/hardware-configuration.nix
@@ -0,0 +1,42 @@
+{
+  config,
+  lib,
+  pkgs,
+  modulesPath,
+  ...
+}:
+
+{
+  imports = [
+    (modulesPath + "/profiles/qemu-guest.nix")
+  ];
+  boot.initrd.availableKernelModules = [
+    "ata_piix"
+    "uhci_hcd"
+    "virtio_pci"
+    "virtio_scsi"
+    "sd_mod"
+    "sr_mod"
+  ];
+  boot.initrd.kernelModules = [ "dm-snapshot" ];
+  boot.kernelModules = [
+    "ptp_kvm"
+  ];
+  boot.extraModulePackages = [ ];
+
+  fileSystems."/" = {
+    device = "/dev/disk/by-label/root";
+    fsType = "xfs";
+  };
+
+  swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
+
+  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
+  # (the default) this is the recommended approach. When using systemd-networkd it's
+  # still possible to use this option, but it's recommended to use it in conjunction
+  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
+  networking.useDHCP = lib.mkDefault true;
+  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
+
+  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
+}
--- a/hosts/jump/configuration.nix
+++ b/hosts/jump/configuration.nix
@@ -1,56 +0,0 @@
-{ config, lib, pkgs, ... }:
-
-{
-  imports =
-    [
-      ../template/hardware-configuration.nix
-      ../../system
-    ];
-
-  nixpkgs.config.allowUnfree = true;
-
-  homelab.host.role = "bastion";
-
-  # Use the systemd-boot EFI boot loader.
-  boot.loader.grub.enable = true;
-  boot.loader.grub.device = "/dev/sda";
-
-  networking.hostName = "jump";
-  networking.domain = "home.2rjus.net";
-  networking.useNetworkd = true;
-  networking.useDHCP = false;
-  services.resolved.enable = false;
-  networking.nameservers = [
-    "10.69.13.5"
-    "10.69.13.6"
-  ];
-
-  systemd.network.enable = true;
-  systemd.network.networks."ens18" = {
-    matchConfig.Name = "ens18";
-    address = [
-      "10.69.13.10/24"
-    ];
-    routes = [
-      { Gateway = "10.69.13.1"; }
-    ];
-    linkConfig.RequiredForOnline = "routable";
-  };
-  time.timeZone = "Europe/Oslo";
-
-  nix.settings.experimental-features = [ "nix-command" "flakes" ];
-  environment.systemPackages = with pkgs; [
-    vim
-    wget
-    git
-  ];
-
-  # Open ports in the firewall.
-  # networking.firewall.allowedTCPPorts = [ ... ];
-  # networking.firewall.allowedUDPPorts = [ ... ];
-  # Or disable the firewall altogether.
-  networking.firewall.enable = false;
-
-  system.stateVersion = "23.11"; # Did you read the comment?
-}
-
--- a/hosts/jump/hardware-configuration.nix
+++ b/hosts/jump/hardware-configuration.nix
@@ -1,36 +0,0 @@
-{ config, lib, pkgs, modulesPath, ... }:
-
-{
-  imports =
-    [
-      (modulesPath + "/profiles/qemu-guest.nix")
-    ];
-
-  boot.initrd.availableKernelModules = [ "ata_piix" "uhci_hcd" "virtio_pci" "virtio_scsi" "sd_mod" "sr_mod" ];
-  boot.initrd.kernelModules = [ ];
-  # boot.kernelModules = [ ];
-  # boot.extraModulePackages = [ ];
-
-  fileSystems."/" =
-    {
-      device = "/dev/disk/by-uuid/6889aba9-61ed-4687-ab10-e5cf4017ac8d";
-      fsType = "xfs";
-    };
-
-  fileSystems."/boot" =
-    {
-      device = "/dev/disk/by-uuid/BC07-3B7A";
-      fsType = "vfat";
-    };
-
-  swapDevices =
-    [{ device = "/dev/disk/by-uuid/64e5757b-6625-4dd2-aa2a-66ca93444d23"; }];
-
-  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
-  # (the default) this is the recommended approach. When using systemd-networkd it's
-  # still possible to use this option, but it's recommended to use it in conjunction
-  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
-  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
-
-  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
-}
--- a/hosts/kanidm01/configuration.nix
+++ b/hosts/kanidm01/configuration.nix
@@ -1,25 +1,39 @@
 {
+  config,
+  lib,
  pkgs,
  ...
 }:

 {
  imports = [
-    ../template/hardware-configuration.nix
+    ../template2/hardware-configuration.nix

    ../../system
    ../../common/vm
+    ../../services/kanidm
  ];

-  nixpkgs.config.allowUnfree = true;
-  # Use the systemd-boot EFI boot loader.
-  boot.loader.grub = {
-    enable = true;
-    device = "/dev/sda";
-    configurationLimit = 3;
+  # Host metadata
+  homelab.host = {
+    tier = "test";
+    role = "auth";
  };

-  networking.hostName = "pgdb1";
+  # DNS CNAME for auth.home.2rjus.net
+  homelab.dns.cnames = [ "auth" ];
+
+  # Enable Vault integration
+  vault.enable = true;
+
+  # Enable remote deployment via NATS
+  homelab.deploy.enable = true;
+
+  nixpkgs.config.allowUnfree = true;
+  boot.loader.grub.enable = true;
+  boot.loader.grub.device = "/dev/vda";
+
+  networking.hostName = "kanidm01";
  networking.domain = "home.2rjus.net";
  networking.useNetworkd = true;
  networking.useDHCP = false;
@@ -33,7 +47,7 @@
  systemd.network.networks."ens18" = {
    matchConfig.Name = "ens18";
    address = [
-      "10.69.13.16/24"
+      "10.69.13.23/24"
    ];
    routes = [
      { Gateway = "10.69.13.1"; }
@@ -59,5 +73,5 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

-  system.stateVersion = "23.11"; # Did you read the comment?
+  system.stateVersion = "25.11"; # Did you read the comment?
 }
--- a/hosts/kanidm01/default.nix
+++ b/hosts/kanidm01/default.nix
--- a/hosts/monitoring01/configuration.nix
+++ b/hosts/monitoring01/configuration.nix
@@ -5,7 +5,7 @@

 {
  imports = [
-    ../template/hardware-configuration.nix
+    ./hardware-configuration.nix

    ../../system
    ../../common/vm
@@ -100,61 +100,6 @@
    ];
  };

-  labmon = {
-    enable = true;
-
-    settings = {
-      ListenAddr = ":9969";
-      Profiling = true;
-      StepMonitors = [
-        {
-          Enabled = true;
-          BaseURL = "https://ca.home.2rjus.net";
-          RootID = "3381bda8015a86b9a3cd1851439d1091890a79005e0f1f7c4301fe4bccc29d80";
-        }
-      ];
-
-      TLSConnectionMonitors = [
-        {
-          Enabled = true;
-          Address = "ca.home.2rjus.net:443";
-          Verify = true;
-          Duration = "12h";
-        }
-        {
-          Enabled = true;
-          Address = "jelly.home.2rjus.net:443";
-          Verify = true;
-          Duration = "12h";
-        }
-        {
-          Enabled = true;
-          Address = "grafana.home.2rjus.net:443";
-          Verify = true;
-          Duration = "12h";
-        }
-        {
-          Enabled = true;
-          Address = "prometheus.home.2rjus.net:443";
-          Verify = true;
-          Duration = "12h";
-        }
-        {
-          Enabled = true;
-          Address = "alertmanager.home.2rjus.net:443";
-          Verify = true;
-          Duration = "12h";
-        }
-        {
-          Enabled = true;
-          Address = "pyroscope.home.2rjus.net:443";
-          Verify = true;
-          Duration = "12h";
-        }
-      ];
-    };
-  };
-
  # Open ports in the firewall.
  # networking.firewall.allowedTCPPorts = [ ... ];
  # networking.firewall.allowedUDPPorts = [ ... ];
--- a/hosts/monitoring01/hardware-configuration.nix
+++ b/hosts/monitoring01/hardware-configuration.nix
@@ -0,0 +1,42 @@
+{
+  config,
+  lib,
+  pkgs,
+  modulesPath,
+  ...
+}:
+
+{
+  imports = [
+    (modulesPath + "/profiles/qemu-guest.nix")
+  ];
+  boot.initrd.availableKernelModules = [
+    "ata_piix"
+    "uhci_hcd"
+    "virtio_pci"
+    "virtio_scsi"
+    "sd_mod"
+    "sr_mod"
+  ];
+  boot.initrd.kernelModules = [ "dm-snapshot" ];
+  boot.kernelModules = [
+    "ptp_kvm"
+  ];
+  boot.extraModulePackages = [ ];
+
+  fileSystems."/" = {
+    device = "/dev/disk/by-label/root";
+    fsType = "xfs";
+  };
+
+  swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
+
+  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
+  # (the default) this is the recommended approach. When using systemd-networkd it's
+  # still possible to use this option, but it's recommended to use it in conjunction
+  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
+  networking.useDHCP = lib.mkDefault true;
+  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
+
+  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
+}
--- a/hosts/nats1/configuration.nix
+++ b/hosts/nats1/configuration.nix
@@ -5,7 +5,7 @@

 {
  imports = [
-    ../template/hardware-configuration.nix
+    ./hardware-configuration.nix

    ../../system
    ../../common/vm
@@ -59,5 +59,8 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

+  vault.enable = true;
+  homelab.deploy.enable = true;
+
  system.stateVersion = "23.11"; # Did you read the comment?
 }
--- a/hosts/nats1/hardware-configuration.nix
+++ b/hosts/nats1/hardware-configuration.nix
@@ -0,0 +1,42 @@
+{
+  config,
+  lib,
+  pkgs,
+  modulesPath,
+  ...
+}:
+
+{
+  imports = [
+    (modulesPath + "/profiles/qemu-guest.nix")
+  ];
+  boot.initrd.availableKernelModules = [
+    "ata_piix"
+    "uhci_hcd"
+    "virtio_pci"
+    "virtio_scsi"
+    "sd_mod"
+    "sr_mod"
+  ];
+  boot.initrd.kernelModules = [ "dm-snapshot" ];
+  boot.kernelModules = [
+    "ptp_kvm"
+  ];
+  boot.extraModulePackages = [ ];
+
+  fileSystems."/" = {
+    device = "/dev/disk/by-label/root";
+    fsType = "xfs";
+  };
+
+  swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
+
+  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
+  # (the default) this is the recommended approach. When using systemd-networkd it's
+  # still possible to use this option, but it's recommended to use it in conjunction
+  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
+  networking.useDHCP = lib.mkDefault true;
+  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
+
+  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
+}
--- a/hosts/nix-cache01/configuration.nix
+++ b/hosts/nix-cache01/configuration.nix
@@ -5,7 +5,7 @@

 {
  imports = [
-    ../template/hardware-configuration.nix
+    ./hardware-configuration.nix

    ../../system
    ../../common/vm
--- a/hosts/nix-cache01/default.nix
+++ b/hosts/nix-cache01/default.nix
@@ -4,6 +4,5 @@
    ./configuration.nix
    ../../services/nix-cache
    ../../services/actions-runner
-    ./zram.nix
  ];
 }
--- a/hosts/nix-cache01/hardware-configuration.nix
+++ b/hosts/nix-cache01/hardware-configuration.nix
@@ -0,0 +1,42 @@
+{
+  config,
+  lib,
+  pkgs,
+  modulesPath,
+  ...
+}:
+
+{
+  imports = [
+    (modulesPath + "/profiles/qemu-guest.nix")
+  ];
+  boot.initrd.availableKernelModules = [
+    "ata_piix"
+    "uhci_hcd"
+    "virtio_pci"
+    "virtio_scsi"
+    "sd_mod"
+    "sr_mod"
+  ];
+  boot.initrd.kernelModules = [ "dm-snapshot" ];
+  boot.kernelModules = [
+    "ptp_kvm"
+  ];
+  boot.extraModulePackages = [ ];
+
+  fileSystems."/" = {
+    device = "/dev/disk/by-label/root";
+    fsType = "xfs";
+  };
+
+  swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
+
+  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
+  # (the default) this is the recommended approach. When using systemd-networkd it's
+  # still possible to use this option, but it's recommended to use it in conjunction
+  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
+  networking.useDHCP = lib.mkDefault true;
+  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
+
+  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
+}
--- a/hosts/nix-cache01/zram.nix
+++ b/hosts/nix-cache01/zram.nix
@@ -1,6 +0,0 @@
-{ ... }:
-{
-  zramSwap = {
-    enable = true;
-  };
-}
--- a/hosts/ns1/configuration.nix
+++ b/hosts/ns1/configuration.nix
@@ -7,23 +7,38 @@

 {
  imports = [
-    ../template/hardware-configuration.nix
+    ../template2/hardware-configuration.nix

    ../../system
+    ../../common/vm
+
+    # DNS services
    ../../services/ns/master-authorative.nix
    ../../services/ns/resolver.nix
-    ../../common/vm
  ];

+  # Host metadata
+  homelab.host = {
+    tier = "prod";
+    role = "dns";
+    labels.dns_role = "primary";
+  };
+
+  # Enable Vault integration
+  vault.enable = true;
+
+  # Enable remote deployment via NATS
+  homelab.deploy.enable = true;
+
  nixpkgs.config.allowUnfree = true;
-  # Use the systemd-boot EFI boot loader.
  boot.loader.grub.enable = true;
-  boot.loader.grub.device = "/dev/sda";
+  boot.loader.grub.device = "/dev/vda";

  networking.hostName = "ns1";
  networking.domain = "home.2rjus.net";
  networking.useNetworkd = true;
  networking.useDHCP = false;
+  # Disable resolved - conflicts with Unbound resolver
  services.resolved.enable = false;
  networking.nameservers = [
    "10.69.13.5"
@@ -47,14 +62,6 @@
    "nix-command"
    "flakes"
  ];
-  vault.enable = true;
-  homelab.deploy.enable = true;
-
-  homelab.host = {
-    role = "dns";
-    labels.dns_role = "primary";
-  };
-
  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
    vim
@@ -68,5 +75,5 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

-  system.stateVersion = "23.11"; # Did you read the comment?
+  system.stateVersion = "25.11"; # Did you read the comment?
 }
--- a/hosts/ns1/hardware-configuration.nix
+++ b/hosts/ns1/hardware-configuration.nix
@@ -1,36 +0,0 @@
-{ config, lib, pkgs, modulesPath, ... }:
-
-{
-  imports =
-    [
-      (modulesPath + "/profiles/qemu-guest.nix")
-    ];
-
-  boot.initrd.availableKernelModules = [ "ata_piix" "uhci_hcd" "virtio_pci" "virtio_scsi" "sd_mod" "sr_mod" ];
-  boot.initrd.kernelModules = [ ];
-  # boot.kernelModules = [ ];
-  # boot.extraModulePackages = [ ];
-
-  fileSystems."/" =
-    {
-      device = "/dev/disk/by-uuid/6889aba9-61ed-4687-ab10-e5cf4017ac8d";
-      fsType = "xfs";
-    };
-
-  fileSystems."/boot" =
-    {
-      device = "/dev/disk/by-uuid/BC07-3B7A";
-      fsType = "vfat";
-    };
-
-  swapDevices =
-    [{ device = "/dev/disk/by-uuid/64e5757b-6625-4dd2-aa2a-66ca93444d23"; }];
-
-  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
-  # (the default) this is the recommended approach. When using systemd-networkd it's
-  # still possible to use this option, but it's recommended to use it in conjunction
-  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
-  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
-
-  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
-}
--- a/hosts/ns2/configuration.nix
+++ b/hosts/ns2/configuration.nix
@@ -7,23 +7,38 @@

 {
  imports = [
-    ../template/hardware-configuration.nix
+    ../template2/hardware-configuration.nix

    ../../system
+    ../../common/vm
+
+    # DNS services
    ../../services/ns/secondary-authorative.nix
    ../../services/ns/resolver.nix
-    ../../common/vm
  ];

+  # Host metadata
+  homelab.host = {
+    tier = "prod";
+    role = "dns";
+    labels.dns_role = "secondary";
+  };
+
+  # Enable Vault integration
+  vault.enable = true;
+
+  # Enable remote deployment via NATS
+  homelab.deploy.enable = true;
+
  nixpkgs.config.allowUnfree = true;
-  # Use the systemd-boot EFI boot loader.
  boot.loader.grub.enable = true;
-  boot.loader.grub.device = "/dev/sda";
+  boot.loader.grub.device = "/dev/vda";

  networking.hostName = "ns2";
  networking.domain = "home.2rjus.net";
  networking.useNetworkd = true;
  networking.useDHCP = false;
+  # Disable resolved - conflicts with Unbound resolver
  services.resolved.enable = false;
  networking.nameservers = [
    "10.69.13.5"
@@ -47,14 +62,7 @@
    "nix-command"
    "flakes"
  ];
-  vault.enable = true;
-  homelab.deploy.enable = true;
-
-  homelab.host = {
-    role = "dns";
-    labels.dns_role = "secondary";
-  };
-
+  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
    vim
    wget
@@ -67,5 +75,5 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

-  system.stateVersion = "23.11"; # Did you read the comment?
+  system.stateVersion = "25.11"; # Did you read the comment?
 }
--- a/hosts/ns2/hardware-configuration.nix
+++ b/hosts/ns2/hardware-configuration.nix
@@ -1,36 +0,0 @@
-{ config, lib, pkgs, modulesPath, ... }:
-
-{
-  imports =
-    [
-      (modulesPath + "/profiles/qemu-guest.nix")
-    ];
-
-  boot.initrd.availableKernelModules = [ "ata_piix" "uhci_hcd" "virtio_pci" "virtio_scsi" "sd_mod" "sr_mod" ];
-  boot.initrd.kernelModules = [ ];
-  # boot.kernelModules = [ ];
-  # boot.extraModulePackages = [ ];
-
-  fileSystems."/" =
-    {
-      device = "/dev/disk/by-uuid/6889aba9-61ed-4687-ab10-e5cf4017ac8d";
-      fsType = "xfs";
-    };
-
-  fileSystems."/boot" =
-    {
-      device = "/dev/disk/by-uuid/BC07-3B7A";
-      fsType = "vfat";
-    };
-
-  swapDevices =
-    [{ device = "/dev/disk/by-uuid/64e5757b-6625-4dd2-aa2a-66ca93444d23"; }];
-
-  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
-  # (the default) this is the recommended approach. When using systemd-networkd it's
-  # still possible to use this option, but it's recommended to use it in conjunction
-  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
-  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
-
-  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
-}
--- a/hosts/pgdb1/default.nix
+++ b/hosts/pgdb1/default.nix
@@ -1,7 +0,0 @@
-{ ... }:
-{
-  imports = [
-    ./configuration.nix
-    ../../services/postgres
-  ];
-}
--- a/hosts/template/configuration.nix
+++ b/hosts/template/configuration.nix
@@ -1,62 +0,0 @@
-{ config, lib, pkgs, ... }:
-
-{
-  imports =
-    [
-      ./hardware-configuration.nix
-
-      ../../system
-    ];
-
-  # Template host - exclude from DNS zone generation
-  homelab.dns.enable = false;
-
-  homelab.host = {
-    tier = "test";
-    priority = "low";
-  };
-
-
-  boot.loader.grub.enable = true;
-  boot.loader.grub.device = "/dev/sda";
-  networking.hostName = "nixos-template";
-  networking.domain = "home.2rjus.net";
-  networking.useNetworkd = true;
-  networking.useDHCP = false;
-  services.resolved.enable = true;
-  networking.nameservers = [
-    "10.69.13.5"
-    "10.69.13.6"
-  ];
-
-  systemd.network.enable = true;
-  systemd.network.networks."ens18" = {
-    matchConfig.Name = "ens18";
-    address = [
-      "10.69.8.250/24"
-    ];
-    routes = [
-      { Gateway = "10.69.8.1"; }
-    ];
-    linkConfig.RequiredForOnline = "routable";
-  };
-  time.timeZone = "Europe/Oslo";
-
-  nix.settings.experimental-features = [ "nix-command" "flakes" ];
-  nix.settings.tarball-ttl = 0;
-  environment.systemPackages = with pkgs; [
-    age
-    vim
-    wget
-    git
-  ];
-
-  # Open ports in the firewall.
-  # networking.firewall.allowedTCPPorts = [ ... ];
-  # networking.firewall.allowedUDPPorts = [ ... ];
-  # Or disable the firewall altogether.
-  networking.firewall.enable = false;
-
-  system.stateVersion = "23.11"; # Did you read the comment?
-}
-
--- a/hosts/template/default.nix
+++ b/hosts/template/default.nix
@@ -1,7 +0,0 @@
-{ ... }: {
-  imports = [
-    ./hardware-configuration.nix
-    ./configuration.nix
-    ./scripts.nix
-  ];
-}
--- a/hosts/template/scripts.nix
+++ b/hosts/template/scripts.nix
@@ -1,36 +0,0 @@
-{ pkgs, ... }:
-let
-  prepare-host-script = pkgs.writeShellApplication {
-    name = "prepare-host.sh";
-    runtimeInputs = [ pkgs.age ];
-    text = ''
-      echo "Removing machine-id"
-      rm -f /etc/machine-id || true
-
-      echo "Removing SSH host keys"
-      rm -f /etc/ssh/ssh_host_* || true
-
-      echo "Restarting SSH"
-      systemctl restart sshd
-
-      echo "Removing temporary files"
-      rm -rf /tmp/* || true
-
-      echo "Removing logs"
-      journalctl --rotate || true
-      journalctl --vacuum-time=1s || true
-
-      echo "Removing cache"
-      rm -rf /var/cache/* || true
-
-      echo "Generate age key"
-      rm -rf /var/lib/sops-nix || true
-      mkdir -p /var/lib/sops-nix
-      age-keygen -o /var/lib/sops-nix/key.txt
-    '';
-  };
-in
-{
-  environment.systemPackages = [ prepare-host-script ];
-  users.motd = "Prepare host by running 'prepare-host.sh'.";
-}
--- a/hosts/template2/bootstrap.nix
+++ b/hosts/template2/bootstrap.nix
@@ -6,22 +6,72 @@ let
    text = ''
      set -euo pipefail

+      LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
+
+      # Send a log entry to Loki with bootstrap status
+      # Usage: log_to_loki <stage> <message>
+      # Fails silently if Loki is unreachable
+      log_to_loki() {
+        local stage="$1"
+        local message="$2"
+        local timestamp_ns
+        timestamp_ns="$(date +%s)000000000"
+
+        local payload
+        payload=$(jq -n \
+          --arg host "$HOSTNAME" \
+          --arg stage "$stage" \
+          --arg branch "''${BRANCH:-master}" \
+          --arg ts "$timestamp_ns" \
+          --arg msg "$message" \
+          '{
+            streams: [{
+              stream: {
+                job: "bootstrap",
+                host: $host,
+                stage: $stage,
+                branch: $branch
+              },
+              values: [[$ts, $msg]]
+            }]
+          }')
+
+        curl -s --connect-timeout 2 --max-time 5 \
+          -X POST \
+          -H "Content-Type: application/json" \
+          -d "$payload" \
+          "$LOKI_URL" >/dev/null 2>&1 || true
+      }
+
+      echo "================================================================================"
+      echo "                     NIXOS BOOTSTRAP IN PROGRESS"
+      echo "================================================================================"
+      echo ""
+
      # Read hostname set by cloud-init (from Terraform VM name via user-data)
      # Cloud-init sets the system hostname from user-data.txt, so we read it from hostnamectl
      HOSTNAME=$(hostnamectl hostname)
-      echo "DEBUG: Hostname from hostnamectl: '$HOSTNAME'"
+      # Read git branch from environment, default to master
+      BRANCH="''${NIXOS_FLAKE_BRANCH:-master}"

+      echo "Hostname: $HOSTNAME"
+      echo ""
      echo "Starting NixOS bootstrap for host: $HOSTNAME"
+
+      log_to_loki "starting" "Bootstrap starting for $HOSTNAME (branch: $BRANCH)"
+
      echo "Waiting for network connectivity..."

      # Verify we can reach the git server via HTTPS (doesn't respond to ping)
      if ! curl -s --connect-timeout 5 --max-time 10 https://git.t-juice.club >/dev/null 2>&1; then
        echo "ERROR: Cannot reach git.t-juice.club via HTTPS"
        echo "Check network configuration and DNS settings"
+        log_to_loki "failed" "Network check failed - cannot reach git.t-juice.club"
        exit 1
      fi

      echo "Network connectivity confirmed"
+      log_to_loki "network_ok" "Network connectivity confirmed"

      # Unwrap Vault token and store AppRole credentials (if provided)
      if [ -n "''${VAULT_WRAPPED_TOKEN:-}" ]; then
@@ -50,6 +100,7 @@ let
          chmod 600 /var/lib/vault/approle/secret-id

          echo "Vault credentials unwrapped and stored successfully"
+          log_to_loki "vault_ok" "Vault credentials unwrapped and stored"
        else
          echo "WARNING: Failed to unwrap Vault token"
          if [ -n "$UNWRAP_RESPONSE" ]; then
@@ -63,17 +114,17 @@ let
          echo "To regenerate token, run: create-host --hostname $HOSTNAME --force"
          echo ""
          echo "Vault secrets will not be available, but continuing bootstrap..."
+          log_to_loki "vault_warn" "Failed to unwrap Vault token - continuing without secrets"
        fi
      else
        echo "No Vault wrapped token provided (VAULT_WRAPPED_TOKEN not set)"
        echo "Skipping Vault credential setup"
+        log_to_loki "vault_skip" "No Vault token provided - skipping credential setup"
      fi

      echo "Fetching and building NixOS configuration from flake..."
-
-      # Read git branch from environment, default to master
-      BRANCH="''${NIXOS_FLAKE_BRANCH:-master}"
      echo "Using git branch: $BRANCH"
+      log_to_loki "building" "Starting nixos-rebuild boot"

      # Build and activate the host-specific configuration
      FLAKE_URL="git+https://git.t-juice.club/torjus/nixos-servers.git?ref=$BRANCH#''${HOSTNAME}"
@@ -81,18 +132,30 @@ let
      if nixos-rebuild boot --flake "$FLAKE_URL"; then
        echo "Successfully built configuration for $HOSTNAME"
        echo "Rebooting into new configuration..."
+        log_to_loki "success" "Build successful - rebooting into new configuration"
        sleep 2
        systemctl reboot
      else
        echo "ERROR: nixos-rebuild failed for $HOSTNAME"
        echo "Check that flake has configuration for this hostname"
        echo "Manual intervention required - system will not reboot"
+        log_to_loki "failed" "nixos-rebuild failed - manual intervention required"
        exit 1
      fi
    '';
  };
 in
 {
+  # Custom greeting line to indicate this is a bootstrap image
+  services.getty.greetingLine = lib.mkForce ''
+    ================================================================================
+                          BOOTSTRAP IMAGE - NixOS \V (\l)
+    ================================================================================
+
+    Bootstrap service is running. Logs are displayed on tty1.
+    Check status: journalctl -fu nixos-bootstrap
+  '';
+
  systemd.services."nixos-bootstrap" = {
    description = "Bootstrap NixOS configuration from flake on first boot";

@@ -107,12 +170,12 @@ in
    serviceConfig = {
      Type = "oneshot";
      RemainAfterExit = true;
-      ExecStart = "${bootstrap-script}/bin/nixos-bootstrap";
+      ExecStart = lib.getExe bootstrap-script;

      # Read environment variables from cloud-init (set by cloud-init write_files)
      EnvironmentFile = "-/run/cloud-init-env";

-      # Logging to journald
+      # Log to journal and console
      StandardOutput = "journal+console";
      StandardError = "journal+console";
    };
--- a/hosts/template2/configuration.nix
+++ b/hosts/template2/configuration.nix
@@ -58,6 +58,14 @@
    "flakes"
  ];
  nix.settings.tarball-ttl = 0;
+  nix.settings.substituters = [
+    "https://nix-cache.home.2rjus.net"
+    "https://cache.nixos.org"
+  ];
+  nix.settings.trusted-public-keys = [
+    "nix-cache.home.2rjus.net-1:2kowZOG6pvhoK4AHVO3alBlvcghH20wchzoR0V86UWI="
+    "cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY="
+  ];
  environment.systemPackages = with pkgs; [
    age
    vim
@@ -71,5 +79,8 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

+  # Compressed swap in RAM - prevents OOM during bootstrap nixos-rebuild
+  zramSwap.enable = true;
+
  system.stateVersion = "25.11";
 }
--- a/hosts/template2/scripts.nix
+++ b/hosts/template2/scripts.nix
@@ -2,7 +2,6 @@
 let
  prepare-host-script = pkgs.writeShellApplication {
    name = "prepare-host.sh";
-    runtimeInputs = [ pkgs.age ];
    text = ''
      echo "Removing machine-id"
      rm -f /etc/machine-id || true
@@ -22,11 +21,6 @@ let

      echo "Removing cache"
      rm -rf /var/cache/* || true
-
-      echo "Generate age key"
-      rm -rf /var/lib/sops-nix || true
-      mkdir -p /var/lib/sops-nix
-      age-keygen -o /var/lib/sops-nix/key.txt
    '';
  };
 in
--- a/hosts/testvm01/configuration.nix
+++ b/hosts/testvm01/configuration.nix
@@ -11,6 +11,7 @@

    ../../system
    ../../common/vm
+    ../../common/ssh-audit.nix
  ];

  # Host metadata (adjust as needed)
@@ -24,6 +25,9 @@
  # Enable remote deployment via NATS
  homelab.deploy.enable = true;

+  # Enable Kanidm PAM/NSS for central authentication
+  homelab.kanidm.enable = true;
+
  nixpkgs.config.allowUnfree = true;
  boot.loader.grub.enable = true;
  boot.loader.grub.device = "/dev/vda";
@@ -62,6 +66,39 @@
    git
  ];

+  # Test nginx with ACME certificate from OpenBao PKI
+  services.nginx = {
+    enable = true;
+    virtualHosts."testvm01.home.2rjus.net" = {
+      forceSSL = true;
+      enableACME = true;
+      locations."/" = {
+        root = pkgs.writeTextDir "index.html" ''
+          <!DOCTYPE html>
+          <html>
+          <head>
+            <title>testvm01 - ACME Test</title>
+            <style>
+              body { font-family: monospace; max-width: 600px; margin: 50px auto; padding: 20px; }
+              .joke { background: #f0f0f0; padding: 20px; border-radius: 8px; margin: 20px 0; }
+              .punchline { margin-top: 15px; font-weight: bold; }
+            </style>
+          </head>
+          <body>
+            <h1>OpenBao PKI ACME Test</h1>
+            <p>If you're seeing this over HTTPS, the migration worked!</p>
+            <div class="joke">
+              <p>Why do programmers prefer dark mode?</p>
+              <p class="punchline">Because light attracts bugs.</p>
+            </div>
+            <p><small>Certificate issued by: vault.home.2rjus.net</small></p>
+          </body>
+          </html>
+        '';
+      };
+    };
+  };
+
  # Open ports in the firewall.
  # networking.firewall.allowedTCPPorts = [ ... ];
  # networking.firewall.allowedUDPPorts = [ ... ];
--- a/hosts/testvm02/configuration.nix
+++ b/hosts/testvm02/configuration.nix
@@ -11,6 +11,7 @@

    ../../system
    ../../common/vm
+    ../../common/ssh-audit.nix
  ];

  # Host metadata (adjust as needed)
@@ -24,6 +25,9 @@
  # Enable remote deployment via NATS
  homelab.deploy.enable = true;

+  # Enable Kanidm PAM/NSS for central authentication
+  homelab.kanidm.enable = true;
+
  nixpkgs.config.allowUnfree = true;
  boot.loader.grub.enable = true;
  boot.loader.grub.device = "/dev/vda";
--- a/hosts/testvm03/configuration.nix
+++ b/hosts/testvm03/configuration.nix
@@ -11,6 +11,7 @@

    ../../system
    ../../common/vm
+    ../../common/ssh-audit.nix
  ];

  # Host metadata (adjust as needed)
@@ -24,6 +25,9 @@
  # Enable remote deployment via NATS
  homelab.deploy.enable = true;

+  # Enable Kanidm PAM/NSS for central authentication
+  homelab.kanidm.enable = true;
+
  nixpkgs.config.allowUnfree = true;
  boot.loader.grub.enable = true;
  boot.loader.grub.device = "/dev/vda";
--- a/hosts/vault01/configuration.nix
+++ b/hosts/vault01/configuration.nix
@@ -62,6 +62,16 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

+  # Vault fetches secrets from itself (after unseal)
+  vault.enable = true;
+  homelab.deploy.enable = true;
+
+  # Ensure vault-secret services wait for openbao to be unsealed
+  systemd.services.vault-secret-homelab-deploy-nkey = {
+    after = [ "openbao.service" ];
+    wants = [ "openbao.service" ];
+  };
+
  system.stateVersion = "25.11"; # Did you read the comment?
 }

--- a/lib/monitoring.nix
+++ b/lib/monitoring.nix
@@ -21,6 +21,7 @@ let
      cfg = hostConfig.config;
      monConfig = (cfg.homelab or { }).monitoring or { enable = true; scrapeTargets = [ ]; };
      dnsConfig = (cfg.homelab or { }).dns or { enable = true; };
+      hostConfig' = (cfg.homelab or { }).host or { };
      hostname = cfg.networking.hostName;
      networks = cfg.systemd.network.networks or { };

@@ -49,20 +50,73 @@ let
        inherit hostname;
        ip = extractIP firstAddress;
        scrapeTargets = monConfig.scrapeTargets or [ ];
+        # Host metadata for label propagation
+        tier = hostConfig'.tier or "prod";
+        priority = hostConfig'.priority or "high";
+        role = hostConfig'.role or null;
+        labels = hostConfig'.labels or { };
      };

+  # Build effective labels for a host
+  # Always includes hostname; only includes tier/priority/role if non-default
+  buildEffectiveLabels = host:
+    { hostname = host.hostname; }
+    // (lib.optionalAttrs (host.tier != "prod") { tier = host.tier; })
+    // (lib.optionalAttrs (host.priority != "high") { priority = host.priority; })
+    // (lib.optionalAttrs (host.role != null) { role = host.role; })
+    // host.labels;
+
  # Generate node-exporter targets from all flake hosts
+  # Returns a list of static_configs entries with labels
  generateNodeExporterTargets = self: externalTargets:
    let
      nixosConfigs = self.nixosConfigurations or { };
      hostList = lib.filter (x: x != null) (
        lib.mapAttrsToList extractHostMonitoring nixosConfigs
      );
-      flakeTargets = map (host: "${host.hostname}.home.2rjus.net:9100") hostList;
+
+      # Extract hostname from a target string like "gunter.home.2rjus.net:9100"
+      extractHostnameFromTarget = target:
+        builtins.head (lib.splitString "." target);
+
+      # Build target entries with labels for each host
+      flakeEntries = map
+        (host: {
+          target = "${host.hostname}.home.2rjus.net:9100";
+          labels = buildEffectiveLabels host;
+        })
+        hostList;
+
+      # External targets get hostname extracted from the target string
+      externalEntries = map
+        (target: {
+          inherit target;
+          labels = { hostname = extractHostnameFromTarget target; };
+        })
+        (externalTargets.nodeExporter or [ ]);
+
+      allEntries = flakeEntries ++ externalEntries;
+
+      # Group entries by their label set for efficient static_configs
+      # Convert labels attrset to a string key for grouping
+      labelKey = entry: builtins.toJSON entry.labels;
+      grouped = lib.groupBy labelKey allEntries;
+
+      # Convert groups to static_configs format
+      # Every flake host now has at least a hostname label
+      staticConfigs = lib.mapAttrsToList
+        (key: entries:
+          let
+            labels = (builtins.head entries).labels;
+          in
+          { targets = map (e: e.target) entries; labels = labels; }
+        )
+        grouped;
    in
-    flakeTargets ++ (externalTargets.nodeExporter or [ ]);
+    staticConfigs;

  # Generate scrape configs from all flake hosts and external targets
+  # Host labels are propagated to service targets for semantic alert filtering
  generateScrapeConfigs = self: externalTargets:
    let
      nixosConfigs = self.nixosConfigurations or { };
@@ -70,13 +124,14 @@ let
        lib.mapAttrsToList extractHostMonitoring nixosConfigs
      );

-      # Collect all scrapeTargets from all hosts, grouped by job_name
+      # Collect all scrapeTargets from all hosts, including host labels
      allTargets = lib.flatten (map
        (host:
          map
            (target: {
              inherit (target) job_name port metrics_path scheme scrape_interval honor_labels;
              hostname = host.hostname;
+              hostLabels = buildEffectiveLabels host;
            })
            host.scrapeTargets
        )
@@ -87,22 +142,32 @@ let
      grouped = lib.groupBy (t: t.job_name) allTargets;

      # Generate a scrape config for each job
+      # Within each job, group targets by their host labels for efficient static_configs
      flakeScrapeConfigs = lib.mapAttrsToList
        (jobName: targets:
          let
            first = builtins.head targets;
-            targetAddrs = map
-              (t:
+
+            # Group targets within this job by their host labels
+            labelKey = t: builtins.toJSON t.hostLabels;
+            groupedByLabels = lib.groupBy labelKey targets;
+
+            # Every flake host now has at least a hostname label
+            staticConfigs = lib.mapAttrsToList
+              (key: labelTargets:
                let
-                  portStr = toString t.port;
+                  labels = (builtins.head labelTargets).hostLabels;
+                  targetAddrs = map
+                    (t: "${t.hostname}.home.2rjus.net:${toString t.port}")
+                    labelTargets;
                in
-                "${t.hostname}.home.2rjus.net:${portStr}")
-              targets;
+                { targets = targetAddrs; labels = labels; }
+              )
+              groupedByLabels;
+
            config = {
              job_name = jobName;
-              static_configs = [{
-                targets = targetAddrs;
-              }];
+              static_configs = staticConfigs;
            }
            // (lib.optionalAttrs (first.metrics_path != "/metrics") {
              metrics_path = first.metrics_path;
--- a/playbooks/build-and-deploy-template.yml
+++ b/playbooks/build-and-deploy-template.yml
@@ -99,3 +99,48 @@
    - name: Display success message
      ansible.builtin.debug:
        msg: "Template VM {{ template_vmid }} created successfully on {{ storage }}"
+
+- name: Update Terraform template name
+  hosts: localhost
+  gather_facts: false
+
+  vars:
+    terraform_dir: "{{ playbook_dir }}/../terraform"
+
+  tasks:
+    - name: Get image filename from earlier play
+      ansible.builtin.set_fact:
+        image_filename: "{{ hostvars['localhost']['image_filename'] }}"
+
+    - name: Extract template name from image filename
+      ansible.builtin.set_fact:
+        new_template_name: "{{ image_filename | regex_replace('\\.vma\\.zst$', '') | regex_replace('^vzdump-qemu-', '') }}"
+
+    - name: Read current Terraform variables file
+      ansible.builtin.slurp:
+        src: "{{ terraform_dir }}/variables.tf"
+      register: variables_tf_content
+
+    - name: Extract current template name from variables.tf
+      ansible.builtin.set_fact:
+        current_template_name: "{{ (variables_tf_content.content | b64decode) | regex_search('variable \"default_template_name\"[^}]+default\\s*=\\s*\"([^\"]+)\"', '\\1') | first }}"
+
+    - name: Check if template name has changed
+      ansible.builtin.set_fact:
+        template_name_changed: "{{ current_template_name != new_template_name }}"
+
+    - name: Display template name status
+      ansible.builtin.debug:
+        msg: "Template name: {{ current_template_name }} -> {{ new_template_name }} ({{ 'changed' if template_name_changed else 'unchanged' }})"
+
+    - name: Update default_template_name in variables.tf
+      ansible.builtin.replace:
+        path: "{{ terraform_dir }}/variables.tf"
+        regexp: '(variable "default_template_name"[^}]+default\s*=\s*)"[^"]+"'
+        replace: '\1"{{ new_template_name }}"'
+      when: template_name_changed
+
+    - name: Display update result
+      ansible.builtin.debug:
+        msg: "Updated terraform/variables.tf with new template name: {{ new_template_name }}"
+      when: template_name_changed
--- a/rebuild-all.sh
+++ b/rebuild-all.sh
@@ -1,20 +0,0 @@
-#!/usr/bin/env bash
-set -euo pipefail
-
-# array of hosts
-HOSTS=(
-    "ns1"
-    "ns2"
-    "ca"
-    "ha1"
-    "http-proxy"
-    "jelly01"
-    "monitoring01"
-    "nix-cache01"
-    "pgdb1"
-)
-
-for host in "${HOSTS[@]}"; do
-    echo "Rebuilding $host"
-    nixos-rebuild boot --flake .#${host} --target-host root@${host}
-done
--- a/scripts/create-host/create_host.py
+++ b/scripts/create-host/create_host.py
@@ -314,11 +314,10 @@ def handle_remove(
        for secret_path in host_secrets:
            console.print(f"   [white]vault kv delete secret/{secret_path}[/white]")

-    # Warn about secrets directory
+    # Warn about legacy secrets directory
    if secrets_exist:
-        console.print(f"\n[yellow]⚠️  Warning: secrets/{hostname}/ directory exists and will NOT be deleted[/yellow]")
+        console.print(f"\n[yellow]⚠️  Warning: secrets/{hostname}/ directory exists (legacy SOPS)[/yellow]")
        console.print(f"   Manually remove if no longer needed: [white]rm -rf secrets/{hostname}/[/white]")
-        console.print(f"   Also update .sops.yaml to remove the host's age key")

    # Exit if dry run
    if dry_run:
--- a/scripts/create-host/manipulators.py
+++ b/scripts/create-host/manipulators.py
@@ -219,7 +219,7 @@ def update_flake_nix(config: HostConfig, repo_root: Path, force: bool = False) -
    new_entry = f"""        {config.hostname} = nixpkgs.lib.nixosSystem {{
          inherit system;
          specialArgs = {{
-            inherit inputs self sops-nix;
+            inherit inputs self;
          }};
          modules = commonModules ++ [
            ./hosts/{config.hostname}
--- a/scripts/create-host/validators.py
+++ b/scripts/create-host/validators.py
@@ -140,20 +140,22 @@ def validate_ip_unique(ip: Optional[str], repo_root: Path) -> None:
    ip_part = ip.split("/")[0]

    # Check all hosts/*/configuration.nix files
+    # Search for IP with CIDR notation to match static IP assignments
+    # (e.g., "10.69.13.5/24") but not DNS resolver entries (e.g., "10.69.13.5")
    hosts_dir = repo_root / "hosts"
    if hosts_dir.exists():
        for config_file in hosts_dir.glob("*/configuration.nix"):
            content = config_file.read_text()
-            if ip_part in content:
+            if ip in content:
                raise ValueError(
                    f"IP address {ip_part} already in use in {config_file}"
                )

-    # Check terraform/vms.tf
+    # Check terraform/vms.tf - search for full IP with CIDR
    terraform_file = repo_root / "terraform" / "vms.tf"
    if terraform_file.exists():
        content = terraform_file.read_text()
-        if ip_part in content:
+        if ip in content:
            raise ValueError(
                f"IP address {ip_part} already in use in {terraform_file}"
            )
--- a/secrets/ca/keys/intermediate_ca_key
+++ b/secrets/ca/keys/intermediate_ca_key
@@ -1,24 +0,0 @@
-{
-	"data": "ENC[AES256_GCM,data:TgGIuklFPUSCBosD86NFnkAtRvYijQNQP4vvTkKu3dRAOjdDa2li5djZDUS4NEEPEihpOcMXqHBb+ABk3LmoU5nLmsKCeylUp7+DhcGi9f3xw2h1zbHV37mt40OVLTF3cYufRdydIkCGQA3td3q1ue/wCna2ewe73xwGg5j6ZVJCZAtW4VCNZM+rcG+YxPUC0gmBH59+O0VSrZrkvSnifbr+K0dGwg4i17KwAukI4Ac7YMkQoeuAPXq38+ZftlRx4tq9xBUko6wpPY9zOaFzeagWYMF0n1UYqDt+/3XZI/mukPhJc9tzbWneqgkQBOx3OiDwrNglCHvEpnb+bZePIRLOnNHd1ShETgBqhsHGp9OAwwbAt4tO+HFpCQtVz7s2LWQFLbWiN0SCGzYUkFGCgoXae5H58lxFav8=,iv:UzaWlJ+M+VQx3CcPSGbFZh5/rGbKpS2Rq2XVZAIDFiQ=,tag:F3waoAMuEKTvN2xANReSww==,type:str]",
-	"sops": {
-		"kms": null,
-		"gcp_kms": null,
-		"azure_kv": null,
-		"hc_vault": null,
-		"age": [
-			{
-				"recipient": "age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u",
-				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBpRGZSVHRSMGlyazAwQU5j\nd1o1L0Y1ckhQMkh4MVZiRmZlR2ozcmdsUW1vCk4xZ1ZibDBrUWZhYmxVVjBUczRn\nYlJtUWF3Y1lHWG56NkhmK2JOUHVGajQKLS0tIDN2S2doQURpTis2U3lWV0NxdWEz\ncjNZaEl1dEQwOXhsNE9xbHhYUzNTV3cKVmVIe05JwgXKSku7AJmrujYXrbBSbpBJ\nnqCuDIhok1w/fiff+XXn8udbgPVq5bC2SOhHbtVxImgBCFzrj5hQ0A==\n-----END AGE ENCRYPTED FILE-----\n"
-			},
-			{
-				"recipient": "age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk",
-				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSA4V3NaUEdvMmJvakQ0L1F0\nUnkvQ2F5dEVlZ2pMdlBZcjJac0tERnF5ZWljCmFrdU1NZ29jMkJ1a1ZLdURmVWI0\ncm1vNytFVzZjbVY2aVd2N3laMWNRNFEKLS0tIGgzOTFZY0lxc0JyVmd5cFBlNkRr\nVDBWc0t4c3pVV3RhSTB1UUVpNHd6NUkKNn6Sxb5oxP7iWqTF1+X9nOiYum3U+Rzk\nkryxVnf9EvQIVIFKDaTb+yAEO8otjqj+C4mHA9fannnNEJduOiPWOg==\n-----END AGE ENCRYPTED FILE-----\n"
-			}
-		],
-		"lastmodified": "2024-11-30T13:18:08Z",
-		"mac": "ENC[AES256_GCM,data:9R9RJzPMr9Bv8aeCDxhExTfbr+R2hjap6FGSk5QxBdbNpOcNS78ica0CLEmkAYVAfjmx/X2jC5ZnsAueSPUK7nAgNX2gJXbUTpY0F+oKt35GJziLrFLl3u/ahpF9lQ50EL9OqqgS+igDqtodJhKme5DXH5/GXQHhz++O3VZkR78=,iv:XgN3PiowiEosi2DmrjP82HhJMvnwaV530tsBE8GQfjs=,tag:U243BrtH7H/DU9LcjN/MMg==,type:str]",
-		"pgp": null,
-		"unencrypted_suffix": "_unencrypted",
-		"version": "3.9.1"
-	}
-}
--- a/secrets/ca/keys/root_ca_key
+++ b/secrets/ca/keys/root_ca_key
@@ -1,24 +0,0 @@
-{
-	"data": "ENC[AES256_GCM,data:5AePh5uXcUseYBGWvlztgmg8mGBGy3ngKRa6+QxOaT0/fzSB1pKkaMtZJo76tV9wwjdL6/b6VVUI7GIaCBD5kgdZuA8RdBTXguHyjjdxAlI9xcrQaWWdATd8JJt+eQp/m2Y+0dioyXKaDV2ukI3GtHYjp/ixMoHHWEocnEEb40wG6c3CZcvsLWJvKTkFc2OvcjcU2RTfuNlYtEETidiD9iC/dtCakNQHmLP1UFYgcn0ebXBKmlqD6+x2o7BVT1SLwVCyGNvH3eKA2AWvddZChnhaNCUIXcRwBFCgS8lPs4iXhAhly+nwuj7ssFpuu3sjm5pq196tRS8WQl2iNUEJ2tzoOpceg1kZZ7KHX3wCbdBlCRqhy9Q4JMvWPDssO+zz2aU21+BDEySDTCnTYX9Hu2/iFvZejt++mKY=,iv:u/Ukye0BAj2ka++AA72W8WfXJAZZ/YJ3RC/aydxdoUc=,tag:ihTP5bCCigWEPcLFaYOhMA==,type:str]",
-	"sops": {
-		"kms": null,
-		"gcp_kms": null,
-		"azure_kv": null,
-		"hc_vault": null,
-		"age": [
-			{
-				"recipient": "age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u",
-				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSB0VElDNHArZXlXa2JRQjd0\nQmVIbGpPWk43NDdiTkFtcEd1bDhRdXJWOUY0CndITHdKTFNJQXFOVFdyUGNtQ09k\nN2hnQmFYR0ZORWtxcUN0ZFhsM0U3N2cKLS0tIFh1TTBpMjFIZ2NYM1QxeDRjYlJx\nYkdrUDZmMUpGbjk3REJCVVRpeFk5Z28KJcia0Bk+3ZoifZnRLwqAko526ODPnkSS\nzymtOj/QYTA0++NP3B1aScIyhWITMEZX1iSoWDmgHj8ZQoNMdkM7AQ==\n-----END AGE ENCRYPTED FILE-----\n"
-			},
-			{
-				"recipient": "age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk",
-				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBZNlNHRWNEcUZGNXNBMDFR\nTzE5RnNMQUMvU1k2OS9XMlpvUktMRzQ5RmxvCnlCS3lzRVpGUHJLRGZ6SWZ2ZktR\na3l0TVN2NUlRVEQwRHByYkNEMDQyWUkKLS0tIEh3RjBWT3c5K2RWeDRjWFpsU1lP\ncStqY2xta3RSNkR6Vkt5YXhYUTZmbDgKvVKmZc8S/RwurJGsGiJ5LhM4waLO9B9k\n2cawxHmcYM3KfXDFwp9UZWhIwF7SRkG56ZE4OjGI3sOL+74ixnePxA==\n-----END AGE ENCRYPTED FILE-----\n"
-			}
-		],
-		"lastmodified": "2024-11-30T13:18:16Z",
-		"mac": "ENC[AES256_GCM,data:JwjbQ129cYCBNA5Fb8lN9rW7/y4wuVOqLeajIMcYyCzlBcjzCZAV1DKN5n75xMamb/hb1AUkmtp/K82PKM0Vg5X4/lpWTUZXZOzn/TrwHx+yqlJjL9mUdGuHnSY5DwME38Dde3UxdtUa0CVgQOxvMIycW27w8+8NNfO2zxGxkzc=,iv:ZMZASOsqXZOb0NkBqG3GGaqqKgQdjZLiku2yU5QonB8=,tag:/lb/HMxsYOV5XX/5kWnFHA==,type:str]",
-		"pgp": null,
-		"unencrypted_suffix": "_unencrypted",
-		"version": "3.9.1"
-	}
-}
--- a/secrets/ca/keys/ssh_host_ca_key
+++ b/secrets/ca/keys/ssh_host_ca_key
@@ -1,24 +0,0 @@
-{
-	"data": "ENC[AES256_GCM,data:vqQ3HwSmuDlI4UwraLWvwkBSj9zTFeNEWI1xzhVrO/gpx8+WBZOt2F0J7/LSTGAWsWW/9Gov+XXXAOtfnKfjYVzizyT/jE8EQwMuItWiFEVA6hohgwtsk7YKJjXdJIxmiv+WKs73gWb0uFVGh1ArMzsVkGPj1W1AKMFAneDPgsfSCy9aVOMuF8zQwypFC8eaxqOQhLpiN2ncRm8e7khwGurSgYfHDgFghaDr8torgUrZTOPNFk+LEdxB3WcC17+4a8ZyuBapmYdRTrP73czTAuxOF8lMwddJhO99SF7nWuOYVF1FOKLGtK04oKci5/xRIzvWo3I0pGajkxtuF5CyWbd1KblcPfBALIU/J5hU/puGJ7M2sE/qsg/4kaTFxnhq32rPZj291jFb4evDdOhVodfC1axOQUbzAC0=,iv:yOeQ384ikqgDqfthl7GIVSIMNA/n0BYTSIqFN3T9MAY=,tag:Y6nhOCrkWx7MnVpEeKN0Jg==,type:str]",
-	"sops": {
-		"kms": null,
-		"gcp_kms": null,
-		"azure_kv": null,
-		"hc_vault": null,
-		"age": [
-			{
-				"recipient": "age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u",
-				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBFTjRMWlNtYVQ2WnJEaGFN\nVFU2TXRTK2FHREpqREhOWHBKemxNc2U4WW44CnV4OWlBdXlFUWhJYi9jTTRuUWJV\nOWFPV2I4UytDRFo3blN3bUtFQ1NGU0kKLS0tIGp2VHlDc1JMMUdDUjlNNDFwUUxj\nVnhHbCtrNVNpZXo0K2dDVU5YTVJJUEkKk9mVTbzQVGZo3RKDLPDwtENknh+in1Q5\njf4DA1cGDDNzcEIWOOYyS+1mzT9WY8gU0hWqihX/bAx7CVsNUallZw==\n-----END AGE ENCRYPTED FILE-----\n"
-			},
-			{
-				"recipient": "age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk",
-				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBrVFNwUGpkOUhkUXFWWERq\nMVdueC9VSE9KbGZkenBVK3NRMjRNVXVmcVRRCjNLa0QzbWVCQks3ZmV3eFVjcEp0\nRmxDSlZIZU1IbEdnbE83WlkxV3VZV1EKLS0tICtsRXArajQ4Um9mNEV5OWZBdS85\nVGFSU2wwODZ3Zm44M3pWcTdDV1dxejQKM2BK5Axb1cF344ea89gkzCLzEX6j4amK\nzxf+boBK7JUX7F6QaPB0sRU8J4Cei9mALz96C8xNHjX00KcD3O2QOA==\n-----END AGE ENCRYPTED FILE-----\n"
-			}
-		],
-		"lastmodified": "2024-11-30T13:18:20Z",
-		"mac": "ENC[AES256_GCM,data:AllgcWxHnr3igPi/JbfJCbEa6hKtmILnAjiaMojRZNO4p6zYSoF0s8lo9XX05/vIrFUo+YaCtsuacv+kfz9f6vQafPn7Vulbh6PeH1VlAmzyVfJOTmHP3YX8ic3uM56A4+III1jOERCFOIcc/CKsnRLFhLCRQRMgtgT0hTl5aPw=,iv:60dOYhoUTu1HIHzY36eJeRZ66/v6JmRRpIW99W2D+CI=,tag:F7nLSFm933K5M+JE4IvNYw==,type:str]",
-		"pgp": null,
-		"unencrypted_suffix": "_unencrypted",
-		"version": "3.9.1"
-	}
-}
--- a/secrets/ca/keys/ssh_user_ca_key
+++ b/secrets/ca/keys/ssh_user_ca_key
@@ -1,24 +0,0 @@
-{
-	"data": "ENC[AES256_GCM,data:YRdPrTLQH0xdWiIzOyjfEGpvfmuj6me6GzZZcauh9bUUywyA1ranDnWqbJYgawQQxIXsq9dhXD0uco+7mmXq2598kF1NI9jh6uLf3k0H494zZOalRBv/k8u9oJDLIiVAkg9eNNLbGX0PMZr/Yue/qdkuXx2Hg9E7bQJwpU/NXF+jKKs+3NmKT5NBlegwAzUs530D4DUoaq5AhvVvdC6a1UcE+KJzQ8pRiz1GjFIxAB7qX+GVwa3yNdLgo2tlAbOzjGtaDfJnhZIHSNEq+4TEhjlF9lCmFCGFDUVupvMOWs0kBywJEzIrDmxmvGHlPj3FfyytPb7qhlsOXDDDS67IoiwluKOnw+sALAG0Iv9LMrDZ3z8MXeEGvRWu0VDMuGXN905/9kGx/A40mPjcfnZvI+qSRIKjER5R8aU=,iv:qiP2Ml59AnK24MBbs7N/HqJIylf+fXGqJAo2N8iFNB0=,tag:0Dj5fVs6OB07kvV4qzuvfw==,type:str]",
-	"sops": {
-		"kms": null,
-		"gcp_kms": null,
-		"azure_kv": null,
-		"hc_vault": null,
-		"age": [
-			{
-				"recipient": "age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u",
-				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBBUFlvNmRNYUlJSHZYUkpJ\nMEloQXFSdENIWGJVVDNIOVY5MS9SYWRoL0FrCnRJc05wZUZBSDRvMHNUUEhNRXQ4\nTWhYOUp6YUNGZFNWUFRrSmlJM1c4aWcKLS0tIFc1b3NlSEo2eFJhdDgwejRqcHlT\nZE5wN01uaE04cTlIbVJMVWQvQ1pXajgKQ1n6UmP7LEBsnIBXVc0BceOqvwCqQzBP\ncI8C5Io4ILgMjY4dr6sd0SeJG6mfDdiMA+k7c6jqoyZCW/Pkd3LANQ==\n-----END AGE ENCRYPTED FILE-----\n"
-			},
-			{
-				"recipient": "age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk",
-				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBtM2lyeXVzdE9nL1k5L3dC\nTkl2MjhMb1FKMFdCeXFPSmNST0pvOTRUaEVvCmdwMnhjSFFHVFhidmIySS9jMEJu\nNTJpRjdFOWpZZ3ZuZFJwZUUrRFU5NnMKLS0tIDJ1UjdVQkpMNm5Pd01JRnZNOEtr\nb1lpMlBkVHpiT2lYdWtZaUQrRW1HUDgKq/JVMf5gdu6lNEmqY6zU2SymbT+jklem\nnUQ9yieJGF+PanutNW6BCJH8jb/fH+Y6AeJ9S+kKCB4Yi75i4d+oHg==\n-----END AGE ENCRYPTED FILE-----\n"
-			}
-		],
-		"lastmodified": "2024-11-30T13:18:24Z",
-		"mac": "ENC[AES256_GCM,data:6FJTKEdIpCm+Dz7Ua8dZOMZQFaGU0oU/HRP6ly5mWbXCv81LRbZXRBd+5RDY3z9g9nb0PXZrOMNps63F6SKxK52VfzLIOap3UGeMNQn5P4/yyFj7JQHQ5Gjcf2l2z2VZ7NhUdNoSCV/6lwjValbKtids48Q5c3sFX997ZiqIUnY=,iv:nUeyJd/v8d9v7QsLLckziD9K5qjOZKK4vOQJw/ymi18=,tag:6n5EE3oklWdVcedvB2J/zA==,type:str]",
-		"pgp": null,
-		"unencrypted_suffix": "_unencrypted",
-		"version": "3.9.1"
-	}
-}
--- a/secrets/ca/secrets.yaml
+++ b/secrets/ca/secrets.yaml
@@ -1,30 +0,0 @@
-ca_root_pw: ENC[AES256_GCM,data:jS5BHS9i/pOykus5aGsW+w==,iv:aQIU7uXnNKaeNXv1UjRpBoSYcRpHo8RjnvCaIw4yCqc=,tag:lkjGm5/Ve93nizqGDQ0ByA==,type:str]
-sops:
-    kms: []
-    gcp_kms: []
-    azure_kv: []
-    hc_vault: []
-    age:
-        - recipient: age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSA5anlORWxJalhRWkJPeGIy
-            OStyVG8vMFRTTEZOWHR3Q3N1UWJQbFlxV3pBCmVKQVM1SlJ2L0JOb3U3cTh3YkZ4
-            WHAxSUpTT1dyRHJHYVd1Qkh1ZWxwYW8KLS0tIEhXeklsSmlGaFlaaWF5L0Nodk5a
-            clZ4M3hFSlFqaEZ0UWREdHpTQ29GVUEKAxj5P05Ilpwis2oKFe54mJX+1LfTwfUv
-            2XRFOrEQbFNcK5WFu46p1mc/AAjKTeHWuvb2Yq43CO+sh1+kqKz0XA==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBaS0dqQ1p4MEE2d2JaeFRx
-            UnB4ejhrS3hLekpqeWJhcEJGdnpzMTZDelVRCmFjVGswd3VtRUloWG1WbWY5N0s3
-            cG9aV2hGU3lFZkkvcUJNWE1rWUIwMmMKLS0tIG1KdlhoQzREWDhPbXVSZVBUQkdE
-            N1hmcEwxWXBIWkQ3a3BrdGhvUFoxbzgKX6hLoz7o/Du6ymrYwmGDkXp2XT+0+7QE
-            YhD5qQzGLVQSh3XM/wWExj2Ue5/gw/NqNziHezOh2r9gQljbHjG2/g==
-            -----END AGE ENCRYPTED FILE-----
-    lastmodified: "2024-10-21T09:12:26Z"
-    mac: ENC[AES256_GCM,data:hfPRIXt/kZJa6lsj7rz+5xGlrWhR/LX895S2d8auP/4t3V//80YE/ofIsHeAY9M7eSFsW9ce2Vp0C/WiCQefVWNaNN7nVAwskCfQ6vTWzs23oYz4NYIeCtZggBG3uGgJxb7ZnAFUJWmLwCxkKTQyoVVnn8i/rUDIBrkilbeLWNI=,iv:lm1HVbWtAifHjqKP0D3sxRadsE9+82ugbA2x54yRBTo=,tag:averxmPLa131lJtFrNxcEA==,type:str]
-    pgp: []
-    unencrypted_suffix: _unencrypted
-    version: 3.9.1
--- a/secrets/http-proxy/wireguard.yaml
+++ b/secrets/http-proxy/wireguard.yaml
@@ -1,25 +0,0 @@
-wg_private_key: ENC[AES256_GCM,data:DlC9txcLkTnb7FoEd249oJV/Ehcp50P8uulbE4rY/xU16fkTlnKvPmYZ7u8=,iv:IsiTzdrh+BNSVgx1mfjpMGNV2J0c88q6AoP0kHX2aGY=,tag:OqFsOIyE71SBD1mcNS/PeQ==,type:str]
-sops:
-    age:
-        - recipient: age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSAzdm9HTTN1amwxQ2Z6MUQv
-            dGJ0cEgyaHNOZWtWSWlXNXc5bGhUdSsvVlVzCkJkc3ZQdzlBNDNxb3Avdi96bXFt
-            TExZY29nUDI3RE5vanh6TVBRME1Fa1UKLS0tIG8vSHdCYzkvWmJpd0hNbnRtUmtk
-            aVcwaFJJclZ3YUlUTTNwR2VESmVyZWMKHvKUJBDuNCqacEcRlapetCXHKRb0Js09
-            sqxLfEDwiN2LQQjYHZOmnMfCOt/b2rwXVKEHdTcIsXbdIdKOJwuAIQ==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1gq8434ku0xekqmvnseeunv83e779cg03c06gwrusnymdsr3rpufqx6vr3m
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBEeU01UTc2V1UyZXRadE5I
-            VE1aakVZUEZUNnJxbzJ1K3J1R3ZQdFdMbUhBCjZBMDM3ZkYvQWlyNHBtaDZRWkd4
-            VzY0L3l4N2RNZjJRTDJWZTZyZVhHbW8KLS0tIGVNZ0N0emVmaVRCV09jNmVKRlla
-            cWVSNkJqWHh5c21KcWFac2FlZTVaMTAK1UvfPgZAZYtwiONKIAo5HlaDpN+UT/S/
-            JfPUfjxgRQid8P20Eh/jUepxrDY8iXRZdsUMON+OoQ8mpwoAh5eN1A==
-            -----END AGE ENCRYPTED FILE-----
-    lastmodified: "2025-05-15T18:56:55Z"
-    mac: ENC[AES256_GCM,data:J2kHY7pXBJZ0UuNCZOhkU11M8rDqCYNzY71NyuDRmzzRCC9ZiNIbavyQAWj2Dpk1pjGsYjXsVoZvP7ti1wTFqahpaR/YWI5gmphrzAe32b9qFVEWTC3YTnmItnY0YxQZYehYghspBjnJtfUK0BvZxSb17egpoFnvHmAq+u5dyxg=,iv:/aLg02RLuJZ1bRzZfOD74pJuE7gppCBztQvUEt557mU=,tag:toxHHBuv3WRblyc9Sth6Iw==,type:str]
-    unencrypted_suffix: _unencrypted
-    version: 3.10.2
--- a/secrets/monitoring01/pve-exporter.yaml
+++ b/secrets/monitoring01/pve-exporter.yaml
@@ -1,33 +0,0 @@
-default:
-    user: ENC[AES256_GCM,data:4Zzjm6/e8GCKSPNivnY=,iv:Y3gR+JSH/GLYvkVu3CN4T/chM5mjGjwVPI0iMB4p1t4=,tag:auyG8iWsd/YGjDnnTC21Ew==,type:str]
-    password: ENC[AES256_GCM,data:9cyM9U8VnzXBBA==,iv:YMHNNUoQ9Az5+81Df07tjC+LaEWPHV6frUjd4PZrQOs=,tag:3hKR+BhLJODJp19nn4ppkA==,type:str]
-    verify_ssl: ENC[AES256_GCM,data:Cu5Ucf0=,iv:QFfdV7gDBQ+L2kSZZqlVqCrn9CRg5RNG5DNTFWtVf5Y=,tag:u24ZbpWA65wj3WOwqU1v+g==,type:bool]
-sops:
-    kms: []
-    gcp_kms: []
-    azure_kv: []
-    hc_vault: []
-    age:
-        - recipient: age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBuUXdMMG5YaHRJbThQZW9u
-            RHVBbXFiSHNiUWdLTDdPajIyQjN3OGR0dGpzCm9ZVkdNWjhBakU3dVdhRU9kbU81
-            aDlCNzJBQ1hvQ3FnTUk2N2RWQkZpUUEKLS0tIEZacTNqa3FWc2p1NXVtRWhwVExj
-            cUJtYXNjb2Z4QkF4MjlidEZxSUFNa3MKAGHGksPc9oJheSlUQ3ARK5MuR5NFbPmD
-            kmSDSgRmzbarxT8eJnK8/K4ii3hX5E9vGOohUkyc03w4ENsh/dw43g==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1vpns76ykll8jgdlu3h05cur4ew2t3k7u03kxdg8y6ypfhsfhq9fqyurjey
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBOVGhvdGE5Mzl0ckhBM21D
-            RXJwb09OS25PMGViblViM21wTVZiZWhtWmhFCnAzL1NqeUVyOGZFVDFvdXFPbklQ
-            ZkJPWDVIdUdCdjZGUjcrcmtvak5CWG8KLS0tIDhLUHJNN2VqNy9CdVh0K0N0b0k1
-            RUE4U0E0aGxiRkF0NWdwSEIrQTU4MjgKeOU6bIWO6ke9YcG+1E3brnC21sSQxZ9b
-            SiG2QEnFnTeJ5P50XQoYHqUY3B0qx7nDLvyzatYEi6sDkfLXhmHGbw==
-            -----END AGE ENCRYPTED FILE-----
-    lastmodified: "2024-12-03T16:25:12Z"
-    mac: ENC[AES256_GCM,data:gemq8YpMZQC+gY7lmMM3tfZh9XxL40qdGlLiB2CD4SIG49w0V6E/vY7xygt0WW0zHbhMI9yUIqlRc/PaXn+QfyxJEr3IjaT05rrWUqQAeRP9Zss74Y3NtQehh8fM8SgeyU4j2CQ9f9B/lW9IgdOW/TNgQZVXGg1vXZPEzl7AZ4A=,iv:LG5ojv3hAqk+EvFa/xEn43MBqL457uKFDE3dG5lSgZo=,tag:AxzcUzmdhO411Sw7Vg1itA==,type:str]
-    pgp: []
-    unencrypted_suffix: _unencrypted
-    version: 3.9.1
--- a/secrets/nix-cache01/actions_token_1
+++ b/secrets/nix-cache01/actions_token_1
@@ -1,19 +0,0 @@
-{
-	"data": "ENC[AES256_GCM,data:P84qHFU+xQjwQGK8I1gIdcBsHrskuUg0M1nGMMaA+hFjAdFYUhdhmAN/+y0CO28=,iv:zJtk01zNMTBDQdVtZBTM34CHRaNYDkabolxh7PWGKUI=,tag:8AS80AbZJbh9B3Av3zuI1w==,type:str]",
-	"sops": {
-		"age": [
-			{
-				"recipient": "age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u",
-				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBkRFB6QTIyWWdwVkV4ZXNB\nWkdSdEhMc0s4cnByWVZXTGhnSWZ0MTdEUWhJCnFlOFQ5TU1hcE91azVyZXVXRCtu\nZjIxalRLYlEreGZ6ZDNoeXNPaFN4b28KLS0tIHY5WVFXN1k4NFVmUjh6VURkcEpv\ncklGcWVhdTdBRnlOdm1qM2h5SS9UUkEKq2RyxSVymDqcsZ+yiNRujDCwk1WOWYRW\nDa4TRKg3FCe7TcCEPkIaev1aBqjLg9J9c/70SYpUm6Zgeps7v5yl3A==\n-----END AGE ENCRYPTED FILE-----\n"
-			},
-			{
-				"recipient": "age1w029fksjv0edrff9p7s03tgk3axecdkppqymfpwfn2nu2gsqqefqc37sxq",
-				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSArTGVuckp2NlhMZXRNMVhO\naUV3K0h3cmZ5ZGx4Q3dJWHNqZXFJeE1kM0dFCmF4TUFUMm9mTHJlYzlYWVhNa1RH\nR29VNDIrL1IvYUpQYm5SZEYzbWhhbkkKLS0tIEJsK1dwZVdaaHpWQkpOOS90dkhx\nbGhvRXhqdFdqQmhZZmhCdmw4NUtSVG8K3z2do+/cIjAqg6EMJnubOWid1sMeTxvo\nrq6eGJ7YzdgZr2JBVtJdDRtk/KeHXu9In4efbBXwLAPIfn1pU0gm1w==\n-----END AGE ENCRYPTED FILE-----\n"
-			}
-		],
-		"lastmodified": "2025-08-21T19:08:48Z",
-		"mac": "ENC[AES256_GCM,data:5CkO09NIqttb4UZPB9iGym8avhTsMeUkTFTKZJlNGjgB1qWyGQNeKCa50A1+SbBCCWE5EwxoynB1so7bi8vnq7k8CPUHbiWG8rLOJSYHQcZ9Tu7ZGtpeWPcCw1zPWJ/PTBsFVeaT5/ufdx/6ut+sTtRoKHOZZtO9oStHmu/Rlfg=,iv:z9iJJlbvhgxJaART5QoCrqvrqlgoVlGj8jlndCALmKU=,tag:ldjmND4NVVQrHUldLrB4Jg==,type:str]",
-		"unencrypted_suffix": "_unencrypted",
-		"version": "3.10.2"
-	}
-}
--- a/secrets/nix-cache01/cache-secret
+++ b/secrets/nix-cache01/cache-secret
@@ -1,19 +0,0 @@
-{
-	"data": "ENC[AES256_GCM,data:MQkR6FQGHK2AuhOmy2was49RY2XlLO5NwaXnUFzFo5Ata/2ufVoAj4Jvotw/dSrKL7f62A6s+2BPAyWrvACJ+pwYFlfyj3T9bNwhxwZPkEmiHEubJjWSiD6jkSW0gOxbY8ib6g/GbyF8I1cPeYr/hJD5qQ==,iv:eBL2Y3MOt9gYTETUZqsHo1D5hPOHxb4JR6Z/DFlzzqI=,tag:Qqbt39xZvQz/QhsggsArsw==,type:str]",
-	"sops": {
-		"age": [
-			{
-				"recipient": "age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u",
-				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSAwZzFXaEsyUkZGNFV0bVlW\nRkpPRHpUK2VwUHpOQXZCUUpoVzFGa3hycnhvCndTN0toVFdoU2E5N3V3UFhTTjU0\nNDByWTkrV0o3T295dE0zS08rVGpyQjAKLS0tIC96M0VEcWpjRk5DMjJnMFB4ZHI3\nM2Jod2x4ZzMyZm1pbDhZNTFuWGNRUlEKHs5jBSfjml09JOeKiT9vFR0Fykg6OxKG\njhFU/J2+fWB22G7dBc4PI60SNqhxIheUbGTdcz4Yp4BPL6vW3eArIw==\n-----END AGE ENCRYPTED FILE-----\n"
-			},
-			{
-				"recipient": "age1w029fksjv0edrff9p7s03tgk3axecdkppqymfpwfn2nu2gsqqefqc37sxq",
-				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBJT3lxamcrQUpFdjZteFlF\nYUQ3aGdadGpuNXd2Z3RtZ3dQU0cvMlFUMUNRClBDR3U0OXZJU0NDamVMSlR5NitN\nYlhvNVlvUE0wRjErYzkwVHFOdGVCVjgKLS0tIEttR1BLTGpDYTRSQ0lUZmVEcnNi\nWkNaMEViUHVBcExVOEpjNE5CZHpjVkEKuX/Rf8kaB3apr1UhAnq3swS6fXiVmwm8\n7Key+SUAPNstbWbz0u6B9m1ev5QcXB2lx2/+Cm7cjW+6VE2gLHjTsQ==\n-----END AGE ENCRYPTED FILE-----\n"
-			}
-		],
-		"lastmodified": "2025-01-24T12:19:16Z",
-		"mac": "ENC[AES256_GCM,data:X8X91LVP1MMJ8ZYeSNPRO6XHN+NuswLZcHpAkbvoY+E9aTteO8UqS+fsStbNDlpF5jz/mhdMsKElnU8Z/CIWImwolI4GGE6blKy6gyqRkn4VeZotUoXcJadYV/5COud3XP2uSTb694JyQEZnBXFNeYeiHpN0y38zLxoX8kXHFbc=,iv:fFCRfv+Y1Nt2zgJNKsxElrYcuKkATJ3A/jvheUY2IK4=,tag:hYojbMGUAQvx7I4qkO7o9w==,type:str]",
-		"unencrypted_suffix": "_unencrypted",
-		"version": "3.9.3"
-	}
-}
--- a/secrets/secrets.yaml
+++ b/secrets/secrets.yaml
@@ -1,109 +0,0 @@
-root_password_hash: ENC[AES256_GCM,data:wk/xEuf+qU3ezmondq9y3OIotXPI/L+TOErTjgJz58wEvQkApYkjc3bHaUTzOrmWjQBgDUENObzPmvQ8WKawUSJRVlpfOEr5TQ==,iv:I8Z3xJz3qoXBD7igx087A1fMwf8d29hQ4JEI3imRXdY=,tag:M80osQeWGG9AAA8BrMfhHA==,type:str]
-ns_xfer_key: ENC[AES256_GCM,data:VFpK7GChgFeUgQm31tTvVC888bN0yt6BAnHQa6KUTg4iZGP1WL5Bx6Zp8dY=,iv:9RF1eEc7JBxBebDOKfcDjGS2U7XsHkOW/l52yIP+1LA=,tag:L6DR2QlHOfo02kzfWWCrvg==,type:str]
-backup_helper_secret: ENC[AES256_GCM,data:EvXEJnDilbfALQ==,iv:Q3dkZ8Ee3qbcjcoi5GxfbaVB4uRIvkIB6ioKVV/dL2Y=,tag:T/UgZvQgYGa740Wh7D0b7Q==,type:str]
-nats_nkey: ENC[AES256_GCM,data:N2CVXjdwiE7eSPUtXe+NeKSTzA9eFwK2igxaCdYsXd4Ps0/DjYb/ggnQziQzSy8viESZYjXhJ2VtNw==,iv:Xhcf5wPB01Wu0A+oMw0wzTEHATp+uN+wsaYshxIzy1w=,tag:IauTIOHqfiM75Ufml/JXbg==,type:str]
-sops:
-    age:
-        - recipient: age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBuWXhzQWFmeCt1R05jREcz
-            Ui9HZFN5dkxHNVE0RVJGZUJUa3hKK2sxdkhBCktYcGpLeGZIQzZIV3ZZWGs3YzF1
-            T09sUEhPWkRkOWZFWkltQXBlM1lQV1UKLS0tIERRSlRUYW5QeW9TVjJFSmorOWNI
-            ZytmaEhzMjVhRXI1S0hielF0NlBrMmcK4I1PtSf7tSvSIJxWBjTnfBCO8GEFHbuZ
-            BkZskr5fRnWUIs72ZOGoTAVSO5ZNiBglOZ8YChl4Vz1U7bvdOCt0bw==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1hz2lz4k050ru3shrk5j3zk3f8azxmrp54pktw5a7nzjml4saudesx6jsl0
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBQcXM0RHlGcmZrYW4yNGZs
-            S1ZqQzVaYmQ4MGhGaTFMUVIwOTk5K0tZZjB3ClN0QkhVeHRrNXZHdmZWMzFBRnJ6
-            WTFtaWZyRmx2TitkOXkrVkFiYVd3RncKLS0tIExpeGUvY1VpODNDL2NCaUhtZkp0
-            cGNVZTI3UGxlNWdFWVZMd3FlS3pDR3cKBulaMeonV++pArXOg3ilgKnW/51IyT6Z
-            vH9HOJUix+ryEwDIcjv4aWx9pYDHthPFZUDC25kLYG91WrJFQOo2oA==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1w2q4gm2lrcgdzscq8du3ssyvk6qtzm4fcszc92z9ftclq23yyydqdga5um
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBabTdsZWxZQjV2TGx2YjNM
-            ZTgzWktqTjY0S0M3bFpNZXlDRDk5TSt3V2k0CjdWWTN0TlRlK1RpUm9xYW03MFFG
-            aWN4a3o4VUVnYzBDd2FrelUraWtrMTAKLS0tIE1vTGpKYkhzcWErWDRreml2QmE2
-            ZkNIWERKb1drdVR6MTBSTnVmdm51VEkKVNDYdyBSrUT7dUn6a4eF7ELQ2B2Pk6V9
-            Z5fbT75ibuyX1JO315/gl2P/FhxmlRW1K6e+04gQe2R/t/3H11Q7YQ==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1d2w5zece9647qwyq4vas9qyqegg96xwmg6c86440a6eg4uj6dd2qrq0w3l
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBVSFhDOFRVbnZWbVlQaG5G
-            U0NWekU0NzI1SlpRN0NVS1hPN210MXY3Z244CmtFemR5OUpzdlBzMHBUV3g0SFFo
-            eUtqNThXZDJ2b01yVVVuOFdwQVo2Qm8KLS0tIHpXRWd3OEpPRkpaVDNDTEJLMWEv
-            ZlZtaFpBdzF0YXFmdjNkNUR3YkxBZU0KAub+HF/OBZQR9bx/SVadZcL6Ms+NQ7yq
-            21HCcDTWyWHbN4ymUrIYXci1A/0tTOrQL9Mkvaz7IJh4VdHLPZrwwA==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1gq8434ku0xekqmvnseeunv83e779cg03c06gwrusnymdsr3rpufqx6vr3m
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBBWkhBL1NTdjFDeEhQcEgv
-            Z3c3Z213L2ZhWGo0Qm5Zd1A1RTBDY3plUkh3CkNWV2ZtNWkrUjB0eWFzUlVtbHlk
-            WTdTQjN4eDIzY0c0dyt6ajVXZ0krd1UKLS0tIHB4aEJqTTRMenV3UkFkTGEySjQ2
-            YVM1a3ZPdUU4T244UU0rc3hVQ3NYczQK10wug4kTjsvv/iOPWi5WrVZMOYUq4/Mf
-            oXS4sikXeUsqH1T2LUBjVnUieSneQVn7puYZlN+cpDQ0XdK/RZ+91A==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBYcEtHbjNWRkdodUxYdHRn
-            MDBMU08zWDlKa0Z4cHJvc28rZk5pUjhnMjE0CmdzRmVGWDlYQ052Wm1zWnlYSFV6
-            dURQK3JSbThxQlg3M2ZaL1hGRzVuL0UKLS0tIEI3UGZvbEpvRS9aR2J2Tnc1YmxZ
-            aUY5Q2MrdHNQWDJNaGt5MWx6MVRrRVEKRPxyAekGHFMKs0Z6spVDayBA4EtPk18e
-            jiFc97BGVtC5IoSu4icq3ZpKOdxymnkqKEt0YP/p/JTC+8MKvTJFQw==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1vpns76ykll8jgdlu3h05cur4ew2t3k7u03kxdg8y6ypfhsfhq9fqyurjey
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBQL3ZMUkI1dUV1T2tTSHhn
-            SjhyQ3dKTytoaDBNcit1VHpwVGUzWVNpdjBnCklYZWtBYzBpcGxZSDBvM2tIZm9H
-            bTFjb1ZCaDkrOU1JODVBVTBTbmxFbmcKLS0tIGtGcS9kejZPZlhHRXI5QnI5Wm9Q
-            VjMxTDdWZEltWThKVDl0S24yWHJxZHcKgzH79zT2I7ZgyTbbbvIhLN/rEcfiomJH
-            oSZDFvPiXlhPgy8bRyyq3l47CVpWbUI2Y7DFXRuODpLUirt3K3TmCA==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1hchvlf3apn8g8jq2743pw53sd6v6ay6xu6lqk0qufrjeccan9vzsc7hdfq
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBPcm9zUm1XUkpLWm1Jb3Uw
-            RncveGozOW5SRThEM1Y4SFF5RDdxUEhZTUE4CjVESHE5R3JZK0krOXZDL0RHR0oy
-            Z3JKaEpydjRjeFFHck1ic2JTRU5yZTQKLS0tIGY2ck56eG95YnpDYlNqUDh5RVp1
-            U3dRYkNleUtsQU1LMWpDbitJbnRIem8K+27HRtZihG8+k7ZC33XVfuXDFjC1e8lA
-            kffmxp9kOEShZF3IKmAjVHFBiPXRyGk3fGPyQLmSMK2UOOfCy/a/qA==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1w029fksjv0edrff9p7s03tgk3axecdkppqymfpwfn2nu2gsqqefqc37sxq
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBTZHlldDdSOEhjTklCSXQr
-            U2pXajFwZnNqQzZOTzY5b3lkMzlyREhXRWo4CmxId2F6NkNqeHNCSWNrcUJIY0Nw
-            cGF6NXJaQnovK1FYSXQ2TkJSTFloTUEKLS0tIHRhWk5aZ0lDVkZaZEJobm9FTDNw
-            a29sZE1GL2ZQSk0vUEc1ZGhkUlpNRkEK9tfe7cNOznSKgxshd5Z6TQiNKp+XW6XH
-            VvPgMqMitgiDYnUPj10bYo3kqhd0xZH2IhLXMnZnqqQ0I23zfPiNaw==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1ha34qeksr4jeaecevqvv2afqem67eja2mvawlmrqsudch0e7fe7qtpsekv
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSB5bk9NVjJNWmMxUGd3cXRx
-            amZ5SWJ3dHpHcnM4UHJxdmh6NnhFVmJQdldzCm95dHN3R21qSkE4Vm9VTnVPREp3
-            dUQyS1B4MWhhdmd3dk5LQ0htZEtpTWMKLS0tIGFaa3MxVExFYk1MY2loOFBvWm1o
-            L0NoRStkeW9VZVdpWlhteC8yTnRmMUkKMYjUdE1rGgVR29FnhJ5OEVjTB1Rh5Mtu
-            M/DvlhW3a7tZU8nDF3IgG2GE5xOXZMDO9QWGdB8zO2RJZAr3Q+YIlA==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1cxt8kwqzx35yuldazcc49q88qvgy9ajkz30xu0h37uw3ts97jagqgmn2ga
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBBU0xYMnhqOE0wdXdleStF
-            THcrY2NBQzNoRHdYTXY3ZmM5YXRZZkQ4aUZnCm9ad0IxSWxYT1JBd2RseUdVT1pi
-            UXBuNzFxVlN0OWNTQU5BV2NiVEV0RUUKLS0tIGJHY0dzSDczUzcrV0RpTjE0czEy
-            cWZMNUNlTzBRcEV5MjlRV1BsWGhoaUUKGhYaH8I0oPCfrbs7HbQKVOF/99rg3HXv
-            RRTXUI71/ejKIuxehOvifClQc3nUW73bWkASFQ0guUvO4R+c0xOgUg==
-            -----END AGE ENCRYPTED FILE-----
-    lastmodified: "2025-02-11T21:18:22Z"
-    mac: ENC[AES256_GCM,data:5//boMp1awc/2XAkSASSCuobpkxa0E6IKf3GR8xHpMoCD30FJsCwV7PgX3fR8OuLEhOJ7UguqMNQdNqG37RMacreuDmI1J8oCFKp+3M2j4kCbXaEo8bw7WAtyjUez+SAXKzZWYmBibH0KOy6jdt+v0fdgy5hMBT4IFDofYRsyD0=,iv:6pD+SLwncpmal/FR4U8It2njvaQfUzzpALBCxa0NyME=,tag:4QN8ZFjdqck5ZgulF+FtbA==,type:str]
-    unencrypted_suffix: _unencrypted
-    version: 3.9.4
--- a/services/ca/default.nix
+++ b/services/ca/default.nix
@@ -1,169 +0,0 @@
-{ pkgs, unstable, ... }:
-{
-  homelab.monitoring.scrapeTargets = [{
-    job_name = "step-ca";
-    port = 9000;
-  }];
-  sops.secrets."ca_root_pw" = {
-    sopsFile = ../../secrets/ca/secrets.yaml;
-    owner = "step-ca";
-    path = "/var/lib/step-ca/secrets/ca_root_pw";
-  };
-  sops.secrets."intermediate_ca_key" = {
-    sopsFile = ../../secrets/ca/keys/intermediate_ca_key;
-    format = "binary";
-    owner = "step-ca";
-    path = "/var/lib/step-ca/secrets/intermediate_ca_key";
-  };
-  sops.secrets."root_ca_key" = {
-    sopsFile = ../../secrets/ca/keys/root_ca_key;
-    format = "binary";
-    owner = "step-ca";
-    path = "/var/lib/step-ca/secrets/root_ca_key";
-  };
-  sops.secrets."ssh_host_ca_key" = {
-    sopsFile = ../../secrets/ca/keys/ssh_host_ca_key;
-    format = "binary";
-    owner = "step-ca";
-    path = "/var/lib/step-ca/secrets/ssh_host_ca_key";
-  };
-  sops.secrets."ssh_user_ca_key" = {
-    sopsFile = ../../secrets/ca/keys/ssh_user_ca_key;
-    format = "binary";
-    owner = "step-ca";
-    path = "/var/lib/step-ca/secrets/ssh_user_ca_key";
-  };
-
-  services.step-ca = {
-    enable = true;
-    package = pkgs.step-ca;
-    intermediatePasswordFile = "/var/lib/step-ca/secrets/ca_root_pw";
-    address = "0.0.0.0";
-    port = 443;
-    settings = {
-      metricsAddress = ":9000";
-      authority = {
-        provisioners = [
-          {
-            claims = {
-              enableSSHCA = true;
-              maxTLSCertDuration = "3600h";
-              defaultTLSCertDuration = "48h";
-            };
-            encryptedKey = "eyJhbGciOiJQQkVTMi1IUzI1NitBMTI4S1ciLCJjdHkiOiJqd2sranNvbiIsImVuYyI6IkEyNTZHQ00iLCJwMmMiOjYwMDAwMCwicDJzIjoiY1lWOFJPb3lteXFLMWpzcS1WM1ZXQSJ9.WS8tPK-Q4gtnSsw7MhpTzYT_oi-SQx-CsRLh7KwdZnpACtd4YbcOYg.zeyDkmKRx8BIp-eB.OQ8c-KDW07gqJFtEMqHacRBkttrbJRRz0sYR47vQWDCoWhodaXsxM_Bj2pGvUrR26ij1t7irDeypnJoh6WXvUg3n_JaIUL4HgTwKSBrXZKTscXmY7YVmRMionhAb6oS9Jgus9K4QcFDHacC9_WgtGI7dnu3m0G7c-9Ur9dcDfROfyrnAByJp1rSZMzvriQr4t9bNYjDa8E8yu9zq6aAQqF0Xg_AxwiqYqesT-sdcfrxKS61appApRgPlAhW-uuzyY0wlWtsiyLaGlWM7WMfKdHsq-VqcVrI7Gi2i77vi7OqPEberqSt8D04tIri9S_sArKqWEDnBJsL07CC41IY.CqtYfbSa_wlmIsKgNj5u7g";
-            key = {
-              alg = "ES256";
-              crv = "P-256";
-              kid = "CIjtIe7FNhsNQe1qKGD9Rpj-lrf2ExyTYCXAOd3YDjE";
-              kty = "EC";
-              use = "sig";
-              x = "XRMX-BeobZ-R5-xb-E9YlaRjJUfd7JQxpscaF1NMgFo";
-              y = "bF9xLp5-jywRD-MugMaOGbpbniPituWSLMlXRJnUUl0";
-            };
-            name = "ca@home.2rjus.net";
-            type = "JWK";
-          }
-          {
-            name = "acme";
-            type = "ACME";
-            claims = {
-              maxTLSCertDuration = "3600h";
-              defaultTLSCertDuration = "1800h";
-            };
-          }
-          {
-            claims = {
-              enableSSHCA = true;
-            };
-            name = "sshpop";
-            type = "SSHPOP";
-          }
-        ];
-      };
-      crt = "/var/lib/step-ca/certs/intermediate_ca.crt";
-      db = {
-        badgerFileLoadingMode = "";
-        dataSource = "/var/lib/step-ca/db";
-        type = "badgerv2";
-      };
-      dnsNames = [
-        "ca.home.2rjus.net"
-        "10.69.13.12"
-      ];
-      federatedRoots = null;
-      insecureAddress = "";
-      key = "/var/lib/step-ca/secrets/intermediate_ca_key";
-      logger = {
-        format = "text";
-      };
-      root = "/var/lib/step-ca/certs/root_ca.crt";
-      ssh = {
-        hostKey = "/var/lib/step-ca/secrets/ssh_host_ca_key";
-        userKey = "/var/lib/step-ca/secrets/ssh_user_ca_key";
-      };
-      templates = {
-        ssh = {
-          host = [
-            {
-              comment = "#";
-              name = "sshd_config.tpl";
-              path = "/etc/ssh/sshd_config";
-              requires = [
-                "Certificate"
-                "Key"
-              ];
-              template = ./templates/ssh/sshd_config.tpl;
-              type = "snippet";
-            }
-            {
-              comment = "#";
-              name = "ca.tpl";
-              path = "/etc/ssh/ca.pub";
-              template = ./templates/ssh/ca.tpl;
-              type = "snippet";
-            }
-          ];
-          user = [
-            {
-              comment = "#";
-              name = "config.tpl";
-              path = "~/.ssh/config";
-              template = ./templates/ssh/config.tpl;
-              type = "snippet";
-            }
-            {
-              comment = "#";
-              name = "step_includes.tpl";
-              path = "\${STEPPATH}/ssh/includes";
-              template = ./templates/ssh/step_includes.tpl;
-              type = "prepend-line";
-            }
-            {
-              comment = "#";
-              name = "step_config.tpl";
-              path = "ssh/config";
-              template = ./templates/ssh/step_config.tpl;
-              type = "file";
-            }
-            {
-              comment = "#";
-              name = "known_hosts.tpl";
-              path = "ssh/known_hosts";
-              template = ./templates/ssh/known_hosts.tpl;
-              type = "file";
-            }
-          ];
-        };
-      };
-      tls = {
-        cipherSuites = [
-          "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256"
-          "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256"
-        ];
-        maxVersion = 1.3;
-        minVersion = 1.2;
-        renegotiation = false;
-      };
-    };
-  };
-}
--- a/services/ca/templates/ssh/ca.tpl
+++ b/services/ca/templates/ssh/ca.tpl
--- a/services/ca/templates/ssh/config.tpl
+++ b/services/ca/templates/ssh/config.tpl
@@ -1,14 +0,0 @@
-Host *
-{{- if or .User.GOOS "none" | eq "windows" }}
-{{- if .User.StepBasePath }}
-	Include "{{ .User.StepBasePath | replace "\\" "/" | trimPrefix "C:" }}/ssh/includes"
-{{- else }}
-	Include "{{ .User.StepPath | replace "\\" "/" | trimPrefix "C:" }}/ssh/includes"
-{{- end }}
-{{- else }}
-{{- if .User.StepBasePath }}
-	Include "{{.User.StepBasePath}}/ssh/includes"
-{{- else }}
-	Include "{{.User.StepPath}}/ssh/includes"
-{{- end }}
-{{- end }}
--- a/services/ca/templates/ssh/known_hosts.tpl
+++ b/services/ca/templates/ssh/known_hosts.tpl
@@ -1,4 +0,0 @@
-@cert-authority * {{.Step.SSH.HostKey.Type}} {{.Step.SSH.HostKey.Marshal | toString | b64enc}}
-{{- range .Step.SSH.HostFederatedKeys}}
-@cert-authority * {{.Type}} {{.Marshal | toString | b64enc}}
-{{- end }}
--- a/services/ca/templates/ssh/sshd_config.tpl
+++ b/services/ca/templates/ssh/sshd_config.tpl
@@ -1,4 +0,0 @@
-Match all
-	TrustedUserCAKeys /etc/ssh/ca.pub
-	HostCertificate /etc/ssh/{{.User.Certificate}}
-	HostKey /etc/ssh/{{.User.Key}}
--- a/services/ca/templates/ssh/step_config.tpl
+++ b/services/ca/templates/ssh/step_config.tpl
@@ -1,11 +0,0 @@
-Match exec "step ssh check-host{{- if .User.Context }} --context {{ .User.Context }}{{- end }} %h"
-{{- if .User.User }}
-	User {{.User.User}}
-{{- end }}
-{{- if or .User.GOOS "none" | eq "windows" }}
-	UserKnownHostsFile "{{.User.StepPath}}\ssh\known_hosts"
-	ProxyCommand C:\Windows\System32\cmd.exe /c step ssh proxycommand{{- if .User.Context }} --context {{ .User.Context }}{{- end }}{{- if .User.Provisioner }} --provisioner {{ .User.Provisioner }}{{- end }} %r %h %p
-{{- else }}
-	UserKnownHostsFile "{{.User.StepPath}}/ssh/known_hosts"
-	ProxyCommand step ssh proxycommand{{- if .User.Context }} --context {{ .User.Context }}{{- end }}{{- if .User.Provisioner }} --provisioner {{ .User.Provisioner }}{{- end }} %r %h %p
-{{- end }}
--- a/services/ca/templates/ssh/step_includes.tpl
+++ b/services/ca/templates/ssh/step_includes.tpl
@@ -1 +0,0 @@
-{{- if or .User.GOOS "none" | eq "windows" }}Include "{{ .User.StepPath | replace "\\" "/" | trimPrefix "C:" }}/ssh/config"{{- else }}Include "{{.User.StepPath}}/ssh/config"{{- end }}
--- a/services/http-proxy/proxy.nix
+++ b/services/http-proxy/proxy.nix
@@ -5,7 +5,7 @@
    package = pkgs.unstable.caddy;
    configFile = pkgs.writeText "Caddyfile" ''
      {
-        acme_ca https://ca.home.2rjus.net/acme/acme/directory
+        acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory

        metrics {
          per_host
--- a/services/kanidm/default.nix
+++ b/services/kanidm/default.nix
@@ -0,0 +1,65 @@
+{ config, lib, pkgs, ... }:
+{
+  services.kanidm = {
+    package = pkgs.kanidmWithSecretProvisioning_1_8;
+    enableServer = true;
+    serverSettings = {
+      domain = "home.2rjus.net";
+      origin = "https://auth.home.2rjus.net";
+      bindaddress = "0.0.0.0:443";
+      ldapbindaddress = "0.0.0.0:636";
+      tls_chain = "/var/lib/acme/auth.home.2rjus.net/fullchain.pem";
+      tls_key = "/var/lib/acme/auth.home.2rjus.net/key.pem";
+      online_backup = {
+        path = "/var/lib/kanidm/backups";
+        schedule = "00 22 * * *";
+        versions = 7;
+      };
+    };
+
+    # Provision base groups only - users are managed via CLI
+    # See docs/user-management.md for details
+    provision = {
+      enable = true;
+      idmAdminPasswordFile = config.vault.secrets.kanidm-idm-admin.outputDir;
+
+      groups = {
+        admins = { };
+        users = { };
+        ssh-users = { };
+      };
+
+      # Regular users (persons) are managed imperatively via kanidm CLI
+    };
+  };
+
+  # Grant kanidm access to ACME certificates
+  users.users.kanidm.extraGroups = [ "acme" ];
+
+  # ACME certificate from internal CA
+  # Include both the CNAME (auth) and A record (kanidm01) for Prometheus scraping
+  security.acme.certs."auth.home.2rjus.net" = {
+    listenHTTP = ":80";
+    reloadServices = [ "kanidm" ];
+    extraDomainNames = [ "${config.networking.hostName}.home.2rjus.net" ];
+  };
+
+  # Vault secret for idm_admin password (used for provisioning)
+  vault.secrets.kanidm-idm-admin = {
+    secretPath = "kanidm/idm-admin-password";
+    extractKey = "password";
+    services = [ "kanidm" ];
+    owner = "kanidm";
+    group = "kanidm";
+  };
+
+  # Note: Kanidm does not expose Prometheus metrics
+  # If metrics support is added in the future, uncomment:
+  # homelab.monitoring.scrapeTargets = [
+  #   {
+  #     job_name = "kanidm";
+  #     port = 443;
+  #     scheme = "https";
+  #   }
+  # ];
+}
--- a/services/monitoring/alloy.nix
+++ b/services/monitoring/alloy.nix
@@ -1,41 +0,0 @@
-{ ... }:
-{
-  services.alloy = {
-    enable = true;
-  };
-
-  environment.etc."alloy/config.alloy" = {
-    enable = true;
-    mode = "0644";
-    text = ''
-      pyroscope.write "local_pyroscope" {
-        endpoint {
-          url = "http://localhost:4040"
-        }
-      }
-
-      pyroscope.scrape "labmon" {
-        targets    = [{"__address__" = "localhost:9969", "service_name" = "labmon"}]
-        forward_to = [pyroscope.write.local_pyroscope.receiver]
-
-        profiling_config {
-          profile.process_cpu {
-            enabled = true
-          }
-          profile.memory {
-            enabled = true
-          }
-          profile.mutex {
-            enabled = true
-          }
-          profile.block {
-            enabled = true
-          }
-          profile.goroutine {
-            enabled = true
-          }
-        }
-      }
-    '';
-  };
-}
--- a/services/monitoring/default.nix
+++ b/services/monitoring/default.nix
@@ -7,7 +7,6 @@
    ./pve.nix
    ./alerttonotify.nix
    ./pyroscope.nix
-    ./alloy.nix
    ./tempo.nix
  ];
 }
--- a/services/monitoring/prometheus.nix
+++ b/services/monitoring/prometheus.nix
@@ -121,22 +121,20 @@ in

    scrapeConfigs = [
      # Auto-generated node-exporter targets from flake hosts + external
+      # Each static_config entry may have labels from homelab.host metadata
      {
        job_name = "node-exporter";
-        static_configs = [
-          {
-            targets = nodeExporterTargets;
-          }
-        ];
+        static_configs = nodeExporterTargets;
      }
      # Systemd exporter on all hosts (same targets, different port)
+      # Preserves the same label grouping as node-exporter
      {
        job_name = "systemd-exporter";
-        static_configs = [
-          {
-            targets = map (t: builtins.replaceStrings [":9100"] [":9558"] t) nodeExporterTargets;
-          }
-        ];
+        static_configs = map
+          (cfg: cfg // {
+            targets = map (t: builtins.replaceStrings [ ":9100" ] [ ":9558" ] t) cfg.targets;
+          })
+          nodeExporterTargets;
      }
      # Local monitoring services (not auto-generated)
      {
@@ -180,14 +178,6 @@ in
          }
        ];
      }
-      {
-        job_name = "labmon";
-        static_configs = [
-          {
-            targets = [ "monitoring01.home.2rjus.net:9969" ];
-          }
-        ];
-      }
      # TODO: nix-cache_caddy can't be auto-generated because the cert is issued
      # for nix-cache.home.2rjus.net (service CNAME), not nix-cache01 (hostname).
      # Consider adding a target override to homelab.monitoring.scrapeTargets.
--- a/services/monitoring/rules.yml
+++ b/services/monitoring/rules.yml
@@ -17,8 +17,9 @@ groups:
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: "Disk space is low on {{ $labels.instance }}. Please check."
+      # Build hosts (e.g., nix-cache01) are expected to have high CPU during builds
      - alert: high_cpu_load
-        expr: max(node_load5{instance!="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance!="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7)
+        expr: max(node_load5{role!="build-host"}) by (instance) > (count by (instance)(node_cpu_seconds_total{role!="build-host", mode="idle"}) * 0.7)
        for: 15m
        labels:
          severity: warning
@@ -26,7 +27,7 @@ groups:
          summary: "High CPU load on {{ $labels.instance }}"
          description: "CPU load is high on {{ $labels.instance }}. Please check."
      - alert: high_cpu_load
-        expr: max(node_load5{instance="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7)
+        expr: max(node_load5{role="build-host"}) by (instance) > (count by (instance)(node_cpu_seconds_total{role="build-host", mode="idle"}) * 0.7)
        for: 2h
        labels:
          severity: warning
@@ -115,8 +116,9 @@ groups:
        annotations:
          summary: "NSD not running on {{ $labels.instance }}"
          description: "NSD has been down on {{ $labels.instance }} more than 5 minutes."
+      # Only alert on primary DNS (secondary has cold cache after failover)
      - alert: unbound_low_cache_hit_ratio
-        expr: (rate(unbound_cache_hits_total[5m]) / (rate(unbound_cache_hits_total[5m]) + rate(unbound_cache_misses_total[5m]))) < 0.5
+        expr: (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) / (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) + rate(unbound_cache_misses_total{dns_role="primary"}[5m]))) < 0.5
        for: 15m
        labels:
          severity: warning
@@ -336,40 +338,6 @@ groups:
        annotations:
          summary: "Pyroscope service not running on {{ $labels.instance }}"
          description: "Pyroscope service not running on {{ $labels.instance }}"
-  - name: certificate_rules
-    rules:
-      - alert: certificate_expiring_soon
-        expr: labmon_tlsconmon_certificate_seconds_left{address!="ca.home.2rjus.net:443"} < 86400
-        for: 5m
-        labels:
-          severity: warning
-        annotations:
-          summary: "TLS certificate expiring soon for {{ $labels.instance }}"
-          description: "TLS certificate for {{ $labels.address }} is expiring within 24 hours."
-      - alert: step_ca_serving_cert_expiring
-        expr: labmon_tlsconmon_certificate_seconds_left{address="ca.home.2rjus.net:443"} < 3600
-        for: 5m
-        labels:
-          severity: critical
-        annotations:
-          summary: "Step-CA serving certificate expiring"
-          description: "The step-ca serving certificate (24h auto-renewed) has less than 1 hour of validity left. Renewal may have failed."
-      - alert: certificate_check_error
-        expr: labmon_tlsconmon_certificate_check_error == 1
-        for: 5m
-        labels:
-          severity: warning
-        annotations:
-          summary: "Error checking certificate for {{ $labels.address }}"
-          description: "Certificate check is failing for {{ $labels.address }} on {{ $labels.instance }}."
-      - alert: step_ca_certificate_expiring
-        expr: labmon_stepmon_certificate_seconds_left < 3600
-        for: 5m
-        labels:
-          severity: critical
-        annotations:
-          summary: "Step-CA certificate expiring for {{ $labels.instance }}"
-          description: "Step-CA certificate is expiring within 1 hour on {{ $labels.instance }}."
  - name: proxmox_rules
    rules:
      - alert: pve_node_down
@@ -388,32 +356,6 @@ groups:
        annotations:
          summary: "Proxmox VM {{ $labels.id }} is stopped"
          description: "Proxmox VM {{ $labels.id }} ({{ $labels.name }}) has onboot=1 but is stopped."
-  - name: postgres_rules
-    rules:
-      - alert: postgres_down
-        expr: node_systemd_unit_state{instance="pgdb1.home.2rjus.net:9100", name="postgresql.service", state="active"} == 0
-        for: 5m
-        labels:
-          severity: critical
-        annotations:
-          summary: "PostgreSQL not running on {{ $labels.instance }}"
-          description: "PostgreSQL has been down on {{ $labels.instance }} more than 5 minutes."
-      - alert: postgres_exporter_down
-        expr: up{job="postgres"} == 0
-        for: 5m
-        labels:
-          severity: warning
-        annotations:
-          summary: "PostgreSQL exporter down on {{ $labels.instance }}"
-          description: "Cannot scrape PostgreSQL metrics from {{ $labels.instance }}."
-      - alert: postgres_high_connections
-        expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
-        for: 5m
-        labels:
-          severity: warning
-        annotations:
-          summary: "PostgreSQL connection pool near exhaustion on {{ $labels.instance }}"
-          description: "PostgreSQL is using over 80% of max_connections on {{ $labels.instance }}."
  - name: jellyfin_rules
    rules:
      - alert: jellyfin_down
--- a/services/nix-cache/proxy.nix
+++ b/services/nix-cache/proxy.nix
@@ -5,7 +5,7 @@
    package = pkgs.unstable.caddy;
    configFile = pkgs.writeText "Caddyfile" ''
      {
-        acme_ca https://ca.home.2rjus.net/acme/acme/directory
+        acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
        metrics
      }

--- a/services/ns/resolver.nix
+++ b/services/ns/resolver.nix
@@ -45,7 +45,11 @@
      };
      stub-zone = {
        name = "home.2rjus.net";
-        stub-addr = "127.0.0.1@8053";
+        stub-addr = [
+          "127.0.0.1@8053"   # Local NSD
+          "10.69.13.5@8053"  # ns1
+          "10.69.13.6@8053"  # ns2
+        ];
      };
      forward-zone = {
        name = ".";
--- a/services/postgres/default.nix
+++ b/services/postgres/default.nix
@@ -1,6 +0,0 @@
-{ ... }:
-{
-  imports = [
-    ./postgres.nix
-  ];
-}
--- a/services/postgres/postgres.nix
+++ b/services/postgres/postgres.nix
@@ -1,23 +0,0 @@
-{ pkgs, ... }:
-{
-  homelab.monitoring.scrapeTargets = [{
-    job_name = "postgres";
-    port = 9187;
-  }];
-
-  services.prometheus.exporters.postgres = {
-    enable = true;
-    runAsLocalSuperUser = true; # Use peer auth as postgres user
-  };
-
-  services.postgresql = {
-    enable = true;
-    enableJIT = true;
-    enableTCPIP = true;
-    extensions = ps: with ps; [ pgvector ];
-    authentication = ''
-      # Allow access to everything from gunter
-      host    all             all             10.69.30.105/32         scram-sha-256
-    '';
-  };
-}
--- a/system/acme.nix
+++ b/system/acme.nix
@@ -3,7 +3,7 @@
  security.acme = {
    acceptTerms = true;
    defaults = {
-      server = "https://ca.home.2rjus.net/acme/acme/directory";
+      server = "https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory";
      email = "root@home.2rjus.net";
      dnsPropagationCheck = false;
    };
--- a/system/default.nix
+++ b/system/default.nix
@@ -4,14 +4,15 @@
    ./acme.nix
    ./autoupgrade.nix
    ./homelab-deploy.nix
+    ./kanidm-client.nix
    ./monitoring
    ./motd.nix
    ./packages.nix
    ./nix.nix
    ./root-user.nix
    ./pki/root-ca.nix
-    ./sops.nix
    ./sshd.nix
    ./vault-secrets.nix
+    ./zram.nix
  ];
 }
--- a/system/kanidm-client.nix
+++ b/system/kanidm-client.nix
@@ -0,0 +1,42 @@
+{ lib, config, pkgs, ... }:
+let
+  cfg = config.homelab.kanidm;
+in
+{
+  options.homelab.kanidm = {
+    enable = lib.mkEnableOption "Kanidm PAM/NSS client for central authentication";
+
+    server = lib.mkOption {
+      type = lib.types.str;
+      default = "https://auth.home.2rjus.net";
+      description = "URI of the Kanidm server";
+    };
+
+    allowedLoginGroups = lib.mkOption {
+      type = lib.types.listOf lib.types.str;
+      default = [ "ssh-users" ];
+      description = "Groups allowed to log in via PAM";
+    };
+  };
+
+  config = lib.mkIf cfg.enable {
+    services.kanidm = {
+      package = pkgs.kanidm_1_8;
+      enablePam = true;
+
+      clientSettings = {
+        uri = cfg.server;
+      };
+
+      unixSettings = {
+        pam_allowed_login_groups = cfg.allowedLoginGroups;
+        # Use short names (torjus) instead of SPN format (torjus@home.2rjus.net)
+        # This prevents "PAM user mismatch" errors with SSH
+        uid_attr_map = "name";
+        gid_attr_map = "name";
+        # Create symlink /home/torjus -> /home/torjus@home.2rjus.net
+        home_alias = "name";
+      };
+    };
+  };
+}
--- a/system/sops.nix
+++ b/system/sops.nix
@@ -1,7 +0,0 @@
-{ ... }: {
-  sops = {
-    defaultSopsFile = ../secrets/secrets.yaml;
-    age.keyFile = "/var/lib/sops-nix/key.txt";
-    age.generateKey = true;
-  };
-}
--- a/system/zram.nix
+++ b/system/zram.nix
@@ -0,0 +1,8 @@
+# Compressed swap in RAM
+#
+# Provides overflow memory during Nix builds and upgrades.
+# Prevents OOM kills on low-memory hosts (2GB VMs).
+{ ... }:
+{
+  zramSwap.enable = true;
+}
--- a/terraform/variables.tf
+++ b/terraform/variables.tf
@@ -33,7 +33,7 @@ variable "default_target_node" {
 variable "default_template_name" {
  description = "Default template VM name to clone from"
  type        = string
-  default     = "nixos-25.11.20260131.41e216c"
+  default     = "nixos-25.11.20260203.e576e3c"
 }

 variable "default_ssh_public_key" {
--- a/terraform/vault/approle.tf
+++ b/terraform/vault/approle.tf
@@ -66,26 +66,7 @@ locals {
      ]
    }

-    "pgdb1" = {
-      paths = [
-        "secret/data/hosts/pgdb1/*",
-      ]
-    }
-
    # Wave 3: DNS servers
-    "ns1" = {
-      paths = [
-        "secret/data/hosts/ns1/*",
-        "secret/data/shared/dns/*",
-      ]
-    }
-
-    "ns2" = {
-      paths = [
-        "secret/data/hosts/ns2/*",
-        "secret/data/shared/dns/*",
-      ]
-    }

    # Wave 4: http-proxy
    "http-proxy" = {
@@ -101,6 +82,13 @@ locals {
      ]
    }

+    # vault01: Vault server itself (fetches secrets from itself)
+    "vault01" = {
+      paths = [
+        "secret/data/hosts/vault01/*",
+      ]
+    }
+
  }
 }

--- a/Show More
+++ b/Show More
				`@@ -1 +0,0 @@`
				`{{- if or .User.GOOS "none" \| eq "windows" }}Include "{{ .User.StepPath \| replace "\\" "/" \| trimPrefix "C:" }}/ssh/config"{{- else }}Include "{{.User.StepPath}}/ssh/config"{{- end }}`