system: add pipe-to-loki helper script

Adds a system-wide script for sending command output or interactive sessions to Loki for easy sharing with Claude. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Merge pull request 'kanidm-pam-client' (#34 ) from kanidm-pam-client into master
2026-02-08 15:30:53 +01:00 · 2026-02-08 14:14:53 +00:00 · 2026-02-08 15:14:21 +01:00 · 2026-02-08 15:14:03 +01:00 · 2026-02-08 15:12:19 +01:00 · 2026-02-08 15:12:19 +01:00
108 changed files with 3627 additions and 1844 deletions
--- a/.claude/agents/auditor.md
+++ b/.claude/agents/auditor.md
@@ -0,0 +1,180 @@
 ---
 name: auditor
 description: Analyzes audit logs to investigate user activity, command execution, and suspicious behavior on hosts. Can be used standalone for security reviews or called by other agents for behavioral context.
 tools: Read, Grep, Glob
 mcpServers:
  - lab-monitoring
 ---
 You are a security auditor for a NixOS homelab infrastructure. Your task is to analyze audit logs and reconstruct user activity on hosts.
 ## Input
 You may receive:
 - A host or list of hosts to investigate
 - A time window (e.g., "last hour", "today", "between 14:00 and 15:00")
 - Optional context: specific events to look for, user to focus on, or suspicious activity to investigate
 - Optional context from a parent investigation (e.g., "a service stopped at 14:32, what happened around that time?")
 ## Audit Log Structure
 Logs are shipped to Loki via promtail. Audit events use these labels:
 - `host` - hostname
 - `systemd_unit` - typically `auditd.service` for audit logs
 - `job` - typically `systemd-journal`
 Audit log entries contain structured data:
 - `EXECVE` - command execution with full arguments
 - `USER_LOGIN` / `USER_LOGOUT` - session start/end
 - `USER_CMD` - sudo command execution
 - `CRED_ACQ` / `CRED_DISP` - credential acquisition/disposal
 - `SERVICE_START` / `SERVICE_STOP` - systemd service events
 ## Investigation Techniques
 ### 1. SSH Session Activity
 Find SSH logins and session activity:
 ```logql
 {host="<hostname>", systemd_unit="sshd.service"}
 ```
 Look for:
 - Accepted/Failed authentication
 - Session opened/closed
 - Unusual source IPs or users
 ### 2. Command Execution
 Query executed commands (filter out noise):
 ```logql
 {host="<hostname>"} |= "EXECVE" != "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"
 ```
 Further filtering:
 - Exclude systemd noise: `!= "systemd" != "/nix/store"`
 - Focus on specific commands: `|= "rm" |= "-rf"`
 - Focus on specific user: `|= "uid=1000"`
 ### 3. Sudo Activity
 Check for privilege escalation:
 ```logql
 {host="<hostname>"} |= "sudo" |= "COMMAND"
 ```
 Or via audit:
 ```logql
 {host="<hostname>"} |= "USER_CMD"
 ```
 ### 4. Service Manipulation
 Check if services were manually stopped/started:
 ```logql
 {host="<hostname>"} |= "EXECVE" |= "systemctl"
 ```
 ### 5. File Operations
 Look for file modifications (if auditd rules are configured):
 ```logql
 {host="<hostname>"} |= "EXECVE" |= "vim"
 {host="<hostname>"} |= "EXECVE" |= "nano"
 {host="<hostname>"} |= "EXECVE" |= "rm"
 ```
 ## Query Guidelines
 **Start narrow, expand if needed:**
 - Begin with `limit: 20-30`
 - Use tight time windows: `start: "15m"` or `start: "30m"`
 - Add filters progressively
 **Avoid:**
 - Querying all audit logs without EXECVE filter (extremely verbose)
 - Large time ranges without specific filters
 - Limits over 50 without tight filters
 **Time-bounded queries:**
 When investigating around a specific event:
 ```logql
 {host="<hostname>"} |= "EXECVE" != "systemd"
 ```
 With `start: "2026-02-08T14:30:00Z"` and `end: "2026-02-08T14:35:00Z"`
 ## Suspicious Patterns to Watch For
 1. **Unusual login times** - Activity outside normal hours
 2. **Failed authentication** - Brute force attempts
 3. **Privilege escalation** - Unexpected sudo usage
 4. **Reconnaissance commands** - `whoami`, `id`, `uname`, `cat /etc/passwd`
 5. **Data exfiltration indicators** - `curl`, `wget`, `scp`, `rsync` to external destinations
 6. **Persistence mechanisms** - Cron modifications, systemd service creation
 7. **Log tampering** - Commands targeting log files
 8. **Lateral movement** - SSH to other internal hosts
 9. **Service manipulation** - Stopping security services, disabling firewalls
 10. **Cleanup activity** - Deleting bash history, clearing logs
 ## Output Format
 ### For Standalone Security Reviews
 ```
 ## Activity Summary
 **Host:** <hostname>
 **Time Period:** <start> to <end>
 **Sessions Found:** <count>
 ## User Sessions
 ### Session 1: <user> from <source_ip>
 - **Login:** HH:MM:SSZ
 - **Logout:** HH:MM:SSZ (or ongoing)
 - **Commands executed:**
  - HH:MM:SSZ - <command>
  - HH:MM:SSZ - <command>
 ## Suspicious Activity
 [If any patterns from the watch list were detected]
 - **Finding:** <description>
 - **Evidence:** <log entries>
 - **Risk Level:** Low / Medium / High
 ## Summary
 [Overall assessment: normal activity, concerning patterns, or clear malicious activity]
 ```
 ### When Called by Another Agent
 Provide a focused response addressing the specific question:
 ```
 ## Audit Findings
 **Query:** <what was asked>
 **Time Window:** <investigated period>
 ## Relevant Activity
 [Chronological list of relevant events]
 - HH:MM:SSZ - <event>
 - HH:MM:SSZ - <event>
 ## Assessment
 [Direct answer to the question with supporting evidence]
 ```
 ## Guidelines
 - Reconstruct timelines chronologically
 - Correlate events (login → commands → logout)
 - Note gaps or missing data
 - Distinguish between automated (systemd, cron) and interactive activity
 - Consider the host's role and tier when assessing severity
 - When called by another agent, focus on answering their specific question
 - Don't speculate without evidence - state what the logs show and don't show
--- a/.claude/agents/investigate-alarm.md
+++ b/.claude/agents/investigate-alarm.md
@@ -0,0 +1,211 @@
 ---
 name: investigate-alarm
 description: Investigates a single system alarm by querying Prometheus metrics and Loki logs, analyzing configuration files for affected hosts/services, and providing root cause analysis.
 tools: Read, Grep, Glob
 mcpServers:
  - lab-monitoring
  - git-explorer
 ---
 You are an alarm investigation specialist for a NixOS homelab infrastructure. Your task is to analyze a single alarm and determine its root cause.
 ## Input
 You will receive information about an alarm, which may include:
 - Alert name and severity
 - Affected host or service
 - Alert expression/threshold
 - Current value or status
 - When it started firing
 ## Investigation Process
 ### 1. Understand the Alert Context
 Start by understanding what the alert is measuring:
 - Use `get_alert` if you have a fingerprint, or `list_alerts` to find matching alerts
 - Use `get_metric_metadata` to understand the metric being monitored
 - Use `search_metrics` to find related metrics
 ### 2. Query Current State
 Gather evidence about the current system state:
 - Use `query` to check the current metric values and related metrics
 - Use `list_targets` to verify the host/service is being scraped successfully
 - Look for correlated metrics that might explain the issue
 ### 3. Check Service Logs
 Search for relevant log entries using `query_logs`. Focus on service-specific logs and errors.
 **Query strategies (start narrow, expand if needed):**
 - Start with `limit: 20-30`, increase only if needed
 - Use tight time windows: `start: "15m"` or `start: "30m"` initially
 - Filter to specific services: `{host="<hostname>", systemd_unit="<service>.service"}`
 - Search for errors: `{host="<hostname>"} |= "error"` or `|= "failed"`
 **Common patterns:**
 - Service logs: `{host="<hostname>", systemd_unit="<service>.service"}`
 - All errors on host: `{host="<hostname>"} |= "error"`
 - Journal for a unit: `{host="<hostname>", systemd_unit="nginx.service"} |= "failed"`
 **Avoid:**
 - Using `start: "1h"` with no filters on busy hosts
 - Limits over 50 without specific filters
 ### 4. Investigate User Activity
 For any analysis of user activity, **always spawn the `auditor` agent**. Do not query audit logs (EXECVE, USER_LOGIN, etc.) directly - delegate this to the auditor.
 **Always call the auditor when:**
 - A service stopped unexpectedly (may have been manually stopped)
 - A process was killed or a config was changed
 - You need to know who was logged in around the time of an incident
 - You need to understand what commands led to the current state
 - The cause isn't obvious from service logs alone
 **Do NOT try to query audit logs yourself.** The auditor is specialized for:
 - Parsing EXECVE records and reconstructing command lines
 - Correlating SSH sessions with commands executed
 - Identifying suspicious patterns
 - Filtering out systemd/nix-store noise
 **Example prompt for auditor:**
 ```
 Investigate user activity on <hostname> between <start_time> and <end_time>.
 Context: The prometheus-node-exporter service stopped at 14:32.
 Determine if it was manually stopped and by whom.
 ```
 Incorporate the auditor's findings into your timeline and root cause analysis.
 ### 5. Check Configuration (if relevant)
 If the alert relates to a NixOS-managed service:
 - Check host configuration in `/hosts/<hostname>/`
 - Check service modules in `/services/<service>/`
 - Look for thresholds, resource limits, or misconfigurations
 - Check `homelab.host` options for tier/priority/role metadata
 ### 6. Check for Configuration Drift
 Use the git-explorer MCP server to compare the host's deployed configuration against the current master branch. This helps identify:
 - Hosts running outdated configurations
 - Recent changes that might have caused the issue
 - Whether a fix has already been committed but not deployed
 **Step 1: Get the deployed revision from Prometheus**
 ```promql
 nixos_flake_info{hostname="<hostname>"}
 ```
 The `current_rev` label contains the deployed git commit hash.
 **Step 2: Check if the host is behind master**
 ```
 resolve_ref("master")           # Get current master commit
 is_ancestor(deployed, master)   # Check if host is behind
 ```
 **Step 3: See what commits are missing**
 ```
 commits_between(deployed, master)  # List commits not yet deployed
 ```
 **Step 4: Check which files changed**
 ```
 get_diff_files(deployed, master)   # Files modified since deployment
 ```
 Look for files in `hosts/<hostname>/`, `services/<relevant-service>/`, or `system/` that affect this host.
 **Step 5: View configuration at the deployed revision**
 ```
 get_file_at_commit(deployed, "services/<service>/default.nix")
 ```
 Compare against the current file to understand differences.
 **Step 6: Find when something changed**
 ```
 search_commits("<service-name>")   # Find commits mentioning the service
 get_commit_info(<hash>)            # Get full details of a specific change
 ```
 **Example workflow for a service-related alert:**
 1. Query `nixos_flake_info{hostname="monitoring01"}` → `current_rev: 8959829`
 2. `resolve_ref("master")` → `4633421`
 3. `is_ancestor("8959829", "4633421")` → Yes, host is behind
 4. `commits_between("8959829", "4633421")` → 7 commits missing
 5. `get_diff_files("8959829", "4633421")` → Check if relevant service files changed
 6. If a fix was committed after the deployed rev, recommend deployment
 ### 7. Consider Common Causes
 For infrastructure alerts, common causes include:
 - **Manual intervention**: Service manually stopped/restarted (call auditor to confirm)
 - **Configuration drift**: Host running outdated config, fix already in master
 - **Disk space**: Nix store growth, logs, temp files
 - **Memory pressure**: Service memory leaks, insufficient limits
 - **CPU**: Runaway processes, build jobs
 - **Network**: DNS issues, connectivity problems
 - **Service restarts**: Failed upgrades, configuration errors
 - **Scrape failures**: Service down, firewall issues, port changes
 **Note:** If a service stopped unexpectedly and service logs don't show a crash or error, it was likely manual intervention - call the auditor to investigate.
 ## Output Format
 Provide a concise report with one of two outcomes:
 ### If Root Cause Identified:
 ```
 ## Root Cause
 [1-2 sentence summary of the root cause]
 ## Timeline
 [Chronological sequence of relevant events leading to the alert]
 - HH:MM:SSZ - [Event description]
 - HH:MM:SSZ - [Event description]
 - HH:MM:SSZ - [Alert fired]
 ### Timeline sources
 - HH:MM:SSZ - [Source for information about this event. Which metric or log file]
 - HH:MM:SSZ - [Source for information about this event. Which metric or log file]
 - HH:MM:SSZ - [Alert fired]
 ## Evidence
 - [Specific metric values or log entries that support the conclusion]
 - [Configuration details if relevant]
 ## Recommended Actions
 1. [Specific remediation step]
 2. [Follow-up actions if any]
 ```
 ### If Root Cause Unclear:
 ```
 ## Investigation Summary
 [What was checked and what was found]
 ## Possible Causes
 - [Hypothesis 1 with supporting/contradicting evidence]
 - [Hypothesis 2 with supporting/contradicting evidence]
 ## Additional Information Needed
 - [Specific data, logs, or access that would help]
 - [Suggested queries or checks for the operator]
 ```
 ## Guidelines
 - Be concise and actionable
 - Reference specific metric names and values as evidence
 - Include log snippets when they're informative
 - Don't speculate without evidence
 - If the alert is a false positive or expected behavior, explain why
 - Consider the host's tier (test vs prod) when assessing severity
 - Build a timeline from log timestamps and metrics to show the sequence of events
 - **Query logs incrementally**: start with narrow filters and small limits, expand only if needed
 - **Always delegate to the auditor agent** for any user activity analysis - never query EXECVE or audit logs directly
--- a/.claude/skills/observability/SKILL.md
+++ b/.claude/skills/observability/SKILL.md
@@ -32,7 +32,7 @@ Use the `lab-monitoring` MCP server tools:
 Available labels for log queries:
 - `host` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`)
 - `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs)
+- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
 - `filename` - For `varlog` job, the log file path
 - `hostname` - Alternative to `host` for some streams
@@ -102,6 +102,36 @@ Useful systemd units for troubleshooting:
 - `sshd.service` - SSH daemon
 - `nix-gc.service` - Nix garbage collection
 ### Bootstrap Logs
 VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
 - `host` - Target hostname
 - `branch` - Git branch being deployed
 - `stage` - Bootstrap stage (see table below)
 **Bootstrap stages:**
 | Stage | Message | Meaning |
 |-------|---------|---------|
 | `starting` | Bootstrap starting for \<host\> (branch: \<branch\>) | Bootstrap service has started |
 | `network_ok` | Network connectivity confirmed | Can reach git server |
 | `vault_ok` | Vault credentials unwrapped and stored | AppRole credentials provisioned |
 | `vault_skip` | No Vault token provided - skipping credential setup | No wrapped token was provided |
 | `vault_warn` | Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) |
 | `building` | Starting nixos-rebuild boot | NixOS build starting |
 | `success` | Build successful - rebooting into new configuration | Build complete, rebooting |
 | `failed` | nixos-rebuild failed - manual intervention required | Build failed |
 **Bootstrap queries:**
 ```logql
 {job="bootstrap"}                              # All bootstrap logs
 {job="bootstrap", host="myhost"}               # Specific host
 {job="bootstrap", stage="failed"}              # All failures
 {job="bootstrap", stage=~"building|success"}   # Track build progress
 ```
 ### Extracting JSON Fields
 Parse JSON and filter on fields:
@@ -175,31 +205,95 @@ Disk space (root filesystem):
 node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
 ```
-### Service-Specific Metrics
+### Prometheus Jobs
-Common job names:
+All available Prometheus job names:
 - `node-exporter` - System metrics (all hosts)
 - `nixos-exporter` - NixOS version/generation metrics
 - `caddy` - Reverse proxy metrics
 - `prometheus` / `loki` / `grafana` - Monitoring stack
 - `home-assistant` - Home automation
 - `step-ca` - Internal CA
-### Instance Label Format
+**System exporters (on all/most hosts):**
 - `node-exporter` - System metrics (CPU, memory, disk, network)
 - `nixos-exporter` - NixOS flake revision and generation info
 - `systemd-exporter` - Systemd unit status metrics
 - `homelab-deploy` - Deployment listener metrics
-The `instance` label uses FQDN format:
+**Service-specific exporters:**
 - `caddy` - Reverse proxy metrics (http-proxy)
 - `nix-cache_caddy` - Nix binary cache metrics
 - `home-assistant` - Home automation metrics (ha1)
 - `jellyfin` - Media server metrics (jelly01)
 - `kanidm` - Authentication server metrics (kanidm01)
 - `nats` - NATS messaging metrics (nats1)
 - `openbao` - Secrets management metrics (vault01)
 - `unbound` - DNS resolver metrics (ns1, ns2)
 - `wireguard` - VPN tunnel metrics (http-proxy)
-```
+**Monitoring stack (localhost on monitoring01):**
-<hostname>.home.2rjus.net:<port>
+- `prometheus` - Prometheus self-metrics
-```
+- `loki` - Loki self-metrics
 - `grafana` - Grafana self-metrics
 - `alertmanager` - Alertmanager metrics
 - `pushgateway` - Push-based metrics gateway
-Example queries filtering by host:
+**External/infrastructure:**
 - `pve-exporter` - Proxmox hypervisor metrics
 - `smartctl` - Disk SMART health (gunter)
 - `restic_rest` - Backup server metrics
 - `ghettoptt` - PTT service metrics (gunter)
 ### Target Labels
 All scrape targets have these labels:
 **Standard labels:**
 - `instance` - Full target address (`<hostname>.home.2rjus.net:<port>`)
 - `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`)
 - `hostname` - Short hostname (e.g., `ns1`, `monitoring01`) - use this for host filtering
 **Host metadata labels** (when configured in `homelab.host`):
 - `role` - Host role (e.g., `dns`, `build-host`, `vault`)
 - `tier` - Deployment tier (`test` for test VMs, absent for prod)
 - `dns_role` - DNS-specific role (`primary` or `secondary` for ns1/ns2)
 ### Filtering by Host
 Use the `hostname` label for easy host filtering across all jobs:
 ```promql
-up{instance=~"monitoring01.*"}
+{hostname="ns1"}                    # All metrics from ns1
-node_load1{instance=~"ns1.*"}
+node_load1{hostname="monitoring01"} # Specific metric by hostname
 up{hostname="ha1"}                  # Check if ha1 is up
 ```
 This is simpler than wildcarding the `instance` label:
 ```promql
 # Old way (still works but verbose)
 up{instance=~"monitoring01.*"}
 # New way (preferred)
 up{hostname="monitoring01"}
 ```
 ### Filtering by Role/Tier
 Filter hosts by their role or tier:
 ```promql
 up{role="dns"}                      # All DNS servers (ns1, ns2)
 node_cpu_seconds_total{role="build-host"}  # Build hosts only (nix-cache01)
 up{tier="test"}                     # All test-tier VMs
 up{dns_role="primary"}              # Primary DNS only (ns1)
 ```
 Current host labels:
 | Host | Labels |
 |------|--------|
 | ns1 | `role=dns`, `dns_role=primary` |
 | ns2 | `role=dns`, `dns_role=secondary` |
 | nix-cache01 | `role=build-host` |
 | vault01 | `role=vault` |
 | kanidm01 | `role=auth`, `tier=test` |
 | testvm01/02/03 | `tier=test` |
 ---
 ## Troubleshooting Workflows
@@ -212,11 +306,12 @@ node_load1{instance=~"ns1.*"}
 ### Investigate Service Issues
-1. Check `up{job="<service>"}` for scrape failures
+1. Check `up{job="<service>"}` or `up{hostname="<host>"}` for scrape failures
 2. Use `list_targets` to see target health details
 3. Query service logs: `{host="<host>", systemd_unit="<service>.service"}`
 4. Search for errors: `{host="<host>"} |= "error"`
 5. Check `list_alerts` for related alerts
 6. Use role filters for group issues: `up{role="dns"}` to check all DNS servers
 ### After Deploying Changes
@@ -225,6 +320,17 @@ node_load1{instance=~"ns1.*"}
 3. Check service logs for startup issues
 4. Check service metrics are being scraped
 ### Monitor VM Bootstrap
 When provisioning new VMs, track bootstrap progress:
 1. Watch bootstrap logs: `{job="bootstrap", host="<hostname>"}`
 2. Check for failures: `{job="bootstrap", host="<hostname>", stage="failed"}`
 3. After success, verify host appears in metrics: `up{hostname="<hostname>"}`
 4. Check logs are flowing: `{host="<hostname>"}`
 See [docs/host-creation.md](../../../docs/host-creation.md) for the full host creation pipeline.
 ### Debug SSH/Access Issues
 ```logql
@@ -246,5 +352,6 @@ With `start: "24h"` to see last 24 hours of upgrades across all hosts.
 - Default scrape interval is 15s for most metrics targets
 - Default log lookback is 1h - use `start` parameter for older logs
 - Use `rate()` for counter metrics, direct queries for gauges
- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters
+- Use the `hostname` label to filter metrics by host (simpler than regex on `instance`)
 - Host metadata labels (`role`, `tier`, `dns_role`) are propagated to all scrape targets
 - Log `MESSAGE` field contains the actual log content in JSON format
--- a/.mcp.json
+++ b/.mcp.json
@@ -33,6 +33,13 @@
        "--nats-url", "nats://nats1.home.2rjus.net:4222",
        "--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey"
      ]
    },
    "git-explorer": {
      "command": "nix",
      "args": ["run", "git+https://git.t-juice.club/torjus/labmcp#git-explorer", "--", "serve"],
      "env": {
        "GIT_REPO_PATH": "/home/torjus/git/nixos-servers"
      }
    }
  }
 }
--- a/.sops.yaml
+++ b/.sops.yaml
@@ -1,52 +0,0 @@
 keys:
  - &admin_torjus age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
  - &server_ns1 age1hz2lz4k050ru3shrk5j3zk3f8azxmrp54pktw5a7nzjml4saudesx6jsl0
  - &server_ns2 age1w2q4gm2lrcgdzscq8du3ssyvk6qtzm4fcszc92z9ftclq23yyydqdga5um
  - &server_ha1 age1d2w5zece9647qwyq4vas9qyqegg96xwmg6c86440a6eg4uj6dd2qrq0w3l
  - &server_http-proxy age1gq8434ku0xekqmvnseeunv83e779cg03c06gwrusnymdsr3rpufqx6vr3m
  - &server_ca age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk
  - &server_monitoring01 age1vpns76ykll8jgdlu3h05cur4ew2t3k7u03kxdg8y6ypfhsfhq9fqyurjey
  - &server_jelly01 age1hchvlf3apn8g8jq2743pw53sd6v6ay6xu6lqk0qufrjeccan9vzsc7hdfq
  - &server_nix-cache01 age1w029fksjv0edrff9p7s03tgk3axecdkppqymfpwfn2nu2gsqqefqc37sxq
  - &server_pgdb1 age1ha34qeksr4jeaecevqvv2afqem67eja2mvawlmrqsudch0e7fe7qtpsekv
  - &server_nats1 age1cxt8kwqzx35yuldazcc49q88qvgy9ajkz30xu0h37uw3ts97jagqgmn2ga
 creation_rules:
  - path_regex: secrets/[^/]+\.(yaml|json|env|ini)
    key_groups:
      - age:
        - *admin_torjus
        - *server_ns1
        - *server_ns2
        - *server_ha1
        - *server_http-proxy
        - *server_ca
        - *server_monitoring01
        - *server_jelly01
        - *server_nix-cache01
        - *server_pgdb1
        - *server_nats1
  - path_regex: secrets/ca/[^/]+\.(yaml|json|env|ini|)
    key_groups:
      - age:
        - *admin_torjus
        - *server_ca
  - path_regex: secrets/monitoring01/[^/]+\.(yaml|json|env|ini)
    key_groups:
      - age:
        - *admin_torjus
        - *server_monitoring01
  - path_regex: secrets/ca/keys/.+
    key_groups:
      - age:
        - *admin_torjus
        - *server_ca
  - path_regex: secrets/nix-cache01/.+
    key_groups:
      - age:
        - *admin_torjus
        - *server_nix-cache01
  - path_regex: secrets/http-proxy/.+
    key_groups:
      - age:
        - *admin_torjus
        - *server_http-proxy
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -35,6 +35,10 @@ nix build .#create-host
 Do not automatically deploy changes. Deployments are usually done by updating the master branch, and then triggering the auto update on the specific host.
 ### SSH Commands
 Do not run SSH commands directly. If a command needs to be run on a remote host, provide the command to the user and ask them to run it manually.
 ### Testing Feature Branches on Hosts
 All hosts have the `nixos-rebuild-test` helper script for testing feature branches before merging:
@@ -61,25 +65,45 @@ Do not run `nix flake update`. Should only be done manually by user.
 ### Development Environment
 ```bash
-# Enter development shell (provides ansible, python3)
+# Enter development shell
 nix develop
 ```
 The devshell provides: `ansible`, `tofu` (OpenTofu), `bao` (OpenBao CLI), `create-host`, and `homelab-deploy`.
 **Important:** When suggesting commands that use devshell tools, always use `nix develop -c <command>` syntax rather than assuming the user is already in a devshell. For example:
 ```bash
 # Good - works regardless of current shell
 nix develop -c tofu plan
 # Avoid - requires user to be in devshell
 tofu plan
 ```
 **OpenTofu:** Use the `-chdir` option instead of `cd` when running tofu commands in subdirectories:
 ```bash
 # Good - uses -chdir option
 nix develop -c tofu -chdir=terraform plan
 nix develop -c tofu -chdir=terraform/vault apply
 # Avoid - changing directories
 cd terraform && tofu plan
 ```
 ### Secrets Management
 Secrets are managed by OpenBao (Vault) using AppRole authentication. Most hosts use the
 `vault.secrets` option defined in `system/vault-secrets.nix` to fetch secrets at boot.
 Terraform manages the secrets and AppRole policies in `terraform/vault/`.
 Legacy sops-nix is still present but only actively used by the `ca` host. Do not edit any
 `.sops.yaml` or any file within `secrets/`. Ask the user to modify if necessary.
 ### Git Workflow
 **Important:** Never commit directly to `master` unless the user explicitly asks for it. Always create a feature branch for changes.
 **Important:** Never amend commits to `master` unless the user explicitly asks for it. Amending rewrites history and causes issues for deployed configurations.
 **Important:** Do not use `gh pr create` to create pull requests. The git server does not support GitHub CLI for PR creation. Instead, push the branch and let the user create the PR manually via the web interface.
 When starting a new plan or task, the first step should typically be to create and checkout a new branch with an appropriate name (e.g., `git checkout -b dns-automation` or `git checkout -b fix-nginx-config`).
 ### Plan Management
@@ -132,67 +156,16 @@ Two MCP servers are available for searching NixOS options and packages:
 This ensures documentation matches the exact nixpkgs version (currently NixOS 25.11) used by this flake.
-### Lab Monitoring Log Queries
+### Lab Monitoring
-The **lab-monitoring** MCP server can query logs from Loki. All hosts ship systemd journal logs via Promtail.
+The **lab-monitoring** MCP server provides access to Prometheus metrics and Loki logs. Use the `/observability` skill for detailed reference on:
-**Loki Label Reference:**
+- Available Prometheus jobs and exporters
 - Loki labels and LogQL query syntax
 - Bootstrap log monitoring for new VMs
 - Common troubleshooting workflows
- `host` - Hostname (e.g., `ns1`, `ns2`, `monitoring01`, `ha1`). Use this label, not `hostname`.
+The skill contains up-to-date information about all scrape targets, host labels, and example queries.
 - `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `prometheus.service`, `nixos-upgrade.service`)
 - `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs like caddy access logs)
 - `filename` - For `varlog` job, the log file path (e.g., `/var/log/caddy/nix-cache.log`)
 Journal log entries are JSON-formatted with the actual log message in the `MESSAGE` field. Other useful fields include `PRIORITY` and `SYSLOG_IDENTIFIER`.
 **Example LogQL queries:**
 ```
 # Logs from a specific service on a host
 {host="ns2", systemd_unit="nsd.service"}
 # Substring match on log content
 {host="ns1", systemd_unit="nsd.service"} |= "error"
 # File-based logs (e.g., caddy access logs)
 {job="varlog", hostname="nix-cache01"}
 ```
 Default lookback is 1 hour. Use the `start` parameter with relative durations (e.g., `24h`, `168h`) for older logs.
 ### Lab Monitoring Prometheus Queries
 The **lab-monitoring** MCP server can query Prometheus metrics via PromQL. The `instance` label uses the FQDN format `<host>.home.2rjus.net:<port>`.
 **Prometheus Job Names:**
 - `node-exporter` - System metrics from all hosts (CPU, memory, disk, network)
 - `caddy` - Reverse proxy metrics (http-proxy)
 - `nix-cache_caddy` - Nix binary cache metrics
 - `home-assistant` - Home automation metrics
 - `jellyfin` - Media server metrics
 - `loki` / `prometheus` / `grafana` - Monitoring stack self-metrics
 - `step-ca` - Internal CA metrics
 - `pve-exporter` - Proxmox hypervisor metrics
 - `smartctl` - Disk SMART health (gunter)
 - `wireguard` - VPN metrics (http-proxy)
 - `pushgateway` - Push-based metrics (e.g., backup results)
 - `restic_rest` - Backup server metrics
 - `labmon` / `ghettoptt` / `alertmanager` - Other service metrics
 **Example PromQL queries:**
 ```
 # Check all targets are up
 up
 # CPU usage for a specific host
 rate(node_cpu_seconds_total{instance=~"ns1.*", mode!="idle"}[5m])
 # Memory usage across all hosts
 node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
 # Disk space
 node_filesystem_avail_bytes{mountpoint="/"}
 ```
 ### Deploying to Test Hosts
@@ -229,6 +202,21 @@ deploy(role="vault", action="switch")
 **Note:** Only test-tier hosts with `homelab.deploy.enable = true` and the listener service running will respond to deployments.
 **Deploying to Prod Hosts:**
 The MCP server only deploys to test-tier hosts. For prod hosts, use the CLI directly:
 ```bash
 nix develop -c homelab-deploy -- deploy \
  --nats-url nats://nats1.home.2rjus.net:4222 \
  --nkey-file ~/.config/homelab-deploy/admin-deployer.nkey \
  --branch <branch-name> \
  --action switch \
  deploy.prod.<hostname>
 ```
 Subject format: `deploy.<tier>.<hostname>` (e.g., `deploy.prod.monitoring01`, `deploy.test.testvm01`)
 **Verifying Deployments:**
 After deploying, use the `nixos_flake_info` metric from nixos-exporter to verify the host is running the expected revision:
@@ -248,10 +236,11 @@ The `current_rev` label contains the git commit hash of the deployed flake confi
  - `default.nix` - Entry point, imports configuration.nix and services
  - `configuration.nix` - Host-specific settings (networking, hardware, users)
 - `/system/` - Shared system-level configurations applied to ALL hosts
-  - Core modules: nix.nix, sshd.nix, sops.nix (legacy), vault-secrets.nix, acme.nix, autoupgrade.nix
+  - Core modules: nix.nix, sshd.nix, vault-secrets.nix, acme.nix, autoupgrade.nix
  - Additional modules: motd.nix (dynamic MOTD), packages.nix (base packages), root-user.nix (root config), homelab-deploy.nix (NATS listener)
  - Monitoring: node-exporter and promtail on every host
 - `/modules/` - Custom NixOS modules
-  - `homelab/` - Homelab-specific options (DNS automation, monitoring scrape targets)
+  - `homelab/` - Homelab-specific options (see "Homelab Module Options" section below)
 - `/lib/` - Nix library functions
  - `dns-zone.nix` - DNS zone generation functions
  - `monitoring.nix` - Prometheus scrape target generation functions
@@ -259,14 +248,14 @@ The `current_rev` label contains the git commit hash of the deployed flake confi
  - `home-assistant/` - Home automation stack
  - `monitoring/` - Observability stack (Prometheus, Grafana, Loki, Tempo)
  - `ns/` - DNS services (authoritative, resolver, zone generation)
-  - `http-proxy/`, `ca/`, `postgres/`, `nats/`, `jellyfin/`, etc.
+  - `vault/` - OpenBao (Vault) secrets server
- `/secrets/` - SOPS-encrypted secrets with age encryption (legacy, only used by ca)
+  - `actions-runner/` - GitHub Actions runner
  - `http-proxy/`, `postgres/`, `nats/`, `jellyfin/`, etc.
 - `/common/` - Shared configurations (e.g., VM guest agent)
 - `/docs/` - Documentation and plans
  - `plans/` - Future plans and proposals
  - `plans/completed/` - Completed plans (moved here when done)
 - `/playbooks/` - Ansible playbooks for fleet management
 - `/.sops.yaml` - SOPS configuration with age keys (legacy, only used by ca)
 ### Configuration Inheritance
@@ -283,7 +272,7 @@ All hosts automatically get:
 - Nix binary cache (nix-cache.home.2rjus.net)
 - SSH with root login enabled
 - OpenBao (Vault) secrets management via AppRole
- Internal ACME CA integration (ca.home.2rjus.net)
+- Internal ACME CA integration (OpenBao PKI at vault.home.2rjus.net)
 - Daily auto-upgrades with auto-reboot
 - Prometheus node-exporter + Promtail (logs to monitoring01)
 - Monitoring scrape target auto-registration via `homelab.monitoring` options
@@ -292,28 +281,31 @@ All hosts automatically get:
 ### Active Hosts
-Production servers managed by `rebuild-all.sh`:
+Production servers:
 - `ns1`, `ns2` - Primary/secondary DNS servers (10.69.13.5/6)
- `ca` - Internal Certificate Authority
+- `vault01` - OpenBao (Vault) secrets server + PKI CA
 - `ha1` - Home Assistant + Zigbee2MQTT + Mosquitto
 - `http-proxy` - Reverse proxy
 - `monitoring01` - Full observability stack (Prometheus, Grafana, Loki, Tempo, Pyroscope)
 - `jelly01` - Jellyfin media server
- `nix-cache01` - Binary cache server
+- `nix-cache01` - Binary cache server + GitHub Actions runner
 - `pgdb1` - PostgreSQL database
 - `nats1` - NATS messaging server
-Template/test hosts:
+Test/staging hosts:
- `template1` - Base template for cloning new hosts
+- `testvm01`, `testvm02`, `testvm03` - Test-tier VMs for branch testing and deployment validation
 Template hosts:
 - `template1`, `template2` - Base templates for cloning new hosts
 ### Flake Inputs
 - `nixpkgs` - NixOS 25.11 stable (primary)
 - `nixpkgs-unstable` - Unstable channel (available via overlay as `pkgs.unstable.<package>`)
- `sops-nix` - Secrets management (legacy, only used by ca)
+- `nixos-exporter` - NixOS module for exposing flake revision metrics (used to verify deployments)
 - `homelab-deploy` - NATS-based remote deployment tool for test-tier hosts
 - Custom packages from git.t-juice.club:
  - `alerttonotify` - Alert routing
  - `labmon` - Lab monitoring
 ### Network Architecture
@@ -337,11 +329,6 @@ Most hosts use OpenBao (Vault) for secrets:
 - Fallback to cached secrets in `/var/lib/vault/cache/` when Vault is unreachable
 - Provision AppRole credentials: `nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<host>`
 Legacy SOPS (only used by `ca` host):
 - SOPS with age encryption, keys in `.sops.yaml`
 - Shared secrets: `/secrets/secrets.yaml`
 - Per-host secrets: `/secrets/<hostname>/`
 ### Auto-Upgrade System
 All hosts pull updates daily from:
@@ -402,9 +389,21 @@ Example VM deployment includes:
 - Custom CPU/memory/disk sizing
 - VLAN tagging
 - QEMU guest agent
 - Automatic Vault credential provisioning via `vault_wrapped_token`
 OpenTofu outputs the VM's IP address after deployment for easy SSH access.
 **Automatic Vault Credential Provisioning:**
 VMs can receive Vault (OpenBao) credentials automatically during bootstrap:
 1. OpenTofu generates a wrapped token via `terraform/vault/` and stores it in the VM configuration
 2. Cloud-init passes `VAULT_WRAPPED_TOKEN` and `NIXOS_FLAKE_BRANCH` to the bootstrap script
 3. The bootstrap script unwraps the token to obtain AppRole credentials
 4. Credentials are written to `/var/lib/vault/approle/` before the NixOS rebuild
 This eliminates the need for manual `provision-approle.yml` playbook runs on new VMs. Bootstrap progress is logged to Loki with `job="bootstrap"` labels.
 #### Template Rebuilding and Terraform State
 When the Proxmox template is rebuilt (via `build-and-deploy-template.yml`), the template name may change. This would normally cause Terraform to want to recreate all existing VMs, but that's unnecessary since VMs are independent once cloned.
@@ -435,20 +434,11 @@ This means:
 ### Adding a New Host
-1. Create `/hosts/<hostname>/` directory
+See [docs/host-creation.md](docs/host-creation.md) for the complete host creation pipeline, including:
-2. Copy structure from `template1` or similar host
+- Using the `create-host` script to generate host configurations
-3. Add host entry to `flake.nix` nixosConfigurations
+- Deploying VMs and secrets with OpenTofu
-4. Configure networking in `configuration.nix` (static IP via `systemd.network.networks`, DNS servers)
+- Monitoring the bootstrap process via Loki
-5. (Optional) Add `homelab.dns.cnames` if the host needs CNAME aliases
+- Verification and troubleshooting steps
 6. Add `vault.enable = true;` to the host configuration
 7. Add AppRole policy in `terraform/vault/approle.tf` and any secrets in `secrets.tf`
 8. Run `tofu apply` in `terraform/vault/`
 9. User clones template host
 10. User runs `prepare-host.sh` on new host
 11. Provision AppRole credentials: `nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<host>`
 12. Commit changes, and merge to master.
 13. Deploy by running `nixos-rebuild boot --flake URL#<hostname>` on the host.
 14. Run auto-upgrade on DNS servers (ns1, ns2) to pick up the new host's DNS entry
 **Note:** DNS A records and Prometheus node-exporter scrape targets are auto-generated from the host's `systemd.network.networks` static IP configuration. No manual zone file or Prometheus config editing is required.
@@ -484,11 +474,7 @@ Prometheus scrape targets are automatically generated from host configurations,
 - **External targets**: Non-flake hosts defined in `/services/monitoring/external-targets.nix`
 - **Library**: `lib/monitoring.nix` provides `generateNodeExporterTargets` and `generateScrapeConfigs`
-Host monitoring options (`homelab.monitoring.*`):
+Service modules declare their scrape targets directly via `homelab.monitoring.scrapeTargets`. The Prometheus config on monitoring01 auto-generates scrape configs from all hosts. See "Homelab Module Options" section for available options.
 - `enable` (default: `true`) - Include host in Prometheus node-exporter scrape targets
 - `scrapeTargets` (default: `[]`) - Additional scrape targets exposed by this host (job_name, port, metrics_path, scheme, scrape_interval, honor_labels)
 Service modules declare their scrape targets directly (e.g., `services/ca/default.nix` declares step-ca on port 9000). The Prometheus config on monitoring01 auto-generates scrape configs from all hosts.
 To add monitoring targets for non-NixOS hosts, edit `/services/monitoring/external-targets.nix`.
@@ -507,13 +493,30 @@ DNS zone entries are automatically generated from host configurations:
 - **External hosts**: Non-flake hosts defined in `/services/ns/external-hosts.nix`
 - **Serial number**: Uses `self.sourceInfo.lastModified` (git commit timestamp)
 Host DNS options (`homelab.dns.*`):
 - `enable` (default: `true`) - Include host in DNS zone generation
 - `cnames` (default: `[]`) - List of CNAME aliases pointing to this host
 Hosts are automatically excluded from DNS if:
 - `homelab.dns.enable = false` (e.g., template hosts)
 - No static IP configured (e.g., DHCP-only hosts)
 - Network interface is a VPN/tunnel (wg*, tun*, tap*)
 To add DNS entries for non-NixOS hosts, edit `/services/ns/external-hosts.nix`.
 ### Homelab Module Options
 The `modules/homelab/` directory defines custom options used across hosts for automation and metadata.
 **Host options (`homelab.host.*`):**
 - `tier` - Deployment tier: `test` or `prod`. Test-tier hosts can receive remote deployments and have different credential access.
 - `priority` - Alerting priority: `high` or `low`. Controls alerting thresholds for the host.
 - `role` - Primary role designation (e.g., `dns`, `database`, `bastion`, `vault`)
 - `labels` - Free-form key-value metadata for host categorization
 **DNS options (`homelab.dns.*`):**
 - `enable` (default: `true`) - Include host in DNS zone generation
 - `cnames` (default: `[]`) - List of CNAME aliases pointing to this host
 **Monitoring options (`homelab.monitoring.*`):**
 - `enable` (default: `true`) - Include host in Prometheus node-exporter scrape targets
 - `scrapeTargets` (default: `[]`) - Additional scrape targets exposed by this host
 **Deploy options (`homelab.deploy.*`):**
 - `enable` (default: `false`) - Enable NATS-based remote deployment listener. When enabled, the host listens for deployment commands via NATS and can be targeted by the `homelab-deploy` MCP server.
--- a/README.md
+++ b/README.md
@@ -13,7 +13,6 @@ NixOS Flake-based configuration repository for a homelab infrastructure. All hos
 | `monitoring01` | Prometheus, Grafana, Loki, Tempo, Pyroscope |
 | `jelly01` | Jellyfin media server |
 | `nix-cache01` | Nix binary cache |
 | `pgdb1` | PostgreSQL |
 | `nats1` | NATS messaging |
 | `vault01` | OpenBao (Vault) secrets management |
 | `template1`, `template2` | VM templates for cloning new hosts |
--- a/common/ssh-audit.nix
+++ b/common/ssh-audit.nix
@@ -0,0 +1,21 @@
 # SSH session command auditing
 #
 # Logs all commands executed by users who logged in interactively (SSH).
 # System services and nix builds are excluded via auid filter.
 #
 # Logs are sent to journald and forwarded to Loki via promtail.
 # Query with: {host="<hostname>"} |= "EXECVE"
 {
  # Enable Linux audit subsystem
  security.audit.enable = true;
  security.auditd.enable = true;
  # Log execve syscalls only from interactive login sessions
  # auid!=4294967295 means "audit login uid is set" (excludes system services, nix builds)
  security.audit.rules = [
    "-a exit,always -F arch=b64 -S execve -F auid!=4294967295"
  ];
  # Forward audit logs to journald (so promtail ships them to Loki)
  services.journald.audit = true;
 }
--- a/docs/host-creation.md
+++ b/docs/host-creation.md
@@ -0,0 +1,217 @@
 # Host Creation Pipeline
 This document describes the process for creating new hosts in the homelab infrastructure.
 ## Overview
 We use the `create-host` script to create new hosts, which generates default configurations from a template. We then use OpenTofu to deploy both secrets and VMs. The VMs boot using a template image (built from `hosts/template2`), which starts a bootstrap process. This bootstrap process applies the host's NixOS configuration and then reboots into the new config.
 ## Prerequisites
 All tools are available in the devshell: `create-host`, `bao` (OpenBao CLI), `tofu`.
 ```bash
 nix develop
 ```
 ## Steps
 Steps marked with **USER** must be performed by the user due to credential requirements.
 1. **USER**: Run `create-host --hostname <name> --ip <ip/prefix>`
 2. Edit the auto-generated configurations in `hosts/<hostname>/` to import whatever modules are needed for its purpose
 3. Add any secrets needed to `terraform/vault/`
 4. Edit the VM specs in `terraform/vms.tf` if needed. To deploy from a branch other than master, add `flake_branch = "<branch>"` to the VM definition
 5. Push configuration to master (or the branch specified by `flake_branch`)
 6. **USER**: Apply terraform:
   ```bash
   nix develop -c tofu -chdir=terraform/vault apply
   nix develop -c tofu -chdir=terraform apply
   ```
 7. Once terraform completes, a VM boots in Proxmox using the template image
 8. The VM runs the `nixos-bootstrap` service, which applies the host config and reboots
 9. After reboot, the host should be operational
 10. Trigger auto-upgrade on `ns1` and `ns2` to propagate DNS records for the new host
 11. Trigger auto-upgrade on `monitoring01` to add the host to Prometheus scrape targets
 ## Tier Specification
 New hosts should set `homelab.host.tier` in their configuration:
 ```nix
 homelab.host.tier = "test";  # or "prod"
 ```
 - **test** - Test-tier hosts can receive remote deployments via the `homelab-deploy` MCP server and have different credential access. Use for staging/testing.
 - **prod** - Production hosts. Deployments require direct access or the CLI with appropriate credentials.
 ## Observability
 During the bootstrap process, status updates are sent to Loki. Query bootstrap logs with:
 ```
 {job="bootstrap", host="<hostname>"}
 ```
 ### Bootstrap Stages
 The bootstrap process reports these stages via the `stage` label:
 | Stage | Message | Meaning |
 |-------|---------|---------|
 | `starting` | Bootstrap starting for \<host\> (branch: \<branch\>) | Bootstrap service has started |
 | `network_ok` | Network connectivity confirmed | Can reach git server |
 | `vault_ok` | Vault credentials unwrapped and stored | AppRole credentials provisioned |
 | `vault_skip` | No Vault token provided - skipping credential setup | No wrapped token was provided |
 | `vault_warn` | Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) |
 | `building` | Starting nixos-rebuild boot | NixOS build starting |
 | `success` | Build successful - rebooting into new configuration | Build complete, rebooting |
 | `failed` | nixos-rebuild failed - manual intervention required | Build failed |
 ### Useful Queries
 ```
 # All bootstrap activity for a host
 {job="bootstrap", host="myhost"}
 # Track all failures
 {job="bootstrap", stage="failed"}
 # Monitor builds in progress
 {job="bootstrap", stage=~"building|success"}
 ```
 Once the VM reboots with its full configuration, it will start publishing metrics to Prometheus and logs to Loki via Promtail.
 ## Verification
 1. Check bootstrap completed successfully:
   ```
   {job="bootstrap", host="<hostname>", stage="success"}
   ```
 2. Verify the host is up and reporting metrics:
   ```promql
   up{instance=~"<hostname>.*"}
   ```
 3. Verify the correct flake revision is deployed:
   ```promql
   nixos_flake_info{instance=~"<hostname>.*"}
   ```
 4. Check logs are flowing:
   ```
   {host="<hostname>"}
   ```
 5. Confirm expected services are running and producing logs
 ## Troubleshooting
 ### Bootstrap Failed
 #### Common Issues
 * VM has trouble running initial nixos-rebuild. Usually caused if it needs to compile packages from scratch if they are not available in our local nix-cache.
 #### Troubleshooting
 1. Check bootstrap logs in Loki - if they never progress past `building`, the rebuild likely consumed all resources:
   ```
   {job="bootstrap", host="<hostname>"}
   ```
 2. **USER**: SSH into the host and check the bootstrap service:
   ```bash
   ssh root@<hostname>
   journalctl -u nixos-bootstrap.service
   ```
 3. If the build failed due to resource constraints, increase VM specs in `terraform/vms.tf` and redeploy, or manually run the rebuild:
   ```bash
   nixos-rebuild boot --flake git+https://git.t-juice.club/torjus/nixos-servers.git#<hostname>
   ```
 4. If the host config doesn't exist in the flake, ensure step 5 was completed (config pushed to the correct branch).
 ### Vault Credentials Not Working
 Usually caused by running the `create-host` script without proper credentials, or the wrapped token has expired/already been used.
 #### Troubleshooting
 1. Check if credentials exist on the host:
   ```bash
   ssh root@<hostname>
   ls -la /var/lib/vault/approle/
   ```
 2. Check bootstrap logs for vault-related stages:
   ```
   {job="bootstrap", host="<hostname>", stage=~"vault.*"}
   ```
 3. **USER**: Regenerate and provision credentials manually:
   ```bash
   nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<hostname>
   ```
 ### Host Not Appearing in DNS
 Usually caused by not having deployed the commit with the new host to ns1/ns2.
 #### Troubleshooting
 1. Verify the host config has a static IP configured in `systemd.network.networks`
 2. Check that `homelab.dns.enable` is not set to `false`
 3. **USER**: Trigger auto-upgrade on DNS servers:
   ```bash
   ssh root@ns1 systemctl start nixos-upgrade.service
   ssh root@ns2 systemctl start nixos-upgrade.service
   ```
 4. Verify DNS resolution after upgrade completes:
   ```bash
   dig @ns1.home.2rjus.net <hostname>.home.2rjus.net
   ```
 ### Host Not Being Scraped by Prometheus
 Usually caused by not having deployed the commit with the new host to the monitoring host.
 #### Troubleshooting
 1. Check that `homelab.monitoring.enable` is not set to `false`
 2. **USER**: Trigger auto-upgrade on monitoring01:
   ```bash
   ssh root@monitoring01 systemctl start nixos-upgrade.service
   ```
 3. Verify the target appears in Prometheus:
   ```promql
   up{instance=~"<hostname>.*"}
   ```
 4. If the target is down, check that node-exporter is running on the host:
   ```bash
   ssh root@<hostname> systemctl status prometheus-node-exporter.service
   ```
 ## Related Files
 | Path | Description |
 |------|-------------|
 | `scripts/create-host/` | The `create-host` script that generates host configurations |
 | `hosts/template2/` | Template VM configuration (base image for new VMs) |
 | `hosts/template2/bootstrap.nix` | Bootstrap service that applies NixOS config on first boot |
 | `terraform/vms.tf` | VM definitions (specs, IPs, branch overrides) |
 | `terraform/cloud-init.tf` | Cloud-init configuration (passes hostname, branch, vault token) |
 | `terraform/vault/approle.tf` | AppRole policies for each host |
 | `terraform/vault/secrets.tf` | Secret definitions in Vault |
 | `terraform/vault/hosts-generated.tf` | Auto-generated wrapped tokens for VM bootstrap |
 | `playbooks/provision-approle.yml` | Ansible playbook for manual credential provisioning |
 | `flake.nix` | Flake with all host configurations (add new hosts here) |
--- a/docs/plans/auth-system-replacement.md
+++ b/docs/plans/auth-system-replacement.md
@@ -2,7 +2,7 @@
 ## Overview
-Replace the current auth01 setup (LLDAP + Authelia) with a modern, unified authentication solution. The current setup is not in active use, making this a good time to evaluate alternatives.
+Deploy a modern, unified authentication solution for the homelab. Provides central user management, SSO for web services, and consistent UID/GID mapping for NAS permissions.
 ## Goals
@@ -11,66 +11,9 @@ Replace the current auth01 setup (LLDAP + Authelia) with a modern, unified authe
 3. **UID/GID consistency** - Proper POSIX attributes for NAS share permissions
 4. **OIDC provider** - Single sign-on for homelab web services (Grafana, etc.)
-## Options Evaluated
+## Solution: Kanidm
-### OpenLDAP (raw)
+Kanidm was chosen for the following reasons:
 - **NixOS Support:** Good (`services.openldap` with `declarativeContents`)
 - **Pros:** Most widely supported, very flexible
 - **Cons:** LDIF format is painful, schema management is complex, no built-in OIDC, requires SSSD on each client
 - **Verdict:** Doesn't address LDAP complexity concerns
 ### LLDAP + Authelia (current)
 - **NixOS Support:** Both have good modules
 - **Pros:** Already configured, lightweight, nice web UIs
 - **Cons:** Two services to manage, limited POSIX attribute support in LLDAP, requires SSSD on every client host
 - **Verdict:** Workable but has friction for NAS/UID goals
 ### FreeIPA
 - **NixOS Support:** None
 - **Pros:** Full enterprise solution (LDAP + Kerberos + DNS + CA)
 - **Cons:** Extremely heavy, wants to own DNS, designed for Red Hat ecosystems, massive overkill for homelab
 - **Verdict:** Overkill, no NixOS support
 ### Keycloak
 - **NixOS Support:** None
 - **Pros:** Good OIDC/SAML, nice UI
 - **Cons:** Primarily an identity broker not a user directory, poor POSIX support, heavy (Java)
 - **Verdict:** Wrong tool for Linux user management
 ### Authentik
 - **NixOS Support:** None (would need Docker)
 - **Pros:** All-in-one with LDAP outpost and OIDC, modern UI
 - **Cons:** Heavy stack (Python + PostgreSQL + Redis), LDAP is a separate component
 - **Verdict:** Would work but requires Docker and is heavy
 ### Kanidm
 - **NixOS Support:** Excellent - first-class module with PAM/NSS integration
 - **Pros:**
  - Native PAM/NSS module (no SSSD needed)
  - Built-in OIDC provider
  - Optional LDAP interface for legacy services
  - Declarative provisioning via NixOS (users, groups, OAuth2 clients)
  - Modern, written in Rust
  - Single service handles everything
 - **Cons:** Newer project, smaller community than LDAP
 - **Verdict:** Best fit for requirements
 ### Pocket-ID
 - **NixOS Support:** Unknown
 - **Pros:** Very lightweight, passkey-first
 - **Cons:** No LDAP, no PAM/NSS integration - purely OIDC for web apps
 - **Verdict:** Doesn't solve Linux user management goal
 ## Recommendation: Kanidm
 Kanidm is the recommended solution for the following reasons:
 | Requirement | Kanidm Support |
 |-------------|----------------|
@@ -82,42 +25,10 @@ Kanidm is the recommended solution for the following reasons:
 | Simplicity | Modern API, LDAP optional |
 | NixOS integration | First-class |
-### Key NixOS Features
+### Configuration Files
-**Server configuration:**
+- **Host configuration:** `hosts/kanidm01/`
-```nix
+- **Service module:** `services/kanidm/default.nix`
 services.kanidm.enableServer = true;
 services.kanidm.serverSettings = {
  domain = "home.2rjus.net";
  origin = "https://auth.home.2rjus.net";
  ldapbindaddress = "0.0.0.0:636";  # Optional LDAP interface
 };
 ```
 **Declarative user provisioning:**
 ```nix
 services.kanidm.provision.enable = true;
 services.kanidm.provision.persons.torjus = {
  displayName = "Torjus";
  groups = [ "admins" "nas-users" ];
 };
 ```
 **Declarative OAuth2 clients:**
 ```nix
 services.kanidm.provision.systems.oauth2.grafana = {
  displayName = "Grafana";
  originUrl = "https://grafana.home.2rjus.net/login/generic_oauth";
  originLanding = "https://grafana.home.2rjus.net";
 };
 ```
 **Client host configuration (add to system/):**
 ```nix
 services.kanidm.enableClient = true;
 services.kanidm.enablePam = true;
 services.kanidm.clientSettings.uri = "https://auth.home.2rjus.net";
 ```
 ## NAS Integration
@@ -148,42 +59,103 @@ This future migration path is a strong argument for Kanidm over LDAP-only soluti
 ## Implementation Steps
-1. **Create Kanidm service module** in `services/kanidm/`
+1. **Create kanidm01 host and service module** ✅
-   - Server configuration
+   - Host: `kanidm01.home.2rjus.net` (10.69.13.23, test tier)
-   - TLS via internal ACME
+   - Service module: `services/kanidm/`
-   - Vault secrets for admin passwords
+   - TLS via internal ACME (`auth.home.2rjus.net`)
   - Vault integration for idm_admin password
   - LDAPS on port 636
-2. **Configure declarative provisioning**
+2. **Configure provisioning** ✅
-   - Define initial users and groups
+   - Groups provisioned declaratively: `admins`, `users`, `ssh-users`
-   - Set up POSIX attributes (UID/GID ranges)
+   - Users managed imperatively via CLI (allows setting POSIX passwords in one step)
   - POSIX attributes enabled (UID/GID range 65,536-69,999)
-3. **Add OIDC clients** for homelab services
+3. **Test NAS integration** (in progress)
-   - Grafana
+   - ✅ LDAP interface verified working
   - Other services as needed
 4. **Create client module** in `system/` for PAM/NSS
   - Enable on all hosts that need central auth
   - Configure trusted CA
 5. **Test NAS integration**
   - Configure TrueNAS LDAP client to connect to Kanidm
   - Verify UID/GID mapping works with NFS shares
-6. **Migrate auth01**
+4. **Add OIDC clients** for homelab services
-   - Remove LLDAP and Authelia services
+   - Grafana
-   - Deploy Kanidm
+   - Other services as needed
   - Update DNS CNAMEs if needed
-7. **Documentation**
+5. **Create client module** in `system/` for PAM/NSS ✅
-   - User management procedures
+   - Module: `system/kanidm-client.nix`
-   - Adding new OAuth2 clients
+   - `homelab.kanidm.enable = true` enables PAM/NSS
-   - Troubleshooting PAM/NSS issues
+   - Short usernames (not SPN format)
   - Home directory symlinks via `home_alias`
   - Enabled on test tier: testvm01, testvm02, testvm03
-## Open Questions
+6. **Documentation** ✅
   - `docs/user-management.md` - CLI workflows, troubleshooting
   - User/group creation procedures verified working
- What UID/GID range should be reserved for Kanidm-managed users?
+## Progress
- Which hosts should have PAM/NSS enabled initially?
+
- What OAuth2 clients are needed at launch?
+### Completed (2026-02-08)
 **Kanidm server deployed on kanidm01 (test tier):**
 - Host: `kanidm01.home.2rjus.net` (10.69.13.23)
 - WebUI: `https://auth.home.2rjus.net`
 - LDAPS: port 636
 - Valid certificate from internal CA
 **Configuration:**
 - Kanidm 1.8 with secret provisioning support
 - Daily backups at 22:00 (7 versions retained)
 - Vault integration for idm_admin password
 - Prometheus monitoring scrape target configured
 **Provisioned entities:**
 - Groups: `admins`, `users`, `ssh-users` (declarative)
 - Users managed via CLI (imperative)
 **Verified working:**
 - WebUI login with idm_admin
 - LDAP bind and search with POSIX-enabled user
 - LDAPS with valid internal CA certificate
 ### Completed (2026-02-08) - PAM/NSS Client
 **Client module deployed (`system/kanidm-client.nix`):**
 - `homelab.kanidm.enable = true` enables PAM/NSS integration
 - Connects to auth.home.2rjus.net
 - Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
 - Home directory symlinks (`/home/torjus` → UUID-based dir)
 - Login restricted to `ssh-users` group
 **Enabled on test tier:**
 - testvm01, testvm02, testvm03
 **Verified working:**
 - User/group resolution via `getent`
 - SSH login with Kanidm unix passwords
 - Home directory creation with symlinks
 - Imperative user/group creation via CLI
 **Documentation:**
 - `docs/user-management.md` with full CLI workflows
 - Password requirements (min 10 chars)
 - Troubleshooting guide (nscd, cache invalidation)
 ### UID/GID Range (Resolved)
 **Range: 65,536 - 69,999** (manually allocated)
 - Users: 65,536 - 67,999 (up to ~2500 users)
 - Groups: 68,000 - 69,999 (up to ~2000 groups)
 Rationale:
 - Starts at Kanidm's recommended minimum (65,536)
 - Well above NixOS system users (typically <1000)
 - Avoids Podman/container issues with very high GIDs
 ### Next Steps
 1. Enable PAM/NSS on production hosts (after test tier validation)
 2. Configure TrueNAS LDAP client for NAS integration testing
 3. Add OAuth2 clients (Grafana first)
 ## References
--- a/docs/plans/cert-monitoring.md
+++ b/docs/plans/cert-monitoring.md
@@ -0,0 +1,72 @@
 # Certificate Monitoring Plan
 ## Summary
 This document describes the removal of labmon certificate monitoring and outlines future needs for certificate monitoring in the homelab.
 ## What Was Removed
 ### labmon Service
 The `labmon` service was a custom Go application that provided:
 1. **StepMonitor**: Monitoring for step-ca (Smallstep CA) certificate provisioning and health
 2. **TLSConnectionMonitor**: Periodic TLS connection checks to verify certificate validity and expiration
 The service exposed Prometheus metrics at `:9969` including:
 - `labmon_tlsconmon_certificate_seconds_left` - Time until certificate expiration
 - `labmon_tlsconmon_certificate_check_error` - Whether the TLS check failed
 - `labmon_stepmon_certificate_seconds_left` - Step-CA internal certificate expiration
 ### Affected Files
 - `hosts/monitoring01/configuration.nix` - Removed labmon configuration block
 - `services/monitoring/prometheus.nix` - Removed labmon scrape target
 - `services/monitoring/rules.yml` - Removed `certificate_rules` alert group
 - `services/monitoring/alloy.nix` - Deleted (was only used for labmon profiling)
 - `services/monitoring/default.nix` - Removed alloy.nix import
 ### Removed Alerts
 - `certificate_expiring_soon` - Warned when any monitored TLS cert had < 24h validity
 - `step_ca_serving_cert_expiring` - Critical alert for step-ca's own serving certificate
 - `certificate_check_error` - Warned when TLS connection check failed
 - `step_ca_certificate_expiring` - Critical alert for step-ca issued certificates
 ## Why It Was Removed
 1. **step-ca decommissioned**: The primary monitoring target (step-ca) is no longer in use
 2. **Outdated codebase**: labmon was a custom tool that required maintenance
 3. **Limited value**: With ACME auto-renewal, certificates should renew automatically
 ## Current State
 ACME certificates are now issued by OpenBao PKI at `vault.home.2rjus.net:8200`. The ACME protocol handles automatic renewal, and certificates are typically renewed well before expiration.
 ## Future Needs
 While ACME handles renewal automatically, we should consider monitoring for:
 1. **ACME renewal failures**: Alert when a certificate fails to renew
   - Could monitor ACME client logs (via Loki queries)
   - Could check certificate file modification times
 2. **Certificate expiration as backup**: Even with auto-renewal, a last-resort alert for certificates approaching expiration would catch renewal failures
 3. **Certificate transparency**: Monitor for unexpected certificate issuance
 ### Potential Solutions
 1. **Prometheus blackbox_exporter**: Can probe TLS endpoints and export certificate expiration metrics
   - `probe_ssl_earliest_cert_expiry` metric
   - Already a standard tool, well-maintained
 2. **Custom Loki alerting**: Query ACME service logs for renewal failures
   - Works with existing infrastructure
   - No additional services needed
 3. **Node-exporter textfile collector**: Script that checks local certificate files and writes expiration metrics
 ## Status
 **Not yet implemented.** This document serves as a placeholder for future work on certificate monitoring.
--- a/docs/plans/completed/automated-host-deployment-pipeline.md
+++ b/docs/plans/completed/automated-host-deployment-pipeline.md
--- a/docs/plans/completed/bootstrap-cache.md
+++ b/docs/plans/completed/bootstrap-cache.md
@@ -0,0 +1,35 @@
 # Plan: Configure Template2 to Use Nix Cache
 ## Problem
 New VMs bootstrapped from template2 don't use our local nix cache (nix-cache.home.2rjus.net) during the initial `nixos-rebuild boot`. This means the first build downloads everything from cache.nixos.org, which is slower and uses more bandwidth.
 ## Solution
 Update the template2 base image to include the nix cache configuration, so new VMs immediately benefit from cached builds during bootstrap.
 ## Implementation
 1. Add nix cache configuration to `hosts/template2/configuration.nix`:
   ```nix
   nix.settings = {
     substituters = [ "https://nix-cache.home.2rjus.net" "https://cache.nixos.org" ];
     trusted-public-keys = [
       "nix-cache.home.2rjus.net:..."  # Add the cache's public key
       "cache.nixos.org-1:..."
     ];
   };
   ```
 2. Rebuild and redeploy the Proxmox template:
   ```bash
   nix develop -c ansible-playbook -i playbooks/inventory.ini playbooks/build-and-deploy-template.yml
   ```
 3. Update `default_template_name` in `terraform/variables.tf` if the template name changed
 ## Benefits
 - Faster VM bootstrap times
 - Reduced bandwidth to external cache
 - Most derivations will already be cached from other hosts
--- a/docs/plans/completed/nats-deploy-service.md
+++ b/docs/plans/completed/nats-deploy-service.md
--- a/docs/plans/completed/ns1-recreation.md
+++ b/docs/plans/completed/ns1-recreation.md
@@ -0,0 +1,107 @@
 # ns1 Recreation Plan
 ## Overview
 Recreate ns1 using the OpenTofu workflow after the existing VM entered emergency mode due to incorrect hardware-configuration.nix (hardcoded UUIDs that don't match actual disk layout).
 ## Current ns1 Configuration to Preserve
 - **IP:** 10.69.13.5/24
 - **Gateway:** 10.69.13.1
 - **Role:** Primary DNS (authoritative + resolver)
 - **Services:**
  - `../../services/ns/master-authorative.nix`
  - `../../services/ns/resolver.nix`
 - **Metadata:**
  - `homelab.host.role = "dns"`
  - `homelab.host.labels.dns_role = "primary"`
 - **Vault:** enabled
 - **Deploy:** enabled
 ## Execution Steps
 ### Phase 1: Remove Old Configuration
 ```bash
 nix develop -c create-host --remove --hostname ns1 --force
 ```
 This removes:
 - `hosts/ns1/` directory
 - Entry from `flake.nix`
 - Any terraform entries (none exist currently)
 ### Phase 2: Create New Configuration
 ```bash
 nix develop -c create-host --hostname ns1 --ip 10.69.13.5/24
 ```
 This creates:
 - `hosts/ns1/` with template2-based configuration
 - Entry in `flake.nix`
 - Entry in `terraform/vms.tf`
 - Vault wrapped token for bootstrap
 ### Phase 3: Customize Configuration
 After create-host, manually update `hosts/ns1/configuration.nix` to add:
 1. DNS service imports:
   ```nix
   ../../services/ns/master-authorative.nix
   ../../services/ns/resolver.nix
   ```
 2. Host metadata:
   ```nix
   homelab.host = {
     tier = "prod";
     role = "dns";
     labels.dns_role = "primary";
   };
   ```
 3. Disable resolved (conflicts with Unbound):
   ```nix
   services.resolved.enable = false;
   ```
 ### Phase 4: Commit Changes
 ```bash
 git add -A
 git commit -m "ns1: recreate with OpenTofu workflow
 Old VM had incorrect hardware-configuration.nix with hardcoded UUIDs
 that didn't match actual disk layout, causing boot failure.
 Recreated using template2-based configuration for OpenTofu provisioning."
 ```
 ### Phase 5: Infrastructure
 1. Delete old ns1 VM in Proxmox (it's broken anyway)
 2. Run `nix develop -c tofu -chdir=terraform apply`
 3. Wait for bootstrap to complete
 4. Verify ns1 is functional:
   - DNS resolution working
   - Zone transfer to ns2 working
   - All exporters responding
 ### Phase 6: Finalize
 - Push to master
 - Move this plan to `docs/plans/completed/`
 ## Rollback
 If the new VM fails:
 1. ns2 is still operational as secondary DNS
 2. Can recreate with different settings if needed
 ## Notes
 - ns2 will continue serving DNS during the migration
 - Zone data is generated from flake, so no data loss
 - The old VM's disk can be kept briefly in Proxmox as backup if desired
--- a/docs/plans/completed/prometheus-scrape-target-labels.md
+++ b/docs/plans/completed/prometheus-scrape-target-labels.md
@@ -1,10 +1,38 @@
 # Prometheus Scrape Target Labels
 ## Implementation Status
 | Step | Status | Notes |
 |------|--------|-------|
 | 1. Create `homelab.host` module | ✅ Complete | `modules/homelab/host.nix` |
 | 2. Update `lib/monitoring.nix` | ✅ Complete | Labels extracted and propagated |
 | 3. Update Prometheus config | ✅ Complete | Uses structured static_configs |
 | 4. Set metadata on hosts | ✅ Complete | All relevant hosts configured |
 | 5. Update alert rules | ✅ Complete | Role-based filtering implemented |
 | 6. Labels for service targets | ✅ Complete | Host labels propagated to all services |
 | 7. Add hostname label | ✅ Complete | All targets have `hostname` label for easy filtering |
 **Hosts with metadata configured:**
 - `ns1`, `ns2`: `role = "dns"`, `labels.dns_role = "primary"/"secondary"`
 - `nix-cache01`: `role = "build-host"`
 - `vault01`: `role = "vault"`
 - `testvm01/02/03`: `tier = "test"`
 **Implementation complete.** Branch: `prometheus-scrape-target-labels`
 **Query examples:**
 - `{hostname="ns1"}` - all metrics from ns1 (any job/port)
 - `node_cpu_seconds_total{hostname="monitoring01"}` - specific metric by hostname
 - `up{role="dns"}` - all DNS servers
 - `up{tier="test"}` - all test-tier hosts
 ---
 ## Goal
 Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names.
-**Related:** This plan shares the `homelab.host` module with `docs/plans/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment.
+**Related:** This plan shares the `homelab.host` module with `docs/plans/completed/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment.
 ## Motivation
@@ -54,12 +82,11 @@ or
 ## Implementation
-This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/nats-deploy-service.md` which uses the same module for deployment tier assignment.
+This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/completed/nats-deploy-service.md` which uses the same module for deployment tier assignment.
 ### 1. Create `homelab.host` module
-**Status:** Step 1 (Create `homelab.host` module) is complete. The module is in
+✅ **Complete.** The module is in `modules/homelab/host.nix`.
 `modules/homelab/host.nix` with tier, priority, role, and labels options.
 Create `modules/homelab/host.nix` with shared host metadata options:
@@ -98,6 +125,8 @@ Import this module in `modules/homelab/default.nix`.
 ### 2. Update `lib/monitoring.nix`
 ✅ **Complete.** Labels are now extracted and propagated.
 - `extractHostMonitoring` should also extract `homelab.host` values (priority, role, labels).
 - Build the combined label set from `homelab.host`:
@@ -126,6 +155,8 @@ This requires grouping hosts by their label attrset and producing one `static_co
 ### 3. Update `services/monitoring/prometheus.nix`
 ✅ **Complete.** Now uses structured static_configs output.
 Change the node-exporter scrape config to use the new structured output:
 ```nix
@@ -138,36 +169,37 @@ static_configs = nodeExporterTargets;
 ### 4. Set metadata on hosts
 ✅ **Complete.** All relevant hosts have metadata configured. Note: The implementation filters by `role` rather than `priority`, which matches the existing nix-cache01 configuration.
 Example in `hosts/nix-cache01/configuration.nix`:
 ```nix
 homelab.host = {
  tier = "test";       # can be deployed by MCP (used by homelab-deploy)
  priority = "low";    # relaxed alerting thresholds
  role = "build-host";
 };
 ```
 **Note:** Current implementation only sets `role = "build-host"`. Consider adding `priority = "low"` when label propagation is implemented.
 Example in `hosts/ns1/configuration.nix`:
 ```nix
 homelab.host = {
  tier = "prod";
  priority = "high";
  role = "dns";
  labels.dns_role = "primary";
 };
 ```
 **Note:** `tier` and `priority` use defaults ("prod" and "high"), which is the intended behavior. The current ns1/ns2 configurations match this pattern.
 ### 5. Update alert rules
-After implementing labels, review and update `services/monitoring/rules.yml`:
+✅ **Complete.** Updated `services/monitoring/rules.yml`:
- Replace instance-name exclusions with label-based filters (e.g. `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`).
+- `high_cpu_load`: Replaced `instance!="nix-cache01..."` with `role!="build-host"` for standard hosts (15m duration) and `role="build-host"` for build hosts (2h duration).
- Consider whether any other rules should differentiate by priority or role.
+- `unbound_low_cache_hit_ratio`: Added `dns_role="primary"` filter to only alert on the primary DNS resolver (secondary has a cold cache).
-Specifically, the `high_cpu_load` rule currently has a nix-cache01 exclusion that should be replaced with a `priority`-based filter.
+### 6. Labels for `generateScrapeConfigs` (service targets)
-### 6. Consider labels for `generateScrapeConfigs` (service targets)
+✅ **Complete.** Host labels are now propagated to all auto-generated service scrape targets (unbound, homelab-deploy, nixos-exporter, etc.). This enables semantic filtering on any service metric, such as using `dns_role="primary"` with the unbound job.
 The same label propagation could be applied to service-level scrape targets. This is optional and can be deferred -- service targets are more specialized and less likely to need generic label-based filtering.
--- a/docs/plans/host-migration-to-opentofu.md
+++ b/docs/plans/host-migration-to-opentofu.md
@@ -9,24 +9,23 @@ hosts are decommissioned or deferred.
 ## Current State
-Hosts already managed by OpenTofu: `vault01`, `testvm01`, `vaulttest01`
+Hosts already managed by OpenTofu: `vault01`, `testvm01`, `testvm02`, `testvm03`, `ns2`, `ns1`
 Hosts to migrate:
 | Host | Category | Notes |
 |------|----------|-------|
-| ns1 | Stateless | Primary DNS, recreate |
+| ~~ns1~~ | ~~Stateless~~ | ✓ Complete |
 | ns2 | Stateless | Secondary DNS, recreate |
 | nix-cache01 | Stateless | Binary cache, recreate |
 | http-proxy | Stateless | Reverse proxy, recreate |
 | nats1 | Stateless | Messaging, recreate |
 | auth01 | Decommission | No longer in use |
 | ha1 | Stateful | Home Assistant + Zigbee2MQTT + Mosquitto |
 | monitoring01 | Stateful | Prometheus, Grafana, Loki |
 | jelly01 | Stateful | Jellyfin metadata, watch history, config |
-| pgdb1 | Stateful | PostgreSQL databases |
+| pgdb1 | Decommission | Only used by Open WebUI on gunter, migrating to local postgres |
-| jump | Decommission | No longer needed |
+| ~~jump~~ | ~~Decommission~~ | ✓ Complete |
-| ca | Deferred | Pending Phase 4c PKI migration to OpenBao |
+| ~~auth01~~ | ~~Decommission~~ | ✓ Complete |
 | ~~ca~~ | ~~Deferred~~ | ✓ Complete |
 ## Phase 1: Backup Preparation
@@ -46,39 +45,19 @@ No backup currently exists. Add a restic backup job for `/var/lib/jellyfin/` whi
 Media files are on the NAS (`nas.home.2rjus.net:/mnt/hdd-pool/media`) and do not need backup.
 The cache directory (`/var/cache/jellyfin/`) does not need backup — it regenerates.
-### 1c. Add PostgreSQL Backup to pgdb1
+### 1c. Verify Existing ha1 Backup
 No backup currently exists. Add a restic backup job with a `pg_dumpall` pre-hook to capture
 all databases and roles. The dump should be piped through restic's stdin backup (similar to
 the Grafana DB dump pattern on monitoring01).
 ### 1d. Verify Existing ha1 Backup
 ha1 already backs up `/var/lib/hass`, `/var/lib/zigbee2mqtt`, `/var/lib/mosquitto`. Verify
 these backups are current and restorable before proceeding with migration.
-### 1e. Verify All Backups
+### 1d. Verify All Backups
 After adding/expanding backup jobs:
 1. Trigger a manual backup run on each host
 2. Verify backup integrity with `restic check`
 3. Test a restore to a temporary location to confirm data is recoverable
-## Phase 2: Declare pgdb1 Databases in Nix
+## Phase 2: Stateless Host Migration
 Before migrating pgdb1, audit the manually-created databases and users on the running
 instance, then declare them in the Nix configuration using `ensureDatabases` and
 `ensureUsers`. This makes the PostgreSQL setup reproducible on the new host.
 Steps:
 1. SSH to pgdb1, run `\l` and `\du` in psql to list databases and roles
 2. Add `ensureDatabases` and `ensureUsers` to `services/postgres/postgres.nix`
 3. Document any non-default PostgreSQL settings or extensions per database
 After reprovisioning, the databases will be created by NixOS, and data restored from the
 `pg_dumpall` backup.
 ## Phase 3: Stateless Host Migration
 These hosts have no meaningful state and can be recreated fresh. For each host:
@@ -95,13 +74,14 @@ Migrate stateless hosts in an order that minimizes disruption:
 1. **nix-cache01** — low risk, no downstream dependencies during migration
 2. **nats1** — low risk, verify no persistent JetStream streams first
-4. **http-proxy** — brief disruption to proxied services, migrate during low-traffic window
+3. **http-proxy** — brief disruption to proxied services, migrate during low-traffic window
-5. **ns1, ns2** — migrate one at a time, verify DNS resolution between each
+4. ~~**ns1** — ns2 already migrated, verify AXFR works after ns1 migration~~ ✓ Complete
-For ns1/ns2: migrate ns2 first (secondary), verify AXFR works, then migrate ns1. All hosts
+~~For ns1/ns2: migrate ns2 first (secondary), verify AXFR works, then migrate ns1.~~ Both ns1
-use both ns1 and ns2 as resolvers, so one being down briefly is tolerable.
+and ns2 migration complete. Zone transfer (AXFR) verified working between ns1 (primary) and
 ns2 (secondary).
-## Phase 4: Stateful Host Migration
+## Phase 3: Stateful Host Migration
 For each stateful host, the procedure is:
@@ -114,17 +94,7 @@ For each stateful host, the procedure is:
 7. Start services and verify functionality
 8. Decommission the old VM
-### 4a. pgdb1
+### 3a. monitoring01
 1. Run final `pg_dumpall` backup via restic
 2. Stop PostgreSQL on the old host
 3. Provision new pgdb1 via OpenTofu
 4. After bootstrap, NixOS creates the declared databases/users
 5. Restore data with `pg_restore` or `psql < dumpall.sql`
 6. Verify database connectivity from gunter (`10.69.30.105`)
 7. Decommission old VM
 ### 4b. monitoring01
 1. Run final Grafana backup
 2. Provision new monitoring01 via OpenTofu
@@ -134,7 +104,7 @@ For each stateful host, the procedure is:
 6. Verify all scrape targets are being collected
 7. Decommission old VM
-### 4c. jelly01
+### 3b. jelly01
 1. Run final Jellyfin backup
 2. Provision new jelly01 via OpenTofu
@@ -143,7 +113,7 @@ For each stateful host, the procedure is:
 5. Start Jellyfin, verify watch history and library metadata are present
 6. Decommission old VM
-### 4d. ha1
+### 3c. ha1
 1. Verify latest restic backup is current
 2. Stop Home Assistant, Zigbee2MQTT, and Mosquitto on old host
@@ -167,47 +137,69 @@ OpenTofu/Proxmox. Verify the USB device ID on the hypervisor and add the appropr
 `usb` block to the VM definition in `terraform/vms.tf`. The USB device must be passed
 through before starting Zigbee2MQTT on the new host.
-## Phase 5: Decommission jump and auth01 Hosts
+## Phase 4: Decommission Hosts
-### jump
+### jump ✓ COMPLETE
-1. Verify nothing depends on the jump host (no SSH proxy configs pointing to it, etc.)
+
-2. Remove host configuration from `hosts/jump/`
+~~1. Verify nothing depends on the jump host (no SSH proxy configs pointing to it, etc.)~~
-3. Remove from `flake.nix`
+~~2. Remove host configuration from `hosts/jump/`~~
-4. Remove any secrets in `secrets/jump/`
+~~3. Remove from `flake.nix`~~
-5. Remove from `.sops.yaml`
+~~4. Remove any secrets in `secrets/jump/`~~
 ~~5. Remove from `.sops.yaml`~~
 ~~6. Destroy the VM in Proxmox~~
 ~~7. Commit cleanup~~
 Host was already removed from flake.nix and VM destroyed. Configuration cleaned up in ba9f47f.
 ### auth01 ✓ COMPLETE
 ~~1. Remove host configuration from `hosts/auth01/`~~
 ~~2. Remove from `flake.nix`~~
 ~~3. Remove any secrets in `secrets/auth01/`~~
 ~~4. Remove from `.sops.yaml`~~
 ~~5. Remove `services/authelia/` and `services/lldap/` (only used by auth01)~~
 ~~6. Destroy the VM in Proxmox~~
 ~~7. Commit cleanup~~
 Host configuration, services, and VM already removed.
 ### pgdb1 (in progress)
 Only consumer was Open WebUI on gunter, which has been migrated to use local PostgreSQL.
 1. ~~Verify Open WebUI on gunter is using local PostgreSQL (not pgdb1)~~ ✓
 2. ~~Remove host configuration from `hosts/pgdb1/`~~ ✓
 3. ~~Remove `services/postgres/` (only used by pgdb1)~~ ✓
 4. ~~Remove from `flake.nix`~~ ✓
 5. ~~Remove Vault AppRole from `terraform/vault/approle.tf`~~ ✓
 6. Destroy the VM in Proxmox
-7. Commit cleanup
+7. ~~Commit cleanup~~ ✓
-### auth01
+See `docs/plans/pgdb1-decommission.md` for detailed plan.
 1. Remove host configuration from `hosts/auth01/`
 2. Remove from `flake.nix`
 3. Remove any secrets in `secrets/auth01/`
 4. Remove from `.sops.yaml`
 5. Remove `services/authelia/` and `services/lldap/` (only used by auth01)
 6. Destroy the VM in Proxmox
 7. Commit cleanup
-## Phase 6: Decommission ca Host (Deferred)
+## Phase 5: Decommission ca Host ✓ COMPLETE
-Deferred until Phase 4c (PKI migration to OpenBao) is complete. Once all hosts use the
+~~Deferred until Phase 4c (PKI migration to OpenBao) is complete. Once all hosts use the
 OpenBao ACME endpoint for certificates, the step-ca host can be decommissioned following
-the same cleanup steps as the jump host.
+the same cleanup steps as the jump host.~~
-## Phase 7: Remove sops-nix
+PKI migration to OpenBao complete. Host configuration, `services/ca/`, and VM removed.
-Once `ca` is decommissioned (Phase 6), `sops-nix` is no longer used by any host. Remove
+## Phase 6: Remove sops-nix ✓ COMPLETE
 all remnants:
 - `sops-nix` input from `flake.nix` and `flake.lock`
 - `sops-nix.nixosModules.sops` from all host module lists in `flake.nix`
 - `inherit sops-nix` from all specialArgs in `flake.nix`
 - `system/sops.nix` and its import in `system/default.nix`
 - `.sops.yaml`
 - `secrets/` directory
 - All `sops.secrets.*` declarations in `services/ca/`, `services/authelia/`, `services/lldap/`
 - Template scripts that generate age keys for sops (`hosts/template/scripts.nix`,
  `hosts/template2/scripts.nix`)
-See `docs/plans/completed/sops-to-openbao-migration.md` for full context.
+~~Once `ca` is decommissioned (Phase 6), `sops-nix` is no longer used by any host. Remove
 all remnants:~~
 ~~- `sops-nix` input from `flake.nix` and `flake.lock`~~
 ~~- `sops-nix.nixosModules.sops` from all host module lists in `flake.nix`~~
 ~~- `inherit sops-nix` from all specialArgs in `flake.nix`~~
 ~~- `system/sops.nix` and its import in `system/default.nix`~~
 ~~- `.sops.yaml`~~
 ~~- `secrets/` directory~~
 ~~- All `sops.secrets.*` declarations in `services/ca/`, `services/authelia/`, `services/lldap/`~~
 ~~- Template scripts that generate age keys for sops (`hosts/template/scripts.nix`,
  `hosts/template2/scripts.nix`)~~
 All sops-nix remnants removed. See `docs/plans/completed/sops-to-openbao-migration.md` for context.
 ## Notes
@@ -216,7 +208,7 @@ See `docs/plans/completed/sops-to-openbao-migration.md` for full context.
 - The old VMs use IPs that the new VMs need, so the old VM must be shut down before
  the new one is provisioned (or use a temporary IP and swap after verification)
 - Stateful migrations should be done during low-usage windows
- After all migrations are complete, the only hosts not in OpenTofu will be ca (deferred)
+- After all migrations are complete, all decommissioned hosts (jump, auth01, ca) have been removed
 - Since many hosts are being recreated, this is a good opportunity to establish consistent
  hostname naming conventions before provisioning the new VMs. Current naming is inconsistent
  (e.g. `ns1` vs `nix-cache01`, `ha1` vs `auth01`, `pgdb1` vs `http-proxy`). Decide on a
--- a/docs/plans/memory-issues-follow-up.md
+++ b/docs/plans/memory-issues-follow-up.md
@@ -0,0 +1,116 @@
 # Memory Issues Follow-up
 Tracking the zram change to verify it resolves OOM issues during nixos-upgrade on low-memory hosts.
 ## Background
 On 2026-02-08, ns2 (2GB RAM) experienced an OOM kill during nixos-upgrade. The Nix evaluation process consumed ~1.6GB before being killed by the kernel. ns1 (manually increased to 4GB) succeeded with the same upgrade.
 Root cause: 2GB RAM is insufficient for Nix flake evaluation without swap.
 ## Fix Applied
 **Commit:** `1674b6a` - system: enable zram swap for all hosts
 **Merged:** 2026-02-08 ~12:15 UTC
 **Change:** Added `zramSwap.enable = true` to `system/zram.nix`, providing ~2GB compressed swap on all hosts.
 ## Timeline
 | Time (UTC) | Event |
 |------------|-------|
 | 05:00:46 | ns2 nixos-upgrade OOM killed |
 | 05:01:47 | `nixos_upgrade_failed` alert fired |
 | 12:15 | zram commit merged to master |
 | 12:19 | ns2 rebooted with zram enabled |
 | 12:20 | ns1 rebooted (memory reduced to 2GB via tofu) |
 ## Hosts Affected
 All 2GB VMs that run nixos-upgrade:
 - ns1, ns2 (DNS)
 - vault01
 - testvm01, testvm02, testvm03
 - kanidm01
 ## Metrics to Monitor
 Check these in Grafana or via PromQL to verify the fix:
 ### Swap availability (should be ~2GB after upgrade)
 ```promql
 node_memory_SwapTotal_bytes / 1024 / 1024
 ```
 ### Swap usage during upgrades
 ```promql
 (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / 1024 / 1024
 ```
 ### Zswap compressed bytes (active compression)
 ```promql
 node_memory_Zswap_bytes / 1024 / 1024
 ```
 ### Upgrade failures (should be 0)
 ```promql
 node_systemd_unit_state{name="nixos-upgrade.service", state="failed"}
 ```
 ### Memory available during upgrades
 ```promql
 node_memory_MemAvailable_bytes / 1024 / 1024
 ```
 ## Verification Steps
 After a few days (allow auto-upgrades to run on all hosts):
 1. Check all hosts have swap enabled:
   ```promql
   node_memory_SwapTotal_bytes > 0
   ```
 2. Check for any upgrade failures since the fix:
   ```promql
   count_over_time(ALERTS{alertname="nixos_upgrade_failed"}[7d])
   ```
 3. Review if any hosts used swap during upgrades (check historical graphs)
 ## Success Criteria
 - No `nixos_upgrade_failed` alerts due to OOM after 2026-02-08
 - All hosts show ~2GB swap available
 - Upgrades complete successfully on 2GB VMs
 ## Fallback Options
 If zram is insufficient:
 1. **Increase VM memory** - Update `terraform/vms.tf` to 4GB for affected hosts
 2. **Enable memory ballooning** - Configure VMs with dynamic memory allocation (see below)
 3. **Use remote builds** - Configure `nix.buildMachines` to offload evaluation
 4. **Reduce flake size** - Split configurations to reduce evaluation memory
 ### Memory Ballooning
 Proxmox supports memory ballooning, which allows VMs to dynamically grow/shrink memory allocation based on demand. The balloon driver inside the guest communicates with the hypervisor to release or reclaim memory pages.
 Configuration in `terraform/vms.tf`:
 ```hcl
 memory  = 4096  # maximum memory
 balloon = 2048  # minimum memory (shrinks to this when idle)
 ```
 Pros:
 - VMs get memory on-demand without reboots
 - Better host memory utilization
 - Solves upgrade OOM without permanently allocating 4GB
 Cons:
 - Requires QEMU guest agent running in guest
 - Guest can experience memory pressure if host is overcommitted
 Ballooning and zram are complementary - ballooning provides headroom from the host, zram provides overflow within the guest.
--- a/docs/plans/monitoring-migration-victoriametrics.md
+++ b/docs/plans/monitoring-migration-victoriametrics.md
@@ -0,0 +1,219 @@
 # Monitoring Stack Migration to VictoriaMetrics
 ## Overview
 Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
 and longer retention. Run in parallel with monitoring01 until validated, then switch over using
 a `monitoring` CNAME for seamless transition.
 ## Current State
 **monitoring01** (10.69.13.13):
 - 4 CPU cores, 4GB RAM, 33GB disk
 - Prometheus with 30-day retention (15s scrape interval)
 - Alertmanager (routes to alerttonotify webhook)
 - Grafana (dashboards, datasources)
 - Loki (log aggregation from all hosts via Promtail)
 - Tempo (distributed tracing)
 - Pyroscope (continuous profiling)
 **Hardcoded References to monitoring01:**
 - `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100`
 - `hosts/template2/bootstrap.nix` - Bootstrap logs to Loki (keep as-is until decommission)
 - `services/http-proxy/proxy.nix` - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway
 **Auto-generated:**
 - Prometheus scrape targets (from `lib/monitoring.nix` + `homelab.monitoring.scrapeTargets`)
 - Node-exporter targets (from all hosts with static IPs)
 ## Decision: VictoriaMetrics
 Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
 - Single binary replacement for Prometheus
 - 5-10x better compression (30 days could become 180+ days in same space)
 - Same PromQL query language (Grafana dashboards work unchanged)
 - Same scrape config format (existing auto-generated configs work)
 If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
 ## Architecture
 ```
                     ┌─────────────────┐
                     │  monitoring02   │
                     │  VictoriaMetrics│
                     │  + Grafana      │
     monitoring      │  + Loki         │
     CNAME ──────────│  + Tempo        │
                     │  + Pyroscope    │
                     │  + Alertmanager │
                     │  (vmalert)      │
                     └─────────────────┘
                            ▲
                            │ scrapes
            ┌───────────────┼───────────────┐
            │               │               │
       ┌────┴────┐    ┌─────┴────┐    ┌─────┴────┐
       │  ns1    │    │  ha1     │    │  ...     │
       │ :9100   │    │ :9100    │    │ :9100    │
       └─────────┘    └──────────┘    └──────────┘
 ```
 ## Implementation Plan
 ### Phase 1: Create monitoring02 Host
 Use `create-host` script which handles flake.nix and terraform/vms.tf automatically.
 1. **Run create-host**: `nix develop -c create-host monitoring02 10.69.13.24`
 2. **Update VM resources** in `terraform/vms.tf`:
   - 4 cores (same as monitoring01)
   - 8GB RAM (double, for VictoriaMetrics headroom)
   - 100GB disk (for 3+ months retention with compression)
 3. **Update host configuration**: Import monitoring services
 4. **Create Vault AppRole**: Add to `terraform/vault/approle.tf`
 ### Phase 2: Set Up VictoriaMetrics Stack
 Create new service module at `services/monitoring/victoriametrics/` for testing alongside existing
 Prometheus config. Once validated, this can replace the Prometheus module.
 1. **VictoriaMetrics** (port 8428):
   - `services.victoriametrics.enable = true`
   - `services.victoriametrics.retentionPeriod = "3m"` (3 months, increase later based on disk usage)
   - Migrate scrape configs via `prometheusConfig`
   - Use native push support (replaces Pushgateway)
 2. **vmalert** for alerting rules:
   - `services.vmalert.enable = true`
   - Point to VictoriaMetrics for metrics evaluation
   - Keep rules in separate `rules.yml` file (same format as Prometheus)
   - No receiver configured during parallel operation (prevents duplicate alerts)
 3. **Alertmanager** (port 9093):
   - Keep existing configuration (alerttonotify webhook routing)
   - Only enable receiver after cutover from monitoring01
 4. **Loki** (port 3100):
   - Same configuration as current
 5. **Grafana** (port 3000):
   - Define dashboards declaratively via NixOS options (not imported from monitoring01)
   - Reference existing dashboards on monitoring01 for content inspiration
   - Configure VictoriaMetrics datasource (port 8428)
   - Configure Loki datasource
 6. **Tempo** (ports 3200, 3201):
   - Same configuration
 7. **Pyroscope** (port 4040):
   - Same Docker-based deployment
 ### Phase 3: Parallel Operation
 Run both monitoring01 and monitoring02 simultaneously:
 1. **Dual scraping**: Both hosts scrape the same targets
   - Validates VictoriaMetrics is collecting data correctly
 2. **Dual log shipping**: Configure Promtail to send logs to both Loki instances
   - Add second client in `system/monitoring/logs.nix` pointing to monitoring02
 3. **Validate dashboards**: Access Grafana on monitoring02, verify dashboards work
 4. **Validate alerts**: Verify vmalert evaluates rules correctly (no receiver = no notifications)
 5. **Compare resource usage**: Monitor disk/memory consumption between hosts
 ### Phase 4: Add monitoring CNAME
 Add CNAME to monitoring02 once validated:
 ```nix
 # hosts/monitoring02/configuration.nix
 homelab.dns.cnames = [ "monitoring" ];
 ```
 This creates `monitoring.home.2rjus.net` pointing to monitoring02.
 ### Phase 5: Update References
 Update hardcoded references to use the CNAME:
 1. **system/monitoring/logs.nix**:
   - Remove dual-shipping, point only to `http://monitoring.home.2rjus.net:3100`
 2. **services/http-proxy/proxy.nix**: Update reverse proxy backends:
   - prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
   - alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
   - grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
   - pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040
 Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission.
 ### Phase 6: Enable Alerting
 Once ready to cut over:
 1. Enable Alertmanager receiver on monitoring02
 2. Verify test alerts route correctly
 ### Phase 7: Cutover and Decommission
 1. **Stop monitoring01**: Prevent duplicate alerts during transition
 2. **Update bootstrap.nix**: Point to `monitoring.home.2rjus.net`
 3. **Verify all targets scraped**: Check VictoriaMetrics UI
 4. **Verify logs flowing**: Check Loki on monitoring02
 5. **Decommission monitoring01**:
   - Remove from flake.nix
   - Remove host configuration
   - Destroy VM in Proxmox
   - Remove from terraform state
 ## Open Questions
 - [ ] What disk size for monitoring02? 100GB should allow 3+ months with VictoriaMetrics compression
 - [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
 ## VictoriaMetrics Service Configuration
 Example NixOS configuration for monitoring02:
 ```nix
 # VictoriaMetrics replaces Prometheus
 services.victoriametrics = {
  enable = true;
  retentionPeriod = "3m";  # 3 months, increase based on disk usage
  prometheusConfig = {
    global.scrape_interval = "15s";
    scrape_configs = [
      # Auto-generated node-exporter targets
      # Service-specific scrape targets
      # External targets
    ];
  };
 };
 # vmalert for alerting rules (no receiver during parallel operation)
 services.vmalert = {
  enable = true;
  datasource.url = "http://localhost:8428";
  # notifier.alertmanager.url = "http://localhost:9093";  # Enable after cutover
  rule = [ ./rules.yml ];
 };
 ```
 ## Rollback Plan
 If issues arise after cutover:
 1. Move `monitoring` CNAME back to monitoring01
 2. Restart monitoring01 services
 3. Revert Promtail config to point only to monitoring01
 4. Revert http-proxy backends
 ## Notes
 - VictoriaMetrics uses port 8428 vs Prometheus 9090
 - PromQL compatibility is excellent
 - VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
 - monitoring02 deployed via OpenTofu using `create-host` script
 - Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state
--- a/docs/plans/nix-cache-reprovision.md
+++ b/docs/plans/nix-cache-reprovision.md
@@ -0,0 +1,212 @@
 # Nix Cache Host Reprovision
 ## Overview
 Reprovision `nix-cache01` using the OpenTofu workflow, and improve the build/cache system with:
 1. NATS-based remote build triggering (replacing the current bash script)
 2. Safer flake update workflow that validates builds before pushing to master
 ## Current State
 ### Host Configuration
 - `nix-cache01` at 10.69.13.15 serves the binary cache via Harmonia
 - Runs Gitea Actions runner for CI workflows
 - Has `homelab.deploy.enable = true` (already supports NATS-based deployment)
 - Uses a dedicated XFS volume at `/nix` for cache storage
 ### Current Build System (`services/nix-cache/build-flakes.sh`)
 - Runs every 30 minutes via systemd timer
 - Clones/pulls two repos: `nixos-servers` and `nixos` (gunter)
 - Builds all hosts with `nixos-rebuild build` (no blacklist despite docs mentioning it)
 - Pushes success/failure metrics to pushgateway
 - Simple but has no filtering, no parallelism, no remote triggering
 ### Current Flake Update Workflow (`.github/workflows/flake-update.yaml`)
 - Runs daily at midnight via cron
 - Runs `nix flake update --commit-lock-file`
 - Pushes directly to master
 - No build validation — can push broken inputs
 ## Improvement 1: NATS-Based Remote Build Triggering
 ### Design
 Extend the existing `homelab-deploy` tool to support a "build" command that triggers builds on the cache host. This reuses the NATS infrastructure already in place.
 | Approach | Pros | Cons |
 |----------|------|------|
 | Extend homelab-deploy | Reuses existing NATS auth, NKey handling, CLI | Adds scope to existing tool |
 | New nix-cache-tool | Clean separation | Duplicate NATS boilerplate, new credentials |
 | Gitea Actions webhook | No custom tooling | Less flexible, tied to Gitea |
 **Recommendation:** Extend `homelab-deploy` with a build subcommand. The tool already has NATS client code, authentication handling, and a listener module in NixOS.
 ### Implementation
 1. Add new message type to homelab-deploy: `build.<host>` subject
 2. Listener on nix-cache01 subscribes to `build.>` wildcard
 3. On message receipt, builds the specified host and returns success/failure
 4. CLI command: `homelab-deploy build <hostname>` or `homelab-deploy build --all`
 ### Benefits
 - Trigger rebuild for specific host to ensure it's cached
 - Could be called from CI after merging PRs
 - Reuses existing NATS infrastructure and auth
 - Progress/status could stream back via NATS reply
 ## Improvement 2: Smarter Flake Update Workflow
 ### Current Problems
 1. Updates can push breaking changes to master
 2. No visibility into what broke when it does
 3. Hosts that auto-update can pull broken configs
 ### Proposed Workflow
 ```
 ┌─────────────────────────────────────────────────────────────────┐
 │                    Flake Update Workflow                         │
 ├─────────────────────────────────────────────────────────────────┤
 │  1. nix flake update (on feature branch)                        │
 │  2. Build ALL hosts locally                                      │
 │  3. If all pass → fast-forward merge to master                  │
 │  4. If any fail → create PR with failure logs attached          │
 └─────────────────────────────────────────────────────────────────┘
 ```
 ### Implementation Options
 | Option | Description | Pros | Cons |
 |--------|-------------|------|------|
 | **A: Self-hosted runner** | Build on nix-cache01 | Fast (local cache), simple | Ties up cache host during build |
 | **B: Gitea Actions only** | Use container runner | Clean separation | Slow (no cache), resource limits |
 | **C: Hybrid** | Trigger builds on nix-cache01 via NATS from Actions | Best of both | More complex |
 **Recommendation:** Option A with nix-cache01 as the runner. The host is already running Gitea Actions runner and has the cache. Building all ~16 hosts is disk I/O heavy but feasible on dedicated hardware.
 ### Workflow Steps
 1. Workflow runs on schedule (daily or weekly)
 2. Creates branch `flake-update/YYYY-MM-DD`
 3. Runs `nix flake update --commit-lock-file`
 4. Builds each host: `nix build .#nixosConfigurations.<host>.config.system.build.toplevel`
 5. If all succeed:
   - Fast-forward merge to master
   - Delete feature branch
 6. If any fail:
   - Create PR from the update branch
   - Attach build logs as PR comment
   - Label PR with `needs-review` or `build-failure`
   - Do NOT merge automatically
 ### Workflow File Changes
 ```yaml
 # New: .github/workflows/flake-update-safe.yaml
 name: Safe flake update
 on:
  schedule:
    - cron: "0 2 * * 0"  # Weekly on Sunday at 2 AM
  workflow_dispatch:  # Manual trigger
 jobs:
  update-and-validate:
    runs-on: homelab  # Use self-hosted runner on nix-cache01
    steps:
      - uses: actions/checkout@v4
        with:
          ref: master
          fetch-depth: 0  # Need full history for merge
      - name: Create update branch
        run: |
          BRANCH="flake-update/$(date +%Y-%m-%d)"
          git checkout -b "$BRANCH"
      - name: Update flake
        run: nix flake update --commit-lock-file
      - name: Build all hosts
        id: build
        run: |
          FAILED=""
          for host in $(nix flake show --json | jq -r '.nixosConfigurations | keys[]'); do
            echo "Building $host..."
            if ! nix build ".#nixosConfigurations.$host.config.system.build.toplevel" 2>&1 | tee "build-$host.log"; then
              FAILED="$FAILED $host"
            fi
          done
          echo "failed=$FAILED" >> $GITHUB_OUTPUT
      - name: Merge to master (if all pass)
        if: steps.build.outputs.failed == ''
        run: |
          git checkout master
          git merge --ff-only "$BRANCH"
          git push origin master
          git push origin --delete "$BRANCH"
      - name: Create PR (if any fail)
        if: steps.build.outputs.failed != ''
        run: |
          git push origin "$BRANCH"
          # Create PR via Gitea API with build logs
          # ... (PR creation with log attachment)
 ```
 ## Migration Steps
 ### Phase 1: Reprovision Host via OpenTofu
 1. Add `nix-cache01` to `terraform/vms.tf`:
   ```hcl
   "nix-cache01" = {
     ip        = "10.69.13.15/24"
     cpu_cores = 4
     memory    = 8192
     disk_size = "100G"  # Larger for nix store
   }
   ```
 2. Shut down existing nix-cache01 VM
 3. Run `tofu apply` to provision new VM
 4. Verify bootstrap completes and cache is serving
 **Note:** The cache will be cold after reprovision. Run initial builds to populate.
 ### Phase 2: Add Build Triggering to homelab-deploy
 1. Add `build` command to homelab-deploy CLI
 2. Add listener handler in NixOS module for `build.*` subjects
 3. Update nix-cache01 config to enable build listener
 4. Test with `homelab-deploy build testvm01`
 ### Phase 3: Implement Safe Flake Update Workflow
 1. Create `.github/workflows/flake-update-safe.yaml`
 2. Disable or remove old `flake-update.yaml`
 3. Test manually with `workflow_dispatch`
 4. Monitor first automated run
 ### Phase 4: Remove Old Build Script
 1. After new workflow is stable, remove:
   - `services/nix-cache/build-flakes.nix`
   - `services/nix-cache/build-flakes.sh`
 2. The new workflow handles scheduled builds
 ## Open Questions
 - [ ] What runner labels should the self-hosted runner use for the update workflow?
 - [ ] Should we build hosts in parallel (faster) or sequentially (easier to debug)?
 - [ ] How long to keep flake-update PRs open before auto-closing stale ones?
 - [ ] Should successful updates trigger a NATS notification to rebuild all hosts?
 - [ ] What to do about `gunter` (external nixos repo) - include in validation?
 - [ ] Disk size for new nix-cache01 - is 100G enough for cache + builds?
 ## Notes
 - The existing `homelab.deploy.enable = true` on nix-cache01 means it already has NATS connectivity
 - The Harmonia service and cache signing key will work the same after reprovision
 - Actions runner token is in Vault, will be provisioned automatically
 - Consider adding a `homelab.host.role = "build-host"` label for monitoring/filtering
--- a/docs/plans/pgdb1-decommission.md
+++ b/docs/plans/pgdb1-decommission.md
@@ -0,0 +1,113 @@
 # pgdb1 Decommissioning Plan
 ## Overview
 Decommission the pgdb1 PostgreSQL server. The only consumer was Open WebUI on gunter, which has been migrated to use a local PostgreSQL instance.
 ## Pre-flight Verification
 Before proceeding, verify that gunter is no longer using pgdb1:
 1. Check Open WebUI on gunter is configured for local PostgreSQL (not 10.69.13.16)
 2. Optionally: Check pgdb1 for recent connection activity:
   ```bash
   ssh pgdb1 'sudo -u postgres psql -c "SELECT * FROM pg_stat_activity WHERE datname IS NOT NULL;"'
   ```
 ## Files to Remove
 ### Host Configuration
 - `hosts/pgdb1/default.nix`
 - `hosts/pgdb1/configuration.nix`
 - `hosts/pgdb1/hardware-configuration.nix`
 - `hosts/pgdb1/` (directory)
 ### Service Module
 - `services/postgres/postgres.nix`
 - `services/postgres/default.nix`
 - `services/postgres/` (directory)
 Note: This service module is only used by pgdb1, so it can be removed entirely.
 ### Flake Entry
 Remove from `flake.nix` (lines 131-138):
 ```nix
 pgdb1 = nixpkgs.lib.nixosSystem {
  inherit system;
  specialArgs = {
    inherit inputs self;
  };
  modules = commonModules ++ [
    ./hosts/pgdb1
  ];
 };
 ```
 ### Vault AppRole
 Remove from `terraform/vault/approle.tf` (lines 69-73):
 ```hcl
 "pgdb1" = {
  paths = [
    "secret/data/hosts/pgdb1/*",
  ]
 }
 ```
 ### Monitoring Rules
 Remove from `services/monitoring/rules.yml` the `postgres_down` alert (lines 359-365):
 ```yaml
 - name: postgres_rules
  rules:
    - alert: postgres_down
      expr: node_systemd_unit_state{instance="pgdb1.home.2rjus.net:9100", name="postgresql.service", state="active"} == 0
      for: 5m
      labels:
        severity: critical
 ```
 ### Utility Scripts
 Delete `rebuild-all.sh` entirely (obsolete script).
 ## Execution Steps
 ### Phase 1: Verification
 - [ ] Confirm Open WebUI on gunter uses local PostgreSQL
 - [ ] Verify no active connections to pgdb1
 ### Phase 2: Code Cleanup
 - [ ] Create feature branch: `git checkout -b decommission-pgdb1`
 - [ ] Remove `hosts/pgdb1/` directory
 - [ ] Remove `services/postgres/` directory
 - [ ] Remove pgdb1 entry from `flake.nix`
 - [ ] Remove postgres alert from `services/monitoring/rules.yml`
 - [ ] Delete `rebuild-all.sh` (obsolete)
 - [ ] Run `nix flake check` to verify no broken references
 - [ ] Commit changes
 ### Phase 3: Terraform Cleanup
 - [ ] Remove pgdb1 from `terraform/vault/approle.tf`
 - [ ] Run `tofu plan` in `terraform/vault/` to preview changes
 - [ ] Run `tofu apply` to remove the AppRole
 - [ ] Commit terraform changes
 ### Phase 4: Infrastructure Cleanup
 - [ ] Shut down pgdb1 VM in Proxmox
 - [ ] Delete the VM from Proxmox
 - [ ] (Optional) Remove any DNS entries if not auto-generated
 ### Phase 5: Finalize
 - [ ] Merge feature branch to master
 - [ ] Trigger auto-upgrade on DNS servers (ns1, ns2) to remove DNS entry
 - [ ] Move this plan to `docs/plans/completed/`
 ## Rollback
 If issues arise after decommissioning:
 1. The VM can be recreated from template using the git history
 2. Database data would need to be restored from backup (if any exists)
 ## Notes
 - pgdb1 IP: 10.69.13.16
 - The postgres service allowed connections from gunter (10.69.30.105)
 - No restic backup was configured for this host
--- a/docs/plans/security-hardening.md
+++ b/docs/plans/security-hardening.md
@@ -0,0 +1,224 @@
 # Security Hardening Plan
 ## Overview
 Address security gaps identified in infrastructure review. Focus areas: SSH hardening, network security, logging improvements, and secrets management.
 ## Current State
 - SSH allows password auth and unrestricted root login (`system/sshd.nix`)
 - Firewall disabled on all hosts (`networking.firewall.enable = false`)
 - Promtail ships logs over HTTP to Loki
 - Loki has no authentication (`auth_enabled = false`)
 - AppRole secret-IDs never expire (`secret_id_ttl = 0`)
 - Vault TLS verification disabled by default (`skipTlsVerify = true`)
 - Audit logging exists (`common/ssh-audit.nix`) but not applied globally
 - Alert rules focus on availability, no security event detection
 ## Priority Matrix
 | Issue | Severity | Effort | Priority |
 |-------|----------|--------|----------|
 | SSH password auth | High | Low | **P1** |
 | Firewall disabled | High | Medium | **P1** |
 | Promtail HTTP (no TLS) | High | Medium | **P2** |
 | No security alerting | Medium | Low | **P2** |
 | Audit logging not global | Low | Low | **P2** |
 | Loki no auth | Medium | Medium | **P3** |
 | Secret-ID TTL | Medium | Medium | **P3** |
 | Vault skipTlsVerify | Medium | Low | **P3** |
 ## Phase 1: Quick Wins (P1)
 ### 1.1 SSH Hardening
 Edit `system/sshd.nix`:
 ```nix
 services.openssh = {
  enable = true;
  settings = {
    PermitRootLogin = "prohibit-password";  # Key-only root login
    PasswordAuthentication = false;
    KbdInteractiveAuthentication = false;
  };
 };
 ```
 **Prerequisite:** Verify all hosts have SSH keys deployed for root.
 ### 1.2 Enable Firewall
 Create `system/firewall.nix` with default deny policy:
 ```nix
 { ... }: {
  networking.firewall.enable = true;
  # Use openssh's built-in firewall integration
  services.openssh.openFirewall = true;
 }
 ```
 **Useful firewall options:**
 | Option | Description |
 |--------|-------------|
 | `networking.firewall.trustedInterfaces` | Accept all traffic from these interfaces (e.g., `[ "lo" ]`) |
 | `networking.firewall.interfaces.<name>.allowedTCPPorts` | Per-interface port rules |
 | `networking.firewall.extraInputRules` | Custom nftables rules (for complex filtering) |
 **Network range restrictions:** Consider restricting SSH to the infrastructure subnet (`10.69.13.0/24`) using `extraInputRules` for defense in depth. However, this adds complexity and may not be necessary given the trusted network model.
 #### Per-Interface Rules (http-proxy WireGuard)
 The `http-proxy` host has a WireGuard interface (`wg0`) that may need different rules than the LAN interface. Use `networking.firewall.interfaces` to apply per-interface policies:
 ```nix
 # Example: http-proxy with different rules per interface
 networking.firewall = {
  enable = true;
  # Default: only SSH (via openFirewall)
  allowedTCPPorts = [ ];
  # LAN interface: allow HTTP/HTTPS
  interfaces.ens18 = {
    allowedTCPPorts = [ 80 443 ];
  };
  # WireGuard interface: restrict to specific services or trust fully
  interfaces.wg0 = {
    allowedTCPPorts = [ 80 443 ];
    # Or use trustedInterfaces = [ "wg0" ] if fully trusted
  };
 };
 ```
 **TODO:** Investigate current WireGuard usage on http-proxy to determine appropriate rules.
 Then per-host, open required ports:
 | Host | Additional Ports |
 |------|------------------|
 | ns1/ns2 | 53 (TCP/UDP) |
 | vault01 | 8200 |
 | monitoring01 | 3100, 9090, 3000, 9093 |
 | http-proxy | 80, 443 |
 | nats1 | 4222 |
 | ha1 | 1883, 8123 |
 | jelly01 | 8096 |
 | nix-cache01 | 5000 |
 ## Phase 2: Logging & Detection (P2)
 ### 2.1 Enable TLS for Promtail → Loki
 Update `system/monitoring/logs.nix`:
 ```nix
 clients = [{
  url = "https://monitoring01.home.2rjus.net:3100/loki/api/v1/push";
  tls_config = {
    ca_file = "/etc/ssl/certs/homelab-root-ca.pem";
  };
 }];
 ```
 Requires:
 - Configure Loki with TLS certificate (use internal ACME)
 - Ensure all hosts trust root CA (already done via `system/pki/root-ca.nix`)
 ### 2.2 Security Alert Rules
 Add to `services/monitoring/rules.yml`:
 ```yaml
 - name: security_rules
  rules:
    - alert: ssh_auth_failures
      expr: increase(node_logind_sessions_total[5m]) > 20
      for: 0m
      labels:
        severity: warning
      annotations:
        summary: "Unusual login activity on {{ $labels.instance }}"
    - alert: vault_secret_fetch_failure
      expr: increase(vault_secret_failures[5m]) > 5
      for: 0m
      labels:
        severity: warning
      annotations:
        summary: "Vault secret fetch failures on {{ $labels.instance }}"
 ```
 Also add Loki-based alerts for:
 - Failed SSH attempts: `{job="systemd-journal"} |= "Failed password"`
 - sudo usage: `{job="systemd-journal"} |= "sudo"`
 ### 2.3 Global Audit Logging
 Add `./common/ssh-audit.nix` import to `system/default.nix`:
 ```nix
 imports = [
  # ... existing imports
  ../common/ssh-audit.nix
 ];
 ```
 ## Phase 3: Defense in Depth (P3)
 ### 3.1 Loki Authentication
 Options:
 1. **Basic auth via reverse proxy** - Put Loki behind Caddy with auth
 2. **Loki multi-tenancy** - Enable `auth_enabled = true` and use tenant IDs
 3. **Network isolation** - Bind Loki only to localhost, expose via authenticated proxy
 Recommendation: Option 1 (reverse proxy) is simplest for homelab.
 ### 3.2 AppRole Secret Rotation
 Update `terraform/vault/approle.tf`:
 ```hcl
 secret_id_ttl  = 2592000  # 30 days
 ```
 Add documentation for manual rotation procedure or implement automated rotation via the existing `restartTrigger` mechanism in `vault-secrets.nix`.
 ### 3.3 Enable Vault TLS Verification
 Change default in `system/vault-secrets.nix`:
 ```nix
 skipTlsVerify = mkOption {
  type = types.bool;
  default = false;  # Changed from true
 };
 ```
 **Prerequisite:** Verify all hosts trust the internal CA that signed the Vault certificate.
 ## Implementation Order
 1. **Test on test-tier first** - Deploy phases 1-2 to testvm01/02/03
 2. **Validate SSH access** - Ensure key-based login works before disabling passwords
 3. **Document firewall ports** - Create reference of ports per host before enabling
 4. **Phase prod rollout** - Deploy to prod hosts one at a time, verify each
 ## Open Questions
 - [ ] Do all hosts have SSH keys configured for root access?
 - [ ] Should firewall rules be per-host or use a central definition with roles?
 - [ ] Should Loki authentication use the existing Kanidm setup?
 **Resolved:** Password-based SSH access for recovery is not required - most hosts have console access through Proxmox or physical access, which provides an out-of-band recovery path if SSH keys fail.
 ## Notes
 - Firewall changes are the highest risk - test thoroughly on test-tier
 - SSH hardening must not lock out access - verify keys first
 - Consider creating a "break glass" procedure for emergency access if keys fail
--- a/docs/user-management.md
+++ b/docs/user-management.md
@@ -0,0 +1,267 @@
 # User Management with Kanidm
 Central authentication for the homelab using Kanidm.
 ## Overview
 - **Server**: kanidm01.home.2rjus.net (auth.home.2rjus.net)
 - **WebUI**: https://auth.home.2rjus.net
 - **LDAPS**: port 636
 ## CLI Setup
 The `kanidm` CLI is available in the devshell:
 ```bash
 nix develop
 # Login as idm_admin
 kanidm login --name idm_admin --url https://auth.home.2rjus.net
 ```
 ## User Management
 POSIX users are managed imperatively via the `kanidm` CLI. This allows setting
 all attributes (including UNIX password) in one workflow.
 ### Creating a POSIX User
 ```bash
 # Create the person
 kanidm person create <username> "<Display Name>"
 # Add to groups
 kanidm group add-members ssh-users <username>
 # Enable POSIX (UID is auto-assigned)
 kanidm person posix set <username>
 # Set UNIX password (required for SSH login, min 10 characters)
 kanidm person posix set-password <username>
 # Optionally set login shell
 kanidm person posix set <username> --shell /bin/zsh
 ```
 ### Example: Full User Creation
 ```bash
 kanidm person create testuser "Test User"
 kanidm group add-members ssh-users testuser
 kanidm person posix set testuser
 kanidm person posix set-password testuser
 kanidm person get testuser
 ```
 After creation, verify on a client host:
 ```bash
 getent passwd testuser
 ssh testuser@testvm01.home.2rjus.net
 ```
 ### Viewing User Details
 ```bash
 kanidm person get <username>
 ```
 ### Removing a User
 ```bash
 kanidm person delete <username>
 ```
 ## Group Management
 Groups for POSIX access are also managed via CLI.
 ### Creating a POSIX Group
 ```bash
 # Create the group
 kanidm group create <group-name>
 # Enable POSIX with a specific GID
 kanidm group posix set <group-name> --gidnumber <gid>
 ```
 ### Adding Members
 ```bash
 kanidm group add-members <group-name> <username>
 ```
 ### Viewing Group Details
 ```bash
 kanidm group get <group-name>
 kanidm group list-members <group-name>
 ```
 ### Example: Full Group Creation
 ```bash
 kanidm group create testgroup
 kanidm group posix set testgroup --gidnumber 68010
 kanidm group add-members testgroup testuser
 kanidm group get testgroup
 ```
 After creation, verify on a client host:
 ```bash
 getent group testgroup
 ```
 ### Current Groups
 | Group | GID | Purpose |
 |-------|-----|---------|
 | ssh-users | 68000 | SSH login access |
 | admins | 68001 | Administrative access |
 | users | 68002 | General users |
 ### UID/GID Allocation
 Kanidm auto-assigns UIDs/GIDs from its configured range. For manually assigned GIDs:
 | Range | Purpose |
 |-------|---------|
 | 65,536+ | Users (auto-assigned) |
 | 68,000 - 68,999 | Groups (manually assigned) |
 ## PAM/NSS Client Configuration
 Enable central authentication on a host:
 ```nix
 homelab.kanidm.enable = true;
 ```
 This configures:
 - `services.kanidm.enablePam = true`
 - Client connection to auth.home.2rjus.net
 - Login authorization for `ssh-users` group
 - Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
 - Home directory symlinks (`/home/torjus` → UUID-based directory)
 ### Enabled Hosts
 - testvm01, testvm02, testvm03 (test tier)
 ### Options
 ```nix
 homelab.kanidm = {
  enable = true;
  server = "https://auth.home.2rjus.net";  # default
  allowedLoginGroups = [ "ssh-users" ];     # default
 };
 ```
 ### Home Directories
 Home directories use UUID-based paths for stability (so renaming a user doesn't
 require moving their home directory). Symlinks provide convenient access:
 ```
 /home/torjus -> /home/e4f4c56c-4aee-4c20-846f-90cb69807733
 ```
 The symlinks are created by `kanidm-unixd-tasks` on first login.
 ## Testing
 ### Verify NSS Resolution
 ```bash
 # Check user resolution
 getent passwd <username>
 # Check group resolution
 getent group <group-name>
 ```
 ### Test SSH Login
 ```bash
 ssh <username>@<hostname>.home.2rjus.net
 ```
 ## Troubleshooting
 ### "PAM user mismatch" error
 SSH fails with "fatal: PAM user mismatch" in logs. This happens when Kanidm returns
 usernames in SPN format (`torjus@home.2rjus.net`) but SSH expects short names (`torjus`).
 **Solution**: Configure `uid_attr_map = "name"` in unixSettings (already set in our module).
 Check current format:
 ```bash
 getent passwd torjus
 # Should show: torjus:x:65536:...
 # NOT: torjus@home.2rjus.net:x:65536:...
 ```
 ### User resolves but SSH fails immediately
 The user's login group (e.g., `ssh-users`) likely doesn't have POSIX enabled:
 ```bash
 # Check if group has POSIX
 getent group ssh-users
 # If empty, enable POSIX on the server
 kanidm group posix set ssh-users --gidnumber 68000
 ```
 ### User doesn't resolve via getent
 1. Check kanidm-unixd service is running:
   ```bash
   systemctl status kanidm-unixd
   ```
 2. Check unixd can reach server:
   ```bash
   kanidm-unix status
   # Should show: system: online, Kanidm: online
   ```
 3. Check client can reach server:
   ```bash
   curl -s https://auth.home.2rjus.net/status
   ```
 4. Check user has POSIX enabled on server:
   ```bash
   kanidm person get <username>
   ```
 5. Restart nscd to clear stale cache:
   ```bash
   systemctl restart nscd
   ```
 6. Invalidate kanidm cache:
   ```bash
   kanidm-unix cache-invalidate
   ```
 ### Changes not taking effect after deployment
 NixOS uses nsncd (a Rust reimplementation of nscd) for NSS caching. After deploying
 kanidm-unixd config changes, you may need to restart both services:
 ```bash
 systemctl restart kanidm-unixd
 systemctl restart nscd
 ```
 ### Test PAM authentication directly
 Use the kanidm-unix CLI to test PAM auth without SSH:
 ```bash
 kanidm-unix auth-test --name <username>
 ```
--- a/flake.lock
+++ b/flake.lock
@@ -28,11 +28,11 @@
        ]
      },
      "locked": {
-        "lastModified": 1770447502,
+        "lastModified": 1770481834,
-        "narHash": "sha256-xH1PNyE3ydj4udhe1IpK8VQxBPZETGLuORZdSWYRmSU=",
+        "narHash": "sha256-Xx9BYnI0C/qgPbwr9nj6NoAdQTbYLunrdbNSaUww9oY=",
        "ref": "master",
-        "rev": "79db119d1ca6630023947ef0a65896cc3307c2ff",
+        "rev": "fd0d63b103dfaf21d1c27363266590e723021c67",
-        "revCount": 22,
+        "revCount": 24,
        "type": "git",
        "url": "https://git.t-juice.club/torjus/homelab-deploy"
      },
@@ -42,27 +42,6 @@
        "url": "https://git.t-juice.club/torjus/homelab-deploy"
      }
    },
    "labmon": {
      "inputs": {
        "nixpkgs": [
          "nixpkgs-unstable"
        ]
      },
      "locked": {
        "lastModified": 1748983975,
        "narHash": "sha256-DA5mOqxwLMj/XLb4hvBU1WtE6cuVej7PjUr8N0EZsCE=",
        "ref": "master",
        "rev": "040a73e891a70ff06ec7ab31d7167914129dbf7d",
        "revCount": 17,
        "type": "git",
        "url": "https://git.t-juice.club/torjus/labmon"
      },
      "original": {
        "ref": "master",
        "type": "git",
        "url": "https://git.t-juice.club/torjus/labmon"
      }
    },
    "nixos-exporter": {
      "inputs": {
        "nixpkgs": [
@@ -119,31 +98,9 @@
      "inputs": {
        "alerttonotify": "alerttonotify",
        "homelab-deploy": "homelab-deploy",
        "labmon": "labmon",
        "nixos-exporter": "nixos-exporter",
        "nixpkgs": "nixpkgs",
-        "nixpkgs-unstable": "nixpkgs-unstable",
+        "nixpkgs-unstable": "nixpkgs-unstable"
        "sops-nix": "sops-nix"
      }
    },
    "sops-nix": {
      "inputs": {
        "nixpkgs": [
          "nixpkgs-unstable"
        ]
      },
      "locked": {
        "lastModified": 1770145881,
        "narHash": "sha256-ktjWTq+D5MTXQcL9N6cDZXUf9kX8JBLLBLT0ZyOTSYY=",
        "owner": "Mic92",
        "repo": "sops-nix",
        "rev": "17eea6f3816ba6568b8c81db8a4e6ca438b30b7c",
        "type": "github"
      },
      "original": {
        "owner": "Mic92",
        "repo": "sops-nix",
        "type": "github"
      }
    }
  },
--- a/flake.nix
+++ b/flake.nix
@@ -5,18 +5,10 @@
    nixpkgs.url = "github:nixos/nixpkgs?ref=nixos-25.11";
    nixpkgs-unstable.url = "github:nixos/nixpkgs?ref=nixos-unstable";
    sops-nix = {
      url = "github:Mic92/sops-nix";
      inputs.nixpkgs.follows = "nixpkgs-unstable";
    };
    alerttonotify = {
      url = "git+https://git.t-juice.club/torjus/alerttonotify?ref=master";
      inputs.nixpkgs.follows = "nixpkgs-unstable";
    };
    labmon = {
      url = "git+https://git.t-juice.club/torjus/labmon?ref=master";
      inputs.nixpkgs.follows = "nixpkgs-unstable";
    };
    nixos-exporter = {
      url = "git+https://git.t-juice.club/torjus/nixos-exporter";
      inputs.nixpkgs.follows = "nixpkgs-unstable";
@@ -32,9 +24,7 @@
      self,
      nixpkgs,
      nixpkgs-unstable,
      sops-nix,
      alerttonotify,
      labmon,
      nixos-exporter,
      homelab-deploy,
      ...
@@ -50,7 +40,6 @@
      commonOverlays = [
        overlay-unstable
        alerttonotify.overlays.default
        labmon.overlays.default
      ];
      # Common modules applied to all hosts
      commonModules = [
@@ -61,7 +50,6 @@
            system.configurationRevision = self.rev or self.dirtyRev or "dirty";
          }
        )
        sops-nix.nixosModules.sops
        nixos-exporter.nixosModules.default
        homelab-deploy.nixosModules.default
        ./modules/homelab
@@ -77,46 +65,19 @@
    in
    {
      nixosConfigurations = {
        ns1 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
            inherit inputs self sops-nix;
          };
          modules = commonModules ++ [
            ./hosts/ns1
          ];
        };
        ns2 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
            inherit inputs self sops-nix;
          };
          modules = commonModules ++ [
            ./hosts/ns2
          ];
        };
        ha1 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/ha1
          ];
        };
        template1 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
            inherit inputs self sops-nix;
          };
          modules = commonModules ++ [
            ./hosts/template
          ];
        };
        template2 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/template2
@@ -125,35 +86,25 @@
        http-proxy = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/http-proxy
          ];
        };
        ca = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
            inherit inputs self sops-nix;
          };
          modules = commonModules ++ [
            ./hosts/ca
          ];
        };
        monitoring01 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/monitoring01
            labmon.nixosModules.labmon
          ];
        };
        jelly01 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/jelly01
@@ -162,55 +113,82 @@
        nix-cache01 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/nix-cache01
          ];
        };
        pgdb1 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
            inherit inputs self sops-nix;
          };
          modules = commonModules ++ [
            ./hosts/pgdb1
          ];
        };
        nats1 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/nats1
          ];
        };
        testvm01 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
            inherit inputs self sops-nix;
          };
          modules = commonModules ++ [
            ./hosts/testvm01
          ];
        };
        vault01 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/vault01
          ];
        };
-        vaulttest01 = nixpkgs.lib.nixosSystem {
+        testvm01 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
-            ./hosts/vaulttest01
+            ./hosts/testvm01
          ];
        };
        testvm02 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/testvm02
          ];
        };
        testvm03 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/testvm03
          ];
        };
        ns2 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/ns2
          ];
        };
        ns1 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/ns1
          ];
        };
        kanidm01 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/kanidm01
          ];
        };
      };
@@ -229,6 +207,7 @@
              pkgs.ansible
              pkgs.opentofu
              pkgs.openbao
              pkgs.kanidm_1_8
              (pkgs.callPackage ./scripts/create-host { })
              homelab-deploy.packages.${pkgs.system}.default
            ];
--- a/hosts/ha1/configuration.nix
+++ b/hosts/ha1/configuration.nix
@@ -7,7 +7,7 @@
 {
  imports = [
-    ../template/hardware-configuration.nix
+    ./hardware-configuration.nix
    ../../system
    ../../common/vm
--- a/hosts/template/hardware-configuration.nix
+++ b/hosts/template/hardware-configuration.nix
--- a/hosts/http-proxy/configuration.nix
+++ b/hosts/http-proxy/configuration.nix
@@ -5,7 +5,7 @@
 {
  imports = [
-    ../template/hardware-configuration.nix
+    ./hardware-configuration.nix
    ../../system
    ../../common/vm
--- a/hosts/http-proxy/hardware-configuration.nix
+++ b/hosts/http-proxy/hardware-configuration.nix
@@ -0,0 +1,42 @@
 {
  config,
  lib,
  pkgs,
  modulesPath,
  ...
 }:
 {
  imports = [
    (modulesPath + "/profiles/qemu-guest.nix")
  ];
  boot.initrd.availableKernelModules = [
    "ata_piix"
    "uhci_hcd"
    "virtio_pci"
    "virtio_scsi"
    "sd_mod"
    "sr_mod"
  ];
  boot.initrd.kernelModules = [ "dm-snapshot" ];
  boot.kernelModules = [
    "ptp_kvm"
  ];
  boot.extraModulePackages = [ ];
  fileSystems."/" = {
    device = "/dev/disk/by-label/root";
    fsType = "xfs";
  };
  swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
  # (the default) this is the recommended approach. When using systemd-networkd it's
  # still possible to use this option, but it's recommended to use it in conjunction
  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
  networking.useDHCP = lib.mkDefault true;
  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
 }
--- a/hosts/jelly01/configuration.nix
+++ b/hosts/jelly01/configuration.nix
@@ -5,7 +5,7 @@
 {
  imports = [
-    ../template/hardware-configuration.nix
+    ./hardware-configuration.nix
    ../../system
    ../../common/vm
@@ -61,9 +61,8 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;
-  zramSwap = {
+  vault.enable = true;
-    enable = true;
+  homelab.deploy.enable = true;
  };
  system.stateVersion = "23.11"; # Did you read the comment?
 }
--- a/hosts/jelly01/hardware-configuration.nix
+++ b/hosts/jelly01/hardware-configuration.nix
@@ -0,0 +1,42 @@
 {
  config,
  lib,
  pkgs,
  modulesPath,
  ...
 }:
 {
  imports = [
    (modulesPath + "/profiles/qemu-guest.nix")
  ];
  boot.initrd.availableKernelModules = [
    "ata_piix"
    "uhci_hcd"
    "virtio_pci"
    "virtio_scsi"
    "sd_mod"
    "sr_mod"
  ];
  boot.initrd.kernelModules = [ "dm-snapshot" ];
  boot.kernelModules = [
    "ptp_kvm"
  ];
  boot.extraModulePackages = [ ];
  fileSystems."/" = {
    device = "/dev/disk/by-label/root";
    fsType = "xfs";
  };
  swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
  # (the default) this is the recommended approach. When using systemd-networkd it's
  # still possible to use this option, but it's recommended to use it in conjunction
  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
  networking.useDHCP = lib.mkDefault true;
  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
 }
--- a/hosts/jump/configuration.nix
+++ b/hosts/jump/configuration.nix
@@ -1,56 +0,0 @@
 { config, lib, pkgs, ... }:
 {
  imports =
    [
      ../template/hardware-configuration.nix
      ../../system
    ];
  nixpkgs.config.allowUnfree = true;
  homelab.host.role = "bastion";
  # Use the systemd-boot EFI boot loader.
  boot.loader.grub.enable = true;
  boot.loader.grub.device = "/dev/sda";
  networking.hostName = "jump";
  networking.domain = "home.2rjus.net";
  networking.useNetworkd = true;
  networking.useDHCP = false;
  services.resolved.enable = false;
  networking.nameservers = [
    "10.69.13.5"
    "10.69.13.6"
  ];
  systemd.network.enable = true;
  systemd.network.networks."ens18" = {
    matchConfig.Name = "ens18";
    address = [
      "10.69.13.10/24"
    ];
    routes = [
      { Gateway = "10.69.13.1"; }
    ];
    linkConfig.RequiredForOnline = "routable";
  };
  time.timeZone = "Europe/Oslo";
  nix.settings.experimental-features = [ "nix-command" "flakes" ];
  environment.systemPackages = with pkgs; [
    vim
    wget
    git
  ];
  # Open ports in the firewall.
  # networking.firewall.allowedTCPPorts = [ ... ];
  # networking.firewall.allowedUDPPorts = [ ... ];
  # Or disable the firewall altogether.
  networking.firewall.enable = false;
  system.stateVersion = "23.11"; # Did you read the comment?
 }
--- a/hosts/jump/hardware-configuration.nix
+++ b/hosts/jump/hardware-configuration.nix
@@ -1,36 +0,0 @@
 { config, lib, pkgs, modulesPath, ... }:
 {
  imports =
    [
      (modulesPath + "/profiles/qemu-guest.nix")
    ];
  boot.initrd.availableKernelModules = [ "ata_piix" "uhci_hcd" "virtio_pci" "virtio_scsi" "sd_mod" "sr_mod" ];
  boot.initrd.kernelModules = [ ];
  # boot.kernelModules = [ ];
  # boot.extraModulePackages = [ ];
  fileSystems."/" =
    {
      device = "/dev/disk/by-uuid/6889aba9-61ed-4687-ab10-e5cf4017ac8d";
      fsType = "xfs";
    };
  fileSystems."/boot" =
    {
      device = "/dev/disk/by-uuid/BC07-3B7A";
      fsType = "vfat";
    };
  swapDevices =
    [{ device = "/dev/disk/by-uuid/64e5757b-6625-4dd2-aa2a-66ca93444d23"; }];
  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
  # (the default) this is the recommended approach. When using systemd-networkd it's
  # still possible to use this option, but it's recommended to use it in conjunction
  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
 }
--- a/hosts/kanidm01/configuration.nix
+++ b/hosts/kanidm01/configuration.nix
@@ -1,25 +1,39 @@
 {
  config,
  lib,
  pkgs,
  ...
 }:
 {
  imports = [
-    ../template/hardware-configuration.nix
+    ../template2/hardware-configuration.nix
    ../../system
    ../../common/vm
    ../../services/kanidm
  ];
-  nixpkgs.config.allowUnfree = true;
+  # Host metadata
-  # Use the systemd-boot EFI boot loader.
+  homelab.host = {
-  boot.loader.grub = {
+    tier = "test";
-    enable = true;
+    role = "auth";
    device = "/dev/sda";
    configurationLimit = 3;
  };
-  networking.hostName = "pgdb1";
+  # DNS CNAME for auth.home.2rjus.net
  homelab.dns.cnames = [ "auth" ];
  # Enable Vault integration
  vault.enable = true;
  # Enable remote deployment via NATS
  homelab.deploy.enable = true;
  nixpkgs.config.allowUnfree = true;
  boot.loader.grub.enable = true;
  boot.loader.grub.device = "/dev/vda";
  networking.hostName = "kanidm01";
  networking.domain = "home.2rjus.net";
  networking.useNetworkd = true;
  networking.useDHCP = false;
@@ -33,7 +47,7 @@
  systemd.network.networks."ens18" = {
    matchConfig.Name = "ens18";
    address = [
-      "10.69.13.16/24"
+      "10.69.13.23/24"
    ];
    routes = [
      { Gateway = "10.69.13.1"; }
@@ -59,5 +73,5 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;
-  system.stateVersion = "23.11"; # Did you read the comment?
+  system.stateVersion = "25.11"; # Did you read the comment?
-}
+}
--- a/hosts/vaulttest01/default.nix
+++ b/hosts/vaulttest01/default.nix
--- a/hosts/monitoring01/configuration.nix
+++ b/hosts/monitoring01/configuration.nix
@@ -5,7 +5,7 @@
 {
  imports = [
-    ../template/hardware-configuration.nix
+    ./hardware-configuration.nix
    ../../system
    ../../common/vm
@@ -100,61 +100,6 @@
    ];
  };
  labmon = {
    enable = true;
    settings = {
      ListenAddr = ":9969";
      Profiling = true;
      StepMonitors = [
        {
          Enabled = true;
          BaseURL = "https://ca.home.2rjus.net";
          RootID = "3381bda8015a86b9a3cd1851439d1091890a79005e0f1f7c4301fe4bccc29d80";
        }
      ];
      TLSConnectionMonitors = [
        {
          Enabled = true;
          Address = "ca.home.2rjus.net:443";
          Verify = true;
          Duration = "12h";
        }
        {
          Enabled = true;
          Address = "jelly.home.2rjus.net:443";
          Verify = true;
          Duration = "12h";
        }
        {
          Enabled = true;
          Address = "grafana.home.2rjus.net:443";
          Verify = true;
          Duration = "12h";
        }
        {
          Enabled = true;
          Address = "prometheus.home.2rjus.net:443";
          Verify = true;
          Duration = "12h";
        }
        {
          Enabled = true;
          Address = "alertmanager.home.2rjus.net:443";
          Verify = true;
          Duration = "12h";
        }
        {
          Enabled = true;
          Address = "pyroscope.home.2rjus.net:443";
          Verify = true;
          Duration = "12h";
        }
      ];
    };
  };
  # Open ports in the firewall.
  # networking.firewall.allowedTCPPorts = [ ... ];
  # networking.firewall.allowedUDPPorts = [ ... ];
--- a/hosts/monitoring01/hardware-configuration.nix
+++ b/hosts/monitoring01/hardware-configuration.nix
@@ -0,0 +1,42 @@
 {
  config,
  lib,
  pkgs,
  modulesPath,
  ...
 }:
 {
  imports = [
    (modulesPath + "/profiles/qemu-guest.nix")
  ];
  boot.initrd.availableKernelModules = [
    "ata_piix"
    "uhci_hcd"
    "virtio_pci"
    "virtio_scsi"
    "sd_mod"
    "sr_mod"
  ];
  boot.initrd.kernelModules = [ "dm-snapshot" ];
  boot.kernelModules = [
    "ptp_kvm"
  ];
  boot.extraModulePackages = [ ];
  fileSystems."/" = {
    device = "/dev/disk/by-label/root";
    fsType = "xfs";
  };
  swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
  # (the default) this is the recommended approach. When using systemd-networkd it's
  # still possible to use this option, but it's recommended to use it in conjunction
  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
  networking.useDHCP = lib.mkDefault true;
  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
 }
--- a/hosts/nats1/configuration.nix
+++ b/hosts/nats1/configuration.nix
@@ -5,7 +5,7 @@
 {
  imports = [
-    ../template/hardware-configuration.nix
+    ./hardware-configuration.nix
    ../../system
    ../../common/vm
@@ -59,5 +59,8 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;
  vault.enable = true;
  homelab.deploy.enable = true;
  system.stateVersion = "23.11"; # Did you read the comment?
 }
--- a/hosts/nats1/hardware-configuration.nix
+++ b/hosts/nats1/hardware-configuration.nix
@@ -0,0 +1,42 @@
 {
  config,
  lib,
  pkgs,
  modulesPath,
  ...
 }:
 {
  imports = [
    (modulesPath + "/profiles/qemu-guest.nix")
  ];
  boot.initrd.availableKernelModules = [
    "ata_piix"
    "uhci_hcd"
    "virtio_pci"
    "virtio_scsi"
    "sd_mod"
    "sr_mod"
  ];
  boot.initrd.kernelModules = [ "dm-snapshot" ];
  boot.kernelModules = [
    "ptp_kvm"
  ];
  boot.extraModulePackages = [ ];
  fileSystems."/" = {
    device = "/dev/disk/by-label/root";
    fsType = "xfs";
  };
  swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
  # (the default) this is the recommended approach. When using systemd-networkd it's
  # still possible to use this option, but it's recommended to use it in conjunction
  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
  networking.useDHCP = lib.mkDefault true;
  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
 }
--- a/hosts/nix-cache01/configuration.nix
+++ b/hosts/nix-cache01/configuration.nix
@@ -5,7 +5,7 @@
 {
  imports = [
-    ../template/hardware-configuration.nix
+    ./hardware-configuration.nix
    ../../system
    ../../common/vm
--- a/hosts/nix-cache01/default.nix
+++ b/hosts/nix-cache01/default.nix
@@ -4,6 +4,5 @@
    ./configuration.nix
    ../../services/nix-cache
    ../../services/actions-runner
    ./zram.nix
  ];
 }
--- a/hosts/nix-cache01/hardware-configuration.nix
+++ b/hosts/nix-cache01/hardware-configuration.nix
@@ -0,0 +1,42 @@
 {
  config,
  lib,
  pkgs,
  modulesPath,
  ...
 }:
 {
  imports = [
    (modulesPath + "/profiles/qemu-guest.nix")
  ];
  boot.initrd.availableKernelModules = [
    "ata_piix"
    "uhci_hcd"
    "virtio_pci"
    "virtio_scsi"
    "sd_mod"
    "sr_mod"
  ];
  boot.initrd.kernelModules = [ "dm-snapshot" ];
  boot.kernelModules = [
    "ptp_kvm"
  ];
  boot.extraModulePackages = [ ];
  fileSystems."/" = {
    device = "/dev/disk/by-label/root";
    fsType = "xfs";
  };
  swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
  # (the default) this is the recommended approach. When using systemd-networkd it's
  # still possible to use this option, but it's recommended to use it in conjunction
  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
  networking.useDHCP = lib.mkDefault true;
  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
 }
--- a/hosts/nix-cache01/zram.nix
+++ b/hosts/nix-cache01/zram.nix
@@ -1,6 +0,0 @@
 { ... }:
 {
  zramSwap = {
    enable = true;
  };
 }
--- a/hosts/ns1/configuration.nix
+++ b/hosts/ns1/configuration.nix
@@ -7,23 +7,38 @@
 {
  imports = [
-    ../template/hardware-configuration.nix
+    ../template2/hardware-configuration.nix
    ../../system
    ../../common/vm
    # DNS services
    ../../services/ns/master-authorative.nix
    ../../services/ns/resolver.nix
    ../../common/vm
  ];
  # Host metadata
  homelab.host = {
    tier = "prod";
    role = "dns";
    labels.dns_role = "primary";
  };
  # Enable Vault integration
  vault.enable = true;
  # Enable remote deployment via NATS
  homelab.deploy.enable = true;
  nixpkgs.config.allowUnfree = true;
  # Use the systemd-boot EFI boot loader.
  boot.loader.grub.enable = true;
-  boot.loader.grub.device = "/dev/sda";
+  boot.loader.grub.device = "/dev/vda";
  networking.hostName = "ns1";
  networking.domain = "home.2rjus.net";
  networking.useNetworkd = true;
  networking.useDHCP = false;
  # Disable resolved - conflicts with Unbound resolver
  services.resolved.enable = false;
  networking.nameservers = [
    "10.69.13.5"
@@ -47,14 +62,6 @@
    "nix-command"
    "flakes"
  ];
  vault.enable = true;
  homelab.deploy.enable = true;
  homelab.host = {
    role = "dns";
    labels.dns_role = "primary";
  };
  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
    vim
@@ -68,5 +75,5 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;
-  system.stateVersion = "23.11"; # Did you read the comment?
+  system.stateVersion = "25.11"; # Did you read the comment?
-}
+}
--- a/hosts/ns1/default.nix
+++ b/hosts/ns1/default.nix
@@ -2,4 +2,4 @@
  imports = [
    ./configuration.nix
  ];
-}
+}
--- a/hosts/ns1/hardware-configuration.nix
+++ b/hosts/ns1/hardware-configuration.nix
@@ -1,36 +0,0 @@
 { config, lib, pkgs, modulesPath, ... }:
 {
  imports =
    [
      (modulesPath + "/profiles/qemu-guest.nix")
    ];
  boot.initrd.availableKernelModules = [ "ata_piix" "uhci_hcd" "virtio_pci" "virtio_scsi" "sd_mod" "sr_mod" ];
  boot.initrd.kernelModules = [ ];
  # boot.kernelModules = [ ];
  # boot.extraModulePackages = [ ];
  fileSystems."/" =
    {
      device = "/dev/disk/by-uuid/6889aba9-61ed-4687-ab10-e5cf4017ac8d";
      fsType = "xfs";
    };
  fileSystems."/boot" =
    {
      device = "/dev/disk/by-uuid/BC07-3B7A";
      fsType = "vfat";
    };
  swapDevices =
    [{ device = "/dev/disk/by-uuid/64e5757b-6625-4dd2-aa2a-66ca93444d23"; }];
  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
  # (the default) this is the recommended approach. When using systemd-networkd it's
  # still possible to use this option, but it's recommended to use it in conjunction
  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
 }
--- a/hosts/ns2/configuration.nix
+++ b/hosts/ns2/configuration.nix
@@ -7,23 +7,38 @@
 {
  imports = [
-    ../template/hardware-configuration.nix
+    ../template2/hardware-configuration.nix
    ../../system
    ../../common/vm
    # DNS services
    ../../services/ns/secondary-authorative.nix
    ../../services/ns/resolver.nix
    ../../common/vm
  ];
  # Host metadata
  homelab.host = {
    tier = "prod";
    role = "dns";
    labels.dns_role = "secondary";
  };
  # Enable Vault integration
  vault.enable = true;
  # Enable remote deployment via NATS
  homelab.deploy.enable = true;
  nixpkgs.config.allowUnfree = true;
  # Use the systemd-boot EFI boot loader.
  boot.loader.grub.enable = true;
-  boot.loader.grub.device = "/dev/sda";
+  boot.loader.grub.device = "/dev/vda";
  networking.hostName = "ns2";
  networking.domain = "home.2rjus.net";
  networking.useNetworkd = true;
  networking.useDHCP = false;
  # Disable resolved - conflicts with Unbound resolver
  services.resolved.enable = false;
  networking.nameservers = [
    "10.69.13.5"
@@ -47,14 +62,7 @@
    "nix-command"
    "flakes"
  ];
-  vault.enable = true;
+  nix.settings.tarball-ttl = 0;
  homelab.deploy.enable = true;
  homelab.host = {
    role = "dns";
    labels.dns_role = "secondary";
  };
  environment.systemPackages = with pkgs; [
    vim
    wget
@@ -67,5 +75,5 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;
-  system.stateVersion = "23.11"; # Did you read the comment?
+  system.stateVersion = "25.11"; # Did you read the comment?
-}
+}
--- a/hosts/ns2/default.nix
+++ b/hosts/ns2/default.nix
@@ -2,4 +2,4 @@
  imports = [
    ./configuration.nix
  ];
-}
+}
--- a/hosts/ns2/hardware-configuration.nix
+++ b/hosts/ns2/hardware-configuration.nix
@@ -1,36 +0,0 @@
 { config, lib, pkgs, modulesPath, ... }:
 {
  imports =
    [
      (modulesPath + "/profiles/qemu-guest.nix")
    ];
  boot.initrd.availableKernelModules = [ "ata_piix" "uhci_hcd" "virtio_pci" "virtio_scsi" "sd_mod" "sr_mod" ];
  boot.initrd.kernelModules = [ ];
  # boot.kernelModules = [ ];
  # boot.extraModulePackages = [ ];
  fileSystems."/" =
    {
      device = "/dev/disk/by-uuid/6889aba9-61ed-4687-ab10-e5cf4017ac8d";
      fsType = "xfs";
    };
  fileSystems."/boot" =
    {
      device = "/dev/disk/by-uuid/BC07-3B7A";
      fsType = "vfat";
    };
  swapDevices =
    [{ device = "/dev/disk/by-uuid/64e5757b-6625-4dd2-aa2a-66ca93444d23"; }];
  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
  # (the default) this is the recommended approach. When using systemd-networkd it's
  # still possible to use this option, but it's recommended to use it in conjunction
  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
 }
--- a/hosts/pgdb1/default.nix
+++ b/hosts/pgdb1/default.nix
@@ -1,7 +0,0 @@
 { ... }:
 {
  imports = [
    ./configuration.nix
    ../../services/postgres
  ];
 }
--- a/hosts/template/default.nix
+++ b/hosts/template/default.nix
@@ -1,7 +0,0 @@
 { ... }: {
  imports = [
    ./hardware-configuration.nix
    ./configuration.nix
    ./scripts.nix
  ];
 }
--- a/hosts/template/scripts.nix
+++ b/hosts/template/scripts.nix
@@ -1,36 +0,0 @@
 { pkgs, ... }:
 let
  prepare-host-script = pkgs.writeShellApplication {
    name = "prepare-host.sh";
    runtimeInputs = [ pkgs.age ];
    text = ''
      echo "Removing machine-id"
      rm -f /etc/machine-id || true
      echo "Removing SSH host keys"
      rm -f /etc/ssh/ssh_host_* || true
      echo "Restarting SSH"
      systemctl restart sshd
      echo "Removing temporary files"
      rm -rf /tmp/* || true
      echo "Removing logs"
      journalctl --rotate || true
      journalctl --vacuum-time=1s || true
      echo "Removing cache"
      rm -rf /var/cache/* || true
      echo "Generate age key"
      rm -rf /var/lib/sops-nix || true
      mkdir -p /var/lib/sops-nix
      age-keygen -o /var/lib/sops-nix/key.txt
    '';
  };
 in
 {
  environment.systemPackages = [ prepare-host-script ];
  users.motd = "Prepare host by running 'prepare-host.sh'.";
 }
--- a/hosts/template2/bootstrap.nix
+++ b/hosts/template2/bootstrap.nix
@@ -6,22 +6,72 @@ let
    text = ''
      set -euo pipefail
      LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
      # Send a log entry to Loki with bootstrap status
      # Usage: log_to_loki <stage> <message>
      # Fails silently if Loki is unreachable
      log_to_loki() {
        local stage="$1"
        local message="$2"
        local timestamp_ns
        timestamp_ns="$(date +%s)000000000"
        local payload
        payload=$(jq -n \
          --arg host "$HOSTNAME" \
          --arg stage "$stage" \
          --arg branch "''${BRANCH:-master}" \
          --arg ts "$timestamp_ns" \
          --arg msg "$message" \
          '{
            streams: [{
              stream: {
                job: "bootstrap",
                host: $host,
                stage: $stage,
                branch: $branch
              },
              values: [[$ts, $msg]]
            }]
          }')
        curl -s --connect-timeout 2 --max-time 5 \
          -X POST \
          -H "Content-Type: application/json" \
          -d "$payload" \
          "$LOKI_URL" >/dev/null 2>&1 || true
      }
      echo "================================================================================"
      echo "                     NIXOS BOOTSTRAP IN PROGRESS"
      echo "================================================================================"
      echo ""
      # Read hostname set by cloud-init (from Terraform VM name via user-data)
      # Cloud-init sets the system hostname from user-data.txt, so we read it from hostnamectl
      HOSTNAME=$(hostnamectl hostname)
-      echo "DEBUG: Hostname from hostnamectl: '$HOSTNAME'"
+      # Read git branch from environment, default to master
      BRANCH="''${NIXOS_FLAKE_BRANCH:-master}"
      echo "Hostname: $HOSTNAME"
      echo ""
      echo "Starting NixOS bootstrap for host: $HOSTNAME"
      log_to_loki "starting" "Bootstrap starting for $HOSTNAME (branch: $BRANCH)"
      echo "Waiting for network connectivity..."
      # Verify we can reach the git server via HTTPS (doesn't respond to ping)
      if ! curl -s --connect-timeout 5 --max-time 10 https://git.t-juice.club >/dev/null 2>&1; then
        echo "ERROR: Cannot reach git.t-juice.club via HTTPS"
        echo "Check network configuration and DNS settings"
        log_to_loki "failed" "Network check failed - cannot reach git.t-juice.club"
        exit 1
      fi
      echo "Network connectivity confirmed"
      log_to_loki "network_ok" "Network connectivity confirmed"
      # Unwrap Vault token and store AppRole credentials (if provided)
      if [ -n "''${VAULT_WRAPPED_TOKEN:-}" ]; then
@@ -50,6 +100,7 @@ let
          chmod 600 /var/lib/vault/approle/secret-id
          echo "Vault credentials unwrapped and stored successfully"
          log_to_loki "vault_ok" "Vault credentials unwrapped and stored"
        else
          echo "WARNING: Failed to unwrap Vault token"
          if [ -n "$UNWRAP_RESPONSE" ]; then
@@ -63,17 +114,17 @@ let
          echo "To regenerate token, run: create-host --hostname $HOSTNAME --force"
          echo ""
          echo "Vault secrets will not be available, but continuing bootstrap..."
          log_to_loki "vault_warn" "Failed to unwrap Vault token - continuing without secrets"
        fi
      else
        echo "No Vault wrapped token provided (VAULT_WRAPPED_TOKEN not set)"
        echo "Skipping Vault credential setup"
        log_to_loki "vault_skip" "No Vault token provided - skipping credential setup"
      fi
      echo "Fetching and building NixOS configuration from flake..."
      # Read git branch from environment, default to master
      BRANCH="''${NIXOS_FLAKE_BRANCH:-master}"
      echo "Using git branch: $BRANCH"
      log_to_loki "building" "Starting nixos-rebuild boot"
      # Build and activate the host-specific configuration
      FLAKE_URL="git+https://git.t-juice.club/torjus/nixos-servers.git?ref=$BRANCH#''${HOSTNAME}"
@@ -81,18 +132,30 @@ let
      if nixos-rebuild boot --flake "$FLAKE_URL"; then
        echo "Successfully built configuration for $HOSTNAME"
        echo "Rebooting into new configuration..."
        log_to_loki "success" "Build successful - rebooting into new configuration"
        sleep 2
        systemctl reboot
      else
        echo "ERROR: nixos-rebuild failed for $HOSTNAME"
        echo "Check that flake has configuration for this hostname"
        echo "Manual intervention required - system will not reboot"
        log_to_loki "failed" "nixos-rebuild failed - manual intervention required"
        exit 1
      fi
    '';
  };
 in
 {
  # Custom greeting line to indicate this is a bootstrap image
  services.getty.greetingLine = lib.mkForce ''
    ================================================================================
                          BOOTSTRAP IMAGE - NixOS \V (\l)
    ================================================================================
    Bootstrap service is running. Logs are displayed on tty1.
    Check status: journalctl -fu nixos-bootstrap
  '';
  systemd.services."nixos-bootstrap" = {
    description = "Bootstrap NixOS configuration from flake on first boot";
@@ -107,12 +170,12 @@ in
    serviceConfig = {
      Type = "oneshot";
      RemainAfterExit = true;
-      ExecStart = "${bootstrap-script}/bin/nixos-bootstrap";
+      ExecStart = lib.getExe bootstrap-script;
      # Read environment variables from cloud-init (set by cloud-init write_files)
      EnvironmentFile = "-/run/cloud-init-env";
-      # Logging to journald
+      # Log to journal and console
      StandardOutput = "journal+console";
      StandardError = "journal+console";
    };
--- a/hosts/template2/configuration.nix
+++ b/hosts/template2/configuration.nix
@@ -58,6 +58,14 @@
    "flakes"
  ];
  nix.settings.tarball-ttl = 0;
  nix.settings.substituters = [
    "https://nix-cache.home.2rjus.net"
    "https://cache.nixos.org"
  ];
  nix.settings.trusted-public-keys = [
    "nix-cache.home.2rjus.net-1:2kowZOG6pvhoK4AHVO3alBlvcghH20wchzoR0V86UWI="
    "cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY="
  ];
  environment.systemPackages = with pkgs; [
    age
    vim
@@ -71,5 +79,8 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;
  # Compressed swap in RAM - prevents OOM during bootstrap nixos-rebuild
  zramSwap.enable = true;
  system.stateVersion = "25.11";
 }
--- a/hosts/template2/scripts.nix
+++ b/hosts/template2/scripts.nix
@@ -2,7 +2,6 @@
 let
  prepare-host-script = pkgs.writeShellApplication {
    name = "prepare-host.sh";
    runtimeInputs = [ pkgs.age ];
    text = ''
      echo "Removing machine-id"
      rm -f /etc/machine-id || true
@@ -22,11 +21,6 @@ let
      echo "Removing cache"
      rm -rf /var/cache/* || true
      echo "Generate age key"
      rm -rf /var/lib/sops-nix || true
      mkdir -p /var/lib/sops-nix
      age-keygen -o /var/lib/sops-nix/key.txt
    '';
  };
 in
--- a/hosts/testvm01/configuration.nix
+++ b/hosts/testvm01/configuration.nix
@@ -11,16 +11,23 @@
    ../../system
    ../../common/vm
    ../../common/ssh-audit.nix
  ];
-  # Test VM - exclude from DNS zone generation
+  # Host metadata (adjust as needed)
  homelab.dns.enable = false;
  homelab.host = {
-    tier = "test";
+    tier = "test";  # Start in test tier, move to prod after validation
    priority = "low";
  };
  # Enable Vault integration
  vault.enable = true;
  # Enable remote deployment via NATS
  homelab.deploy.enable = true;
  # Enable Kanidm PAM/NSS for central authentication
  homelab.kanidm.enable = true;
  nixpkgs.config.allowUnfree = true;
  boot.loader.grub.enable = true;
  boot.loader.grub.device = "/dev/vda";
@@ -29,7 +36,7 @@
  networking.domain = "home.2rjus.net";
  networking.useNetworkd = true;
  networking.useDHCP = false;
-  services.resolved.enable = false;
+  services.resolved.enable = true;
  networking.nameservers = [
    "10.69.13.5"
    "10.69.13.6"
@@ -39,7 +46,7 @@
  systemd.network.networks."ens18" = {
    matchConfig.Name = "ens18";
    address = [
-      "10.69.13.101/24"
+      "10.69.13.20/24"
    ];
    routes = [
      { Gateway = "10.69.13.1"; }
@@ -59,6 +66,39 @@
    git
  ];
  # Test nginx with ACME certificate from OpenBao PKI
  services.nginx = {
    enable = true;
    virtualHosts."testvm01.home.2rjus.net" = {
      forceSSL = true;
      enableACME = true;
      locations."/" = {
        root = pkgs.writeTextDir "index.html" ''
          <!DOCTYPE html>
          <html>
          <head>
            <title>testvm01 - ACME Test</title>
            <style>
              body { font-family: monospace; max-width: 600px; margin: 50px auto; padding: 20px; }
              .joke { background: #f0f0f0; padding: 20px; border-radius: 8px; margin: 20px 0; }
              .punchline { margin-top: 15px; font-weight: bold; }
            </style>
          </head>
          <body>
            <h1>OpenBao PKI ACME Test</h1>
            <p>If you're seeing this over HTTPS, the migration worked!</p>
            <div class="joke">
              <p>Why do programmers prefer dark mode?</p>
              <p class="punchline">Because light attracts bugs.</p>
            </div>
            <p><small>Certificate issued by: vault.home.2rjus.net</small></p>
          </body>
          </html>
        '';
      };
    };
  };
  # Open ports in the firewall.
  # networking.firewall.allowedTCPPorts = [ ... ];
  # networking.firewall.allowedUDPPorts = [ ... ];
--- a/hosts/testvm02/configuration.nix
+++ b/hosts/testvm02/configuration.nix
@@ -1,25 +1,38 @@
 {
  config,
  lib,
  pkgs,
  ...
 }:
 {
  imports = [
-    ../template/hardware-configuration.nix
+    ../template2/hardware-configuration.nix
    ../../system
    ../../common/vm
    ../../common/ssh-audit.nix
  ];
-  nixpkgs.config.allowUnfree = true;
+  # Host metadata (adjust as needed)
-  # Use the systemd-boot EFI boot loader.
+  homelab.host = {
-  boot.loader.grub = {
+    tier = "test";  # Start in test tier, move to prod after validation
    enable = true;
    device = "/dev/sda";
    configurationLimit = 3;
  };
-  networking.hostName = "ca";
+  # Enable Vault integration
  vault.enable = true;
  # Enable remote deployment via NATS
  homelab.deploy.enable = true;
  # Enable Kanidm PAM/NSS for central authentication
  homelab.kanidm.enable = true;
  nixpkgs.config.allowUnfree = true;
  boot.loader.grub.enable = true;
  boot.loader.grub.device = "/dev/vda";
  networking.hostName = "testvm02";
  networking.domain = "home.2rjus.net";
  networking.useNetworkd = true;
  networking.useDHCP = false;
@@ -33,7 +46,7 @@
  systemd.network.networks."ens18" = {
    matchConfig.Name = "ens18";
    address = [
-      "10.69.13.12/24"
+      "10.69.13.21/24"
    ];
    routes = [
      { Gateway = "10.69.13.1"; }
@@ -59,5 +72,5 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;
-  system.stateVersion = "23.11"; # Did you read the comment?
+  system.stateVersion = "25.11"; # Did you read the comment?
-}
+}
--- a/hosts/testvm02/default.nix
+++ b/hosts/testvm02/default.nix
@@ -2,4 +2,4 @@
  imports = [
    ./configuration.nix
  ];
-}
+}
--- a/hosts/testvm03/configuration.nix
+++ b/hosts/testvm03/configuration.nix
@@ -1,25 +1,38 @@
-{ config, lib, pkgs, ... }:
+{
  config,
  lib,
  pkgs,
  ...
 }:
 {
-  imports =
+  imports = [
-    [
+    ../template2/hardware-configuration.nix
      ./hardware-configuration.nix
-      ../../system
+    ../../system
-    ];
+    ../../common/vm
-
+    ../../common/ssh-audit.nix
-  # Template host - exclude from DNS zone generation
+  ];
  homelab.dns.enable = false;
  # Host metadata (adjust as needed)
  homelab.host = {
-    tier = "test";
+    tier = "test";  # Start in test tier, move to prod after validation
    priority = "low";
  };
  # Enable Vault integration
  vault.enable = true;
  # Enable remote deployment via NATS
  homelab.deploy.enable = true;
  # Enable Kanidm PAM/NSS for central authentication
  homelab.kanidm.enable = true;
  nixpkgs.config.allowUnfree = true;
  boot.loader.grub.enable = true;
-  boot.loader.grub.device = "/dev/sda";
+  boot.loader.grub.device = "/dev/vda";
-  networking.hostName = "nixos-template";
+
  networking.hostName = "testvm03";
  networking.domain = "home.2rjus.net";
  networking.useNetworkd = true;
  networking.useDHCP = false;
@@ -33,19 +46,21 @@
  systemd.network.networks."ens18" = {
    matchConfig.Name = "ens18";
    address = [
-      "10.69.8.250/24"
+      "10.69.13.22/24"
    ];
    routes = [
-      { Gateway = "10.69.8.1"; }
+      { Gateway = "10.69.13.1"; }
    ];
    linkConfig.RequiredForOnline = "routable";
  };
  time.timeZone = "Europe/Oslo";
-  nix.settings.experimental-features = [ "nix-command" "flakes" ];
+  nix.settings.experimental-features = [
    "nix-command"
    "flakes"
  ];
  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
    age
    vim
    wget
    git
@@ -57,6 +72,5 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;
-  system.stateVersion = "23.11"; # Did you read the comment?
+  system.stateVersion = "25.11"; # Did you read the comment?
-}
+}
--- a/hosts/testvm03/default.nix
+++ b/hosts/testvm03/default.nix
@@ -1,7 +1,5 @@
-{ ... }:
+{ ... }: {
 {
  imports = [
    ./configuration.nix
    ../../services/ca
  ];
-}
+}
--- a/hosts/vault01/configuration.nix
+++ b/hosts/vault01/configuration.nix
@@ -62,6 +62,16 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;
  # Vault fetches secrets from itself (after unseal)
  vault.enable = true;
  homelab.deploy.enable = true;
  # Ensure vault-secret services wait for openbao to be unsealed
  systemd.services.vault-secret-homelab-deploy-nkey = {
    after = [ "openbao.service" ];
    wants = [ "openbao.service" ];
  };
  system.stateVersion = "25.11"; # Did you read the comment?
 }
--- a/hosts/vaulttest01/configuration.nix
+++ b/hosts/vaulttest01/configuration.nix
@@ -1,135 +0,0 @@
 {
  config,
  lib,
  pkgs,
  ...
 }:
 let
  vault-test-script = pkgs.writeShellApplication {
    name = "vault-test";
    text = ''
      echo "=== Vault Secret Test ==="
      echo "Secret path: hosts/vaulttest01/test-service"
      if [ -f /run/secrets/test-service/password ]; then
        echo "✓ Password file exists"
        echo "Password length: $(wc -c < /run/secrets/test-service/password)"
      else
        echo "✗ Password file missing!"
        exit 1
      fi
      if [ -d /var/lib/vault/cache/test-service ]; then
        echo "✓ Cache directory exists"
      else
        echo "✗ Cache directory missing!"
        exit 1
      fi
      echo "Test successful!"
    '';
  };
 in
 {
  imports = [
    ../template2/hardware-configuration.nix
    ../../system
    ../../common/vm
  ];
  homelab.host = {
    tier = "test";
    priority = "low";
    role = "vault";
  };
  nixpkgs.config.allowUnfree = true;
  boot.loader.grub.enable = true;
  boot.loader.grub.device = "/dev/vda";
  networking.hostName = "vaulttest01";
  networking.domain = "home.2rjus.net";
  networking.useNetworkd = true;
  networking.useDHCP = false;
  services.resolved.enable = true;
  networking.nameservers = [
    "10.69.13.5"
    "10.69.13.6"
  ];
  systemd.network.enable = true;
  systemd.network.networks."ens18" = {
    matchConfig.Name = "ens18";
    address = [
      "10.69.13.150/24"
    ];
    routes = [
      { Gateway = "10.69.13.1"; }
    ];
    linkConfig.RequiredForOnline = "routable";
  };
  time.timeZone = "Europe/Oslo";
  nix.settings.experimental-features = [
    "nix-command"
    "flakes"
  ];
  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
    vim
    wget
    git
    htop # test deploy verification
  ];
  # Open ports in the firewall.
  # networking.firewall.allowedTCPPorts = [ ... ];
  # networking.firewall.allowedUDPPorts = [ ... ];
  # Or disable the firewall altogether.
  networking.firewall.enable = false;
  # Testing config
  # Enable Vault secrets management
  vault.enable = true;
  homelab.deploy.enable = true;
  # Define a test secret
  vault.secrets.test-service = {
    secretPath = "hosts/vaulttest01/test-service";
    restartTrigger = true;
    restartInterval = "daily";
    services = [ "vault-test" ];
  };
  # Create a test service that uses the secret
  systemd.services.vault-test = {
    description = "Test Vault secret fetching";
    wantedBy = [ "multi-user.target" ];
    after = [ "vault-secret-test-service.service" ];
    serviceConfig = {
      Type = "oneshot";
      RemainAfterExit = true;
      ExecStart = lib.getExe vault-test-script;
      StandardOutput = "journal+console";
    };
  };
  # Test ACME certificate issuance from OpenBao PKI
  # Override the global ACME server (from system/acme.nix) to use OpenBao instead of step-ca
  security.acme.defaults.server = lib.mkForce "https://vault01.home.2rjus.net:8200/v1/pki_int/acme/directory";
  # Request a certificate for this host
  # Using HTTP-01 challenge with standalone listener on port 80
  security.acme.certs."vaulttest01.home.2rjus.net" = {
    listenHTTP = ":80";
    enableDebugLogs = true;
  };
  system.stateVersion = "25.11"; # Did you read the comment?
 }
--- a/lib/monitoring.nix
+++ b/lib/monitoring.nix
@@ -21,6 +21,7 @@ let
      cfg = hostConfig.config;
      monConfig = (cfg.homelab or { }).monitoring or { enable = true; scrapeTargets = [ ]; };
      dnsConfig = (cfg.homelab or { }).dns or { enable = true; };
      hostConfig' = (cfg.homelab or { }).host or { };
      hostname = cfg.networking.hostName;
      networks = cfg.systemd.network.networks or { };
@@ -49,20 +50,73 @@ let
        inherit hostname;
        ip = extractIP firstAddress;
        scrapeTargets = monConfig.scrapeTargets or [ ];
        # Host metadata for label propagation
        tier = hostConfig'.tier or "prod";
        priority = hostConfig'.priority or "high";
        role = hostConfig'.role or null;
        labels = hostConfig'.labels or { };
      };
  # Build effective labels for a host
  # Always includes hostname; only includes tier/priority/role if non-default
  buildEffectiveLabels = host:
    { hostname = host.hostname; }
    // (lib.optionalAttrs (host.tier != "prod") { tier = host.tier; })
    // (lib.optionalAttrs (host.priority != "high") { priority = host.priority; })
    // (lib.optionalAttrs (host.role != null) { role = host.role; })
    // host.labels;
  # Generate node-exporter targets from all flake hosts
  # Returns a list of static_configs entries with labels
  generateNodeExporterTargets = self: externalTargets:
    let
      nixosConfigs = self.nixosConfigurations or { };
      hostList = lib.filter (x: x != null) (
        lib.mapAttrsToList extractHostMonitoring nixosConfigs
      );
-      flakeTargets = map (host: "${host.hostname}.home.2rjus.net:9100") hostList;
+
      # Extract hostname from a target string like "gunter.home.2rjus.net:9100"
      extractHostnameFromTarget = target:
        builtins.head (lib.splitString "." target);
      # Build target entries with labels for each host
      flakeEntries = map
        (host: {
          target = "${host.hostname}.home.2rjus.net:9100";
          labels = buildEffectiveLabels host;
        })
        hostList;
      # External targets get hostname extracted from the target string
      externalEntries = map
        (target: {
          inherit target;
          labels = { hostname = extractHostnameFromTarget target; };
        })
        (externalTargets.nodeExporter or [ ]);
      allEntries = flakeEntries ++ externalEntries;
      # Group entries by their label set for efficient static_configs
      # Convert labels attrset to a string key for grouping
      labelKey = entry: builtins.toJSON entry.labels;
      grouped = lib.groupBy labelKey allEntries;
      # Convert groups to static_configs format
      # Every flake host now has at least a hostname label
      staticConfigs = lib.mapAttrsToList
        (key: entries:
          let
            labels = (builtins.head entries).labels;
          in
          { targets = map (e: e.target) entries; labels = labels; }
        )
        grouped;
    in
-    flakeTargets ++ (externalTargets.nodeExporter or [ ]);
+    staticConfigs;
  # Generate scrape configs from all flake hosts and external targets
  # Host labels are propagated to service targets for semantic alert filtering
  generateScrapeConfigs = self: externalTargets:
    let
      nixosConfigs = self.nixosConfigurations or { };
@@ -70,13 +124,14 @@ let
        lib.mapAttrsToList extractHostMonitoring nixosConfigs
      );
-      # Collect all scrapeTargets from all hosts, grouped by job_name
+      # Collect all scrapeTargets from all hosts, including host labels
      allTargets = lib.flatten (map
        (host:
          map
            (target: {
              inherit (target) job_name port metrics_path scheme scrape_interval honor_labels;
              hostname = host.hostname;
              hostLabels = buildEffectiveLabels host;
            })
            host.scrapeTargets
        )
@@ -87,22 +142,32 @@ let
      grouped = lib.groupBy (t: t.job_name) allTargets;
      # Generate a scrape config for each job
      # Within each job, group targets by their host labels for efficient static_configs
      flakeScrapeConfigs = lib.mapAttrsToList
        (jobName: targets:
          let
            first = builtins.head targets;
-            targetAddrs = map
+
-              (t:
+            # Group targets within this job by their host labels
            labelKey = t: builtins.toJSON t.hostLabels;
            groupedByLabels = lib.groupBy labelKey targets;
            # Every flake host now has at least a hostname label
            staticConfigs = lib.mapAttrsToList
              (key: labelTargets:
                let
-                  portStr = toString t.port;
+                  labels = (builtins.head labelTargets).hostLabels;
                  targetAddrs = map
                    (t: "${t.hostname}.home.2rjus.net:${toString t.port}")
                    labelTargets;
                in
-                "${t.hostname}.home.2rjus.net:${portStr}")
+                { targets = targetAddrs; labels = labels; }
-              targets;
+              )
              groupedByLabels;
            config = {
              job_name = jobName;
-              static_configs = [{
+              static_configs = staticConfigs;
                targets = targetAddrs;
              }];
            }
            // (lib.optionalAttrs (first.metrics_path != "/metrics") {
              metrics_path = first.metrics_path;
--- a/playbooks/build-and-deploy-template.yml
+++ b/playbooks/build-and-deploy-template.yml
@@ -99,3 +99,48 @@
    - name: Display success message
      ansible.builtin.debug:
        msg: "Template VM {{ template_vmid }} created successfully on {{ storage }}"
 - name: Update Terraform template name
  hosts: localhost
  gather_facts: false
  vars:
    terraform_dir: "{{ playbook_dir }}/../terraform"
  tasks:
    - name: Get image filename from earlier play
      ansible.builtin.set_fact:
        image_filename: "{{ hostvars['localhost']['image_filename'] }}"
    - name: Extract template name from image filename
      ansible.builtin.set_fact:
        new_template_name: "{{ image_filename | regex_replace('\\.vma\\.zst$', '') | regex_replace('^vzdump-qemu-', '') }}"
    - name: Read current Terraform variables file
      ansible.builtin.slurp:
        src: "{{ terraform_dir }}/variables.tf"
      register: variables_tf_content
    - name: Extract current template name from variables.tf
      ansible.builtin.set_fact:
        current_template_name: "{{ (variables_tf_content.content | b64decode) | regex_search('variable \"default_template_name\"[^}]+default\\s*=\\s*\"([^\"]+)\"', '\\1') | first }}"
    - name: Check if template name has changed
      ansible.builtin.set_fact:
        template_name_changed: "{{ current_template_name != new_template_name }}"
    - name: Display template name status
      ansible.builtin.debug:
        msg: "Template name: {{ current_template_name }} -> {{ new_template_name }} ({{ 'changed' if template_name_changed else 'unchanged' }})"
    - name: Update default_template_name in variables.tf
      ansible.builtin.replace:
        path: "{{ terraform_dir }}/variables.tf"
        regexp: '(variable "default_template_name"[^}]+default\s*=\s*)"[^"]+"'
        replace: '\1"{{ new_template_name }}"'
      when: template_name_changed
    - name: Display update result
      ansible.builtin.debug:
        msg: "Updated terraform/variables.tf with new template name: {{ new_template_name }}"
      when: template_name_changed
--- a/rebuild-all.sh
+++ b/rebuild-all.sh
@@ -1,20 +0,0 @@
 #!/usr/bin/env bash
 set -euo pipefail
 # array of hosts
 HOSTS=(
    "ns1"
    "ns2"
    "ca"
    "ha1"
    "http-proxy"
    "jelly01"
    "monitoring01"
    "nix-cache01"
    "pgdb1"
 )
 for host in "${HOSTS[@]}"; do
    echo "Rebuilding $host"
    nixos-rebuild boot --flake .#${host} --target-host root@${host}
 done
--- a/scripts/create-host/create_host.py
+++ b/scripts/create-host/create_host.py
@@ -18,6 +18,8 @@ from manipulators import (
    remove_from_flake_nix,
    remove_from_terraform_vms,
    remove_from_vault_terraform,
    remove_from_approle_tf,
    find_host_secrets,
    check_entries_exist,
 )
 from models import HostConfig
@@ -255,7 +257,10 @@ def handle_remove(
        sys.exit(1)
    # Check what entries exist
-    flake_exists, terraform_exists, vault_exists = check_entries_exist(hostname, repo_root)
+    flake_exists, terraform_exists, vault_exists, approle_exists = check_entries_exist(hostname, repo_root)
    # Check for secrets in secrets.tf
    host_secrets = find_host_secrets(hostname, repo_root)
    # Collect all files in the host directory recursively
    files_in_host_dir = sorted([f for f in host_dir.rglob("*") if f.is_file()])
@@ -294,11 +299,25 @@ def handle_remove(
    else:
        console.print(f"  • terraform/vault/hosts-generated.tf [dim](not found)[/dim]")
-    # Warn about secrets directory
+    if approle_exists:
        console.print(f'  • terraform/vault/approle.tf (host_policies["{hostname}"])')
    else:
        console.print(f"  • terraform/vault/approle.tf [dim](not found)[/dim]")
    # Warn about secrets in secrets.tf
    if host_secrets:
        console.print(f"\n[yellow]⚠️  Warning: Found {len(host_secrets)} secret(s) in terraform/vault/secrets.tf:[/yellow]")
        for secret_path in host_secrets:
            console.print(f'   • "{secret_path}"')
        console.print(f"\n   [yellow]These will NOT be removed automatically.[/yellow]")
        console.print(f"   After removal, manually edit secrets.tf and run:")
        for secret_path in host_secrets:
            console.print(f"   [white]vault kv delete secret/{secret_path}[/white]")
    # Warn about legacy secrets directory
    if secrets_exist:
-        console.print(f"\n[yellow]⚠️  Warning: secrets/{hostname}/ directory exists and will NOT be deleted[/yellow]")
+        console.print(f"\n[yellow]⚠️  Warning: secrets/{hostname}/ directory exists (legacy SOPS)[/yellow]")
        console.print(f"   Manually remove if no longer needed: [white]rm -rf secrets/{hostname}/[/white]")
        console.print(f"   Also update .sops.yaml to remove the host's age key")
    # Exit if dry run
    if dry_run:
@@ -323,6 +342,13 @@ def handle_remove(
        else:
            console.print("[yellow]⚠[/yellow]  Could not remove from terraform/vault/hosts-generated.tf")
    # Remove from terraform/vault/approle.tf
    if approle_exists:
        if remove_from_approle_tf(hostname, repo_root):
            console.print("[green]✓[/green] Removed from terraform/vault/approle.tf")
        else:
            console.print("[yellow]⚠[/yellow]  Could not remove from terraform/vault/approle.tf")
    # Remove from terraform/vms.tf
    if terraform_exists:
        if remove_from_terraform_vms(hostname, repo_root):
@@ -345,19 +371,34 @@ def handle_remove(
    console.print(f"\n[bold green]✓ Host {hostname} removed successfully![/bold green]\n")
    # Display next steps
-    display_removal_next_steps(hostname, vault_exists)
+    display_removal_next_steps(hostname, vault_exists, approle_exists, host_secrets)
-def display_removal_next_steps(hostname: str, had_vault: bool) -> None:
+def display_removal_next_steps(hostname: str, had_vault: bool, had_approle: bool, host_secrets: list) -> None:
    """Display next steps after successful removal."""
-    vault_file = " terraform/vault/hosts-generated.tf" if had_vault else ""
+    vault_files = ""
    vault_apply = ""
    if had_vault:
        vault_files += " terraform/vault/hosts-generated.tf"
    if had_approle:
        vault_files += " terraform/vault/approle.tf"
    vault_apply = ""
    if had_vault or had_approle:
        vault_apply = f"""
 3. Apply Vault changes:
   [white]cd terraform/vault && tofu apply[/white]
 """
    secrets_cleanup = ""
    if host_secrets:
        secrets_cleanup = f"""
 5. Clean up secrets (manual):
   Edit terraform/vault/secrets.tf to remove entries for {hostname}
   Then delete from Vault:"""
        for secret_path in host_secrets:
            secrets_cleanup += f"\n   [white]vault kv delete secret/{secret_path}[/white]"
        secrets_cleanup += "\n"
    next_steps = f"""[bold cyan]Next Steps:[/bold cyan]
 1. Review changes:
@@ -367,9 +408,9 @@ def display_removal_next_steps(hostname: str, had_vault: bool) -> None:
   [white]cd terraform && tofu destroy -target='proxmox_vm_qemu.vm["{hostname}"]'[/white]
 {vault_apply}
 4. Commit changes:
-   [white]git add -u hosts/{hostname} flake.nix terraform/vms.tf{vault_file}
+   [white]git add -u hosts/{hostname} flake.nix terraform/vms.tf{vault_files}
   git commit -m "hosts: remove {hostname}"[/white]
-"""
+{secrets_cleanup}"""
    console.print(Panel(next_steps, border_style="cyan"))
--- a/scripts/create-host/generators.py
+++ b/scripts/create-host/generators.py
@@ -144,7 +144,7 @@ resource "vault_approle_auth_backend_role" "generated_hosts" {
  backend            = vault_auth_backend.approle.path
  role_name          = each.key
-  token_policies     = ["host-\${each.key}"]
+  token_policies     = ["host-\${each.key}", "homelab-deploy"]
  secret_id_ttl      = 0  # Never expire (wrapped tokens provide time limit)
  token_ttl          = 3600
  token_max_ttl      = 3600
--- a/scripts/create-host/manipulators.py
+++ b/scripts/create-host/manipulators.py
@@ -22,12 +22,12 @@ def remove_from_flake_nix(hostname: str, repo_root: Path) -> bool:
    content = flake_path.read_text()
    # Check if hostname exists
-    hostname_pattern = rf"^      {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem"
+    hostname_pattern = rf"^        {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem"
    if not re.search(hostname_pattern, content, re.MULTILINE):
        return False
    # Match the entire block from "hostname = " to "};"
-    replace_pattern = rf"^      {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem \{{.*?^      \}};\n"
+    replace_pattern = rf"^        {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem \{{.*?^        \}};\n"
    new_content, count = re.subn(replace_pattern, "", content, flags=re.MULTILINE | re.DOTALL)
    if count == 0:
@@ -101,7 +101,68 @@ def remove_from_vault_terraform(hostname: str, repo_root: Path) -> bool:
    return True
-def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, bool]:
+def remove_from_approle_tf(hostname: str, repo_root: Path) -> bool:
    """
    Remove host entry from terraform/vault/approle.tf locals.host_policies.
    Args:
        hostname: Hostname to remove
        repo_root: Path to repository root
    Returns:
        True if found and removed, False if not found
    """
    approle_path = repo_root / "terraform" / "vault" / "approle.tf"
    if not approle_path.exists():
        return False
    content = approle_path.read_text()
    # Check if hostname exists in host_policies
    hostname_pattern = rf'^\s+"{re.escape(hostname)}" = \{{'
    if not re.search(hostname_pattern, content, re.MULTILINE):
        return False
    # Match the entire block from "hostname" = { to closing }
    # The block contains paths = [ ... ] and possibly extra_policies = [...]
    replace_pattern = rf'\n?\s+"{re.escape(hostname)}" = \{{[^}}]*\}}\n?'
    new_content, count = re.subn(replace_pattern, "\n", content, flags=re.DOTALL)
    if count == 0:
        return False
    approle_path.write_text(new_content)
    return True
 def find_host_secrets(hostname: str, repo_root: Path) -> list:
    """
    Find secrets in terraform/vault/secrets.tf that belong to a host.
    Args:
        hostname: Hostname to search for
        repo_root: Path to repository root
    Returns:
        List of secret paths found (e.g., ["hosts/hostname/test-service"])
    """
    secrets_path = repo_root / "terraform" / "vault" / "secrets.tf"
    if not secrets_path.exists():
        return []
    content = secrets_path.read_text()
    # Find all secret paths matching hosts/{hostname}/
    pattern = rf'"(hosts/{re.escape(hostname)}/[^"]+)"'
    matches = re.findall(pattern, content)
    # Return unique paths, preserving order
    return list(dict.fromkeys(matches))
 def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, bool, bool]:
    """
    Check which entries exist for a hostname.
@@ -110,12 +171,12 @@ def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, boo
        repo_root: Path to repository root
    Returns:
-        Tuple of (flake_exists, terraform_vms_exists, vault_exists)
+        Tuple of (flake_exists, terraform_vms_exists, vault_generated_exists, approle_exists)
    """
    # Check flake.nix
    flake_path = repo_root / "flake.nix"
    flake_content = flake_path.read_text()
-    flake_pattern = rf"^      {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem"
+    flake_pattern = rf"^        {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem"
    flake_exists = bool(re.search(flake_pattern, flake_content, re.MULTILINE))
    # Check terraform/vms.tf
@@ -131,7 +192,15 @@ def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, boo
        vault_content = vault_tf_path.read_text()
        vault_exists = f'"{hostname}"' in vault_content
-    return (flake_exists, terraform_exists, vault_exists)
+    # Check terraform/vault/approle.tf
    approle_path = repo_root / "terraform" / "vault" / "approle.tf"
    approle_exists = False
    if approle_path.exists():
        approle_content = approle_path.read_text()
        approle_pattern = rf'^\s+"{re.escape(hostname)}" = \{{'
        approle_exists = bool(re.search(approle_pattern, approle_content, re.MULTILINE))
    return (flake_exists, terraform_exists, vault_exists, approle_exists)
 def update_flake_nix(config: HostConfig, repo_root: Path, force: bool = False) -> None:
@@ -147,32 +216,25 @@ def update_flake_nix(config: HostConfig, repo_root: Path, force: bool = False) -
    content = flake_path.read_text()
    # Create new entry
-    new_entry = f"""      {config.hostname} = nixpkgs.lib.nixosSystem {{
+    new_entry = f"""        {config.hostname} = nixpkgs.lib.nixosSystem {{
-        inherit system;
+          inherit system;
-        specialArgs = {{
+          specialArgs = {{
-          inherit inputs self sops-nix;
+            inherit inputs self;
          }};
          modules = commonModules ++ [
            ./hosts/{config.hostname}
          ];
        }};
        modules = [
          (
            {{ config, pkgs, ... }}:
            {{
              nixpkgs.overlays = commonOverlays;
            }}
          )
          ./hosts/{config.hostname}
          sops-nix.nixosModules.sops
        ];
      }};
 """
    # Check if hostname already exists
-    hostname_pattern = rf"^      {re.escape(config.hostname)} = nixpkgs\.lib\.nixosSystem"
+    hostname_pattern = rf"^        {re.escape(config.hostname)} = nixpkgs\.lib\.nixosSystem"
    existing_match = re.search(hostname_pattern, content, re.MULTILINE)
    if existing_match and force:
        # Replace existing entry
        # Match the entire block from "hostname = " to "};"
-        replace_pattern = rf"^      {re.escape(config.hostname)} = nixpkgs\.lib\.nixosSystem \{{.*?^      \}};\n"
+        replace_pattern = rf"^        {re.escape(config.hostname)} = nixpkgs\.lib\.nixosSystem \{{.*?^        \}};\n"
        new_content, count = re.subn(replace_pattern, new_entry, content, flags=re.MULTILINE | re.DOTALL)
        if count == 0:
--- a/scripts/create-host/templates/configuration.nix.j2
+++ b/scripts/create-host/templates/configuration.nix.j2
@@ -18,6 +18,12 @@
    tier = "test";  # Start in test tier, move to prod after validation
  };
  # Enable Vault integration
  vault.enable = true;
  # Enable remote deployment via NATS
  homelab.deploy.enable = true;
  nixpkgs.config.allowUnfree = true;
  boot.loader.grub.enable = true;
  boot.loader.grub.device = "/dev/vda";
--- a/scripts/create-host/validators.py
+++ b/scripts/create-host/validators.py
@@ -140,20 +140,22 @@ def validate_ip_unique(ip: Optional[str], repo_root: Path) -> None:
    ip_part = ip.split("/")[0]
    # Check all hosts/*/configuration.nix files
    # Search for IP with CIDR notation to match static IP assignments
    # (e.g., "10.69.13.5/24") but not DNS resolver entries (e.g., "10.69.13.5")
    hosts_dir = repo_root / "hosts"
    if hosts_dir.exists():
        for config_file in hosts_dir.glob("*/configuration.nix"):
            content = config_file.read_text()
-            if ip_part in content:
+            if ip in content:
                raise ValueError(
                    f"IP address {ip_part} already in use in {config_file}"
                )
-    # Check terraform/vms.tf
+    # Check terraform/vms.tf - search for full IP with CIDR
    terraform_file = repo_root / "terraform" / "vms.tf"
    if terraform_file.exists():
        content = terraform_file.read_text()
-        if ip_part in content:
+        if ip in content:
            raise ValueError(
                f"IP address {ip_part} already in use in {terraform_file}"
            )
--- a/secrets/ca/keys/intermediate_ca_key
+++ b/secrets/ca/keys/intermediate_ca_key
@@ -1,24 +0,0 @@
 {
 	"data": "ENC[AES256_GCM,data:TgGIuklFPUSCBosD86NFnkAtRvYijQNQP4vvTkKu3dRAOjdDa2li5djZDUS4NEEPEihpOcMXqHBb+ABk3LmoU5nLmsKCeylUp7+DhcGi9f3xw2h1zbHV37mt40OVLTF3cYufRdydIkCGQA3td3q1ue/wCna2ewe73xwGg5j6ZVJCZAtW4VCNZM+rcG+YxPUC0gmBH59+O0VSrZrkvSnifbr+K0dGwg4i17KwAukI4Ac7YMkQoeuAPXq38+ZftlRx4tq9xBUko6wpPY9zOaFzeagWYMF0n1UYqDt+/3XZI/mukPhJc9tzbWneqgkQBOx3OiDwrNglCHvEpnb+bZePIRLOnNHd1ShETgBqhsHGp9OAwwbAt4tO+HFpCQtVz7s2LWQFLbWiN0SCGzYUkFGCgoXae5H58lxFav8=,iv:UzaWlJ+M+VQx3CcPSGbFZh5/rGbKpS2Rq2XVZAIDFiQ=,tag:F3waoAMuEKTvN2xANReSww==,type:str]",
 	"sops": {
 		"kms": null,
 		"gcp_kms": null,
 		"azure_kv": null,
 		"hc_vault": null,
 		"age": [
 			{
 				"recipient": "age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u",
 				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBpRGZSVHRSMGlyazAwQU5j\nd1o1L0Y1ckhQMkh4MVZiRmZlR2ozcmdsUW1vCk4xZ1ZibDBrUWZhYmxVVjBUczRn\nYlJtUWF3Y1lHWG56NkhmK2JOUHVGajQKLS0tIDN2S2doQURpTis2U3lWV0NxdWEz\ncjNZaEl1dEQwOXhsNE9xbHhYUzNTV3cKVmVIe05JwgXKSku7AJmrujYXrbBSbpBJ\nnqCuDIhok1w/fiff+XXn8udbgPVq5bC2SOhHbtVxImgBCFzrj5hQ0A==\n-----END AGE ENCRYPTED FILE-----\n"
 			},
 			{
 				"recipient": "age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk",
 				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSA4V3NaUEdvMmJvakQ0L1F0\nUnkvQ2F5dEVlZ2pMdlBZcjJac0tERnF5ZWljCmFrdU1NZ29jMkJ1a1ZLdURmVWI0\ncm1vNytFVzZjbVY2aVd2N3laMWNRNFEKLS0tIGgzOTFZY0lxc0JyVmd5cFBlNkRr\nVDBWc0t4c3pVV3RhSTB1UUVpNHd6NUkKNn6Sxb5oxP7iWqTF1+X9nOiYum3U+Rzk\nkryxVnf9EvQIVIFKDaTb+yAEO8otjqj+C4mHA9fannnNEJduOiPWOg==\n-----END AGE ENCRYPTED FILE-----\n"
 			}
 		],
 		"lastmodified": "2024-11-30T13:18:08Z",
 		"mac": "ENC[AES256_GCM,data:9R9RJzPMr9Bv8aeCDxhExTfbr+R2hjap6FGSk5QxBdbNpOcNS78ica0CLEmkAYVAfjmx/X2jC5ZnsAueSPUK7nAgNX2gJXbUTpY0F+oKt35GJziLrFLl3u/ahpF9lQ50EL9OqqgS+igDqtodJhKme5DXH5/GXQHhz++O3VZkR78=,iv:XgN3PiowiEosi2DmrjP82HhJMvnwaV530tsBE8GQfjs=,tag:U243BrtH7H/DU9LcjN/MMg==,type:str]",
 		"pgp": null,
 		"unencrypted_suffix": "_unencrypted",
 		"version": "3.9.1"
 	}
 }
--- a/secrets/ca/keys/root_ca_key
+++ b/secrets/ca/keys/root_ca_key
@@ -1,24 +0,0 @@
 {
 	"data": "ENC[AES256_GCM,data:5AePh5uXcUseYBGWvlztgmg8mGBGy3ngKRa6+QxOaT0/fzSB1pKkaMtZJo76tV9wwjdL6/b6VVUI7GIaCBD5kgdZuA8RdBTXguHyjjdxAlI9xcrQaWWdATd8JJt+eQp/m2Y+0dioyXKaDV2ukI3GtHYjp/ixMoHHWEocnEEb40wG6c3CZcvsLWJvKTkFc2OvcjcU2RTfuNlYtEETidiD9iC/dtCakNQHmLP1UFYgcn0ebXBKmlqD6+x2o7BVT1SLwVCyGNvH3eKA2AWvddZChnhaNCUIXcRwBFCgS8lPs4iXhAhly+nwuj7ssFpuu3sjm5pq196tRS8WQl2iNUEJ2tzoOpceg1kZZ7KHX3wCbdBlCRqhy9Q4JMvWPDssO+zz2aU21+BDEySDTCnTYX9Hu2/iFvZejt++mKY=,iv:u/Ukye0BAj2ka++AA72W8WfXJAZZ/YJ3RC/aydxdoUc=,tag:ihTP5bCCigWEPcLFaYOhMA==,type:str]",
 	"sops": {
 		"kms": null,
 		"gcp_kms": null,
 		"azure_kv": null,
 		"hc_vault": null,
 		"age": [
 			{
 				"recipient": "age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u",
 				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSB0VElDNHArZXlXa2JRQjd0\nQmVIbGpPWk43NDdiTkFtcEd1bDhRdXJWOUY0CndITHdKTFNJQXFOVFdyUGNtQ09k\nN2hnQmFYR0ZORWtxcUN0ZFhsM0U3N2cKLS0tIFh1TTBpMjFIZ2NYM1QxeDRjYlJx\nYkdrUDZmMUpGbjk3REJCVVRpeFk5Z28KJcia0Bk+3ZoifZnRLwqAko526ODPnkSS\nzymtOj/QYTA0++NP3B1aScIyhWITMEZX1iSoWDmgHj8ZQoNMdkM7AQ==\n-----END AGE ENCRYPTED FILE-----\n"
 			},
 			{
 				"recipient": "age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk",
 				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBZNlNHRWNEcUZGNXNBMDFR\nTzE5RnNMQUMvU1k2OS9XMlpvUktMRzQ5RmxvCnlCS3lzRVpGUHJLRGZ6SWZ2ZktR\na3l0TVN2NUlRVEQwRHByYkNEMDQyWUkKLS0tIEh3RjBWT3c5K2RWeDRjWFpsU1lP\ncStqY2xta3RSNkR6Vkt5YXhYUTZmbDgKvVKmZc8S/RwurJGsGiJ5LhM4waLO9B9k\n2cawxHmcYM3KfXDFwp9UZWhIwF7SRkG56ZE4OjGI3sOL+74ixnePxA==\n-----END AGE ENCRYPTED FILE-----\n"
 			}
 		],
 		"lastmodified": "2024-11-30T13:18:16Z",
 		"mac": "ENC[AES256_GCM,data:JwjbQ129cYCBNA5Fb8lN9rW7/y4wuVOqLeajIMcYyCzlBcjzCZAV1DKN5n75xMamb/hb1AUkmtp/K82PKM0Vg5X4/lpWTUZXZOzn/TrwHx+yqlJjL9mUdGuHnSY5DwME38Dde3UxdtUa0CVgQOxvMIycW27w8+8NNfO2zxGxkzc=,iv:ZMZASOsqXZOb0NkBqG3GGaqqKgQdjZLiku2yU5QonB8=,tag:/lb/HMxsYOV5XX/5kWnFHA==,type:str]",
 		"pgp": null,
 		"unencrypted_suffix": "_unencrypted",
 		"version": "3.9.1"
 	}
 }
--- a/secrets/ca/keys/ssh_host_ca_key
+++ b/secrets/ca/keys/ssh_host_ca_key
@@ -1,24 +0,0 @@
 {
 	"data": "ENC[AES256_GCM,data:vqQ3HwSmuDlI4UwraLWvwkBSj9zTFeNEWI1xzhVrO/gpx8+WBZOt2F0J7/LSTGAWsWW/9Gov+XXXAOtfnKfjYVzizyT/jE8EQwMuItWiFEVA6hohgwtsk7YKJjXdJIxmiv+WKs73gWb0uFVGh1ArMzsVkGPj1W1AKMFAneDPgsfSCy9aVOMuF8zQwypFC8eaxqOQhLpiN2ncRm8e7khwGurSgYfHDgFghaDr8torgUrZTOPNFk+LEdxB3WcC17+4a8ZyuBapmYdRTrP73czTAuxOF8lMwddJhO99SF7nWuOYVF1FOKLGtK04oKci5/xRIzvWo3I0pGajkxtuF5CyWbd1KblcPfBALIU/J5hU/puGJ7M2sE/qsg/4kaTFxnhq32rPZj291jFb4evDdOhVodfC1axOQUbzAC0=,iv:yOeQ384ikqgDqfthl7GIVSIMNA/n0BYTSIqFN3T9MAY=,tag:Y6nhOCrkWx7MnVpEeKN0Jg==,type:str]",
 	"sops": {
 		"kms": null,
 		"gcp_kms": null,
 		"azure_kv": null,
 		"hc_vault": null,
 		"age": [
 			{
 				"recipient": "age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u",
 				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBFTjRMWlNtYVQ2WnJEaGFN\nVFU2TXRTK2FHREpqREhOWHBKemxNc2U4WW44CnV4OWlBdXlFUWhJYi9jTTRuUWJV\nOWFPV2I4UytDRFo3blN3bUtFQ1NGU0kKLS0tIGp2VHlDc1JMMUdDUjlNNDFwUUxj\nVnhHbCtrNVNpZXo0K2dDVU5YTVJJUEkKk9mVTbzQVGZo3RKDLPDwtENknh+in1Q5\njf4DA1cGDDNzcEIWOOYyS+1mzT9WY8gU0hWqihX/bAx7CVsNUallZw==\n-----END AGE ENCRYPTED FILE-----\n"
 			},
 			{
 				"recipient": "age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk",
 				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBrVFNwUGpkOUhkUXFWWERq\nMVdueC9VSE9KbGZkenBVK3NRMjRNVXVmcVRRCjNLa0QzbWVCQks3ZmV3eFVjcEp0\nRmxDSlZIZU1IbEdnbE83WlkxV3VZV1EKLS0tICtsRXArajQ4Um9mNEV5OWZBdS85\nVGFSU2wwODZ3Zm44M3pWcTdDV1dxejQKM2BK5Axb1cF344ea89gkzCLzEX6j4amK\nzxf+boBK7JUX7F6QaPB0sRU8J4Cei9mALz96C8xNHjX00KcD3O2QOA==\n-----END AGE ENCRYPTED FILE-----\n"
 			}
 		],
 		"lastmodified": "2024-11-30T13:18:20Z",
 		"mac": "ENC[AES256_GCM,data:AllgcWxHnr3igPi/JbfJCbEa6hKtmILnAjiaMojRZNO4p6zYSoF0s8lo9XX05/vIrFUo+YaCtsuacv+kfz9f6vQafPn7Vulbh6PeH1VlAmzyVfJOTmHP3YX8ic3uM56A4+III1jOERCFOIcc/CKsnRLFhLCRQRMgtgT0hTl5aPw=,iv:60dOYhoUTu1HIHzY36eJeRZ66/v6JmRRpIW99W2D+CI=,tag:F7nLSFm933K5M+JE4IvNYw==,type:str]",
 		"pgp": null,
 		"unencrypted_suffix": "_unencrypted",
 		"version": "3.9.1"
 	}
 }
--- a/secrets/ca/keys/ssh_user_ca_key
+++ b/secrets/ca/keys/ssh_user_ca_key
@@ -1,24 +0,0 @@
 {
 	"data": "ENC[AES256_GCM,data:YRdPrTLQH0xdWiIzOyjfEGpvfmuj6me6GzZZcauh9bUUywyA1ranDnWqbJYgawQQxIXsq9dhXD0uco+7mmXq2598kF1NI9jh6uLf3k0H494zZOalRBv/k8u9oJDLIiVAkg9eNNLbGX0PMZr/Yue/qdkuXx2Hg9E7bQJwpU/NXF+jKKs+3NmKT5NBlegwAzUs530D4DUoaq5AhvVvdC6a1UcE+KJzQ8pRiz1GjFIxAB7qX+GVwa3yNdLgo2tlAbOzjGtaDfJnhZIHSNEq+4TEhjlF9lCmFCGFDUVupvMOWs0kBywJEzIrDmxmvGHlPj3FfyytPb7qhlsOXDDDS67IoiwluKOnw+sALAG0Iv9LMrDZ3z8MXeEGvRWu0VDMuGXN905/9kGx/A40mPjcfnZvI+qSRIKjER5R8aU=,iv:qiP2Ml59AnK24MBbs7N/HqJIylf+fXGqJAo2N8iFNB0=,tag:0Dj5fVs6OB07kvV4qzuvfw==,type:str]",
 	"sops": {
 		"kms": null,
 		"gcp_kms": null,
 		"azure_kv": null,
 		"hc_vault": null,
 		"age": [
 			{
 				"recipient": "age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u",
 				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBBUFlvNmRNYUlJSHZYUkpJ\nMEloQXFSdENIWGJVVDNIOVY5MS9SYWRoL0FrCnRJc05wZUZBSDRvMHNUUEhNRXQ4\nTWhYOUp6YUNGZFNWUFRrSmlJM1c4aWcKLS0tIFc1b3NlSEo2eFJhdDgwejRqcHlT\nZE5wN01uaE04cTlIbVJMVWQvQ1pXajgKQ1n6UmP7LEBsnIBXVc0BceOqvwCqQzBP\ncI8C5Io4ILgMjY4dr6sd0SeJG6mfDdiMA+k7c6jqoyZCW/Pkd3LANQ==\n-----END AGE ENCRYPTED FILE-----\n"
 			},
 			{
 				"recipient": "age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk",
 				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBtM2lyeXVzdE9nL1k5L3dC\nTkl2MjhMb1FKMFdCeXFPSmNST0pvOTRUaEVvCmdwMnhjSFFHVFhidmIySS9jMEJu\nNTJpRjdFOWpZZ3ZuZFJwZUUrRFU5NnMKLS0tIDJ1UjdVQkpMNm5Pd01JRnZNOEtr\nb1lpMlBkVHpiT2lYdWtZaUQrRW1HUDgKq/JVMf5gdu6lNEmqY6zU2SymbT+jklem\nnUQ9yieJGF+PanutNW6BCJH8jb/fH+Y6AeJ9S+kKCB4Yi75i4d+oHg==\n-----END AGE ENCRYPTED FILE-----\n"
 			}
 		],
 		"lastmodified": "2024-11-30T13:18:24Z",
 		"mac": "ENC[AES256_GCM,data:6FJTKEdIpCm+Dz7Ua8dZOMZQFaGU0oU/HRP6ly5mWbXCv81LRbZXRBd+5RDY3z9g9nb0PXZrOMNps63F6SKxK52VfzLIOap3UGeMNQn5P4/yyFj7JQHQ5Gjcf2l2z2VZ7NhUdNoSCV/6lwjValbKtids48Q5c3sFX997ZiqIUnY=,iv:nUeyJd/v8d9v7QsLLckziD9K5qjOZKK4vOQJw/ymi18=,tag:6n5EE3oklWdVcedvB2J/zA==,type:str]",
 		"pgp": null,
 		"unencrypted_suffix": "_unencrypted",
 		"version": "3.9.1"
 	}
 }
--- a/secrets/ca/secrets.yaml
+++ b/secrets/ca/secrets.yaml
@@ -1,30 +0,0 @@
 ca_root_pw: ENC[AES256_GCM,data:jS5BHS9i/pOykus5aGsW+w==,iv:aQIU7uXnNKaeNXv1UjRpBoSYcRpHo8RjnvCaIw4yCqc=,tag:lkjGm5/Ve93nizqGDQ0ByA==,type:str]
 sops:
    kms: []
    gcp_kms: []
    azure_kv: []
    hc_vault: []
    age:
        - recipient: age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
          enc: |
            -----BEGIN AGE ENCRYPTED FILE-----
            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSA5anlORWxJalhRWkJPeGIy
            OStyVG8vMFRTTEZOWHR3Q3N1UWJQbFlxV3pBCmVKQVM1SlJ2L0JOb3U3cTh3YkZ4
            WHAxSUpTT1dyRHJHYVd1Qkh1ZWxwYW8KLS0tIEhXeklsSmlGaFlaaWF5L0Nodk5a
            clZ4M3hFSlFqaEZ0UWREdHpTQ29GVUEKAxj5P05Ilpwis2oKFe54mJX+1LfTwfUv
            2XRFOrEQbFNcK5WFu46p1mc/AAjKTeHWuvb2Yq43CO+sh1+kqKz0XA==
            -----END AGE ENCRYPTED FILE-----
        - recipient: age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk
          enc: |
            -----BEGIN AGE ENCRYPTED FILE-----
            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBaS0dqQ1p4MEE2d2JaeFRx
            UnB4ejhrS3hLekpqeWJhcEJGdnpzMTZDelVRCmFjVGswd3VtRUloWG1WbWY5N0s3
            cG9aV2hGU3lFZkkvcUJNWE1rWUIwMmMKLS0tIG1KdlhoQzREWDhPbXVSZVBUQkdE
            N1hmcEwxWXBIWkQ3a3BrdGhvUFoxbzgKX6hLoz7o/Du6ymrYwmGDkXp2XT+0+7QE
            YhD5qQzGLVQSh3XM/wWExj2Ue5/gw/NqNziHezOh2r9gQljbHjG2/g==
            -----END AGE ENCRYPTED FILE-----
    lastmodified: "2024-10-21T09:12:26Z"
    mac: ENC[AES256_GCM,data:hfPRIXt/kZJa6lsj7rz+5xGlrWhR/LX895S2d8auP/4t3V//80YE/ofIsHeAY9M7eSFsW9ce2Vp0C/WiCQefVWNaNN7nVAwskCfQ6vTWzs23oYz4NYIeCtZggBG3uGgJxb7ZnAFUJWmLwCxkKTQyoVVnn8i/rUDIBrkilbeLWNI=,iv:lm1HVbWtAifHjqKP0D3sxRadsE9+82ugbA2x54yRBTo=,tag:averxmPLa131lJtFrNxcEA==,type:str]
    pgp: []
    unencrypted_suffix: _unencrypted
    version: 3.9.1
--- a/secrets/http-proxy/wireguard.yaml
+++ b/secrets/http-proxy/wireguard.yaml
@@ -1,25 +0,0 @@
 wg_private_key: ENC[AES256_GCM,data:DlC9txcLkTnb7FoEd249oJV/Ehcp50P8uulbE4rY/xU16fkTlnKvPmYZ7u8=,iv:IsiTzdrh+BNSVgx1mfjpMGNV2J0c88q6AoP0kHX2aGY=,tag:OqFsOIyE71SBD1mcNS/PeQ==,type:str]
 sops:
    age:
        - recipient: age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
          enc: |
            -----BEGIN AGE ENCRYPTED FILE-----
            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSAzdm9HTTN1amwxQ2Z6MUQv
            dGJ0cEgyaHNOZWtWSWlXNXc5bGhUdSsvVlVzCkJkc3ZQdzlBNDNxb3Avdi96bXFt
            TExZY29nUDI3RE5vanh6TVBRME1Fa1UKLS0tIG8vSHdCYzkvWmJpd0hNbnRtUmtk
            aVcwaFJJclZ3YUlUTTNwR2VESmVyZWMKHvKUJBDuNCqacEcRlapetCXHKRb0Js09
            sqxLfEDwiN2LQQjYHZOmnMfCOt/b2rwXVKEHdTcIsXbdIdKOJwuAIQ==
            -----END AGE ENCRYPTED FILE-----
        - recipient: age1gq8434ku0xekqmvnseeunv83e779cg03c06gwrusnymdsr3rpufqx6vr3m
          enc: |
            -----BEGIN AGE ENCRYPTED FILE-----
            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBEeU01UTc2V1UyZXRadE5I
            VE1aakVZUEZUNnJxbzJ1K3J1R3ZQdFdMbUhBCjZBMDM3ZkYvQWlyNHBtaDZRWkd4
            VzY0L3l4N2RNZjJRTDJWZTZyZVhHbW8KLS0tIGVNZ0N0emVmaVRCV09jNmVKRlla
            cWVSNkJqWHh5c21KcWFac2FlZTVaMTAK1UvfPgZAZYtwiONKIAo5HlaDpN+UT/S/
            JfPUfjxgRQid8P20Eh/jUepxrDY8iXRZdsUMON+OoQ8mpwoAh5eN1A==
            -----END AGE ENCRYPTED FILE-----
    lastmodified: "2025-05-15T18:56:55Z"
    mac: ENC[AES256_GCM,data:J2kHY7pXBJZ0UuNCZOhkU11M8rDqCYNzY71NyuDRmzzRCC9ZiNIbavyQAWj2Dpk1pjGsYjXsVoZvP7ti1wTFqahpaR/YWI5gmphrzAe32b9qFVEWTC3YTnmItnY0YxQZYehYghspBjnJtfUK0BvZxSb17egpoFnvHmAq+u5dyxg=,iv:/aLg02RLuJZ1bRzZfOD74pJuE7gppCBztQvUEt557mU=,tag:toxHHBuv3WRblyc9Sth6Iw==,type:str]
    unencrypted_suffix: _unencrypted
    version: 3.10.2
--- a/secrets/monitoring01/pve-exporter.yaml
+++ b/secrets/monitoring01/pve-exporter.yaml
@@ -1,33 +0,0 @@
 default:
    user: ENC[AES256_GCM,data:4Zzjm6/e8GCKSPNivnY=,iv:Y3gR+JSH/GLYvkVu3CN4T/chM5mjGjwVPI0iMB4p1t4=,tag:auyG8iWsd/YGjDnnTC21Ew==,type:str]
    password: ENC[AES256_GCM,data:9cyM9U8VnzXBBA==,iv:YMHNNUoQ9Az5+81Df07tjC+LaEWPHV6frUjd4PZrQOs=,tag:3hKR+BhLJODJp19nn4ppkA==,type:str]
    verify_ssl: ENC[AES256_GCM,data:Cu5Ucf0=,iv:QFfdV7gDBQ+L2kSZZqlVqCrn9CRg5RNG5DNTFWtVf5Y=,tag:u24ZbpWA65wj3WOwqU1v+g==,type:bool]
 sops:
    kms: []
    gcp_kms: []
    azure_kv: []
    hc_vault: []
    age:
        - recipient: age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
          enc: |
            -----BEGIN AGE ENCRYPTED FILE-----
            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBuUXdMMG5YaHRJbThQZW9u
            RHVBbXFiSHNiUWdLTDdPajIyQjN3OGR0dGpzCm9ZVkdNWjhBakU3dVdhRU9kbU81
            aDlCNzJBQ1hvQ3FnTUk2N2RWQkZpUUEKLS0tIEZacTNqa3FWc2p1NXVtRWhwVExj
            cUJtYXNjb2Z4QkF4MjlidEZxSUFNa3MKAGHGksPc9oJheSlUQ3ARK5MuR5NFbPmD
            kmSDSgRmzbarxT8eJnK8/K4ii3hX5E9vGOohUkyc03w4ENsh/dw43g==
            -----END AGE ENCRYPTED FILE-----
        - recipient: age1vpns76ykll8jgdlu3h05cur4ew2t3k7u03kxdg8y6ypfhsfhq9fqyurjey
          enc: |
            -----BEGIN AGE ENCRYPTED FILE-----
            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBOVGhvdGE5Mzl0ckhBM21D
            RXJwb09OS25PMGViblViM21wTVZiZWhtWmhFCnAzL1NqeUVyOGZFVDFvdXFPbklQ
            ZkJPWDVIdUdCdjZGUjcrcmtvak5CWG8KLS0tIDhLUHJNN2VqNy9CdVh0K0N0b0k1
            RUE4U0E0aGxiRkF0NWdwSEIrQTU4MjgKeOU6bIWO6ke9YcG+1E3brnC21sSQxZ9b
            SiG2QEnFnTeJ5P50XQoYHqUY3B0qx7nDLvyzatYEi6sDkfLXhmHGbw==
            -----END AGE ENCRYPTED FILE-----
    lastmodified: "2024-12-03T16:25:12Z"
    mac: ENC[AES256_GCM,data:gemq8YpMZQC+gY7lmMM3tfZh9XxL40qdGlLiB2CD4SIG49w0V6E/vY7xygt0WW0zHbhMI9yUIqlRc/PaXn+QfyxJEr3IjaT05rrWUqQAeRP9Zss74Y3NtQehh8fM8SgeyU4j2CQ9f9B/lW9IgdOW/TNgQZVXGg1vXZPEzl7AZ4A=,iv:LG5ojv3hAqk+EvFa/xEn43MBqL457uKFDE3dG5lSgZo=,tag:AxzcUzmdhO411Sw7Vg1itA==,type:str]
    pgp: []
    unencrypted_suffix: _unencrypted
    version: 3.9.1
--- a/secrets/nix-cache01/actions_token_1
+++ b/secrets/nix-cache01/actions_token_1
@@ -1,19 +0,0 @@
 {
 	"data": "ENC[AES256_GCM,data:P84qHFU+xQjwQGK8I1gIdcBsHrskuUg0M1nGMMaA+hFjAdFYUhdhmAN/+y0CO28=,iv:zJtk01zNMTBDQdVtZBTM34CHRaNYDkabolxh7PWGKUI=,tag:8AS80AbZJbh9B3Av3zuI1w==,type:str]",
 	"sops": {
 		"age": [
 			{
 				"recipient": "age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u",
 				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBkRFB6QTIyWWdwVkV4ZXNB\nWkdSdEhMc0s4cnByWVZXTGhnSWZ0MTdEUWhJCnFlOFQ5TU1hcE91azVyZXVXRCtu\nZjIxalRLYlEreGZ6ZDNoeXNPaFN4b28KLS0tIHY5WVFXN1k4NFVmUjh6VURkcEpv\ncklGcWVhdTdBRnlOdm1qM2h5SS9UUkEKq2RyxSVymDqcsZ+yiNRujDCwk1WOWYRW\nDa4TRKg3FCe7TcCEPkIaev1aBqjLg9J9c/70SYpUm6Zgeps7v5yl3A==\n-----END AGE ENCRYPTED FILE-----\n"
 			},
 			{
 				"recipient": "age1w029fksjv0edrff9p7s03tgk3axecdkppqymfpwfn2nu2gsqqefqc37sxq",
 				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSArTGVuckp2NlhMZXRNMVhO\naUV3K0h3cmZ5ZGx4Q3dJWHNqZXFJeE1kM0dFCmF4TUFUMm9mTHJlYzlYWVhNa1RH\nR29VNDIrL1IvYUpQYm5SZEYzbWhhbkkKLS0tIEJsK1dwZVdaaHpWQkpOOS90dkhx\nbGhvRXhqdFdqQmhZZmhCdmw4NUtSVG8K3z2do+/cIjAqg6EMJnubOWid1sMeTxvo\nrq6eGJ7YzdgZr2JBVtJdDRtk/KeHXu9In4efbBXwLAPIfn1pU0gm1w==\n-----END AGE ENCRYPTED FILE-----\n"
 			}
 		],
 		"lastmodified": "2025-08-21T19:08:48Z",
 		"mac": "ENC[AES256_GCM,data:5CkO09NIqttb4UZPB9iGym8avhTsMeUkTFTKZJlNGjgB1qWyGQNeKCa50A1+SbBCCWE5EwxoynB1so7bi8vnq7k8CPUHbiWG8rLOJSYHQcZ9Tu7ZGtpeWPcCw1zPWJ/PTBsFVeaT5/ufdx/6ut+sTtRoKHOZZtO9oStHmu/Rlfg=,iv:z9iJJlbvhgxJaART5QoCrqvrqlgoVlGj8jlndCALmKU=,tag:ldjmND4NVVQrHUldLrB4Jg==,type:str]",
 		"unencrypted_suffix": "_unencrypted",
 		"version": "3.10.2"
 	}
 }
--- a/secrets/nix-cache01/cache-secret
+++ b/secrets/nix-cache01/cache-secret
@@ -1,19 +0,0 @@
 {
 	"data": "ENC[AES256_GCM,data:MQkR6FQGHK2AuhOmy2was49RY2XlLO5NwaXnUFzFo5Ata/2ufVoAj4Jvotw/dSrKL7f62A6s+2BPAyWrvACJ+pwYFlfyj3T9bNwhxwZPkEmiHEubJjWSiD6jkSW0gOxbY8ib6g/GbyF8I1cPeYr/hJD5qQ==,iv:eBL2Y3MOt9gYTETUZqsHo1D5hPOHxb4JR6Z/DFlzzqI=,tag:Qqbt39xZvQz/QhsggsArsw==,type:str]",
 	"sops": {
 		"age": [
 			{
 				"recipient": "age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u",
 				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSAwZzFXaEsyUkZGNFV0bVlW\nRkpPRHpUK2VwUHpOQXZCUUpoVzFGa3hycnhvCndTN0toVFdoU2E5N3V3UFhTTjU0\nNDByWTkrV0o3T295dE0zS08rVGpyQjAKLS0tIC96M0VEcWpjRk5DMjJnMFB4ZHI3\nM2Jod2x4ZzMyZm1pbDhZNTFuWGNRUlEKHs5jBSfjml09JOeKiT9vFR0Fykg6OxKG\njhFU/J2+fWB22G7dBc4PI60SNqhxIheUbGTdcz4Yp4BPL6vW3eArIw==\n-----END AGE ENCRYPTED FILE-----\n"
 			},
 			{
 				"recipient": "age1w029fksjv0edrff9p7s03tgk3axecdkppqymfpwfn2nu2gsqqefqc37sxq",
 				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBJT3lxamcrQUpFdjZteFlF\nYUQ3aGdadGpuNXd2Z3RtZ3dQU0cvMlFUMUNRClBDR3U0OXZJU0NDamVMSlR5NitN\nYlhvNVlvUE0wRjErYzkwVHFOdGVCVjgKLS0tIEttR1BLTGpDYTRSQ0lUZmVEcnNi\nWkNaMEViUHVBcExVOEpjNE5CZHpjVkEKuX/Rf8kaB3apr1UhAnq3swS6fXiVmwm8\n7Key+SUAPNstbWbz0u6B9m1ev5QcXB2lx2/+Cm7cjW+6VE2gLHjTsQ==\n-----END AGE ENCRYPTED FILE-----\n"
 			}
 		],
 		"lastmodified": "2025-01-24T12:19:16Z",
 		"mac": "ENC[AES256_GCM,data:X8X91LVP1MMJ8ZYeSNPRO6XHN+NuswLZcHpAkbvoY+E9aTteO8UqS+fsStbNDlpF5jz/mhdMsKElnU8Z/CIWImwolI4GGE6blKy6gyqRkn4VeZotUoXcJadYV/5COud3XP2uSTb694JyQEZnBXFNeYeiHpN0y38zLxoX8kXHFbc=,iv:fFCRfv+Y1Nt2zgJNKsxElrYcuKkATJ3A/jvheUY2IK4=,tag:hYojbMGUAQvx7I4qkO7o9w==,type:str]",
 		"unencrypted_suffix": "_unencrypted",
 		"version": "3.9.3"
 	}
 }
--- a/secrets/secrets.yaml
+++ b/secrets/secrets.yaml
@@ -1,109 +0,0 @@
 root_password_hash: ENC[AES256_GCM,data:wk/xEuf+qU3ezmondq9y3OIotXPI/L+TOErTjgJz58wEvQkApYkjc3bHaUTzOrmWjQBgDUENObzPmvQ8WKawUSJRVlpfOEr5TQ==,iv:I8Z3xJz3qoXBD7igx087A1fMwf8d29hQ4JEI3imRXdY=,tag:M80osQeWGG9AAA8BrMfhHA==,type:str]
 ns_xfer_key: ENC[AES256_GCM,data:VFpK7GChgFeUgQm31tTvVC888bN0yt6BAnHQa6KUTg4iZGP1WL5Bx6Zp8dY=,iv:9RF1eEc7JBxBebDOKfcDjGS2U7XsHkOW/l52yIP+1LA=,tag:L6DR2QlHOfo02kzfWWCrvg==,type:str]
 backup_helper_secret: ENC[AES256_GCM,data:EvXEJnDilbfALQ==,iv:Q3dkZ8Ee3qbcjcoi5GxfbaVB4uRIvkIB6ioKVV/dL2Y=,tag:T/UgZvQgYGa740Wh7D0b7Q==,type:str]
 nats_nkey: ENC[AES256_GCM,data:N2CVXjdwiE7eSPUtXe+NeKSTzA9eFwK2igxaCdYsXd4Ps0/DjYb/ggnQziQzSy8viESZYjXhJ2VtNw==,iv:Xhcf5wPB01Wu0A+oMw0wzTEHATp+uN+wsaYshxIzy1w=,tag:IauTIOHqfiM75Ufml/JXbg==,type:str]
 sops:
    age:
        - recipient: age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
          enc: |
            -----BEGIN AGE ENCRYPTED FILE-----
            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBuWXhzQWFmeCt1R05jREcz
            Ui9HZFN5dkxHNVE0RVJGZUJUa3hKK2sxdkhBCktYcGpLeGZIQzZIV3ZZWGs3YzF1
            T09sUEhPWkRkOWZFWkltQXBlM1lQV1UKLS0tIERRSlRUYW5QeW9TVjJFSmorOWNI
            ZytmaEhzMjVhRXI1S0hielF0NlBrMmcK4I1PtSf7tSvSIJxWBjTnfBCO8GEFHbuZ
            BkZskr5fRnWUIs72ZOGoTAVSO5ZNiBglOZ8YChl4Vz1U7bvdOCt0bw==
            -----END AGE ENCRYPTED FILE-----
        - recipient: age1hz2lz4k050ru3shrk5j3zk3f8azxmrp54pktw5a7nzjml4saudesx6jsl0
          enc: |
            -----BEGIN AGE ENCRYPTED FILE-----
            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBQcXM0RHlGcmZrYW4yNGZs
            S1ZqQzVaYmQ4MGhGaTFMUVIwOTk5K0tZZjB3ClN0QkhVeHRrNXZHdmZWMzFBRnJ6
            WTFtaWZyRmx2TitkOXkrVkFiYVd3RncKLS0tIExpeGUvY1VpODNDL2NCaUhtZkp0
            cGNVZTI3UGxlNWdFWVZMd3FlS3pDR3cKBulaMeonV++pArXOg3ilgKnW/51IyT6Z
            vH9HOJUix+ryEwDIcjv4aWx9pYDHthPFZUDC25kLYG91WrJFQOo2oA==
            -----END AGE ENCRYPTED FILE-----
        - recipient: age1w2q4gm2lrcgdzscq8du3ssyvk6qtzm4fcszc92z9ftclq23yyydqdga5um
          enc: |
            -----BEGIN AGE ENCRYPTED FILE-----
            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBabTdsZWxZQjV2TGx2YjNM
            ZTgzWktqTjY0S0M3bFpNZXlDRDk5TSt3V2k0CjdWWTN0TlRlK1RpUm9xYW03MFFG
            aWN4a3o4VUVnYzBDd2FrelUraWtrMTAKLS0tIE1vTGpKYkhzcWErWDRreml2QmE2
            ZkNIWERKb1drdVR6MTBSTnVmdm51VEkKVNDYdyBSrUT7dUn6a4eF7ELQ2B2Pk6V9
            Z5fbT75ibuyX1JO315/gl2P/FhxmlRW1K6e+04gQe2R/t/3H11Q7YQ==
            -----END AGE ENCRYPTED FILE-----
        - recipient: age1d2w5zece9647qwyq4vas9qyqegg96xwmg6c86440a6eg4uj6dd2qrq0w3l
          enc: |
            -----BEGIN AGE ENCRYPTED FILE-----
            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBVSFhDOFRVbnZWbVlQaG5G
            U0NWekU0NzI1SlpRN0NVS1hPN210MXY3Z244CmtFemR5OUpzdlBzMHBUV3g0SFFo
            eUtqNThXZDJ2b01yVVVuOFdwQVo2Qm8KLS0tIHpXRWd3OEpPRkpaVDNDTEJLMWEv
            ZlZtaFpBdzF0YXFmdjNkNUR3YkxBZU0KAub+HF/OBZQR9bx/SVadZcL6Ms+NQ7yq
            21HCcDTWyWHbN4ymUrIYXci1A/0tTOrQL9Mkvaz7IJh4VdHLPZrwwA==
            -----END AGE ENCRYPTED FILE-----
        - recipient: age1gq8434ku0xekqmvnseeunv83e779cg03c06gwrusnymdsr3rpufqx6vr3m
          enc: |
            -----BEGIN AGE ENCRYPTED FILE-----
            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBBWkhBL1NTdjFDeEhQcEgv
            Z3c3Z213L2ZhWGo0Qm5Zd1A1RTBDY3plUkh3CkNWV2ZtNWkrUjB0eWFzUlVtbHlk
            WTdTQjN4eDIzY0c0dyt6ajVXZ0krd1UKLS0tIHB4aEJqTTRMenV3UkFkTGEySjQ2
            YVM1a3ZPdUU4T244UU0rc3hVQ3NYczQK10wug4kTjsvv/iOPWi5WrVZMOYUq4/Mf
            oXS4sikXeUsqH1T2LUBjVnUieSneQVn7puYZlN+cpDQ0XdK/RZ+91A==
            -----END AGE ENCRYPTED FILE-----
        - recipient: age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk
          enc: |
            -----BEGIN AGE ENCRYPTED FILE-----
            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBYcEtHbjNWRkdodUxYdHRn
            MDBMU08zWDlKa0Z4cHJvc28rZk5pUjhnMjE0CmdzRmVGWDlYQ052Wm1zWnlYSFV6
            dURQK3JSbThxQlg3M2ZaL1hGRzVuL0UKLS0tIEI3UGZvbEpvRS9aR2J2Tnc1YmxZ
            aUY5Q2MrdHNQWDJNaGt5MWx6MVRrRVEKRPxyAekGHFMKs0Z6spVDayBA4EtPk18e
            jiFc97BGVtC5IoSu4icq3ZpKOdxymnkqKEt0YP/p/JTC+8MKvTJFQw==
            -----END AGE ENCRYPTED FILE-----
        - recipient: age1vpns76ykll8jgdlu3h05cur4ew2t3k7u03kxdg8y6ypfhsfhq9fqyurjey
          enc: |
            -----BEGIN AGE ENCRYPTED FILE-----
            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBQL3ZMUkI1dUV1T2tTSHhn
            SjhyQ3dKTytoaDBNcit1VHpwVGUzWVNpdjBnCklYZWtBYzBpcGxZSDBvM2tIZm9H
            bTFjb1ZCaDkrOU1JODVBVTBTbmxFbmcKLS0tIGtGcS9kejZPZlhHRXI5QnI5Wm9Q
            VjMxTDdWZEltWThKVDl0S24yWHJxZHcKgzH79zT2I7ZgyTbbbvIhLN/rEcfiomJH
            oSZDFvPiXlhPgy8bRyyq3l47CVpWbUI2Y7DFXRuODpLUirt3K3TmCA==
            -----END AGE ENCRYPTED FILE-----
        - recipient: age1hchvlf3apn8g8jq2743pw53sd6v6ay6xu6lqk0qufrjeccan9vzsc7hdfq
          enc: |
            -----BEGIN AGE ENCRYPTED FILE-----
            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBPcm9zUm1XUkpLWm1Jb3Uw
            RncveGozOW5SRThEM1Y4SFF5RDdxUEhZTUE4CjVESHE5R3JZK0krOXZDL0RHR0oy
            Z3JKaEpydjRjeFFHck1ic2JTRU5yZTQKLS0tIGY2ck56eG95YnpDYlNqUDh5RVp1
            U3dRYkNleUtsQU1LMWpDbitJbnRIem8K+27HRtZihG8+k7ZC33XVfuXDFjC1e8lA
            kffmxp9kOEShZF3IKmAjVHFBiPXRyGk3fGPyQLmSMK2UOOfCy/a/qA==
            -----END AGE ENCRYPTED FILE-----
        - recipient: age1w029fksjv0edrff9p7s03tgk3axecdkppqymfpwfn2nu2gsqqefqc37sxq
          enc: |
            -----BEGIN AGE ENCRYPTED FILE-----
            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBTZHlldDdSOEhjTklCSXQr
            U2pXajFwZnNqQzZOTzY5b3lkMzlyREhXRWo4CmxId2F6NkNqeHNCSWNrcUJIY0Nw
            cGF6NXJaQnovK1FYSXQ2TkJSTFloTUEKLS0tIHRhWk5aZ0lDVkZaZEJobm9FTDNw
            a29sZE1GL2ZQSk0vUEc1ZGhkUlpNRkEK9tfe7cNOznSKgxshd5Z6TQiNKp+XW6XH
            VvPgMqMitgiDYnUPj10bYo3kqhd0xZH2IhLXMnZnqqQ0I23zfPiNaw==
            -----END AGE ENCRYPTED FILE-----
        - recipient: age1ha34qeksr4jeaecevqvv2afqem67eja2mvawlmrqsudch0e7fe7qtpsekv
          enc: |
            -----BEGIN AGE ENCRYPTED FILE-----
            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSB5bk9NVjJNWmMxUGd3cXRx
            amZ5SWJ3dHpHcnM4UHJxdmh6NnhFVmJQdldzCm95dHN3R21qSkE4Vm9VTnVPREp3
            dUQyS1B4MWhhdmd3dk5LQ0htZEtpTWMKLS0tIGFaa3MxVExFYk1MY2loOFBvWm1o
            L0NoRStkeW9VZVdpWlhteC8yTnRmMUkKMYjUdE1rGgVR29FnhJ5OEVjTB1Rh5Mtu
            M/DvlhW3a7tZU8nDF3IgG2GE5xOXZMDO9QWGdB8zO2RJZAr3Q+YIlA==
            -----END AGE ENCRYPTED FILE-----
        - recipient: age1cxt8kwqzx35yuldazcc49q88qvgy9ajkz30xu0h37uw3ts97jagqgmn2ga
          enc: |
            -----BEGIN AGE ENCRYPTED FILE-----
            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBBU0xYMnhqOE0wdXdleStF
            THcrY2NBQzNoRHdYTXY3ZmM5YXRZZkQ4aUZnCm9ad0IxSWxYT1JBd2RseUdVT1pi
            UXBuNzFxVlN0OWNTQU5BV2NiVEV0RUUKLS0tIGJHY0dzSDczUzcrV0RpTjE0czEy
            cWZMNUNlTzBRcEV5MjlRV1BsWGhoaUUKGhYaH8I0oPCfrbs7HbQKVOF/99rg3HXv
            RRTXUI71/ejKIuxehOvifClQc3nUW73bWkASFQ0guUvO4R+c0xOgUg==
            -----END AGE ENCRYPTED FILE-----
    lastmodified: "2025-02-11T21:18:22Z"
    mac: ENC[AES256_GCM,data:5//boMp1awc/2XAkSASSCuobpkxa0E6IKf3GR8xHpMoCD30FJsCwV7PgX3fR8OuLEhOJ7UguqMNQdNqG37RMacreuDmI1J8oCFKp+3M2j4kCbXaEo8bw7WAtyjUez+SAXKzZWYmBibH0KOy6jdt+v0fdgy5hMBT4IFDofYRsyD0=,iv:6pD+SLwncpmal/FR4U8It2njvaQfUzzpALBCxa0NyME=,tag:4QN8ZFjdqck5ZgulF+FtbA==,type:str]
    unencrypted_suffix: _unencrypted
    version: 3.9.4
--- a/services/ca/default.nix
+++ b/services/ca/default.nix
@@ -1,169 +0,0 @@
 { pkgs, unstable, ... }:
 {
  homelab.monitoring.scrapeTargets = [{
    job_name = "step-ca";
    port = 9000;
  }];
  sops.secrets."ca_root_pw" = {
    sopsFile = ../../secrets/ca/secrets.yaml;
    owner = "step-ca";
    path = "/var/lib/step-ca/secrets/ca_root_pw";
  };
  sops.secrets."intermediate_ca_key" = {
    sopsFile = ../../secrets/ca/keys/intermediate_ca_key;
    format = "binary";
    owner = "step-ca";
    path = "/var/lib/step-ca/secrets/intermediate_ca_key";
  };
  sops.secrets."root_ca_key" = {
    sopsFile = ../../secrets/ca/keys/root_ca_key;
    format = "binary";
    owner = "step-ca";
    path = "/var/lib/step-ca/secrets/root_ca_key";
  };
  sops.secrets."ssh_host_ca_key" = {
    sopsFile = ../../secrets/ca/keys/ssh_host_ca_key;
    format = "binary";
    owner = "step-ca";
    path = "/var/lib/step-ca/secrets/ssh_host_ca_key";
  };
  sops.secrets."ssh_user_ca_key" = {
    sopsFile = ../../secrets/ca/keys/ssh_user_ca_key;
    format = "binary";
    owner = "step-ca";
    path = "/var/lib/step-ca/secrets/ssh_user_ca_key";
  };
  services.step-ca = {
    enable = true;
    package = pkgs.step-ca;
    intermediatePasswordFile = "/var/lib/step-ca/secrets/ca_root_pw";
    address = "0.0.0.0";
    port = 443;
    settings = {
      metricsAddress = ":9000";
      authority = {
        provisioners = [
          {
            claims = {
              enableSSHCA = true;
              maxTLSCertDuration = "3600h";
              defaultTLSCertDuration = "48h";
            };
            encryptedKey = "eyJhbGciOiJQQkVTMi1IUzI1NitBMTI4S1ciLCJjdHkiOiJqd2sranNvbiIsImVuYyI6IkEyNTZHQ00iLCJwMmMiOjYwMDAwMCwicDJzIjoiY1lWOFJPb3lteXFLMWpzcS1WM1ZXQSJ9.WS8tPK-Q4gtnSsw7MhpTzYT_oi-SQx-CsRLh7KwdZnpACtd4YbcOYg.zeyDkmKRx8BIp-eB.OQ8c-KDW07gqJFtEMqHacRBkttrbJRRz0sYR47vQWDCoWhodaXsxM_Bj2pGvUrR26ij1t7irDeypnJoh6WXvUg3n_JaIUL4HgTwKSBrXZKTscXmY7YVmRMionhAb6oS9Jgus9K4QcFDHacC9_WgtGI7dnu3m0G7c-9Ur9dcDfROfyrnAByJp1rSZMzvriQr4t9bNYjDa8E8yu9zq6aAQqF0Xg_AxwiqYqesT-sdcfrxKS61appApRgPlAhW-uuzyY0wlWtsiyLaGlWM7WMfKdHsq-VqcVrI7Gi2i77vi7OqPEberqSt8D04tIri9S_sArKqWEDnBJsL07CC41IY.CqtYfbSa_wlmIsKgNj5u7g";
            key = {
              alg = "ES256";
              crv = "P-256";
              kid = "CIjtIe7FNhsNQe1qKGD9Rpj-lrf2ExyTYCXAOd3YDjE";
              kty = "EC";
              use = "sig";
              x = "XRMX-BeobZ-R5-xb-E9YlaRjJUfd7JQxpscaF1NMgFo";
              y = "bF9xLp5-jywRD-MugMaOGbpbniPituWSLMlXRJnUUl0";
            };
            name = "ca@home.2rjus.net";
            type = "JWK";
          }
          {
            name = "acme";
            type = "ACME";
            claims = {
              maxTLSCertDuration = "3600h";
              defaultTLSCertDuration = "1800h";
            };
          }
          {
            claims = {
              enableSSHCA = true;
            };
            name = "sshpop";
            type = "SSHPOP";
          }
        ];
      };
      crt = "/var/lib/step-ca/certs/intermediate_ca.crt";
      db = {
        badgerFileLoadingMode = "";
        dataSource = "/var/lib/step-ca/db";
        type = "badgerv2";
      };
      dnsNames = [
        "ca.home.2rjus.net"
        "10.69.13.12"
      ];
      federatedRoots = null;
      insecureAddress = "";
      key = "/var/lib/step-ca/secrets/intermediate_ca_key";
      logger = {
        format = "text";
      };
      root = "/var/lib/step-ca/certs/root_ca.crt";
      ssh = {
        hostKey = "/var/lib/step-ca/secrets/ssh_host_ca_key";
        userKey = "/var/lib/step-ca/secrets/ssh_user_ca_key";
      };
      templates = {
        ssh = {
          host = [
            {
              comment = "#";
              name = "sshd_config.tpl";
              path = "/etc/ssh/sshd_config";
              requires = [
                "Certificate"
                "Key"
              ];
              template = ./templates/ssh/sshd_config.tpl;
              type = "snippet";
            }
            {
              comment = "#";
              name = "ca.tpl";
              path = "/etc/ssh/ca.pub";
              template = ./templates/ssh/ca.tpl;
              type = "snippet";
            }
          ];
          user = [
            {
              comment = "#";
              name = "config.tpl";
              path = "~/.ssh/config";
              template = ./templates/ssh/config.tpl;
              type = "snippet";
            }
            {
              comment = "#";
              name = "step_includes.tpl";
              path = "\${STEPPATH}/ssh/includes";
              template = ./templates/ssh/step_includes.tpl;
              type = "prepend-line";
            }
            {
              comment = "#";
              name = "step_config.tpl";
              path = "ssh/config";
              template = ./templates/ssh/step_config.tpl;
              type = "file";
            }
            {
              comment = "#";
              name = "known_hosts.tpl";
              path = "ssh/known_hosts";
              template = ./templates/ssh/known_hosts.tpl;
              type = "file";
            }
          ];
        };
      };
      tls = {
        cipherSuites = [
          "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256"
          "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256"
        ];
        maxVersion = 1.3;
        minVersion = 1.2;
        renegotiation = false;
      };
    };
  };
 }
--- a/services/ca/templates/ssh/ca.tpl
+++ b/services/ca/templates/ssh/ca.tpl
--- a/services/ca/templates/ssh/config.tpl
+++ b/services/ca/templates/ssh/config.tpl
@@ -1,14 +0,0 @@
 Host *
 {{- if or .User.GOOS "none" | eq "windows" }}
 {{- if .User.StepBasePath }}
 	Include "{{ .User.StepBasePath | replace "\\" "/" | trimPrefix "C:" }}/ssh/includes"
 {{- else }}
 	Include "{{ .User.StepPath | replace "\\" "/" | trimPrefix "C:" }}/ssh/includes"
 {{- end }}
 {{- else }}
 {{- if .User.StepBasePath }}
 	Include "{{.User.StepBasePath}}/ssh/includes"
 {{- else }}
 	Include "{{.User.StepPath}}/ssh/includes"
 {{- end }}
 {{- end }}
--- a/services/ca/templates/ssh/known_hosts.tpl
+++ b/services/ca/templates/ssh/known_hosts.tpl
@@ -1,4 +0,0 @@
@cert-authority * {{.Step.SSH.HostKey.Type}} {{.Step.SSH.HostKey.Marshal | toString | b64enc}}
 {{- range .Step.SSH.HostFederatedKeys}}
@cert-authority * {{.Type}} {{.Marshal | toString | b64enc}}
 {{- end }}
--- a/services/ca/templates/ssh/sshd_config.tpl
+++ b/services/ca/templates/ssh/sshd_config.tpl
@@ -1,4 +0,0 @@
 Match all
 	TrustedUserCAKeys /etc/ssh/ca.pub
 	HostCertificate /etc/ssh/{{.User.Certificate}}
 	HostKey /etc/ssh/{{.User.Key}}
--- a/services/ca/templates/ssh/step_config.tpl
+++ b/services/ca/templates/ssh/step_config.tpl
@@ -1,11 +0,0 @@
 Match exec "step ssh check-host{{- if .User.Context }} --context {{ .User.Context }}{{- end }} %h"
 {{- if .User.User }}
 	User {{.User.User}}
 {{- end }}
 {{- if or .User.GOOS "none" | eq "windows" }}
 	UserKnownHostsFile "{{.User.StepPath}}\ssh\known_hosts"
 	ProxyCommand C:\Windows\System32\cmd.exe /c step ssh proxycommand{{- if .User.Context }} --context {{ .User.Context }}{{- end }}{{- if .User.Provisioner }} --provisioner {{ .User.Provisioner }}{{- end }} %r %h %p
 {{- else }}
 	UserKnownHostsFile "{{.User.StepPath}}/ssh/known_hosts"
 	ProxyCommand step ssh proxycommand{{- if .User.Context }} --context {{ .User.Context }}{{- end }}{{- if .User.Provisioner }} --provisioner {{ .User.Provisioner }}{{- end }} %r %h %p
 {{- end }}
--- a/services/ca/templates/ssh/step_includes.tpl
+++ b/services/ca/templates/ssh/step_includes.tpl
@@ -1 +0,0 @@
 {{- if or .User.GOOS "none" | eq "windows" }}Include "{{ .User.StepPath | replace "\\" "/" | trimPrefix "C:" }}/ssh/config"{{- else }}Include "{{.User.StepPath}}/ssh/config"{{- end }}
--- a/services/http-proxy/proxy.nix
+++ b/services/http-proxy/proxy.nix
@@ -5,7 +5,7 @@
    package = pkgs.unstable.caddy;
    configFile = pkgs.writeText "Caddyfile" ''
      {
-        acme_ca https://ca.home.2rjus.net/acme/acme/directory
+        acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
        metrics {
          per_host
--- a/services/kanidm/default.nix
+++ b/services/kanidm/default.nix
@@ -0,0 +1,65 @@
 { config, lib, pkgs, ... }:
 {
  services.kanidm = {
    package = pkgs.kanidmWithSecretProvisioning_1_8;
    enableServer = true;
    serverSettings = {
      domain = "home.2rjus.net";
      origin = "https://auth.home.2rjus.net";
      bindaddress = "0.0.0.0:443";
      ldapbindaddress = "0.0.0.0:636";
      tls_chain = "/var/lib/acme/auth.home.2rjus.net/fullchain.pem";
      tls_key = "/var/lib/acme/auth.home.2rjus.net/key.pem";
      online_backup = {
        path = "/var/lib/kanidm/backups";
        schedule = "00 22 * * *";
        versions = 7;
      };
    };
    # Provision base groups only - users are managed via CLI
    # See docs/user-management.md for details
    provision = {
      enable = true;
      idmAdminPasswordFile = config.vault.secrets.kanidm-idm-admin.outputDir;
      groups = {
        admins = { };
        users = { };
        ssh-users = { };
      };
      # Regular users (persons) are managed imperatively via kanidm CLI
    };
  };
  # Grant kanidm access to ACME certificates
  users.users.kanidm.extraGroups = [ "acme" ];
  # ACME certificate from internal CA
  # Include both the CNAME (auth) and A record (kanidm01) for Prometheus scraping
  security.acme.certs."auth.home.2rjus.net" = {
    listenHTTP = ":80";
    reloadServices = [ "kanidm" ];
    extraDomainNames = [ "${config.networking.hostName}.home.2rjus.net" ];
  };
  # Vault secret for idm_admin password (used for provisioning)
  vault.secrets.kanidm-idm-admin = {
    secretPath = "kanidm/idm-admin-password";
    extractKey = "password";
    services = [ "kanidm" ];
    owner = "kanidm";
    group = "kanidm";
  };
  # Note: Kanidm does not expose Prometheus metrics
  # If metrics support is added in the future, uncomment:
  # homelab.monitoring.scrapeTargets = [
  #   {
  #     job_name = "kanidm";
  #     port = 443;
  #     scheme = "https";
  #   }
  # ];
 }
--- a/services/monitoring/alloy.nix
+++ b/services/monitoring/alloy.nix
@@ -1,41 +0,0 @@
 { ... }:
 {
  services.alloy = {
    enable = true;
  };
  environment.etc."alloy/config.alloy" = {
    enable = true;
    mode = "0644";
    text = ''
      pyroscope.write "local_pyroscope" {
        endpoint {
          url = "http://localhost:4040"
        }
      }
      pyroscope.scrape "labmon" {
        targets    = [{"__address__" = "localhost:9969", "service_name" = "labmon"}]
        forward_to = [pyroscope.write.local_pyroscope.receiver]
        profiling_config {
          profile.process_cpu {
            enabled = true
          }
          profile.memory {
            enabled = true
          }
          profile.mutex {
            enabled = true
          }
          profile.block {
            enabled = true
          }
          profile.goroutine {
            enabled = true
          }
        }
      }
    '';
  };
 }
--- a/services/monitoring/default.nix
+++ b/services/monitoring/default.nix
@@ -7,7 +7,6 @@
    ./pve.nix
    ./alerttonotify.nix
    ./pyroscope.nix
    ./alloy.nix
    ./tempo.nix
  ];
 }
--- a/services/monitoring/prometheus.nix
+++ b/services/monitoring/prometheus.nix
@@ -121,22 +121,20 @@ in
    scrapeConfigs = [
      # Auto-generated node-exporter targets from flake hosts + external
      # Each static_config entry may have labels from homelab.host metadata
      {
        job_name = "node-exporter";
-        static_configs = [
+        static_configs = nodeExporterTargets;
          {
            targets = nodeExporterTargets;
          }
        ];
      }
      # Systemd exporter on all hosts (same targets, different port)
      # Preserves the same label grouping as node-exporter
      {
        job_name = "systemd-exporter";
-        static_configs = [
+        static_configs = map
-          {
+          (cfg: cfg // {
-            targets = map (t: builtins.replaceStrings [":9100"] [":9558"] t) nodeExporterTargets;
+            targets = map (t: builtins.replaceStrings [ ":9100" ] [ ":9558" ] t) cfg.targets;
-          }
+          })
-        ];
+          nodeExporterTargets;
      }
      # Local monitoring services (not auto-generated)
      {
@@ -180,14 +178,6 @@ in
          }
        ];
      }
      {
        job_name = "labmon";
        static_configs = [
          {
            targets = [ "monitoring01.home.2rjus.net:9969" ];
          }
        ];
      }
      # TODO: nix-cache_caddy can't be auto-generated because the cert is issued
      # for nix-cache.home.2rjus.net (service CNAME), not nix-cache01 (hostname).
      # Consider adding a target override to homelab.monitoring.scrapeTargets.
--- a/services/monitoring/rules.yml
+++ b/services/monitoring/rules.yml
@@ -17,8 +17,9 @@ groups:
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: "Disk space is low on {{ $labels.instance }}. Please check."
      # Build hosts (e.g., nix-cache01) are expected to have high CPU during builds
      - alert: high_cpu_load
-        expr: max(node_load5{instance!="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance!="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7)
+        expr: max(node_load5{role!="build-host"}) by (instance) > (count by (instance)(node_cpu_seconds_total{role!="build-host", mode="idle"}) * 0.7)
        for: 15m
        labels:
          severity: warning
@@ -26,7 +27,7 @@ groups:
          summary: "High CPU load on {{ $labels.instance }}"
          description: "CPU load is high on {{ $labels.instance }}. Please check."
      - alert: high_cpu_load
-        expr: max(node_load5{instance="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7)
+        expr: max(node_load5{role="build-host"}) by (instance) > (count by (instance)(node_cpu_seconds_total{role="build-host", mode="idle"}) * 0.7)
        for: 2h
        labels:
          severity: warning
@@ -115,8 +116,9 @@ groups:
        annotations:
          summary: "NSD not running on {{ $labels.instance }}"
          description: "NSD has been down on {{ $labels.instance }} more than 5 minutes."
      # Only alert on primary DNS (secondary has cold cache after failover)
      - alert: unbound_low_cache_hit_ratio
-        expr: (rate(unbound_cache_hits_total[5m]) / (rate(unbound_cache_hits_total[5m]) + rate(unbound_cache_misses_total[5m]))) < 0.5
+        expr: (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) / (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) + rate(unbound_cache_misses_total{dns_role="primary"}[5m]))) < 0.5
        for: 15m
        labels:
          severity: warning
@@ -336,40 +338,6 @@ groups:
        annotations:
          summary: "Pyroscope service not running on {{ $labels.instance }}"
          description: "Pyroscope service not running on {{ $labels.instance }}"
  - name: certificate_rules
    rules:
      - alert: certificate_expiring_soon
        expr: labmon_tlsconmon_certificate_seconds_left{address!="ca.home.2rjus.net:443"} < 86400
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "TLS certificate expiring soon for {{ $labels.instance }}"
          description: "TLS certificate for {{ $labels.address }} is expiring within 24 hours."
      - alert: step_ca_serving_cert_expiring
        expr: labmon_tlsconmon_certificate_seconds_left{address="ca.home.2rjus.net:443"} < 3600
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Step-CA serving certificate expiring"
          description: "The step-ca serving certificate (24h auto-renewed) has less than 1 hour of validity left. Renewal may have failed."
      - alert: certificate_check_error
        expr: labmon_tlsconmon_certificate_check_error == 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Error checking certificate for {{ $labels.address }}"
          description: "Certificate check is failing for {{ $labels.address }} on {{ $labels.instance }}."
      - alert: step_ca_certificate_expiring
        expr: labmon_stepmon_certificate_seconds_left < 3600
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Step-CA certificate expiring for {{ $labels.instance }}"
          description: "Step-CA certificate is expiring within 1 hour on {{ $labels.instance }}."
  - name: proxmox_rules
    rules:
      - alert: pve_node_down
@@ -388,32 +356,6 @@ groups:
        annotations:
          summary: "Proxmox VM {{ $labels.id }} is stopped"
          description: "Proxmox VM {{ $labels.id }} ({{ $labels.name }}) has onboot=1 but is stopped."
  - name: postgres_rules
    rules:
      - alert: postgres_down
        expr: node_systemd_unit_state{instance="pgdb1.home.2rjus.net:9100", name="postgresql.service", state="active"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL not running on {{ $labels.instance }}"
          description: "PostgreSQL has been down on {{ $labels.instance }} more than 5 minutes."
      - alert: postgres_exporter_down
        expr: up{job="postgres"} == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PostgreSQL exporter down on {{ $labels.instance }}"
          description: "Cannot scrape PostgreSQL metrics from {{ $labels.instance }}."
      - alert: postgres_high_connections
        expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PostgreSQL connection pool near exhaustion on {{ $labels.instance }}"
          description: "PostgreSQL is using over 80% of max_connections on {{ $labels.instance }}."
  - name: jellyfin_rules
    rules:
      - alert: jellyfin_down
--- a/services/nix-cache/proxy.nix
+++ b/services/nix-cache/proxy.nix
@@ -5,7 +5,7 @@
    package = pkgs.unstable.caddy;
    configFile = pkgs.writeText "Caddyfile" ''
      {
-        acme_ca https://ca.home.2rjus.net/acme/acme/directory
+        acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
        metrics
      }
--- a/services/ns/resolver.nix
+++ b/services/ns/resolver.nix
@@ -45,7 +45,11 @@
      };
      stub-zone = {
        name = "home.2rjus.net";
-        stub-addr = "127.0.0.1@8053";
+        stub-addr = [
          "127.0.0.1@8053"   # Local NSD
          "10.69.13.5@8053"  # ns1
          "10.69.13.6@8053"  # ns2
        ];
      };
      forward-zone = {
        name = ".";
--- a/services/postgres/default.nix
+++ b/services/postgres/default.nix
@@ -1,6 +0,0 @@
 { ... }:
 {
  imports = [
    ./postgres.nix
  ];
 }
--- a/services/postgres/postgres.nix
+++ b/services/postgres/postgres.nix
@@ -1,23 +0,0 @@
 { pkgs, ... }:
 {
  homelab.monitoring.scrapeTargets = [{
    job_name = "postgres";
    port = 9187;
  }];
  services.prometheus.exporters.postgres = {
    enable = true;
    runAsLocalSuperUser = true; # Use peer auth as postgres user
  };
  services.postgresql = {
    enable = true;
    enableJIT = true;
    enableTCPIP = true;
    extensions = ps: with ps; [ pgvector ];
    authentication = ''
      # Allow access to everything from gunter
      host    all             all             10.69.30.105/32         scram-sha-256
    '';
  };
 }
--- a/system/acme.nix
+++ b/system/acme.nix
@@ -3,7 +3,7 @@
  security.acme = {
    acceptTerms = true;
    defaults = {
-      server = "https://ca.home.2rjus.net/acme/acme/directory";
+      server = "https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory";
      email = "root@home.2rjus.net";
      dnsPropagationCheck = false;
    };
--- a/system/default.nix
+++ b/system/default.nix
@@ -4,14 +4,16 @@
    ./acme.nix
    ./autoupgrade.nix
    ./homelab-deploy.nix
    ./kanidm-client.nix
    ./monitoring
    ./motd.nix
    ./packages.nix
    ./nix.nix
    ./pipe-to-loki.nix
    ./root-user.nix
    ./pki/root-ca.nix
    ./sops.nix
    ./sshd.nix
    ./vault-secrets.nix
    ./zram.nix
  ];
 }
--- a/system/kanidm-client.nix
+++ b/system/kanidm-client.nix
@@ -0,0 +1,42 @@
 { lib, config, pkgs, ... }:
 let
  cfg = config.homelab.kanidm;
 in
 {
  options.homelab.kanidm = {
    enable = lib.mkEnableOption "Kanidm PAM/NSS client for central authentication";
    server = lib.mkOption {
      type = lib.types.str;
      default = "https://auth.home.2rjus.net";
      description = "URI of the Kanidm server";
    };
    allowedLoginGroups = lib.mkOption {
      type = lib.types.listOf lib.types.str;
      default = [ "ssh-users" ];
      description = "Groups allowed to log in via PAM";
    };
  };
  config = lib.mkIf cfg.enable {
    services.kanidm = {
      package = pkgs.kanidm_1_8;
      enablePam = true;
      clientSettings = {
        uri = cfg.server;
      };
      unixSettings = {
        pam_allowed_login_groups = cfg.allowedLoginGroups;
        # Use short names (torjus) instead of SPN format (torjus@home.2rjus.net)
        # This prevents "PAM user mismatch" errors with SSH
        uid_attr_map = "name";
        gid_attr_map = "name";
        # Create symlink /home/torjus -> /home/torjus@home.2rjus.net
        home_alias = "name";
      };
    };
  };
 }
--- a/Show More
+++ b/Show More
		`@@ -1 +0,0 @@`
			`{{- if or .User.GOOS "none" \| eq "windows" }}Include "{{ .User.StepPath \| replace "\\" "/" \| trimPrefix "C:" }}/ssh/config"{{- else }}Include "{{.User.StepPath}}/ssh/config"{{- end }}`