mcp: move config to .mcp.json.example, gitignore real config

The real .mcp.json now contains Loki credentials for basic auth, so it should not be committed. The example file has placeholders. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
promtail: fix vault secret ownership for loki auth
2026-02-17 20:35:14 +01:00 · 2026-02-17 20:17:02 +01:00 · 2026-02-17 20:13:22 +01:00 · 2026-02-17 20:10:37 +01:00 · 2026-02-17 20:00:08 +01:00 · 2026-02-17 19:48:06 +01:00
156 changed files with 10085 additions and 2103 deletions
--- a/.claude/agents/auditor.md
+++ b/.claude/agents/auditor.md
@@ -0,0 +1,180 @@
+---
+name: auditor
+description: Analyzes audit logs to investigate user activity, command execution, and suspicious behavior on hosts. Can be used standalone for security reviews or called by other agents for behavioral context.
+tools: Read, Grep, Glob
+mcpServers:
+  - lab-monitoring
+---
+
+You are a security auditor for a NixOS homelab infrastructure. Your task is to analyze audit logs and reconstruct user activity on hosts.
+
+## Input
+
+You may receive:
+- A host or list of hosts to investigate
+- A time window (e.g., "last hour", "today", "between 14:00 and 15:00")
+- Optional context: specific events to look for, user to focus on, or suspicious activity to investigate
+- Optional context from a parent investigation (e.g., "a service stopped at 14:32, what happened around that time?")
+
+## Audit Log Structure
+
+Logs are shipped to Loki via promtail. Audit events use these labels:
+- `hostname` - hostname
+- `systemd_unit` - typically `auditd.service` for audit logs
+- `job` - typically `systemd-journal`
+
+Audit log entries contain structured data:
+- `EXECVE` - command execution with full arguments
+- `USER_LOGIN` / `USER_LOGOUT` - session start/end
+- `USER_CMD` - sudo command execution
+- `CRED_ACQ` / `CRED_DISP` - credential acquisition/disposal
+- `SERVICE_START` / `SERVICE_STOP` - systemd service events
+
+## Investigation Techniques
+
+### 1. SSH Session Activity
+
+Find SSH logins and session activity:
+```logql
+{hostname="<hostname>", systemd_unit="sshd.service"}
+```
+
+Look for:
+- Accepted/Failed authentication
+- Session opened/closed
+- Unusual source IPs or users
+
+### 2. Command Execution
+
+Query executed commands (filter out noise):
+```logql
+{hostname="<hostname>"} |= "EXECVE" != "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"
+```
+
+Further filtering:
+- Exclude systemd noise: `!= "systemd" != "/nix/store"`
+- Focus on specific commands: `|= "rm" |= "-rf"`
+- Focus on specific user: `|= "uid=1000"`
+
+### 3. Sudo Activity
+
+Check for privilege escalation:
+```logql
+{hostname="<hostname>"} |= "sudo" |= "COMMAND"
+```
+
+Or via audit:
+```logql
+{hostname="<hostname>"} |= "USER_CMD"
+```
+
+### 4. Service Manipulation
+
+Check if services were manually stopped/started:
+```logql
+{hostname="<hostname>"} |= "EXECVE" |= "systemctl"
+```
+
+### 5. File Operations
+
+Look for file modifications (if auditd rules are configured):
+```logql
+{hostname="<hostname>"} |= "EXECVE" |= "vim"
+{hostname="<hostname>"} |= "EXECVE" |= "nano"
+{hostname="<hostname>"} |= "EXECVE" |= "rm"
+```
+
+## Query Guidelines
+
+**Start narrow, expand if needed:**
+- Begin with `limit: 20-30`
+- Use tight time windows: `start: "15m"` or `start: "30m"`
+- Add filters progressively
+
+**Avoid:**
+- Querying all audit logs without EXECVE filter (extremely verbose)
+- Large time ranges without specific filters
+- Limits over 50 without tight filters
+
+**Time-bounded queries:**
+When investigating around a specific event:
+```logql
+{hostname="<hostname>"} |= "EXECVE" != "systemd"
+```
+With `start: "2026-02-08T14:30:00Z"` and `end: "2026-02-08T14:35:00Z"`
+
+## Suspicious Patterns to Watch For
+
+1. **Unusual login times** - Activity outside normal hours
+2. **Failed authentication** - Brute force attempts
+3. **Privilege escalation** - Unexpected sudo usage
+4. **Reconnaissance commands** - `whoami`, `id`, `uname`, `cat /etc/passwd`
+5. **Data exfiltration indicators** - `curl`, `wget`, `scp`, `rsync` to external destinations
+6. **Persistence mechanisms** - Cron modifications, systemd service creation
+7. **Log tampering** - Commands targeting log files
+8. **Lateral movement** - SSH to other internal hosts
+9. **Service manipulation** - Stopping security services, disabling firewalls
+10. **Cleanup activity** - Deleting bash history, clearing logs
+
+## Output Format
+
+### For Standalone Security Reviews
+
+```
+## Activity Summary
+
+**Host:** <hostname>
+**Time Period:** <start> to <end>
+**Sessions Found:** <count>
+
+## User Sessions
+
+### Session 1: <user> from <source_ip>
+- **Login:** HH:MM:SSZ
+- **Logout:** HH:MM:SSZ (or ongoing)
+- **Commands executed:**
+  - HH:MM:SSZ - <command>
+  - HH:MM:SSZ - <command>
+
+## Suspicious Activity
+
+[If any patterns from the watch list were detected]
+- **Finding:** <description>
+- **Evidence:** <log entries>
+- **Risk Level:** Low / Medium / High
+
+## Summary
+
+[Overall assessment: normal activity, concerning patterns, or clear malicious activity]
+```
+
+### When Called by Another Agent
+
+Provide a focused response addressing the specific question:
+
+```
+## Audit Findings
+
+**Query:** <what was asked>
+**Time Window:** <investigated period>
+
+## Relevant Activity
+
+[Chronological list of relevant events]
+- HH:MM:SSZ - <event>
+- HH:MM:SSZ - <event>
+
+## Assessment
+
+[Direct answer to the question with supporting evidence]
+```
+
+## Guidelines
+
+- Reconstruct timelines chronologically
+- Correlate events (login → commands → logout)
+- Note gaps or missing data
+- Distinguish between automated (systemd, cron) and interactive activity
+- Consider the host's role and tier when assessing severity
+- When called by another agent, focus on answering their specific question
+- Don't speculate without evidence - state what the logs show and don't show
--- a/.claude/agents/investigate-alarm.md
+++ b/.claude/agents/investigate-alarm.md
@@ -0,0 +1,211 @@
+---
+name: investigate-alarm
+description: Investigates a single system alarm by querying Prometheus metrics and Loki logs, analyzing configuration files for affected hosts/services, and providing root cause analysis.
+tools: Read, Grep, Glob
+mcpServers:
+  - lab-monitoring
+  - git-explorer
+---
+
+You are an alarm investigation specialist for a NixOS homelab infrastructure. Your task is to analyze a single alarm and determine its root cause.
+
+## Input
+
+You will receive information about an alarm, which may include:
+- Alert name and severity
+- Affected host or service
+- Alert expression/threshold
+- Current value or status
+- When it started firing
+
+## Investigation Process
+
+### 1. Understand the Alert Context
+
+Start by understanding what the alert is measuring:
+- Use `get_alert` if you have a fingerprint, or `list_alerts` to find matching alerts
+- Use `get_metric_metadata` to understand the metric being monitored
+- Use `search_metrics` to find related metrics
+
+### 2. Query Current State
+
+Gather evidence about the current system state:
+- Use `query` to check the current metric values and related metrics
+- Use `list_targets` to verify the host/service is being scraped successfully
+- Look for correlated metrics that might explain the issue
+
+### 3. Check Service Logs
+
+Search for relevant log entries using `query_logs`. Focus on service-specific logs and errors.
+
+**Query strategies (start narrow, expand if needed):**
+- Start with `limit: 20-30`, increase only if needed
+- Use tight time windows: `start: "15m"` or `start: "30m"` initially
+- Filter to specific services: `{hostname="<hostname>", systemd_unit="<service>.service"}`
+- Search for errors: `{hostname="<hostname>"} |= "error"` or `|= "failed"`
+
+**Common patterns:**
+- Service logs: `{hostname="<hostname>", systemd_unit="<service>.service"}`
+- All errors on host: `{hostname="<hostname>"} |= "error"`
+- Journal for a unit: `{hostname="<hostname>", systemd_unit="nginx.service"} |= "failed"`
+
+**Avoid:**
+- Using `start: "1h"` with no filters on busy hosts
+- Limits over 50 without specific filters
+
+### 4. Investigate User Activity
+
+For any analysis of user activity, **always spawn the `auditor` agent**. Do not query audit logs (EXECVE, USER_LOGIN, etc.) directly - delegate this to the auditor.
+
+**Always call the auditor when:**
+- A service stopped unexpectedly (may have been manually stopped)
+- A process was killed or a config was changed
+- You need to know who was logged in around the time of an incident
+- You need to understand what commands led to the current state
+- The cause isn't obvious from service logs alone
+
+**Do NOT try to query audit logs yourself.** The auditor is specialized for:
+- Parsing EXECVE records and reconstructing command lines
+- Correlating SSH sessions with commands executed
+- Identifying suspicious patterns
+- Filtering out systemd/nix-store noise
+
+**Example prompt for auditor:**
+```
+Investigate user activity on <hostname> between <start_time> and <end_time>.
+Context: The prometheus-node-exporter service stopped at 14:32.
+Determine if it was manually stopped and by whom.
+```
+
+Incorporate the auditor's findings into your timeline and root cause analysis.
+
+### 5. Check Configuration (if relevant)
+
+If the alert relates to a NixOS-managed service:
+- Check host configuration in `/hosts/<hostname>/`
+- Check service modules in `/services/<service>/`
+- Look for thresholds, resource limits, or misconfigurations
+- Check `homelab.host` options for tier/priority/role metadata
+
+### 6. Check for Configuration Drift
+
+Use the git-explorer MCP server to compare the host's deployed configuration against the current master branch. This helps identify:
+- Hosts running outdated configurations
+- Recent changes that might have caused the issue
+- Whether a fix has already been committed but not deployed
+
+**Step 1: Get the deployed revision from Prometheus**
+```promql
+nixos_flake_info{hostname="<hostname>"}
+```
+The `current_rev` label contains the deployed git commit hash.
+
+**Step 2: Check if the host is behind master**
+```
+resolve_ref("master")           # Get current master commit
+is_ancestor(deployed, master)   # Check if host is behind
+```
+
+**Step 3: See what commits are missing**
+```
+commits_between(deployed, master)  # List commits not yet deployed
+```
+
+**Step 4: Check which files changed**
+```
+get_diff_files(deployed, master)   # Files modified since deployment
+```
+Look for files in `hosts/<hostname>/`, `services/<relevant-service>/`, or `system/` that affect this host.
+
+**Step 5: View configuration at the deployed revision**
+```
+get_file_at_commit(deployed, "services/<service>/default.nix")
+```
+Compare against the current file to understand differences.
+
+**Step 6: Find when something changed**
+```
+search_commits("<service-name>")   # Find commits mentioning the service
+get_commit_info(<hash>)            # Get full details of a specific change
+```
+
+**Example workflow for a service-related alert:**
+1. Query `nixos_flake_info{hostname="monitoring01"}` → `current_rev: 8959829`
+2. `resolve_ref("master")` → `4633421`
+3. `is_ancestor("8959829", "4633421")` → Yes, host is behind
+4. `commits_between("8959829", "4633421")` → 7 commits missing
+5. `get_diff_files("8959829", "4633421")` → Check if relevant service files changed
+6. If a fix was committed after the deployed rev, recommend deployment
+
+### 7. Consider Common Causes
+
+For infrastructure alerts, common causes include:
+- **Manual intervention**: Service manually stopped/restarted (call auditor to confirm)
+- **Configuration drift**: Host running outdated config, fix already in master
+- **Disk space**: Nix store growth, logs, temp files
+- **Memory pressure**: Service memory leaks, insufficient limits
+- **CPU**: Runaway processes, build jobs
+- **Network**: DNS issues, connectivity problems
+- **Service restarts**: Failed upgrades, configuration errors
+- **Scrape failures**: Service down, firewall issues, port changes
+
+**Note:** If a service stopped unexpectedly and service logs don't show a crash or error, it was likely manual intervention - call the auditor to investigate.
+
+## Output Format
+
+Provide a concise report with one of two outcomes:
+
+### If Root Cause Identified:
+
+```
+## Root Cause
+[1-2 sentence summary of the root cause]
+
+## Timeline
+[Chronological sequence of relevant events leading to the alert]
+- HH:MM:SSZ - [Event description]
+- HH:MM:SSZ - [Event description]
+- HH:MM:SSZ - [Alert fired]
+
+### Timeline sources
+- HH:MM:SSZ - [Source for information about this event. Which metric or log file]
+- HH:MM:SSZ - [Source for information about this event. Which metric or log file]
+- HH:MM:SSZ - [Alert fired]
+
+
+## Evidence
+- [Specific metric values or log entries that support the conclusion]
+- [Configuration details if relevant]
+
+
+## Recommended Actions
+1. [Specific remediation step]
+2. [Follow-up actions if any]
+```
+
+### If Root Cause Unclear:
+
+```
+## Investigation Summary
+[What was checked and what was found]
+
+## Possible Causes
+- [Hypothesis 1 with supporting/contradicting evidence]
+- [Hypothesis 2 with supporting/contradicting evidence]
+
+## Additional Information Needed
+- [Specific data, logs, or access that would help]
+- [Suggested queries or checks for the operator]
+```
+
+## Guidelines
+
+- Be concise and actionable
+- Reference specific metric names and values as evidence
+- Include log snippets when they're informative
+- Don't speculate without evidence
+- If the alert is a false positive or expected behavior, explain why
+- Consider the host's tier (test vs prod) when assessing severity
+- Build a timeline from log timestamps and metrics to show the sequence of events
+- **Query logs incrementally**: start with narrow filters and small limits, expand only if needed
+- **Always delegate to the auditor agent** for any user activity analysis - never query EXECVE or audit logs directly
--- a/.claude/skills/observability/SKILL.md
+++ b/.claude/skills/observability/SKILL.md
@@ -30,11 +30,13 @@ Use the `lab-monitoring` MCP server tools:
 ### Label Reference

 Available labels for log queries:
- `host` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`)
+- `hostname` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`) - matches the Prometheus `hostname` label
 - `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs)
+- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
 - `filename` - For `varlog` job, the log file path
- `hostname` - Alternative to `host` for some streams
+- `tier` - Deployment tier (`test` or `prod`)
+- `role` - Host role (e.g., `dns`, `vault`, `monitoring`) - matches the Prometheus `role` label
+- `level` - Log level mapped from journal PRIORITY (`critical`, `error`, `warning`, `notice`, `info`, `debug`) - journal scrape only

 ### Log Format

@@ -47,12 +49,12 @@ Journal logs are JSON-formatted. Key fields:

 **Logs from a specific service on a host:**
 ```logql
-{host="ns1", systemd_unit="nsd.service"}
+{hostname="ns1", systemd_unit="nsd.service"}
 ```

 **All logs from a host:**
 ```logql
-{host="monitoring01"}
+{hostname="monitoring01"}
 ```

 **Logs from a service across all hosts:**
@@ -62,12 +64,12 @@ Journal logs are JSON-formatted. Key fields:

 **Substring matching (case-sensitive):**
 ```logql
-{host="ha1"} |= "error"
+{hostname="ha1"} |= "error"
 ```

 **Exclude pattern:**
 ```logql
-{host="ns1"} != "routine"
+{hostname="ns1"} != "routine"
 ```

 **Regex matching:**
@@ -75,6 +77,20 @@ Journal logs are JSON-formatted. Key fields:
 {systemd_unit="prometheus.service"} |~ "scrape.*failed"
 ```

+**Filter by level (journal scrape only):**
+```logql
+{level="error"}                                  # All errors across the fleet
+{level=~"critical|error", tier="prod"}           # Prod errors and criticals
+{hostname="ns1", level="warning"}                # Warnings from a specific host
+```
+
+**Filter by tier/role:**
+```logql
+{tier="prod"} |= "error"                        # All errors on prod hosts
+{role="dns"}                                     # All DNS server logs
+{tier="test", job="systemd-journal"}             # Journal logs from test hosts
+```
+
 **File-based logs (caddy access logs, etc):**
 ```logql
 {job="varlog", hostname="nix-cache01"}
@@ -102,6 +118,36 @@ Useful systemd units for troubleshooting:
 - `sshd.service` - SSH daemon
 - `nix-gc.service` - Nix garbage collection

+### Bootstrap Logs
+
+VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
+
+- `hostname` - Target hostname
+- `branch` - Git branch being deployed
+- `stage` - Bootstrap stage (see table below)
+
+**Bootstrap stages:**
+
+| Stage | Message | Meaning |
+|-------|---------|---------|
+| `starting` | Bootstrap starting for \<host\> (branch: \<branch\>) | Bootstrap service has started |
+| `network_ok` | Network connectivity confirmed | Can reach git server |
+| `vault_ok` | Vault credentials unwrapped and stored | AppRole credentials provisioned |
+| `vault_skip` | No Vault token provided - skipping credential setup | No wrapped token was provided |
+| `vault_warn` | Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) |
+| `building` | Starting nixos-rebuild boot | NixOS build starting |
+| `success` | Build successful - rebooting into new configuration | Build complete, rebooting |
+| `failed` | nixos-rebuild failed - manual intervention required | Build failed |
+
+**Bootstrap queries:**
+
+```logql
+{job="bootstrap"}                              # All bootstrap logs
+{job="bootstrap", hostname="myhost"}            # Specific host
+{job="bootstrap", stage="failed"}              # All failures
+{job="bootstrap", stage=~"building|success"}   # Track build progress
+```
+
 ### Extracting JSON Fields

 Parse JSON and filter on fields:
@@ -175,31 +221,95 @@ Disk space (root filesystem):
 node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
 ```

-### Service-Specific Metrics
+### Prometheus Jobs

-Common job names:
- `node-exporter` - System metrics (all hosts)
- `nixos-exporter` - NixOS version/generation metrics
- `caddy` - Reverse proxy metrics
- `prometheus` / `loki` / `grafana` - Monitoring stack
- `home-assistant` - Home automation
- `step-ca` - Internal CA
+All available Prometheus job names:

-### Instance Label Format
+**System exporters (on all/most hosts):**
+- `node-exporter` - System metrics (CPU, memory, disk, network)
+- `nixos-exporter` - NixOS flake revision and generation info
+- `systemd-exporter` - Systemd unit status metrics
+- `homelab-deploy` - Deployment listener metrics

-The `instance` label uses FQDN format:
+**Service-specific exporters:**
+- `caddy` - Reverse proxy metrics (http-proxy)
+- `nix-cache_caddy` - Nix binary cache metrics
+- `home-assistant` - Home automation metrics (ha1)
+- `jellyfin` - Media server metrics (jelly01)
+- `kanidm` - Authentication server metrics (kanidm01)
+- `nats` - NATS messaging metrics (nats1)
+- `openbao` - Secrets management metrics (vault01)
+- `unbound` - DNS resolver metrics (ns1, ns2)
+- `wireguard` - VPN tunnel metrics (http-proxy)

-```
-<hostname>.home.2rjus.net:<port>
-```
+**Monitoring stack (localhost on monitoring01):**
+- `prometheus` - Prometheus self-metrics
+- `loki` - Loki self-metrics
+- `grafana` - Grafana self-metrics
+- `alertmanager` - Alertmanager metrics
+- `pushgateway` - Push-based metrics gateway

-Example queries filtering by host:
+**External/infrastructure:**
+- `pve-exporter` - Proxmox hypervisor metrics
+- `smartctl` - Disk SMART health (gunter)
+- `restic_rest` - Backup server metrics
+- `ghettoptt` - PTT service metrics (gunter)
+
+### Target Labels
+
+All scrape targets have these labels:
+
+**Standard labels:**
+- `instance` - Full target address (`<hostname>.home.2rjus.net:<port>`)
+- `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`)
+- `hostname` - Short hostname (e.g., `ns1`, `monitoring01`) - use this for host filtering
+
+**Host metadata labels** (when configured in `homelab.host`):
+- `role` - Host role (e.g., `dns`, `build-host`, `vault`)
+- `tier` - Deployment tier (`test` for test VMs, absent for prod)
+- `dns_role` - DNS-specific role (`primary` or `secondary` for ns1/ns2)
+
+### Filtering by Host
+
+Use the `hostname` label for easy host filtering across all jobs:

 ```promql
-up{instance=~"monitoring01.*"}
-node_load1{instance=~"ns1.*"}
+{hostname="ns1"}                    # All metrics from ns1
+node_load1{hostname="monitoring01"} # Specific metric by hostname
+up{hostname="ha1"}                  # Check if ha1 is up
 ```

+This is simpler than wildcarding the `instance` label:
+
+```promql
+# Old way (still works but verbose)
+up{instance=~"monitoring01.*"}
+
+# New way (preferred)
+up{hostname="monitoring01"}
+```
+
+### Filtering by Role/Tier
+
+Filter hosts by their role or tier:
+
+```promql
+up{role="dns"}                      # All DNS servers (ns1, ns2)
+node_cpu_seconds_total{role="build-host"}  # Build hosts only (nix-cache01)
+up{tier="test"}                     # All test-tier VMs
+up{dns_role="primary"}              # Primary DNS only (ns1)
+```
+
+Current host labels:
+| Host | Labels |
+|------|--------|
+| ns1 | `role=dns`, `dns_role=primary` |
+| ns2 | `role=dns`, `dns_role=secondary` |
+| nix-cache01 | `role=build-host` |
+| vault01 | `role=vault` |
+| kanidm01 | `role=auth`, `tier=test` |
+| testvm01/02/03 | `tier=test` |
+
 ---

 ## Troubleshooting Workflows
@@ -212,11 +322,12 @@ node_load1{instance=~"ns1.*"}

 ### Investigate Service Issues

-1. Check `up{job="<service>"}` for scrape failures
+1. Check `up{job="<service>"}` or `up{hostname="<host>"}` for scrape failures
 2. Use `list_targets` to see target health details
-3. Query service logs: `{host="<host>", systemd_unit="<service>.service"}`
-4. Search for errors: `{host="<host>"} |= "error"`
+3. Query service logs: `{hostname="<host>", systemd_unit="<service>.service"}`
+4. Search for errors: `{hostname="<host>"} |= "error"`
 5. Check `list_alerts` for related alerts
+6. Use role filters for group issues: `up{role="dns"}` to check all DNS servers

 ### After Deploying Changes

@@ -225,10 +336,21 @@ node_load1{instance=~"ns1.*"}
 3. Check service logs for startup issues
 4. Check service metrics are being scraped

+### Monitor VM Bootstrap
+
+When provisioning new VMs, track bootstrap progress:
+
+1. Watch bootstrap logs: `{job="bootstrap", hostname="<hostname>"}`
+2. Check for failures: `{job="bootstrap", hostname="<hostname>", stage="failed"}`
+3. After success, verify host appears in metrics: `up{hostname="<hostname>"}`
+4. Check logs are flowing: `{hostname="<hostname>"}`
+
+See [docs/host-creation.md](../../../docs/host-creation.md) for the full host creation pipeline.
+
 ### Debug SSH/Access Issues

 ```logql
-{host="<host>", systemd_unit="sshd.service"}
+{hostname="<host>", systemd_unit="sshd.service"}
 ```

 ### Check Recent Upgrades
@@ -246,5 +368,6 @@ With `start: "24h"` to see last 24 hours of upgrades across all hosts.
 - Default scrape interval is 15s for most metrics targets
 - Default log lookback is 1h - use `start` parameter for older logs
 - Use `rate()` for counter metrics, direct queries for gauges
- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters
+- Use the `hostname` label to filter metrics by host (simpler than regex on `instance`)
+- Host metadata labels (`role`, `tier`, `dns_role`) are propagated to all scrape targets
 - Log `MESSAGE` field contains the actual log content in JSON format
--- a/.gitignore
+++ b/.gitignore
@@ -2,6 +2,9 @@
 result
 result-*

+# MCP config (contains secrets)
+.mcp.json
+
 # Terraform/OpenTofu
 terraform/.terraform/
 terraform/.terraform.lock.hcl
--- a/.mcp.json.example
+++ b/.mcp.json.example
@@ -20,7 +20,9 @@
      "env": {
        "PROMETHEUS_URL": "https://prometheus.home.2rjus.net",
        "ALERTMANAGER_URL": "https://alertmanager.home.2rjus.net",
-        "LOKI_URL": "http://monitoring01.home.2rjus.net:3100"
+        "LOKI_URL": "https://loki.home.2rjus.net",
+        "LOKI_USERNAME": "promtail",
+        "LOKI_PASSWORD": "<password from: bao kv get -field=password secret/shared/loki/push-auth>"
      }
    },
    "homelab-deploy": {
@@ -31,9 +33,16 @@
        "--",
        "mcp",
        "--nats-url", "nats://nats1.home.2rjus.net:4222",
-        "--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey"
+        "--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey",
+        "--enable-builds"
      ]
+    },
+    "git-explorer": {
+      "command": "nix",
+      "args": ["run", "git+https://git.t-juice.club/torjus/labmcp#git-explorer", "--", "serve"],
+      "env": {
+        "GIT_REPO_PATH": "/home/torjus/git/nixos-servers"
+      }
    }
  }
 }
-
--- a/.sops.yaml
+++ b/.sops.yaml
@@ -1,52 +0,0 @@
-keys:
-  - &admin_torjus age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
-  - &server_ns1 age1hz2lz4k050ru3shrk5j3zk3f8azxmrp54pktw5a7nzjml4saudesx6jsl0
-  - &server_ns2 age1w2q4gm2lrcgdzscq8du3ssyvk6qtzm4fcszc92z9ftclq23yyydqdga5um
-  - &server_ha1 age1d2w5zece9647qwyq4vas9qyqegg96xwmg6c86440a6eg4uj6dd2qrq0w3l
-  - &server_http-proxy age1gq8434ku0xekqmvnseeunv83e779cg03c06gwrusnymdsr3rpufqx6vr3m
-  - &server_ca age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk
-  - &server_monitoring01 age1vpns76ykll8jgdlu3h05cur4ew2t3k7u03kxdg8y6ypfhsfhq9fqyurjey
-  - &server_jelly01 age1hchvlf3apn8g8jq2743pw53sd6v6ay6xu6lqk0qufrjeccan9vzsc7hdfq
-  - &server_nix-cache01 age1w029fksjv0edrff9p7s03tgk3axecdkppqymfpwfn2nu2gsqqefqc37sxq
-  - &server_pgdb1 age1ha34qeksr4jeaecevqvv2afqem67eja2mvawlmrqsudch0e7fe7qtpsekv
-  - &server_nats1 age1cxt8kwqzx35yuldazcc49q88qvgy9ajkz30xu0h37uw3ts97jagqgmn2ga
-creation_rules:
-  - path_regex: secrets/[^/]+\.(yaml|json|env|ini)
-    key_groups:
-      - age:
-        - *admin_torjus
-        - *server_ns1
-        - *server_ns2
-        - *server_ha1
-        - *server_http-proxy
-        - *server_ca
-        - *server_monitoring01
-        - *server_jelly01
-        - *server_nix-cache01
-        - *server_pgdb1
-        - *server_nats1
-  - path_regex: secrets/ca/[^/]+\.(yaml|json|env|ini|)
-    key_groups:
-      - age:
-        - *admin_torjus
-        - *server_ca
-  - path_regex: secrets/monitoring01/[^/]+\.(yaml|json|env|ini)
-    key_groups:
-      - age:
-        - *admin_torjus
-        - *server_monitoring01
-  - path_regex: secrets/ca/keys/.+
-    key_groups:
-      - age:
-        - *admin_torjus
-        - *server_ca
-  - path_regex: secrets/nix-cache01/.+
-    key_groups:
-      - age:
-        - *admin_torjus
-        - *server_nix-cache01
-  - path_regex: secrets/http-proxy/.+
-    key_groups:
-      - age:
-        - *admin_torjus
-        - *server_http-proxy
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -35,6 +35,34 @@ nix build .#create-host

 Do not automatically deploy changes. Deployments are usually done by updating the master branch, and then triggering the auto update on the specific host.

+### SSH Commands
+
+Do not run SSH commands directly. If a command needs to be run on a remote host, provide the command to the user and ask them to run it manually.
+
+### Sharing Command Output via Loki
+
+All hosts have the `pipe-to-loki` script for sending command output or terminal sessions to Loki, allowing users to share output with Claude without copy-pasting.
+
+**Pipe mode** - send command output:
+```bash
+command | pipe-to-loki                  # Auto-generated ID
+command | pipe-to-loki --id my-test     # Custom ID
+```
+
+**Session mode** - record interactive terminal session:
+```bash
+pipe-to-loki --record                   # Start recording, exit to send
+pipe-to-loki --record --id my-session   # With custom ID
+```
+
+The script prints the session ID which the user can share. Query results with:
+```logql
+{job="pipe-to-loki"}                           # All entries
+{job="pipe-to-loki", id="my-test"}             # Specific ID
+{job="pipe-to-loki", hostname="testvm01"}       # From specific host
+{job="pipe-to-loki", type="session"}           # Only sessions
+```
+
 ### Testing Feature Branches on Hosts

 All hosts have the `nixos-rebuild-test` helper script for testing feature branches before merging:
@@ -61,25 +89,53 @@ Do not run `nix flake update`. Should only be done manually by user.
 ### Development Environment

 ```bash
-# Enter development shell (provides ansible, python3)
+# Enter development shell
 nix develop
 ```

+The devshell provides: `ansible`, `tofu` (OpenTofu), `bao` (OpenBao CLI), `create-host`, and `homelab-deploy`.
+
+**Important:** When suggesting commands that use devshell tools, always use `nix develop -c <command>` syntax rather than assuming the user is already in a devshell. For example:
+```bash
+# Good - works regardless of current shell
+nix develop -c tofu plan
+
+# Avoid - requires user to be in devshell
+tofu plan
+```
+
+**OpenTofu:** Use the `-chdir` option instead of `cd` when running tofu commands in subdirectories:
+```bash
+# Good - uses -chdir option
+nix develop -c tofu -chdir=terraform plan
+nix develop -c tofu -chdir=terraform/vault apply
+
+# Avoid - changing directories
+cd terraform && tofu plan
+```
+
+### Ansible
+
+Ansible configuration and playbooks are in `/ansible/`. See [ansible/README.md](ansible/README.md) for inventory groups, available playbooks, and usage examples.
+
+The devshell sets `ANSIBLE_CONFIG` automatically, so no `-i` flag is needed.
+
 ### Secrets Management

 Secrets are managed by OpenBao (Vault) using AppRole authentication. Most hosts use the
 `vault.secrets` option defined in `system/vault-secrets.nix` to fetch secrets at boot.
 Terraform manages the secrets and AppRole policies in `terraform/vault/`.

-Legacy sops-nix is still present but only actively used by the `ca` host. Do not edit any
-`.sops.yaml` or any file within `secrets/`. Ask the user to modify if necessary.
-
 ### Git Workflow

 **Important:** Never commit directly to `master` unless the user explicitly asks for it. Always create a feature branch for changes.

 **Important:** Never amend commits to `master` unless the user explicitly asks for it. Amending rewrites history and causes issues for deployed configurations.

+**Important:** Never force push to `master`. If a commit on master has an error, fix it with a new commit rather than rewriting history.
+
+**Important:** Do not use `gh pr create` to create pull requests. The git server does not support GitHub CLI for PR creation. Instead, push the branch and let the user create the PR manually via the web interface.
+
 When starting a new plan or task, the first step should typically be to create and checkout a new branch with an appropriate name (e.g., `git checkout -b dns-automation` or `git checkout -b fix-nginx-config`).

 ### Plan Management
@@ -132,67 +188,16 @@ Two MCP servers are available for searching NixOS options and packages:

 This ensures documentation matches the exact nixpkgs version (currently NixOS 25.11) used by this flake.

-### Lab Monitoring Log Queries
+### Lab Monitoring

-The **lab-monitoring** MCP server can query logs from Loki. All hosts ship systemd journal logs via Promtail.
+The **lab-monitoring** MCP server provides access to Prometheus metrics and Loki logs. Use the `/observability` skill for detailed reference on:

-**Loki Label Reference:**
+- Available Prometheus jobs and exporters
+- Loki labels and LogQL query syntax
+- Bootstrap log monitoring for new VMs
+- Common troubleshooting workflows

- `host` - Hostname (e.g., `ns1`, `ns2`, `monitoring01`, `ha1`). Use this label, not `hostname`.
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `prometheus.service`, `nixos-upgrade.service`)
- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs like caddy access logs)
- `filename` - For `varlog` job, the log file path (e.g., `/var/log/caddy/nix-cache.log`)
-
-Journal log entries are JSON-formatted with the actual log message in the `MESSAGE` field. Other useful fields include `PRIORITY` and `SYSLOG_IDENTIFIER`.
-
-**Example LogQL queries:**
-```
-# Logs from a specific service on a host
-{host="ns2", systemd_unit="nsd.service"}
-
-# Substring match on log content
-{host="ns1", systemd_unit="nsd.service"} |= "error"
-
-# File-based logs (e.g., caddy access logs)
-{job="varlog", hostname="nix-cache01"}
-```
-
-Default lookback is 1 hour. Use the `start` parameter with relative durations (e.g., `24h`, `168h`) for older logs.
-
-### Lab Monitoring Prometheus Queries
-
-The **lab-monitoring** MCP server can query Prometheus metrics via PromQL. The `instance` label uses the FQDN format `<host>.home.2rjus.net:<port>`.
-
-**Prometheus Job Names:**
-
- `node-exporter` - System metrics from all hosts (CPU, memory, disk, network)
- `caddy` - Reverse proxy metrics (http-proxy)
- `nix-cache_caddy` - Nix binary cache metrics
- `home-assistant` - Home automation metrics
- `jellyfin` - Media server metrics
- `loki` / `prometheus` / `grafana` - Monitoring stack self-metrics
- `step-ca` - Internal CA metrics
- `pve-exporter` - Proxmox hypervisor metrics
- `smartctl` - Disk SMART health (gunter)
- `wireguard` - VPN metrics (http-proxy)
- `pushgateway` - Push-based metrics (e.g., backup results)
- `restic_rest` - Backup server metrics
- `labmon` / `ghettoptt` / `alertmanager` - Other service metrics
-
-**Example PromQL queries:**
-```
-# Check all targets are up
-up
-
-# CPU usage for a specific host
-rate(node_cpu_seconds_total{instance=~"ns1.*", mode!="idle"}[5m])
-
-# Memory usage across all hosts
-node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
-
-# Disk space
-node_filesystem_avail_bytes{mountpoint="/"}
-```
+The skill contains up-to-date information about all scrape targets, host labels, and example queries.

 ### Deploying to Test Hosts

@@ -229,6 +234,21 @@ deploy(role="vault", action="switch")

 **Note:** Only test-tier hosts with `homelab.deploy.enable = true` and the listener service running will respond to deployments.

+**Deploying to Prod Hosts:**
+
+The MCP server only deploys to test-tier hosts. For prod hosts, use the CLI directly:
+
+```bash
+nix develop -c homelab-deploy -- deploy \
+  --nats-url nats://nats1.home.2rjus.net:4222 \
+  --nkey-file ~/.config/homelab-deploy/admin-deployer.nkey \
+  --branch <branch-name> \
+  --action switch \
+  deploy.prod.<hostname>
+```
+
+Subject format: `deploy.<tier>.<hostname>` (e.g., `deploy.prod.monitoring01`, `deploy.test.testvm01`)
+
 **Verifying Deployments:**

 After deploying, use the `nixos_flake_info` metric from nixos-exporter to verify the host is running the expected revision:
@@ -248,10 +268,11 @@ The `current_rev` label contains the git commit hash of the deployed flake confi
  - `default.nix` - Entry point, imports configuration.nix and services
  - `configuration.nix` - Host-specific settings (networking, hardware, users)
 - `/system/` - Shared system-level configurations applied to ALL hosts
-  - Core modules: nix.nix, sshd.nix, sops.nix (legacy), vault-secrets.nix, acme.nix, autoupgrade.nix
+  - Core modules: nix.nix, sshd.nix, vault-secrets.nix, acme.nix, autoupgrade.nix
+  - Additional modules: motd.nix (dynamic MOTD), packages.nix (base packages), root-user.nix (root config), homelab-deploy.nix (NATS listener)
  - Monitoring: node-exporter and promtail on every host
 - `/modules/` - Custom NixOS modules
-  - `homelab/` - Homelab-specific options (DNS automation, monitoring scrape targets)
+  - `homelab/` - Homelab-specific options (see "Homelab Module Options" section below)
 - `/lib/` - Nix library functions
  - `dns-zone.nix` - DNS zone generation functions
  - `monitoring.nix` - Prometheus scrape target generation functions
@@ -259,14 +280,17 @@ The `current_rev` label contains the git commit hash of the deployed flake confi
  - `home-assistant/` - Home automation stack
  - `monitoring/` - Observability stack (Prometheus, Grafana, Loki, Tempo)
  - `ns/` - DNS services (authoritative, resolver, zone generation)
-  - `http-proxy/`, `ca/`, `postgres/`, `nats/`, `jellyfin/`, etc.
- `/secrets/` - SOPS-encrypted secrets with age encryption (legacy, only used by ca)
+  - `vault/` - OpenBao (Vault) secrets server
+  - `actions-runner/` - GitHub Actions runner
+  - `http-proxy/`, `postgres/`, `nats/`, `jellyfin/`, etc.
 - `/common/` - Shared configurations (e.g., VM guest agent)
 - `/docs/` - Documentation and plans
  - `plans/` - Future plans and proposals
  - `plans/completed/` - Completed plans (moved here when done)
- `/playbooks/` - Ansible playbooks for fleet management
- `/.sops.yaml` - SOPS configuration with age keys (legacy, only used by ca)
+- `/ansible/` - Ansible configuration and playbooks
+  - `ansible.cfg` - Ansible configuration (inventory path, defaults)
+  - `inventory/` - Dynamic and static inventory sources
+  - `playbooks/` - Ansible playbooks for fleet management

 ### Configuration Inheritance

@@ -283,37 +307,27 @@ All hosts automatically get:
 - Nix binary cache (nix-cache.home.2rjus.net)
 - SSH with root login enabled
 - OpenBao (Vault) secrets management via AppRole
- Internal ACME CA integration (ca.home.2rjus.net)
+- Internal ACME CA integration (OpenBao PKI at vault.home.2rjus.net)
 - Daily auto-upgrades with auto-reboot
 - Prometheus node-exporter + Promtail (logs to monitoring01)
 - Monitoring scrape target auto-registration via `homelab.monitoring` options
 - Custom root CA trust
 - DNS zone auto-registration via `homelab.dns` options

-### Active Hosts
+### Hosts

-Production servers managed by `rebuild-all.sh`:
- `ns1`, `ns2` - Primary/secondary DNS servers (10.69.13.5/6)
- `ca` - Internal Certificate Authority
- `ha1` - Home Assistant + Zigbee2MQTT + Mosquitto
- `http-proxy` - Reverse proxy
- `monitoring01` - Full observability stack (Prometheus, Grafana, Loki, Tempo, Pyroscope)
- `jelly01` - Jellyfin media server
- `nix-cache01` - Binary cache server
- `pgdb1` - PostgreSQL database
- `nats1` - NATS messaging server
+Host configurations are in `/hosts/<hostname>/`. See `flake.nix` for the complete list of `nixosConfigurations`.

-Template/test hosts:
- `template1` - Base template for cloning new hosts
+Use `nix flake show` or `nix develop -c ansible-inventory --graph` to list all hosts.

 ### Flake Inputs

 - `nixpkgs` - NixOS 25.11 stable (primary)
 - `nixpkgs-unstable` - Unstable channel (available via overlay as `pkgs.unstable.<package>`)
- `sops-nix` - Secrets management (legacy, only used by ca)
+- `nixos-exporter` - NixOS module for exposing flake revision metrics (used to verify deployments)
+- `homelab-deploy` - NATS-based remote deployment tool for test-tier hosts
 - Custom packages from git.t-juice.club:
  - `alerttonotify` - Alert routing
-  - `labmon` - Lab monitoring

 ### Network Architecture

@@ -335,12 +349,7 @@ Most hosts use OpenBao (Vault) for secrets:
 - `extractKey` option extracts a single key from vault JSON as a plain file
 - Secrets fetched at boot by `vault-secret-<name>.service` systemd units
 - Fallback to cached secrets in `/var/lib/vault/cache/` when Vault is unreachable
- Provision AppRole credentials: `nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<host>`
-
-Legacy SOPS (only used by `ca` host):
- SOPS with age encryption, keys in `.sops.yaml`
- Shared secrets: `/secrets/secrets.yaml`
- Per-host secrets: `/secrets/<hostname>/`
+- Provision AppRole credentials: `nix develop -c ansible-playbook ansible/playbooks/provision-approle.yml -l <hostname>`

 ### Auto-Upgrade System

@@ -364,7 +373,7 @@ Template VMs are built from `hosts/template2` and deployed to Proxmox using Ansi

 ```bash
 # Build NixOS image and deploy to Proxmox as template
-nix develop -c ansible-playbook -i playbooks/inventory.ini playbooks/build-and-deploy-template.yml
+nix develop -c ansible-playbook ansible/playbooks/build-and-deploy-template.yml
 ```

 This playbook:
@@ -402,9 +411,21 @@ Example VM deployment includes:
 - Custom CPU/memory/disk sizing
 - VLAN tagging
 - QEMU guest agent
+- Automatic Vault credential provisioning via `vault_wrapped_token`

 OpenTofu outputs the VM's IP address after deployment for easy SSH access.

+**Automatic Vault Credential Provisioning:**
+
+VMs can receive Vault (OpenBao) credentials automatically during bootstrap:
+
+1. OpenTofu generates a wrapped token via `terraform/vault/` and stores it in the VM configuration
+2. Cloud-init passes `VAULT_WRAPPED_TOKEN` and `NIXOS_FLAKE_BRANCH` to the bootstrap script
+3. The bootstrap script unwraps the token to obtain AppRole credentials
+4. Credentials are written to `/var/lib/vault/approle/` before the NixOS rebuild
+
+This eliminates the need for manual `provision-approle.yml` playbook runs on new VMs. Bootstrap progress is logged to Loki with `job="bootstrap"` labels.
+
 #### Template Rebuilding and Terraform State

 When the Proxmox template is rebuilt (via `build-and-deploy-template.yml`), the template name may change. This would normally cause Terraform to want to recreate all existing VMs, but that's unnecessary since VMs are independent once cloned.
@@ -427,7 +448,7 @@ This means:
 - `tofu plan` won't show spurious changes for Proxmox-managed defaults

 **When rebuilding the template:**
-1. Run `nix develop -c ansible-playbook -i playbooks/inventory.ini playbooks/build-and-deploy-template.yml`
+1. Run `nix develop -c ansible-playbook ansible/playbooks/build-and-deploy-template.yml`
 2. Update `default_template_name` in `terraform/variables.tf` if the name changed
 3. Run `tofu plan` - should show no VM recreations (only template name in state)
 4. Run `tofu apply` - updates state without touching existing VMs
@@ -435,20 +456,11 @@ This means:

 ### Adding a New Host

-1. Create `/hosts/<hostname>/` directory
-2. Copy structure from `template1` or similar host
-3. Add host entry to `flake.nix` nixosConfigurations
-4. Configure networking in `configuration.nix` (static IP via `systemd.network.networks`, DNS servers)
-5. (Optional) Add `homelab.dns.cnames` if the host needs CNAME aliases
-6. Add `vault.enable = true;` to the host configuration
-7. Add AppRole policy in `terraform/vault/approle.tf` and any secrets in `secrets.tf`
-8. Run `tofu apply` in `terraform/vault/`
-9. User clones template host
-10. User runs `prepare-host.sh` on new host
-11. Provision AppRole credentials: `nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<host>`
-12. Commit changes, and merge to master.
-13. Deploy by running `nixos-rebuild boot --flake URL#<hostname>` on the host.
-14. Run auto-upgrade on DNS servers (ns1, ns2) to pick up the new host's DNS entry
+See [docs/host-creation.md](docs/host-creation.md) for the complete host creation pipeline, including:
+- Using the `create-host` script to generate host configurations
+- Deploying VMs and secrets with OpenTofu
+- Monitoring the bootstrap process via Loki
+- Verification and troubleshooting steps

 **Note:** DNS A records and Prometheus node-exporter scrape targets are auto-generated from the host's `systemd.network.networks` static IP configuration. No manual zone file or Prometheus config editing is required.

@@ -484,11 +496,7 @@ Prometheus scrape targets are automatically generated from host configurations,
 - **External targets**: Non-flake hosts defined in `/services/monitoring/external-targets.nix`
 - **Library**: `lib/monitoring.nix` provides `generateNodeExporterTargets` and `generateScrapeConfigs`

-Host monitoring options (`homelab.monitoring.*`):
- `enable` (default: `true`) - Include host in Prometheus node-exporter scrape targets
- `scrapeTargets` (default: `[]`) - Additional scrape targets exposed by this host (job_name, port, metrics_path, scheme, scrape_interval, honor_labels)
-
-Service modules declare their scrape targets directly (e.g., `services/ca/default.nix` declares step-ca on port 9000). The Prometheus config on monitoring01 auto-generates scrape configs from all hosts.
+Service modules declare their scrape targets directly via `homelab.monitoring.scrapeTargets`. The Prometheus config on monitoring01 auto-generates scrape configs from all hosts. See "Homelab Module Options" section for available options.

 To add monitoring targets for non-NixOS hosts, edit `/services/monitoring/external-targets.nix`.

@@ -507,13 +515,31 @@ DNS zone entries are automatically generated from host configurations:
 - **External hosts**: Non-flake hosts defined in `/services/ns/external-hosts.nix`
 - **Serial number**: Uses `self.sourceInfo.lastModified` (git commit timestamp)

-Host DNS options (`homelab.dns.*`):
- `enable` (default: `true`) - Include host in DNS zone generation
- `cnames` (default: `[]`) - List of CNAME aliases pointing to this host
-
 Hosts are automatically excluded from DNS if:
 - `homelab.dns.enable = false` (e.g., template hosts)
 - No static IP configured (e.g., DHCP-only hosts)
 - Network interface is a VPN/tunnel (wg*, tun*, tap*)

 To add DNS entries for non-NixOS hosts, edit `/services/ns/external-hosts.nix`.
+
+### Homelab Module Options
+
+The `modules/homelab/` directory defines custom options used across hosts for automation and metadata.
+
+**Host options (`homelab.host.*`):**
+- `tier` - Deployment tier: `test` or `prod`. Test-tier hosts can receive remote deployments and have different credential access.
+- `priority` - Alerting priority: `high` or `low`. Controls alerting thresholds for the host.
+- `role` - Primary role designation (e.g., `dns`, `database`, `bastion`, `vault`)
+- `labels` - Free-form key-value metadata for host categorization
+  - `ansible = "false"` - Exclude host from Ansible dynamic inventory
+
+**DNS options (`homelab.dns.*`):**
+- `enable` (default: `true`) - Include host in DNS zone generation
+- `cnames` (default: `[]`) - List of CNAME aliases pointing to this host
+
+**Monitoring options (`homelab.monitoring.*`):**
+- `enable` (default: `true`) - Include host in Prometheus node-exporter scrape targets
+- `scrapeTargets` (default: `[]`) - Additional scrape targets exposed by this host
+
+**Deploy options (`homelab.deploy.*`):**
+- `enable` (default: `false`) - Enable NATS-based remote deployment listener. When enabled, the host listens for deployment commands via NATS and can be targeted by the `homelab-deploy` MCP server.
--- a/README.md
+++ b/README.md
@@ -12,8 +12,7 @@ NixOS Flake-based configuration repository for a homelab infrastructure. All hos
 | `http-proxy` | Reverse proxy |
 | `monitoring01` | Prometheus, Grafana, Loki, Tempo, Pyroscope |
 | `jelly01` | Jellyfin media server |
-| `nix-cache01` | Nix binary cache |
-| `pgdb1` | PostgreSQL |
+| `nix-cache02` | Nix binary cache + NATS-based build service |
 | `nats1` | NATS messaging |
 | `vault01` | OpenBao (Vault) secrets management |
 | `template1`, `template2` | VM templates for cloning new hosts |
--- a/ansible/README.md
+++ b/ansible/README.md
@@ -0,0 +1,120 @@
+# Ansible Configuration
+
+This directory contains Ansible configuration for fleet management tasks.
+
+## Structure
+
+```
+ansible/
+├── ansible.cfg              # Ansible configuration
+├── inventory/
+│   ├── dynamic_flake.py     # Dynamic inventory from NixOS flake
+│   ├── static.yml           # Non-flake hosts (Proxmox, etc.)
+│   └── group_vars/
+│       └── all.yml          # Common variables
+└── playbooks/
+    ├── build-and-deploy-template.yml
+    ├── provision-approle.yml
+    ├── restart-service.yml
+    └── run-upgrade.yml
+```
+
+## Usage
+
+The devshell automatically configures `ANSIBLE_CONFIG`, so commands work without extra flags:
+
+```bash
+# List inventory groups
+nix develop -c ansible-inventory --graph
+
+# List hosts in a specific group
+nix develop -c ansible-inventory --list | jq '.role_dns'
+
+# Run a playbook
+nix develop -c ansible-playbook ansible/playbooks/run-upgrade.yml -l tier_test
+```
+
+## Inventory
+
+The inventory combines dynamic and static sources automatically.
+
+### Dynamic Inventory (from flake)
+
+The `dynamic_flake.py` script extracts hosts from the NixOS flake using `homelab.host.*` options:
+
+**Groups generated:**
+- `flake_hosts` - All NixOS hosts from the flake
+- `tier_test`, `tier_prod` - By `homelab.host.tier`
+- `role_dns`, `role_vault`, `role_monitoring`, etc. - By `homelab.host.role`
+
+**Host variables set:**
+- `tier` - Deployment tier (test/prod)
+- `role` - Host role
+- `short_hostname` - Hostname without domain
+
+### Static Inventory
+
+Non-flake hosts are defined in `inventory/static.yml`:
+
+- `proxmox` - Proxmox hypervisors
+
+## Playbooks
+
+| Playbook | Description | Example |
+|----------|-------------|---------|
+| `run-upgrade.yml` | Trigger nixos-upgrade on hosts | `-l tier_prod` |
+| `restart-service.yml` | Restart a systemd service | `-l role_dns -e service=unbound` |
+| `reboot.yml` | Rolling reboot (one host at a time) | `-l tier_test` |
+| `provision-approle.yml` | Deploy Vault credentials (single host only) | `-l testvm01` |
+| `build-and-deploy-template.yml` | Build and deploy Proxmox template | (no limit needed) |
+
+### Examples
+
+```bash
+# Restart unbound on all DNS servers
+nix develop -c ansible-playbook ansible/playbooks/restart-service.yml \
+  -l role_dns -e service=unbound
+
+# Trigger upgrade on all test hosts
+nix develop -c ansible-playbook ansible/playbooks/run-upgrade.yml -l tier_test
+
+# Provision Vault credentials for a specific host
+nix develop -c ansible-playbook ansible/playbooks/provision-approle.yml -l testvm01
+
+# Build and deploy Proxmox template
+nix develop -c ansible-playbook ansible/playbooks/build-and-deploy-template.yml
+
+# Rolling reboot of test hosts (one at a time, waits for each to come back)
+nix develop -c ansible-playbook ansible/playbooks/reboot.yml -l tier_test
+```
+
+## Excluding Flake Hosts
+
+To exclude a flake host from the dynamic inventory, add the `ansible = "false"` label in the host's configuration:
+
+```nix
+homelab.host.labels.ansible = "false";
+```
+
+Hosts with `homelab.dns.enable = false` are also excluded automatically.
+
+## Adding Non-Flake Hosts
+
+Edit `inventory/static.yml` to add hosts not managed by the NixOS flake:
+
+```yaml
+all:
+  children:
+    my_group:
+      hosts:
+        host1.example.com:
+          ansible_user: admin
+```
+
+## Common Variables
+
+Variables in `inventory/group_vars/all.yml` apply to all hosts:
+
+- `ansible_user` - Default SSH user (root)
+- `domain` - Domain name (home.2rjus.net)
+- `vault_addr` - Vault server URL
--- a/ansible/ansible.cfg
+++ b/ansible/ansible.cfg
@@ -0,0 +1,17 @@
+[defaults]
+inventory = inventory/
+remote_user = root
+host_key_checking = False
+
+# Reduce SSH connection overhead
+forks = 10
+pipelining = True
+
+# Output formatting (YAML output via builtin default callback)
+stdout_callback = default
+callbacks_enabled = profile_tasks
+result_format = yaml
+
+[ssh_connection]
+# Reuse SSH connections
+ssh_args = -o ControlMaster=auto -o ControlPersist=60s
--- a/ansible/inventory/dynamic_flake.py
+++ b/ansible/inventory/dynamic_flake.py
@@ -0,0 +1,162 @@
+#!/usr/bin/env python3
+"""
+Dynamic Ansible inventory script that extracts host information from the NixOS flake.
+
+Generates groups:
+  - flake_hosts: All hosts defined in the flake
+  - tier_test, tier_prod: Hosts by deployment tier
+  - role_<name>: Hosts by role (dns, vault, monitoring, etc.)
+
+Usage:
+  ./dynamic_flake.py --list    # Return full inventory
+  ./dynamic_flake.py --host X  # Return host vars (not used, but required by Ansible)
+"""
+
+import json
+import subprocess
+import sys
+from pathlib import Path
+
+
+def get_flake_dir() -> Path:
+    """Find the flake root directory."""
+    script_dir = Path(__file__).resolve().parent
+    # ansible/inventory/dynamic_flake.py -> repo root
+    return script_dir.parent.parent
+
+
+def evaluate_flake() -> dict:
+    """Evaluate the flake and extract host metadata."""
+    flake_dir = get_flake_dir()
+
+    # Nix expression to extract relevant config from each host
+    nix_expr = """
+    configs: builtins.mapAttrs (name: cfg: {
+      hostname = cfg.config.networking.hostName;
+      domain = cfg.config.networking.domain or "home.2rjus.net";
+      tier = cfg.config.homelab.host.tier;
+      role = cfg.config.homelab.host.role;
+      labels = cfg.config.homelab.host.labels;
+      dns_enabled = cfg.config.homelab.dns.enable;
+    }) configs
+    """
+
+    try:
+        result = subprocess.run(
+            [
+                "nix",
+                "eval",
+                "--json",
+                f"{flake_dir}#nixosConfigurations",
+                "--apply",
+                nix_expr,
+            ],
+            capture_output=True,
+            text=True,
+            check=True,
+            cwd=flake_dir,
+        )
+        return json.loads(result.stdout)
+    except subprocess.CalledProcessError as e:
+        print(f"Error evaluating flake: {e.stderr}", file=sys.stderr)
+        sys.exit(1)
+    except json.JSONDecodeError as e:
+        print(f"Error parsing nix output: {e}", file=sys.stderr)
+        sys.exit(1)
+
+
+def sanitize_group_name(name: str) -> str:
+    """Sanitize a string for use as an Ansible group name.
+
+    Ansible group names should contain only alphanumeric characters and underscores.
+    """
+    return name.replace("-", "_")
+
+
+def build_inventory(hosts_data: dict) -> dict:
+    """Build Ansible inventory structure from host data."""
+    inventory = {
+        "_meta": {"hostvars": {}},
+        "flake_hosts": {"hosts": []},
+    }
+
+    # Track groups we need to create
+    tier_groups: dict[str, list[str]] = {}
+    role_groups: dict[str, list[str]] = {}
+
+    for _config_name, host_info in hosts_data.items():
+        hostname = host_info["hostname"]
+        domain = host_info["domain"]
+        tier = host_info["tier"]
+        role = host_info["role"]
+        labels = host_info["labels"]
+        dns_enabled = host_info["dns_enabled"]
+
+        # Skip hosts that have DNS disabled (like templates)
+        if not dns_enabled:
+            continue
+
+        # Skip hosts with ansible = "false" label
+        if labels.get("ansible") == "false":
+            continue
+
+        fqdn = f"{hostname}.{domain}"
+
+        # Use short hostname as inventory name, FQDN for connection
+        inventory_name = hostname
+
+        # Add to flake_hosts group
+        inventory["flake_hosts"]["hosts"].append(inventory_name)
+
+        # Add host variables
+        inventory["_meta"]["hostvars"][inventory_name] = {
+            "ansible_host": fqdn,  # Connect using FQDN
+            "fqdn": fqdn,
+            "tier": tier,
+            "role": role,
+        }
+
+        # Group by tier
+        tier_group = f"tier_{sanitize_group_name(tier)}"
+        if tier_group not in tier_groups:
+            tier_groups[tier_group] = []
+        tier_groups[tier_group].append(inventory_name)
+
+        # Group by role (if set)
+        if role:
+            role_group = f"role_{sanitize_group_name(role)}"
+            if role_group not in role_groups:
+                role_groups[role_group] = []
+            role_groups[role_group].append(inventory_name)
+
+    # Add tier groups to inventory
+    for group_name, hosts in tier_groups.items():
+        inventory[group_name] = {"hosts": hosts}
+
+    # Add role groups to inventory
+    for group_name, hosts in role_groups.items():
+        inventory[group_name] = {"hosts": hosts}
+
+    return inventory
+
+
+def main():
+    if len(sys.argv) < 2:
+        print("Usage: dynamic_flake.py --list | --host <hostname>", file=sys.stderr)
+        sys.exit(1)
+
+    if sys.argv[1] == "--list":
+        hosts_data = evaluate_flake()
+        inventory = build_inventory(hosts_data)
+        print(json.dumps(inventory, indent=2))
+    elif sys.argv[1] == "--host":
+        # Ansible calls this to get vars for a specific host
+        # We provide all vars in _meta.hostvars, so just return empty
+        print(json.dumps({}))
+    else:
+        print(f"Unknown option: {sys.argv[1]}", file=sys.stderr)
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    main()
--- a/ansible/inventory/group_vars/all.yml
+++ b/ansible/inventory/group_vars/all.yml
@@ -0,0 +1,5 @@
+# Common variables for all hosts
+
+ansible_user: root
+domain: home.2rjus.net
+vault_addr: https://vault01.home.2rjus.net:8200
--- a/ansible/inventory/static.yml
+++ b/ansible/inventory/static.yml
@@ -0,0 +1,13 @@
+# Static inventory for non-flake hosts
+#
+# Hosts defined here are merged with the dynamic flake inventory.
+# Use this for infrastructure that isn't managed by NixOS.
+#
+# Use short hostnames as inventory names with ansible_host for FQDN.
+
+all:
+  children:
+    proxmox:
+      hosts:
+        pve1:
+          ansible_host: pve1.home.2rjus.net
--- a/ansible/playbooks/build-and-deploy-template.yml
+++ b/ansible/playbooks/build-and-deploy-template.yml
@@ -15,13 +15,13 @@
    - name: Build NixOS image
      ansible.builtin.command:
        cmd: "nixos-rebuild build-image --image-variant proxmox --flake .#template2"
-        chdir: "{{ playbook_dir }}/.."
+        chdir: "{{ playbook_dir }}/../.."
      register: build_result
      changed_when: true

    - name: Find built image file
      ansible.builtin.find:
-        paths: "{{ playbook_dir}}/../result"
+        paths: "{{ playbook_dir}}/../../result"
        patterns: "*.vma.zst"
        recurse: true
      register: image_files
@@ -99,3 +99,48 @@
    - name: Display success message
      ansible.builtin.debug:
        msg: "Template VM {{ template_vmid }} created successfully on {{ storage }}"
+
+- name: Update Terraform template name
+  hosts: localhost
+  gather_facts: false
+
+  vars:
+    terraform_dir: "{{ playbook_dir }}/../../terraform"
+
+  tasks:
+    - name: Get image filename from earlier play
+      ansible.builtin.set_fact:
+        image_filename: "{{ hostvars['localhost']['image_filename'] }}"
+
+    - name: Extract template name from image filename
+      ansible.builtin.set_fact:
+        new_template_name: "{{ image_filename | regex_replace('\\.vma\\.zst$', '') | regex_replace('^vzdump-qemu-', '') }}"
+
+    - name: Read current Terraform variables file
+      ansible.builtin.slurp:
+        src: "{{ terraform_dir }}/variables.tf"
+      register: variables_tf_content
+
+    - name: Extract current template name from variables.tf
+      ansible.builtin.set_fact:
+        current_template_name: "{{ (variables_tf_content.content | b64decode) | regex_search('variable \"default_template_name\"[^}]+default\\s*=\\s*\"([^\"]+)\"', '\\1') | first }}"
+
+    - name: Check if template name has changed
+      ansible.builtin.set_fact:
+        template_name_changed: "{{ current_template_name != new_template_name }}"
+
+    - name: Display template name status
+      ansible.builtin.debug:
+        msg: "Template name: {{ current_template_name }} -> {{ new_template_name }} ({{ 'changed' if template_name_changed else 'unchanged' }})"
+
+    - name: Update default_template_name in variables.tf
+      ansible.builtin.replace:
+        path: "{{ terraform_dir }}/variables.tf"
+        regexp: '(variable "default_template_name"[^}]+default\s*=\s*)"[^"]+"'
+        replace: '\1"{{ new_template_name }}"'
+      when: template_name_changed
+
+    - name: Display update result
+      ansible.builtin.debug:
+        msg: "Updated terraform/variables.tf with new template name: {{ new_template_name }}"
+      when: template_name_changed
--- a/ansible/playbooks/provision-approle.yml
+++ b/ansible/playbooks/provision-approle.yml
@@ -1,7 +1,27 @@
 ---
-# Provision OpenBao AppRole credentials to an existing host
-# Usage: nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=ha1
+# Provision OpenBao AppRole credentials to a host
+#
+# Usage: ansible-playbook ansible/playbooks/provision-approle.yml -l <hostname>
 # Requires: BAO_ADDR and BAO_TOKEN environment variables set
+#
+# IMPORTANT: This playbook must target exactly one host to prevent
+# accidentally regenerating credentials for multiple hosts.
+
+- name: Validate single host target
+  hosts: all
+  gather_facts: false
+
+  tasks:
+    - name: Fail if targeting multiple hosts
+      ansible.builtin.fail:
+        msg: |
+          This playbook must target exactly one host.
+          Use: ansible-playbook provision-approle.yml -l <hostname>
+
+          Targeting multiple hosts would regenerate credentials for all of them,
+          potentially breaking existing services.
+      when: ansible_play_hosts | length != 1
+      run_once: true

 - name: Fetch AppRole credentials from OpenBao
  hosts: localhost
@@ -9,18 +29,17 @@
  gather_facts: false

  vars:
-    vault_addr: "{{ lookup('env', 'BAO_ADDR') | default('https://vault01.home.2rjus.net:8200', true) }}"
-    domain: "home.2rjus.net"
+    target_host: "{{ groups['all'] | first }}"
+    target_hostname: "{{ hostvars[target_host]['short_hostname'] | default(target_host.split('.')[0]) }}"

  tasks:
-    - name: Validate hostname is provided
-      ansible.builtin.fail:
-        msg: "hostname variable is required. Use: -e hostname=<name>"
-      when: hostname is not defined
+    - name: Display target host
+      ansible.builtin.debug:
+        msg: "Provisioning AppRole credentials for: {{ target_hostname }}"

    - name: Get role-id for host
      ansible.builtin.command:
-        cmd: "bao read -field=role_id auth/approle/role/{{ hostname }}/role-id"
+        cmd: "bao read -field=role_id auth/approle/role/{{ target_hostname }}/role-id"
      environment:
        BAO_ADDR: "{{ vault_addr }}"
        BAO_SKIP_VERIFY: "1"
@@ -29,25 +48,26 @@

    - name: Generate secret-id for host
      ansible.builtin.command:
-        cmd: "bao write -field=secret_id -f auth/approle/role/{{ hostname }}/secret-id"
+        cmd: "bao write -field=secret_id -f auth/approle/role/{{ target_hostname }}/secret-id"
      environment:
        BAO_ADDR: "{{ vault_addr }}"
        BAO_SKIP_VERIFY: "1"
      register: secret_id_result
      changed_when: true

-    - name: Add target host to inventory
-      ansible.builtin.add_host:
-        name: "{{ hostname }}.{{ domain }}"
-        groups: vault_target
-        ansible_user: root
+    - name: Store credentials for next play
+      ansible.builtin.set_fact:
        vault_role_id: "{{ role_id_result.stdout }}"
        vault_secret_id: "{{ secret_id_result.stdout }}"

 - name: Deploy AppRole credentials to host
-  hosts: vault_target
+  hosts: all
  gather_facts: false

+  vars:
+    vault_role_id: "{{ hostvars['localhost']['vault_role_id'] }}"
+    vault_secret_id: "{{ hostvars['localhost']['vault_secret_id'] }}"
+
  tasks:
    - name: Create AppRole directory
      ansible.builtin.file:
--- a/ansible/playbooks/reboot.yml
+++ b/ansible/playbooks/reboot.yml
@@ -0,0 +1,48 @@
+---
+# Reboot hosts with rolling strategy to avoid taking down redundant services
+#
+# Usage examples:
+#   # Reboot a single host
+#   ansible-playbook reboot.yml -l testvm01
+#
+#   # Reboot all test hosts (one at a time)
+#   ansible-playbook reboot.yml -l tier_test
+#
+#   # Reboot all DNS servers safely (one at a time)
+#   ansible-playbook reboot.yml -l role_dns
+#
+# Safety features:
+#   - serial: 1 ensures only one host reboots at a time
+#   - Waits for host to come back online before proceeding
+#   - Groups hosts by role to avoid rebooting same-role hosts consecutively
+
+- name: Reboot hosts (rolling)
+  hosts: all
+  serial: 1
+  order: shuffle  # Randomize to spread out same-role hosts
+  gather_facts: false
+
+  vars:
+    reboot_timeout: 300  # 5 minutes to wait for host to come back
+
+  tasks:
+    - name: Display reboot target
+      ansible.builtin.debug:
+        msg: "Rebooting {{ inventory_hostname }} (role: {{ role | default('none') }})"
+
+    - name: Reboot the host
+      ansible.builtin.systemd:
+        name: reboot.target
+        state: started
+      async: 1
+      poll: 0
+      ignore_errors: true
+
+    - name: Wait for host to come back online
+      ansible.builtin.wait_for_connection:
+        delay: 5
+        timeout: "{{ reboot_timeout }}"
+
+    - name: Display reboot result
+      ansible.builtin.debug:
+        msg: "{{ inventory_hostname }} rebooted successfully"
--- a/ansible/playbooks/restart-service.yml
+++ b/ansible/playbooks/restart-service.yml
@@ -0,0 +1,40 @@
+---
+# Restart a systemd service on target hosts
+#
+# Usage examples:
+#   # Restart unbound on all DNS servers
+#   ansible-playbook restart-service.yml -l role_dns -e service=unbound
+#
+#   # Restart nginx on a specific host
+#   ansible-playbook restart-service.yml -l http-proxy.home.2rjus.net -e service=nginx
+#
+#   # Restart promtail on all prod hosts
+#   ansible-playbook restart-service.yml -l tier_prod -e service=promtail
+
+- name: Restart systemd service
+  hosts: all
+  gather_facts: false
+
+  tasks:
+    - name: Validate service name provided
+      ansible.builtin.fail:
+        msg: |
+          The 'service' variable is required.
+          Usage: ansible-playbook restart-service.yml -l <target> -e service=<name>
+
+          Examples:
+            -e service=nginx
+            -e service=unbound
+            -e service=promtail
+      when: service is not defined
+      run_once: true
+
+    - name: Restart {{ service }}
+      ansible.builtin.systemd:
+        name: "{{ service }}"
+        state: restarted
+      register: restart_result
+
+    - name: Display result
+      ansible.builtin.debug:
+        msg: "Service {{ service }} restarted on {{ inventory_hostname }}"
--- a/ansible/playbooks/run-upgrade.yml
+++ b/ansible/playbooks/run-upgrade.yml
--- a/common/ssh-audit.nix
+++ b/common/ssh-audit.nix
@@ -0,0 +1,21 @@
+# SSH session command auditing
+#
+# Logs all commands executed by users who logged in interactively (SSH).
+# System services and nix builds are excluded via auid filter.
+#
+# Logs are sent to journald and forwarded to Loki via promtail.
+# Query with: {host="<hostname>"} |= "EXECVE"
+{
+  # Enable Linux audit subsystem
+  security.audit.enable = true;
+  security.auditd.enable = true;
+
+  # Log execve syscalls only from interactive login sessions
+  # auid!=4294967295 means "audit login uid is set" (excludes system services, nix builds)
+  security.audit.rules = [
+    "-a exit,always -F arch=b64 -S execve -F auid!=4294967295"
+  ];
+
+  # Forward audit logs to journald (so promtail ships them to Loki)
+  services.journald.audit = true;
+}
--- a/docs/host-creation.md
+++ b/docs/host-creation.md
@@ -0,0 +1,217 @@
+# Host Creation Pipeline
+
+This document describes the process for creating new hosts in the homelab infrastructure.
+
+## Overview
+
+We use the `create-host` script to create new hosts, which generates default configurations from a template. We then use OpenTofu to deploy both secrets and VMs. The VMs boot using a template image (built from `hosts/template2`), which starts a bootstrap process. This bootstrap process applies the host's NixOS configuration and then reboots into the new config.
+
+## Prerequisites
+
+All tools are available in the devshell: `create-host`, `bao` (OpenBao CLI), `tofu`.
+
+```bash
+nix develop
+```
+
+## Steps
+
+Steps marked with **USER** must be performed by the user due to credential requirements.
+
+1. **USER**: Run `create-host --hostname <name> --ip <ip/prefix>`
+2. Edit the auto-generated configurations in `hosts/<hostname>/` to import whatever modules are needed for its purpose
+3. Add any secrets needed to `terraform/vault/`
+4. Edit the VM specs in `terraform/vms.tf` if needed. To deploy from a branch other than master, add `flake_branch = "<branch>"` to the VM definition
+5. Push configuration to master (or the branch specified by `flake_branch`)
+6. **USER**: Apply terraform:
+   ```bash
+   nix develop -c tofu -chdir=terraform/vault apply
+   nix develop -c tofu -chdir=terraform apply
+   ```
+7. Once terraform completes, a VM boots in Proxmox using the template image
+8. The VM runs the `nixos-bootstrap` service, which applies the host config and reboots
+9. After reboot, the host should be operational
+10. Trigger auto-upgrade on `ns1` and `ns2` to propagate DNS records for the new host
+11. Trigger auto-upgrade on `monitoring01` to add the host to Prometheus scrape targets
+
+## Tier Specification
+
+New hosts should set `homelab.host.tier` in their configuration:
+
+```nix
+homelab.host.tier = "test";  # or "prod"
+```
+
+- **test** - Test-tier hosts can receive remote deployments via the `homelab-deploy` MCP server and have different credential access. Use for staging/testing.
+- **prod** - Production hosts. Deployments require direct access or the CLI with appropriate credentials.
+
+## Observability
+
+During the bootstrap process, status updates are sent to Loki. Query bootstrap logs with:
+
+```
+{job="bootstrap", hostname="<hostname>"}
+```
+
+### Bootstrap Stages
+
+The bootstrap process reports these stages via the `stage` label:
+
+| Stage | Message | Meaning |
+|-------|---------|---------|
+| `starting` | Bootstrap starting for \<host\> (branch: \<branch\>) | Bootstrap service has started |
+| `network_ok` | Network connectivity confirmed | Can reach git server |
+| `vault_ok` | Vault credentials unwrapped and stored | AppRole credentials provisioned |
+| `vault_skip` | No Vault token provided - skipping credential setup | No wrapped token was provided |
+| `vault_warn` | Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) |
+| `building` | Starting nixos-rebuild boot | NixOS build starting |
+| `success` | Build successful - rebooting into new configuration | Build complete, rebooting |
+| `failed` | nixos-rebuild failed - manual intervention required | Build failed |
+
+### Useful Queries
+
+```
+# All bootstrap activity for a host
+{job="bootstrap", hostname="myhost"}
+
+# Track all failures
+{job="bootstrap", stage="failed"}
+
+# Monitor builds in progress
+{job="bootstrap", stage=~"building|success"}
+```
+
+Once the VM reboots with its full configuration, it will start publishing metrics to Prometheus and logs to Loki via Promtail.
+
+## Verification
+
+1. Check bootstrap completed successfully:
+   ```
+   {job="bootstrap", hostname="<hostname>", stage="success"}
+   ```
+
+2. Verify the host is up and reporting metrics:
+   ```promql
+   up{instance=~"<hostname>.*"}
+   ```
+
+3. Verify the correct flake revision is deployed:
+   ```promql
+   nixos_flake_info{instance=~"<hostname>.*"}
+   ```
+
+4. Check logs are flowing:
+   ```
+   {hostname="<hostname>"}
+   ```
+
+5. Confirm expected services are running and producing logs
+
+## Troubleshooting
+
+### Bootstrap Failed
+
+#### Common Issues
+
+* VM has trouble running initial nixos-rebuild. Usually caused if it needs to compile packages from scratch if they are not available in our local nix-cache.
+
+#### Troubleshooting
+
+1. Check bootstrap logs in Loki - if they never progress past `building`, the rebuild likely consumed all resources:
+   ```
+   {job="bootstrap", hostname="<hostname>"}
+   ```
+
+2. **USER**: SSH into the host and check the bootstrap service:
+   ```bash
+   ssh root@<hostname>
+   journalctl -u nixos-bootstrap.service
+   ```
+
+3. If the build failed due to resource constraints, increase VM specs in `terraform/vms.tf` and redeploy, or manually run the rebuild:
+   ```bash
+   nixos-rebuild boot --flake git+https://git.t-juice.club/torjus/nixos-servers.git#<hostname>
+   ```
+
+4. If the host config doesn't exist in the flake, ensure step 5 was completed (config pushed to the correct branch).
+
+### Vault Credentials Not Working
+
+Usually caused by running the `create-host` script without proper credentials, or the wrapped token has expired/already been used.
+
+#### Troubleshooting
+
+1. Check if credentials exist on the host:
+   ```bash
+   ssh root@<hostname>
+   ls -la /var/lib/vault/approle/
+   ```
+
+2. Check bootstrap logs for vault-related stages:
+   ```
+   {job="bootstrap", hostname="<hostname>", stage=~"vault.*"}
+   ```
+
+3. **USER**: Regenerate and provision credentials manually:
+   ```bash
+   nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<hostname>
+   ```
+
+### Host Not Appearing in DNS
+
+Usually caused by not having deployed the commit with the new host to ns1/ns2.
+
+#### Troubleshooting
+
+1. Verify the host config has a static IP configured in `systemd.network.networks`
+
+2. Check that `homelab.dns.enable` is not set to `false`
+
+3. **USER**: Trigger auto-upgrade on DNS servers:
+   ```bash
+   ssh root@ns1 systemctl start nixos-upgrade.service
+   ssh root@ns2 systemctl start nixos-upgrade.service
+   ```
+
+4. Verify DNS resolution after upgrade completes:
+   ```bash
+   dig @ns1.home.2rjus.net <hostname>.home.2rjus.net
+   ```
+
+### Host Not Being Scraped by Prometheus
+
+Usually caused by not having deployed the commit with the new host to the monitoring host.
+
+#### Troubleshooting
+
+1. Check that `homelab.monitoring.enable` is not set to `false`
+
+2. **USER**: Trigger auto-upgrade on monitoring01:
+   ```bash
+   ssh root@monitoring01 systemctl start nixos-upgrade.service
+   ```
+
+3. Verify the target appears in Prometheus:
+   ```promql
+   up{instance=~"<hostname>.*"}
+   ```
+
+4. If the target is down, check that node-exporter is running on the host:
+   ```bash
+   ssh root@<hostname> systemctl status prometheus-node-exporter.service
+   ```
+
+## Related Files
+
+| Path | Description |
+|------|-------------|
+| `scripts/create-host/` | The `create-host` script that generates host configurations |
+| `hosts/template2/` | Template VM configuration (base image for new VMs) |
+| `hosts/template2/bootstrap.nix` | Bootstrap service that applies NixOS config on first boot |
+| `terraform/vms.tf` | VM definitions (specs, IPs, branch overrides) |
+| `terraform/cloud-init.tf` | Cloud-init configuration (passes hostname, branch, vault token) |
+| `terraform/vault/approle.tf` | AppRole policies for each host |
+| `terraform/vault/secrets.tf` | Secret definitions in Vault |
+| `terraform/vault/hosts-generated.tf` | Auto-generated wrapped tokens for VM bootstrap |
+| `playbooks/provision-approle.yml` | Ansible playbook for manual credential provisioning |
+| `flake.nix` | Flake with all host configurations (add new hosts here) |
--- a/docs/plans/auth-system-replacement.md
+++ b/docs/plans/auth-system-replacement.md
@@ -1,192 +0,0 @@
-# Authentication System Replacement Plan
-
-## Overview
-
-Replace the current auth01 setup (LLDAP + Authelia) with a modern, unified authentication solution. The current setup is not in active use, making this a good time to evaluate alternatives.
-
-## Goals
-
-1. **Central user database** - Manage users across all homelab hosts from a single source
-2. **Linux PAM/NSS integration** - Users can SSH into hosts using central credentials
-3. **UID/GID consistency** - Proper POSIX attributes for NAS share permissions
-4. **OIDC provider** - Single sign-on for homelab web services (Grafana, etc.)
-
-## Options Evaluated
-
-### OpenLDAP (raw)
-
- **NixOS Support:** Good (`services.openldap` with `declarativeContents`)
- **Pros:** Most widely supported, very flexible
- **Cons:** LDIF format is painful, schema management is complex, no built-in OIDC, requires SSSD on each client
- **Verdict:** Doesn't address LDAP complexity concerns
-
-### LLDAP + Authelia (current)
-
- **NixOS Support:** Both have good modules
- **Pros:** Already configured, lightweight, nice web UIs
- **Cons:** Two services to manage, limited POSIX attribute support in LLDAP, requires SSSD on every client host
- **Verdict:** Workable but has friction for NAS/UID goals
-
-### FreeIPA
-
- **NixOS Support:** None
- **Pros:** Full enterprise solution (LDAP + Kerberos + DNS + CA)
- **Cons:** Extremely heavy, wants to own DNS, designed for Red Hat ecosystems, massive overkill for homelab
- **Verdict:** Overkill, no NixOS support
-
-### Keycloak
-
- **NixOS Support:** None
- **Pros:** Good OIDC/SAML, nice UI
- **Cons:** Primarily an identity broker not a user directory, poor POSIX support, heavy (Java)
- **Verdict:** Wrong tool for Linux user management
-
-### Authentik
-
- **NixOS Support:** None (would need Docker)
- **Pros:** All-in-one with LDAP outpost and OIDC, modern UI
- **Cons:** Heavy stack (Python + PostgreSQL + Redis), LDAP is a separate component
- **Verdict:** Would work but requires Docker and is heavy
-
-### Kanidm
-
- **NixOS Support:** Excellent - first-class module with PAM/NSS integration
- **Pros:**
-  - Native PAM/NSS module (no SSSD needed)
-  - Built-in OIDC provider
-  - Optional LDAP interface for legacy services
-  - Declarative provisioning via NixOS (users, groups, OAuth2 clients)
-  - Modern, written in Rust
-  - Single service handles everything
- **Cons:** Newer project, smaller community than LDAP
- **Verdict:** Best fit for requirements
-
-### Pocket-ID
-
- **NixOS Support:** Unknown
- **Pros:** Very lightweight, passkey-first
- **Cons:** No LDAP, no PAM/NSS integration - purely OIDC for web apps
- **Verdict:** Doesn't solve Linux user management goal
-
-## Recommendation: Kanidm
-
-Kanidm is the recommended solution for the following reasons:
-
-| Requirement | Kanidm Support |
-|-------------|----------------|
-| Central user database | Native |
-| Linux PAM/NSS (host login) | Native NixOS module |
-| UID/GID for NAS | POSIX attributes supported |
-| OIDC for services | Built-in |
-| Declarative config | Excellent NixOS provisioning |
-| Simplicity | Modern API, LDAP optional |
-| NixOS integration | First-class |
-
-### Key NixOS Features
-
-**Server configuration:**
-```nix
-services.kanidm.enableServer = true;
-services.kanidm.serverSettings = {
-  domain = "home.2rjus.net";
-  origin = "https://auth.home.2rjus.net";
-  ldapbindaddress = "0.0.0.0:636";  # Optional LDAP interface
-};
-```
-
-**Declarative user provisioning:**
-```nix
-services.kanidm.provision.enable = true;
-services.kanidm.provision.persons.torjus = {
-  displayName = "Torjus";
-  groups = [ "admins" "nas-users" ];
-};
-```
-
-**Declarative OAuth2 clients:**
-```nix
-services.kanidm.provision.systems.oauth2.grafana = {
-  displayName = "Grafana";
-  originUrl = "https://grafana.home.2rjus.net/login/generic_oauth";
-  originLanding = "https://grafana.home.2rjus.net";
-};
-```
-
-**Client host configuration (add to system/):**
-```nix
-services.kanidm.enableClient = true;
-services.kanidm.enablePam = true;
-services.kanidm.clientSettings.uri = "https://auth.home.2rjus.net";
-```
-
-## NAS Integration
-
-### Current: TrueNAS CORE (FreeBSD)
-
-TrueNAS CORE has a built-in LDAP client. Kanidm's read-only LDAP interface will work for NFS share permissions:
-
- **NFS shares**: Only need consistent UID/GID mapping - Kanidm's LDAP provides this
- **No SMB requirement**: SMB would need Samba schema attributes (deprecated in TrueNAS 13.0+), but we're NFS-only
-
-Configuration approach:
-1. Enable Kanidm's LDAP interface (`ldapbindaddress = "0.0.0.0:636"`)
-2. Import internal CA certificate into TrueNAS
-3. Configure TrueNAS LDAP client with Kanidm's Base DN and bind credentials
-4. Users/groups appear in TrueNAS permission dropdowns
-
-Note: Kanidm's LDAP is read-only and uses LDAPS only (no StartTLS). This is fine for our use case.
-
-### Future: NixOS NAS
-
-When the NAS is migrated to NixOS, it becomes a first-class citizen:
-
- Native Kanidm PAM/NSS integration (same as other hosts)
- No LDAP compatibility layer needed
- Full integration with the rest of the homelab
-
-This future migration path is a strong argument for Kanidm over LDAP-only solutions.
-
-## Implementation Steps
-
-1. **Create Kanidm service module** in `services/kanidm/`
-   - Server configuration
-   - TLS via internal ACME
-   - Vault secrets for admin passwords
-
-2. **Configure declarative provisioning**
-   - Define initial users and groups
-   - Set up POSIX attributes (UID/GID ranges)
-
-3. **Add OIDC clients** for homelab services
-   - Grafana
-   - Other services as needed
-
-4. **Create client module** in `system/` for PAM/NSS
-   - Enable on all hosts that need central auth
-   - Configure trusted CA
-
-5. **Test NAS integration**
-   - Configure TrueNAS LDAP client to connect to Kanidm
-   - Verify UID/GID mapping works with NFS shares
-
-6. **Migrate auth01**
-   - Remove LLDAP and Authelia services
-   - Deploy Kanidm
-   - Update DNS CNAMEs if needed
-
-7. **Documentation**
-   - User management procedures
-   - Adding new OAuth2 clients
-   - Troubleshooting PAM/NSS issues
-
-## Open Questions
-
- What UID/GID range should be reserved for Kanidm-managed users?
- Which hosts should have PAM/NSS enabled initially?
- What OAuth2 clients are needed at launch?
-
-## References
-
- [Kanidm Documentation](https://kanidm.github.io/kanidm/stable/)
- [NixOS Kanidm Module](https://search.nixos.org/options?query=services.kanidm)
- [Kanidm PAM/NSS Integration](https://kanidm.github.io/kanidm/stable/pam_and_nsswitch.html)
--- a/docs/plans/completed/auth-system-replacement.md
+++ b/docs/plans/completed/auth-system-replacement.md
@@ -0,0 +1,183 @@
+# Authentication System Replacement Plan
+
+## Overview
+
+Deploy a modern, unified authentication solution for the homelab. Provides central user management, SSO for web services, and consistent UID/GID mapping for NAS permissions.
+
+## Goals
+
+1. **Central user database** - Manage users across all homelab hosts from a single source
+2. **Linux PAM/NSS integration** - Users can SSH into hosts using central credentials
+3. **UID/GID consistency** - Proper POSIX attributes for NAS share permissions
+4. **OIDC provider** - Single sign-on for homelab web services (Grafana, etc.)
+
+## Solution: Kanidm
+
+Kanidm was chosen for the following reasons:
+
+| Requirement | Kanidm Support |
+|-------------|----------------|
+| Central user database | Native |
+| Linux PAM/NSS (host login) | Native NixOS module |
+| UID/GID for NAS | POSIX attributes supported |
+| OIDC for services | Built-in |
+| Declarative config | Excellent NixOS provisioning |
+| Simplicity | Modern API, LDAP optional |
+| NixOS integration | First-class |
+
+### Configuration Files
+
+- **Host configuration:** `hosts/kanidm01/`
+- **Service module:** `services/kanidm/default.nix`
+
+## NAS Integration
+
+### Current: TrueNAS CORE (FreeBSD)
+
+TrueNAS CORE has a built-in LDAP client. Kanidm's read-only LDAP interface will work for NFS share permissions:
+
+- **NFS shares**: Only need consistent UID/GID mapping - Kanidm's LDAP provides this
+- **No SMB requirement**: SMB would need Samba schema attributes (deprecated in TrueNAS 13.0+), but we're NFS-only
+
+Configuration approach:
+1. Enable Kanidm's LDAP interface (`ldapbindaddress = "0.0.0.0:636"`)
+2. Import internal CA certificate into TrueNAS
+3. Configure TrueNAS LDAP client with Kanidm's Base DN and bind credentials
+4. Users/groups appear in TrueNAS permission dropdowns
+
+Note: Kanidm's LDAP is read-only and uses LDAPS only (no StartTLS). This is fine for our use case.
+
+### Future: NixOS NAS
+
+When the NAS is migrated to NixOS, it becomes a first-class citizen:
+
+- Native Kanidm PAM/NSS integration (same as other hosts)
+- No LDAP compatibility layer needed
+- Full integration with the rest of the homelab
+
+This future migration path is a strong argument for Kanidm over LDAP-only solutions.
+
+## Implementation Steps
+
+1. **Create kanidm01 host and service module** ✅
+   - Host: `kanidm01.home.2rjus.net` (10.69.13.23, test tier)
+   - Service module: `services/kanidm/`
+   - TLS via internal ACME (`auth.home.2rjus.net`)
+   - Vault integration for idm_admin password
+   - LDAPS on port 636
+
+2. **Configure provisioning** ✅
+   - Groups provisioned declaratively: `admins`, `users`, `ssh-users`
+   - Users managed imperatively via CLI (allows setting POSIX passwords in one step)
+   - POSIX attributes enabled (UID/GID range 65,536-69,999)
+
+3. **Test NAS integration** (in progress)
+   - ✅ LDAP interface verified working
+   - Configure TrueNAS LDAP client to connect to Kanidm
+   - Verify UID/GID mapping works with NFS shares
+
+4. **Add OIDC clients** for homelab services
+   - Grafana
+   - Other services as needed
+
+5. **Create client module** in `system/` for PAM/NSS ✅
+   - Module: `system/kanidm-client.nix`
+   - `homelab.kanidm.enable = true` enables PAM/NSS
+   - Short usernames (not SPN format)
+   - Home directory symlinks via `home_alias`
+   - Enabled on test tier: testvm01, testvm02, testvm03
+
+6. **Documentation** ✅
+   - `docs/user-management.md` - CLI workflows, troubleshooting
+   - User/group creation procedures verified working
+
+## Progress
+
+### Completed (2026-02-08)
+
+**Kanidm server deployed on kanidm01 (test tier):**
+- Host: `kanidm01.home.2rjus.net` (10.69.13.23)
+- WebUI: `https://auth.home.2rjus.net`
+- LDAPS: port 636
+- Valid certificate from internal CA
+
+**Configuration:**
+- Kanidm 1.8 with secret provisioning support
+- Daily backups at 22:00 (7 versions retained)
+- Vault integration for idm_admin password
+- Prometheus monitoring scrape target configured
+
+**Provisioned entities:**
+- Groups: `admins`, `users`, `ssh-users` (declarative)
+- Users managed via CLI (imperative)
+
+**Verified working:**
+- WebUI login with idm_admin
+- LDAP bind and search with POSIX-enabled user
+- LDAPS with valid internal CA certificate
+
+### Completed (2026-02-08) - PAM/NSS Client
+
+**Client module deployed (`system/kanidm-client.nix`):**
+- `homelab.kanidm.enable = true` enables PAM/NSS integration
+- Connects to auth.home.2rjus.net
+- Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
+- Home directory symlinks (`/home/torjus` → UUID-based dir)
+- Login restricted to `ssh-users` group
+
+**Enabled on test tier:**
+- testvm01, testvm02, testvm03
+
+**Verified working:**
+- User/group resolution via `getent`
+- SSH login with Kanidm unix passwords
+- Home directory creation with symlinks
+- Imperative user/group creation via CLI
+
+**Documentation:**
+- `docs/user-management.md` with full CLI workflows
+- Password requirements (min 10 chars)
+- Troubleshooting guide (nscd, cache invalidation)
+
+### UID/GID Range (Resolved)
+
+**Range: 65,536 - 69,999** (manually allocated)
+
+- Users: 65,536 - 67,999 (up to ~2500 users)
+- Groups: 68,000 - 69,999 (up to ~2000 groups)
+
+Rationale:
+- Starts at Kanidm's recommended minimum (65,536)
+- Well above NixOS system users (typically <1000)
+- Avoids Podman/container issues with very high GIDs
+
+### Completed (2026-02-08) - OAuth2/OIDC for Grafana
+
+**OAuth2 client deployed for Grafana on monitoring02:**
+- Client ID: `grafana`
+- Redirect URL: `https://grafana-test.home.2rjus.net/login/generic_oauth`
+- Scope maps: `openid`, `profile`, `email`, `groups` for `users` group
+- Role mapping: `admins` group → Grafana Admin, others → Viewer
+
+**Configuration locations:**
+- Kanidm OAuth2 client: `services/kanidm/default.nix`
+- Grafana OIDC config: `services/grafana/default.nix`
+- Vault secret: `services/grafana/oauth2-client-secret`
+
+**Key findings:**
+- PKCE is required by Kanidm - enable `use_pkce = true` in Grafana
+- Must set `email_attribute_path`, `login_attribute_path`, `name_attribute_path` to extract from userinfo
+- Users need: primary credential (password + TOTP for MFA), membership in `users` group, email address set
+- Unix password is separate from primary credential (web login requires primary credential)
+
+### Next Steps
+
+1. Enable PAM/NSS on production hosts (after test tier validation)
+2. Configure TrueNAS LDAP client for NAS integration testing
+3. Add OAuth2 clients for other services as needed
+
+## References
+
+- [Kanidm Documentation](https://kanidm.github.io/kanidm/stable/)
+- [NixOS Kanidm Module](https://search.nixos.org/options?query=services.kanidm)
+- [Kanidm PAM/NSS Integration](https://kanidm.github.io/kanidm/stable/pam_and_nsswitch.html)
--- a/docs/plans/completed/automated-host-deployment-pipeline.md
+++ b/docs/plans/completed/automated-host-deployment-pipeline.md
--- a/docs/plans/completed/bootstrap-cache.md
+++ b/docs/plans/completed/bootstrap-cache.md
@@ -0,0 +1,35 @@
+# Plan: Configure Template2 to Use Nix Cache
+
+## Problem
+
+New VMs bootstrapped from template2 don't use our local nix cache (nix-cache.home.2rjus.net) during the initial `nixos-rebuild boot`. This means the first build downloads everything from cache.nixos.org, which is slower and uses more bandwidth.
+
+## Solution
+
+Update the template2 base image to include the nix cache configuration, so new VMs immediately benefit from cached builds during bootstrap.
+
+## Implementation
+
+1. Add nix cache configuration to `hosts/template2/configuration.nix`:
+   ```nix
+   nix.settings = {
+     substituters = [ "https://nix-cache.home.2rjus.net" "https://cache.nixos.org" ];
+     trusted-public-keys = [
+       "nix-cache.home.2rjus.net:..."  # Add the cache's public key
+       "cache.nixos.org-1:..."
+     ];
+   };
+   ```
+
+2. Rebuild and redeploy the Proxmox template:
+   ```bash
+   nix develop -c ansible-playbook -i playbooks/inventory.ini playbooks/build-and-deploy-template.yml
+   ```
+
+3. Update `default_template_name` in `terraform/variables.tf` if the template name changed
+
+## Benefits
+
+- Faster VM bootstrap times
+- Reduced bandwidth to external cache
+- Most derivations will already be cached from other hosts
--- a/docs/plans/completed/cert-monitoring.md
+++ b/docs/plans/completed/cert-monitoring.md
@@ -0,0 +1,72 @@
+# Certificate Monitoring Plan
+
+## Summary
+
+This document describes the removal of labmon certificate monitoring and outlines future needs for certificate monitoring in the homelab.
+
+## What Was Removed
+
+### labmon Service
+
+The `labmon` service was a custom Go application that provided:
+
+1. **StepMonitor**: Monitoring for step-ca (Smallstep CA) certificate provisioning and health
+2. **TLSConnectionMonitor**: Periodic TLS connection checks to verify certificate validity and expiration
+
+The service exposed Prometheus metrics at `:9969` including:
+- `labmon_tlsconmon_certificate_seconds_left` - Time until certificate expiration
+- `labmon_tlsconmon_certificate_check_error` - Whether the TLS check failed
+- `labmon_stepmon_certificate_seconds_left` - Step-CA internal certificate expiration
+
+### Affected Files
+
+- `hosts/monitoring01/configuration.nix` - Removed labmon configuration block
+- `services/monitoring/prometheus.nix` - Removed labmon scrape target
+- `services/monitoring/rules.yml` - Removed `certificate_rules` alert group
+- `services/monitoring/alloy.nix` - Deleted (was only used for labmon profiling)
+- `services/monitoring/default.nix` - Removed alloy.nix import
+
+### Removed Alerts
+
+- `certificate_expiring_soon` - Warned when any monitored TLS cert had < 24h validity
+- `step_ca_serving_cert_expiring` - Critical alert for step-ca's own serving certificate
+- `certificate_check_error` - Warned when TLS connection check failed
+- `step_ca_certificate_expiring` - Critical alert for step-ca issued certificates
+
+## Why It Was Removed
+
+1. **step-ca decommissioned**: The primary monitoring target (step-ca) is no longer in use
+2. **Outdated codebase**: labmon was a custom tool that required maintenance
+3. **Limited value**: With ACME auto-renewal, certificates should renew automatically
+
+## Current State
+
+ACME certificates are now issued by OpenBao PKI at `vault.home.2rjus.net:8200`. The ACME protocol handles automatic renewal, and certificates are typically renewed well before expiration.
+
+## Future Needs
+
+While ACME handles renewal automatically, we should consider monitoring for:
+
+1. **ACME renewal failures**: Alert when a certificate fails to renew
+   - Could monitor ACME client logs (via Loki queries)
+   - Could check certificate file modification times
+
+2. **Certificate expiration as backup**: Even with auto-renewal, a last-resort alert for certificates approaching expiration would catch renewal failures
+
+3. **Certificate transparency**: Monitor for unexpected certificate issuance
+
+### Potential Solutions
+
+1. **Prometheus blackbox_exporter**: Can probe TLS endpoints and export certificate expiration metrics
+   - `probe_ssl_earliest_cert_expiry` metric
+   - Already a standard tool, well-maintained
+
+2. **Custom Loki alerting**: Query ACME service logs for renewal failures
+   - Works with existing infrastructure
+   - No additional services needed
+
+3. **Node-exporter textfile collector**: Script that checks local certificate files and writes expiration metrics
+
+## Status
+
+**Not yet implemented.** This document serves as a placeholder for future work on certificate monitoring.
--- a/docs/plans/completed/garage-s3-storage.md
+++ b/docs/plans/completed/garage-s3-storage.md
@@ -0,0 +1,46 @@
+# Garage S3 Storage Server
+
+## Overview
+
+Deploy a Garage instance for self-hosted S3-compatible object storage.
+
+## Garage Basics
+
+- S3-compatible distributed object storage designed for self-hosting
+- Supports per-key, per-bucket permissions (read/write/owner)
+- Keys without explicit grants have no access
+
+## NixOS Module
+
+Available as `services.garage` with these key options:
+
+- `services.garage.enable` - Enable the service
+- `services.garage.package` - Must be set explicitly
+- `services.garage.settings` - Freeform TOML config (replication mode, ports, RPC, etc.)
+- `services.garage.settings.metadata_dir` - Metadata storage (SSD recommended)
+- `services.garage.settings.data_dir` - Data block storage (supports multiple dirs since v0.9)
+- `services.garage.environmentFile` - For secrets like `GARAGE_RPC_SECRET`
+- `services.garage.logLevel` - error/warn/info/debug/trace
+
+The NixOS module only manages the server daemon. Buckets and keys are managed externally.
+
+## Bucket/Key Management
+
+No declarative NixOS options for buckets or keys. Two options:
+
+1. **Terraform provider** - `jkossis/terraform-provider-garage` manages buckets, keys, and permissions via the Garage Admin API v2. Could live in `terraform/garage/` similar to `terraform/vault/`.
+2. **CLI** - `garage key create`, `garage bucket create`, `garage bucket allow`
+
+## Integration Ideas
+
+- Store Garage API keys in Vault, fetch via `vault.secrets` on consuming hosts
+- Terraform manages both Vault secrets and Garage buckets/keys
+- Enable admin API with token for Terraform provider access
+- Add Prometheus metrics scraping (Garage exposes metrics endpoint)
+
+## Open Questions
+
+- Single-node or multi-node replication?
+- Which host to deploy on?
+- What to store? (backups, media, app data)
+- Expose via HTTP proxy or direct S3 API only?
--- a/docs/plans/completed/monitoring02-reboot-alert-investigation.md
+++ b/docs/plans/completed/monitoring02-reboot-alert-investigation.md
@@ -0,0 +1,135 @@
+# monitoring02 Reboot Alert Investigation
+
+**Date:** 2026-02-10
+**Status:** Completed - False positive identified
+
+## Summary
+
+A `host_reboot` alert fired for monitoring02 at 16:27:36 UTC. Investigation determined this was a **false positive** caused by NTP clock adjustments, not an actual reboot.
+
+## Alert Details
+
+- **Alert:** `host_reboot`
+- **Rule:** `changes(node_boot_time_seconds[10m]) > 0`
+- **Host:** monitoring02
+- **Time:** 2026-02-10T16:27:36Z
+
+## Investigation Findings
+
+### Evidence Against Actual Reboot
+
+1. **Uptime:** System had been up for ~40 hours (143,751 seconds) at time of alert
+2. **Consistent BOOT_ID:** All logs showed the same systemd BOOT_ID (`fd26e7f3d86f4cd688d1b1d7af62f2ad`) from Feb 9 through the alert time
+3. **No log gaps:** Logs were continuous - no shutdown/restart cycle visible
+4. **Prometheus metrics:** `node_boot_time_seconds` showed a 1-second fluctuation, then returned to normal
+
+### Root Cause: NTP Clock Adjustment
+
+The `node_boot_time_seconds` metric fluctuated by 1 second due to how Linux calculates boot time:
+
+```
+btime = current_wall_clock_time - monotonic_uptime
+```
+
+When NTP adjusts the wall clock, `btime` shifts by the same amount. The `node_timex_*` metrics confirmed this:
+
+| Metric | Value |
+|--------|-------|
+| `node_timex_maxerror_seconds` (max in 3h) | 1.02 seconds |
+| `node_timex_maxerror_seconds` (max in 24h) | 2.05 seconds |
+| `node_timex_sync_status` | 1 (synced) |
+| Current `node_timex_offset_seconds` | ~9ms (normal) |
+
+The kernel's estimated maximum clock error spiked to over 1 second, causing the boot time calculation to drift momentarily.
+
+Additionally, `systemd-resolved` logged "Clock change detected. Flushing caches." at 16:26:53Z, corroborating the NTP adjustment.
+
+## Current Time Sync Configuration
+
+### NixOS Guests
+- **NTP client:** systemd-timesyncd (NixOS default)
+- **No explicit configuration** in the codebase
+- Uses default NixOS NTP server pool
+
+### Proxmox VMs
+- **Clocksource:** `kvm-clock` (optimal for KVM VMs)
+- **QEMU guest agent:** Enabled
+- **No additional QEMU timing args** configured
+
+## Potential Improvements
+
+### 1. Improve Alert Rule (Recommended)
+
+Add tolerance to filter out small NTP adjustments:
+
+```yaml
+# Current rule (triggers on any change)
+expr: changes(node_boot_time_seconds[10m]) > 0
+
+# Improved rule (requires >60 second shift)
+expr: changes(node_boot_time_seconds[10m]) > 0 and abs(delta(node_boot_time_seconds[10m])) > 60
+```
+
+### 2. Switch to Chrony (Optional)
+
+Chrony handles time adjustments more gracefully than systemd-timesyncd:
+
+```nix
+# In common/vm/qemu-guest.nix
+{
+  services.qemuGuest.enable = true;
+
+  services.timesyncd.enable = false;
+  services.chrony = {
+    enable = true;
+    extraConfig = ''
+      makestep 1 3
+      rtcsync
+    '';
+  };
+}
+```
+
+### 3. Add QEMU Timing Args (Optional)
+
+In `terraform/vms.tf`:
+
+```hcl
+args = "-global kvm-pit.lost_tick_policy=delay -rtc driftfix=slew"
+```
+
+### 4. Local NTP Server (Optional)
+
+Running a local NTP server (e.g., on ns1/ns2) would reduce latency and improve sync stability across all hosts.
+
+## Monitoring NTP Health
+
+The `node_timex_*` metrics from node_exporter provide visibility into NTP health:
+
+```promql
+# Clock offset from reference
+node_timex_offset_seconds
+
+# Sync status (1 = synced)
+node_timex_sync_status
+
+# Maximum estimated error - useful for alerting
+node_timex_maxerror_seconds
+```
+
+A potential alert for NTP issues:
+
+```yaml
+- alert: ntp_clock_drift
+  expr: node_timex_maxerror_seconds > 1
+  for: 5m
+  labels:
+    severity: warning
+  annotations:
+    summary: "High clock drift on {{ $labels.hostname }}"
+    description: "NTP max error is {{ $value }}s on {{ $labels.hostname }}"
+```
+
+## Conclusion
+
+No action required for the alert itself - the system was healthy. Consider implementing the improved alert rule to prevent future false positives from NTP adjustments.
--- a/docs/plans/completed/nats-deploy-service.md
+++ b/docs/plans/completed/nats-deploy-service.md
--- a/docs/plans/completed/nix-cache-reprovision.md
+++ b/docs/plans/completed/nix-cache-reprovision.md
@@ -0,0 +1,156 @@
+# Nix Cache Host Reprovision
+
+## Overview
+
+Reprovision `nix-cache01` using the OpenTofu workflow, and improve the build/cache system with:
+1. NATS-based remote build triggering (replacing the current bash script)
+2. Safer flake update workflow that validates builds before pushing to master
+
+## Status
+
+**Phase 1: New Build Host** - COMPLETE
+**Phase 2: NATS Build Triggering** - COMPLETE
+**Phase 3: Safe Flake Update Workflow** - NOT STARTED
+**Phase 4: Complete Migration** - COMPLETE
+**Phase 5: Scheduled Builds** - COMPLETE
+
+## Completed Work
+
+### New Build Host (nix-cache02)
+
+Instead of reprovisioning nix-cache01 in-place, we created a new host `nix-cache02` at 10.69.13.25:
+
+- **Specs**: 8 CPU cores, 16GB RAM (temporarily, will increase to 24GB after nix-cache01 decommissioned), 200GB disk
+- **Provisioned via OpenTofu** with automatic Vault credential bootstrapping
+- **Builder service** configured with two repos:
+  - `nixos-servers` → `git+https://git.t-juice.club/torjus/nixos-servers.git`
+  - `nixos` (gunter) → `git+https://git.t-juice.club/torjus/nixos.git`
+
+### NATS-Based Build Triggering
+
+The `homelab-deploy` tool was extended with a builder mode:
+
+**NATS Subjects:**
+- `build.<repo>.<target>` - e.g., `build.nixos-servers.all` or `build.nixos-servers.ns1`
+
+**NATS Permissions (in DEPLOY account):**
+| User | Publish | Subscribe |
+|------|---------|-----------|
+| Builder | `build.responses.>` | `build.>` |
+| Test deployer | `deploy.test.>`, `deploy.discover`, `build.>` | `deploy.responses.>`, `deploy.discover`, `build.responses.>` |
+| Admin deployer | `deploy.>`, `build.>` | `deploy.>`, `build.responses.>` |
+
+**Vault Secrets:**
+- `shared/homelab-deploy/builder-nkey` - NKey seed for builder authentication
+
+**NixOS Configuration:**
+- `hosts/nix-cache02/builder.nix` - Builder service configuration
+- `services/nats/default.nix` - Updated with builder NATS user
+
+**MCP Integration:**
+- `.mcp.json` updated with `--enable-builds` flag
+- Build tool available via MCP for Claude Code
+
+**Tested:**
+- Single host build: `build nixos-servers testvm01` (~30s)
+- All hosts build: `build nixos-servers all` (16 hosts in ~226s)
+
+### Harmonia Binary Cache
+
+- Parameterized `services/nix-cache/harmonia.nix` to use hostname-based Vault paths
+- Parameterized `services/nix-cache/proxy.nix` for hostname-based domain
+- New signing key: `nix-cache02.home.2rjus.net-1`
+- Vault secret: `hosts/nix-cache02/cache-secret`
+- Removed unused Gitea Actions runner from nix-cache01
+
+## Current State
+
+### nix-cache02 (Active)
+- Running at 10.69.13.25
+- Serving `https://nix-cache.home.2rjus.net` (canonical URL)
+- Builder service active, responding to NATS build requests
+- Metrics exposed on port 9973 (`homelab-deploy-builder` job)
+- Harmonia binary cache server running
+- Signing key: `nix-cache02.home.2rjus.net-1`
+- Prod tier with `build-host` role
+
+### nix-cache01 (Decommissioned)
+- VM deleted from Proxmox
+- Host configuration removed from repo
+- Vault AppRole and secrets removed
+- Old signing key removed from trusted-public-keys
+
+## Remaining Work
+
+### Phase 3: Safe Flake Update Workflow
+
+1. Create `.github/workflows/flake-update-safe.yaml`
+2. Disable or remove old `flake-update.yaml`
+3. Test manually with `workflow_dispatch`
+4. Monitor first automated run
+
+### Phase 4: Complete Migration ✅
+
+1. ~~**Add Harmonia to nix-cache02**~~ ✅ Done - new signing key, parameterized service
+2. ~~**Add trusted public key to all hosts**~~ ✅ Done - `system/nix.nix` updated
+3. ~~**Test cache from other hosts**~~ ✅ Done - verified from testvm01
+4. ~~**Update proxy and DNS**~~ ✅ Done - `nix-cache.home.2rjus.net` CNAME now points to nix-cache02
+5. ~~**Deploy to all hosts**~~ ✅ Done - all hosts have new trusted key
+6. ~~**Decommission nix-cache01**~~ ✅ Done - 2026-02-10:
+   - Removed `hosts/nix-cache01/` directory
+   - Removed `services/nix-cache/build-flakes.{nix,sh}`
+   - Removed Vault AppRole and secrets
+   - Removed old signing key from `system/nix.nix`
+   - Removed from `flake.nix`
+   - Deleted VM from Proxmox
+
+### Phase 5: Scheduled Builds ✅
+
+Implemented a systemd timer on nix-cache02 that triggers builds every 2 hours:
+
+- **Timer**: `scheduled-build.timer` runs every 2 hours with 5m random jitter
+- **Service**: `scheduled-build.service` calls `homelab-deploy build` for both repos
+- **Authentication**: Dedicated scheduler NKey stored in Vault
+- **NATS user**: Added to DEPLOY account with publish `build.>` and subscribe `build.responses.>`
+
+Files:
+- `hosts/nix-cache02/scheduler.nix` - Timer and service configuration
+- `services/nats/default.nix` - Scheduler NATS user
+- `terraform/vault/secrets.tf` - Scheduler NKey secret
+- `terraform/vault/variables.tf` - Variable for scheduler NKey
+
+## Resolved Questions
+
+- **Parallel vs sequential builds?** Sequential - hosts share packages, subsequent builds are fast after first
+- **What about gunter?** Configured as `nixos` repo in builder settings
+- **Disk size?** 200GB for new host
+- **Build host specs?** 8 cores, 16-24GB RAM matches current nix-cache01
+
+### Phase 6: Observability
+
+1. **Alerting rules** for build failures:
+   ```promql
+   # Alert if any build fails
+   increase(homelab_deploy_build_host_total{status="failure"}[1h]) > 0
+
+   # Alert if no successful builds in 24h (scheduled builds stopped)
+   time() - homelab_deploy_build_last_success_timestamp > 86400
+   ```
+
+2. **Grafana dashboard** for build metrics:
+   - Build success/failure rate over time
+   - Average build duration per host (histogram)
+   - Build frequency (builds per hour/day)
+   - Last successful build timestamp per repo
+
+Available metrics:
+- `homelab_deploy_builds_total{repo, status}` - total builds by repo and status
+- `homelab_deploy_build_host_total{repo, host, status}` - per-host build counts
+- `homelab_deploy_build_duration_seconds_{bucket,sum,count}` - build duration histogram
+- `homelab_deploy_build_last_timestamp{repo}` - last build attempt
+- `homelab_deploy_build_last_success_timestamp{repo}` - last successful build
+
+## Open Questions
+
+- [x] ~~When to cut over DNS from nix-cache01 to nix-cache02?~~ Done - 2026-02-10
+- [ ] Implement safe flake update workflow before or after full migration?
--- a/docs/plans/completed/ns1-recreation.md
+++ b/docs/plans/completed/ns1-recreation.md
@@ -0,0 +1,107 @@
+# ns1 Recreation Plan
+
+## Overview
+
+Recreate ns1 using the OpenTofu workflow after the existing VM entered emergency mode due to incorrect hardware-configuration.nix (hardcoded UUIDs that don't match actual disk layout).
+
+## Current ns1 Configuration to Preserve
+
+- **IP:** 10.69.13.5/24
+- **Gateway:** 10.69.13.1
+- **Role:** Primary DNS (authoritative + resolver)
+- **Services:**
+  - `../../services/ns/master-authorative.nix`
+  - `../../services/ns/resolver.nix`
+- **Metadata:**
+  - `homelab.host.role = "dns"`
+  - `homelab.host.labels.dns_role = "primary"`
+- **Vault:** enabled
+- **Deploy:** enabled
+
+## Execution Steps
+
+### Phase 1: Remove Old Configuration
+
+```bash
+nix develop -c create-host --remove --hostname ns1 --force
+```
+
+This removes:
+- `hosts/ns1/` directory
+- Entry from `flake.nix`
+- Any terraform entries (none exist currently)
+
+### Phase 2: Create New Configuration
+
+```bash
+nix develop -c create-host --hostname ns1 --ip 10.69.13.5/24
+```
+
+This creates:
+- `hosts/ns1/` with template2-based configuration
+- Entry in `flake.nix`
+- Entry in `terraform/vms.tf`
+- Vault wrapped token for bootstrap
+
+### Phase 3: Customize Configuration
+
+After create-host, manually update `hosts/ns1/configuration.nix` to add:
+
+1. DNS service imports:
+   ```nix
+   ../../services/ns/master-authorative.nix
+   ../../services/ns/resolver.nix
+   ```
+
+2. Host metadata:
+   ```nix
+   homelab.host = {
+     tier = "prod";
+     role = "dns";
+     labels.dns_role = "primary";
+   };
+   ```
+
+3. Disable resolved (conflicts with Unbound):
+   ```nix
+   services.resolved.enable = false;
+   ```
+
+### Phase 4: Commit Changes
+
+```bash
+git add -A
+git commit -m "ns1: recreate with OpenTofu workflow
+
+Old VM had incorrect hardware-configuration.nix with hardcoded UUIDs
+that didn't match actual disk layout, causing boot failure.
+
+Recreated using template2-based configuration for OpenTofu provisioning."
+```
+
+### Phase 5: Infrastructure
+
+1. Delete old ns1 VM in Proxmox (it's broken anyway)
+2. Run `nix develop -c tofu -chdir=terraform apply`
+3. Wait for bootstrap to complete
+4. Verify ns1 is functional:
+   - DNS resolution working
+   - Zone transfer to ns2 working
+   - All exporters responding
+
+### Phase 6: Finalize
+
+- Push to master
+- Move this plan to `docs/plans/completed/`
+
+## Rollback
+
+If the new VM fails:
+1. ns2 is still operational as secondary DNS
+2. Can recreate with different settings if needed
+
+## Notes
+
+- ns2 will continue serving DNS during the migration
+- Zone data is generated from flake, so no data loss
+- The old VM's disk can be kept briefly in Proxmox as backup if desired
--- a/docs/plans/completed/openbao-kanidm-oidc.md
+++ b/docs/plans/completed/openbao-kanidm-oidc.md
@@ -0,0 +1,87 @@
+# OpenBao + Kanidm OIDC Integration
+
+## Status: Completed
+
+Implemented 2026-02-09.
+
+## Overview
+
+Enable Kanidm users to authenticate to OpenBao (Vault) using OIDC for Web UI access. Members of the `admins` group get full read/write access to secrets.
+
+## Implementation
+
+### Files Modified
+
+| File | Changes |
+|------|---------|
+| `terraform/vault/oidc.tf` | New - OIDC auth backend and roles |
+| `terraform/vault/policies.tf` | Added oidc-admin and oidc-default policies |
+| `terraform/vault/secrets.tf` | Added OAuth2 client secret |
+| `terraform/vault/approle.tf` | Granted kanidm01 access to openbao secrets |
+| `services/kanidm/default.nix` | Added openbao OAuth2 client, enabled imperative group membership |
+
+### Kanidm Configuration
+
+OAuth2 client `openbao` with:
+- Confidential client (uses client secret)
+- Web UI callback only: `https://vault.home.2rjus.net:8200/ui/vault/auth/oidc/oidc/callback`
+- Legacy crypto enabled (RS256 for OpenBao compatibility)
+- Scope maps for `admins` and `users` groups
+
+Group membership is now managed imperatively (`overwriteMembers = false`) to prevent provisioning from resetting group memberships on service restart.
+
+### OpenBao Configuration
+
+OIDC auth backend at `/oidc` with two roles:
+
+| Role | Bound Claims | Policy | Access |
+|------|--------------|--------|--------|
+| `admin` | `groups = admins@home.2rjus.net` | `oidc-admin` | Full read/write to secrets, system health/metrics |
+| `default` | (none) | `oidc-default` | Token lookup-self, system health |
+
+Both roles request scopes: `openid`, `profile`, `email`, `groups`
+
+### Policies
+
+**oidc-admin:**
+- `secret/*` - create, read, update, delete, list
+- `sys/health` - read
+- `sys/metrics` - read
+- `sys/auth` - read
+- `sys/mounts` - read
+
+**oidc-default:**
+- `auth/token/lookup-self` - read
+- `sys/health` - read
+
+## Usage
+
+### Web UI Login
+1. Navigate to https://vault.home.2rjus.net:8200
+2. Select "OIDC" authentication method
+3. Enter role: `admin` (for admins) or `default` (for any user)
+4. Click "Sign in with OIDC"
+5. Authenticate with Kanidm
+
+### Group Management
+Add users to admins group for full access:
+```bash
+kanidm group add-members admins <username>
+```
+
+## Limitations
+
+**CLI login not supported:** Kanidm requires HTTPS for all redirect URIs on confidential (non-public) OAuth2 clients. OpenBao CLI uses `http://localhost:8250/oidc/callback` which Kanidm rejects. Public clients would allow localhost redirects, but OpenBao requires a client secret for OIDC auth.
+
+## Lessons Learned
+
+1. **Kanidm group names:** Groups are returned as `groupname@domain` (e.g., `admins@home.2rjus.net`), not just the short name
+2. **RS256 required:** OpenBao only supports RS256 for JWT signing; Kanidm defaults to ES256, requiring `enableLegacyCrypto = true`
+3. **Scope request:** OIDC roles must explicitly request the `groups` scope via `oidc_scopes`
+4. **Provisioning resets:** Kanidm provisioning with default `overwriteMembers = true` resets group memberships on restart
+5. **Two-phase Terraform:** Secret must exist before OIDC backend can validate discovery URL
+
+## References
+
+- [OpenBao JWT/OIDC Auth Method](https://openbao.org/docs/auth/jwt/)
+- [Kanidm OAuth2 Documentation](https://kanidm.github.io/kanidm/stable/integrations/oauth2.html)
--- a/docs/plans/completed/pgdb1-decommission.md
+++ b/docs/plans/completed/pgdb1-decommission.md
@@ -0,0 +1,113 @@
+# pgdb1 Decommissioning Plan
+
+## Overview
+
+Decommission the pgdb1 PostgreSQL server. The only consumer was Open WebUI on gunter, which has been migrated to use a local PostgreSQL instance.
+
+## Pre-flight Verification
+
+Before proceeding, verify that gunter is no longer using pgdb1:
+
+1. Check Open WebUI on gunter is configured for local PostgreSQL (not 10.69.13.16)
+2. Optionally: Check pgdb1 for recent connection activity:
+   ```bash
+   ssh pgdb1 'sudo -u postgres psql -c "SELECT * FROM pg_stat_activity WHERE datname IS NOT NULL;"'
+   ```
+
+## Files to Remove
+
+### Host Configuration
+- `hosts/pgdb1/default.nix`
+- `hosts/pgdb1/configuration.nix`
+- `hosts/pgdb1/hardware-configuration.nix`
+- `hosts/pgdb1/` (directory)
+
+### Service Module
+- `services/postgres/postgres.nix`
+- `services/postgres/default.nix`
+- `services/postgres/` (directory)
+
+Note: This service module is only used by pgdb1, so it can be removed entirely.
+
+### Flake Entry
+Remove from `flake.nix` (lines 131-138):
+```nix
+pgdb1 = nixpkgs.lib.nixosSystem {
+  inherit system;
+  specialArgs = {
+    inherit inputs self;
+  };
+  modules = commonModules ++ [
+    ./hosts/pgdb1
+  ];
+};
+```
+
+### Vault AppRole
+Remove from `terraform/vault/approle.tf` (lines 69-73):
+```hcl
+"pgdb1" = {
+  paths = [
+    "secret/data/hosts/pgdb1/*",
+  ]
+}
+```
+
+### Monitoring Rules
+Remove from `services/monitoring/rules.yml` the `postgres_down` alert (lines 359-365):
+```yaml
+- name: postgres_rules
+  rules:
+    - alert: postgres_down
+      expr: node_systemd_unit_state{instance="pgdb1.home.2rjus.net:9100", name="postgresql.service", state="active"} == 0
+      for: 5m
+      labels:
+        severity: critical
+```
+
+### Utility Scripts
+Delete `rebuild-all.sh` entirely (obsolete script).
+
+## Execution Steps
+
+### Phase 1: Verification
+- [ ] Confirm Open WebUI on gunter uses local PostgreSQL
+- [ ] Verify no active connections to pgdb1
+
+### Phase 2: Code Cleanup
+- [ ] Create feature branch: `git checkout -b decommission-pgdb1`
+- [ ] Remove `hosts/pgdb1/` directory
+- [ ] Remove `services/postgres/` directory
+- [ ] Remove pgdb1 entry from `flake.nix`
+- [ ] Remove postgres alert from `services/monitoring/rules.yml`
+- [ ] Delete `rebuild-all.sh` (obsolete)
+- [ ] Run `nix flake check` to verify no broken references
+- [ ] Commit changes
+
+### Phase 3: Terraform Cleanup
+- [ ] Remove pgdb1 from `terraform/vault/approle.tf`
+- [ ] Run `tofu plan` in `terraform/vault/` to preview changes
+- [ ] Run `tofu apply` to remove the AppRole
+- [ ] Commit terraform changes
+
+### Phase 4: Infrastructure Cleanup
+- [ ] Shut down pgdb1 VM in Proxmox
+- [ ] Delete the VM from Proxmox
+- [ ] (Optional) Remove any DNS entries if not auto-generated
+
+### Phase 5: Finalize
+- [ ] Merge feature branch to master
+- [ ] Trigger auto-upgrade on DNS servers (ns1, ns2) to remove DNS entry
+- [ ] Move this plan to `docs/plans/completed/`
+
+## Rollback
+
+If issues arise after decommissioning:
+1. The VM can be recreated from template using the git history
+2. Database data would need to be restored from backup (if any exists)
+
+## Notes
+
+- pgdb1 IP: 10.69.13.16
+- The postgres service allowed connections from gunter (10.69.30.105)
+- No restic backup was configured for this host
--- a/docs/plans/completed/prometheus-scrape-target-labels.md
+++ b/docs/plans/completed/prometheus-scrape-target-labels.md
@@ -1,10 +1,38 @@
 # Prometheus Scrape Target Labels

+## Implementation Status
+
+| Step | Status | Notes |
+|------|--------|-------|
+| 1. Create `homelab.host` module | ✅ Complete | `modules/homelab/host.nix` |
+| 2. Update `lib/monitoring.nix` | ✅ Complete | Labels extracted and propagated |
+| 3. Update Prometheus config | ✅ Complete | Uses structured static_configs |
+| 4. Set metadata on hosts | ✅ Complete | All relevant hosts configured |
+| 5. Update alert rules | ✅ Complete | Role-based filtering implemented |
+| 6. Labels for service targets | ✅ Complete | Host labels propagated to all services |
+| 7. Add hostname label | ✅ Complete | All targets have `hostname` label for easy filtering |
+
+**Hosts with metadata configured:**
+- `ns1`, `ns2`: `role = "dns"`, `labels.dns_role = "primary"/"secondary"`
+- `nix-cache01`: `role = "build-host"`
+- `vault01`: `role = "vault"`
+- `testvm01/02/03`: `tier = "test"`
+
+**Implementation complete.** Branch: `prometheus-scrape-target-labels`
+
+**Query examples:**
+- `{hostname="ns1"}` - all metrics from ns1 (any job/port)
+- `node_cpu_seconds_total{hostname="monitoring01"}` - specific metric by hostname
+- `up{role="dns"}` - all DNS servers
+- `up{tier="test"}` - all test-tier hosts
+
+---
+
 ## Goal

 Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names.

-**Related:** This plan shares the `homelab.host` module with `docs/plans/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment.
+**Related:** This plan shares the `homelab.host` module with `docs/plans/completed/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment.

 ## Motivation

@@ -54,12 +82,11 @@ or

 ## Implementation

-This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/nats-deploy-service.md` which uses the same module for deployment tier assignment.
+This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/completed/nats-deploy-service.md` which uses the same module for deployment tier assignment.

 ### 1. Create `homelab.host` module

-**Status:** Step 1 (Create `homelab.host` module) is complete. The module is in
-`modules/homelab/host.nix` with tier, priority, role, and labels options.
+✅ **Complete.** The module is in `modules/homelab/host.nix`.

 Create `modules/homelab/host.nix` with shared host metadata options:

@@ -98,6 +125,8 @@ Import this module in `modules/homelab/default.nix`.

 ### 2. Update `lib/monitoring.nix`

+✅ **Complete.** Labels are now extracted and propagated.
+
 - `extractHostMonitoring` should also extract `homelab.host` values (priority, role, labels).
 - Build the combined label set from `homelab.host`:

@@ -126,6 +155,8 @@ This requires grouping hosts by their label attrset and producing one `static_co

 ### 3. Update `services/monitoring/prometheus.nix`

+✅ **Complete.** Now uses structured static_configs output.
+
 Change the node-exporter scrape config to use the new structured output:

 ```nix
@@ -138,36 +169,37 @@ static_configs = nodeExporterTargets;

 ### 4. Set metadata on hosts

+✅ **Complete.** All relevant hosts have metadata configured. Note: The implementation filters by `role` rather than `priority`, which matches the existing nix-cache01 configuration.
+
 Example in `hosts/nix-cache01/configuration.nix`:

 ```nix
 homelab.host = {
-  tier = "test";       # can be deployed by MCP (used by homelab-deploy)
  priority = "low";    # relaxed alerting thresholds
  role = "build-host";
 };
 ```

+**Note:** Current implementation only sets `role = "build-host"`. Consider adding `priority = "low"` when label propagation is implemented.
+
 Example in `hosts/ns1/configuration.nix`:

 ```nix
 homelab.host = {
-  tier = "prod";
-  priority = "high";
  role = "dns";
  labels.dns_role = "primary";
 };
 ```

+**Note:** `tier` and `priority` use defaults ("prod" and "high"), which is the intended behavior. The current ns1/ns2 configurations match this pattern.
+
 ### 5. Update alert rules

-After implementing labels, review and update `services/monitoring/rules.yml`:
+✅ **Complete.** Updated `services/monitoring/rules.yml`:

- Replace instance-name exclusions with label-based filters (e.g. `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`).
- Consider whether any other rules should differentiate by priority or role.
+- `high_cpu_load`: Replaced `instance!="nix-cache01..."` with `role!="build-host"` for standard hosts (15m duration) and `role="build-host"` for build hosts (2h duration).
+- `unbound_low_cache_hit_ratio`: Added `dns_role="primary"` filter to only alert on the primary DNS resolver (secondary has a cold cache).

-Specifically, the `high_cpu_load` rule currently has a nix-cache01 exclusion that should be replaced with a `priority`-based filter.
+### 6. Labels for `generateScrapeConfigs` (service targets)

-### 6. Consider labels for `generateScrapeConfigs` (service targets)
-
-The same label propagation could be applied to service-level scrape targets. This is optional and can be deferred -- service targets are more specialized and less likely to need generic label-based filtering.
+✅ **Complete.** Host labels are now propagated to all auto-generated service scrape targets (unbound, homelab-deploy, nixos-exporter, etc.). This enables semantic filtering on any service metric, such as using `dns_role="primary"` with the unbound job.
--- a/docs/plans/host-migration-to-opentofu.md
+++ b/docs/plans/host-migration-to-opentofu.md
@@ -9,24 +9,23 @@ hosts are decommissioned or deferred.

 ## Current State

-Hosts already managed by OpenTofu: `vault01`, `testvm01`, `vaulttest01`
+Hosts already managed by OpenTofu: `vault01`, `testvm01`, `testvm02`, `testvm03`, `ns2`, `ns1`

 Hosts to migrate:

 | Host | Category | Notes |
 |------|----------|-------|
-| ns1 | Stateless | Primary DNS, recreate |
-| ns2 | Stateless | Secondary DNS, recreate |
+| ~~ns1~~ | ~~Stateless~~ | ✓ Complete |
 | nix-cache01 | Stateless | Binary cache, recreate |
 | http-proxy | Stateless | Reverse proxy, recreate |
 | nats1 | Stateless | Messaging, recreate |
-| auth01 | Decommission | No longer in use |
 | ha1 | Stateful | Home Assistant + Zigbee2MQTT + Mosquitto |
 | monitoring01 | Stateful | Prometheus, Grafana, Loki |
 | jelly01 | Stateful | Jellyfin metadata, watch history, config |
-| pgdb1 | Stateful | PostgreSQL databases |
-| jump | Decommission | No longer needed |
-| ca | Deferred | Pending Phase 4c PKI migration to OpenBao |
+| pgdb1 | Decommission | Only used by Open WebUI on gunter, migrating to local postgres |
+| ~~jump~~ | ~~Decommission~~ | ✓ Complete |
+| ~~auth01~~ | ~~Decommission~~ | ✓ Complete |
+| ~~ca~~ | ~~Deferred~~ | ✓ Complete |

 ## Phase 1: Backup Preparation

@@ -46,39 +45,19 @@ No backup currently exists. Add a restic backup job for `/var/lib/jellyfin/` whi
 Media files are on the NAS (`nas.home.2rjus.net:/mnt/hdd-pool/media`) and do not need backup.
 The cache directory (`/var/cache/jellyfin/`) does not need backup — it regenerates.

-### 1c. Add PostgreSQL Backup to pgdb1
-
-No backup currently exists. Add a restic backup job with a `pg_dumpall` pre-hook to capture
-all databases and roles. The dump should be piped through restic's stdin backup (similar to
-the Grafana DB dump pattern on monitoring01).
-
-### 1d. Verify Existing ha1 Backup
+### 1c. Verify Existing ha1 Backup

 ha1 already backs up `/var/lib/hass`, `/var/lib/zigbee2mqtt`, `/var/lib/mosquitto`. Verify
 these backups are current and restorable before proceeding with migration.

-### 1e. Verify All Backups
+### 1d. Verify All Backups

 After adding/expanding backup jobs:
 1. Trigger a manual backup run on each host
 2. Verify backup integrity with `restic check`
 3. Test a restore to a temporary location to confirm data is recoverable

-## Phase 2: Declare pgdb1 Databases in Nix
-
-Before migrating pgdb1, audit the manually-created databases and users on the running
-instance, then declare them in the Nix configuration using `ensureDatabases` and
-`ensureUsers`. This makes the PostgreSQL setup reproducible on the new host.
-
-Steps:
-1. SSH to pgdb1, run `\l` and `\du` in psql to list databases and roles
-2. Add `ensureDatabases` and `ensureUsers` to `services/postgres/postgres.nix`
-3. Document any non-default PostgreSQL settings or extensions per database
-
-After reprovisioning, the databases will be created by NixOS, and data restored from the
-`pg_dumpall` backup.
-
-## Phase 3: Stateless Host Migration
+## Phase 2: Stateless Host Migration

 These hosts have no meaningful state and can be recreated fresh. For each host:

@@ -95,13 +74,14 @@ Migrate stateless hosts in an order that minimizes disruption:

 1. **nix-cache01** — low risk, no downstream dependencies during migration
 2. **nats1** — low risk, verify no persistent JetStream streams first
-4. **http-proxy** — brief disruption to proxied services, migrate during low-traffic window
-5. **ns1, ns2** — migrate one at a time, verify DNS resolution between each
+3. **http-proxy** — brief disruption to proxied services, migrate during low-traffic window
+4. ~~**ns1** — ns2 already migrated, verify AXFR works after ns1 migration~~ ✓ Complete

-For ns1/ns2: migrate ns2 first (secondary), verify AXFR works, then migrate ns1. All hosts
-use both ns1 and ns2 as resolvers, so one being down briefly is tolerable.
+~~For ns1/ns2: migrate ns2 first (secondary), verify AXFR works, then migrate ns1.~~ Both ns1
+and ns2 migration complete. Zone transfer (AXFR) verified working between ns1 (primary) and
+ns2 (secondary).

-## Phase 4: Stateful Host Migration
+## Phase 3: Stateful Host Migration

 For each stateful host, the procedure is:

@@ -114,17 +94,7 @@ For each stateful host, the procedure is:
 7. Start services and verify functionality
 8. Decommission the old VM

-### 4a. pgdb1
-
-1. Run final `pg_dumpall` backup via restic
-2. Stop PostgreSQL on the old host
-3. Provision new pgdb1 via OpenTofu
-4. After bootstrap, NixOS creates the declared databases/users
-5. Restore data with `pg_restore` or `psql < dumpall.sql`
-6. Verify database connectivity from gunter (`10.69.30.105`)
-7. Decommission old VM
-
-### 4b. monitoring01
+### 3a. monitoring01

 1. Run final Grafana backup
 2. Provision new monitoring01 via OpenTofu
@@ -134,7 +104,7 @@ For each stateful host, the procedure is:
 6. Verify all scrape targets are being collected
 7. Decommission old VM

-### 4c. jelly01
+### 3b. jelly01

 1. Run final Jellyfin backup
 2. Provision new jelly01 via OpenTofu
@@ -143,7 +113,7 @@ For each stateful host, the procedure is:
 5. Start Jellyfin, verify watch history and library metadata are present
 6. Decommission old VM

-### 4d. ha1
+### 3c. ha1

 1. Verify latest restic backup is current
 2. Stop Home Assistant, Zigbee2MQTT, and Mosquitto on old host
@@ -167,47 +137,69 @@ OpenTofu/Proxmox. Verify the USB device ID on the hypervisor and add the appropr
 `usb` block to the VM definition in `terraform/vms.tf`. The USB device must be passed
 through before starting Zigbee2MQTT on the new host.

-## Phase 5: Decommission jump and auth01 Hosts
+## Phase 4: Decommission Hosts

-### jump
-1. Verify nothing depends on the jump host (no SSH proxy configs pointing to it, etc.)
-2. Remove host configuration from `hosts/jump/`
-3. Remove from `flake.nix`
-4. Remove any secrets in `secrets/jump/`
-5. Remove from `.sops.yaml`
+### jump ✓ COMPLETE
+
+~~1. Verify nothing depends on the jump host (no SSH proxy configs pointing to it, etc.)~~
+~~2. Remove host configuration from `hosts/jump/`~~
+~~3. Remove from `flake.nix`~~
+~~4. Remove any secrets in `secrets/jump/`~~
+~~5. Remove from `.sops.yaml`~~
+~~6. Destroy the VM in Proxmox~~
+~~7. Commit cleanup~~
+
+Host was already removed from flake.nix and VM destroyed. Configuration cleaned up in ba9f47f.
+
+### auth01 ✓ COMPLETE
+
+~~1. Remove host configuration from `hosts/auth01/`~~
+~~2. Remove from `flake.nix`~~
+~~3. Remove any secrets in `secrets/auth01/`~~
+~~4. Remove from `.sops.yaml`~~
+~~5. Remove `services/authelia/` and `services/lldap/` (only used by auth01)~~
+~~6. Destroy the VM in Proxmox~~
+~~7. Commit cleanup~~
+
+Host configuration, services, and VM already removed.
+
+### pgdb1 (in progress)
+
+Only consumer was Open WebUI on gunter, which has been migrated to use local PostgreSQL.
+
+1. ~~Verify Open WebUI on gunter is using local PostgreSQL (not pgdb1)~~ ✓
+2. ~~Remove host configuration from `hosts/pgdb1/`~~ ✓
+3. ~~Remove `services/postgres/` (only used by pgdb1)~~ ✓
+4. ~~Remove from `flake.nix`~~ ✓
+5. ~~Remove Vault AppRole from `terraform/vault/approle.tf`~~ ✓
 6. Destroy the VM in Proxmox
-7. Commit cleanup
+7. ~~Commit cleanup~~ ✓

-### auth01
-1. Remove host configuration from `hosts/auth01/`
-2. Remove from `flake.nix`
-3. Remove any secrets in `secrets/auth01/`
-4. Remove from `.sops.yaml`
-5. Remove `services/authelia/` and `services/lldap/` (only used by auth01)
-6. Destroy the VM in Proxmox
-7. Commit cleanup
+See `docs/plans/pgdb1-decommission.md` for detailed plan.

-## Phase 6: Decommission ca Host (Deferred)
+## Phase 5: Decommission ca Host ✓ COMPLETE

-Deferred until Phase 4c (PKI migration to OpenBao) is complete. Once all hosts use the
+~~Deferred until Phase 4c (PKI migration to OpenBao) is complete. Once all hosts use the
 OpenBao ACME endpoint for certificates, the step-ca host can be decommissioned following
-the same cleanup steps as the jump host.
+the same cleanup steps as the jump host.~~

-## Phase 7: Remove sops-nix
+PKI migration to OpenBao complete. Host configuration, `services/ca/`, and VM removed.

-Once `ca` is decommissioned (Phase 6), `sops-nix` is no longer used by any host. Remove
-all remnants:
- `sops-nix` input from `flake.nix` and `flake.lock`
- `sops-nix.nixosModules.sops` from all host module lists in `flake.nix`
- `inherit sops-nix` from all specialArgs in `flake.nix`
- `system/sops.nix` and its import in `system/default.nix`
- `.sops.yaml`
- `secrets/` directory
- All `sops.secrets.*` declarations in `services/ca/`, `services/authelia/`, `services/lldap/`
- Template scripts that generate age keys for sops (`hosts/template/scripts.nix`,
-  `hosts/template2/scripts.nix`)
+## Phase 6: Remove sops-nix ✓ COMPLETE

-See `docs/plans/completed/sops-to-openbao-migration.md` for full context.
+~~Once `ca` is decommissioned (Phase 6), `sops-nix` is no longer used by any host. Remove
+all remnants:~~
+~~- `sops-nix` input from `flake.nix` and `flake.lock`~~
+~~- `sops-nix.nixosModules.sops` from all host module lists in `flake.nix`~~
+~~- `inherit sops-nix` from all specialArgs in `flake.nix`~~
+~~- `system/sops.nix` and its import in `system/default.nix`~~
+~~- `.sops.yaml`~~
+~~- `secrets/` directory~~
+~~- All `sops.secrets.*` declarations in `services/ca/`, `services/authelia/`, `services/lldap/`~~
+~~- Template scripts that generate age keys for sops (`hosts/template/scripts.nix`,
+  `hosts/template2/scripts.nix`)~~
+
+All sops-nix remnants removed. See `docs/plans/completed/sops-to-openbao-migration.md` for context.

 ## Notes

@@ -216,7 +208,7 @@ See `docs/plans/completed/sops-to-openbao-migration.md` for full context.
 - The old VMs use IPs that the new VMs need, so the old VM must be shut down before
  the new one is provisioned (or use a temporary IP and swap after verification)
 - Stateful migrations should be done during low-usage windows
- After all migrations are complete, the only hosts not in OpenTofu will be ca (deferred)
+- After all migrations are complete, all decommissioned hosts (jump, auth01, ca) have been removed
 - Since many hosts are being recreated, this is a good opportunity to establish consistent
  hostname naming conventions before provisioning the new VMs. Current naming is inconsistent
  (e.g. `ns1` vs `nix-cache01`, `ha1` vs `auth01`, `pgdb1` vs `http-proxy`). Decide on a
--- a/docs/plans/loki-improvements.md
+++ b/docs/plans/loki-improvements.md
@@ -0,0 +1,196 @@
+# Loki Setup Improvements
+
+## Overview
+
+The current Loki deployment on monitoring01 is functional but minimal. It lacks retention policies, rate limiting, and uses local filesystem storage. This plan evaluates improvement options across several dimensions: retention management, storage backend, resource limits, and operational improvements.
+
+## Current State
+
+**Loki** on monitoring01 (`services/monitoring/loki.nix`):
+- Single-node deployment, no HA
+- Filesystem storage at `/var/lib/loki/chunks` (~6.8 GB as of 2026-02-13)
+- TSDB index (v13 schema, 24h period)
+- 30-day compactor-based retention with basic rate limits
+- No caching layer
+- Auth disabled (trusted network)
+
+**Promtail** on all 16 hosts (`system/monitoring/logs.nix`):
+- Ships systemd journal (JSON) + `/var/log/**/*.log`
+- Labels: `hostname`, `tier`, `role`, `level`, `job` (systemd-journal/varlog), `systemd_unit`
+- `level` label mapped from journal PRIORITY (critical/error/warning/notice/info/debug)
+- Hardcoded to `http://monitoring01.home.2rjus.net:3100`
+
+**Additional log sources:**
+- `pipe-to-loki` script (manual log submission, `job=pipe-to-loki`)
+- Bootstrap logs from template2 (`job=bootstrap`)
+
+**Context:** The VictoriaMetrics migration plan (`docs/plans/monitoring-migration-victoriametrics.md`) includes moving Loki to monitoring02 with "same configuration as current". These improvements could be applied either before or after that migration.
+
+## Improvement Areas
+
+### 1. Retention Policy
+
+**Implemented.** Compactor-based retention with 30-day period. Note: Loki 3.6.3 requires `delete_request_store = "filesystem"` when retention is enabled (not documented in older guides).
+
+```nix
+compactor = {
+  working_directory = "/var/lib/loki/compactor";
+  compaction_interval = "10m";
+  retention_enabled = true;
+  retention_delete_delay = "2h";
+  retention_delete_worker_count = 150;
+  delete_request_store = "filesystem";
+};
+
+limits_config = {
+  retention_period = "30d";
+};
+```
+
+### 2. Storage Backend
+
+**Decision:** Stay with filesystem storage for now. Garage S3 was considered but ruled out - the current single-node Garage (replication_factor=1) offers no real durability benefit over local disk. S3 storage can be revisited after the NAS migration, when a more robust S3-compatible solution will likely be available.
+
+### 3. Limits Configuration
+
+**Implemented.** Basic guardrails added alongside retention in `limits_config`:
+
+```nix
+limits_config = {
+  retention_period = "30d";
+  ingestion_rate_mb = 10;           # MB/s per tenant
+  ingestion_burst_size_mb = 20;     # Burst allowance
+  max_streams_per_user = 10000;     # Prevent label explosion
+  max_query_series = 500;           # Limit query resource usage
+  max_query_parallelism = 8;
+};
+```
+
+### 4. Promtail Label Improvements
+
+**Problem:** Label inconsistencies and missing useful metadata:
+- The `varlog` scrape config uses `hostname` while journal uses `host` (different label name)
+- No `tier` or `role` labels, making it hard to filter logs by deployment tier or host function
+
+**Implemented:** Standardized on `hostname` to match Prometheus labels. The journal scrape previously used a relabel from `__journal__hostname` to `host`; now both scrape configs use a static `hostname` label from `config.networking.hostName`. Also updated `pipe-to-loki` and bootstrap scripts to use `hostname` instead of `host`.
+
+1. **Standardized label:** Both scrape configs use `hostname` (matching Prometheus) via shared `hostLabels`
+2. **Added `tier` label:** Static label from `config.homelab.host.tier` (`test`/`prod`) on both scrape configs
+3. **Added `role` label:** Static label from `config.homelab.host.role` on both scrape configs (conditionally, only when non-null)
+
+No cardinality impact - `tier` and `role` are 1:1 with `hostname`, so they add metadata to existing streams without creating new ones.
+
+This enables queries like:
+- `{tier="prod"} |= "error"` - all errors on prod hosts
+- `{role="dns"}` - all DNS server logs
+- `{tier="test", job="systemd-journal"}` - journal logs from test hosts
+
+### 5. Journal Priority → Level Label
+
+**Implemented.** Promtail pipeline stages map journal `PRIORITY` to a `level` label:
+
+| PRIORITY | level |
+|----------|-------|
+| 0-2 | critical |
+| 3 | error |
+| 4 | warning |
+| 5 | notice |
+| 6 | info |
+| 7 | debug |
+
+Uses a `json` stage to extract PRIORITY, `template` to map to level name, and `labels` to attach it. This gives reliable level filtering for all journal logs, unlike Loki's `detected_level` which only works for apps that embed level keywords in message text.
+
+Example queries:
+- `{level="error"}` - all errors across the fleet
+- `{level=~"critical|error", tier="prod"}` - prod errors and criticals
+- `{level="warning", role="dns"}` - warnings from DNS servers
+
+### 6. Enable JSON Logging on Services
+
+**Problem:** Many services support structured JSON log output but may be using plain text by default. JSON logs are significantly easier to query in Loki - `| json` cleanly extracts all fields, whereas plain text requires fragile regex or pattern matching.
+
+**Audit results (2026-02-13):**
+
+**Already logging JSON:**
+- Caddy (all instances) - JSON by default for access logs
+- homelab-deploy (listener/builder) - Go app, logs structured JSON
+
+**Supports JSON, not configured (high value):**
+
+| Service | How to enable | Config file |
+|---------|--------------|-------------|
+| Prometheus | `--log.format=json` | `services/monitoring/prometheus.nix` |
+| Alertmanager | `--log.format=json` | `services/monitoring/prometheus.nix` |
+| Loki | `--log.format=json` | `services/monitoring/loki.nix` |
+| Grafana | `log.console.format = "json"` | `services/monitoring/grafana.nix` |
+| Tempo | `log_format: json` in config | `services/monitoring/tempo.nix` |
+| OpenBao | `log_format = "json"` | `services/vault/default.nix` |
+
+**Supports JSON, not configured (lower value - minimal log output):**
+
+| Service | How to enable |
+|---------|--------------|
+| Pyroscope | `--log.format=json` (OCI container) |
+| Blackbox Exporter | `--log.format=json` |
+| Node Exporter | `--log.format=json` (all 16 hosts) |
+| Systemd Exporter | `--log.format=json` (all 16 hosts) |
+
+**No JSON support (syslog/text only):**
+- NSD, Unbound, OpenSSH, Mosquitto
+
+**Needs verification:**
+- Kanidm, Jellyfin, Home Assistant, Harmonia, Zigbee2MQTT, NATS
+
+**Recommendation:** Start with the monitoring stack (Prometheus, Alertmanager, Loki, Grafana, Tempo) since they're all Go apps with the same `--log.format=json` flag. Then OpenBao. The exporters are lower priority since they produce minimal log output.
+
+### 7. Monitoring CNAME for Promtail Target
+
+**Problem:** Promtail hardcodes `monitoring01.home.2rjus.net:3100`. The VictoriaMetrics migration plan already addresses this by switching to a `monitoring` CNAME.
+
+**Recommendation:** This should happen as part of the monitoring02 migration, not independently. If we do Loki improvements before that migration, keep pointing to monitoring01.
+
+## Priority Ranking
+
+| # | Improvement | Effort | Impact | Status |
+|---|-------------|--------|--------|--------|
+| 1 | **Retention policy** | Low | High | Done (30d compactor retention) |
+| 2 | **Limits config** | Low | Medium | Done (rate limits + stream guards) |
+| 3 | **Promtail labels** | Trivial | Low | Done (hostname/tier/role/level) |
+| 4 | **Journal priority → level** | Low-medium | Medium | Done (pipeline stages) |
+| 5 | **JSON logging audit** | Low-medium | Medium | Audited, not yet enabled |
+| 6 | **Monitoring CNAME** | Low | Medium | Part of monitoring02 migration |
+
+## Implementation Steps
+
+### Phase 1: Retention + Labels (done 2026-02-13)
+
+1. ~~Add `compactor` section to `services/monitoring/loki.nix`~~ Done
+2. ~~Add `limits_config` with 30-day retention and basic rate limits~~ Done
+3. ~~Update `system/monitoring/logs.nix`~~ Done:
+   - Standardized on `hostname` label (matching Prometheus) for both scrape configs
+   - Added `tier` and `role` static labels from `homelab.host` options
+   - Added pipeline stages for journal PRIORITY → `level` label mapping
+4. ~~Update `pipe-to-loki` and bootstrap scripts to use `hostname`~~ Done
+5. ~~Deploy and verify labels~~ Done - all 15 hosts reporting with correct labels
+
+### Phase 2: JSON Logging (not started)
+
+Enable JSON logging on services that support it, starting with the monitoring stack:
+1. Prometheus, Alertmanager, Loki, Grafana, Tempo (`--log.format=json`)
+2. OpenBao (`log_format = "json"`)
+3. Lower priority: exporters (node-exporter, systemd-exporter, blackbox)
+
+### Phase 3 (future): S3 Storage Migration
+
+Revisit after NAS migration when a proper S3-compatible storage solution is available. At that point, add a new schema period with `object_store = "s3"` - the old filesystem period will continue serving historical data until it ages out past retention.
+
+## Open Questions
+
+- [ ] Do we want per-stream retention (e.g., keep bootstrap/pipe-to-loki longer)?
+
+## Notes
+
+- Loki schema changes require adding a new period entry (not modifying existing ones). The old period continues serving historical data.
+- Loki 3.6.3 requires `delete_request_store = "filesystem"` in the compactor config when retention is enabled.
+- S3 storage deferred until post-NAS migration when a proper solution is available.
+- As of 2026-02-13, Loki uses ~6.8 GB for ~30 days of logs from 16 hosts. Prometheus uses ~7.6 GB on the same disk (33 GB total, ~8 GB free).
--- a/docs/plans/memory-issues-follow-up.md
+++ b/docs/plans/memory-issues-follow-up.md
@@ -0,0 +1,116 @@
+# Memory Issues Follow-up
+
+Tracking the zram change to verify it resolves OOM issues during nixos-upgrade on low-memory hosts.
+
+## Background
+
+On 2026-02-08, ns2 (2GB RAM) experienced an OOM kill during nixos-upgrade. The Nix evaluation process consumed ~1.6GB before being killed by the kernel. ns1 (manually increased to 4GB) succeeded with the same upgrade.
+
+Root cause: 2GB RAM is insufficient for Nix flake evaluation without swap.
+
+## Fix Applied
+
+**Commit:** `1674b6a` - system: enable zram swap for all hosts
+
+**Merged:** 2026-02-08 ~12:15 UTC
+
+**Change:** Added `zramSwap.enable = true` to `system/zram.nix`, providing ~2GB compressed swap on all hosts.
+
+## Timeline
+
+| Time (UTC) | Event |
+|------------|-------|
+| 05:00:46 | ns2 nixos-upgrade OOM killed |
+| 05:01:47 | `nixos_upgrade_failed` alert fired |
+| 12:15 | zram commit merged to master |
+| 12:19 | ns2 rebooted with zram enabled |
+| 12:20 | ns1 rebooted (memory reduced to 2GB via tofu) |
+
+## Hosts Affected
+
+All 2GB VMs that run nixos-upgrade:
+- ns1, ns2 (DNS)
+- vault01
+- testvm01, testvm02, testvm03
+- kanidm01
+
+## Metrics to Monitor
+
+Check these in Grafana or via PromQL to verify the fix:
+
+### Swap availability (should be ~2GB after upgrade)
+```promql
+node_memory_SwapTotal_bytes / 1024 / 1024
+```
+
+### Swap usage during upgrades
+```promql
+(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / 1024 / 1024
+```
+
+### Zswap compressed bytes (active compression)
+```promql
+node_memory_Zswap_bytes / 1024 / 1024
+```
+
+### Upgrade failures (should be 0)
+```promql
+node_systemd_unit_state{name="nixos-upgrade.service", state="failed"}
+```
+
+### Memory available during upgrades
+```promql
+node_memory_MemAvailable_bytes / 1024 / 1024
+```
+
+## Verification Steps
+
+After a few days (allow auto-upgrades to run on all hosts):
+
+1. Check all hosts have swap enabled:
+   ```promql
+   node_memory_SwapTotal_bytes > 0
+   ```
+
+2. Check for any upgrade failures since the fix:
+   ```promql
+   count_over_time(ALERTS{alertname="nixos_upgrade_failed"}[7d])
+   ```
+
+3. Review if any hosts used swap during upgrades (check historical graphs)
+
+## Success Criteria
+
+- No `nixos_upgrade_failed` alerts due to OOM after 2026-02-08
+- All hosts show ~2GB swap available
+- Upgrades complete successfully on 2GB VMs
+
+## Fallback Options
+
+If zram is insufficient:
+
+1. **Increase VM memory** - Update `terraform/vms.tf` to 4GB for affected hosts
+2. **Enable memory ballooning** - Configure VMs with dynamic memory allocation (see below)
+3. **Use remote builds** - Configure `nix.buildMachines` to offload evaluation
+4. **Reduce flake size** - Split configurations to reduce evaluation memory
+
+### Memory Ballooning
+
+Proxmox supports memory ballooning, which allows VMs to dynamically grow/shrink memory allocation based on demand. The balloon driver inside the guest communicates with the hypervisor to release or reclaim memory pages.
+
+Configuration in `terraform/vms.tf`:
+```hcl
+memory  = 4096  # maximum memory
+balloon = 2048  # minimum memory (shrinks to this when idle)
+```
+
+Pros:
+- VMs get memory on-demand without reboots
+- Better host memory utilization
+- Solves upgrade OOM without permanently allocating 4GB
+
+Cons:
+- Requires QEMU guest agent running in guest
+- Guest can experience memory pressure if host is overcommitted
+
+Ballooning and zram are complementary - ballooning provides headroom from the host, zram provides overflow within the guest.
--- a/docs/plans/monitoring-migration-victoriametrics.md
+++ b/docs/plans/monitoring-migration-victoriametrics.md
@@ -0,0 +1,201 @@
+# Monitoring Stack Migration to VictoriaMetrics
+
+## Overview
+
+Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
+and longer retention. Run in parallel with monitoring01 until validated, then switch over using
+a `monitoring` CNAME for seamless transition.
+
+## Current State
+
+**monitoring01** (10.69.13.13):
+- 4 CPU cores, 4GB RAM, 33GB disk
+- Prometheus with 30-day retention (15s scrape interval)
+- Alertmanager (routes to alerttonotify webhook)
+- Grafana (dashboards, datasources)
+- Loki (log aggregation from all hosts via Promtail)
+- Tempo (distributed tracing) - not actively used
+- Pyroscope (continuous profiling) - not actively used
+
+**Hardcoded References to monitoring01:**
+- `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100`
+- `hosts/template2/bootstrap.nix` - Bootstrap logs to Loki (keep as-is until decommission)
+- `services/http-proxy/proxy.nix` - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway
+
+**Auto-generated:**
+- Prometheus scrape targets (from `lib/monitoring.nix` + `homelab.monitoring.scrapeTargets`)
+- Node-exporter targets (from all hosts with static IPs)
+
+## Decision: VictoriaMetrics
+
+Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
+- Single binary replacement for Prometheus
+- 5-10x better compression (30 days could become 180+ days in same space)
+- Same PromQL query language (Grafana dashboards work unchanged)
+- Same scrape config format (existing auto-generated configs work)
+
+If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
+
+## Architecture
+
+```
+                     ┌─────────────────┐
+                     │  monitoring02   │
+                     │  VictoriaMetrics│
+                     │  + Grafana      │
+     monitoring      │  + Loki         │
+     CNAME ──────────│  + Alertmanager │
+                     │  (vmalert)      │
+                     └─────────────────┘
+                            ▲
+                            │ scrapes
+            ┌───────────────┼───────────────┐
+            │               │               │
+       ┌────┴────┐    ┌─────┴────┐    ┌─────┴────┐
+       │  ns1    │    │  ha1     │    │  ...     │
+       │ :9100   │    │ :9100    │    │ :9100    │
+       └─────────┘    └──────────┘    └──────────┘
+```
+
+## Implementation Plan
+
+### Phase 1: Create monitoring02 Host [COMPLETE]
+
+Host created and deployed at 10.69.13.24 (prod tier) with:
+- 4 CPU cores, 8GB RAM, 60GB disk
+- Vault integration enabled
+- NATS-based remote deployment enabled
+- Grafana with Kanidm OIDC deployed as test instance (`grafana-test.home.2rjus.net`)
+
+### Phase 2: Set Up VictoriaMetrics Stack
+
+New service module at `services/victoriametrics/` for VictoriaMetrics + vmalert + Alertmanager.
+Imported by monitoring02 alongside the existing Grafana service.
+
+1. **VictoriaMetrics** (port 8428): [DONE]
+   - `services.victoriametrics.enable = true`
+   - `retentionPeriod = "3"` (3 months)
+   - All scrape configs migrated from Prometheus (22 jobs including auto-generated)
+   - Static user override (DynamicUser disabled) for credential file access
+   - OpenBao token fetch service + 30min refresh timer
+   - Apiary bearer token via vault.secrets
+
+2. **vmalert** for alerting rules: [DONE]
+   - Points to VictoriaMetrics datasource at localhost:8428
+   - Reuses existing `services/monitoring/rules.yml` directly via `settings.rule`
+   - No notifier configured during parallel operation (prevents duplicate alerts)
+
+3. **Alertmanager** (port 9093): [DONE]
+   - Same configuration as monitoring01 (alerttonotify webhook routing)
+   - Will only receive alerts after cutover (vmalert notifier disabled)
+
+4. **Grafana** (port 3000): [DONE]
+   - VictoriaMetrics datasource (localhost:8428) as default
+   - monitoring01 Prometheus datasource kept for comparison during parallel operation
+   - Loki datasource pointing to localhost (after Loki migrated to monitoring02)
+
+5. **Loki** (port 3100): [DONE]
+   - Same configuration as monitoring01 in standalone `services/loki/` module
+   - Grafana datasource updated to localhost:3100
+
+**Note:** pve-exporter and pushgateway scrape targets are not included on monitoring02.
+pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics
+native push support.
+
+### Phase 3: Parallel Operation
+
+Run both monitoring01 and monitoring02 simultaneously:
+
+1. **Dual scraping**: Both hosts scrape the same targets
+   - Validates VictoriaMetrics is collecting data correctly
+
+2. **Dual log shipping**: Configure Promtail to send logs to both Loki instances
+   - Add second client in `system/monitoring/logs.nix` pointing to monitoring02
+
+3. **Validate dashboards**: Access Grafana on monitoring02, verify dashboards work
+
+4. **Validate alerts**: Verify vmalert evaluates rules correctly (no receiver = no notifications)
+
+5. **Compare resource usage**: Monitor disk/memory consumption between hosts
+
+### Phase 4: Add monitoring CNAME
+
+Add CNAME to monitoring02 once validated:
+
+```nix
+# hosts/monitoring02/configuration.nix
+homelab.dns.cnames = [ "monitoring" ];
+```
+
+This creates `monitoring.home.2rjus.net` pointing to monitoring02.
+
+### Phase 5: Update References
+
+Update hardcoded references to use the CNAME:
+
+1. **system/monitoring/logs.nix**:
+   - Remove dual-shipping, point only to `http://monitoring.home.2rjus.net:3100`
+
+2. **services/http-proxy/proxy.nix**: Update reverse proxy backends:
+   - prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
+   - alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
+   - grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
+
+Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission.
+
+### Phase 6: Enable Alerting
+
+Once ready to cut over:
+1. Enable Alertmanager receiver on monitoring02
+2. Verify test alerts route correctly
+
+### Phase 7: Cutover and Decommission
+
+1. **Stop monitoring01**: Prevent duplicate alerts during transition
+2. **Update bootstrap.nix**: Point to `monitoring.home.2rjus.net`
+3. **Verify all targets scraped**: Check VictoriaMetrics UI
+4. **Verify logs flowing**: Check Loki on monitoring02
+5. **Decommission monitoring01**:
+   - Remove from flake.nix
+   - Remove host configuration
+   - Destroy VM in Proxmox
+   - Remove from terraform state
+
+## Current Progress
+
+- **Phase 1** complete (2026-02-08): monitoring02 host created, Grafana with Kanidm OIDC validated
+- **Phase 2** complete (2026-02-17): VictoriaMetrics, vmalert, Alertmanager, Loki, Grafana datasources configured
+  - Tempo and Pyroscope deferred (not actively used; can be added later if needed)
+
+## Open Questions
+
+- [ ] What disk size for monitoring02? Current 60GB may need expansion for 3+ months with VictoriaMetrics
+- [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
+- [ ] Consider replacing Promtail with Grafana Alloy (`services.alloy`, v1.12.2 in nixpkgs). Promtail is in maintenance mode and Grafana recommends Alloy as the successor. Alloy is a unified collector (logs, metrics, traces, profiles) but uses its own "River" config format instead of YAML, so less Nix-native ergonomics. Could bundle the migration with monitoring02 to consolidate disruption.
+
+## VictoriaMetrics Service Configuration
+
+Implemented in `services/victoriametrics/default.nix`. Key design decisions:
+
+- **Static user**: VictoriaMetrics NixOS module uses `DynamicUser`, overridden with a static
+  `victoriametrics` user so vault.secrets and credential files work correctly
+- **Shared rules**: vmalert reuses `services/monitoring/rules.yml` via `settings.rule` path
+  reference (no YAML-to-Nix conversion needed)
+- **Scrape config reuse**: Uses the same `lib/monitoring.nix` functions and
+  `services/monitoring/external-targets.nix` as Prometheus for auto-generated targets
+
+## Rollback Plan
+
+If issues arise after cutover:
+1. Move `monitoring` CNAME back to monitoring01
+2. Restart monitoring01 services
+3. Revert Promtail config to point only to monitoring01
+4. Revert http-proxy backends
+
+## Notes
+
+- VictoriaMetrics uses port 8428 vs Prometheus 9090
+- PromQL compatibility is excellent
+- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
+- monitoring02 deployed via OpenTofu using `create-host` script
+- Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state
--- a/docs/plans/new-services.md
+++ b/docs/plans/new-services.md
@@ -0,0 +1,145 @@
+# New Service Candidates
+
+Ideas for additional services to deploy in the homelab. These lean more enterprise/obscure
+than the typical self-hosted fare.
+
+## Litestream
+
+Continuous SQLite replication to S3-compatible storage. Streams WAL changes in near-real-time,
+providing point-in-time recovery without scheduled backup jobs.
+
+**Why:** Several services use SQLite (Home Assistant, potentially others). Litestream would
+give continuous backup to Garage S3 with minimal resource overhead and near-zero configuration.
+Replaces cron-based backup scripts with a small daemon per database.
+
+**Integration points:**
+- Garage S3 as replication target (already deployed)
+- Home Assistant SQLite database is the primary candidate
+- Could also cover any future SQLite-backed services
+
+**Complexity:** Low. Single Go binary, minimal config (source DB path + S3 endpoint).
+
+**NixOS packaging:** Available in nixpkgs as `litestream`.
+
+---
+
+## ntopng
+
+Deep network traffic analysis and flow monitoring. Provides real-time visibility into bandwidth
+usage, protocol distribution, top talkers, and anomaly detection via a web UI.
+
+**Why:** We have host-level metrics (node-exporter) and logs (Loki) but no network-level
+visibility. ntopng would show traffic patterns across the infrastructure — NFS throughput to
+the NAS, DNS query volume, inter-host traffic, and bandwidth anomalies. Useful for capacity
+planning and debugging network issues.
+
+**Integration points:**
+- Could export metrics to Prometheus via its built-in exporter
+- Web UI behind http-proxy with Kanidm OIDC (if supported) or Pomerium
+- NetFlow/sFlow from managed switches (if available)
+- Passive traffic capture on a mirror port or the monitoring host itself
+
+**Complexity:** Medium. Needs network tap or mirror port for full visibility, or can run
+in host-local mode. May need a dedicated interface or VLAN mirror.
+
+**NixOS packaging:** Available in nixpkgs as `ntopng`.
+
+---
+
+## Renovate
+
+Automated dependency update bot that understands Nix flakes natively. Creates branches/PRs
+to bump flake inputs on a configurable schedule.
+
+**Why:** Currently `nix flake update` is manual. Renovate can automatically propose updates
+to individual flake inputs (nixpkgs, homelab-deploy, nixos-exporter, etc.), group related
+updates, and respect schedules. More granular than updating everything at once — can bump
+nixpkgs weekly but hold back other inputs, auto-merge patch-level changes, etc.
+
+**Integration points:**
+- Runs against git.t-juice.club repositories
+- Understands `flake.lock` format natively
+- Could target both `nixos-servers` and `nixos` repos
+- Update branches would be validated by homelab-deploy builder
+
+**Complexity:** Medium. Needs git forge integration (Gitea/Forgejo API). Self-hosted runner
+mode available. Configuration via `renovate.json` in each repo.
+
+**NixOS packaging:** Available in nixpkgs as `renovate`.
+
+---
+
+## Pomerium
+
+Identity-aware reverse proxy implementing zero-trust access. Every request is authenticated
+and authorized based on identity, device, and context — not just network location.
+
+**Why:** Currently Caddy terminates TLS but doesn't enforce authentication on most services.
+Pomerium would put Kanidm OIDC authentication in front of every internal service, with
+per-route authorization policies (e.g., "only admins can access Prometheus," "require re-auth
+for Vault UI"). Directly addresses the security hardening plan's goals.
+
+**Integration points:**
+- Kanidm as OIDC identity provider (already deployed)
+- Could replace or sit in front of Caddy for internal services
+- Per-route policies based on Kanidm groups (admins, users, ssh-users)
+- Centralizes access logging and audit trail
+
+**Complexity:** Medium-high. Needs careful integration with existing Caddy reverse proxy.
+Decision needed on whether Pomerium replaces Caddy or works alongside it (Pomerium for
+auth, Caddy for TLS termination and routing, or Pomerium handles everything).
+
+**NixOS packaging:** Available in nixpkgs as `pomerium`.
+
+---
+
+## Apache Guacamole
+
+Clientless remote desktop and SSH gateway. Provides browser-based access to hosts via
+RDP, VNC, SSH, and Telnet with no client software required. Supports session recording
+and playback.
+
+**Why:** Provides an alternative remote access path that doesn't require VPN software or
+SSH keys on the client device. Useful for accessing hosts from untrusted machines (phone,
+borrowed laptop) or providing temporary access to others. Session recording gives an audit
+trail. Could complement the WireGuard remote access plan rather than replace it.
+
+**Integration points:**
+- Kanidm for authentication (OIDC or LDAP)
+- Behind http-proxy or Pomerium for TLS
+- SSH access to all hosts in the fleet
+- Session recordings could be stored on Garage S3
+- Could serve as the "emergency access" path when VPN is unavailable
+
+**Complexity:** Medium. Java-based (guacd + web app), typically needs PostgreSQL for
+connection/user storage (already available). Docker is the common deployment method but
+native packaging exists.
+
+**NixOS packaging:** Available in nixpkgs as `guacamole-server` and `guacamole-client`.
+
+---
+
+## CrowdSec
+
+Collaborative intrusion prevention system with crowd-sourced threat intelligence.
+Parses logs to detect attack patterns, applies remediation (firewall bans, CAPTCHA),
+and shares/receives threat signals from a global community network.
+
+**Why:** Goes beyond fail2ban with behavioral detection, crowd-sourced IP reputation,
+and a scenario-based engine. Fits the security hardening plan. The community blocklist
+means we benefit from threat intelligence gathered across thousands of deployments.
+Could parse SSH logs, HTTP access logs, and other service logs to detect and block
+malicious activity.
+
+**Integration points:**
+- Could consume logs from Loki or directly from journald/log files
+- Firewall bouncer for iptables/nftables remediation
+- Caddy bouncer for HTTP-level blocking
+- Prometheus metrics exporter for alert integration
+- Scenarios available for SSH brute force, HTTP scanning, and more
+- Feeds into existing alerting pipeline (Alertmanager -> alerttonotify)
+
+**Complexity:** Medium. Agent (log parser + decision engine) on each host or centralized.
+Bouncers (enforcement) on edge hosts. Free community tier includes threat intel access.
+
+**NixOS packaging:** Available in nixpkgs as `crowdsec`.
--- a/docs/plans/nixos-router.md
+++ b/docs/plans/nixos-router.md
@@ -0,0 +1,162 @@
+# NixOS Router — Replace EdgeRouter
+
+Replace the aging Ubiquiti EdgeRouter (gw, 10.69.10.1) with a NixOS-based router.
+The EdgeRouter is suspected to be a throughput bottleneck. A NixOS router integrates
+naturally with the existing fleet: same config management, same monitoring pipeline,
+same deployment workflow.
+
+## Goals
+
+- Eliminate the EdgeRouter throughput bottleneck
+- Full integration with existing monitoring (node-exporter, promtail, Prometheus, Loki)
+- Declarative firewall and routing config managed in the flake
+- Inter-VLAN routing for all existing subnets
+- DHCP server for client subnets
+- NetFlow/traffic accounting for future ntopng integration
+- Foundation for WireGuard remote access (see remote-access.md)
+
+## Current Network Topology
+
+**Subnets (known VLANs):**
+| VLAN/Subnet    | Purpose          | Notable hosts                          |
+|----------------|------------------|----------------------------------------|
+| 10.69.10.0/24  | Gateway          | gw (10.69.10.1)                        |
+| 10.69.12.0/24  | Core services    | nas, pve1, arr jails, restic           |
+| 10.69.13.0/24  | Infrastructure   | All NixOS servers (static IPs)         |
+| 10.69.22.0/24  | WLAN             | unifi-ctrl                             |
+| 10.69.30.0/24  | Workstations     | gunter                                 |
+| 10.69.31.0/24  | Media            | media                                  |
+| 10.69.99.0/24  | Management       | sw1 (MikroTik CRS326-24G-2S+)         |
+
+**DNS:** ns1 (10.69.13.5) and ns2 (10.69.13.6) handle all resolution. Upstream is
+Cloudflare/Google over DoT via Unbound.
+
+**Switch:** MikroTik CRS326-24G-2S+ — L2 switching with VLAN trunking. Capable of
+L3 routing via RouterOS but not ideal for sustained routing throughput.
+
+## Hardware
+
+Needs a small x86 box with:
+- At least 2 NICs (WAN + LAN trunk). Dual 2.5GbE preferred.
+- Enough CPU for nftables NAT at line rate (any modern x86 is fine)
+- 4-8 GB RAM (plenty for routing + DHCP + NetFlow accounting)
+- Low power consumption, fanless preferred for always-on use
+
+Candidates:
+- Topton / CWWK mini PC with dual/quad Intel 2.5GbE (~100-150 EUR)
+- Protectli Vault (more expensive, ~200-300 EUR, proven in pfSense/OPNsense community)
+- Any mini PC with one onboard NIC + one USB 2.5GbE adapter (cheapest, less ideal)
+
+The LAN port would carry a VLAN trunk to the MikroTik switch, with sub-interfaces
+for each VLAN. WAN port connects to the ISP uplink.
+
+## NixOS Configuration
+
+### Stability Policy
+
+The router is treated differently from the rest of the fleet:
+- **No auto-upgrade** — `system.autoUpgrade.enable = false`
+- **No homelab-deploy listener** — `homelab.deploy.enable = false`
+- **Manual updates only** — update every few months, test-build first
+- **Use `nixos-rebuild boot`** — changes take effect on next deliberate reboot
+- **Tier: prod, priority: high** — alerts treated with highest priority
+
+### Core Services
+
+**Routing & NAT:**
+- `systemd-networkd` for all interface config (consistent with rest of fleet)
+- VLAN sub-interfaces on the LAN trunk (one per subnet)
+- `networking.nftables` for stateful firewall and NAT
+- IP forwarding enabled (`net.ipv4.ip_forward = 1`)
+- Masquerade outbound traffic on WAN interface
+
+**DHCP:**
+- Kea or dnsmasq for DHCP on client subnets (WLAN, workstations, media)
+- Infrastructure subnet (10.69.13.0/24) stays static — no DHCP needed
+- Static leases for known devices
+
+**Firewall (nftables):**
+- Default deny between VLANs
+- Explicit allow rules for known cross-VLAN traffic:
+  - All subnets → ns1/ns2 (DNS)
+  - All subnets → monitoring01 (metrics/logs)
+  - Infrastructure → all (management access)
+  - Workstations → media, core services
+- NAT masquerade on WAN
+- Rate limiting on WAN-facing services
+
+**Traffic Accounting:**
+- nftables flow accounting or softflowd for NetFlow export
+- Export to future ntopng instance (see new-services.md)
+
+### Monitoring Integration
+
+Since this is a NixOS host in the flake, it gets the standard monitoring stack for free:
+- node-exporter for system metrics (CPU, memory, NIC throughput per interface)
+- promtail shipping logs to Loki
+- Prometheus scrape target auto-registration
+- Alertmanager alerts for host-down, high CPU, etc.
+
+Additional router-specific monitoring:
+- Per-VLAN interface traffic metrics via node-exporter (automatic for all interfaces)
+- NAT connection tracking table size
+- WAN uplink status and throughput
+- DHCP lease metrics (if Kea, it has a Prometheus exporter)
+
+This is a significant advantage over the EdgeRouter — full observability through
+the existing Grafana dashboards and Loki log search, debuggable via the monitoring
+MCP tools.
+
+### WireGuard Integration
+
+The remote access plan (remote-access.md) currently proposes a separate `extgw01`
+gateway host. With a NixOS router, there's a decision to make:
+
+**Option A:** WireGuard terminates on the router itself. Simplest topology — the
+router is already the gateway, so VPN traffic doesn't need extra hops or firewall
+rules. But adds complexity to the router, which should stay simple.
+
+**Option B:** Keep extgw01 as a separate host (original plan). Router just routes
+traffic to it. Better separation of concerns, router stays minimal.
+
+Recommendation: Start with option B (keep it separate). The router should do routing
+and nothing else. WireGuard can move to the router later if extgw01 feels redundant.
+
+## Migration Plan
+
+### Phase 1: Build and lab test
+- Acquire hardware
+- Create host config in the flake (routing, NAT, DHCP, firewall)
+- Test-build on workstation: `nix build .#nixosConfigurations.router01.config.system.build.toplevel`
+- Lab test with a temporary setup if possible (two NICs, isolated VLAN)
+
+### Phase 2: Prepare cutover
+- Pre-configure the MikroTik switch trunk port for the new router
+- Document current EdgeRouter config (port forwarding, NAT rules, DHCP leases)
+- Replicate all rules in the NixOS config
+- Verify DNS, DHCP, and inter-VLAN routing work in test
+
+### Phase 3: Cutover
+- Schedule a maintenance window (brief downtime expected)
+- Swap WAN cable from EdgeRouter to new router
+- Swap LAN trunk from EdgeRouter to new router
+- Verify connectivity from each VLAN
+- Verify internet access, DNS resolution, inter-VLAN routing
+- Monitor via Prometheus/Loki (immediately available since it's a fleet host)
+
+### Phase 4: Decommission EdgeRouter
+- Keep EdgeRouter available as fallback for a few weeks
+- Remove `gw` entry from external-hosts.nix, replace with flake-managed host
+- Update any references to 10.69.10.1 if the router IP changes
+
+## Open Questions
+
+- **Router IP:** Keep 10.69.10.1 or move to a different address? Each VLAN
+  sub-interface needs an IP (the gateway address for that subnet).
+- **ISP uplink:** What type of WAN connection? PPPoE, DHCP, static IP?
+- **Port forwarding:** What ports are currently forwarded on the EdgeRouter?
+  These need to be replicated in nftables.
+- **DHCP scope:** Which subnets currently get DHCP from the EdgeRouter vs
+  other sources (UniFi controller for WLAN?)?
+- **UPnP/NAT-PMP:** Needed for any devices? (gaming consoles, etc.)
+- **Hardware preference:** Fanless mini PC budget and preferred vendor?
--- a/docs/plans/remote-access.md
+++ b/docs/plans/remote-access.md
@@ -4,119 +4,127 @@

 ## Goal

-Enable remote access to some or all homelab services from outside the internal network, without exposing anything directly to the internet.
+Enable personal remote access to selected homelab services from outside the internal network, without exposing anything directly to the internet.

 ## Current State

 - All services are only accessible from the internal 10.69.13.x network
- Exception: jelly01 has a WireGuard link to an external VPS
- No services are directly exposed to the public internet
+- http-proxy has a WireGuard tunnel (`wg0`, `10.69.222.0/24`) to a VPS (`docker2.t-juice.club`) on an OpenStack cluster
+- VPS runs Traefik which proxies selected services (including Jellyfin) back through the tunnel to http-proxy's Caddy
+- No other services are directly exposed to the public internet

-## Constraints
+## Decision: WireGuard Gateway

- Nothing should be directly accessible from the outside
- Must use VPN or overlay network (no port forwarding of services)
- Self-hosted solutions preferred over managed services
+After evaluating WireGuard gateway vs Headscale (self-hosted Tailscale), the **WireGuard gateway** approach was chosen:

-## Options
+- Only 2 client devices (laptop + phone), so Headscale's device management UX isn't needed
+- Split DNS works fine on Linux laptop via systemd-resolved; all-or-nothing DNS on phone is acceptable for occasional use
+- Simpler infrastructure - no control server to maintain
+- Builds on existing WireGuard experience and setup

-### 1. WireGuard Gateway (Internal Router)
+## Architecture

-A dedicated NixOS host on the internal network with a WireGuard tunnel out to the VPS. The VPS becomes the public entry point, and the gateway routes traffic to internal services. Firewall rules on the gateway control which services are reachable.
+```
+                    ┌─────────────────────────────────┐
+                    │  VPS (OpenStack)                │
+  Laptop/Phone ──→ │  WireGuard endpoint             │
+  (WireGuard)      │  Client peers: laptop, phone    │
+                    │  Routes 10.69.13.0/24 via tunnel│
+                    └──────────┬──────────────────────┘
+                               │ WireGuard tunnel
+                               ▼
+                    ┌─────────────────────────────────┐
+                    │  extgw01 (gateway + bastion)    │
+                    │  - WireGuard tunnel to VPS      │
+                    │  - Firewall (allowlist only)    │
+                    │  - SSH + 2FA (full access)      │
+                    └──────────┬──────────────────────┘
+                               │ allowed traffic only
+                               ▼
+                    ┌─────────────────────────────────┐
+                    │  Internal network 10.69.13.0/24 │
+                    │  - monitoring01:3000 (Grafana)  │
+                    │  - jelly01:8096 (Jellyfin)      │
+                    │  - *-jail hosts (arr stack)     │
+                    └─────────────────────────────────┘
+```

-**Pros:**
- Simple, well-understood technology
- Already running WireGuard for jelly01
- Full control over routing and firewall rules
- Excellent NixOS module support
- No extra dependencies
+### Existing path (unchanged)

-**Cons:**
- Hub-and-spoke topology (all traffic goes through VPS)
- Manual peer management
- Adding a new client device means editing configs on both VPS and gateway
+The current public access path stays as-is:

-### 2. WireGuard Mesh (No Relay)
+```
+Internet → VPS (Traefik) → WireGuard → http-proxy (Caddy) → internal services
+```

-Each client device connects directly to a WireGuard endpoint. Could be on the VPS which forwards to the homelab, or if there is a routable IP at home, directly to an internal host.
+This handles public Jellyfin access and any other publicly-exposed services.

-**Pros:**
- Simple and fast
- No extra software
+### New path (personal VPN)

-**Cons:**
- Manual key and endpoint management for every peer
- Doesn't scale well
- If behind CGNAT, still needs the VPS as intermediary
+A separate WireGuard tunnel for personal remote access with restricted firewall rules:

-### 3. Headscale (Self-Hosted Tailscale)
+```
+Laptop/Phone → VPS (WireGuard peers) → tunnel → extgw01 (firewall) → allowed services
+```

-Run a Headscale control server (on the VPS or internally) and install the Tailscale client on homelab hosts and personal devices. Gets the Tailscale mesh networking UX without depending on Tailscale's infrastructure.
+### Access tiers

-**Pros:**
- Mesh topology - devices communicate directly via NAT traversal (DERP relay as fallback)
- Easy to add/remove devices
- ACL support for granular access control
- MagicDNS for service discovery
- Good NixOS support for both headscale server and tailscale client
- Subnet routing lets you expose the entire 10.69.13.x network or specific hosts without installing tailscale on every host
+1. **VPN (default)**: Laptop/phone connect to VPS WireGuard endpoint, traffic routed through extgw01 firewall. Only whitelisted services are reachable.
+2. **SSH + 2FA (escalated)**: SSH into extgw01 for full network access when needed.

-**Cons:**
- More moving parts than plain WireGuard
- Headscale is a third-party reimplementation, can lag behind Tailscale features
- Need to run and maintain the control server
+## New Host: extgw01

-### 4. Tailscale (Managed)
+A NixOS host on the internal network acting as both WireGuard gateway and SSH bastion.

-Same as Headscale but using Tailscale's hosted control plane.
+### Responsibilities

-**Pros:**
- Zero infrastructure to manage on the control plane side
- Polished UX, well-maintained clients
- Free tier covers personal use
+- **WireGuard tunnel** to the VPS for client traffic
+- **Firewall** with allowlist controlling which internal services are reachable through the VPN
+- **SSH bastion** with 2FA for full network access when needed
+- **DNS**: Clients get split DNS config (laptop via systemd-resolved routing domain, phone uses internal DNS for all queries)

-**Cons:**
- Dependency on Tailscale's service
- Less aligned with self-hosting preference
- Coordination metadata goes through their servers (data plane is still peer-to-peer)
+### Firewall allowlist (initial)

-### 5. Netbird (Self-Hosted)
+| Service    | Destination                  | Port  |
+|------------|------------------------------|-------|
+| Grafana    | monitoring01.home.2rjus.net  | 3000  |
+| Jellyfin   | jelly01.home.2rjus.net       | 8096  |
+| Sonarr     | sonarr-jail.home.2rjus.net   | 8989  |
+| Radarr     | radarr-jail.home.2rjus.net   | 7878  |
+| NZBget     | nzbget-jail.home.2rjus.net   | 6789  |

-Open-source alternative to Tailscale with a self-hostable management server. WireGuard-based, supports ACLs and NAT traversal.
+### SSH 2FA options (to be decided)

-**Pros:**
- Fully self-hostable
- Web UI for management
- ACL and peer grouping support
+- **Kanidm**: Already deployed on kanidm01, supports RADIUS/OAuth2 for PAM integration
+- **SSH certificates via OpenBao**: Fits existing Vault infrastructure, short-lived certs
+- **TOTP via PAM**: Simplest fallback, Google Authenticator / similar

-**Cons:**
- Heavier to self-host (needs multiple components: management server, signal server, TURN relay)
- Less mature NixOS module support compared to Tailscale/Headscale
+## VPS Configuration

-### 6. Nebula (by Defined Networking)
+The VPS needs a new WireGuard interface (separate from the existing http-proxy tunnel):

-Certificate-based mesh VPN. Each node gets a certificate from a CA you control. No central coordination server needed at runtime.
+- WireGuard endpoint listening on a public UDP port
+- 2 peers: laptop, phone
+- Routes client traffic through tunnel to extgw01
+- Minimal config - just routing, no firewall policy (that lives on extgw01)

-**Pros:**
- No always-on control plane
- Certificate-based identity
- Lightweight
+## Implementation Steps

-**Cons:**
- Less convenient for ad-hoc device addition (need to issue certs)
- NAT traversal less mature than Tailscale's
- Smaller community/ecosystem
-
-## Key Decision Points
-
- **Static public IP vs CGNAT?** Determines whether clients can connect directly to home network or need VPS relay.
- **Number of client devices?** If just phone and laptop, plain WireGuard via VPS is fine. More devices favors Headscale.
- **Per-service vs per-network access?** Gateway with firewall rules gives per-service control. Headscale ACLs can also do this. Plain WireGuard gives network-level access with gateway firewall for finer control.
- **Subnet routing vs per-host agents?** With Headscale/Tailscale, can either install client on every host, or use a single subnet router that advertises the 10.69.13.x range. The latter is closer to the gateway approach and avoids touching every host.
-
-## Leading Candidates
-
-Based on existing WireGuard experience, self-hosting preference, and NixOS stack:
-
-1. **Headscale with a subnet router** - Best balance of convenience and self-hosting
-2. **WireGuard gateway via VPS** - Simplest, most transparent, builds on existing setup
+1. **Create extgw01 host configuration** in this repo
+   - VM provisioned via OpenTofu (same as other hosts)
+   - WireGuard interface for VPS tunnel
+   - nftables/iptables firewall with service allowlist
+   - IP forwarding enabled
+2. **Configure VPS WireGuard** for client peers
+   - New WireGuard interface with laptop + phone peers
+   - Routing for 10.69.13.0/24 through extgw01 tunnel
+3. **Set up client configs**
+   - Laptop: WireGuard config + systemd-resolved split DNS for `home.2rjus.net`
+   - Phone: WireGuard app config with DNS pointing at internal nameservers
+4. **Set up SSH 2FA** on extgw01
+   - Evaluate Kanidm integration vs OpenBao SSH certs vs TOTP
+5. **Test and verify**
+   - VPN access to allowed services only
+   - Firewall blocks everything else
+   - SSH + 2FA grants full access
+   - Existing public access path unaffected
--- a/docs/plans/security-hardening.md
+++ b/docs/plans/security-hardening.md
@@ -0,0 +1,224 @@
+# Security Hardening Plan
+
+## Overview
+
+Address security gaps identified in infrastructure review. Focus areas: SSH hardening, network security, logging improvements, and secrets management.
+
+## Current State
+
+- SSH allows password auth and unrestricted root login (`system/sshd.nix`)
+- Firewall disabled on all hosts (`networking.firewall.enable = false`)
+- Promtail ships logs over HTTP to Loki
+- Loki has no authentication (`auth_enabled = false`)
+- AppRole secret-IDs never expire (`secret_id_ttl = 0`)
+- Vault TLS verification disabled by default (`skipTlsVerify = true`)
+- Audit logging exists (`common/ssh-audit.nix`) but not applied globally
+- Alert rules focus on availability, no security event detection
+
+## Priority Matrix
+
+| Issue | Severity | Effort | Priority |
+|-------|----------|--------|----------|
+| SSH password auth | High | Low | **P1** |
+| Firewall disabled | High | Medium | **P1** |
+| Promtail HTTP (no TLS) | High | Medium | **P2** |
+| No security alerting | Medium | Low | **P2** |
+| Audit logging not global | Low | Low | **P2** |
+| Loki no auth | Medium | Medium | **P3** |
+| Secret-ID TTL | Medium | Medium | **P3** |
+| Vault skipTlsVerify | Medium | Low | **P3** |
+
+## Phase 1: Quick Wins (P1)
+
+### 1.1 SSH Hardening
+
+Edit `system/sshd.nix`:
+
+```nix
+services.openssh = {
+  enable = true;
+  settings = {
+    PermitRootLogin = "prohibit-password";  # Key-only root login
+    PasswordAuthentication = false;
+    KbdInteractiveAuthentication = false;
+  };
+};
+```
+
+**Prerequisite:** Verify all hosts have SSH keys deployed for root.
+
+### 1.2 Enable Firewall
+
+Create `system/firewall.nix` with default deny policy:
+
+```nix
+{ ... }: {
+  networking.firewall.enable = true;
+
+  # Use openssh's built-in firewall integration
+  services.openssh.openFirewall = true;
+}
+```
+
+**Useful firewall options:**
+
+| Option | Description |
+|--------|-------------|
+| `networking.firewall.trustedInterfaces` | Accept all traffic from these interfaces (e.g., `[ "lo" ]`) |
+| `networking.firewall.interfaces.<name>.allowedTCPPorts` | Per-interface port rules |
+| `networking.firewall.extraInputRules` | Custom nftables rules (for complex filtering) |
+
+**Network range restrictions:** Consider restricting SSH to the infrastructure subnet (`10.69.13.0/24`) using `extraInputRules` for defense in depth. However, this adds complexity and may not be necessary given the trusted network model.
+
+#### Per-Interface Rules (http-proxy WireGuard)
+
+The `http-proxy` host has a WireGuard interface (`wg0`) that may need different rules than the LAN interface. Use `networking.firewall.interfaces` to apply per-interface policies:
+
+```nix
+# Example: http-proxy with different rules per interface
+networking.firewall = {
+  enable = true;
+
+  # Default: only SSH (via openFirewall)
+  allowedTCPPorts = [ ];
+
+  # LAN interface: allow HTTP/HTTPS
+  interfaces.ens18 = {
+    allowedTCPPorts = [ 80 443 ];
+  };
+
+  # WireGuard interface: restrict to specific services or trust fully
+  interfaces.wg0 = {
+    allowedTCPPorts = [ 80 443 ];
+    # Or use trustedInterfaces = [ "wg0" ] if fully trusted
+  };
+};
+```
+
+**TODO:** Investigate current WireGuard usage on http-proxy to determine appropriate rules.
+
+Then per-host, open required ports:
+
+| Host | Additional Ports |
+|------|------------------|
+| ns1/ns2 | 53 (TCP/UDP) |
+| vault01 | 8200 |
+| monitoring01 | 3100, 9090, 3000, 9093 |
+| http-proxy | 80, 443 |
+| nats1 | 4222 |
+| ha1 | 1883, 8123 |
+| jelly01 | 8096 |
+| nix-cache01 | 5000 |
+
+## Phase 2: Logging & Detection (P2)
+
+### 2.1 Enable TLS for Promtail → Loki
+
+Update `system/monitoring/logs.nix`:
+
+```nix
+clients = [{
+  url = "https://monitoring01.home.2rjus.net:3100/loki/api/v1/push";
+  tls_config = {
+    ca_file = "/etc/ssl/certs/homelab-root-ca.pem";
+  };
+}];
+```
+
+Requires:
+- Configure Loki with TLS certificate (use internal ACME)
+- Ensure all hosts trust root CA (already done via `system/pki/root-ca.nix`)
+
+### 2.2 Security Alert Rules
+
+Add to `services/monitoring/rules.yml`:
+
+```yaml
+- name: security_rules
+  rules:
+    - alert: ssh_auth_failures
+      expr: increase(node_logind_sessions_total[5m]) > 20
+      for: 0m
+      labels:
+        severity: warning
+      annotations:
+        summary: "Unusual login activity on {{ $labels.instance }}"
+
+    - alert: vault_secret_fetch_failure
+      expr: increase(vault_secret_failures[5m]) > 5
+      for: 0m
+      labels:
+        severity: warning
+      annotations:
+        summary: "Vault secret fetch failures on {{ $labels.instance }}"
+```
+
+Also add Loki-based alerts for:
+- Failed SSH attempts: `{job="systemd-journal"} |= "Failed password"`
+- sudo usage: `{job="systemd-journal"} |= "sudo"`
+
+### 2.3 Global Audit Logging
+
+Add `./common/ssh-audit.nix` import to `system/default.nix`:
+
+```nix
+imports = [
+  # ... existing imports
+  ../common/ssh-audit.nix
+];
+```
+
+## Phase 3: Defense in Depth (P3)
+
+### 3.1 Loki Authentication
+
+Options:
+1. **Basic auth via reverse proxy** - Put Loki behind Caddy with auth
+2. **Loki multi-tenancy** - Enable `auth_enabled = true` and use tenant IDs
+3. **Network isolation** - Bind Loki only to localhost, expose via authenticated proxy
+
+Recommendation: Option 1 (reverse proxy) is simplest for homelab.
+
+### 3.2 AppRole Secret Rotation
+
+Update `terraform/vault/approle.tf`:
+
+```hcl
+secret_id_ttl  = 2592000  # 30 days
+```
+
+Add documentation for manual rotation procedure or implement automated rotation via the existing `restartTrigger` mechanism in `vault-secrets.nix`.
+
+### 3.3 Enable Vault TLS Verification
+
+Change default in `system/vault-secrets.nix`:
+
+```nix
+skipTlsVerify = mkOption {
+  type = types.bool;
+  default = false;  # Changed from true
+};
+```
+
+**Prerequisite:** Verify all hosts trust the internal CA that signed the Vault certificate.
+
+## Implementation Order
+
+1. **Test on test-tier first** - Deploy phases 1-2 to testvm01/02/03
+2. **Validate SSH access** - Ensure key-based login works before disabling passwords
+3. **Document firewall ports** - Create reference of ports per host before enabling
+4. **Phase prod rollout** - Deploy to prod hosts one at a time, verify each
+
+## Open Questions
+
+- [ ] Do all hosts have SSH keys configured for root access?
+- [ ] Should firewall rules be per-host or use a central definition with roles?
+- [ ] Should Loki authentication use the existing Kanidm setup?
+
+**Resolved:** Password-based SSH access for recovery is not required - most hosts have console access through Proxmox or physical access, which provides an out-of-band recovery path if SSH keys fail.
+
+## Notes
+
+- Firewall changes are the highest risk - test thoroughly on test-tier
+- SSH hardening must not lock out access - verify keys first
+- Consider creating a "break glass" procedure for emergency access if keys fail
--- a/docs/user-management.md
+++ b/docs/user-management.md
@@ -0,0 +1,311 @@
+# User Management with Kanidm
+
+Central authentication for the homelab using Kanidm.
+
+## Overview
+
+- **Server**: kanidm01.home.2rjus.net (auth.home.2rjus.net)
+- **WebUI**: https://auth.home.2rjus.net
+- **LDAPS**: port 636
+
+## CLI Setup
+
+The `kanidm` CLI is available in the devshell:
+
+```bash
+nix develop
+
+# Login as idm_admin
+kanidm login --name idm_admin --url https://auth.home.2rjus.net
+```
+
+## User Management
+
+POSIX users are managed imperatively via the `kanidm` CLI. This allows setting
+all attributes (including UNIX password) in one workflow.
+
+### Creating a POSIX User
+
+```bash
+# Create the person
+kanidm person create <username> "<Display Name>"
+
+# Add to groups
+kanidm group add-members ssh-users <username>
+
+# Enable POSIX (UID is auto-assigned)
+kanidm person posix set <username>
+
+# Set UNIX password (required for SSH login, min 10 characters)
+kanidm person posix set-password <username>
+
+# Optionally set login shell
+kanidm person posix set <username> --shell /bin/zsh
+```
+
+### Setting Email Address
+
+Email is required for OAuth2/OIDC login (e.g., Grafana):
+
+```bash
+kanidm person update <username> --mail <email>
+```
+
+### Example: Full User Creation
+
+```bash
+kanidm person create testuser "Test User"
+kanidm person update testuser --mail testuser@home.2rjus.net
+kanidm group add-members ssh-users testuser
+kanidm group add-members users testuser  # Required for OAuth2 scopes
+kanidm person posix set testuser
+kanidm person posix set-password testuser
+kanidm person get testuser
+```
+
+After creation, verify on a client host:
+```bash
+getent passwd testuser
+ssh testuser@testvm01.home.2rjus.net
+```
+
+### Viewing User Details
+
+```bash
+kanidm person get <username>
+```
+
+### Removing a User
+
+```bash
+kanidm person delete <username>
+```
+
+## Group Management
+
+Groups for POSIX access are also managed via CLI.
+
+### Creating a POSIX Group
+
+```bash
+# Create the group
+kanidm group create <group-name>
+
+# Enable POSIX with a specific GID
+kanidm group posix set <group-name> --gidnumber <gid>
+```
+
+### Adding Members
+
+```bash
+kanidm group add-members <group-name> <username>
+```
+
+### Viewing Group Details
+
+```bash
+kanidm group get <group-name>
+kanidm group list-members <group-name>
+```
+
+### Example: Full Group Creation
+
+```bash
+kanidm group create testgroup
+kanidm group posix set testgroup --gidnumber 68010
+kanidm group add-members testgroup testuser
+kanidm group get testgroup
+```
+
+After creation, verify on a client host:
+```bash
+getent group testgroup
+```
+
+### Current Groups
+
+| Group | GID | Purpose |
+|-------|-----|---------|
+| ssh-users | 68000 | SSH login access |
+| admins | 68001 | Administrative access |
+| users | 68002 | General users |
+
+### UID/GID Allocation
+
+Kanidm auto-assigns UIDs/GIDs from its configured range. For manually assigned GIDs:
+
+| Range | Purpose |
+|-------|---------|
+| 65,536+ | Users (auto-assigned) |
+| 68,000 - 68,999 | Groups (manually assigned) |
+
+## OAuth2/OIDC Login (Web Services)
+
+For OAuth2/OIDC login to web services like Grafana, users need:
+
+1. **Primary credential** - Password set via `credential update` (separate from unix password)
+2. **MFA** - TOTP or passkey (Kanidm requires MFA for primary credentials)
+3. **Group membership** - Member of `users` group (for OAuth2 scope mapping)
+4. **Email address** - Set via `person update --mail`
+
+### Setting Up Primary Credential (Web Login)
+
+The primary credential is different from the unix/POSIX password:
+
+```bash
+# Interactive credential setup
+kanidm person credential update <username>
+
+# In the interactive prompt:
+# 1. Type 'password' to set a password
+# 2. Type 'totp' to add TOTP (scan QR with authenticator app)
+# 3. Type 'commit' to save
+```
+
+### Verifying OAuth2 Readiness
+
+```bash
+kanidm person get <username>
+```
+
+Check for:
+- `mail:` - Email address set
+- `memberof:` - Includes `users@home.2rjus.net`
+- Primary credential status (check via `credential update` → `status`)
+
+## PAM/NSS Client Configuration
+
+Enable central authentication on a host:
+
+```nix
+homelab.kanidm.enable = true;
+```
+
+This configures:
+- `services.kanidm.enablePam = true`
+- Client connection to auth.home.2rjus.net
+- Login authorization for `ssh-users` group
+- Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
+- Home directory symlinks (`/home/torjus` → UUID-based directory)
+
+### Enabled Hosts
+
+- testvm01, testvm02, testvm03 (test tier)
+
+### Options
+
+```nix
+homelab.kanidm = {
+  enable = true;
+  server = "https://auth.home.2rjus.net";  # default
+  allowedLoginGroups = [ "ssh-users" ];     # default
+};
+```
+
+### Home Directories
+
+Home directories use UUID-based paths for stability (so renaming a user doesn't
+require moving their home directory). Symlinks provide convenient access:
+
+```
+/home/torjus -> /home/e4f4c56c-4aee-4c20-846f-90cb69807733
+```
+
+The symlinks are created by `kanidm-unixd-tasks` on first login.
+
+## Testing
+
+### Verify NSS Resolution
+
+```bash
+# Check user resolution
+getent passwd <username>
+
+# Check group resolution
+getent group <group-name>
+```
+
+### Test SSH Login
+
+```bash
+ssh <username>@<hostname>.home.2rjus.net
+```
+
+## Troubleshooting
+
+### "PAM user mismatch" error
+
+SSH fails with "fatal: PAM user mismatch" in logs. This happens when Kanidm returns
+usernames in SPN format (`torjus@home.2rjus.net`) but SSH expects short names (`torjus`).
+
+**Solution**: Configure `uid_attr_map = "name"` in unixSettings (already set in our module).
+
+Check current format:
+```bash
+getent passwd torjus
+# Should show: torjus:x:65536:...
+# NOT: torjus@home.2rjus.net:x:65536:...
+```
+
+### User resolves but SSH fails immediately
+
+The user's login group (e.g., `ssh-users`) likely doesn't have POSIX enabled:
+
+```bash
+# Check if group has POSIX
+getent group ssh-users
+
+# If empty, enable POSIX on the server
+kanidm group posix set ssh-users --gidnumber 68000
+```
+
+### User doesn't resolve via getent
+
+1. Check kanidm-unixd service is running:
+   ```bash
+   systemctl status kanidm-unixd
+   ```
+
+2. Check unixd can reach server:
+   ```bash
+   kanidm-unix status
+   # Should show: system: online, Kanidm: online
+   ```
+
+3. Check client can reach server:
+   ```bash
+   curl -s https://auth.home.2rjus.net/status
+   ```
+
+4. Check user has POSIX enabled on server:
+   ```bash
+   kanidm person get <username>
+   ```
+
+5. Restart nscd to clear stale cache:
+   ```bash
+   systemctl restart nscd
+   ```
+
+6. Invalidate kanidm cache:
+   ```bash
+   kanidm-unix cache-invalidate
+   ```
+
+### Changes not taking effect after deployment
+
+NixOS uses nsncd (a Rust reimplementation of nscd) for NSS caching. After deploying
+kanidm-unixd config changes, you may need to restart both services:
+
+```bash
+systemctl restart kanidm-unixd
+systemctl restart nscd
+```
+
+### Test PAM authentication directly
+
+Use the kanidm-unix CLI to test PAM auth without SSH:
+
+```bash
+kanidm-unix auth-test --name <username>
+```
--- a/flake.lock
+++ b/flake.lock
@@ -28,11 +28,11 @@
        ]
      },
      "locked": {
-        "lastModified": 1770447502,
-        "narHash": "sha256-xH1PNyE3ydj4udhe1IpK8VQxBPZETGLuORZdSWYRmSU=",
+        "lastModified": 1771004123,
+        "narHash": "sha256-Jw36EzL4IGIc2TmeZGphAAUrJXoWqfvCbybF8bTHgMA=",
        "ref": "master",
-        "rev": "79db119d1ca6630023947ef0a65896cc3307c2ff",
-        "revCount": 22,
+        "rev": "e5e8be86ecdcae8a5962ba3bddddfe91b574792b",
+        "revCount": 36,
        "type": "git",
        "url": "https://git.t-juice.club/torjus/homelab-deploy"
      },
@@ -42,27 +42,6 @@
        "url": "https://git.t-juice.club/torjus/homelab-deploy"
      }
    },
-    "labmon": {
-      "inputs": {
-        "nixpkgs": [
-          "nixpkgs-unstable"
-        ]
-      },
-      "locked": {
-        "lastModified": 1748983975,
-        "narHash": "sha256-DA5mOqxwLMj/XLb4hvBU1WtE6cuVej7PjUr8N0EZsCE=",
-        "ref": "master",
-        "rev": "040a73e891a70ff06ec7ab31d7167914129dbf7d",
-        "revCount": 17,
-        "type": "git",
-        "url": "https://git.t-juice.club/torjus/labmon"
-      },
-      "original": {
-        "ref": "master",
-        "type": "git",
-        "url": "https://git.t-juice.club/torjus/labmon"
-      }
-    },
    "nixos-exporter": {
      "inputs": {
        "nixpkgs": [
@@ -70,11 +49,11 @@
        ]
      },
      "locked": {
-        "lastModified": 1770422522,
-        "narHash": "sha256-WmIFnquu4u58v8S2bOVWmknRwHn4x88CRfBFTzJ1inQ=",
+        "lastModified": 1770593543,
+        "narHash": "sha256-hT8Rj6JAwGDFvcxWEcUzTCrWSiupCfBa57pBDnM2C5g=",
        "ref": "refs/heads/master",
-        "rev": "cf0ce858997af4d8dcc2ce10393ff393e17fc911",
-        "revCount": 11,
+        "rev": "5aa5f7275b7a08015816171ba06d2cbdc2e02d3e",
+        "revCount": 15,
        "type": "git",
        "url": "https://git.t-juice.club/torjus/nixos-exporter"
      },
@@ -85,11 +64,11 @@
    },
    "nixpkgs": {
      "locked": {
-        "lastModified": 1770136044,
-        "narHash": "sha256-tlFqNG/uzz2++aAmn4v8J0vAkV3z7XngeIIB3rM3650=",
+        "lastModified": 1771043024,
+        "narHash": "sha256-O1XDr7EWbRp+kHrNNgLWgIrB0/US5wvw9K6RERWAj6I=",
        "owner": "nixos",
        "repo": "nixpkgs",
-        "rev": "e576e3c9cf9bad747afcddd9e34f51d18c855b4e",
+        "rev": "3aadb7ca9eac2891d52a9dec199d9580a6e2bf44",
        "type": "github"
      },
      "original": {
@@ -101,11 +80,11 @@
    },
    "nixpkgs-unstable": {
      "locked": {
-        "lastModified": 1770197578,
-        "narHash": "sha256-AYqlWrX09+HvGs8zM6ebZ1pwUqjkfpnv8mewYwAo+iM=",
+        "lastModified": 1771008912,
+        "narHash": "sha256-gf2AmWVTs8lEq7z/3ZAsgnZDhWIckkb+ZnAo5RzSxJg=",
        "owner": "nixos",
        "repo": "nixpkgs",
-        "rev": "00c21e4c93d963c50d4c0c89bfa84ed6e0694df2",
+        "rev": "a82ccc39b39b621151d6732718e3e250109076fa",
        "type": "github"
      },
      "original": {
@@ -119,31 +98,9 @@
      "inputs": {
        "alerttonotify": "alerttonotify",
        "homelab-deploy": "homelab-deploy",
-        "labmon": "labmon",
        "nixos-exporter": "nixos-exporter",
        "nixpkgs": "nixpkgs",
-        "nixpkgs-unstable": "nixpkgs-unstable",
-        "sops-nix": "sops-nix"
-      }
-    },
-    "sops-nix": {
-      "inputs": {
-        "nixpkgs": [
-          "nixpkgs-unstable"
-        ]
-      },
-      "locked": {
-        "lastModified": 1770145881,
-        "narHash": "sha256-ktjWTq+D5MTXQcL9N6cDZXUf9kX8JBLLBLT0ZyOTSYY=",
-        "owner": "Mic92",
-        "repo": "sops-nix",
-        "rev": "17eea6f3816ba6568b8c81db8a4e6ca438b30b7c",
-        "type": "github"
-      },
-      "original": {
-        "owner": "Mic92",
-        "repo": "sops-nix",
-        "type": "github"
+        "nixpkgs-unstable": "nixpkgs-unstable"
      }
    }
  },
--- a/flake.nix
+++ b/flake.nix
@@ -5,18 +5,10 @@
    nixpkgs.url = "github:nixos/nixpkgs?ref=nixos-25.11";
    nixpkgs-unstable.url = "github:nixos/nixpkgs?ref=nixos-unstable";

-    sops-nix = {
-      url = "github:Mic92/sops-nix";
-      inputs.nixpkgs.follows = "nixpkgs-unstable";
-    };
    alerttonotify = {
      url = "git+https://git.t-juice.club/torjus/alerttonotify?ref=master";
      inputs.nixpkgs.follows = "nixpkgs-unstable";
    };
-    labmon = {
-      url = "git+https://git.t-juice.club/torjus/labmon?ref=master";
-      inputs.nixpkgs.follows = "nixpkgs-unstable";
-    };
    nixos-exporter = {
      url = "git+https://git.t-juice.club/torjus/nixos-exporter";
      inputs.nixpkgs.follows = "nixpkgs-unstable";
@@ -32,9 +24,7 @@
      self,
      nixpkgs,
      nixpkgs-unstable,
-      sops-nix,
      alerttonotify,
-      labmon,
      nixos-exporter,
      homelab-deploy,
      ...
@@ -50,7 +40,6 @@
      commonOverlays = [
        overlay-unstable
        alerttonotify.overlays.default
-        labmon.overlays.default
      ];
      # Common modules applied to all hosts
      commonModules = [
@@ -61,7 +50,6 @@
            system.configurationRevision = self.rev or self.dirtyRev or "dirty";
          }
        )
-        sops-nix.nixosModules.sops
        nixos-exporter.nixosModules.default
        homelab-deploy.nixosModules.default
        ./modules/homelab
@@ -77,46 +65,19 @@
    in
    {
      nixosConfigurations = {
-        ns1 = nixpkgs.lib.nixosSystem {
-          inherit system;
-          specialArgs = {
-            inherit inputs self sops-nix;
-          };
-          modules = commonModules ++ [
-            ./hosts/ns1
-          ];
-        };
-        ns2 = nixpkgs.lib.nixosSystem {
-          inherit system;
-          specialArgs = {
-            inherit inputs self sops-nix;
-          };
-          modules = commonModules ++ [
-            ./hosts/ns2
-          ];
-        };
        ha1 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/ha1
          ];
        };
-        template1 = nixpkgs.lib.nixosSystem {
-          inherit system;
-          specialArgs = {
-            inherit inputs self sops-nix;
-          };
-          modules = commonModules ++ [
-            ./hosts/template
-          ];
-        };
        template2 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/template2
@@ -125,62 +86,34 @@
        http-proxy = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/http-proxy
          ];
        };
-        ca = nixpkgs.lib.nixosSystem {
-          inherit system;
-          specialArgs = {
-            inherit inputs self sops-nix;
-          };
-          modules = commonModules ++ [
-            ./hosts/ca
-          ];
-        };
        monitoring01 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/monitoring01
-            labmon.nixosModules.labmon
          ];
        };
        jelly01 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/jelly01
          ];
        };
-        nix-cache01 = nixpkgs.lib.nixosSystem {
-          inherit system;
-          specialArgs = {
-            inherit inputs self sops-nix;
-          };
-          modules = commonModules ++ [
-            ./hosts/nix-cache01
-          ];
-        };
-        pgdb1 = nixpkgs.lib.nixosSystem {
-          inherit system;
-          specialArgs = {
-            inherit inputs self sops-nix;
-          };
-          modules = commonModules ++ [
-            ./hosts/pgdb1
-          ];
-        };
        nats1 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/nats1
@@ -189,7 +122,7 @@
        vault01 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/vault01
@@ -198,7 +131,7 @@
        testvm01 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/testvm01
@@ -207,7 +140,7 @@
        testvm02 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/testvm02
@@ -216,12 +149,66 @@
        testvm03 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self sops-nix;
+            inherit inputs self;
          };
          modules = commonModules ++ [
            ./hosts/testvm03
          ];
        };
+        ns2 = nixpkgs.lib.nixosSystem {
+          inherit system;
+          specialArgs = {
+            inherit inputs self;
+          };
+          modules = commonModules ++ [
+            ./hosts/ns2
+          ];
+        };
+        ns1 = nixpkgs.lib.nixosSystem {
+          inherit system;
+          specialArgs = {
+            inherit inputs self;
+          };
+          modules = commonModules ++ [
+            ./hosts/ns1
+          ];
+        };
+        kanidm01 = nixpkgs.lib.nixosSystem {
+          inherit system;
+          specialArgs = {
+            inherit inputs self;
+          };
+          modules = commonModules ++ [
+            ./hosts/kanidm01
+          ];
+        };
+        monitoring02 = nixpkgs.lib.nixosSystem {
+          inherit system;
+          specialArgs = {
+            inherit inputs self;
+          };
+          modules = commonModules ++ [
+            ./hosts/monitoring02
+          ];
+        };
+        nix-cache02 = nixpkgs.lib.nixosSystem {
+          inherit system;
+          specialArgs = {
+            inherit inputs self;
+          };
+          modules = commonModules ++ [
+            ./hosts/nix-cache02
+          ];
+        };
+        garage01 = nixpkgs.lib.nixosSystem {
+          inherit system;
+          specialArgs = {
+            inherit inputs self;
+          };
+          modules = commonModules ++ [
+            ./hosts/garage01
+          ];
+        };
      };
      packages = forAllSystems (
        { pkgs }:
@@ -238,9 +225,12 @@
              pkgs.ansible
              pkgs.opentofu
              pkgs.openbao
+              pkgs.kanidm_1_8
+              pkgs.nkeys
              (pkgs.callPackage ./scripts/create-host { })
              homelab-deploy.packages.${pkgs.system}.default
            ];
+            ANSIBLE_CONFIG = "./ansible/ansible.cfg";
          };
        }
      );
--- a/hosts/nix-cache01/configuration.nix
+++ b/hosts/nix-cache01/configuration.nix
@@ -1,33 +1,37 @@
 {
+  config,
+  lib,
  pkgs,
  ...
 }:

 {
  imports = [
-    ../template/hardware-configuration.nix
+    ../template2/hardware-configuration.nix

    ../../system
    ../../common/vm
  ];

-  homelab.dns.cnames = [ "nix-cache" "actions1" ];
-
-  homelab.host.role = "build-host";
-
-  fileSystems."/nix" = {
-    device = "/dev/disk/by-label/nixcache";
-    fsType = "xfs";
+  # Host metadata (adjust as needed)
+  homelab.host = {
+    tier = "test";  # Start in test tier, move to prod after validation
+    role = "storage";
  };
+
+  homelab.dns.cnames = [ "s3" ];
+
+  # Enable Vault integration
+  vault.enable = true;
+
+  # Enable remote deployment via NATS
+  homelab.deploy.enable = true;
+
  nixpkgs.config.allowUnfree = true;
-  # Use the systemd-boot EFI boot loader.
-  boot.loader.grub = {
-    enable = true;
-    device = "/dev/sda";
-    configurationLimit = 3;
-  };
+  boot.loader.grub.enable = true;
+  boot.loader.grub.device = "/dev/vda";

-  networking.hostName = "nix-cache01";
+  networking.hostName = "garage01";
  networking.domain = "home.2rjus.net";
  networking.useNetworkd = true;
  networking.useDHCP = false;
@@ -41,7 +45,7 @@
  systemd.network.networks."ens18" = {
    matchConfig.Name = "ens18";
    address = [
-      "10.69.13.15/24"
+      "10.69.13.26/24"
    ];
    routes = [
      { Gateway = "10.69.13.1"; }
@@ -54,9 +58,6 @@
    "nix-command"
    "flakes"
  ];
-  vault.enable = true;
-  homelab.deploy.enable = true;
-
  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
    vim
@@ -64,13 +65,11 @@
    git
  ];

-  services.qemuGuest.enable = true;
-
  # Open ports in the firewall.
  # networking.firewall.allowedTCPPorts = [ ... ];
  # networking.firewall.allowedUDPPorts = [ ... ];
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

-  system.stateVersion = "24.05"; # Did you read the comment?
+  system.stateVersion = "25.11"; # Did you read the comment?
 }
--- a/hosts/garage01/default.nix
+++ b/hosts/garage01/default.nix
@@ -1,7 +1,6 @@
-{ ... }:
-{
+{ ... }: {
  imports = [
    ./configuration.nix
-    ../../services/ca
+    ../../services/garage
  ];
 }
--- a/hosts/ha1/configuration.nix
+++ b/hosts/ha1/configuration.nix
@@ -7,12 +7,14 @@

 {
  imports = [
-    ../template/hardware-configuration.nix
+    ./hardware-configuration.nix

    ../../system
    ../../common/vm
  ];

+  homelab.host.role = "home-automation";
+
  nixpkgs.config.allowUnfree = true;
  # Use the systemd-boot EFI boot loader.
  boot.loader.grub = {
@@ -85,6 +87,7 @@
      "--keep-monthly 6"
      "--keep-within 1d"
    ];
+    extraOptions = [ "--retry-lock=5m" ];
  };

  # Open ports in the firewall.
--- a/hosts/template/hardware-configuration.nix
+++ b/hosts/template/hardware-configuration.nix
--- a/hosts/http-proxy/configuration.nix
+++ b/hosts/http-proxy/configuration.nix
@@ -5,12 +5,13 @@

 {
  imports = [
-    ../template/hardware-configuration.nix
+    ./hardware-configuration.nix

    ../../system
    ../../common/vm
  ];

+  homelab.host.role = "proxy";
  homelab.dns.cnames = [
    "nzbget"
    "radarr"
--- a/hosts/http-proxy/hardware-configuration.nix
+++ b/hosts/http-proxy/hardware-configuration.nix
@@ -0,0 +1,42 @@
+{
+  config,
+  lib,
+  pkgs,
+  modulesPath,
+  ...
+}:
+
+{
+  imports = [
+    (modulesPath + "/profiles/qemu-guest.nix")
+  ];
+  boot.initrd.availableKernelModules = [
+    "ata_piix"
+    "uhci_hcd"
+    "virtio_pci"
+    "virtio_scsi"
+    "sd_mod"
+    "sr_mod"
+  ];
+  boot.initrd.kernelModules = [ "dm-snapshot" ];
+  boot.kernelModules = [
+    "ptp_kvm"
+  ];
+  boot.extraModulePackages = [ ];
+
+  fileSystems."/" = {
+    device = "/dev/disk/by-label/root";
+    fsType = "xfs";
+  };
+
+  swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
+
+  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
+  # (the default) this is the recommended approach. When using systemd-networkd it's
+  # still possible to use this option, but it's recommended to use it in conjunction
+  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
+  networking.useDHCP = lib.mkDefault true;
+  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
+
+  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
+}
--- a/hosts/jelly01/configuration.nix
+++ b/hosts/jelly01/configuration.nix
@@ -5,12 +5,14 @@

 {
  imports = [
-    ../template/hardware-configuration.nix
+    ./hardware-configuration.nix

    ../../system
    ../../common/vm
  ];

+  homelab.host.role = "media";
+
  nixpkgs.config.allowUnfree = true;
  # Use the systemd-boot EFI boot loader.
  boot.loader.grub = {
@@ -61,9 +63,8 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

-  zramSwap = {
-    enable = true;
-  };
+  vault.enable = true;
+  homelab.deploy.enable = true;

  system.stateVersion = "23.11"; # Did you read the comment?
 }
--- a/hosts/jelly01/hardware-configuration.nix
+++ b/hosts/jelly01/hardware-configuration.nix
@@ -0,0 +1,42 @@
+{
+  config,
+  lib,
+  pkgs,
+  modulesPath,
+  ...
+}:
+
+{
+  imports = [
+    (modulesPath + "/profiles/qemu-guest.nix")
+  ];
+  boot.initrd.availableKernelModules = [
+    "ata_piix"
+    "uhci_hcd"
+    "virtio_pci"
+    "virtio_scsi"
+    "sd_mod"
+    "sr_mod"
+  ];
+  boot.initrd.kernelModules = [ "dm-snapshot" ];
+  boot.kernelModules = [
+    "ptp_kvm"
+  ];
+  boot.extraModulePackages = [ ];
+
+  fileSystems."/" = {
+    device = "/dev/disk/by-label/root";
+    fsType = "xfs";
+  };
+
+  swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
+
+  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
+  # (the default) this is the recommended approach. When using systemd-networkd it's
+  # still possible to use this option, but it's recommended to use it in conjunction
+  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
+  networking.useDHCP = lib.mkDefault true;
+  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
+
+  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
+}
--- a/hosts/jump/configuration.nix
+++ b/hosts/jump/configuration.nix
@@ -1,56 +0,0 @@
-{ config, lib, pkgs, ... }:
-
-{
-  imports =
-    [
-      ../template/hardware-configuration.nix
-      ../../system
-    ];
-
-  nixpkgs.config.allowUnfree = true;
-
-  homelab.host.role = "bastion";
-
-  # Use the systemd-boot EFI boot loader.
-  boot.loader.grub.enable = true;
-  boot.loader.grub.device = "/dev/sda";
-
-  networking.hostName = "jump";
-  networking.domain = "home.2rjus.net";
-  networking.useNetworkd = true;
-  networking.useDHCP = false;
-  services.resolved.enable = false;
-  networking.nameservers = [
-    "10.69.13.5"
-    "10.69.13.6"
-  ];
-
-  systemd.network.enable = true;
-  systemd.network.networks."ens18" = {
-    matchConfig.Name = "ens18";
-    address = [
-      "10.69.13.10/24"
-    ];
-    routes = [
-      { Gateway = "10.69.13.1"; }
-    ];
-    linkConfig.RequiredForOnline = "routable";
-  };
-  time.timeZone = "Europe/Oslo";
-
-  nix.settings.experimental-features = [ "nix-command" "flakes" ];
-  environment.systemPackages = with pkgs; [
-    vim
-    wget
-    git
-  ];
-
-  # Open ports in the firewall.
-  # networking.firewall.allowedTCPPorts = [ ... ];
-  # networking.firewall.allowedUDPPorts = [ ... ];
-  # Or disable the firewall altogether.
-  networking.firewall.enable = false;
-
-  system.stateVersion = "23.11"; # Did you read the comment?
-}
-
--- a/hosts/jump/hardware-configuration.nix
+++ b/hosts/jump/hardware-configuration.nix
@@ -1,36 +0,0 @@
-{ config, lib, pkgs, modulesPath, ... }:
-
-{
-  imports =
-    [
-      (modulesPath + "/profiles/qemu-guest.nix")
-    ];
-
-  boot.initrd.availableKernelModules = [ "ata_piix" "uhci_hcd" "virtio_pci" "virtio_scsi" "sd_mod" "sr_mod" ];
-  boot.initrd.kernelModules = [ ];
-  # boot.kernelModules = [ ];
-  # boot.extraModulePackages = [ ];
-
-  fileSystems."/" =
-    {
-      device = "/dev/disk/by-uuid/6889aba9-61ed-4687-ab10-e5cf4017ac8d";
-      fsType = "xfs";
-    };
-
-  fileSystems."/boot" =
-    {
-      device = "/dev/disk/by-uuid/BC07-3B7A";
-      fsType = "vfat";
-    };
-
-  swapDevices =
-    [{ device = "/dev/disk/by-uuid/64e5757b-6625-4dd2-aa2a-66ca93444d23"; }];
-
-  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
-  # (the default) this is the recommended approach. When using systemd-networkd it's
-  # still possible to use this option, but it's recommended to use it in conjunction
-  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
-  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
-
-  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
-}
--- a/hosts/kanidm01/configuration.nix
+++ b/hosts/kanidm01/configuration.nix
@@ -1,25 +1,38 @@
-{ config, lib, pkgs, ... }:
+{
+  config,
+  lib,
+  pkgs,
+  ...
+}:

 {
-  imports =
-    [
-      ./hardware-configuration.nix
+  imports = [
+    ../template2/hardware-configuration.nix

-      ../../system
-    ];
-
-  # Template host - exclude from DNS zone generation
-  homelab.dns.enable = false;
+    ../../system
+    ../../common/vm
+    ../../services/kanidm
+  ];

  homelab.host = {
-    tier = "test";
-    priority = "low";
+    tier = "prod";
+    role = "auth";
  };

+  # DNS CNAME for auth.home.2rjus.net
+  homelab.dns.cnames = [ "auth" ];

+  # Enable Vault integration
+  vault.enable = true;
+
+  # Enable remote deployment via NATS
+  homelab.deploy.enable = true;
+
+  nixpkgs.config.allowUnfree = true;
  boot.loader.grub.enable = true;
-  boot.loader.grub.device = "/dev/sda";
-  networking.hostName = "nixos-template";
+  boot.loader.grub.device = "/dev/vda";
+
+  networking.hostName = "kanidm01";
  networking.domain = "home.2rjus.net";
  networking.useNetworkd = true;
  networking.useDHCP = false;
@@ -33,19 +46,21 @@
  systemd.network.networks."ens18" = {
    matchConfig.Name = "ens18";
    address = [
-      "10.69.8.250/24"
+      "10.69.13.23/24"
    ];
    routes = [
-      { Gateway = "10.69.8.1"; }
+      { Gateway = "10.69.13.1"; }
    ];
    linkConfig.RequiredForOnline = "routable";
  };
  time.timeZone = "Europe/Oslo";

-  nix.settings.experimental-features = [ "nix-command" "flakes" ];
+  nix.settings.experimental-features = [
+    "nix-command"
+    "flakes"
+  ];
  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
-    age
    vim
    wget
    git
@@ -57,6 +72,5 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

-  system.stateVersion = "23.11"; # Did you read the comment?
+  system.stateVersion = "25.11"; # Did you read the comment?
 }
-
--- a/hosts/kanidm01/default.nix
+++ b/hosts/kanidm01/default.nix
--- a/hosts/monitoring01/configuration.nix
+++ b/hosts/monitoring01/configuration.nix
@@ -5,12 +5,14 @@

 {
  imports = [
-    ../template/hardware-configuration.nix
+    ./hardware-configuration.nix

    ../../system
    ../../common/vm
  ];

+  homelab.host.role = "monitoring";
+
  nixpkgs.config.allowUnfree = true;
  # Use the systemd-boot EFI boot loader.
  boot.loader.grub = {
@@ -81,6 +83,7 @@
      "--keep-monthly 6"
      "--keep-within 1d"
    ];
+    extraOptions = [ "--retry-lock=5m" ];
  };

  services.restic.backups.grafana-db = {
@@ -98,61 +101,7 @@
      "--keep-monthly 6"
      "--keep-within 1d"
    ];
-  };
-
-  labmon = {
-    enable = true;
-
-    settings = {
-      ListenAddr = ":9969";
-      Profiling = true;
-      StepMonitors = [
-        {
-          Enabled = true;
-          BaseURL = "https://ca.home.2rjus.net";
-          RootID = "3381bda8015a86b9a3cd1851439d1091890a79005e0f1f7c4301fe4bccc29d80";
-        }
-      ];
-
-      TLSConnectionMonitors = [
-        {
-          Enabled = true;
-          Address = "ca.home.2rjus.net:443";
-          Verify = true;
-          Duration = "12h";
-        }
-        {
-          Enabled = true;
-          Address = "jelly.home.2rjus.net:443";
-          Verify = true;
-          Duration = "12h";
-        }
-        {
-          Enabled = true;
-          Address = "grafana.home.2rjus.net:443";
-          Verify = true;
-          Duration = "12h";
-        }
-        {
-          Enabled = true;
-          Address = "prometheus.home.2rjus.net:443";
-          Verify = true;
-          Duration = "12h";
-        }
-        {
-          Enabled = true;
-          Address = "alertmanager.home.2rjus.net:443";
-          Verify = true;
-          Duration = "12h";
-        }
-        {
-          Enabled = true;
-          Address = "pyroscope.home.2rjus.net:443";
-          Verify = true;
-          Duration = "12h";
-        }
-      ];
-    };
+    extraOptions = [ "--retry-lock=5m" ];
  };

  # Open ports in the firewall.
--- a/hosts/monitoring01/hardware-configuration.nix
+++ b/hosts/monitoring01/hardware-configuration.nix
@@ -0,0 +1,42 @@
+{
+  config,
+  lib,
+  pkgs,
+  modulesPath,
+  ...
+}:
+
+{
+  imports = [
+    (modulesPath + "/profiles/qemu-guest.nix")
+  ];
+  boot.initrd.availableKernelModules = [
+    "ata_piix"
+    "uhci_hcd"
+    "virtio_pci"
+    "virtio_scsi"
+    "sd_mod"
+    "sr_mod"
+  ];
+  boot.initrd.kernelModules = [ "dm-snapshot" ];
+  boot.kernelModules = [
+    "ptp_kvm"
+  ];
+  boot.extraModulePackages = [ ];
+
+  fileSystems."/" = {
+    device = "/dev/disk/by-label/root";
+    fsType = "xfs";
+  };
+
+  swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
+
+  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
+  # (the default) this is the recommended approach. When using systemd-networkd it's
+  # still possible to use this option, but it's recommended to use it in conjunction
+  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
+  networking.useDHCP = lib.mkDefault true;
+  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
+
+  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
+}
--- a/hosts/monitoring02/configuration.nix
+++ b/hosts/monitoring02/configuration.nix
@@ -1,25 +1,36 @@
 {
+  config,
+  lib,
  pkgs,
  ...
 }:

 {
  imports = [
-    ../template/hardware-configuration.nix
+    ../template2/hardware-configuration.nix

    ../../system
    ../../common/vm
  ];

-  nixpkgs.config.allowUnfree = true;
-  # Use the systemd-boot EFI boot loader.
-  boot.loader.grub = {
-    enable = true;
-    device = "/dev/sda";
-    configurationLimit = 3;
+  homelab.host = {
+    tier = "prod";
+    role = "monitoring";
  };

-  networking.hostName = "pgdb1";
+  homelab.dns.cnames = [ "grafana-test" "metrics" "vmalert" "loki" ];
+
+  # Enable Vault integration
+  vault.enable = true;
+
+  # Enable remote deployment via NATS
+  homelab.deploy.enable = true;
+
+  nixpkgs.config.allowUnfree = true;
+  boot.loader.grub.enable = true;
+  boot.loader.grub.device = "/dev/vda";
+
+  networking.hostName = "monitoring02";
  networking.domain = "home.2rjus.net";
  networking.useNetworkd = true;
  networking.useDHCP = false;
@@ -33,7 +44,7 @@
  systemd.network.networks."ens18" = {
    matchConfig.Name = "ens18";
    address = [
-      "10.69.13.16/24"
+      "10.69.13.24/24"
    ];
    routes = [
      { Gateway = "10.69.13.1"; }
@@ -59,5 +70,5 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

-  system.stateVersion = "23.11"; # Did you read the comment?
+  system.stateVersion = "25.11"; # Did you read the comment?
 }
--- a/hosts/monitoring02/default.nix
+++ b/hosts/monitoring02/default.nix
@@ -0,0 +1,8 @@
+{ ... }: {
+  imports = [
+    ./configuration.nix
+    ../../services/grafana
+    ../../services/victoriametrics
+    ../../services/loki
+  ];
+}
--- a/hosts/nats1/configuration.nix
+++ b/hosts/nats1/configuration.nix
@@ -5,12 +5,14 @@

 {
  imports = [
-    ../template/hardware-configuration.nix
+    ./hardware-configuration.nix

    ../../system
    ../../common/vm
  ];

+  homelab.host.role = "messaging";
+
  nixpkgs.config.allowUnfree = true;
  # Use the systemd-boot EFI boot loader.
  boot.loader.grub = {
@@ -59,5 +61,8 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

+  vault.enable = true;
+  homelab.deploy.enable = true;
+
  system.stateVersion = "23.11"; # Did you read the comment?
 }
--- a/hosts/nats1/hardware-configuration.nix
+++ b/hosts/nats1/hardware-configuration.nix
@@ -0,0 +1,42 @@
+{
+  config,
+  lib,
+  pkgs,
+  modulesPath,
+  ...
+}:
+
+{
+  imports = [
+    (modulesPath + "/profiles/qemu-guest.nix")
+  ];
+  boot.initrd.availableKernelModules = [
+    "ata_piix"
+    "uhci_hcd"
+    "virtio_pci"
+    "virtio_scsi"
+    "sd_mod"
+    "sr_mod"
+  ];
+  boot.initrd.kernelModules = [ "dm-snapshot" ];
+  boot.kernelModules = [
+    "ptp_kvm"
+  ];
+  boot.extraModulePackages = [ ];
+
+  fileSystems."/" = {
+    device = "/dev/disk/by-label/root";
+    fsType = "xfs";
+  };
+
+  swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
+
+  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
+  # (the default) this is the recommended approach. When using systemd-networkd it's
+  # still possible to use this option, but it's recommended to use it in conjunction
+  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
+  networking.useDHCP = lib.mkDefault true;
+  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
+
+  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
+}
--- a/hosts/nix-cache01/zram.nix
+++ b/hosts/nix-cache01/zram.nix
@@ -1,6 +0,0 @@
-{ ... }:
-{
-  zramSwap = {
-    enable = true;
-  };
-}
--- a/hosts/nix-cache02/builder.nix
+++ b/hosts/nix-cache02/builder.nix
@@ -0,0 +1,45 @@
+{ config, ... }:
+{
+  # Fetch builder NKey from Vault
+  vault.secrets.builder-nkey = {
+    secretPath = "shared/homelab-deploy/builder-nkey";
+    extractKey = "nkey";
+    outputDir = "/run/secrets/builder-nkey";
+    services = [ "homelab-deploy-builder" ];
+  };
+
+  # Configure the builder service
+  services.homelab-deploy.builder = {
+    enable = true;
+    natsUrl = "nats://nats1.home.2rjus.net:4222";
+    nkeyFile = "/run/secrets/builder-nkey";
+
+    settings.repos = {
+      nixos-servers = {
+        url = "git+https://git.t-juice.club/torjus/nixos-servers.git";
+        defaultBranch = "master";
+      };
+      nixos = {
+        url = "git+https://git.t-juice.club/torjus/nixos.git";
+        defaultBranch = "master";
+      };
+    };
+
+    timeout = 7200;
+    metrics.enable = true;
+  };
+
+  # Expose builder metrics for Prometheus scraping
+  homelab.monitoring.scrapeTargets = [
+    {
+      job_name = "homelab-deploy-builder";
+      port = 9973;
+    }
+  ];
+
+  # Ensure builder starts after vault secret is available
+  systemd.services.homelab-deploy-builder = {
+    after = [ "vault-secret-builder-nkey.service" ];
+    requires = [ "vault-secret-builder-nkey.service" ];
+  };
+}
--- a/hosts/nix-cache02/configuration.nix
+++ b/hosts/nix-cache02/configuration.nix
@@ -1,25 +1,36 @@
 {
+  config,
+  lib,
  pkgs,
  ...
 }:

 {
  imports = [
-    ../template/hardware-configuration.nix
+    ../template2/hardware-configuration.nix

    ../../system
    ../../common/vm
  ];

-  nixpkgs.config.allowUnfree = true;
-  # Use the systemd-boot EFI boot loader.
-  boot.loader.grub = {
-    enable = true;
-    device = "/dev/sda";
-    configurationLimit = 3;
+  homelab.host = {
+    tier = "prod";
+    role = "build-host";
  };

-  networking.hostName = "ca";
+  homelab.dns.cnames = [ "nix-cache" ];
+
+  # Enable Vault integration
+  vault.enable = true;
+
+  # Enable remote deployment via NATS
+  homelab.deploy.enable = true;
+
+  nixpkgs.config.allowUnfree = true;
+  boot.loader.grub.enable = true;
+  boot.loader.grub.device = "/dev/vda";
+
+  networking.hostName = "nix-cache02";
  networking.domain = "home.2rjus.net";
  networking.useNetworkd = true;
  networking.useDHCP = false;
@@ -33,7 +44,7 @@
  systemd.network.networks."ens18" = {
    matchConfig.Name = "ens18";
    address = [
-      "10.69.13.12/24"
+      "10.69.13.25/24"
    ];
    routes = [
      { Gateway = "10.69.13.1"; }
@@ -59,5 +70,5 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

-  system.stateVersion = "23.11"; # Did you read the comment?
+  system.stateVersion = "25.11"; # Did you read the comment?
 }
--- a/hosts/nix-cache02/default.nix
+++ b/hosts/nix-cache02/default.nix
@@ -1,9 +1,8 @@
-{ ... }:
-{
+{ ... }: {
  imports = [
    ./configuration.nix
+    ./builder.nix
+    ./scheduler.nix
    ../../services/nix-cache
-    ../../services/actions-runner
-    ./zram.nix
  ];
 }
--- a/hosts/nix-cache02/scheduler.nix
+++ b/hosts/nix-cache02/scheduler.nix
@@ -0,0 +1,61 @@
+{ config, pkgs, lib, inputs, ... }:
+let
+  homelab-deploy = inputs.homelab-deploy.packages.${pkgs.system}.default;
+
+  scheduledBuildScript = pkgs.writeShellApplication {
+    name = "scheduled-build";
+    runtimeInputs = [ homelab-deploy ];
+    text = ''
+      NATS_URL="nats://nats1.home.2rjus.net:4222"
+      NKEY_FILE="/run/secrets/scheduler-nkey"
+
+      echo "Starting scheduled builds at $(date)"
+
+      # Build all nixos-servers hosts
+      homelab-deploy build \
+        --nats-url "$NATS_URL" \
+        --nkey-file "$NKEY_FILE" \
+        nixos-servers --all
+
+      # Build all nixos (gunter) hosts
+      homelab-deploy build \
+        --nats-url "$NATS_URL" \
+        --nkey-file "$NKEY_FILE" \
+        nixos --all
+
+      echo "Scheduled builds completed at $(date)"
+    '';
+  };
+in
+{
+  # Fetch scheduler NKey from Vault
+  vault.secrets.scheduler-nkey = {
+    secretPath = "shared/homelab-deploy/scheduler-nkey";
+    extractKey = "nkey";
+    outputDir = "/run/secrets/scheduler-nkey";
+    services = [ "scheduled-build" ];
+  };
+
+  # Timer: every 2 hours
+  systemd.timers.scheduled-build = {
+    description = "Trigger scheduled Nix builds";
+    wantedBy = [ "timers.target" ];
+    timerConfig = {
+      OnCalendar = "*-*-* 00/2:00:00"; # Every 2 hours at :00
+      Persistent = true; # Run missed builds on boot
+      RandomizedDelaySec = "5m"; # Slight jitter
+    };
+  };
+
+  # Service: oneshot that triggers builds
+  systemd.services.scheduled-build = {
+    description = "Trigger builds for all hosts via NATS";
+    after = [ "network-online.target" "vault-secret-scheduler-nkey.service" ];
+    requires = [ "vault-secret-scheduler-nkey.service" ];
+    wants = [ "network-online.target" ];
+    serviceConfig = {
+      Type = "oneshot";
+      ExecStart = lib.getExe scheduledBuildScript;
+    };
+  };
+}
--- a/hosts/ns1/configuration.nix
+++ b/hosts/ns1/configuration.nix
@@ -7,23 +7,38 @@

 {
  imports = [
-    ../template/hardware-configuration.nix
+    ../template2/hardware-configuration.nix

    ../../system
+    ../../common/vm
+
+    # DNS services
    ../../services/ns/master-authorative.nix
    ../../services/ns/resolver.nix
-    ../../common/vm
  ];

+  # Host metadata
+  homelab.host = {
+    tier = "prod";
+    role = "dns";
+    labels.dns_role = "primary";
+  };
+
+  # Enable Vault integration
+  vault.enable = true;
+
+  # Enable remote deployment via NATS
+  homelab.deploy.enable = true;
+
  nixpkgs.config.allowUnfree = true;
-  # Use the systemd-boot EFI boot loader.
  boot.loader.grub.enable = true;
-  boot.loader.grub.device = "/dev/sda";
+  boot.loader.grub.device = "/dev/vda";

  networking.hostName = "ns1";
  networking.domain = "home.2rjus.net";
  networking.useNetworkd = true;
  networking.useDHCP = false;
+  # Disable resolved - conflicts with Unbound resolver
  services.resolved.enable = false;
  networking.nameservers = [
    "10.69.13.5"
@@ -47,14 +62,6 @@
    "nix-command"
    "flakes"
  ];
-  vault.enable = true;
-  homelab.deploy.enable = true;
-
-  homelab.host = {
-    role = "dns";
-    labels.dns_role = "primary";
-  };
-
  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
    vim
@@ -68,5 +75,5 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

-  system.stateVersion = "23.11"; # Did you read the comment?
+  system.stateVersion = "25.11"; # Did you read the comment?
 }
--- a/hosts/ns1/hardware-configuration.nix
+++ b/hosts/ns1/hardware-configuration.nix
@@ -1,36 +0,0 @@
-{ config, lib, pkgs, modulesPath, ... }:
-
-{
-  imports =
-    [
-      (modulesPath + "/profiles/qemu-guest.nix")
-    ];
-
-  boot.initrd.availableKernelModules = [ "ata_piix" "uhci_hcd" "virtio_pci" "virtio_scsi" "sd_mod" "sr_mod" ];
-  boot.initrd.kernelModules = [ ];
-  # boot.kernelModules = [ ];
-  # boot.extraModulePackages = [ ];
-
-  fileSystems."/" =
-    {
-      device = "/dev/disk/by-uuid/6889aba9-61ed-4687-ab10-e5cf4017ac8d";
-      fsType = "xfs";
-    };
-
-  fileSystems."/boot" =
-    {
-      device = "/dev/disk/by-uuid/BC07-3B7A";
-      fsType = "vfat";
-    };
-
-  swapDevices =
-    [{ device = "/dev/disk/by-uuid/64e5757b-6625-4dd2-aa2a-66ca93444d23"; }];
-
-  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
-  # (the default) this is the recommended approach. When using systemd-networkd it's
-  # still possible to use this option, but it's recommended to use it in conjunction
-  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
-  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
-
-  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
-}
--- a/hosts/ns2/configuration.nix
+++ b/hosts/ns2/configuration.nix
@@ -7,23 +7,38 @@

 {
  imports = [
-    ../template/hardware-configuration.nix
+    ../template2/hardware-configuration.nix

    ../../system
+    ../../common/vm
+
+    # DNS services
    ../../services/ns/secondary-authorative.nix
    ../../services/ns/resolver.nix
-    ../../common/vm
  ];

+  # Host metadata
+  homelab.host = {
+    tier = "prod";
+    role = "dns";
+    labels.dns_role = "secondary";
+  };
+
+  # Enable Vault integration
+  vault.enable = true;
+
+  # Enable remote deployment via NATS
+  homelab.deploy.enable = true;
+
  nixpkgs.config.allowUnfree = true;
-  # Use the systemd-boot EFI boot loader.
  boot.loader.grub.enable = true;
-  boot.loader.grub.device = "/dev/sda";
+  boot.loader.grub.device = "/dev/vda";

  networking.hostName = "ns2";
  networking.domain = "home.2rjus.net";
  networking.useNetworkd = true;
  networking.useDHCP = false;
+  # Disable resolved - conflicts with Unbound resolver
  services.resolved.enable = false;
  networking.nameservers = [
    "10.69.13.5"
@@ -47,14 +62,7 @@
    "nix-command"
    "flakes"
  ];
-  vault.enable = true;
-  homelab.deploy.enable = true;
-
-  homelab.host = {
-    role = "dns";
-    labels.dns_role = "secondary";
-  };
-
+  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
    vim
    wget
@@ -67,5 +75,5 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

-  system.stateVersion = "23.11"; # Did you read the comment?
+  system.stateVersion = "25.11"; # Did you read the comment?
 }
--- a/hosts/ns2/hardware-configuration.nix
+++ b/hosts/ns2/hardware-configuration.nix
@@ -1,36 +0,0 @@
-{ config, lib, pkgs, modulesPath, ... }:
-
-{
-  imports =
-    [
-      (modulesPath + "/profiles/qemu-guest.nix")
-    ];
-
-  boot.initrd.availableKernelModules = [ "ata_piix" "uhci_hcd" "virtio_pci" "virtio_scsi" "sd_mod" "sr_mod" ];
-  boot.initrd.kernelModules = [ ];
-  # boot.kernelModules = [ ];
-  # boot.extraModulePackages = [ ];
-
-  fileSystems."/" =
-    {
-      device = "/dev/disk/by-uuid/6889aba9-61ed-4687-ab10-e5cf4017ac8d";
-      fsType = "xfs";
-    };
-
-  fileSystems."/boot" =
-    {
-      device = "/dev/disk/by-uuid/BC07-3B7A";
-      fsType = "vfat";
-    };
-
-  swapDevices =
-    [{ device = "/dev/disk/by-uuid/64e5757b-6625-4dd2-aa2a-66ca93444d23"; }];
-
-  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
-  # (the default) this is the recommended approach. When using systemd-networkd it's
-  # still possible to use this option, but it's recommended to use it in conjunction
-  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
-  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
-
-  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
-}
--- a/hosts/pgdb1/default.nix
+++ b/hosts/pgdb1/default.nix
@@ -1,7 +0,0 @@
-{ ... }:
-{
-  imports = [
-    ./configuration.nix
-    ../../services/postgres
-  ];
-}
--- a/hosts/template/default.nix
+++ b/hosts/template/default.nix
@@ -1,7 +0,0 @@
-{ ... }: {
-  imports = [
-    ./hardware-configuration.nix
-    ./configuration.nix
-    ./scripts.nix
-  ];
-}
--- a/hosts/template/scripts.nix
+++ b/hosts/template/scripts.nix
@@ -1,36 +0,0 @@
-{ pkgs, ... }:
-let
-  prepare-host-script = pkgs.writeShellApplication {
-    name = "prepare-host.sh";
-    runtimeInputs = [ pkgs.age ];
-    text = ''
-      echo "Removing machine-id"
-      rm -f /etc/machine-id || true
-
-      echo "Removing SSH host keys"
-      rm -f /etc/ssh/ssh_host_* || true
-
-      echo "Restarting SSH"
-      systemctl restart sshd
-
-      echo "Removing temporary files"
-      rm -rf /tmp/* || true
-
-      echo "Removing logs"
-      journalctl --rotate || true
-      journalctl --vacuum-time=1s || true
-
-      echo "Removing cache"
-      rm -rf /var/cache/* || true
-
-      echo "Generate age key"
-      rm -rf /var/lib/sops-nix || true
-      mkdir -p /var/lib/sops-nix
-      age-keygen -o /var/lib/sops-nix/key.txt
-    '';
-  };
-in
-{
-  environment.systemPackages = [ prepare-host-script ];
-  users.motd = "Prepare host by running 'prepare-host.sh'.";
-}
--- a/hosts/template2/bootstrap.nix
+++ b/hosts/template2/bootstrap.nix
@@ -6,22 +6,72 @@ let
    text = ''
      set -euo pipefail

+      LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
+
+      # Send a log entry to Loki with bootstrap status
+      # Usage: log_to_loki <stage> <message>
+      # Fails silently if Loki is unreachable
+      log_to_loki() {
+        local stage="$1"
+        local message="$2"
+        local timestamp_ns
+        timestamp_ns="$(date +%s)000000000"
+
+        local payload
+        payload=$(jq -n \
+          --arg host "$HOSTNAME" \
+          --arg stage "$stage" \
+          --arg branch "''${BRANCH:-master}" \
+          --arg ts "$timestamp_ns" \
+          --arg msg "$message" \
+          '{
+            streams: [{
+              stream: {
+                job: "bootstrap",
+                hostname: $host,
+                stage: $stage,
+                branch: $branch
+              },
+              values: [[$ts, $msg]]
+            }]
+          }')
+
+        curl -s --connect-timeout 2 --max-time 5 \
+          -X POST \
+          -H "Content-Type: application/json" \
+          -d "$payload" \
+          "$LOKI_URL" >/dev/null 2>&1 || true
+      }
+
+      echo "================================================================================"
+      echo "                     NIXOS BOOTSTRAP IN PROGRESS"
+      echo "================================================================================"
+      echo ""
+
      # Read hostname set by cloud-init (from Terraform VM name via user-data)
      # Cloud-init sets the system hostname from user-data.txt, so we read it from hostnamectl
      HOSTNAME=$(hostnamectl hostname)
-      echo "DEBUG: Hostname from hostnamectl: '$HOSTNAME'"
+      # Read git branch from environment, default to master
+      BRANCH="''${NIXOS_FLAKE_BRANCH:-master}"

+      echo "Hostname: $HOSTNAME"
+      echo ""
      echo "Starting NixOS bootstrap for host: $HOSTNAME"
+
+      log_to_loki "starting" "Bootstrap starting for $HOSTNAME (branch: $BRANCH)"
+
      echo "Waiting for network connectivity..."

      # Verify we can reach the git server via HTTPS (doesn't respond to ping)
      if ! curl -s --connect-timeout 5 --max-time 10 https://git.t-juice.club >/dev/null 2>&1; then
        echo "ERROR: Cannot reach git.t-juice.club via HTTPS"
        echo "Check network configuration and DNS settings"
+        log_to_loki "failed" "Network check failed - cannot reach git.t-juice.club"
        exit 1
      fi

      echo "Network connectivity confirmed"
+      log_to_loki "network_ok" "Network connectivity confirmed"

      # Unwrap Vault token and store AppRole credentials (if provided)
      if [ -n "''${VAULT_WRAPPED_TOKEN:-}" ]; then
@@ -50,6 +100,7 @@ let
          chmod 600 /var/lib/vault/approle/secret-id

          echo "Vault credentials unwrapped and stored successfully"
+          log_to_loki "vault_ok" "Vault credentials unwrapped and stored"
        else
          echo "WARNING: Failed to unwrap Vault token"
          if [ -n "$UNWRAP_RESPONSE" ]; then
@@ -63,17 +114,17 @@ let
          echo "To regenerate token, run: create-host --hostname $HOSTNAME --force"
          echo ""
          echo "Vault secrets will not be available, but continuing bootstrap..."
+          log_to_loki "vault_warn" "Failed to unwrap Vault token - continuing without secrets"
        fi
      else
        echo "No Vault wrapped token provided (VAULT_WRAPPED_TOKEN not set)"
        echo "Skipping Vault credential setup"
+        log_to_loki "vault_skip" "No Vault token provided - skipping credential setup"
      fi

      echo "Fetching and building NixOS configuration from flake..."
-
-      # Read git branch from environment, default to master
-      BRANCH="''${NIXOS_FLAKE_BRANCH:-master}"
      echo "Using git branch: $BRANCH"
+      log_to_loki "building" "Starting nixos-rebuild boot"

      # Build and activate the host-specific configuration
      FLAKE_URL="git+https://git.t-juice.club/torjus/nixos-servers.git?ref=$BRANCH#''${HOSTNAME}"
@@ -81,18 +132,30 @@ let
      if nixos-rebuild boot --flake "$FLAKE_URL"; then
        echo "Successfully built configuration for $HOSTNAME"
        echo "Rebooting into new configuration..."
+        log_to_loki "success" "Build successful - rebooting into new configuration"
        sleep 2
        systemctl reboot
      else
        echo "ERROR: nixos-rebuild failed for $HOSTNAME"
        echo "Check that flake has configuration for this hostname"
        echo "Manual intervention required - system will not reboot"
+        log_to_loki "failed" "nixos-rebuild failed - manual intervention required"
        exit 1
      fi
    '';
  };
 in
 {
+  # Custom greeting line to indicate this is a bootstrap image
+  services.getty.greetingLine = lib.mkForce ''
+    ================================================================================
+                          BOOTSTRAP IMAGE - NixOS \V (\l)
+    ================================================================================
+
+    Bootstrap service is running. Logs are displayed on tty1.
+    Check status: journalctl -fu nixos-bootstrap
+  '';
+
  systemd.services."nixos-bootstrap" = {
    description = "Bootstrap NixOS configuration from flake on first boot";

@@ -107,12 +170,12 @@ in
    serviceConfig = {
      Type = "oneshot";
      RemainAfterExit = true;
-      ExecStart = "${bootstrap-script}/bin/nixos-bootstrap";
+      ExecStart = lib.getExe bootstrap-script;

      # Read environment variables from cloud-init (set by cloud-init write_files)
      EnvironmentFile = "-/run/cloud-init-env";

-      # Logging to journald
+      # Log to journal and console
      StandardOutput = "journal+console";
      StandardError = "journal+console";
    };
--- a/hosts/template2/configuration.nix
+++ b/hosts/template2/configuration.nix
@@ -35,6 +35,7 @@
  homelab.host = {
    tier = "test";
    priority = "low";
+    labels.ansible = "false";  # Exclude from Ansible inventory
  };

  boot.loader.grub.enable = true;
@@ -58,6 +59,14 @@
    "flakes"
  ];
  nix.settings.tarball-ttl = 0;
+  nix.settings.substituters = [
+    "https://nix-cache.home.2rjus.net"
+    "https://cache.nixos.org"
+  ];
+  nix.settings.trusted-public-keys = [
+    "nix-cache.home.2rjus.net-1:2kowZOG6pvhoK4AHVO3alBlvcghH20wchzoR0V86UWI="
+    "cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY="
+  ];
  environment.systemPackages = with pkgs; [
    age
    vim
@@ -71,5 +80,8 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

+  # Compressed swap in RAM - prevents OOM during bootstrap nixos-rebuild
+  zramSwap.enable = true;
+
  system.stateVersion = "25.11";
 }
--- a/hosts/template2/scripts.nix
+++ b/hosts/template2/scripts.nix
@@ -2,7 +2,6 @@
 let
  prepare-host-script = pkgs.writeShellApplication {
    name = "prepare-host.sh";
-    runtimeInputs = [ pkgs.age ];
    text = ''
      echo "Removing machine-id"
      rm -f /etc/machine-id || true
@@ -22,11 +21,6 @@ let

      echo "Removing cache"
      rm -rf /var/cache/* || true
-
-      echo "Generate age key"
-      rm -rf /var/lib/sops-nix || true
-      mkdir -p /var/lib/sops-nix
-      age-keygen -o /var/lib/sops-nix/key.txt
    '';
  };
 in
--- a/hosts/testvm01/configuration.nix
+++ b/hosts/testvm01/configuration.nix
@@ -11,11 +11,12 @@

    ../../system
    ../../common/vm
+    ../../common/ssh-audit.nix
  ];

-  # Host metadata (adjust as needed)
  homelab.host = {
-    tier = "test";  # Start in test tier, move to prod after validation
+    tier = "test";
+    role = "test";
  };

  # Enable Vault integration
@@ -24,6 +25,9 @@
  # Enable remote deployment via NATS
  homelab.deploy.enable = true;

+  # Enable Kanidm PAM/NSS for central authentication
+  homelab.kanidm.enable = true;
+
  nixpkgs.config.allowUnfree = true;
  boot.loader.grub.enable = true;
  boot.loader.grub.device = "/dev/vda";
@@ -62,6 +66,39 @@
    git
  ];

+  # Test nginx with ACME certificate from OpenBao PKI
+  services.nginx = {
+    enable = true;
+    virtualHosts."testvm01.home.2rjus.net" = {
+      forceSSL = true;
+      enableACME = true;
+      locations."/" = {
+        root = pkgs.writeTextDir "index.html" ''
+          <!DOCTYPE html>
+          <html>
+          <head>
+            <title>testvm01 - ACME Test</title>
+            <style>
+              body { font-family: monospace; max-width: 600px; margin: 50px auto; padding: 20px; }
+              .joke { background: #f0f0f0; padding: 20px; border-radius: 8px; margin: 20px 0; }
+              .punchline { margin-top: 15px; font-weight: bold; }
+            </style>
+          </head>
+          <body>
+            <h1>OpenBao PKI ACME Test</h1>
+            <p>If you're seeing this over HTTPS, the migration worked!</p>
+            <div class="joke">
+              <p>Why do programmers prefer dark mode?</p>
+              <p class="punchline">Because light attracts bugs.</p>
+            </div>
+            <p><small>Certificate issued by: vault.home.2rjus.net</small></p>
+          </body>
+          </html>
+        '';
+      };
+    };
+  };
+
  # Open ports in the firewall.
  # networking.firewall.allowedTCPPorts = [ ... ];
  # networking.firewall.allowedUDPPorts = [ ... ];
--- a/hosts/testvm02/configuration.nix
+++ b/hosts/testvm02/configuration.nix
@@ -11,11 +11,12 @@

    ../../system
    ../../common/vm
+    ../../common/ssh-audit.nix
  ];

-  # Host metadata (adjust as needed)
  homelab.host = {
-    tier = "test";  # Start in test tier, move to prod after validation
+    tier = "test";
+    role = "test";
  };

  # Enable Vault integration
@@ -24,6 +25,9 @@
  # Enable remote deployment via NATS
  homelab.deploy.enable = true;

+  # Enable Kanidm PAM/NSS for central authentication
+  homelab.kanidm.enable = true;
+
  nixpkgs.config.allowUnfree = true;
  boot.loader.grub.enable = true;
  boot.loader.grub.device = "/dev/vda";
--- a/hosts/testvm03/configuration.nix
+++ b/hosts/testvm03/configuration.nix
@@ -11,11 +11,12 @@

    ../../system
    ../../common/vm
+    ../../common/ssh-audit.nix
  ];

-  # Host metadata (adjust as needed)
  homelab.host = {
-    tier = "test";  # Start in test tier, move to prod after validation
+    tier = "test";
+    role = "test";
  };

  # Enable Vault integration
@@ -24,6 +25,9 @@
  # Enable remote deployment via NATS
  homelab.deploy.enable = true;

+  # Enable Kanidm PAM/NSS for central authentication
+  homelab.kanidm.enable = true;
+
  nixpkgs.config.allowUnfree = true;
  boot.loader.grub.enable = true;
  boot.loader.grub.device = "/dev/vda";
--- a/hosts/vault01/configuration.nix
+++ b/hosts/vault01/configuration.nix
@@ -62,6 +62,16 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

+  # Vault fetches secrets from itself (after unseal)
+  vault.enable = true;
+  homelab.deploy.enable = true;
+
+  # Ensure vault-secret services wait for openbao to be unsealed
+  systemd.services.vault-secret-homelab-deploy-nkey = {
+    after = [ "openbao.service" ];
+    wants = [ "openbao.service" ];
+  };
+
  system.stateVersion = "25.11"; # Did you read the comment?
 }

--- a/lib/monitoring.nix
+++ b/lib/monitoring.nix
@@ -21,6 +21,7 @@ let
      cfg = hostConfig.config;
      monConfig = (cfg.homelab or { }).monitoring or { enable = true; scrapeTargets = [ ]; };
      dnsConfig = (cfg.homelab or { }).dns or { enable = true; };
+      hostConfig' = (cfg.homelab or { }).host or { };
      hostname = cfg.networking.hostName;
      networks = cfg.systemd.network.networks or { };

@@ -49,20 +50,72 @@ let
        inherit hostname;
        ip = extractIP firstAddress;
        scrapeTargets = monConfig.scrapeTargets or [ ];
+        # Host metadata for label propagation
+        tier = hostConfig'.tier or "prod";
+        priority = hostConfig'.priority or "high";
+        role = hostConfig'.role or null;
+        labels = hostConfig'.labels or { };
      };

+  # Build effective labels for a host
+  # Always includes hostname and tier; only includes priority/role if non-default
+  buildEffectiveLabels = host:
+    { hostname = host.hostname; tier = host.tier; }
+    // (lib.optionalAttrs (host.priority != "high") { priority = host.priority; })
+    // (lib.optionalAttrs (host.role != null) { role = host.role; })
+    // host.labels;
+
  # Generate node-exporter targets from all flake hosts
+  # Returns a list of static_configs entries with labels
  generateNodeExporterTargets = self: externalTargets:
    let
      nixosConfigs = self.nixosConfigurations or { };
      hostList = lib.filter (x: x != null) (
        lib.mapAttrsToList extractHostMonitoring nixosConfigs
      );
-      flakeTargets = map (host: "${host.hostname}.home.2rjus.net:9100") hostList;
+
+      # Extract hostname from a target string like "gunter.home.2rjus.net:9100"
+      extractHostnameFromTarget = target:
+        builtins.head (lib.splitString "." target);
+
+      # Build target entries with labels for each host
+      flakeEntries = map
+        (host: {
+          target = "${host.hostname}.home.2rjus.net:9100";
+          labels = buildEffectiveLabels host;
+        })
+        hostList;
+
+      # External targets get hostname extracted from the target string
+      externalEntries = map
+        (target: {
+          inherit target;
+          labels = { hostname = extractHostnameFromTarget target; };
+        })
+        (externalTargets.nodeExporter or [ ]);
+
+      allEntries = flakeEntries ++ externalEntries;
+
+      # Group entries by their label set for efficient static_configs
+      # Convert labels attrset to a string key for grouping
+      labelKey = entry: builtins.toJSON entry.labels;
+      grouped = lib.groupBy labelKey allEntries;
+
+      # Convert groups to static_configs format
+      # Every flake host now has at least a hostname label
+      staticConfigs = lib.mapAttrsToList
+        (key: entries:
+          let
+            labels = (builtins.head entries).labels;
+          in
+          { targets = map (e: e.target) entries; labels = labels; }
+        )
+        grouped;
    in
-    flakeTargets ++ (externalTargets.nodeExporter or [ ]);
+    staticConfigs;

  # Generate scrape configs from all flake hosts and external targets
+  # Host labels are propagated to service targets for semantic alert filtering
  generateScrapeConfigs = self: externalTargets:
    let
      nixosConfigs = self.nixosConfigurations or { };
@@ -70,13 +123,14 @@ let
        lib.mapAttrsToList extractHostMonitoring nixosConfigs
      );

-      # Collect all scrapeTargets from all hosts, grouped by job_name
+      # Collect all scrapeTargets from all hosts, including host labels
      allTargets = lib.flatten (map
        (host:
          map
            (target: {
              inherit (target) job_name port metrics_path scheme scrape_interval honor_labels;
              hostname = host.hostname;
+              hostLabels = buildEffectiveLabels host;
            })
            host.scrapeTargets
        )
@@ -87,22 +141,32 @@ let
      grouped = lib.groupBy (t: t.job_name) allTargets;

      # Generate a scrape config for each job
+      # Within each job, group targets by their host labels for efficient static_configs
      flakeScrapeConfigs = lib.mapAttrsToList
        (jobName: targets:
          let
            first = builtins.head targets;
-            targetAddrs = map
-              (t:
+
+            # Group targets within this job by their host labels
+            labelKey = t: builtins.toJSON t.hostLabels;
+            groupedByLabels = lib.groupBy labelKey targets;
+
+            # Every flake host now has at least a hostname label
+            staticConfigs = lib.mapAttrsToList
+              (key: labelTargets:
                let
-                  portStr = toString t.port;
+                  labels = (builtins.head labelTargets).hostLabels;
+                  targetAddrs = map
+                    (t: "${t.hostname}.home.2rjus.net:${toString t.port}")
+                    labelTargets;
                in
-                "${t.hostname}.home.2rjus.net:${portStr}")
-              targets;
+                { targets = targetAddrs; labels = labels; }
+              )
+              groupedByLabels;
+
            config = {
              job_name = jobName;
-              static_configs = [{
-                targets = targetAddrs;
-              }];
+              static_configs = staticConfigs;
            }
            // (lib.optionalAttrs (first.metrics_path != "/metrics") {
              metrics_path = first.metrics_path;
--- a/playbooks/inventory.ini
+++ b/playbooks/inventory.ini
@@ -1,5 +0,0 @@
-[proxmox]
-pve1.home.2rjus.net
-
-[proxmox:vars]
-ansible_user=root
--- a/rebuild-all.sh
+++ b/rebuild-all.sh
@@ -1,20 +0,0 @@
-#!/usr/bin/env bash
-set -euo pipefail
-
-# array of hosts
-HOSTS=(
-    "ns1"
-    "ns2"
-    "ca"
-    "ha1"
-    "http-proxy"
-    "jelly01"
-    "monitoring01"
-    "nix-cache01"
-    "pgdb1"
-)
-
-for host in "${HOSTS[@]}"; do
-    echo "Rebuilding $host"
-    nixos-rebuild boot --flake .#${host} --target-host root@${host}
-done
--- a/scripts/create-host/create_host.py
+++ b/scripts/create-host/create_host.py
@@ -314,11 +314,10 @@ def handle_remove(
        for secret_path in host_secrets:
            console.print(f"   [white]vault kv delete secret/{secret_path}[/white]")

-    # Warn about secrets directory
+    # Warn about legacy secrets directory
    if secrets_exist:
-        console.print(f"\n[yellow]⚠️  Warning: secrets/{hostname}/ directory exists and will NOT be deleted[/yellow]")
+        console.print(f"\n[yellow]⚠️  Warning: secrets/{hostname}/ directory exists (legacy SOPS)[/yellow]")
        console.print(f"   Manually remove if no longer needed: [white]rm -rf secrets/{hostname}/[/white]")
-        console.print(f"   Also update .sops.yaml to remove the host's age key")

    # Exit if dry run
    if dry_run:
--- a/scripts/create-host/manipulators.py
+++ b/scripts/create-host/manipulators.py
@@ -219,7 +219,7 @@ def update_flake_nix(config: HostConfig, repo_root: Path, force: bool = False) -
    new_entry = f"""        {config.hostname} = nixpkgs.lib.nixosSystem {{
          inherit system;
          specialArgs = {{
-            inherit inputs self sops-nix;
+            inherit inputs self;
          }};
          modules = commonModules ++ [
            ./hosts/{config.hostname}
--- a/scripts/create-host/validators.py
+++ b/scripts/create-host/validators.py
@@ -140,20 +140,22 @@ def validate_ip_unique(ip: Optional[str], repo_root: Path) -> None:
    ip_part = ip.split("/")[0]

    # Check all hosts/*/configuration.nix files
+    # Search for IP with CIDR notation to match static IP assignments
+    # (e.g., "10.69.13.5/24") but not DNS resolver entries (e.g., "10.69.13.5")
    hosts_dir = repo_root / "hosts"
    if hosts_dir.exists():
        for config_file in hosts_dir.glob("*/configuration.nix"):
            content = config_file.read_text()
-            if ip_part in content:
+            if ip in content:
                raise ValueError(
                    f"IP address {ip_part} already in use in {config_file}"
                )

-    # Check terraform/vms.tf
+    # Check terraform/vms.tf - search for full IP with CIDR
    terraform_file = repo_root / "terraform" / "vms.tf"
    if terraform_file.exists():
        content = terraform_file.read_text()
-        if ip_part in content:
+        if ip in content:
            raise ValueError(
                f"IP address {ip_part} already in use in {terraform_file}"
            )
--- a/secrets/ca/keys/intermediate_ca_key
+++ b/secrets/ca/keys/intermediate_ca_key
@@ -1,24 +0,0 @@
-{
-	"data": "ENC[AES256_GCM,data:TgGIuklFPUSCBosD86NFnkAtRvYijQNQP4vvTkKu3dRAOjdDa2li5djZDUS4NEEPEihpOcMXqHBb+ABk3LmoU5nLmsKCeylUp7+DhcGi9f3xw2h1zbHV37mt40OVLTF3cYufRdydIkCGQA3td3q1ue/wCna2ewe73xwGg5j6ZVJCZAtW4VCNZM+rcG+YxPUC0gmBH59+O0VSrZrkvSnifbr+K0dGwg4i17KwAukI4Ac7YMkQoeuAPXq38+ZftlRx4tq9xBUko6wpPY9zOaFzeagWYMF0n1UYqDt+/3XZI/mukPhJc9tzbWneqgkQBOx3OiDwrNglCHvEpnb+bZePIRLOnNHd1ShETgBqhsHGp9OAwwbAt4tO+HFpCQtVz7s2LWQFLbWiN0SCGzYUkFGCgoXae5H58lxFav8=,iv:UzaWlJ+M+VQx3CcPSGbFZh5/rGbKpS2Rq2XVZAIDFiQ=,tag:F3waoAMuEKTvN2xANReSww==,type:str]",
-	"sops": {
-		"kms": null,
-		"gcp_kms": null,
-		"azure_kv": null,
-		"hc_vault": null,
-		"age": [
-			{
-				"recipient": "age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u",
-				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBpRGZSVHRSMGlyazAwQU5j\nd1o1L0Y1ckhQMkh4MVZiRmZlR2ozcmdsUW1vCk4xZ1ZibDBrUWZhYmxVVjBUczRn\nYlJtUWF3Y1lHWG56NkhmK2JOUHVGajQKLS0tIDN2S2doQURpTis2U3lWV0NxdWEz\ncjNZaEl1dEQwOXhsNE9xbHhYUzNTV3cKVmVIe05JwgXKSku7AJmrujYXrbBSbpBJ\nnqCuDIhok1w/fiff+XXn8udbgPVq5bC2SOhHbtVxImgBCFzrj5hQ0A==\n-----END AGE ENCRYPTED FILE-----\n"
-			},
-			{
-				"recipient": "age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk",
-				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSA4V3NaUEdvMmJvakQ0L1F0\nUnkvQ2F5dEVlZ2pMdlBZcjJac0tERnF5ZWljCmFrdU1NZ29jMkJ1a1ZLdURmVWI0\ncm1vNytFVzZjbVY2aVd2N3laMWNRNFEKLS0tIGgzOTFZY0lxc0JyVmd5cFBlNkRr\nVDBWc0t4c3pVV3RhSTB1UUVpNHd6NUkKNn6Sxb5oxP7iWqTF1+X9nOiYum3U+Rzk\nkryxVnf9EvQIVIFKDaTb+yAEO8otjqj+C4mHA9fannnNEJduOiPWOg==\n-----END AGE ENCRYPTED FILE-----\n"
-			}
-		],
-		"lastmodified": "2024-11-30T13:18:08Z",
-		"mac": "ENC[AES256_GCM,data:9R9RJzPMr9Bv8aeCDxhExTfbr+R2hjap6FGSk5QxBdbNpOcNS78ica0CLEmkAYVAfjmx/X2jC5ZnsAueSPUK7nAgNX2gJXbUTpY0F+oKt35GJziLrFLl3u/ahpF9lQ50EL9OqqgS+igDqtodJhKme5DXH5/GXQHhz++O3VZkR78=,iv:XgN3PiowiEosi2DmrjP82HhJMvnwaV530tsBE8GQfjs=,tag:U243BrtH7H/DU9LcjN/MMg==,type:str]",
-		"pgp": null,
-		"unencrypted_suffix": "_unencrypted",
-		"version": "3.9.1"
-	}
-}
--- a/secrets/ca/keys/root_ca_key
+++ b/secrets/ca/keys/root_ca_key
@@ -1,24 +0,0 @@
-{
-	"data": "ENC[AES256_GCM,data:5AePh5uXcUseYBGWvlztgmg8mGBGy3ngKRa6+QxOaT0/fzSB1pKkaMtZJo76tV9wwjdL6/b6VVUI7GIaCBD5kgdZuA8RdBTXguHyjjdxAlI9xcrQaWWdATd8JJt+eQp/m2Y+0dioyXKaDV2ukI3GtHYjp/ixMoHHWEocnEEb40wG6c3CZcvsLWJvKTkFc2OvcjcU2RTfuNlYtEETidiD9iC/dtCakNQHmLP1UFYgcn0ebXBKmlqD6+x2o7BVT1SLwVCyGNvH3eKA2AWvddZChnhaNCUIXcRwBFCgS8lPs4iXhAhly+nwuj7ssFpuu3sjm5pq196tRS8WQl2iNUEJ2tzoOpceg1kZZ7KHX3wCbdBlCRqhy9Q4JMvWPDssO+zz2aU21+BDEySDTCnTYX9Hu2/iFvZejt++mKY=,iv:u/Ukye0BAj2ka++AA72W8WfXJAZZ/YJ3RC/aydxdoUc=,tag:ihTP5bCCigWEPcLFaYOhMA==,type:str]",
-	"sops": {
-		"kms": null,
-		"gcp_kms": null,
-		"azure_kv": null,
-		"hc_vault": null,
-		"age": [
-			{
-				"recipient": "age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u",
-				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSB0VElDNHArZXlXa2JRQjd0\nQmVIbGpPWk43NDdiTkFtcEd1bDhRdXJWOUY0CndITHdKTFNJQXFOVFdyUGNtQ09k\nN2hnQmFYR0ZORWtxcUN0ZFhsM0U3N2cKLS0tIFh1TTBpMjFIZ2NYM1QxeDRjYlJx\nYkdrUDZmMUpGbjk3REJCVVRpeFk5Z28KJcia0Bk+3ZoifZnRLwqAko526ODPnkSS\nzymtOj/QYTA0++NP3B1aScIyhWITMEZX1iSoWDmgHj8ZQoNMdkM7AQ==\n-----END AGE ENCRYPTED FILE-----\n"
-			},
-			{
-				"recipient": "age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk",
-				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBZNlNHRWNEcUZGNXNBMDFR\nTzE5RnNMQUMvU1k2OS9XMlpvUktMRzQ5RmxvCnlCS3lzRVpGUHJLRGZ6SWZ2ZktR\na3l0TVN2NUlRVEQwRHByYkNEMDQyWUkKLS0tIEh3RjBWT3c5K2RWeDRjWFpsU1lP\ncStqY2xta3RSNkR6Vkt5YXhYUTZmbDgKvVKmZc8S/RwurJGsGiJ5LhM4waLO9B9k\n2cawxHmcYM3KfXDFwp9UZWhIwF7SRkG56ZE4OjGI3sOL+74ixnePxA==\n-----END AGE ENCRYPTED FILE-----\n"
-			}
-		],
-		"lastmodified": "2024-11-30T13:18:16Z",
-		"mac": "ENC[AES256_GCM,data:JwjbQ129cYCBNA5Fb8lN9rW7/y4wuVOqLeajIMcYyCzlBcjzCZAV1DKN5n75xMamb/hb1AUkmtp/K82PKM0Vg5X4/lpWTUZXZOzn/TrwHx+yqlJjL9mUdGuHnSY5DwME38Dde3UxdtUa0CVgQOxvMIycW27w8+8NNfO2zxGxkzc=,iv:ZMZASOsqXZOb0NkBqG3GGaqqKgQdjZLiku2yU5QonB8=,tag:/lb/HMxsYOV5XX/5kWnFHA==,type:str]",
-		"pgp": null,
-		"unencrypted_suffix": "_unencrypted",
-		"version": "3.9.1"
-	}
-}
--- a/secrets/ca/keys/ssh_host_ca_key
+++ b/secrets/ca/keys/ssh_host_ca_key
@@ -1,24 +0,0 @@
-{
-	"data": "ENC[AES256_GCM,data:vqQ3HwSmuDlI4UwraLWvwkBSj9zTFeNEWI1xzhVrO/gpx8+WBZOt2F0J7/LSTGAWsWW/9Gov+XXXAOtfnKfjYVzizyT/jE8EQwMuItWiFEVA6hohgwtsk7YKJjXdJIxmiv+WKs73gWb0uFVGh1ArMzsVkGPj1W1AKMFAneDPgsfSCy9aVOMuF8zQwypFC8eaxqOQhLpiN2ncRm8e7khwGurSgYfHDgFghaDr8torgUrZTOPNFk+LEdxB3WcC17+4a8ZyuBapmYdRTrP73czTAuxOF8lMwddJhO99SF7nWuOYVF1FOKLGtK04oKci5/xRIzvWo3I0pGajkxtuF5CyWbd1KblcPfBALIU/J5hU/puGJ7M2sE/qsg/4kaTFxnhq32rPZj291jFb4evDdOhVodfC1axOQUbzAC0=,iv:yOeQ384ikqgDqfthl7GIVSIMNA/n0BYTSIqFN3T9MAY=,tag:Y6nhOCrkWx7MnVpEeKN0Jg==,type:str]",
-	"sops": {
-		"kms": null,
-		"gcp_kms": null,
-		"azure_kv": null,
-		"hc_vault": null,
-		"age": [
-			{
-				"recipient": "age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u",
-				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBFTjRMWlNtYVQ2WnJEaGFN\nVFU2TXRTK2FHREpqREhOWHBKemxNc2U4WW44CnV4OWlBdXlFUWhJYi9jTTRuUWJV\nOWFPV2I4UytDRFo3blN3bUtFQ1NGU0kKLS0tIGp2VHlDc1JMMUdDUjlNNDFwUUxj\nVnhHbCtrNVNpZXo0K2dDVU5YTVJJUEkKk9mVTbzQVGZo3RKDLPDwtENknh+in1Q5\njf4DA1cGDDNzcEIWOOYyS+1mzT9WY8gU0hWqihX/bAx7CVsNUallZw==\n-----END AGE ENCRYPTED FILE-----\n"
-			},
-			{
-				"recipient": "age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk",
-				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBrVFNwUGpkOUhkUXFWWERq\nMVdueC9VSE9KbGZkenBVK3NRMjRNVXVmcVRRCjNLa0QzbWVCQks3ZmV3eFVjcEp0\nRmxDSlZIZU1IbEdnbE83WlkxV3VZV1EKLS0tICtsRXArajQ4Um9mNEV5OWZBdS85\nVGFSU2wwODZ3Zm44M3pWcTdDV1dxejQKM2BK5Axb1cF344ea89gkzCLzEX6j4amK\nzxf+boBK7JUX7F6QaPB0sRU8J4Cei9mALz96C8xNHjX00KcD3O2QOA==\n-----END AGE ENCRYPTED FILE-----\n"
-			}
-		],
-		"lastmodified": "2024-11-30T13:18:20Z",
-		"mac": "ENC[AES256_GCM,data:AllgcWxHnr3igPi/JbfJCbEa6hKtmILnAjiaMojRZNO4p6zYSoF0s8lo9XX05/vIrFUo+YaCtsuacv+kfz9f6vQafPn7Vulbh6PeH1VlAmzyVfJOTmHP3YX8ic3uM56A4+III1jOERCFOIcc/CKsnRLFhLCRQRMgtgT0hTl5aPw=,iv:60dOYhoUTu1HIHzY36eJeRZ66/v6JmRRpIW99W2D+CI=,tag:F7nLSFm933K5M+JE4IvNYw==,type:str]",
-		"pgp": null,
-		"unencrypted_suffix": "_unencrypted",
-		"version": "3.9.1"
-	}
-}
--- a/secrets/ca/keys/ssh_user_ca_key
+++ b/secrets/ca/keys/ssh_user_ca_key
@@ -1,24 +0,0 @@
-{
-	"data": "ENC[AES256_GCM,data:YRdPrTLQH0xdWiIzOyjfEGpvfmuj6me6GzZZcauh9bUUywyA1ranDnWqbJYgawQQxIXsq9dhXD0uco+7mmXq2598kF1NI9jh6uLf3k0H494zZOalRBv/k8u9oJDLIiVAkg9eNNLbGX0PMZr/Yue/qdkuXx2Hg9E7bQJwpU/NXF+jKKs+3NmKT5NBlegwAzUs530D4DUoaq5AhvVvdC6a1UcE+KJzQ8pRiz1GjFIxAB7qX+GVwa3yNdLgo2tlAbOzjGtaDfJnhZIHSNEq+4TEhjlF9lCmFCGFDUVupvMOWs0kBywJEzIrDmxmvGHlPj3FfyytPb7qhlsOXDDDS67IoiwluKOnw+sALAG0Iv9LMrDZ3z8MXeEGvRWu0VDMuGXN905/9kGx/A40mPjcfnZvI+qSRIKjER5R8aU=,iv:qiP2Ml59AnK24MBbs7N/HqJIylf+fXGqJAo2N8iFNB0=,tag:0Dj5fVs6OB07kvV4qzuvfw==,type:str]",
-	"sops": {
-		"kms": null,
-		"gcp_kms": null,
-		"azure_kv": null,
-		"hc_vault": null,
-		"age": [
-			{
-				"recipient": "age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u",
-				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBBUFlvNmRNYUlJSHZYUkpJ\nMEloQXFSdENIWGJVVDNIOVY5MS9SYWRoL0FrCnRJc05wZUZBSDRvMHNUUEhNRXQ4\nTWhYOUp6YUNGZFNWUFRrSmlJM1c4aWcKLS0tIFc1b3NlSEo2eFJhdDgwejRqcHlT\nZE5wN01uaE04cTlIbVJMVWQvQ1pXajgKQ1n6UmP7LEBsnIBXVc0BceOqvwCqQzBP\ncI8C5Io4ILgMjY4dr6sd0SeJG6mfDdiMA+k7c6jqoyZCW/Pkd3LANQ==\n-----END AGE ENCRYPTED FILE-----\n"
-			},
-			{
-				"recipient": "age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk",
-				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBtM2lyeXVzdE9nL1k5L3dC\nTkl2MjhMb1FKMFdCeXFPSmNST0pvOTRUaEVvCmdwMnhjSFFHVFhidmIySS9jMEJu\nNTJpRjdFOWpZZ3ZuZFJwZUUrRFU5NnMKLS0tIDJ1UjdVQkpMNm5Pd01JRnZNOEtr\nb1lpMlBkVHpiT2lYdWtZaUQrRW1HUDgKq/JVMf5gdu6lNEmqY6zU2SymbT+jklem\nnUQ9yieJGF+PanutNW6BCJH8jb/fH+Y6AeJ9S+kKCB4Yi75i4d+oHg==\n-----END AGE ENCRYPTED FILE-----\n"
-			}
-		],
-		"lastmodified": "2024-11-30T13:18:24Z",
-		"mac": "ENC[AES256_GCM,data:6FJTKEdIpCm+Dz7Ua8dZOMZQFaGU0oU/HRP6ly5mWbXCv81LRbZXRBd+5RDY3z9g9nb0PXZrOMNps63F6SKxK52VfzLIOap3UGeMNQn5P4/yyFj7JQHQ5Gjcf2l2z2VZ7NhUdNoSCV/6lwjValbKtids48Q5c3sFX997ZiqIUnY=,iv:nUeyJd/v8d9v7QsLLckziD9K5qjOZKK4vOQJw/ymi18=,tag:6n5EE3oklWdVcedvB2J/zA==,type:str]",
-		"pgp": null,
-		"unencrypted_suffix": "_unencrypted",
-		"version": "3.9.1"
-	}
-}
--- a/secrets/ca/secrets.yaml
+++ b/secrets/ca/secrets.yaml
@@ -1,30 +0,0 @@
-ca_root_pw: ENC[AES256_GCM,data:jS5BHS9i/pOykus5aGsW+w==,iv:aQIU7uXnNKaeNXv1UjRpBoSYcRpHo8RjnvCaIw4yCqc=,tag:lkjGm5/Ve93nizqGDQ0ByA==,type:str]
-sops:
-    kms: []
-    gcp_kms: []
-    azure_kv: []
-    hc_vault: []
-    age:
-        - recipient: age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSA5anlORWxJalhRWkJPeGIy
-            OStyVG8vMFRTTEZOWHR3Q3N1UWJQbFlxV3pBCmVKQVM1SlJ2L0JOb3U3cTh3YkZ4
-            WHAxSUpTT1dyRHJHYVd1Qkh1ZWxwYW8KLS0tIEhXeklsSmlGaFlaaWF5L0Nodk5a
-            clZ4M3hFSlFqaEZ0UWREdHpTQ29GVUEKAxj5P05Ilpwis2oKFe54mJX+1LfTwfUv
-            2XRFOrEQbFNcK5WFu46p1mc/AAjKTeHWuvb2Yq43CO+sh1+kqKz0XA==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBaS0dqQ1p4MEE2d2JaeFRx
-            UnB4ejhrS3hLekpqeWJhcEJGdnpzMTZDelVRCmFjVGswd3VtRUloWG1WbWY5N0s3
-            cG9aV2hGU3lFZkkvcUJNWE1rWUIwMmMKLS0tIG1KdlhoQzREWDhPbXVSZVBUQkdE
-            N1hmcEwxWXBIWkQ3a3BrdGhvUFoxbzgKX6hLoz7o/Du6ymrYwmGDkXp2XT+0+7QE
-            YhD5qQzGLVQSh3XM/wWExj2Ue5/gw/NqNziHezOh2r9gQljbHjG2/g==
-            -----END AGE ENCRYPTED FILE-----
-    lastmodified: "2024-10-21T09:12:26Z"
-    mac: ENC[AES256_GCM,data:hfPRIXt/kZJa6lsj7rz+5xGlrWhR/LX895S2d8auP/4t3V//80YE/ofIsHeAY9M7eSFsW9ce2Vp0C/WiCQefVWNaNN7nVAwskCfQ6vTWzs23oYz4NYIeCtZggBG3uGgJxb7ZnAFUJWmLwCxkKTQyoVVnn8i/rUDIBrkilbeLWNI=,iv:lm1HVbWtAifHjqKP0D3sxRadsE9+82ugbA2x54yRBTo=,tag:averxmPLa131lJtFrNxcEA==,type:str]
-    pgp: []
-    unencrypted_suffix: _unencrypted
-    version: 3.9.1
--- a/secrets/http-proxy/wireguard.yaml
+++ b/secrets/http-proxy/wireguard.yaml
@@ -1,25 +0,0 @@
-wg_private_key: ENC[AES256_GCM,data:DlC9txcLkTnb7FoEd249oJV/Ehcp50P8uulbE4rY/xU16fkTlnKvPmYZ7u8=,iv:IsiTzdrh+BNSVgx1mfjpMGNV2J0c88q6AoP0kHX2aGY=,tag:OqFsOIyE71SBD1mcNS/PeQ==,type:str]
-sops:
-    age:
-        - recipient: age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSAzdm9HTTN1amwxQ2Z6MUQv
-            dGJ0cEgyaHNOZWtWSWlXNXc5bGhUdSsvVlVzCkJkc3ZQdzlBNDNxb3Avdi96bXFt
-            TExZY29nUDI3RE5vanh6TVBRME1Fa1UKLS0tIG8vSHdCYzkvWmJpd0hNbnRtUmtk
-            aVcwaFJJclZ3YUlUTTNwR2VESmVyZWMKHvKUJBDuNCqacEcRlapetCXHKRb0Js09
-            sqxLfEDwiN2LQQjYHZOmnMfCOt/b2rwXVKEHdTcIsXbdIdKOJwuAIQ==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1gq8434ku0xekqmvnseeunv83e779cg03c06gwrusnymdsr3rpufqx6vr3m
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBEeU01UTc2V1UyZXRadE5I
-            VE1aakVZUEZUNnJxbzJ1K3J1R3ZQdFdMbUhBCjZBMDM3ZkYvQWlyNHBtaDZRWkd4
-            VzY0L3l4N2RNZjJRTDJWZTZyZVhHbW8KLS0tIGVNZ0N0emVmaVRCV09jNmVKRlla
-            cWVSNkJqWHh5c21KcWFac2FlZTVaMTAK1UvfPgZAZYtwiONKIAo5HlaDpN+UT/S/
-            JfPUfjxgRQid8P20Eh/jUepxrDY8iXRZdsUMON+OoQ8mpwoAh5eN1A==
-            -----END AGE ENCRYPTED FILE-----
-    lastmodified: "2025-05-15T18:56:55Z"
-    mac: ENC[AES256_GCM,data:J2kHY7pXBJZ0UuNCZOhkU11M8rDqCYNzY71NyuDRmzzRCC9ZiNIbavyQAWj2Dpk1pjGsYjXsVoZvP7ti1wTFqahpaR/YWI5gmphrzAe32b9qFVEWTC3YTnmItnY0YxQZYehYghspBjnJtfUK0BvZxSb17egpoFnvHmAq+u5dyxg=,iv:/aLg02RLuJZ1bRzZfOD74pJuE7gppCBztQvUEt557mU=,tag:toxHHBuv3WRblyc9Sth6Iw==,type:str]
-    unencrypted_suffix: _unencrypted
-    version: 3.10.2
--- a/secrets/monitoring01/pve-exporter.yaml
+++ b/secrets/monitoring01/pve-exporter.yaml
@@ -1,33 +0,0 @@
-default:
-    user: ENC[AES256_GCM,data:4Zzjm6/e8GCKSPNivnY=,iv:Y3gR+JSH/GLYvkVu3CN4T/chM5mjGjwVPI0iMB4p1t4=,tag:auyG8iWsd/YGjDnnTC21Ew==,type:str]
-    password: ENC[AES256_GCM,data:9cyM9U8VnzXBBA==,iv:YMHNNUoQ9Az5+81Df07tjC+LaEWPHV6frUjd4PZrQOs=,tag:3hKR+BhLJODJp19nn4ppkA==,type:str]
-    verify_ssl: ENC[AES256_GCM,data:Cu5Ucf0=,iv:QFfdV7gDBQ+L2kSZZqlVqCrn9CRg5RNG5DNTFWtVf5Y=,tag:u24ZbpWA65wj3WOwqU1v+g==,type:bool]
-sops:
-    kms: []
-    gcp_kms: []
-    azure_kv: []
-    hc_vault: []
-    age:
-        - recipient: age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBuUXdMMG5YaHRJbThQZW9u
-            RHVBbXFiSHNiUWdLTDdPajIyQjN3OGR0dGpzCm9ZVkdNWjhBakU3dVdhRU9kbU81
-            aDlCNzJBQ1hvQ3FnTUk2N2RWQkZpUUEKLS0tIEZacTNqa3FWc2p1NXVtRWhwVExj
-            cUJtYXNjb2Z4QkF4MjlidEZxSUFNa3MKAGHGksPc9oJheSlUQ3ARK5MuR5NFbPmD
-            kmSDSgRmzbarxT8eJnK8/K4ii3hX5E9vGOohUkyc03w4ENsh/dw43g==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1vpns76ykll8jgdlu3h05cur4ew2t3k7u03kxdg8y6ypfhsfhq9fqyurjey
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBOVGhvdGE5Mzl0ckhBM21D
-            RXJwb09OS25PMGViblViM21wTVZiZWhtWmhFCnAzL1NqeUVyOGZFVDFvdXFPbklQ
-            ZkJPWDVIdUdCdjZGUjcrcmtvak5CWG8KLS0tIDhLUHJNN2VqNy9CdVh0K0N0b0k1
-            RUE4U0E0aGxiRkF0NWdwSEIrQTU4MjgKeOU6bIWO6ke9YcG+1E3brnC21sSQxZ9b
-            SiG2QEnFnTeJ5P50XQoYHqUY3B0qx7nDLvyzatYEi6sDkfLXhmHGbw==
-            -----END AGE ENCRYPTED FILE-----
-    lastmodified: "2024-12-03T16:25:12Z"
-    mac: ENC[AES256_GCM,data:gemq8YpMZQC+gY7lmMM3tfZh9XxL40qdGlLiB2CD4SIG49w0V6E/vY7xygt0WW0zHbhMI9yUIqlRc/PaXn+QfyxJEr3IjaT05rrWUqQAeRP9Zss74Y3NtQehh8fM8SgeyU4j2CQ9f9B/lW9IgdOW/TNgQZVXGg1vXZPEzl7AZ4A=,iv:LG5ojv3hAqk+EvFa/xEn43MBqL457uKFDE3dG5lSgZo=,tag:AxzcUzmdhO411Sw7Vg1itA==,type:str]
-    pgp: []
-    unencrypted_suffix: _unencrypted
-    version: 3.9.1
--- a/secrets/nix-cache01/actions_token_1
+++ b/secrets/nix-cache01/actions_token_1
@@ -1,19 +0,0 @@
-{
-	"data": "ENC[AES256_GCM,data:P84qHFU+xQjwQGK8I1gIdcBsHrskuUg0M1nGMMaA+hFjAdFYUhdhmAN/+y0CO28=,iv:zJtk01zNMTBDQdVtZBTM34CHRaNYDkabolxh7PWGKUI=,tag:8AS80AbZJbh9B3Av3zuI1w==,type:str]",
-	"sops": {
-		"age": [
-			{
-				"recipient": "age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u",
-				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBkRFB6QTIyWWdwVkV4ZXNB\nWkdSdEhMc0s4cnByWVZXTGhnSWZ0MTdEUWhJCnFlOFQ5TU1hcE91azVyZXVXRCtu\nZjIxalRLYlEreGZ6ZDNoeXNPaFN4b28KLS0tIHY5WVFXN1k4NFVmUjh6VURkcEpv\ncklGcWVhdTdBRnlOdm1qM2h5SS9UUkEKq2RyxSVymDqcsZ+yiNRujDCwk1WOWYRW\nDa4TRKg3FCe7TcCEPkIaev1aBqjLg9J9c/70SYpUm6Zgeps7v5yl3A==\n-----END AGE ENCRYPTED FILE-----\n"
-			},
-			{
-				"recipient": "age1w029fksjv0edrff9p7s03tgk3axecdkppqymfpwfn2nu2gsqqefqc37sxq",
-				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSArTGVuckp2NlhMZXRNMVhO\naUV3K0h3cmZ5ZGx4Q3dJWHNqZXFJeE1kM0dFCmF4TUFUMm9mTHJlYzlYWVhNa1RH\nR29VNDIrL1IvYUpQYm5SZEYzbWhhbkkKLS0tIEJsK1dwZVdaaHpWQkpOOS90dkhx\nbGhvRXhqdFdqQmhZZmhCdmw4NUtSVG8K3z2do+/cIjAqg6EMJnubOWid1sMeTxvo\nrq6eGJ7YzdgZr2JBVtJdDRtk/KeHXu9In4efbBXwLAPIfn1pU0gm1w==\n-----END AGE ENCRYPTED FILE-----\n"
-			}
-		],
-		"lastmodified": "2025-08-21T19:08:48Z",
-		"mac": "ENC[AES256_GCM,data:5CkO09NIqttb4UZPB9iGym8avhTsMeUkTFTKZJlNGjgB1qWyGQNeKCa50A1+SbBCCWE5EwxoynB1so7bi8vnq7k8CPUHbiWG8rLOJSYHQcZ9Tu7ZGtpeWPcCw1zPWJ/PTBsFVeaT5/ufdx/6ut+sTtRoKHOZZtO9oStHmu/Rlfg=,iv:z9iJJlbvhgxJaART5QoCrqvrqlgoVlGj8jlndCALmKU=,tag:ldjmND4NVVQrHUldLrB4Jg==,type:str]",
-		"unencrypted_suffix": "_unencrypted",
-		"version": "3.10.2"
-	}
-}
--- a/secrets/nix-cache01/cache-secret
+++ b/secrets/nix-cache01/cache-secret
@@ -1,19 +0,0 @@
-{
-	"data": "ENC[AES256_GCM,data:MQkR6FQGHK2AuhOmy2was49RY2XlLO5NwaXnUFzFo5Ata/2ufVoAj4Jvotw/dSrKL7f62A6s+2BPAyWrvACJ+pwYFlfyj3T9bNwhxwZPkEmiHEubJjWSiD6jkSW0gOxbY8ib6g/GbyF8I1cPeYr/hJD5qQ==,iv:eBL2Y3MOt9gYTETUZqsHo1D5hPOHxb4JR6Z/DFlzzqI=,tag:Qqbt39xZvQz/QhsggsArsw==,type:str]",
-	"sops": {
-		"age": [
-			{
-				"recipient": "age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u",
-				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSAwZzFXaEsyUkZGNFV0bVlW\nRkpPRHpUK2VwUHpOQXZCUUpoVzFGa3hycnhvCndTN0toVFdoU2E5N3V3UFhTTjU0\nNDByWTkrV0o3T295dE0zS08rVGpyQjAKLS0tIC96M0VEcWpjRk5DMjJnMFB4ZHI3\nM2Jod2x4ZzMyZm1pbDhZNTFuWGNRUlEKHs5jBSfjml09JOeKiT9vFR0Fykg6OxKG\njhFU/J2+fWB22G7dBc4PI60SNqhxIheUbGTdcz4Yp4BPL6vW3eArIw==\n-----END AGE ENCRYPTED FILE-----\n"
-			},
-			{
-				"recipient": "age1w029fksjv0edrff9p7s03tgk3axecdkppqymfpwfn2nu2gsqqefqc37sxq",
-				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBJT3lxamcrQUpFdjZteFlF\nYUQ3aGdadGpuNXd2Z3RtZ3dQU0cvMlFUMUNRClBDR3U0OXZJU0NDamVMSlR5NitN\nYlhvNVlvUE0wRjErYzkwVHFOdGVCVjgKLS0tIEttR1BLTGpDYTRSQ0lUZmVEcnNi\nWkNaMEViUHVBcExVOEpjNE5CZHpjVkEKuX/Rf8kaB3apr1UhAnq3swS6fXiVmwm8\n7Key+SUAPNstbWbz0u6B9m1ev5QcXB2lx2/+Cm7cjW+6VE2gLHjTsQ==\n-----END AGE ENCRYPTED FILE-----\n"
-			}
-		],
-		"lastmodified": "2025-01-24T12:19:16Z",
-		"mac": "ENC[AES256_GCM,data:X8X91LVP1MMJ8ZYeSNPRO6XHN+NuswLZcHpAkbvoY+E9aTteO8UqS+fsStbNDlpF5jz/mhdMsKElnU8Z/CIWImwolI4GGE6blKy6gyqRkn4VeZotUoXcJadYV/5COud3XP2uSTb694JyQEZnBXFNeYeiHpN0y38zLxoX8kXHFbc=,iv:fFCRfv+Y1Nt2zgJNKsxElrYcuKkATJ3A/jvheUY2IK4=,tag:hYojbMGUAQvx7I4qkO7o9w==,type:str]",
-		"unencrypted_suffix": "_unencrypted",
-		"version": "3.9.3"
-	}
-}
--- a/secrets/secrets.yaml
+++ b/secrets/secrets.yaml
@@ -1,109 +0,0 @@
-root_password_hash: ENC[AES256_GCM,data:wk/xEuf+qU3ezmondq9y3OIotXPI/L+TOErTjgJz58wEvQkApYkjc3bHaUTzOrmWjQBgDUENObzPmvQ8WKawUSJRVlpfOEr5TQ==,iv:I8Z3xJz3qoXBD7igx087A1fMwf8d29hQ4JEI3imRXdY=,tag:M80osQeWGG9AAA8BrMfhHA==,type:str]
-ns_xfer_key: ENC[AES256_GCM,data:VFpK7GChgFeUgQm31tTvVC888bN0yt6BAnHQa6KUTg4iZGP1WL5Bx6Zp8dY=,iv:9RF1eEc7JBxBebDOKfcDjGS2U7XsHkOW/l52yIP+1LA=,tag:L6DR2QlHOfo02kzfWWCrvg==,type:str]
-backup_helper_secret: ENC[AES256_GCM,data:EvXEJnDilbfALQ==,iv:Q3dkZ8Ee3qbcjcoi5GxfbaVB4uRIvkIB6ioKVV/dL2Y=,tag:T/UgZvQgYGa740Wh7D0b7Q==,type:str]
-nats_nkey: ENC[AES256_GCM,data:N2CVXjdwiE7eSPUtXe+NeKSTzA9eFwK2igxaCdYsXd4Ps0/DjYb/ggnQziQzSy8viESZYjXhJ2VtNw==,iv:Xhcf5wPB01Wu0A+oMw0wzTEHATp+uN+wsaYshxIzy1w=,tag:IauTIOHqfiM75Ufml/JXbg==,type:str]
-sops:
-    age:
-        - recipient: age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBuWXhzQWFmeCt1R05jREcz
-            Ui9HZFN5dkxHNVE0RVJGZUJUa3hKK2sxdkhBCktYcGpLeGZIQzZIV3ZZWGs3YzF1
-            T09sUEhPWkRkOWZFWkltQXBlM1lQV1UKLS0tIERRSlRUYW5QeW9TVjJFSmorOWNI
-            ZytmaEhzMjVhRXI1S0hielF0NlBrMmcK4I1PtSf7tSvSIJxWBjTnfBCO8GEFHbuZ
-            BkZskr5fRnWUIs72ZOGoTAVSO5ZNiBglOZ8YChl4Vz1U7bvdOCt0bw==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1hz2lz4k050ru3shrk5j3zk3f8azxmrp54pktw5a7nzjml4saudesx6jsl0
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBQcXM0RHlGcmZrYW4yNGZs
-            S1ZqQzVaYmQ4MGhGaTFMUVIwOTk5K0tZZjB3ClN0QkhVeHRrNXZHdmZWMzFBRnJ6
-            WTFtaWZyRmx2TitkOXkrVkFiYVd3RncKLS0tIExpeGUvY1VpODNDL2NCaUhtZkp0
-            cGNVZTI3UGxlNWdFWVZMd3FlS3pDR3cKBulaMeonV++pArXOg3ilgKnW/51IyT6Z
-            vH9HOJUix+ryEwDIcjv4aWx9pYDHthPFZUDC25kLYG91WrJFQOo2oA==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1w2q4gm2lrcgdzscq8du3ssyvk6qtzm4fcszc92z9ftclq23yyydqdga5um
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBabTdsZWxZQjV2TGx2YjNM
-            ZTgzWktqTjY0S0M3bFpNZXlDRDk5TSt3V2k0CjdWWTN0TlRlK1RpUm9xYW03MFFG
-            aWN4a3o4VUVnYzBDd2FrelUraWtrMTAKLS0tIE1vTGpKYkhzcWErWDRreml2QmE2
-            ZkNIWERKb1drdVR6MTBSTnVmdm51VEkKVNDYdyBSrUT7dUn6a4eF7ELQ2B2Pk6V9
-            Z5fbT75ibuyX1JO315/gl2P/FhxmlRW1K6e+04gQe2R/t/3H11Q7YQ==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1d2w5zece9647qwyq4vas9qyqegg96xwmg6c86440a6eg4uj6dd2qrq0w3l
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBVSFhDOFRVbnZWbVlQaG5G
-            U0NWekU0NzI1SlpRN0NVS1hPN210MXY3Z244CmtFemR5OUpzdlBzMHBUV3g0SFFo
-            eUtqNThXZDJ2b01yVVVuOFdwQVo2Qm8KLS0tIHpXRWd3OEpPRkpaVDNDTEJLMWEv
-            ZlZtaFpBdzF0YXFmdjNkNUR3YkxBZU0KAub+HF/OBZQR9bx/SVadZcL6Ms+NQ7yq
-            21HCcDTWyWHbN4ymUrIYXci1A/0tTOrQL9Mkvaz7IJh4VdHLPZrwwA==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1gq8434ku0xekqmvnseeunv83e779cg03c06gwrusnymdsr3rpufqx6vr3m
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBBWkhBL1NTdjFDeEhQcEgv
-            Z3c3Z213L2ZhWGo0Qm5Zd1A1RTBDY3plUkh3CkNWV2ZtNWkrUjB0eWFzUlVtbHlk
-            WTdTQjN4eDIzY0c0dyt6ajVXZ0krd1UKLS0tIHB4aEJqTTRMenV3UkFkTGEySjQ2
-            YVM1a3ZPdUU4T244UU0rc3hVQ3NYczQK10wug4kTjsvv/iOPWi5WrVZMOYUq4/Mf
-            oXS4sikXeUsqH1T2LUBjVnUieSneQVn7puYZlN+cpDQ0XdK/RZ+91A==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBYcEtHbjNWRkdodUxYdHRn
-            MDBMU08zWDlKa0Z4cHJvc28rZk5pUjhnMjE0CmdzRmVGWDlYQ052Wm1zWnlYSFV6
-            dURQK3JSbThxQlg3M2ZaL1hGRzVuL0UKLS0tIEI3UGZvbEpvRS9aR2J2Tnc1YmxZ
-            aUY5Q2MrdHNQWDJNaGt5MWx6MVRrRVEKRPxyAekGHFMKs0Z6spVDayBA4EtPk18e
-            jiFc97BGVtC5IoSu4icq3ZpKOdxymnkqKEt0YP/p/JTC+8MKvTJFQw==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1vpns76ykll8jgdlu3h05cur4ew2t3k7u03kxdg8y6ypfhsfhq9fqyurjey
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBQL3ZMUkI1dUV1T2tTSHhn
-            SjhyQ3dKTytoaDBNcit1VHpwVGUzWVNpdjBnCklYZWtBYzBpcGxZSDBvM2tIZm9H
-            bTFjb1ZCaDkrOU1JODVBVTBTbmxFbmcKLS0tIGtGcS9kejZPZlhHRXI5QnI5Wm9Q
-            VjMxTDdWZEltWThKVDl0S24yWHJxZHcKgzH79zT2I7ZgyTbbbvIhLN/rEcfiomJH
-            oSZDFvPiXlhPgy8bRyyq3l47CVpWbUI2Y7DFXRuODpLUirt3K3TmCA==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1hchvlf3apn8g8jq2743pw53sd6v6ay6xu6lqk0qufrjeccan9vzsc7hdfq
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBPcm9zUm1XUkpLWm1Jb3Uw
-            RncveGozOW5SRThEM1Y4SFF5RDdxUEhZTUE4CjVESHE5R3JZK0krOXZDL0RHR0oy
-            Z3JKaEpydjRjeFFHck1ic2JTRU5yZTQKLS0tIGY2ck56eG95YnpDYlNqUDh5RVp1
-            U3dRYkNleUtsQU1LMWpDbitJbnRIem8K+27HRtZihG8+k7ZC33XVfuXDFjC1e8lA
-            kffmxp9kOEShZF3IKmAjVHFBiPXRyGk3fGPyQLmSMK2UOOfCy/a/qA==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1w029fksjv0edrff9p7s03tgk3axecdkppqymfpwfn2nu2gsqqefqc37sxq
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBTZHlldDdSOEhjTklCSXQr
-            U2pXajFwZnNqQzZOTzY5b3lkMzlyREhXRWo4CmxId2F6NkNqeHNCSWNrcUJIY0Nw
-            cGF6NXJaQnovK1FYSXQ2TkJSTFloTUEKLS0tIHRhWk5aZ0lDVkZaZEJobm9FTDNw
-            a29sZE1GL2ZQSk0vUEc1ZGhkUlpNRkEK9tfe7cNOznSKgxshd5Z6TQiNKp+XW6XH
-            VvPgMqMitgiDYnUPj10bYo3kqhd0xZH2IhLXMnZnqqQ0I23zfPiNaw==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1ha34qeksr4jeaecevqvv2afqem67eja2mvawlmrqsudch0e7fe7qtpsekv
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSB5bk9NVjJNWmMxUGd3cXRx
-            amZ5SWJ3dHpHcnM4UHJxdmh6NnhFVmJQdldzCm95dHN3R21qSkE4Vm9VTnVPREp3
-            dUQyS1B4MWhhdmd3dk5LQ0htZEtpTWMKLS0tIGFaa3MxVExFYk1MY2loOFBvWm1o
-            L0NoRStkeW9VZVdpWlhteC8yTnRmMUkKMYjUdE1rGgVR29FnhJ5OEVjTB1Rh5Mtu
-            M/DvlhW3a7tZU8nDF3IgG2GE5xOXZMDO9QWGdB8zO2RJZAr3Q+YIlA==
-            -----END AGE ENCRYPTED FILE-----
-        - recipient: age1cxt8kwqzx35yuldazcc49q88qvgy9ajkz30xu0h37uw3ts97jagqgmn2ga
-          enc: |
-            -----BEGIN AGE ENCRYPTED FILE-----
-            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBBU0xYMnhqOE0wdXdleStF
-            THcrY2NBQzNoRHdYTXY3ZmM5YXRZZkQ4aUZnCm9ad0IxSWxYT1JBd2RseUdVT1pi
-            UXBuNzFxVlN0OWNTQU5BV2NiVEV0RUUKLS0tIGJHY0dzSDczUzcrV0RpTjE0czEy
-            cWZMNUNlTzBRcEV5MjlRV1BsWGhoaUUKGhYaH8I0oPCfrbs7HbQKVOF/99rg3HXv
-            RRTXUI71/ejKIuxehOvifClQc3nUW73bWkASFQ0guUvO4R+c0xOgUg==
-            -----END AGE ENCRYPTED FILE-----
-    lastmodified: "2025-02-11T21:18:22Z"
-    mac: ENC[AES256_GCM,data:5//boMp1awc/2XAkSASSCuobpkxa0E6IKf3GR8xHpMoCD30FJsCwV7PgX3fR8OuLEhOJ7UguqMNQdNqG37RMacreuDmI1J8oCFKp+3M2j4kCbXaEo8bw7WAtyjUez+SAXKzZWYmBibH0KOy6jdt+v0fdgy5hMBT4IFDofYRsyD0=,iv:6pD+SLwncpmal/FR4U8It2njvaQfUzzpALBCxa0NyME=,tag:4QN8ZFjdqck5ZgulF+FtbA==,type:str]
-    unencrypted_suffix: _unencrypted
-    version: 3.9.4
--- a/services/actions-runner/default.nix
+++ b/services/actions-runner/default.nix
@@ -1,57 +0,0 @@
-{ pkgs, config, ... }:
-{
-  vault.secrets.actions-token = {
-    secretPath = "hosts/nix-cache01/actions-token";
-    extractKey = "token";
-    outputDir = "/run/secrets/actions-token-1";
-    services = [ "gitea-runner-actions1" ];
-  };
-
-  virtualisation.podman = {
-    enable = true;
-    dockerCompat = true;
-  };
-
-  services.gitea-actions-runner.instances = {
-    actions1 = {
-      enable = true;
-      tokenFile = "/run/secrets/actions-token-1";
-      name = "actions1.home.2rjus.net";
-      settings = {
-        log = {
-          level = "debug";
-        };
-
-        runner = {
-          file = ".runner";
-          capacity = 4;
-          timeout = "2h";
-          shutdown_timeout = "10m";
-          insecure = false;
-          fetch_timeout = "10s";
-          fetch_interval = "30s";
-        };
-
-        cache = {
-          enabled = true;
-          dir = "/var/cache/gitea-actions1";
-        };
-
-        container = {
-          privileged = false;
-        };
-      };
-      labels =
-        builtins.map (n: "${n}:docker://gitea/runner-images:${n}") [
-          "ubuntu-latest"
-          "ubuntu-latest-slim"
-          "ubuntu-latest-full"
-        ]
-        ++ [
-          "homelab"
-        ];
-
-      url = "https://git.t-juice.club";
-    };
-  };
-}
--- a/services/ca/default.nix
+++ b/services/ca/default.nix
@@ -1,169 +0,0 @@
-{ pkgs, unstable, ... }:
-{
-  homelab.monitoring.scrapeTargets = [{
-    job_name = "step-ca";
-    port = 9000;
-  }];
-  sops.secrets."ca_root_pw" = {
-    sopsFile = ../../secrets/ca/secrets.yaml;
-    owner = "step-ca";
-    path = "/var/lib/step-ca/secrets/ca_root_pw";
-  };
-  sops.secrets."intermediate_ca_key" = {
-    sopsFile = ../../secrets/ca/keys/intermediate_ca_key;
-    format = "binary";
-    owner = "step-ca";
-    path = "/var/lib/step-ca/secrets/intermediate_ca_key";
-  };
-  sops.secrets."root_ca_key" = {
-    sopsFile = ../../secrets/ca/keys/root_ca_key;
-    format = "binary";
-    owner = "step-ca";
-    path = "/var/lib/step-ca/secrets/root_ca_key";
-  };
-  sops.secrets."ssh_host_ca_key" = {
-    sopsFile = ../../secrets/ca/keys/ssh_host_ca_key;
-    format = "binary";
-    owner = "step-ca";
-    path = "/var/lib/step-ca/secrets/ssh_host_ca_key";
-  };
-  sops.secrets."ssh_user_ca_key" = {
-    sopsFile = ../../secrets/ca/keys/ssh_user_ca_key;
-    format = "binary";
-    owner = "step-ca";
-    path = "/var/lib/step-ca/secrets/ssh_user_ca_key";
-  };
-
-  services.step-ca = {
-    enable = true;
-    package = pkgs.step-ca;
-    intermediatePasswordFile = "/var/lib/step-ca/secrets/ca_root_pw";
-    address = "0.0.0.0";
-    port = 443;
-    settings = {
-      metricsAddress = ":9000";
-      authority = {
-        provisioners = [
-          {
-            claims = {
-              enableSSHCA = true;
-              maxTLSCertDuration = "3600h";
-              defaultTLSCertDuration = "48h";
-            };
-            encryptedKey = "eyJhbGciOiJQQkVTMi1IUzI1NitBMTI4S1ciLCJjdHkiOiJqd2sranNvbiIsImVuYyI6IkEyNTZHQ00iLCJwMmMiOjYwMDAwMCwicDJzIjoiY1lWOFJPb3lteXFLMWpzcS1WM1ZXQSJ9.WS8tPK-Q4gtnSsw7MhpTzYT_oi-SQx-CsRLh7KwdZnpACtd4YbcOYg.zeyDkmKRx8BIp-eB.OQ8c-KDW07gqJFtEMqHacRBkttrbJRRz0sYR47vQWDCoWhodaXsxM_Bj2pGvUrR26ij1t7irDeypnJoh6WXvUg3n_JaIUL4HgTwKSBrXZKTscXmY7YVmRMionhAb6oS9Jgus9K4QcFDHacC9_WgtGI7dnu3m0G7c-9Ur9dcDfROfyrnAByJp1rSZMzvriQr4t9bNYjDa8E8yu9zq6aAQqF0Xg_AxwiqYqesT-sdcfrxKS61appApRgPlAhW-uuzyY0wlWtsiyLaGlWM7WMfKdHsq-VqcVrI7Gi2i77vi7OqPEberqSt8D04tIri9S_sArKqWEDnBJsL07CC41IY.CqtYfbSa_wlmIsKgNj5u7g";
-            key = {
-              alg = "ES256";
-              crv = "P-256";
-              kid = "CIjtIe7FNhsNQe1qKGD9Rpj-lrf2ExyTYCXAOd3YDjE";
-              kty = "EC";
-              use = "sig";
-              x = "XRMX-BeobZ-R5-xb-E9YlaRjJUfd7JQxpscaF1NMgFo";
-              y = "bF9xLp5-jywRD-MugMaOGbpbniPituWSLMlXRJnUUl0";
-            };
-            name = "ca@home.2rjus.net";
-            type = "JWK";
-          }
-          {
-            name = "acme";
-            type = "ACME";
-            claims = {
-              maxTLSCertDuration = "3600h";
-              defaultTLSCertDuration = "1800h";
-            };
-          }
-          {
-            claims = {
-              enableSSHCA = true;
-            };
-            name = "sshpop";
-            type = "SSHPOP";
-          }
-        ];
-      };
-      crt = "/var/lib/step-ca/certs/intermediate_ca.crt";
-      db = {
-        badgerFileLoadingMode = "";
-        dataSource = "/var/lib/step-ca/db";
-        type = "badgerv2";
-      };
-      dnsNames = [
-        "ca.home.2rjus.net"
-        "10.69.13.12"
-      ];
-      federatedRoots = null;
-      insecureAddress = "";
-      key = "/var/lib/step-ca/secrets/intermediate_ca_key";
-      logger = {
-        format = "text";
-      };
-      root = "/var/lib/step-ca/certs/root_ca.crt";
-      ssh = {
-        hostKey = "/var/lib/step-ca/secrets/ssh_host_ca_key";
-        userKey = "/var/lib/step-ca/secrets/ssh_user_ca_key";
-      };
-      templates = {
-        ssh = {
-          host = [
-            {
-              comment = "#";
-              name = "sshd_config.tpl";
-              path = "/etc/ssh/sshd_config";
-              requires = [
-                "Certificate"
-                "Key"
-              ];
-              template = ./templates/ssh/sshd_config.tpl;
-              type = "snippet";
-            }
-            {
-              comment = "#";
-              name = "ca.tpl";
-              path = "/etc/ssh/ca.pub";
-              template = ./templates/ssh/ca.tpl;
-              type = "snippet";
-            }
-          ];
-          user = [
-            {
-              comment = "#";
-              name = "config.tpl";
-              path = "~/.ssh/config";
-              template = ./templates/ssh/config.tpl;
-              type = "snippet";
-            }
-            {
-              comment = "#";
-              name = "step_includes.tpl";
-              path = "\${STEPPATH}/ssh/includes";
-              template = ./templates/ssh/step_includes.tpl;
-              type = "prepend-line";
-            }
-            {
-              comment = "#";
-              name = "step_config.tpl";
-              path = "ssh/config";
-              template = ./templates/ssh/step_config.tpl;
-              type = "file";
-            }
-            {
-              comment = "#";
-              name = "known_hosts.tpl";
-              path = "ssh/known_hosts";
-              template = ./templates/ssh/known_hosts.tpl;
-              type = "file";
-            }
-          ];
-        };
-      };
-      tls = {
-        cipherSuites = [
-          "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256"
-          "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256"
-        ];
-        maxVersion = 1.3;
-        minVersion = 1.2;
-        renegotiation = false;
-      };
-    };
-  };
-}
--- a/services/ca/templates/ssh/ca.tpl
+++ b/services/ca/templates/ssh/ca.tpl
--- a/Show More
+++ b/Show More