flake: update homelab-deploy

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
homelab: add deploy.enable option with assertion
2026-02-07 06:53:13 +01:00 · 2026-02-07 06:47:12 +01:00 · 2026-02-07 06:41:03 +01:00 · 2026-02-07 06:27:21 +01:00 · 2026-02-07 06:20:14 +01:00 · 2026-02-07 06:11:37 +01:00
109 changed files with 1818 additions and 3646 deletions
--- a/.claude/agents/auditor.md
+++ b/.claude/agents/auditor.md
@@ -1,180 +0,0 @@
---
-name: auditor
-description: Analyzes audit logs to investigate user activity, command execution, and suspicious behavior on hosts. Can be used standalone for security reviews or called by other agents for behavioral context.
-tools: Read, Grep, Glob
-mcpServers:
-  - lab-monitoring
---
-
-You are a security auditor for a NixOS homelab infrastructure. Your task is to analyze audit logs and reconstruct user activity on hosts.
-
-## Input
-
-You may receive:
- A host or list of hosts to investigate
- A time window (e.g., "last hour", "today", "between 14:00 and 15:00")
- Optional context: specific events to look for, user to focus on, or suspicious activity to investigate
- Optional context from a parent investigation (e.g., "a service stopped at 14:32, what happened around that time?")
-
-## Audit Log Structure
-
-Logs are shipped to Loki via promtail. Audit events use these labels:
- `host` - hostname
- `systemd_unit` - typically `auditd.service` for audit logs
- `job` - typically `systemd-journal`
-
-Audit log entries contain structured data:
- `EXECVE` - command execution with full arguments
- `USER_LOGIN` / `USER_LOGOUT` - session start/end
- `USER_CMD` - sudo command execution
- `CRED_ACQ` / `CRED_DISP` - credential acquisition/disposal
- `SERVICE_START` / `SERVICE_STOP` - systemd service events
-
-## Investigation Techniques
-
-### 1. SSH Session Activity
-
-Find SSH logins and session activity:
-```logql
-{host="<hostname>", systemd_unit="sshd.service"}
-```
-
-Look for:
- Accepted/Failed authentication
- Session opened/closed
- Unusual source IPs or users
-
-### 2. Command Execution
-
-Query executed commands (filter out noise):
-```logql
-{host="<hostname>"} |= "EXECVE" != "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"
-```
-
-Further filtering:
- Exclude systemd noise: `!= "systemd" != "/nix/store"`
- Focus on specific commands: `|= "rm" |= "-rf"`
- Focus on specific user: `|= "uid=1000"`
-
-### 3. Sudo Activity
-
-Check for privilege escalation:
-```logql
-{host="<hostname>"} |= "sudo" |= "COMMAND"
-```
-
-Or via audit:
-```logql
-{host="<hostname>"} |= "USER_CMD"
-```
-
-### 4. Service Manipulation
-
-Check if services were manually stopped/started:
-```logql
-{host="<hostname>"} |= "EXECVE" |= "systemctl"
-```
-
-### 5. File Operations
-
-Look for file modifications (if auditd rules are configured):
-```logql
-{host="<hostname>"} |= "EXECVE" |= "vim"
-{host="<hostname>"} |= "EXECVE" |= "nano"
-{host="<hostname>"} |= "EXECVE" |= "rm"
-```
-
-## Query Guidelines
-
-**Start narrow, expand if needed:**
- Begin with `limit: 20-30`
- Use tight time windows: `start: "15m"` or `start: "30m"`
- Add filters progressively
-
-**Avoid:**
- Querying all audit logs without EXECVE filter (extremely verbose)
- Large time ranges without specific filters
- Limits over 50 without tight filters
-
-**Time-bounded queries:**
-When investigating around a specific event:
-```logql
-{host="<hostname>"} |= "EXECVE" != "systemd"
-```
-With `start: "2026-02-08T14:30:00Z"` and `end: "2026-02-08T14:35:00Z"`
-
-## Suspicious Patterns to Watch For
-
-1. **Unusual login times** - Activity outside normal hours
-2. **Failed authentication** - Brute force attempts
-3. **Privilege escalation** - Unexpected sudo usage
-4. **Reconnaissance commands** - `whoami`, `id`, `uname`, `cat /etc/passwd`
-5. **Data exfiltration indicators** - `curl`, `wget`, `scp`, `rsync` to external destinations
-6. **Persistence mechanisms** - Cron modifications, systemd service creation
-7. **Log tampering** - Commands targeting log files
-8. **Lateral movement** - SSH to other internal hosts
-9. **Service manipulation** - Stopping security services, disabling firewalls
-10. **Cleanup activity** - Deleting bash history, clearing logs
-
-## Output Format
-
-### For Standalone Security Reviews
-
-```
-## Activity Summary
-
-**Host:** <hostname>
-**Time Period:** <start> to <end>
-**Sessions Found:** <count>
-
-## User Sessions
-
-### Session 1: <user> from <source_ip>
- **Login:** HH:MM:SSZ
- **Logout:** HH:MM:SSZ (or ongoing)
- **Commands executed:**
-  - HH:MM:SSZ - <command>
-  - HH:MM:SSZ - <command>
-
-## Suspicious Activity
-
-[If any patterns from the watch list were detected]
- **Finding:** <description>
- **Evidence:** <log entries>
- **Risk Level:** Low / Medium / High
-
-## Summary
-
-[Overall assessment: normal activity, concerning patterns, or clear malicious activity]
-```
-
-### When Called by Another Agent
-
-Provide a focused response addressing the specific question:
-
-```
-## Audit Findings
-
-**Query:** <what was asked>
-**Time Window:** <investigated period>
-
-## Relevant Activity
-
-[Chronological list of relevant events]
- HH:MM:SSZ - <event>
- HH:MM:SSZ - <event>
-
-## Assessment
-
-[Direct answer to the question with supporting evidence]
-```
-
-## Guidelines
-
- Reconstruct timelines chronologically
- Correlate events (login → commands → logout)
- Note gaps or missing data
- Distinguish between automated (systemd, cron) and interactive activity
- Consider the host's role and tier when assessing severity
- When called by another agent, focus on answering their specific question
- Don't speculate without evidence - state what the logs show and don't show
--- a/.claude/agents/investigate-alarm.md
+++ b/.claude/agents/investigate-alarm.md
@@ -1,211 +0,0 @@
---
-name: investigate-alarm
-description: Investigates a single system alarm by querying Prometheus metrics and Loki logs, analyzing configuration files for affected hosts/services, and providing root cause analysis.
-tools: Read, Grep, Glob
-mcpServers:
-  - lab-monitoring
-  - git-explorer
---
-
-You are an alarm investigation specialist for a NixOS homelab infrastructure. Your task is to analyze a single alarm and determine its root cause.
-
-## Input
-
-You will receive information about an alarm, which may include:
- Alert name and severity
- Affected host or service
- Alert expression/threshold
- Current value or status
- When it started firing
-
-## Investigation Process
-
-### 1. Understand the Alert Context
-
-Start by understanding what the alert is measuring:
- Use `get_alert` if you have a fingerprint, or `list_alerts` to find matching alerts
- Use `get_metric_metadata` to understand the metric being monitored
- Use `search_metrics` to find related metrics
-
-### 2. Query Current State
-
-Gather evidence about the current system state:
- Use `query` to check the current metric values and related metrics
- Use `list_targets` to verify the host/service is being scraped successfully
- Look for correlated metrics that might explain the issue
-
-### 3. Check Service Logs
-
-Search for relevant log entries using `query_logs`. Focus on service-specific logs and errors.
-
-**Query strategies (start narrow, expand if needed):**
- Start with `limit: 20-30`, increase only if needed
- Use tight time windows: `start: "15m"` or `start: "30m"` initially
- Filter to specific services: `{host="<hostname>", systemd_unit="<service>.service"}`
- Search for errors: `{host="<hostname>"} |= "error"` or `|= "failed"`
-
-**Common patterns:**
- Service logs: `{host="<hostname>", systemd_unit="<service>.service"}`
- All errors on host: `{host="<hostname>"} |= "error"`
- Journal for a unit: `{host="<hostname>", systemd_unit="nginx.service"} |= "failed"`
-
-**Avoid:**
- Using `start: "1h"` with no filters on busy hosts
- Limits over 50 without specific filters
-
-### 4. Investigate User Activity
-
-For any analysis of user activity, **always spawn the `auditor` agent**. Do not query audit logs (EXECVE, USER_LOGIN, etc.) directly - delegate this to the auditor.
-
-**Always call the auditor when:**
- A service stopped unexpectedly (may have been manually stopped)
- A process was killed or a config was changed
- You need to know who was logged in around the time of an incident
- You need to understand what commands led to the current state
- The cause isn't obvious from service logs alone
-
-**Do NOT try to query audit logs yourself.** The auditor is specialized for:
- Parsing EXECVE records and reconstructing command lines
- Correlating SSH sessions with commands executed
- Identifying suspicious patterns
- Filtering out systemd/nix-store noise
-
-**Example prompt for auditor:**
-```
-Investigate user activity on <hostname> between <start_time> and <end_time>.
-Context: The prometheus-node-exporter service stopped at 14:32.
-Determine if it was manually stopped and by whom.
-```
-
-Incorporate the auditor's findings into your timeline and root cause analysis.
-
-### 5. Check Configuration (if relevant)
-
-If the alert relates to a NixOS-managed service:
- Check host configuration in `/hosts/<hostname>/`
- Check service modules in `/services/<service>/`
- Look for thresholds, resource limits, or misconfigurations
- Check `homelab.host` options for tier/priority/role metadata
-
-### 6. Check for Configuration Drift
-
-Use the git-explorer MCP server to compare the host's deployed configuration against the current master branch. This helps identify:
- Hosts running outdated configurations
- Recent changes that might have caused the issue
- Whether a fix has already been committed but not deployed
-
-**Step 1: Get the deployed revision from Prometheus**
-```promql
-nixos_flake_info{hostname="<hostname>"}
-```
-The `current_rev` label contains the deployed git commit hash.
-
-**Step 2: Check if the host is behind master**
-```
-resolve_ref("master")           # Get current master commit
-is_ancestor(deployed, master)   # Check if host is behind
-```
-
-**Step 3: See what commits are missing**
-```
-commits_between(deployed, master)  # List commits not yet deployed
-```
-
-**Step 4: Check which files changed**
-```
-get_diff_files(deployed, master)   # Files modified since deployment
-```
-Look for files in `hosts/<hostname>/`, `services/<relevant-service>/`, or `system/` that affect this host.
-
-**Step 5: View configuration at the deployed revision**
-```
-get_file_at_commit(deployed, "services/<service>/default.nix")
-```
-Compare against the current file to understand differences.
-
-**Step 6: Find when something changed**
-```
-search_commits("<service-name>")   # Find commits mentioning the service
-get_commit_info(<hash>)            # Get full details of a specific change
-```
-
-**Example workflow for a service-related alert:**
-1. Query `nixos_flake_info{hostname="monitoring01"}` → `current_rev: 8959829`
-2. `resolve_ref("master")` → `4633421`
-3. `is_ancestor("8959829", "4633421")` → Yes, host is behind
-4. `commits_between("8959829", "4633421")` → 7 commits missing
-5. `get_diff_files("8959829", "4633421")` → Check if relevant service files changed
-6. If a fix was committed after the deployed rev, recommend deployment
-
-### 7. Consider Common Causes
-
-For infrastructure alerts, common causes include:
- **Manual intervention**: Service manually stopped/restarted (call auditor to confirm)
- **Configuration drift**: Host running outdated config, fix already in master
- **Disk space**: Nix store growth, logs, temp files
- **Memory pressure**: Service memory leaks, insufficient limits
- **CPU**: Runaway processes, build jobs
- **Network**: DNS issues, connectivity problems
- **Service restarts**: Failed upgrades, configuration errors
- **Scrape failures**: Service down, firewall issues, port changes
-
-**Note:** If a service stopped unexpectedly and service logs don't show a crash or error, it was likely manual intervention - call the auditor to investigate.
-
-## Output Format
-
-Provide a concise report with one of two outcomes:
-
-### If Root Cause Identified:
-
-```
-## Root Cause
-[1-2 sentence summary of the root cause]
-
-## Timeline
-[Chronological sequence of relevant events leading to the alert]
- HH:MM:SSZ - [Event description]
- HH:MM:SSZ - [Event description]
- HH:MM:SSZ - [Alert fired]
-
-### Timeline sources
- HH:MM:SSZ - [Source for information about this event. Which metric or log file]
- HH:MM:SSZ - [Source for information about this event. Which metric or log file]
- HH:MM:SSZ - [Alert fired]
-
-
-## Evidence
- [Specific metric values or log entries that support the conclusion]
- [Configuration details if relevant]
-
-
-## Recommended Actions
-1. [Specific remediation step]
-2. [Follow-up actions if any]
-```
-
-### If Root Cause Unclear:
-
-```
-## Investigation Summary
-[What was checked and what was found]
-
-## Possible Causes
- [Hypothesis 1 with supporting/contradicting evidence]
- [Hypothesis 2 with supporting/contradicting evidence]
-
-## Additional Information Needed
- [Specific data, logs, or access that would help]
- [Suggested queries or checks for the operator]
-```
-
-## Guidelines
-
- Be concise and actionable
- Reference specific metric names and values as evidence
- Include log snippets when they're informative
- Don't speculate without evidence
- If the alert is a false positive or expected behavior, explain why
- Consider the host's tier (test vs prod) when assessing severity
- Build a timeline from log timestamps and metrics to show the sequence of events
- **Query logs incrementally**: start with narrow filters and small limits, expand only if needed
- **Always delegate to the auditor agent** for any user activity analysis - never query EXECVE or audit logs directly
--- a/.claude/skills/observability/SKILL.md
+++ b/.claude/skills/observability/SKILL.md
@@ -32,7 +32,7 @@ Use the `lab-monitoring` MCP server tools:
 Available labels for log queries:
 - `host` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`)
 - `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
+- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs)
 - `filename` - For `varlog` job, the log file path
 - `hostname` - Alternative to `host` for some streams

@@ -102,36 +102,6 @@ Useful systemd units for troubleshooting:
 - `sshd.service` - SSH daemon
 - `nix-gc.service` - Nix garbage collection

-### Bootstrap Logs
-
-VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
-
- `host` - Target hostname
- `branch` - Git branch being deployed
- `stage` - Bootstrap stage (see table below)
-
-**Bootstrap stages:**
-
-| Stage | Message | Meaning |
-|-------|---------|---------|
-| `starting` | Bootstrap starting for \<host\> (branch: \<branch\>) | Bootstrap service has started |
-| `network_ok` | Network connectivity confirmed | Can reach git server |
-| `vault_ok` | Vault credentials unwrapped and stored | AppRole credentials provisioned |
-| `vault_skip` | No Vault token provided - skipping credential setup | No wrapped token was provided |
-| `vault_warn` | Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) |
-| `building` | Starting nixos-rebuild boot | NixOS build starting |
-| `success` | Build successful - rebooting into new configuration | Build complete, rebooting |
-| `failed` | nixos-rebuild failed - manual intervention required | Build failed |
-
-**Bootstrap queries:**
-
-```logql
-{job="bootstrap"}                              # All bootstrap logs
-{job="bootstrap", host="myhost"}               # Specific host
-{job="bootstrap", stage="failed"}              # All failures
-{job="bootstrap", stage=~"building|success"}   # Track build progress
-```
-
 ### Extracting JSON Fields

 Parse JSON and filter on fields:
@@ -205,95 +175,31 @@ Disk space (root filesystem):
 node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
 ```

-### Prometheus Jobs
+### Service-Specific Metrics

-All available Prometheus job names:
+Common job names:
+- `node-exporter` - System metrics (all hosts)
+- `nixos-exporter` - NixOS version/generation metrics
+- `caddy` - Reverse proxy metrics
+- `prometheus` / `loki` / `grafana` - Monitoring stack
+- `home-assistant` - Home automation
+- `step-ca` - Internal CA

-**System exporters (on all/most hosts):**
- `node-exporter` - System metrics (CPU, memory, disk, network)
- `nixos-exporter` - NixOS flake revision and generation info
- `systemd-exporter` - Systemd unit status metrics
- `homelab-deploy` - Deployment listener metrics
+### Instance Label Format

-**Service-specific exporters:**
- `caddy` - Reverse proxy metrics (http-proxy)
- `nix-cache_caddy` - Nix binary cache metrics
- `home-assistant` - Home automation metrics (ha1)
- `jellyfin` - Media server metrics (jelly01)
- `kanidm` - Authentication server metrics (kanidm01)
- `nats` - NATS messaging metrics (nats1)
- `openbao` - Secrets management metrics (vault01)
- `unbound` - DNS resolver metrics (ns1, ns2)
- `wireguard` - VPN tunnel metrics (http-proxy)
+The `instance` label uses FQDN format:

-**Monitoring stack (localhost on monitoring01):**
- `prometheus` - Prometheus self-metrics
- `loki` - Loki self-metrics
- `grafana` - Grafana self-metrics
- `alertmanager` - Alertmanager metrics
- `pushgateway` - Push-based metrics gateway
-
-**External/infrastructure:**
- `pve-exporter` - Proxmox hypervisor metrics
- `smartctl` - Disk SMART health (gunter)
- `restic_rest` - Backup server metrics
- `ghettoptt` - PTT service metrics (gunter)
-
-### Target Labels
-
-All scrape targets have these labels:
-
-**Standard labels:**
- `instance` - Full target address (`<hostname>.home.2rjus.net:<port>`)
- `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`)
- `hostname` - Short hostname (e.g., `ns1`, `monitoring01`) - use this for host filtering
-
-**Host metadata labels** (when configured in `homelab.host`):
- `role` - Host role (e.g., `dns`, `build-host`, `vault`)
- `tier` - Deployment tier (`test` for test VMs, absent for prod)
- `dns_role` - DNS-specific role (`primary` or `secondary` for ns1/ns2)
-
-### Filtering by Host
-
-Use the `hostname` label for easy host filtering across all jobs:
-
-```promql
-{hostname="ns1"}                    # All metrics from ns1
-node_load1{hostname="monitoring01"} # Specific metric by hostname
-up{hostname="ha1"}                  # Check if ha1 is up
+```
+<hostname>.home.2rjus.net:<port>
 ```

-This is simpler than wildcarding the `instance` label:
+Example queries filtering by host:

 ```promql
-# Old way (still works but verbose)
 up{instance=~"monitoring01.*"}
-
-# New way (preferred)
-up{hostname="monitoring01"}
+node_load1{instance=~"ns1.*"}
 ```

-### Filtering by Role/Tier
-
-Filter hosts by their role or tier:
-
-```promql
-up{role="dns"}                      # All DNS servers (ns1, ns2)
-node_cpu_seconds_total{role="build-host"}  # Build hosts only (nix-cache01)
-up{tier="test"}                     # All test-tier VMs
-up{dns_role="primary"}              # Primary DNS only (ns1)
-```
-
-Current host labels:
-| Host | Labels |
-|------|--------|
-| ns1 | `role=dns`, `dns_role=primary` |
-| ns2 | `role=dns`, `dns_role=secondary` |
-| nix-cache01 | `role=build-host` |
-| vault01 | `role=vault` |
-| kanidm01 | `role=auth`, `tier=test` |
-| testvm01/02/03 | `tier=test` |
-
 ---

 ## Troubleshooting Workflows
@@ -306,12 +212,11 @@ Current host labels:

 ### Investigate Service Issues

-1. Check `up{job="<service>"}` or `up{hostname="<host>"}` for scrape failures
+1. Check `up{job="<service>"}` for scrape failures
 2. Use `list_targets` to see target health details
 3. Query service logs: `{host="<host>", systemd_unit="<service>.service"}`
 4. Search for errors: `{host="<host>"} |= "error"`
 5. Check `list_alerts` for related alerts
-6. Use role filters for group issues: `up{role="dns"}` to check all DNS servers

 ### After Deploying Changes

@@ -320,17 +225,6 @@ Current host labels:
 3. Check service logs for startup issues
 4. Check service metrics are being scraped

-### Monitor VM Bootstrap
-
-When provisioning new VMs, track bootstrap progress:
-
-1. Watch bootstrap logs: `{job="bootstrap", host="<hostname>"}`
-2. Check for failures: `{job="bootstrap", host="<hostname>", stage="failed"}`
-3. After success, verify host appears in metrics: `up{hostname="<hostname>"}`
-4. Check logs are flowing: `{host="<hostname>"}`
-
-See [docs/host-creation.md](../../../docs/host-creation.md) for the full host creation pipeline.
-
 ### Debug SSH/Access Issues

 ```logql
@@ -352,6 +246,5 @@ With `start: "24h"` to see last 24 hours of upgrades across all hosts.
 - Default scrape interval is 15s for most metrics targets
 - Default log lookback is 1h - use `start` parameter for older logs
 - Use `rate()` for counter metrics, direct queries for gauges
- Use the `hostname` label to filter metrics by host (simpler than regex on `instance`)
- Host metadata labels (`role`, `tier`, `dns_role`) are propagated to all scrape targets
+- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters
 - Log `MESSAGE` field contains the actual log content in JSON format
--- a/.mcp.json
+++ b/.mcp.json
@@ -22,24 +22,6 @@
        "ALERTMANAGER_URL": "https://alertmanager.home.2rjus.net",
        "LOKI_URL": "http://monitoring01.home.2rjus.net:3100"
      }
-    },
-    "homelab-deploy": {
-      "command": "nix",
-      "args": [
-        "run",
-        "git+https://git.t-juice.club/torjus/homelab-deploy",
-        "--",
-        "mcp",
-        "--nats-url", "nats://nats1.home.2rjus.net:4222",
-        "--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey"
-      ]
-    },
-    "git-explorer": {
-      "command": "nix",
-      "args": ["run", "git+https://git.t-juice.club/torjus/labmcp#git-explorer", "--", "serve"],
-      "env": {
-        "GIT_REPO_PATH": "/home/torjus/git/nixos-servers"
-      }
    }
  }
 }
--- a/.sops.yaml
+++ b/.sops.yaml
@@ -0,0 +1,52 @@
+keys:
+  - &admin_torjus age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
+  - &server_ns1 age1hz2lz4k050ru3shrk5j3zk3f8azxmrp54pktw5a7nzjml4saudesx6jsl0
+  - &server_ns2 age1w2q4gm2lrcgdzscq8du3ssyvk6qtzm4fcszc92z9ftclq23yyydqdga5um
+  - &server_ha1 age1d2w5zece9647qwyq4vas9qyqegg96xwmg6c86440a6eg4uj6dd2qrq0w3l
+  - &server_http-proxy age1gq8434ku0xekqmvnseeunv83e779cg03c06gwrusnymdsr3rpufqx6vr3m
+  - &server_ca age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk
+  - &server_monitoring01 age1vpns76ykll8jgdlu3h05cur4ew2t3k7u03kxdg8y6ypfhsfhq9fqyurjey
+  - &server_jelly01 age1hchvlf3apn8g8jq2743pw53sd6v6ay6xu6lqk0qufrjeccan9vzsc7hdfq
+  - &server_nix-cache01 age1w029fksjv0edrff9p7s03tgk3axecdkppqymfpwfn2nu2gsqqefqc37sxq
+  - &server_pgdb1 age1ha34qeksr4jeaecevqvv2afqem67eja2mvawlmrqsudch0e7fe7qtpsekv
+  - &server_nats1 age1cxt8kwqzx35yuldazcc49q88qvgy9ajkz30xu0h37uw3ts97jagqgmn2ga
+creation_rules:
+  - path_regex: secrets/[^/]+\.(yaml|json|env|ini)
+    key_groups:
+      - age:
+        - *admin_torjus
+        - *server_ns1
+        - *server_ns2
+        - *server_ha1
+        - *server_http-proxy
+        - *server_ca
+        - *server_monitoring01
+        - *server_jelly01
+        - *server_nix-cache01
+        - *server_pgdb1
+        - *server_nats1
+  - path_regex: secrets/ca/[^/]+\.(yaml|json|env|ini|)
+    key_groups:
+      - age:
+        - *admin_torjus
+        - *server_ca
+  - path_regex: secrets/monitoring01/[^/]+\.(yaml|json|env|ini)
+    key_groups:
+      - age:
+        - *admin_torjus
+        - *server_monitoring01
+  - path_regex: secrets/ca/keys/.+
+    key_groups:
+      - age:
+        - *admin_torjus
+        - *server_ca
+  - path_regex: secrets/nix-cache01/.+
+    key_groups:
+      - age:
+        - *admin_torjus
+        - *server_nix-cache01
+  - path_regex: secrets/http-proxy/.+
+    key_groups:
+      - age:
+        - *admin_torjus
+        - *server_http-proxy
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -35,10 +35,6 @@ nix build .#create-host

 Do not automatically deploy changes. Deployments are usually done by updating the master branch, and then triggering the auto update on the specific host.

-### SSH Commands
-
-Do not run SSH commands directly. If a command needs to be run on a remote host, provide the command to the user and ask them to run it manually.
-
 ### Testing Feature Branches on Hosts

 All hosts have the `nixos-rebuild-test` helper script for testing feature branches before merging:
@@ -65,45 +61,25 @@ Do not run `nix flake update`. Should only be done manually by user.
 ### Development Environment

 ```bash
-# Enter development shell
+# Enter development shell (provides ansible, python3)
 nix develop
 ```

-The devshell provides: `ansible`, `tofu` (OpenTofu), `bao` (OpenBao CLI), `create-host`, and `homelab-deploy`.
-
-**Important:** When suggesting commands that use devshell tools, always use `nix develop -c <command>` syntax rather than assuming the user is already in a devshell. For example:
-```bash
-# Good - works regardless of current shell
-nix develop -c tofu plan
-
-# Avoid - requires user to be in devshell
-tofu plan
-```
-
-**OpenTofu:** Use the `-chdir` option instead of `cd` when running tofu commands in subdirectories:
-```bash
-# Good - uses -chdir option
-nix develop -c tofu -chdir=terraform plan
-nix develop -c tofu -chdir=terraform/vault apply
-
-# Avoid - changing directories
-cd terraform && tofu plan
-```
-
 ### Secrets Management

 Secrets are managed by OpenBao (Vault) using AppRole authentication. Most hosts use the
 `vault.secrets` option defined in `system/vault-secrets.nix` to fetch secrets at boot.
 Terraform manages the secrets and AppRole policies in `terraform/vault/`.

+Legacy sops-nix is still present but only actively used by the `ca` host. Do not edit any
+`.sops.yaml` or any file within `secrets/`. Ask the user to modify if necessary.
+
 ### Git Workflow

 **Important:** Never commit directly to `master` unless the user explicitly asks for it. Always create a feature branch for changes.

 **Important:** Never amend commits to `master` unless the user explicitly asks for it. Amending rewrites history and causes issues for deployed configurations.

-**Important:** Do not use `gh pr create` to create pull requests. The git server does not support GitHub CLI for PR creation. Instead, push the branch and let the user create the PR manually via the web interface.
-
 When starting a new plan or task, the first step should typically be to create and checkout a new branch with an appropriate name (e.g., `git checkout -b dns-automation` or `git checkout -b fix-nginx-config`).

 ### Plan Management
@@ -156,77 +132,68 @@ Two MCP servers are available for searching NixOS options and packages:

 This ensures documentation matches the exact nixpkgs version (currently NixOS 25.11) used by this flake.

-### Lab Monitoring
+### Lab Monitoring Log Queries

-The **lab-monitoring** MCP server provides access to Prometheus metrics and Loki logs. Use the `/observability` skill for detailed reference on:
+The **lab-monitoring** MCP server can query logs from Loki. All hosts ship systemd journal logs via Promtail.

- Available Prometheus jobs and exporters
- Loki labels and LogQL query syntax
- Bootstrap log monitoring for new VMs
- Common troubleshooting workflows
+**Loki Label Reference:**

-The skill contains up-to-date information about all scrape targets, host labels, and example queries.
+- `host` - Hostname (e.g., `ns1`, `ns2`, `monitoring01`, `ha1`). Use this label, not `hostname`.
+- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `prometheus.service`, `nixos-upgrade.service`)
+- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs like caddy access logs)
+- `filename` - For `varlog` job, the log file path (e.g., `/var/log/caddy/nix-cache.log`)

-### Deploying to Test Hosts
-
-The **homelab-deploy** MCP server enables remote deployments to test-tier hosts via NATS messaging.
-
-**Available Tools:**
-
- `deploy` - Deploy NixOS configuration to test-tier hosts
- `list_hosts` - List available deployment targets
-
-**Deploy Parameters:**
-
- `hostname` - Target a specific host (e.g., `vaulttest01`)
- `role` - Deploy to all hosts with a specific role (e.g., `vault`)
- `all` - Deploy to all test-tier hosts
- `action` - nixos-rebuild action: `switch` (default), `boot`, `test`, `dry-activate`
- `branch` - Git branch or commit to deploy (default: `master`)
-
-**Examples:**
+Journal log entries are JSON-formatted with the actual log message in the `MESSAGE` field. Other useful fields include `PRIORITY` and `SYSLOG_IDENTIFIER`.

+**Example LogQL queries:**
 ```
-# List available hosts
-list_hosts()
+# Logs from a specific service on a host
+{host="ns2", systemd_unit="nsd.service"}

-# Deploy to a specific host
-deploy(hostname="vaulttest01", action="switch")
+# Substring match on log content
+{host="ns1", systemd_unit="nsd.service"} |= "error"

-# Dry-run deployment
-deploy(hostname="vaulttest01", action="dry-activate")
-
-# Deploy to all hosts with a role
-deploy(role="vault", action="switch")
+# File-based logs (e.g., caddy access logs)
+{job="varlog", hostname="nix-cache01"}
 ```

-**Note:** Only test-tier hosts with `homelab.deploy.enable = true` and the listener service running will respond to deployments.
+Default lookback is 1 hour. Use the `start` parameter with relative durations (e.g., `24h`, `168h`) for older logs.

-**Deploying to Prod Hosts:**
+### Lab Monitoring Prometheus Queries

-The MCP server only deploys to test-tier hosts. For prod hosts, use the CLI directly:
+The **lab-monitoring** MCP server can query Prometheus metrics via PromQL. The `instance` label uses the FQDN format `<host>.home.2rjus.net:<port>`.

-```bash
-nix develop -c homelab-deploy -- deploy \
-  --nats-url nats://nats1.home.2rjus.net:4222 \
-  --nkey-file ~/.config/homelab-deploy/admin-deployer.nkey \
-  --branch <branch-name> \
-  --action switch \
-  deploy.prod.<hostname>
+**Prometheus Job Names:**
+
+- `node-exporter` - System metrics from all hosts (CPU, memory, disk, network)
+- `caddy` - Reverse proxy metrics (http-proxy)
+- `nix-cache_caddy` - Nix binary cache metrics
+- `home-assistant` - Home automation metrics
+- `jellyfin` - Media server metrics
+- `loki` / `prometheus` / `grafana` - Monitoring stack self-metrics
+- `step-ca` - Internal CA metrics
+- `pve-exporter` - Proxmox hypervisor metrics
+- `smartctl` - Disk SMART health (gunter)
+- `wireguard` - VPN metrics (http-proxy)
+- `pushgateway` - Push-based metrics (e.g., backup results)
+- `restic_rest` - Backup server metrics
+- `labmon` / `ghettoptt` / `alertmanager` - Other service metrics
+
+**Example PromQL queries:**
 ```
+# Check all targets are up
+up

-Subject format: `deploy.<tier>.<hostname>` (e.g., `deploy.prod.monitoring01`, `deploy.test.testvm01`)
+# CPU usage for a specific host
+rate(node_cpu_seconds_total{instance=~"ns1.*", mode!="idle"}[5m])

-**Verifying Deployments:**
+# Memory usage across all hosts
+node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes

-After deploying, use the `nixos_flake_info` metric from nixos-exporter to verify the host is running the expected revision:
-
-```promql
-nixos_flake_info{instance=~"vaulttest01.*"}
+# Disk space
+node_filesystem_avail_bytes{mountpoint="/"}
 ```

-The `current_rev` label contains the git commit hash of the deployed flake configuration.
-
 ## Architecture

 ### Directory Structure
@@ -236,11 +203,10 @@ The `current_rev` label contains the git commit hash of the deployed flake confi
  - `default.nix` - Entry point, imports configuration.nix and services
  - `configuration.nix` - Host-specific settings (networking, hardware, users)
 - `/system/` - Shared system-level configurations applied to ALL hosts
-  - Core modules: nix.nix, sshd.nix, vault-secrets.nix, acme.nix, autoupgrade.nix
-  - Additional modules: motd.nix (dynamic MOTD), packages.nix (base packages), root-user.nix (root config), homelab-deploy.nix (NATS listener)
+  - Core modules: nix.nix, sshd.nix, sops.nix (legacy), vault-secrets.nix, acme.nix, autoupgrade.nix
  - Monitoring: node-exporter and promtail on every host
 - `/modules/` - Custom NixOS modules
-  - `homelab/` - Homelab-specific options (see "Homelab Module Options" section below)
+  - `homelab/` - Homelab-specific options (DNS automation, monitoring scrape targets)
 - `/lib/` - Nix library functions
  - `dns-zone.nix` - DNS zone generation functions
  - `monitoring.nix` - Prometheus scrape target generation functions
@@ -248,14 +214,14 @@ The `current_rev` label contains the git commit hash of the deployed flake confi
  - `home-assistant/` - Home automation stack
  - `monitoring/` - Observability stack (Prometheus, Grafana, Loki, Tempo)
  - `ns/` - DNS services (authoritative, resolver, zone generation)
-  - `vault/` - OpenBao (Vault) secrets server
-  - `actions-runner/` - GitHub Actions runner
-  - `http-proxy/`, `postgres/`, `nats/`, `jellyfin/`, etc.
+  - `http-proxy/`, `ca/`, `postgres/`, `nats/`, `jellyfin/`, etc.
+- `/secrets/` - SOPS-encrypted secrets with age encryption (legacy, only used by ca)
 - `/common/` - Shared configurations (e.g., VM guest agent)
 - `/docs/` - Documentation and plans
  - `plans/` - Future plans and proposals
  - `plans/completed/` - Completed plans (moved here when done)
 - `/playbooks/` - Ansible playbooks for fleet management
+- `/.sops.yaml` - SOPS configuration with age keys (legacy, only used by ca)

 ### Configuration Inheritance

@@ -272,7 +238,7 @@ All hosts automatically get:
 - Nix binary cache (nix-cache.home.2rjus.net)
 - SSH with root login enabled
 - OpenBao (Vault) secrets management via AppRole
- Internal ACME CA integration (OpenBao PKI at vault.home.2rjus.net)
+- Internal ACME CA integration (ca.home.2rjus.net)
 - Daily auto-upgrades with auto-reboot
 - Prometheus node-exporter + Promtail (logs to monitoring01)
 - Monitoring scrape target auto-registration via `homelab.monitoring` options
@@ -281,31 +247,28 @@ All hosts automatically get:

 ### Active Hosts

-Production servers:
+Production servers managed by `rebuild-all.sh`:
 - `ns1`, `ns2` - Primary/secondary DNS servers (10.69.13.5/6)
- `vault01` - OpenBao (Vault) secrets server + PKI CA
+- `ca` - Internal Certificate Authority
 - `ha1` - Home Assistant + Zigbee2MQTT + Mosquitto
 - `http-proxy` - Reverse proxy
 - `monitoring01` - Full observability stack (Prometheus, Grafana, Loki, Tempo, Pyroscope)
 - `jelly01` - Jellyfin media server
- `nix-cache01` - Binary cache server + GitHub Actions runner
+- `nix-cache01` - Binary cache server
 - `pgdb1` - PostgreSQL database
 - `nats1` - NATS messaging server

-Test/staging hosts:
- `testvm01`, `testvm02`, `testvm03` - Test-tier VMs for branch testing and deployment validation
-
-Template hosts:
- `template1`, `template2` - Base templates for cloning new hosts
+Template/test hosts:
+- `template1` - Base template for cloning new hosts

 ### Flake Inputs

 - `nixpkgs` - NixOS 25.11 stable (primary)
 - `nixpkgs-unstable` - Unstable channel (available via overlay as `pkgs.unstable.<package>`)
- `nixos-exporter` - NixOS module for exposing flake revision metrics (used to verify deployments)
- `homelab-deploy` - NATS-based remote deployment tool for test-tier hosts
+- `sops-nix` - Secrets management (legacy, only used by ca)
 - Custom packages from git.t-juice.club:
  - `alerttonotify` - Alert routing
+  - `labmon` - Lab monitoring

 ### Network Architecture

@@ -329,6 +292,11 @@ Most hosts use OpenBao (Vault) for secrets:
 - Fallback to cached secrets in `/var/lib/vault/cache/` when Vault is unreachable
 - Provision AppRole credentials: `nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<host>`

+Legacy SOPS (only used by `ca` host):
+- SOPS with age encryption, keys in `.sops.yaml`
+- Shared secrets: `/secrets/secrets.yaml`
+- Per-host secrets: `/secrets/<hostname>/`
+
 ### Auto-Upgrade System

 All hosts pull updates daily from:
@@ -389,21 +357,9 @@ Example VM deployment includes:
 - Custom CPU/memory/disk sizing
 - VLAN tagging
 - QEMU guest agent
- Automatic Vault credential provisioning via `vault_wrapped_token`

 OpenTofu outputs the VM's IP address after deployment for easy SSH access.

-**Automatic Vault Credential Provisioning:**
-
-VMs can receive Vault (OpenBao) credentials automatically during bootstrap:
-
-1. OpenTofu generates a wrapped token via `terraform/vault/` and stores it in the VM configuration
-2. Cloud-init passes `VAULT_WRAPPED_TOKEN` and `NIXOS_FLAKE_BRANCH` to the bootstrap script
-3. The bootstrap script unwraps the token to obtain AppRole credentials
-4. Credentials are written to `/var/lib/vault/approle/` before the NixOS rebuild
-
-This eliminates the need for manual `provision-approle.yml` playbook runs on new VMs. Bootstrap progress is logged to Loki with `job="bootstrap"` labels.
-
 #### Template Rebuilding and Terraform State

 When the Proxmox template is rebuilt (via `build-and-deploy-template.yml`), the template name may change. This would normally cause Terraform to want to recreate all existing VMs, but that's unnecessary since VMs are independent once cloned.
@@ -434,11 +390,20 @@ This means:

 ### Adding a New Host

-See [docs/host-creation.md](docs/host-creation.md) for the complete host creation pipeline, including:
- Using the `create-host` script to generate host configurations
- Deploying VMs and secrets with OpenTofu
- Monitoring the bootstrap process via Loki
- Verification and troubleshooting steps
+1. Create `/hosts/<hostname>/` directory
+2. Copy structure from `template1` or similar host
+3. Add host entry to `flake.nix` nixosConfigurations
+4. Configure networking in `configuration.nix` (static IP via `systemd.network.networks`, DNS servers)
+5. (Optional) Add `homelab.dns.cnames` if the host needs CNAME aliases
+6. Add `vault.enable = true;` to the host configuration
+7. Add AppRole policy in `terraform/vault/approle.tf` and any secrets in `secrets.tf`
+8. Run `tofu apply` in `terraform/vault/`
+9. User clones template host
+10. User runs `prepare-host.sh` on new host
+11. Provision AppRole credentials: `nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<host>`
+12. Commit changes, and merge to master.
+13. Deploy by running `nixos-rebuild boot --flake URL#<hostname>` on the host.
+14. Run auto-upgrade on DNS servers (ns1, ns2) to pick up the new host's DNS entry

 **Note:** DNS A records and Prometheus node-exporter scrape targets are auto-generated from the host's `systemd.network.networks` static IP configuration. No manual zone file or Prometheus config editing is required.

@@ -474,7 +439,11 @@ Prometheus scrape targets are automatically generated from host configurations,
 - **External targets**: Non-flake hosts defined in `/services/monitoring/external-targets.nix`
 - **Library**: `lib/monitoring.nix` provides `generateNodeExporterTargets` and `generateScrapeConfigs`

-Service modules declare their scrape targets directly via `homelab.monitoring.scrapeTargets`. The Prometheus config on monitoring01 auto-generates scrape configs from all hosts. See "Homelab Module Options" section for available options.
+Host monitoring options (`homelab.monitoring.*`):
+- `enable` (default: `true`) - Include host in Prometheus node-exporter scrape targets
+- `scrapeTargets` (default: `[]`) - Additional scrape targets exposed by this host (job_name, port, metrics_path, scheme, scrape_interval, honor_labels)
+
+Service modules declare their scrape targets directly (e.g., `services/ca/default.nix` declares step-ca on port 9000). The Prometheus config on monitoring01 auto-generates scrape configs from all hosts.

 To add monitoring targets for non-NixOS hosts, edit `/services/monitoring/external-targets.nix`.

@@ -493,30 +462,13 @@ DNS zone entries are automatically generated from host configurations:
 - **External hosts**: Non-flake hosts defined in `/services/ns/external-hosts.nix`
 - **Serial number**: Uses `self.sourceInfo.lastModified` (git commit timestamp)

+Host DNS options (`homelab.dns.*`):
+- `enable` (default: `true`) - Include host in DNS zone generation
+- `cnames` (default: `[]`) - List of CNAME aliases pointing to this host
+
 Hosts are automatically excluded from DNS if:
 - `homelab.dns.enable = false` (e.g., template hosts)
 - No static IP configured (e.g., DHCP-only hosts)
 - Network interface is a VPN/tunnel (wg*, tun*, tap*)

 To add DNS entries for non-NixOS hosts, edit `/services/ns/external-hosts.nix`.
-
-### Homelab Module Options
-
-The `modules/homelab/` directory defines custom options used across hosts for automation and metadata.
-
-**Host options (`homelab.host.*`):**
- `tier` - Deployment tier: `test` or `prod`. Test-tier hosts can receive remote deployments and have different credential access.
- `priority` - Alerting priority: `high` or `low`. Controls alerting thresholds for the host.
- `role` - Primary role designation (e.g., `dns`, `database`, `bastion`, `vault`)
- `labels` - Free-form key-value metadata for host categorization
-
-**DNS options (`homelab.dns.*`):**
- `enable` (default: `true`) - Include host in DNS zone generation
- `cnames` (default: `[]`) - List of CNAME aliases pointing to this host
-
-**Monitoring options (`homelab.monitoring.*`):**
- `enable` (default: `true`) - Include host in Prometheus node-exporter scrape targets
- `scrapeTargets` (default: `[]`) - Additional scrape targets exposed by this host
-
-**Deploy options (`homelab.deploy.*`):**
- `enable` (default: `false`) - Enable NATS-based remote deployment listener. When enabled, the host listens for deployment commands via NATS and can be targeted by the `homelab-deploy` MCP server.
--- a/README.md
+++ b/README.md
@@ -13,6 +13,7 @@ NixOS Flake-based configuration repository for a homelab infrastructure. All hos
 | `monitoring01` | Prometheus, Grafana, Loki, Tempo, Pyroscope |
 | `jelly01` | Jellyfin media server |
 | `nix-cache01` | Nix binary cache |
+| `pgdb1` | PostgreSQL |
 | `nats1` | NATS messaging |
 | `vault01` | OpenBao (Vault) secrets management |
 | `template1`, `template2` | VM templates for cloning new hosts |
--- a/docs/plans/completed/automated-host-deployment-pipeline.md
+++ b/docs/plans/completed/automated-host-deployment-pipeline.md
--- a/common/ssh-audit.nix
+++ b/common/ssh-audit.nix
@@ -1,21 +0,0 @@
-# SSH session command auditing
-#
-# Logs all commands executed by users who logged in interactively (SSH).
-# System services and nix builds are excluded via auid filter.
-#
-# Logs are sent to journald and forwarded to Loki via promtail.
-# Query with: {host="<hostname>"} |= "EXECVE"
-{
-  # Enable Linux audit subsystem
-  security.audit.enable = true;
-  security.auditd.enable = true;
-
-  # Log execve syscalls only from interactive login sessions
-  # auid!=4294967295 means "audit login uid is set" (excludes system services, nix builds)
-  security.audit.rules = [
-    "-a exit,always -F arch=b64 -S execve -F auid!=4294967295"
-  ];
-
-  # Forward audit logs to journald (so promtail ships them to Loki)
-  services.journald.audit = true;
-}
--- a/docs/host-creation.md
+++ b/docs/host-creation.md
@@ -1,217 +0,0 @@
-# Host Creation Pipeline
-
-This document describes the process for creating new hosts in the homelab infrastructure.
-
-## Overview
-
-We use the `create-host` script to create new hosts, which generates default configurations from a template. We then use OpenTofu to deploy both secrets and VMs. The VMs boot using a template image (built from `hosts/template2`), which starts a bootstrap process. This bootstrap process applies the host's NixOS configuration and then reboots into the new config.
-
-## Prerequisites
-
-All tools are available in the devshell: `create-host`, `bao` (OpenBao CLI), `tofu`.
-
-```bash
-nix develop
-```
-
-## Steps
-
-Steps marked with **USER** must be performed by the user due to credential requirements.
-
-1. **USER**: Run `create-host --hostname <name> --ip <ip/prefix>`
-2. Edit the auto-generated configurations in `hosts/<hostname>/` to import whatever modules are needed for its purpose
-3. Add any secrets needed to `terraform/vault/`
-4. Edit the VM specs in `terraform/vms.tf` if needed. To deploy from a branch other than master, add `flake_branch = "<branch>"` to the VM definition
-5. Push configuration to master (or the branch specified by `flake_branch`)
-6. **USER**: Apply terraform:
-   ```bash
-   nix develop -c tofu -chdir=terraform/vault apply
-   nix develop -c tofu -chdir=terraform apply
-   ```
-7. Once terraform completes, a VM boots in Proxmox using the template image
-8. The VM runs the `nixos-bootstrap` service, which applies the host config and reboots
-9. After reboot, the host should be operational
-10. Trigger auto-upgrade on `ns1` and `ns2` to propagate DNS records for the new host
-11. Trigger auto-upgrade on `monitoring01` to add the host to Prometheus scrape targets
-
-## Tier Specification
-
-New hosts should set `homelab.host.tier` in their configuration:
-
-```nix
-homelab.host.tier = "test";  # or "prod"
-```
-
- **test** - Test-tier hosts can receive remote deployments via the `homelab-deploy` MCP server and have different credential access. Use for staging/testing.
- **prod** - Production hosts. Deployments require direct access or the CLI with appropriate credentials.
-
-## Observability
-
-During the bootstrap process, status updates are sent to Loki. Query bootstrap logs with:
-
-```
-{job="bootstrap", host="<hostname>"}
-```
-
-### Bootstrap Stages
-
-The bootstrap process reports these stages via the `stage` label:
-
-| Stage | Message | Meaning |
-|-------|---------|---------|
-| `starting` | Bootstrap starting for \<host\> (branch: \<branch\>) | Bootstrap service has started |
-| `network_ok` | Network connectivity confirmed | Can reach git server |
-| `vault_ok` | Vault credentials unwrapped and stored | AppRole credentials provisioned |
-| `vault_skip` | No Vault token provided - skipping credential setup | No wrapped token was provided |
-| `vault_warn` | Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) |
-| `building` | Starting nixos-rebuild boot | NixOS build starting |
-| `success` | Build successful - rebooting into new configuration | Build complete, rebooting |
-| `failed` | nixos-rebuild failed - manual intervention required | Build failed |
-
-### Useful Queries
-
-```
-# All bootstrap activity for a host
-{job="bootstrap", host="myhost"}
-
-# Track all failures
-{job="bootstrap", stage="failed"}
-
-# Monitor builds in progress
-{job="bootstrap", stage=~"building|success"}
-```
-
-Once the VM reboots with its full configuration, it will start publishing metrics to Prometheus and logs to Loki via Promtail.
-
-## Verification
-
-1. Check bootstrap completed successfully:
-   ```
-   {job="bootstrap", host="<hostname>", stage="success"}
-   ```
-
-2. Verify the host is up and reporting metrics:
-   ```promql
-   up{instance=~"<hostname>.*"}
-   ```
-
-3. Verify the correct flake revision is deployed:
-   ```promql
-   nixos_flake_info{instance=~"<hostname>.*"}
-   ```
-
-4. Check logs are flowing:
-   ```
-   {host="<hostname>"}
-   ```
-
-5. Confirm expected services are running and producing logs
-
-## Troubleshooting
-
-### Bootstrap Failed
-
-#### Common Issues
-
-* VM has trouble running initial nixos-rebuild. Usually caused if it needs to compile packages from scratch if they are not available in our local nix-cache.
-
-#### Troubleshooting
-
-1. Check bootstrap logs in Loki - if they never progress past `building`, the rebuild likely consumed all resources:
-   ```
-   {job="bootstrap", host="<hostname>"}
-   ```
-
-2. **USER**: SSH into the host and check the bootstrap service:
-   ```bash
-   ssh root@<hostname>
-   journalctl -u nixos-bootstrap.service
-   ```
-
-3. If the build failed due to resource constraints, increase VM specs in `terraform/vms.tf` and redeploy, or manually run the rebuild:
-   ```bash
-   nixos-rebuild boot --flake git+https://git.t-juice.club/torjus/nixos-servers.git#<hostname>
-   ```
-
-4. If the host config doesn't exist in the flake, ensure step 5 was completed (config pushed to the correct branch).
-
-### Vault Credentials Not Working
-
-Usually caused by running the `create-host` script without proper credentials, or the wrapped token has expired/already been used.
-
-#### Troubleshooting
-
-1. Check if credentials exist on the host:
-   ```bash
-   ssh root@<hostname>
-   ls -la /var/lib/vault/approle/
-   ```
-
-2. Check bootstrap logs for vault-related stages:
-   ```
-   {job="bootstrap", host="<hostname>", stage=~"vault.*"}
-   ```
-
-3. **USER**: Regenerate and provision credentials manually:
-   ```bash
-   nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<hostname>
-   ```
-
-### Host Not Appearing in DNS
-
-Usually caused by not having deployed the commit with the new host to ns1/ns2.
-
-#### Troubleshooting
-
-1. Verify the host config has a static IP configured in `systemd.network.networks`
-
-2. Check that `homelab.dns.enable` is not set to `false`
-
-3. **USER**: Trigger auto-upgrade on DNS servers:
-   ```bash
-   ssh root@ns1 systemctl start nixos-upgrade.service
-   ssh root@ns2 systemctl start nixos-upgrade.service
-   ```
-
-4. Verify DNS resolution after upgrade completes:
-   ```bash
-   dig @ns1.home.2rjus.net <hostname>.home.2rjus.net
-   ```
-
-### Host Not Being Scraped by Prometheus
-
-Usually caused by not having deployed the commit with the new host to the monitoring host.
-
-#### Troubleshooting
-
-1. Check that `homelab.monitoring.enable` is not set to `false`
-
-2. **USER**: Trigger auto-upgrade on monitoring01:
-   ```bash
-   ssh root@monitoring01 systemctl start nixos-upgrade.service
-   ```
-
-3. Verify the target appears in Prometheus:
-   ```promql
-   up{instance=~"<hostname>.*"}
-   ```
-
-4. If the target is down, check that node-exporter is running on the host:
-   ```bash
-   ssh root@<hostname> systemctl status prometheus-node-exporter.service
-   ```
-
-## Related Files
-
-| Path | Description |
-|------|-------------|
-| `scripts/create-host/` | The `create-host` script that generates host configurations |
-| `hosts/template2/` | Template VM configuration (base image for new VMs) |
-| `hosts/template2/bootstrap.nix` | Bootstrap service that applies NixOS config on first boot |
-| `terraform/vms.tf` | VM definitions (specs, IPs, branch overrides) |
-| `terraform/cloud-init.tf` | Cloud-init configuration (passes hostname, branch, vault token) |
-| `terraform/vault/approle.tf` | AppRole policies for each host |
-| `terraform/vault/secrets.tf` | Secret definitions in Vault |
-| `terraform/vault/hosts-generated.tf` | Auto-generated wrapped tokens for VM bootstrap |
-| `playbooks/provision-approle.yml` | Ansible playbook for manual credential provisioning |
-| `flake.nix` | Flake with all host configurations (add new hosts here) |
--- a/docs/plans/auth-system-replacement.md
+++ b/docs/plans/auth-system-replacement.md
@@ -2,7 +2,7 @@

 ## Overview

-Deploy a modern, unified authentication solution for the homelab. Provides central user management, SSO for web services, and consistent UID/GID mapping for NAS permissions.
+Replace the current auth01 setup (LLDAP + Authelia) with a modern, unified authentication solution. The current setup is not in active use, making this a good time to evaluate alternatives.

 ## Goals

@@ -11,9 +11,66 @@ Deploy a modern, unified authentication solution for the homelab. Provides centr
 3. **UID/GID consistency** - Proper POSIX attributes for NAS share permissions
 4. **OIDC provider** - Single sign-on for homelab web services (Grafana, etc.)

-## Solution: Kanidm
+## Options Evaluated

-Kanidm was chosen for the following reasons:
+### OpenLDAP (raw)
+
+- **NixOS Support:** Good (`services.openldap` with `declarativeContents`)
+- **Pros:** Most widely supported, very flexible
+- **Cons:** LDIF format is painful, schema management is complex, no built-in OIDC, requires SSSD on each client
+- **Verdict:** Doesn't address LDAP complexity concerns
+
+### LLDAP + Authelia (current)
+
+- **NixOS Support:** Both have good modules
+- **Pros:** Already configured, lightweight, nice web UIs
+- **Cons:** Two services to manage, limited POSIX attribute support in LLDAP, requires SSSD on every client host
+- **Verdict:** Workable but has friction for NAS/UID goals
+
+### FreeIPA
+
+- **NixOS Support:** None
+- **Pros:** Full enterprise solution (LDAP + Kerberos + DNS + CA)
+- **Cons:** Extremely heavy, wants to own DNS, designed for Red Hat ecosystems, massive overkill for homelab
+- **Verdict:** Overkill, no NixOS support
+
+### Keycloak
+
+- **NixOS Support:** None
+- **Pros:** Good OIDC/SAML, nice UI
+- **Cons:** Primarily an identity broker not a user directory, poor POSIX support, heavy (Java)
+- **Verdict:** Wrong tool for Linux user management
+
+### Authentik
+
+- **NixOS Support:** None (would need Docker)
+- **Pros:** All-in-one with LDAP outpost and OIDC, modern UI
+- **Cons:** Heavy stack (Python + PostgreSQL + Redis), LDAP is a separate component
+- **Verdict:** Would work but requires Docker and is heavy
+
+### Kanidm
+
+- **NixOS Support:** Excellent - first-class module with PAM/NSS integration
+- **Pros:**
+  - Native PAM/NSS module (no SSSD needed)
+  - Built-in OIDC provider
+  - Optional LDAP interface for legacy services
+  - Declarative provisioning via NixOS (users, groups, OAuth2 clients)
+  - Modern, written in Rust
+  - Single service handles everything
+- **Cons:** Newer project, smaller community than LDAP
+- **Verdict:** Best fit for requirements
+
+### Pocket-ID
+
+- **NixOS Support:** Unknown
+- **Pros:** Very lightweight, passkey-first
+- **Cons:** No LDAP, no PAM/NSS integration - purely OIDC for web apps
+- **Verdict:** Doesn't solve Linux user management goal
+
+## Recommendation: Kanidm
+
+Kanidm is the recommended solution for the following reasons:

 | Requirement | Kanidm Support |
 |-------------|----------------|
@@ -25,10 +82,42 @@ Kanidm was chosen for the following reasons:
 | Simplicity | Modern API, LDAP optional |
 | NixOS integration | First-class |

-### Configuration Files
+### Key NixOS Features

- **Host configuration:** `hosts/kanidm01/`
- **Service module:** `services/kanidm/default.nix`
+**Server configuration:**
+```nix
+services.kanidm.enableServer = true;
+services.kanidm.serverSettings = {
+  domain = "home.2rjus.net";
+  origin = "https://auth.home.2rjus.net";
+  ldapbindaddress = "0.0.0.0:636";  # Optional LDAP interface
+};
+```
+
+**Declarative user provisioning:**
+```nix
+services.kanidm.provision.enable = true;
+services.kanidm.provision.persons.torjus = {
+  displayName = "Torjus";
+  groups = [ "admins" "nas-users" ];
+};
+```
+
+**Declarative OAuth2 clients:**
+```nix
+services.kanidm.provision.systems.oauth2.grafana = {
+  displayName = "Grafana";
+  originUrl = "https://grafana.home.2rjus.net/login/generic_oauth";
+  originLanding = "https://grafana.home.2rjus.net";
+};
+```
+
+**Client host configuration (add to system/):**
+```nix
+services.kanidm.enableClient = true;
+services.kanidm.enablePam = true;
+services.kanidm.clientSettings.uri = "https://auth.home.2rjus.net";
+```

 ## NAS Integration

@@ -59,103 +148,42 @@ This future migration path is a strong argument for Kanidm over LDAP-only soluti

 ## Implementation Steps

-1. **Create kanidm01 host and service module** ✅
-   - Host: `kanidm01.home.2rjus.net` (10.69.13.23, test tier)
-   - Service module: `services/kanidm/`
-   - TLS via internal ACME (`auth.home.2rjus.net`)
-   - Vault integration for idm_admin password
-   - LDAPS on port 636
+1. **Create Kanidm service module** in `services/kanidm/`
+   - Server configuration
+   - TLS via internal ACME
+   - Vault secrets for admin passwords

-2. **Configure provisioning** ✅
-   - Groups provisioned declaratively: `admins`, `users`, `ssh-users`
-   - Users managed imperatively via CLI (allows setting POSIX passwords in one step)
-   - POSIX attributes enabled (UID/GID range 65,536-69,999)
+2. **Configure declarative provisioning**
+   - Define initial users and groups
+   - Set up POSIX attributes (UID/GID ranges)

-3. **Test NAS integration** (in progress)
-   - ✅ LDAP interface verified working
-   - Configure TrueNAS LDAP client to connect to Kanidm
-   - Verify UID/GID mapping works with NFS shares
-
-4. **Add OIDC clients** for homelab services
+3. **Add OIDC clients** for homelab services
   - Grafana
   - Other services as needed

-5. **Create client module** in `system/` for PAM/NSS ✅
-   - Module: `system/kanidm-client.nix`
-   - `homelab.kanidm.enable = true` enables PAM/NSS
-   - Short usernames (not SPN format)
-   - Home directory symlinks via `home_alias`
-   - Enabled on test tier: testvm01, testvm02, testvm03
+4. **Create client module** in `system/` for PAM/NSS
+   - Enable on all hosts that need central auth
+   - Configure trusted CA

-6. **Documentation** ✅
-   - `docs/user-management.md` - CLI workflows, troubleshooting
-   - User/group creation procedures verified working
+5. **Test NAS integration**
+   - Configure TrueNAS LDAP client to connect to Kanidm
+   - Verify UID/GID mapping works with NFS shares

-## Progress
+6. **Migrate auth01**
+   - Remove LLDAP and Authelia services
+   - Deploy Kanidm
+   - Update DNS CNAMEs if needed

-### Completed (2026-02-08)
+7. **Documentation**
+   - User management procedures
+   - Adding new OAuth2 clients
+   - Troubleshooting PAM/NSS issues

-**Kanidm server deployed on kanidm01 (test tier):**
- Host: `kanidm01.home.2rjus.net` (10.69.13.23)
- WebUI: `https://auth.home.2rjus.net`
- LDAPS: port 636
- Valid certificate from internal CA
+## Open Questions

-**Configuration:**
- Kanidm 1.8 with secret provisioning support
- Daily backups at 22:00 (7 versions retained)
- Vault integration for idm_admin password
- Prometheus monitoring scrape target configured
-
-**Provisioned entities:**
- Groups: `admins`, `users`, `ssh-users` (declarative)
- Users managed via CLI (imperative)
-
-**Verified working:**
- WebUI login with idm_admin
- LDAP bind and search with POSIX-enabled user
- LDAPS with valid internal CA certificate
-
-### Completed (2026-02-08) - PAM/NSS Client
-
-**Client module deployed (`system/kanidm-client.nix`):**
- `homelab.kanidm.enable = true` enables PAM/NSS integration
- Connects to auth.home.2rjus.net
- Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
- Home directory symlinks (`/home/torjus` → UUID-based dir)
- Login restricted to `ssh-users` group
-
-**Enabled on test tier:**
- testvm01, testvm02, testvm03
-
-**Verified working:**
- User/group resolution via `getent`
- SSH login with Kanidm unix passwords
- Home directory creation with symlinks
- Imperative user/group creation via CLI
-
-**Documentation:**
- `docs/user-management.md` with full CLI workflows
- Password requirements (min 10 chars)
- Troubleshooting guide (nscd, cache invalidation)
-
-### UID/GID Range (Resolved)
-
-**Range: 65,536 - 69,999** (manually allocated)
-
- Users: 65,536 - 67,999 (up to ~2500 users)
- Groups: 68,000 - 69,999 (up to ~2000 groups)
-
-Rationale:
- Starts at Kanidm's recommended minimum (65,536)
- Well above NixOS system users (typically <1000)
- Avoids Podman/container issues with very high GIDs
-
-### Next Steps
-
-1. Enable PAM/NSS on production hosts (after test tier validation)
-2. Configure TrueNAS LDAP client for NAS integration testing
-3. Add OAuth2 clients (Grafana first)
+- What UID/GID range should be reserved for Kanidm-managed users?
+- Which hosts should have PAM/NSS enabled initially?
+- What OAuth2 clients are needed at launch?

 ## References

--- a/docs/plans/cert-monitoring.md
+++ b/docs/plans/cert-monitoring.md
@@ -1,72 +0,0 @@
-# Certificate Monitoring Plan
-
-## Summary
-
-This document describes the removal of labmon certificate monitoring and outlines future needs for certificate monitoring in the homelab.
-
-## What Was Removed
-
-### labmon Service
-
-The `labmon` service was a custom Go application that provided:
-
-1. **StepMonitor**: Monitoring for step-ca (Smallstep CA) certificate provisioning and health
-2. **TLSConnectionMonitor**: Periodic TLS connection checks to verify certificate validity and expiration
-
-The service exposed Prometheus metrics at `:9969` including:
- `labmon_tlsconmon_certificate_seconds_left` - Time until certificate expiration
- `labmon_tlsconmon_certificate_check_error` - Whether the TLS check failed
- `labmon_stepmon_certificate_seconds_left` - Step-CA internal certificate expiration
-
-### Affected Files
-
- `hosts/monitoring01/configuration.nix` - Removed labmon configuration block
- `services/monitoring/prometheus.nix` - Removed labmon scrape target
- `services/monitoring/rules.yml` - Removed `certificate_rules` alert group
- `services/monitoring/alloy.nix` - Deleted (was only used for labmon profiling)
- `services/monitoring/default.nix` - Removed alloy.nix import
-
-### Removed Alerts
-
- `certificate_expiring_soon` - Warned when any monitored TLS cert had < 24h validity
- `step_ca_serving_cert_expiring` - Critical alert for step-ca's own serving certificate
- `certificate_check_error` - Warned when TLS connection check failed
- `step_ca_certificate_expiring` - Critical alert for step-ca issued certificates
-
-## Why It Was Removed
-
-1. **step-ca decommissioned**: The primary monitoring target (step-ca) is no longer in use
-2. **Outdated codebase**: labmon was a custom tool that required maintenance
-3. **Limited value**: With ACME auto-renewal, certificates should renew automatically
-
-## Current State
-
-ACME certificates are now issued by OpenBao PKI at `vault.home.2rjus.net:8200`. The ACME protocol handles automatic renewal, and certificates are typically renewed well before expiration.
-
-## Future Needs
-
-While ACME handles renewal automatically, we should consider monitoring for:
-
-1. **ACME renewal failures**: Alert when a certificate fails to renew
-   - Could monitor ACME client logs (via Loki queries)
-   - Could check certificate file modification times
-
-2. **Certificate expiration as backup**: Even with auto-renewal, a last-resort alert for certificates approaching expiration would catch renewal failures
-
-3. **Certificate transparency**: Monitor for unexpected certificate issuance
-
-### Potential Solutions
-
-1. **Prometheus blackbox_exporter**: Can probe TLS endpoints and export certificate expiration metrics
-   - `probe_ssl_earliest_cert_expiry` metric
-   - Already a standard tool, well-maintained
-
-2. **Custom Loki alerting**: Query ACME service logs for renewal failures
-   - Works with existing infrastructure
-   - No additional services needed
-
-3. **Node-exporter textfile collector**: Script that checks local certificate files and writes expiration metrics
-
-## Status
-
-**Not yet implemented.** This document serves as a placeholder for future work on certificate monitoring.
--- a/docs/plans/completed/bootstrap-cache.md
+++ b/docs/plans/completed/bootstrap-cache.md
@@ -1,35 +0,0 @@
-# Plan: Configure Template2 to Use Nix Cache
-
-## Problem
-
-New VMs bootstrapped from template2 don't use our local nix cache (nix-cache.home.2rjus.net) during the initial `nixos-rebuild boot`. This means the first build downloads everything from cache.nixos.org, which is slower and uses more bandwidth.
-
-## Solution
-
-Update the template2 base image to include the nix cache configuration, so new VMs immediately benefit from cached builds during bootstrap.
-
-## Implementation
-
-1. Add nix cache configuration to `hosts/template2/configuration.nix`:
-   ```nix
-   nix.settings = {
-     substituters = [ "https://nix-cache.home.2rjus.net" "https://cache.nixos.org" ];
-     trusted-public-keys = [
-       "nix-cache.home.2rjus.net:..."  # Add the cache's public key
-       "cache.nixos.org-1:..."
-     ];
-   };
-   ```
-
-2. Rebuild and redeploy the Proxmox template:
-   ```bash
-   nix develop -c ansible-playbook -i playbooks/inventory.ini playbooks/build-and-deploy-template.yml
-   ```
-
-3. Update `default_template_name` in `terraform/variables.tf` if the template name changed
-
-## Benefits
-
- Faster VM bootstrap times
- Reduced bandwidth to external cache
- Most derivations will already be cached from other hosts
--- a/docs/plans/completed/ns1-recreation.md
+++ b/docs/plans/completed/ns1-recreation.md
@@ -1,107 +0,0 @@
-# ns1 Recreation Plan
-
-## Overview
-
-Recreate ns1 using the OpenTofu workflow after the existing VM entered emergency mode due to incorrect hardware-configuration.nix (hardcoded UUIDs that don't match actual disk layout).
-
-## Current ns1 Configuration to Preserve
-
- **IP:** 10.69.13.5/24
- **Gateway:** 10.69.13.1
- **Role:** Primary DNS (authoritative + resolver)
- **Services:**
-  - `../../services/ns/master-authorative.nix`
-  - `../../services/ns/resolver.nix`
- **Metadata:**
-  - `homelab.host.role = "dns"`
-  - `homelab.host.labels.dns_role = "primary"`
- **Vault:** enabled
- **Deploy:** enabled
-
-## Execution Steps
-
-### Phase 1: Remove Old Configuration
-
-```bash
-nix develop -c create-host --remove --hostname ns1 --force
-```
-
-This removes:
- `hosts/ns1/` directory
- Entry from `flake.nix`
- Any terraform entries (none exist currently)
-
-### Phase 2: Create New Configuration
-
-```bash
-nix develop -c create-host --hostname ns1 --ip 10.69.13.5/24
-```
-
-This creates:
- `hosts/ns1/` with template2-based configuration
- Entry in `flake.nix`
- Entry in `terraform/vms.tf`
- Vault wrapped token for bootstrap
-
-### Phase 3: Customize Configuration
-
-After create-host, manually update `hosts/ns1/configuration.nix` to add:
-
-1. DNS service imports:
-   ```nix
-   ../../services/ns/master-authorative.nix
-   ../../services/ns/resolver.nix
-   ```
-
-2. Host metadata:
-   ```nix
-   homelab.host = {
-     tier = "prod";
-     role = "dns";
-     labels.dns_role = "primary";
-   };
-   ```
-
-3. Disable resolved (conflicts with Unbound):
-   ```nix
-   services.resolved.enable = false;
-   ```
-
-### Phase 4: Commit Changes
-
-```bash
-git add -A
-git commit -m "ns1: recreate with OpenTofu workflow
-
-Old VM had incorrect hardware-configuration.nix with hardcoded UUIDs
-that didn't match actual disk layout, causing boot failure.
-
-Recreated using template2-based configuration for OpenTofu provisioning."
-```
-
-### Phase 5: Infrastructure
-
-1. Delete old ns1 VM in Proxmox (it's broken anyway)
-2. Run `nix develop -c tofu -chdir=terraform apply`
-3. Wait for bootstrap to complete
-4. Verify ns1 is functional:
-   - DNS resolution working
-   - Zone transfer to ns2 working
-   - All exporters responding
-
-### Phase 6: Finalize
-
- Push to master
- Move this plan to `docs/plans/completed/`
-
-## Rollback
-
-If the new VM fails:
-1. ns2 is still operational as secondary DNS
-2. Can recreate with different settings if needed
-
-## Notes
-
- ns2 will continue serving DNS during the migration
- Zone data is generated from flake, so no data loss
- The old VM's disk can be kept briefly in Proxmox as backup if desired
--- a/docs/plans/host-migration-to-opentofu.md
+++ b/docs/plans/host-migration-to-opentofu.md
@@ -9,23 +9,24 @@ hosts are decommissioned or deferred.

 ## Current State

-Hosts already managed by OpenTofu: `vault01`, `testvm01`, `testvm02`, `testvm03`, `ns2`, `ns1`
+Hosts already managed by OpenTofu: `vault01`, `testvm01`, `vaulttest01`

 Hosts to migrate:

 | Host | Category | Notes |
 |------|----------|-------|
-| ~~ns1~~ | ~~Stateless~~ | ✓ Complete |
+| ns1 | Stateless | Primary DNS, recreate |
+| ns2 | Stateless | Secondary DNS, recreate |
 | nix-cache01 | Stateless | Binary cache, recreate |
 | http-proxy | Stateless | Reverse proxy, recreate |
 | nats1 | Stateless | Messaging, recreate |
+| auth01 | Decommission | No longer in use |
 | ha1 | Stateful | Home Assistant + Zigbee2MQTT + Mosquitto |
 | monitoring01 | Stateful | Prometheus, Grafana, Loki |
 | jelly01 | Stateful | Jellyfin metadata, watch history, config |
-| pgdb1 | Decommission | Only used by Open WebUI on gunter, migrating to local postgres |
-| ~~jump~~ | ~~Decommission~~ | ✓ Complete |
-| ~~auth01~~ | ~~Decommission~~ | ✓ Complete |
-| ~~ca~~ | ~~Deferred~~ | ✓ Complete |
+| pgdb1 | Stateful | PostgreSQL databases |
+| jump | Decommission | No longer needed |
+| ca | Deferred | Pending Phase 4c PKI migration to OpenBao |

 ## Phase 1: Backup Preparation

@@ -45,19 +46,39 @@ No backup currently exists. Add a restic backup job for `/var/lib/jellyfin/` whi
 Media files are on the NAS (`nas.home.2rjus.net:/mnt/hdd-pool/media`) and do not need backup.
 The cache directory (`/var/cache/jellyfin/`) does not need backup — it regenerates.

-### 1c. Verify Existing ha1 Backup
+### 1c. Add PostgreSQL Backup to pgdb1
+
+No backup currently exists. Add a restic backup job with a `pg_dumpall` pre-hook to capture
+all databases and roles. The dump should be piped through restic's stdin backup (similar to
+the Grafana DB dump pattern on monitoring01).
+
+### 1d. Verify Existing ha1 Backup

 ha1 already backs up `/var/lib/hass`, `/var/lib/zigbee2mqtt`, `/var/lib/mosquitto`. Verify
 these backups are current and restorable before proceeding with migration.

-### 1d. Verify All Backups
+### 1e. Verify All Backups

 After adding/expanding backup jobs:
 1. Trigger a manual backup run on each host
 2. Verify backup integrity with `restic check`
 3. Test a restore to a temporary location to confirm data is recoverable

-## Phase 2: Stateless Host Migration
+## Phase 2: Declare pgdb1 Databases in Nix
+
+Before migrating pgdb1, audit the manually-created databases and users on the running
+instance, then declare them in the Nix configuration using `ensureDatabases` and
+`ensureUsers`. This makes the PostgreSQL setup reproducible on the new host.
+
+Steps:
+1. SSH to pgdb1, run `\l` and `\du` in psql to list databases and roles
+2. Add `ensureDatabases` and `ensureUsers` to `services/postgres/postgres.nix`
+3. Document any non-default PostgreSQL settings or extensions per database
+
+After reprovisioning, the databases will be created by NixOS, and data restored from the
+`pg_dumpall` backup.
+
+## Phase 3: Stateless Host Migration

 These hosts have no meaningful state and can be recreated fresh. For each host:

@@ -74,14 +95,13 @@ Migrate stateless hosts in an order that minimizes disruption:

 1. **nix-cache01** — low risk, no downstream dependencies during migration
 2. **nats1** — low risk, verify no persistent JetStream streams first
-3. **http-proxy** — brief disruption to proxied services, migrate during low-traffic window
-4. ~~**ns1** — ns2 already migrated, verify AXFR works after ns1 migration~~ ✓ Complete
+4. **http-proxy** — brief disruption to proxied services, migrate during low-traffic window
+5. **ns1, ns2** — migrate one at a time, verify DNS resolution between each

-~~For ns1/ns2: migrate ns2 first (secondary), verify AXFR works, then migrate ns1.~~ Both ns1
-and ns2 migration complete. Zone transfer (AXFR) verified working between ns1 (primary) and
-ns2 (secondary).
+For ns1/ns2: migrate ns2 first (secondary), verify AXFR works, then migrate ns1. All hosts
+use both ns1 and ns2 as resolvers, so one being down briefly is tolerable.

-## Phase 3: Stateful Host Migration
+## Phase 4: Stateful Host Migration

 For each stateful host, the procedure is:

@@ -94,7 +114,17 @@ For each stateful host, the procedure is:
 7. Start services and verify functionality
 8. Decommission the old VM

-### 3a. monitoring01
+### 4a. pgdb1
+
+1. Run final `pg_dumpall` backup via restic
+2. Stop PostgreSQL on the old host
+3. Provision new pgdb1 via OpenTofu
+4. After bootstrap, NixOS creates the declared databases/users
+5. Restore data with `pg_restore` or `psql < dumpall.sql`
+6. Verify database connectivity from gunter (`10.69.30.105`)
+7. Decommission old VM
+
+### 4b. monitoring01

 1. Run final Grafana backup
 2. Provision new monitoring01 via OpenTofu
@@ -104,7 +134,7 @@ For each stateful host, the procedure is:
 6. Verify all scrape targets are being collected
 7. Decommission old VM

-### 3b. jelly01
+### 4c. jelly01

 1. Run final Jellyfin backup
 2. Provision new jelly01 via OpenTofu
@@ -113,7 +143,7 @@ For each stateful host, the procedure is:
 5. Start Jellyfin, verify watch history and library metadata are present
 6. Decommission old VM

-### 3c. ha1
+### 4d. ha1

 1. Verify latest restic backup is current
 2. Stop Home Assistant, Zigbee2MQTT, and Mosquitto on old host
@@ -137,69 +167,47 @@ OpenTofu/Proxmox. Verify the USB device ID on the hypervisor and add the appropr
 `usb` block to the VM definition in `terraform/vms.tf`. The USB device must be passed
 through before starting Zigbee2MQTT on the new host.

-## Phase 4: Decommission Hosts
+## Phase 5: Decommission jump and auth01 Hosts

-### jump ✓ COMPLETE
-
-~~1. Verify nothing depends on the jump host (no SSH proxy configs pointing to it, etc.)~~
-~~2. Remove host configuration from `hosts/jump/`~~
-~~3. Remove from `flake.nix`~~
-~~4. Remove any secrets in `secrets/jump/`~~
-~~5. Remove from `.sops.yaml`~~
-~~6. Destroy the VM in Proxmox~~
-~~7. Commit cleanup~~
-
-Host was already removed from flake.nix and VM destroyed. Configuration cleaned up in ba9f47f.
-
-### auth01 ✓ COMPLETE
-
-~~1. Remove host configuration from `hosts/auth01/`~~
-~~2. Remove from `flake.nix`~~
-~~3. Remove any secrets in `secrets/auth01/`~~
-~~4. Remove from `.sops.yaml`~~
-~~5. Remove `services/authelia/` and `services/lldap/` (only used by auth01)~~
-~~6. Destroy the VM in Proxmox~~
-~~7. Commit cleanup~~
-
-Host configuration, services, and VM already removed.
-
-### pgdb1 (in progress)
-
-Only consumer was Open WebUI on gunter, which has been migrated to use local PostgreSQL.
-
-1. ~~Verify Open WebUI on gunter is using local PostgreSQL (not pgdb1)~~ ✓
-2. ~~Remove host configuration from `hosts/pgdb1/`~~ ✓
-3. ~~Remove `services/postgres/` (only used by pgdb1)~~ ✓
-4. ~~Remove from `flake.nix`~~ ✓
-5. ~~Remove Vault AppRole from `terraform/vault/approle.tf`~~ ✓
+### jump
+1. Verify nothing depends on the jump host (no SSH proxy configs pointing to it, etc.)
+2. Remove host configuration from `hosts/jump/`
+3. Remove from `flake.nix`
+4. Remove any secrets in `secrets/jump/`
+5. Remove from `.sops.yaml`
 6. Destroy the VM in Proxmox
-7. ~~Commit cleanup~~ ✓
+7. Commit cleanup

-See `docs/plans/pgdb1-decommission.md` for detailed plan.
+### auth01
+1. Remove host configuration from `hosts/auth01/`
+2. Remove from `flake.nix`
+3. Remove any secrets in `secrets/auth01/`
+4. Remove from `.sops.yaml`
+5. Remove `services/authelia/` and `services/lldap/` (only used by auth01)
+6. Destroy the VM in Proxmox
+7. Commit cleanup

-## Phase 5: Decommission ca Host ✓ COMPLETE
+## Phase 6: Decommission ca Host (Deferred)

-~~Deferred until Phase 4c (PKI migration to OpenBao) is complete. Once all hosts use the
+Deferred until Phase 4c (PKI migration to OpenBao) is complete. Once all hosts use the
 OpenBao ACME endpoint for certificates, the step-ca host can be decommissioned following
-the same cleanup steps as the jump host.~~
+the same cleanup steps as the jump host.

-PKI migration to OpenBao complete. Host configuration, `services/ca/`, and VM removed.
+## Phase 7: Remove sops-nix

-## Phase 6: Remove sops-nix ✓ COMPLETE
+Once `ca` is decommissioned (Phase 6), `sops-nix` is no longer used by any host. Remove
+all remnants:
+- `sops-nix` input from `flake.nix` and `flake.lock`
+- `sops-nix.nixosModules.sops` from all host module lists in `flake.nix`
+- `inherit sops-nix` from all specialArgs in `flake.nix`
+- `system/sops.nix` and its import in `system/default.nix`
+- `.sops.yaml`
+- `secrets/` directory
+- All `sops.secrets.*` declarations in `services/ca/`, `services/authelia/`, `services/lldap/`
+- Template scripts that generate age keys for sops (`hosts/template/scripts.nix`,
+  `hosts/template2/scripts.nix`)

-~~Once `ca` is decommissioned (Phase 6), `sops-nix` is no longer used by any host. Remove
-all remnants:~~
-~~- `sops-nix` input from `flake.nix` and `flake.lock`~~
-~~- `sops-nix.nixosModules.sops` from all host module lists in `flake.nix`~~
-~~- `inherit sops-nix` from all specialArgs in `flake.nix`~~
-~~- `system/sops.nix` and its import in `system/default.nix`~~
-~~- `.sops.yaml`~~
-~~- `secrets/` directory~~
-~~- All `sops.secrets.*` declarations in `services/ca/`, `services/authelia/`, `services/lldap/`~~
-~~- Template scripts that generate age keys for sops (`hosts/template/scripts.nix`,
-  `hosts/template2/scripts.nix`)~~
-
-All sops-nix remnants removed. See `docs/plans/completed/sops-to-openbao-migration.md` for context.
+See `docs/plans/completed/sops-to-openbao-migration.md` for full context.

 ## Notes

@@ -208,7 +216,7 @@ All sops-nix remnants removed. See `docs/plans/completed/sops-to-openbao-migrati
 - The old VMs use IPs that the new VMs need, so the old VM must be shut down before
  the new one is provisioned (or use a temporary IP and swap after verification)
 - Stateful migrations should be done during low-usage windows
- After all migrations are complete, all decommissioned hosts (jump, auth01, ca) have been removed
+- After all migrations are complete, the only hosts not in OpenTofu will be ca (deferred)
 - Since many hosts are being recreated, this is a good opportunity to establish consistent
  hostname naming conventions before provisioning the new VMs. Current naming is inconsistent
  (e.g. `ns1` vs `nix-cache01`, `ha1` vs `auth01`, `pgdb1` vs `http-proxy`). Decide on a
--- a/docs/plans/long-term-metrics-storage.md
+++ b/docs/plans/long-term-metrics-storage.md
@@ -1,122 +0,0 @@
-# Long-Term Metrics Storage Options
-
-## Problem Statement
-
-Current Prometheus configuration retains metrics for 30 days (`retentionTime = "30d"`). Extending retention further raises disk usage concerns on the homelab hypervisor with limited local storage.
-
-Prometheus does not support downsampling - it stores all data at full resolution until the retention period expires, then deletes it entirely.
-
-## Current Configuration
-
-Location: `services/monitoring/prometheus.nix`
-
- **Retention**: 30 days
- **Scrape interval**: 15s
- **Features**: Alertmanager, Pushgateway, auto-generated scrape configs from flake hosts
- **Storage**: Local disk on monitoring01
-
-## Options Evaluated
-
-### Option 1: VictoriaMetrics
-
-VictoriaMetrics is a Prometheus-compatible TSDB with significantly better compression (5-10x smaller storage footprint).
-
-**NixOS Options Available:**
- `services.victoriametrics.enable`
- `services.victoriametrics.prometheusConfig` - accepts Prometheus scrape config format
- `services.victoriametrics.retentionPeriod` - e.g., "6m" for 6 months
- `services.vmagent` - dedicated scraping agent
- `services.vmalert` - alerting rules evaluation
-
-**Pros:**
- Simple migration - single service replacement
- Same PromQL query language - Grafana dashboards work unchanged
- Same scrape config format - existing auto-generated configs work as-is
- 5-10x better compression means 30 days of Prometheus data could become 180+ days
- Lightweight, single binary
-
-**Cons:**
- No automatic downsampling (relies on compression alone)
- Alerting requires switching to vmalert instead of Prometheus alertmanager integration
- Would need to migrate existing data or start fresh
-
-**Migration Steps:**
-1. Replace `services.prometheus` with `services.victoriametrics`
-2. Move scrape configs to `prometheusConfig`
-3. Set up `services.vmalert` for alerting rules
-4. Update Grafana datasource to VictoriaMetrics port (8428)
-5. Keep Alertmanager for notification routing
-
-### Option 2: Thanos
-
-Thanos extends Prometheus with long-term storage and automatic downsampling by uploading data to object storage.
-
-**NixOS Options Available:**
- `services.thanos.sidecar` - uploads Prometheus blocks to object storage
- `services.thanos.compact` - compacts and downsamples data
- `services.thanos.query` - unified query gateway
- `services.thanos.query-frontend` - query caching and parallelization
- `services.thanos.downsample` - dedicated downsampling service
-
-**Downsampling Behavior:**
- Raw resolution kept for configurable period (default: indefinite)
- 5-minute resolution created after 40 hours
- 1-hour resolution created after 10 days
-
-**Retention Configuration (in compactor):**
-```nix
-services.thanos.compact = {
-  retention.resolution-raw = "30d";   # Keep raw for 30 days
-  retention.resolution-5m = "180d";   # Keep 5m samples for 6 months
-  retention.resolution-1h = "2y";     # Keep 1h samples for 2 years
-};
-```
-
-**Pros:**
- True downsampling - older data uses progressively less storage
- Keep metrics for years with minimal storage impact
- Prometheus continues running unchanged
- Existing Alertmanager integration preserved
-
-**Cons:**
- Requires object storage (MinIO, S3, or local filesystem)
- Multiple services to manage (sidecar, compactor, query)
- More complex architecture
- Additional infrastructure (MinIO) may be needed
-
-**Required Components:**
-1. Thanos Sidecar (runs alongside Prometheus)
-2. Object storage (MinIO or local filesystem)
-3. Thanos Compactor (handles downsampling)
-4. Thanos Query (provides unified query endpoint)
-
-**Migration Steps:**
-1. Deploy object storage (MinIO or configure filesystem backend)
-2. Add Thanos sidecar pointing to Prometheus data directory
-3. Add Thanos compactor with retention policies
-4. Add Thanos query gateway
-5. Update Grafana datasource to Thanos Query port (10902)
-
-## Comparison
-
-| Aspect | VictoriaMetrics | Thanos |
-|--------|-----------------|--------|
-| Complexity | Low (1 service) | Higher (3-4 services) |
-| Downsampling | No | Yes (automatic) |
-| Storage savings | 5-10x compression | Compression + downsampling |
-| Object storage required | No | Yes |
-| Migration effort | Minimal | Moderate |
-| Grafana changes | Change port only | Change port only |
-| Alerting changes | Need vmalert | Keep existing |
-
-## Recommendation
-
-**Start with VictoriaMetrics** for simplicity. The compression alone may provide 6+ months of retention in the same disk space currently used for 30 days.
-
-If multi-year retention with true downsampling becomes necessary, Thanos can be evaluated later. However, it requires deploying object storage infrastructure (MinIO) which adds operational complexity.
-
-## References
-
- VictoriaMetrics docs: https://docs.victoriametrics.com/
- Thanos docs: https://thanos.io/tip/thanos/getting-started.md/
- NixOS options searched from nixpkgs revision e576e3c9 (NixOS 25.11)
--- a/docs/plans/memory-issues-follow-up.md
+++ b/docs/plans/memory-issues-follow-up.md
@@ -1,116 +0,0 @@
-# Memory Issues Follow-up
-
-Tracking the zram change to verify it resolves OOM issues during nixos-upgrade on low-memory hosts.
-
-## Background
-
-On 2026-02-08, ns2 (2GB RAM) experienced an OOM kill during nixos-upgrade. The Nix evaluation process consumed ~1.6GB before being killed by the kernel. ns1 (manually increased to 4GB) succeeded with the same upgrade.
-
-Root cause: 2GB RAM is insufficient for Nix flake evaluation without swap.
-
-## Fix Applied
-
-**Commit:** `1674b6a` - system: enable zram swap for all hosts
-
-**Merged:** 2026-02-08 ~12:15 UTC
-
-**Change:** Added `zramSwap.enable = true` to `system/zram.nix`, providing ~2GB compressed swap on all hosts.
-
-## Timeline
-
-| Time (UTC) | Event |
-|------------|-------|
-| 05:00:46 | ns2 nixos-upgrade OOM killed |
-| 05:01:47 | `nixos_upgrade_failed` alert fired |
-| 12:15 | zram commit merged to master |
-| 12:19 | ns2 rebooted with zram enabled |
-| 12:20 | ns1 rebooted (memory reduced to 2GB via tofu) |
-
-## Hosts Affected
-
-All 2GB VMs that run nixos-upgrade:
- ns1, ns2 (DNS)
- vault01
- testvm01, testvm02, testvm03
- kanidm01
-
-## Metrics to Monitor
-
-Check these in Grafana or via PromQL to verify the fix:
-
-### Swap availability (should be ~2GB after upgrade)
-```promql
-node_memory_SwapTotal_bytes / 1024 / 1024
-```
-
-### Swap usage during upgrades
-```promql
-(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / 1024 / 1024
-```
-
-### Zswap compressed bytes (active compression)
-```promql
-node_memory_Zswap_bytes / 1024 / 1024
-```
-
-### Upgrade failures (should be 0)
-```promql
-node_systemd_unit_state{name="nixos-upgrade.service", state="failed"}
-```
-
-### Memory available during upgrades
-```promql
-node_memory_MemAvailable_bytes / 1024 / 1024
-```
-
-## Verification Steps
-
-After a few days (allow auto-upgrades to run on all hosts):
-
-1. Check all hosts have swap enabled:
-   ```promql
-   node_memory_SwapTotal_bytes > 0
-   ```
-
-2. Check for any upgrade failures since the fix:
-   ```promql
-   count_over_time(ALERTS{alertname="nixos_upgrade_failed"}[7d])
-   ```
-
-3. Review if any hosts used swap during upgrades (check historical graphs)
-
-## Success Criteria
-
- No `nixos_upgrade_failed` alerts due to OOM after 2026-02-08
- All hosts show ~2GB swap available
- Upgrades complete successfully on 2GB VMs
-
-## Fallback Options
-
-If zram is insufficient:
-
-1. **Increase VM memory** - Update `terraform/vms.tf` to 4GB for affected hosts
-2. **Enable memory ballooning** - Configure VMs with dynamic memory allocation (see below)
-3. **Use remote builds** - Configure `nix.buildMachines` to offload evaluation
-4. **Reduce flake size** - Split configurations to reduce evaluation memory
-
-### Memory Ballooning
-
-Proxmox supports memory ballooning, which allows VMs to dynamically grow/shrink memory allocation based on demand. The balloon driver inside the guest communicates with the hypervisor to release or reclaim memory pages.
-
-Configuration in `terraform/vms.tf`:
-```hcl
-memory  = 4096  # maximum memory
-balloon = 2048  # minimum memory (shrinks to this when idle)
-```
-
-Pros:
- VMs get memory on-demand without reboots
- Better host memory utilization
- Solves upgrade OOM without permanently allocating 4GB
-
-Cons:
- Requires QEMU guest agent running in guest
- Guest can experience memory pressure if host is overcommitted
-
-Ballooning and zram are complementary - ballooning provides headroom from the host, zram provides overflow within the guest.
--- a/docs/plans/monitoring-migration-victoriametrics.md
+++ b/docs/plans/monitoring-migration-victoriametrics.md
@@ -1,219 +0,0 @@
-# Monitoring Stack Migration to VictoriaMetrics
-
-## Overview
-
-Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
-and longer retention. Run in parallel with monitoring01 until validated, then switch over using
-a `monitoring` CNAME for seamless transition.
-
-## Current State
-
-**monitoring01** (10.69.13.13):
- 4 CPU cores, 4GB RAM, 33GB disk
- Prometheus with 30-day retention (15s scrape interval)
- Alertmanager (routes to alerttonotify webhook)
- Grafana (dashboards, datasources)
- Loki (log aggregation from all hosts via Promtail)
- Tempo (distributed tracing)
- Pyroscope (continuous profiling)
-
-**Hardcoded References to monitoring01:**
- `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100`
- `hosts/template2/bootstrap.nix` - Bootstrap logs to Loki (keep as-is until decommission)
- `services/http-proxy/proxy.nix` - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway
-
-**Auto-generated:**
- Prometheus scrape targets (from `lib/monitoring.nix` + `homelab.monitoring.scrapeTargets`)
- Node-exporter targets (from all hosts with static IPs)
-
-## Decision: VictoriaMetrics
-
-Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
- Single binary replacement for Prometheus
- 5-10x better compression (30 days could become 180+ days in same space)
- Same PromQL query language (Grafana dashboards work unchanged)
- Same scrape config format (existing auto-generated configs work)
-
-If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
-
-## Architecture
-
-```
-                     ┌─────────────────┐
-                     │  monitoring02   │
-                     │  VictoriaMetrics│
-                     │  + Grafana      │
-     monitoring      │  + Loki         │
-     CNAME ──────────│  + Tempo        │
-                     │  + Pyroscope    │
-                     │  + Alertmanager │
-                     │  (vmalert)      │
-                     └─────────────────┘
-                            ▲
-                            │ scrapes
-            ┌───────────────┼───────────────┐
-            │               │               │
-       ┌────┴────┐    ┌─────┴────┐    ┌─────┴────┐
-       │  ns1    │    │  ha1     │    │  ...     │
-       │ :9100   │    │ :9100    │    │ :9100    │
-       └─────────┘    └──────────┘    └──────────┘
-```
-
-## Implementation Plan
-
-### Phase 1: Create monitoring02 Host
-
-Use `create-host` script which handles flake.nix and terraform/vms.tf automatically.
-
-1. **Run create-host**: `nix develop -c create-host monitoring02 10.69.13.24`
-2. **Update VM resources** in `terraform/vms.tf`:
-   - 4 cores (same as monitoring01)
-   - 8GB RAM (double, for VictoriaMetrics headroom)
-   - 100GB disk (for 3+ months retention with compression)
-3. **Update host configuration**: Import monitoring services
-4. **Create Vault AppRole**: Add to `terraform/vault/approle.tf`
-
-### Phase 2: Set Up VictoriaMetrics Stack
-
-Create new service module at `services/monitoring/victoriametrics/` for testing alongside existing
-Prometheus config. Once validated, this can replace the Prometheus module.
-
-1. **VictoriaMetrics** (port 8428):
-   - `services.victoriametrics.enable = true`
-   - `services.victoriametrics.retentionPeriod = "3m"` (3 months, increase later based on disk usage)
-   - Migrate scrape configs via `prometheusConfig`
-   - Use native push support (replaces Pushgateway)
-
-2. **vmalert** for alerting rules:
-   - `services.vmalert.enable = true`
-   - Point to VictoriaMetrics for metrics evaluation
-   - Keep rules in separate `rules.yml` file (same format as Prometheus)
-   - No receiver configured during parallel operation (prevents duplicate alerts)
-
-3. **Alertmanager** (port 9093):
-   - Keep existing configuration (alerttonotify webhook routing)
-   - Only enable receiver after cutover from monitoring01
-
-4. **Loki** (port 3100):
-   - Same configuration as current
-
-5. **Grafana** (port 3000):
-   - Define dashboards declaratively via NixOS options (not imported from monitoring01)
-   - Reference existing dashboards on monitoring01 for content inspiration
-   - Configure VictoriaMetrics datasource (port 8428)
-   - Configure Loki datasource
-
-6. **Tempo** (ports 3200, 3201):
-   - Same configuration
-
-7. **Pyroscope** (port 4040):
-   - Same Docker-based deployment
-
-### Phase 3: Parallel Operation
-
-Run both monitoring01 and monitoring02 simultaneously:
-
-1. **Dual scraping**: Both hosts scrape the same targets
-   - Validates VictoriaMetrics is collecting data correctly
-
-2. **Dual log shipping**: Configure Promtail to send logs to both Loki instances
-   - Add second client in `system/monitoring/logs.nix` pointing to monitoring02
-
-3. **Validate dashboards**: Access Grafana on monitoring02, verify dashboards work
-
-4. **Validate alerts**: Verify vmalert evaluates rules correctly (no receiver = no notifications)
-
-5. **Compare resource usage**: Monitor disk/memory consumption between hosts
-
-### Phase 4: Add monitoring CNAME
-
-Add CNAME to monitoring02 once validated:
-
-```nix
-# hosts/monitoring02/configuration.nix
-homelab.dns.cnames = [ "monitoring" ];
-```
-
-This creates `monitoring.home.2rjus.net` pointing to monitoring02.
-
-### Phase 5: Update References
-
-Update hardcoded references to use the CNAME:
-
-1. **system/monitoring/logs.nix**:
-   - Remove dual-shipping, point only to `http://monitoring.home.2rjus.net:3100`
-
-2. **services/http-proxy/proxy.nix**: Update reverse proxy backends:
-   - prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
-   - alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
-   - grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
-   - pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040
-
-Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission.
-
-### Phase 6: Enable Alerting
-
-Once ready to cut over:
-1. Enable Alertmanager receiver on monitoring02
-2. Verify test alerts route correctly
-
-### Phase 7: Cutover and Decommission
-
-1. **Stop monitoring01**: Prevent duplicate alerts during transition
-2. **Update bootstrap.nix**: Point to `monitoring.home.2rjus.net`
-3. **Verify all targets scraped**: Check VictoriaMetrics UI
-4. **Verify logs flowing**: Check Loki on monitoring02
-5. **Decommission monitoring01**:
-   - Remove from flake.nix
-   - Remove host configuration
-   - Destroy VM in Proxmox
-   - Remove from terraform state
-
-## Open Questions
-
- [ ] What disk size for monitoring02? 100GB should allow 3+ months with VictoriaMetrics compression
- [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
-
-## VictoriaMetrics Service Configuration
-
-Example NixOS configuration for monitoring02:
-
-```nix
-# VictoriaMetrics replaces Prometheus
-services.victoriametrics = {
-  enable = true;
-  retentionPeriod = "3m";  # 3 months, increase based on disk usage
-  prometheusConfig = {
-    global.scrape_interval = "15s";
-    scrape_configs = [
-      # Auto-generated node-exporter targets
-      # Service-specific scrape targets
-      # External targets
-    ];
-  };
-};
-
-# vmalert for alerting rules (no receiver during parallel operation)
-services.vmalert = {
-  enable = true;
-  datasource.url = "http://localhost:8428";
-  # notifier.alertmanager.url = "http://localhost:9093";  # Enable after cutover
-  rule = [ ./rules.yml ];
-};
-```
-
-## Rollback Plan
-
-If issues arise after cutover:
-1. Move `monitoring` CNAME back to monitoring01
-2. Restart monitoring01 services
-3. Revert Promtail config to point only to monitoring01
-4. Revert http-proxy backends
-
-## Notes
-
- VictoriaMetrics uses port 8428 vs Prometheus 9090
- PromQL compatibility is excellent
- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
- monitoring02 deployed via OpenTofu using `create-host` script
- Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state
--- a/docs/plans/completed/nats-deploy-service.md
+++ b/docs/plans/completed/nats-deploy-service.md
--- a/docs/plans/nix-cache-reprovision.md
+++ b/docs/plans/nix-cache-reprovision.md
@@ -1,212 +0,0 @@
-# Nix Cache Host Reprovision
-
-## Overview
-
-Reprovision `nix-cache01` using the OpenTofu workflow, and improve the build/cache system with:
-1. NATS-based remote build triggering (replacing the current bash script)
-2. Safer flake update workflow that validates builds before pushing to master
-
-## Current State
-
-### Host Configuration
- `nix-cache01` at 10.69.13.15 serves the binary cache via Harmonia
- Runs Gitea Actions runner for CI workflows
- Has `homelab.deploy.enable = true` (already supports NATS-based deployment)
- Uses a dedicated XFS volume at `/nix` for cache storage
-
-### Current Build System (`services/nix-cache/build-flakes.sh`)
- Runs every 30 minutes via systemd timer
- Clones/pulls two repos: `nixos-servers` and `nixos` (gunter)
- Builds all hosts with `nixos-rebuild build` (no blacklist despite docs mentioning it)
- Pushes success/failure metrics to pushgateway
- Simple but has no filtering, no parallelism, no remote triggering
-
-### Current Flake Update Workflow (`.github/workflows/flake-update.yaml`)
- Runs daily at midnight via cron
- Runs `nix flake update --commit-lock-file`
- Pushes directly to master
- No build validation — can push broken inputs
-
-## Improvement 1: NATS-Based Remote Build Triggering
-
-### Design
-
-Extend the existing `homelab-deploy` tool to support a "build" command that triggers builds on the cache host. This reuses the NATS infrastructure already in place.
-
-| Approach | Pros | Cons |
-|----------|------|------|
-| Extend homelab-deploy | Reuses existing NATS auth, NKey handling, CLI | Adds scope to existing tool |
-| New nix-cache-tool | Clean separation | Duplicate NATS boilerplate, new credentials |
-| Gitea Actions webhook | No custom tooling | Less flexible, tied to Gitea |
-
-**Recommendation:** Extend `homelab-deploy` with a build subcommand. The tool already has NATS client code, authentication handling, and a listener module in NixOS.
-
-### Implementation
-
-1. Add new message type to homelab-deploy: `build.<host>` subject
-2. Listener on nix-cache01 subscribes to `build.>` wildcard
-3. On message receipt, builds the specified host and returns success/failure
-4. CLI command: `homelab-deploy build <hostname>` or `homelab-deploy build --all`
-
-### Benefits
- Trigger rebuild for specific host to ensure it's cached
- Could be called from CI after merging PRs
- Reuses existing NATS infrastructure and auth
- Progress/status could stream back via NATS reply
-
-## Improvement 2: Smarter Flake Update Workflow
-
-### Current Problems
-1. Updates can push breaking changes to master
-2. No visibility into what broke when it does
-3. Hosts that auto-update can pull broken configs
-
-### Proposed Workflow
-
-```
-┌─────────────────────────────────────────────────────────────────┐
-│                    Flake Update Workflow                         │
-├─────────────────────────────────────────────────────────────────┤
-│  1. nix flake update (on feature branch)                        │
-│  2. Build ALL hosts locally                                      │
-│  3. If all pass → fast-forward merge to master                  │
-│  4. If any fail → create PR with failure logs attached          │
-└─────────────────────────────────────────────────────────────────┘
-```
-
-### Implementation Options
-
-| Option | Description | Pros | Cons |
-|--------|-------------|------|------|
-| **A: Self-hosted runner** | Build on nix-cache01 | Fast (local cache), simple | Ties up cache host during build |
-| **B: Gitea Actions only** | Use container runner | Clean separation | Slow (no cache), resource limits |
-| **C: Hybrid** | Trigger builds on nix-cache01 via NATS from Actions | Best of both | More complex |
-
-**Recommendation:** Option A with nix-cache01 as the runner. The host is already running Gitea Actions runner and has the cache. Building all ~16 hosts is disk I/O heavy but feasible on dedicated hardware.
-
-### Workflow Steps
-
-1. Workflow runs on schedule (daily or weekly)
-2. Creates branch `flake-update/YYYY-MM-DD`
-3. Runs `nix flake update --commit-lock-file`
-4. Builds each host: `nix build .#nixosConfigurations.<host>.config.system.build.toplevel`
-5. If all succeed:
-   - Fast-forward merge to master
-   - Delete feature branch
-6. If any fail:
-   - Create PR from the update branch
-   - Attach build logs as PR comment
-   - Label PR with `needs-review` or `build-failure`
-   - Do NOT merge automatically
-
-### Workflow File Changes
-
-```yaml
-# New: .github/workflows/flake-update-safe.yaml
-name: Safe flake update
-on:
-  schedule:
-    - cron: "0 2 * * 0"  # Weekly on Sunday at 2 AM
-  workflow_dispatch:  # Manual trigger
-
-jobs:
-  update-and-validate:
-    runs-on: homelab  # Use self-hosted runner on nix-cache01
-    steps:
-      - uses: actions/checkout@v4
-        with:
-          ref: master
-          fetch-depth: 0  # Need full history for merge
-
-      - name: Create update branch
-        run: |
-          BRANCH="flake-update/$(date +%Y-%m-%d)"
-          git checkout -b "$BRANCH"
-
-      - name: Update flake
-        run: nix flake update --commit-lock-file
-
-      - name: Build all hosts
-        id: build
-        run: |
-          FAILED=""
-          for host in $(nix flake show --json | jq -r '.nixosConfigurations | keys[]'); do
-            echo "Building $host..."
-            if ! nix build ".#nixosConfigurations.$host.config.system.build.toplevel" 2>&1 | tee "build-$host.log"; then
-              FAILED="$FAILED $host"
-            fi
-          done
-          echo "failed=$FAILED" >> $GITHUB_OUTPUT
-
-      - name: Merge to master (if all pass)
-        if: steps.build.outputs.failed == ''
-        run: |
-          git checkout master
-          git merge --ff-only "$BRANCH"
-          git push origin master
-          git push origin --delete "$BRANCH"
-
-      - name: Create PR (if any fail)
-        if: steps.build.outputs.failed != ''
-        run: |
-          git push origin "$BRANCH"
-          # Create PR via Gitea API with build logs
-          # ... (PR creation with log attachment)
-```
-
-## Migration Steps
-
-### Phase 1: Reprovision Host via OpenTofu
-
-1. Add `nix-cache01` to `terraform/vms.tf`:
-   ```hcl
-   "nix-cache01" = {
-     ip        = "10.69.13.15/24"
-     cpu_cores = 4
-     memory    = 8192
-     disk_size = "100G"  # Larger for nix store
-   }
-   ```
-
-2. Shut down existing nix-cache01 VM
-3. Run `tofu apply` to provision new VM
-4. Verify bootstrap completes and cache is serving
-
-**Note:** The cache will be cold after reprovision. Run initial builds to populate.
-
-### Phase 2: Add Build Triggering to homelab-deploy
-
-1. Add `build` command to homelab-deploy CLI
-2. Add listener handler in NixOS module for `build.*` subjects
-3. Update nix-cache01 config to enable build listener
-4. Test with `homelab-deploy build testvm01`
-
-### Phase 3: Implement Safe Flake Update Workflow
-
-1. Create `.github/workflows/flake-update-safe.yaml`
-2. Disable or remove old `flake-update.yaml`
-3. Test manually with `workflow_dispatch`
-4. Monitor first automated run
-
-### Phase 4: Remove Old Build Script
-
-1. After new workflow is stable, remove:
-   - `services/nix-cache/build-flakes.nix`
-   - `services/nix-cache/build-flakes.sh`
-2. The new workflow handles scheduled builds
-
-## Open Questions
-
- [ ] What runner labels should the self-hosted runner use for the update workflow?
- [ ] Should we build hosts in parallel (faster) or sequentially (easier to debug)?
- [ ] How long to keep flake-update PRs open before auto-closing stale ones?
- [ ] Should successful updates trigger a NATS notification to rebuild all hosts?
- [ ] What to do about `gunter` (external nixos repo) - include in validation?
- [ ] Disk size for new nix-cache01 - is 100G enough for cache + builds?
-
-## Notes
-
- The existing `homelab.deploy.enable = true` on nix-cache01 means it already has NATS connectivity
- The Harmonia service and cache signing key will work the same after reprovision
- Actions runner token is in Vault, will be provisioned automatically
- Consider adding a `homelab.host.role = "build-host"` label for monitoring/filtering
--- a/docs/plans/pgdb1-decommission.md
+++ b/docs/plans/pgdb1-decommission.md
@@ -1,113 +0,0 @@
-# pgdb1 Decommissioning Plan
-
-## Overview
-
-Decommission the pgdb1 PostgreSQL server. The only consumer was Open WebUI on gunter, which has been migrated to use a local PostgreSQL instance.
-
-## Pre-flight Verification
-
-Before proceeding, verify that gunter is no longer using pgdb1:
-
-1. Check Open WebUI on gunter is configured for local PostgreSQL (not 10.69.13.16)
-2. Optionally: Check pgdb1 for recent connection activity:
-   ```bash
-   ssh pgdb1 'sudo -u postgres psql -c "SELECT * FROM pg_stat_activity WHERE datname IS NOT NULL;"'
-   ```
-
-## Files to Remove
-
-### Host Configuration
- `hosts/pgdb1/default.nix`
- `hosts/pgdb1/configuration.nix`
- `hosts/pgdb1/hardware-configuration.nix`
- `hosts/pgdb1/` (directory)
-
-### Service Module
- `services/postgres/postgres.nix`
- `services/postgres/default.nix`
- `services/postgres/` (directory)
-
-Note: This service module is only used by pgdb1, so it can be removed entirely.
-
-### Flake Entry
-Remove from `flake.nix` (lines 131-138):
-```nix
-pgdb1 = nixpkgs.lib.nixosSystem {
-  inherit system;
-  specialArgs = {
-    inherit inputs self;
-  };
-  modules = commonModules ++ [
-    ./hosts/pgdb1
-  ];
-};
-```
-
-### Vault AppRole
-Remove from `terraform/vault/approle.tf` (lines 69-73):
-```hcl
-"pgdb1" = {
-  paths = [
-    "secret/data/hosts/pgdb1/*",
-  ]
-}
-```
-
-### Monitoring Rules
-Remove from `services/monitoring/rules.yml` the `postgres_down` alert (lines 359-365):
-```yaml
- name: postgres_rules
-  rules:
-    - alert: postgres_down
-      expr: node_systemd_unit_state{instance="pgdb1.home.2rjus.net:9100", name="postgresql.service", state="active"} == 0
-      for: 5m
-      labels:
-        severity: critical
-```
-
-### Utility Scripts
-Delete `rebuild-all.sh` entirely (obsolete script).
-
-## Execution Steps
-
-### Phase 1: Verification
- [ ] Confirm Open WebUI on gunter uses local PostgreSQL
- [ ] Verify no active connections to pgdb1
-
-### Phase 2: Code Cleanup
- [ ] Create feature branch: `git checkout -b decommission-pgdb1`
- [ ] Remove `hosts/pgdb1/` directory
- [ ] Remove `services/postgres/` directory
- [ ] Remove pgdb1 entry from `flake.nix`
- [ ] Remove postgres alert from `services/monitoring/rules.yml`
- [ ] Delete `rebuild-all.sh` (obsolete)
- [ ] Run `nix flake check` to verify no broken references
- [ ] Commit changes
-
-### Phase 3: Terraform Cleanup
- [ ] Remove pgdb1 from `terraform/vault/approle.tf`
- [ ] Run `tofu plan` in `terraform/vault/` to preview changes
- [ ] Run `tofu apply` to remove the AppRole
- [ ] Commit terraform changes
-
-### Phase 4: Infrastructure Cleanup
- [ ] Shut down pgdb1 VM in Proxmox
- [ ] Delete the VM from Proxmox
- [ ] (Optional) Remove any DNS entries if not auto-generated
-
-### Phase 5: Finalize
- [ ] Merge feature branch to master
- [ ] Trigger auto-upgrade on DNS servers (ns1, ns2) to remove DNS entry
- [ ] Move this plan to `docs/plans/completed/`
-
-## Rollback
-
-If issues arise after decommissioning:
-1. The VM can be recreated from template using the git history
-2. Database data would need to be restored from backup (if any exists)
-
-## Notes
-
- pgdb1 IP: 10.69.13.16
- The postgres service allowed connections from gunter (10.69.30.105)
- No restic backup was configured for this host
--- a/docs/plans/completed/prometheus-scrape-target-labels.md
+++ b/docs/plans/completed/prometheus-scrape-target-labels.md
@@ -1,38 +1,10 @@
 # Prometheus Scrape Target Labels

-## Implementation Status
-
-| Step | Status | Notes |
-|------|--------|-------|
-| 1. Create `homelab.host` module | ✅ Complete | `modules/homelab/host.nix` |
-| 2. Update `lib/monitoring.nix` | ✅ Complete | Labels extracted and propagated |
-| 3. Update Prometheus config | ✅ Complete | Uses structured static_configs |
-| 4. Set metadata on hosts | ✅ Complete | All relevant hosts configured |
-| 5. Update alert rules | ✅ Complete | Role-based filtering implemented |
-| 6. Labels for service targets | ✅ Complete | Host labels propagated to all services |
-| 7. Add hostname label | ✅ Complete | All targets have `hostname` label for easy filtering |
-
-**Hosts with metadata configured:**
- `ns1`, `ns2`: `role = "dns"`, `labels.dns_role = "primary"/"secondary"`
- `nix-cache01`: `role = "build-host"`
- `vault01`: `role = "vault"`
- `testvm01/02/03`: `tier = "test"`
-
-**Implementation complete.** Branch: `prometheus-scrape-target-labels`
-
-**Query examples:**
- `{hostname="ns1"}` - all metrics from ns1 (any job/port)
- `node_cpu_seconds_total{hostname="monitoring01"}` - specific metric by hostname
- `up{role="dns"}` - all DNS servers
- `up{tier="test"}` - all test-tier hosts
-
---
-
 ## Goal

 Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names.

-**Related:** This plan shares the `homelab.host` module with `docs/plans/completed/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment.
+**Related:** This plan shares the `homelab.host` module with `docs/plans/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment.

 ## Motivation

@@ -82,11 +54,12 @@ or

 ## Implementation

-This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/completed/nats-deploy-service.md` which uses the same module for deployment tier assignment.
+This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/nats-deploy-service.md` which uses the same module for deployment tier assignment.

 ### 1. Create `homelab.host` module

-✅ **Complete.** The module is in `modules/homelab/host.nix`.
+**Status:** Step 1 (Create `homelab.host` module) is complete. The module is in
+`modules/homelab/host.nix` with tier, priority, role, and labels options.

 Create `modules/homelab/host.nix` with shared host metadata options:

@@ -125,8 +98,6 @@ Import this module in `modules/homelab/default.nix`.

 ### 2. Update `lib/monitoring.nix`

-✅ **Complete.** Labels are now extracted and propagated.
-
 - `extractHostMonitoring` should also extract `homelab.host` values (priority, role, labels).
 - Build the combined label set from `homelab.host`:

@@ -155,8 +126,6 @@ This requires grouping hosts by their label attrset and producing one `static_co

 ### 3. Update `services/monitoring/prometheus.nix`

-✅ **Complete.** Now uses structured static_configs output.
-
 Change the node-exporter scrape config to use the new structured output:

 ```nix
@@ -169,37 +138,36 @@ static_configs = nodeExporterTargets;

 ### 4. Set metadata on hosts

-✅ **Complete.** All relevant hosts have metadata configured. Note: The implementation filters by `role` rather than `priority`, which matches the existing nix-cache01 configuration.
-
 Example in `hosts/nix-cache01/configuration.nix`:

 ```nix
 homelab.host = {
+  tier = "test";       # can be deployed by MCP (used by homelab-deploy)
  priority = "low";    # relaxed alerting thresholds
  role = "build-host";
 };
 ```

-**Note:** Current implementation only sets `role = "build-host"`. Consider adding `priority = "low"` when label propagation is implemented.
-
 Example in `hosts/ns1/configuration.nix`:

 ```nix
 homelab.host = {
+  tier = "prod";
+  priority = "high";
  role = "dns";
  labels.dns_role = "primary";
 };
 ```

-**Note:** `tier` and `priority` use defaults ("prod" and "high"), which is the intended behavior. The current ns1/ns2 configurations match this pattern.
-
 ### 5. Update alert rules

-✅ **Complete.** Updated `services/monitoring/rules.yml`:
+After implementing labels, review and update `services/monitoring/rules.yml`:

- `high_cpu_load`: Replaced `instance!="nix-cache01..."` with `role!="build-host"` for standard hosts (15m duration) and `role="build-host"` for build hosts (2h duration).
- `unbound_low_cache_hit_ratio`: Added `dns_role="primary"` filter to only alert on the primary DNS resolver (secondary has a cold cache).
+- Replace instance-name exclusions with label-based filters (e.g. `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`).
+- Consider whether any other rules should differentiate by priority or role.

-### 6. Labels for `generateScrapeConfigs` (service targets)
+Specifically, the `high_cpu_load` rule currently has a nix-cache01 exclusion that should be replaced with a `priority`-based filter.

-✅ **Complete.** Host labels are now propagated to all auto-generated service scrape targets (unbound, homelab-deploy, nixos-exporter, etc.). This enables semantic filtering on any service metric, such as using `dns_role="primary"` with the unbound job.
+### 6. Consider labels for `generateScrapeConfigs` (service targets)
+
+The same label propagation could be applied to service-level scrape targets. This is optional and can be deferred -- service targets are more specialized and less likely to need generic label-based filtering.
--- a/docs/plans/security-hardening.md
+++ b/docs/plans/security-hardening.md
@@ -1,224 +0,0 @@
-# Security Hardening Plan
-
-## Overview
-
-Address security gaps identified in infrastructure review. Focus areas: SSH hardening, network security, logging improvements, and secrets management.
-
-## Current State
-
- SSH allows password auth and unrestricted root login (`system/sshd.nix`)
- Firewall disabled on all hosts (`networking.firewall.enable = false`)
- Promtail ships logs over HTTP to Loki
- Loki has no authentication (`auth_enabled = false`)
- AppRole secret-IDs never expire (`secret_id_ttl = 0`)
- Vault TLS verification disabled by default (`skipTlsVerify = true`)
- Audit logging exists (`common/ssh-audit.nix`) but not applied globally
- Alert rules focus on availability, no security event detection
-
-## Priority Matrix
-
-| Issue | Severity | Effort | Priority |
-|-------|----------|--------|----------|
-| SSH password auth | High | Low | **P1** |
-| Firewall disabled | High | Medium | **P1** |
-| Promtail HTTP (no TLS) | High | Medium | **P2** |
-| No security alerting | Medium | Low | **P2** |
-| Audit logging not global | Low | Low | **P2** |
-| Loki no auth | Medium | Medium | **P3** |
-| Secret-ID TTL | Medium | Medium | **P3** |
-| Vault skipTlsVerify | Medium | Low | **P3** |
-
-## Phase 1: Quick Wins (P1)
-
-### 1.1 SSH Hardening
-
-Edit `system/sshd.nix`:
-
-```nix
-services.openssh = {
-  enable = true;
-  settings = {
-    PermitRootLogin = "prohibit-password";  # Key-only root login
-    PasswordAuthentication = false;
-    KbdInteractiveAuthentication = false;
-  };
-};
-```
-
-**Prerequisite:** Verify all hosts have SSH keys deployed for root.
-
-### 1.2 Enable Firewall
-
-Create `system/firewall.nix` with default deny policy:
-
-```nix
-{ ... }: {
-  networking.firewall.enable = true;
-
-  # Use openssh's built-in firewall integration
-  services.openssh.openFirewall = true;
-}
-```
-
-**Useful firewall options:**
-
-| Option | Description |
-|--------|-------------|
-| `networking.firewall.trustedInterfaces` | Accept all traffic from these interfaces (e.g., `[ "lo" ]`) |
-| `networking.firewall.interfaces.<name>.allowedTCPPorts` | Per-interface port rules |
-| `networking.firewall.extraInputRules` | Custom nftables rules (for complex filtering) |
-
-**Network range restrictions:** Consider restricting SSH to the infrastructure subnet (`10.69.13.0/24`) using `extraInputRules` for defense in depth. However, this adds complexity and may not be necessary given the trusted network model.
-
-#### Per-Interface Rules (http-proxy WireGuard)
-
-The `http-proxy` host has a WireGuard interface (`wg0`) that may need different rules than the LAN interface. Use `networking.firewall.interfaces` to apply per-interface policies:
-
-```nix
-# Example: http-proxy with different rules per interface
-networking.firewall = {
-  enable = true;
-
-  # Default: only SSH (via openFirewall)
-  allowedTCPPorts = [ ];
-
-  # LAN interface: allow HTTP/HTTPS
-  interfaces.ens18 = {
-    allowedTCPPorts = [ 80 443 ];
-  };
-
-  # WireGuard interface: restrict to specific services or trust fully
-  interfaces.wg0 = {
-    allowedTCPPorts = [ 80 443 ];
-    # Or use trustedInterfaces = [ "wg0" ] if fully trusted
-  };
-};
-```
-
-**TODO:** Investigate current WireGuard usage on http-proxy to determine appropriate rules.
-
-Then per-host, open required ports:
-
-| Host | Additional Ports |
-|------|------------------|
-| ns1/ns2 | 53 (TCP/UDP) |
-| vault01 | 8200 |
-| monitoring01 | 3100, 9090, 3000, 9093 |
-| http-proxy | 80, 443 |
-| nats1 | 4222 |
-| ha1 | 1883, 8123 |
-| jelly01 | 8096 |
-| nix-cache01 | 5000 |
-
-## Phase 2: Logging & Detection (P2)
-
-### 2.1 Enable TLS for Promtail → Loki
-
-Update `system/monitoring/logs.nix`:
-
-```nix
-clients = [{
-  url = "https://monitoring01.home.2rjus.net:3100/loki/api/v1/push";
-  tls_config = {
-    ca_file = "/etc/ssl/certs/homelab-root-ca.pem";
-  };
-}];
-```
-
-Requires:
- Configure Loki with TLS certificate (use internal ACME)
- Ensure all hosts trust root CA (already done via `system/pki/root-ca.nix`)
-
-### 2.2 Security Alert Rules
-
-Add to `services/monitoring/rules.yml`:
-
-```yaml
- name: security_rules
-  rules:
-    - alert: ssh_auth_failures
-      expr: increase(node_logind_sessions_total[5m]) > 20
-      for: 0m
-      labels:
-        severity: warning
-      annotations:
-        summary: "Unusual login activity on {{ $labels.instance }}"
-
-    - alert: vault_secret_fetch_failure
-      expr: increase(vault_secret_failures[5m]) > 5
-      for: 0m
-      labels:
-        severity: warning
-      annotations:
-        summary: "Vault secret fetch failures on {{ $labels.instance }}"
-```
-
-Also add Loki-based alerts for:
- Failed SSH attempts: `{job="systemd-journal"} |= "Failed password"`
- sudo usage: `{job="systemd-journal"} |= "sudo"`
-
-### 2.3 Global Audit Logging
-
-Add `./common/ssh-audit.nix` import to `system/default.nix`:
-
-```nix
-imports = [
-  # ... existing imports
-  ../common/ssh-audit.nix
-];
-```
-
-## Phase 3: Defense in Depth (P3)
-
-### 3.1 Loki Authentication
-
-Options:
-1. **Basic auth via reverse proxy** - Put Loki behind Caddy with auth
-2. **Loki multi-tenancy** - Enable `auth_enabled = true` and use tenant IDs
-3. **Network isolation** - Bind Loki only to localhost, expose via authenticated proxy
-
-Recommendation: Option 1 (reverse proxy) is simplest for homelab.
-
-### 3.2 AppRole Secret Rotation
-
-Update `terraform/vault/approle.tf`:
-
-```hcl
-secret_id_ttl  = 2592000  # 30 days
-```
-
-Add documentation for manual rotation procedure or implement automated rotation via the existing `restartTrigger` mechanism in `vault-secrets.nix`.
-
-### 3.3 Enable Vault TLS Verification
-
-Change default in `system/vault-secrets.nix`:
-
-```nix
-skipTlsVerify = mkOption {
-  type = types.bool;
-  default = false;  # Changed from true
-};
-```
-
-**Prerequisite:** Verify all hosts trust the internal CA that signed the Vault certificate.
-
-## Implementation Order
-
-1. **Test on test-tier first** - Deploy phases 1-2 to testvm01/02/03
-2. **Validate SSH access** - Ensure key-based login works before disabling passwords
-3. **Document firewall ports** - Create reference of ports per host before enabling
-4. **Phase prod rollout** - Deploy to prod hosts one at a time, verify each
-
-## Open Questions
-
- [ ] Do all hosts have SSH keys configured for root access?
- [ ] Should firewall rules be per-host or use a central definition with roles?
- [ ] Should Loki authentication use the existing Kanidm setup?
-
-**Resolved:** Password-based SSH access for recovery is not required - most hosts have console access through Proxmox or physical access, which provides an out-of-band recovery path if SSH keys fail.
-
-## Notes
-
- Firewall changes are the highest risk - test thoroughly on test-tier
- SSH hardening must not lock out access - verify keys first
- Consider creating a "break glass" procedure for emergency access if keys fail
--- a/docs/user-management.md
+++ b/docs/user-management.md
@@ -1,267 +0,0 @@
-# User Management with Kanidm
-
-Central authentication for the homelab using Kanidm.
-
-## Overview
-
- **Server**: kanidm01.home.2rjus.net (auth.home.2rjus.net)
- **WebUI**: https://auth.home.2rjus.net
- **LDAPS**: port 636
-
-## CLI Setup
-
-The `kanidm` CLI is available in the devshell:
-
-```bash
-nix develop
-
-# Login as idm_admin
-kanidm login --name idm_admin --url https://auth.home.2rjus.net
-```
-
-## User Management
-
-POSIX users are managed imperatively via the `kanidm` CLI. This allows setting
-all attributes (including UNIX password) in one workflow.
-
-### Creating a POSIX User
-
-```bash
-# Create the person
-kanidm person create <username> "<Display Name>"
-
-# Add to groups
-kanidm group add-members ssh-users <username>
-
-# Enable POSIX (UID is auto-assigned)
-kanidm person posix set <username>
-
-# Set UNIX password (required for SSH login, min 10 characters)
-kanidm person posix set-password <username>
-
-# Optionally set login shell
-kanidm person posix set <username> --shell /bin/zsh
-```
-
-### Example: Full User Creation
-
-```bash
-kanidm person create testuser "Test User"
-kanidm group add-members ssh-users testuser
-kanidm person posix set testuser
-kanidm person posix set-password testuser
-kanidm person get testuser
-```
-
-After creation, verify on a client host:
-```bash
-getent passwd testuser
-ssh testuser@testvm01.home.2rjus.net
-```
-
-### Viewing User Details
-
-```bash
-kanidm person get <username>
-```
-
-### Removing a User
-
-```bash
-kanidm person delete <username>
-```
-
-## Group Management
-
-Groups for POSIX access are also managed via CLI.
-
-### Creating a POSIX Group
-
-```bash
-# Create the group
-kanidm group create <group-name>
-
-# Enable POSIX with a specific GID
-kanidm group posix set <group-name> --gidnumber <gid>
-```
-
-### Adding Members
-
-```bash
-kanidm group add-members <group-name> <username>
-```
-
-### Viewing Group Details
-
-```bash
-kanidm group get <group-name>
-kanidm group list-members <group-name>
-```
-
-### Example: Full Group Creation
-
-```bash
-kanidm group create testgroup
-kanidm group posix set testgroup --gidnumber 68010
-kanidm group add-members testgroup testuser
-kanidm group get testgroup
-```
-
-After creation, verify on a client host:
-```bash
-getent group testgroup
-```
-
-### Current Groups
-
-| Group | GID | Purpose |
-|-------|-----|---------|
-| ssh-users | 68000 | SSH login access |
-| admins | 68001 | Administrative access |
-| users | 68002 | General users |
-
-### UID/GID Allocation
-
-Kanidm auto-assigns UIDs/GIDs from its configured range. For manually assigned GIDs:
-
-| Range | Purpose |
-|-------|---------|
-| 65,536+ | Users (auto-assigned) |
-| 68,000 - 68,999 | Groups (manually assigned) |
-
-## PAM/NSS Client Configuration
-
-Enable central authentication on a host:
-
-```nix
-homelab.kanidm.enable = true;
-```
-
-This configures:
- `services.kanidm.enablePam = true`
- Client connection to auth.home.2rjus.net
- Login authorization for `ssh-users` group
- Short usernames (`torjus` instead of `torjus@home.2rjus.net`)
- Home directory symlinks (`/home/torjus` → UUID-based directory)
-
-### Enabled Hosts
-
- testvm01, testvm02, testvm03 (test tier)
-
-### Options
-
-```nix
-homelab.kanidm = {
-  enable = true;
-  server = "https://auth.home.2rjus.net";  # default
-  allowedLoginGroups = [ "ssh-users" ];     # default
-};
-```
-
-### Home Directories
-
-Home directories use UUID-based paths for stability (so renaming a user doesn't
-require moving their home directory). Symlinks provide convenient access:
-
-```
-/home/torjus -> /home/e4f4c56c-4aee-4c20-846f-90cb69807733
-```
-
-The symlinks are created by `kanidm-unixd-tasks` on first login.
-
-## Testing
-
-### Verify NSS Resolution
-
-```bash
-# Check user resolution
-getent passwd <username>
-
-# Check group resolution
-getent group <group-name>
-```
-
-### Test SSH Login
-
-```bash
-ssh <username>@<hostname>.home.2rjus.net
-```
-
-## Troubleshooting
-
-### "PAM user mismatch" error
-
-SSH fails with "fatal: PAM user mismatch" in logs. This happens when Kanidm returns
-usernames in SPN format (`torjus@home.2rjus.net`) but SSH expects short names (`torjus`).
-
-**Solution**: Configure `uid_attr_map = "name"` in unixSettings (already set in our module).
-
-Check current format:
-```bash
-getent passwd torjus
-# Should show: torjus:x:65536:...
-# NOT: torjus@home.2rjus.net:x:65536:...
-```
-
-### User resolves but SSH fails immediately
-
-The user's login group (e.g., `ssh-users`) likely doesn't have POSIX enabled:
-
-```bash
-# Check if group has POSIX
-getent group ssh-users
-
-# If empty, enable POSIX on the server
-kanidm group posix set ssh-users --gidnumber 68000
-```
-
-### User doesn't resolve via getent
-
-1. Check kanidm-unixd service is running:
-   ```bash
-   systemctl status kanidm-unixd
-   ```
-
-2. Check unixd can reach server:
-   ```bash
-   kanidm-unix status
-   # Should show: system: online, Kanidm: online
-   ```
-
-3. Check client can reach server:
-   ```bash
-   curl -s https://auth.home.2rjus.net/status
-   ```
-
-4. Check user has POSIX enabled on server:
-   ```bash
-   kanidm person get <username>
-   ```
-
-5. Restart nscd to clear stale cache:
-   ```bash
-   systemctl restart nscd
-   ```
-
-6. Invalidate kanidm cache:
-   ```bash
-   kanidm-unix cache-invalidate
-   ```
-
-### Changes not taking effect after deployment
-
-NixOS uses nsncd (a Rust reimplementation of nscd) for NSS caching. After deploying
-kanidm-unixd config changes, you may need to restart both services:
-
-```bash
-systemctl restart kanidm-unixd
-systemctl restart nscd
-```
-
-### Test PAM authentication directly
-
-Use the kanidm-unix CLI to test PAM auth without SSH:
-
-```bash
-kanidm-unix auth-test --name <username>
-```
--- a/flake.lock
+++ b/flake.lock
@@ -28,11 +28,11 @@
        ]
      },
      "locked": {
-        "lastModified": 1770481834,
-        "narHash": "sha256-Xx9BYnI0C/qgPbwr9nj6NoAdQTbYLunrdbNSaUww9oY=",
+        "lastModified": 1770443536,
+        "narHash": "sha256-UufZIVggiioMFDSjKx+ifgkDOk9alNSiRmkvc4/+HIA=",
        "ref": "master",
-        "rev": "fd0d63b103dfaf21d1c27363266590e723021c67",
-        "revCount": 24,
+        "rev": "95b795dcfd86b7b36045bba67e536b3a1c61dd33",
+        "revCount": 20,
        "type": "git",
        "url": "https://git.t-juice.club/torjus/homelab-deploy"
      },
@@ -42,6 +42,27 @@
        "url": "https://git.t-juice.club/torjus/homelab-deploy"
      }
    },
+    "labmon": {
+      "inputs": {
+        "nixpkgs": [
+          "nixpkgs-unstable"
+        ]
+      },
+      "locked": {
+        "lastModified": 1748983975,
+        "narHash": "sha256-DA5mOqxwLMj/XLb4hvBU1WtE6cuVej7PjUr8N0EZsCE=",
+        "ref": "master",
+        "rev": "040a73e891a70ff06ec7ab31d7167914129dbf7d",
+        "revCount": 17,
+        "type": "git",
+        "url": "https://git.t-juice.club/torjus/labmon"
+      },
+      "original": {
+        "ref": "master",
+        "type": "git",
+        "url": "https://git.t-juice.club/torjus/labmon"
+      }
+    },
    "nixos-exporter": {
      "inputs": {
        "nixpkgs": [
@@ -98,9 +119,31 @@
      "inputs": {
        "alerttonotify": "alerttonotify",
        "homelab-deploy": "homelab-deploy",
+        "labmon": "labmon",
        "nixos-exporter": "nixos-exporter",
        "nixpkgs": "nixpkgs",
-        "nixpkgs-unstable": "nixpkgs-unstable"
+        "nixpkgs-unstable": "nixpkgs-unstable",
+        "sops-nix": "sops-nix"
+      }
+    },
+    "sops-nix": {
+      "inputs": {
+        "nixpkgs": [
+          "nixpkgs-unstable"
+        ]
+      },
+      "locked": {
+        "lastModified": 1770145881,
+        "narHash": "sha256-ktjWTq+D5MTXQcL9N6cDZXUf9kX8JBLLBLT0ZyOTSYY=",
+        "owner": "Mic92",
+        "repo": "sops-nix",
+        "rev": "17eea6f3816ba6568b8c81db8a4e6ca438b30b7c",
+        "type": "github"
+      },
+      "original": {
+        "owner": "Mic92",
+        "repo": "sops-nix",
+        "type": "github"
      }
    }
  },
--- a/flake.nix
+++ b/flake.nix
@@ -5,10 +5,18 @@
    nixpkgs.url = "github:nixos/nixpkgs?ref=nixos-25.11";
    nixpkgs-unstable.url = "github:nixos/nixpkgs?ref=nixos-unstable";

+    sops-nix = {
+      url = "github:Mic92/sops-nix";
+      inputs.nixpkgs.follows = "nixpkgs-unstable";
+    };
    alerttonotify = {
      url = "git+https://git.t-juice.club/torjus/alerttonotify?ref=master";
      inputs.nixpkgs.follows = "nixpkgs-unstable";
    };
+    labmon = {
+      url = "git+https://git.t-juice.club/torjus/labmon?ref=master";
+      inputs.nixpkgs.follows = "nixpkgs-unstable";
+    };
    nixos-exporter = {
      url = "git+https://git.t-juice.club/torjus/nixos-exporter";
      inputs.nixpkgs.follows = "nixpkgs-unstable";
@@ -24,7 +32,9 @@
      self,
      nixpkgs,
      nixpkgs-unstable,
+      sops-nix,
      alerttonotify,
+      labmon,
      nixos-exporter,
      homelab-deploy,
      ...
@@ -40,6 +50,7 @@
      commonOverlays = [
        overlay-unstable
        alerttonotify.overlays.default
+        labmon.overlays.default
      ];
      # Common modules applied to all hosts
      commonModules = [
@@ -50,6 +61,7 @@
            system.configurationRevision = self.rev or self.dirtyRev or "dirty";
          }
        )
+        sops-nix.nixosModules.sops
        nixos-exporter.nixosModules.default
        homelab-deploy.nixosModules.default
        ./modules/homelab
@@ -65,19 +77,46 @@
    in
    {
      nixosConfigurations = {
+        ns1 = nixpkgs.lib.nixosSystem {
+          inherit system;
+          specialArgs = {
+            inherit inputs self sops-nix;
+          };
+          modules = commonModules ++ [
+            ./hosts/ns1
+          ];
+        };
+        ns2 = nixpkgs.lib.nixosSystem {
+          inherit system;
+          specialArgs = {
+            inherit inputs self sops-nix;
+          };
+          modules = commonModules ++ [
+            ./hosts/ns2
+          ];
+        };
        ha1 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self;
+            inherit inputs self sops-nix;
          };
          modules = commonModules ++ [
            ./hosts/ha1
          ];
        };
+        template1 = nixpkgs.lib.nixosSystem {
+          inherit system;
+          specialArgs = {
+            inherit inputs self sops-nix;
+          };
+          modules = commonModules ++ [
+            ./hosts/template
+          ];
+        };
        template2 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self;
+            inherit inputs self sops-nix;
          };
          modules = commonModules ++ [
            ./hosts/template2
@@ -86,25 +125,35 @@
        http-proxy = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self;
+            inherit inputs self sops-nix;
          };
          modules = commonModules ++ [
            ./hosts/http-proxy
          ];
        };
+        ca = nixpkgs.lib.nixosSystem {
+          inherit system;
+          specialArgs = {
+            inherit inputs self sops-nix;
+          };
+          modules = commonModules ++ [
+            ./hosts/ca
+          ];
+        };
        monitoring01 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self;
+            inherit inputs self sops-nix;
          };
          modules = commonModules ++ [
            ./hosts/monitoring01
+            labmon.nixosModules.labmon
          ];
        };
        jelly01 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self;
+            inherit inputs self sops-nix;
          };
          modules = commonModules ++ [
            ./hosts/jelly01
@@ -113,82 +162,55 @@
        nix-cache01 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self;
+            inherit inputs self sops-nix;
          };
          modules = commonModules ++ [
            ./hosts/nix-cache01
          ];
        };
+        pgdb1 = nixpkgs.lib.nixosSystem {
+          inherit system;
+          specialArgs = {
+            inherit inputs self sops-nix;
+          };
+          modules = commonModules ++ [
+            ./hosts/pgdb1
+          ];
+        };
        nats1 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self;
+            inherit inputs self sops-nix;
          };
          modules = commonModules ++ [
            ./hosts/nats1
          ];
        };
-        vault01 = nixpkgs.lib.nixosSystem {
-          inherit system;
-          specialArgs = {
-            inherit inputs self;
-          };
-          modules = commonModules ++ [
-            ./hosts/vault01
-          ];
-        };
        testvm01 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self;
+            inherit inputs self sops-nix;
          };
          modules = commonModules ++ [
            ./hosts/testvm01
          ];
        };
-        testvm02 = nixpkgs.lib.nixosSystem {
+        vault01 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self;
+            inherit inputs self sops-nix;
          };
          modules = commonModules ++ [
-            ./hosts/testvm02
+            ./hosts/vault01
          ];
        };
-        testvm03 = nixpkgs.lib.nixosSystem {
+        vaulttest01 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
-            inherit inputs self;
+            inherit inputs self sops-nix;
          };
          modules = commonModules ++ [
-            ./hosts/testvm03
-          ];
-        };
-        ns2 = nixpkgs.lib.nixosSystem {
-          inherit system;
-          specialArgs = {
-            inherit inputs self;
-          };
-          modules = commonModules ++ [
-            ./hosts/ns2
-          ];
-        };
-        ns1 = nixpkgs.lib.nixosSystem {
-          inherit system;
-          specialArgs = {
-            inherit inputs self;
-          };
-          modules = commonModules ++ [
-            ./hosts/ns1
-          ];
-        };
-        kanidm01 = nixpkgs.lib.nixosSystem {
-          inherit system;
-          specialArgs = {
-            inherit inputs self;
-          };
-          modules = commonModules ++ [
-            ./hosts/kanidm01
+            ./hosts/vaulttest01
          ];
        };
      };
@@ -207,7 +229,6 @@
              pkgs.ansible
              pkgs.opentofu
              pkgs.openbao
-              pkgs.kanidm_1_8
              (pkgs.callPackage ./scripts/create-host { })
              homelab-deploy.packages.${pkgs.system}.default
            ];
--- a/hosts/kanidm01/configuration.nix
+++ b/hosts/kanidm01/configuration.nix
@@ -1,39 +1,25 @@
 {
-  config,
-  lib,
  pkgs,
  ...
 }:

 {
  imports = [
-    ../template2/hardware-configuration.nix
+    ../template/hardware-configuration.nix

    ../../system
    ../../common/vm
-    ../../services/kanidm
  ];

-  # Host metadata
-  homelab.host = {
-    tier = "test";
-    role = "auth";
+  nixpkgs.config.allowUnfree = true;
+  # Use the systemd-boot EFI boot loader.
+  boot.loader.grub = {
+    enable = true;
+    device = "/dev/sda";
+    configurationLimit = 3;
  };

-  # DNS CNAME for auth.home.2rjus.net
-  homelab.dns.cnames = [ "auth" ];
-
-  # Enable Vault integration
-  vault.enable = true;
-
-  # Enable remote deployment via NATS
-  homelab.deploy.enable = true;
-
-  nixpkgs.config.allowUnfree = true;
-  boot.loader.grub.enable = true;
-  boot.loader.grub.device = "/dev/vda";
-
-  networking.hostName = "kanidm01";
+  networking.hostName = "ca";
  networking.domain = "home.2rjus.net";
  networking.useNetworkd = true;
  networking.useDHCP = false;
@@ -47,7 +33,7 @@
  systemd.network.networks."ens18" = {
    matchConfig.Name = "ens18";
    address = [
-      "10.69.13.23/24"
+      "10.69.13.12/24"
    ];
    routes = [
      { Gateway = "10.69.13.1"; }
@@ -73,5 +59,5 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

-  system.stateVersion = "25.11"; # Did you read the comment?
-}
+  system.stateVersion = "23.11"; # Did you read the comment?
+}
--- a/hosts/testvm03/default.nix
+++ b/hosts/testvm03/default.nix
@@ -1,5 +1,7 @@
-{ ... }: {
+{ ... }:
+{
  imports = [
    ./configuration.nix
+    ../../services/ca
  ];
-}
+}
--- a/hosts/ha1/configuration.nix
+++ b/hosts/ha1/configuration.nix
@@ -7,7 +7,7 @@

 {
  imports = [
-    ./hardware-configuration.nix
+    ../template/hardware-configuration.nix

    ../../system
    ../../common/vm
--- a/hosts/http-proxy/configuration.nix
+++ b/hosts/http-proxy/configuration.nix
@@ -5,7 +5,7 @@

 {
  imports = [
-    ./hardware-configuration.nix
+    ../template/hardware-configuration.nix

    ../../system
    ../../common/vm
--- a/hosts/http-proxy/hardware-configuration.nix
+++ b/hosts/http-proxy/hardware-configuration.nix
@@ -1,42 +0,0 @@
-{
-  config,
-  lib,
-  pkgs,
-  modulesPath,
-  ...
-}:
-
-{
-  imports = [
-    (modulesPath + "/profiles/qemu-guest.nix")
-  ];
-  boot.initrd.availableKernelModules = [
-    "ata_piix"
-    "uhci_hcd"
-    "virtio_pci"
-    "virtio_scsi"
-    "sd_mod"
-    "sr_mod"
-  ];
-  boot.initrd.kernelModules = [ "dm-snapshot" ];
-  boot.kernelModules = [
-    "ptp_kvm"
-  ];
-  boot.extraModulePackages = [ ];
-
-  fileSystems."/" = {
-    device = "/dev/disk/by-label/root";
-    fsType = "xfs";
-  };
-
-  swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
-
-  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
-  # (the default) this is the recommended approach. When using systemd-networkd it's
-  # still possible to use this option, but it's recommended to use it in conjunction
-  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
-  networking.useDHCP = lib.mkDefault true;
-  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
-
-  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
-}
--- a/hosts/jelly01/configuration.nix
+++ b/hosts/jelly01/configuration.nix
@@ -5,7 +5,7 @@

 {
  imports = [
-    ./hardware-configuration.nix
+    ../template/hardware-configuration.nix

    ../../system
    ../../common/vm
@@ -61,8 +61,9 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

-  vault.enable = true;
-  homelab.deploy.enable = true;
+  zramSwap = {
+    enable = true;
+  };

  system.stateVersion = "23.11"; # Did you read the comment?
 }
--- a/hosts/jelly01/hardware-configuration.nix
+++ b/hosts/jelly01/hardware-configuration.nix
@@ -1,42 +0,0 @@
-{
-  config,
-  lib,
-  pkgs,
-  modulesPath,
-  ...
-}:
-
-{
-  imports = [
-    (modulesPath + "/profiles/qemu-guest.nix")
-  ];
-  boot.initrd.availableKernelModules = [
-    "ata_piix"
-    "uhci_hcd"
-    "virtio_pci"
-    "virtio_scsi"
-    "sd_mod"
-    "sr_mod"
-  ];
-  boot.initrd.kernelModules = [ "dm-snapshot" ];
-  boot.kernelModules = [
-    "ptp_kvm"
-  ];
-  boot.extraModulePackages = [ ];
-
-  fileSystems."/" = {
-    device = "/dev/disk/by-label/root";
-    fsType = "xfs";
-  };
-
-  swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
-
-  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
-  # (the default) this is the recommended approach. When using systemd-networkd it's
-  # still possible to use this option, but it's recommended to use it in conjunction
-  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
-  networking.useDHCP = lib.mkDefault true;
-  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
-
-  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
-}
--- a/hosts/jump/configuration.nix
+++ b/hosts/jump/configuration.nix
@@ -0,0 +1,56 @@
+{ config, lib, pkgs, ... }:
+
+{
+  imports =
+    [
+      ../template/hardware-configuration.nix
+      ../../system
+    ];
+
+  nixpkgs.config.allowUnfree = true;
+
+  homelab.host.role = "bastion";
+
+  # Use the systemd-boot EFI boot loader.
+  boot.loader.grub.enable = true;
+  boot.loader.grub.device = "/dev/sda";
+
+  networking.hostName = "jump";
+  networking.domain = "home.2rjus.net";
+  networking.useNetworkd = true;
+  networking.useDHCP = false;
+  services.resolved.enable = false;
+  networking.nameservers = [
+    "10.69.13.5"
+    "10.69.13.6"
+  ];
+
+  systemd.network.enable = true;
+  systemd.network.networks."ens18" = {
+    matchConfig.Name = "ens18";
+    address = [
+      "10.69.13.10/24"
+    ];
+    routes = [
+      { Gateway = "10.69.13.1"; }
+    ];
+    linkConfig.RequiredForOnline = "routable";
+  };
+  time.timeZone = "Europe/Oslo";
+
+  nix.settings.experimental-features = [ "nix-command" "flakes" ];
+  environment.systemPackages = with pkgs; [
+    vim
+    wget
+    git
+  ];
+
+  # Open ports in the firewall.
+  # networking.firewall.allowedTCPPorts = [ ... ];
+  # networking.firewall.allowedUDPPorts = [ ... ];
+  # Or disable the firewall altogether.
+  networking.firewall.enable = false;
+
+  system.stateVersion = "23.11"; # Did you read the comment?
+}
+
--- a/hosts/testvm02/default.nix
+++ b/hosts/testvm02/default.nix
@@ -2,4 +2,4 @@
  imports = [
    ./configuration.nix
  ];
-}
+}
--- a/hosts/jump/hardware-configuration.nix
+++ b/hosts/jump/hardware-configuration.nix
@@ -0,0 +1,36 @@
+{ config, lib, pkgs, modulesPath, ... }:
+
+{
+  imports =
+    [
+      (modulesPath + "/profiles/qemu-guest.nix")
+    ];
+
+  boot.initrd.availableKernelModules = [ "ata_piix" "uhci_hcd" "virtio_pci" "virtio_scsi" "sd_mod" "sr_mod" ];
+  boot.initrd.kernelModules = [ ];
+  # boot.kernelModules = [ ];
+  # boot.extraModulePackages = [ ];
+
+  fileSystems."/" =
+    {
+      device = "/dev/disk/by-uuid/6889aba9-61ed-4687-ab10-e5cf4017ac8d";
+      fsType = "xfs";
+    };
+
+  fileSystems."/boot" =
+    {
+      device = "/dev/disk/by-uuid/BC07-3B7A";
+      fsType = "vfat";
+    };
+
+  swapDevices =
+    [{ device = "/dev/disk/by-uuid/64e5757b-6625-4dd2-aa2a-66ca93444d23"; }];
+
+  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
+  # (the default) this is the recommended approach. When using systemd-networkd it's
+  # still possible to use this option, but it's recommended to use it in conjunction
+  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
+  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
+
+  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
+}
--- a/hosts/monitoring01/configuration.nix
+++ b/hosts/monitoring01/configuration.nix
@@ -5,7 +5,7 @@

 {
  imports = [
-    ./hardware-configuration.nix
+    ../template/hardware-configuration.nix

    ../../system
    ../../common/vm
@@ -100,6 +100,61 @@
    ];
  };

+  labmon = {
+    enable = true;
+
+    settings = {
+      ListenAddr = ":9969";
+      Profiling = true;
+      StepMonitors = [
+        {
+          Enabled = true;
+          BaseURL = "https://ca.home.2rjus.net";
+          RootID = "3381bda8015a86b9a3cd1851439d1091890a79005e0f1f7c4301fe4bccc29d80";
+        }
+      ];
+
+      TLSConnectionMonitors = [
+        {
+          Enabled = true;
+          Address = "ca.home.2rjus.net:443";
+          Verify = true;
+          Duration = "12h";
+        }
+        {
+          Enabled = true;
+          Address = "jelly.home.2rjus.net:443";
+          Verify = true;
+          Duration = "12h";
+        }
+        {
+          Enabled = true;
+          Address = "grafana.home.2rjus.net:443";
+          Verify = true;
+          Duration = "12h";
+        }
+        {
+          Enabled = true;
+          Address = "prometheus.home.2rjus.net:443";
+          Verify = true;
+          Duration = "12h";
+        }
+        {
+          Enabled = true;
+          Address = "alertmanager.home.2rjus.net:443";
+          Verify = true;
+          Duration = "12h";
+        }
+        {
+          Enabled = true;
+          Address = "pyroscope.home.2rjus.net:443";
+          Verify = true;
+          Duration = "12h";
+        }
+      ];
+    };
+  };
+
  # Open ports in the firewall.
  # networking.firewall.allowedTCPPorts = [ ... ];
  # networking.firewall.allowedUDPPorts = [ ... ];
--- a/hosts/monitoring01/hardware-configuration.nix
+++ b/hosts/monitoring01/hardware-configuration.nix
@@ -1,42 +0,0 @@
-{
-  config,
-  lib,
-  pkgs,
-  modulesPath,
-  ...
-}:
-
-{
-  imports = [
-    (modulesPath + "/profiles/qemu-guest.nix")
-  ];
-  boot.initrd.availableKernelModules = [
-    "ata_piix"
-    "uhci_hcd"
-    "virtio_pci"
-    "virtio_scsi"
-    "sd_mod"
-    "sr_mod"
-  ];
-  boot.initrd.kernelModules = [ "dm-snapshot" ];
-  boot.kernelModules = [
-    "ptp_kvm"
-  ];
-  boot.extraModulePackages = [ ];
-
-  fileSystems."/" = {
-    device = "/dev/disk/by-label/root";
-    fsType = "xfs";
-  };
-
-  swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
-
-  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
-  # (the default) this is the recommended approach. When using systemd-networkd it's
-  # still possible to use this option, but it's recommended to use it in conjunction
-  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
-  networking.useDHCP = lib.mkDefault true;
-  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
-
-  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
-}
--- a/hosts/nats1/configuration.nix
+++ b/hosts/nats1/configuration.nix
@@ -5,7 +5,7 @@

 {
  imports = [
-    ./hardware-configuration.nix
+    ../template/hardware-configuration.nix

    ../../system
    ../../common/vm
@@ -59,8 +59,5 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

-  vault.enable = true;
-  homelab.deploy.enable = true;
-
  system.stateVersion = "23.11"; # Did you read the comment?
 }
--- a/hosts/nats1/hardware-configuration.nix
+++ b/hosts/nats1/hardware-configuration.nix
@@ -1,42 +0,0 @@
-{
-  config,
-  lib,
-  pkgs,
-  modulesPath,
-  ...
-}:
-
-{
-  imports = [
-    (modulesPath + "/profiles/qemu-guest.nix")
-  ];
-  boot.initrd.availableKernelModules = [
-    "ata_piix"
-    "uhci_hcd"
-    "virtio_pci"
-    "virtio_scsi"
-    "sd_mod"
-    "sr_mod"
-  ];
-  boot.initrd.kernelModules = [ "dm-snapshot" ];
-  boot.kernelModules = [
-    "ptp_kvm"
-  ];
-  boot.extraModulePackages = [ ];
-
-  fileSystems."/" = {
-    device = "/dev/disk/by-label/root";
-    fsType = "xfs";
-  };
-
-  swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
-
-  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
-  # (the default) this is the recommended approach. When using systemd-networkd it's
-  # still possible to use this option, but it's recommended to use it in conjunction
-  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
-  networking.useDHCP = lib.mkDefault true;
-  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
-
-  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
-}
--- a/hosts/nix-cache01/configuration.nix
+++ b/hosts/nix-cache01/configuration.nix
@@ -5,7 +5,7 @@

 {
  imports = [
-    ./hardware-configuration.nix
+    ../template/hardware-configuration.nix

    ../../system
    ../../common/vm
--- a/hosts/nix-cache01/default.nix
+++ b/hosts/nix-cache01/default.nix
@@ -4,5 +4,6 @@
    ./configuration.nix
    ../../services/nix-cache
    ../../services/actions-runner
+    ./zram.nix
  ];
 }
--- a/hosts/nix-cache01/hardware-configuration.nix
+++ b/hosts/nix-cache01/hardware-configuration.nix
@@ -1,42 +0,0 @@
-{
-  config,
-  lib,
-  pkgs,
-  modulesPath,
-  ...
-}:
-
-{
-  imports = [
-    (modulesPath + "/profiles/qemu-guest.nix")
-  ];
-  boot.initrd.availableKernelModules = [
-    "ata_piix"
-    "uhci_hcd"
-    "virtio_pci"
-    "virtio_scsi"
-    "sd_mod"
-    "sr_mod"
-  ];
-  boot.initrd.kernelModules = [ "dm-snapshot" ];
-  boot.kernelModules = [
-    "ptp_kvm"
-  ];
-  boot.extraModulePackages = [ ];
-
-  fileSystems."/" = {
-    device = "/dev/disk/by-label/root";
-    fsType = "xfs";
-  };
-
-  swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
-
-  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
-  # (the default) this is the recommended approach. When using systemd-networkd it's
-  # still possible to use this option, but it's recommended to use it in conjunction
-  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
-  networking.useDHCP = lib.mkDefault true;
-  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
-
-  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
-}
--- a/hosts/nix-cache01/zram.nix
+++ b/hosts/nix-cache01/zram.nix
@@ -0,0 +1,6 @@
+{ ... }:
+{
+  zramSwap = {
+    enable = true;
+  };
+}
--- a/hosts/ns1/configuration.nix
+++ b/hosts/ns1/configuration.nix
@@ -7,38 +7,23 @@

 {
  imports = [
-    ../template2/hardware-configuration.nix
+    ../template/hardware-configuration.nix

    ../../system
-    ../../common/vm
-
-    # DNS services
    ../../services/ns/master-authorative.nix
    ../../services/ns/resolver.nix
+    ../../common/vm
  ];

-  # Host metadata
-  homelab.host = {
-    tier = "prod";
-    role = "dns";
-    labels.dns_role = "primary";
-  };
-
-  # Enable Vault integration
-  vault.enable = true;
-
-  # Enable remote deployment via NATS
-  homelab.deploy.enable = true;
-
  nixpkgs.config.allowUnfree = true;
+  # Use the systemd-boot EFI boot loader.
  boot.loader.grub.enable = true;
-  boot.loader.grub.device = "/dev/vda";
+  boot.loader.grub.device = "/dev/sda";

  networking.hostName = "ns1";
  networking.domain = "home.2rjus.net";
  networking.useNetworkd = true;
  networking.useDHCP = false;
-  # Disable resolved - conflicts with Unbound resolver
  services.resolved.enable = false;
  networking.nameservers = [
    "10.69.13.5"
@@ -62,6 +47,14 @@
    "nix-command"
    "flakes"
  ];
+  vault.enable = true;
+  homelab.deploy.enable = true;
+
+  homelab.host = {
+    role = "dns";
+    labels.dns_role = "primary";
+  };
+
  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
    vim
@@ -75,5 +68,5 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

-  system.stateVersion = "25.11"; # Did you read the comment?
-}
+  system.stateVersion = "23.11"; # Did you read the comment?
+}
--- a/hosts/ns1/default.nix
+++ b/hosts/ns1/default.nix
@@ -2,4 +2,4 @@
  imports = [
    ./configuration.nix
  ];
-}
+}
--- a/hosts/ns1/hardware-configuration.nix
+++ b/hosts/ns1/hardware-configuration.nix
@@ -0,0 +1,36 @@
+{ config, lib, pkgs, modulesPath, ... }:
+
+{
+  imports =
+    [
+      (modulesPath + "/profiles/qemu-guest.nix")
+    ];
+
+  boot.initrd.availableKernelModules = [ "ata_piix" "uhci_hcd" "virtio_pci" "virtio_scsi" "sd_mod" "sr_mod" ];
+  boot.initrd.kernelModules = [ ];
+  # boot.kernelModules = [ ];
+  # boot.extraModulePackages = [ ];
+
+  fileSystems."/" =
+    {
+      device = "/dev/disk/by-uuid/6889aba9-61ed-4687-ab10-e5cf4017ac8d";
+      fsType = "xfs";
+    };
+
+  fileSystems."/boot" =
+    {
+      device = "/dev/disk/by-uuid/BC07-3B7A";
+      fsType = "vfat";
+    };
+
+  swapDevices =
+    [{ device = "/dev/disk/by-uuid/64e5757b-6625-4dd2-aa2a-66ca93444d23"; }];
+
+  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
+  # (the default) this is the recommended approach. When using systemd-networkd it's
+  # still possible to use this option, but it's recommended to use it in conjunction
+  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
+  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
+
+  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
+}
--- a/hosts/ns2/configuration.nix
+++ b/hosts/ns2/configuration.nix
@@ -7,38 +7,23 @@

 {
  imports = [
-    ../template2/hardware-configuration.nix
+    ../template/hardware-configuration.nix

    ../../system
-    ../../common/vm
-
-    # DNS services
    ../../services/ns/secondary-authorative.nix
    ../../services/ns/resolver.nix
+    ../../common/vm
  ];

-  # Host metadata
-  homelab.host = {
-    tier = "prod";
-    role = "dns";
-    labels.dns_role = "secondary";
-  };
-
-  # Enable Vault integration
-  vault.enable = true;
-
-  # Enable remote deployment via NATS
-  homelab.deploy.enable = true;
-
  nixpkgs.config.allowUnfree = true;
+  # Use the systemd-boot EFI boot loader.
  boot.loader.grub.enable = true;
-  boot.loader.grub.device = "/dev/vda";
+  boot.loader.grub.device = "/dev/sda";

  networking.hostName = "ns2";
  networking.domain = "home.2rjus.net";
  networking.useNetworkd = true;
  networking.useDHCP = false;
-  # Disable resolved - conflicts with Unbound resolver
  services.resolved.enable = false;
  networking.nameservers = [
    "10.69.13.5"
@@ -62,7 +47,14 @@
    "nix-command"
    "flakes"
  ];
-  nix.settings.tarball-ttl = 0;
+  vault.enable = true;
+  homelab.deploy.enable = true;
+
+  homelab.host = {
+    role = "dns";
+    labels.dns_role = "secondary";
+  };
+
  environment.systemPackages = with pkgs; [
    vim
    wget
@@ -75,5 +67,5 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

-  system.stateVersion = "25.11"; # Did you read the comment?
-}
+  system.stateVersion = "23.11"; # Did you read the comment?
+}
--- a/hosts/ns2/default.nix
+++ b/hosts/ns2/default.nix
@@ -2,4 +2,4 @@
  imports = [
    ./configuration.nix
  ];
-}
+}
--- a/hosts/ns2/hardware-configuration.nix
+++ b/hosts/ns2/hardware-configuration.nix
@@ -0,0 +1,36 @@
+{ config, lib, pkgs, modulesPath, ... }:
+
+{
+  imports =
+    [
+      (modulesPath + "/profiles/qemu-guest.nix")
+    ];
+
+  boot.initrd.availableKernelModules = [ "ata_piix" "uhci_hcd" "virtio_pci" "virtio_scsi" "sd_mod" "sr_mod" ];
+  boot.initrd.kernelModules = [ ];
+  # boot.kernelModules = [ ];
+  # boot.extraModulePackages = [ ];
+
+  fileSystems."/" =
+    {
+      device = "/dev/disk/by-uuid/6889aba9-61ed-4687-ab10-e5cf4017ac8d";
+      fsType = "xfs";
+    };
+
+  fileSystems."/boot" =
+    {
+      device = "/dev/disk/by-uuid/BC07-3B7A";
+      fsType = "vfat";
+    };
+
+  swapDevices =
+    [{ device = "/dev/disk/by-uuid/64e5757b-6625-4dd2-aa2a-66ca93444d23"; }];
+
+  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
+  # (the default) this is the recommended approach. When using systemd-networkd it's
+  # still possible to use this option, but it's recommended to use it in conjunction
+  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
+  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
+
+  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
+}
--- a/hosts/testvm02/configuration.nix
+++ b/hosts/testvm02/configuration.nix
@@ -1,38 +1,25 @@
 {
-  config,
-  lib,
  pkgs,
  ...
 }:

 {
  imports = [
-    ../template2/hardware-configuration.nix
+    ../template/hardware-configuration.nix

    ../../system
    ../../common/vm
-    ../../common/ssh-audit.nix
  ];

-  # Host metadata (adjust as needed)
-  homelab.host = {
-    tier = "test";  # Start in test tier, move to prod after validation
+  nixpkgs.config.allowUnfree = true;
+  # Use the systemd-boot EFI boot loader.
+  boot.loader.grub = {
+    enable = true;
+    device = "/dev/sda";
+    configurationLimit = 3;
  };

-  # Enable Vault integration
-  vault.enable = true;
-
-  # Enable remote deployment via NATS
-  homelab.deploy.enable = true;
-
-  # Enable Kanidm PAM/NSS for central authentication
-  homelab.kanidm.enable = true;
-
-  nixpkgs.config.allowUnfree = true;
-  boot.loader.grub.enable = true;
-  boot.loader.grub.device = "/dev/vda";
-
-  networking.hostName = "testvm02";
+  networking.hostName = "pgdb1";
  networking.domain = "home.2rjus.net";
  networking.useNetworkd = true;
  networking.useDHCP = false;
@@ -46,7 +33,7 @@
  systemd.network.networks."ens18" = {
    matchConfig.Name = "ens18";
    address = [
-      "10.69.13.21/24"
+      "10.69.13.16/24"
    ];
    routes = [
      { Gateway = "10.69.13.1"; }
@@ -72,5 +59,5 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

-  system.stateVersion = "25.11"; # Did you read the comment?
-}
+  system.stateVersion = "23.11"; # Did you read the comment?
+}
--- a/hosts/pgdb1/default.nix
+++ b/hosts/pgdb1/default.nix
@@ -0,0 +1,7 @@
+{ ... }:
+{
+  imports = [
+    ./configuration.nix
+    ../../services/postgres
+  ];
+}
--- a/hosts/template/configuration.nix
+++ b/hosts/template/configuration.nix
@@ -1,38 +1,25 @@
-{
-  config,
-  lib,
-  pkgs,
-  ...
-}:
+{ config, lib, pkgs, ... }:

 {
-  imports = [
-    ../template2/hardware-configuration.nix
+  imports =
+    [
+      ./hardware-configuration.nix

-    ../../system
-    ../../common/vm
-    ../../common/ssh-audit.nix
-  ];
+      ../../system
+    ];
+
+  # Template host - exclude from DNS zone generation
+  homelab.dns.enable = false;

-  # Host metadata (adjust as needed)
  homelab.host = {
-    tier = "test";  # Start in test tier, move to prod after validation
+    tier = "test";
+    priority = "low";
  };

-  # Enable Vault integration
-  vault.enable = true;

-  # Enable remote deployment via NATS
-  homelab.deploy.enable = true;
-
-  # Enable Kanidm PAM/NSS for central authentication
-  homelab.kanidm.enable = true;
-
-  nixpkgs.config.allowUnfree = true;
  boot.loader.grub.enable = true;
-  boot.loader.grub.device = "/dev/vda";
-
-  networking.hostName = "testvm03";
+  boot.loader.grub.device = "/dev/sda";
+  networking.hostName = "nixos-template";
  networking.domain = "home.2rjus.net";
  networking.useNetworkd = true;
  networking.useDHCP = false;
@@ -46,21 +33,19 @@
  systemd.network.networks."ens18" = {
    matchConfig.Name = "ens18";
    address = [
-      "10.69.13.22/24"
+      "10.69.8.250/24"
    ];
    routes = [
-      { Gateway = "10.69.13.1"; }
+      { Gateway = "10.69.8.1"; }
    ];
    linkConfig.RequiredForOnline = "routable";
  };
  time.timeZone = "Europe/Oslo";

-  nix.settings.experimental-features = [
-    "nix-command"
-    "flakes"
-  ];
+  nix.settings.experimental-features = [ "nix-command" "flakes" ];
  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
+    age
    vim
    wget
    git
@@ -72,5 +57,6 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

-  system.stateVersion = "25.11"; # Did you read the comment?
-}
+  system.stateVersion = "23.11"; # Did you read the comment?
+}
+
--- a/hosts/template/default.nix
+++ b/hosts/template/default.nix
@@ -0,0 +1,7 @@
+{ ... }: {
+  imports = [
+    ./hardware-configuration.nix
+    ./configuration.nix
+    ./scripts.nix
+  ];
+}
--- a/hosts/template/hardware-configuration.nix
+++ b/hosts/template/hardware-configuration.nix
--- a/hosts/template/scripts.nix
+++ b/hosts/template/scripts.nix
@@ -0,0 +1,36 @@
+{ pkgs, ... }:
+let
+  prepare-host-script = pkgs.writeShellApplication {
+    name = "prepare-host.sh";
+    runtimeInputs = [ pkgs.age ];
+    text = ''
+      echo "Removing machine-id"
+      rm -f /etc/machine-id || true
+
+      echo "Removing SSH host keys"
+      rm -f /etc/ssh/ssh_host_* || true
+
+      echo "Restarting SSH"
+      systemctl restart sshd
+
+      echo "Removing temporary files"
+      rm -rf /tmp/* || true
+
+      echo "Removing logs"
+      journalctl --rotate || true
+      journalctl --vacuum-time=1s || true
+
+      echo "Removing cache"
+      rm -rf /var/cache/* || true
+
+      echo "Generate age key"
+      rm -rf /var/lib/sops-nix || true
+      mkdir -p /var/lib/sops-nix
+      age-keygen -o /var/lib/sops-nix/key.txt
+    '';
+  };
+in
+{
+  environment.systemPackages = [ prepare-host-script ];
+  users.motd = "Prepare host by running 'prepare-host.sh'.";
+}
--- a/hosts/template2/bootstrap.nix
+++ b/hosts/template2/bootstrap.nix
@@ -6,72 +6,22 @@ let
    text = ''
      set -euo pipefail

-      LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
-
-      # Send a log entry to Loki with bootstrap status
-      # Usage: log_to_loki <stage> <message>
-      # Fails silently if Loki is unreachable
-      log_to_loki() {
-        local stage="$1"
-        local message="$2"
-        local timestamp_ns
-        timestamp_ns="$(date +%s)000000000"
-
-        local payload
-        payload=$(jq -n \
-          --arg host "$HOSTNAME" \
-          --arg stage "$stage" \
-          --arg branch "''${BRANCH:-master}" \
-          --arg ts "$timestamp_ns" \
-          --arg msg "$message" \
-          '{
-            streams: [{
-              stream: {
-                job: "bootstrap",
-                host: $host,
-                stage: $stage,
-                branch: $branch
-              },
-              values: [[$ts, $msg]]
-            }]
-          }')
-
-        curl -s --connect-timeout 2 --max-time 5 \
-          -X POST \
-          -H "Content-Type: application/json" \
-          -d "$payload" \
-          "$LOKI_URL" >/dev/null 2>&1 || true
-      }
-
-      echo "================================================================================"
-      echo "                     NIXOS BOOTSTRAP IN PROGRESS"
-      echo "================================================================================"
-      echo ""
-
      # Read hostname set by cloud-init (from Terraform VM name via user-data)
      # Cloud-init sets the system hostname from user-data.txt, so we read it from hostnamectl
      HOSTNAME=$(hostnamectl hostname)
-      # Read git branch from environment, default to master
-      BRANCH="''${NIXOS_FLAKE_BRANCH:-master}"
+      echo "DEBUG: Hostname from hostnamectl: '$HOSTNAME'"

-      echo "Hostname: $HOSTNAME"
-      echo ""
      echo "Starting NixOS bootstrap for host: $HOSTNAME"
-
-      log_to_loki "starting" "Bootstrap starting for $HOSTNAME (branch: $BRANCH)"
-
      echo "Waiting for network connectivity..."

      # Verify we can reach the git server via HTTPS (doesn't respond to ping)
      if ! curl -s --connect-timeout 5 --max-time 10 https://git.t-juice.club >/dev/null 2>&1; then
        echo "ERROR: Cannot reach git.t-juice.club via HTTPS"
        echo "Check network configuration and DNS settings"
-        log_to_loki "failed" "Network check failed - cannot reach git.t-juice.club"
        exit 1
      fi

      echo "Network connectivity confirmed"
-      log_to_loki "network_ok" "Network connectivity confirmed"

      # Unwrap Vault token and store AppRole credentials (if provided)
      if [ -n "''${VAULT_WRAPPED_TOKEN:-}" ]; then
@@ -100,7 +50,6 @@ let
          chmod 600 /var/lib/vault/approle/secret-id

          echo "Vault credentials unwrapped and stored successfully"
-          log_to_loki "vault_ok" "Vault credentials unwrapped and stored"
        else
          echo "WARNING: Failed to unwrap Vault token"
          if [ -n "$UNWRAP_RESPONSE" ]; then
@@ -114,17 +63,17 @@ let
          echo "To regenerate token, run: create-host --hostname $HOSTNAME --force"
          echo ""
          echo "Vault secrets will not be available, but continuing bootstrap..."
-          log_to_loki "vault_warn" "Failed to unwrap Vault token - continuing without secrets"
        fi
      else
        echo "No Vault wrapped token provided (VAULT_WRAPPED_TOKEN not set)"
        echo "Skipping Vault credential setup"
-        log_to_loki "vault_skip" "No Vault token provided - skipping credential setup"
      fi

      echo "Fetching and building NixOS configuration from flake..."
+
+      # Read git branch from environment, default to master
+      BRANCH="''${NIXOS_FLAKE_BRANCH:-master}"
      echo "Using git branch: $BRANCH"
-      log_to_loki "building" "Starting nixos-rebuild boot"

      # Build and activate the host-specific configuration
      FLAKE_URL="git+https://git.t-juice.club/torjus/nixos-servers.git?ref=$BRANCH#''${HOSTNAME}"
@@ -132,30 +81,18 @@ let
      if nixos-rebuild boot --flake "$FLAKE_URL"; then
        echo "Successfully built configuration for $HOSTNAME"
        echo "Rebooting into new configuration..."
-        log_to_loki "success" "Build successful - rebooting into new configuration"
        sleep 2
        systemctl reboot
      else
        echo "ERROR: nixos-rebuild failed for $HOSTNAME"
        echo "Check that flake has configuration for this hostname"
        echo "Manual intervention required - system will not reboot"
-        log_to_loki "failed" "nixos-rebuild failed - manual intervention required"
        exit 1
      fi
    '';
  };
 in
 {
-  # Custom greeting line to indicate this is a bootstrap image
-  services.getty.greetingLine = lib.mkForce ''
-    ================================================================================
-                          BOOTSTRAP IMAGE - NixOS \V (\l)
-    ================================================================================
-
-    Bootstrap service is running. Logs are displayed on tty1.
-    Check status: journalctl -fu nixos-bootstrap
-  '';
-
  systemd.services."nixos-bootstrap" = {
    description = "Bootstrap NixOS configuration from flake on first boot";

@@ -170,12 +107,12 @@ in
    serviceConfig = {
      Type = "oneshot";
      RemainAfterExit = true;
-      ExecStart = lib.getExe bootstrap-script;
+      ExecStart = "${bootstrap-script}/bin/nixos-bootstrap";

      # Read environment variables from cloud-init (set by cloud-init write_files)
      EnvironmentFile = "-/run/cloud-init-env";

-      # Log to journal and console
+      # Logging to journald
      StandardOutput = "journal+console";
      StandardError = "journal+console";
    };
--- a/hosts/template2/configuration.nix
+++ b/hosts/template2/configuration.nix
@@ -58,14 +58,6 @@
    "flakes"
  ];
  nix.settings.tarball-ttl = 0;
-  nix.settings.substituters = [
-    "https://nix-cache.home.2rjus.net"
-    "https://cache.nixos.org"
-  ];
-  nix.settings.trusted-public-keys = [
-    "nix-cache.home.2rjus.net-1:2kowZOG6pvhoK4AHVO3alBlvcghH20wchzoR0V86UWI="
-    "cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY="
-  ];
  environment.systemPackages = with pkgs; [
    age
    vim
@@ -79,8 +71,5 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

-  # Compressed swap in RAM - prevents OOM during bootstrap nixos-rebuild
-  zramSwap.enable = true;
-
  system.stateVersion = "25.11";
 }
--- a/hosts/template2/scripts.nix
+++ b/hosts/template2/scripts.nix
@@ -2,6 +2,7 @@
 let
  prepare-host-script = pkgs.writeShellApplication {
    name = "prepare-host.sh";
+    runtimeInputs = [ pkgs.age ];
    text = ''
      echo "Removing machine-id"
      rm -f /etc/machine-id || true
@@ -21,6 +22,11 @@ let

      echo "Removing cache"
      rm -rf /var/cache/* || true
+
+      echo "Generate age key"
+      rm -rf /var/lib/sops-nix || true
+      mkdir -p /var/lib/sops-nix
+      age-keygen -o /var/lib/sops-nix/key.txt
    '';
  };
 in
--- a/hosts/testvm01/configuration.nix
+++ b/hosts/testvm01/configuration.nix
@@ -11,23 +11,16 @@

    ../../system
    ../../common/vm
-    ../../common/ssh-audit.nix
  ];

-  # Host metadata (adjust as needed)
+  # Test VM - exclude from DNS zone generation
+  homelab.dns.enable = false;
+
  homelab.host = {
-    tier = "test";  # Start in test tier, move to prod after validation
+    tier = "test";
+    priority = "low";
  };

-  # Enable Vault integration
-  vault.enable = true;
-
-  # Enable remote deployment via NATS
-  homelab.deploy.enable = true;
-
-  # Enable Kanidm PAM/NSS for central authentication
-  homelab.kanidm.enable = true;
-
  nixpkgs.config.allowUnfree = true;
  boot.loader.grub.enable = true;
  boot.loader.grub.device = "/dev/vda";
@@ -36,7 +29,7 @@
  networking.domain = "home.2rjus.net";
  networking.useNetworkd = true;
  networking.useDHCP = false;
-  services.resolved.enable = true;
+  services.resolved.enable = false;
  networking.nameservers = [
    "10.69.13.5"
    "10.69.13.6"
@@ -46,7 +39,7 @@
  systemd.network.networks."ens18" = {
    matchConfig.Name = "ens18";
    address = [
-      "10.69.13.20/24"
+      "10.69.13.101/24"
    ];
    routes = [
      { Gateway = "10.69.13.1"; }
@@ -66,39 +59,6 @@
    git
  ];

-  # Test nginx with ACME certificate from OpenBao PKI
-  services.nginx = {
-    enable = true;
-    virtualHosts."testvm01.home.2rjus.net" = {
-      forceSSL = true;
-      enableACME = true;
-      locations."/" = {
-        root = pkgs.writeTextDir "index.html" ''
-          <!DOCTYPE html>
-          <html>
-          <head>
-            <title>testvm01 - ACME Test</title>
-            <style>
-              body { font-family: monospace; max-width: 600px; margin: 50px auto; padding: 20px; }
-              .joke { background: #f0f0f0; padding: 20px; border-radius: 8px; margin: 20px 0; }
-              .punchline { margin-top: 15px; font-weight: bold; }
-            </style>
-          </head>
-          <body>
-            <h1>OpenBao PKI ACME Test</h1>
-            <p>If you're seeing this over HTTPS, the migration worked!</p>
-            <div class="joke">
-              <p>Why do programmers prefer dark mode?</p>
-              <p class="punchline">Because light attracts bugs.</p>
-            </div>
-            <p><small>Certificate issued by: vault.home.2rjus.net</small></p>
-          </body>
-          </html>
-        '';
-      };
-    };
-  };
-
  # Open ports in the firewall.
  # networking.firewall.allowedTCPPorts = [ ... ];
  # networking.firewall.allowedUDPPorts = [ ... ];
--- a/hosts/vault01/configuration.nix
+++ b/hosts/vault01/configuration.nix
@@ -62,16 +62,6 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

-  # Vault fetches secrets from itself (after unseal)
-  vault.enable = true;
-  homelab.deploy.enable = true;
-
-  # Ensure vault-secret services wait for openbao to be unsealed
-  systemd.services.vault-secret-homelab-deploy-nkey = {
-    after = [ "openbao.service" ];
-    wants = [ "openbao.service" ];
-  };
-
  system.stateVersion = "25.11"; # Did you read the comment?
 }

--- a/hosts/vaulttest01/configuration.nix
+++ b/hosts/vaulttest01/configuration.nix
@@ -0,0 +1,134 @@
+{
+  config,
+  lib,
+  pkgs,
+  ...
+}:
+
+let
+  vault-test-script = pkgs.writeShellApplication {
+    name = "vault-test";
+    text = ''
+      echo "=== Vault Secret Test ==="
+      echo "Secret path: hosts/vaulttest01/test-service"
+
+      if [ -f /run/secrets/test-service/password ]; then
+        echo "✓ Password file exists"
+        echo "Password length: $(wc -c < /run/secrets/test-service/password)"
+      else
+        echo "✗ Password file missing!"
+        exit 1
+      fi
+
+      if [ -d /var/lib/vault/cache/test-service ]; then
+        echo "✓ Cache directory exists"
+      else
+        echo "✗ Cache directory missing!"
+        exit 1
+      fi
+
+      echo "Test successful!"
+    '';
+  };
+in
+{
+  imports = [
+    ../template2/hardware-configuration.nix
+
+    ../../system
+    ../../common/vm
+  ];
+
+  homelab.host = {
+    tier = "test";
+    priority = "low";
+    role = "vault";
+  };
+
+  nixpkgs.config.allowUnfree = true;
+  boot.loader.grub.enable = true;
+  boot.loader.grub.device = "/dev/vda";
+
+  networking.hostName = "vaulttest01";
+  networking.domain = "home.2rjus.net";
+  networking.useNetworkd = true;
+  networking.useDHCP = false;
+  services.resolved.enable = true;
+  networking.nameservers = [
+    "10.69.13.5"
+    "10.69.13.6"
+  ];
+
+  systemd.network.enable = true;
+  systemd.network.networks."ens18" = {
+    matchConfig.Name = "ens18";
+    address = [
+      "10.69.13.150/24"
+    ];
+    routes = [
+      { Gateway = "10.69.13.1"; }
+    ];
+    linkConfig.RequiredForOnline = "routable";
+  };
+  time.timeZone = "Europe/Oslo";
+
+  nix.settings.experimental-features = [
+    "nix-command"
+    "flakes"
+  ];
+  nix.settings.tarball-ttl = 0;
+  environment.systemPackages = with pkgs; [
+    vim
+    wget
+    git
+  ];
+
+  # Open ports in the firewall.
+  # networking.firewall.allowedTCPPorts = [ ... ];
+  # networking.firewall.allowedUDPPorts = [ ... ];
+  # Or disable the firewall altogether.
+  networking.firewall.enable = false;
+
+  # Testing config
+  # Enable Vault secrets management
+  vault.enable = true;
+  homelab.deploy.enable = true;
+
+  # Define a test secret
+  vault.secrets.test-service = {
+    secretPath = "hosts/vaulttest01/test-service";
+    restartTrigger = true;
+    restartInterval = "daily";
+    services = [ "vault-test" ];
+  };
+
+  # Create a test service that uses the secret
+  systemd.services.vault-test = {
+    description = "Test Vault secret fetching";
+    wantedBy = [ "multi-user.target" ];
+    after = [ "vault-secret-test-service.service" ];
+
+    serviceConfig = {
+      Type = "oneshot";
+      RemainAfterExit = true;
+
+      ExecStart = lib.getExe vault-test-script;
+
+      StandardOutput = "journal+console";
+    };
+  };
+
+  # Test ACME certificate issuance from OpenBao PKI
+  # Override the global ACME server (from system/acme.nix) to use OpenBao instead of step-ca
+  security.acme.defaults.server = lib.mkForce "https://vault01.home.2rjus.net:8200/v1/pki_int/acme/directory";
+
+  # Request a certificate for this host
+  # Using HTTP-01 challenge with standalone listener on port 80
+  security.acme.certs."vaulttest01.home.2rjus.net" = {
+    listenHTTP = ":80";
+    enableDebugLogs = true;
+  };
+
+  system.stateVersion = "25.11"; # Did you read the comment?
+}
+
--- a/hosts/vaulttest01/default.nix
+++ b/hosts/vaulttest01/default.nix
--- a/lib/monitoring.nix
+++ b/lib/monitoring.nix
@@ -21,7 +21,6 @@ let
      cfg = hostConfig.config;
      monConfig = (cfg.homelab or { }).monitoring or { enable = true; scrapeTargets = [ ]; };
      dnsConfig = (cfg.homelab or { }).dns or { enable = true; };
-      hostConfig' = (cfg.homelab or { }).host or { };
      hostname = cfg.networking.hostName;
      networks = cfg.systemd.network.networks or { };

@@ -50,73 +49,20 @@ let
        inherit hostname;
        ip = extractIP firstAddress;
        scrapeTargets = monConfig.scrapeTargets or [ ];
-        # Host metadata for label propagation
-        tier = hostConfig'.tier or "prod";
-        priority = hostConfig'.priority or "high";
-        role = hostConfig'.role or null;
-        labels = hostConfig'.labels or { };
      };

-  # Build effective labels for a host
-  # Always includes hostname; only includes tier/priority/role if non-default
-  buildEffectiveLabels = host:
-    { hostname = host.hostname; }
-    // (lib.optionalAttrs (host.tier != "prod") { tier = host.tier; })
-    // (lib.optionalAttrs (host.priority != "high") { priority = host.priority; })
-    // (lib.optionalAttrs (host.role != null) { role = host.role; })
-    // host.labels;
-
  # Generate node-exporter targets from all flake hosts
-  # Returns a list of static_configs entries with labels
  generateNodeExporterTargets = self: externalTargets:
    let
      nixosConfigs = self.nixosConfigurations or { };
      hostList = lib.filter (x: x != null) (
        lib.mapAttrsToList extractHostMonitoring nixosConfigs
      );
-
-      # Extract hostname from a target string like "gunter.home.2rjus.net:9100"
-      extractHostnameFromTarget = target:
-        builtins.head (lib.splitString "." target);
-
-      # Build target entries with labels for each host
-      flakeEntries = map
-        (host: {
-          target = "${host.hostname}.home.2rjus.net:9100";
-          labels = buildEffectiveLabels host;
-        })
-        hostList;
-
-      # External targets get hostname extracted from the target string
-      externalEntries = map
-        (target: {
-          inherit target;
-          labels = { hostname = extractHostnameFromTarget target; };
-        })
-        (externalTargets.nodeExporter or [ ]);
-
-      allEntries = flakeEntries ++ externalEntries;
-
-      # Group entries by their label set for efficient static_configs
-      # Convert labels attrset to a string key for grouping
-      labelKey = entry: builtins.toJSON entry.labels;
-      grouped = lib.groupBy labelKey allEntries;
-
-      # Convert groups to static_configs format
-      # Every flake host now has at least a hostname label
-      staticConfigs = lib.mapAttrsToList
-        (key: entries:
-          let
-            labels = (builtins.head entries).labels;
-          in
-          { targets = map (e: e.target) entries; labels = labels; }
-        )
-        grouped;
+      flakeTargets = map (host: "${host.hostname}.home.2rjus.net:9100") hostList;
    in
-    staticConfigs;
+    flakeTargets ++ (externalTargets.nodeExporter or [ ]);

  # Generate scrape configs from all flake hosts and external targets
-  # Host labels are propagated to service targets for semantic alert filtering
  generateScrapeConfigs = self: externalTargets:
    let
      nixosConfigs = self.nixosConfigurations or { };
@@ -124,14 +70,13 @@ let
        lib.mapAttrsToList extractHostMonitoring nixosConfigs
      );

-      # Collect all scrapeTargets from all hosts, including host labels
+      # Collect all scrapeTargets from all hosts, grouped by job_name
      allTargets = lib.flatten (map
        (host:
          map
            (target: {
              inherit (target) job_name port metrics_path scheme scrape_interval honor_labels;
              hostname = host.hostname;
-              hostLabels = buildEffectiveLabels host;
            })
            host.scrapeTargets
        )
@@ -142,32 +87,22 @@ let
      grouped = lib.groupBy (t: t.job_name) allTargets;

      # Generate a scrape config for each job
-      # Within each job, group targets by their host labels for efficient static_configs
      flakeScrapeConfigs = lib.mapAttrsToList
        (jobName: targets:
          let
            first = builtins.head targets;
-
-            # Group targets within this job by their host labels
-            labelKey = t: builtins.toJSON t.hostLabels;
-            groupedByLabels = lib.groupBy labelKey targets;
-
-            # Every flake host now has at least a hostname label
-            staticConfigs = lib.mapAttrsToList
-              (key: labelTargets:
+            targetAddrs = map
+              (t:
                let
-                  labels = (builtins.head labelTargets).hostLabels;
-                  targetAddrs = map
-                    (t: "${t.hostname}.home.2rjus.net:${toString t.port}")
-                    labelTargets;
+                  portStr = toString t.port;
                in
-                { targets = targetAddrs; labels = labels; }
-              )
-              groupedByLabels;
-
+                "${t.hostname}.home.2rjus.net:${portStr}")
+              targets;
            config = {
              job_name = jobName;
-              static_configs = staticConfigs;
+              static_configs = [{
+                targets = targetAddrs;
+              }];
            }
            // (lib.optionalAttrs (first.metrics_path != "/metrics") {
              metrics_path = first.metrics_path;
--- a/playbooks/build-and-deploy-template.yml
+++ b/playbooks/build-and-deploy-template.yml
@@ -99,48 +99,3 @@
    - name: Display success message
      ansible.builtin.debug:
        msg: "Template VM {{ template_vmid }} created successfully on {{ storage }}"
-
- name: Update Terraform template name
-  hosts: localhost
-  gather_facts: false
-
-  vars:
-    terraform_dir: "{{ playbook_dir }}/../terraform"
-
-  tasks:
-    - name: Get image filename from earlier play
-      ansible.builtin.set_fact:
-        image_filename: "{{ hostvars['localhost']['image_filename'] }}"
-
-    - name: Extract template name from image filename
-      ansible.builtin.set_fact:
-        new_template_name: "{{ image_filename | regex_replace('\\.vma\\.zst$', '') | regex_replace('^vzdump-qemu-', '') }}"
-
-    - name: Read current Terraform variables file
-      ansible.builtin.slurp:
-        src: "{{ terraform_dir }}/variables.tf"
-      register: variables_tf_content
-
-    - name: Extract current template name from variables.tf
-      ansible.builtin.set_fact:
-        current_template_name: "{{ (variables_tf_content.content | b64decode) | regex_search('variable \"default_template_name\"[^}]+default\\s*=\\s*\"([^\"]+)\"', '\\1') | first }}"
-
-    - name: Check if template name has changed
-      ansible.builtin.set_fact:
-        template_name_changed: "{{ current_template_name != new_template_name }}"
-
-    - name: Display template name status
-      ansible.builtin.debug:
-        msg: "Template name: {{ current_template_name }} -> {{ new_template_name }} ({{ 'changed' if template_name_changed else 'unchanged' }})"
-
-    - name: Update default_template_name in variables.tf
-      ansible.builtin.replace:
-        path: "{{ terraform_dir }}/variables.tf"
-        regexp: '(variable "default_template_name"[^}]+default\s*=\s*)"[^"]+"'
-        replace: '\1"{{ new_template_name }}"'
-      when: template_name_changed
-
-    - name: Display update result
-      ansible.builtin.debug:
-        msg: "Updated terraform/variables.tf with new template name: {{ new_template_name }}"
-      when: template_name_changed
--- a/rebuild-all.sh
+++ b/rebuild-all.sh
@@ -0,0 +1,20 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# array of hosts
+HOSTS=(
+    "ns1"
+    "ns2"
+    "ca"
+    "ha1"
+    "http-proxy"
+    "jelly01"
+    "monitoring01"
+    "nix-cache01"
+    "pgdb1"
+)
+
+for host in "${HOSTS[@]}"; do
+    echo "Rebuilding $host"
+    nixos-rebuild boot --flake .#${host} --target-host root@${host}
+done
--- a/scripts/create-host/create_host.py
+++ b/scripts/create-host/create_host.py
@@ -18,8 +18,6 @@ from manipulators import (
    remove_from_flake_nix,
    remove_from_terraform_vms,
    remove_from_vault_terraform,
-    remove_from_approle_tf,
-    find_host_secrets,
    check_entries_exist,
 )
 from models import HostConfig
@@ -257,10 +255,7 @@ def handle_remove(
        sys.exit(1)

    # Check what entries exist
-    flake_exists, terraform_exists, vault_exists, approle_exists = check_entries_exist(hostname, repo_root)
-
-    # Check for secrets in secrets.tf
-    host_secrets = find_host_secrets(hostname, repo_root)
+    flake_exists, terraform_exists, vault_exists = check_entries_exist(hostname, repo_root)

    # Collect all files in the host directory recursively
    files_in_host_dir = sorted([f for f in host_dir.rglob("*") if f.is_file()])
@@ -299,25 +294,11 @@ def handle_remove(
    else:
        console.print(f"  • terraform/vault/hosts-generated.tf [dim](not found)[/dim]")

-    if approle_exists:
-        console.print(f'  • terraform/vault/approle.tf (host_policies["{hostname}"])')
-    else:
-        console.print(f"  • terraform/vault/approle.tf [dim](not found)[/dim]")
-
-    # Warn about secrets in secrets.tf
-    if host_secrets:
-        console.print(f"\n[yellow]⚠️  Warning: Found {len(host_secrets)} secret(s) in terraform/vault/secrets.tf:[/yellow]")
-        for secret_path in host_secrets:
-            console.print(f'   • "{secret_path}"')
-        console.print(f"\n   [yellow]These will NOT be removed automatically.[/yellow]")
-        console.print(f"   After removal, manually edit secrets.tf and run:")
-        for secret_path in host_secrets:
-            console.print(f"   [white]vault kv delete secret/{secret_path}[/white]")
-
-    # Warn about legacy secrets directory
+    # Warn about secrets directory
    if secrets_exist:
-        console.print(f"\n[yellow]⚠️  Warning: secrets/{hostname}/ directory exists (legacy SOPS)[/yellow]")
+        console.print(f"\n[yellow]⚠️  Warning: secrets/{hostname}/ directory exists and will NOT be deleted[/yellow]")
        console.print(f"   Manually remove if no longer needed: [white]rm -rf secrets/{hostname}/[/white]")
+        console.print(f"   Also update .sops.yaml to remove the host's age key")

    # Exit if dry run
    if dry_run:
@@ -342,13 +323,6 @@ def handle_remove(
        else:
            console.print("[yellow]⚠[/yellow]  Could not remove from terraform/vault/hosts-generated.tf")

-    # Remove from terraform/vault/approle.tf
-    if approle_exists:
-        if remove_from_approle_tf(hostname, repo_root):
-            console.print("[green]✓[/green] Removed from terraform/vault/approle.tf")
-        else:
-            console.print("[yellow]⚠[/yellow]  Could not remove from terraform/vault/approle.tf")
-
    # Remove from terraform/vms.tf
    if terraform_exists:
        if remove_from_terraform_vms(hostname, repo_root):
@@ -371,34 +345,19 @@ def handle_remove(
    console.print(f"\n[bold green]✓ Host {hostname} removed successfully![/bold green]\n")

    # Display next steps
-    display_removal_next_steps(hostname, vault_exists, approle_exists, host_secrets)
+    display_removal_next_steps(hostname, vault_exists)


-def display_removal_next_steps(hostname: str, had_vault: bool, had_approle: bool, host_secrets: list) -> None:
+def display_removal_next_steps(hostname: str, had_vault: bool) -> None:
    """Display next steps after successful removal."""
-    vault_files = ""
-    if had_vault:
-        vault_files += " terraform/vault/hosts-generated.tf"
-    if had_approle:
-        vault_files += " terraform/vault/approle.tf"
-
+    vault_file = " terraform/vault/hosts-generated.tf" if had_vault else ""
    vault_apply = ""
-    if had_vault or had_approle:
+    if had_vault:
        vault_apply = f"""
 3. Apply Vault changes:
   [white]cd terraform/vault && tofu apply[/white]
 """

-    secrets_cleanup = ""
-    if host_secrets:
-        secrets_cleanup = f"""
-5. Clean up secrets (manual):
-   Edit terraform/vault/secrets.tf to remove entries for {hostname}
-   Then delete from Vault:"""
-        for secret_path in host_secrets:
-            secrets_cleanup += f"\n   [white]vault kv delete secret/{secret_path}[/white]"
-        secrets_cleanup += "\n"
-
    next_steps = f"""[bold cyan]Next Steps:[/bold cyan]

 1. Review changes:
@@ -408,9 +367,9 @@ def display_removal_next_steps(hostname: str, had_vault: bool, had_approle: bool
   [white]cd terraform && tofu destroy -target='proxmox_vm_qemu.vm["{hostname}"]'[/white]
 {vault_apply}
 4. Commit changes:
-   [white]git add -u hosts/{hostname} flake.nix terraform/vms.tf{vault_files}
+   [white]git add -u hosts/{hostname} flake.nix terraform/vms.tf{vault_file}
   git commit -m "hosts: remove {hostname}"[/white]
-{secrets_cleanup}"""
+"""
    console.print(Panel(next_steps, border_style="cyan"))


--- a/scripts/create-host/generators.py
+++ b/scripts/create-host/generators.py
@@ -144,7 +144,7 @@ resource "vault_approle_auth_backend_role" "generated_hosts" {

  backend            = vault_auth_backend.approle.path
  role_name          = each.key
-  token_policies     = ["host-\${each.key}", "homelab-deploy"]
+  token_policies     = ["host-\${each.key}"]
  secret_id_ttl      = 0  # Never expire (wrapped tokens provide time limit)
  token_ttl          = 3600
  token_max_ttl      = 3600
--- a/scripts/create-host/manipulators.py
+++ b/scripts/create-host/manipulators.py
@@ -22,12 +22,12 @@ def remove_from_flake_nix(hostname: str, repo_root: Path) -> bool:
    content = flake_path.read_text()

    # Check if hostname exists
-    hostname_pattern = rf"^        {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem"
+    hostname_pattern = rf"^      {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem"
    if not re.search(hostname_pattern, content, re.MULTILINE):
        return False

    # Match the entire block from "hostname = " to "};"
-    replace_pattern = rf"^        {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem \{{.*?^        \}};\n"
+    replace_pattern = rf"^      {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem \{{.*?^      \}};\n"
    new_content, count = re.subn(replace_pattern, "", content, flags=re.MULTILINE | re.DOTALL)

    if count == 0:
@@ -101,68 +101,7 @@ def remove_from_vault_terraform(hostname: str, repo_root: Path) -> bool:
    return True


-def remove_from_approle_tf(hostname: str, repo_root: Path) -> bool:
-    """
-    Remove host entry from terraform/vault/approle.tf locals.host_policies.
-
-    Args:
-        hostname: Hostname to remove
-        repo_root: Path to repository root
-
-    Returns:
-        True if found and removed, False if not found
-    """
-    approle_path = repo_root / "terraform" / "vault" / "approle.tf"
-
-    if not approle_path.exists():
-        return False
-
-    content = approle_path.read_text()
-
-    # Check if hostname exists in host_policies
-    hostname_pattern = rf'^\s+"{re.escape(hostname)}" = \{{'
-    if not re.search(hostname_pattern, content, re.MULTILINE):
-        return False
-
-    # Match the entire block from "hostname" = { to closing }
-    # The block contains paths = [ ... ] and possibly extra_policies = [...]
-    replace_pattern = rf'\n?\s+"{re.escape(hostname)}" = \{{[^}}]*\}}\n?'
-    new_content, count = re.subn(replace_pattern, "\n", content, flags=re.DOTALL)
-
-    if count == 0:
-        return False
-
-    approle_path.write_text(new_content)
-    return True
-
-
-def find_host_secrets(hostname: str, repo_root: Path) -> list:
-    """
-    Find secrets in terraform/vault/secrets.tf that belong to a host.
-
-    Args:
-        hostname: Hostname to search for
-        repo_root: Path to repository root
-
-    Returns:
-        List of secret paths found (e.g., ["hosts/hostname/test-service"])
-    """
-    secrets_path = repo_root / "terraform" / "vault" / "secrets.tf"
-
-    if not secrets_path.exists():
-        return []
-
-    content = secrets_path.read_text()
-
-    # Find all secret paths matching hosts/{hostname}/
-    pattern = rf'"(hosts/{re.escape(hostname)}/[^"]+)"'
-    matches = re.findall(pattern, content)
-
-    # Return unique paths, preserving order
-    return list(dict.fromkeys(matches))
-
-
-def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, bool, bool]:
+def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, bool]:
    """
    Check which entries exist for a hostname.

@@ -171,12 +110,12 @@ def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, boo
        repo_root: Path to repository root

    Returns:
-        Tuple of (flake_exists, terraform_vms_exists, vault_generated_exists, approle_exists)
+        Tuple of (flake_exists, terraform_vms_exists, vault_exists)
    """
    # Check flake.nix
    flake_path = repo_root / "flake.nix"
    flake_content = flake_path.read_text()
-    flake_pattern = rf"^        {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem"
+    flake_pattern = rf"^      {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem"
    flake_exists = bool(re.search(flake_pattern, flake_content, re.MULTILINE))

    # Check terraform/vms.tf
@@ -192,15 +131,7 @@ def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, boo
        vault_content = vault_tf_path.read_text()
        vault_exists = f'"{hostname}"' in vault_content

-    # Check terraform/vault/approle.tf
-    approle_path = repo_root / "terraform" / "vault" / "approle.tf"
-    approle_exists = False
-    if approle_path.exists():
-        approle_content = approle_path.read_text()
-        approle_pattern = rf'^\s+"{re.escape(hostname)}" = \{{'
-        approle_exists = bool(re.search(approle_pattern, approle_content, re.MULTILINE))
-
-    return (flake_exists, terraform_exists, vault_exists, approle_exists)
+    return (flake_exists, terraform_exists, vault_exists)


 def update_flake_nix(config: HostConfig, repo_root: Path, force: bool = False) -> None:
@@ -216,25 +147,32 @@ def update_flake_nix(config: HostConfig, repo_root: Path, force: bool = False) -
    content = flake_path.read_text()

    # Create new entry
-    new_entry = f"""        {config.hostname} = nixpkgs.lib.nixosSystem {{
-          inherit system;
-          specialArgs = {{
-            inherit inputs self;
-          }};
-          modules = commonModules ++ [
-            ./hosts/{config.hostname}
-          ];
+    new_entry = f"""      {config.hostname} = nixpkgs.lib.nixosSystem {{
+        inherit system;
+        specialArgs = {{
+          inherit inputs self sops-nix;
        }};
+        modules = [
+          (
+            {{ config, pkgs, ... }}:
+            {{
+              nixpkgs.overlays = commonOverlays;
+            }}
+          )
+          ./hosts/{config.hostname}
+          sops-nix.nixosModules.sops
+        ];
+      }};
 """

    # Check if hostname already exists
-    hostname_pattern = rf"^        {re.escape(config.hostname)} = nixpkgs\.lib\.nixosSystem"
+    hostname_pattern = rf"^      {re.escape(config.hostname)} = nixpkgs\.lib\.nixosSystem"
    existing_match = re.search(hostname_pattern, content, re.MULTILINE)

    if existing_match and force:
        # Replace existing entry
        # Match the entire block from "hostname = " to "};"
-        replace_pattern = rf"^        {re.escape(config.hostname)} = nixpkgs\.lib\.nixosSystem \{{.*?^        \}};\n"
+        replace_pattern = rf"^      {re.escape(config.hostname)} = nixpkgs\.lib\.nixosSystem \{{.*?^      \}};\n"
        new_content, count = re.subn(replace_pattern, new_entry, content, flags=re.MULTILINE | re.DOTALL)

        if count == 0:
--- a/scripts/create-host/templates/configuration.nix.j2
+++ b/scripts/create-host/templates/configuration.nix.j2
@@ -18,12 +18,6 @@
    tier = "test";  # Start in test tier, move to prod after validation
  };

-  # Enable Vault integration
-  vault.enable = true;
-
-  # Enable remote deployment via NATS
-  homelab.deploy.enable = true;
-
  nixpkgs.config.allowUnfree = true;
  boot.loader.grub.enable = true;
  boot.loader.grub.device = "/dev/vda";
--- a/scripts/create-host/validators.py
+++ b/scripts/create-host/validators.py
@@ -140,22 +140,20 @@ def validate_ip_unique(ip: Optional[str], repo_root: Path) -> None:
    ip_part = ip.split("/")[0]

    # Check all hosts/*/configuration.nix files
-    # Search for IP with CIDR notation to match static IP assignments
-    # (e.g., "10.69.13.5/24") but not DNS resolver entries (e.g., "10.69.13.5")
    hosts_dir = repo_root / "hosts"
    if hosts_dir.exists():
        for config_file in hosts_dir.glob("*/configuration.nix"):
            content = config_file.read_text()
-            if ip in content:
+            if ip_part in content:
                raise ValueError(
                    f"IP address {ip_part} already in use in {config_file}"
                )

-    # Check terraform/vms.tf - search for full IP with CIDR
+    # Check terraform/vms.tf
    terraform_file = repo_root / "terraform" / "vms.tf"
    if terraform_file.exists():
        content = terraform_file.read_text()
-        if ip in content:
+        if ip_part in content:
            raise ValueError(
                f"IP address {ip_part} already in use in {terraform_file}"
            )
--- a/secrets/ca/keys/intermediate_ca_key
+++ b/secrets/ca/keys/intermediate_ca_key
@@ -0,0 +1,24 @@
+{
+	"data": "ENC[AES256_GCM,data:TgGIuklFPUSCBosD86NFnkAtRvYijQNQP4vvTkKu3dRAOjdDa2li5djZDUS4NEEPEihpOcMXqHBb+ABk3LmoU5nLmsKCeylUp7+DhcGi9f3xw2h1zbHV37mt40OVLTF3cYufRdydIkCGQA3td3q1ue/wCna2ewe73xwGg5j6ZVJCZAtW4VCNZM+rcG+YxPUC0gmBH59+O0VSrZrkvSnifbr+K0dGwg4i17KwAukI4Ac7YMkQoeuAPXq38+ZftlRx4tq9xBUko6wpPY9zOaFzeagWYMF0n1UYqDt+/3XZI/mukPhJc9tzbWneqgkQBOx3OiDwrNglCHvEpnb+bZePIRLOnNHd1ShETgBqhsHGp9OAwwbAt4tO+HFpCQtVz7s2LWQFLbWiN0SCGzYUkFGCgoXae5H58lxFav8=,iv:UzaWlJ+M+VQx3CcPSGbFZh5/rGbKpS2Rq2XVZAIDFiQ=,tag:F3waoAMuEKTvN2xANReSww==,type:str]",
+	"sops": {
+		"kms": null,
+		"gcp_kms": null,
+		"azure_kv": null,
+		"hc_vault": null,
+		"age": [
+			{
+				"recipient": "age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u",
+				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBpRGZSVHRSMGlyazAwQU5j\nd1o1L0Y1ckhQMkh4MVZiRmZlR2ozcmdsUW1vCk4xZ1ZibDBrUWZhYmxVVjBUczRn\nYlJtUWF3Y1lHWG56NkhmK2JOUHVGajQKLS0tIDN2S2doQURpTis2U3lWV0NxdWEz\ncjNZaEl1dEQwOXhsNE9xbHhYUzNTV3cKVmVIe05JwgXKSku7AJmrujYXrbBSbpBJ\nnqCuDIhok1w/fiff+XXn8udbgPVq5bC2SOhHbtVxImgBCFzrj5hQ0A==\n-----END AGE ENCRYPTED FILE-----\n"
+			},
+			{
+				"recipient": "age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk",
+				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSA4V3NaUEdvMmJvakQ0L1F0\nUnkvQ2F5dEVlZ2pMdlBZcjJac0tERnF5ZWljCmFrdU1NZ29jMkJ1a1ZLdURmVWI0\ncm1vNytFVzZjbVY2aVd2N3laMWNRNFEKLS0tIGgzOTFZY0lxc0JyVmd5cFBlNkRr\nVDBWc0t4c3pVV3RhSTB1UUVpNHd6NUkKNn6Sxb5oxP7iWqTF1+X9nOiYum3U+Rzk\nkryxVnf9EvQIVIFKDaTb+yAEO8otjqj+C4mHA9fannnNEJduOiPWOg==\n-----END AGE ENCRYPTED FILE-----\n"
+			}
+		],
+		"lastmodified": "2024-11-30T13:18:08Z",
+		"mac": "ENC[AES256_GCM,data:9R9RJzPMr9Bv8aeCDxhExTfbr+R2hjap6FGSk5QxBdbNpOcNS78ica0CLEmkAYVAfjmx/X2jC5ZnsAueSPUK7nAgNX2gJXbUTpY0F+oKt35GJziLrFLl3u/ahpF9lQ50EL9OqqgS+igDqtodJhKme5DXH5/GXQHhz++O3VZkR78=,iv:XgN3PiowiEosi2DmrjP82HhJMvnwaV530tsBE8GQfjs=,tag:U243BrtH7H/DU9LcjN/MMg==,type:str]",
+		"pgp": null,
+		"unencrypted_suffix": "_unencrypted",
+		"version": "3.9.1"
+	}
+}
--- a/secrets/ca/keys/root_ca_key
+++ b/secrets/ca/keys/root_ca_key
@@ -0,0 +1,24 @@
+{
+	"data": "ENC[AES256_GCM,data:5AePh5uXcUseYBGWvlztgmg8mGBGy3ngKRa6+QxOaT0/fzSB1pKkaMtZJo76tV9wwjdL6/b6VVUI7GIaCBD5kgdZuA8RdBTXguHyjjdxAlI9xcrQaWWdATd8JJt+eQp/m2Y+0dioyXKaDV2ukI3GtHYjp/ixMoHHWEocnEEb40wG6c3CZcvsLWJvKTkFc2OvcjcU2RTfuNlYtEETidiD9iC/dtCakNQHmLP1UFYgcn0ebXBKmlqD6+x2o7BVT1SLwVCyGNvH3eKA2AWvddZChnhaNCUIXcRwBFCgS8lPs4iXhAhly+nwuj7ssFpuu3sjm5pq196tRS8WQl2iNUEJ2tzoOpceg1kZZ7KHX3wCbdBlCRqhy9Q4JMvWPDssO+zz2aU21+BDEySDTCnTYX9Hu2/iFvZejt++mKY=,iv:u/Ukye0BAj2ka++AA72W8WfXJAZZ/YJ3RC/aydxdoUc=,tag:ihTP5bCCigWEPcLFaYOhMA==,type:str]",
+	"sops": {
+		"kms": null,
+		"gcp_kms": null,
+		"azure_kv": null,
+		"hc_vault": null,
+		"age": [
+			{
+				"recipient": "age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u",
+				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSB0VElDNHArZXlXa2JRQjd0\nQmVIbGpPWk43NDdiTkFtcEd1bDhRdXJWOUY0CndITHdKTFNJQXFOVFdyUGNtQ09k\nN2hnQmFYR0ZORWtxcUN0ZFhsM0U3N2cKLS0tIFh1TTBpMjFIZ2NYM1QxeDRjYlJx\nYkdrUDZmMUpGbjk3REJCVVRpeFk5Z28KJcia0Bk+3ZoifZnRLwqAko526ODPnkSS\nzymtOj/QYTA0++NP3B1aScIyhWITMEZX1iSoWDmgHj8ZQoNMdkM7AQ==\n-----END AGE ENCRYPTED FILE-----\n"
+			},
+			{
+				"recipient": "age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk",
+				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBZNlNHRWNEcUZGNXNBMDFR\nTzE5RnNMQUMvU1k2OS9XMlpvUktMRzQ5RmxvCnlCS3lzRVpGUHJLRGZ6SWZ2ZktR\na3l0TVN2NUlRVEQwRHByYkNEMDQyWUkKLS0tIEh3RjBWT3c5K2RWeDRjWFpsU1lP\ncStqY2xta3RSNkR6Vkt5YXhYUTZmbDgKvVKmZc8S/RwurJGsGiJ5LhM4waLO9B9k\n2cawxHmcYM3KfXDFwp9UZWhIwF7SRkG56ZE4OjGI3sOL+74ixnePxA==\n-----END AGE ENCRYPTED FILE-----\n"
+			}
+		],
+		"lastmodified": "2024-11-30T13:18:16Z",
+		"mac": "ENC[AES256_GCM,data:JwjbQ129cYCBNA5Fb8lN9rW7/y4wuVOqLeajIMcYyCzlBcjzCZAV1DKN5n75xMamb/hb1AUkmtp/K82PKM0Vg5X4/lpWTUZXZOzn/TrwHx+yqlJjL9mUdGuHnSY5DwME38Dde3UxdtUa0CVgQOxvMIycW27w8+8NNfO2zxGxkzc=,iv:ZMZASOsqXZOb0NkBqG3GGaqqKgQdjZLiku2yU5QonB8=,tag:/lb/HMxsYOV5XX/5kWnFHA==,type:str]",
+		"pgp": null,
+		"unencrypted_suffix": "_unencrypted",
+		"version": "3.9.1"
+	}
+}
--- a/secrets/ca/keys/ssh_host_ca_key
+++ b/secrets/ca/keys/ssh_host_ca_key
@@ -0,0 +1,24 @@
+{
+	"data": "ENC[AES256_GCM,data:vqQ3HwSmuDlI4UwraLWvwkBSj9zTFeNEWI1xzhVrO/gpx8+WBZOt2F0J7/LSTGAWsWW/9Gov+XXXAOtfnKfjYVzizyT/jE8EQwMuItWiFEVA6hohgwtsk7YKJjXdJIxmiv+WKs73gWb0uFVGh1ArMzsVkGPj1W1AKMFAneDPgsfSCy9aVOMuF8zQwypFC8eaxqOQhLpiN2ncRm8e7khwGurSgYfHDgFghaDr8torgUrZTOPNFk+LEdxB3WcC17+4a8ZyuBapmYdRTrP73czTAuxOF8lMwddJhO99SF7nWuOYVF1FOKLGtK04oKci5/xRIzvWo3I0pGajkxtuF5CyWbd1KblcPfBALIU/J5hU/puGJ7M2sE/qsg/4kaTFxnhq32rPZj291jFb4evDdOhVodfC1axOQUbzAC0=,iv:yOeQ384ikqgDqfthl7GIVSIMNA/n0BYTSIqFN3T9MAY=,tag:Y6nhOCrkWx7MnVpEeKN0Jg==,type:str]",
+	"sops": {
+		"kms": null,
+		"gcp_kms": null,
+		"azure_kv": null,
+		"hc_vault": null,
+		"age": [
+			{
+				"recipient": "age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u",
+				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBFTjRMWlNtYVQ2WnJEaGFN\nVFU2TXRTK2FHREpqREhOWHBKemxNc2U4WW44CnV4OWlBdXlFUWhJYi9jTTRuUWJV\nOWFPV2I4UytDRFo3blN3bUtFQ1NGU0kKLS0tIGp2VHlDc1JMMUdDUjlNNDFwUUxj\nVnhHbCtrNVNpZXo0K2dDVU5YTVJJUEkKk9mVTbzQVGZo3RKDLPDwtENknh+in1Q5\njf4DA1cGDDNzcEIWOOYyS+1mzT9WY8gU0hWqihX/bAx7CVsNUallZw==\n-----END AGE ENCRYPTED FILE-----\n"
+			},
+			{
+				"recipient": "age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk",
+				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBrVFNwUGpkOUhkUXFWWERq\nMVdueC9VSE9KbGZkenBVK3NRMjRNVXVmcVRRCjNLa0QzbWVCQks3ZmV3eFVjcEp0\nRmxDSlZIZU1IbEdnbE83WlkxV3VZV1EKLS0tICtsRXArajQ4Um9mNEV5OWZBdS85\nVGFSU2wwODZ3Zm44M3pWcTdDV1dxejQKM2BK5Axb1cF344ea89gkzCLzEX6j4amK\nzxf+boBK7JUX7F6QaPB0sRU8J4Cei9mALz96C8xNHjX00KcD3O2QOA==\n-----END AGE ENCRYPTED FILE-----\n"
+			}
+		],
+		"lastmodified": "2024-11-30T13:18:20Z",
+		"mac": "ENC[AES256_GCM,data:AllgcWxHnr3igPi/JbfJCbEa6hKtmILnAjiaMojRZNO4p6zYSoF0s8lo9XX05/vIrFUo+YaCtsuacv+kfz9f6vQafPn7Vulbh6PeH1VlAmzyVfJOTmHP3YX8ic3uM56A4+III1jOERCFOIcc/CKsnRLFhLCRQRMgtgT0hTl5aPw=,iv:60dOYhoUTu1HIHzY36eJeRZ66/v6JmRRpIW99W2D+CI=,tag:F7nLSFm933K5M+JE4IvNYw==,type:str]",
+		"pgp": null,
+		"unencrypted_suffix": "_unencrypted",
+		"version": "3.9.1"
+	}
+}
--- a/secrets/ca/keys/ssh_user_ca_key
+++ b/secrets/ca/keys/ssh_user_ca_key
@@ -0,0 +1,24 @@
+{
+	"data": "ENC[AES256_GCM,data:YRdPrTLQH0xdWiIzOyjfEGpvfmuj6me6GzZZcauh9bUUywyA1ranDnWqbJYgawQQxIXsq9dhXD0uco+7mmXq2598kF1NI9jh6uLf3k0H494zZOalRBv/k8u9oJDLIiVAkg9eNNLbGX0PMZr/Yue/qdkuXx2Hg9E7bQJwpU/NXF+jKKs+3NmKT5NBlegwAzUs530D4DUoaq5AhvVvdC6a1UcE+KJzQ8pRiz1GjFIxAB7qX+GVwa3yNdLgo2tlAbOzjGtaDfJnhZIHSNEq+4TEhjlF9lCmFCGFDUVupvMOWs0kBywJEzIrDmxmvGHlPj3FfyytPb7qhlsOXDDDS67IoiwluKOnw+sALAG0Iv9LMrDZ3z8MXeEGvRWu0VDMuGXN905/9kGx/A40mPjcfnZvI+qSRIKjER5R8aU=,iv:qiP2Ml59AnK24MBbs7N/HqJIylf+fXGqJAo2N8iFNB0=,tag:0Dj5fVs6OB07kvV4qzuvfw==,type:str]",
+	"sops": {
+		"kms": null,
+		"gcp_kms": null,
+		"azure_kv": null,
+		"hc_vault": null,
+		"age": [
+			{
+				"recipient": "age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u",
+				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBBUFlvNmRNYUlJSHZYUkpJ\nMEloQXFSdENIWGJVVDNIOVY5MS9SYWRoL0FrCnRJc05wZUZBSDRvMHNUUEhNRXQ4\nTWhYOUp6YUNGZFNWUFRrSmlJM1c4aWcKLS0tIFc1b3NlSEo2eFJhdDgwejRqcHlT\nZE5wN01uaE04cTlIbVJMVWQvQ1pXajgKQ1n6UmP7LEBsnIBXVc0BceOqvwCqQzBP\ncI8C5Io4ILgMjY4dr6sd0SeJG6mfDdiMA+k7c6jqoyZCW/Pkd3LANQ==\n-----END AGE ENCRYPTED FILE-----\n"
+			},
+			{
+				"recipient": "age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk",
+				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBtM2lyeXVzdE9nL1k5L3dC\nTkl2MjhMb1FKMFdCeXFPSmNST0pvOTRUaEVvCmdwMnhjSFFHVFhidmIySS9jMEJu\nNTJpRjdFOWpZZ3ZuZFJwZUUrRFU5NnMKLS0tIDJ1UjdVQkpMNm5Pd01JRnZNOEtr\nb1lpMlBkVHpiT2lYdWtZaUQrRW1HUDgKq/JVMf5gdu6lNEmqY6zU2SymbT+jklem\nnUQ9yieJGF+PanutNW6BCJH8jb/fH+Y6AeJ9S+kKCB4Yi75i4d+oHg==\n-----END AGE ENCRYPTED FILE-----\n"
+			}
+		],
+		"lastmodified": "2024-11-30T13:18:24Z",
+		"mac": "ENC[AES256_GCM,data:6FJTKEdIpCm+Dz7Ua8dZOMZQFaGU0oU/HRP6ly5mWbXCv81LRbZXRBd+5RDY3z9g9nb0PXZrOMNps63F6SKxK52VfzLIOap3UGeMNQn5P4/yyFj7JQHQ5Gjcf2l2z2VZ7NhUdNoSCV/6lwjValbKtids48Q5c3sFX997ZiqIUnY=,iv:nUeyJd/v8d9v7QsLLckziD9K5qjOZKK4vOQJw/ymi18=,tag:6n5EE3oklWdVcedvB2J/zA==,type:str]",
+		"pgp": null,
+		"unencrypted_suffix": "_unencrypted",
+		"version": "3.9.1"
+	}
+}
--- a/secrets/ca/secrets.yaml
+++ b/secrets/ca/secrets.yaml
@@ -0,0 +1,30 @@
+ca_root_pw: ENC[AES256_GCM,data:jS5BHS9i/pOykus5aGsW+w==,iv:aQIU7uXnNKaeNXv1UjRpBoSYcRpHo8RjnvCaIw4yCqc=,tag:lkjGm5/Ve93nizqGDQ0ByA==,type:str]
+sops:
+    kms: []
+    gcp_kms: []
+    azure_kv: []
+    hc_vault: []
+    age:
+        - recipient: age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
+          enc: |
+            -----BEGIN AGE ENCRYPTED FILE-----
+            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSA5anlORWxJalhRWkJPeGIy
+            OStyVG8vMFRTTEZOWHR3Q3N1UWJQbFlxV3pBCmVKQVM1SlJ2L0JOb3U3cTh3YkZ4
+            WHAxSUpTT1dyRHJHYVd1Qkh1ZWxwYW8KLS0tIEhXeklsSmlGaFlaaWF5L0Nodk5a
+            clZ4M3hFSlFqaEZ0UWREdHpTQ29GVUEKAxj5P05Ilpwis2oKFe54mJX+1LfTwfUv
+            2XRFOrEQbFNcK5WFu46p1mc/AAjKTeHWuvb2Yq43CO+sh1+kqKz0XA==
+            -----END AGE ENCRYPTED FILE-----
+        - recipient: age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk
+          enc: |
+            -----BEGIN AGE ENCRYPTED FILE-----
+            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBaS0dqQ1p4MEE2d2JaeFRx
+            UnB4ejhrS3hLekpqeWJhcEJGdnpzMTZDelVRCmFjVGswd3VtRUloWG1WbWY5N0s3
+            cG9aV2hGU3lFZkkvcUJNWE1rWUIwMmMKLS0tIG1KdlhoQzREWDhPbXVSZVBUQkdE
+            N1hmcEwxWXBIWkQ3a3BrdGhvUFoxbzgKX6hLoz7o/Du6ymrYwmGDkXp2XT+0+7QE
+            YhD5qQzGLVQSh3XM/wWExj2Ue5/gw/NqNziHezOh2r9gQljbHjG2/g==
+            -----END AGE ENCRYPTED FILE-----
+    lastmodified: "2024-10-21T09:12:26Z"
+    mac: ENC[AES256_GCM,data:hfPRIXt/kZJa6lsj7rz+5xGlrWhR/LX895S2d8auP/4t3V//80YE/ofIsHeAY9M7eSFsW9ce2Vp0C/WiCQefVWNaNN7nVAwskCfQ6vTWzs23oYz4NYIeCtZggBG3uGgJxb7ZnAFUJWmLwCxkKTQyoVVnn8i/rUDIBrkilbeLWNI=,iv:lm1HVbWtAifHjqKP0D3sxRadsE9+82ugbA2x54yRBTo=,tag:averxmPLa131lJtFrNxcEA==,type:str]
+    pgp: []
+    unencrypted_suffix: _unencrypted
+    version: 3.9.1
--- a/secrets/http-proxy/wireguard.yaml
+++ b/secrets/http-proxy/wireguard.yaml
@@ -0,0 +1,25 @@
+wg_private_key: ENC[AES256_GCM,data:DlC9txcLkTnb7FoEd249oJV/Ehcp50P8uulbE4rY/xU16fkTlnKvPmYZ7u8=,iv:IsiTzdrh+BNSVgx1mfjpMGNV2J0c88q6AoP0kHX2aGY=,tag:OqFsOIyE71SBD1mcNS/PeQ==,type:str]
+sops:
+    age:
+        - recipient: age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
+          enc: |
+            -----BEGIN AGE ENCRYPTED FILE-----
+            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSAzdm9HTTN1amwxQ2Z6MUQv
+            dGJ0cEgyaHNOZWtWSWlXNXc5bGhUdSsvVlVzCkJkc3ZQdzlBNDNxb3Avdi96bXFt
+            TExZY29nUDI3RE5vanh6TVBRME1Fa1UKLS0tIG8vSHdCYzkvWmJpd0hNbnRtUmtk
+            aVcwaFJJclZ3YUlUTTNwR2VESmVyZWMKHvKUJBDuNCqacEcRlapetCXHKRb0Js09
+            sqxLfEDwiN2LQQjYHZOmnMfCOt/b2rwXVKEHdTcIsXbdIdKOJwuAIQ==
+            -----END AGE ENCRYPTED FILE-----
+        - recipient: age1gq8434ku0xekqmvnseeunv83e779cg03c06gwrusnymdsr3rpufqx6vr3m
+          enc: |
+            -----BEGIN AGE ENCRYPTED FILE-----
+            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBEeU01UTc2V1UyZXRadE5I
+            VE1aakVZUEZUNnJxbzJ1K3J1R3ZQdFdMbUhBCjZBMDM3ZkYvQWlyNHBtaDZRWkd4
+            VzY0L3l4N2RNZjJRTDJWZTZyZVhHbW8KLS0tIGVNZ0N0emVmaVRCV09jNmVKRlla
+            cWVSNkJqWHh5c21KcWFac2FlZTVaMTAK1UvfPgZAZYtwiONKIAo5HlaDpN+UT/S/
+            JfPUfjxgRQid8P20Eh/jUepxrDY8iXRZdsUMON+OoQ8mpwoAh5eN1A==
+            -----END AGE ENCRYPTED FILE-----
+    lastmodified: "2025-05-15T18:56:55Z"
+    mac: ENC[AES256_GCM,data:J2kHY7pXBJZ0UuNCZOhkU11M8rDqCYNzY71NyuDRmzzRCC9ZiNIbavyQAWj2Dpk1pjGsYjXsVoZvP7ti1wTFqahpaR/YWI5gmphrzAe32b9qFVEWTC3YTnmItnY0YxQZYehYghspBjnJtfUK0BvZxSb17egpoFnvHmAq+u5dyxg=,iv:/aLg02RLuJZ1bRzZfOD74pJuE7gppCBztQvUEt557mU=,tag:toxHHBuv3WRblyc9Sth6Iw==,type:str]
+    unencrypted_suffix: _unencrypted
+    version: 3.10.2
--- a/secrets/monitoring01/pve-exporter.yaml
+++ b/secrets/monitoring01/pve-exporter.yaml
@@ -0,0 +1,33 @@
+default:
+    user: ENC[AES256_GCM,data:4Zzjm6/e8GCKSPNivnY=,iv:Y3gR+JSH/GLYvkVu3CN4T/chM5mjGjwVPI0iMB4p1t4=,tag:auyG8iWsd/YGjDnnTC21Ew==,type:str]
+    password: ENC[AES256_GCM,data:9cyM9U8VnzXBBA==,iv:YMHNNUoQ9Az5+81Df07tjC+LaEWPHV6frUjd4PZrQOs=,tag:3hKR+BhLJODJp19nn4ppkA==,type:str]
+    verify_ssl: ENC[AES256_GCM,data:Cu5Ucf0=,iv:QFfdV7gDBQ+L2kSZZqlVqCrn9CRg5RNG5DNTFWtVf5Y=,tag:u24ZbpWA65wj3WOwqU1v+g==,type:bool]
+sops:
+    kms: []
+    gcp_kms: []
+    azure_kv: []
+    hc_vault: []
+    age:
+        - recipient: age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
+          enc: |
+            -----BEGIN AGE ENCRYPTED FILE-----
+            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBuUXdMMG5YaHRJbThQZW9u
+            RHVBbXFiSHNiUWdLTDdPajIyQjN3OGR0dGpzCm9ZVkdNWjhBakU3dVdhRU9kbU81
+            aDlCNzJBQ1hvQ3FnTUk2N2RWQkZpUUEKLS0tIEZacTNqa3FWc2p1NXVtRWhwVExj
+            cUJtYXNjb2Z4QkF4MjlidEZxSUFNa3MKAGHGksPc9oJheSlUQ3ARK5MuR5NFbPmD
+            kmSDSgRmzbarxT8eJnK8/K4ii3hX5E9vGOohUkyc03w4ENsh/dw43g==
+            -----END AGE ENCRYPTED FILE-----
+        - recipient: age1vpns76ykll8jgdlu3h05cur4ew2t3k7u03kxdg8y6ypfhsfhq9fqyurjey
+          enc: |
+            -----BEGIN AGE ENCRYPTED FILE-----
+            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBOVGhvdGE5Mzl0ckhBM21D
+            RXJwb09OS25PMGViblViM21wTVZiZWhtWmhFCnAzL1NqeUVyOGZFVDFvdXFPbklQ
+            ZkJPWDVIdUdCdjZGUjcrcmtvak5CWG8KLS0tIDhLUHJNN2VqNy9CdVh0K0N0b0k1
+            RUE4U0E0aGxiRkF0NWdwSEIrQTU4MjgKeOU6bIWO6ke9YcG+1E3brnC21sSQxZ9b
+            SiG2QEnFnTeJ5P50XQoYHqUY3B0qx7nDLvyzatYEi6sDkfLXhmHGbw==
+            -----END AGE ENCRYPTED FILE-----
+    lastmodified: "2024-12-03T16:25:12Z"
+    mac: ENC[AES256_GCM,data:gemq8YpMZQC+gY7lmMM3tfZh9XxL40qdGlLiB2CD4SIG49w0V6E/vY7xygt0WW0zHbhMI9yUIqlRc/PaXn+QfyxJEr3IjaT05rrWUqQAeRP9Zss74Y3NtQehh8fM8SgeyU4j2CQ9f9B/lW9IgdOW/TNgQZVXGg1vXZPEzl7AZ4A=,iv:LG5ojv3hAqk+EvFa/xEn43MBqL457uKFDE3dG5lSgZo=,tag:AxzcUzmdhO411Sw7Vg1itA==,type:str]
+    pgp: []
+    unencrypted_suffix: _unencrypted
+    version: 3.9.1
--- a/secrets/nix-cache01/actions_token_1
+++ b/secrets/nix-cache01/actions_token_1
@@ -0,0 +1,19 @@
+{
+	"data": "ENC[AES256_GCM,data:P84qHFU+xQjwQGK8I1gIdcBsHrskuUg0M1nGMMaA+hFjAdFYUhdhmAN/+y0CO28=,iv:zJtk01zNMTBDQdVtZBTM34CHRaNYDkabolxh7PWGKUI=,tag:8AS80AbZJbh9B3Av3zuI1w==,type:str]",
+	"sops": {
+		"age": [
+			{
+				"recipient": "age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u",
+				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBkRFB6QTIyWWdwVkV4ZXNB\nWkdSdEhMc0s4cnByWVZXTGhnSWZ0MTdEUWhJCnFlOFQ5TU1hcE91azVyZXVXRCtu\nZjIxalRLYlEreGZ6ZDNoeXNPaFN4b28KLS0tIHY5WVFXN1k4NFVmUjh6VURkcEpv\ncklGcWVhdTdBRnlOdm1qM2h5SS9UUkEKq2RyxSVymDqcsZ+yiNRujDCwk1WOWYRW\nDa4TRKg3FCe7TcCEPkIaev1aBqjLg9J9c/70SYpUm6Zgeps7v5yl3A==\n-----END AGE ENCRYPTED FILE-----\n"
+			},
+			{
+				"recipient": "age1w029fksjv0edrff9p7s03tgk3axecdkppqymfpwfn2nu2gsqqefqc37sxq",
+				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSArTGVuckp2NlhMZXRNMVhO\naUV3K0h3cmZ5ZGx4Q3dJWHNqZXFJeE1kM0dFCmF4TUFUMm9mTHJlYzlYWVhNa1RH\nR29VNDIrL1IvYUpQYm5SZEYzbWhhbkkKLS0tIEJsK1dwZVdaaHpWQkpOOS90dkhx\nbGhvRXhqdFdqQmhZZmhCdmw4NUtSVG8K3z2do+/cIjAqg6EMJnubOWid1sMeTxvo\nrq6eGJ7YzdgZr2JBVtJdDRtk/KeHXu9In4efbBXwLAPIfn1pU0gm1w==\n-----END AGE ENCRYPTED FILE-----\n"
+			}
+		],
+		"lastmodified": "2025-08-21T19:08:48Z",
+		"mac": "ENC[AES256_GCM,data:5CkO09NIqttb4UZPB9iGym8avhTsMeUkTFTKZJlNGjgB1qWyGQNeKCa50A1+SbBCCWE5EwxoynB1so7bi8vnq7k8CPUHbiWG8rLOJSYHQcZ9Tu7ZGtpeWPcCw1zPWJ/PTBsFVeaT5/ufdx/6ut+sTtRoKHOZZtO9oStHmu/Rlfg=,iv:z9iJJlbvhgxJaART5QoCrqvrqlgoVlGj8jlndCALmKU=,tag:ldjmND4NVVQrHUldLrB4Jg==,type:str]",
+		"unencrypted_suffix": "_unencrypted",
+		"version": "3.10.2"
+	}
+}
--- a/secrets/nix-cache01/cache-secret
+++ b/secrets/nix-cache01/cache-secret
@@ -0,0 +1,19 @@
+{
+	"data": "ENC[AES256_GCM,data:MQkR6FQGHK2AuhOmy2was49RY2XlLO5NwaXnUFzFo5Ata/2ufVoAj4Jvotw/dSrKL7f62A6s+2BPAyWrvACJ+pwYFlfyj3T9bNwhxwZPkEmiHEubJjWSiD6jkSW0gOxbY8ib6g/GbyF8I1cPeYr/hJD5qQ==,iv:eBL2Y3MOt9gYTETUZqsHo1D5hPOHxb4JR6Z/DFlzzqI=,tag:Qqbt39xZvQz/QhsggsArsw==,type:str]",
+	"sops": {
+		"age": [
+			{
+				"recipient": "age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u",
+				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSAwZzFXaEsyUkZGNFV0bVlW\nRkpPRHpUK2VwUHpOQXZCUUpoVzFGa3hycnhvCndTN0toVFdoU2E5N3V3UFhTTjU0\nNDByWTkrV0o3T295dE0zS08rVGpyQjAKLS0tIC96M0VEcWpjRk5DMjJnMFB4ZHI3\nM2Jod2x4ZzMyZm1pbDhZNTFuWGNRUlEKHs5jBSfjml09JOeKiT9vFR0Fykg6OxKG\njhFU/J2+fWB22G7dBc4PI60SNqhxIheUbGTdcz4Yp4BPL6vW3eArIw==\n-----END AGE ENCRYPTED FILE-----\n"
+			},
+			{
+				"recipient": "age1w029fksjv0edrff9p7s03tgk3axecdkppqymfpwfn2nu2gsqqefqc37sxq",
+				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBJT3lxamcrQUpFdjZteFlF\nYUQ3aGdadGpuNXd2Z3RtZ3dQU0cvMlFUMUNRClBDR3U0OXZJU0NDamVMSlR5NitN\nYlhvNVlvUE0wRjErYzkwVHFOdGVCVjgKLS0tIEttR1BLTGpDYTRSQ0lUZmVEcnNi\nWkNaMEViUHVBcExVOEpjNE5CZHpjVkEKuX/Rf8kaB3apr1UhAnq3swS6fXiVmwm8\n7Key+SUAPNstbWbz0u6B9m1ev5QcXB2lx2/+Cm7cjW+6VE2gLHjTsQ==\n-----END AGE ENCRYPTED FILE-----\n"
+			}
+		],
+		"lastmodified": "2025-01-24T12:19:16Z",
+		"mac": "ENC[AES256_GCM,data:X8X91LVP1MMJ8ZYeSNPRO6XHN+NuswLZcHpAkbvoY+E9aTteO8UqS+fsStbNDlpF5jz/mhdMsKElnU8Z/CIWImwolI4GGE6blKy6gyqRkn4VeZotUoXcJadYV/5COud3XP2uSTb694JyQEZnBXFNeYeiHpN0y38zLxoX8kXHFbc=,iv:fFCRfv+Y1Nt2zgJNKsxElrYcuKkATJ3A/jvheUY2IK4=,tag:hYojbMGUAQvx7I4qkO7o9w==,type:str]",
+		"unencrypted_suffix": "_unencrypted",
+		"version": "3.9.3"
+	}
+}
--- a/secrets/secrets.yaml
+++ b/secrets/secrets.yaml
@@ -0,0 +1,109 @@
+root_password_hash: ENC[AES256_GCM,data:wk/xEuf+qU3ezmondq9y3OIotXPI/L+TOErTjgJz58wEvQkApYkjc3bHaUTzOrmWjQBgDUENObzPmvQ8WKawUSJRVlpfOEr5TQ==,iv:I8Z3xJz3qoXBD7igx087A1fMwf8d29hQ4JEI3imRXdY=,tag:M80osQeWGG9AAA8BrMfhHA==,type:str]
+ns_xfer_key: ENC[AES256_GCM,data:VFpK7GChgFeUgQm31tTvVC888bN0yt6BAnHQa6KUTg4iZGP1WL5Bx6Zp8dY=,iv:9RF1eEc7JBxBebDOKfcDjGS2U7XsHkOW/l52yIP+1LA=,tag:L6DR2QlHOfo02kzfWWCrvg==,type:str]
+backup_helper_secret: ENC[AES256_GCM,data:EvXEJnDilbfALQ==,iv:Q3dkZ8Ee3qbcjcoi5GxfbaVB4uRIvkIB6ioKVV/dL2Y=,tag:T/UgZvQgYGa740Wh7D0b7Q==,type:str]
+nats_nkey: ENC[AES256_GCM,data:N2CVXjdwiE7eSPUtXe+NeKSTzA9eFwK2igxaCdYsXd4Ps0/DjYb/ggnQziQzSy8viESZYjXhJ2VtNw==,iv:Xhcf5wPB01Wu0A+oMw0wzTEHATp+uN+wsaYshxIzy1w=,tag:IauTIOHqfiM75Ufml/JXbg==,type:str]
+sops:
+    age:
+        - recipient: age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
+          enc: |
+            -----BEGIN AGE ENCRYPTED FILE-----
+            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBuWXhzQWFmeCt1R05jREcz
+            Ui9HZFN5dkxHNVE0RVJGZUJUa3hKK2sxdkhBCktYcGpLeGZIQzZIV3ZZWGs3YzF1
+            T09sUEhPWkRkOWZFWkltQXBlM1lQV1UKLS0tIERRSlRUYW5QeW9TVjJFSmorOWNI
+            ZytmaEhzMjVhRXI1S0hielF0NlBrMmcK4I1PtSf7tSvSIJxWBjTnfBCO8GEFHbuZ
+            BkZskr5fRnWUIs72ZOGoTAVSO5ZNiBglOZ8YChl4Vz1U7bvdOCt0bw==
+            -----END AGE ENCRYPTED FILE-----
+        - recipient: age1hz2lz4k050ru3shrk5j3zk3f8azxmrp54pktw5a7nzjml4saudesx6jsl0
+          enc: |
+            -----BEGIN AGE ENCRYPTED FILE-----
+            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBQcXM0RHlGcmZrYW4yNGZs
+            S1ZqQzVaYmQ4MGhGaTFMUVIwOTk5K0tZZjB3ClN0QkhVeHRrNXZHdmZWMzFBRnJ6
+            WTFtaWZyRmx2TitkOXkrVkFiYVd3RncKLS0tIExpeGUvY1VpODNDL2NCaUhtZkp0
+            cGNVZTI3UGxlNWdFWVZMd3FlS3pDR3cKBulaMeonV++pArXOg3ilgKnW/51IyT6Z
+            vH9HOJUix+ryEwDIcjv4aWx9pYDHthPFZUDC25kLYG91WrJFQOo2oA==
+            -----END AGE ENCRYPTED FILE-----
+        - recipient: age1w2q4gm2lrcgdzscq8du3ssyvk6qtzm4fcszc92z9ftclq23yyydqdga5um
+          enc: |
+            -----BEGIN AGE ENCRYPTED FILE-----
+            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBabTdsZWxZQjV2TGx2YjNM
+            ZTgzWktqTjY0S0M3bFpNZXlDRDk5TSt3V2k0CjdWWTN0TlRlK1RpUm9xYW03MFFG
+            aWN4a3o4VUVnYzBDd2FrelUraWtrMTAKLS0tIE1vTGpKYkhzcWErWDRreml2QmE2
+            ZkNIWERKb1drdVR6MTBSTnVmdm51VEkKVNDYdyBSrUT7dUn6a4eF7ELQ2B2Pk6V9
+            Z5fbT75ibuyX1JO315/gl2P/FhxmlRW1K6e+04gQe2R/t/3H11Q7YQ==
+            -----END AGE ENCRYPTED FILE-----
+        - recipient: age1d2w5zece9647qwyq4vas9qyqegg96xwmg6c86440a6eg4uj6dd2qrq0w3l
+          enc: |
+            -----BEGIN AGE ENCRYPTED FILE-----
+            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBVSFhDOFRVbnZWbVlQaG5G
+            U0NWekU0NzI1SlpRN0NVS1hPN210MXY3Z244CmtFemR5OUpzdlBzMHBUV3g0SFFo
+            eUtqNThXZDJ2b01yVVVuOFdwQVo2Qm8KLS0tIHpXRWd3OEpPRkpaVDNDTEJLMWEv
+            ZlZtaFpBdzF0YXFmdjNkNUR3YkxBZU0KAub+HF/OBZQR9bx/SVadZcL6Ms+NQ7yq
+            21HCcDTWyWHbN4ymUrIYXci1A/0tTOrQL9Mkvaz7IJh4VdHLPZrwwA==
+            -----END AGE ENCRYPTED FILE-----
+        - recipient: age1gq8434ku0xekqmvnseeunv83e779cg03c06gwrusnymdsr3rpufqx6vr3m
+          enc: |
+            -----BEGIN AGE ENCRYPTED FILE-----
+            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBBWkhBL1NTdjFDeEhQcEgv
+            Z3c3Z213L2ZhWGo0Qm5Zd1A1RTBDY3plUkh3CkNWV2ZtNWkrUjB0eWFzUlVtbHlk
+            WTdTQjN4eDIzY0c0dyt6ajVXZ0krd1UKLS0tIHB4aEJqTTRMenV3UkFkTGEySjQ2
+            YVM1a3ZPdUU4T244UU0rc3hVQ3NYczQK10wug4kTjsvv/iOPWi5WrVZMOYUq4/Mf
+            oXS4sikXeUsqH1T2LUBjVnUieSneQVn7puYZlN+cpDQ0XdK/RZ+91A==
+            -----END AGE ENCRYPTED FILE-----
+        - recipient: age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk
+          enc: |
+            -----BEGIN AGE ENCRYPTED FILE-----
+            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBYcEtHbjNWRkdodUxYdHRn
+            MDBMU08zWDlKa0Z4cHJvc28rZk5pUjhnMjE0CmdzRmVGWDlYQ052Wm1zWnlYSFV6
+            dURQK3JSbThxQlg3M2ZaL1hGRzVuL0UKLS0tIEI3UGZvbEpvRS9aR2J2Tnc1YmxZ
+            aUY5Q2MrdHNQWDJNaGt5MWx6MVRrRVEKRPxyAekGHFMKs0Z6spVDayBA4EtPk18e
+            jiFc97BGVtC5IoSu4icq3ZpKOdxymnkqKEt0YP/p/JTC+8MKvTJFQw==
+            -----END AGE ENCRYPTED FILE-----
+        - recipient: age1vpns76ykll8jgdlu3h05cur4ew2t3k7u03kxdg8y6ypfhsfhq9fqyurjey
+          enc: |
+            -----BEGIN AGE ENCRYPTED FILE-----
+            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBQL3ZMUkI1dUV1T2tTSHhn
+            SjhyQ3dKTytoaDBNcit1VHpwVGUzWVNpdjBnCklYZWtBYzBpcGxZSDBvM2tIZm9H
+            bTFjb1ZCaDkrOU1JODVBVTBTbmxFbmcKLS0tIGtGcS9kejZPZlhHRXI5QnI5Wm9Q
+            VjMxTDdWZEltWThKVDl0S24yWHJxZHcKgzH79zT2I7ZgyTbbbvIhLN/rEcfiomJH
+            oSZDFvPiXlhPgy8bRyyq3l47CVpWbUI2Y7DFXRuODpLUirt3K3TmCA==
+            -----END AGE ENCRYPTED FILE-----
+        - recipient: age1hchvlf3apn8g8jq2743pw53sd6v6ay6xu6lqk0qufrjeccan9vzsc7hdfq
+          enc: |
+            -----BEGIN AGE ENCRYPTED FILE-----
+            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBPcm9zUm1XUkpLWm1Jb3Uw
+            RncveGozOW5SRThEM1Y4SFF5RDdxUEhZTUE4CjVESHE5R3JZK0krOXZDL0RHR0oy
+            Z3JKaEpydjRjeFFHck1ic2JTRU5yZTQKLS0tIGY2ck56eG95YnpDYlNqUDh5RVp1
+            U3dRYkNleUtsQU1LMWpDbitJbnRIem8K+27HRtZihG8+k7ZC33XVfuXDFjC1e8lA
+            kffmxp9kOEShZF3IKmAjVHFBiPXRyGk3fGPyQLmSMK2UOOfCy/a/qA==
+            -----END AGE ENCRYPTED FILE-----
+        - recipient: age1w029fksjv0edrff9p7s03tgk3axecdkppqymfpwfn2nu2gsqqefqc37sxq
+          enc: |
+            -----BEGIN AGE ENCRYPTED FILE-----
+            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBTZHlldDdSOEhjTklCSXQr
+            U2pXajFwZnNqQzZOTzY5b3lkMzlyREhXRWo4CmxId2F6NkNqeHNCSWNrcUJIY0Nw
+            cGF6NXJaQnovK1FYSXQ2TkJSTFloTUEKLS0tIHRhWk5aZ0lDVkZaZEJobm9FTDNw
+            a29sZE1GL2ZQSk0vUEc1ZGhkUlpNRkEK9tfe7cNOznSKgxshd5Z6TQiNKp+XW6XH
+            VvPgMqMitgiDYnUPj10bYo3kqhd0xZH2IhLXMnZnqqQ0I23zfPiNaw==
+            -----END AGE ENCRYPTED FILE-----
+        - recipient: age1ha34qeksr4jeaecevqvv2afqem67eja2mvawlmrqsudch0e7fe7qtpsekv
+          enc: |
+            -----BEGIN AGE ENCRYPTED FILE-----
+            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSB5bk9NVjJNWmMxUGd3cXRx
+            amZ5SWJ3dHpHcnM4UHJxdmh6NnhFVmJQdldzCm95dHN3R21qSkE4Vm9VTnVPREp3
+            dUQyS1B4MWhhdmd3dk5LQ0htZEtpTWMKLS0tIGFaa3MxVExFYk1MY2loOFBvWm1o
+            L0NoRStkeW9VZVdpWlhteC8yTnRmMUkKMYjUdE1rGgVR29FnhJ5OEVjTB1Rh5Mtu
+            M/DvlhW3a7tZU8nDF3IgG2GE5xOXZMDO9QWGdB8zO2RJZAr3Q+YIlA==
+            -----END AGE ENCRYPTED FILE-----
+        - recipient: age1cxt8kwqzx35yuldazcc49q88qvgy9ajkz30xu0h37uw3ts97jagqgmn2ga
+          enc: |
+            -----BEGIN AGE ENCRYPTED FILE-----
+            YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBBU0xYMnhqOE0wdXdleStF
+            THcrY2NBQzNoRHdYTXY3ZmM5YXRZZkQ4aUZnCm9ad0IxSWxYT1JBd2RseUdVT1pi
+            UXBuNzFxVlN0OWNTQU5BV2NiVEV0RUUKLS0tIGJHY0dzSDczUzcrV0RpTjE0czEy
+            cWZMNUNlTzBRcEV5MjlRV1BsWGhoaUUKGhYaH8I0oPCfrbs7HbQKVOF/99rg3HXv
+            RRTXUI71/ejKIuxehOvifClQc3nUW73bWkASFQ0guUvO4R+c0xOgUg==
+            -----END AGE ENCRYPTED FILE-----
+    lastmodified: "2025-02-11T21:18:22Z"
+    mac: ENC[AES256_GCM,data:5//boMp1awc/2XAkSASSCuobpkxa0E6IKf3GR8xHpMoCD30FJsCwV7PgX3fR8OuLEhOJ7UguqMNQdNqG37RMacreuDmI1J8oCFKp+3M2j4kCbXaEo8bw7WAtyjUez+SAXKzZWYmBibH0KOy6jdt+v0fdgy5hMBT4IFDofYRsyD0=,iv:6pD+SLwncpmal/FR4U8It2njvaQfUzzpALBCxa0NyME=,tag:4QN8ZFjdqck5ZgulF+FtbA==,type:str]
+    unencrypted_suffix: _unencrypted
+    version: 3.9.4
--- a/services/ca/default.nix
+++ b/services/ca/default.nix
@@ -0,0 +1,169 @@
+{ pkgs, unstable, ... }:
+{
+  homelab.monitoring.scrapeTargets = [{
+    job_name = "step-ca";
+    port = 9000;
+  }];
+  sops.secrets."ca_root_pw" = {
+    sopsFile = ../../secrets/ca/secrets.yaml;
+    owner = "step-ca";
+    path = "/var/lib/step-ca/secrets/ca_root_pw";
+  };
+  sops.secrets."intermediate_ca_key" = {
+    sopsFile = ../../secrets/ca/keys/intermediate_ca_key;
+    format = "binary";
+    owner = "step-ca";
+    path = "/var/lib/step-ca/secrets/intermediate_ca_key";
+  };
+  sops.secrets."root_ca_key" = {
+    sopsFile = ../../secrets/ca/keys/root_ca_key;
+    format = "binary";
+    owner = "step-ca";
+    path = "/var/lib/step-ca/secrets/root_ca_key";
+  };
+  sops.secrets."ssh_host_ca_key" = {
+    sopsFile = ../../secrets/ca/keys/ssh_host_ca_key;
+    format = "binary";
+    owner = "step-ca";
+    path = "/var/lib/step-ca/secrets/ssh_host_ca_key";
+  };
+  sops.secrets."ssh_user_ca_key" = {
+    sopsFile = ../../secrets/ca/keys/ssh_user_ca_key;
+    format = "binary";
+    owner = "step-ca";
+    path = "/var/lib/step-ca/secrets/ssh_user_ca_key";
+  };
+
+  services.step-ca = {
+    enable = true;
+    package = pkgs.step-ca;
+    intermediatePasswordFile = "/var/lib/step-ca/secrets/ca_root_pw";
+    address = "0.0.0.0";
+    port = 443;
+    settings = {
+      metricsAddress = ":9000";
+      authority = {
+        provisioners = [
+          {
+            claims = {
+              enableSSHCA = true;
+              maxTLSCertDuration = "3600h";
+              defaultTLSCertDuration = "48h";
+            };
+            encryptedKey = "eyJhbGciOiJQQkVTMi1IUzI1NitBMTI4S1ciLCJjdHkiOiJqd2sranNvbiIsImVuYyI6IkEyNTZHQ00iLCJwMmMiOjYwMDAwMCwicDJzIjoiY1lWOFJPb3lteXFLMWpzcS1WM1ZXQSJ9.WS8tPK-Q4gtnSsw7MhpTzYT_oi-SQx-CsRLh7KwdZnpACtd4YbcOYg.zeyDkmKRx8BIp-eB.OQ8c-KDW07gqJFtEMqHacRBkttrbJRRz0sYR47vQWDCoWhodaXsxM_Bj2pGvUrR26ij1t7irDeypnJoh6WXvUg3n_JaIUL4HgTwKSBrXZKTscXmY7YVmRMionhAb6oS9Jgus9K4QcFDHacC9_WgtGI7dnu3m0G7c-9Ur9dcDfROfyrnAByJp1rSZMzvriQr4t9bNYjDa8E8yu9zq6aAQqF0Xg_AxwiqYqesT-sdcfrxKS61appApRgPlAhW-uuzyY0wlWtsiyLaGlWM7WMfKdHsq-VqcVrI7Gi2i77vi7OqPEberqSt8D04tIri9S_sArKqWEDnBJsL07CC41IY.CqtYfbSa_wlmIsKgNj5u7g";
+            key = {
+              alg = "ES256";
+              crv = "P-256";
+              kid = "CIjtIe7FNhsNQe1qKGD9Rpj-lrf2ExyTYCXAOd3YDjE";
+              kty = "EC";
+              use = "sig";
+              x = "XRMX-BeobZ-R5-xb-E9YlaRjJUfd7JQxpscaF1NMgFo";
+              y = "bF9xLp5-jywRD-MugMaOGbpbniPituWSLMlXRJnUUl0";
+            };
+            name = "ca@home.2rjus.net";
+            type = "JWK";
+          }
+          {
+            name = "acme";
+            type = "ACME";
+            claims = {
+              maxTLSCertDuration = "3600h";
+              defaultTLSCertDuration = "1800h";
+            };
+          }
+          {
+            claims = {
+              enableSSHCA = true;
+            };
+            name = "sshpop";
+            type = "SSHPOP";
+          }
+        ];
+      };
+      crt = "/var/lib/step-ca/certs/intermediate_ca.crt";
+      db = {
+        badgerFileLoadingMode = "";
+        dataSource = "/var/lib/step-ca/db";
+        type = "badgerv2";
+      };
+      dnsNames = [
+        "ca.home.2rjus.net"
+        "10.69.13.12"
+      ];
+      federatedRoots = null;
+      insecureAddress = "";
+      key = "/var/lib/step-ca/secrets/intermediate_ca_key";
+      logger = {
+        format = "text";
+      };
+      root = "/var/lib/step-ca/certs/root_ca.crt";
+      ssh = {
+        hostKey = "/var/lib/step-ca/secrets/ssh_host_ca_key";
+        userKey = "/var/lib/step-ca/secrets/ssh_user_ca_key";
+      };
+      templates = {
+        ssh = {
+          host = [
+            {
+              comment = "#";
+              name = "sshd_config.tpl";
+              path = "/etc/ssh/sshd_config";
+              requires = [
+                "Certificate"
+                "Key"
+              ];
+              template = ./templates/ssh/sshd_config.tpl;
+              type = "snippet";
+            }
+            {
+              comment = "#";
+              name = "ca.tpl";
+              path = "/etc/ssh/ca.pub";
+              template = ./templates/ssh/ca.tpl;
+              type = "snippet";
+            }
+          ];
+          user = [
+            {
+              comment = "#";
+              name = "config.tpl";
+              path = "~/.ssh/config";
+              template = ./templates/ssh/config.tpl;
+              type = "snippet";
+            }
+            {
+              comment = "#";
+              name = "step_includes.tpl";
+              path = "\${STEPPATH}/ssh/includes";
+              template = ./templates/ssh/step_includes.tpl;
+              type = "prepend-line";
+            }
+            {
+              comment = "#";
+              name = "step_config.tpl";
+              path = "ssh/config";
+              template = ./templates/ssh/step_config.tpl;
+              type = "file";
+            }
+            {
+              comment = "#";
+              name = "known_hosts.tpl";
+              path = "ssh/known_hosts";
+              template = ./templates/ssh/known_hosts.tpl;
+              type = "file";
+            }
+          ];
+        };
+      };
+      tls = {
+        cipherSuites = [
+          "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256"
+          "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256"
+        ];
+        maxVersion = 1.3;
+        minVersion = 1.2;
+        renegotiation = false;
+      };
+    };
+  };
+}
--- a/services/ca/templates/ssh/ca.tpl
+++ b/services/ca/templates/ssh/ca.tpl
--- a/services/ca/templates/ssh/config.tpl
+++ b/services/ca/templates/ssh/config.tpl
@@ -0,0 +1,14 @@
+Host *
+{{- if or .User.GOOS "none" | eq "windows" }}
+{{- if .User.StepBasePath }}
+	Include "{{ .User.StepBasePath | replace "\\" "/" | trimPrefix "C:" }}/ssh/includes"
+{{- else }}
+	Include "{{ .User.StepPath | replace "\\" "/" | trimPrefix "C:" }}/ssh/includes"
+{{- end }}
+{{- else }}
+{{- if .User.StepBasePath }}
+	Include "{{.User.StepBasePath}}/ssh/includes"
+{{- else }}
+	Include "{{.User.StepPath}}/ssh/includes"
+{{- end }}
+{{- end }}
--- a/services/ca/templates/ssh/known_hosts.tpl
+++ b/services/ca/templates/ssh/known_hosts.tpl
@@ -0,0 +1,4 @@
+@cert-authority * {{.Step.SSH.HostKey.Type}} {{.Step.SSH.HostKey.Marshal | toString | b64enc}}
+{{- range .Step.SSH.HostFederatedKeys}}
+@cert-authority * {{.Type}} {{.Marshal | toString | b64enc}}
+{{- end }}
--- a/services/ca/templates/ssh/sshd_config.tpl
+++ b/services/ca/templates/ssh/sshd_config.tpl
@@ -0,0 +1,4 @@
+Match all
+	TrustedUserCAKeys /etc/ssh/ca.pub
+	HostCertificate /etc/ssh/{{.User.Certificate}}
+	HostKey /etc/ssh/{{.User.Key}}
--- a/services/ca/templates/ssh/step_config.tpl
+++ b/services/ca/templates/ssh/step_config.tpl
@@ -0,0 +1,11 @@
+Match exec "step ssh check-host{{- if .User.Context }} --context {{ .User.Context }}{{- end }} %h"
+{{- if .User.User }}
+	User {{.User.User}}
+{{- end }}
+{{- if or .User.GOOS "none" | eq "windows" }}
+	UserKnownHostsFile "{{.User.StepPath}}\ssh\known_hosts"
+	ProxyCommand C:\Windows\System32\cmd.exe /c step ssh proxycommand{{- if .User.Context }} --context {{ .User.Context }}{{- end }}{{- if .User.Provisioner }} --provisioner {{ .User.Provisioner }}{{- end }} %r %h %p
+{{- else }}
+	UserKnownHostsFile "{{.User.StepPath}}/ssh/known_hosts"
+	ProxyCommand step ssh proxycommand{{- if .User.Context }} --context {{ .User.Context }}{{- end }}{{- if .User.Provisioner }} --provisioner {{ .User.Provisioner }}{{- end }} %r %h %p
+{{- end }}
--- a/services/ca/templates/ssh/step_includes.tpl
+++ b/services/ca/templates/ssh/step_includes.tpl
@@ -0,0 +1 @@
+{{- if or .User.GOOS "none" | eq "windows" }}Include "{{ .User.StepPath | replace "\\" "/" | trimPrefix "C:" }}/ssh/config"{{- else }}Include "{{.User.StepPath}}/ssh/config"{{- end }}
--- a/services/http-proxy/proxy.nix
+++ b/services/http-proxy/proxy.nix
@@ -5,7 +5,7 @@
    package = pkgs.unstable.caddy;
    configFile = pkgs.writeText "Caddyfile" ''
      {
-        acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
+        acme_ca https://ca.home.2rjus.net/acme/acme/directory

        metrics {
          per_host
--- a/services/kanidm/default.nix
+++ b/services/kanidm/default.nix
@@ -1,65 +0,0 @@
-{ config, lib, pkgs, ... }:
-{
-  services.kanidm = {
-    package = pkgs.kanidmWithSecretProvisioning_1_8;
-    enableServer = true;
-    serverSettings = {
-      domain = "home.2rjus.net";
-      origin = "https://auth.home.2rjus.net";
-      bindaddress = "0.0.0.0:443";
-      ldapbindaddress = "0.0.0.0:636";
-      tls_chain = "/var/lib/acme/auth.home.2rjus.net/fullchain.pem";
-      tls_key = "/var/lib/acme/auth.home.2rjus.net/key.pem";
-      online_backup = {
-        path = "/var/lib/kanidm/backups";
-        schedule = "00 22 * * *";
-        versions = 7;
-      };
-    };
-
-    # Provision base groups only - users are managed via CLI
-    # See docs/user-management.md for details
-    provision = {
-      enable = true;
-      idmAdminPasswordFile = config.vault.secrets.kanidm-idm-admin.outputDir;
-
-      groups = {
-        admins = { };
-        users = { };
-        ssh-users = { };
-      };
-
-      # Regular users (persons) are managed imperatively via kanidm CLI
-    };
-  };
-
-  # Grant kanidm access to ACME certificates
-  users.users.kanidm.extraGroups = [ "acme" ];
-
-  # ACME certificate from internal CA
-  # Include both the CNAME (auth) and A record (kanidm01) for Prometheus scraping
-  security.acme.certs."auth.home.2rjus.net" = {
-    listenHTTP = ":80";
-    reloadServices = [ "kanidm" ];
-    extraDomainNames = [ "${config.networking.hostName}.home.2rjus.net" ];
-  };
-
-  # Vault secret for idm_admin password (used for provisioning)
-  vault.secrets.kanidm-idm-admin = {
-    secretPath = "kanidm/idm-admin-password";
-    extractKey = "password";
-    services = [ "kanidm" ];
-    owner = "kanidm";
-    group = "kanidm";
-  };
-
-  # Note: Kanidm does not expose Prometheus metrics
-  # If metrics support is added in the future, uncomment:
-  # homelab.monitoring.scrapeTargets = [
-  #   {
-  #     job_name = "kanidm";
-  #     port = 443;
-  #     scheme = "https";
-  #   }
-  # ];
-}
--- a/services/monitoring/alloy.nix
+++ b/services/monitoring/alloy.nix
@@ -0,0 +1,41 @@
+{ ... }:
+{
+  services.alloy = {
+    enable = true;
+  };
+
+  environment.etc."alloy/config.alloy" = {
+    enable = true;
+    mode = "0644";
+    text = ''
+      pyroscope.write "local_pyroscope" {
+        endpoint {
+          url = "http://localhost:4040"
+        }
+      }
+
+      pyroscope.scrape "labmon" {
+        targets    = [{"__address__" = "localhost:9969", "service_name" = "labmon"}]
+        forward_to = [pyroscope.write.local_pyroscope.receiver]
+
+        profiling_config {
+          profile.process_cpu {
+            enabled = true
+          }
+          profile.memory {
+            enabled = true
+          }
+          profile.mutex {
+            enabled = true
+          }
+          profile.block {
+            enabled = true
+          }
+          profile.goroutine {
+            enabled = true
+          }
+        }
+      }
+    '';
+  };
+}
--- a/services/monitoring/default.nix
+++ b/services/monitoring/default.nix
@@ -7,6 +7,7 @@
    ./pve.nix
    ./alerttonotify.nix
    ./pyroscope.nix
+    ./alloy.nix
    ./tempo.nix
  ];
 }
--- a/services/monitoring/prometheus.nix
+++ b/services/monitoring/prometheus.nix
@@ -121,20 +121,22 @@ in

    scrapeConfigs = [
      # Auto-generated node-exporter targets from flake hosts + external
-      # Each static_config entry may have labels from homelab.host metadata
      {
        job_name = "node-exporter";
-        static_configs = nodeExporterTargets;
+        static_configs = [
+          {
+            targets = nodeExporterTargets;
+          }
+        ];
      }
      # Systemd exporter on all hosts (same targets, different port)
-      # Preserves the same label grouping as node-exporter
      {
        job_name = "systemd-exporter";
-        static_configs = map
-          (cfg: cfg // {
-            targets = map (t: builtins.replaceStrings [ ":9100" ] [ ":9558" ] t) cfg.targets;
-          })
-          nodeExporterTargets;
+        static_configs = [
+          {
+            targets = map (t: builtins.replaceStrings [":9100"] [":9558"] t) nodeExporterTargets;
+          }
+        ];
      }
      # Local monitoring services (not auto-generated)
      {
@@ -178,6 +180,14 @@ in
          }
        ];
      }
+      {
+        job_name = "labmon";
+        static_configs = [
+          {
+            targets = [ "monitoring01.home.2rjus.net:9969" ];
+          }
+        ];
+      }
      # TODO: nix-cache_caddy can't be auto-generated because the cert is issued
      # for nix-cache.home.2rjus.net (service CNAME), not nix-cache01 (hostname).
      # Consider adding a target override to homelab.monitoring.scrapeTargets.
--- a/services/monitoring/rules.yml
+++ b/services/monitoring/rules.yml
@@ -17,9 +17,8 @@ groups:
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: "Disk space is low on {{ $labels.instance }}. Please check."
-      # Build hosts (e.g., nix-cache01) are expected to have high CPU during builds
      - alert: high_cpu_load
-        expr: max(node_load5{role!="build-host"}) by (instance) > (count by (instance)(node_cpu_seconds_total{role!="build-host", mode="idle"}) * 0.7)
+        expr: max(node_load5{instance!="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance!="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7)
        for: 15m
        labels:
          severity: warning
@@ -27,7 +26,7 @@ groups:
          summary: "High CPU load on {{ $labels.instance }}"
          description: "CPU load is high on {{ $labels.instance }}. Please check."
      - alert: high_cpu_load
-        expr: max(node_load5{role="build-host"}) by (instance) > (count by (instance)(node_cpu_seconds_total{role="build-host", mode="idle"}) * 0.7)
+        expr: max(node_load5{instance="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7)
        for: 2h
        labels:
          severity: warning
@@ -116,9 +115,8 @@ groups:
        annotations:
          summary: "NSD not running on {{ $labels.instance }}"
          description: "NSD has been down on {{ $labels.instance }} more than 5 minutes."
-      # Only alert on primary DNS (secondary has cold cache after failover)
      - alert: unbound_low_cache_hit_ratio
-        expr: (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) / (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) + rate(unbound_cache_misses_total{dns_role="primary"}[5m]))) < 0.5
+        expr: (rate(unbound_cache_hits_total[5m]) / (rate(unbound_cache_hits_total[5m]) + rate(unbound_cache_misses_total[5m]))) < 0.5
        for: 15m
        labels:
          severity: warning
@@ -338,6 +336,40 @@ groups:
        annotations:
          summary: "Pyroscope service not running on {{ $labels.instance }}"
          description: "Pyroscope service not running on {{ $labels.instance }}"
+  - name: certificate_rules
+    rules:
+      - alert: certificate_expiring_soon
+        expr: labmon_tlsconmon_certificate_seconds_left{address!="ca.home.2rjus.net:443"} < 86400
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "TLS certificate expiring soon for {{ $labels.instance }}"
+          description: "TLS certificate for {{ $labels.address }} is expiring within 24 hours."
+      - alert: step_ca_serving_cert_expiring
+        expr: labmon_tlsconmon_certificate_seconds_left{address="ca.home.2rjus.net:443"} < 3600
+        for: 5m
+        labels:
+          severity: critical
+        annotations:
+          summary: "Step-CA serving certificate expiring"
+          description: "The step-ca serving certificate (24h auto-renewed) has less than 1 hour of validity left. Renewal may have failed."
+      - alert: certificate_check_error
+        expr: labmon_tlsconmon_certificate_check_error == 1
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Error checking certificate for {{ $labels.address }}"
+          description: "Certificate check is failing for {{ $labels.address }} on {{ $labels.instance }}."
+      - alert: step_ca_certificate_expiring
+        expr: labmon_stepmon_certificate_seconds_left < 3600
+        for: 5m
+        labels:
+          severity: critical
+        annotations:
+          summary: "Step-CA certificate expiring for {{ $labels.instance }}"
+          description: "Step-CA certificate is expiring within 1 hour on {{ $labels.instance }}."
  - name: proxmox_rules
    rules:
      - alert: pve_node_down
@@ -356,6 +388,32 @@ groups:
        annotations:
          summary: "Proxmox VM {{ $labels.id }} is stopped"
          description: "Proxmox VM {{ $labels.id }} ({{ $labels.name }}) has onboot=1 but is stopped."
+  - name: postgres_rules
+    rules:
+      - alert: postgres_down
+        expr: node_systemd_unit_state{instance="pgdb1.home.2rjus.net:9100", name="postgresql.service", state="active"} == 0
+        for: 5m
+        labels:
+          severity: critical
+        annotations:
+          summary: "PostgreSQL not running on {{ $labels.instance }}"
+          description: "PostgreSQL has been down on {{ $labels.instance }} more than 5 minutes."
+      - alert: postgres_exporter_down
+        expr: up{job="postgres"} == 0
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "PostgreSQL exporter down on {{ $labels.instance }}"
+          description: "Cannot scrape PostgreSQL metrics from {{ $labels.instance }}."
+      - alert: postgres_high_connections
+        expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "PostgreSQL connection pool near exhaustion on {{ $labels.instance }}"
+          description: "PostgreSQL is using over 80% of max_connections on {{ $labels.instance }}."
  - name: jellyfin_rules
    rules:
      - alert: jellyfin_down
--- a/services/nix-cache/proxy.nix
+++ b/services/nix-cache/proxy.nix
@@ -5,7 +5,7 @@
    package = pkgs.unstable.caddy;
    configFile = pkgs.writeText "Caddyfile" ''
      {
-        acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
+        acme_ca https://ca.home.2rjus.net/acme/acme/directory
        metrics
      }

--- a/services/ns/resolver.nix
+++ b/services/ns/resolver.nix
@@ -45,11 +45,7 @@
      };
      stub-zone = {
        name = "home.2rjus.net";
-        stub-addr = [
-          "127.0.0.1@8053"   # Local NSD
-          "10.69.13.5@8053"  # ns1
-          "10.69.13.6@8053"  # ns2
-        ];
+        stub-addr = "127.0.0.1@8053";
      };
      forward-zone = {
        name = ".";
--- a/services/postgres/default.nix
+++ b/services/postgres/default.nix
@@ -0,0 +1,6 @@
+{ ... }:
+{
+  imports = [
+    ./postgres.nix
+  ];
+}
--- a/services/postgres/postgres.nix
+++ b/services/postgres/postgres.nix
@@ -0,0 +1,23 @@
+{ pkgs, ... }:
+{
+  homelab.monitoring.scrapeTargets = [{
+    job_name = "postgres";
+    port = 9187;
+  }];
+
+  services.prometheus.exporters.postgres = {
+    enable = true;
+    runAsLocalSuperUser = true; # Use peer auth as postgres user
+  };
+
+  services.postgresql = {
+    enable = true;
+    enableJIT = true;
+    enableTCPIP = true;
+    extensions = ps: with ps; [ pgvector ];
+    authentication = ''
+      # Allow access to everything from gunter
+      host    all             all             10.69.30.105/32         scram-sha-256
+    '';
+  };
+}
--- a/system/acme.nix
+++ b/system/acme.nix
@@ -3,7 +3,7 @@
  security.acme = {
    acceptTerms = true;
    defaults = {
-      server = "https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory";
+      server = "https://ca.home.2rjus.net/acme/acme/directory";
      email = "root@home.2rjus.net";
      dnsPropagationCheck = false;
    };
--- a/system/default.nix
+++ b/system/default.nix
@@ -4,15 +4,14 @@
    ./acme.nix
    ./autoupgrade.nix
    ./homelab-deploy.nix
-    ./kanidm-client.nix
    ./monitoring
    ./motd.nix
    ./packages.nix
    ./nix.nix
    ./root-user.nix
    ./pki/root-ca.nix
+    ./sops.nix
    ./sshd.nix
    ./vault-secrets.nix
-    ./zram.nix
  ];
 }
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Torjus Håkestad	2669b10f0e	flake: update homelab-deploy Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 06:53:13 +01:00
Torjus Håkestad	db6d610e16	homelab: add deploy.enable option with assertion All checks were successful Run nix flake check / flake-check (push) Successful in 2m3s Details - Add homelab.deploy.enable option (requires vault.enable) - Create shared homelab-deploy Vault policy for all hosts - Enable homelab.deploy on all vault-enabled hosts Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 06:47:12 +01:00
Torjus Håkestad	e4eb8afe5c	system: enable homelab-deploy listener for all vault hosts All checks were successful Run nix flake check / flake-check (push) Successful in 2m4s Details Add system/homelab-deploy.nix module that automatically enables the listener on all hosts with vault.enable=true. Uses homelab.host.tier and homelab.host.role for NATS subject subscriptions. - Add homelab-deploy access to all host AppRole policies - Remove manual listener config from vaulttest01 (now handled by system module) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 06:41:03 +01:00
Torjus Håkestad	df9246a0f8	flake: update homelab-deploy Some checks failed Run nix flake check / flake-check (push) Failing after 12m46s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 06:27:21 +01:00
Torjus Håkestad	ec3b87f7fa	flake: update homelab-deploy All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 06:20:14 +01:00
Torjus Håkestad	913fa11c64	flake: update homelab-deploy Some checks failed Run nix flake check / flake-check (push) Failing after 3m39s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 06:11:37 +01:00
Torjus Håkestad	3e85e2527f	flake: update homelab-deploy All checks were successful Run nix flake check / flake-check (push) Successful in 2m7s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 05:58:26 +01:00
Torjus Håkestad	543ca18b14	flake: update homelab-deploy All checks were successful Run nix flake check / flake-check (push) Successful in 2m6s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 05:54:05 +01:00
Torjus Håkestad	c83218b3bc	flake: update homelab-deploy, add to devShell All checks were successful Run nix flake check / flake-check (push) Successful in 2m6s Details Update homelab-deploy to include bugfix. Add CLI to devShell for easier testing and deployment operations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 05:45:54 +01:00
				`@@ -0,0 +1 @@`
				`{{- if or .User.GOOS "none" \| eq "windows" }}Include "{{ .User.StepPath \| replace "\\" "/" \| trimPrefix "C:" }}/ssh/config"{{- else }}Include "{{.User.StepPath}}/ssh/config"{{- end }}`