nrec-nixos01: enable Git LFS and hide explore page

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
nrec-nixos01: add Forgejo with Caddy reverse proxy
2026-03-08 15:10:59 +01:00 · 2026-03-08 14:49:48 +01:00 · 2026-03-08 14:31:09 +01:00 · 2026-03-08 14:22:24 +01:00 · 2026-03-08 13:12:24 +00:00 · 2026-03-08 14:10:05 +01:00
82 changed files with 2025 additions and 1331 deletions
--- a/.claude/agents/investigate-alarm.md
+++ b/.claude/agents/investigate-alarm.md
@@ -130,7 +130,7 @@ get_commit_info(<hash>)            # Get full details of a specific change
 ```

 **Example workflow for a service-related alert:**
-1. Query `nixos_flake_info{hostname="monitoring01"}` → `current_rev: 8959829`
+1. Query `nixos_flake_info{hostname="monitoring02"}` → `current_rev: 8959829`
 2. `resolve_ref("master")` → `4633421`
 3. `is_ancestor("8959829", "4633421")` → Yes, host is behind
 4. `commits_between("8959829", "4633421")` → 7 commits missing
--- a/.claude/skills/observability/SKILL.md
+++ b/.claude/skills/observability/SKILL.md
@@ -30,7 +30,7 @@ Use the `lab-monitoring` MCP server tools:
 ### Label Reference

 Available labels for log queries:
- `hostname` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`) - matches the Prometheus `hostname` label
+- `hostname` - Hostname (e.g., `ns1`, `monitoring02`, `ha1`) - matches the Prometheus `hostname` label
 - `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
 - `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
 - `filename` - For `varlog` job, the log file path
@@ -54,7 +54,7 @@ Journal logs are JSON-formatted. Key fields:

 **All logs from a host:**
 ```logql
-{hostname="monitoring01"}
+{hostname="monitoring02"}
 ```

 **Logs from a service across all hosts:**
@@ -74,7 +74,7 @@ Journal logs are JSON-formatted. Key fields:

 **Regex matching:**
 ```logql
-{systemd_unit="prometheus.service"} |~ "scrape.*failed"
+{systemd_unit="victoriametrics.service"} |~ "scrape.*failed"
 ```

 **Filter by level (journal scrape only):**
@@ -109,7 +109,7 @@ Default lookback is 1 hour. Use `start` parameter for older logs:
 Useful systemd units for troubleshooting:
 - `nixos-upgrade.service` - Daily auto-upgrade logs
 - `nsd.service` - DNS server (ns1/ns2)
- `prometheus.service` - Metrics collection
+- `victoriametrics.service` - Metrics collection
 - `loki.service` - Log aggregation
 - `caddy.service` - Reverse proxy
 - `home-assistant.service` - Home automation
@@ -152,7 +152,7 @@ VMs provisioned from template2 send bootstrap progress directly to Loki via curl

 Parse JSON and filter on fields:
 ```logql
-{systemd_unit="prometheus.service"} | json | PRIORITY="3"
+{systemd_unit="victoriametrics.service"} | json | PRIORITY="3"
 ```

 ---
@@ -242,12 +242,11 @@ All available Prometheus job names:
 - `unbound` - DNS resolver metrics (ns1, ns2)
 - `wireguard` - VPN tunnel metrics (http-proxy)

-**Monitoring stack (localhost on monitoring01):**
- `prometheus` - Prometheus self-metrics
+**Monitoring stack (localhost on monitoring02):**
+- `victoriametrics` - VictoriaMetrics self-metrics
 - `loki` - Loki self-metrics
 - `grafana` - Grafana self-metrics
 - `alertmanager` - Alertmanager metrics
- `pushgateway` - Push-based metrics gateway

 **External/infrastructure:**
 - `pve-exporter` - Proxmox hypervisor metrics
@@ -262,7 +261,7 @@ All scrape targets have these labels:
 **Standard labels:**
 - `instance` - Full target address (`<hostname>.home.2rjus.net:<port>`)
 - `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`)
- `hostname` - Short hostname (e.g., `ns1`, `monitoring01`) - use this for host filtering
+- `hostname` - Short hostname (e.g., `ns1`, `monitoring02`) - use this for host filtering

 **Host metadata labels** (when configured in `homelab.host`):
 - `role` - Host role (e.g., `dns`, `build-host`, `vault`)
@@ -275,7 +274,7 @@ Use the `hostname` label for easy host filtering across all jobs:

 ```promql
 {hostname="ns1"}                    # All metrics from ns1
-node_load1{hostname="monitoring01"} # Specific metric by hostname
+node_load1{hostname="monitoring02"} # Specific metric by hostname
 up{hostname="ha1"}                  # Check if ha1 is up
 ```

@@ -283,10 +282,10 @@ This is simpler than wildcarding the `instance` label:

 ```promql
 # Old way (still works but verbose)
-up{instance=~"monitoring01.*"}
+up{instance=~"monitoring02.*"}

 # New way (preferred)
-up{hostname="monitoring01"}
+up{hostname="monitoring02"}
 ```

 ### Filtering by Role/Tier
--- a/.claude/skills/quick-plan/SKILL.md
+++ b/.claude/skills/quick-plan/SKILL.md
@@ -73,6 +73,7 @@ Additional context, caveats, or references.
 - **Reference existing patterns**: Mention how this fits with existing infrastructure
 - **Tables for comparisons**: Use markdown tables when comparing options
 - **Practical focus**: Emphasize what needs to happen, not theory
+- **Mermaid diagrams**: Use mermaid code blocks for architecture diagrams, flow charts, or other graphs when relevant to the plan. Keep node labels short and use `<br/>` for line breaks

 ## Examples of Good Plans

--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -247,7 +247,7 @@ nix develop -c homelab-deploy -- deploy \
  deploy.prod.<hostname>
 ```

-Subject format: `deploy.<tier>.<hostname>` (e.g., `deploy.prod.monitoring01`, `deploy.test.testvm01`)
+Subject format: `deploy.<tier>.<hostname>` (e.g., `deploy.prod.monitoring02`, `deploy.test.testvm01`)

 **Verifying Deployments:**

@@ -309,7 +309,7 @@ All hosts automatically get:
 - OpenBao (Vault) secrets management via AppRole
 - Internal ACME CA integration (OpenBao PKI at vault.home.2rjus.net)
 - Daily auto-upgrades with auto-reboot
- Prometheus node-exporter + Promtail (logs to monitoring01)
+- Prometheus node-exporter + Promtail (logs to monitoring02)
 - Monitoring scrape target auto-registration via `homelab.monitoring` options
 - Custom root CA trust
 - DNS zone auto-registration via `homelab.dns` options
@@ -335,7 +335,7 @@ Use `nix flake show` or `nix develop -c ansible-inventory --graph` to list all h
 - Infrastructure subnet: `10.69.13.x`
 - DNS: ns1/ns2 provide authoritative DNS with primary-secondary setup
 - Internal CA for ACME certificates (no Let's Encrypt)
- Centralized monitoring at monitoring01
+- Centralized monitoring at monitoring02
 - Static networking via systemd-networkd

 ### Secrets Management
@@ -480,23 +480,21 @@ See [docs/host-creation.md](docs/host-creation.md) for the complete host creatio

 ### Monitoring Stack

-All hosts ship metrics and logs to `monitoring01`:
- **Metrics**: Prometheus scrapes node-exporter from all hosts
- **Logs**: Promtail ships logs to Loki on monitoring01
- **Access**: Grafana at monitoring01 for visualization
- **Tracing**: Tempo for distributed tracing
- **Profiling**: Pyroscope for continuous profiling
+All hosts ship metrics and logs to `monitoring02`:
+- **Metrics**: VictoriaMetrics scrapes node-exporter from all hosts
+- **Logs**: Promtail ships logs to Loki on monitoring02
+- **Access**: Grafana at monitoring02 for visualization

 **Scrape Target Auto-Generation:**

-Prometheus scrape targets are automatically generated from host configurations, following the same pattern as DNS zone generation:
+VictoriaMetrics scrape targets are automatically generated from host configurations, following the same pattern as DNS zone generation:

 - **Node-exporter**: All flake hosts with static IPs are automatically added as node-exporter targets
 - **Service targets**: Defined via `homelab.monitoring.scrapeTargets` in service modules
 - **External targets**: Non-flake hosts defined in `/services/monitoring/external-targets.nix`
 - **Library**: `lib/monitoring.nix` provides `generateNodeExporterTargets` and `generateScrapeConfigs`

-Service modules declare their scrape targets directly via `homelab.monitoring.scrapeTargets`. The Prometheus config on monitoring01 auto-generates scrape configs from all hosts. See "Homelab Module Options" section for available options.
+Service modules declare their scrape targets directly via `homelab.monitoring.scrapeTargets`. The VictoriaMetrics config on monitoring02 auto-generates scrape configs from all hosts. See "Homelab Module Options" section for available options.

 To add monitoring targets for non-NixOS hosts, edit `/services/monitoring/external-targets.nix`.

--- a/README.md
+++ b/README.md
@@ -10,7 +10,7 @@ NixOS Flake-based configuration repository for a homelab infrastructure. All hos
 | `ca` | Internal Certificate Authority |
 | `ha1` | Home Assistant + Zigbee2MQTT + Mosquitto |
 | `http-proxy` | Reverse proxy |
-| `monitoring01` | Prometheus, Grafana, Loki, Tempo, Pyroscope |
+| `monitoring02` | VictoriaMetrics, Grafana, Loki, Alertmanager |
 | `jelly01` | Jellyfin media server |
 | `nix-cache02` | Nix binary cache + NATS-based build service |
 | `nats1` | NATS messaging |
@@ -121,4 +121,4 @@ No manual intervention is required after `tofu apply`.
 - Infrastructure subnet: `10.69.13.0/24`
 - DNS: ns1/ns2 authoritative with primary-secondary AXFR
 - Internal CA for TLS certificates (migrating from step-ca to OpenBao PKI)
- Centralized monitoring at monitoring01
+- Centralized monitoring at monitoring02
--- a/ansible/playbooks/provision-approle.yml
+++ b/ansible/playbooks/provision-approle.yml
@@ -23,14 +23,12 @@
      when: ansible_play_hosts | length != 1
      run_once: true

- name: Fetch AppRole credentials from OpenBao
-  hosts: localhost
-  connection: local
+- name: Provision AppRole credentials
+  hosts: all
  gather_facts: false

  vars:
-    target_host: "{{ groups['all'] | first }}"
-    target_hostname: "{{ hostvars[target_host]['short_hostname'] | default(target_host.split('.')[0]) }}"
+    target_hostname: "{{ inventory_hostname.split('.')[0] }}"

  tasks:
    - name: Display target host
@@ -45,6 +43,7 @@
        BAO_SKIP_VERIFY: "1"
      register: role_id_result
      changed_when: false
+      delegate_to: localhost

    - name: Generate secret-id for host
      ansible.builtin.command:
@@ -54,21 +53,8 @@
        BAO_SKIP_VERIFY: "1"
      register: secret_id_result
      changed_when: true
+      delegate_to: localhost

-    - name: Store credentials for next play
-      ansible.builtin.set_fact:
-        vault_role_id: "{{ role_id_result.stdout }}"
-        vault_secret_id: "{{ secret_id_result.stdout }}"
-
- name: Deploy AppRole credentials to host
-  hosts: all
-  gather_facts: false
-
-  vars:
-    vault_role_id: "{{ hostvars['localhost']['vault_role_id'] }}"
-    vault_secret_id: "{{ hostvars['localhost']['vault_secret_id'] }}"
-
-  tasks:
    - name: Create AppRole directory
      ansible.builtin.file:
        path: /var/lib/vault/approle
@@ -79,7 +65,7 @@

    - name: Write role-id
      ansible.builtin.copy:
-        content: "{{ vault_role_id }}"
+        content: "{{ role_id_result.stdout }}"
        dest: /var/lib/vault/approle/role-id
        mode: "0600"
        owner: root
@@ -87,7 +73,7 @@

    - name: Write secret-id
      ansible.builtin.copy:
-        content: "{{ vault_secret_id }}"
+        content: "{{ secret_id_result.stdout }}"
        dest: /var/lib/vault/approle/secret-id
        mode: "0600"
        owner: root
--- a/docs/plans/completed/monitoring-migration-victoriametrics.md
+++ b/docs/plans/completed/monitoring-migration-victoriametrics.md
@@ -0,0 +1,156 @@
+# Monitoring Stack Migration to VictoriaMetrics
+
+## Overview
+
+Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
+and longer retention. Run in parallel with monitoring01 until validated, then switch over using
+a `monitoring` CNAME for seamless transition.
+
+## Current State
+
+**monitoring02** (10.69.13.24) - **PRIMARY**:
+- 4 CPU cores, 8GB RAM, 60GB disk
+- VictoriaMetrics with 3-month retention
+- vmalert with alerting enabled (routes to local Alertmanager)
+- Alertmanager -> alerttonotify -> NATS notification pipeline
+- Grafana with Kanidm OIDC (`grafana.home.2rjus.net`)
+- Loki (log aggregation)
+- CNAMEs: monitoring, alertmanager, grafana, grafana-test, metrics, vmalert, loki
+
+**monitoring01** (10.69.13.13) - **SHUT DOWN**:
+- No longer running, pending decommission
+
+## Decision: VictoriaMetrics
+
+Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
+- Single binary replacement for Prometheus
+- 5-10x better compression (30 days could become 180+ days in same space)
+- Same PromQL query language (Grafana dashboards work unchanged)
+- Same scrape config format (existing auto-generated configs work)
+
+If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
+
+## Architecture
+
+```
+                     ┌─────────────────┐
+                     │  monitoring02   │
+                     │  VictoriaMetrics│
+                     │  + Grafana      │
+     monitoring      │  + Loki         │
+     CNAME ──────────│  + Alertmanager │
+                     │  (vmalert)      │
+                     └─────────────────┘
+                            ▲
+                            │ scrapes
+            ┌───────────────┼───────────────┐
+            │               │               │
+       ┌────┴────┐    ┌─────┴────┐    ┌─────┴────┐
+       │  ns1    │    │  ha1     │    │  ...     │
+       │ :9100   │    │ :9100    │    │ :9100    │
+       └─────────┘    └──────────┘    └──────────┘
+```
+
+## Implementation Plan
+
+### Phase 1: Create monitoring02 Host [COMPLETE]
+
+Host created and deployed at 10.69.13.24 (prod tier) with:
+- 4 CPU cores, 8GB RAM, 60GB disk
+- Vault integration enabled
+- NATS-based remote deployment enabled
+- Grafana with Kanidm OIDC deployed as test instance (`grafana-test.home.2rjus.net`)
+
+### Phase 2: Set Up VictoriaMetrics Stack [COMPLETE]
+
+New service module at `services/victoriametrics/` for VictoriaMetrics + vmalert + Alertmanager.
+Imported by monitoring02 alongside the existing Grafana service.
+
+1. **VictoriaMetrics** (port 8428):
+   - `services.victoriametrics.enable = true`
+   - `retentionPeriod = "3"` (3 months)
+   - All scrape configs migrated from Prometheus (22 jobs including auto-generated)
+   - Static user override (DynamicUser disabled) for credential file access
+   - OpenBao token fetch service + 30min refresh timer
+   - Apiary bearer token via vault.secrets
+
+2. **vmalert** for alerting rules:
+   - Points to VictoriaMetrics datasource at localhost:8428
+   - Reuses existing `services/monitoring/rules.yml` directly via `settings.rule`
+   - Notifier sends to local Alertmanager at localhost:9093
+
+3. **Alertmanager** (port 9093):
+   - Same configuration as monitoring01 (alerttonotify webhook routing)
+   - alerttonotify imported on monitoring02, routes alerts via NATS
+
+4. **Grafana** (port 3000):
+   - VictoriaMetrics datasource (localhost:8428) as default
+   - Loki datasource pointing to localhost:3100
+
+5. **Loki** (port 3100):
+   - Same configuration as monitoring01 in standalone `services/loki/` module
+   - Grafana datasource updated to localhost:3100
+
+**Note:** pve-exporter and pushgateway scrape targets are not included on monitoring02.
+pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics
+native push support.
+
+### Phase 3: Parallel Operation [COMPLETE]
+
+Ran both monitoring01 and monitoring02 simultaneously to validate data collection and dashboards.
+
+### Phase 4: Add monitoring CNAME [COMPLETE]
+
+Added CNAMEs to monitoring02: monitoring, alertmanager, grafana, metrics, vmalert, loki.
+
+### Phase 5: Update References [COMPLETE]
+
+- Moved alertmanager, grafana, prometheus CNAMEs from http-proxy to monitoring02
+- Removed corresponding Caddy reverse proxy entries from http-proxy
+- monitoring02 Caddy serves alertmanager, grafana, metrics, vmalert directly
+
+### Phase 6: Enable Alerting [COMPLETE]
+
+- Switched vmalert from blackhole mode to local Alertmanager
+- alerttonotify service running on monitoring02 (NATS nkey from Vault)
+- prometheus-metrics Vault policy added for OpenBao scraping
+- Full alerting pipeline verified: vmalert -> Alertmanager -> alerttonotify -> NATS
+
+### Phase 7: Cutover and Decommission [IN PROGRESS]
+
+- monitoring01 shut down (2026-02-17)
+- Vault AppRole moved from approle.tf to hosts-generated.tf with extra_policies support
+
+**Remaining cleanup (separate branch):**
+- [ ] Update `system/monitoring/logs.nix` - Promtail still points to monitoring01
+- [ ] Update `hosts/template2/bootstrap.nix` - Bootstrap Loki URL still points to monitoring01
+- [ ] Remove monitoring01 from flake.nix and host configuration
+- [ ] Destroy monitoring01 VM in Proxmox
+- [ ] Remove monitoring01 from terraform state
+- [ ] Remove or archive `services/monitoring/` (Prometheus config)
+
+## Completed
+
+- 2026-02-08: Phase 1 - monitoring02 host created
+- 2026-02-17: Phase 2 - VictoriaMetrics, vmalert, Alertmanager, Loki, Grafana configured
+- 2026-02-17: Phase 6 - Alerting enabled, CNAMEs migrated, monitoring01 shut down
+
+## VictoriaMetrics Service Configuration
+
+Implemented in `services/victoriametrics/default.nix`. Key design decisions:
+
+- **Static user**: VictoriaMetrics NixOS module uses `DynamicUser`, overridden with a static
+  `victoriametrics` user so vault.secrets and credential files work correctly
+- **Shared rules**: vmalert reuses `services/monitoring/rules.yml` via `settings.rule` path
+  reference (no YAML-to-Nix conversion needed)
+- **Scrape config reuse**: Uses the same `lib/monitoring.nix` functions and
+  `services/monitoring/external-targets.nix` as Prometheus for auto-generated targets
+
+## Notes
+
+- VictoriaMetrics uses port 8428 vs Prometheus 9090
+- PromQL compatibility is excellent
+- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
+- monitoring02 deployed via OpenTofu using `create-host` script
+- Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state
+- Tempo and Pyroscope deferred (not actively used; can be added later if needed)
--- a/docs/plans/host-migration-to-opentofu.md
+++ b/docs/plans/host-migration-to-opentofu.md
@@ -20,9 +20,9 @@ Hosts to migrate:
 | http-proxy | Stateless | Reverse proxy, recreate |
 | nats1 | Stateless | Messaging, recreate |
 | ha1 | Stateful | Home Assistant + Zigbee2MQTT + Mosquitto |
-| monitoring01 | Stateful | Prometheus, Grafana, Loki |
+| ~~monitoring01~~ | ~~Decommission~~ | ✓ Complete — replaced by monitoring02 (VictoriaMetrics) |
 | jelly01 | Stateful | Jellyfin metadata, watch history, config |
-| pgdb1 | Decommission | Only used by Open WebUI on gunter, migrating to local postgres |
+| ~~pgdb1~~ | ~~Decommission~~ | ✓ Complete |
 | ~~jump~~ | ~~Decommission~~ | ✓ Complete |
 | ~~auth01~~ | ~~Decommission~~ | ✓ Complete |
 | ~~ca~~ | ~~Deferred~~ | ✓ Complete |
@@ -31,10 +31,12 @@ Hosts to migrate:

 Before migrating any stateful host, ensure restic backups are in place and verified.

-### 1a. Expand monitoring01 Grafana Backup
+### ~~1a. Expand monitoring01 Grafana Backup~~ ✓ N/A

-The existing backup only covers `/var/lib/grafana/plugins` and a sqlite dump of `grafana.db`.
-Expand to back up all of `/var/lib/grafana/` to capture config directory and any other state.
+~~The existing backup only covers `/var/lib/grafana/plugins` and a sqlite dump of `grafana.db`.
+Expand to back up all of `/var/lib/grafana/` to capture config directory and any other state.~~
+
+No longer needed — monitoring01 decommissioned, replaced by monitoring02 with declarative Grafana dashboards.

 ### 1b. Add Jellyfin Backup to jelly01

@@ -94,15 +96,17 @@ For each stateful host, the procedure is:
 7. Start services and verify functionality
 8. Decommission the old VM

-### 3a. monitoring01
+### 3a. monitoring01 ✓ COMPLETE

-1. Run final Grafana backup
-2. Provision new monitoring01 via OpenTofu
-3. After bootstrap, restore `/var/lib/grafana/` from restic
-4. Restart Grafana, verify dashboards and datasources are intact
-5. Prometheus and Loki start fresh with empty data (acceptable)
-6. Verify all scrape targets are being collected
-7. Decommission old VM
+~~1. Run final Grafana backup~~
+~~2. Provision new monitoring01 via OpenTofu~~
+~~3. After bootstrap, restore `/var/lib/grafana/` from restic~~
+~~4. Restart Grafana, verify dashboards and datasources are intact~~
+~~5. Prometheus and Loki start fresh with empty data (acceptable)~~
+~~6. Verify all scrape targets are being collected~~
+~~7. Decommission old VM~~
+
+Replaced by monitoring02 with VictoriaMetrics, standalone Loki and Grafana modules. Host configuration, old service modules, and terraform resources removed.

 ### 3b. jelly01

@@ -163,19 +167,19 @@ Host was already removed from flake.nix and VM destroyed. Configuration cleaned

 Host configuration, services, and VM already removed.

-### pgdb1 (in progress)
+### pgdb1 ✓ COMPLETE

-Only consumer was Open WebUI on gunter, which has been migrated to use local PostgreSQL.
+~~Only consumer was Open WebUI on gunter, which has been migrated to use local PostgreSQL.~~

-1. ~~Verify Open WebUI on gunter is using local PostgreSQL (not pgdb1)~~ ✓
-2. ~~Remove host configuration from `hosts/pgdb1/`~~ ✓
-3. ~~Remove `services/postgres/` (only used by pgdb1)~~ ✓
-4. ~~Remove from `flake.nix`~~ ✓
-5. ~~Remove Vault AppRole from `terraform/vault/approle.tf`~~ ✓
-6. Destroy the VM in Proxmox
-7. ~~Commit cleanup~~ ✓
+~~1. Verify Open WebUI on gunter is using local PostgreSQL (not pgdb1)~~
+~~2. Remove host configuration from `hosts/pgdb1/`~~
+~~3. Remove `services/postgres/` (only used by pgdb1)~~
+~~4. Remove from `flake.nix`~~
+~~5. Remove Vault AppRole from `terraform/vault/approle.tf`~~
+~~6. Destroy the VM in Proxmox~~
+~~7. Commit cleanup~~

-See `docs/plans/pgdb1-decommission.md` for detailed plan.
+Host configuration, services, terraform resources, and VM removed. See `docs/plans/pgdb1-decommission.md` for detailed plan.

 ## Phase 5: Decommission ca Host ✓ COMPLETE

--- a/docs/plans/local-ntp-chrony.md
+++ b/docs/plans/local-ntp-chrony.md
@@ -0,0 +1,79 @@
+# Local NTP with Chrony
+
+## Overview/Goal
+
+Set up pve1 as a local NTP server and switch all NixOS VMs from systemd-timesyncd to chrony, pointing at pve1 as the sole time source. This eliminates clock drift issues that cause false `host_reboot` alerts.
+
+## Current State
+
+- All NixOS hosts use `systemd-timesyncd` with default NixOS pool servers (`0.nixos.pool.ntp.org` etc.)
+- No NTP/timesyncd configuration exists in the repo — all defaults
+- pve1 (Proxmox, bare metal) already runs chrony but only as a client
+- VMs drift noticeably — ns1 (~19ms) and jelly01 (~39ms) are worst offenders
+- Clock step corrections from timesyncd trigger false `host_reboot` alerts via `changes(node_boot_time_seconds[10m]) > 0`
+- pve1 itself stays at 0ms offset thanks to chrony
+
+## Why systemd-timesyncd is Insufficient
+
+- Minimal SNTP client, no proper clock discipline or frequency tracking
+- Backs off polling interval when it thinks clock is stable, missing drift
+- Corrects via step adjustments rather than gradual slewing, causing metric jumps
+- Each VM resolves to different pool servers with varying accuracy
+
+## Implementation Steps
+
+### 1. Configure pve1 as NTP Server
+
+Add to pve1's `/etc/chrony/chrony.conf`:
+
+```
+# Allow NTP clients from the infrastructure subnet
+allow 10.69.13.0/24
+```
+
+Restart chrony on pve1.
+
+### 2. Add Chrony to NixOS System Config
+
+Create `system/chrony.nix` (applied to all hosts via system imports):
+
+```nix
+{
+  # Disable systemd-timesyncd (chrony takes over)
+  services.timesyncd.enable = false;
+
+  # Enable chrony pointing at pve1
+  services.chrony = {
+    enable = true;
+    servers = [ "pve1.home.2rjus.net" ];
+    serverOption = "iburst";
+  };
+}
+```
+
+### 3. Optional: Add Chrony Exporter
+
+For better visibility into NTP sync quality:
+
+```nix
+services.prometheus.exporters.chrony.enable = true;
+```
+
+Add chrony exporter scrape targets via `homelab.monitoring.scrapeTargets` and create a Grafana dashboard for NTP offset across all hosts.
+
+### 4. Roll Out
+
+- Deploy to a test-tier host first to verify
+- Then deploy to all hosts via auto-upgrade
+
+## Open Questions
+
+- [ ] Does pve1's chrony config need `local stratum 10` as fallback if upstream is unreachable?
+- [ ] Should we also enable `enableRTCTrimming` for the VMs?
+- [ ] Worth adding a chrony exporter on pve1 as well (manual install like node-exporter)?
+
+## Notes
+
+- No fallback NTP servers needed on VMs — if pve1 is down, all VMs are down too
+- The `host_reboot` alert rule (`changes(node_boot_time_seconds[10m]) > 0`) should stop false-firing once clock corrections are slewed instead of stepped
+- pn01/pn02 are bare metal but still benefit from syncing to pve1 for consistency
--- a/docs/plans/media-pc-replacement.md
+++ b/docs/plans/media-pc-replacement.md
@@ -0,0 +1,244 @@
+# Media PC Replacement
+
+## Overview
+
+Replace the aging Linux+Kodi media PC connected to the TV with a modern, compact solution. Primary use cases are Jellyfin/Kodi playback and watching Twitch/YouTube. The current machine (`media`, 10.69.31.50) is on VLAN 31.
+
+## Current State
+
+### Hardware
+- **CPU**: Intel Core i7-4770K @ 3.50GHz (Haswell, 4C/8T, 2013)
+- **GPU**: Nvidia GeForce GT 710 (Kepler, GK208B)
+- **OS**: Ubuntu 22.04.5 LTS (Jammy)
+- **Software**: Kodi
+- **Network**: `media.home.2rjus.net` at `10.69.31.50` (VLAN 31)
+
+### Control & Display
+- **Input**: Wireless keyboard (works well, useful for browser)
+- **TV**: 1080p (no 4K/HDR currently, but may upgrade TV later)
+- **Audio**: Surround system connected via HDMI ARC from TV (PC → HDMI → TV → ARC → surround)
+
+### Notes on Current Hardware
+- The i7-4770K is massively overpowered for media playback — it's a full desktop CPU from 2013
+- The GT 710 is a low-end passive GPU; supports NVDEC for H.264/H.265 hardware decode but limited to 4K@30Hz over HDMI 1.4
+- Ubuntu 22.04 is approaching EOL (April 2027) and is not managed by this repo
+- The whole system is likely in a full-size or mid-tower case — not ideal for a TV setup
+
+### Integration
+- **Media source**: Jellyfin on `jelly01` (10.69.13.14) serves media from NAS via NFS
+- **DNS**: A record in `services/ns/external-hosts.nix`
+- **Not managed**: Not a NixOS host in this repo, no monitoring/auto-updates
+
+## Options
+
+### Option 1: Dedicated Streaming Device (Apple TV / Nvidia Shield)
+
+| Aspect | Apple TV 4K | Nvidia Shield Pro |
+|--------|-------------|-------------------|
+| **Price** | ~$130-180 | ~$200 |
+| **Jellyfin** | Swiftfin app (good) | Jellyfin Android TV (good) |
+| **Kodi** | Not available (tvOS) | Full Kodi support |
+| **Twitch** | Native app | Native app |
+| **YouTube** | Native app | Native app |
+| **HDR/DV** | Dolby Vision + HDR10 | Dolby Vision + HDR10 |
+| **4K** | Yes | Yes |
+| **Form factor** | Tiny, silent | Small, silent |
+| **Remote** | Excellent Siri remote | Decent, supports CEC |
+| **Homelab integration** | None | Minimal (Plex/Kodi only) |
+
+**Pros:**
+- Zero maintenance - appliance experience
+- Excellent app ecosystem (native Twitch, YouTube, streaming services)
+- Silent, tiny form factor
+- Great remote control / CEC support
+- Hardware-accelerated codec support out of the box
+
+**Cons:**
+- No NixOS management, monitoring, or auto-updates
+- Can't run arbitrary software
+- Jellyfin clients are decent but not as mature as Kodi
+- Vendor lock-in (Apple ecosystem / Google ecosystem)
+- No SSH access for troubleshooting
+
+### Option 2: NixOS Mini PC (Kodi Appliance)
+
+A small form factor PC (Intel NUC, Beelink, MinisForum, etc.) running NixOS with Kodi as the desktop environment.
+
+**NixOS has built-in support:**
+- `services.xserver.desktopManager.kodi.enable` - boots directly into Kodi
+- `kodi-gbm` package - Kodi with direct DRM/KMS rendering (no X11/Wayland needed)
+- `kodiPackages.jellycon` - Jellyfin integration for Kodi
+- `kodiPackages.sendtokodi` - plays streams via yt-dlp (Twitch, YouTube)
+- `kodiPackages.inputstream-adaptive` - adaptive streaming support
+
+**Example NixOS config sketch:**
+```nix
+{ pkgs, ... }:
+{
+  services.xserver.desktopManager.kodi = {
+    enable = true;
+    package = pkgs.kodi.withPackages (p: [
+      p.jellycon
+      p.sendtokodi
+      p.inputstream-adaptive
+    ]);
+  };
+
+  # Auto-login to Kodi session
+  services.displayManager.autoLogin = {
+    enable = true;
+    user = "kodi";
+  };
+}
+```
+
+**Pros:**
+- Full NixOS management (monitoring, auto-updates, vault, promtail)
+- Kodi is a proven TV interface with excellent remote/CEC support
+- JellyCon integrates Jellyfin library directly into Kodi
+- Twitch/YouTube via sendtokodi + yt-dlp or Kodi browser addons
+- Can run arbitrary services (e.g., Home Assistant dashboard)
+- Declarative, reproducible config in this repo
+
+**Cons:**
+- More maintenance than an appliance
+- NixOS + Kodi on bare metal needs GPU driver setup (Intel iGPU is usually fine)
+- Kodi YouTube/Twitch addons are less polished than native apps
+- Need to buy hardware (~$150-400 for a decent mini PC)
+- Power consumption higher than a streaming device
+
+### Option 3: NixOS Mini PC (Wayland Desktop)
+
+A mini PC running NixOS with a lightweight Wayland compositor, launching Kodi for media and a browser for Twitch/YouTube.
+
+**Pros:**
+- Best of both worlds: Kodi for media, Firefox/Chromium for Twitch/YouTube
+- Full NixOS management
+- Can switch between Kodi and browser easily
+- Native web experience for streaming sites
+
+**Cons:**
+- More complex setup (compositor + Kodi + browser)
+- Harder to get a good "10-foot UI" experience
+- Keyboard/mouse may be needed alongside remote
+- Significantly more maintenance
+
+## Comparison
+
+| Criteria | Dedicated Device | NixOS Kodi | NixOS Desktop |
+|----------|-----------------|------------|---------------|
+| **Maintenance** | None | Low | Medium |
+| **Media experience** | Excellent | Excellent | Good |
+| **Twitch/YouTube** | Excellent (native apps) | Good (addons/yt-dlp) | Excellent (browser) |
+| **Homelab integration** | None | Full | Full |
+| **Form factor** | Tiny | Small | Small |
+| **Cost** | $130-200 | $150-400 | $150-400 |
+| **Silent operation** | Yes | Likely (fanless options) | Likely |
+| **CEC remote** | Yes | Yes (Kodi) | Partial |
+
+## Decision: NixOS Mini PC with Kodi (Option 2)
+
+**Rationale:**
+- Already comfortable with Kodi + wireless keyboard workflow
+- Browser access for Twitch/YouTube is important — Kodi can launch a browser when needed
+- Homelab integration comes for free (monitoring, auto-updates, vault)
+- Natural fit alongside the other 16 NixOS hosts in this repo
+- Dedicated devices lose the browser/keyboard workflow
+
+### Display Server: Sway/Hyprland
+
+Options evaluated:
+
+| Approach | Pros | Cons |
+|----------|------|------|
+| Cage (kiosk) | Simplest, single-app | No browser without TTY switching |
+| kodi-gbm (no compositor) | Best HDR support | No browser at all, ALSA-only audio |
+| **Sway/Hyprland** | **Workspace switching, VA-API in browser** | **Slightly more config** |
+| Full DE (GNOME/KDE) | Everything works | Overkill, heavy |
+
+**Decision: Sway or Hyprland** (Hyprland preferred — same as desktop)
+
+- Kodi fullscreen on workspace 1, Firefox on workspace 2
+- Switch via keybinding on wireless keyboard
+- Auto-start both on login via greetd
+- Minimal config — no bar, no decorations, just workspaces
+- VA-API hardware decode works in Firefox on Wayland (important for YouTube/Twitch)
+- Can revisit kodi-gbm later if HDR becomes a priority (just a config change)
+
+### Twitch/YouTube
+
+Firefox on workspace 2, switched to via keyboard. Kodi addons (sendtokodi, YouTube plugin) available as secondary options but a real browser is the primary approach.
+
+### Media Playback: Kodi + JellyCon + NFS Direct Path
+
+Three options were evaluated for media playback:
+
+| Approach | Transcoding | Library management | Watch state sync |
+|----------|-------------|-------------------|-----------------|
+| Jellyfin only (browser) | Yes — browsers lack codec support for DTS, PGS subs, etc. | Jellyfin | Jellyfin |
+| Kodi + NFS only | No — Kodi plays everything natively | Kodi local DB | None |
+| **Kodi + JellyCon + NFS** | **No — Kodi's native player, direct path via NFS** | **Jellyfin** | **Jellyfin** |
+
+**Decision: Kodi + JellyCon with NFS direct path**
+
+- JellyCon presents the Jellyfin library inside Kodi's UI (browse, search, metadata, artwork)
+- Playback uses Kodi's native player — direct play, no transcoding, full codec support including surround passthrough
+- JellyCon's "direct path" mode maps Jellyfin paths to local NFS mounts, so playback goes straight over NFS without streaming through Jellyfin's HTTP layer
+- Watch state, resume position, etc. sync back to Jellyfin — accessible from other devices too
+- NFS mount follows the same pattern as jelly01 (`nas.home.2rjus.net:/mnt/hdd-pool/media`)
+
+### Audio Passthrough
+
+Kodi on NixOS supports HDMI audio passthrough for surround formats (AC3, DTS, etc.). The ARC chain (PC → HDMI → TV → ARC → surround) works transparently — Kodi just needs to be configured for passthrough rather than decoding audio locally.
+
+## Hardware
+
+### Leading Candidate: GMKtec G3
+
+- **CPU**: Intel N100 (Alder Lake-N, 4C/4T)
+- **RAM**: 16GB
+- **Storage**: 512GB NVMe
+- **Price**: ~NOK 2800 (~$250 USD)
+- **Source**: AliExpress
+
+The N100 supports hardware decode for all relevant 4K codecs:
+
+| Codec | Support | Used by |
+|-------|---------|---------|
+| H.264/AVC | Yes (Quick Sync) | Older media |
+| H.265/HEVC 10-bit | Yes (Quick Sync) | Most 4K media, HDR |
+| VP9 | Yes (Quick Sync) | YouTube 4K |
+| AV1 | Yes (Quick Sync) | YouTube, Twitch, newer encodes |
+
+16GB RAM is comfortable for Kodi + browser + NixOS system services (node-exporter, promtail, etc.) with plenty of headroom.
+
+### Key Requirements
+- HDMI 2.0+ for 4K future-proofing (current TV is 1080p)
+- Hardware video decode via VA-API / Intel Quick Sync
+- HDR support (for future TV upgrade)
+- Fanless or near-silent operation
+
+## Implementation Steps
+
+1. **Choose and order hardware**
+2. **Create host configuration** (`hosts/media1/`)
+   - Kodi desktop manager with Jellyfin + streaming addons
+   - Intel/AMD iGPU driver and VA-API hardware decode
+   - HDMI audio passthrough for surround
+   - NFS mount for media (same pattern as jelly01)
+   - Browser package (Firefox/Chromium) for Twitch/YouTube fallback
+   - Standard system modules (monitoring, promtail, vault, auto-upgrade)
+3. **Install NixOS** on the mini PC
+4. **Configure Kodi** (Jellyfin server, addons, audio passthrough)
+5. **Update DNS** - point `media.home.2rjus.net` to new IP (or keep on VLAN 31)
+6. **Retire old media PC**
+
+## Open Questions
+
+- [x] What are the current media PC specs? — i7-4770K, GT 710, Ubuntu 22.04. Overkill CPU, weak GPU, large form factor. Not worth reusing if goal is compact/silent.
+- [x] VLAN? — Keep on VLAN 31 for now, same as current media PC. Can revisit later.
+- [x] Is CEC needed? — No, not using it currently. Can add later if desired.
+- [x] Is 4K HDR output needed? — TV is 1080p now, but want 4K/HDR capability for future TV upgrade
+- [x] Audio setup? — Surround system via HDMI ARC from TV. Media PC outputs HDMI to TV, TV passes audio to surround via ARC. Kodi/any player just needs HDMI audio output with surround passthrough.
+- [x] Are there streaming service apps needed? — No. Only Twitch/YouTube, which work fine in any browser.
+- [x] Budget? — ~NOK 2800 for GMKtec G3 (N100, 16GB, 512GB NVMe)
--- a/docs/plans/monitoring-migration-victoriametrics.md
+++ b/docs/plans/monitoring-migration-victoriametrics.md
@@ -1,201 +0,0 @@
-# Monitoring Stack Migration to VictoriaMetrics
-
-## Overview
-
-Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
-and longer retention. Run in parallel with monitoring01 until validated, then switch over using
-a `monitoring` CNAME for seamless transition.
-
-## Current State
-
-**monitoring01** (10.69.13.13):
- 4 CPU cores, 4GB RAM, 33GB disk
- Prometheus with 30-day retention (15s scrape interval)
- Alertmanager (routes to alerttonotify webhook)
- Grafana (dashboards, datasources)
- Loki (log aggregation from all hosts via Promtail)
- Tempo (distributed tracing) - not actively used
- Pyroscope (continuous profiling) - not actively used
-
-**Hardcoded References to monitoring01:**
- `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100`
- `hosts/template2/bootstrap.nix` - Bootstrap logs to Loki (keep as-is until decommission)
- `services/http-proxy/proxy.nix` - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway
-
-**Auto-generated:**
- Prometheus scrape targets (from `lib/monitoring.nix` + `homelab.monitoring.scrapeTargets`)
- Node-exporter targets (from all hosts with static IPs)
-
-## Decision: VictoriaMetrics
-
-Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
- Single binary replacement for Prometheus
- 5-10x better compression (30 days could become 180+ days in same space)
- Same PromQL query language (Grafana dashboards work unchanged)
- Same scrape config format (existing auto-generated configs work)
-
-If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
-
-## Architecture
-
-```
-                     ┌─────────────────┐
-                     │  monitoring02   │
-                     │  VictoriaMetrics│
-                     │  + Grafana      │
-     monitoring      │  + Loki         │
-     CNAME ──────────│  + Alertmanager │
-                     │  (vmalert)      │
-                     └─────────────────┘
-                            ▲
-                            │ scrapes
-            ┌───────────────┼───────────────┐
-            │               │               │
-       ┌────┴────┐    ┌─────┴────┐    ┌─────┴────┐
-       │  ns1    │    │  ha1     │    │  ...     │
-       │ :9100   │    │ :9100    │    │ :9100    │
-       └─────────┘    └──────────┘    └──────────┘
-```
-
-## Implementation Plan
-
-### Phase 1: Create monitoring02 Host [COMPLETE]
-
-Host created and deployed at 10.69.13.24 (prod tier) with:
- 4 CPU cores, 8GB RAM, 60GB disk
- Vault integration enabled
- NATS-based remote deployment enabled
- Grafana with Kanidm OIDC deployed as test instance (`grafana-test.home.2rjus.net`)
-
-### Phase 2: Set Up VictoriaMetrics Stack
-
-New service module at `services/victoriametrics/` for VictoriaMetrics + vmalert + Alertmanager.
-Imported by monitoring02 alongside the existing Grafana service.
-
-1. **VictoriaMetrics** (port 8428): [DONE]
-   - `services.victoriametrics.enable = true`
-   - `retentionPeriod = "3"` (3 months)
-   - All scrape configs migrated from Prometheus (22 jobs including auto-generated)
-   - Static user override (DynamicUser disabled) for credential file access
-   - OpenBao token fetch service + 30min refresh timer
-   - Apiary bearer token via vault.secrets
-
-2. **vmalert** for alerting rules: [DONE]
-   - Points to VictoriaMetrics datasource at localhost:8428
-   - Reuses existing `services/monitoring/rules.yml` directly via `settings.rule`
-   - No notifier configured during parallel operation (prevents duplicate alerts)
-
-3. **Alertmanager** (port 9093): [DONE]
-   - Same configuration as monitoring01 (alerttonotify webhook routing)
-   - Will only receive alerts after cutover (vmalert notifier disabled)
-
-4. **Grafana** (port 3000): [DONE]
-   - VictoriaMetrics datasource (localhost:8428) as default
-   - monitoring01 Prometheus datasource kept for comparison during parallel operation
-   - Loki datasource pointing to localhost (after Loki migrated to monitoring02)
-
-5. **Loki** (port 3100): [DONE]
-   - Same configuration as monitoring01 in standalone `services/loki/` module
-   - Grafana datasource updated to localhost:3100
-
-**Note:** pve-exporter and pushgateway scrape targets are not included on monitoring02.
-pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics
-native push support.
-
-### Phase 3: Parallel Operation
-
-Run both monitoring01 and monitoring02 simultaneously:
-
-1. **Dual scraping**: Both hosts scrape the same targets
-   - Validates VictoriaMetrics is collecting data correctly
-
-2. **Dual log shipping**: Configure Promtail to send logs to both Loki instances
-   - Add second client in `system/monitoring/logs.nix` pointing to monitoring02
-
-3. **Validate dashboards**: Access Grafana on monitoring02, verify dashboards work
-
-4. **Validate alerts**: Verify vmalert evaluates rules correctly (no receiver = no notifications)
-
-5. **Compare resource usage**: Monitor disk/memory consumption between hosts
-
-### Phase 4: Add monitoring CNAME
-
-Add CNAME to monitoring02 once validated:
-
-```nix
-# hosts/monitoring02/configuration.nix
-homelab.dns.cnames = [ "monitoring" ];
-```
-
-This creates `monitoring.home.2rjus.net` pointing to monitoring02.
-
-### Phase 5: Update References
-
-Update hardcoded references to use the CNAME:
-
-1. **system/monitoring/logs.nix**:
-   - Remove dual-shipping, point only to `http://monitoring.home.2rjus.net:3100`
-
-2. **services/http-proxy/proxy.nix**: Update reverse proxy backends:
-   - prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
-   - alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
-   - grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
-
-Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission.
-
-### Phase 6: Enable Alerting
-
-Once ready to cut over:
-1. Enable Alertmanager receiver on monitoring02
-2. Verify test alerts route correctly
-
-### Phase 7: Cutover and Decommission
-
-1. **Stop monitoring01**: Prevent duplicate alerts during transition
-2. **Update bootstrap.nix**: Point to `monitoring.home.2rjus.net`
-3. **Verify all targets scraped**: Check VictoriaMetrics UI
-4. **Verify logs flowing**: Check Loki on monitoring02
-5. **Decommission monitoring01**:
-   - Remove from flake.nix
-   - Remove host configuration
-   - Destroy VM in Proxmox
-   - Remove from terraform state
-
-## Current Progress
-
- **Phase 1** complete (2026-02-08): monitoring02 host created, Grafana with Kanidm OIDC validated
- **Phase 2** complete (2026-02-17): VictoriaMetrics, vmalert, Alertmanager, Loki, Grafana datasources configured
-  - Tempo and Pyroscope deferred (not actively used; can be added later if needed)
-
-## Open Questions
-
- [ ] What disk size for monitoring02? Current 60GB may need expansion for 3+ months with VictoriaMetrics
- [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
- [ ] Consider replacing Promtail with Grafana Alloy (`services.alloy`, v1.12.2 in nixpkgs). Promtail is in maintenance mode and Grafana recommends Alloy as the successor. Alloy is a unified collector (logs, metrics, traces, profiles) but uses its own "River" config format instead of YAML, so less Nix-native ergonomics. Could bundle the migration with monitoring02 to consolidate disruption.
-
-## VictoriaMetrics Service Configuration
-
-Implemented in `services/victoriametrics/default.nix`. Key design decisions:
-
- **Static user**: VictoriaMetrics NixOS module uses `DynamicUser`, overridden with a static
-  `victoriametrics` user so vault.secrets and credential files work correctly
- **Shared rules**: vmalert reuses `services/monitoring/rules.yml` via `settings.rule` path
-  reference (no YAML-to-Nix conversion needed)
- **Scrape config reuse**: Uses the same `lib/monitoring.nix` functions and
-  `services/monitoring/external-targets.nix` as Prometheus for auto-generated targets
-
-## Rollback Plan
-
-If issues arise after cutover:
-1. Move `monitoring` CNAME back to monitoring01
-2. Restart monitoring01 services
-3. Revert Promtail config to point only to monitoring01
-4. Revert http-proxy backends
-
-## Notes
-
- VictoriaMetrics uses port 8428 vs Prometheus 9090
- PromQL compatibility is excellent
- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
- monitoring02 deployed via OpenTofu using `create-host` script
- Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state
--- a/docs/plans/nixos-hypervisor.md
+++ b/docs/plans/nixos-hypervisor.md
@@ -0,0 +1,232 @@
+# NixOS Hypervisor
+
+## Overview
+
+Experiment with running a NixOS-based hypervisor as an alternative/complement to the current Proxmox setup. Goal is better homelab integration — declarative config, monitoring, auto-updates — while retaining the ability to run VMs with a Terraform-like workflow.
+
+## Motivation
+
+- Proxmox works but doesn't integrate with the NixOS-managed homelab (no monitoring, no auto-updates, no vault, no declarative config)
+- The PN51 units (once stable) are good candidates for experimentation — test-tier, plenty of RAM (32-64GB), 8C/16T
+- Long-term: could reduce reliance on Proxmox or provide a secondary hypervisor pool
+- **VM migration**: Currently all VMs (including both nameservers) run on a single Proxmox host. Being able to migrate VMs between hypervisors would allow rebooting a host for kernel updates without downtime for critical services like DNS.
+
+## Hardware Candidates
+
+| | pn01 | pn02 |
+|---|---|---|
+| **CPU** | Ryzen 7 5700U (8C/16T) | Ryzen 7 5700U (8C/16T) |
+| **RAM** | 64GB (2x32GB) | 32GB (1x32GB, second slot available) |
+| **Storage** | 1TB NVMe | 1TB SATA SSD (NVMe planned) |
+| **Status** | Stability testing | Stability testing |
+
+## Options
+
+### Option 1: Incus
+
+Fork of LXD (after Canonical made LXD proprietary). Supports both containers (LXC) and VMs (QEMU/KVM).
+
+**NixOS integration:**
+- `virtualisation.incus.enable` module in nixpkgs
+- Manages storage pools, networks, and instances
+- REST API for automation
+- CLI tool (`incus`) for management
+
+**Terraform integration:**
+- `lxd` provider works with Incus (API-compatible)
+- Dedicated `incus` Terraform provider also exists
+- Can define VMs/containers in OpenTofu, similar to current Proxmox workflow
+
+**Migration:**
+- Built-in live and offline migration via `incus move <instance> --target <host>`
+- Clustering makes hosts aware of each other — migration is a first-class operation
+- Shared storage (NFS, Ceph) or Incus can transfer storage during migration
+- Stateful stop-and-move also supported for offline migration
+
+**Pros:**
+- Supports both containers and VMs
+- REST API + CLI for automation
+- Built-in clustering and migration — closest to Proxmox experience
+- Good NixOS module support
+- Image-based workflow (can build NixOS images and import)
+- Active development and community
+
+**Cons:**
+- Another abstraction layer on top of QEMU/KVM
+- Less mature Terraform provider than libvirt
+- Container networking can be complex
+- NixOS guests in Incus VMs need some setup
+
+### Option 2: libvirt/QEMU
+
+Standard Linux virtualization stack. Thin wrapper around QEMU/KVM.
+
+**NixOS integration:**
+- `virtualisation.libvirtd.enable` module in nixpkgs
+- Mature and well-tested
+- virsh CLI for management
+
+**Terraform integration:**
+- `dmacvicar/libvirt` provider — mature, well-maintained
+- Supports cloud-init, volume management, network config
+- Very similar workflow to current Proxmox+OpenTofu setup
+- Can reuse cloud-init patterns from existing `terraform/` config
+
+**Migration:**
+- Supports live and offline migration via `virsh migrate`
+- Requires shared storage (NFS, Ceph, or similar) for live migration
+- Requires matching CPU models between hosts (or CPU model masking)
+- Works but is manual — no cluster awareness, must specify target URI
+- No built-in orchestration for multi-host scenarios
+
+**Pros:**
+- Closest to current Proxmox+Terraform workflow
+- Most mature Terraform provider
+- Minimal abstraction — direct QEMU/KVM management
+- Well-understood, massive community
+- Cloud-init works identically to Proxmox workflow
+- Can reuse existing template-building patterns
+
+**Cons:**
+- VMs only (no containers without adding LXC separately)
+- No built-in REST API (would need to expose libvirt socket)
+- No web UI without adding cockpit or virt-manager
+- Migration works but requires manual setup — no clustering, no orchestration
+- Less feature-rich than Incus for multi-host scenarios
+
+### Option 3: microvm.nix
+
+NixOS-native microVM framework. VMs defined as NixOS modules in the host's flake.
+
+**NixOS integration:**
+- VMs are NixOS configurations in the same flake
+- Supports multiple backends: cloud-hypervisor, QEMU, firecracker, kvmtool
+- Lightweight — shares host's nix store with guests via virtiofs
+- Declarative network, storage, and resource allocation
+
+**Terraform integration:**
+- None — everything is defined in Nix
+- Fundamentally different workflow from current Proxmox+Terraform approach
+
+**Pros:**
+- Most NixOS-native approach
+- VMs defined right alongside host configs in this repo
+- Very lightweight — fast boot, minimal overhead
+- Shares nix store with host (no duplicate packages)
+- No cloud-init needed — guest config is part of the flake
+
+**Migration:**
+- No migration support — VMs are tied to the host's NixOS config
+- Moving a VM means rebuilding it on another host
+
+**Cons:**
+- Very niche, smaller community
+- Different mental model from current workflow
+- Only NixOS guests (no Ubuntu, FreeBSD, etc.)
+- No Terraform integration
+- No migration support
+- Less isolation than full QEMU VMs
+- Would need to learn a new deployment pattern
+
+## Comparison
+
+| Criteria | Incus | libvirt | microvm.nix |
+|----------|-------|---------|-------------|
+| **Workflow similarity** | Medium | High | Low |
+| **Terraform support** | Yes (lxd/incus provider) | Yes (mature provider) | No |
+| **NixOS module** | Yes | Yes | Yes |
+| **Containers + VMs** | Both | VMs only | VMs only |
+| **Non-NixOS guests** | Yes | Yes | No |
+| **Live migration** | Built-in (first-class) | Yes (manual setup) | No |
+| **Offline migration** | Built-in | Yes (manual setup) | No (rebuild) |
+| **Clustering** | Built-in | Manual | No |
+| **Learning curve** | Medium | Low | Medium |
+| **Community/maturity** | Growing | Very mature | Niche |
+| **Overhead** | Low | Minimal | Minimal |
+
+## Recommendation
+
+Start with **Incus**. Migration and clustering are key requirements:
+- Built-in clustering makes two PN51s a proper hypervisor pool
+- Live and offline migration are first-class operations, similar to Proxmox
+- Can move VMs between hosts for maintenance (kernel updates, hardware work) without downtime
+- Supports both containers and VMs — flexibility for future use
+- Terraform provider exists (less mature than libvirt's, but functional)
+- REST API enables automation beyond what Terraform covers
+
+libvirt could achieve similar results but requires significantly more manual setup for migration and has no clustering awareness. For a two-node setup where migration is a priority, Incus provides much more out of the box.
+
+**microvm.nix** is off the table given the migration requirement.
+
+## Implementation Plan
+
+### Phase 1: Single-Node Setup (on one PN51)
+
+1. Enable `virtualisation.incus` on pn01 (or whichever is stable)
+2. Initialize Incus (`incus admin init`) — configure storage pool (local NVMe) and network bridge
+3. Configure bridge networking for VM traffic on VLAN 12
+4. Build a NixOS VM image and import it into Incus
+5. Create a test VM manually with `incus launch` to validate the setup
+
+### Phase 2: Two-Node Cluster (PN51s only)
+
+1. Enable Incus on the second PN51
+2. Form a cluster between both nodes
+3. Configure shared storage (NFS from NAS, or Ceph if warranted)
+4. Test offline migration: `incus move <vm> --target <other-node>`
+5. Test live migration with shared storage
+6. CPU compatibility is not an issue here — both nodes have identical Ryzen 7 5700U CPUs
+
+### Phase 3: Terraform Integration
+
+1. Add Incus Terraform provider to `terraform/`
+2. Define a test VM in OpenTofu (cloud-init, static IP, vault provisioning)
+3. Verify the full pipeline: tofu apply -> VM boots -> cloud-init -> vault credentials -> NixOS rebuild
+4. Compare workflow with existing Proxmox pipeline
+
+### Phase 4: Evaluate and Expand
+
+- Is the workflow comparable to Proxmox?
+- Migration reliability — does live migration work cleanly?
+- Performance overhead acceptable on Ryzen 5700U?
+- Worth migrating some test-tier VMs from Proxmox?
+- Could ns1/ns2 run on separate Incus nodes instead of the single Proxmox host?
+
+### Phase 5: Proxmox Replacement (optional)
+
+If Incus works well on the PN51s, consider replacing Proxmox entirely for a three-node cluster.
+
+**CPU compatibility for mixed cluster:**
+
+| Node | CPU | Architecture | x86-64-v3 |
+|------|-----|-------------|-----------|
+| Proxmox host | AMD Ryzen 9 3900X (12C/24T) | Zen 2 | Yes |
+| pn01 | AMD Ryzen 7 5700U (8C/16T) | Zen 3 | Yes |
+| pn02 | AMD Ryzen 7 5700U (8C/16T) | Zen 3 | Yes |
+
+All three CPUs are AMD and support `x86-64-v3`. The 3900X (Zen 2) is the oldest, so it defines the feature ceiling — but `x86-64-v3` is well within its capabilities. VMs configured with `x86-64-v3` can migrate freely between all three nodes.
+
+Being all-AMD also avoids the trickier Intel/AMD cross-vendor migration edge cases (different CPUID layouts, virtualization extensions).
+
+The 3900X (12C/24T) would be the most powerful node, making it the natural home for heavier workloads, with the PN51s (8C/16T each) handling lighter VMs or serving as migration targets during maintenance.
+
+Steps:
+1. Install NixOS + Incus on the Proxmox host (or a replacement machine)
+2. Join it to the existing Incus cluster with `x86-64-v3` CPU baseline
+3. Migrate VMs from Proxmox to the Incus cluster
+4. Decommission Proxmox
+
+## Prerequisites
+
+- [ ] PN51 units pass stability testing (see `pn51-stability.md`)
+- [ ] Decide which unit to use first (pn01 preferred — 64GB RAM, NVMe, currently more stable)
+
+## Open Questions
+
+- How to handle VM storage? Local NVMe, NFS from NAS, or Ceph between the two nodes?
+- Network topology: bridge on VLAN 12, or trunk multiple VLANs to the PN51?
+- Should VMs be on the same VLAN as the hypervisor host, or separate?
+- Incus clustering with only two nodes — any quorum issues? Three nodes (with Proxmox replacement) would solve this
+- How to handle NixOS guest images? Build with nixos-generators, or use Incus image builder?
+- ~~What CPU does the current Proxmox host have?~~ AMD Ryzen 9 3900X (Zen 2) — `x86-64-v3` confirmed, all-AMD cluster
+- If replacing Proxmox: migrate VMs first, or fresh start and rebuild?
--- a/docs/plans/nixos-router.md
+++ b/docs/plans/nixos-router.md
@@ -42,10 +42,24 @@ Needs a small x86 box with:
 - 4-8 GB RAM (plenty for routing + DHCP + NetFlow accounting)
 - Low power consumption, fanless preferred for always-on use

-Candidates:
- Topton / CWWK mini PC with dual/quad Intel 2.5GbE (~100-150 EUR)
- Protectli Vault (more expensive, ~200-300 EUR, proven in pfSense/OPNsense community)
- Any mini PC with one onboard NIC + one USB 2.5GbE adapter (cheapest, less ideal)
+**Leading candidate:** [Topton Solid Mini PC](https://www.aliexpress.com/item/1005008981218625.html)
+with Intel i3-N300 (8 E-cores), 2x10GbE SFP+ + 3x2.5GbE (~NOK 3000 barebones). The N300
+gives headroom for ntopng DPI and potential Suricata IDS without being overkill.
+
+### Hardware Alternatives
+
+Domestic availability for firewall mini PCs is limited — likely ordering from AliExpress.
+
+Key things to verify:
+- NIC chipset: Intel i225-V/i226-V preferred over Realtek for Linux driver support
+- RAM/storage: some listings are barebones, check what's included
+- Import duties: factor in ~25% on top of listing price
+
+| Option | NICs | Notes | Price |
+|--------|------|-------|-------|
+| [Topton Solid Firewall Router](https://www.aliexpress.com/item/1005008059819023.html) | 2x10GbE SFP+, 4x2.5GbE | No RAM/SSD, only Intel N150 available currently | ~NOK 2500 |
+| [Topton Solid Mini PC](https://www.aliexpress.com/item/1005008981218625.html) | 2x10GbE SFP+, 3x2.5GbE | No RAM/SSD, only Intel i3-N300 available currently | ~NOK 3000 |
+| [MINISFORUM MS-01](https://www.aliexpress.com/item/1005007308262492.html) | 2x10GbE SFP+, 2x2.5GbE | No RAM/SSD, i5-12600H | ~NOK 4500 |

 The LAN port would carry a VLAN trunk to the MikroTik switch, with sub-interfaces
 for each VLAN. WAN port connects to the ISP uplink.
@@ -89,6 +103,12 @@ The router is treated differently from the rest of the fleet:
 - nftables flow accounting or softflowd for NetFlow export
 - Export to future ntopng instance (see new-services.md)

+**IDS/IPS (future consideration):**
+- Suricata for inline intrusion detection/prevention on the WAN interface
+- Signature-based threat detection, protocol anomaly detection
+- CPU-intensive — feasible at typical home internet speeds (500Mbps-1Gbps) on the N300
+- Not a day-one requirement, but the hardware should support it
+
 ### Monitoring Integration

 Since this is a NixOS host in the flake, it gets the standard monitoring stack for free:
--- a/docs/plans/openstack-nixos-image.md
+++ b/docs/plans/openstack-nixos-image.md
@@ -0,0 +1,104 @@
+# NixOS OpenStack Image
+
+## Overview
+
+Build and upload a NixOS base image to the OpenStack cluster at work, enabling NixOS-based VPS instances to replace the current Debian+Podman setup. This image will serve as the foundation for multiple external services:
+
+- **Forgejo** (replacing Gitea on docker2)
+- **WireGuard gateway** (replacing docker2's tunnel role, feeding into the remote-access plan)
+- Any future externally-hosted services
+
+## Current State
+
+- VPS hosting runs on an OpenStack cluster with a personal quota
+- Current VPS (`docker2.t-juice.club`) runs Debian with Podman containers
+- Homelab already has a working Proxmox image pipeline: `template2` builds via `nixos-rebuild build-image --image-variant proxmox`, deployed via Ansible
+- nixpkgs has a built-in `openstack` image variant in the same `image.modules` system used for Proxmox
+
+## Decisions
+
+- **No cloud-init dependency** - SSH key baked into the image, no need for metadata service
+- **No bootstrap script** - VPS deployments are infrequent; manual `nixos-rebuild` after first boot is fine
+- **No Vault access** - secrets handled manually until WireGuard access is set up (see remote-access plan)
+- **Separate from homelab services** - no logging/metrics integration initially; revisit after remote-access WireGuard is in place
+- **Repo placement TBD** - keep in this flake for now for convenience, but external hosts may move to a separate flake later since they can't use most shared `system/` modules (no Vault, no internal DNS, no Promtail)
+- **OpenStack CLI in devshell** - add `openstackclient` package; credentials (`clouds.yaml`) stay outside the repo
+- **Parallel deployment** - new Forgejo instance runs alongside docker2 initially, then CNAME moves over
+
+## Approach
+
+Follow the same pattern as the Proxmox template (`hosts/template2`), but targeting OpenStack's qcow2 format.
+
+### What nixpkgs provides
+
+The `image.modules.openstack` module produces a qcow2 image with:
+- `openstack-config.nix`: EC2 metadata fetcher, SSH enabled, GRUB bootloader, serial console, auto-growing root partition
+- `qemu-guest.nix` profile (virtio drivers)
+- ext4 root filesystem with `autoResize`
+
+### What we need to customize
+
+The stock OpenStack image pulls SSH keys and hostname from EC2-style metadata. Since we're baking the SSH key into the image, we need a simpler configuration:
+
+- SSH authorized keys baked into the image
+- Base packages (age, vim, wget, git)
+- Nix substituters (`cache.nixos.org` only - internal cache not reachable)
+- systemd-networkd with DHCP
+- GRUB bootloader
+- Firewall enabled (public-facing host)
+
+### Differences from template2
+
+| Aspect | template2 (Proxmox) | openstack-template (OpenStack) |
+|--------|---------------------|-------------------------------|
+| Image format | VMA (`.vma.zst`) | qcow2 (`.qcow2`) |
+| Image variant | `proxmox` | `openstack` |
+| Cloud-init | ConfigDrive + NoCloud | Not used (SSH key baked in) |
+| Nix cache | Internal + nixos.org | `cache.nixos.org` only |
+| Vault | AppRole via wrapped token | None |
+| Bootstrap | Automatic nixos-rebuild on first boot | Manual |
+| Network | Internal DHCP | OpenStack DHCP |
+| DNS | Internal ns1/ns2 | Public DNS |
+| Firewall | Disabled (trusted network) | Enabled |
+| System modules | Full `../../system` import | Minimal (sshd, packages only) |
+
+## Implementation Steps
+
+### Phase 1: Build the image
+
+1. Create `hosts/openstack-template/` with minimal configuration
+   - `default.nix` - imports (only sshd and packages from `system/`, not the full set)
+   - `configuration.nix` - base config: SSH key, DHCP, GRUB, base packages, firewall on
+   - `hardware-configuration.nix` - qemu-guest profile with virtio drivers
+   - Exclude from DNS and monitoring (`homelab.dns.enable = false`, `homelab.monitoring.enable = false`)
+   - May need to override parts of `image.modules.openstack` to disable the EC2 metadata fetcher if it causes boot delays
+2. Build with `nixos-rebuild build-image --image-variant openstack --flake .#openstack-template`
+3. Verify the qcow2 image is produced in `result/`
+
+### Phase 2: Upload and test
+
+1. Add `openstackclient` to the devshell
+2. Upload image: `openstack image create --disk-format qcow2 --file result/<image>.qcow2 nixos-template`
+3. Boot a test instance from the image
+4. Verify: SSH access works, DHCP networking, Nix builds work
+5. Test manual `nixos-rebuild switch --flake` against the instance
+
+### Phase 3: Automation (optional, later)
+
+Consider an Ansible playbook similar to `build-and-deploy-template.yml` for image builds + uploads. Low priority since this will be done rarely.
+
+## Open Questions
+
+- [ ] Should external VPS hosts eventually move to a separate flake? (Depends on how different they end up being from homelab hosts)
+- [ ] Will the stock `openstack-config.nix` metadata fetcher cause boot delays/errors if the metadata service isn't reachable? May need to disable it.
+- [ ] **Flavor selection** - investigate what flavors are available in the quota. The standard small flavors likely have insufficient root disk for a NixOS host (Nix store grows fast). Options:
+  - Use a larger flavor with adequate root disk
+  - Create a custom flavor (if permissions allow)
+  - Cinder block storage is an option in theory, but was very slow last time it was tested - avoid if possible
+- [ ] Consolidation opportunity - currently running multiple smaller VMs on OpenStack. Could a single larger NixOS VM replace several of them?
+
+## Notes
+
+- `nixos-rebuild build-image --image-variant openstack` uses the same `image.modules` system as Proxmox
+- nixpkgs also has an `openstack-zfs` variant if ZFS root is ever wanted
+- The stock OpenStack module imports `ec2-data.nix` and `amazon-init.nix` - these may need to be disabled or overridden if they cause issues without a metadata service
--- a/docs/plans/pn51-stability.md
+++ b/docs/plans/pn51-stability.md
@@ -0,0 +1,231 @@
+# ASUS PN51 Stability Testing
+
+## Overview
+
+Two ASUS PN51-E1 mini PCs (Ryzen 7 5700U) purchased years ago but shelved due to stability issues. Revisiting them to potentially add to the homelab.
+
+## Hardware
+
+| | pn01 (10.69.12.60) | pn02 (10.69.12.61) |
+|---|---|---|
+| **CPU** | AMD Ryzen 7 5700U (8C/16T) | AMD Ryzen 7 5700U (8C/16T) |
+| **RAM** | 2x 32GB DDR4 SO-DIMM (64GB) | 1x 32GB DDR4 SO-DIMM (32GB) |
+| **Storage** | 1TB NVMe | 1TB Samsung 870 EVO (SATA SSD) |
+| **BIOS** | 0508 (2023-11-08) | Updated 2026-02-21 (latest from ASUS) |
+
+## Original Issues
+
+- **pn01**: Would boot but freeze randomly after some time. No console errors, completely unresponsive. memtest86 passed.
+- **pn02**: Had trouble booting — would start loading kernel from installer USB then instantly reboot. When it did boot, would also freeze randomly.
+
+## Debugging Steps
+
+### 2026-02-21: Initial Setup
+
+1. **Disabled fTPM** (labeled "Security Device" in ASUS BIOS) on both units
+   - AMD Ryzen 5000 series had a known fTPM bug causing random hard freezes with no console output
+   - Both units booted the NixOS installer successfully after this change
+2. Installed NixOS on both, added to repo as `pn01` and `pn02` on VLAN 12
+3. Configured monitoring (node-exporter, promtail, nixos-exporter)
+
+### 2026-02-21: pn02 First Freeze
+
+- pn02 froze approximately 1 hour after boot
+- All three Prometheus targets went down simultaneously — hard freeze, not graceful shutdown
+- Journal on next boot: `system.journal corrupted or uncleanly shut down`
+- Kernel warnings from boot log before freeze:
+  - **TSC clocksource unstable**: `Marking clocksource 'tsc' as unstable because the skew is too large` — TSC skewing ~3.8ms over 500ms relative to HPET watchdog
+  - **AMD PSP error**: `psp gfx command LOAD_TA(0x1) failed and response status is (0x7)` — Platform Security Processor failing to load trusted application
+- pn01 did not show these warnings on this particular boot, but has shown them historically (see below)
+
+### 2026-02-21: pn02 BIOS Update
+
+- Updated pn02 BIOS to latest version from ASUS website
+- **TSC still unstable** after BIOS update — same ~3.8ms skew
+- **PSP LOAD_TA still failing** after BIOS update
+- Monitoring back up, letting it run to see if freeze recurs
+
+### 2026-02-22: TSC/PSP Confirmed on Both Units
+
+- Checked kernel logs after ~9 hours uptime — both units still running
+- **pn01 now shows TSC unstable and PSP LOAD_TA failure** on this boot (same ~3.8ms TSC skew, same PSP error)
+- pn01 had these same issues historically when tested years ago — the earlier clean boot was just lucky TSC calibration timing
+- **Conclusion**: TSC instability and PSP LOAD_TA are platform-level quirks of the PN51-E1 / Ryzen 5700U, present on both units
+- The kernel handles TSC instability gracefully (falls back to HPET), and PSP LOAD_TA is non-fatal
+- Neither issue is likely the cause of the hard freezes — the fTPM bug remains the primary suspect
+
+### 2026-02-22: Stress Test (1 hour)
+
+- Ran `stress-ng --cpu 16 --vm 2 --vm-bytes 8G --timeout 1h` on both units
+- CPU temps peaked at ~85°C, settled to ~80°C sustained (throttle limit is 105°C)
+- Both survived the full hour with no freezes, no MCE errors, no kernel issues
+- No concerning log entries during or after the test
+
+### 2026-02-22: TSC Runtime Switch Test
+
+- Attempted to switch clocksource back to TSC at runtime on pn01:
+  ```
+  echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
+  ```
+- Kernel watchdog immediately reverted to HPET — TSC skew is ongoing, not just a boot-time issue
+- **Conclusion**: TSC is genuinely unstable on the PN51-E1 platform. HPET is the correct clocksource.
+- For virtualization (Incus), this means guest VMs will use HPET-backed timing. Performance impact is minimal for typical server workloads (DNS, monitoring, light services) but would matter for latency-sensitive applications.
+
+### 2026-02-22: BIOS Tweaks (Both Units)
+
+- Disabled ErP Ready on both (EU power efficiency mode — aggressively cuts power in idle)
+- Disabled WiFi and Bluetooth in BIOS on both
+- **TSC still unstable** after these changes — same ~3.8ms skew on both units
+- ErP/power states are not the cause of the TSC issue
+
+### 2026-02-22: pn02 Second Freeze
+
+- pn02 froze again ~5.5 hours after boot (at idle, not under load)
+- All Prometheus targets down simultaneously — same hard freeze pattern
+- Last log entry was normal nix-daemon activity — zero warning/error logs before crash
+- Survived the 1h stress test earlier but froze at idle later — not thermal
+- pn01 remains stable throughout
+- **Action**: Blacklisted `amdgpu` kernel module on pn02 (`boot.blacklistedKernelModules = [ "amdgpu" ]`) to eliminate GPU/PSP firmware interactions as a cause. No console output but managed via SSH.
+- **Action**: Added diagnostic/recovery config to pn02:
+  - `panic=10` + `nmi_watchdog=1` kernel params — auto-reboot after 10s on panic
+  - `softlockup_panic` + `hardlockup_panic` sysctls — convert lockups to panics with stack traces
+  - `hardware.rasdaemon` with recording — logs hardware errors (MCE, PCIe AER, memory) to sqlite database, survives reboots
+  - Check recorded errors: `ras-mc-ctl --summary`, `ras-mc-ctl --errors`
+
+## Benign Kernel Errors (Both Units)
+
+These appear on both units and can be ignored:
+- `clocksource: Marking clocksource 'tsc' as unstable` — TSC skew vs HPET, kernel falls back gracefully. Platform-level quirk on PN51-E1, not always reproducible on every boot.
+- `psp gfx command LOAD_TA(0x1) failed` — AMD PSP firmware error, non-fatal. Present on both units across all BIOS versions.
+- `pcie_mp2_amd: amd_sfh_hid_client_init failed err -95` — AMD Sensor Fusion Hub, no sensors connected
+- `Bluetooth: hci0: Reading supported features failed` — Bluetooth init quirk
+- `Serial bus multi instantiate pseudo device driver INT3515:00: error -ENXIO` — unused serial bus device
+- `snd_hda_intel: no codecs found` — no audio device connected, headless server
+- `ata2.00: supports DRM functions and may not be fully accessible` — Samsung SSD DRM quirk (pn02 only)
+
+### 2026-02-23: processor.max_cstate=1 and Proxmox Forums
+
+- Found a thread on the Proxmox forums about PN51 units with similar freeze issues
+  - Many users reporting identical symptoms — random hard freezes, no log evidence
+  - No conclusive fix. Some have frequent freezes, others only a few times a month
+  - Some reported BIOS updates helped, but results inconsistent
+- Added `processor.max_cstate=1` kernel parameter to pn02 — limits CPU to C1 halt state, preventing deep C-state sleep transitions that may trigger freezes on AMD mobile chips
+- Also applied: amdgpu blacklist, panic=10, nmi_watchdog=1, softlockup/hardlockup panic, rasdaemon
+
+### 2026-02-23: logind D-Bus Deadlock (pn02)
+
+- node-exporter alert fired — but host was NOT frozen
+- logind was running (PID 871) but deadlocked on D-Bus — not responding to `org.freedesktop.login1` requests
+- Every node-exporter scrape blocked for 25s waiting for logind, causing scrape timeouts
+- Likely related to amdgpu blacklist — no DRM device means no graphical seat, logind may have deadlocked during seat enumeration at boot
+- Fix: `systemctl restart systemd-logind` + `systemctl restart prometheus-node-exporter`
+- After restart, logind responded normally and reported seat0
+
+### 2026-02-27: pn02 Third Freeze
+
+- pn02 crashed again after ~2 days 21 hours uptime (longest run so far)
+- Evidence of crash:
+  - Journal file corrupted: `system.journal corrupted or uncleanly shut down`
+  - Boot partition fsck: `Dirty bit is set. Fs was not properly unmounted`
+  - No orderly shutdown logs from previous boot
+  - No auto-upgrade triggered
+- **NMI watchdog did NOT fire** — no kernel panic logged. This is a true hard lockup below NMI level
+- **rasdaemon recorded nothing** — no MCE, AER, or memory errors in the sqlite database
+- **Positive**: The system auto-rebooted this time (likely hardware watchdog), unlike previous freezes that required manual power cycle
+- `processor.max_cstate=1` may have extended uptime (2d21h vs previous 1h and 5.5h) but did not prevent the freeze
+
+### 2026-02-27 to 2026-03-03: Relative Stability
+
+- pn02 ran without crashes for approximately one week after the third freeze
+- pn01 continued to be completely stable throughout this period
+- Auto-upgrade reboots continued daily (~4am) on both units — these are planned and healthy
+
+### 2026-03-04: pn02 Fourth Crash — sched_ext Kernel Oops (pstore captured)
+
+- pn02 crashed after ~5.8 days uptime (504566s)
+- **First crash captured by pstore** — kernel oops and panic stack traces preserved across reboot
+- Journal corruption confirmed: `system.journal corrupted or uncleanly shut down`
+- **Crash location**: `RIP: 0010:set_next_task_scx+0x6e/0x210` — crash in the **sched_ext (SCX) scheduler** subsystem
+- **Call trace**: `sysvec_apic_timer_interrupt` → `cpuidle_enter_state` — crashed during CPU idle, triggered by APIC timer interrupt
+- **CR2**: `ffffffffffffff89` — dereferencing an obviously invalid kernel pointer
+- **Kernel**: 6.12.74 (NixOS 25.11)
+- **Significance**: This is the first crash with actual diagnostic output. Previous crashes were silent sub-NMI freezes. The sched_ext scheduler path is a new finding — earlier crashes were assumed to be hardware-level.
+
+### 2026-03-06: pn02 Fifth Crash
+
+- pn02 crashed again — journal corruption on next boot
+- No pstore data captured for this crash
+
+### 2026-03-07: pn02 Sixth and Seventh Crashes — Two in One Day
+
+**First crash (~11:06 UTC):**
+- ~26.6 hours uptime (95994s)
+- **pstore captured both Oops and Panic**
+- **Crash location**: Scheduler code path — `pick_next_task_fair` → `__pick_next_task`
+- **CR2**: `000000c000726000` — invalid pointer dereference
+- **Notable**: `dbus-daemon` segfaulted ~50 minutes before the kernel crash (`segfault at 0` in `libdbus-1.so.3.32.4` on CPU 0) — may indicate memory corruption preceding the kernel crash
+
+**Second crash (~21:15 UTC):**
+- Journal corruption confirmed on next boot
+- No pstore data captured
+
+### 2026-03-07: pn01 Status
+
+- pn01 has had **zero crashes** since initial setup on Feb 21
+- Zero journal corruptions, zero pstore dumps in 30 days
+- Same BOOT_ID maintained between daily auto-upgrade reboots — consistently clean shutdown/reboot cycles
+- All 8 reboots in 30 days are planned auto-upgrade reboots
+- **pn01 is fully stable**
+
+## Crash Summary
+
+| Date | Uptime Before Crash | Crash Type | Diagnostic Data |
+|------|---------------------|------------|-----------------|
+| Feb 21 | ~1h | Silent freeze | None — sub-NMI |
+| Feb 22 | ~5.5h | Silent freeze | None — sub-NMI |
+| Feb 27 | ~2d 21h | Silent freeze | None — sub-NMI, rasdaemon empty |
+| Mar 4 | ~5.8d | **Kernel oops** | pstore: `set_next_task_scx` (sched_ext) |
+| Mar 6 | Unknown | Crash | Journal corruption only |
+| Mar 7 | ~26.6h | **Kernel oops + panic** | pstore: `pick_next_task_fair` (scheduler) + dbus segfault |
+| Mar 7 | Unknown | Crash | Journal corruption only |
+
+## Conclusion
+
+**pn02 is unreliable.** After exhausting mitigations (fTPM disabled, BIOS updated, WiFi/BT disabled, ErP disabled, amdgpu blacklisted, processor.max_cstate=1, NMI watchdog, rasdaemon), the unit still crashes every few days. 26 reboots in 30 days (7 unclean crashes + daily auto-upgrade reboots).
+
+The pstore crash dumps from March reveal a new dimension: at least some crashes are **kernel scheduler bugs in sched_ext**, not just silent hardware-level freezes. The `set_next_task_scx` and `pick_next_task_fair` crash sites, combined with the dbus-daemon segfault before one crash, suggest possible memory corruption that manifests in the scheduler. It's unclear whether this is:
+1. A sched_ext kernel bug exposed by the PN51's hardware quirks (unstable TSC, C-state behavior)
+2. Hardware-induced memory corruption that happens to hit scheduler data structures
+3. A pure software bug in the 6.12.74 kernel's sched_ext implementation
+
+**pn01 is stable** — zero crashes in 30 days of continuous operation. Both units have identical kernel and NixOS configuration (minus pn02's diagnostic mitigations), so the difference points toward a hardware defect specific to the pn02 board.
+
+## Next Steps
+
+- **pn02 memtest**: Run memtest86 for 24h+ (available in systemd-boot menu). The crash signatures (userspace segfaults before kernel panics, corrupted pointers in scheduler structures) are consistent with intermittent RAM errors that a quick pass wouldn't catch. If memtest finds errors, swap the DIMM.
+- **pn02**: Consider scrapping or repurposing for non-critical workloads that tolerate random reboots (auto-recovery via hardware watchdog is now working)
+- **pn02 investigation**: Could try disabling sched_ext (`boot.kernelParams = [ "sched_ext.enabled=0" ]` or equivalent) to test whether the crashes stop — would help distinguish kernel bug from hardware defect
+- **pn01**: Continue monitoring. If it remains stable long-term, it is viable for light workloads
+- If pn01 eventually crashes, apply the same mitigations (amdgpu blacklist, max_cstate=1) to see if they help
+- For the Incus hypervisor plan: likely need different hardware. Evaluating GMKtec G3 (Intel) as an alternative. Note: mixed Intel/AMD cluster complicates live migration
+
+## Diagnostics and Auto-Recovery (pn02)
+
+Currently deployed on pn02:
+
+```nix
+boot.blacklistedKernelModules = [ "amdgpu" ];
+boot.kernelParams = [ "panic=10" "nmi_watchdog=1" "processor.max_cstate=1" ];
+boot.kernel.sysctl."kernel.softlockup_panic" = 1;
+boot.kernel.sysctl."kernel.hardlockup_panic" = 1;
+hardware.rasdaemon.enable = true;
+hardware.rasdaemon.record = true;
+```
+
+**Crash recovery is working**: pstore now captures kernel oops/panic data, and the system auto-reboots via `panic=10` or SP5100 TCO hardware watchdog.
+
+**After reboot, check:**
+- `ras-mc-ctl --summary` — overview of hardware errors
+- `ras-mc-ctl --errors` — detailed error list
+- `journalctl -b -1 -p err` — kernel logs from crashed boot (if panic was logged)
+- pstore data is automatically archived by `systemd-pstore.service` and forwarded to Loki via promtail
--- a/docs/plans/remote-access.md
+++ b/docs/plans/remote-access.md
@@ -24,29 +24,20 @@ After evaluating WireGuard gateway vs Headscale (self-hosted Tailscale), the **W

 ## Architecture

-```
-                    ┌─────────────────────────────────┐
-                    │  VPS (OpenStack)                │
-  Laptop/Phone ──→ │  WireGuard endpoint             │
-  (WireGuard)      │  Client peers: laptop, phone    │
-                    │  Routes 10.69.13.0/24 via tunnel│
-                    └──────────┬──────────────────────┘
-                               │ WireGuard tunnel
-                               ▼
-                    ┌─────────────────────────────────┐
-                    │  extgw01 (gateway + bastion)    │
-                    │  - WireGuard tunnel to VPS      │
-                    │  - Firewall (allowlist only)    │
-                    │  - SSH + 2FA (full access)      │
-                    └──────────┬──────────────────────┘
-                               │ allowed traffic only
-                               ▼
-                    ┌─────────────────────────────────┐
-                    │  Internal network 10.69.13.0/24 │
-                    │  - monitoring01:3000 (Grafana)  │
-                    │  - jelly01:8096 (Jellyfin)      │
-                    │  - *-jail hosts (arr stack)     │
-                    └─────────────────────────────────┘
+```mermaid
+graph TD
+    clients["Laptop / Phone"]
+    vps["VPS<br/>(WireGuard endpoint)"]
+    extgw["extgw01<br/>(gateway + bastion)"]
+    grafana["Grafana<br/>monitoring01:3000"]
+    jellyfin["Jellyfin<br/>jelly01:8096"]
+    arr["arr stack<br/>*-jail hosts"]
+
+    clients -->|WireGuard| vps
+    vps -->|WireGuard tunnel| extgw
+    extgw -->|allowed traffic| grafana
+    extgw -->|allowed traffic| jellyfin
+    extgw -->|allowed traffic| arr
 ```

 ### Existing path (unchanged)
--- a/docs/plans/truenas-migration.md
+++ b/docs/plans/truenas-migration.md
@@ -39,23 +39,17 @@ Expand storage capacity for the main hdd-pool. Since we need to add disks anyway
 - nzbget: NixOS service or OCI container
 - NFS exports: `services.nfs.server`

-### Filesystem: BTRFS RAID1
+### Filesystem: Keep ZFS

-**Decision**: Migrate from ZFS to BTRFS with RAID1
+**Decision**: Keep existing ZFS pool, import on NixOS

 **Rationale**:
- **In-kernel**: No out-of-tree module issues like ZFS
- **Flexible expansion**: Add individual disks, not required to buy pairs
- **Mixed disk sizes**: Better handling than ZFS multi-vdev approach
- **RAID level conversion**: Can convert between RAID levels in place
- Built-in checksumming, snapshots, compression (zstd)
- NixOS has good BTRFS support
-
-**BTRFS RAID1 notes**:
- "RAID1" means 2 copies of all data
- Distributes across all available devices
- With 6+ disks, provides redundancy + capacity scaling
- RAID5/6 avoided (known issues), RAID1/10 are stable
+- **No data migration needed**: Existing ZFS pool can be imported directly on NixOS
+- **Proven reliability**: Pool has been running reliably on TrueNAS
+- **NixOS ZFS support**: Well-supported, declarative configuration via `boot.zfs` and `services.zfs`
+- **BTRFS RAID5/6 unreliable**: Research showed BTRFS RAID5/6 write hole is still unresolved
+- **BTRFS RAID1 wasteful**: With mixed disk sizes, RAID1 wastes significant capacity vs ZFS mirrors
+- Checksumming, snapshots, compression (lz4/zstd) all available

 ### Hardware: Keep Existing + Add Disks

@@ -69,83 +63,94 @@ Expand storage capacity for the main hdd-pool. Since we need to add disks anyway

 **Storage architecture**:

-**Bulk storage** (BTRFS RAID1 on HDDs):
- Current: 6x HDDs (2x16TB + 2x8TB + 2x8TB)
- Add: 2x new HDDs (size TBD)
+**hdd-pool** (ZFS mirrors):
+- Current: 3 mirror vdevs (2x16TB + 2x8TB + 2x8TB) = 32TB usable
+- Add: mirror-3 with 2x 24TB = +24TB usable
+- Total after expansion: ~56TB usable
 - Use: Media, downloads, backups, non-critical data
- Risk tolerance: High (data mostly replaceable)
-
-**Critical data** (small volume):
- Use 2x 240GB SSDs in mirror (BTRFS or ZFS)
- Or use 2TB NVMe for critical data
- Risk tolerance: Low (data important but small)

 ### Disk Purchase Decision

-**Options under consideration**:
-
-**Option A: 2x 16TB drives**
- Matches largest current drives
- Enables potential future RAID5 if desired (6x 16TB array)
- More conservative capacity increase
-
-**Option B: 2x 20-24TB drives**
- Larger capacity headroom
- Better $/TB ratio typically
- Future-proofs better
-
-**Initial purchase**: 2 drives (chassis has space for 2 more without modifications)
+**Decision**: 2x 24TB drives (ordered, arriving 2026-02-21)

 ## Migration Strategy

 ### High-Level Plan

-1. **Preparation**:
-   - Purchase 2x new HDDs (16TB or 20-24TB)
-   - Create NixOS configuration for new storage host
-   - Set up bare metal NixOS installation
+1. **Expand ZFS pool** (on TrueNAS):
+   - Install 2x 24TB drives (may need new drive trays - order from abroad if needed)
+   - If chassis space is limited, temporarily replace the two oldest 8TB drives (da0/ada4)
+   - Add as mirror-3 vdev to hdd-pool
+   - Verify pool health and resilver completes
+   - Check SMART data on old 8TB drives (all healthy as of 2026-02-20, no reallocated sectors)
+   - Burn-in: at minimum short + long SMART test before adding to pool

-2. **Initial BTRFS pool**:
-   - Install 2 new disks
-   - Create BTRFS filesystem in RAID1
-   - Mount and test NFS exports
+2. **Prepare NixOS configuration**:
+   - Create host configuration (`hosts/nas1/` or similar)
+   - Configure ZFS pool import (`boot.zfs.extraPools`)
+   - Set up services: radarr, sonarr, nzbget, restic-rest, NFS
+   - Configure monitoring (node-exporter, promtail, smartctl-exporter)

-3. **Data migration**:
-   - Copy data from TrueNAS ZFS pool to new BTRFS pool over 10GbE
-   - Verify data integrity
+3. **Install NixOS**:
+   - `zfs export hdd-pool` on TrueNAS before shutdown (clean export)
+   - Wipe TrueNAS boot-pool SSDs, set up as mdadm RAID1 for NixOS root
+   - Install NixOS on mdadm mirror (keeps boot path ZFS-independent)
+   - Import hdd-pool via `boot.zfs.extraPools`
+   - Verify all datasets mount correctly

-4. **Expand pool**:
-   - As old ZFS pool is emptied, wipe drives and add to BTRFS pool
-   - Pool grows incrementally: 2 → 4 → 6 → 8 disks
-   - BTRFS rebalances data across new devices
+4. **Service migration**:
+   - Configure NixOS services to use ZFS dataset paths
+   - Update NFS exports
+   - Test from consuming hosts

-5. **Service migration**:
-   - Set up radarr/sonarr/nzbget/restic as NixOS services
-   - Update NFS client mounts on consuming hosts
-
-6. **Cutover**:
-   - Point consumers to new NAS host
+5. **Cutover**:
+   - Update DNS/client mounts if IP changes
+   - Verify monitoring integration
   - Decommission TrueNAS
-   - Repurpose hardware or keep as spare
+
+### Post-Expansion: Vdev Rebalancing
+
+ZFS has no built-in rebalance command. After adding the new 24TB vdev, ZFS will
+write new data preferentially to it (most free space), leaving old vdevs packed
+at ~97%. This is suboptimal but not urgent once overall pool usage drops to ~50%.
+
+To gradually rebalance, rewrite files in place so ZFS redistributes blocks across
+all vdevs proportional to free space:
+
+```bash
+# Rewrite files individually (spreads blocks across all vdevs)
+find /pool/dataset -type f -exec sh -c '
+  for f; do cp "$f" "$f.rebal" && mv "$f.rebal" "$f"; done
+' _ {} +
+```
+
+Avoid `zfs send/recv` for large datasets (e.g. 20TB) as this would concentrate
+data on the emptiest vdev rather than spreading it evenly.
+
+**Recommendation**: Do this after NixOS migration is stable. Not urgent - the pool
+will function fine with uneven distribution, just slightly suboptimal for performance.

 ### Migration Advantages

- **Low risk**: New pool created independently, old data remains intact during migration
- **Incremental**: Can add old disks one at a time as space allows
- **Flexible**: BTRFS handles mixed disk sizes gracefully
- **Reversible**: Keep TrueNAS running until fully validated
+- **No data migration**: ZFS pool imported directly, no copying terabytes of data
+- **Low risk**: Pool expansion done on stable TrueNAS before OS swap
+- **Reversible**: Can boot back to TrueNAS if NixOS has issues (ZFS pool is OS-independent)
+- **Quick cutover**: Once NixOS config is ready, the OS swap is fast

 ## Next Steps

-1. Decide on disk size (16TB vs 20-24TB)
-2. Purchase disks
-3. Design NixOS host configuration (`hosts/nas1/`)
-4. Plan detailed migration timeline
-5. Document NFS export mapping (current → new)
+1. ~~Decide on disk size~~ - 2x 24TB ordered
+2. Install drives and add mirror vdev to ZFS pool
+3. Check SMART data on 8TB drives - decide whether to keep or retire
+4. Design NixOS host configuration (`hosts/nas1/`)
+5. Document NFS export mapping (current -> new)
+6. Plan NixOS installation and cutover

 ## Open Questions

- [ ] Final decision on disk size?
 - [ ] Hostname for new NAS host? (nas1? storage1?)
- [ ] IP address allocation (keep 10.69.12.50 or new IP?)
- [ ] Timeline/maintenance window for migration?
+- [ ] IP address/subnet: NAS and Proxmox are both on 10GbE to the same switch but different subnets, forcing traffic through the router (bottleneck). Move to same subnet during migration.
+- [x] Boot drive: Reuse TrueNAS boot-pool SSDs as mdadm RAID1 for NixOS root (no ZFS on boot path)
+- [ ] Retire old 8TB drives? (SMART looks healthy, keep unless chassis space is needed)
+- [x] Drive trays: ordered domestically (expected 2026-02-25 to 2026-03-03)
+- [ ] Timeline/maintenance window for NixOS swap?
--- a/flake.lock
+++ b/flake.lock
@@ -28,11 +28,11 @@
        ]
      },
      "locked": {
-        "lastModified": 1771004123,
-        "narHash": "sha256-Jw36EzL4IGIc2TmeZGphAAUrJXoWqfvCbybF8bTHgMA=",
+        "lastModified": 1771488195,
+        "narHash": "sha256-2kMxqdDyPluRQRoES22Y0oSjp7pc5fj2nRterfmSIyc=",
        "ref": "master",
-        "rev": "e5e8be86ecdcae8a5962ba3bddddfe91b574792b",
-        "revCount": 36,
+        "rev": "2d26de50559d8acb82ea803764e138325d95572c",
+        "revCount": 37,
        "type": "git",
        "url": "https://git.t-juice.club/torjus/homelab-deploy"
      },
@@ -64,11 +64,11 @@
    },
    "nixpkgs": {
      "locked": {
-        "lastModified": 1771043024,
-        "narHash": "sha256-O1XDr7EWbRp+kHrNNgLWgIrB0/US5wvw9K6RERWAj6I=",
+        "lastModified": 1772822230,
+        "narHash": "sha256-yf3iYLGbGVlIthlQIk5/4/EQDZNNEmuqKZkQssMljuw=",
        "owner": "nixos",
        "repo": "nixpkgs",
-        "rev": "3aadb7ca9eac2891d52a9dec199d9580a6e2bf44",
+        "rev": "71caefce12ba78d84fe618cf61644dce01cf3a96",
        "type": "github"
      },
      "original": {
@@ -80,11 +80,11 @@
    },
    "nixpkgs-unstable": {
      "locked": {
-        "lastModified": 1771008912,
-        "narHash": "sha256-gf2AmWVTs8lEq7z/3ZAsgnZDhWIckkb+ZnAo5RzSxJg=",
+        "lastModified": 1772773019,
+        "narHash": "sha256-E1bxHxNKfDoQUuvriG71+f+s/NT0qWkImXsYZNFFfCs=",
        "owner": "nixos",
        "repo": "nixpkgs",
-        "rev": "a82ccc39b39b621151d6732718e3e250109076fa",
+        "rev": "aca4d95fce4914b3892661bcb80b8087293536c6",
        "type": "github"
      },
      "original": {
--- a/flake.nix
+++ b/flake.nix
@@ -92,15 +92,6 @@
            ./hosts/http-proxy
          ];
        };
-        monitoring01 = nixpkgs.lib.nixosSystem {
-          inherit system;
-          specialArgs = {
-            inherit inputs self;
-          };
-          modules = commonModules ++ [
-            ./hosts/monitoring01
-          ];
-        };
        jelly01 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
@@ -209,6 +200,42 @@
            ./hosts/garage01
          ];
        };
+        pn01 = nixpkgs.lib.nixosSystem {
+          inherit system;
+          specialArgs = {
+            inherit inputs self;
+          };
+          modules = commonModules ++ [
+            ./hosts/pn01
+          ];
+        };
+        pn02 = nixpkgs.lib.nixosSystem {
+          inherit system;
+          specialArgs = {
+            inherit inputs self;
+          };
+          modules = commonModules ++ [
+            ./hosts/pn02
+          ];
+        };
+        nrec-nixos01 = nixpkgs.lib.nixosSystem {
+          inherit system;
+          specialArgs = {
+            inherit inputs self;
+          };
+          modules = commonModules ++ [
+            ./hosts/nrec-nixos01
+          ];
+        };
+        openstack-template = nixpkgs.lib.nixosSystem {
+          inherit system;
+          specialArgs = {
+            inherit inputs self;
+          };
+          modules = commonModules ++ [
+            ./hosts/openstack-template
+          ];
+        };
      };
      packages = forAllSystems (
        { pkgs }:
@@ -227,6 +254,7 @@
              pkgs.openbao
              pkgs.kanidm_1_8
              pkgs.nkeys
+              pkgs.openstackclient
              (pkgs.callPackage ./scripts/create-host { })
              homelab-deploy.packages.${pkgs.system}.default
            ];
--- a/hosts/garage01/configuration.nix
+++ b/hosts/garage01/configuration.nix
@@ -54,10 +54,7 @@
  };
  time.timeZone = "Europe/Oslo";

-  nix.settings.experimental-features = [
-    "nix-command"
-    "flakes"
-  ];
+
  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
    vim
--- a/hosts/ha1/configuration.nix
+++ b/hosts/ha1/configuration.nix
@@ -46,10 +46,7 @@
  };
  time.timeZone = "Europe/Oslo";

-  nix.settings.experimental-features = [
-    "nix-command"
-    "flakes"
-  ];
+
  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
    vim
--- a/hosts/http-proxy/configuration.nix
+++ b/hosts/http-proxy/configuration.nix
@@ -18,12 +18,7 @@
    "sonarr"
    "ha"
    "z2m"
-    "grafana"
-    "prometheus"
-    "alertmanager"
    "jelly"
-    "pyroscope"
-    "pushgw"
  ];

  nixpkgs.config.allowUnfree = true;
@@ -57,10 +52,7 @@
  };
  time.timeZone = "Europe/Oslo";

-  nix.settings.experimental-features = [
-    "nix-command"
-    "flakes"
-  ];
+
  vault.enable = true;
  homelab.deploy.enable = true;

--- a/hosts/jelly01/configuration.nix
+++ b/hosts/jelly01/configuration.nix
@@ -44,10 +44,7 @@
  };
  time.timeZone = "Europe/Oslo";

-  nix.settings.experimental-features = [
-    "nix-command"
-    "flakes"
-  ];
+
  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
    vim
--- a/hosts/kanidm01/configuration.nix
+++ b/hosts/kanidm01/configuration.nix
@@ -55,10 +55,7 @@
  };
  time.timeZone = "Europe/Oslo";

-  nix.settings.experimental-features = [
-    "nix-command"
-    "flakes"
-  ];
+
  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
    vim
--- a/hosts/monitoring01/configuration.nix
+++ b/hosts/monitoring01/configuration.nix
@@ -1,114 +0,0 @@
-{
-  pkgs,
-  ...
-}:
-
-{
-  imports = [
-    ./hardware-configuration.nix
-
-    ../../system
-    ../../common/vm
-  ];
-
-  homelab.host.role = "monitoring";
-
-  nixpkgs.config.allowUnfree = true;
-  # Use the systemd-boot EFI boot loader.
-  boot.loader.grub = {
-    enable = true;
-    device = "/dev/sda";
-    configurationLimit = 3;
-  };
-
-  networking.hostName = "monitoring01";
-  networking.domain = "home.2rjus.net";
-  networking.useNetworkd = true;
-  networking.useDHCP = false;
-  services.resolved.enable = true;
-  networking.nameservers = [
-    "10.69.13.5"
-    "10.69.13.6"
-  ];
-
-  systemd.network.enable = true;
-  systemd.network.networks."ens18" = {
-    matchConfig.Name = "ens18";
-    address = [
-      "10.69.13.13/24"
-    ];
-    routes = [
-      { Gateway = "10.69.13.1"; }
-    ];
-    linkConfig.RequiredForOnline = "routable";
-  };
-  time.timeZone = "Europe/Oslo";
-
-  nix.settings.experimental-features = [
-    "nix-command"
-    "flakes"
-  ];
-  nix.settings.tarball-ttl = 0;
-  environment.systemPackages = with pkgs; [
-    vim
-    wget
-    git
-    sqlite
-  ];
-
-  services.qemuGuest.enable = true;
-
-  # Vault secrets management
-  vault.enable = true;
-  homelab.deploy.enable = true;
-  vault.secrets.backup-helper = {
-    secretPath = "shared/backup/password";
-    extractKey = "password";
-    outputDir = "/run/secrets/backup_helper_secret";
-    services = [ "restic-backups-grafana" "restic-backups-grafana-db" ];
-  };
-
-  services.restic.backups.grafana = {
-    repository = "rest:http://10.69.12.52:8000/backup-nix";
-    passwordFile = "/run/secrets/backup_helper_secret";
-    paths = [ "/var/lib/grafana/plugins" ];
-    timerConfig = {
-      OnCalendar = "daily";
-      Persistent = true;
-      RandomizedDelaySec = "2h";
-    };
-    pruneOpts = [
-      "--keep-daily 7"
-      "--keep-weekly 4"
-      "--keep-monthly 6"
-      "--keep-within 1d"
-    ];
-    extraOptions = [ "--retry-lock=5m" ];
-  };
-
-  services.restic.backups.grafana-db = {
-    repository = "rest:http://10.69.12.52:8000/backup-nix";
-    passwordFile = "/run/secrets/backup_helper_secret";
-    command = [ "${pkgs.sqlite}/bin/sqlite3" "/var/lib/grafana/data/grafana.db" ".dump" ];
-    timerConfig = {
-      OnCalendar = "daily";
-      Persistent = true;
-      RandomizedDelaySec = "2h";
-    };
-    pruneOpts = [
-      "--keep-daily 7"
-      "--keep-weekly 4"
-      "--keep-monthly 6"
-      "--keep-within 1d"
-    ];
-    extraOptions = [ "--retry-lock=5m" ];
-  };
-
-  # Open ports in the firewall.
-  # networking.firewall.allowedTCPPorts = [ ... ];
-  # networking.firewall.allowedUDPPorts = [ ... ];
-  # Or disable the firewall altogether.
-  networking.firewall.enable = false;
-
-  system.stateVersion = "23.11"; # Did you read the comment?
-}
--- a/hosts/monitoring01/hardware-configuration.nix
+++ b/hosts/monitoring01/hardware-configuration.nix
@@ -1,42 +0,0 @@
-{
-  config,
-  lib,
-  pkgs,
-  modulesPath,
-  ...
-}:
-
-{
-  imports = [
-    (modulesPath + "/profiles/qemu-guest.nix")
-  ];
-  boot.initrd.availableKernelModules = [
-    "ata_piix"
-    "uhci_hcd"
-    "virtio_pci"
-    "virtio_scsi"
-    "sd_mod"
-    "sr_mod"
-  ];
-  boot.initrd.kernelModules = [ "dm-snapshot" ];
-  boot.kernelModules = [
-    "ptp_kvm"
-  ];
-  boot.extraModulePackages = [ ];
-
-  fileSystems."/" = {
-    device = "/dev/disk/by-label/root";
-    fsType = "xfs";
-  };
-
-  swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
-
-  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
-  # (the default) this is the recommended approach. When using systemd-networkd it's
-  # still possible to use this option, but it's recommended to use it in conjunction
-  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
-  networking.useDHCP = lib.mkDefault true;
-  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
-
-  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
-}
--- a/hosts/monitoring02/configuration.nix
+++ b/hosts/monitoring02/configuration.nix
@@ -18,7 +18,7 @@
    role = "monitoring";
  };

-  homelab.dns.cnames = [ "grafana-test" "metrics" "vmalert" "loki" ];
+  homelab.dns.cnames = [ "monitoring" "alertmanager" "grafana" "grafana-test" "metrics" "vmalert" "loki" ];

  # Enable Vault integration
  vault.enable = true;
@@ -53,10 +53,7 @@
  };
  time.timeZone = "Europe/Oslo";

-  nix.settings.experimental-features = [
-    "nix-command"
-    "flakes"
-  ];
+
  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
    vim
--- a/hosts/monitoring02/default.nix
+++ b/hosts/monitoring02/default.nix
@@ -4,5 +4,9 @@
    ../../services/grafana
    ../../services/victoriametrics
    ../../services/loki
+    ../../services/monitoring/alerttonotify.nix
+    ../../services/monitoring/blackbox.nix
+    ../../services/monitoring/exportarr.nix
+    ../../services/monitoring/pve.nix
  ];
 }
--- a/hosts/nats1/configuration.nix
+++ b/hosts/nats1/configuration.nix
@@ -44,10 +44,7 @@
  };
  time.timeZone = "Europe/Oslo";

-  nix.settings.experimental-features = [
-    "nix-command"
-    "flakes"
-  ];
+
  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
    vim
--- a/hosts/nix-cache02/builder.nix
+++ b/hosts/nix-cache02/builder.nix
@@ -25,7 +25,7 @@
      };
    };

-    timeout = 7200;
+    timeout = 14400;
    metrics.enable = true;
  };

--- a/hosts/nix-cache02/configuration.nix
+++ b/hosts/nix-cache02/configuration.nix
@@ -53,10 +53,7 @@
  };
  time.timeZone = "Europe/Oslo";

-  nix.settings.experimental-features = [
-    "nix-command"
-    "flakes"
-  ];
+
  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
    vim
--- a/hosts/nrec-nixos01/configuration.nix
+++ b/hosts/nrec-nixos01/configuration.nix
@@ -0,0 +1,78 @@
+{
+  lib,
+  pkgs,
+  ...
+}:
+
+{
+  services.openssh = {
+    enable = true;
+    settings = {
+      PermitRootLogin = lib.mkForce "no";
+      PasswordAuthentication = false;
+    };
+  };
+
+  users.users.nixos = {
+    isNormalUser = true;
+    extraGroups = [ "wheel" ];
+    shell = pkgs.zsh;
+    openssh.authorizedKeys.keys = [
+      "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAwfb2jpKrBnCw28aevnH8HbE5YbcMXpdaVv2KmueDu6 torjus@gunter"
+    ];
+  };
+  security.sudo.wheelNeedsPassword = false;
+  programs.zsh.enable = true;
+
+  homelab.dns.enable = false;
+  homelab.monitoring.enable = false;
+  homelab.host.labels.ansible = "false";
+
+  fileSystems."/" = {
+    device = "/dev/disk/by-label/nixos";
+    fsType = "ext4";
+    autoResize = true;
+  };
+
+  boot.loader.grub.enable = true;
+  boot.loader.grub.device = "/dev/vda";
+  networking.hostName = "nrec-nixos01";
+  networking.useNetworkd = true;
+  networking.useDHCP = false;
+  services.resolved.enable = true;
+
+  systemd.network.enable = true;
+  systemd.network.networks."ens3" = {
+    matchConfig.Name = "ens3";
+    networkConfig.DHCP = "ipv4";
+    linkConfig.RequiredForOnline = "routable";
+  };
+  time.timeZone = "Europe/Oslo";
+
+  networking.firewall.enable = true;
+  networking.firewall.allowedTCPPorts = [
+    22
+    80
+    443
+  ];
+
+  nix.settings.substituters = [
+    "https://cache.nixos.org"
+  ];
+  nix.settings.trusted-public-keys = [
+    "cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY="
+  ];
+
+  services.caddy = {
+    enable = true;
+    virtualHosts."nrec-nixos01.t-juice.club" = {
+      extraConfig = ''
+        reverse_proxy 127.0.0.1:3000
+      '';
+    };
+  };
+
+  zramSwap.enable = true;
+
+  system.stateVersion = "25.11";
+}
--- a/hosts/nrec-nixos01/default.nix
+++ b/hosts/nrec-nixos01/default.nix
@@ -0,0 +1,9 @@
+{ modulesPath, ... }:
+{
+  imports = [
+    ./configuration.nix
+    ../../system/packages.nix
+    ../../services/forgejo
+    (modulesPath + "/profiles/qemu-guest.nix")
+  ];
+}
--- a/hosts/ns1/configuration.nix
+++ b/hosts/ns1/configuration.nix
@@ -58,10 +58,7 @@
  };
  time.timeZone = "Europe/Oslo";

-  nix.settings.experimental-features = [
-    "nix-command"
-    "flakes"
-  ];
+
  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
    vim
--- a/hosts/ns2/configuration.nix
+++ b/hosts/ns2/configuration.nix
@@ -58,10 +58,7 @@
  };
  time.timeZone = "Europe/Oslo";

-  nix.settings.experimental-features = [
-    "nix-command"
-    "flakes"
-  ];
+
  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
    vim
--- a/hosts/openstack-template/configuration.nix
+++ b/hosts/openstack-template/configuration.nix
@@ -0,0 +1,72 @@
+{
+  lib,
+  pkgs,
+  ...
+}:
+
+{
+  services.openssh = {
+    enable = true;
+    settings = {
+      PermitRootLogin = lib.mkForce "no";
+      PasswordAuthentication = false;
+    };
+  };
+
+  users.users.nixos = {
+    isNormalUser = true;
+    extraGroups = [ "wheel" ];
+    shell = pkgs.zsh;
+    openssh.authorizedKeys.keys = [
+      "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAwfb2jpKrBnCw28aevnH8HbE5YbcMXpdaVv2KmueDu6 torjus@gunter"
+    ];
+  };
+  security.sudo.wheelNeedsPassword = false;
+  programs.zsh.enable = true;
+
+  homelab.dns.enable = false;
+  homelab.monitoring.enable = false;
+  homelab.host.labels.ansible = "false";
+
+  # Minimal fileSystems for evaluation; openstack-config.nix overrides this at image build time
+  fileSystems."/" = {
+    device = lib.mkDefault "/dev/vda1";
+    fsType = lib.mkDefault "ext4";
+  };
+
+  boot.loader.grub.enable = true;
+  boot.loader.grub.device = "/dev/vda";
+  networking.hostName = "nixos-openstack-template";
+  networking.useNetworkd = true;
+  networking.useDHCP = false;
+  services.resolved.enable = true;
+
+  systemd.network.enable = true;
+  systemd.network.networks."ens3" = {
+    matchConfig.Name = "ens3";
+    networkConfig.DHCP = "ipv4";
+    linkConfig.RequiredForOnline = "routable";
+  };
+  time.timeZone = "Europe/Oslo";
+
+  networking.firewall.enable = true;
+  networking.firewall.allowedTCPPorts = [ 22 ];
+
+  nix.settings.substituters = [
+    "https://cache.nixos.org"
+  ];
+  nix.settings.trusted-public-keys = [
+    "cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY="
+  ];
+
+  environment.systemPackages = with pkgs; [
+    age
+    vim
+    wget
+    git
+  ];
+
+  zramSwap.enable = true;
+
+  system.stateVersion = "25.11";
+}
--- a/hosts/openstack-template/default.nix
+++ b/hosts/openstack-template/default.nix
@@ -2,6 +2,6 @@
 {
  imports = [
    ./configuration.nix
-    ../../services/monitoring
+    ../../system/packages.nix
  ];
 }
--- a/hosts/pn01/configuration.nix
+++ b/hosts/pn01/configuration.nix
@@ -0,0 +1,54 @@
+{
+  config,
+  lib,
+  pkgs,
+  ...
+}:
+
+{
+  imports = [
+    ./hardware-configuration.nix
+    ../../system
+  ];
+
+  boot.loader.systemd-boot.enable = true;
+  boot.loader.systemd-boot.memtest86.enable = true;
+  boot.loader.efi.canTouchEfiVariables = true;
+
+  networking.hostName = "pn01";
+  networking.domain = "home.2rjus.net";
+  networking.useNetworkd = true;
+  networking.useDHCP = false;
+  networking.firewall.enable = false;
+  services.resolved.enable = true;
+  networking.nameservers = [
+    "10.69.13.5"
+    "10.69.13.6"
+  ];
+
+  systemd.network.enable = true;
+  systemd.network.networks."enp2s0" = {
+    matchConfig.Name = "enp2s0";
+    address = [
+      "10.69.12.60/24"
+    ];
+    routes = [
+      { Gateway = "10.69.12.1"; }
+    ];
+    linkConfig.RequiredForOnline = "routable";
+  };
+
+  time.timeZone = "Europe/Oslo";
+
+  homelab.host = {
+    tier = "test";
+    priority = "low";
+    role = "compute";
+  };
+
+  vault.enable = true;
+
+  nixpkgs.config.allowUnfree = true;
+
+  system.stateVersion = "25.11";
+}
--- a/hosts/pn01/default.nix
+++ b/hosts/pn01/default.nix
@@ -0,0 +1,5 @@
+{ ... }: {
+  imports = [
+    ./configuration.nix
+  ];
+}
--- a/hosts/pn01/hardware-configuration.nix
+++ b/hosts/pn01/hardware-configuration.nix
@@ -0,0 +1,33 @@
+# Do not modify this file!  It was generated by ‘nixos-generate-config’
+# and may be overwritten by future invocations.  Please make changes
+# to /etc/nixos/configuration.nix instead.
+{ config, lib, pkgs, modulesPath, ... }:
+
+{
+  imports =
+    [ (modulesPath + "/installer/scan/not-detected.nix")
+    ];
+
+  boot.initrd.availableKernelModules = [ "xhci_pci" "nvme" "ahci" "usb_storage" "usbhid" "sd_mod" "rtsx_usb_sdmmc" ];
+  boot.initrd.kernelModules = [ ];
+  boot.kernelModules = [ "kvm-amd" ];
+  boot.extraModulePackages = [ ];
+
+  fileSystems."/" =
+    { device = "/dev/disk/by-uuid/9444cf54-80e0-4315-adca-8ddd5037217c";
+      fsType = "ext4";
+    };
+
+  fileSystems."/boot" =
+    { device = "/dev/disk/by-uuid/D897-146F";
+      fsType = "vfat";
+      options = [ "fmask=0022" "dmask=0022" ];
+    };
+
+  swapDevices =
+    [ { device = "/dev/disk/by-uuid/6c1e775f-342e-463a-a7f9-d7ce6593a482"; }
+    ];
+
+  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
+  hardware.cpu.amd.updateMicrocode = lib.mkDefault config.hardware.enableRedistributableFirmware;
+}
--- a/hosts/pn02/configuration.nix
+++ b/hosts/pn02/configuration.nix
@@ -0,0 +1,61 @@
+{
+  config,
+  lib,
+  pkgs,
+  ...
+}:
+
+{
+  imports = [
+    ./hardware-configuration.nix
+    ../../system
+  ];
+
+  boot.loader.systemd-boot.enable = true;
+  boot.loader.systemd-boot.memtest86.enable = true;
+  boot.loader.efi.canTouchEfiVariables = true;
+  boot.blacklistedKernelModules = [ "amdgpu" ];
+  boot.kernelParams = [ "panic=10" "nmi_watchdog=1" "processor.max_cstate=1" ];
+  boot.kernel.sysctl."kernel.softlockup_panic" = 1;
+  boot.kernel.sysctl."kernel.hardlockup_panic" = 1;
+
+  hardware.rasdaemon.enable = true;
+  hardware.rasdaemon.record = true;
+
+  networking.hostName = "pn02";
+  networking.domain = "home.2rjus.net";
+  networking.useNetworkd = true;
+  networking.useDHCP = false;
+  networking.firewall.enable = false;
+  services.resolved.enable = true;
+  networking.nameservers = [
+    "10.69.13.5"
+    "10.69.13.6"
+  ];
+
+  systemd.network.enable = true;
+  systemd.network.networks."enp2s0" = {
+    matchConfig.Name = "enp2s0";
+    address = [
+      "10.69.12.61/24"
+    ];
+    routes = [
+      { Gateway = "10.69.12.1"; }
+    ];
+    linkConfig.RequiredForOnline = "routable";
+  };
+
+  time.timeZone = "Europe/Oslo";
+
+  homelab.host = {
+    tier = "test";
+    priority = "low";
+    role = "compute";
+  };
+
+  vault.enable = true;
+
+  nixpkgs.config.allowUnfree = true;
+
+  system.stateVersion = "25.11";
+}
--- a/hosts/pn02/default.nix
+++ b/hosts/pn02/default.nix
@@ -0,0 +1,5 @@
+{ ... }: {
+  imports = [
+    ./configuration.nix
+  ];
+}
--- a/hosts/pn02/hardware-configuration.nix
+++ b/hosts/pn02/hardware-configuration.nix
@@ -0,0 +1,33 @@
+# Do not modify this file!  It was generated by ‘nixos-generate-config’
+# and may be overwritten by future invocations.  Please make changes
+# to /etc/nixos/configuration.nix instead.
+{ config, lib, pkgs, modulesPath, ... }:
+
+{
+  imports =
+    [ (modulesPath + "/installer/scan/not-detected.nix")
+    ];
+
+  boot.initrd.availableKernelModules = [ "xhci_pci" "ahci" "usb_storage" "usbhid" "sd_mod" "rtsx_usb_sdmmc" ];
+  boot.initrd.kernelModules = [ ];
+  boot.kernelModules = [ "kvm-amd" ];
+  boot.extraModulePackages = [ ];
+
+  fileSystems."/" =
+    { device = "/dev/disk/by-uuid/1d28b629-51ae-4f0e-b440-9388c2e48413";
+      fsType = "ext4";
+    };
+
+  fileSystems."/boot" =
+    { device = "/dev/disk/by-uuid/A5A7-C7B2";
+      fsType = "vfat";
+      options = [ "fmask=0022" "dmask=0022" ];
+    };
+
+  swapDevices =
+    [ { device = "/dev/disk/by-uuid/f2570894-0922-4746-84c7-2b2fe7601ea1"; }
+    ];
+
+  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
+  hardware.cpu.amd.updateMicrocode = lib.mkDefault config.hardware.enableRedistributableFirmware;
+}
--- a/hosts/template2/bootstrap.nix
+++ b/hosts/template2/bootstrap.nix
@@ -6,7 +6,8 @@ let
    text = ''
      set -euo pipefail

-      LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
+      LOKI_URL="https://loki.home.2rjus.net/loki/api/v1/push"
+      LOKI_AUTH_FILE="/run/secrets/promtail-loki-auth"

      # Send a log entry to Loki with bootstrap status
      # Usage: log_to_loki <stage> <message>
@@ -36,8 +37,14 @@ let
            }]
          }')

+        local auth_args=()
+        if [[ -f "$LOKI_AUTH_FILE" ]]; then
+          auth_args=(-u "promtail:$(cat "$LOKI_AUTH_FILE")")
+        fi
+
        curl -s --connect-timeout 2 --max-time 5 \
          -X POST \
+          "''${auth_args[@]}" \
          -H "Content-Type: application/json" \
          -d "$payload" \
          "$LOKI_URL" >/dev/null 2>&1 || true
--- a/hosts/template2/configuration.nix
+++ b/hosts/template2/configuration.nix
@@ -54,10 +54,7 @@
  };
  time.timeZone = "Europe/Oslo";

-  nix.settings.experimental-features = [
-    "nix-command"
-    "flakes"
-  ];
+
  nix.settings.tarball-ttl = 0;
  nix.settings.substituters = [
    "https://nix-cache.home.2rjus.net"
--- a/hosts/testvm01/configuration.nix
+++ b/hosts/testvm01/configuration.nix
@@ -55,10 +55,7 @@
  };
  time.timeZone = "Europe/Oslo";

-  nix.settings.experimental-features = [
-    "nix-command"
-    "flakes"
-  ];
+
  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
    vim
--- a/hosts/testvm02/configuration.nix
+++ b/hosts/testvm02/configuration.nix
@@ -55,10 +55,7 @@
  };
  time.timeZone = "Europe/Oslo";

-  nix.settings.experimental-features = [
-    "nix-command"
-    "flakes"
-  ];
+
  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
    vim
--- a/hosts/testvm03/configuration.nix
+++ b/hosts/testvm03/configuration.nix
@@ -55,10 +55,7 @@
  };
  time.timeZone = "Europe/Oslo";

-  nix.settings.experimental-features = [
-    "nix-command"
-    "flakes"
-  ];
+
  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
    vim
--- a/hosts/vault01/configuration.nix
+++ b/hosts/vault01/configuration.nix
@@ -45,10 +45,7 @@
  };
  time.timeZone = "Europe/Oslo";

-  nix.settings.experimental-features = [
-    "nix-command"
-    "flakes"
-  ];
+
  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
    vim
--- a/lib/monitoring.nix
+++ b/lib/monitoring.nix
@@ -94,7 +94,15 @@ let
        })
        (externalTargets.nodeExporter or [ ]);

-      allEntries = flakeEntries ++ externalEntries;
+      # Node-exporter-only external targets (no systemd-exporter)
+      externalOnlyEntries = map
+        (target: {
+          inherit target;
+          labels = { hostname = extractHostnameFromTarget target; };
+        })
+        (externalTargets.nodeExporterOnly or [ ]);
+
+      allEntries = flakeEntries ++ externalEntries ++ externalOnlyEntries;

      # Group entries by their label set for efficient static_configs
      # Convert labels attrset to a string key for grouping
@@ -203,7 +211,18 @@ let
    in
    flakeScrapeConfigs ++ externalScrapeConfigs;

+  # Generate systemd-exporter targets (excludes nodeExporterOnly hosts)
+  generateSystemdExporterTargets = self: externalTargets:
+    let
+      nodeTargets = generateNodeExporterTargets self (externalTargets // { nodeExporterOnly = [ ]; });
+    in
+    map
+      (cfg: cfg // {
+        targets = map (t: builtins.replaceStrings [ ":9100" ] [ ":9558" ] t) cfg.targets;
+      })
+      nodeTargets;
+
 in
 {
-  inherit extractHostMonitoring generateNodeExporterTargets generateScrapeConfigs;
+  inherit extractHostMonitoring generateNodeExporterTargets generateScrapeConfigs generateSystemdExporterTargets;
 }
--- a/scripts/create-host/templates/configuration.nix.j2
+++ b/scripts/create-host/templates/configuration.nix.j2
@@ -56,10 +56,7 @@
  };
  time.timeZone = "Europe/Oslo";

-  nix.settings.experimental-features = [
-    "nix-command"
-    "flakes"
-  ];
+
  nix.settings.tarball-ttl = 0;
  environment.systemPackages = with pkgs; [
    vim
--- a/scripts/vault-fetch/README.md
+++ b/scripts/vault-fetch/README.md
@@ -20,10 +20,10 @@ vault-fetch <secret-path> <output-directory> [cache-directory]

 ```bash
 # Fetch Grafana admin secrets
-vault-fetch hosts/monitoring01/grafana-admin /run/secrets/grafana /var/lib/vault/cache/grafana
+vault-fetch hosts/ha1/mqtt-password /run/secrets/grafana /var/lib/vault/cache/grafana

 # Use default cache location
-vault-fetch hosts/monitoring01/grafana-admin /run/secrets/grafana
+vault-fetch hosts/ha1/mqtt-password /run/secrets/grafana
 ```

 ## How It Works
@@ -53,13 +53,13 @@ If Vault is unreachable or authentication fails:
 This tool is designed to be called from systemd service `ExecStartPre` hooks via the `vault.secrets` NixOS module:

 ```nix
-vault.secrets.grafana-admin = {
-  secretPath = "hosts/monitoring01/grafana-admin";
+vault.secrets.mqtt-password = {
+  secretPath = "hosts/ha1/mqtt-password";
 };

 # Service automatically gets secrets fetched before start
-systemd.services.grafana.serviceConfig = {
-  EnvironmentFile = "/run/secrets/grafana-admin/password";
+systemd.services.mosquitto.serviceConfig = {
+  EnvironmentFile = "/run/secrets/mqtt-password/password";
 };
 ```

--- a/scripts/vault-fetch/vault-fetch.sh
+++ b/scripts/vault-fetch/vault-fetch.sh
@@ -5,7 +5,7 @@ set -euo pipefail
 #
 # Usage: vault-fetch <secret-path> <output-directory> [cache-directory]
 #
-# Example: vault-fetch hosts/monitoring01/grafana-admin /run/secrets/grafana /var/lib/vault/cache/grafana
+# Example: vault-fetch hosts/ha1/mqtt-password /run/secrets/grafana /var/lib/vault/cache/grafana
 #
 # This script:
 # 1. Authenticates to Vault using AppRole credentials from /var/lib/vault/approle/
@@ -17,7 +17,7 @@ set -euo pipefail
 # Parse arguments
 if [ $# -lt 2 ]; then
    echo "Usage: vault-fetch <secret-path> <output-directory> [cache-directory]" >&2
-    echo "Example: vault-fetch hosts/monitoring01/grafana /run/secrets/grafana /var/lib/vault/cache/grafana" >&2
+    echo "Example: vault-fetch hosts/ha1/mqtt-password /run/secrets/grafana /var/lib/vault/cache/grafana" >&2
    exit 1
 fi

--- a/services/forgejo/default.nix
+++ b/services/forgejo/default.nix
@@ -0,0 +1,19 @@
+{ ... }:
+{
+  services.forgejo = {
+    enable = true;
+    database.type = "sqlite3";
+    settings = {
+      server = {
+        DOMAIN = "nrec-nixos01.t-juice.club";
+        ROOT_URL = "https://nrec-nixos01.t-juice.club/";
+        HTTP_ADDR = "127.0.0.1";
+        HTTP_PORT = 3000;
+      };
+      server.LFS_START_SERVER = true;
+      service.DISABLE_REGISTRATION = true;
+      "service.explore".REQUIRE_SIGNIN_VIEW = true;
+      session.COOKIE_SECURE = true;
+    };
+  };
+}
--- a/services/grafana/dashboards/apiary.json
+++ b/services/grafana/dashboards/apiary.json
@@ -19,7 +19,7 @@
      "title": "SSH Connections",
      "type": "stat",
      "gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "sum(oubliette_ssh_connections_total{job=\"apiary\"})",
@@ -51,7 +51,7 @@
      "title": "Active Sessions",
      "type": "stat",
      "gridPos": {"h": 4, "w": 6, "x": 6, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "oubliette_sessions_active{job=\"apiary\"}",
@@ -86,7 +86,7 @@
      "title": "Unique IPs",
      "type": "stat",
      "gridPos": {"h": 4, "w": 6, "x": 12, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "oubliette_storage_unique_ips{job=\"apiary\"}",
@@ -118,7 +118,7 @@
      "title": "Total Login Attempts",
      "type": "stat",
      "gridPos": {"h": 4, "w": 6, "x": 18, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "oubliette_storage_login_attempts_total{job=\"apiary\"}",
@@ -150,7 +150,7 @@
      "title": "SSH Connections Over Time",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "interval": "60s",
      "targets": [
        {
@@ -183,7 +183,7 @@
      "title": "Auth Attempts Over Time",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "interval": "60s",
      "targets": [
        {
@@ -216,7 +216,7 @@
      "title": "Sessions by Shell",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 22},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "interval": "60s",
      "targets": [
        {
@@ -249,7 +249,7 @@
      "title": "Attempts by Country",
      "type": "geomap",
      "gridPos": {"h": 10, "w": 24, "x": 0, "y": 12},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "oubliette_auth_attempts_by_country_total{job=\"apiary\"}",
@@ -318,7 +318,7 @@
      "title": "Session Duration Distribution",
      "type": "heatmap",
      "gridPos": {"h": 8, "w": 24, "x": 0, "y": 30},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "interval": "60s",
      "targets": [
        {
@@ -359,7 +359,7 @@
      "title": "Commands Executed by Shell",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 22},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "interval": "60s",
      "targets": [
        {
@@ -386,6 +386,107 @@
        "tooltip": {"mode": "multi", "sort": "desc"}
      },
      "description": "Rate of commands executed in honeypot shells"
+    },
+    {
+      "id": 11,
+      "title": "Storage Query Duration by Method",
+      "type": "timeseries",
+      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 38},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "interval": "60s",
+      "targets": [
+        {
+          "expr": "rate(oubliette_storage_query_duration_seconds_sum{job=\"apiary\"}[$__rate_interval]) / rate(oubliette_storage_query_duration_seconds_count{job=\"apiary\"}[$__rate_interval])",
+          "legendFormat": "{{method}}",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "s",
+          "custom": {
+            "drawStyle": "line",
+            "lineInterpolation": "smooth",
+            "fillOpacity": 10,
+            "pointSize": 5,
+            "showPoints": "auto",
+            "stacking": {"mode": "none"}
+          }
+        }
+      },
+      "options": {
+        "legend": {"displayMode": "list", "placement": "bottom"},
+        "tooltip": {"mode": "multi", "sort": "desc"}
+      },
+      "description": "Average query duration per storage method over time"
+    },
+    {
+      "id": 12,
+      "title": "Storage Query Rate by Method",
+      "type": "timeseries",
+      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 38},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "interval": "60s",
+      "targets": [
+        {
+          "expr": "rate(oubliette_storage_query_duration_seconds_count{job=\"apiary\"}[$__rate_interval])",
+          "legendFormat": "{{method}}",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "ops",
+          "custom": {
+            "drawStyle": "line",
+            "lineInterpolation": "smooth",
+            "fillOpacity": 10,
+            "pointSize": 5,
+            "showPoints": "auto",
+            "stacking": {"mode": "none"}
+          }
+        }
+      },
+      "options": {
+        "legend": {"displayMode": "list", "placement": "bottom"},
+        "tooltip": {"mode": "multi", "sort": "desc"}
+      },
+      "description": "Query execution rate per storage method"
+    },
+    {
+      "id": 13,
+      "title": "Storage Query Errors",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 6, "x": 0, "y": 46},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "targets": [
+        {
+          "expr": "sum(oubliette_storage_query_errors_total{job=\"apiary\"})",
+          "legendFormat": "Errors",
+          "refId": "A",
+          "instant": true
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "green", "value": null},
+              {"color": "yellow", "value": 1},
+              {"color": "red", "value": 10}
+            ]
+          },
+          "noValue": "0"
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none",
+        "textMode": "auto"
+      },
+      "description": "Total storage query errors"
    }
  ]
 }
--- a/services/grafana/dashboards/certificates.json
+++ b/services/grafana/dashboards/certificates.json
@@ -16,7 +16,7 @@
      "title": "Endpoints Monitored",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "count(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"})",
@@ -48,7 +48,7 @@
      "title": "Probe Failures",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "count(probe_success{job=\"blackbox_tls\"} == 0) or vector(0)",
@@ -82,7 +82,7 @@
      "title": "Expiring Soon (< 7d)",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "count((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) < 86400 * 7) or vector(0)",
@@ -116,7 +116,7 @@
      "title": "Expiring Critical (< 24h)",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "count((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) < 86400) or vector(0)",
@@ -150,7 +150,7 @@
      "title": "Minimum Days Remaining",
      "type": "gauge",
      "gridPos": {"h": 4, "w": 8, "x": 16, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "min((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400)",
@@ -187,7 +187,7 @@
      "title": "Certificate Expiry by Endpoint",
      "type": "table",
      "gridPos": {"h": 12, "w": 12, "x": 0, "y": 4},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400",
@@ -253,7 +253,7 @@
      "title": "Probe Status",
      "type": "table",
      "gridPos": {"h": 12, "w": 12, "x": 12, "y": 4},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "probe_success{job=\"blackbox_tls\"}",
@@ -340,7 +340,7 @@
      "title": "Certificate Expiry Over Time",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 24, "x": 0, "y": 16},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400",
@@ -378,7 +378,7 @@
      "title": "Probe Success Rate",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 24},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "avg(probe_success{job=\"blackbox_tls\"}) * 100",
@@ -418,7 +418,7 @@
      "title": "Probe Duration",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 24},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "probe_duration_seconds{job=\"blackbox_tls\"}",
--- a/services/grafana/dashboards/nixos-fleet.json
+++ b/services/grafana/dashboards/nixos-fleet.json
@@ -15,7 +15,7 @@
      {
        "name": "tier",
        "type": "query",
-        "datasource": {"type": "prometheus", "uid": "prometheus"},
+        "datasource": {"type": "prometheus", "uid": "victoriametrics"},
        "query": "label_values(nixos_flake_info, tier)",
        "refresh": 2,
        "includeAll": true,
@@ -30,7 +30,7 @@
      "title": "Hosts Behind Remote",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "count(nixos_flake_revision_behind{tier=~\"$tier\"} == 1)",
@@ -65,7 +65,7 @@
      "title": "Hosts Needing Reboot",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "count(nixos_config_mismatch{tier=~\"$tier\"} == 1)",
@@ -100,7 +100,7 @@
      "title": "Total Hosts",
      "type": "stat",
      "gridPos": {"h": 4, "w": 3, "x": 8, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "count(nixos_flake_info{tier=~\"$tier\"})",
@@ -128,7 +128,7 @@
      "title": "Nixpkgs Age",
      "type": "stat",
      "gridPos": {"h": 4, "w": 3, "x": 11, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "max(nixos_flake_input_age_seconds{input=\"nixpkgs\", tier=~\"$tier\"})",
@@ -163,7 +163,7 @@
      "title": "Hosts Up-to-date",
      "type": "stat",
      "gridPos": {"h": 4, "w": 3, "x": 14, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "count(nixos_flake_revision_behind{tier=~\"$tier\"} == 0)",
@@ -192,7 +192,7 @@
      "title": "Deployments (24h)",
      "type": "stat",
      "gridPos": {"h": 4, "w": 3, "x": 17, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "sum(increase(homelab_deploy_deployments_total{status=\"completed\"}[24h]))",
@@ -222,7 +222,7 @@
      "title": "Avg Deploy Time",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "sum(increase(homelab_deploy_deployment_duration_seconds_sum{success=\"true\"}[24h])) / sum(increase(homelab_deploy_deployment_duration_seconds_count{success=\"true\"}[24h]))",
@@ -256,7 +256,7 @@
      "title": "Fleet Status",
      "type": "table",
      "gridPos": {"h": 10, "w": 24, "x": 0, "y": 4},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "nixos_flake_info{tier=~\"$tier\"}",
@@ -430,7 +430,7 @@
      "title": "Generation Age by Host",
      "type": "bargauge",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 14},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "sort_desc(nixos_generation_age_seconds{tier=~\"$tier\"})",
@@ -467,7 +467,7 @@
      "title": "Generations per Host",
      "type": "bargauge",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 14},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "sort_desc(nixos_generation_count{tier=~\"$tier\"})",
@@ -501,7 +501,7 @@
      "title": "Deployment Activity (Generation Age Over Time)",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 24, "x": 0, "y": 22},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "nixos_generation_age_seconds{tier=~\"$tier\"}",
@@ -534,7 +534,7 @@
      "title": "Flake Input Ages",
      "type": "table",
      "gridPos": {"h": 6, "w": 12, "x": 0, "y": 30},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "max by (input) (nixos_flake_input_age_seconds)",
@@ -577,7 +577,7 @@
      "title": "Hosts by Revision",
      "type": "piechart",
      "gridPos": {"h": 6, "w": 6, "x": 12, "y": 30},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "count by (current_rev) (nixos_flake_info{tier=~\"$tier\"})",
@@ -601,7 +601,7 @@
      "title": "Hosts by Tier",
      "type": "piechart",
      "gridPos": {"h": 6, "w": 6, "x": 18, "y": 30},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "count by (tier) (nixos_flake_info)",
@@ -641,7 +641,7 @@
      "title": "Builds (24h)",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 0, "y": 37},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "sum(increase(homelab_deploy_build_host_total{status=\"success\"}[24h]))",
@@ -671,7 +671,7 @@
      "title": "Failed Builds (24h)",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 4, "y": 37},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "sum(increase(homelab_deploy_build_host_total{status=\"failure\"}[24h])) or vector(0)",
@@ -705,7 +705,7 @@
      "title": "Last Build",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 8, "y": 37},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "time() - max(homelab_deploy_build_last_timestamp)",
@@ -739,7 +739,7 @@
      "title": "Avg Build Time",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 12, "y": 37},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "sum(increase(homelab_deploy_build_duration_seconds_sum[24h])) / sum(increase(homelab_deploy_build_duration_seconds_count[24h]))",
@@ -773,7 +773,7 @@
      "title": "Total Hosts Built",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 16, "y": 37},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "count(homelab_deploy_build_duration_seconds_count)",
@@ -802,7 +802,7 @@
      "title": "Build Jobs (24h)",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 20, "y": 37},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "sum(increase(homelab_deploy_builds_total[24h]))",
@@ -832,7 +832,7 @@
      "title": "Build Time by Host",
      "type": "bargauge",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 41},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "sort_desc(homelab_deploy_build_duration_seconds_sum / homelab_deploy_build_duration_seconds_count)",
@@ -869,7 +869,7 @@
      "title": "Build Count by Host",
      "type": "bargauge",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 41},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "sort_desc(sum by (host) (homelab_deploy_build_host_total))",
@@ -903,7 +903,7 @@
      "title": "Build Activity",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 24, "x": 0, "y": 49},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "sum(increase(homelab_deploy_build_host_total{status=\"success\"}[1h]))",
--- a/services/grafana/dashboards/node-exporter.json
+++ b/services/grafana/dashboards/node-exporter.json
@@ -11,7 +11,7 @@
      {
        "name": "instance",
        "type": "query",
-        "datasource": {"type": "prometheus", "uid": "prometheus"},
+        "datasource": {"type": "prometheus", "uid": "victoriametrics"},
        "query": "label_values(node_uname_info, instance)",
        "refresh": 2,
        "includeAll": false,
@@ -26,7 +26,7 @@
      "title": "CPU Usage",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\", instance=~\"$instance\"}[5m])) * 100)",
@@ -55,7 +55,7 @@
      "title": "Memory Usage",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$instance\"} / node_memory_MemTotal_bytes{instance=~\"$instance\"})) * 100",
@@ -84,7 +84,7 @@
      "title": "Disk Usage",
      "type": "gauge",
      "gridPos": {"h": 8, "w": 8, "x": 0, "y": 8},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "100 - ((node_filesystem_avail_bytes{instance=~\"$instance\",mountpoint=\"/\",fstype!=\"rootfs\"} / node_filesystem_size_bytes{instance=~\"$instance\",mountpoint=\"/\",fstype!=\"rootfs\"}) * 100)",
@@ -113,7 +113,7 @@
      "title": "System Load",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 8, "x": 8, "y": 8},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "node_load1{instance=~\"$instance\"}",
@@ -142,7 +142,7 @@
      "title": "Uptime",
      "type": "stat",
      "gridPos": {"h": 8, "w": 8, "x": 16, "y": 8},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "time() - node_boot_time_seconds{instance=~\"$instance\"}",
@@ -161,7 +161,7 @@
      "title": "Network Traffic",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "rate(node_network_receive_bytes_total{instance=~\"$instance\",device!~\"lo|veth.*|br.*|docker.*\"}[5m])",
@@ -185,7 +185,7 @@
      "title": "Disk I/O",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 16},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "rate(node_disk_read_bytes_total{instance=~\"$instance\",device!~\"dm-.*\"}[5m])",
--- a/services/grafana/dashboards/proxmox.json
+++ b/services/grafana/dashboards/proxmox.json
@@ -15,7 +15,7 @@
      {
        "name": "vm",
        "type": "query",
-        "datasource": {"type": "prometheus", "uid": "prometheus"},
+        "datasource": {"type": "prometheus", "uid": "victoriametrics"},
        "query": "label_values(pve_guest_info{template=\"0\"}, name)",
        "refresh": 2,
        "includeAll": true,
@@ -30,7 +30,7 @@
      "title": "VMs Running",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "count(pve_up{id=~\"qemu/.*\"} * on(id) pve_guest_info{template=\"0\"} == 1)",
@@ -56,7 +56,7 @@
      "title": "VMs Stopped",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "count(pve_up{id=~\"qemu/.*\"} * on(id) pve_guest_info{template=\"0\"} == 0)",
@@ -87,7 +87,7 @@
      "title": "Node CPU",
      "type": "gauge",
      "gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "pve_cpu_usage_ratio{id=~\"node/.*\"} * 100",
@@ -120,7 +120,7 @@
      "title": "Node Memory",
      "type": "gauge",
      "gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "pve_memory_usage_bytes{id=~\"node/.*\"} / pve_memory_size_bytes{id=~\"node/.*\"} * 100",
@@ -153,7 +153,7 @@
      "title": "Node Uptime",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "pve_uptime_seconds{id=~\"node/.*\"}",
@@ -180,7 +180,7 @@
      "title": "Templates",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "count(pve_guest_info{template=\"1\"})",
@@ -206,7 +206,7 @@
      "title": "VM Status",
      "type": "table",
      "gridPos": {"h": 10, "w": 24, "x": 0, "y": 4},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "pve_guest_info{template=\"0\", name=~\"$vm\"}",
@@ -362,7 +362,7 @@
      "title": "VM CPU Usage",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 14},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "pve_cpu_usage_ratio{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"} * 100",
@@ -391,7 +391,7 @@
      "title": "VM Memory Usage",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 14},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "pve_memory_usage_bytes{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
@@ -420,7 +420,7 @@
      "title": "VM Network Traffic",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 22},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "rate(pve_network_receive_bytes{id=~\"qemu/.*\"}[5m]) * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
@@ -453,7 +453,7 @@
      "title": "VM Disk I/O",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 22},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "rate(pve_disk_read_bytes{id=~\"qemu/.*\"}[5m]) * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
@@ -486,7 +486,7 @@
      "title": "Storage Usage",
      "type": "bargauge",
      "gridPos": {"h": 6, "w": 12, "x": 0, "y": 30},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "pve_disk_usage_bytes{id=~\"storage/.*\"} / pve_disk_size_bytes{id=~\"storage/.*\"} * 100",
@@ -531,7 +531,7 @@
      "title": "Storage Capacity",
      "type": "table",
      "gridPos": {"h": 6, "w": 12, "x": 12, "y": 30},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "pve_disk_size_bytes{id=~\"storage/.*\"}",
--- a/services/grafana/dashboards/systemd.json
+++ b/services/grafana/dashboards/systemd.json
@@ -15,7 +15,7 @@
      {
        "name": "hostname",
        "type": "query",
-        "datasource": {"type": "prometheus", "uid": "prometheus"},
+        "datasource": {"type": "prometheus", "uid": "victoriametrics"},
        "query": "label_values(systemd_unit_state, hostname)",
        "refresh": 2,
        "includeAll": true,
@@ -30,7 +30,7 @@
      "title": "Failed Units",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "count(systemd_unit_state{state=\"failed\", hostname=~\"$hostname\"} == 1) or vector(0)",
@@ -60,7 +60,7 @@
      "title": "Active Units",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "count(systemd_unit_state{state=\"active\", hostname=~\"$hostname\"} == 1)",
@@ -86,7 +86,7 @@
      "title": "Hosts Monitored",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "count(count by (hostname) (systemd_unit_state{hostname=~\"$hostname\"}))",
@@ -112,7 +112,7 @@
      "title": "Total Service Restarts",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "sum(systemd_service_restart_total{hostname=~\"$hostname\"})",
@@ -143,7 +143,7 @@
      "title": "Inactive Units",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "count(systemd_unit_state{state=\"inactive\", hostname=~\"$hostname\"} == 1)",
@@ -169,7 +169,7 @@
      "title": "Timers",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "count(systemd_timer_last_trigger_seconds{hostname=~\"$hostname\"})",
@@ -195,7 +195,7 @@
      "title": "Failed Units",
      "type": "table",
      "gridPos": {"h": 6, "w": 12, "x": 0, "y": 4},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "systemd_unit_state{state=\"failed\", hostname=~\"$hostname\"} == 1",
@@ -251,7 +251,7 @@
      "title": "Service Restarts (Top 15)",
      "type": "table",
      "gridPos": {"h": 6, "w": 12, "x": 12, "y": 4},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "topk(15, systemd_service_restart_total{hostname=~\"$hostname\"} > 0)",
@@ -309,7 +309,7 @@
      "title": "Active Units per Host",
      "type": "bargauge",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 10},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "sort_desc(count by (hostname) (systemd_unit_state{state=\"active\", hostname=~\"$hostname\"} == 1))",
@@ -339,7 +339,7 @@
      "title": "NixOS Upgrade Timers",
      "type": "table",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 10},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "systemd_timer_last_trigger_seconds{name=\"nixos-upgrade.timer\", hostname=~\"$hostname\"}",
@@ -429,7 +429,7 @@
      "title": "Backup Timers",
      "type": "table",
      "gridPos": {"h": 6, "w": 12, "x": 0, "y": 18},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "systemd_timer_last_trigger_seconds{name=~\"restic.*\", hostname=~\"$hostname\"}",
@@ -524,7 +524,7 @@
      "title": "Service Restarts Over Time",
      "type": "timeseries",
      "gridPos": {"h": 6, "w": 12, "x": 12, "y": 18},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "sum by (hostname) (increase(systemd_service_restart_total{hostname=~\"$hostname\"}[1h]))",
--- a/services/grafana/dashboards/temperature.json
+++ b/services/grafana/dashboards/temperature.json
@@ -19,7 +19,7 @@
      "title": "Current Temperatures",
      "type": "stat",
      "gridPos": {"h": 6, "w": 12, "x": 0, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}",
@@ -71,7 +71,7 @@
      "title": "Average Home Temperature",
      "type": "gauge",
      "gridPos": {"h": 6, "w": 6, "x": 12, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "avg(hass_sensor_temperature_celsius{entity!~\".*device_temperature|.*server.*\"})",
@@ -108,7 +108,7 @@
      "title": "Current Humidity",
      "type": "stat",
      "gridPos": {"h": 6, "w": 6, "x": 18, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "hass_sensor_humidity_percent{entity!~\".*server.*\"}",
@@ -154,7 +154,7 @@
      "title": "Temperature History (30 Days)",
      "type": "timeseries",
      "gridPos": {"h": 10, "w": 24, "x": 0, "y": 6},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}",
@@ -207,7 +207,7 @@
      "title": "Temperature Trend (1h rate of change)",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "deriv(hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}[1h]) * 3600",
@@ -268,7 +268,7 @@
      "title": "24h Min / Max / Avg",
      "type": "table",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 16},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "min_over_time(hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}[24h])",
@@ -346,7 +346,7 @@
      "title": "Humidity History (30 Days)",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 24, "x": 0, "y": 24},
-      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
      "targets": [
        {
          "expr": "hass_sensor_humidity_percent",
--- a/services/grafana/default.nix
+++ b/services/grafana/default.nix
@@ -37,6 +37,10 @@
    # Declarative datasources
    provision.datasources.settings = {
      apiVersion = 1;
+      prune = true;
+      deleteDatasources = [
+        { name = "Prometheus (monitoring01)"; orgId = 1; }
+      ];
      datasources = [
        {
          name = "VictoriaMetrics";
@@ -45,13 +49,7 @@
          isDefault = true;
          uid = "victoriametrics";
        }
-        {
-          name = "Prometheus (monitoring01)";
-          type = "prometheus";
-          url = "http://monitoring01.home.2rjus.net:9090";
-          uid = "prometheus";
-        }
-        {
+{
          name = "Loki";
          type = "loki";
          url = "http://localhost:3100";
@@ -91,6 +89,14 @@
      acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
      metrics
    '';
+    virtualHosts."grafana.home.2rjus.net".extraConfig = ''
+      log {
+        output file /var/log/caddy/grafana.log {
+          mode 644
+        }
+      }
+      reverse_proxy http://127.0.0.1:3000
+    '';
    virtualHosts."grafana-test.home.2rjus.net".extraConfig = ''
      log {
        output file /var/log/caddy/grafana.log {
--- a/services/http-proxy/proxy.nix
+++ b/services/http-proxy/proxy.nix
@@ -54,30 +54,7 @@
        }
        reverse_proxy http://ha1.home.2rjus.net:8080
      }
-      prometheus.home.2rjus.net {
-        log {
-          output file /var/log/caddy/prometheus.log {
-            mode 644
-          }
-        }
-        reverse_proxy http://monitoring01.home.2rjus.net:9090
-      }
-      alertmanager.home.2rjus.net {
-        log {
-          output file /var/log/caddy/alertmanager.log {
-            mode 644
-          }
-        }
-        reverse_proxy http://monitoring01.home.2rjus.net:9093
-      }
-      grafana.home.2rjus.net {
-        log {
-          output file /var/log/caddy/grafana.log {
-            mode 644
-          }
-        }
-        reverse_proxy http://monitoring01.home.2rjus.net:3000
-      }
+
      jelly.home.2rjus.net {
        log {
          output file /var/log/caddy/jelly.log {
@@ -86,22 +63,6 @@
        }
        reverse_proxy http://jelly01.home.2rjus.net:8096
      }
-      pyroscope.home.2rjus.net {
-        log {
-          output file /var/log/caddy/pyroscope.log {
-            mode 644
-          }
-        }
-        reverse_proxy http://monitoring01.home.2rjus.net:4040
-      }
-      pushgw.home.2rjus.net {
-        log {
-          output file /var/log/caddy/pushgw.log {
-            mode 644
-          }
-        }
-        reverse_proxy http://monitoring01.home.2rjus.net:9091
-      }
      http://http-proxy.home.2rjus.net/metrics {
        log {
          output file /var/log/caddy/caddy-metrics.log {
--- a/services/monitoring/blackbox.nix
+++ b/services/monitoring/blackbox.nix
@@ -1,33 +1,4 @@
 { pkgs, ... }:
-let
-  # TLS endpoints to monitor for certificate expiration
-  # These are all services using ACME certificates from OpenBao PKI
-  tlsTargets = [
-    # Direct ACME certs (security.acme.certs)
-    "https://vault.home.2rjus.net:8200"
-    "https://auth.home.2rjus.net"
-    "https://testvm01.home.2rjus.net"
-
-    # Caddy auto-TLS on http-proxy
-    "https://nzbget.home.2rjus.net"
-    "https://radarr.home.2rjus.net"
-    "https://sonarr.home.2rjus.net"
-    "https://ha.home.2rjus.net"
-    "https://z2m.home.2rjus.net"
-    "https://prometheus.home.2rjus.net"
-    "https://alertmanager.home.2rjus.net"
-    "https://grafana.home.2rjus.net"
-    "https://jelly.home.2rjus.net"
-    "https://pyroscope.home.2rjus.net"
-    "https://pushgw.home.2rjus.net"
-
-    # Caddy auto-TLS on nix-cache02
-    "https://nix-cache.home.2rjus.net"
-
-    # Caddy auto-TLS on grafana01
-    "https://grafana-test.home.2rjus.net"
-  ];
-in
 {
  services.prometheus.exporters.blackbox = {
    enable = true;
@@ -57,36 +28,4 @@ in
              - 503
    '';
  };
-
-  # Add blackbox scrape config to Prometheus
-  # Alert rules are in rules.yml (certificate_rules group)
-  services.prometheus.scrapeConfigs = [
-    {
-      job_name = "blackbox_tls";
-      metrics_path = "/probe";
-      params = {
-        module = [ "https_cert" ];
-      };
-      static_configs = [{
-        targets = tlsTargets;
-      }];
-      relabel_configs = [
-        # Pass the target URL to blackbox as a parameter
-        {
-          source_labels = [ "__address__" ];
-          target_label = "__param_target";
-        }
-        # Use the target URL as the instance label
-        {
-          source_labels = [ "__param_target" ];
-          target_label = "instance";
-        }
-        # Point the actual scrape at the local blackbox exporter
-        {
-          target_label = "__address__";
-          replacement = "127.0.0.1:9115";
-        }
-      ];
-    }
-  ];
 }
--- a/services/monitoring/default.nix
+++ b/services/monitoring/default.nix
@@ -1,14 +0,0 @@
-{ ... }:
-{
-  imports = [
-    ./loki.nix
-    ./grafana.nix
-    ./prometheus.nix
-    ./blackbox.nix
-    ./exportarr.nix
-    ./pve.nix
-    ./alerttonotify.nix
-    ./pyroscope.nix
-    ./tempo.nix
-  ];
-}
--- a/services/monitoring/exportarr.nix
+++ b/services/monitoring/exportarr.nix
@@ -14,14 +14,4 @@
    apiKeyFile = config.vault.secrets.sonarr-api-key.outputDir;
    port = 9709;
  };
-
-  # Scrape config
-  services.prometheus.scrapeConfigs = [
-    {
-      job_name = "sonarr";
-      static_configs = [{
-        targets = [ "localhost:9709" ];
-      }];
-    }
-  ];
 }
--- a/services/monitoring/external-targets.nix
+++ b/services/monitoring/external-targets.nix
@@ -4,6 +4,10 @@
  nodeExporter = [
    "gunter.home.2rjus.net:9100"
  ];
+  # Hosts with node-exporter but no systemd-exporter
+  nodeExporterOnly = [
+    "pve1.home.2rjus.net:9100"
+  ];
  scrapeConfigs = [
    { job_name = "smartctl"; targets = [ "gunter.home.2rjus.net:9633" ]; }
    { job_name = "ghettoptt"; targets = [ "gunter.home.2rjus.net:8989" ]; }
--- a/services/monitoring/grafana.nix
+++ b/services/monitoring/grafana.nix
@@ -1,11 +0,0 @@
-{ pkgs, ... }:
-{
-  services.grafana = {
-    enable = true;
-    settings = {
-      server = {
-        http_addr = "";
-      };
-    };
-  };
-}
--- a/services/monitoring/loki.nix
+++ b/services/monitoring/loki.nix
@@ -1,58 +0,0 @@
-{ ... }:
-{
-  services.loki = {
-    enable = true;
-    configuration = {
-      auth_enabled = false;
-
-      server = {
-        http_listen_port = 3100;
-      };
-      common = {
-        ring = {
-          instance_addr = "127.0.0.1";
-          kvstore = {
-            store = "inmemory";
-          };
-        };
-        replication_factor = 1;
-        path_prefix = "/var/lib/loki";
-      };
-      schema_config = {
-        configs = [
-          {
-            from = "2024-01-01";
-            store = "tsdb";
-            object_store = "filesystem";
-            schema = "v13";
-            index = {
-              prefix = "loki_index_";
-              period = "24h";
-            };
-          }
-        ];
-      };
-      storage_config = {
-        filesystem = {
-          directory = "/var/lib/loki/chunks";
-        };
-      };
-      compactor = {
-        working_directory = "/var/lib/loki/compactor";
-        compaction_interval = "10m";
-        retention_enabled = true;
-        retention_delete_delay = "2h";
-        retention_delete_worker_count = 150;
-        delete_request_store = "filesystem";
-      };
-      limits_config = {
-        retention_period = "30d";
-        ingestion_rate_mb = 10;
-        ingestion_burst_size_mb = 20;
-        max_streams_per_user = 10000;
-        max_query_series = 500;
-        max_query_parallelism = 8;
-      };
-    };
-  };
-}
--- a/services/monitoring/prometheus.nix
+++ b/services/monitoring/prometheus.nix
@@ -1,267 +0,0 @@
-{ self, lib, pkgs, ... }:
-let
-  monLib = import ../../lib/monitoring.nix { inherit lib; };
-  externalTargets = import ./external-targets.nix;
-
-  nodeExporterTargets = monLib.generateNodeExporterTargets self externalTargets;
-  autoScrapeConfigs = monLib.generateScrapeConfigs self externalTargets;
-
-  # Script to fetch AppRole token for Prometheus to use when scraping OpenBao metrics
-  fetchOpenbaoToken = pkgs.writeShellApplication {
-    name = "fetch-openbao-token";
-    runtimeInputs = [ pkgs.curl pkgs.jq ];
-    text = ''
-      VAULT_ADDR="https://vault01.home.2rjus.net:8200"
-      APPROLE_DIR="/var/lib/vault/approle"
-      OUTPUT_FILE="/run/secrets/prometheus/openbao-token"
-
-      # Read AppRole credentials
-      if [ ! -f "$APPROLE_DIR/role-id" ] || [ ! -f "$APPROLE_DIR/secret-id" ]; then
-        echo "AppRole credentials not found at $APPROLE_DIR" >&2
-        exit 1
-      fi
-
-      ROLE_ID=$(cat "$APPROLE_DIR/role-id")
-      SECRET_ID=$(cat "$APPROLE_DIR/secret-id")
-
-      # Authenticate to Vault
-      AUTH_RESPONSE=$(curl -sf -k -X POST \
-        -d "{\"role_id\":\"$ROLE_ID\",\"secret_id\":\"$SECRET_ID\"}" \
-        "$VAULT_ADDR/v1/auth/approle/login")
-
-      # Extract token
-      VAULT_TOKEN=$(echo "$AUTH_RESPONSE" | jq -r '.auth.client_token')
-      if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
-        echo "Failed to extract Vault token from response" >&2
-        exit 1
-      fi
-
-      # Write token to file
-      mkdir -p "$(dirname "$OUTPUT_FILE")"
-      echo -n "$VAULT_TOKEN" > "$OUTPUT_FILE"
-      chown prometheus:prometheus "$OUTPUT_FILE"
-      chmod 0400 "$OUTPUT_FILE"
-
-      echo "Successfully fetched OpenBao token"
-    '';
-  };
-in
-{
-  # Systemd service to fetch AppRole token for Prometheus OpenBao scraping
-  # The token is used to authenticate when scraping /v1/sys/metrics
-  systemd.services.prometheus-openbao-token = {
-    description = "Fetch OpenBao token for Prometheus metrics scraping";
-    after = [ "network-online.target" ];
-    wants = [ "network-online.target" ];
-    before = [ "prometheus.service" ];
-    requiredBy = [ "prometheus.service" ];
-
-    serviceConfig = {
-      Type = "oneshot";
-      ExecStart = lib.getExe fetchOpenbaoToken;
-    };
-  };
-
-  # Timer to periodically refresh the token (AppRole tokens have 1-hour TTL)
-  systemd.timers.prometheus-openbao-token = {
-    description = "Refresh OpenBao token for Prometheus";
-    wantedBy = [ "timers.target" ];
-    timerConfig = {
-      OnBootSec = "5min";
-      OnUnitActiveSec = "30min";
-      RandomizedDelaySec = "5min";
-    };
-  };
-
-  # Fetch apiary bearer token from Vault
-  vault.secrets.prometheus-apiary-token = {
-    secretPath = "hosts/monitoring01/apiary-token";
-    extractKey = "password";
-    owner = "prometheus";
-    group = "prometheus";
-    services = [ "prometheus" ];
-  };
-
-  services.prometheus = {
-    enable = true;
-    # syntax-only check because we use external credential files (e.g., openbao-token)
-    checkConfig = "syntax-only";
-    alertmanager = {
-      enable = true;
-      configuration = {
-        global = {
-        };
-        route = {
-          receiver = "webhook_natstonotify";
-          group_wait = "30s";
-          group_interval = "5m";
-          repeat_interval = "1h";
-          group_by = [ "alertname" ];
-        };
-        receivers = [
-          {
-            name = "webhook_natstonotify";
-            webhook_configs = [
-              {
-                url = "http://localhost:5001/alert";
-              }
-            ];
-          }
-        ];
-      };
-    };
-    alertmanagers = [
-      {
-        static_configs = [
-          {
-            targets = [ "localhost:9093" ];
-          }
-        ];
-      }
-    ];
-
-    retentionTime = "30d";
-    globalConfig = {
-      scrape_interval = "15s";
-    };
-    rules = [
-      (builtins.readFile ./rules.yml)
-    ];
-
-    scrapeConfigs = [
-      # Auto-generated node-exporter targets from flake hosts + external
-      # Each static_config entry may have labels from homelab.host metadata
-      {
-        job_name = "node-exporter";
-        static_configs = nodeExporterTargets;
-      }
-      # Systemd exporter on all hosts (same targets, different port)
-      # Preserves the same label grouping as node-exporter
-      {
-        job_name = "systemd-exporter";
-        static_configs = map
-          (cfg: cfg // {
-            targets = map (t: builtins.replaceStrings [ ":9100" ] [ ":9558" ] t) cfg.targets;
-          })
-          nodeExporterTargets;
-      }
-      # Local monitoring services (not auto-generated)
-      {
-        job_name = "prometheus";
-        static_configs = [
-          {
-            targets = [ "localhost:9090" ];
-          }
-        ];
-      }
-      {
-        job_name = "loki";
-        static_configs = [
-          {
-            targets = [ "localhost:3100" ];
-          }
-        ];
-      }
-      {
-        job_name = "grafana";
-        static_configs = [
-          {
-            targets = [ "localhost:3000" ];
-          }
-        ];
-      }
-      {
-        job_name = "alertmanager";
-        static_configs = [
-          {
-            targets = [ "localhost:9093" ];
-          }
-        ];
-      }
-      {
-        job_name = "pushgateway";
-        honor_labels = true;
-        static_configs = [
-          {
-            targets = [ "localhost:9091" ];
-          }
-        ];
-      }
-      # Caddy metrics from nix-cache02 (serves nix-cache.home.2rjus.net)
-      {
-        job_name = "nix-cache_caddy";
-        scheme = "https";
-        static_configs = [
-          {
-            targets = [ "nix-cache.home.2rjus.net" ];
-          }
-        ];
-      }
-      # pve-exporter with complex relabel config
-      {
-        job_name = "pve-exporter";
-        static_configs = [
-          {
-            targets = [ "10.69.12.75" ];
-          }
-        ];
-        metrics_path = "/pve";
-        params = {
-          module = [ "default" ];
-          cluster = [ "1" ];
-          node = [ "1" ];
-        };
-        relabel_configs = [
-          {
-            source_labels = [ "__address__" ];
-            target_label = "__param_target";
-          }
-          {
-            source_labels = [ "__param_target" ];
-            target_label = "instance";
-          }
-          {
-            target_label = "__address__";
-            replacement = "127.0.0.1:9221";
-          }
-        ];
-      }
-      # OpenBao metrics with bearer token auth
-      {
-        job_name = "openbao";
-        scheme = "https";
-        metrics_path = "/v1/sys/metrics";
-        params = {
-          format = [ "prometheus" ];
-        };
-        static_configs = [{
-          targets = [ "vault01.home.2rjus.net:8200" ];
-        }];
-        authorization = {
-          type = "Bearer";
-          credentials_file = "/run/secrets/prometheus/openbao-token";
-        };
-      }
-      # Apiary external service
-      {
-        job_name = "apiary";
-        scheme = "https";
-        scrape_interval = "60s";
-        static_configs = [{
-          targets = [ "apiary.t-juice.club" ];
-        }];
-        authorization = {
-          type = "Bearer";
-          credentials_file = "/run/secrets/prometheus-apiary-token";
-        };
-      }
-    ] ++ autoScrapeConfigs;
-
-    pushgateway = {
-      enable = true;
-      web = {
-        external-url = "https://pushgw.home.2rjus.net";
-      };
-    };
-  };
-}
--- a/services/monitoring/pve.nix
+++ b/services/monitoring/pve.nix
@@ -1,7 +1,7 @@
 { config, ... }:
 {
  vault.secrets.pve-exporter = {
-    secretPath = "hosts/monitoring01/pve-exporter";
+    secretPath = "hosts/monitoring02/pve-exporter";
    extractKey = "config";
    outputDir = "/run/secrets/pve_exporter";
    mode = "0444";
--- a/services/monitoring/pyroscope.nix
+++ b/services/monitoring/pyroscope.nix
@@ -1,8 +0,0 @@
-{ ... }:
-{
-  virtualisation.oci-containers.containers.pyroscope = {
-    pull = "missing";
-    image = "grafana/pyroscope:latest";
-    ports = [ "4040:4040" ];
-  };
-}
--- a/services/monitoring/rules.yml
+++ b/services/monitoring/rules.yml
@@ -67,13 +67,13 @@ groups:
          summary: "Promtail service not running on {{ $labels.instance }}"
          description: "The promtail service has not been active on {{ $labels.instance }} for 5 minutes."
      - alert: filesystem_filling_up
-        expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 24*3600) < 0
+        expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[24h], 24*3600) < 0
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Filesystem predicted to fill within 24h on {{ $labels.instance }}"
-          description: "Based on the last 6h trend, the root filesystem on {{ $labels.instance }} is predicted to run out of space within 24 hours."
+          description: "Based on the last 24h trend, the root filesystem on {{ $labels.instance }} is predicted to run out of space within 24 hours."
      - alert: systemd_not_running
        expr: node_systemd_system_running == 0
        for: 10m
@@ -259,32 +259,32 @@ groups:
          description: "Wireguard handshake timeout on {{ $labels.instance }} for peer {{ $labels.public_key }}."
  - name: monitoring_rules
    rules:
-      - alert: prometheus_not_running
-        expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="prometheus.service", state="active"} == 0
+      - alert: victoriametrics_not_running
+        expr: node_systemd_unit_state{instance="monitoring02.home.2rjus.net:9100", name="victoriametrics.service", state="active"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
-          summary: "Prometheus service not running on {{ $labels.instance }}"
-          description: "Prometheus service not running on {{ $labels.instance }}"
+          summary: "VictoriaMetrics service not running on {{ $labels.instance }}"
+          description: "VictoriaMetrics service not running on {{ $labels.instance }}"
+      - alert: vmalert_not_running
+        expr: node_systemd_unit_state{instance="monitoring02.home.2rjus.net:9100", name="vmalert.service", state="active"} == 0
+        for: 5m
+        labels:
+          severity: critical
+        annotations:
+          summary: "vmalert service not running on {{ $labels.instance }}"
+          description: "vmalert service not running on {{ $labels.instance }}"
      - alert: alertmanager_not_running
-        expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="alertmanager.service", state="active"} == 0
+        expr: node_systemd_unit_state{instance="monitoring02.home.2rjus.net:9100", name="alertmanager.service", state="active"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Alertmanager service not running on {{ $labels.instance }}"
          description: "Alertmanager service not running on {{ $labels.instance }}"
-      - alert: pushgateway_not_running
-        expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="pushgateway.service", state="active"} == 0
-        for: 5m
-        labels:
-          severity: critical
-        annotations:
-          summary: "Pushgateway service not running on {{ $labels.instance }}"
-          description: "Pushgateway service not running on {{ $labels.instance }}"
      - alert: loki_not_running
-        expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="loki.service", state="active"} == 0
+        expr: node_systemd_unit_state{instance="monitoring02.home.2rjus.net:9100", name="loki.service", state="active"} == 0
        for: 5m
        labels:
          severity: critical
@@ -292,29 +292,13 @@ groups:
          summary: "Loki service not running on {{ $labels.instance }}"
          description: "Loki service not running on {{ $labels.instance }}"
      - alert: grafana_not_running
-        expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="grafana.service", state="active"} == 0
+        expr: node_systemd_unit_state{instance="monitoring02.home.2rjus.net:9100", name="grafana.service", state="active"} == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Grafana service not running on {{ $labels.instance }}"
          description: "Grafana service not running on {{ $labels.instance }}"
-      - alert: tempo_not_running
-        expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="tempo.service", state="active"} == 0
-        for: 5m
-        labels:
-          severity: warning
-        annotations:
-          summary: "Tempo service not running on {{ $labels.instance }}"
-          description: "Tempo service not running on {{ $labels.instance }}"
-      - alert: pyroscope_not_running
-        expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="podman-pyroscope.service", state="active"} == 0
-        for: 5m
-        labels:
-          severity: warning
-        annotations:
-          summary: "Pyroscope service not running on {{ $labels.instance }}"
-          description: "Pyroscope service not running on {{ $labels.instance }}"
  - name: proxmox_rules
    rules:
      - alert: pve_node_down
--- a/services/monitoring/tempo.nix
+++ b/services/monitoring/tempo.nix
@@ -1,37 +0,0 @@
-{ ... }:
-{
-  services.tempo = {
-    enable = true;
-    settings = {
-      server = {
-        http_listen_port = 3200;
-        grpc_listen_port = 3201;
-      };
-      distributor = {
-        receivers = {
-          otlp = {
-            protocols = {
-              http = {
-                endpoint = ":4318";
-                cors = {
-                  allowed_origins = [ "*.home.2rjus.net" ];
-                };
-              };
-            };
-          };
-        };
-      };
-      storage = {
-        trace = {
-          backend = "local";
-          local = {
-            path = "/var/lib/tempo";
-          };
-          wal = {
-            path = "/var/lib/tempo/wal";
-          };
-        };
-      };
-    };
-  };
-}
--- a/services/victoriametrics/default.nix
+++ b/services/victoriametrics/default.nix
@@ -4,8 +4,27 @@ let
  externalTargets = import ../monitoring/external-targets.nix;

  nodeExporterTargets = monLib.generateNodeExporterTargets self externalTargets;
+  systemdExporterTargets = monLib.generateSystemdExporterTargets self externalTargets;
  autoScrapeConfigs = monLib.generateScrapeConfigs self externalTargets;

+  # TLS endpoints to monitor for certificate expiration via blackbox exporter
+  tlsTargets = [
+    "https://vault.home.2rjus.net:8200"
+    "https://auth.home.2rjus.net"
+    "https://testvm01.home.2rjus.net"
+    "https://nzbget.home.2rjus.net"
+    "https://radarr.home.2rjus.net"
+    "https://sonarr.home.2rjus.net"
+    "https://ha.home.2rjus.net"
+    "https://z2m.home.2rjus.net"
+    "https://metrics.home.2rjus.net"
+    "https://alertmanager.home.2rjus.net"
+    "https://grafana.home.2rjus.net"
+    "https://jelly.home.2rjus.net"
+    "https://nix-cache.home.2rjus.net"
+    "https://grafana-test.home.2rjus.net"
+  ];
+
  # Script to fetch AppRole token for VictoriaMetrics to use when scraping OpenBao metrics
  fetchOpenbaoToken = pkgs.writeShellApplication {
    name = "fetch-openbao-token-vm";
@@ -52,14 +71,10 @@ let
      job_name = "node-exporter";
      static_configs = nodeExporterTargets;
    }
-    # Systemd exporter on all hosts (same targets, different port)
+    # Systemd exporter on hosts that have it (excludes nodeExporterOnly hosts)
    {
      job_name = "systemd-exporter";
-      static_configs = map
-        (cfg: cfg // {
-          targets = map (t: builtins.replaceStrings [ ":9100" ] [ ":9558" ] t) cfg.targets;
-        })
-        nodeExporterTargets;
+      static_configs = systemdExporterTargets;
    }
    # Local monitoring services
    {
@@ -107,6 +122,39 @@ let
        credentials_file = "/run/secrets/victoriametrics-apiary-token";
      };
    }
+    # Blackbox TLS certificate monitoring
+    {
+      job_name = "blackbox_tls";
+      metrics_path = "/probe";
+      params = {
+        module = [ "https_cert" ];
+      };
+      static_configs = [{ targets = tlsTargets; }];
+      relabel_configs = [
+        {
+          source_labels = [ "__address__" ];
+          target_label = "__param_target";
+        }
+        {
+          source_labels = [ "__param_target" ];
+          target_label = "instance";
+        }
+        {
+          target_label = "__address__";
+          replacement = "127.0.0.1:9115";
+        }
+      ];
+    }
+    # Sonarr exporter
+    {
+      job_name = "sonarr";
+      static_configs = [{ targets = [ "localhost:9709" ]; }];
+    }
+    # Proxmox VE exporter
+    {
+      job_name = "pve";
+      static_configs = [{ targets = [ "localhost:9221" ]; }];
+    }
  ] ++ autoScrapeConfigs;
 in
 {
@@ -152,7 +200,7 @@ in

  # Fetch apiary bearer token from Vault
  vault.secrets.victoriametrics-apiary-token = {
-    secretPath = "hosts/monitoring01/apiary-token";
+    secretPath = "hosts/monitoring02/apiary-token";
    extractKey = "password";
    owner = "victoriametrics";
    group = "victoriametrics";
@@ -170,15 +218,12 @@ in
    };
  };

-  # vmalert for alerting rules - no notifier during parallel operation
+  # vmalert for alerting rules
  services.vmalert.instances.default = {
    enable = true;
    settings = {
      "datasource.url" = "http://localhost:8428";
-      # Blackhole notifications during parallel operation to prevent duplicate alerts.
-      # Replace with notifier.url after cutover from monitoring01:
-      # "notifier.url" = [ "http://localhost:9093" ];
-      "notifier.blackhole" = true;
+      "notifier.url" = [ "http://localhost:9093" ];
      "rule" = [ ../monitoring/rules.yml ];
    };
  };
@@ -191,8 +236,11 @@ in
    reverse_proxy http://127.0.0.1:8880
  '';

-  # Alertmanager - same config as monitoring01 but will only receive
-  # alerts after cutover (vmalert notifier is disabled above)
+  # Alertmanager
+  services.caddy.virtualHosts."alertmanager.home.2rjus.net".extraConfig = ''
+    reverse_proxy http://127.0.0.1:9093
+  '';
+
  services.prometheus.alertmanager = {
    enable = true;
    configuration = {
--- a/system/monitoring/logs.nix
+++ b/system/monitoring/logs.nix
@@ -38,10 +38,6 @@ in
      };

      clients = [
-        {
-          url = "http://monitoring01.home.2rjus.net:3100/loki/api/v1/push";
-        }
-      ] ++ lib.optionals config.vault.enable [
        {
          url = "https://loki.home.2rjus.net/loki/api/v1/push";
          basic_auth = {
--- a/system/nix.nix
+++ b/system/nix.nix
@@ -31,6 +31,10 @@ in
    };

    settings = {
+      experimental-features = [
+        "nix-command"
+        "flakes"
+      ];
      trusted-substituters = [
        "https://nix-cache.home.2rjus.net"
        "https://cache.nixos.org"
--- a/system/pipe-to-loki.nix
+++ b/system/pipe-to-loki.nix
@@ -16,7 +16,8 @@ let
    text = ''
      set -euo pipefail

-      LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
+      LOKI_URL="https://loki.home.2rjus.net/loki/api/v1/push"
+      LOKI_AUTH_FILE="/run/secrets/promtail-loki-auth"
      HOSTNAME=$(hostname)
      SESSION_ID=""
      RECORD_MODE=false
@@ -69,7 +70,13 @@ let
            }]
          }')

+        local auth_args=()
+        if [[ -f "$LOKI_AUTH_FILE" ]]; then
+          auth_args=(-u "promtail:$(cat "$LOKI_AUTH_FILE")")
+        fi
+
        if curl -s -X POST "$LOKI_URL" \
+          "''${auth_args[@]}" \
          -H "Content-Type: application/json" \
          -d "$payload" > /dev/null; then
          return 0
--- a/system/vault-secrets.nix
+++ b/system/vault-secrets.nix
@@ -57,7 +57,7 @@ let
        type = types.str;
        description = ''
          Path to the secret in Vault (without /v1/secret/data/ prefix).
-          Example: "hosts/monitoring01/grafana-admin"
+          Example: "hosts/ha1/mqtt-password"
        '';
      };

@@ -152,13 +152,11 @@ in
      '';
      example = literalExpression ''
        {
-          grafana-admin = {
-            secretPath = "hosts/monitoring01/grafana-admin";
-            owner = "grafana";
-            group = "grafana";
-            restartTrigger = true;
-            restartInterval = "daily";
-            services = [ "grafana" ];
+          mqtt-password = {
+            secretPath = "hosts/ha1/mqtt-password";
+            owner = "mosquitto";
+            group = "mosquitto";
+            services = [ "mosquitto" ];
          };
        }
      '';
--- a/terraform/vault/approle.tf
+++ b/terraform/vault/approle.tf
@@ -40,23 +40,13 @@ EOT
 # Define host access policies
 locals {
  host_policies = {
-    # Example: monitoring01 host
-    # "monitoring01" = {
-    #   paths = [
-    #     "secret/data/hosts/monitoring01/*",
-    #     "secret/data/services/prometheus/*",
-    #     "secret/data/services/grafana/*",
-    #     "secret/data/shared/smtp/*"
-    #   ]
-    #   extra_policies = ["some-other-policy"]  # Optional: additional policies
-    # }
-
-    # Example: ha1 host
+    # Example:
    # "ha1" = {
    #   paths = [
    #     "secret/data/hosts/ha1/*",
    #     "secret/data/shared/mqtt/*"
    #   ]
+    #   extra_policies = ["some-other-policy"]  # Optional: additional policies
    # }

    "ha1" = {
@@ -66,16 +56,6 @@ locals {
      ]
    }

-    "monitoring01" = {
-      paths = [
-        "secret/data/hosts/monitoring01/*",
-        "secret/data/shared/backup/*",
-        "secret/data/shared/nats/*",
-        "secret/data/services/exportarr/*",
-      ]
-      extra_policies = ["prometheus-metrics"]
-    }
-
    # Wave 1: hosts with no service secrets (only need vault.enable for future use)
    "nats1" = {
      paths = [
@@ -115,15 +95,6 @@ locals {
      ]
    }

-    # monitoring02: Grafana + VictoriaMetrics
-    "monitoring02" = {
-      paths = [
-        "secret/data/hosts/monitoring02/*",
-        "secret/data/hosts/monitoring01/apiary-token",
-        "secret/data/services/grafana/*",
-      ]
-    }
-
  }
 }

--- a/terraform/vault/hosts-generated.tf
+++ b/terraform/vault/hosts-generated.tf
@@ -44,7 +44,26 @@ locals {
        "secret/data/hosts/garage01/*",
      ]
    }
-  
+    "monitoring02" = {
+      paths = [
+        "secret/data/hosts/monitoring02/*",
+        "secret/data/services/grafana/*",
+        "secret/data/services/exportarr/*",
+        "secret/data/shared/nats/nkey",
+      ]
+      extra_policies = ["prometheus-metrics"]
+    }
+    "pn01" = {
+      paths = [
+        "secret/data/hosts/pn01/*",
+      ]
+    }
+    "pn02" = {
+      paths = [
+        "secret/data/hosts/pn02/*",
+      ]
+    }
+
  }

  # Placeholder secrets - user should add actual secrets manually or via tofu
@@ -74,7 +93,10 @@ resource "vault_approle_auth_backend_role" "generated_hosts" {

  backend            = vault_auth_backend.approle.path
  role_name          = each.key
-  token_policies     = ["host-${each.key}", "homelab-deploy", "nixos-exporter", "loki-push"]
+  token_policies     = concat(
+    ["host-${each.key}", "homelab-deploy", "nixos-exporter", "loki-push"],
+    lookup(each.value, "extra_policies", [])
+  )
  secret_id_ttl      = 0 # Never expire (wrapped tokens provide time limit)
  token_ttl          = 3600
  token_max_ttl      = 3600
--- a/terraform/vault/secrets.tf
+++ b/terraform/vault/secrets.tf
@@ -10,10 +10,6 @@ resource "vault_mount" "kv" {
 locals {
  secrets = {
    # Example host-specific secrets
-    # "hosts/monitoring01/grafana-admin" = {
-    #   auto_generate   = true
-    #   password_length = 32
-    # }
    # "hosts/ha1/mqtt-password" = {
    #   auto_generate   = true
    #   password_length = 24
@@ -35,11 +31,6 @@ locals {
    #   }
    # }

-    "hosts/monitoring01/grafana-admin" = {
-      auto_generate   = true
-      password_length = 32
-    }
-
    "hosts/ha1/mqtt-password" = {
      auto_generate   = true
      password_length = 24
@@ -57,8 +48,8 @@ locals {
      data          = { nkey = var.nats_nkey }
    }

-    # PVE exporter config for monitoring01
-    "hosts/monitoring01/pve-exporter" = {
+    # PVE exporter config for monitoring02
+    "hosts/monitoring02/pve-exporter" = {
      auto_generate = false
      data          = { config = var.pve_exporter_config }
    }
@@ -149,7 +140,7 @@ locals {
    }

    # Bearer token for scraping apiary metrics
-    "hosts/monitoring01/apiary-token" = {
+    "hosts/monitoring02/apiary-token" = {
      auto_generate   = true
      password_length = 64
    }