terraform: grant monitoring02 access to apiary-token secret

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
grafana: remove one-time deleteDatasources cleanup
2026-02-17 00:52:07 +01:00 · 2026-02-17 00:49:27 +01:00 · 2026-02-17 00:44:35 +01:00 · 2026-02-17 00:36:11 +01:00 · 2026-02-17 00:31:53 +01:00 · 2026-02-17 00:29:34 +01:00
51 changed files with 1252 additions and 721 deletions
--- a/.claude/agents/investigate-alarm.md
+++ b/.claude/agents/investigate-alarm.md
@@ -130,7 +130,7 @@ get_commit_info(<hash>)            # Get full details of a specific change
 ```

 **Example workflow for a service-related alert:**
-1. Query `nixos_flake_info{hostname="monitoring02"}` → `current_rev: 8959829`
+1. Query `nixos_flake_info{hostname="monitoring01"}` → `current_rev: 8959829`
 2. `resolve_ref("master")` → `4633421`
 3. `is_ancestor("8959829", "4633421")` → Yes, host is behind
 4. `commits_between("8959829", "4633421")` → 7 commits missing
--- a/.claude/skills/observability/SKILL.md
+++ b/.claude/skills/observability/SKILL.md
@@ -30,7 +30,7 @@ Use the `lab-monitoring` MCP server tools:
 ### Label Reference

 Available labels for log queries:
- `hostname` - Hostname (e.g., `ns1`, `monitoring02`, `ha1`) - matches the Prometheus `hostname` label
+- `hostname` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`) - matches the Prometheus `hostname` label
 - `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
 - `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
 - `filename` - For `varlog` job, the log file path
@@ -54,7 +54,7 @@ Journal logs are JSON-formatted. Key fields:

 **All logs from a host:**
 ```logql
-{hostname="monitoring02"}
+{hostname="monitoring01"}
 ```

 **Logs from a service across all hosts:**
@@ -74,7 +74,7 @@ Journal logs are JSON-formatted. Key fields:

 **Regex matching:**
 ```logql
-{systemd_unit="victoriametrics.service"} |~ "scrape.*failed"
+{systemd_unit="prometheus.service"} |~ "scrape.*failed"
 ```

 **Filter by level (journal scrape only):**
@@ -109,7 +109,7 @@ Default lookback is 1 hour. Use `start` parameter for older logs:
 Useful systemd units for troubleshooting:
 - `nixos-upgrade.service` - Daily auto-upgrade logs
 - `nsd.service` - DNS server (ns1/ns2)
- `victoriametrics.service` - Metrics collection
+- `prometheus.service` - Metrics collection
 - `loki.service` - Log aggregation
 - `caddy.service` - Reverse proxy
 - `home-assistant.service` - Home automation
@@ -152,7 +152,7 @@ VMs provisioned from template2 send bootstrap progress directly to Loki via curl

 Parse JSON and filter on fields:
 ```logql
-{systemd_unit="victoriametrics.service"} | json | PRIORITY="3"
+{systemd_unit="prometheus.service"} | json | PRIORITY="3"
 ```

 ---
@@ -242,11 +242,12 @@ All available Prometheus job names:
 - `unbound` - DNS resolver metrics (ns1, ns2)
 - `wireguard` - VPN tunnel metrics (http-proxy)

-**Monitoring stack (localhost on monitoring02):**
- `victoriametrics` - VictoriaMetrics self-metrics
+**Monitoring stack (localhost on monitoring01):**
+- `prometheus` - Prometheus self-metrics
 - `loki` - Loki self-metrics
 - `grafana` - Grafana self-metrics
 - `alertmanager` - Alertmanager metrics
+- `pushgateway` - Push-based metrics gateway

 **External/infrastructure:**
 - `pve-exporter` - Proxmox hypervisor metrics
@@ -261,7 +262,7 @@ All scrape targets have these labels:
 **Standard labels:**
 - `instance` - Full target address (`<hostname>.home.2rjus.net:<port>`)
 - `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`)
- `hostname` - Short hostname (e.g., `ns1`, `monitoring02`) - use this for host filtering
+- `hostname` - Short hostname (e.g., `ns1`, `monitoring01`) - use this for host filtering

 **Host metadata labels** (when configured in `homelab.host`):
 - `role` - Host role (e.g., `dns`, `build-host`, `vault`)
@@ -274,7 +275,7 @@ Use the `hostname` label for easy host filtering across all jobs:

 ```promql
 {hostname="ns1"}                    # All metrics from ns1
-node_load1{hostname="monitoring02"} # Specific metric by hostname
+node_load1{hostname="monitoring01"} # Specific metric by hostname
 up{hostname="ha1"}                  # Check if ha1 is up
 ```

@@ -282,10 +283,10 @@ This is simpler than wildcarding the `instance` label:

 ```promql
 # Old way (still works but verbose)
-up{instance=~"monitoring02.*"}
+up{instance=~"monitoring01.*"}

 # New way (preferred)
-up{hostname="monitoring02"}
+up{hostname="monitoring01"}
 ```

 ### Filtering by Role/Tier
--- a/.claude/skills/quick-plan/SKILL.md
+++ b/.claude/skills/quick-plan/SKILL.md
@@ -73,7 +73,6 @@ Additional context, caveats, or references.
 - **Reference existing patterns**: Mention how this fits with existing infrastructure
 - **Tables for comparisons**: Use markdown tables when comparing options
 - **Practical focus**: Emphasize what needs to happen, not theory
- **Mermaid diagrams**: Use mermaid code blocks for architecture diagrams, flow charts, or other graphs when relevant to the plan. Keep node labels short and use `<br/>` for line breaks

 ## Examples of Good Plans

--- a/.gitignore
+++ b/.gitignore
@@ -2,9 +2,6 @@
 result
 result-*

-# MCP config (contains secrets)
-.mcp.json
-
 # Terraform/OpenTofu
 terraform/.terraform/
 terraform/.terraform.lock.hcl
--- a/.mcp.json.example
+++ b/.mcp.json.example
@@ -20,9 +20,7 @@
      "env": {
        "PROMETHEUS_URL": "https://prometheus.home.2rjus.net",
        "ALERTMANAGER_URL": "https://alertmanager.home.2rjus.net",
-        "LOKI_URL": "https://loki.home.2rjus.net",
-        "LOKI_USERNAME": "promtail",
-        "LOKI_PASSWORD": "<password from: bao kv get -field=password secret/shared/loki/push-auth>"
+        "LOKI_URL": "http://monitoring01.home.2rjus.net:3100"
      }
    },
    "homelab-deploy": {
@@ -46,3 +44,4 @@
    }
  }
 }
+
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -247,7 +247,7 @@ nix develop -c homelab-deploy -- deploy \
  deploy.prod.<hostname>
 ```

-Subject format: `deploy.<tier>.<hostname>` (e.g., `deploy.prod.monitoring02`, `deploy.test.testvm01`)
+Subject format: `deploy.<tier>.<hostname>` (e.g., `deploy.prod.monitoring01`, `deploy.test.testvm01`)

 **Verifying Deployments:**

@@ -309,7 +309,7 @@ All hosts automatically get:
 - OpenBao (Vault) secrets management via AppRole
 - Internal ACME CA integration (OpenBao PKI at vault.home.2rjus.net)
 - Daily auto-upgrades with auto-reboot
- Prometheus node-exporter + Promtail (logs to monitoring02)
+- Prometheus node-exporter + Promtail (logs to monitoring01)
 - Monitoring scrape target auto-registration via `homelab.monitoring` options
 - Custom root CA trust
 - DNS zone auto-registration via `homelab.dns` options
@@ -335,7 +335,7 @@ Use `nix flake show` or `nix develop -c ansible-inventory --graph` to list all h
 - Infrastructure subnet: `10.69.13.x`
 - DNS: ns1/ns2 provide authoritative DNS with primary-secondary setup
 - Internal CA for ACME certificates (no Let's Encrypt)
- Centralized monitoring at monitoring02
+- Centralized monitoring at monitoring01
 - Static networking via systemd-networkd

 ### Secrets Management
@@ -480,21 +480,23 @@ See [docs/host-creation.md](docs/host-creation.md) for the complete host creatio

 ### Monitoring Stack

-All hosts ship metrics and logs to `monitoring02`:
- **Metrics**: VictoriaMetrics scrapes node-exporter from all hosts
- **Logs**: Promtail ships logs to Loki on monitoring02
- **Access**: Grafana at monitoring02 for visualization
+All hosts ship metrics and logs to `monitoring01`:
+- **Metrics**: Prometheus scrapes node-exporter from all hosts
+- **Logs**: Promtail ships logs to Loki on monitoring01
+- **Access**: Grafana at monitoring01 for visualization
+- **Tracing**: Tempo for distributed tracing
+- **Profiling**: Pyroscope for continuous profiling

 **Scrape Target Auto-Generation:**

-VictoriaMetrics scrape targets are automatically generated from host configurations, following the same pattern as DNS zone generation:
+Prometheus scrape targets are automatically generated from host configurations, following the same pattern as DNS zone generation:

 - **Node-exporter**: All flake hosts with static IPs are automatically added as node-exporter targets
 - **Service targets**: Defined via `homelab.monitoring.scrapeTargets` in service modules
 - **External targets**: Non-flake hosts defined in `/services/monitoring/external-targets.nix`
 - **Library**: `lib/monitoring.nix` provides `generateNodeExporterTargets` and `generateScrapeConfigs`

-Service modules declare their scrape targets directly via `homelab.monitoring.scrapeTargets`. The VictoriaMetrics config on monitoring02 auto-generates scrape configs from all hosts. See "Homelab Module Options" section for available options.
+Service modules declare their scrape targets directly via `homelab.monitoring.scrapeTargets`. The Prometheus config on monitoring01 auto-generates scrape configs from all hosts. See "Homelab Module Options" section for available options.

 To add monitoring targets for non-NixOS hosts, edit `/services/monitoring/external-targets.nix`.

--- a/README.md
+++ b/README.md
@@ -10,7 +10,7 @@ NixOS Flake-based configuration repository for a homelab infrastructure. All hos
 | `ca` | Internal Certificate Authority |
 | `ha1` | Home Assistant + Zigbee2MQTT + Mosquitto |
 | `http-proxy` | Reverse proxy |
-| `monitoring02` | VictoriaMetrics, Grafana, Loki, Alertmanager |
+| `monitoring01` | Prometheus, Grafana, Loki, Tempo, Pyroscope |
 | `jelly01` | Jellyfin media server |
 | `nix-cache02` | Nix binary cache + NATS-based build service |
 | `nats1` | NATS messaging |
@@ -121,4 +121,4 @@ No manual intervention is required after `tofu apply`.
 - Infrastructure subnet: `10.69.13.0/24`
 - DNS: ns1/ns2 authoritative with primary-secondary AXFR
 - Internal CA for TLS certificates (migrating from step-ca to OpenBao PKI)
- Centralized monitoring at monitoring02
+- Centralized monitoring at monitoring01
--- a/docs/plans/completed/monitoring-migration-victoriametrics.md
+++ b/docs/plans/completed/monitoring-migration-victoriametrics.md
@@ -1,156 +0,0 @@
-# Monitoring Stack Migration to VictoriaMetrics
-
-## Overview
-
-Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
-and longer retention. Run in parallel with monitoring01 until validated, then switch over using
-a `monitoring` CNAME for seamless transition.
-
-## Current State
-
-**monitoring02** (10.69.13.24) - **PRIMARY**:
- 4 CPU cores, 8GB RAM, 60GB disk
- VictoriaMetrics with 3-month retention
- vmalert with alerting enabled (routes to local Alertmanager)
- Alertmanager -> alerttonotify -> NATS notification pipeline
- Grafana with Kanidm OIDC (`grafana.home.2rjus.net`)
- Loki (log aggregation)
- CNAMEs: monitoring, alertmanager, grafana, grafana-test, metrics, vmalert, loki
-
-**monitoring01** (10.69.13.13) - **SHUT DOWN**:
- No longer running, pending decommission
-
-## Decision: VictoriaMetrics
-
-Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
- Single binary replacement for Prometheus
- 5-10x better compression (30 days could become 180+ days in same space)
- Same PromQL query language (Grafana dashboards work unchanged)
- Same scrape config format (existing auto-generated configs work)
-
-If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
-
-## Architecture
-
-```
-                     ┌─────────────────┐
-                     │  monitoring02   │
-                     │  VictoriaMetrics│
-                     │  + Grafana      │
-     monitoring      │  + Loki         │
-     CNAME ──────────│  + Alertmanager │
-                     │  (vmalert)      │
-                     └─────────────────┘
-                            ▲
-                            │ scrapes
-            ┌───────────────┼───────────────┐
-            │               │               │
-       ┌────┴────┐    ┌─────┴────┐    ┌─────┴────┐
-       │  ns1    │    │  ha1     │    │  ...     │
-       │ :9100   │    │ :9100    │    │ :9100    │
-       └─────────┘    └──────────┘    └──────────┘
-```
-
-## Implementation Plan
-
-### Phase 1: Create monitoring02 Host [COMPLETE]
-
-Host created and deployed at 10.69.13.24 (prod tier) with:
- 4 CPU cores, 8GB RAM, 60GB disk
- Vault integration enabled
- NATS-based remote deployment enabled
- Grafana with Kanidm OIDC deployed as test instance (`grafana-test.home.2rjus.net`)
-
-### Phase 2: Set Up VictoriaMetrics Stack [COMPLETE]
-
-New service module at `services/victoriametrics/` for VictoriaMetrics + vmalert + Alertmanager.
-Imported by monitoring02 alongside the existing Grafana service.
-
-1. **VictoriaMetrics** (port 8428):
-   - `services.victoriametrics.enable = true`
-   - `retentionPeriod = "3"` (3 months)
-   - All scrape configs migrated from Prometheus (22 jobs including auto-generated)
-   - Static user override (DynamicUser disabled) for credential file access
-   - OpenBao token fetch service + 30min refresh timer
-   - Apiary bearer token via vault.secrets
-
-2. **vmalert** for alerting rules:
-   - Points to VictoriaMetrics datasource at localhost:8428
-   - Reuses existing `services/monitoring/rules.yml` directly via `settings.rule`
-   - Notifier sends to local Alertmanager at localhost:9093
-
-3. **Alertmanager** (port 9093):
-   - Same configuration as monitoring01 (alerttonotify webhook routing)
-   - alerttonotify imported on monitoring02, routes alerts via NATS
-
-4. **Grafana** (port 3000):
-   - VictoriaMetrics datasource (localhost:8428) as default
-   - Loki datasource pointing to localhost:3100
-
-5. **Loki** (port 3100):
-   - Same configuration as monitoring01 in standalone `services/loki/` module
-   - Grafana datasource updated to localhost:3100
-
-**Note:** pve-exporter and pushgateway scrape targets are not included on monitoring02.
-pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics
-native push support.
-
-### Phase 3: Parallel Operation [COMPLETE]
-
-Ran both monitoring01 and monitoring02 simultaneously to validate data collection and dashboards.
-
-### Phase 4: Add monitoring CNAME [COMPLETE]
-
-Added CNAMEs to monitoring02: monitoring, alertmanager, grafana, metrics, vmalert, loki.
-
-### Phase 5: Update References [COMPLETE]
-
- Moved alertmanager, grafana, prometheus CNAMEs from http-proxy to monitoring02
- Removed corresponding Caddy reverse proxy entries from http-proxy
- monitoring02 Caddy serves alertmanager, grafana, metrics, vmalert directly
-
-### Phase 6: Enable Alerting [COMPLETE]
-
- Switched vmalert from blackhole mode to local Alertmanager
- alerttonotify service running on monitoring02 (NATS nkey from Vault)
- prometheus-metrics Vault policy added for OpenBao scraping
- Full alerting pipeline verified: vmalert -> Alertmanager -> alerttonotify -> NATS
-
-### Phase 7: Cutover and Decommission [IN PROGRESS]
-
- monitoring01 shut down (2026-02-17)
- Vault AppRole moved from approle.tf to hosts-generated.tf with extra_policies support
-
-**Remaining cleanup (separate branch):**
- [ ] Update `system/monitoring/logs.nix` - Promtail still points to monitoring01
- [ ] Update `hosts/template2/bootstrap.nix` - Bootstrap Loki URL still points to monitoring01
- [ ] Remove monitoring01 from flake.nix and host configuration
- [ ] Destroy monitoring01 VM in Proxmox
- [ ] Remove monitoring01 from terraform state
- [ ] Remove or archive `services/monitoring/` (Prometheus config)
-
-## Completed
-
- 2026-02-08: Phase 1 - monitoring02 host created
- 2026-02-17: Phase 2 - VictoriaMetrics, vmalert, Alertmanager, Loki, Grafana configured
- 2026-02-17: Phase 6 - Alerting enabled, CNAMEs migrated, monitoring01 shut down
-
-## VictoriaMetrics Service Configuration
-
-Implemented in `services/victoriametrics/default.nix`. Key design decisions:
-
- **Static user**: VictoriaMetrics NixOS module uses `DynamicUser`, overridden with a static
-  `victoriametrics` user so vault.secrets and credential files work correctly
- **Shared rules**: vmalert reuses `services/monitoring/rules.yml` via `settings.rule` path
-  reference (no YAML-to-Nix conversion needed)
- **Scrape config reuse**: Uses the same `lib/monitoring.nix` functions and
-  `services/monitoring/external-targets.nix` as Prometheus for auto-generated targets
-
-## Notes
-
- VictoriaMetrics uses port 8428 vs Prometheus 9090
- PromQL compatibility is excellent
- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
- monitoring02 deployed via OpenTofu using `create-host` script
- Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state
- Tempo and Pyroscope deferred (not actively used; can be added later if needed)
--- a/docs/plans/host-migration-to-opentofu.md
+++ b/docs/plans/host-migration-to-opentofu.md
@@ -20,9 +20,9 @@ Hosts to migrate:
 | http-proxy | Stateless | Reverse proxy, recreate |
 | nats1 | Stateless | Messaging, recreate |
 | ha1 | Stateful | Home Assistant + Zigbee2MQTT + Mosquitto |
-| ~~monitoring01~~ | ~~Decommission~~ | ✓ Complete — replaced by monitoring02 (VictoriaMetrics) |
+| monitoring01 | Stateful | Prometheus, Grafana, Loki |
 | jelly01 | Stateful | Jellyfin metadata, watch history, config |
-| ~~pgdb1~~ | ~~Decommission~~ | ✓ Complete |
+| pgdb1 | Decommission | Only used by Open WebUI on gunter, migrating to local postgres |
 | ~~jump~~ | ~~Decommission~~ | ✓ Complete |
 | ~~auth01~~ | ~~Decommission~~ | ✓ Complete |
 | ~~ca~~ | ~~Deferred~~ | ✓ Complete |
@@ -31,12 +31,10 @@ Hosts to migrate:

 Before migrating any stateful host, ensure restic backups are in place and verified.

-### ~~1a. Expand monitoring01 Grafana Backup~~ ✓ N/A
+### 1a. Expand monitoring01 Grafana Backup

-~~The existing backup only covers `/var/lib/grafana/plugins` and a sqlite dump of `grafana.db`.
-Expand to back up all of `/var/lib/grafana/` to capture config directory and any other state.~~
-
-No longer needed — monitoring01 decommissioned, replaced by monitoring02 with declarative Grafana dashboards.
+The existing backup only covers `/var/lib/grafana/plugins` and a sqlite dump of `grafana.db`.
+Expand to back up all of `/var/lib/grafana/` to capture config directory and any other state.

 ### 1b. Add Jellyfin Backup to jelly01

@@ -96,17 +94,15 @@ For each stateful host, the procedure is:
 7. Start services and verify functionality
 8. Decommission the old VM

-### 3a. monitoring01 ✓ COMPLETE
+### 3a. monitoring01

-~~1. Run final Grafana backup~~
-~~2. Provision new monitoring01 via OpenTofu~~
-~~3. After bootstrap, restore `/var/lib/grafana/` from restic~~
-~~4. Restart Grafana, verify dashboards and datasources are intact~~
-~~5. Prometheus and Loki start fresh with empty data (acceptable)~~
-~~6. Verify all scrape targets are being collected~~
-~~7. Decommission old VM~~
-
-Replaced by monitoring02 with VictoriaMetrics, standalone Loki and Grafana modules. Host configuration, old service modules, and terraform resources removed.
+1. Run final Grafana backup
+2. Provision new monitoring01 via OpenTofu
+3. After bootstrap, restore `/var/lib/grafana/` from restic
+4. Restart Grafana, verify dashboards and datasources are intact
+5. Prometheus and Loki start fresh with empty data (acceptable)
+6. Verify all scrape targets are being collected
+7. Decommission old VM

 ### 3b. jelly01

@@ -167,19 +163,19 @@ Host was already removed from flake.nix and VM destroyed. Configuration cleaned

 Host configuration, services, and VM already removed.

-### pgdb1 ✓ COMPLETE
+### pgdb1 (in progress)

-~~Only consumer was Open WebUI on gunter, which has been migrated to use local PostgreSQL.~~
+Only consumer was Open WebUI on gunter, which has been migrated to use local PostgreSQL.

-~~1. Verify Open WebUI on gunter is using local PostgreSQL (not pgdb1)~~
-~~2. Remove host configuration from `hosts/pgdb1/`~~
-~~3. Remove `services/postgres/` (only used by pgdb1)~~
-~~4. Remove from `flake.nix`~~
-~~5. Remove Vault AppRole from `terraform/vault/approle.tf`~~
-~~6. Destroy the VM in Proxmox~~
-~~7. Commit cleanup~~
+1. ~~Verify Open WebUI on gunter is using local PostgreSQL (not pgdb1)~~ ✓
+2. ~~Remove host configuration from `hosts/pgdb1/`~~ ✓
+3. ~~Remove `services/postgres/` (only used by pgdb1)~~ ✓
+4. ~~Remove from `flake.nix`~~ ✓
+5. ~~Remove Vault AppRole from `terraform/vault/approle.tf`~~ ✓
+6. Destroy the VM in Proxmox
+7. ~~Commit cleanup~~ ✓

-Host configuration, services, terraform resources, and VM removed. See `docs/plans/pgdb1-decommission.md` for detailed plan.
+See `docs/plans/pgdb1-decommission.md` for detailed plan.

 ## Phase 5: Decommission ca Host ✓ COMPLETE

--- a/docs/plans/monitoring-migration-victoriametrics.md
+++ b/docs/plans/monitoring-migration-victoriametrics.md
@@ -0,0 +1,209 @@
+# Monitoring Stack Migration to VictoriaMetrics
+
+## Overview
+
+Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
+and longer retention. Run in parallel with monitoring01 until validated, then switch over using
+a `monitoring` CNAME for seamless transition.
+
+## Current State
+
+**monitoring01** (10.69.13.13):
+- 4 CPU cores, 4GB RAM, 33GB disk
+- Prometheus with 30-day retention (15s scrape interval)
+- Alertmanager (routes to alerttonotify webhook)
+- Grafana (dashboards, datasources)
+- Loki (log aggregation from all hosts via Promtail)
+- Tempo (distributed tracing)
+- Pyroscope (continuous profiling)
+
+**Hardcoded References to monitoring01:**
+- `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100`
+- `hosts/template2/bootstrap.nix` - Bootstrap logs to Loki (keep as-is until decommission)
+- `services/http-proxy/proxy.nix` - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway
+
+**Auto-generated:**
+- Prometheus scrape targets (from `lib/monitoring.nix` + `homelab.monitoring.scrapeTargets`)
+- Node-exporter targets (from all hosts with static IPs)
+
+## Decision: VictoriaMetrics
+
+Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
+- Single binary replacement for Prometheus
+- 5-10x better compression (30 days could become 180+ days in same space)
+- Same PromQL query language (Grafana dashboards work unchanged)
+- Same scrape config format (existing auto-generated configs work)
+
+If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
+
+## Architecture
+
+```
+                     ┌─────────────────┐
+                     │  monitoring02   │
+                     │  VictoriaMetrics│
+                     │  + Grafana      │
+     monitoring      │  + Loki         │
+     CNAME ──────────│  + Tempo        │
+                     │  + Pyroscope    │
+                     │  + Alertmanager │
+                     │  (vmalert)      │
+                     └─────────────────┘
+                            ▲
+                            │ scrapes
+            ┌───────────────┼───────────────┐
+            │               │               │
+       ┌────┴────┐    ┌─────┴────┐    ┌─────┴────┐
+       │  ns1    │    │  ha1     │    │  ...     │
+       │ :9100   │    │ :9100    │    │ :9100    │
+       └─────────┘    └──────────┘    └──────────┘
+```
+
+## Implementation Plan
+
+### Phase 1: Create monitoring02 Host [COMPLETE]
+
+Host created and deployed at 10.69.13.24 (prod tier) with:
+- 4 CPU cores, 8GB RAM, 60GB disk
+- Vault integration enabled
+- NATS-based remote deployment enabled
+- Grafana with Kanidm OIDC deployed as test instance (`grafana-test.home.2rjus.net`)
+
+### Phase 2: Set Up VictoriaMetrics Stack
+
+New service module at `services/victoriametrics/` for VictoriaMetrics + vmalert + Alertmanager.
+Imported by monitoring02 alongside the existing Grafana service.
+
+1. **VictoriaMetrics** (port 8428): [DONE]
+   - `services.victoriametrics.enable = true`
+   - `retentionPeriod = "3"` (3 months)
+   - All scrape configs migrated from Prometheus (22 jobs including auto-generated)
+   - Static user override (DynamicUser disabled) for credential file access
+   - OpenBao token fetch service + 30min refresh timer
+   - Apiary bearer token via vault.secrets
+
+2. **vmalert** for alerting rules: [DONE]
+   - Points to VictoriaMetrics datasource at localhost:8428
+   - Reuses existing `services/monitoring/rules.yml` directly via `settings.rule`
+   - No notifier configured during parallel operation (prevents duplicate alerts)
+
+3. **Alertmanager** (port 9093): [DONE]
+   - Same configuration as monitoring01 (alerttonotify webhook routing)
+   - Will only receive alerts after cutover (vmalert notifier disabled)
+
+4. **Grafana** (port 3000): [DONE]
+   - VictoriaMetrics datasource (localhost:8428) as default
+   - monitoring01 Prometheus datasource kept for comparison during parallel operation
+   - Loki datasource pointing to monitoring01 (until Loki migrated)
+
+5. **Loki** (port 3100):
+   - TODO: Same configuration as current
+
+6. **Tempo** (ports 3200, 3201):
+   - TODO: Same configuration
+
+7. **Pyroscope** (port 4040):
+   - TODO: Same Docker-based deployment
+
+**Note:** pve-exporter and pushgateway scrape targets are not included on monitoring02.
+pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics
+native push support.
+
+### Phase 3: Parallel Operation
+
+Run both monitoring01 and monitoring02 simultaneously:
+
+1. **Dual scraping**: Both hosts scrape the same targets
+   - Validates VictoriaMetrics is collecting data correctly
+
+2. **Dual log shipping**: Configure Promtail to send logs to both Loki instances
+   - Add second client in `system/monitoring/logs.nix` pointing to monitoring02
+
+3. **Validate dashboards**: Access Grafana on monitoring02, verify dashboards work
+
+4. **Validate alerts**: Verify vmalert evaluates rules correctly (no receiver = no notifications)
+
+5. **Compare resource usage**: Monitor disk/memory consumption between hosts
+
+### Phase 4: Add monitoring CNAME
+
+Add CNAME to monitoring02 once validated:
+
+```nix
+# hosts/monitoring02/configuration.nix
+homelab.dns.cnames = [ "monitoring" ];
+```
+
+This creates `monitoring.home.2rjus.net` pointing to monitoring02.
+
+### Phase 5: Update References
+
+Update hardcoded references to use the CNAME:
+
+1. **system/monitoring/logs.nix**:
+   - Remove dual-shipping, point only to `http://monitoring.home.2rjus.net:3100`
+
+2. **services/http-proxy/proxy.nix**: Update reverse proxy backends:
+   - prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
+   - alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
+   - grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
+   - pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040
+
+Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission.
+
+### Phase 6: Enable Alerting
+
+Once ready to cut over:
+1. Enable Alertmanager receiver on monitoring02
+2. Verify test alerts route correctly
+
+### Phase 7: Cutover and Decommission
+
+1. **Stop monitoring01**: Prevent duplicate alerts during transition
+2. **Update bootstrap.nix**: Point to `monitoring.home.2rjus.net`
+3. **Verify all targets scraped**: Check VictoriaMetrics UI
+4. **Verify logs flowing**: Check Loki on monitoring02
+5. **Decommission monitoring01**:
+   - Remove from flake.nix
+   - Remove host configuration
+   - Destroy VM in Proxmox
+   - Remove from terraform state
+
+## Current Progress
+
+- **Phase 1** complete (2026-02-08): monitoring02 host created, Grafana with Kanidm OIDC validated
+- **Phase 2** in progress (2026-02-17): VictoriaMetrics, vmalert, Alertmanager, Grafana datasources configured
+  - Remaining: Loki, Tempo, Pyroscope migration
+
+## Open Questions
+
+- [ ] What disk size for monitoring02? Current 60GB may need expansion for 3+ months with VictoriaMetrics
+- [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
+- [ ] Consider replacing Promtail with Grafana Alloy (`services.alloy`, v1.12.2 in nixpkgs). Promtail is in maintenance mode and Grafana recommends Alloy as the successor. Alloy is a unified collector (logs, metrics, traces, profiles) but uses its own "River" config format instead of YAML, so less Nix-native ergonomics. Could bundle the migration with monitoring02 to consolidate disruption.
+
+## VictoriaMetrics Service Configuration
+
+Implemented in `services/victoriametrics/default.nix`. Key design decisions:
+
+- **Static user**: VictoriaMetrics NixOS module uses `DynamicUser`, overridden with a static
+  `victoriametrics` user so vault.secrets and credential files work correctly
+- **Shared rules**: vmalert reuses `services/monitoring/rules.yml` via `settings.rule` path
+  reference (no YAML-to-Nix conversion needed)
+- **Scrape config reuse**: Uses the same `lib/monitoring.nix` functions and
+  `services/monitoring/external-targets.nix` as Prometheus for auto-generated targets
+
+## Rollback Plan
+
+If issues arise after cutover:
+1. Move `monitoring` CNAME back to monitoring01
+2. Restart monitoring01 services
+3. Revert Promtail config to point only to monitoring01
+4. Revert http-proxy backends
+
+## Notes
+
+- VictoriaMetrics uses port 8428 vs Prometheus 9090
+- PromQL compatibility is excellent
+- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
+- monitoring02 deployed via OpenTofu using `create-host` script
+- Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state
--- a/docs/plans/remote-access.md
+++ b/docs/plans/remote-access.md
@@ -24,20 +24,29 @@ After evaluating WireGuard gateway vs Headscale (self-hosted Tailscale), the **W

 ## Architecture

-```mermaid
-graph TD
-    clients["Laptop / Phone"]
-    vps["VPS<br/>(WireGuard endpoint)"]
-    extgw["extgw01<br/>(gateway + bastion)"]
-    grafana["Grafana<br/>monitoring01:3000"]
-    jellyfin["Jellyfin<br/>jelly01:8096"]
-    arr["arr stack<br/>*-jail hosts"]
-
-    clients -->|WireGuard| vps
-    vps -->|WireGuard tunnel| extgw
-    extgw -->|allowed traffic| grafana
-    extgw -->|allowed traffic| jellyfin
-    extgw -->|allowed traffic| arr
+```
+                    ┌─────────────────────────────────┐
+                    │  VPS (OpenStack)                │
+  Laptop/Phone ──→ │  WireGuard endpoint             │
+  (WireGuard)      │  Client peers: laptop, phone    │
+                    │  Routes 10.69.13.0/24 via tunnel│
+                    └──────────┬──────────────────────┘
+                               │ WireGuard tunnel
+                               ▼
+                    ┌─────────────────────────────────┐
+                    │  extgw01 (gateway + bastion)    │
+                    │  - WireGuard tunnel to VPS      │
+                    │  - Firewall (allowlist only)    │
+                    │  - SSH + 2FA (full access)      │
+                    └──────────┬──────────────────────┘
+                               │ allowed traffic only
+                               ▼
+                    ┌─────────────────────────────────┐
+                    │  Internal network 10.69.13.0/24 │
+                    │  - monitoring01:3000 (Grafana)  │
+                    │  - jelly01:8096 (Jellyfin)      │
+                    │  - *-jail hosts (arr stack)     │
+                    └─────────────────────────────────┘
 ```

 ### Existing path (unchanged)
--- a/docs/plans/truenas-migration.md
+++ b/docs/plans/truenas-migration.md
@@ -39,17 +39,23 @@ Expand storage capacity for the main hdd-pool. Since we need to add disks anyway
 - nzbget: NixOS service or OCI container
 - NFS exports: `services.nfs.server`

-### Filesystem: Keep ZFS
+### Filesystem: BTRFS RAID1

-**Decision**: Keep existing ZFS pool, import on NixOS
+**Decision**: Migrate from ZFS to BTRFS with RAID1

 **Rationale**:
- **No data migration needed**: Existing ZFS pool can be imported directly on NixOS
- **Proven reliability**: Pool has been running reliably on TrueNAS
- **NixOS ZFS support**: Well-supported, declarative configuration via `boot.zfs` and `services.zfs`
- **BTRFS RAID5/6 unreliable**: Research showed BTRFS RAID5/6 write hole is still unresolved
- **BTRFS RAID1 wasteful**: With mixed disk sizes, RAID1 wastes significant capacity vs ZFS mirrors
- Checksumming, snapshots, compression (lz4/zstd) all available
+- **In-kernel**: No out-of-tree module issues like ZFS
+- **Flexible expansion**: Add individual disks, not required to buy pairs
+- **Mixed disk sizes**: Better handling than ZFS multi-vdev approach
+- **RAID level conversion**: Can convert between RAID levels in place
+- Built-in checksumming, snapshots, compression (zstd)
+- NixOS has good BTRFS support
+
+**BTRFS RAID1 notes**:
+- "RAID1" means 2 copies of all data
+- Distributes across all available devices
+- With 6+ disks, provides redundancy + capacity scaling
+- RAID5/6 avoided (known issues), RAID1/10 are stable

 ### Hardware: Keep Existing + Add Disks

@@ -63,94 +69,83 @@ Expand storage capacity for the main hdd-pool. Since we need to add disks anyway

 **Storage architecture**:

-**hdd-pool** (ZFS mirrors):
- Current: 3 mirror vdevs (2x16TB + 2x8TB + 2x8TB) = 32TB usable
- Add: mirror-3 with 2x 24TB = +24TB usable
- Total after expansion: ~56TB usable
+**Bulk storage** (BTRFS RAID1 on HDDs):
+- Current: 6x HDDs (2x16TB + 2x8TB + 2x8TB)
+- Add: 2x new HDDs (size TBD)
 - Use: Media, downloads, backups, non-critical data
+- Risk tolerance: High (data mostly replaceable)
+
+**Critical data** (small volume):
+- Use 2x 240GB SSDs in mirror (BTRFS or ZFS)
+- Or use 2TB NVMe for critical data
+- Risk tolerance: Low (data important but small)

 ### Disk Purchase Decision

-**Decision**: 2x 24TB drives (ordered, arriving 2026-02-21)
+**Options under consideration**:
+
+**Option A: 2x 16TB drives**
+- Matches largest current drives
+- Enables potential future RAID5 if desired (6x 16TB array)
+- More conservative capacity increase
+
+**Option B: 2x 20-24TB drives**
+- Larger capacity headroom
+- Better $/TB ratio typically
+- Future-proofs better
+
+**Initial purchase**: 2 drives (chassis has space for 2 more without modifications)

 ## Migration Strategy

 ### High-Level Plan

-1. **Expand ZFS pool** (on TrueNAS):
-   - Install 2x 24TB drives (may need new drive trays - order from abroad if needed)
-   - If chassis space is limited, temporarily replace the two oldest 8TB drives (da0/ada4)
-   - Add as mirror-3 vdev to hdd-pool
-   - Verify pool health and resilver completes
-   - Check SMART data on old 8TB drives (all healthy as of 2026-02-20, no reallocated sectors)
-   - Burn-in: at minimum short + long SMART test before adding to pool
+1. **Preparation**:
+   - Purchase 2x new HDDs (16TB or 20-24TB)
+   - Create NixOS configuration for new storage host
+   - Set up bare metal NixOS installation

-2. **Prepare NixOS configuration**:
-   - Create host configuration (`hosts/nas1/` or similar)
-   - Configure ZFS pool import (`boot.zfs.extraPools`)
-   - Set up services: radarr, sonarr, nzbget, restic-rest, NFS
-   - Configure monitoring (node-exporter, promtail, smartctl-exporter)
+2. **Initial BTRFS pool**:
+   - Install 2 new disks
+   - Create BTRFS filesystem in RAID1
+   - Mount and test NFS exports

-3. **Install NixOS**:
-   - `zfs export hdd-pool` on TrueNAS before shutdown (clean export)
-   - Wipe TrueNAS boot-pool SSDs, set up as mdadm RAID1 for NixOS root
-   - Install NixOS on mdadm mirror (keeps boot path ZFS-independent)
-   - Import hdd-pool via `boot.zfs.extraPools`
-   - Verify all datasets mount correctly
+3. **Data migration**:
+   - Copy data from TrueNAS ZFS pool to new BTRFS pool over 10GbE
+   - Verify data integrity

-4. **Service migration**:
-   - Configure NixOS services to use ZFS dataset paths
-   - Update NFS exports
-   - Test from consuming hosts
+4. **Expand pool**:
+   - As old ZFS pool is emptied, wipe drives and add to BTRFS pool
+   - Pool grows incrementally: 2 → 4 → 6 → 8 disks
+   - BTRFS rebalances data across new devices

-5. **Cutover**:
-   - Update DNS/client mounts if IP changes
-   - Verify monitoring integration
+5. **Service migration**:
+   - Set up radarr/sonarr/nzbget/restic as NixOS services
+   - Update NFS client mounts on consuming hosts
+
+6. **Cutover**:
+   - Point consumers to new NAS host
   - Decommission TrueNAS
-
-### Post-Expansion: Vdev Rebalancing
-
-ZFS has no built-in rebalance command. After adding the new 24TB vdev, ZFS will
-write new data preferentially to it (most free space), leaving old vdevs packed
-at ~97%. This is suboptimal but not urgent once overall pool usage drops to ~50%.
-
-To gradually rebalance, rewrite files in place so ZFS redistributes blocks across
-all vdevs proportional to free space:
-
-```bash
-# Rewrite files individually (spreads blocks across all vdevs)
-find /pool/dataset -type f -exec sh -c '
-  for f; do cp "$f" "$f.rebal" && mv "$f.rebal" "$f"; done
-' _ {} +
-```
-
-Avoid `zfs send/recv` for large datasets (e.g. 20TB) as this would concentrate
-data on the emptiest vdev rather than spreading it evenly.
-
-**Recommendation**: Do this after NixOS migration is stable. Not urgent - the pool
-will function fine with uneven distribution, just slightly suboptimal for performance.
+   - Repurpose hardware or keep as spare

 ### Migration Advantages

- **No data migration**: ZFS pool imported directly, no copying terabytes of data
- **Low risk**: Pool expansion done on stable TrueNAS before OS swap
- **Reversible**: Can boot back to TrueNAS if NixOS has issues (ZFS pool is OS-independent)
- **Quick cutover**: Once NixOS config is ready, the OS swap is fast
+- **Low risk**: New pool created independently, old data remains intact during migration
+- **Incremental**: Can add old disks one at a time as space allows
+- **Flexible**: BTRFS handles mixed disk sizes gracefully
+- **Reversible**: Keep TrueNAS running until fully validated

 ## Next Steps

-1. ~~Decide on disk size~~ - 2x 24TB ordered
-2. Install drives and add mirror vdev to ZFS pool
-3. Check SMART data on 8TB drives - decide whether to keep or retire
-4. Design NixOS host configuration (`hosts/nas1/`)
-5. Document NFS export mapping (current -> new)
-6. Plan NixOS installation and cutover
+1. Decide on disk size (16TB vs 20-24TB)
+2. Purchase disks
+3. Design NixOS host configuration (`hosts/nas1/`)
+4. Plan detailed migration timeline
+5. Document NFS export mapping (current → new)

 ## Open Questions

+- [ ] Final decision on disk size?
 - [ ] Hostname for new NAS host? (nas1? storage1?)
- [ ] IP address/subnet: NAS and Proxmox are both on 10GbE to the same switch but different subnets, forcing traffic through the router (bottleneck). Move to same subnet during migration.
- [x] Boot drive: Reuse TrueNAS boot-pool SSDs as mdadm RAID1 for NixOS root (no ZFS on boot path)
- [ ] Retire old 8TB drives? (SMART looks healthy, keep unless chassis space is needed)
- [ ] Drive trays: do new 24TB drives fit, or order trays from abroad?
- [ ] Timeline/maintenance window for NixOS swap?
+- [ ] IP address allocation (keep 10.69.12.50 or new IP?)
+- [ ] Timeline/maintenance window for migration?
--- a/flake.lock
+++ b/flake.lock
@@ -28,11 +28,11 @@
        ]
      },
      "locked": {
-        "lastModified": 1771488195,
-        "narHash": "sha256-2kMxqdDyPluRQRoES22Y0oSjp7pc5fj2nRterfmSIyc=",
+        "lastModified": 1771004123,
+        "narHash": "sha256-Jw36EzL4IGIc2TmeZGphAAUrJXoWqfvCbybF8bTHgMA=",
        "ref": "master",
-        "rev": "2d26de50559d8acb82ea803764e138325d95572c",
-        "revCount": 37,
+        "rev": "e5e8be86ecdcae8a5962ba3bddddfe91b574792b",
+        "revCount": 36,
        "type": "git",
        "url": "https://git.t-juice.club/torjus/homelab-deploy"
      },
@@ -64,11 +64,11 @@
    },
    "nixpkgs": {
      "locked": {
-        "lastModified": 1771419570,
-        "narHash": "sha256-bxAlQgre3pcQcaRUm/8A0v/X8d2nhfraWSFqVmMcBcU=",
+        "lastModified": 1771043024,
+        "narHash": "sha256-O1XDr7EWbRp+kHrNNgLWgIrB0/US5wvw9K6RERWAj6I=",
        "owner": "nixos",
        "repo": "nixpkgs",
-        "rev": "6d41bc27aaf7b6a3ba6b169db3bd5d6159cfaa47",
+        "rev": "3aadb7ca9eac2891d52a9dec199d9580a6e2bf44",
        "type": "github"
      },
      "original": {
@@ -80,11 +80,11 @@
    },
    "nixpkgs-unstable": {
      "locked": {
-        "lastModified": 1771369470,
-        "narHash": "sha256-0NBlEBKkN3lufyvFegY4TYv5mCNHbi5OmBDrzihbBMQ=",
+        "lastModified": 1771008912,
+        "narHash": "sha256-gf2AmWVTs8lEq7z/3ZAsgnZDhWIckkb+ZnAo5RzSxJg=",
        "owner": "nixos",
        "repo": "nixpkgs",
-        "rev": "0182a361324364ae3f436a63005877674cf45efb",
+        "rev": "a82ccc39b39b621151d6732718e3e250109076fa",
        "type": "github"
      },
      "original": {
--- a/flake.nix
+++ b/flake.nix
@@ -92,6 +92,15 @@
            ./hosts/http-proxy
          ];
        };
+        monitoring01 = nixpkgs.lib.nixosSystem {
+          inherit system;
+          specialArgs = {
+            inherit inputs self;
+          };
+          modules = commonModules ++ [
+            ./hosts/monitoring01
+          ];
+        };
        jelly01 = nixpkgs.lib.nixosSystem {
          inherit system;
          specialArgs = {
--- a/hosts/http-proxy/configuration.nix
+++ b/hosts/http-proxy/configuration.nix
@@ -18,7 +18,12 @@
    "sonarr"
    "ha"
    "z2m"
+    "grafana"
+    "prometheus"
+    "alertmanager"
    "jelly"
+    "pyroscope"
+    "pushgw"
  ];

  nixpkgs.config.allowUnfree = true;
--- a/hosts/monitoring01/configuration.nix
+++ b/hosts/monitoring01/configuration.nix
@@ -0,0 +1,114 @@
+{
+  pkgs,
+  ...
+}:
+
+{
+  imports = [
+    ./hardware-configuration.nix
+
+    ../../system
+    ../../common/vm
+  ];
+
+  homelab.host.role = "monitoring";
+
+  nixpkgs.config.allowUnfree = true;
+  # Use the systemd-boot EFI boot loader.
+  boot.loader.grub = {
+    enable = true;
+    device = "/dev/sda";
+    configurationLimit = 3;
+  };
+
+  networking.hostName = "monitoring01";
+  networking.domain = "home.2rjus.net";
+  networking.useNetworkd = true;
+  networking.useDHCP = false;
+  services.resolved.enable = true;
+  networking.nameservers = [
+    "10.69.13.5"
+    "10.69.13.6"
+  ];
+
+  systemd.network.enable = true;
+  systemd.network.networks."ens18" = {
+    matchConfig.Name = "ens18";
+    address = [
+      "10.69.13.13/24"
+    ];
+    routes = [
+      { Gateway = "10.69.13.1"; }
+    ];
+    linkConfig.RequiredForOnline = "routable";
+  };
+  time.timeZone = "Europe/Oslo";
+
+  nix.settings.experimental-features = [
+    "nix-command"
+    "flakes"
+  ];
+  nix.settings.tarball-ttl = 0;
+  environment.systemPackages = with pkgs; [
+    vim
+    wget
+    git
+    sqlite
+  ];
+
+  services.qemuGuest.enable = true;
+
+  # Vault secrets management
+  vault.enable = true;
+  homelab.deploy.enable = true;
+  vault.secrets.backup-helper = {
+    secretPath = "shared/backup/password";
+    extractKey = "password";
+    outputDir = "/run/secrets/backup_helper_secret";
+    services = [ "restic-backups-grafana" "restic-backups-grafana-db" ];
+  };
+
+  services.restic.backups.grafana = {
+    repository = "rest:http://10.69.12.52:8000/backup-nix";
+    passwordFile = "/run/secrets/backup_helper_secret";
+    paths = [ "/var/lib/grafana/plugins" ];
+    timerConfig = {
+      OnCalendar = "daily";
+      Persistent = true;
+      RandomizedDelaySec = "2h";
+    };
+    pruneOpts = [
+      "--keep-daily 7"
+      "--keep-weekly 4"
+      "--keep-monthly 6"
+      "--keep-within 1d"
+    ];
+    extraOptions = [ "--retry-lock=5m" ];
+  };
+
+  services.restic.backups.grafana-db = {
+    repository = "rest:http://10.69.12.52:8000/backup-nix";
+    passwordFile = "/run/secrets/backup_helper_secret";
+    command = [ "${pkgs.sqlite}/bin/sqlite3" "/var/lib/grafana/data/grafana.db" ".dump" ];
+    timerConfig = {
+      OnCalendar = "daily";
+      Persistent = true;
+      RandomizedDelaySec = "2h";
+    };
+    pruneOpts = [
+      "--keep-daily 7"
+      "--keep-weekly 4"
+      "--keep-monthly 6"
+      "--keep-within 1d"
+    ];
+    extraOptions = [ "--retry-lock=5m" ];
+  };
+
+  # Open ports in the firewall.
+  # networking.firewall.allowedTCPPorts = [ ... ];
+  # networking.firewall.allowedUDPPorts = [ ... ];
+  # Or disable the firewall altogether.
+  networking.firewall.enable = false;
+
+  system.stateVersion = "23.11"; # Did you read the comment?
+}
--- a/hosts/monitoring01/default.nix
+++ b/hosts/monitoring01/default.nix
@@ -0,0 +1,7 @@
+{ ... }:
+{
+  imports = [
+    ./configuration.nix
+    ../../services/monitoring
+  ];
+}
--- a/hosts/monitoring01/hardware-configuration.nix
+++ b/hosts/monitoring01/hardware-configuration.nix
@@ -0,0 +1,42 @@
+{
+  config,
+  lib,
+  pkgs,
+  modulesPath,
+  ...
+}:
+
+{
+  imports = [
+    (modulesPath + "/profiles/qemu-guest.nix")
+  ];
+  boot.initrd.availableKernelModules = [
+    "ata_piix"
+    "uhci_hcd"
+    "virtio_pci"
+    "virtio_scsi"
+    "sd_mod"
+    "sr_mod"
+  ];
+  boot.initrd.kernelModules = [ "dm-snapshot" ];
+  boot.kernelModules = [
+    "ptp_kvm"
+  ];
+  boot.extraModulePackages = [ ];
+
+  fileSystems."/" = {
+    device = "/dev/disk/by-label/root";
+    fsType = "xfs";
+  };
+
+  swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
+
+  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
+  # (the default) this is the recommended approach. When using systemd-networkd it's
+  # still possible to use this option, but it's recommended to use it in conjunction
+  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
+  networking.useDHCP = lib.mkDefault true;
+  # networking.interfaces.ens18.useDHCP = lib.mkDefault true;
+
+  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
+}
--- a/hosts/monitoring02/configuration.nix
+++ b/hosts/monitoring02/configuration.nix
@@ -18,7 +18,7 @@
    role = "monitoring";
  };

-  homelab.dns.cnames = [ "monitoring" "alertmanager" "grafana" "grafana-test" "metrics" "vmalert" "loki" ];
+  homelab.dns.cnames = [ "grafana-test" "metrics" "vmalert" ];

  # Enable Vault integration
  vault.enable = true;
--- a/hosts/monitoring02/default.nix
+++ b/hosts/monitoring02/default.nix
@@ -3,10 +3,5 @@
    ./configuration.nix
    ../../services/grafana
    ../../services/victoriametrics
-    ../../services/loki
-    ../../services/monitoring/alerttonotify.nix
-    ../../services/monitoring/blackbox.nix
-    ../../services/monitoring/exportarr.nix
-    ../../services/monitoring/pve.nix
  ];
 }
--- a/hosts/nix-cache02/builder.nix
+++ b/hosts/nix-cache02/builder.nix
@@ -25,7 +25,7 @@
      };
    };

-    timeout = 14400;
+    timeout = 7200;
    metrics.enable = true;
  };

--- a/hosts/template2/bootstrap.nix
+++ b/hosts/template2/bootstrap.nix
@@ -6,8 +6,7 @@ let
    text = ''
      set -euo pipefail

-      LOKI_URL="https://loki.home.2rjus.net/loki/api/v1/push"
-      LOKI_AUTH_FILE="/run/secrets/promtail-loki-auth"
+      LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"

      # Send a log entry to Loki with bootstrap status
      # Usage: log_to_loki <stage> <message>
@@ -37,14 +36,8 @@ let
            }]
          }')

-        local auth_args=()
-        if [[ -f "$LOKI_AUTH_FILE" ]]; then
-          auth_args=(-u "promtail:$(cat "$LOKI_AUTH_FILE")")
-        fi
-
        curl -s --connect-timeout 2 --max-time 5 \
          -X POST \
-          "''${auth_args[@]}" \
          -H "Content-Type: application/json" \
          -d "$payload" \
          "$LOKI_URL" >/dev/null 2>&1 || true
--- a/scripts/vault-fetch/README.md
+++ b/scripts/vault-fetch/README.md
@@ -20,10 +20,10 @@ vault-fetch <secret-path> <output-directory> [cache-directory]

 ```bash
 # Fetch Grafana admin secrets
-vault-fetch hosts/ha1/mqtt-password /run/secrets/grafana /var/lib/vault/cache/grafana
+vault-fetch hosts/monitoring01/grafana-admin /run/secrets/grafana /var/lib/vault/cache/grafana

 # Use default cache location
-vault-fetch hosts/ha1/mqtt-password /run/secrets/grafana
+vault-fetch hosts/monitoring01/grafana-admin /run/secrets/grafana
 ```

 ## How It Works
@@ -53,13 +53,13 @@ If Vault is unreachable or authentication fails:
 This tool is designed to be called from systemd service `ExecStartPre` hooks via the `vault.secrets` NixOS module:

 ```nix
-vault.secrets.mqtt-password = {
-  secretPath = "hosts/ha1/mqtt-password";
+vault.secrets.grafana-admin = {
+  secretPath = "hosts/monitoring01/grafana-admin";
 };

 # Service automatically gets secrets fetched before start
-systemd.services.mosquitto.serviceConfig = {
-  EnvironmentFile = "/run/secrets/mqtt-password/password";
+systemd.services.grafana.serviceConfig = {
+  EnvironmentFile = "/run/secrets/grafana-admin/password";
 };
 ```

--- a/scripts/vault-fetch/vault-fetch.sh
+++ b/scripts/vault-fetch/vault-fetch.sh
@@ -5,7 +5,7 @@ set -euo pipefail
 #
 # Usage: vault-fetch <secret-path> <output-directory> [cache-directory]
 #
-# Example: vault-fetch hosts/ha1/mqtt-password /run/secrets/grafana /var/lib/vault/cache/grafana
+# Example: vault-fetch hosts/monitoring01/grafana-admin /run/secrets/grafana /var/lib/vault/cache/grafana
 #
 # This script:
 # 1. Authenticates to Vault using AppRole credentials from /var/lib/vault/approle/
@@ -17,7 +17,7 @@ set -euo pipefail
 # Parse arguments
 if [ $# -lt 2 ]; then
    echo "Usage: vault-fetch <secret-path> <output-directory> [cache-directory]" >&2
-    echo "Example: vault-fetch hosts/ha1/mqtt-password /run/secrets/grafana /var/lib/vault/cache/grafana" >&2
+    echo "Example: vault-fetch hosts/monitoring01/grafana /run/secrets/grafana /var/lib/vault/cache/grafana" >&2
    exit 1
 fi

--- a/services/grafana/dashboards/apiary.json
+++ b/services/grafana/dashboards/apiary.json
@@ -19,7 +19,7 @@
      "title": "SSH Connections",
      "type": "stat",
      "gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "sum(oubliette_ssh_connections_total{job=\"apiary\"})",
@@ -51,7 +51,7 @@
      "title": "Active Sessions",
      "type": "stat",
      "gridPos": {"h": 4, "w": 6, "x": 6, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "oubliette_sessions_active{job=\"apiary\"}",
@@ -86,7 +86,7 @@
      "title": "Unique IPs",
      "type": "stat",
      "gridPos": {"h": 4, "w": 6, "x": 12, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "oubliette_storage_unique_ips{job=\"apiary\"}",
@@ -118,7 +118,7 @@
      "title": "Total Login Attempts",
      "type": "stat",
      "gridPos": {"h": 4, "w": 6, "x": 18, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "oubliette_storage_login_attempts_total{job=\"apiary\"}",
@@ -150,7 +150,7 @@
      "title": "SSH Connections Over Time",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "interval": "60s",
      "targets": [
        {
@@ -183,7 +183,7 @@
      "title": "Auth Attempts Over Time",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "interval": "60s",
      "targets": [
        {
@@ -216,7 +216,7 @@
      "title": "Sessions by Shell",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 22},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "interval": "60s",
      "targets": [
        {
@@ -249,7 +249,7 @@
      "title": "Attempts by Country",
      "type": "geomap",
      "gridPos": {"h": 10, "w": 24, "x": 0, "y": 12},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "oubliette_auth_attempts_by_country_total{job=\"apiary\"}",
@@ -318,7 +318,7 @@
      "title": "Session Duration Distribution",
      "type": "heatmap",
      "gridPos": {"h": 8, "w": 24, "x": 0, "y": 30},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "interval": "60s",
      "targets": [
        {
@@ -359,7 +359,7 @@
      "title": "Commands Executed by Shell",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 22},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "interval": "60s",
      "targets": [
        {
--- a/services/grafana/dashboards/certificates.json
+++ b/services/grafana/dashboards/certificates.json
@@ -16,7 +16,7 @@
      "title": "Endpoints Monitored",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "count(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"})",
@@ -48,7 +48,7 @@
      "title": "Probe Failures",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "count(probe_success{job=\"blackbox_tls\"} == 0) or vector(0)",
@@ -82,7 +82,7 @@
      "title": "Expiring Soon (< 7d)",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "count((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) < 86400 * 7) or vector(0)",
@@ -116,7 +116,7 @@
      "title": "Expiring Critical (< 24h)",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "count((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) < 86400) or vector(0)",
@@ -150,7 +150,7 @@
      "title": "Minimum Days Remaining",
      "type": "gauge",
      "gridPos": {"h": 4, "w": 8, "x": 16, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "min((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400)",
@@ -187,7 +187,7 @@
      "title": "Certificate Expiry by Endpoint",
      "type": "table",
      "gridPos": {"h": 12, "w": 12, "x": 0, "y": 4},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400",
@@ -253,7 +253,7 @@
      "title": "Probe Status",
      "type": "table",
      "gridPos": {"h": 12, "w": 12, "x": 12, "y": 4},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "probe_success{job=\"blackbox_tls\"}",
@@ -340,7 +340,7 @@
      "title": "Certificate Expiry Over Time",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 24, "x": 0, "y": 16},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400",
@@ -378,7 +378,7 @@
      "title": "Probe Success Rate",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 24},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "avg(probe_success{job=\"blackbox_tls\"}) * 100",
@@ -418,7 +418,7 @@
      "title": "Probe Duration",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 24},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "probe_duration_seconds{job=\"blackbox_tls\"}",
--- a/services/grafana/dashboards/nixos-fleet.json
+++ b/services/grafana/dashboards/nixos-fleet.json
@@ -15,7 +15,7 @@
      {
        "name": "tier",
        "type": "query",
-        "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+        "datasource": {"type": "prometheus", "uid": "prometheus"},
        "query": "label_values(nixos_flake_info, tier)",
        "refresh": 2,
        "includeAll": true,
@@ -30,7 +30,7 @@
      "title": "Hosts Behind Remote",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "count(nixos_flake_revision_behind{tier=~\"$tier\"} == 1)",
@@ -65,7 +65,7 @@
      "title": "Hosts Needing Reboot",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "count(nixos_config_mismatch{tier=~\"$tier\"} == 1)",
@@ -100,7 +100,7 @@
      "title": "Total Hosts",
      "type": "stat",
      "gridPos": {"h": 4, "w": 3, "x": 8, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "count(nixos_flake_info{tier=~\"$tier\"})",
@@ -128,7 +128,7 @@
      "title": "Nixpkgs Age",
      "type": "stat",
      "gridPos": {"h": 4, "w": 3, "x": 11, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "max(nixos_flake_input_age_seconds{input=\"nixpkgs\", tier=~\"$tier\"})",
@@ -163,7 +163,7 @@
      "title": "Hosts Up-to-date",
      "type": "stat",
      "gridPos": {"h": 4, "w": 3, "x": 14, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "count(nixos_flake_revision_behind{tier=~\"$tier\"} == 0)",
@@ -192,7 +192,7 @@
      "title": "Deployments (24h)",
      "type": "stat",
      "gridPos": {"h": 4, "w": 3, "x": 17, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "sum(increase(homelab_deploy_deployments_total{status=\"completed\"}[24h]))",
@@ -222,7 +222,7 @@
      "title": "Avg Deploy Time",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "sum(increase(homelab_deploy_deployment_duration_seconds_sum{success=\"true\"}[24h])) / sum(increase(homelab_deploy_deployment_duration_seconds_count{success=\"true\"}[24h]))",
@@ -256,7 +256,7 @@
      "title": "Fleet Status",
      "type": "table",
      "gridPos": {"h": 10, "w": 24, "x": 0, "y": 4},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "nixos_flake_info{tier=~\"$tier\"}",
@@ -430,7 +430,7 @@
      "title": "Generation Age by Host",
      "type": "bargauge",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 14},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "sort_desc(nixos_generation_age_seconds{tier=~\"$tier\"})",
@@ -467,7 +467,7 @@
      "title": "Generations per Host",
      "type": "bargauge",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 14},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "sort_desc(nixos_generation_count{tier=~\"$tier\"})",
@@ -501,7 +501,7 @@
      "title": "Deployment Activity (Generation Age Over Time)",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 24, "x": 0, "y": 22},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "nixos_generation_age_seconds{tier=~\"$tier\"}",
@@ -534,7 +534,7 @@
      "title": "Flake Input Ages",
      "type": "table",
      "gridPos": {"h": 6, "w": 12, "x": 0, "y": 30},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "max by (input) (nixos_flake_input_age_seconds)",
@@ -577,7 +577,7 @@
      "title": "Hosts by Revision",
      "type": "piechart",
      "gridPos": {"h": 6, "w": 6, "x": 12, "y": 30},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "count by (current_rev) (nixos_flake_info{tier=~\"$tier\"})",
@@ -601,7 +601,7 @@
      "title": "Hosts by Tier",
      "type": "piechart",
      "gridPos": {"h": 6, "w": 6, "x": 18, "y": 30},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "count by (tier) (nixos_flake_info)",
@@ -641,7 +641,7 @@
      "title": "Builds (24h)",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 0, "y": 37},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "sum(increase(homelab_deploy_build_host_total{status=\"success\"}[24h]))",
@@ -671,7 +671,7 @@
      "title": "Failed Builds (24h)",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 4, "y": 37},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "sum(increase(homelab_deploy_build_host_total{status=\"failure\"}[24h])) or vector(0)",
@@ -705,7 +705,7 @@
      "title": "Last Build",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 8, "y": 37},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "time() - max(homelab_deploy_build_last_timestamp)",
@@ -739,7 +739,7 @@
      "title": "Avg Build Time",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 12, "y": 37},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "sum(increase(homelab_deploy_build_duration_seconds_sum[24h])) / sum(increase(homelab_deploy_build_duration_seconds_count[24h]))",
@@ -773,7 +773,7 @@
      "title": "Total Hosts Built",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 16, "y": 37},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "count(homelab_deploy_build_duration_seconds_count)",
@@ -802,7 +802,7 @@
      "title": "Build Jobs (24h)",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 20, "y": 37},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "sum(increase(homelab_deploy_builds_total[24h]))",
@@ -832,7 +832,7 @@
      "title": "Build Time by Host",
      "type": "bargauge",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 41},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "sort_desc(homelab_deploy_build_duration_seconds_sum / homelab_deploy_build_duration_seconds_count)",
@@ -869,7 +869,7 @@
      "title": "Build Count by Host",
      "type": "bargauge",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 41},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "sort_desc(sum by (host) (homelab_deploy_build_host_total))",
@@ -903,7 +903,7 @@
      "title": "Build Activity",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 24, "x": 0, "y": 49},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "sum(increase(homelab_deploy_build_host_total{status=\"success\"}[1h]))",
--- a/services/grafana/dashboards/node-exporter.json
+++ b/services/grafana/dashboards/node-exporter.json
@@ -11,7 +11,7 @@
      {
        "name": "instance",
        "type": "query",
-        "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+        "datasource": {"type": "prometheus", "uid": "prometheus"},
        "query": "label_values(node_uname_info, instance)",
        "refresh": 2,
        "includeAll": false,
@@ -26,7 +26,7 @@
      "title": "CPU Usage",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\", instance=~\"$instance\"}[5m])) * 100)",
@@ -55,7 +55,7 @@
      "title": "Memory Usage",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$instance\"} / node_memory_MemTotal_bytes{instance=~\"$instance\"})) * 100",
@@ -84,7 +84,7 @@
      "title": "Disk Usage",
      "type": "gauge",
      "gridPos": {"h": 8, "w": 8, "x": 0, "y": 8},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "100 - ((node_filesystem_avail_bytes{instance=~\"$instance\",mountpoint=\"/\",fstype!=\"rootfs\"} / node_filesystem_size_bytes{instance=~\"$instance\",mountpoint=\"/\",fstype!=\"rootfs\"}) * 100)",
@@ -113,7 +113,7 @@
      "title": "System Load",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 8, "x": 8, "y": 8},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "node_load1{instance=~\"$instance\"}",
@@ -142,7 +142,7 @@
      "title": "Uptime",
      "type": "stat",
      "gridPos": {"h": 8, "w": 8, "x": 16, "y": 8},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "time() - node_boot_time_seconds{instance=~\"$instance\"}",
@@ -161,7 +161,7 @@
      "title": "Network Traffic",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "rate(node_network_receive_bytes_total{instance=~\"$instance\",device!~\"lo|veth.*|br.*|docker.*\"}[5m])",
@@ -185,7 +185,7 @@
      "title": "Disk I/O",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 16},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "rate(node_disk_read_bytes_total{instance=~\"$instance\",device!~\"dm-.*\"}[5m])",
--- a/services/grafana/dashboards/proxmox.json
+++ b/services/grafana/dashboards/proxmox.json
@@ -15,7 +15,7 @@
      {
        "name": "vm",
        "type": "query",
-        "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+        "datasource": {"type": "prometheus", "uid": "prometheus"},
        "query": "label_values(pve_guest_info{template=\"0\"}, name)",
        "refresh": 2,
        "includeAll": true,
@@ -30,7 +30,7 @@
      "title": "VMs Running",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "count(pve_up{id=~\"qemu/.*\"} * on(id) pve_guest_info{template=\"0\"} == 1)",
@@ -56,7 +56,7 @@
      "title": "VMs Stopped",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "count(pve_up{id=~\"qemu/.*\"} * on(id) pve_guest_info{template=\"0\"} == 0)",
@@ -87,7 +87,7 @@
      "title": "Node CPU",
      "type": "gauge",
      "gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "pve_cpu_usage_ratio{id=~\"node/.*\"} * 100",
@@ -120,7 +120,7 @@
      "title": "Node Memory",
      "type": "gauge",
      "gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "pve_memory_usage_bytes{id=~\"node/.*\"} / pve_memory_size_bytes{id=~\"node/.*\"} * 100",
@@ -153,7 +153,7 @@
      "title": "Node Uptime",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "pve_uptime_seconds{id=~\"node/.*\"}",
@@ -180,7 +180,7 @@
      "title": "Templates",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "count(pve_guest_info{template=\"1\"})",
@@ -206,7 +206,7 @@
      "title": "VM Status",
      "type": "table",
      "gridPos": {"h": 10, "w": 24, "x": 0, "y": 4},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "pve_guest_info{template=\"0\", name=~\"$vm\"}",
@@ -362,7 +362,7 @@
      "title": "VM CPU Usage",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 14},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "pve_cpu_usage_ratio{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"} * 100",
@@ -391,7 +391,7 @@
      "title": "VM Memory Usage",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 14},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "pve_memory_usage_bytes{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
@@ -420,7 +420,7 @@
      "title": "VM Network Traffic",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 22},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "rate(pve_network_receive_bytes{id=~\"qemu/.*\"}[5m]) * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
@@ -453,7 +453,7 @@
      "title": "VM Disk I/O",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 22},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "rate(pve_disk_read_bytes{id=~\"qemu/.*\"}[5m]) * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
@@ -486,7 +486,7 @@
      "title": "Storage Usage",
      "type": "bargauge",
      "gridPos": {"h": 6, "w": 12, "x": 0, "y": 30},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "pve_disk_usage_bytes{id=~\"storage/.*\"} / pve_disk_size_bytes{id=~\"storage/.*\"} * 100",
@@ -531,7 +531,7 @@
      "title": "Storage Capacity",
      "type": "table",
      "gridPos": {"h": 6, "w": 12, "x": 12, "y": 30},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "pve_disk_size_bytes{id=~\"storage/.*\"}",
--- a/services/grafana/dashboards/systemd.json
+++ b/services/grafana/dashboards/systemd.json
@@ -15,7 +15,7 @@
      {
        "name": "hostname",
        "type": "query",
-        "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+        "datasource": {"type": "prometheus", "uid": "prometheus"},
        "query": "label_values(systemd_unit_state, hostname)",
        "refresh": 2,
        "includeAll": true,
@@ -30,7 +30,7 @@
      "title": "Failed Units",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "count(systemd_unit_state{state=\"failed\", hostname=~\"$hostname\"} == 1) or vector(0)",
@@ -60,7 +60,7 @@
      "title": "Active Units",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "count(systemd_unit_state{state=\"active\", hostname=~\"$hostname\"} == 1)",
@@ -86,7 +86,7 @@
      "title": "Hosts Monitored",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "count(count by (hostname) (systemd_unit_state{hostname=~\"$hostname\"}))",
@@ -112,7 +112,7 @@
      "title": "Total Service Restarts",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "sum(systemd_service_restart_total{hostname=~\"$hostname\"})",
@@ -143,7 +143,7 @@
      "title": "Inactive Units",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "count(systemd_unit_state{state=\"inactive\", hostname=~\"$hostname\"} == 1)",
@@ -169,7 +169,7 @@
      "title": "Timers",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "count(systemd_timer_last_trigger_seconds{hostname=~\"$hostname\"})",
@@ -195,7 +195,7 @@
      "title": "Failed Units",
      "type": "table",
      "gridPos": {"h": 6, "w": 12, "x": 0, "y": 4},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "systemd_unit_state{state=\"failed\", hostname=~\"$hostname\"} == 1",
@@ -251,7 +251,7 @@
      "title": "Service Restarts (Top 15)",
      "type": "table",
      "gridPos": {"h": 6, "w": 12, "x": 12, "y": 4},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "topk(15, systemd_service_restart_total{hostname=~\"$hostname\"} > 0)",
@@ -309,7 +309,7 @@
      "title": "Active Units per Host",
      "type": "bargauge",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 10},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "sort_desc(count by (hostname) (systemd_unit_state{state=\"active\", hostname=~\"$hostname\"} == 1))",
@@ -339,7 +339,7 @@
      "title": "NixOS Upgrade Timers",
      "type": "table",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 10},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "systemd_timer_last_trigger_seconds{name=\"nixos-upgrade.timer\", hostname=~\"$hostname\"}",
@@ -429,7 +429,7 @@
      "title": "Backup Timers",
      "type": "table",
      "gridPos": {"h": 6, "w": 12, "x": 0, "y": 18},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "systemd_timer_last_trigger_seconds{name=~\"restic.*\", hostname=~\"$hostname\"}",
@@ -524,7 +524,7 @@
      "title": "Service Restarts Over Time",
      "type": "timeseries",
      "gridPos": {"h": 6, "w": 12, "x": 12, "y": 18},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "sum by (hostname) (increase(systemd_service_restart_total{hostname=~\"$hostname\"}[1h]))",
--- a/services/grafana/dashboards/temperature.json
+++ b/services/grafana/dashboards/temperature.json
@@ -19,7 +19,7 @@
      "title": "Current Temperatures",
      "type": "stat",
      "gridPos": {"h": 6, "w": 12, "x": 0, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}",
@@ -71,7 +71,7 @@
      "title": "Average Home Temperature",
      "type": "gauge",
      "gridPos": {"h": 6, "w": 6, "x": 12, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "avg(hass_sensor_temperature_celsius{entity!~\".*device_temperature|.*server.*\"})",
@@ -108,7 +108,7 @@
      "title": "Current Humidity",
      "type": "stat",
      "gridPos": {"h": 6, "w": 6, "x": 18, "y": 0},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "hass_sensor_humidity_percent{entity!~\".*server.*\"}",
@@ -154,7 +154,7 @@
      "title": "Temperature History (30 Days)",
      "type": "timeseries",
      "gridPos": {"h": 10, "w": 24, "x": 0, "y": 6},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}",
@@ -207,7 +207,7 @@
      "title": "Temperature Trend (1h rate of change)",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "deriv(hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}[1h]) * 3600",
@@ -268,7 +268,7 @@
      "title": "24h Min / Max / Avg",
      "type": "table",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 16},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "min_over_time(hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}[24h])",
@@ -346,7 +346,7 @@
      "title": "Humidity History (30 Days)",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 24, "x": 0, "y": 24},
-      "datasource": {"type": "prometheus", "uid": "victoriametrics"},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
      "targets": [
        {
          "expr": "hass_sensor_humidity_percent",
--- a/services/grafana/default.nix
+++ b/services/grafana/default.nix
@@ -37,10 +37,6 @@
    # Declarative datasources
    provision.datasources.settings = {
      apiVersion = 1;
-      prune = true;
-      deleteDatasources = [
-        { name = "Prometheus (monitoring01)"; orgId = 1; }
-      ];
 datasources = [
        {
          name = "VictoriaMetrics";
@@ -49,10 +45,16 @@
          isDefault = true;
          uid = "victoriametrics";
        }
+        {
+          name = "Prometheus (monitoring01)";
+          type = "prometheus";
+          url = "http://monitoring01.home.2rjus.net:9090";
+          uid = "prometheus";
+        }
        {
          name = "Loki";
          type = "loki";
-          url = "http://localhost:3100";
+          url = "http://monitoring01.home.2rjus.net:3100";
          uid = "loki";
        }
      ];
@@ -89,14 +91,6 @@
      acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
      metrics
    '';
-    virtualHosts."grafana.home.2rjus.net".extraConfig = ''
-      log {
-        output file /var/log/caddy/grafana.log {
-          mode 644
-        }
-      }
-      reverse_proxy http://127.0.0.1:3000
-    '';
    virtualHosts."grafana-test.home.2rjus.net".extraConfig = ''
      log {
        output file /var/log/caddy/grafana.log {
--- a/services/http-proxy/proxy.nix
+++ b/services/http-proxy/proxy.nix
@@ -54,49 +54,53 @@
        }
        reverse_proxy http://ha1.home.2rjus.net:8080
      }
-
+      prometheus.home.2rjus.net {
+        log {
+          output file /var/log/caddy/prometheus.log {
+            mode 644
+          }
+        }
+        reverse_proxy http://monitoring01.home.2rjus.net:9090
+      }
+      alertmanager.home.2rjus.net {
+        log {
+          output file /var/log/caddy/alertmanager.log {
+            mode 644
+          }
+        }
+        reverse_proxy http://monitoring01.home.2rjus.net:9093
+      }
+      grafana.home.2rjus.net {
+        log {
+          output file /var/log/caddy/grafana.log {
+            mode 644
+          }
+        }
+        reverse_proxy http://monitoring01.home.2rjus.net:3000
+      }
      jelly.home.2rjus.net {
        log {
          output file /var/log/caddy/jelly.log {
            mode 644
          }
        }
-        header Content-Type text/html
-        respond <<HTML
-          <!DOCTYPE html>
-          <html>
-          <head>
-            <title>Jellyfin - Maintenance</title>
-            <style>
-              body {
-                background: #101020;
-                color: #ddd;
-                font-family: sans-serif;
-                display: flex;
-                justify-content: center;
-                align-items: center;
-                min-height: 100vh;
-                margin: 0;
-                text-align: center;
+        reverse_proxy http://jelly01.home.2rjus.net:8096
      }
-              .container { max-width: 500px; }
-              .disk { font-size: 80px; animation: spin 3s linear infinite; display: inline-block; }
-              @keyframes spin { from { transform: rotate(0deg); } to { transform: rotate(360deg); } }
-              h1 { color: #00a4dc; }
-              p { font-size: 1.2em; line-height: 1.6; }
-            </style>
-          </head>
-          <body>
-            <div class="container">
-              <div class="disk">&#x1F4BF;</div>
-              <h1>Jellyfin is taking a nap</h1>
-              <p>The NAS is getting shiny new hard drives.<br>
-              Jellyfin will be back once the disks stop spinning up.</p>
-              <p style="color:#666;font-size:0.9em;">In the meantime, maybe go outside?</p>
-            </div>
-          </body>
-          </html>
-        HTML 200
+      pyroscope.home.2rjus.net {
+        log {
+          output file /var/log/caddy/pyroscope.log {
+            mode 644
+          }
+        }
+        reverse_proxy http://monitoring01.home.2rjus.net:4040
+      }
+      pushgw.home.2rjus.net {
+        log {
+          output file /var/log/caddy/pushgw.log {
+            mode 644
+          }
+        }
+        reverse_proxy http://monitoring01.home.2rjus.net:9091
      }
      http://http-proxy.home.2rjus.net/metrics {
        log {
--- a/services/loki/default.nix
+++ b/services/loki/default.nix
@@ -1,104 +0,0 @@
-{ config, lib, pkgs, ... }:
-let
-  # Script to generate bcrypt hash from Vault password for Caddy basic_auth
-  generateCaddyAuth = pkgs.writeShellApplication {
-    name = "generate-caddy-loki-auth";
-    runtimeInputs = [ config.services.caddy.package ];
-    text = ''
-      PASSWORD=$(cat /run/secrets/loki-push-auth)
-      HASH=$(caddy hash-password --plaintext "$PASSWORD")
-      echo "LOKI_PUSH_HASH=$HASH" > /run/secrets/caddy-loki-auth.env
-      chmod 0400 /run/secrets/caddy-loki-auth.env
-    '';
-  };
-in
-{
-  # Fetch Loki push password from Vault
-  vault.secrets.loki-push-auth = {
-    secretPath = "shared/loki/push-auth";
-    extractKey = "password";
-    services = [ "caddy" ];
-  };
-
-  # Generate bcrypt hash for Caddy before it starts
-  systemd.services.caddy-loki-auth = {
-    description = "Generate Caddy basic auth hash for Loki";
-    after = [ "vault-secret-loki-push-auth.service" ];
-    requires = [ "vault-secret-loki-push-auth.service" ];
-    before = [ "caddy.service" ];
-    requiredBy = [ "caddy.service" ];
-    serviceConfig = {
-      Type = "oneshot";
-      RemainAfterExit = true;
-      ExecStart = lib.getExe generateCaddyAuth;
-    };
-  };
-
-  # Load the bcrypt hash as environment variable for Caddy
-  services.caddy.environmentFile = "/run/secrets/caddy-loki-auth.env";
-
-  # Caddy reverse proxy for Loki with basic auth
-  services.caddy.virtualHosts."loki.home.2rjus.net".extraConfig = ''
-    basic_auth {
-      promtail {env.LOKI_PUSH_HASH}
-    }
-    reverse_proxy http://127.0.0.1:3100
-  '';
-
-  services.loki = {
-    enable = true;
-    configuration = {
-      auth_enabled = false;
-
-      server = {
-        http_listen_address = "127.0.0.1";
-        http_listen_port = 3100;
-      };
-      common = {
-        ring = {
-          instance_addr = "127.0.0.1";
-          kvstore = {
-            store = "inmemory";
-          };
-        };
-        replication_factor = 1;
-        path_prefix = "/var/lib/loki";
-      };
-      schema_config = {
-        configs = [
-          {
-            from = "2024-01-01";
-            store = "tsdb";
-            object_store = "filesystem";
-            schema = "v13";
-            index = {
-              prefix = "loki_index_";
-              period = "24h";
-            };
-          }
-        ];
-      };
-      storage_config = {
-        filesystem = {
-          directory = "/var/lib/loki/chunks";
-        };
-      };
-      compactor = {
-        working_directory = "/var/lib/loki/compactor";
-        compaction_interval = "10m";
-        retention_enabled = true;
-        retention_delete_delay = "2h";
-        retention_delete_worker_count = 150;
-        delete_request_store = "filesystem";
-      };
-      limits_config = {
-        retention_period = "30d";
-        ingestion_rate_mb = 10;
-        ingestion_burst_size_mb = 20;
-        max_streams_per_user = 10000;
-        max_query_series = 500;
-        max_query_parallelism = 8;
-      };
-    };
-  };
-}
--- a/services/monitoring/blackbox.nix
+++ b/services/monitoring/blackbox.nix
@@ -1,4 +1,33 @@
 { pkgs, ... }:
+let
+  # TLS endpoints to monitor for certificate expiration
+  # These are all services using ACME certificates from OpenBao PKI
+  tlsTargets = [
+    # Direct ACME certs (security.acme.certs)
+    "https://vault.home.2rjus.net:8200"
+    "https://auth.home.2rjus.net"
+    "https://testvm01.home.2rjus.net"
+
+    # Caddy auto-TLS on http-proxy
+    "https://nzbget.home.2rjus.net"
+    "https://radarr.home.2rjus.net"
+    "https://sonarr.home.2rjus.net"
+    "https://ha.home.2rjus.net"
+    "https://z2m.home.2rjus.net"
+    "https://prometheus.home.2rjus.net"
+    "https://alertmanager.home.2rjus.net"
+    "https://grafana.home.2rjus.net"
+    "https://jelly.home.2rjus.net"
+    "https://pyroscope.home.2rjus.net"
+    "https://pushgw.home.2rjus.net"
+
+    # Caddy auto-TLS on nix-cache02
+    "https://nix-cache.home.2rjus.net"
+
+    # Caddy auto-TLS on grafana01
+    "https://grafana-test.home.2rjus.net"
+  ];
+in
 {
  services.prometheus.exporters.blackbox = {
    enable = true;
@@ -28,4 +57,36 @@
              - 503
    '';
  };
+
+  # Add blackbox scrape config to Prometheus
+  # Alert rules are in rules.yml (certificate_rules group)
+  services.prometheus.scrapeConfigs = [
+    {
+      job_name = "blackbox_tls";
+      metrics_path = "/probe";
+      params = {
+        module = [ "https_cert" ];
+      };
+      static_configs = [{
+        targets = tlsTargets;
+      }];
+      relabel_configs = [
+        # Pass the target URL to blackbox as a parameter
+        {
+          source_labels = [ "__address__" ];
+          target_label = "__param_target";
+        }
+        # Use the target URL as the instance label
+        {
+          source_labels = [ "__param_target" ];
+          target_label = "instance";
+        }
+        # Point the actual scrape at the local blackbox exporter
+        {
+          target_label = "__address__";
+          replacement = "127.0.0.1:9115";
+        }
+      ];
+    }
+  ];
 }
--- a/services/monitoring/default.nix
+++ b/services/monitoring/default.nix
@@ -0,0 +1,14 @@
+{ ... }:
+{
+  imports = [
+    ./loki.nix
+    ./grafana.nix
+    ./prometheus.nix
+    ./blackbox.nix
+    ./exportarr.nix
+    ./pve.nix
+    ./alerttonotify.nix
+    ./pyroscope.nix
+    ./tempo.nix
+  ];
+}
--- a/services/monitoring/exportarr.nix
+++ b/services/monitoring/exportarr.nix
@@ -14,4 +14,14 @@
    apiKeyFile = config.vault.secrets.sonarr-api-key.outputDir;
    port = 9709;
  };
+
+  # Scrape config
+  services.prometheus.scrapeConfigs = [
+    {
+      job_name = "sonarr";
+      static_configs = [{
+        targets = [ "localhost:9709" ];
+      }];
+    }
+  ];
 }
--- a/services/monitoring/grafana.nix
+++ b/services/monitoring/grafana.nix
@@ -0,0 +1,11 @@
+{ pkgs, ... }:
+{
+  services.grafana = {
+    enable = true;
+    settings = {
+      server = {
+        http_addr = "";
+      };
+    };
+  };
+}
--- a/services/monitoring/loki.nix
+++ b/services/monitoring/loki.nix
@@ -0,0 +1,58 @@
+{ ... }:
+{
+  services.loki = {
+    enable = true;
+    configuration = {
+      auth_enabled = false;
+
+      server = {
+        http_listen_port = 3100;
+      };
+      common = {
+        ring = {
+          instance_addr = "127.0.0.1";
+          kvstore = {
+            store = "inmemory";
+          };
+        };
+        replication_factor = 1;
+        path_prefix = "/var/lib/loki";
+      };
+      schema_config = {
+        configs = [
+          {
+            from = "2024-01-01";
+            store = "tsdb";
+            object_store = "filesystem";
+            schema = "v13";
+            index = {
+              prefix = "loki_index_";
+              period = "24h";
+            };
+          }
+        ];
+      };
+      storage_config = {
+        filesystem = {
+          directory = "/var/lib/loki/chunks";
+        };
+      };
+      compactor = {
+        working_directory = "/var/lib/loki/compactor";
+        compaction_interval = "10m";
+        retention_enabled = true;
+        retention_delete_delay = "2h";
+        retention_delete_worker_count = 150;
+        delete_request_store = "filesystem";
+      };
+      limits_config = {
+        retention_period = "30d";
+        ingestion_rate_mb = 10;
+        ingestion_burst_size_mb = 20;
+        max_streams_per_user = 10000;
+        max_query_series = 500;
+        max_query_parallelism = 8;
+      };
+    };
+  };
+}
--- a/services/monitoring/prometheus.nix
+++ b/services/monitoring/prometheus.nix
@@ -0,0 +1,267 @@
+{ self, lib, pkgs, ... }:
+let
+  monLib = import ../../lib/monitoring.nix { inherit lib; };
+  externalTargets = import ./external-targets.nix;
+
+  nodeExporterTargets = monLib.generateNodeExporterTargets self externalTargets;
+  autoScrapeConfigs = monLib.generateScrapeConfigs self externalTargets;
+
+  # Script to fetch AppRole token for Prometheus to use when scraping OpenBao metrics
+  fetchOpenbaoToken = pkgs.writeShellApplication {
+    name = "fetch-openbao-token";
+    runtimeInputs = [ pkgs.curl pkgs.jq ];
+    text = ''
+      VAULT_ADDR="https://vault01.home.2rjus.net:8200"
+      APPROLE_DIR="/var/lib/vault/approle"
+      OUTPUT_FILE="/run/secrets/prometheus/openbao-token"
+
+      # Read AppRole credentials
+      if [ ! -f "$APPROLE_DIR/role-id" ] || [ ! -f "$APPROLE_DIR/secret-id" ]; then
+        echo "AppRole credentials not found at $APPROLE_DIR" >&2
+        exit 1
+      fi
+
+      ROLE_ID=$(cat "$APPROLE_DIR/role-id")
+      SECRET_ID=$(cat "$APPROLE_DIR/secret-id")
+
+      # Authenticate to Vault
+      AUTH_RESPONSE=$(curl -sf -k -X POST \
+        -d "{\"role_id\":\"$ROLE_ID\",\"secret_id\":\"$SECRET_ID\"}" \
+        "$VAULT_ADDR/v1/auth/approle/login")
+
+      # Extract token
+      VAULT_TOKEN=$(echo "$AUTH_RESPONSE" | jq -r '.auth.client_token')
+      if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
+        echo "Failed to extract Vault token from response" >&2
+        exit 1
+      fi
+
+      # Write token to file
+      mkdir -p "$(dirname "$OUTPUT_FILE")"
+      echo -n "$VAULT_TOKEN" > "$OUTPUT_FILE"
+      chown prometheus:prometheus "$OUTPUT_FILE"
+      chmod 0400 "$OUTPUT_FILE"
+
+      echo "Successfully fetched OpenBao token"
+    '';
+  };
+in
+{
+  # Systemd service to fetch AppRole token for Prometheus OpenBao scraping
+  # The token is used to authenticate when scraping /v1/sys/metrics
+  systemd.services.prometheus-openbao-token = {
+    description = "Fetch OpenBao token for Prometheus metrics scraping";
+    after = [ "network-online.target" ];
+    wants = [ "network-online.target" ];
+    before = [ "prometheus.service" ];
+    requiredBy = [ "prometheus.service" ];
+
+    serviceConfig = {
+      Type = "oneshot";
+      ExecStart = lib.getExe fetchOpenbaoToken;
+    };
+  };
+
+  # Timer to periodically refresh the token (AppRole tokens have 1-hour TTL)
+  systemd.timers.prometheus-openbao-token = {
+    description = "Refresh OpenBao token for Prometheus";
+    wantedBy = [ "timers.target" ];
+    timerConfig = {
+      OnBootSec = "5min";
+      OnUnitActiveSec = "30min";
+      RandomizedDelaySec = "5min";
+    };
+  };
+
+  # Fetch apiary bearer token from Vault
+  vault.secrets.prometheus-apiary-token = {
+    secretPath = "hosts/monitoring01/apiary-token";
+    extractKey = "password";
+    owner = "prometheus";
+    group = "prometheus";
+    services = [ "prometheus" ];
+  };
+
+  services.prometheus = {
+    enable = true;
+    # syntax-only check because we use external credential files (e.g., openbao-token)
+    checkConfig = "syntax-only";
+    alertmanager = {
+      enable = true;
+      configuration = {
+        global = {
+        };
+        route = {
+          receiver = "webhook_natstonotify";
+          group_wait = "30s";
+          group_interval = "5m";
+          repeat_interval = "1h";
+          group_by = [ "alertname" ];
+        };
+        receivers = [
+          {
+            name = "webhook_natstonotify";
+            webhook_configs = [
+              {
+                url = "http://localhost:5001/alert";
+              }
+            ];
+          }
+        ];
+      };
+    };
+    alertmanagers = [
+      {
+        static_configs = [
+          {
+            targets = [ "localhost:9093" ];
+          }
+        ];
+      }
+    ];
+
+    retentionTime = "30d";
+    globalConfig = {
+      scrape_interval = "15s";
+    };
+    rules = [
+      (builtins.readFile ./rules.yml)
+    ];
+
+    scrapeConfigs = [
+      # Auto-generated node-exporter targets from flake hosts + external
+      # Each static_config entry may have labels from homelab.host metadata
+      {
+        job_name = "node-exporter";
+        static_configs = nodeExporterTargets;
+      }
+      # Systemd exporter on all hosts (same targets, different port)
+      # Preserves the same label grouping as node-exporter
+      {
+        job_name = "systemd-exporter";
+        static_configs = map
+          (cfg: cfg // {
+            targets = map (t: builtins.replaceStrings [ ":9100" ] [ ":9558" ] t) cfg.targets;
+          })
+          nodeExporterTargets;
+      }
+      # Local monitoring services (not auto-generated)
+      {
+        job_name = "prometheus";
+        static_configs = [
+          {
+            targets = [ "localhost:9090" ];
+          }
+        ];
+      }
+      {
+        job_name = "loki";
+        static_configs = [
+          {
+            targets = [ "localhost:3100" ];
+          }
+        ];
+      }
+      {
+        job_name = "grafana";
+        static_configs = [
+          {
+            targets = [ "localhost:3000" ];
+          }
+        ];
+      }
+      {
+        job_name = "alertmanager";
+        static_configs = [
+          {
+            targets = [ "localhost:9093" ];
+          }
+        ];
+      }
+      {
+        job_name = "pushgateway";
+        honor_labels = true;
+        static_configs = [
+          {
+            targets = [ "localhost:9091" ];
+          }
+        ];
+      }
+      # Caddy metrics from nix-cache02 (serves nix-cache.home.2rjus.net)
+      {
+        job_name = "nix-cache_caddy";
+        scheme = "https";
+        static_configs = [
+          {
+            targets = [ "nix-cache.home.2rjus.net" ];
+          }
+        ];
+      }
+      # pve-exporter with complex relabel config
+      {
+        job_name = "pve-exporter";
+        static_configs = [
+          {
+            targets = [ "10.69.12.75" ];
+          }
+        ];
+        metrics_path = "/pve";
+        params = {
+          module = [ "default" ];
+          cluster = [ "1" ];
+          node = [ "1" ];
+        };
+        relabel_configs = [
+          {
+            source_labels = [ "__address__" ];
+            target_label = "__param_target";
+          }
+          {
+            source_labels = [ "__param_target" ];
+            target_label = "instance";
+          }
+          {
+            target_label = "__address__";
+            replacement = "127.0.0.1:9221";
+          }
+        ];
+      }
+      # OpenBao metrics with bearer token auth
+      {
+        job_name = "openbao";
+        scheme = "https";
+        metrics_path = "/v1/sys/metrics";
+        params = {
+          format = [ "prometheus" ];
+        };
+        static_configs = [{
+          targets = [ "vault01.home.2rjus.net:8200" ];
+        }];
+        authorization = {
+          type = "Bearer";
+          credentials_file = "/run/secrets/prometheus/openbao-token";
+        };
+      }
+      # Apiary external service
+      {
+        job_name = "apiary";
+        scheme = "https";
+        scrape_interval = "60s";
+        static_configs = [{
+          targets = [ "apiary.t-juice.club" ];
+        }];
+        authorization = {
+          type = "Bearer";
+          credentials_file = "/run/secrets/prometheus-apiary-token";
+        };
+      }
+    ] ++ autoScrapeConfigs;
+
+    pushgateway = {
+      enable = true;
+      web = {
+        external-url = "https://pushgw.home.2rjus.net";
+      };
+    };
+  };
+}
--- a/services/monitoring/pve.nix
+++ b/services/monitoring/pve.nix
@@ -1,7 +1,7 @@
 { config, ... }:
 {
  vault.secrets.pve-exporter = {
-    secretPath = "hosts/monitoring02/pve-exporter";
+    secretPath = "hosts/monitoring01/pve-exporter";
    extractKey = "config";
    outputDir = "/run/secrets/pve_exporter";
    mode = "0444";
--- a/services/monitoring/pyroscope.nix
+++ b/services/monitoring/pyroscope.nix
@@ -0,0 +1,8 @@
+{ ... }:
+{
+  virtualisation.oci-containers.containers.pyroscope = {
+    pull = "missing";
+    image = "grafana/pyroscope:latest";
+    ports = [ "4040:4040" ];
+  };
+}
--- a/services/monitoring/rules.yml
+++ b/services/monitoring/rules.yml
@@ -67,13 +67,13 @@ groups:
          summary: "Promtail service not running on {{ $labels.instance }}"
          description: "The promtail service has not been active on {{ $labels.instance }} for 5 minutes."
      - alert: filesystem_filling_up
-        expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[24h], 24*3600) < 0
+        expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 24*3600) < 0
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Filesystem predicted to fill within 24h on {{ $labels.instance }}"
-          description: "Based on the last 24h trend, the root filesystem on {{ $labels.instance }} is predicted to run out of space within 24 hours."
+          description: "Based on the last 6h trend, the root filesystem on {{ $labels.instance }} is predicted to run out of space within 24 hours."
      - alert: systemd_not_running
        expr: node_systemd_system_running == 0
        for: 10m
@@ -259,32 +259,32 @@ groups:
          description: "Wireguard handshake timeout on {{ $labels.instance }} for peer {{ $labels.public_key }}."
  - name: monitoring_rules
    rules:
-      - alert: victoriametrics_not_running
-        expr: node_systemd_unit_state{instance="monitoring02.home.2rjus.net:9100", name="victoriametrics.service", state="active"} == 0
+      - alert: prometheus_not_running
+        expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="prometheus.service", state="active"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
-          summary: "VictoriaMetrics service not running on {{ $labels.instance }}"
-          description: "VictoriaMetrics service not running on {{ $labels.instance }}"
-      - alert: vmalert_not_running
-        expr: node_systemd_unit_state{instance="monitoring02.home.2rjus.net:9100", name="vmalert.service", state="active"} == 0
-        for: 5m
-        labels:
-          severity: critical
-        annotations:
-          summary: "vmalert service not running on {{ $labels.instance }}"
-          description: "vmalert service not running on {{ $labels.instance }}"
+          summary: "Prometheus service not running on {{ $labels.instance }}"
+          description: "Prometheus service not running on {{ $labels.instance }}"
      - alert: alertmanager_not_running
-        expr: node_systemd_unit_state{instance="monitoring02.home.2rjus.net:9100", name="alertmanager.service", state="active"} == 0
+        expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="alertmanager.service", state="active"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Alertmanager service not running on {{ $labels.instance }}"
          description: "Alertmanager service not running on {{ $labels.instance }}"
+      - alert: pushgateway_not_running
+        expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="pushgateway.service", state="active"} == 0
+        for: 5m
+        labels:
+          severity: critical
+        annotations:
+          summary: "Pushgateway service not running on {{ $labels.instance }}"
+          description: "Pushgateway service not running on {{ $labels.instance }}"
      - alert: loki_not_running
-        expr: node_systemd_unit_state{instance="monitoring02.home.2rjus.net:9100", name="loki.service", state="active"} == 0
+        expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="loki.service", state="active"} == 0
        for: 5m
        labels:
          severity: critical
@@ -292,13 +292,29 @@ groups:
          summary: "Loki service not running on {{ $labels.instance }}"
          description: "Loki service not running on {{ $labels.instance }}"
      - alert: grafana_not_running
-        expr: node_systemd_unit_state{instance="monitoring02.home.2rjus.net:9100", name="grafana.service", state="active"} == 0
+        expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="grafana.service", state="active"} == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Grafana service not running on {{ $labels.instance }}"
          description: "Grafana service not running on {{ $labels.instance }}"
+      - alert: tempo_not_running
+        expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="tempo.service", state="active"} == 0
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Tempo service not running on {{ $labels.instance }}"
+          description: "Tempo service not running on {{ $labels.instance }}"
+      - alert: pyroscope_not_running
+        expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="podman-pyroscope.service", state="active"} == 0
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Pyroscope service not running on {{ $labels.instance }}"
+          description: "Pyroscope service not running on {{ $labels.instance }}"
  - name: proxmox_rules
    rules:
      - alert: pve_node_down
--- a/services/monitoring/tempo.nix
+++ b/services/monitoring/tempo.nix
@@ -0,0 +1,37 @@
+{ ... }:
+{
+  services.tempo = {
+    enable = true;
+    settings = {
+      server = {
+        http_listen_port = 3200;
+        grpc_listen_port = 3201;
+      };
+      distributor = {
+        receivers = {
+          otlp = {
+            protocols = {
+              http = {
+                endpoint = ":4318";
+                cors = {
+                  allowed_origins = [ "*.home.2rjus.net" ];
+                };
+              };
+            };
+          };
+        };
+      };
+      storage = {
+        trace = {
+          backend = "local";
+          local = {
+            path = "/var/lib/tempo";
+          };
+          wal = {
+            path = "/var/lib/tempo/wal";
+          };
+        };
+      };
+    };
+  };
+}
--- a/services/victoriametrics/default.nix
+++ b/services/victoriametrics/default.nix
@@ -6,24 +6,6 @@ let
  nodeExporterTargets = monLib.generateNodeExporterTargets self externalTargets;
  autoScrapeConfigs = monLib.generateScrapeConfigs self externalTargets;

-  # TLS endpoints to monitor for certificate expiration via blackbox exporter
-  tlsTargets = [
-    "https://vault.home.2rjus.net:8200"
-    "https://auth.home.2rjus.net"
-    "https://testvm01.home.2rjus.net"
-    "https://nzbget.home.2rjus.net"
-    "https://radarr.home.2rjus.net"
-    "https://sonarr.home.2rjus.net"
-    "https://ha.home.2rjus.net"
-    "https://z2m.home.2rjus.net"
-    "https://metrics.home.2rjus.net"
-    "https://alertmanager.home.2rjus.net"
-    "https://grafana.home.2rjus.net"
-    "https://jelly.home.2rjus.net"
-    "https://nix-cache.home.2rjus.net"
-    "https://grafana-test.home.2rjus.net"
-  ];
-
  # Script to fetch AppRole token for VictoriaMetrics to use when scraping OpenBao metrics
  fetchOpenbaoToken = pkgs.writeShellApplication {
    name = "fetch-openbao-token-vm";
@@ -125,39 +107,6 @@ let
        credentials_file = "/run/secrets/victoriametrics-apiary-token";
      };
    }
-    # Blackbox TLS certificate monitoring
-    {
-      job_name = "blackbox_tls";
-      metrics_path = "/probe";
-      params = {
-        module = [ "https_cert" ];
-      };
-      static_configs = [{ targets = tlsTargets; }];
-      relabel_configs = [
-        {
-          source_labels = [ "__address__" ];
-          target_label = "__param_target";
-        }
-        {
-          source_labels = [ "__param_target" ];
-          target_label = "instance";
-        }
-        {
-          target_label = "__address__";
-          replacement = "127.0.0.1:9115";
-        }
-      ];
-    }
-    # Sonarr exporter
-    {
-      job_name = "sonarr";
-      static_configs = [{ targets = [ "localhost:9709" ]; }];
-    }
-    # Proxmox VE exporter
-    {
-      job_name = "pve";
-      static_configs = [{ targets = [ "localhost:9221" ]; }];
-    }
  ] ++ autoScrapeConfigs;
 in
 {
@@ -203,7 +152,7 @@ in

  # Fetch apiary bearer token from Vault
  vault.secrets.victoriametrics-apiary-token = {
-    secretPath = "hosts/monitoring02/apiary-token";
+    secretPath = "hosts/monitoring01/apiary-token";
    extractKey = "password";
    owner = "victoriametrics";
    group = "victoriametrics";
@@ -221,12 +170,15 @@ in
    };
  };

-  # vmalert for alerting rules
+  # vmalert for alerting rules - no notifier during parallel operation
  services.vmalert.instances.default = {
    enable = true;
    settings = {
      "datasource.url" = "http://localhost:8428";
-      "notifier.url" = [ "http://localhost:9093" ];
+      # Blackhole notifications during parallel operation to prevent duplicate alerts.
+      # Replace with notifier.url after cutover from monitoring01:
+      # "notifier.url" = [ "http://localhost:9093" ];
+      "notifier.blackhole" = true;
      "rule" = [ ../monitoring/rules.yml ];
    };
  };
@@ -239,11 +191,8 @@ in
    reverse_proxy http://127.0.0.1:8880
  '';

-  # Alertmanager
-  services.caddy.virtualHosts."alertmanager.home.2rjus.net".extraConfig = ''
-    reverse_proxy http://127.0.0.1:9093
-  '';
-
+  # Alertmanager - same config as monitoring01 but will only receive
+  # alerts after cutover (vmalert notifier is disabled above)
  services.prometheus.alertmanager = {
    enable = true;
    configuration = {
--- a/system/monitoring/logs.nix
+++ b/system/monitoring/logs.nix
@@ -16,16 +16,6 @@ in
      SystemKeepFree=1G
    '';
  };
-
-  # Fetch Loki push password from Vault (only on hosts with Vault enabled)
-  vault.secrets.promtail-loki-auth = lib.mkIf config.vault.enable {
-    secretPath = "shared/loki/push-auth";
-    extractKey = "password";
-    owner = "promtail";
-    group = "promtail";
-    services = [ "promtail" ];
-  };
-
  # Configure promtail
  services.promtail = {
    enable = true;
@@ -39,11 +29,7 @@ in

      clients = [
        {
-          url = "https://loki.home.2rjus.net/loki/api/v1/push";
-          basic_auth = {
-            username = "promtail";
-            password_file = "/run/secrets/promtail-loki-auth";
-          };
+          url = "http://monitoring01.home.2rjus.net:3100/loki/api/v1/push";
        }
      ];

--- a/system/pipe-to-loki.nix
+++ b/system/pipe-to-loki.nix
@@ -16,8 +16,7 @@ let
    text = ''
      set -euo pipefail

-      LOKI_URL="https://loki.home.2rjus.net/loki/api/v1/push"
-      LOKI_AUTH_FILE="/run/secrets/promtail-loki-auth"
+      LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
      HOSTNAME=$(hostname)
      SESSION_ID=""
      RECORD_MODE=false
@@ -70,13 +69,7 @@ let
            }]
          }')

-        local auth_args=()
-        if [[ -f "$LOKI_AUTH_FILE" ]]; then
-          auth_args=(-u "promtail:$(cat "$LOKI_AUTH_FILE")")
-        fi
-
        if curl -s -X POST "$LOKI_URL" \
-          "''${auth_args[@]}" \
          -H "Content-Type: application/json" \
          -d "$payload" > /dev/null; then
          return 0
--- a/system/vault-secrets.nix
+++ b/system/vault-secrets.nix
@@ -57,7 +57,7 @@ let
        type = types.str;
        description = ''
          Path to the secret in Vault (without /v1/secret/data/ prefix).
-          Example: "hosts/ha1/mqtt-password"
+          Example: "hosts/monitoring01/grafana-admin"
        '';
      };

@@ -152,11 +152,13 @@ in
      '';
      example = literalExpression ''
        {
-          mqtt-password = {
-            secretPath = "hosts/ha1/mqtt-password";
-            owner = "mosquitto";
-            group = "mosquitto";
-            services = [ "mosquitto" ];
+          grafana-admin = {
+            secretPath = "hosts/monitoring01/grafana-admin";
+            owner = "grafana";
+            group = "grafana";
+            restartTrigger = true;
+            restartInterval = "daily";
+            services = [ "grafana" ];
          };
        }
      '';
--- a/terraform/vault/approle.tf
+++ b/terraform/vault/approle.tf
@@ -26,27 +26,26 @@ path "secret/data/shared/nixos-exporter/*" {
 EOT
 }

-# Shared policy for Loki push authentication (all hosts push logs)
-resource "vault_policy" "loki_push" {
-  name = "loki-push"
-
-  policy = <<EOT
-path "secret/data/shared/loki/*" {
-  capabilities = ["read", "list"]
-}
-EOT
-}
-
 # Define host access policies
 locals {
  host_policies = {
-    # Example:
+    # Example: monitoring01 host
+    # "monitoring01" = {
+    #   paths = [
+    #     "secret/data/hosts/monitoring01/*",
+    #     "secret/data/services/prometheus/*",
+    #     "secret/data/services/grafana/*",
+    #     "secret/data/shared/smtp/*"
+    #   ]
+    #   extra_policies = ["some-other-policy"]  # Optional: additional policies
+    # }
+
+    # Example: ha1 host
    # "ha1" = {
    #   paths = [
    #     "secret/data/hosts/ha1/*",
    #     "secret/data/shared/mqtt/*"
    #   ]
-    #   extra_policies = ["some-other-policy"]  # Optional: additional policies
    # }

    "ha1" = {
@@ -56,6 +55,16 @@ locals {
      ]
    }

+    "monitoring01" = {
+      paths = [
+        "secret/data/hosts/monitoring01/*",
+        "secret/data/shared/backup/*",
+        "secret/data/shared/nats/*",
+        "secret/data/services/exportarr/*",
+      ]
+      extra_policies = ["prometheus-metrics"]
+    }
+
    # Wave 1: hosts with no service secrets (only need vault.enable for future use)
    "nats1" = {
      paths = [
@@ -69,7 +78,7 @@ locals {
      ]
    }

-    # Wave 3: DNS servers (managed in hosts-generated.tf)
+    # Wave 3: DNS servers

    # Wave 4: http-proxy
    "http-proxy" = {
@@ -95,6 +104,15 @@ locals {
      ]
    }

+    # monitoring02: Grafana + VictoriaMetrics
+    "monitoring02" = {
+      paths = [
+        "secret/data/hosts/monitoring02/*",
+        "secret/data/hosts/monitoring01/apiary-token",
+        "secret/data/services/grafana/*",
+      ]
+    }
+
  }
 }

@@ -120,7 +138,7 @@ resource "vault_approle_auth_backend_role" "hosts" {
  backend   = vault_auth_backend.approle.path
  role_name = each.key
  token_policies = concat(
-    ["${each.key}-policy", "homelab-deploy", "nixos-exporter", "loki-push"],
+    ["${each.key}-policy", "homelab-deploy", "nixos-exporter"],
    lookup(each.value, "extra_policies", [])
  )

--- a/terraform/vault/hosts-generated.tf
+++ b/terraform/vault/hosts-generated.tf
@@ -44,15 +44,6 @@ locals {
        "secret/data/hosts/garage01/*",
      ]
    }
-    "monitoring02" = {
-      paths = [
-        "secret/data/hosts/monitoring02/*",
-        "secret/data/services/grafana/*",
-        "secret/data/services/exportarr/*",
-        "secret/data/shared/nats/nkey",
-      ]
-      extra_policies = ["prometheus-metrics"]
-    }
  
  }

@@ -83,10 +74,7 @@ resource "vault_approle_auth_backend_role" "generated_hosts" {

  backend            = vault_auth_backend.approle.path
  role_name          = each.key
-  token_policies     = concat(
-    ["host-${each.key}", "homelab-deploy", "nixos-exporter", "loki-push"],
-    lookup(each.value, "extra_policies", [])
-  )
+  token_policies     = ["host-${each.key}", "homelab-deploy", "nixos-exporter"]
  secret_id_ttl      = 0 # Never expire (wrapped tokens provide time limit)
  token_ttl          = 3600
  token_max_ttl      = 3600
--- a/terraform/vault/secrets.tf
+++ b/terraform/vault/secrets.tf
@@ -10,6 +10,10 @@ resource "vault_mount" "kv" {
 locals {
  secrets = {
    # Example host-specific secrets
+    # "hosts/monitoring01/grafana-admin" = {
+    #   auto_generate   = true
+    #   password_length = 32
+    # }
    # "hosts/ha1/mqtt-password" = {
    #   auto_generate   = true
    #   password_length = 24
@@ -31,6 +35,11 @@ locals {
    #   }
    # }

+    "hosts/monitoring01/grafana-admin" = {
+      auto_generate   = true
+      password_length = 32
+    }
+
    "hosts/ha1/mqtt-password" = {
      auto_generate   = true
      password_length = 24
@@ -48,8 +57,8 @@ locals {
      data          = { nkey = var.nats_nkey }
    }

-    # PVE exporter config for monitoring02
-    "hosts/monitoring02/pve-exporter" = {
+    # PVE exporter config for monitoring01
+    "hosts/monitoring01/pve-exporter" = {
      auto_generate = false
      data          = { config = var.pve_exporter_config }
    }
@@ -140,16 +149,10 @@ locals {
    }

    # Bearer token for scraping apiary metrics
-    "hosts/monitoring02/apiary-token" = {
+    "hosts/monitoring01/apiary-token" = {
      auto_generate   = true
      password_length = 64
    }
-
-    # Loki push authentication (used by Promtail on all hosts)
-    "shared/loki/push-auth" = {
-      auto_generate   = true
-      password_length = 32
-    }
  }
 }
Author	SHA1	Message	Date
Torjus Håkestad	ef850d91a4	terraform: grant monitoring02 access to apiary-token secret Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 00:52:07 +01:00
Torjus Håkestad	a99fb5b959	grafana: remove one-time deleteDatasources cleanup Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 00:49:27 +01:00
Torjus Håkestad	d385f02c89	grafana: fix datasource provisioning crash from renamed Prometheus datasource Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 00:44:35 +01:00
Torjus Håkestad	8dfd04b406	monitoring02: add Caddy reverse proxy for VictoriaMetrics and vmalert Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Add metrics.home.2rjus.net and vmalert.home.2rjus.net CNAMEs with Caddy TLS termination via internal ACME CA. Refactors Grafana's Caddy config from configFile to globalConfig + virtualHosts so both modules can contribute routes to the same Caddy instance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 00:36:11 +01:00
Torjus Håkestad	63cf690598	victoriametrics: fix vmalert crash by adding notifier.blackhole Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details vmalert requires either a notifier URL or -notifier.blackhole when alerting rules are present. Add blackhole flag for parallel operation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 00:31:53 +01:00
Torjus Håkestad	ef8eeaa2f5	monitoring02: add VictoriaMetrics, vmalert, and Alertmanager Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Set up the core metrics stack on monitoring02 as Phase 2 of the monitoring migration. VictoriaMetrics replaces Prometheus with identical scrape configs (22 jobs including auto-generated targets). - VictoriaMetrics with 3-month retention and all scrape configs - vmalert evaluating existing rules.yml (notifier disabled) - Alertmanager with same routing config (no alerts during parallel op) - Grafana datasources updated: local VictoriaMetrics as default - Static user override for credential file access (OpenBao, Apiary) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 00:29:34 +01:00