# Monitoring Stack Migration to VictoriaMetrics ## Overview Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression and longer retention. Run in parallel with monitoring01 until validated, then switch over using a `monitoring` CNAME for seamless transition. ## Current State **monitoring02** (10.69.13.24) - **PRIMARY**: - 4 CPU cores, 8GB RAM, 60GB disk - VictoriaMetrics with 3-month retention - vmalert with alerting enabled (routes to local Alertmanager) - Alertmanager -> alerttonotify -> NATS notification pipeline - Grafana with Kanidm OIDC (`grafana.home.2rjus.net`) - Loki (log aggregation) - CNAMEs: monitoring, alertmanager, grafana, grafana-test, metrics, vmalert, loki **monitoring01** (10.69.13.13) - **SHUT DOWN**: - No longer running, pending decommission ## Decision: VictoriaMetrics Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point: - Single binary replacement for Prometheus - 5-10x better compression (30 days could become 180+ days in same space) - Same PromQL query language (Grafana dashboards work unchanged) - Same scrape config format (existing auto-generated configs work) If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated. ## Architecture ``` ┌─────────────────┐ │ monitoring02 │ │ VictoriaMetrics│ │ + Grafana │ monitoring │ + Loki │ CNAME ──────────│ + Alertmanager │ │ (vmalert) │ └─────────────────┘ ▲ │ scrapes ┌───────────────┼───────────────┐ │ │ │ ┌────┴────┐ ┌─────┴────┐ ┌─────┴────┐ │ ns1 │ │ ha1 │ │ ... │ │ :9100 │ │ :9100 │ │ :9100 │ └─────────┘ └──────────┘ └──────────┘ ``` ## Implementation Plan ### Phase 1: Create monitoring02 Host [COMPLETE] Host created and deployed at 10.69.13.24 (prod tier) with: - 4 CPU cores, 8GB RAM, 60GB disk - Vault integration enabled - NATS-based remote deployment enabled - Grafana with Kanidm OIDC deployed as test instance (`grafana-test.home.2rjus.net`) ### Phase 2: Set Up VictoriaMetrics Stack [COMPLETE] New service module at `services/victoriametrics/` for VictoriaMetrics + vmalert + Alertmanager. Imported by monitoring02 alongside the existing Grafana service. 1. **VictoriaMetrics** (port 8428): - `services.victoriametrics.enable = true` - `retentionPeriod = "3"` (3 months) - All scrape configs migrated from Prometheus (22 jobs including auto-generated) - Static user override (DynamicUser disabled) for credential file access - OpenBao token fetch service + 30min refresh timer - Apiary bearer token via vault.secrets 2. **vmalert** for alerting rules: - Points to VictoriaMetrics datasource at localhost:8428 - Reuses existing `services/monitoring/rules.yml` directly via `settings.rule` - Notifier sends to local Alertmanager at localhost:9093 3. **Alertmanager** (port 9093): - Same configuration as monitoring01 (alerttonotify webhook routing) - alerttonotify imported on monitoring02, routes alerts via NATS 4. **Grafana** (port 3000): - VictoriaMetrics datasource (localhost:8428) as default - Loki datasource pointing to localhost:3100 5. **Loki** (port 3100): - Same configuration as monitoring01 in standalone `services/loki/` module - Grafana datasource updated to localhost:3100 **Note:** pve-exporter and pushgateway scrape targets are not included on monitoring02. pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics native push support. ### Phase 3: Parallel Operation [COMPLETE] Ran both monitoring01 and monitoring02 simultaneously to validate data collection and dashboards. ### Phase 4: Add monitoring CNAME [COMPLETE] Added CNAMEs to monitoring02: monitoring, alertmanager, grafana, metrics, vmalert, loki. ### Phase 5: Update References [COMPLETE] - Moved alertmanager, grafana, prometheus CNAMEs from http-proxy to monitoring02 - Removed corresponding Caddy reverse proxy entries from http-proxy - monitoring02 Caddy serves alertmanager, grafana, metrics, vmalert directly ### Phase 6: Enable Alerting [COMPLETE] - Switched vmalert from blackhole mode to local Alertmanager - alerttonotify service running on monitoring02 (NATS nkey from Vault) - prometheus-metrics Vault policy added for OpenBao scraping - Full alerting pipeline verified: vmalert -> Alertmanager -> alerttonotify -> NATS ### Phase 7: Cutover and Decommission [IN PROGRESS] - monitoring01 shut down (2026-02-17) - Vault AppRole moved from approle.tf to hosts-generated.tf with extra_policies support **Remaining cleanup (separate branch):** - [ ] Update `system/monitoring/logs.nix` - Promtail still points to monitoring01 - [ ] Update `hosts/template2/bootstrap.nix` - Bootstrap Loki URL still points to monitoring01 - [ ] Remove monitoring01 from flake.nix and host configuration - [ ] Destroy monitoring01 VM in Proxmox - [ ] Remove monitoring01 from terraform state - [ ] Remove or archive `services/monitoring/` (Prometheus config) ## Completed - 2026-02-08: Phase 1 - monitoring02 host created - 2026-02-17: Phase 2 - VictoriaMetrics, vmalert, Alertmanager, Loki, Grafana configured - 2026-02-17: Phase 6 - Alerting enabled, CNAMEs migrated, monitoring01 shut down ## VictoriaMetrics Service Configuration Implemented in `services/victoriametrics/default.nix`. Key design decisions: - **Static user**: VictoriaMetrics NixOS module uses `DynamicUser`, overridden with a static `victoriametrics` user so vault.secrets and credential files work correctly - **Shared rules**: vmalert reuses `services/monitoring/rules.yml` via `settings.rule` path reference (no YAML-to-Nix conversion needed) - **Scrape config reuse**: Uses the same `lib/monitoring.nix` functions and `services/monitoring/external-targets.nix` as Prometheus for auto-generated targets ## Notes - VictoriaMetrics uses port 8428 vs Prometheus 9090 - PromQL compatibility is excellent - VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed) - monitoring02 deployed via OpenTofu using `create-host` script - Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state - Tempo and Pyroscope deferred (not actively used; can be added later if needed)