# Monitoring Stack Migration to VictoriaMetrics ## Overview Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression and longer retention. Run in parallel with monitoring01 until validated, then switch over using a `monitoring` CNAME for seamless transition. ## Current State **monitoring01** (10.69.13.13): - 4 CPU cores, 4GB RAM, 33GB disk - Prometheus with 30-day retention (15s scrape interval) - Alertmanager (routes to alerttonotify webhook) - Grafana (dashboards, datasources) - Loki (log aggregation from all hosts via Promtail) - Tempo (distributed tracing) - Pyroscope (continuous profiling) **Hardcoded References to monitoring01:** - `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100` - `hosts/template2/bootstrap.nix` - Bootstrap logs to Loki (keep as-is until decommission) - `services/http-proxy/proxy.nix` - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway **Auto-generated:** - Prometheus scrape targets (from `lib/monitoring.nix` + `homelab.monitoring.scrapeTargets`) - Node-exporter targets (from all hosts with static IPs) ## Decision: VictoriaMetrics Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point: - Single binary replacement for Prometheus - 5-10x better compression (30 days could become 180+ days in same space) - Same PromQL query language (Grafana dashboards work unchanged) - Same scrape config format (existing auto-generated configs work) If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated. ## Architecture ``` ┌─────────────────┐ │ monitoring02 │ │ VictoriaMetrics│ │ + Grafana │ monitoring │ + Loki │ CNAME ──────────│ + Tempo │ │ + Pyroscope │ │ + Alertmanager │ │ (vmalert) │ └─────────────────┘ ▲ │ scrapes ┌───────────────┼───────────────┐ │ │ │ ┌────┴────┐ ┌─────┴────┐ ┌─────┴────┐ │ ns1 │ │ ha1 │ │ ... │ │ :9100 │ │ :9100 │ │ :9100 │ └─────────┘ └──────────┘ └──────────┘ ``` ## Implementation Plan ### Phase 1: Create monitoring02 Host [COMPLETE] Host created and deployed at 10.69.13.24 (prod tier) with: - 4 CPU cores, 8GB RAM, 60GB disk - Vault integration enabled - NATS-based remote deployment enabled - Grafana with Kanidm OIDC deployed as test instance (`grafana-test.home.2rjus.net`) ### Phase 2: Set Up VictoriaMetrics Stack New service module at `services/victoriametrics/` for VictoriaMetrics + vmalert + Alertmanager. Imported by monitoring02 alongside the existing Grafana service. 1. **VictoriaMetrics** (port 8428): [DONE] - `services.victoriametrics.enable = true` - `retentionPeriod = "3"` (3 months) - All scrape configs migrated from Prometheus (22 jobs including auto-generated) - Static user override (DynamicUser disabled) for credential file access - OpenBao token fetch service + 30min refresh timer - Apiary bearer token via vault.secrets 2. **vmalert** for alerting rules: [DONE] - Points to VictoriaMetrics datasource at localhost:8428 - Reuses existing `services/monitoring/rules.yml` directly via `settings.rule` - No notifier configured during parallel operation (prevents duplicate alerts) 3. **Alertmanager** (port 9093): [DONE] - Same configuration as monitoring01 (alerttonotify webhook routing) - Will only receive alerts after cutover (vmalert notifier disabled) 4. **Grafana** (port 3000): [DONE] - VictoriaMetrics datasource (localhost:8428) as default - monitoring01 Prometheus datasource kept for comparison during parallel operation - Loki datasource pointing to monitoring01 (until Loki migrated) 5. **Loki** (port 3100): - TODO: Same configuration as current 6. **Tempo** (ports 3200, 3201): - TODO: Same configuration 7. **Pyroscope** (port 4040): - TODO: Same Docker-based deployment **Note:** pve-exporter and pushgateway scrape targets are not included on monitoring02. pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics native push support. ### Phase 3: Parallel Operation Run both monitoring01 and monitoring02 simultaneously: 1. **Dual scraping**: Both hosts scrape the same targets - Validates VictoriaMetrics is collecting data correctly 2. **Dual log shipping**: Configure Promtail to send logs to both Loki instances - Add second client in `system/monitoring/logs.nix` pointing to monitoring02 3. **Validate dashboards**: Access Grafana on monitoring02, verify dashboards work 4. **Validate alerts**: Verify vmalert evaluates rules correctly (no receiver = no notifications) 5. **Compare resource usage**: Monitor disk/memory consumption between hosts ### Phase 4: Add monitoring CNAME Add CNAME to monitoring02 once validated: ```nix # hosts/monitoring02/configuration.nix homelab.dns.cnames = [ "monitoring" ]; ``` This creates `monitoring.home.2rjus.net` pointing to monitoring02. ### Phase 5: Update References Update hardcoded references to use the CNAME: 1. **system/monitoring/logs.nix**: - Remove dual-shipping, point only to `http://monitoring.home.2rjus.net:3100` 2. **services/http-proxy/proxy.nix**: Update reverse proxy backends: - prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428 - alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093 - grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000 - pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040 Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission. ### Phase 6: Enable Alerting Once ready to cut over: 1. Enable Alertmanager receiver on monitoring02 2. Verify test alerts route correctly ### Phase 7: Cutover and Decommission 1. **Stop monitoring01**: Prevent duplicate alerts during transition 2. **Update bootstrap.nix**: Point to `monitoring.home.2rjus.net` 3. **Verify all targets scraped**: Check VictoriaMetrics UI 4. **Verify logs flowing**: Check Loki on monitoring02 5. **Decommission monitoring01**: - Remove from flake.nix - Remove host configuration - Destroy VM in Proxmox - Remove from terraform state ## Current Progress - **Phase 1** complete (2026-02-08): monitoring02 host created, Grafana with Kanidm OIDC validated - **Phase 2** in progress (2026-02-17): VictoriaMetrics, vmalert, Alertmanager, Grafana datasources configured - Remaining: Loki, Tempo, Pyroscope migration ## Open Questions - [ ] What disk size for monitoring02? Current 60GB may need expansion for 3+ months with VictoriaMetrics - [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set) - [ ] Consider replacing Promtail with Grafana Alloy (`services.alloy`, v1.12.2 in nixpkgs). Promtail is in maintenance mode and Grafana recommends Alloy as the successor. Alloy is a unified collector (logs, metrics, traces, profiles) but uses its own "River" config format instead of YAML, so less Nix-native ergonomics. Could bundle the migration with monitoring02 to consolidate disruption. ## VictoriaMetrics Service Configuration Implemented in `services/victoriametrics/default.nix`. Key design decisions: - **Static user**: VictoriaMetrics NixOS module uses `DynamicUser`, overridden with a static `victoriametrics` user so vault.secrets and credential files work correctly - **Shared rules**: vmalert reuses `services/monitoring/rules.yml` via `settings.rule` path reference (no YAML-to-Nix conversion needed) - **Scrape config reuse**: Uses the same `lib/monitoring.nix` functions and `services/monitoring/external-targets.nix` as Prometheus for auto-generated targets ## Rollback Plan If issues arise after cutover: 1. Move `monitoring` CNAME back to monitoring01 2. Restart monitoring01 services 3. Revert Promtail config to point only to monitoring01 4. Revert http-proxy backends ## Notes - VictoriaMetrics uses port 8428 vs Prometheus 9090 - PromQL compatibility is excellent - VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed) - monitoring02 deployed via OpenTofu using `create-host` script - Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state