# Monitoring Stack Migration to VictoriaMetrics ## Overview Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression and longer retention. Run in parallel with monitoring01 until validated, then switch over using a `monitoring` CNAME for seamless transition. ## Current State **monitoring01** (10.69.13.13): - 4 CPU cores, 4GB RAM, 33GB disk - Prometheus with 30-day retention (15s scrape interval) - Alertmanager (routes to alerttonotify webhook) - Grafana (dashboards, datasources) - Loki (log aggregation from all hosts via Promtail) - Tempo (distributed tracing) - Pyroscope (continuous profiling) **Hardcoded References to monitoring01:** - `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100` - `hosts/template2/bootstrap.nix` - Bootstrap logs to Loki (keep as-is until decommission) - `services/http-proxy/proxy.nix` - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway **Auto-generated:** - Prometheus scrape targets (from `lib/monitoring.nix` + `homelab.monitoring.scrapeTargets`) - Node-exporter targets (from all hosts with static IPs) ## Decision: VictoriaMetrics Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point: - Single binary replacement for Prometheus - 5-10x better compression (30 days could become 180+ days in same space) - Same PromQL query language (Grafana dashboards work unchanged) - Same scrape config format (existing auto-generated configs work) If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated. ## Architecture ``` ┌─────────────────┐ │ monitoring02 │ │ VictoriaMetrics│ │ + Grafana │ monitoring │ + Loki │ CNAME ──────────│ + Tempo │ │ + Pyroscope │ │ + Alertmanager │ │ (vmalert) │ └─────────────────┘ ▲ │ scrapes ┌───────────────┼───────────────┐ │ │ │ ┌────┴────┐ ┌─────┴────┐ ┌─────┴────┐ │ ns1 │ │ ha1 │ │ ... │ │ :9100 │ │ :9100 │ │ :9100 │ └─────────┘ └──────────┘ └──────────┘ ``` ## Implementation Plan ### Phase 1: Create monitoring02 Host Use `create-host` script which handles flake.nix and terraform/vms.tf automatically. 1. **Run create-host**: `nix develop -c create-host monitoring02 10.69.13.24` 2. **Update VM resources** in `terraform/vms.tf`: - 4 cores (same as monitoring01) - 8GB RAM (double, for VictoriaMetrics headroom) - 100GB disk (for 3+ months retention with compression) 3. **Update host configuration**: Import monitoring services 4. **Create Vault AppRole**: Add to `terraform/vault/approle.tf` ### Phase 2: Set Up VictoriaMetrics Stack Create new service module at `services/monitoring/victoriametrics/` for testing alongside existing Prometheus config. Once validated, this can replace the Prometheus module. 1. **VictoriaMetrics** (port 8428): - `services.victoriametrics.enable = true` - `services.victoriametrics.retentionPeriod = "3m"` (3 months, increase later based on disk usage) - Migrate scrape configs via `prometheusConfig` - Use native push support (replaces Pushgateway) 2. **vmalert** for alerting rules: - `services.vmalert.enable = true` - Point to VictoriaMetrics for metrics evaluation - Keep rules in separate `rules.yml` file (same format as Prometheus) - No receiver configured during parallel operation (prevents duplicate alerts) 3. **Alertmanager** (port 9093): - Keep existing configuration (alerttonotify webhook routing) - Only enable receiver after cutover from monitoring01 4. **Loki** (port 3100): - Same configuration as current 5. **Grafana** (port 3000): - Define dashboards declaratively via NixOS options (not imported from monitoring01) - Reference existing dashboards on monitoring01 for content inspiration - Configure VictoriaMetrics datasource (port 8428) - Configure Loki datasource 6. **Tempo** (ports 3200, 3201): - Same configuration 7. **Pyroscope** (port 4040): - Same Docker-based deployment ### Phase 3: Parallel Operation Run both monitoring01 and monitoring02 simultaneously: 1. **Dual scraping**: Both hosts scrape the same targets - Validates VictoriaMetrics is collecting data correctly 2. **Dual log shipping**: Configure Promtail to send logs to both Loki instances - Add second client in `system/monitoring/logs.nix` pointing to monitoring02 3. **Validate dashboards**: Access Grafana on monitoring02, verify dashboards work 4. **Validate alerts**: Verify vmalert evaluates rules correctly (no receiver = no notifications) 5. **Compare resource usage**: Monitor disk/memory consumption between hosts ### Phase 4: Add monitoring CNAME Add CNAME to monitoring02 once validated: ```nix # hosts/monitoring02/configuration.nix homelab.dns.cnames = [ "monitoring" ]; ``` This creates `monitoring.home.2rjus.net` pointing to monitoring02. ### Phase 5: Update References Update hardcoded references to use the CNAME: 1. **system/monitoring/logs.nix**: - Remove dual-shipping, point only to `http://monitoring.home.2rjus.net:3100` 2. **services/http-proxy/proxy.nix**: Update reverse proxy backends: - prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428 - alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093 - grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000 - pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040 Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission. ### Phase 6: Enable Alerting Once ready to cut over: 1. Enable Alertmanager receiver on monitoring02 2. Verify test alerts route correctly ### Phase 7: Cutover and Decommission 1. **Stop monitoring01**: Prevent duplicate alerts during transition 2. **Update bootstrap.nix**: Point to `monitoring.home.2rjus.net` 3. **Verify all targets scraped**: Check VictoriaMetrics UI 4. **Verify logs flowing**: Check Loki on monitoring02 5. **Decommission monitoring01**: - Remove from flake.nix - Remove host configuration - Destroy VM in Proxmox - Remove from terraform state ## Current Progress ### monitoring02 Host Created (2026-02-08) Host deployed at 10.69.13.24 (test tier) with: - 4 CPU cores, 8GB RAM, 60GB disk - Vault integration enabled - NATS-based remote deployment enabled ### Grafana with Kanidm OIDC (2026-02-08) Grafana deployed on monitoring02 as a test instance (`grafana-test.home.2rjus.net`): - Kanidm OIDC authentication (PKCE enabled) - Role mapping: `admins` → Admin, others → Viewer - Declarative datasources pointing to monitoring01 (Prometheus, Loki) - Local Caddy for TLS termination via internal ACME CA This validates the Grafana + OIDC pattern before the full VictoriaMetrics migration. The existing `services/monitoring/grafana.nix` on monitoring01 can be replaced with the new `services/grafana/` module once monitoring02 becomes the primary monitoring host. ## Open Questions - [ ] What disk size for monitoring02? Current 60GB may need expansion for 3+ months with VictoriaMetrics - [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set) - [ ] Consider replacing Promtail with Grafana Alloy (`services.alloy`, v1.12.2 in nixpkgs). Promtail is in maintenance mode and Grafana recommends Alloy as the successor. Alloy is a unified collector (logs, metrics, traces, profiles) but uses its own "River" config format instead of YAML, so less Nix-native ergonomics. Could bundle the migration with monitoring02 to consolidate disruption. ## VictoriaMetrics Service Configuration Example NixOS configuration for monitoring02: ```nix # VictoriaMetrics replaces Prometheus services.victoriametrics = { enable = true; retentionPeriod = "3m"; # 3 months, increase based on disk usage prometheusConfig = { global.scrape_interval = "15s"; scrape_configs = [ # Auto-generated node-exporter targets # Service-specific scrape targets # External targets ]; }; }; # vmalert for alerting rules (no receiver during parallel operation) services.vmalert = { enable = true; datasource.url = "http://localhost:8428"; # notifier.alertmanager.url = "http://localhost:9093"; # Enable after cutover rule = [ ./rules.yml ]; }; ``` ## Rollback Plan If issues arise after cutover: 1. Move `monitoring` CNAME back to monitoring01 2. Restart monitoring01 services 3. Revert Promtail config to point only to monitoring01 4. Revert http-proxy backends ## Notes - VictoriaMetrics uses port 8428 vs Prometheus 9090 - PromQL compatibility is excellent - VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed) - monitoring02 deployed via OpenTofu using `create-host` script - Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state