diff --git a/docs/plans/monitoring-migration-victoriametrics.md b/docs/plans/monitoring-migration-victoriametrics.md new file mode 100644 index 0000000..7fc926a --- /dev/null +++ b/docs/plans/monitoring-migration-victoriametrics.md @@ -0,0 +1,219 @@ +# Monitoring Stack Migration to VictoriaMetrics + +## Overview + +Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression +and longer retention. Run in parallel with monitoring01 until validated, then switch over using +a `monitoring` CNAME for seamless transition. + +## Current State + +**monitoring01** (10.69.13.13): +- 4 CPU cores, 4GB RAM, 33GB disk +- Prometheus with 30-day retention (15s scrape interval) +- Alertmanager (routes to alerttonotify webhook) +- Grafana (dashboards, datasources) +- Loki (log aggregation from all hosts via Promtail) +- Tempo (distributed tracing) +- Pyroscope (continuous profiling) + +**Hardcoded References to monitoring01:** +- `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100` +- `hosts/template2/bootstrap.nix` - Bootstrap logs to Loki (keep as-is until decommission) +- `services/http-proxy/proxy.nix` - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway + +**Auto-generated:** +- Prometheus scrape targets (from `lib/monitoring.nix` + `homelab.monitoring.scrapeTargets`) +- Node-exporter targets (from all hosts with static IPs) + +## Decision: VictoriaMetrics + +Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point: +- Single binary replacement for Prometheus +- 5-10x better compression (30 days could become 180+ days in same space) +- Same PromQL query language (Grafana dashboards work unchanged) +- Same scrape config format (existing auto-generated configs work) + +If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated. + +## Architecture + +``` + ┌─────────────────┐ + │ monitoring02 │ + │ VictoriaMetrics│ + │ + Grafana │ + monitoring │ + Loki │ + CNAME ──────────│ + Tempo │ + │ + Pyroscope │ + │ + Alertmanager │ + │ (vmalert) │ + └─────────────────┘ + ▲ + │ scrapes + ┌───────────────┼───────────────┐ + │ │ │ + ┌────┴────┐ ┌─────┴────┐ ┌─────┴────┐ + │ ns1 │ │ ha1 │ │ ... │ + │ :9100 │ │ :9100 │ │ :9100 │ + └─────────┘ └──────────┘ └──────────┘ +``` + +## Implementation Plan + +### Phase 1: Create monitoring02 Host + +Use `create-host` script which handles flake.nix and terraform/vms.tf automatically. + +1. **Run create-host**: `nix develop -c create-host monitoring02 10.69.13.24` +2. **Update VM resources** in `terraform/vms.tf`: + - 4 cores (same as monitoring01) + - 8GB RAM (double, for VictoriaMetrics headroom) + - 100GB disk (for 3+ months retention with compression) +3. **Update host configuration**: Import monitoring services +4. **Create Vault AppRole**: Add to `terraform/vault/approle.tf` + +### Phase 2: Set Up VictoriaMetrics Stack + +Create new service module at `services/monitoring/victoriametrics/` for testing alongside existing +Prometheus config. Once validated, this can replace the Prometheus module. + +1. **VictoriaMetrics** (port 8428): + - `services.victoriametrics.enable = true` + - `services.victoriametrics.retentionPeriod = "3m"` (3 months, increase later based on disk usage) + - Migrate scrape configs via `prometheusConfig` + - Use native push support (replaces Pushgateway) + +2. **vmalert** for alerting rules: + - `services.vmalert.enable = true` + - Point to VictoriaMetrics for metrics evaluation + - Keep rules in separate `rules.yml` file (same format as Prometheus) + - No receiver configured during parallel operation (prevents duplicate alerts) + +3. **Alertmanager** (port 9093): + - Keep existing configuration (alerttonotify webhook routing) + - Only enable receiver after cutover from monitoring01 + +4. **Loki** (port 3100): + - Same configuration as current + +5. **Grafana** (port 3000): + - Define dashboards declaratively via NixOS options (not imported from monitoring01) + - Reference existing dashboards on monitoring01 for content inspiration + - Configure VictoriaMetrics datasource (port 8428) + - Configure Loki datasource + +6. **Tempo** (ports 3200, 3201): + - Same configuration + +7. **Pyroscope** (port 4040): + - Same Docker-based deployment + +### Phase 3: Parallel Operation + +Run both monitoring01 and monitoring02 simultaneously: + +1. **Dual scraping**: Both hosts scrape the same targets + - Validates VictoriaMetrics is collecting data correctly + +2. **Dual log shipping**: Configure Promtail to send logs to both Loki instances + - Add second client in `system/monitoring/logs.nix` pointing to monitoring02 + +3. **Validate dashboards**: Access Grafana on monitoring02, verify dashboards work + +4. **Validate alerts**: Verify vmalert evaluates rules correctly (no receiver = no notifications) + +5. **Compare resource usage**: Monitor disk/memory consumption between hosts + +### Phase 4: Add monitoring CNAME + +Add CNAME to monitoring02 once validated: + +```nix +# hosts/monitoring02/configuration.nix +homelab.dns.cnames = [ "monitoring" ]; +``` + +This creates `monitoring.home.2rjus.net` pointing to monitoring02. + +### Phase 5: Update References + +Update hardcoded references to use the CNAME: + +1. **system/monitoring/logs.nix**: + - Remove dual-shipping, point only to `http://monitoring.home.2rjus.net:3100` + +2. **services/http-proxy/proxy.nix**: Update reverse proxy backends: + - prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428 + - alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093 + - grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000 + - pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040 + +Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission. + +### Phase 6: Enable Alerting + +Once ready to cut over: +1. Enable Alertmanager receiver on monitoring02 +2. Verify test alerts route correctly + +### Phase 7: Cutover and Decommission + +1. **Stop monitoring01**: Prevent duplicate alerts during transition +2. **Update bootstrap.nix**: Point to `monitoring.home.2rjus.net` +3. **Verify all targets scraped**: Check VictoriaMetrics UI +4. **Verify logs flowing**: Check Loki on monitoring02 +5. **Decommission monitoring01**: + - Remove from flake.nix + - Remove host configuration + - Destroy VM in Proxmox + - Remove from terraform state + +## Open Questions + +- [ ] What disk size for monitoring02? 100GB should allow 3+ months with VictoriaMetrics compression +- [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set) + +## VictoriaMetrics Service Configuration + +Example NixOS configuration for monitoring02: + +```nix +# VictoriaMetrics replaces Prometheus +services.victoriametrics = { + enable = true; + retentionPeriod = "3m"; # 3 months, increase based on disk usage + prometheusConfig = { + global.scrape_interval = "15s"; + scrape_configs = [ + # Auto-generated node-exporter targets + # Service-specific scrape targets + # External targets + ]; + }; +}; + +# vmalert for alerting rules (no receiver during parallel operation) +services.vmalert = { + enable = true; + datasource.url = "http://localhost:8428"; + # notifier.alertmanager.url = "http://localhost:9093"; # Enable after cutover + rule = [ ./rules.yml ]; +}; +``` + +## Rollback Plan + +If issues arise after cutover: +1. Move `monitoring` CNAME back to monitoring01 +2. Restart monitoring01 services +3. Revert Promtail config to point only to monitoring01 +4. Revert http-proxy backends + +## Notes + +- VictoriaMetrics uses port 8428 vs Prometheus 9090 +- PromQL compatibility is excellent +- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed) +- monitoring02 deployed via OpenTofu using `create-host` script +- Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state