docs: add monitoring migration to VictoriaMetrics plan

Plan for migrating from Prometheus to VictoriaMetrics on new monitoring02 host with parallel operation, declarative Grafana dashboards, and CNAME-based cutover. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 01:11:07 +01:00
parent 93dbb45802
commit 8959829f77
1 changed files with 219 additions and 0 deletions
--- a/docs/plans/monitoring-migration-victoriametrics.md
+++ b/docs/plans/monitoring-migration-victoriametrics.md
@@ -0,0 +1,219 @@
 # Monitoring Stack Migration to VictoriaMetrics
 ## Overview
 Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
 and longer retention. Run in parallel with monitoring01 until validated, then switch over using
 a `monitoring` CNAME for seamless transition.
 ## Current State
 **monitoring01** (10.69.13.13):
 - 4 CPU cores, 4GB RAM, 33GB disk
 - Prometheus with 30-day retention (15s scrape interval)
 - Alertmanager (routes to alerttonotify webhook)
 - Grafana (dashboards, datasources)
 - Loki (log aggregation from all hosts via Promtail)
 - Tempo (distributed tracing)
 - Pyroscope (continuous profiling)
 **Hardcoded References to monitoring01:**
 - `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100`
 - `hosts/template2/bootstrap.nix` - Bootstrap logs to Loki (keep as-is until decommission)
 - `services/http-proxy/proxy.nix` - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway
 **Auto-generated:**
 - Prometheus scrape targets (from `lib/monitoring.nix` + `homelab.monitoring.scrapeTargets`)
 - Node-exporter targets (from all hosts with static IPs)
 ## Decision: VictoriaMetrics
 Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
 - Single binary replacement for Prometheus
 - 5-10x better compression (30 days could become 180+ days in same space)
 - Same PromQL query language (Grafana dashboards work unchanged)
 - Same scrape config format (existing auto-generated configs work)
 If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
 ## Architecture
 ```
                     ┌─────────────────┐
                     │  monitoring02   │
                     │  VictoriaMetrics│
                     │  + Grafana      │
     monitoring      │  + Loki         │
     CNAME ──────────│  + Tempo        │
                     │  + Pyroscope    │
                     │  + Alertmanager │
                     │  (vmalert)      │
                     └─────────────────┘
                            ▲
                            │ scrapes
            ┌───────────────┼───────────────┐
            │               │               │
       ┌────┴────┐    ┌─────┴────┐    ┌─────┴────┐
       │  ns1    │    │  ha1     │    │  ...     │
       │ :9100   │    │ :9100    │    │ :9100    │
       └─────────┘    └──────────┘    └──────────┘
 ```
 ## Implementation Plan
 ### Phase 1: Create monitoring02 Host
 Use `create-host` script which handles flake.nix and terraform/vms.tf automatically.
 1. **Run create-host**: `nix develop -c create-host monitoring02 10.69.13.24`
 2. **Update VM resources** in `terraform/vms.tf`:
   - 4 cores (same as monitoring01)
   - 8GB RAM (double, for VictoriaMetrics headroom)
   - 100GB disk (for 3+ months retention with compression)
 3. **Update host configuration**: Import monitoring services
 4. **Create Vault AppRole**: Add to `terraform/vault/approle.tf`
 ### Phase 2: Set Up VictoriaMetrics Stack
 Create new service module at `services/monitoring/victoriametrics/` for testing alongside existing
 Prometheus config. Once validated, this can replace the Prometheus module.
 1. **VictoriaMetrics** (port 8428):
   - `services.victoriametrics.enable = true`
   - `services.victoriametrics.retentionPeriod = "3m"` (3 months, increase later based on disk usage)
   - Migrate scrape configs via `prometheusConfig`
   - Use native push support (replaces Pushgateway)
 2. **vmalert** for alerting rules:
   - `services.vmalert.enable = true`
   - Point to VictoriaMetrics for metrics evaluation
   - Keep rules in separate `rules.yml` file (same format as Prometheus)
   - No receiver configured during parallel operation (prevents duplicate alerts)
 3. **Alertmanager** (port 9093):
   - Keep existing configuration (alerttonotify webhook routing)
   - Only enable receiver after cutover from monitoring01
 4. **Loki** (port 3100):
   - Same configuration as current
 5. **Grafana** (port 3000):
   - Define dashboards declaratively via NixOS options (not imported from monitoring01)
   - Reference existing dashboards on monitoring01 for content inspiration
   - Configure VictoriaMetrics datasource (port 8428)
   - Configure Loki datasource
 6. **Tempo** (ports 3200, 3201):
   - Same configuration
 7. **Pyroscope** (port 4040):
   - Same Docker-based deployment
 ### Phase 3: Parallel Operation
 Run both monitoring01 and monitoring02 simultaneously:
 1. **Dual scraping**: Both hosts scrape the same targets
   - Validates VictoriaMetrics is collecting data correctly
 2. **Dual log shipping**: Configure Promtail to send logs to both Loki instances
   - Add second client in `system/monitoring/logs.nix` pointing to monitoring02
 3. **Validate dashboards**: Access Grafana on monitoring02, verify dashboards work
 4. **Validate alerts**: Verify vmalert evaluates rules correctly (no receiver = no notifications)
 5. **Compare resource usage**: Monitor disk/memory consumption between hosts
 ### Phase 4: Add monitoring CNAME
 Add CNAME to monitoring02 once validated:
 ```nix
 # hosts/monitoring02/configuration.nix
 homelab.dns.cnames = [ "monitoring" ];
 ```
 This creates `monitoring.home.2rjus.net` pointing to monitoring02.
 ### Phase 5: Update References
 Update hardcoded references to use the CNAME:
 1. **system/monitoring/logs.nix**:
   - Remove dual-shipping, point only to `http://monitoring.home.2rjus.net:3100`
 2. **services/http-proxy/proxy.nix**: Update reverse proxy backends:
   - prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
   - alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
   - grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
   - pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040
 Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission.
 ### Phase 6: Enable Alerting
 Once ready to cut over:
 1. Enable Alertmanager receiver on monitoring02
 2. Verify test alerts route correctly
 ### Phase 7: Cutover and Decommission
 1. **Stop monitoring01**: Prevent duplicate alerts during transition
 2. **Update bootstrap.nix**: Point to `monitoring.home.2rjus.net`
 3. **Verify all targets scraped**: Check VictoriaMetrics UI
 4. **Verify logs flowing**: Check Loki on monitoring02
 5. **Decommission monitoring01**:
   - Remove from flake.nix
   - Remove host configuration
   - Destroy VM in Proxmox
   - Remove from terraform state
 ## Open Questions
 - [ ] What disk size for monitoring02? 100GB should allow 3+ months with VictoriaMetrics compression
 - [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
 ## VictoriaMetrics Service Configuration
 Example NixOS configuration for monitoring02:
 ```nix
 # VictoriaMetrics replaces Prometheus
 services.victoriametrics = {
  enable = true;
  retentionPeriod = "3m";  # 3 months, increase based on disk usage
  prometheusConfig = {
    global.scrape_interval = "15s";
    scrape_configs = [
      # Auto-generated node-exporter targets
      # Service-specific scrape targets
      # External targets
    ];
  };
 };
 # vmalert for alerting rules (no receiver during parallel operation)
 services.vmalert = {
  enable = true;
  datasource.url = "http://localhost:8428";
  # notifier.alertmanager.url = "http://localhost:9093";  # Enable after cutover
  rule = [ ./rules.yml ];
 };
 ```
 ## Rollback Plan
 If issues arise after cutover:
 1. Move `monitoring` CNAME back to monitoring01
 2. Restart monitoring01 services
 3. Revert Promtail config to point only to monitoring01
 4. Revert http-proxy backends
 ## Notes
 - VictoriaMetrics uses port 8428 vs Prometheus 9090
 - PromQL compatibility is excellent
 - VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
 - monitoring02 deployed via OpenTofu using `create-host` script
 - Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state