docs: add monitoring migration to VictoriaMetrics plan

Plan for migrating from Prometheus to VictoriaMetrics on new monitoring02 host with parallel operation, declarative Grafana dashboards, and CNAME-based cutover. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 01:11:07 +01:00
parent 93dbb45802
commit 8959829f77
1 changed files with 219 additions and 0 deletions
--- a/docs/plans/monitoring-migration-victoriametrics.md
+++ b/docs/plans/monitoring-migration-victoriametrics.md
@@ -0,0 +1,219 @@
+# Monitoring Stack Migration to VictoriaMetrics
+
+## Overview
+
+Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
+and longer retention. Run in parallel with monitoring01 until validated, then switch over using
+a `monitoring` CNAME for seamless transition.
+
+## Current State
+
+**monitoring01** (10.69.13.13):
+- 4 CPU cores, 4GB RAM, 33GB disk
+- Prometheus with 30-day retention (15s scrape interval)
+- Alertmanager (routes to alerttonotify webhook)
+- Grafana (dashboards, datasources)
+- Loki (log aggregation from all hosts via Promtail)
+- Tempo (distributed tracing)
+- Pyroscope (continuous profiling)
+
+**Hardcoded References to monitoring01:**
+- `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100`
+- `hosts/template2/bootstrap.nix` - Bootstrap logs to Loki (keep as-is until decommission)
+- `services/http-proxy/proxy.nix` - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway
+
+**Auto-generated:**
+- Prometheus scrape targets (from `lib/monitoring.nix` + `homelab.monitoring.scrapeTargets`)
+- Node-exporter targets (from all hosts with static IPs)
+
+## Decision: VictoriaMetrics
+
+Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
+- Single binary replacement for Prometheus
+- 5-10x better compression (30 days could become 180+ days in same space)
+- Same PromQL query language (Grafana dashboards work unchanged)
+- Same scrape config format (existing auto-generated configs work)
+
+If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
+
+## Architecture
+
+```
+                     ┌─────────────────┐
+                     │  monitoring02   │
+                     │  VictoriaMetrics│
+                     │  + Grafana      │
+     monitoring      │  + Loki         │
+     CNAME ──────────│  + Tempo        │
+                     │  + Pyroscope    │
+                     │  + Alertmanager │
+                     │  (vmalert)      │
+                     └─────────────────┘
+                            ▲
+                            │ scrapes
+            ┌───────────────┼───────────────┐
+            │               │               │
+       ┌────┴────┐    ┌─────┴────┐    ┌─────┴────┐
+       │  ns1    │    │  ha1     │    │  ...     │
+       │ :9100   │    │ :9100    │    │ :9100    │
+       └─────────┘    └──────────┘    └──────────┘
+```
+
+## Implementation Plan
+
+### Phase 1: Create monitoring02 Host
+
+Use `create-host` script which handles flake.nix and terraform/vms.tf automatically.
+
+1. **Run create-host**: `nix develop -c create-host monitoring02 10.69.13.24`
+2. **Update VM resources** in `terraform/vms.tf`:
+   - 4 cores (same as monitoring01)
+   - 8GB RAM (double, for VictoriaMetrics headroom)
+   - 100GB disk (for 3+ months retention with compression)
+3. **Update host configuration**: Import monitoring services
+4. **Create Vault AppRole**: Add to `terraform/vault/approle.tf`
+
+### Phase 2: Set Up VictoriaMetrics Stack
+
+Create new service module at `services/monitoring/victoriametrics/` for testing alongside existing
+Prometheus config. Once validated, this can replace the Prometheus module.
+
+1. **VictoriaMetrics** (port 8428):
+   - `services.victoriametrics.enable = true`
+   - `services.victoriametrics.retentionPeriod = "3m"` (3 months, increase later based on disk usage)
+   - Migrate scrape configs via `prometheusConfig`
+   - Use native push support (replaces Pushgateway)
+
+2. **vmalert** for alerting rules:
+   - `services.vmalert.enable = true`
+   - Point to VictoriaMetrics for metrics evaluation
+   - Keep rules in separate `rules.yml` file (same format as Prometheus)
+   - No receiver configured during parallel operation (prevents duplicate alerts)
+
+3. **Alertmanager** (port 9093):
+   - Keep existing configuration (alerttonotify webhook routing)
+   - Only enable receiver after cutover from monitoring01
+
+4. **Loki** (port 3100):
+   - Same configuration as current
+
+5. **Grafana** (port 3000):
+   - Define dashboards declaratively via NixOS options (not imported from monitoring01)
+   - Reference existing dashboards on monitoring01 for content inspiration
+   - Configure VictoriaMetrics datasource (port 8428)
+   - Configure Loki datasource
+
+6. **Tempo** (ports 3200, 3201):
+   - Same configuration
+
+7. **Pyroscope** (port 4040):
+   - Same Docker-based deployment
+
+### Phase 3: Parallel Operation
+
+Run both monitoring01 and monitoring02 simultaneously:
+
+1. **Dual scraping**: Both hosts scrape the same targets
+   - Validates VictoriaMetrics is collecting data correctly
+
+2. **Dual log shipping**: Configure Promtail to send logs to both Loki instances
+   - Add second client in `system/monitoring/logs.nix` pointing to monitoring02
+
+3. **Validate dashboards**: Access Grafana on monitoring02, verify dashboards work
+
+4. **Validate alerts**: Verify vmalert evaluates rules correctly (no receiver = no notifications)
+
+5. **Compare resource usage**: Monitor disk/memory consumption between hosts
+
+### Phase 4: Add monitoring CNAME
+
+Add CNAME to monitoring02 once validated:
+
+```nix
+# hosts/monitoring02/configuration.nix
+homelab.dns.cnames = [ "monitoring" ];
+```
+
+This creates `monitoring.home.2rjus.net` pointing to monitoring02.
+
+### Phase 5: Update References
+
+Update hardcoded references to use the CNAME:
+
+1. **system/monitoring/logs.nix**:
+   - Remove dual-shipping, point only to `http://monitoring.home.2rjus.net:3100`
+
+2. **services/http-proxy/proxy.nix**: Update reverse proxy backends:
+   - prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
+   - alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
+   - grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
+   - pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040
+
+Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission.
+
+### Phase 6: Enable Alerting
+
+Once ready to cut over:
+1. Enable Alertmanager receiver on monitoring02
+2. Verify test alerts route correctly
+
+### Phase 7: Cutover and Decommission
+
+1. **Stop monitoring01**: Prevent duplicate alerts during transition
+2. **Update bootstrap.nix**: Point to `monitoring.home.2rjus.net`
+3. **Verify all targets scraped**: Check VictoriaMetrics UI
+4. **Verify logs flowing**: Check Loki on monitoring02
+5. **Decommission monitoring01**:
+   - Remove from flake.nix
+   - Remove host configuration
+   - Destroy VM in Proxmox
+   - Remove from terraform state
+
+## Open Questions
+
+- [ ] What disk size for monitoring02? 100GB should allow 3+ months with VictoriaMetrics compression
+- [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
+
+## VictoriaMetrics Service Configuration
+
+Example NixOS configuration for monitoring02:
+
+```nix
+# VictoriaMetrics replaces Prometheus
+services.victoriametrics = {
+  enable = true;
+  retentionPeriod = "3m";  # 3 months, increase based on disk usage
+  prometheusConfig = {
+    global.scrape_interval = "15s";
+    scrape_configs = [
+      # Auto-generated node-exporter targets
+      # Service-specific scrape targets
+      # External targets
+    ];
+  };
+};
+
+# vmalert for alerting rules (no receiver during parallel operation)
+services.vmalert = {
+  enable = true;
+  datasource.url = "http://localhost:8428";
+  # notifier.alertmanager.url = "http://localhost:9093";  # Enable after cutover
+  rule = [ ./rules.yml ];
+};
+```
+
+## Rollback Plan
+
+If issues arise after cutover:
+1. Move `monitoring` CNAME back to monitoring01
+2. Restart monitoring01 services
+3. Revert Promtail config to point only to monitoring01
+4. Revert http-proxy backends
+
+## Notes
+
+- VictoriaMetrics uses port 8428 vs Prometheus 9090
+- PromQL compatibility is excellent
+- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
+- monitoring02 deployed via OpenTofu using `create-host` script
+- Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state