Plan for migrating from Prometheus to VictoriaMetrics on new monitoring02 host with parallel operation, declarative Grafana dashboards, and CNAME-based cutover. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
8.1 KiB
Monitoring Stack Migration to VictoriaMetrics
Overview
Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
and longer retention. Run in parallel with monitoring01 until validated, then switch over using
a monitoring CNAME for seamless transition.
Current State
monitoring01 (10.69.13.13):
- 4 CPU cores, 4GB RAM, 33GB disk
- Prometheus with 30-day retention (15s scrape interval)
- Alertmanager (routes to alerttonotify webhook)
- Grafana (dashboards, datasources)
- Loki (log aggregation from all hosts via Promtail)
- Tempo (distributed tracing)
- Pyroscope (continuous profiling)
Hardcoded References to monitoring01:
system/monitoring/logs.nix- Promtail sends logs tohttp://monitoring01.home.2rjus.net:3100hosts/template2/bootstrap.nix- Bootstrap logs to Loki (keep as-is until decommission)services/http-proxy/proxy.nix- Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway
Auto-generated:
- Prometheus scrape targets (from
lib/monitoring.nix+homelab.monitoring.scrapeTargets) - Node-exporter targets (from all hosts with static IPs)
Decision: VictoriaMetrics
Per docs/plans/long-term-metrics-storage.md, VictoriaMetrics is the recommended starting point:
- Single binary replacement for Prometheus
- 5-10x better compression (30 days could become 180+ days in same space)
- Same PromQL query language (Grafana dashboards work unchanged)
- Same scrape config format (existing auto-generated configs work)
If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
Architecture
┌─────────────────┐
│ monitoring02 │
│ VictoriaMetrics│
│ + Grafana │
monitoring │ + Loki │
CNAME ──────────│ + Tempo │
│ + Pyroscope │
│ + Alertmanager │
│ (vmalert) │
└─────────────────┘
▲
│ scrapes
┌───────────────┼───────────────┐
│ │ │
┌────┴────┐ ┌─────┴────┐ ┌─────┴────┐
│ ns1 │ │ ha1 │ │ ... │
│ :9100 │ │ :9100 │ │ :9100 │
└─────────┘ └──────────┘ └──────────┘
Implementation Plan
Phase 1: Create monitoring02 Host
Use create-host script which handles flake.nix and terraform/vms.tf automatically.
- Run create-host:
nix develop -c create-host monitoring02 10.69.13.24 - Update VM resources in
terraform/vms.tf:- 4 cores (same as monitoring01)
- 8GB RAM (double, for VictoriaMetrics headroom)
- 100GB disk (for 3+ months retention with compression)
- Update host configuration: Import monitoring services
- Create Vault AppRole: Add to
terraform/vault/approle.tf
Phase 2: Set Up VictoriaMetrics Stack
Create new service module at services/monitoring/victoriametrics/ for testing alongside existing
Prometheus config. Once validated, this can replace the Prometheus module.
-
VictoriaMetrics (port 8428):
services.victoriametrics.enable = trueservices.victoriametrics.retentionPeriod = "3m"(3 months, increase later based on disk usage)- Migrate scrape configs via
prometheusConfig - Use native push support (replaces Pushgateway)
-
vmalert for alerting rules:
services.vmalert.enable = true- Point to VictoriaMetrics for metrics evaluation
- Keep rules in separate
rules.ymlfile (same format as Prometheus) - No receiver configured during parallel operation (prevents duplicate alerts)
-
Alertmanager (port 9093):
- Keep existing configuration (alerttonotify webhook routing)
- Only enable receiver after cutover from monitoring01
-
Loki (port 3100):
- Same configuration as current
-
Grafana (port 3000):
- Define dashboards declaratively via NixOS options (not imported from monitoring01)
- Reference existing dashboards on monitoring01 for content inspiration
- Configure VictoriaMetrics datasource (port 8428)
- Configure Loki datasource
-
Tempo (ports 3200, 3201):
- Same configuration
-
Pyroscope (port 4040):
- Same Docker-based deployment
Phase 3: Parallel Operation
Run both monitoring01 and monitoring02 simultaneously:
-
Dual scraping: Both hosts scrape the same targets
- Validates VictoriaMetrics is collecting data correctly
-
Dual log shipping: Configure Promtail to send logs to both Loki instances
- Add second client in
system/monitoring/logs.nixpointing to monitoring02
- Add second client in
-
Validate dashboards: Access Grafana on monitoring02, verify dashboards work
-
Validate alerts: Verify vmalert evaluates rules correctly (no receiver = no notifications)
-
Compare resource usage: Monitor disk/memory consumption between hosts
Phase 4: Add monitoring CNAME
Add CNAME to monitoring02 once validated:
# hosts/monitoring02/configuration.nix
homelab.dns.cnames = [ "monitoring" ];
This creates monitoring.home.2rjus.net pointing to monitoring02.
Phase 5: Update References
Update hardcoded references to use the CNAME:
-
system/monitoring/logs.nix:
- Remove dual-shipping, point only to
http://monitoring.home.2rjus.net:3100
- Remove dual-shipping, point only to
-
services/http-proxy/proxy.nix: Update reverse proxy backends:
- prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
- alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
- grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
- pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040
Note: hosts/template2/bootstrap.nix stays pointed at monitoring01 until decommission.
Phase 6: Enable Alerting
Once ready to cut over:
- Enable Alertmanager receiver on monitoring02
- Verify test alerts route correctly
Phase 7: Cutover and Decommission
- Stop monitoring01: Prevent duplicate alerts during transition
- Update bootstrap.nix: Point to
monitoring.home.2rjus.net - Verify all targets scraped: Check VictoriaMetrics UI
- Verify logs flowing: Check Loki on monitoring02
- Decommission monitoring01:
- Remove from flake.nix
- Remove host configuration
- Destroy VM in Proxmox
- Remove from terraform state
Open Questions
- What disk size for monitoring02? 100GB should allow 3+ months with VictoriaMetrics compression
- Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
VictoriaMetrics Service Configuration
Example NixOS configuration for monitoring02:
# VictoriaMetrics replaces Prometheus
services.victoriametrics = {
enable = true;
retentionPeriod = "3m"; # 3 months, increase based on disk usage
prometheusConfig = {
global.scrape_interval = "15s";
scrape_configs = [
# Auto-generated node-exporter targets
# Service-specific scrape targets
# External targets
];
};
};
# vmalert for alerting rules (no receiver during parallel operation)
services.vmalert = {
enable = true;
datasource.url = "http://localhost:8428";
# notifier.alertmanager.url = "http://localhost:9093"; # Enable after cutover
rule = [ ./rules.yml ];
};
Rollback Plan
If issues arise after cutover:
- Move
monitoringCNAME back to monitoring01 - Restart monitoring01 services
- Revert Promtail config to point only to monitoring01
- Revert http-proxy backends
Notes
- VictoriaMetrics uses port 8428 vs Prometheus 9090
- PromQL compatibility is excellent
- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
- monitoring02 deployed via OpenTofu using
create-hostscript - Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state