Set up the core metrics stack on monitoring02 as Phase 2 of the monitoring migration. VictoriaMetrics replaces Prometheus with identical scrape configs (22 jobs including auto-generated targets). - VictoriaMetrics with 3-month retention and all scrape configs - vmalert evaluating existing rules.yml (notifier disabled) - Alertmanager with same routing config (no alerts during parallel op) - Grafana datasources updated: local VictoriaMetrics as default - Static user override for credential file access (OpenBao, Apiary) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8.7 KiB
Monitoring Stack Migration to VictoriaMetrics
Overview
Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
and longer retention. Run in parallel with monitoring01 until validated, then switch over using
a monitoring CNAME for seamless transition.
Current State
monitoring01 (10.69.13.13):
- 4 CPU cores, 4GB RAM, 33GB disk
- Prometheus with 30-day retention (15s scrape interval)
- Alertmanager (routes to alerttonotify webhook)
- Grafana (dashboards, datasources)
- Loki (log aggregation from all hosts via Promtail)
- Tempo (distributed tracing)
- Pyroscope (continuous profiling)
Hardcoded References to monitoring01:
system/monitoring/logs.nix- Promtail sends logs tohttp://monitoring01.home.2rjus.net:3100hosts/template2/bootstrap.nix- Bootstrap logs to Loki (keep as-is until decommission)services/http-proxy/proxy.nix- Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway
Auto-generated:
- Prometheus scrape targets (from
lib/monitoring.nix+homelab.monitoring.scrapeTargets) - Node-exporter targets (from all hosts with static IPs)
Decision: VictoriaMetrics
Per docs/plans/long-term-metrics-storage.md, VictoriaMetrics is the recommended starting point:
- Single binary replacement for Prometheus
- 5-10x better compression (30 days could become 180+ days in same space)
- Same PromQL query language (Grafana dashboards work unchanged)
- Same scrape config format (existing auto-generated configs work)
If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
Architecture
┌─────────────────┐
│ monitoring02 │
│ VictoriaMetrics│
│ + Grafana │
monitoring │ + Loki │
CNAME ──────────│ + Tempo │
│ + Pyroscope │
│ + Alertmanager │
│ (vmalert) │
└─────────────────┘
▲
│ scrapes
┌───────────────┼───────────────┐
│ │ │
┌────┴────┐ ┌─────┴────┐ ┌─────┴────┐
│ ns1 │ │ ha1 │ │ ... │
│ :9100 │ │ :9100 │ │ :9100 │
└─────────┘ └──────────┘ └──────────┘
Implementation Plan
Phase 1: Create monitoring02 Host [COMPLETE]
Host created and deployed at 10.69.13.24 (prod tier) with:
- 4 CPU cores, 8GB RAM, 60GB disk
- Vault integration enabled
- NATS-based remote deployment enabled
- Grafana with Kanidm OIDC deployed as test instance (
grafana-test.home.2rjus.net)
Phase 2: Set Up VictoriaMetrics Stack
New service module at services/victoriametrics/ for VictoriaMetrics + vmalert + Alertmanager.
Imported by monitoring02 alongside the existing Grafana service.
-
VictoriaMetrics (port 8428): [DONE]
services.victoriametrics.enable = trueretentionPeriod = "3"(3 months)- All scrape configs migrated from Prometheus (22 jobs including auto-generated)
- Static user override (DynamicUser disabled) for credential file access
- OpenBao token fetch service + 30min refresh timer
- Apiary bearer token via vault.secrets
-
vmalert for alerting rules: [DONE]
- Points to VictoriaMetrics datasource at localhost:8428
- Reuses existing
services/monitoring/rules.ymldirectly viasettings.rule - No notifier configured during parallel operation (prevents duplicate alerts)
-
Alertmanager (port 9093): [DONE]
- Same configuration as monitoring01 (alerttonotify webhook routing)
- Will only receive alerts after cutover (vmalert notifier disabled)
-
Grafana (port 3000): [DONE]
- VictoriaMetrics datasource (localhost:8428) as default
- monitoring01 Prometheus datasource kept for comparison during parallel operation
- Loki datasource pointing to monitoring01 (until Loki migrated)
-
Loki (port 3100):
- TODO: Same configuration as current
-
Tempo (ports 3200, 3201):
- TODO: Same configuration
-
Pyroscope (port 4040):
- TODO: Same Docker-based deployment
Note: pve-exporter and pushgateway scrape targets are not included on monitoring02. pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics native push support.
Phase 3: Parallel Operation
Run both monitoring01 and monitoring02 simultaneously:
-
Dual scraping: Both hosts scrape the same targets
- Validates VictoriaMetrics is collecting data correctly
-
Dual log shipping: Configure Promtail to send logs to both Loki instances
- Add second client in
system/monitoring/logs.nixpointing to monitoring02
- Add second client in
-
Validate dashboards: Access Grafana on monitoring02, verify dashboards work
-
Validate alerts: Verify vmalert evaluates rules correctly (no receiver = no notifications)
-
Compare resource usage: Monitor disk/memory consumption between hosts
Phase 4: Add monitoring CNAME
Add CNAME to monitoring02 once validated:
# hosts/monitoring02/configuration.nix
homelab.dns.cnames = [ "monitoring" ];
This creates monitoring.home.2rjus.net pointing to monitoring02.
Phase 5: Update References
Update hardcoded references to use the CNAME:
-
system/monitoring/logs.nix:
- Remove dual-shipping, point only to
http://monitoring.home.2rjus.net:3100
- Remove dual-shipping, point only to
-
services/http-proxy/proxy.nix: Update reverse proxy backends:
- prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
- alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
- grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
- pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040
Note: hosts/template2/bootstrap.nix stays pointed at monitoring01 until decommission.
Phase 6: Enable Alerting
Once ready to cut over:
- Enable Alertmanager receiver on monitoring02
- Verify test alerts route correctly
Phase 7: Cutover and Decommission
- Stop monitoring01: Prevent duplicate alerts during transition
- Update bootstrap.nix: Point to
monitoring.home.2rjus.net - Verify all targets scraped: Check VictoriaMetrics UI
- Verify logs flowing: Check Loki on monitoring02
- Decommission monitoring01:
- Remove from flake.nix
- Remove host configuration
- Destroy VM in Proxmox
- Remove from terraform state
Current Progress
- Phase 1 complete (2026-02-08): monitoring02 host created, Grafana with Kanidm OIDC validated
- Phase 2 in progress (2026-02-17): VictoriaMetrics, vmalert, Alertmanager, Grafana datasources configured
- Remaining: Loki, Tempo, Pyroscope migration
Open Questions
- What disk size for monitoring02? Current 60GB may need expansion for 3+ months with VictoriaMetrics
- Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
- Consider replacing Promtail with Grafana Alloy (
services.alloy, v1.12.2 in nixpkgs). Promtail is in maintenance mode and Grafana recommends Alloy as the successor. Alloy is a unified collector (logs, metrics, traces, profiles) but uses its own "River" config format instead of YAML, so less Nix-native ergonomics. Could bundle the migration with monitoring02 to consolidate disruption.
VictoriaMetrics Service Configuration
Implemented in services/victoriametrics/default.nix. Key design decisions:
- Static user: VictoriaMetrics NixOS module uses
DynamicUser, overridden with a staticvictoriametricsuser so vault.secrets and credential files work correctly - Shared rules: vmalert reuses
services/monitoring/rules.ymlviasettings.rulepath reference (no YAML-to-Nix conversion needed) - Scrape config reuse: Uses the same
lib/monitoring.nixfunctions andservices/monitoring/external-targets.nixas Prometheus for auto-generated targets
Rollback Plan
If issues arise after cutover:
- Move
monitoringCNAME back to monitoring01 - Restart monitoring01 services
- Revert Promtail config to point only to monitoring01
- Revert http-proxy backends
Notes
- VictoriaMetrics uses port 8428 vs Prometheus 9090
- PromQL compatibility is excellent
- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
- monitoring02 deployed via OpenTofu using
create-hostscript - Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state