Files
nixos-servers/docs/plans/monitoring-migration-victoriametrics.md
Torjus Håkestad e329f87b0b monitoring02: add VictoriaMetrics, vmalert, and Alertmanager
Set up the core metrics stack on monitoring02 as Phase 2 of the
monitoring migration. VictoriaMetrics replaces Prometheus with
identical scrape configs (22 jobs including auto-generated targets).

- VictoriaMetrics with 3-month retention and all scrape configs
- vmalert evaluating existing rules.yml (notifier disabled)
- Alertmanager with same routing config (no alerts during parallel op)
- Grafana datasources updated: local VictoriaMetrics as default
- Static user override for credential file access (OpenBao, Apiary)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:55:08 +01:00

8.7 KiB

Monitoring Stack Migration to VictoriaMetrics

Overview

Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression and longer retention. Run in parallel with monitoring01 until validated, then switch over using a monitoring CNAME for seamless transition.

Current State

monitoring01 (10.69.13.13):

  • 4 CPU cores, 4GB RAM, 33GB disk
  • Prometheus with 30-day retention (15s scrape interval)
  • Alertmanager (routes to alerttonotify webhook)
  • Grafana (dashboards, datasources)
  • Loki (log aggregation from all hosts via Promtail)
  • Tempo (distributed tracing)
  • Pyroscope (continuous profiling)

Hardcoded References to monitoring01:

  • system/monitoring/logs.nix - Promtail sends logs to http://monitoring01.home.2rjus.net:3100
  • hosts/template2/bootstrap.nix - Bootstrap logs to Loki (keep as-is until decommission)
  • services/http-proxy/proxy.nix - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway

Auto-generated:

  • Prometheus scrape targets (from lib/monitoring.nix + homelab.monitoring.scrapeTargets)
  • Node-exporter targets (from all hosts with static IPs)

Decision: VictoriaMetrics

Per docs/plans/long-term-metrics-storage.md, VictoriaMetrics is the recommended starting point:

  • Single binary replacement for Prometheus
  • 5-10x better compression (30 days could become 180+ days in same space)
  • Same PromQL query language (Grafana dashboards work unchanged)
  • Same scrape config format (existing auto-generated configs work)

If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.

Architecture

                     ┌─────────────────┐
                     │  monitoring02   │
                     │  VictoriaMetrics│
                     │  + Grafana      │
     monitoring      │  + Loki         │
     CNAME ──────────│  + Tempo        │
                     │  + Pyroscope    │
                     │  + Alertmanager │
                     │  (vmalert)      │
                     └─────────────────┘
                            ▲
                            │ scrapes
            ┌───────────────┼───────────────┐
            │               │               │
       ┌────┴────┐    ┌─────┴────┐    ┌─────┴────┐
       │  ns1    │    │  ha1     │    │  ...     │
       │ :9100   │    │ :9100    │    │ :9100    │
       └─────────┘    └──────────┘    └──────────┘

Implementation Plan

Phase 1: Create monitoring02 Host [COMPLETE]

Host created and deployed at 10.69.13.24 (prod tier) with:

  • 4 CPU cores, 8GB RAM, 60GB disk
  • Vault integration enabled
  • NATS-based remote deployment enabled
  • Grafana with Kanidm OIDC deployed as test instance (grafana-test.home.2rjus.net)

Phase 2: Set Up VictoriaMetrics Stack

New service module at services/victoriametrics/ for VictoriaMetrics + vmalert + Alertmanager. Imported by monitoring02 alongside the existing Grafana service.

  1. VictoriaMetrics (port 8428): [DONE]

    • services.victoriametrics.enable = true
    • retentionPeriod = "3" (3 months)
    • All scrape configs migrated from Prometheus (22 jobs including auto-generated)
    • Static user override (DynamicUser disabled) for credential file access
    • OpenBao token fetch service + 30min refresh timer
    • Apiary bearer token via vault.secrets
  2. vmalert for alerting rules: [DONE]

    • Points to VictoriaMetrics datasource at localhost:8428
    • Reuses existing services/monitoring/rules.yml directly via settings.rule
    • No notifier configured during parallel operation (prevents duplicate alerts)
  3. Alertmanager (port 9093): [DONE]

    • Same configuration as monitoring01 (alerttonotify webhook routing)
    • Will only receive alerts after cutover (vmalert notifier disabled)
  4. Grafana (port 3000): [DONE]

    • VictoriaMetrics datasource (localhost:8428) as default
    • monitoring01 Prometheus datasource kept for comparison during parallel operation
    • Loki datasource pointing to monitoring01 (until Loki migrated)
  5. Loki (port 3100):

    • TODO: Same configuration as current
  6. Tempo (ports 3200, 3201):

    • TODO: Same configuration
  7. Pyroscope (port 4040):

    • TODO: Same Docker-based deployment

Note: pve-exporter and pushgateway scrape targets are not included on monitoring02. pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics native push support.

Phase 3: Parallel Operation

Run both monitoring01 and monitoring02 simultaneously:

  1. Dual scraping: Both hosts scrape the same targets

    • Validates VictoriaMetrics is collecting data correctly
  2. Dual log shipping: Configure Promtail to send logs to both Loki instances

    • Add second client in system/monitoring/logs.nix pointing to monitoring02
  3. Validate dashboards: Access Grafana on monitoring02, verify dashboards work

  4. Validate alerts: Verify vmalert evaluates rules correctly (no receiver = no notifications)

  5. Compare resource usage: Monitor disk/memory consumption between hosts

Phase 4: Add monitoring CNAME

Add CNAME to monitoring02 once validated:

# hosts/monitoring02/configuration.nix
homelab.dns.cnames = [ "monitoring" ];

This creates monitoring.home.2rjus.net pointing to monitoring02.

Phase 5: Update References

Update hardcoded references to use the CNAME:

  1. system/monitoring/logs.nix:

    • Remove dual-shipping, point only to http://monitoring.home.2rjus.net:3100
  2. services/http-proxy/proxy.nix: Update reverse proxy backends:

    • prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
    • alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
    • grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
    • pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040

Note: hosts/template2/bootstrap.nix stays pointed at monitoring01 until decommission.

Phase 6: Enable Alerting

Once ready to cut over:

  1. Enable Alertmanager receiver on monitoring02
  2. Verify test alerts route correctly

Phase 7: Cutover and Decommission

  1. Stop monitoring01: Prevent duplicate alerts during transition
  2. Update bootstrap.nix: Point to monitoring.home.2rjus.net
  3. Verify all targets scraped: Check VictoriaMetrics UI
  4. Verify logs flowing: Check Loki on monitoring02
  5. Decommission monitoring01:
    • Remove from flake.nix
    • Remove host configuration
    • Destroy VM in Proxmox
    • Remove from terraform state

Current Progress

  • Phase 1 complete (2026-02-08): monitoring02 host created, Grafana with Kanidm OIDC validated
  • Phase 2 in progress (2026-02-17): VictoriaMetrics, vmalert, Alertmanager, Grafana datasources configured
    • Remaining: Loki, Tempo, Pyroscope migration

Open Questions

  • What disk size for monitoring02? Current 60GB may need expansion for 3+ months with VictoriaMetrics
  • Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
  • Consider replacing Promtail with Grafana Alloy (services.alloy, v1.12.2 in nixpkgs). Promtail is in maintenance mode and Grafana recommends Alloy as the successor. Alloy is a unified collector (logs, metrics, traces, profiles) but uses its own "River" config format instead of YAML, so less Nix-native ergonomics. Could bundle the migration with monitoring02 to consolidate disruption.

VictoriaMetrics Service Configuration

Implemented in services/victoriametrics/default.nix. Key design decisions:

  • Static user: VictoriaMetrics NixOS module uses DynamicUser, overridden with a static victoriametrics user so vault.secrets and credential files work correctly
  • Shared rules: vmalert reuses services/monitoring/rules.yml via settings.rule path reference (no YAML-to-Nix conversion needed)
  • Scrape config reuse: Uses the same lib/monitoring.nix functions and services/monitoring/external-targets.nix as Prometheus for auto-generated targets

Rollback Plan

If issues arise after cutover:

  1. Move monitoring CNAME back to monitoring01
  2. Restart monitoring01 services
  3. Revert Promtail config to point only to monitoring01
  4. Revert http-proxy backends

Notes

  • VictoriaMetrics uses port 8428 vs Prometheus 9090
  • PromQL compatibility is excellent
  • VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
  • monitoring02 deployed via OpenTofu using create-host script
  • Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state