Files
nixos-servers/docs/plans/monitoring-migration-victoriametrics.md
Torjus Håkestad 8959829f77
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
docs: add monitoring migration to VictoriaMetrics plan
Plan for migrating from Prometheus to VictoriaMetrics on new monitoring02
host with parallel operation, declarative Grafana dashboards, and CNAME-based
cutover.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 01:11:07 +01:00

8.1 KiB

Monitoring Stack Migration to VictoriaMetrics

Overview

Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression and longer retention. Run in parallel with monitoring01 until validated, then switch over using a monitoring CNAME for seamless transition.

Current State

monitoring01 (10.69.13.13):

  • 4 CPU cores, 4GB RAM, 33GB disk
  • Prometheus with 30-day retention (15s scrape interval)
  • Alertmanager (routes to alerttonotify webhook)
  • Grafana (dashboards, datasources)
  • Loki (log aggregation from all hosts via Promtail)
  • Tempo (distributed tracing)
  • Pyroscope (continuous profiling)

Hardcoded References to monitoring01:

  • system/monitoring/logs.nix - Promtail sends logs to http://monitoring01.home.2rjus.net:3100
  • hosts/template2/bootstrap.nix - Bootstrap logs to Loki (keep as-is until decommission)
  • services/http-proxy/proxy.nix - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway

Auto-generated:

  • Prometheus scrape targets (from lib/monitoring.nix + homelab.monitoring.scrapeTargets)
  • Node-exporter targets (from all hosts with static IPs)

Decision: VictoriaMetrics

Per docs/plans/long-term-metrics-storage.md, VictoriaMetrics is the recommended starting point:

  • Single binary replacement for Prometheus
  • 5-10x better compression (30 days could become 180+ days in same space)
  • Same PromQL query language (Grafana dashboards work unchanged)
  • Same scrape config format (existing auto-generated configs work)

If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.

Architecture

                     ┌─────────────────┐
                     │  monitoring02   │
                     │  VictoriaMetrics│
                     │  + Grafana      │
     monitoring      │  + Loki         │
     CNAME ──────────│  + Tempo        │
                     │  + Pyroscope    │
                     │  + Alertmanager │
                     │  (vmalert)      │
                     └─────────────────┘
                            ▲
                            │ scrapes
            ┌───────────────┼───────────────┐
            │               │               │
       ┌────┴────┐    ┌─────┴────┐    ┌─────┴────┐
       │  ns1    │    │  ha1     │    │  ...     │
       │ :9100   │    │ :9100    │    │ :9100    │
       └─────────┘    └──────────┘    └──────────┘

Implementation Plan

Phase 1: Create monitoring02 Host

Use create-host script which handles flake.nix and terraform/vms.tf automatically.

  1. Run create-host: nix develop -c create-host monitoring02 10.69.13.24
  2. Update VM resources in terraform/vms.tf:
    • 4 cores (same as monitoring01)
    • 8GB RAM (double, for VictoriaMetrics headroom)
    • 100GB disk (for 3+ months retention with compression)
  3. Update host configuration: Import monitoring services
  4. Create Vault AppRole: Add to terraform/vault/approle.tf

Phase 2: Set Up VictoriaMetrics Stack

Create new service module at services/monitoring/victoriametrics/ for testing alongside existing Prometheus config. Once validated, this can replace the Prometheus module.

  1. VictoriaMetrics (port 8428):

    • services.victoriametrics.enable = true
    • services.victoriametrics.retentionPeriod = "3m" (3 months, increase later based on disk usage)
    • Migrate scrape configs via prometheusConfig
    • Use native push support (replaces Pushgateway)
  2. vmalert for alerting rules:

    • services.vmalert.enable = true
    • Point to VictoriaMetrics for metrics evaluation
    • Keep rules in separate rules.yml file (same format as Prometheus)
    • No receiver configured during parallel operation (prevents duplicate alerts)
  3. Alertmanager (port 9093):

    • Keep existing configuration (alerttonotify webhook routing)
    • Only enable receiver after cutover from monitoring01
  4. Loki (port 3100):

    • Same configuration as current
  5. Grafana (port 3000):

    • Define dashboards declaratively via NixOS options (not imported from monitoring01)
    • Reference existing dashboards on monitoring01 for content inspiration
    • Configure VictoriaMetrics datasource (port 8428)
    • Configure Loki datasource
  6. Tempo (ports 3200, 3201):

    • Same configuration
  7. Pyroscope (port 4040):

    • Same Docker-based deployment

Phase 3: Parallel Operation

Run both monitoring01 and monitoring02 simultaneously:

  1. Dual scraping: Both hosts scrape the same targets

    • Validates VictoriaMetrics is collecting data correctly
  2. Dual log shipping: Configure Promtail to send logs to both Loki instances

    • Add second client in system/monitoring/logs.nix pointing to monitoring02
  3. Validate dashboards: Access Grafana on monitoring02, verify dashboards work

  4. Validate alerts: Verify vmalert evaluates rules correctly (no receiver = no notifications)

  5. Compare resource usage: Monitor disk/memory consumption between hosts

Phase 4: Add monitoring CNAME

Add CNAME to monitoring02 once validated:

# hosts/monitoring02/configuration.nix
homelab.dns.cnames = [ "monitoring" ];

This creates monitoring.home.2rjus.net pointing to monitoring02.

Phase 5: Update References

Update hardcoded references to use the CNAME:

  1. system/monitoring/logs.nix:

    • Remove dual-shipping, point only to http://monitoring.home.2rjus.net:3100
  2. services/http-proxy/proxy.nix: Update reverse proxy backends:

    • prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
    • alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
    • grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
    • pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040

Note: hosts/template2/bootstrap.nix stays pointed at monitoring01 until decommission.

Phase 6: Enable Alerting

Once ready to cut over:

  1. Enable Alertmanager receiver on monitoring02
  2. Verify test alerts route correctly

Phase 7: Cutover and Decommission

  1. Stop monitoring01: Prevent duplicate alerts during transition
  2. Update bootstrap.nix: Point to monitoring.home.2rjus.net
  3. Verify all targets scraped: Check VictoriaMetrics UI
  4. Verify logs flowing: Check Loki on monitoring02
  5. Decommission monitoring01:
    • Remove from flake.nix
    • Remove host configuration
    • Destroy VM in Proxmox
    • Remove from terraform state

Open Questions

  • What disk size for monitoring02? 100GB should allow 3+ months with VictoriaMetrics compression
  • Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)

VictoriaMetrics Service Configuration

Example NixOS configuration for monitoring02:

# VictoriaMetrics replaces Prometheus
services.victoriametrics = {
  enable = true;
  retentionPeriod = "3m";  # 3 months, increase based on disk usage
  prometheusConfig = {
    global.scrape_interval = "15s";
    scrape_configs = [
      # Auto-generated node-exporter targets
      # Service-specific scrape targets
      # External targets
    ];
  };
};

# vmalert for alerting rules (no receiver during parallel operation)
services.vmalert = {
  enable = true;
  datasource.url = "http://localhost:8428";
  # notifier.alertmanager.url = "http://localhost:9093";  # Enable after cutover
  rule = [ ./rules.yml ];
};

Rollback Plan

If issues arise after cutover:

  1. Move monitoring CNAME back to monitoring01
  2. Restart monitoring01 services
  3. Revert Promtail config to point only to monitoring01
  4. Revert http-proxy backends

Notes

  • VictoriaMetrics uses port 8428 vs Prometheus 9090
  • PromQL compatibility is excellent
  • VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
  • monitoring02 deployed via OpenTofu using create-host script
  • Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state