Files
nixos-servers/docs/plans/monitoring-migration-victoriametrics.md
Torjus Håkestad 1544415ef3
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
docs: add Loki improvements plan
Covers retention policy, limits config, Promtail label improvements
(tier/role/level), and journal PRIORITY extraction. Also adds Alloy
consideration to VictoriaMetrics migration plan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 22:39:16 +01:00

9.3 KiB

Monitoring Stack Migration to VictoriaMetrics

Overview

Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression and longer retention. Run in parallel with monitoring01 until validated, then switch over using a monitoring CNAME for seamless transition.

Current State

monitoring01 (10.69.13.13):

  • 4 CPU cores, 4GB RAM, 33GB disk
  • Prometheus with 30-day retention (15s scrape interval)
  • Alertmanager (routes to alerttonotify webhook)
  • Grafana (dashboards, datasources)
  • Loki (log aggregation from all hosts via Promtail)
  • Tempo (distributed tracing)
  • Pyroscope (continuous profiling)

Hardcoded References to monitoring01:

  • system/monitoring/logs.nix - Promtail sends logs to http://monitoring01.home.2rjus.net:3100
  • hosts/template2/bootstrap.nix - Bootstrap logs to Loki (keep as-is until decommission)
  • services/http-proxy/proxy.nix - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway

Auto-generated:

  • Prometheus scrape targets (from lib/monitoring.nix + homelab.monitoring.scrapeTargets)
  • Node-exporter targets (from all hosts with static IPs)

Decision: VictoriaMetrics

Per docs/plans/long-term-metrics-storage.md, VictoriaMetrics is the recommended starting point:

  • Single binary replacement for Prometheus
  • 5-10x better compression (30 days could become 180+ days in same space)
  • Same PromQL query language (Grafana dashboards work unchanged)
  • Same scrape config format (existing auto-generated configs work)

If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.

Architecture

                     ┌─────────────────┐
                     │  monitoring02   │
                     │  VictoriaMetrics│
                     │  + Grafana      │
     monitoring      │  + Loki         │
     CNAME ──────────│  + Tempo        │
                     │  + Pyroscope    │
                     │  + Alertmanager │
                     │  (vmalert)      │
                     └─────────────────┘
                            ▲
                            │ scrapes
            ┌───────────────┼───────────────┐
            │               │               │
       ┌────┴────┐    ┌─────┴────┐    ┌─────┴────┐
       │  ns1    │    │  ha1     │    │  ...     │
       │ :9100   │    │ :9100    │    │ :9100    │
       └─────────┘    └──────────┘    └──────────┘

Implementation Plan

Phase 1: Create monitoring02 Host

Use create-host script which handles flake.nix and terraform/vms.tf automatically.

  1. Run create-host: nix develop -c create-host monitoring02 10.69.13.24
  2. Update VM resources in terraform/vms.tf:
    • 4 cores (same as monitoring01)
    • 8GB RAM (double, for VictoriaMetrics headroom)
    • 100GB disk (for 3+ months retention with compression)
  3. Update host configuration: Import monitoring services
  4. Create Vault AppRole: Add to terraform/vault/approle.tf

Phase 2: Set Up VictoriaMetrics Stack

Create new service module at services/monitoring/victoriametrics/ for testing alongside existing Prometheus config. Once validated, this can replace the Prometheus module.

  1. VictoriaMetrics (port 8428):

    • services.victoriametrics.enable = true
    • services.victoriametrics.retentionPeriod = "3m" (3 months, increase later based on disk usage)
    • Migrate scrape configs via prometheusConfig
    • Use native push support (replaces Pushgateway)
  2. vmalert for alerting rules:

    • services.vmalert.enable = true
    • Point to VictoriaMetrics for metrics evaluation
    • Keep rules in separate rules.yml file (same format as Prometheus)
    • No receiver configured during parallel operation (prevents duplicate alerts)
  3. Alertmanager (port 9093):

    • Keep existing configuration (alerttonotify webhook routing)
    • Only enable receiver after cutover from monitoring01
  4. Loki (port 3100):

    • Same configuration as current
  5. Grafana (port 3000):

    • Define dashboards declaratively via NixOS options (not imported from monitoring01)
    • Reference existing dashboards on monitoring01 for content inspiration
    • Configure VictoriaMetrics datasource (port 8428)
    • Configure Loki datasource
  6. Tempo (ports 3200, 3201):

    • Same configuration
  7. Pyroscope (port 4040):

    • Same Docker-based deployment

Phase 3: Parallel Operation

Run both monitoring01 and monitoring02 simultaneously:

  1. Dual scraping: Both hosts scrape the same targets

    • Validates VictoriaMetrics is collecting data correctly
  2. Dual log shipping: Configure Promtail to send logs to both Loki instances

    • Add second client in system/monitoring/logs.nix pointing to monitoring02
  3. Validate dashboards: Access Grafana on monitoring02, verify dashboards work

  4. Validate alerts: Verify vmalert evaluates rules correctly (no receiver = no notifications)

  5. Compare resource usage: Monitor disk/memory consumption between hosts

Phase 4: Add monitoring CNAME

Add CNAME to monitoring02 once validated:

# hosts/monitoring02/configuration.nix
homelab.dns.cnames = [ "monitoring" ];

This creates monitoring.home.2rjus.net pointing to monitoring02.

Phase 5: Update References

Update hardcoded references to use the CNAME:

  1. system/monitoring/logs.nix:

    • Remove dual-shipping, point only to http://monitoring.home.2rjus.net:3100
  2. services/http-proxy/proxy.nix: Update reverse proxy backends:

    • prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
    • alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
    • grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
    • pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040

Note: hosts/template2/bootstrap.nix stays pointed at monitoring01 until decommission.

Phase 6: Enable Alerting

Once ready to cut over:

  1. Enable Alertmanager receiver on monitoring02
  2. Verify test alerts route correctly

Phase 7: Cutover and Decommission

  1. Stop monitoring01: Prevent duplicate alerts during transition
  2. Update bootstrap.nix: Point to monitoring.home.2rjus.net
  3. Verify all targets scraped: Check VictoriaMetrics UI
  4. Verify logs flowing: Check Loki on monitoring02
  5. Decommission monitoring01:
    • Remove from flake.nix
    • Remove host configuration
    • Destroy VM in Proxmox
    • Remove from terraform state

Current Progress

monitoring02 Host Created (2026-02-08)

Host deployed at 10.69.13.24 (test tier) with:

  • 4 CPU cores, 8GB RAM, 60GB disk
  • Vault integration enabled
  • NATS-based remote deployment enabled

Grafana with Kanidm OIDC (2026-02-08)

Grafana deployed on monitoring02 as a test instance (grafana-test.home.2rjus.net):

  • Kanidm OIDC authentication (PKCE enabled)
  • Role mapping: admins → Admin, others → Viewer
  • Declarative datasources pointing to monitoring01 (Prometheus, Loki)
  • Local Caddy for TLS termination via internal ACME CA

This validates the Grafana + OIDC pattern before the full VictoriaMetrics migration. The existing services/monitoring/grafana.nix on monitoring01 can be replaced with the new services/grafana/ module once monitoring02 becomes the primary monitoring host.

Open Questions

  • What disk size for monitoring02? Current 60GB may need expansion for 3+ months with VictoriaMetrics
  • Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
  • Consider replacing Promtail with Grafana Alloy (services.alloy, v1.12.2 in nixpkgs). Promtail is in maintenance mode and Grafana recommends Alloy as the successor. Alloy is a unified collector (logs, metrics, traces, profiles) but uses its own "River" config format instead of YAML, so less Nix-native ergonomics. Could bundle the migration with monitoring02 to consolidate disruption.

VictoriaMetrics Service Configuration

Example NixOS configuration for monitoring02:

# VictoriaMetrics replaces Prometheus
services.victoriametrics = {
  enable = true;
  retentionPeriod = "3m";  # 3 months, increase based on disk usage
  prometheusConfig = {
    global.scrape_interval = "15s";
    scrape_configs = [
      # Auto-generated node-exporter targets
      # Service-specific scrape targets
      # External targets
    ];
  };
};

# vmalert for alerting rules (no receiver during parallel operation)
services.vmalert = {
  enable = true;
  datasource.url = "http://localhost:8428";
  # notifier.alertmanager.url = "http://localhost:9093";  # Enable after cutover
  rule = [ ./rules.yml ];
};

Rollback Plan

If issues arise after cutover:

  1. Move monitoring CNAME back to monitoring01
  2. Restart monitoring01 services
  3. Revert Promtail config to point only to monitoring01
  4. Revert http-proxy backends

Notes

  • VictoriaMetrics uses port 8428 vs Prometheus 9090
  • PromQL compatibility is excellent
  • VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
  • monitoring02 deployed via OpenTofu using create-host script
  • Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state