Files
nixos-servers/docs/plans/monitoring-migration-victoriametrics.md
Torjus Håkestad 6184f4cbbb
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
monitoring02: enable alerting and migrate CNAMEs from http-proxy
- Switch vmalert from blackhole mode to sending alerts to local
  Alertmanager
- Import alerttonotify service so alerts route to NATS notifications
- Move alertmanager and grafana CNAMEs from http-proxy to monitoring02
- Add monitoring CNAME to monitoring02
- Add Caddy reverse proxy entries for alertmanager and grafana
- Remove prometheus, alertmanager, and grafana Caddy entries from
  http-proxy (now served directly by monitoring02)
- Move monitoring02 Vault AppRole to hosts-generated.tf with
  extra_policies support and prometheus-metrics policy
- Update Promtail to use authenticated loki.home.2rjus.net endpoint
  only (remove unauthenticated monitoring01 client)
- Update pipe-to-loki and bootstrap to use loki.home.2rjus.net with
  basic auth from Vault secret
- Update migration plan with current status

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 21:22:33 +01:00

6.8 KiB

Monitoring Stack Migration to VictoriaMetrics

Overview

Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression and longer retention. Run in parallel with monitoring01 until validated, then switch over using a monitoring CNAME for seamless transition.

Current State

monitoring02 (10.69.13.24) - PRIMARY:

  • 4 CPU cores, 8GB RAM, 60GB disk
  • VictoriaMetrics with 3-month retention
  • vmalert with alerting enabled (routes to local Alertmanager)
  • Alertmanager -> alerttonotify -> NATS notification pipeline
  • Grafana with Kanidm OIDC (grafana.home.2rjus.net)
  • Loki (log aggregation)
  • CNAMEs: monitoring, alertmanager, grafana, grafana-test, metrics, vmalert, loki

monitoring01 (10.69.13.13) - SHUT DOWN:

  • No longer running, pending decommission

Decision: VictoriaMetrics

Per docs/plans/long-term-metrics-storage.md, VictoriaMetrics is the recommended starting point:

  • Single binary replacement for Prometheus
  • 5-10x better compression (30 days could become 180+ days in same space)
  • Same PromQL query language (Grafana dashboards work unchanged)
  • Same scrape config format (existing auto-generated configs work)

If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.

Architecture

                     ┌─────────────────┐
                     │  monitoring02   │
                     │  VictoriaMetrics│
                     │  + Grafana      │
     monitoring      │  + Loki         │
     CNAME ──────────│  + Alertmanager │
                     │  (vmalert)      │
                     └─────────────────┘
                            ▲
                            │ scrapes
            ┌───────────────┼───────────────┐
            │               │               │
       ┌────┴────┐    ┌─────┴────┐    ┌─────┴────┐
       │  ns1    │    │  ha1     │    │  ...     │
       │ :9100   │    │ :9100    │    │ :9100    │
       └─────────┘    └──────────┘    └──────────┘

Implementation Plan

Phase 1: Create monitoring02 Host [COMPLETE]

Host created and deployed at 10.69.13.24 (prod tier) with:

  • 4 CPU cores, 8GB RAM, 60GB disk
  • Vault integration enabled
  • NATS-based remote deployment enabled
  • Grafana with Kanidm OIDC deployed as test instance (grafana-test.home.2rjus.net)

Phase 2: Set Up VictoriaMetrics Stack [COMPLETE]

New service module at services/victoriametrics/ for VictoriaMetrics + vmalert + Alertmanager. Imported by monitoring02 alongside the existing Grafana service.

  1. VictoriaMetrics (port 8428):

    • services.victoriametrics.enable = true
    • retentionPeriod = "3" (3 months)
    • All scrape configs migrated from Prometheus (22 jobs including auto-generated)
    • Static user override (DynamicUser disabled) for credential file access
    • OpenBao token fetch service + 30min refresh timer
    • Apiary bearer token via vault.secrets
  2. vmalert for alerting rules:

    • Points to VictoriaMetrics datasource at localhost:8428
    • Reuses existing services/monitoring/rules.yml directly via settings.rule
    • Notifier sends to local Alertmanager at localhost:9093
  3. Alertmanager (port 9093):

    • Same configuration as monitoring01 (alerttonotify webhook routing)
    • alerttonotify imported on monitoring02, routes alerts via NATS
  4. Grafana (port 3000):

    • VictoriaMetrics datasource (localhost:8428) as default
    • Loki datasource pointing to localhost:3100
  5. Loki (port 3100):

    • Same configuration as monitoring01 in standalone services/loki/ module
    • Grafana datasource updated to localhost:3100

Note: pve-exporter and pushgateway scrape targets are not included on monitoring02. pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics native push support.

Phase 3: Parallel Operation [COMPLETE]

Ran both monitoring01 and monitoring02 simultaneously to validate data collection and dashboards.

Phase 4: Add monitoring CNAME [COMPLETE]

Added CNAMEs to monitoring02: monitoring, alertmanager, grafana, metrics, vmalert, loki.

Phase 5: Update References [COMPLETE]

  • Moved alertmanager, grafana, prometheus CNAMEs from http-proxy to monitoring02
  • Removed corresponding Caddy reverse proxy entries from http-proxy
  • monitoring02 Caddy serves alertmanager, grafana, metrics, vmalert directly

Phase 6: Enable Alerting [COMPLETE]

  • Switched vmalert from blackhole mode to local Alertmanager
  • alerttonotify service running on monitoring02 (NATS nkey from Vault)
  • prometheus-metrics Vault policy added for OpenBao scraping
  • Full alerting pipeline verified: vmalert -> Alertmanager -> alerttonotify -> NATS

Phase 7: Cutover and Decommission [IN PROGRESS]

  • monitoring01 shut down (2026-02-17)
  • Vault AppRole moved from approle.tf to hosts-generated.tf with extra_policies support

Remaining cleanup (separate branch):

  • Update system/monitoring/logs.nix - Promtail still points to monitoring01
  • Update hosts/template2/bootstrap.nix - Bootstrap Loki URL still points to monitoring01
  • Remove monitoring01 from flake.nix and host configuration
  • Destroy monitoring01 VM in Proxmox
  • Remove monitoring01 from terraform state
  • Remove or archive services/monitoring/ (Prometheus config)

Completed

  • 2026-02-08: Phase 1 - monitoring02 host created
  • 2026-02-17: Phase 2 - VictoriaMetrics, vmalert, Alertmanager, Loki, Grafana configured
  • 2026-02-17: Phase 6 - Alerting enabled, CNAMEs migrated, monitoring01 shut down

VictoriaMetrics Service Configuration

Implemented in services/victoriametrics/default.nix. Key design decisions:

  • Static user: VictoriaMetrics NixOS module uses DynamicUser, overridden with a static victoriametrics user so vault.secrets and credential files work correctly
  • Shared rules: vmalert reuses services/monitoring/rules.yml via settings.rule path reference (no YAML-to-Nix conversion needed)
  • Scrape config reuse: Uses the same lib/monitoring.nix functions and services/monitoring/external-targets.nix as Prometheus for auto-generated targets

Notes

  • VictoriaMetrics uses port 8428 vs Prometheus 9090
  • PromQL compatibility is excellent
  • VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
  • monitoring02 deployed via OpenTofu using create-host script
  • Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state
  • Tempo and Pyroscope deferred (not actively used; can be added later if needed)