- Switch vmalert from blackhole mode to sending alerts to local Alertmanager - Import alerttonotify service so alerts route to NATS notifications - Move alertmanager and grafana CNAMEs from http-proxy to monitoring02 - Add monitoring CNAME to monitoring02 - Add Caddy reverse proxy entries for alertmanager and grafana - Remove prometheus, alertmanager, and grafana Caddy entries from http-proxy (now served directly by monitoring02) - Move monitoring02 Vault AppRole to hosts-generated.tf with extra_policies support and prometheus-metrics policy - Update Promtail to use authenticated loki.home.2rjus.net endpoint only (remove unauthenticated monitoring01 client) - Update pipe-to-loki and bootstrap to use loki.home.2rjus.net with basic auth from Vault secret - Move migration plan to completed Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6.8 KiB
Monitoring Stack Migration to VictoriaMetrics
Overview
Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
and longer retention. Run in parallel with monitoring01 until validated, then switch over using
a monitoring CNAME for seamless transition.
Current State
monitoring02 (10.69.13.24) - PRIMARY:
- 4 CPU cores, 8GB RAM, 60GB disk
- VictoriaMetrics with 3-month retention
- vmalert with alerting enabled (routes to local Alertmanager)
- Alertmanager -> alerttonotify -> NATS notification pipeline
- Grafana with Kanidm OIDC (
grafana.home.2rjus.net) - Loki (log aggregation)
- CNAMEs: monitoring, alertmanager, grafana, grafana-test, metrics, vmalert, loki
monitoring01 (10.69.13.13) - SHUT DOWN:
- No longer running, pending decommission
Decision: VictoriaMetrics
Per docs/plans/long-term-metrics-storage.md, VictoriaMetrics is the recommended starting point:
- Single binary replacement for Prometheus
- 5-10x better compression (30 days could become 180+ days in same space)
- Same PromQL query language (Grafana dashboards work unchanged)
- Same scrape config format (existing auto-generated configs work)
If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
Architecture
┌─────────────────┐
│ monitoring02 │
│ VictoriaMetrics│
│ + Grafana │
monitoring │ + Loki │
CNAME ──────────│ + Alertmanager │
│ (vmalert) │
└─────────────────┘
▲
│ scrapes
┌───────────────┼───────────────┐
│ │ │
┌────┴────┐ ┌─────┴────┐ ┌─────┴────┐
│ ns1 │ │ ha1 │ │ ... │
│ :9100 │ │ :9100 │ │ :9100 │
└─────────┘ └──────────┘ └──────────┘
Implementation Plan
Phase 1: Create monitoring02 Host [COMPLETE]
Host created and deployed at 10.69.13.24 (prod tier) with:
- 4 CPU cores, 8GB RAM, 60GB disk
- Vault integration enabled
- NATS-based remote deployment enabled
- Grafana with Kanidm OIDC deployed as test instance (
grafana-test.home.2rjus.net)
Phase 2: Set Up VictoriaMetrics Stack [COMPLETE]
New service module at services/victoriametrics/ for VictoriaMetrics + vmalert + Alertmanager.
Imported by monitoring02 alongside the existing Grafana service.
-
VictoriaMetrics (port 8428):
services.victoriametrics.enable = trueretentionPeriod = "3"(3 months)- All scrape configs migrated from Prometheus (22 jobs including auto-generated)
- Static user override (DynamicUser disabled) for credential file access
- OpenBao token fetch service + 30min refresh timer
- Apiary bearer token via vault.secrets
-
vmalert for alerting rules:
- Points to VictoriaMetrics datasource at localhost:8428
- Reuses existing
services/monitoring/rules.ymldirectly viasettings.rule - Notifier sends to local Alertmanager at localhost:9093
-
Alertmanager (port 9093):
- Same configuration as monitoring01 (alerttonotify webhook routing)
- alerttonotify imported on monitoring02, routes alerts via NATS
-
Grafana (port 3000):
- VictoriaMetrics datasource (localhost:8428) as default
- Loki datasource pointing to localhost:3100
-
Loki (port 3100):
- Same configuration as monitoring01 in standalone
services/loki/module - Grafana datasource updated to localhost:3100
- Same configuration as monitoring01 in standalone
Note: pve-exporter and pushgateway scrape targets are not included on monitoring02. pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics native push support.
Phase 3: Parallel Operation [COMPLETE]
Ran both monitoring01 and monitoring02 simultaneously to validate data collection and dashboards.
Phase 4: Add monitoring CNAME [COMPLETE]
Added CNAMEs to monitoring02: monitoring, alertmanager, grafana, metrics, vmalert, loki.
Phase 5: Update References [COMPLETE]
- Moved alertmanager, grafana, prometheus CNAMEs from http-proxy to monitoring02
- Removed corresponding Caddy reverse proxy entries from http-proxy
- monitoring02 Caddy serves alertmanager, grafana, metrics, vmalert directly
Phase 6: Enable Alerting [COMPLETE]
- Switched vmalert from blackhole mode to local Alertmanager
- alerttonotify service running on monitoring02 (NATS nkey from Vault)
- prometheus-metrics Vault policy added for OpenBao scraping
- Full alerting pipeline verified: vmalert -> Alertmanager -> alerttonotify -> NATS
Phase 7: Cutover and Decommission [IN PROGRESS]
- monitoring01 shut down (2026-02-17)
- Vault AppRole moved from approle.tf to hosts-generated.tf with extra_policies support
Remaining cleanup (separate branch):
- Update
system/monitoring/logs.nix- Promtail still points to monitoring01 - Update
hosts/template2/bootstrap.nix- Bootstrap Loki URL still points to monitoring01 - Remove monitoring01 from flake.nix and host configuration
- Destroy monitoring01 VM in Proxmox
- Remove monitoring01 from terraform state
- Remove or archive
services/monitoring/(Prometheus config)
Completed
- 2026-02-08: Phase 1 - monitoring02 host created
- 2026-02-17: Phase 2 - VictoriaMetrics, vmalert, Alertmanager, Loki, Grafana configured
- 2026-02-17: Phase 6 - Alerting enabled, CNAMEs migrated, monitoring01 shut down
VictoriaMetrics Service Configuration
Implemented in services/victoriametrics/default.nix. Key design decisions:
- Static user: VictoriaMetrics NixOS module uses
DynamicUser, overridden with a staticvictoriametricsuser so vault.secrets and credential files work correctly - Shared rules: vmalert reuses
services/monitoring/rules.ymlviasettings.rulepath reference (no YAML-to-Nix conversion needed) - Scrape config reuse: Uses the same
lib/monitoring.nixfunctions andservices/monitoring/external-targets.nixas Prometheus for auto-generated targets
Notes
- VictoriaMetrics uses port 8428 vs Prometheus 9090
- PromQL compatibility is excellent
- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
- monitoring02 deployed via OpenTofu using
create-hostscript - Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state
- Tempo and Pyroscope deferred (not actively used; can be added later if needed)