Files

monitoring02: add VictoriaMetrics, vmalert, and Alertmanager

Set up the core metrics stack on monitoring02 as Phase 2 of the
monitoring migration. VictoriaMetrics replaces Prometheus with
identical scrape configs (22 jobs including auto-generated targets).

- VictoriaMetrics with 3-month retention and all scrape configs
- vmalert evaluating existing rules.yml (notifier disabled)
- Alertmanager with same routing config (no alerts during parallel op)
- Grafana datasources updated: local VictoriaMetrics as default
- Static user override for credential file access (OpenBao, Apiary)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-17 00:55:08 +01:00

8.7 KiB

Raw Blame History

Monitoring Stack Migration to VictoriaMetrics

Overview

Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression and longer retention. Run in parallel with monitoring01 until validated, then switch over using a monitoring CNAME for seamless transition.

Current State

monitoring01 (10.69.13.13):

4 CPU cores, 4GB RAM, 33GB disk
Prometheus with 30-day retention (15s scrape interval)
Alertmanager (routes to alerttonotify webhook)
Grafana (dashboards, datasources)
Loki (log aggregation from all hosts via Promtail)
Tempo (distributed tracing)
Pyroscope (continuous profiling)

Hardcoded References to monitoring01:

system/monitoring/logs.nix - Promtail sends logs to http://monitoring01.home.2rjus.net:3100
hosts/template2/bootstrap.nix - Bootstrap logs to Loki (keep as-is until decommission)
services/http-proxy/proxy.nix - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway

Auto-generated:

Prometheus scrape targets (from lib/monitoring.nix + homelab.monitoring.scrapeTargets)
Node-exporter targets (from all hosts with static IPs)

Decision: VictoriaMetrics

Per docs/plans/long-term-metrics-storage.md, VictoriaMetrics is the recommended starting point:

Single binary replacement for Prometheus
5-10x better compression (30 days could become 180+ days in same space)
Same PromQL query language (Grafana dashboards work unchanged)
Same scrape config format (existing auto-generated configs work)

If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.

Architecture

                     ┌─────────────────┐
                     │  monitoring02   │
                     │  VictoriaMetrics│
                     │  + Grafana      │
     monitoring      │  + Loki         │
     CNAME ──────────│  + Tempo        │
                     │  + Pyroscope    │
                     │  + Alertmanager │
                     │  (vmalert)      │
                     └─────────────────┘
                            ▲
                            │ scrapes
            ┌───────────────┼───────────────┐
            │               │               │
       ┌────┴────┐    ┌─────┴────┐    ┌─────┴────┐
       │  ns1    │    │  ha1     │    │  ...     │
       │ :9100   │    │ :9100    │    │ :9100    │
       └─────────┘    └──────────┘    └──────────┘

Implementation Plan

Phase 1: Create monitoring02 Host [COMPLETE]

Host created and deployed at 10.69.13.24 (prod tier) with:

4 CPU cores, 8GB RAM, 60GB disk
Vault integration enabled
NATS-based remote deployment enabled
Grafana with Kanidm OIDC deployed as test instance (grafana-test.home.2rjus.net)

Phase 2: Set Up VictoriaMetrics Stack

New service module at services/victoriametrics/ for VictoriaMetrics + vmalert + Alertmanager. Imported by monitoring02 alongside the existing Grafana service.

VictoriaMetrics (port 8428): [DONE]
- services.victoriametrics.enable = true
- retentionPeriod = "3" (3 months)
- All scrape configs migrated from Prometheus (22 jobs including auto-generated)
- Static user override (DynamicUser disabled) for credential file access
- OpenBao token fetch service + 30min refresh timer
- Apiary bearer token via vault.secrets
vmalert for alerting rules: [DONE]
- Points to VictoriaMetrics datasource at localhost:8428
- Reuses existing services/monitoring/rules.yml directly via settings.rule
- No notifier configured during parallel operation (prevents duplicate alerts)
Alertmanager (port 9093): [DONE]
- Same configuration as monitoring01 (alerttonotify webhook routing)
- Will only receive alerts after cutover (vmalert notifier disabled)
Grafana (port 3000): [DONE]
- VictoriaMetrics datasource (localhost:8428) as default
- monitoring01 Prometheus datasource kept for comparison during parallel operation
- Loki datasource pointing to monitoring01 (until Loki migrated)
Loki (port 3100):
- TODO: Same configuration as current
Tempo (ports 3200, 3201):
- TODO: Same configuration
Pyroscope (port 4040):
- TODO: Same Docker-based deployment

Note: pve-exporter and pushgateway scrape targets are not included on monitoring02. pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics native push support.

Phase 3: Parallel Operation

Run both monitoring01 and monitoring02 simultaneously:

Dual scraping: Both hosts scrape the same targets
- Validates VictoriaMetrics is collecting data correctly
Dual log shipping: Configure Promtail to send logs to both Loki instances
- Add second client in system/monitoring/logs.nix pointing to monitoring02
Validate dashboards: Access Grafana on monitoring02, verify dashboards work
Validate alerts: Verify vmalert evaluates rules correctly (no receiver = no notifications)
Compare resource usage: Monitor disk/memory consumption between hosts

Phase 4: Add monitoring CNAME

Add CNAME to monitoring02 once validated:

# hosts/monitoring02/configuration.nix
homelab.dns.cnames = [ "monitoring" ];

This creates monitoring.home.2rjus.net pointing to monitoring02.

Phase 5: Update References

Update hardcoded references to use the CNAME:

system/monitoring/logs.nix:
- Remove dual-shipping, point only to http://monitoring.home.2rjus.net:3100
services/http-proxy/proxy.nix: Update reverse proxy backends:
- prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
- alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
- grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
- pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040

Note: hosts/template2/bootstrap.nix stays pointed at monitoring01 until decommission.

Phase 6: Enable Alerting

Once ready to cut over:

Enable Alertmanager receiver on monitoring02
Verify test alerts route correctly

Phase 7: Cutover and Decommission

Stop monitoring01: Prevent duplicate alerts during transition
Update bootstrap.nix: Point to monitoring.home.2rjus.net
Verify all targets scraped: Check VictoriaMetrics UI
Verify logs flowing: Check Loki on monitoring02
Decommission monitoring01:
- Remove from flake.nix
- Remove host configuration
- Destroy VM in Proxmox
- Remove from terraform state

Current Progress

Phase 1 complete (2026-02-08): monitoring02 host created, Grafana with Kanidm OIDC validated
Phase 2 in progress (2026-02-17): VictoriaMetrics, vmalert, Alertmanager, Grafana datasources configured
- Remaining: Loki, Tempo, Pyroscope migration

Open Questions

What disk size for monitoring02? Current 60GB may need expansion for 3+ months with VictoriaMetrics
Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
Consider replacing Promtail with Grafana Alloy (services.alloy, v1.12.2 in nixpkgs). Promtail is in maintenance mode and Grafana recommends Alloy as the successor. Alloy is a unified collector (logs, metrics, traces, profiles) but uses its own "River" config format instead of YAML, so less Nix-native ergonomics. Could bundle the migration with monitoring02 to consolidate disruption.

VictoriaMetrics Service Configuration

Implemented in services/victoriametrics/default.nix. Key design decisions:

Static user: VictoriaMetrics NixOS module uses DynamicUser, overridden with a static victoriametrics user so vault.secrets and credential files work correctly
Shared rules: vmalert reuses services/monitoring/rules.yml via settings.rule path reference (no YAML-to-Nix conversion needed)
Scrape config reuse: Uses the same lib/monitoring.nix functions and services/monitoring/external-targets.nix as Prometheus for auto-generated targets

Rollback Plan

If issues arise after cutover:

Move monitoring CNAME back to monitoring01
Restart monitoring01 services
Revert Promtail config to point only to monitoring01
Revert http-proxy backends

Notes

VictoriaMetrics uses port 8428 vs Prometheus 9090
PromQL compatibility is excellent
VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
monitoring02 deployed via OpenTofu using create-host script
Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state

8.7 KiB Raw Blame History