Files

Run nix flake check / flake-check (push) Failing after 1s

Details

docs: add monitoring migration to VictoriaMetrics plan

Plan for migrating from Prometheus to VictoriaMetrics on new monitoring02
host with parallel operation, declarative Grafana dashboards, and CNAME-based
cutover.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-08 01:11:07 +01:00

8.1 KiB

Raw Blame History

Monitoring Stack Migration to VictoriaMetrics

Overview

Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression and longer retention. Run in parallel with monitoring01 until validated, then switch over using a monitoring CNAME for seamless transition.

Current State

monitoring01 (10.69.13.13):

4 CPU cores, 4GB RAM, 33GB disk
Prometheus with 30-day retention (15s scrape interval)
Alertmanager (routes to alerttonotify webhook)
Grafana (dashboards, datasources)
Loki (log aggregation from all hosts via Promtail)
Tempo (distributed tracing)
Pyroscope (continuous profiling)

Hardcoded References to monitoring01:

system/monitoring/logs.nix - Promtail sends logs to http://monitoring01.home.2rjus.net:3100
hosts/template2/bootstrap.nix - Bootstrap logs to Loki (keep as-is until decommission)
services/http-proxy/proxy.nix - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway

Auto-generated:

Prometheus scrape targets (from lib/monitoring.nix + homelab.monitoring.scrapeTargets)
Node-exporter targets (from all hosts with static IPs)

Decision: VictoriaMetrics

Per docs/plans/long-term-metrics-storage.md, VictoriaMetrics is the recommended starting point:

Single binary replacement for Prometheus
5-10x better compression (30 days could become 180+ days in same space)
Same PromQL query language (Grafana dashboards work unchanged)
Same scrape config format (existing auto-generated configs work)

If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.

Architecture

                     ┌─────────────────┐
                     │  monitoring02   │
                     │  VictoriaMetrics│
                     │  + Grafana      │
     monitoring      │  + Loki         │
     CNAME ──────────│  + Tempo        │
                     │  + Pyroscope    │
                     │  + Alertmanager │
                     │  (vmalert)      │
                     └─────────────────┘
                            ▲
                            │ scrapes
            ┌───────────────┼───────────────┐
            │               │               │
       ┌────┴────┐    ┌─────┴────┐    ┌─────┴────┐
       │  ns1    │    │  ha1     │    │  ...     │
       │ :9100   │    │ :9100    │    │ :9100    │
       └─────────┘    └──────────┘    └──────────┘

Implementation Plan

Phase 1: Create monitoring02 Host

Use create-host script which handles flake.nix and terraform/vms.tf automatically.

Run create-host: nix develop -c create-host monitoring02 10.69.13.24
Update VM resources in terraform/vms.tf:
- 4 cores (same as monitoring01)
- 8GB RAM (double, for VictoriaMetrics headroom)
- 100GB disk (for 3+ months retention with compression)
Update host configuration: Import monitoring services
Create Vault AppRole: Add to terraform/vault/approle.tf

Phase 2: Set Up VictoriaMetrics Stack

Create new service module at services/monitoring/victoriametrics/ for testing alongside existing Prometheus config. Once validated, this can replace the Prometheus module.

VictoriaMetrics (port 8428):
- services.victoriametrics.enable = true
- services.victoriametrics.retentionPeriod = "3m" (3 months, increase later based on disk usage)
- Migrate scrape configs via prometheusConfig
- Use native push support (replaces Pushgateway)
vmalert for alerting rules:
- services.vmalert.enable = true
- Point to VictoriaMetrics for metrics evaluation
- Keep rules in separate rules.yml file (same format as Prometheus)
- No receiver configured during parallel operation (prevents duplicate alerts)
Alertmanager (port 9093):
- Keep existing configuration (alerttonotify webhook routing)
- Only enable receiver after cutover from monitoring01
Loki (port 3100):
- Same configuration as current
Grafana (port 3000):
- Define dashboards declaratively via NixOS options (not imported from monitoring01)
- Reference existing dashboards on monitoring01 for content inspiration
- Configure VictoriaMetrics datasource (port 8428)
- Configure Loki datasource
Tempo (ports 3200, 3201):
- Same configuration
Pyroscope (port 4040):
- Same Docker-based deployment

Phase 3: Parallel Operation

Run both monitoring01 and monitoring02 simultaneously:

Dual scraping: Both hosts scrape the same targets
- Validates VictoriaMetrics is collecting data correctly
Dual log shipping: Configure Promtail to send logs to both Loki instances
- Add second client in system/monitoring/logs.nix pointing to monitoring02
Validate dashboards: Access Grafana on monitoring02, verify dashboards work
Validate alerts: Verify vmalert evaluates rules correctly (no receiver = no notifications)
Compare resource usage: Monitor disk/memory consumption between hosts

Phase 4: Add monitoring CNAME

Add CNAME to monitoring02 once validated:

# hosts/monitoring02/configuration.nix
homelab.dns.cnames = [ "monitoring" ];

This creates monitoring.home.2rjus.net pointing to monitoring02.

Phase 5: Update References

Update hardcoded references to use the CNAME:

system/monitoring/logs.nix:
- Remove dual-shipping, point only to http://monitoring.home.2rjus.net:3100
services/http-proxy/proxy.nix: Update reverse proxy backends:
- prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
- alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
- grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
- pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040

Note: hosts/template2/bootstrap.nix stays pointed at monitoring01 until decommission.

Phase 6: Enable Alerting

Once ready to cut over:

Enable Alertmanager receiver on monitoring02
Verify test alerts route correctly

Phase 7: Cutover and Decommission

Stop monitoring01: Prevent duplicate alerts during transition
Update bootstrap.nix: Point to monitoring.home.2rjus.net
Verify all targets scraped: Check VictoriaMetrics UI
Verify logs flowing: Check Loki on monitoring02
Decommission monitoring01:
- Remove from flake.nix
- Remove host configuration
- Destroy VM in Proxmox
- Remove from terraform state

Open Questions

What disk size for monitoring02? 100GB should allow 3+ months with VictoriaMetrics compression
Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)

VictoriaMetrics Service Configuration

Example NixOS configuration for monitoring02:

# VictoriaMetrics replaces Prometheus
services.victoriametrics = {
  enable = true;
  retentionPeriod = "3m";  # 3 months, increase based on disk usage
  prometheusConfig = {
    global.scrape_interval = "15s";
    scrape_configs = [
      # Auto-generated node-exporter targets
      # Service-specific scrape targets
      # External targets
    ];
  };
};

# vmalert for alerting rules (no receiver during parallel operation)
services.vmalert = {
  enable = true;
  datasource.url = "http://localhost:8428";
  # notifier.alertmanager.url = "http://localhost:9093";  # Enable after cutover
  rule = [ ./rules.yml ];
};

Rollback Plan

If issues arise after cutover:

Move monitoring CNAME back to monitoring01
Restart monitoring01 services
Revert Promtail config to point only to monitoring01
Revert http-proxy backends

Notes

VictoriaMetrics uses port 8428 vs Prometheus 9090
PromQL compatibility is excellent
VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
monitoring02 deployed via OpenTofu using create-host script
Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state

8.1 KiB Raw Blame History