Files

Run nix flake check / flake-check (push) Has been cancelled

Details

Covers retention policy, limits config, Promtail label improvements
(tier/role/level), and journal PRIORITY extraction. Also adds Alloy
consideration to VictoriaMetrics migration plan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-13 22:39:16 +01:00

9.3 KiB

Raw Blame History

Monitoring Stack Migration to VictoriaMetrics

Overview

Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression and longer retention. Run in parallel with monitoring01 until validated, then switch over using a monitoring CNAME for seamless transition.

Current State

monitoring01 (10.69.13.13):

4 CPU cores, 4GB RAM, 33GB disk
Prometheus with 30-day retention (15s scrape interval)
Alertmanager (routes to alerttonotify webhook)
Grafana (dashboards, datasources)
Loki (log aggregation from all hosts via Promtail)
Tempo (distributed tracing)
Pyroscope (continuous profiling)

Hardcoded References to monitoring01:

system/monitoring/logs.nix - Promtail sends logs to http://monitoring01.home.2rjus.net:3100
hosts/template2/bootstrap.nix - Bootstrap logs to Loki (keep as-is until decommission)
services/http-proxy/proxy.nix - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway

Auto-generated:

Prometheus scrape targets (from lib/monitoring.nix + homelab.monitoring.scrapeTargets)
Node-exporter targets (from all hosts with static IPs)

Decision: VictoriaMetrics

Per docs/plans/long-term-metrics-storage.md, VictoriaMetrics is the recommended starting point:

Single binary replacement for Prometheus
5-10x better compression (30 days could become 180+ days in same space)
Same PromQL query language (Grafana dashboards work unchanged)
Same scrape config format (existing auto-generated configs work)

If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.

Architecture

                     ┌─────────────────┐
                     │  monitoring02   │
                     │  VictoriaMetrics│
                     │  + Grafana      │
     monitoring      │  + Loki         │
     CNAME ──────────│  + Tempo        │
                     │  + Pyroscope    │
                     │  + Alertmanager │
                     │  (vmalert)      │
                     └─────────────────┘
                            ▲
                            │ scrapes
            ┌───────────────┼───────────────┐
            │               │               │
       ┌────┴────┐    ┌─────┴────┐    ┌─────┴────┐
       │  ns1    │    │  ha1     │    │  ...     │
       │ :9100   │    │ :9100    │    │ :9100    │
       └─────────┘    └──────────┘    └──────────┘

Implementation Plan

Phase 1: Create monitoring02 Host

Use create-host script which handles flake.nix and terraform/vms.tf automatically.

Run create-host: nix develop -c create-host monitoring02 10.69.13.24
Update VM resources in terraform/vms.tf:
- 4 cores (same as monitoring01)
- 8GB RAM (double, for VictoriaMetrics headroom)
- 100GB disk (for 3+ months retention with compression)
Update host configuration: Import monitoring services
Create Vault AppRole: Add to terraform/vault/approle.tf

Phase 2: Set Up VictoriaMetrics Stack

Create new service module at services/monitoring/victoriametrics/ for testing alongside existing Prometheus config. Once validated, this can replace the Prometheus module.

VictoriaMetrics (port 8428):
- services.victoriametrics.enable = true
- services.victoriametrics.retentionPeriod = "3m" (3 months, increase later based on disk usage)
- Migrate scrape configs via prometheusConfig
- Use native push support (replaces Pushgateway)
vmalert for alerting rules:
- services.vmalert.enable = true
- Point to VictoriaMetrics for metrics evaluation
- Keep rules in separate rules.yml file (same format as Prometheus)
- No receiver configured during parallel operation (prevents duplicate alerts)
Alertmanager (port 9093):
- Keep existing configuration (alerttonotify webhook routing)
- Only enable receiver after cutover from monitoring01
Loki (port 3100):
- Same configuration as current
Grafana (port 3000):
- Define dashboards declaratively via NixOS options (not imported from monitoring01)
- Reference existing dashboards on monitoring01 for content inspiration
- Configure VictoriaMetrics datasource (port 8428)
- Configure Loki datasource
Tempo (ports 3200, 3201):
- Same configuration
Pyroscope (port 4040):
- Same Docker-based deployment

Phase 3: Parallel Operation

Run both monitoring01 and monitoring02 simultaneously:

Dual scraping: Both hosts scrape the same targets
- Validates VictoriaMetrics is collecting data correctly
Dual log shipping: Configure Promtail to send logs to both Loki instances
- Add second client in system/monitoring/logs.nix pointing to monitoring02
Validate dashboards: Access Grafana on monitoring02, verify dashboards work
Validate alerts: Verify vmalert evaluates rules correctly (no receiver = no notifications)
Compare resource usage: Monitor disk/memory consumption between hosts

Phase 4: Add monitoring CNAME

Add CNAME to monitoring02 once validated:

# hosts/monitoring02/configuration.nix
homelab.dns.cnames = [ "monitoring" ];

This creates monitoring.home.2rjus.net pointing to monitoring02.

Phase 5: Update References

Update hardcoded references to use the CNAME:

system/monitoring/logs.nix:
- Remove dual-shipping, point only to http://monitoring.home.2rjus.net:3100
services/http-proxy/proxy.nix: Update reverse proxy backends:
- prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
- alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
- grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
- pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040

Note: hosts/template2/bootstrap.nix stays pointed at monitoring01 until decommission.

Phase 6: Enable Alerting

Once ready to cut over:

Enable Alertmanager receiver on monitoring02
Verify test alerts route correctly

Phase 7: Cutover and Decommission

Stop monitoring01: Prevent duplicate alerts during transition
Update bootstrap.nix: Point to monitoring.home.2rjus.net
Verify all targets scraped: Check VictoriaMetrics UI
Verify logs flowing: Check Loki on monitoring02
Decommission monitoring01:
- Remove from flake.nix
- Remove host configuration
- Destroy VM in Proxmox
- Remove from terraform state

Current Progress

monitoring02 Host Created (2026-02-08)

Host deployed at 10.69.13.24 (test tier) with:

4 CPU cores, 8GB RAM, 60GB disk
Vault integration enabled
NATS-based remote deployment enabled

Grafana with Kanidm OIDC (2026-02-08)

Grafana deployed on monitoring02 as a test instance (grafana-test.home.2rjus.net):

Kanidm OIDC authentication (PKCE enabled)
Role mapping: admins → Admin, others → Viewer
Declarative datasources pointing to monitoring01 (Prometheus, Loki)
Local Caddy for TLS termination via internal ACME CA

This validates the Grafana + OIDC pattern before the full VictoriaMetrics migration. The existing services/monitoring/grafana.nix on monitoring01 can be replaced with the new services/grafana/ module once monitoring02 becomes the primary monitoring host.

Open Questions

What disk size for monitoring02? Current 60GB may need expansion for 3+ months with VictoriaMetrics
Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
Consider replacing Promtail with Grafana Alloy (services.alloy, v1.12.2 in nixpkgs). Promtail is in maintenance mode and Grafana recommends Alloy as the successor. Alloy is a unified collector (logs, metrics, traces, profiles) but uses its own "River" config format instead of YAML, so less Nix-native ergonomics. Could bundle the migration with monitoring02 to consolidate disruption.

VictoriaMetrics Service Configuration

Example NixOS configuration for monitoring02:

# VictoriaMetrics replaces Prometheus
services.victoriametrics = {
  enable = true;
  retentionPeriod = "3m";  # 3 months, increase based on disk usage
  prometheusConfig = {
    global.scrape_interval = "15s";
    scrape_configs = [
      # Auto-generated node-exporter targets
      # Service-specific scrape targets
      # External targets
    ];
  };
};

# vmalert for alerting rules (no receiver during parallel operation)
services.vmalert = {
  enable = true;
  datasource.url = "http://localhost:8428";
  # notifier.alertmanager.url = "http://localhost:9093";  # Enable after cutover
  rule = [ ./rules.yml ];
};

Rollback Plan

If issues arise after cutover:

Move monitoring CNAME back to monitoring01
Restart monitoring01 services
Revert Promtail config to point only to monitoring01
Revert http-proxy backends

Notes

VictoriaMetrics uses port 8428 vs Prometheus 9090
PromQL compatibility is excellent
VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
monitoring02 deployed via OpenTofu using create-host script
Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state

9.3 KiB Raw Blame History