nixos-servers/docs/plans/monitoring-migration-victoriametrics.md

# Monitoring Stack Migration to VictoriaMetrics

## Overview

Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
and longer retention. Run in parallel with monitoring01 until validated, then switch over using
a `monitoring` CNAME for seamless transition.

## Current State

**monitoring02** (10.69.13.24) - **PRIMARY**:
- 4 CPU cores, 8GB RAM, 60GB disk
- VictoriaMetrics with 3-month retention
- vmalert with alerting enabled (routes to local Alertmanager)
- Alertmanager -> alerttonotify -> NATS notification pipeline
- Grafana with Kanidm OIDC (`grafana.home.2rjus.net`)
- Loki (log aggregation)
- CNAMEs: monitoring, alertmanager, grafana, grafana-test, metrics, vmalert, loki

**monitoring01** (10.69.13.13) - **SHUT DOWN**:
- No longer running, pending decommission

## Decision: VictoriaMetrics

Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
- Single binary replacement for Prometheus
- 5-10x better compression (30 days could become 180+ days in same space)
- Same PromQL query language (Grafana dashboards work unchanged)
- Same scrape config format (existing auto-generated configs work)

If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.

## Architecture

```
                     ┌─────────────────┐
                     │  monitoring02   │
                     │  VictoriaMetrics│
                     │  + Grafana      │
     monitoring      │  + Loki         │
     CNAME ──────────│  + Alertmanager │
                     │  (vmalert)      │
                     └─────────────────┘
                            ▲
                            │ scrapes
            ┌───────────────┼───────────────┐
            │               │               │
       ┌────┴────┐    ┌─────┴────┐    ┌─────┴────┐
       │  ns1    │    │  ha1     │    │  ...     │
       │ :9100   │    │ :9100    │    │ :9100    │
       └─────────┘    └──────────┘    └──────────┘
```

## Implementation Plan

### Phase 1: Create monitoring02 Host [COMPLETE]

Host created and deployed at 10.69.13.24 (prod tier) with:
- 4 CPU cores, 8GB RAM, 60GB disk
- Vault integration enabled
- NATS-based remote deployment enabled
- Grafana with Kanidm OIDC deployed as test instance (`grafana-test.home.2rjus.net`)

### Phase 2: Set Up VictoriaMetrics Stack [COMPLETE]

New service module at `services/victoriametrics/` for VictoriaMetrics + vmalert + Alertmanager.
Imported by monitoring02 alongside the existing Grafana service.

1. **VictoriaMetrics** (port 8428):
   - `services.victoriametrics.enable = true`
   - `retentionPeriod = "3"` (3 months)
   - All scrape configs migrated from Prometheus (22 jobs including auto-generated)
   - Static user override (DynamicUser disabled) for credential file access
   - OpenBao token fetch service + 30min refresh timer
   - Apiary bearer token via vault.secrets

2. **vmalert** for alerting rules:
   - Points to VictoriaMetrics datasource at localhost:8428
   - Reuses existing `services/monitoring/rules.yml` directly via `settings.rule`
   - Notifier sends to local Alertmanager at localhost:9093

3. **Alertmanager** (port 9093):
   - Same configuration as monitoring01 (alerttonotify webhook routing)
   - alerttonotify imported on monitoring02, routes alerts via NATS

4. **Grafana** (port 3000):
   - VictoriaMetrics datasource (localhost:8428) as default
   - Loki datasource pointing to localhost:3100

5. **Loki** (port 3100):
   - Same configuration as monitoring01 in standalone `services/loki/` module
   - Grafana datasource updated to localhost:3100

**Note:** pve-exporter and pushgateway scrape targets are not included on monitoring02.
pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics
native push support.

### Phase 3: Parallel Operation [COMPLETE]

Ran both monitoring01 and monitoring02 simultaneously to validate data collection and dashboards.

### Phase 4: Add monitoring CNAME [COMPLETE]

Added CNAMEs to monitoring02: monitoring, alertmanager, grafana, metrics, vmalert, loki.

### Phase 5: Update References [COMPLETE]

- Moved alertmanager, grafana, prometheus CNAMEs from http-proxy to monitoring02
- Removed corresponding Caddy reverse proxy entries from http-proxy
- monitoring02 Caddy serves alertmanager, grafana, metrics, vmalert directly

### Phase 6: Enable Alerting [COMPLETE]

- Switched vmalert from blackhole mode to local Alertmanager
- alerttonotify service running on monitoring02 (NATS nkey from Vault)
- prometheus-metrics Vault policy added for OpenBao scraping
- Full alerting pipeline verified: vmalert -> Alertmanager -> alerttonotify -> NATS

### Phase 7: Cutover and Decommission [IN PROGRESS]

- monitoring01 shut down (2026-02-17)
- Vault AppRole moved from approle.tf to hosts-generated.tf with extra_policies support

**Remaining cleanup (separate branch):**
- [ ] Update `system/monitoring/logs.nix` - Promtail still points to monitoring01
- [ ] Update `hosts/template2/bootstrap.nix` - Bootstrap Loki URL still points to monitoring01
- [ ] Remove monitoring01 from flake.nix and host configuration
- [ ] Destroy monitoring01 VM in Proxmox
- [ ] Remove monitoring01 from terraform state
- [ ] Remove or archive `services/monitoring/` (Prometheus config)

## Completed

- 2026-02-08: Phase 1 - monitoring02 host created
- 2026-02-17: Phase 2 - VictoriaMetrics, vmalert, Alertmanager, Loki, Grafana configured
- 2026-02-17: Phase 6 - Alerting enabled, CNAMEs migrated, monitoring01 shut down

## VictoriaMetrics Service Configuration

Implemented in `services/victoriametrics/default.nix`. Key design decisions:

- **Static user**: VictoriaMetrics NixOS module uses `DynamicUser`, overridden with a static
  `victoriametrics` user so vault.secrets and credential files work correctly
- **Shared rules**: vmalert reuses `services/monitoring/rules.yml` via `settings.rule` path
  reference (no YAML-to-Nix conversion needed)
- **Scrape config reuse**: Uses the same `lib/monitoring.nix` functions and
  `services/monitoring/external-targets.nix` as Prometheus for auto-generated targets

## Notes

- VictoriaMetrics uses port 8428 vs Prometheus 9090
- PromQL compatibility is excellent
- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
- monitoring02 deployed via OpenTofu using `create-host` script
- Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state
- Tempo and Pyroscope deferred (not actively used; can be added later if needed)