nixos-servers/docs/plans/monitoring-migration-victoriametrics.md

# Monitoring Stack Migration to VictoriaMetrics

## Overview

Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
and longer retention. Run in parallel with monitoring01 until validated, then switch over using
a `monitoring` CNAME for seamless transition.

## Current State

**monitoring01** (10.69.13.13):
- 4 CPU cores, 4GB RAM, 33GB disk
- Prometheus with 30-day retention (15s scrape interval)
- Alertmanager (routes to alerttonotify webhook)
- Grafana (dashboards, datasources)
- Loki (log aggregation from all hosts via Promtail)
- Tempo (distributed tracing) - not actively used
- Pyroscope (continuous profiling) - not actively used

**Hardcoded References to monitoring01:**
- `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100`
- `hosts/template2/bootstrap.nix` - Bootstrap logs to Loki (keep as-is until decommission)
- `services/http-proxy/proxy.nix` - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway

**Auto-generated:**
- Prometheus scrape targets (from `lib/monitoring.nix` + `homelab.monitoring.scrapeTargets`)
- Node-exporter targets (from all hosts with static IPs)

## Decision: VictoriaMetrics

Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
- Single binary replacement for Prometheus
- 5-10x better compression (30 days could become 180+ days in same space)
- Same PromQL query language (Grafana dashboards work unchanged)
- Same scrape config format (existing auto-generated configs work)

If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.

## Architecture

```
                     ┌─────────────────┐
                     │  monitoring02   │
                     │  VictoriaMetrics│
                     │  + Grafana      │
     monitoring      │  + Loki         │
     CNAME ──────────│  + Alertmanager │
                     │  (vmalert)      │
                     └─────────────────┘
                            ▲
                            │ scrapes
            ┌───────────────┼───────────────┐
            │               │               │
       ┌────┴────┐    ┌─────┴────┐    ┌─────┴────┐
       │  ns1    │    │  ha1     │    │  ...     │
       │ :9100   │    │ :9100    │    │ :9100    │
       └─────────┘    └──────────┘    └──────────┘
```

## Implementation Plan

### Phase 1: Create monitoring02 Host [COMPLETE]

Host created and deployed at 10.69.13.24 (prod tier) with:
- 4 CPU cores, 8GB RAM, 60GB disk
- Vault integration enabled
- NATS-based remote deployment enabled
- Grafana with Kanidm OIDC deployed as test instance (`grafana-test.home.2rjus.net`)

### Phase 2: Set Up VictoriaMetrics Stack

New service module at `services/victoriametrics/` for VictoriaMetrics + vmalert + Alertmanager.
Imported by monitoring02 alongside the existing Grafana service.

1. **VictoriaMetrics** (port 8428): [DONE]
   - `services.victoriametrics.enable = true`
   - `retentionPeriod = "3"` (3 months)
   - All scrape configs migrated from Prometheus (22 jobs including auto-generated)
   - Static user override (DynamicUser disabled) for credential file access
   - OpenBao token fetch service + 30min refresh timer
   - Apiary bearer token via vault.secrets

2. **vmalert** for alerting rules: [DONE]
   - Points to VictoriaMetrics datasource at localhost:8428
   - Reuses existing `services/monitoring/rules.yml` directly via `settings.rule`
   - No notifier configured during parallel operation (prevents duplicate alerts)

3. **Alertmanager** (port 9093): [DONE]
   - Same configuration as monitoring01 (alerttonotify webhook routing)
   - Will only receive alerts after cutover (vmalert notifier disabled)

4. **Grafana** (port 3000): [DONE]
   - VictoriaMetrics datasource (localhost:8428) as default
   - monitoring01 Prometheus datasource kept for comparison during parallel operation
   - Loki datasource pointing to localhost (after Loki migrated to monitoring02)

5. **Loki** (port 3100): [DONE]
   - Same configuration as monitoring01 in standalone `services/loki/` module
   - Grafana datasource updated to localhost:3100

**Note:** pve-exporter and pushgateway scrape targets are not included on monitoring02.
pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics
native push support.

### Phase 3: Parallel Operation

Run both monitoring01 and monitoring02 simultaneously:

1. **Dual scraping**: Both hosts scrape the same targets
   - Validates VictoriaMetrics is collecting data correctly

2. **Dual log shipping**: Configure Promtail to send logs to both Loki instances
   - Add second client in `system/monitoring/logs.nix` pointing to monitoring02

3. **Validate dashboards**: Access Grafana on monitoring02, verify dashboards work

4. **Validate alerts**: Verify vmalert evaluates rules correctly (no receiver = no notifications)

5. **Compare resource usage**: Monitor disk/memory consumption between hosts

### Phase 4: Add monitoring CNAME

Add CNAME to monitoring02 once validated:

```nix
# hosts/monitoring02/configuration.nix
homelab.dns.cnames = [ "monitoring" ];
```

This creates `monitoring.home.2rjus.net` pointing to monitoring02.

### Phase 5: Update References

Update hardcoded references to use the CNAME:

1. **system/monitoring/logs.nix**:
   - Remove dual-shipping, point only to `http://monitoring.home.2rjus.net:3100`

2. **services/http-proxy/proxy.nix**: Update reverse proxy backends:
   - prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
   - alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
   - grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000

Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission.

### Phase 6: Enable Alerting

Once ready to cut over:
1. Enable Alertmanager receiver on monitoring02
2. Verify test alerts route correctly

### Phase 7: Cutover and Decommission

1. **Stop monitoring01**: Prevent duplicate alerts during transition
2. **Update bootstrap.nix**: Point to `monitoring.home.2rjus.net`
3. **Verify all targets scraped**: Check VictoriaMetrics UI
4. **Verify logs flowing**: Check Loki on monitoring02
5. **Decommission monitoring01**:
   - Remove from flake.nix
   - Remove host configuration
   - Destroy VM in Proxmox
   - Remove from terraform state

## Current Progress

- **Phase 1** complete (2026-02-08): monitoring02 host created, Grafana with Kanidm OIDC validated
- **Phase 2** complete (2026-02-17): VictoriaMetrics, vmalert, Alertmanager, Loki, Grafana datasources configured
  - Tempo and Pyroscope deferred (not actively used; can be added later if needed)

## Open Questions

- [ ] What disk size for monitoring02? Current 60GB may need expansion for 3+ months with VictoriaMetrics
- [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
- [ ] Consider replacing Promtail with Grafana Alloy (`services.alloy`, v1.12.2 in nixpkgs). Promtail is in maintenance mode and Grafana recommends Alloy as the successor. Alloy is a unified collector (logs, metrics, traces, profiles) but uses its own "River" config format instead of YAML, so less Nix-native ergonomics. Could bundle the migration with monitoring02 to consolidate disruption.

## VictoriaMetrics Service Configuration

Implemented in `services/victoriametrics/default.nix`. Key design decisions:

- **Static user**: VictoriaMetrics NixOS module uses `DynamicUser`, overridden with a static
  `victoriametrics` user so vault.secrets and credential files work correctly
- **Shared rules**: vmalert reuses `services/monitoring/rules.yml` via `settings.rule` path
  reference (no YAML-to-Nix conversion needed)
- **Scrape config reuse**: Uses the same `lib/monitoring.nix` functions and
  `services/monitoring/external-targets.nix` as Prometheus for auto-generated targets

## Rollback Plan

If issues arise after cutover:
1. Move `monitoring` CNAME back to monitoring01
2. Restart monitoring01 services
3. Revert Promtail config to point only to monitoring01
4. Revert http-proxy backends

## Notes

- VictoriaMetrics uses port 8428 vs Prometheus 9090
- PromQL compatibility is excellent
- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
- monitoring02 deployed via OpenTofu using `create-host` script
- Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state