nixos-servers/docs/plans/monitoring-migration-victoriametrics.md

# Monitoring Stack Migration to VictoriaMetrics

## Overview

Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
and longer retention. Run in parallel with monitoring01 until validated, then switch over using
a `monitoring` CNAME for seamless transition.

## Current State

**monitoring01** (10.69.13.13):
- 4 CPU cores, 4GB RAM, 33GB disk
- Prometheus with 30-day retention (15s scrape interval)
- Alertmanager (routes to alerttonotify webhook)
- Grafana (dashboards, datasources)
- Loki (log aggregation from all hosts via Promtail)
- Tempo (distributed tracing)
- Pyroscope (continuous profiling)

**Hardcoded References to monitoring01:**
- `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100`
- `hosts/template2/bootstrap.nix` - Bootstrap logs to Loki (keep as-is until decommission)
- `services/http-proxy/proxy.nix` - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway

**Auto-generated:**
- Prometheus scrape targets (from `lib/monitoring.nix` + `homelab.monitoring.scrapeTargets`)
- Node-exporter targets (from all hosts with static IPs)

## Decision: VictoriaMetrics

Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
- Single binary replacement for Prometheus
- 5-10x better compression (30 days could become 180+ days in same space)
- Same PromQL query language (Grafana dashboards work unchanged)
- Same scrape config format (existing auto-generated configs work)

If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.

## Architecture

```
                     ┌─────────────────┐
                     │  monitoring02   │
                     │  VictoriaMetrics│
                     │  + Grafana      │
     monitoring      │  + Loki         │
     CNAME ──────────│  + Tempo        │
                     │  + Pyroscope    │
                     │  + Alertmanager │
                     │  (vmalert)      │
                     └─────────────────┘
                            ▲
                            │ scrapes
            ┌───────────────┼───────────────┐
            │               │               │
       ┌────┴────┐    ┌─────┴────┐    ┌─────┴────┐
       │  ns1    │    │  ha1     │    │  ...     │
       │ :9100   │    │ :9100    │    │ :9100    │
       └─────────┘    └──────────┘    └──────────┘
```

## Implementation Plan

### Phase 1: Create monitoring02 Host

Use `create-host` script which handles flake.nix and terraform/vms.tf automatically.

1. **Run create-host**: `nix develop -c create-host monitoring02 10.69.13.24`
2. **Update VM resources** in `terraform/vms.tf`:
   - 4 cores (same as monitoring01)
   - 8GB RAM (double, for VictoriaMetrics headroom)
   - 100GB disk (for 3+ months retention with compression)
3. **Update host configuration**: Import monitoring services
4. **Create Vault AppRole**: Add to `terraform/vault/approle.tf`

### Phase 2: Set Up VictoriaMetrics Stack

Create new service module at `services/monitoring/victoriametrics/` for testing alongside existing
Prometheus config. Once validated, this can replace the Prometheus module.

1. **VictoriaMetrics** (port 8428):
   - `services.victoriametrics.enable = true`
   - `services.victoriametrics.retentionPeriod = "3m"` (3 months, increase later based on disk usage)
   - Migrate scrape configs via `prometheusConfig`
   - Use native push support (replaces Pushgateway)

2. **vmalert** for alerting rules:
   - `services.vmalert.enable = true`
   - Point to VictoriaMetrics for metrics evaluation
   - Keep rules in separate `rules.yml` file (same format as Prometheus)
   - No receiver configured during parallel operation (prevents duplicate alerts)

3. **Alertmanager** (port 9093):
   - Keep existing configuration (alerttonotify webhook routing)
   - Only enable receiver after cutover from monitoring01

4. **Loki** (port 3100):
   - Same configuration as current

5. **Grafana** (port 3000):
   - Define dashboards declaratively via NixOS options (not imported from monitoring01)
   - Reference existing dashboards on monitoring01 for content inspiration
   - Configure VictoriaMetrics datasource (port 8428)
   - Configure Loki datasource

6. **Tempo** (ports 3200, 3201):
   - Same configuration

7. **Pyroscope** (port 4040):
   - Same Docker-based deployment

### Phase 3: Parallel Operation

Run both monitoring01 and monitoring02 simultaneously:

1. **Dual scraping**: Both hosts scrape the same targets
   - Validates VictoriaMetrics is collecting data correctly

2. **Dual log shipping**: Configure Promtail to send logs to both Loki instances
   - Add second client in `system/monitoring/logs.nix` pointing to monitoring02

3. **Validate dashboards**: Access Grafana on monitoring02, verify dashboards work

4. **Validate alerts**: Verify vmalert evaluates rules correctly (no receiver = no notifications)

5. **Compare resource usage**: Monitor disk/memory consumption between hosts

### Phase 4: Add monitoring CNAME

Add CNAME to monitoring02 once validated:

```nix
# hosts/monitoring02/configuration.nix
homelab.dns.cnames = [ "monitoring" ];
```

This creates `monitoring.home.2rjus.net` pointing to monitoring02.

### Phase 5: Update References

Update hardcoded references to use the CNAME:

1. **system/monitoring/logs.nix**:
   - Remove dual-shipping, point only to `http://monitoring.home.2rjus.net:3100`

2. **services/http-proxy/proxy.nix**: Update reverse proxy backends:
   - prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
   - alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
   - grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
   - pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040

Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission.

### Phase 6: Enable Alerting

Once ready to cut over:
1. Enable Alertmanager receiver on monitoring02
2. Verify test alerts route correctly

### Phase 7: Cutover and Decommission

1. **Stop monitoring01**: Prevent duplicate alerts during transition
2. **Update bootstrap.nix**: Point to `monitoring.home.2rjus.net`
3. **Verify all targets scraped**: Check VictoriaMetrics UI
4. **Verify logs flowing**: Check Loki on monitoring02
5. **Decommission monitoring01**:
   - Remove from flake.nix
   - Remove host configuration
   - Destroy VM in Proxmox
   - Remove from terraform state

## Open Questions

- [ ] What disk size for monitoring02? 100GB should allow 3+ months with VictoriaMetrics compression
- [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)

## VictoriaMetrics Service Configuration

Example NixOS configuration for monitoring02:

```nix
# VictoriaMetrics replaces Prometheus
services.victoriametrics = {
  enable = true;
  retentionPeriod = "3m";  # 3 months, increase based on disk usage
  prometheusConfig = {
    global.scrape_interval = "15s";
    scrape_configs = [
      # Auto-generated node-exporter targets
      # Service-specific scrape targets
      # External targets
    ];
  };
};

# vmalert for alerting rules (no receiver during parallel operation)
services.vmalert = {
  enable = true;
  datasource.url = "http://localhost:8428";
  # notifier.alertmanager.url = "http://localhost:9093";  # Enable after cutover
  rule = [ ./rules.yml ];
};
```

## Rollback Plan

If issues arise after cutover:
1. Move `monitoring` CNAME back to monitoring01
2. Restart monitoring01 services
3. Revert Promtail config to point only to monitoring01
4. Revert http-proxy backends

## Notes

- VictoriaMetrics uses port 8428 vs Prometheus 9090
- PromQL compatibility is excellent
- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
- monitoring02 deployed via OpenTofu using `create-host` script
- Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state