Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
- Switch vmalert from blackhole mode to sending alerts to local Alertmanager - Import alerttonotify service so alerts route to NATS notifications - Move alertmanager and grafana CNAMEs from http-proxy to monitoring02 - Add monitoring CNAME to monitoring02 - Add Caddy reverse proxy entries for alertmanager and grafana - Remove prometheus, alertmanager, and grafana Caddy entries from http-proxy (now served directly by monitoring02) - Move monitoring02 Vault AppRole to hosts-generated.tf with extra_policies support and prometheus-metrics policy - Update Promtail to use authenticated loki.home.2rjus.net endpoint only (remove unauthenticated monitoring01 client) - Update pipe-to-loki and bootstrap to use loki.home.2rjus.net with basic auth from Vault secret - Update migration plan with current status Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
157 lines
6.8 KiB
Markdown
157 lines
6.8 KiB
Markdown
# Monitoring Stack Migration to VictoriaMetrics
|
|
|
|
## Overview
|
|
|
|
Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
|
|
and longer retention. Run in parallel with monitoring01 until validated, then switch over using
|
|
a `monitoring` CNAME for seamless transition.
|
|
|
|
## Current State
|
|
|
|
**monitoring02** (10.69.13.24) - **PRIMARY**:
|
|
- 4 CPU cores, 8GB RAM, 60GB disk
|
|
- VictoriaMetrics with 3-month retention
|
|
- vmalert with alerting enabled (routes to local Alertmanager)
|
|
- Alertmanager -> alerttonotify -> NATS notification pipeline
|
|
- Grafana with Kanidm OIDC (`grafana.home.2rjus.net`)
|
|
- Loki (log aggregation)
|
|
- CNAMEs: monitoring, alertmanager, grafana, grafana-test, metrics, vmalert, loki
|
|
|
|
**monitoring01** (10.69.13.13) - **SHUT DOWN**:
|
|
- No longer running, pending decommission
|
|
|
|
## Decision: VictoriaMetrics
|
|
|
|
Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
|
|
- Single binary replacement for Prometheus
|
|
- 5-10x better compression (30 days could become 180+ days in same space)
|
|
- Same PromQL query language (Grafana dashboards work unchanged)
|
|
- Same scrape config format (existing auto-generated configs work)
|
|
|
|
If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────┐
|
|
│ monitoring02 │
|
|
│ VictoriaMetrics│
|
|
│ + Grafana │
|
|
monitoring │ + Loki │
|
|
CNAME ──────────│ + Alertmanager │
|
|
│ (vmalert) │
|
|
└─────────────────┘
|
|
▲
|
|
│ scrapes
|
|
┌───────────────┼───────────────┐
|
|
│ │ │
|
|
┌────┴────┐ ┌─────┴────┐ ┌─────┴────┐
|
|
│ ns1 │ │ ha1 │ │ ... │
|
|
│ :9100 │ │ :9100 │ │ :9100 │
|
|
└─────────┘ └──────────┘ └──────────┘
|
|
```
|
|
|
|
## Implementation Plan
|
|
|
|
### Phase 1: Create monitoring02 Host [COMPLETE]
|
|
|
|
Host created and deployed at 10.69.13.24 (prod tier) with:
|
|
- 4 CPU cores, 8GB RAM, 60GB disk
|
|
- Vault integration enabled
|
|
- NATS-based remote deployment enabled
|
|
- Grafana with Kanidm OIDC deployed as test instance (`grafana-test.home.2rjus.net`)
|
|
|
|
### Phase 2: Set Up VictoriaMetrics Stack [COMPLETE]
|
|
|
|
New service module at `services/victoriametrics/` for VictoriaMetrics + vmalert + Alertmanager.
|
|
Imported by monitoring02 alongside the existing Grafana service.
|
|
|
|
1. **VictoriaMetrics** (port 8428):
|
|
- `services.victoriametrics.enable = true`
|
|
- `retentionPeriod = "3"` (3 months)
|
|
- All scrape configs migrated from Prometheus (22 jobs including auto-generated)
|
|
- Static user override (DynamicUser disabled) for credential file access
|
|
- OpenBao token fetch service + 30min refresh timer
|
|
- Apiary bearer token via vault.secrets
|
|
|
|
2. **vmalert** for alerting rules:
|
|
- Points to VictoriaMetrics datasource at localhost:8428
|
|
- Reuses existing `services/monitoring/rules.yml` directly via `settings.rule`
|
|
- Notifier sends to local Alertmanager at localhost:9093
|
|
|
|
3. **Alertmanager** (port 9093):
|
|
- Same configuration as monitoring01 (alerttonotify webhook routing)
|
|
- alerttonotify imported on monitoring02, routes alerts via NATS
|
|
|
|
4. **Grafana** (port 3000):
|
|
- VictoriaMetrics datasource (localhost:8428) as default
|
|
- Loki datasource pointing to localhost:3100
|
|
|
|
5. **Loki** (port 3100):
|
|
- Same configuration as monitoring01 in standalone `services/loki/` module
|
|
- Grafana datasource updated to localhost:3100
|
|
|
|
**Note:** pve-exporter and pushgateway scrape targets are not included on monitoring02.
|
|
pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics
|
|
native push support.
|
|
|
|
### Phase 3: Parallel Operation [COMPLETE]
|
|
|
|
Ran both monitoring01 and monitoring02 simultaneously to validate data collection and dashboards.
|
|
|
|
### Phase 4: Add monitoring CNAME [COMPLETE]
|
|
|
|
Added CNAMEs to monitoring02: monitoring, alertmanager, grafana, metrics, vmalert, loki.
|
|
|
|
### Phase 5: Update References [COMPLETE]
|
|
|
|
- Moved alertmanager, grafana, prometheus CNAMEs from http-proxy to monitoring02
|
|
- Removed corresponding Caddy reverse proxy entries from http-proxy
|
|
- monitoring02 Caddy serves alertmanager, grafana, metrics, vmalert directly
|
|
|
|
### Phase 6: Enable Alerting [COMPLETE]
|
|
|
|
- Switched vmalert from blackhole mode to local Alertmanager
|
|
- alerttonotify service running on monitoring02 (NATS nkey from Vault)
|
|
- prometheus-metrics Vault policy added for OpenBao scraping
|
|
- Full alerting pipeline verified: vmalert -> Alertmanager -> alerttonotify -> NATS
|
|
|
|
### Phase 7: Cutover and Decommission [IN PROGRESS]
|
|
|
|
- monitoring01 shut down (2026-02-17)
|
|
- Vault AppRole moved from approle.tf to hosts-generated.tf with extra_policies support
|
|
|
|
**Remaining cleanup (separate branch):**
|
|
- [ ] Update `system/monitoring/logs.nix` - Promtail still points to monitoring01
|
|
- [ ] Update `hosts/template2/bootstrap.nix` - Bootstrap Loki URL still points to monitoring01
|
|
- [ ] Remove monitoring01 from flake.nix and host configuration
|
|
- [ ] Destroy monitoring01 VM in Proxmox
|
|
- [ ] Remove monitoring01 from terraform state
|
|
- [ ] Remove or archive `services/monitoring/` (Prometheus config)
|
|
|
|
## Completed
|
|
|
|
- 2026-02-08: Phase 1 - monitoring02 host created
|
|
- 2026-02-17: Phase 2 - VictoriaMetrics, vmalert, Alertmanager, Loki, Grafana configured
|
|
- 2026-02-17: Phase 6 - Alerting enabled, CNAMEs migrated, monitoring01 shut down
|
|
|
|
## VictoriaMetrics Service Configuration
|
|
|
|
Implemented in `services/victoriametrics/default.nix`. Key design decisions:
|
|
|
|
- **Static user**: VictoriaMetrics NixOS module uses `DynamicUser`, overridden with a static
|
|
`victoriametrics` user so vault.secrets and credential files work correctly
|
|
- **Shared rules**: vmalert reuses `services/monitoring/rules.yml` via `settings.rule` path
|
|
reference (no YAML-to-Nix conversion needed)
|
|
- **Scrape config reuse**: Uses the same `lib/monitoring.nix` functions and
|
|
`services/monitoring/external-targets.nix` as Prometheus for auto-generated targets
|
|
|
|
## Notes
|
|
|
|
- VictoriaMetrics uses port 8428 vs Prometheus 9090
|
|
- PromQL compatibility is excellent
|
|
- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
|
|
- monitoring02 deployed via OpenTofu using `create-host` script
|
|
- Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state
|
|
- Tempo and Pyroscope deferred (not actively used; can be added later if needed)
|