docs: add monitoring migration to VictoriaMetrics plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Plan for migrating from Prometheus to VictoriaMetrics on new monitoring02 host with parallel operation, declarative Grafana dashboards, and CNAME-based cutover. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
219
docs/plans/monitoring-migration-victoriametrics.md
Normal file
219
docs/plans/monitoring-migration-victoriametrics.md
Normal file
@@ -0,0 +1,219 @@
|
||||
# Monitoring Stack Migration to VictoriaMetrics
|
||||
|
||||
## Overview
|
||||
|
||||
Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
|
||||
and longer retention. Run in parallel with monitoring01 until validated, then switch over using
|
||||
a `monitoring` CNAME for seamless transition.
|
||||
|
||||
## Current State
|
||||
|
||||
**monitoring01** (10.69.13.13):
|
||||
- 4 CPU cores, 4GB RAM, 33GB disk
|
||||
- Prometheus with 30-day retention (15s scrape interval)
|
||||
- Alertmanager (routes to alerttonotify webhook)
|
||||
- Grafana (dashboards, datasources)
|
||||
- Loki (log aggregation from all hosts via Promtail)
|
||||
- Tempo (distributed tracing)
|
||||
- Pyroscope (continuous profiling)
|
||||
|
||||
**Hardcoded References to monitoring01:**
|
||||
- `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100`
|
||||
- `hosts/template2/bootstrap.nix` - Bootstrap logs to Loki (keep as-is until decommission)
|
||||
- `services/http-proxy/proxy.nix` - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway
|
||||
|
||||
**Auto-generated:**
|
||||
- Prometheus scrape targets (from `lib/monitoring.nix` + `homelab.monitoring.scrapeTargets`)
|
||||
- Node-exporter targets (from all hosts with static IPs)
|
||||
|
||||
## Decision: VictoriaMetrics
|
||||
|
||||
Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
|
||||
- Single binary replacement for Prometheus
|
||||
- 5-10x better compression (30 days could become 180+ days in same space)
|
||||
- Same PromQL query language (Grafana dashboards work unchanged)
|
||||
- Same scrape config format (existing auto-generated configs work)
|
||||
|
||||
If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────┐
|
||||
│ monitoring02 │
|
||||
│ VictoriaMetrics│
|
||||
│ + Grafana │
|
||||
monitoring │ + Loki │
|
||||
CNAME ──────────│ + Tempo │
|
||||
│ + Pyroscope │
|
||||
│ + Alertmanager │
|
||||
│ (vmalert) │
|
||||
└─────────────────┘
|
||||
▲
|
||||
│ scrapes
|
||||
┌───────────────┼───────────────┐
|
||||
│ │ │
|
||||
┌────┴────┐ ┌─────┴────┐ ┌─────┴────┐
|
||||
│ ns1 │ │ ha1 │ │ ... │
|
||||
│ :9100 │ │ :9100 │ │ :9100 │
|
||||
└─────────┘ └──────────┘ └──────────┘
|
||||
```
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Create monitoring02 Host
|
||||
|
||||
Use `create-host` script which handles flake.nix and terraform/vms.tf automatically.
|
||||
|
||||
1. **Run create-host**: `nix develop -c create-host monitoring02 10.69.13.24`
|
||||
2. **Update VM resources** in `terraform/vms.tf`:
|
||||
- 4 cores (same as monitoring01)
|
||||
- 8GB RAM (double, for VictoriaMetrics headroom)
|
||||
- 100GB disk (for 3+ months retention with compression)
|
||||
3. **Update host configuration**: Import monitoring services
|
||||
4. **Create Vault AppRole**: Add to `terraform/vault/approle.tf`
|
||||
|
||||
### Phase 2: Set Up VictoriaMetrics Stack
|
||||
|
||||
Create new service module at `services/monitoring/victoriametrics/` for testing alongside existing
|
||||
Prometheus config. Once validated, this can replace the Prometheus module.
|
||||
|
||||
1. **VictoriaMetrics** (port 8428):
|
||||
- `services.victoriametrics.enable = true`
|
||||
- `services.victoriametrics.retentionPeriod = "3m"` (3 months, increase later based on disk usage)
|
||||
- Migrate scrape configs via `prometheusConfig`
|
||||
- Use native push support (replaces Pushgateway)
|
||||
|
||||
2. **vmalert** for alerting rules:
|
||||
- `services.vmalert.enable = true`
|
||||
- Point to VictoriaMetrics for metrics evaluation
|
||||
- Keep rules in separate `rules.yml` file (same format as Prometheus)
|
||||
- No receiver configured during parallel operation (prevents duplicate alerts)
|
||||
|
||||
3. **Alertmanager** (port 9093):
|
||||
- Keep existing configuration (alerttonotify webhook routing)
|
||||
- Only enable receiver after cutover from monitoring01
|
||||
|
||||
4. **Loki** (port 3100):
|
||||
- Same configuration as current
|
||||
|
||||
5. **Grafana** (port 3000):
|
||||
- Define dashboards declaratively via NixOS options (not imported from monitoring01)
|
||||
- Reference existing dashboards on monitoring01 for content inspiration
|
||||
- Configure VictoriaMetrics datasource (port 8428)
|
||||
- Configure Loki datasource
|
||||
|
||||
6. **Tempo** (ports 3200, 3201):
|
||||
- Same configuration
|
||||
|
||||
7. **Pyroscope** (port 4040):
|
||||
- Same Docker-based deployment
|
||||
|
||||
### Phase 3: Parallel Operation
|
||||
|
||||
Run both monitoring01 and monitoring02 simultaneously:
|
||||
|
||||
1. **Dual scraping**: Both hosts scrape the same targets
|
||||
- Validates VictoriaMetrics is collecting data correctly
|
||||
|
||||
2. **Dual log shipping**: Configure Promtail to send logs to both Loki instances
|
||||
- Add second client in `system/monitoring/logs.nix` pointing to monitoring02
|
||||
|
||||
3. **Validate dashboards**: Access Grafana on monitoring02, verify dashboards work
|
||||
|
||||
4. **Validate alerts**: Verify vmalert evaluates rules correctly (no receiver = no notifications)
|
||||
|
||||
5. **Compare resource usage**: Monitor disk/memory consumption between hosts
|
||||
|
||||
### Phase 4: Add monitoring CNAME
|
||||
|
||||
Add CNAME to monitoring02 once validated:
|
||||
|
||||
```nix
|
||||
# hosts/monitoring02/configuration.nix
|
||||
homelab.dns.cnames = [ "monitoring" ];
|
||||
```
|
||||
|
||||
This creates `monitoring.home.2rjus.net` pointing to monitoring02.
|
||||
|
||||
### Phase 5: Update References
|
||||
|
||||
Update hardcoded references to use the CNAME:
|
||||
|
||||
1. **system/monitoring/logs.nix**:
|
||||
- Remove dual-shipping, point only to `http://monitoring.home.2rjus.net:3100`
|
||||
|
||||
2. **services/http-proxy/proxy.nix**: Update reverse proxy backends:
|
||||
- prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
|
||||
- alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
|
||||
- grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
|
||||
- pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040
|
||||
|
||||
Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission.
|
||||
|
||||
### Phase 6: Enable Alerting
|
||||
|
||||
Once ready to cut over:
|
||||
1. Enable Alertmanager receiver on monitoring02
|
||||
2. Verify test alerts route correctly
|
||||
|
||||
### Phase 7: Cutover and Decommission
|
||||
|
||||
1. **Stop monitoring01**: Prevent duplicate alerts during transition
|
||||
2. **Update bootstrap.nix**: Point to `monitoring.home.2rjus.net`
|
||||
3. **Verify all targets scraped**: Check VictoriaMetrics UI
|
||||
4. **Verify logs flowing**: Check Loki on monitoring02
|
||||
5. **Decommission monitoring01**:
|
||||
- Remove from flake.nix
|
||||
- Remove host configuration
|
||||
- Destroy VM in Proxmox
|
||||
- Remove from terraform state
|
||||
|
||||
## Open Questions
|
||||
|
||||
- [ ] What disk size for monitoring02? 100GB should allow 3+ months with VictoriaMetrics compression
|
||||
- [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
|
||||
|
||||
## VictoriaMetrics Service Configuration
|
||||
|
||||
Example NixOS configuration for monitoring02:
|
||||
|
||||
```nix
|
||||
# VictoriaMetrics replaces Prometheus
|
||||
services.victoriametrics = {
|
||||
enable = true;
|
||||
retentionPeriod = "3m"; # 3 months, increase based on disk usage
|
||||
prometheusConfig = {
|
||||
global.scrape_interval = "15s";
|
||||
scrape_configs = [
|
||||
# Auto-generated node-exporter targets
|
||||
# Service-specific scrape targets
|
||||
# External targets
|
||||
];
|
||||
};
|
||||
};
|
||||
|
||||
# vmalert for alerting rules (no receiver during parallel operation)
|
||||
services.vmalert = {
|
||||
enable = true;
|
||||
datasource.url = "http://localhost:8428";
|
||||
# notifier.alertmanager.url = "http://localhost:9093"; # Enable after cutover
|
||||
rule = [ ./rules.yml ];
|
||||
};
|
||||
```
|
||||
|
||||
## Rollback Plan
|
||||
|
||||
If issues arise after cutover:
|
||||
1. Move `monitoring` CNAME back to monitoring01
|
||||
2. Restart monitoring01 services
|
||||
3. Revert Promtail config to point only to monitoring01
|
||||
4. Revert http-proxy backends
|
||||
|
||||
## Notes
|
||||
|
||||
- VictoriaMetrics uses port 8428 vs Prometheus 9090
|
||||
- PromQL compatibility is excellent
|
||||
- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
|
||||
- monitoring02 deployed via OpenTofu using `create-host` script
|
||||
- Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state
|
||||
Reference in New Issue
Block a user