monitoring02: add VictoriaMetrics, vmalert, and Alertmanager
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Set up the core metrics stack on monitoring02 as Phase 2 of the monitoring migration. VictoriaMetrics replaces Prometheus with identical scrape configs (22 jobs including auto-generated targets). - VictoriaMetrics with 3-month retention and all scrape configs - vmalert evaluating existing rules.yml (notifier disabled) - Alertmanager with same routing config (no alerts during parallel op) - Grafana datasources updated: local VictoriaMetrics as default - Static user override for credential file access (OpenBao, Apiary) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -61,53 +61,53 @@ If multi-year retention with downsampling becomes necessary later, Thanos can be
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Create monitoring02 Host
|
||||
### Phase 1: Create monitoring02 Host [COMPLETE]
|
||||
|
||||
Use `create-host` script which handles flake.nix and terraform/vms.tf automatically.
|
||||
|
||||
1. **Run create-host**: `nix develop -c create-host monitoring02 10.69.13.24`
|
||||
2. **Update VM resources** in `terraform/vms.tf`:
|
||||
- 4 cores (same as monitoring01)
|
||||
- 8GB RAM (double, for VictoriaMetrics headroom)
|
||||
- 100GB disk (for 3+ months retention with compression)
|
||||
3. **Update host configuration**: Import monitoring services
|
||||
4. **Create Vault AppRole**: Add to `terraform/vault/approle.tf`
|
||||
Host created and deployed at 10.69.13.24 (prod tier) with:
|
||||
- 4 CPU cores, 8GB RAM, 60GB disk
|
||||
- Vault integration enabled
|
||||
- NATS-based remote deployment enabled
|
||||
- Grafana with Kanidm OIDC deployed as test instance (`grafana-test.home.2rjus.net`)
|
||||
|
||||
### Phase 2: Set Up VictoriaMetrics Stack
|
||||
|
||||
Create new service module at `services/monitoring/victoriametrics/` for testing alongside existing
|
||||
Prometheus config. Once validated, this can replace the Prometheus module.
|
||||
New service module at `services/victoriametrics/` for VictoriaMetrics + vmalert + Alertmanager.
|
||||
Imported by monitoring02 alongside the existing Grafana service.
|
||||
|
||||
1. **VictoriaMetrics** (port 8428):
|
||||
1. **VictoriaMetrics** (port 8428): [DONE]
|
||||
- `services.victoriametrics.enable = true`
|
||||
- `services.victoriametrics.retentionPeriod = "3m"` (3 months, increase later based on disk usage)
|
||||
- Migrate scrape configs via `prometheusConfig`
|
||||
- Use native push support (replaces Pushgateway)
|
||||
- `retentionPeriod = "3"` (3 months)
|
||||
- All scrape configs migrated from Prometheus (22 jobs including auto-generated)
|
||||
- Static user override (DynamicUser disabled) for credential file access
|
||||
- OpenBao token fetch service + 30min refresh timer
|
||||
- Apiary bearer token via vault.secrets
|
||||
|
||||
2. **vmalert** for alerting rules:
|
||||
- `services.vmalert.enable = true`
|
||||
- Point to VictoriaMetrics for metrics evaluation
|
||||
- Keep rules in separate `rules.yml` file (same format as Prometheus)
|
||||
- No receiver configured during parallel operation (prevents duplicate alerts)
|
||||
2. **vmalert** for alerting rules: [DONE]
|
||||
- Points to VictoriaMetrics datasource at localhost:8428
|
||||
- Reuses existing `services/monitoring/rules.yml` directly via `settings.rule`
|
||||
- No notifier configured during parallel operation (prevents duplicate alerts)
|
||||
|
||||
3. **Alertmanager** (port 9093):
|
||||
- Keep existing configuration (alerttonotify webhook routing)
|
||||
- Only enable receiver after cutover from monitoring01
|
||||
3. **Alertmanager** (port 9093): [DONE]
|
||||
- Same configuration as monitoring01 (alerttonotify webhook routing)
|
||||
- Will only receive alerts after cutover (vmalert notifier disabled)
|
||||
|
||||
4. **Loki** (port 3100):
|
||||
- Same configuration as current
|
||||
4. **Grafana** (port 3000): [DONE]
|
||||
- VictoriaMetrics datasource (localhost:8428) as default
|
||||
- monitoring01 Prometheus datasource kept for comparison during parallel operation
|
||||
- Loki datasource pointing to monitoring01 (until Loki migrated)
|
||||
|
||||
5. **Grafana** (port 3000):
|
||||
- Define dashboards declaratively via NixOS options (not imported from monitoring01)
|
||||
- Reference existing dashboards on monitoring01 for content inspiration
|
||||
- Configure VictoriaMetrics datasource (port 8428)
|
||||
- Configure Loki datasource
|
||||
5. **Loki** (port 3100):
|
||||
- TODO: Same configuration as current
|
||||
|
||||
6. **Tempo** (ports 3200, 3201):
|
||||
- Same configuration
|
||||
- TODO: Same configuration
|
||||
|
||||
7. **Pyroscope** (port 4040):
|
||||
- Same Docker-based deployment
|
||||
- TODO: Same Docker-based deployment
|
||||
|
||||
**Note:** pve-exporter and pushgateway scrape targets are not included on monitoring02.
|
||||
pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics
|
||||
native push support.
|
||||
|
||||
### Phase 3: Parallel Operation
|
||||
|
||||
@@ -171,24 +171,9 @@ Once ready to cut over:
|
||||
|
||||
## Current Progress
|
||||
|
||||
### monitoring02 Host Created (2026-02-08)
|
||||
|
||||
Host deployed at 10.69.13.24 (test tier) with:
|
||||
- 4 CPU cores, 8GB RAM, 60GB disk
|
||||
- Vault integration enabled
|
||||
- NATS-based remote deployment enabled
|
||||
|
||||
### Grafana with Kanidm OIDC (2026-02-08)
|
||||
|
||||
Grafana deployed on monitoring02 as a test instance (`grafana-test.home.2rjus.net`):
|
||||
- Kanidm OIDC authentication (PKCE enabled)
|
||||
- Role mapping: `admins` → Admin, others → Viewer
|
||||
- Declarative datasources pointing to monitoring01 (Prometheus, Loki)
|
||||
- Local Caddy for TLS termination via internal ACME CA
|
||||
|
||||
This validates the Grafana + OIDC pattern before the full VictoriaMetrics migration. The existing
|
||||
`services/monitoring/grafana.nix` on monitoring01 can be replaced with the new `services/grafana/`
|
||||
module once monitoring02 becomes the primary monitoring host.
|
||||
- **Phase 1** complete (2026-02-08): monitoring02 host created, Grafana with Kanidm OIDC validated
|
||||
- **Phase 2** in progress (2026-02-17): VictoriaMetrics, vmalert, Alertmanager, Grafana datasources configured
|
||||
- Remaining: Loki, Tempo, Pyroscope migration
|
||||
|
||||
## Open Questions
|
||||
|
||||
@@ -198,31 +183,14 @@ module once monitoring02 becomes the primary monitoring host.
|
||||
|
||||
## VictoriaMetrics Service Configuration
|
||||
|
||||
Example NixOS configuration for monitoring02:
|
||||
Implemented in `services/victoriametrics/default.nix`. Key design decisions:
|
||||
|
||||
```nix
|
||||
# VictoriaMetrics replaces Prometheus
|
||||
services.victoriametrics = {
|
||||
enable = true;
|
||||
retentionPeriod = "3m"; # 3 months, increase based on disk usage
|
||||
prometheusConfig = {
|
||||
global.scrape_interval = "15s";
|
||||
scrape_configs = [
|
||||
# Auto-generated node-exporter targets
|
||||
# Service-specific scrape targets
|
||||
# External targets
|
||||
];
|
||||
};
|
||||
};
|
||||
|
||||
# vmalert for alerting rules (no receiver during parallel operation)
|
||||
services.vmalert = {
|
||||
enable = true;
|
||||
datasource.url = "http://localhost:8428";
|
||||
# notifier.alertmanager.url = "http://localhost:9093"; # Enable after cutover
|
||||
rule = [ ./rules.yml ];
|
||||
};
|
||||
```
|
||||
- **Static user**: VictoriaMetrics NixOS module uses `DynamicUser`, overridden with a static
|
||||
`victoriametrics` user so vault.secrets and credential files work correctly
|
||||
- **Shared rules**: vmalert reuses `services/monitoring/rules.yml` via `settings.rule` path
|
||||
reference (no YAML-to-Nix conversion needed)
|
||||
- **Scrape config reuse**: Uses the same `lib/monitoring.nix` functions and
|
||||
`services/monitoring/external-targets.nix` as Prometheus for auto-generated targets
|
||||
|
||||
## Rollback Plan
|
||||
|
||||
|
||||
Reference in New Issue
Block a user