monitoring02: add VictoriaMetrics, vmalert, and Alertmanager

Set up the core metrics stack on monitoring02 as Phase 2 of the monitoring migration. VictoriaMetrics replaces Prometheus with identical scrape configs (22 jobs including auto-generated targets). - VictoriaMetrics with 3-month retention and all scrape configs - vmalert evaluating existing rules.yml (notifier disabled) - Alertmanager with same routing config (no alerts during parallel op) - Grafana datasources updated: local VictoriaMetrics as default - Static user override for credential file access (OpenBao, Apiary) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:29:34 +01:00
parent c151f31011
commit ef8eeaa2f5
4 changed files with 263 additions and 78 deletions
--- a/docs/plans/monitoring-migration-victoriametrics.md
+++ b/docs/plans/monitoring-migration-victoriametrics.md
@@ -61,53 +61,53 @@ If multi-year retention with downsampling becomes necessary later, Thanos can be

 ## Implementation Plan

-### Phase 1: Create monitoring02 Host
+### Phase 1: Create monitoring02 Host [COMPLETE]

-Use `create-host` script which handles flake.nix and terraform/vms.tf automatically.
-
-1. **Run create-host**: `nix develop -c create-host monitoring02 10.69.13.24`
-2. **Update VM resources** in `terraform/vms.tf`:
-   - 4 cores (same as monitoring01)
-   - 8GB RAM (double, for VictoriaMetrics headroom)
-   - 100GB disk (for 3+ months retention with compression)
-3. **Update host configuration**: Import monitoring services
-4. **Create Vault AppRole**: Add to `terraform/vault/approle.tf`
+Host created and deployed at 10.69.13.24 (prod tier) with:
+- 4 CPU cores, 8GB RAM, 60GB disk
+- Vault integration enabled
+- NATS-based remote deployment enabled
+- Grafana with Kanidm OIDC deployed as test instance (`grafana-test.home.2rjus.net`)

 ### Phase 2: Set Up VictoriaMetrics Stack

-Create new service module at `services/monitoring/victoriametrics/` for testing alongside existing
-Prometheus config. Once validated, this can replace the Prometheus module.
+New service module at `services/victoriametrics/` for VictoriaMetrics + vmalert + Alertmanager.
+Imported by monitoring02 alongside the existing Grafana service.

-1. **VictoriaMetrics** (port 8428):
+1. **VictoriaMetrics** (port 8428): [DONE]
   - `services.victoriametrics.enable = true`
-   - `services.victoriametrics.retentionPeriod = "3m"` (3 months, increase later based on disk usage)
-   - Migrate scrape configs via `prometheusConfig`
-   - Use native push support (replaces Pushgateway)
+   - `retentionPeriod = "3"` (3 months)
+   - All scrape configs migrated from Prometheus (22 jobs including auto-generated)
+   - Static user override (DynamicUser disabled) for credential file access
+   - OpenBao token fetch service + 30min refresh timer
+   - Apiary bearer token via vault.secrets

-2. **vmalert** for alerting rules:
-   - `services.vmalert.enable = true`
-   - Point to VictoriaMetrics for metrics evaluation
-   - Keep rules in separate `rules.yml` file (same format as Prometheus)
-   - No receiver configured during parallel operation (prevents duplicate alerts)
+2. **vmalert** for alerting rules: [DONE]
+   - Points to VictoriaMetrics datasource at localhost:8428
+   - Reuses existing `services/monitoring/rules.yml` directly via `settings.rule`
+   - No notifier configured during parallel operation (prevents duplicate alerts)

-3. **Alertmanager** (port 9093):
-   - Keep existing configuration (alerttonotify webhook routing)
-   - Only enable receiver after cutover from monitoring01
+3. **Alertmanager** (port 9093): [DONE]
+   - Same configuration as monitoring01 (alerttonotify webhook routing)
+   - Will only receive alerts after cutover (vmalert notifier disabled)

-4. **Loki** (port 3100):
-   - Same configuration as current
+4. **Grafana** (port 3000): [DONE]
+   - VictoriaMetrics datasource (localhost:8428) as default
+   - monitoring01 Prometheus datasource kept for comparison during parallel operation
+   - Loki datasource pointing to monitoring01 (until Loki migrated)

-5. **Grafana** (port 3000):
-   - Define dashboards declaratively via NixOS options (not imported from monitoring01)
-   - Reference existing dashboards on monitoring01 for content inspiration
-   - Configure VictoriaMetrics datasource (port 8428)
-   - Configure Loki datasource
+5. **Loki** (port 3100):
+   - TODO: Same configuration as current

 6. **Tempo** (ports 3200, 3201):
-   - Same configuration
+   - TODO: Same configuration

 7. **Pyroscope** (port 4040):
-   - Same Docker-based deployment
+   - TODO: Same Docker-based deployment
+
+**Note:** pve-exporter and pushgateway scrape targets are not included on monitoring02.
+pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics
+native push support.

 ### Phase 3: Parallel Operation

@@ -171,24 +171,9 @@ Once ready to cut over:

 ## Current Progress

-### monitoring02 Host Created (2026-02-08)
-
-Host deployed at 10.69.13.24 (test tier) with:
- 4 CPU cores, 8GB RAM, 60GB disk
- Vault integration enabled
- NATS-based remote deployment enabled
-
-### Grafana with Kanidm OIDC (2026-02-08)
-
-Grafana deployed on monitoring02 as a test instance (`grafana-test.home.2rjus.net`):
- Kanidm OIDC authentication (PKCE enabled)
- Role mapping: `admins` → Admin, others → Viewer
- Declarative datasources pointing to monitoring01 (Prometheus, Loki)
- Local Caddy for TLS termination via internal ACME CA
-
-This validates the Grafana + OIDC pattern before the full VictoriaMetrics migration. The existing
-`services/monitoring/grafana.nix` on monitoring01 can be replaced with the new `services/grafana/`
-module once monitoring02 becomes the primary monitoring host.
+- **Phase 1** complete (2026-02-08): monitoring02 host created, Grafana with Kanidm OIDC validated
+- **Phase 2** in progress (2026-02-17): VictoriaMetrics, vmalert, Alertmanager, Grafana datasources configured
+  - Remaining: Loki, Tempo, Pyroscope migration

 ## Open Questions

@@ -198,31 +183,14 @@ module once monitoring02 becomes the primary monitoring host.

 ## VictoriaMetrics Service Configuration

-Example NixOS configuration for monitoring02:
+Implemented in `services/victoriametrics/default.nix`. Key design decisions:

-```nix
-# VictoriaMetrics replaces Prometheus
-services.victoriametrics = {
-  enable = true;
-  retentionPeriod = "3m";  # 3 months, increase based on disk usage
-  prometheusConfig = {
-    global.scrape_interval = "15s";
-    scrape_configs = [
-      # Auto-generated node-exporter targets
-      # Service-specific scrape targets
-      # External targets
-    ];
-  };
-};
-
-# vmalert for alerting rules (no receiver during parallel operation)
-services.vmalert = {
-  enable = true;
-  datasource.url = "http://localhost:8428";
-  # notifier.alertmanager.url = "http://localhost:9093";  # Enable after cutover
-  rule = [ ./rules.yml ];
-};
-```
+- **Static user**: VictoriaMetrics NixOS module uses `DynamicUser`, overridden with a static
+  `victoriametrics` user so vault.secrets and credential files work correctly
+- **Shared rules**: vmalert reuses `services/monitoring/rules.yml` via `settings.rule` path
+  reference (no YAML-to-Nix conversion needed)
+- **Scrape config reuse**: Uses the same `lib/monitoring.nix` functions and
+  `services/monitoring/external-targets.nix` as Prometheus for auto-generated targets

 ## Rollback Plan