monitoring02: add VictoriaMetrics, vmalert, and Alertmanager
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled

Set up the core metrics stack on monitoring02 as Phase 2 of the
monitoring migration. VictoriaMetrics replaces Prometheus with
identical scrape configs (22 jobs including auto-generated targets).

- VictoriaMetrics with 3-month retention and all scrape configs
- vmalert evaluating existing rules.yml (notifier disabled)
- Alertmanager with same routing config (no alerts during parallel op)
- Grafana datasources updated: local VictoriaMetrics as default
- Static user override for credential file access (OpenBao, Apiary)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-17 00:29:34 +01:00
parent c151f31011
commit ef8eeaa2f5
4 changed files with 263 additions and 78 deletions

View File

@@ -61,53 +61,53 @@ If multi-year retention with downsampling becomes necessary later, Thanos can be
## Implementation Plan ## Implementation Plan
### Phase 1: Create monitoring02 Host ### Phase 1: Create monitoring02 Host [COMPLETE]
Use `create-host` script which handles flake.nix and terraform/vms.tf automatically. Host created and deployed at 10.69.13.24 (prod tier) with:
- 4 CPU cores, 8GB RAM, 60GB disk
1. **Run create-host**: `nix develop -c create-host monitoring02 10.69.13.24` - Vault integration enabled
2. **Update VM resources** in `terraform/vms.tf`: - NATS-based remote deployment enabled
- 4 cores (same as monitoring01) - Grafana with Kanidm OIDC deployed as test instance (`grafana-test.home.2rjus.net`)
- 8GB RAM (double, for VictoriaMetrics headroom)
- 100GB disk (for 3+ months retention with compression)
3. **Update host configuration**: Import monitoring services
4. **Create Vault AppRole**: Add to `terraform/vault/approle.tf`
### Phase 2: Set Up VictoriaMetrics Stack ### Phase 2: Set Up VictoriaMetrics Stack
Create new service module at `services/monitoring/victoriametrics/` for testing alongside existing New service module at `services/victoriametrics/` for VictoriaMetrics + vmalert + Alertmanager.
Prometheus config. Once validated, this can replace the Prometheus module. Imported by monitoring02 alongside the existing Grafana service.
1. **VictoriaMetrics** (port 8428): 1. **VictoriaMetrics** (port 8428): [DONE]
- `services.victoriametrics.enable = true` - `services.victoriametrics.enable = true`
- `services.victoriametrics.retentionPeriod = "3m"` (3 months, increase later based on disk usage) - `retentionPeriod = "3"` (3 months)
- Migrate scrape configs via `prometheusConfig` - All scrape configs migrated from Prometheus (22 jobs including auto-generated)
- Use native push support (replaces Pushgateway) - Static user override (DynamicUser disabled) for credential file access
- OpenBao token fetch service + 30min refresh timer
- Apiary bearer token via vault.secrets
2. **vmalert** for alerting rules: 2. **vmalert** for alerting rules: [DONE]
- `services.vmalert.enable = true` - Points to VictoriaMetrics datasource at localhost:8428
- Point to VictoriaMetrics for metrics evaluation - Reuses existing `services/monitoring/rules.yml` directly via `settings.rule`
- Keep rules in separate `rules.yml` file (same format as Prometheus) - No notifier configured during parallel operation (prevents duplicate alerts)
- No receiver configured during parallel operation (prevents duplicate alerts)
3. **Alertmanager** (port 9093): 3. **Alertmanager** (port 9093): [DONE]
- Keep existing configuration (alerttonotify webhook routing) - Same configuration as monitoring01 (alerttonotify webhook routing)
- Only enable receiver after cutover from monitoring01 - Will only receive alerts after cutover (vmalert notifier disabled)
4. **Loki** (port 3100): 4. **Grafana** (port 3000): [DONE]
- Same configuration as current - VictoriaMetrics datasource (localhost:8428) as default
- monitoring01 Prometheus datasource kept for comparison during parallel operation
- Loki datasource pointing to monitoring01 (until Loki migrated)
5. **Grafana** (port 3000): 5. **Loki** (port 3100):
- Define dashboards declaratively via NixOS options (not imported from monitoring01) - TODO: Same configuration as current
- Reference existing dashboards on monitoring01 for content inspiration
- Configure VictoriaMetrics datasource (port 8428)
- Configure Loki datasource
6. **Tempo** (ports 3200, 3201): 6. **Tempo** (ports 3200, 3201):
- Same configuration - TODO: Same configuration
7. **Pyroscope** (port 4040): 7. **Pyroscope** (port 4040):
- Same Docker-based deployment - TODO: Same Docker-based deployment
**Note:** pve-exporter and pushgateway scrape targets are not included on monitoring02.
pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics
native push support.
### Phase 3: Parallel Operation ### Phase 3: Parallel Operation
@@ -171,24 +171,9 @@ Once ready to cut over:
## Current Progress ## Current Progress
### monitoring02 Host Created (2026-02-08) - **Phase 1** complete (2026-02-08): monitoring02 host created, Grafana with Kanidm OIDC validated
- **Phase 2** in progress (2026-02-17): VictoriaMetrics, vmalert, Alertmanager, Grafana datasources configured
Host deployed at 10.69.13.24 (test tier) with: - Remaining: Loki, Tempo, Pyroscope migration
- 4 CPU cores, 8GB RAM, 60GB disk
- Vault integration enabled
- NATS-based remote deployment enabled
### Grafana with Kanidm OIDC (2026-02-08)
Grafana deployed on monitoring02 as a test instance (`grafana-test.home.2rjus.net`):
- Kanidm OIDC authentication (PKCE enabled)
- Role mapping: `admins` → Admin, others → Viewer
- Declarative datasources pointing to monitoring01 (Prometheus, Loki)
- Local Caddy for TLS termination via internal ACME CA
This validates the Grafana + OIDC pattern before the full VictoriaMetrics migration. The existing
`services/monitoring/grafana.nix` on monitoring01 can be replaced with the new `services/grafana/`
module once monitoring02 becomes the primary monitoring host.
## Open Questions ## Open Questions
@@ -198,31 +183,14 @@ module once monitoring02 becomes the primary monitoring host.
## VictoriaMetrics Service Configuration ## VictoriaMetrics Service Configuration
Example NixOS configuration for monitoring02: Implemented in `services/victoriametrics/default.nix`. Key design decisions:
```nix - **Static user**: VictoriaMetrics NixOS module uses `DynamicUser`, overridden with a static
# VictoriaMetrics replaces Prometheus `victoriametrics` user so vault.secrets and credential files work correctly
services.victoriametrics = { - **Shared rules**: vmalert reuses `services/monitoring/rules.yml` via `settings.rule` path
enable = true; reference (no YAML-to-Nix conversion needed)
retentionPeriod = "3m"; # 3 months, increase based on disk usage - **Scrape config reuse**: Uses the same `lib/monitoring.nix` functions and
prometheusConfig = { `services/monitoring/external-targets.nix` as Prometheus for auto-generated targets
global.scrape_interval = "15s";
scrape_configs = [
# Auto-generated node-exporter targets
# Service-specific scrape targets
# External targets
];
};
};
# vmalert for alerting rules (no receiver during parallel operation)
services.vmalert = {
enable = true;
datasource.url = "http://localhost:8428";
# notifier.alertmanager.url = "http://localhost:9093"; # Enable after cutover
rule = [ ./rules.yml ];
};
```
## Rollback Plan ## Rollback Plan

View File

@@ -2,5 +2,6 @@
imports = [ imports = [
./configuration.nix ./configuration.nix
../../services/grafana ../../services/grafana
../../services/victoriametrics
]; ];
} }

View File

@@ -34,15 +34,21 @@
}; };
}; };
# Declarative datasources pointing to monitoring01 # Declarative datasources
provision.datasources.settings = { provision.datasources.settings = {
apiVersion = 1; apiVersion = 1;
datasources = [ datasources = [
{ {
name = "Prometheus"; name = "VictoriaMetrics";
type = "prometheus";
url = "http://localhost:8428";
isDefault = true;
uid = "victoriametrics";
}
{
name = "Prometheus (monitoring01)";
type = "prometheus"; type = "prometheus";
url = "http://monitoring01.home.2rjus.net:9090"; url = "http://monitoring01.home.2rjus.net:9090";
isDefault = true;
uid = "prometheus"; uid = "prometheus";
} }
{ {

View File

@@ -0,0 +1,210 @@
{ self, config, lib, pkgs, ... }:
let
monLib = import ../../lib/monitoring.nix { inherit lib; };
externalTargets = import ../monitoring/external-targets.nix;
nodeExporterTargets = monLib.generateNodeExporterTargets self externalTargets;
autoScrapeConfigs = monLib.generateScrapeConfigs self externalTargets;
# Script to fetch AppRole token for VictoriaMetrics to use when scraping OpenBao metrics
fetchOpenbaoToken = pkgs.writeShellApplication {
name = "fetch-openbao-token-vm";
runtimeInputs = [ pkgs.curl pkgs.jq ];
text = ''
VAULT_ADDR="https://vault01.home.2rjus.net:8200"
APPROLE_DIR="/var/lib/vault/approle"
OUTPUT_FILE="/run/secrets/victoriametrics/openbao-token"
# Read AppRole credentials
if [ ! -f "$APPROLE_DIR/role-id" ] || [ ! -f "$APPROLE_DIR/secret-id" ]; then
echo "AppRole credentials not found at $APPROLE_DIR" >&2
exit 1
fi
ROLE_ID=$(cat "$APPROLE_DIR/role-id")
SECRET_ID=$(cat "$APPROLE_DIR/secret-id")
# Authenticate to Vault
AUTH_RESPONSE=$(curl -sf -k -X POST \
-d "{\"role_id\":\"$ROLE_ID\",\"secret_id\":\"$SECRET_ID\"}" \
"$VAULT_ADDR/v1/auth/approle/login")
# Extract token
VAULT_TOKEN=$(echo "$AUTH_RESPONSE" | jq -r '.auth.client_token')
if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
echo "Failed to extract Vault token from response" >&2
exit 1
fi
# Write token to file
mkdir -p "$(dirname "$OUTPUT_FILE")"
echo -n "$VAULT_TOKEN" > "$OUTPUT_FILE"
chown victoriametrics:victoriametrics "$OUTPUT_FILE"
chmod 0400 "$OUTPUT_FILE"
echo "Successfully fetched OpenBao token"
'';
};
scrapeConfigs = [
# Auto-generated node-exporter targets from flake hosts + external
{
job_name = "node-exporter";
static_configs = nodeExporterTargets;
}
# Systemd exporter on all hosts (same targets, different port)
{
job_name = "systemd-exporter";
static_configs = map
(cfg: cfg // {
targets = map (t: builtins.replaceStrings [ ":9100" ] [ ":9558" ] t) cfg.targets;
})
nodeExporterTargets;
}
# Local monitoring services
{
job_name = "victoriametrics";
static_configs = [{ targets = [ "localhost:8428" ]; }];
}
{
job_name = "loki";
static_configs = [{ targets = [ "localhost:3100" ]; }];
}
{
job_name = "grafana";
static_configs = [{ targets = [ "localhost:3000" ]; }];
}
{
job_name = "alertmanager";
static_configs = [{ targets = [ "localhost:9093" ]; }];
}
# Caddy metrics from nix-cache02
{
job_name = "nix-cache_caddy";
scheme = "https";
static_configs = [{ targets = [ "nix-cache.home.2rjus.net" ]; }];
}
# OpenBao metrics with bearer token auth
{
job_name = "openbao";
scheme = "https";
metrics_path = "/v1/sys/metrics";
params = { format = [ "prometheus" ]; };
static_configs = [{ targets = [ "vault01.home.2rjus.net:8200" ]; }];
authorization = {
type = "Bearer";
credentials_file = "/run/secrets/victoriametrics/openbao-token";
};
}
# Apiary external service
{
job_name = "apiary";
scheme = "https";
scrape_interval = "60s";
static_configs = [{ targets = [ "apiary.t-juice.club" ]; }];
authorization = {
type = "Bearer";
credentials_file = "/run/secrets/victoriametrics-apiary-token";
};
}
] ++ autoScrapeConfigs;
in
{
# Static user for VictoriaMetrics (overrides DynamicUser) so vault.secrets
# and credential files can be owned by this user
users.users.victoriametrics = {
isSystemUser = true;
group = "victoriametrics";
};
users.groups.victoriametrics = { };
# Override DynamicUser since we need a static user for credential file access
systemd.services.victoriametrics.serviceConfig = {
DynamicUser = lib.mkForce false;
User = "victoriametrics";
Group = "victoriametrics";
};
# Systemd service to fetch AppRole token for OpenBao scraping
systemd.services.victoriametrics-openbao-token = {
description = "Fetch OpenBao token for VictoriaMetrics metrics scraping";
after = [ "network-online.target" ];
wants = [ "network-online.target" ];
before = [ "victoriametrics.service" ];
requiredBy = [ "victoriametrics.service" ];
serviceConfig = {
Type = "oneshot";
ExecStart = lib.getExe fetchOpenbaoToken;
};
};
# Timer to periodically refresh the token (AppRole tokens have 1-hour TTL)
systemd.timers.victoriametrics-openbao-token = {
description = "Refresh OpenBao token for VictoriaMetrics";
wantedBy = [ "timers.target" ];
timerConfig = {
OnBootSec = "5min";
OnUnitActiveSec = "30min";
RandomizedDelaySec = "5min";
};
};
# Fetch apiary bearer token from Vault
vault.secrets.victoriametrics-apiary-token = {
secretPath = "hosts/monitoring01/apiary-token";
extractKey = "password";
owner = "victoriametrics";
group = "victoriametrics";
services = [ "victoriametrics" ];
};
services.victoriametrics = {
enable = true;
retentionPeriod = "3"; # 3 months
# Disable config check since we reference external credential files
checkConfig = false;
prometheusConfig = {
global.scrape_interval = "15s";
scrape_configs = scrapeConfigs;
};
};
# vmalert for alerting rules - no notifier during parallel operation
services.vmalert.instances.default = {
enable = true;
settings = {
"datasource.url" = "http://localhost:8428";
# Notifier disabled during parallel operation to prevent duplicate alerts
# Uncomment after cutover from monitoring01:
# "notifier.url" = [ "http://localhost:9093" ];
"rule" = [ ../monitoring/rules.yml ];
};
};
# Alertmanager - same config as monitoring01 but will only receive
# alerts after cutover (vmalert notifier is disabled above)
services.prometheus.alertmanager = {
enable = true;
configuration = {
global = { };
route = {
receiver = "webhook_natstonotify";
group_wait = "30s";
group_interval = "5m";
repeat_interval = "1h";
group_by = [ "alertname" ];
};
receivers = [
{
name = "webhook_natstonotify";
webhook_configs = [
{
url = "http://localhost:5001/alert";
}
];
}
];
};
};
}