monitoring: auto-generate Prometheus scrape targets from host configs #16

Merged
torjus merged 2 commits from monitoring-improvements into master 2026-02-04 23:53:46 +00:00
Owner

Summary

  • Add homelab.monitoring NixOS options module (enable, scrapeTargets) following the same pattern as homelab.dns
  • Add lib/monitoring.nix library functions to auto-generate Prometheus scrape targets from flake host configurations
  • Add services/monitoring/external-targets.nix for non-flake hosts (gunter, restic)
  • Declare scrapeTargets in service modules: step-ca, home-assistant, caddy, jellyfin, nix-cache caddy, wireguard exporter
  • Replace hardcoded Prometheus target lists with auto-generated configs — new hosts are automatically discovered
  • Fix grafana scrape target port (3100 → 3000)

Alert rules cleanup

  • Standardize naming to snake_case (SmartCriticalWarningsmart_critical_warning, etc.)
  • Fix zigbee2qmtt_down typo → zigbee2mqtt_down
  • Remove duplicate pushgateway_not_running alert
  • Add missing for: 5m clauses to all monitoring_rules alerts
  • Remove hardcoded WireGuard public key from wireguard_handshake_timeout

New alerts

Alert Group Severity
certificate_expiring_soon certificate_rules warning
certificate_check_error certificate_rules warning
step_ca_certificate_expiring certificate_rules critical
pve_node_down proxmox_rules critical
pve_guest_stopped proxmox_rules warning
caddy_upstream_unhealthy http_proxy_rules warning
caddy_high_error_rate http_proxy_rules warning
smartctl_high_temperature smartctl_rules warning
filesystem_filling_up common_rules warning
systemd_not_running common_rules critical
high_file_descriptors common_rules warning
host_reboot common_rules info

Node-exporter target coverage

Previously 10 hardcoded + 1 external. Now auto-discovers 17 flake hosts + 1 external, adding: auth01, media1, ns3, ns4, vault01, vaulttest01, nixos-test1.

Test plan

  • nix build .#nixosConfigurations.monitoring01.config.system.build.toplevel — Prometheus config and rules validation pass
  • nix build for ca, ha1, http-proxy, jelly01, nix-cache01 — all hosts with new scrapeTargets build successfully
  • After deploy: verify all targets appear in Prometheus UI (/targets)
  • After deploy: verify no spurious alerts fire from new rules
## Summary - Add `homelab.monitoring` NixOS options module (`enable`, `scrapeTargets`) following the same pattern as `homelab.dns` - Add `lib/monitoring.nix` library functions to auto-generate Prometheus scrape targets from flake host configurations - Add `services/monitoring/external-targets.nix` for non-flake hosts (gunter, restic) - Declare `scrapeTargets` in service modules: step-ca, home-assistant, caddy, jellyfin, nix-cache caddy, wireguard exporter - Replace hardcoded Prometheus target lists with auto-generated configs — new hosts are automatically discovered - Fix grafana scrape target port (3100 → 3000) ### Alert rules cleanup - Standardize naming to `snake_case` (`SmartCriticalWarning` → `smart_critical_warning`, etc.) - Fix `zigbee2qmtt_down` typo → `zigbee2mqtt_down` - Remove duplicate `pushgateway_not_running` alert - Add missing `for: 5m` clauses to all `monitoring_rules` alerts - Remove hardcoded WireGuard public key from `wireguard_handshake_timeout` ### New alerts | Alert | Group | Severity | |---|---|---| | `certificate_expiring_soon` | certificate_rules | warning | | `certificate_check_error` | certificate_rules | warning | | `step_ca_certificate_expiring` | certificate_rules | critical | | `pve_node_down` | proxmox_rules | critical | | `pve_guest_stopped` | proxmox_rules | warning | | `caddy_upstream_unhealthy` | http_proxy_rules | warning | | `caddy_high_error_rate` | http_proxy_rules | warning | | `smartctl_high_temperature` | smartctl_rules | warning | | `filesystem_filling_up` | common_rules | warning | | `systemd_not_running` | common_rules | critical | | `high_file_descriptors` | common_rules | warning | | `host_reboot` | common_rules | info | ### Node-exporter target coverage Previously 10 hardcoded + 1 external. Now auto-discovers 17 flake hosts + 1 external, adding: auth01, media1, ns3, ns4, vault01, vaulttest01, nixos-test1. ## Test plan - [x] `nix build .#nixosConfigurations.monitoring01.config.system.build.toplevel` — Prometheus config and rules validation pass - [x] `nix build` for ca, ha1, http-proxy, jelly01, nix-cache01 — all hosts with new `scrapeTargets` build successfully - [x] After deploy: verify all targets appear in Prometheus UI (`/targets`) - [x] After deploy: verify no spurious alerts fire from new rules
torjus added 1 commit 2026-02-04 23:50:21 +00:00
monitoring: auto-generate Prometheus scrape targets from host configs
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m49s
Run nix flake check / flake-check (push) Has been cancelled
dd1b64de27
Add homelab.monitoring NixOS options (enable, scrapeTargets) following
the same pattern as homelab.dns. Prometheus scrape configs are now
auto-generated from flake host configurations and external targets,
replacing hardcoded target lists.

Also cleans up alert rules: snake_case naming, fix zigbee2mqtt typo,
remove duplicate pushgateway alert, add for clauses to monitoring_rules,
remove hardcoded WireGuard public key, and add new alerts for
certificates, proxmox, caddy, smartctl temperature, filesystem
prediction, systemd state, file descriptors, and host reboots.

Fixes grafana scrape target port from 3100 to 3000.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
torjus added 1 commit 2026-02-04 23:52:52 +00:00
docs: document monitoring auto-generation in CLAUDE.md
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m33s
Run nix flake check / flake-check (pull_request) Successful in 6m48s
e7980978c7
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
torjus merged commit da9dd02d10 into master 2026-02-04 23:53:46 +00:00
torjus deleted branch monitoring-improvements 2026-02-04 23:53:46 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: torjus/nixos-servers#16