Commit Graph

11 Commits

Author SHA1 Message Date
59e1962d75 auth01: decommission host and remove authelia/lldap services
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m5s
Run nix flake check / flake-check (push) Failing after 18m1s
Remove auth01 host configuration and associated services in preparation
for new auth stack with different provisioning system.

Removed:
- hosts/auth01/ - host configuration
- services/authelia/ - authelia service module
- services/lldap/ - lldap service module
- secrets/auth01/ - sops secrets
- Reverse proxy entries for auth and lldap
- Monitoring alert rules for authelia and lldap
- SOPS configuration for auth01

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 23:35:45 +01:00
c515a6b4e1 home-assistant: fix zigbee sensor battery reporting
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
WSDCGQ12LM sensors report battery: 0 due to firmware quirk. Override
battery calculation using voltage via homeassistant value_template.

Also adds zigbee_sensor_stale alert for detecting dead sensors regardless
of battery reporting accuracy (1 hour threshold).

Device configuration moved from external devices.yaml to inline NixOS
config for declarative management.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 22:41:07 +01:00
3cccfc0487 monitoring: implement monitoring gaps coverage
Some checks failed
Run nix flake check / flake-check (push) Failing after 7m36s
Add exporters and scrape targets for services lacking monitoring:
- PostgreSQL: postgres-exporter on pgdb1
- Authelia: native telemetry metrics on auth01
- Unbound: unbound-exporter with remote-control on ns1/ns2
- NATS: HTTP monitoring endpoint on nats1
- OpenBao: telemetry config and Prometheus scrape with token auth
- Systemd: systemd-exporter on all hosts for per-service metrics

Add alert rules for postgres, auth (authelia + lldap), jellyfin,
vault (openbao), plus extend existing nats and unbound rules.

Add Terraform config for Prometheus metrics policy and token. The
token is created via vault_token resource and stored in KV, so no
manual token creation is needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 21:44:13 +01:00
28b8d7c115 monitoring: increase high_cpu_load duration for nix-cache01 to 2h
nix-cache01 regularly hits high CPU during nix builds, causing flappy
alerts. Keep the 15m threshold for all other hosts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:28:48 +01:00
3a9a47f1ad monitoring: exclude step-ca serving cert from general expiry alert
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m23s
Run nix flake check / flake-check (pull_request) Failing after 4m46s
The step-ca serving certificate is auto-renewed with a 24h lifetime,
so it always triggers the general < 86400s threshold. Exclude it and
add a dedicated step_ca_serving_cert_expiring alert at < 1h instead.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 01:12:42 +01:00
dd1b64de27 monitoring: auto-generate Prometheus scrape targets from host configs
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m49s
Run nix flake check / flake-check (push) Has been cancelled
Add homelab.monitoring NixOS options (enable, scrapeTargets) following
the same pattern as homelab.dns. Prometheus scrape configs are now
auto-generated from flake host configurations and external targets,
replacing hardcoded target lists.

Also cleans up alert rules: snake_case naming, fix zigbee2mqtt typo,
remove duplicate pushgateway alert, add for clauses to monitoring_rules,
remove hardcoded WireGuard public key, and add new alerts for
certificates, proxmox, caddy, smartctl temperature, filesystem
prediction, systemd state, file descriptors, and host reboots.

Fixes grafana scrape target port from 3100 to 3000.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 00:49:07 +01:00
14aa3a9340 Remove non-working timer rule
Some checks failed
Run nix flake check / flake-check (push) Failing after 14m3s
Periodic flake update / flake-update (push) Successful in 3m9s
2025-05-29 10:15:40 +02:00
797f915939 Add monitoring rules for monitoring services
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-05-29 10:09:27 +02:00
3785b8047a Fix alert name for build-flakes alert
Some checks failed
Run nix flake check / flake-check (push) Failing after 10m34s
Periodic flake update / flake-update (push) Successful in 3m3s
2025-05-28 21:28:04 +02:00
fb1a36a846 Rework build-flakes alert rules
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-05-28 21:26:04 +02:00
fe2e87658a Move prometheus roles to external file
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m7s
2025-05-18 14:54:09 +02:00