Commit Graph

20 Commits

Author SHA1 Message Date
b03e2e8ee4 monitoring: add alerts for homelab-deploy build failures
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-11 00:45:07 +01:00
75210805d5 nix-cache01: decommission and remove all references
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Removed:
- hosts/nix-cache01/ directory
- services/nix-cache/build-flakes.{nix,sh} (replaced by NATS builder)
- Vault secret and AppRole for nix-cache01
- Old signing key variable from terraform
- Old trusted public key from system/nix.nix

Updated:
- flake.nix: removed nixosConfiguration
- README.md: nix-cache01 -> nix-cache02
- Monitoring rules: removed build-flakes alerts, updated harmonia to nix-cache02
- Simplified proxy.nix (no longer needs hostname conditional)

nix-cache02 is now the sole binary cache host.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:40:51 +01:00
8e1753c2c8 monitoring: fix blackbox rules and add force-push policy
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Move certificate alert rules to rules.yml instead of adding them as a
separate rules string in blackbox.nix. The previous approach caused a
YAML parse error due to duplicate 'groups' keys.

Also add policy to CLAUDE.md: never force push to master.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 22:26:05 +01:00
ffad2dd205 monitoring: increase zigbee_sensor_stale threshold to 4 hours
The 2-hour threshold was too aggressive for temperature sensors in
stable environments. Historical data shows gaps up to 2.75 hours when
temperature hasn't changed (Home Assistant only updates last_updated
when values change). Increasing to 4 hours avoids false positives
while still catching genuine failures.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 16:10:54 +01:00
8ec2a083bd pgdb1: decommission postgresql host
Remove pgdb1 host configuration and postgres service module.
The only consumer (Open WebUI on gunter) has migrated to local PostgreSQL.

Removed:
- hosts/pgdb1/ - host configuration
- services/postgres/ - service module (only used by pgdb1)
- postgres_rules from monitoring rules
- rebuild-all.sh (obsolete script)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 22:54:50 +01:00
21db7e9573 acme: migrate from step-ca to OpenBao PKI
Switch all ACME certificate issuance from step-ca (ca.home.2rjus.net)
to OpenBao PKI (vault.home.2rjus.net:8200/v1/pki_int/acme/directory).

- Update default ACME server in system/acme.nix
- Update Caddy acme_ca in http-proxy and nix-cache services
- Remove labmon service from monitoring01 (step-ca monitoring)
- Remove labmon scrape target and certificate_rules alerts
- Remove alloy.nix (only used for labmon profiling)
- Add docs/plans/cert-monitoring.md for future cert monitoring needs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 18:20:10 +01:00
7d291f85bf monitoring: propagate host labels to Prometheus scrape targets
Extract homelab.host metadata (tier, priority, role, labels) from host
configurations and propagate them to Prometheus scrape targets. This
enables semantic alert filtering using labels instead of hardcoded
instance names.

Changes:
- lib/monitoring.nix: Extract host metadata, group targets by labels
- prometheus.nix: Use structured static_configs with labels
- rules.yml: Replace instance filters with role-based filters

Example labels in Prometheus:
- ns1/ns2: role=dns, dns_role=primary/secondary
- nix-cache01: role=build-host
- testvm*: tier=test

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:04:50 +01:00
881e70df27 monitoring: relax systemd_not_running alert threshold
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m4s
Increase duration from 5m to 10m and demote severity from critical to
warning. Brief degraded states during nixos-rebuild are normal and were
causing false positive alerts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 01:22:29 +01:00
15c00393f1 monitoring: increase zigbee_sensor_stale threshold to 2 hours
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m59s
Sensors report every ~45-50 minutes on average, so 1 hour was too tight.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 19:26:56 +01:00
59e1962d75 auth01: decommission host and remove authelia/lldap services
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m5s
Run nix flake check / flake-check (push) Failing after 18m1s
Remove auth01 host configuration and associated services in preparation
for new auth stack with different provisioning system.

Removed:
- hosts/auth01/ - host configuration
- services/authelia/ - authelia service module
- services/lldap/ - lldap service module
- secrets/auth01/ - sops secrets
- Reverse proxy entries for auth and lldap
- Monitoring alert rules for authelia and lldap
- SOPS configuration for auth01

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 23:35:45 +01:00
c515a6b4e1 home-assistant: fix zigbee sensor battery reporting
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
WSDCGQ12LM sensors report battery: 0 due to firmware quirk. Override
battery calculation using voltage via homeassistant value_template.

Also adds zigbee_sensor_stale alert for detecting dead sensors regardless
of battery reporting accuracy (1 hour threshold).

Device configuration moved from external devices.yaml to inline NixOS
config for declarative management.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 22:41:07 +01:00
3cccfc0487 monitoring: implement monitoring gaps coverage
Some checks failed
Run nix flake check / flake-check (push) Failing after 7m36s
Add exporters and scrape targets for services lacking monitoring:
- PostgreSQL: postgres-exporter on pgdb1
- Authelia: native telemetry metrics on auth01
- Unbound: unbound-exporter with remote-control on ns1/ns2
- NATS: HTTP monitoring endpoint on nats1
- OpenBao: telemetry config and Prometheus scrape with token auth
- Systemd: systemd-exporter on all hosts for per-service metrics

Add alert rules for postgres, auth (authelia + lldap), jellyfin,
vault (openbao), plus extend existing nats and unbound rules.

Add Terraform config for Prometheus metrics policy and token. The
token is created via vault_token resource and stored in KV, so no
manual token creation is needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 21:44:13 +01:00
28b8d7c115 monitoring: increase high_cpu_load duration for nix-cache01 to 2h
nix-cache01 regularly hits high CPU during nix builds, causing flappy
alerts. Keep the 15m threshold for all other hosts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:28:48 +01:00
3a9a47f1ad monitoring: exclude step-ca serving cert from general expiry alert
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m23s
Run nix flake check / flake-check (pull_request) Failing after 4m46s
The step-ca serving certificate is auto-renewed with a 24h lifetime,
so it always triggers the general < 86400s threshold. Exclude it and
add a dedicated step_ca_serving_cert_expiring alert at < 1h instead.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 01:12:42 +01:00
dd1b64de27 monitoring: auto-generate Prometheus scrape targets from host configs
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m49s
Run nix flake check / flake-check (push) Has been cancelled
Add homelab.monitoring NixOS options (enable, scrapeTargets) following
the same pattern as homelab.dns. Prometheus scrape configs are now
auto-generated from flake host configurations and external targets,
replacing hardcoded target lists.

Also cleans up alert rules: snake_case naming, fix zigbee2mqtt typo,
remove duplicate pushgateway alert, add for clauses to monitoring_rules,
remove hardcoded WireGuard public key, and add new alerts for
certificates, proxmox, caddy, smartctl temperature, filesystem
prediction, systemd state, file descriptors, and host reboots.

Fixes grafana scrape target port from 3100 to 3000.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 00:49:07 +01:00
14aa3a9340 Remove non-working timer rule
Some checks failed
Run nix flake check / flake-check (push) Failing after 14m3s
Periodic flake update / flake-update (push) Successful in 3m9s
2025-05-29 10:15:40 +02:00
797f915939 Add monitoring rules for monitoring services
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-05-29 10:09:27 +02:00
3785b8047a Fix alert name for build-flakes alert
Some checks failed
Run nix flake check / flake-check (push) Failing after 10m34s
Periodic flake update / flake-update (push) Successful in 3m3s
2025-05-28 21:28:04 +02:00
fb1a36a846 Rework build-flakes alert rules
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-05-28 21:26:04 +02:00
fe2e87658a Move prometheus roles to external file
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m7s
2025-05-18 14:54:09 +02:00