Commit Graph

68 Commits

Author SHA1 Message Date
b322b1156b monitoring: fix openbao token output path
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m17s
Run nix flake check / flake-check (push) Failing after 8m57s
The outputDir with extractKey should be the full file path, not just
the parent directory.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 21:56:26 +01:00
3cccfc0487 monitoring: implement monitoring gaps coverage
Some checks failed
Run nix flake check / flake-check (push) Failing after 7m36s
Add exporters and scrape targets for services lacking monitoring:
- PostgreSQL: postgres-exporter on pgdb1
- Authelia: native telemetry metrics on auth01
- Unbound: unbound-exporter with remote-control on ns1/ns2
- NATS: HTTP monitoring endpoint on nats1
- OpenBao: telemetry config and Prometheus scrape with token auth
- Systemd: systemd-exporter on all hosts for per-service metrics

Add alert rules for postgres, auth (authelia + lldap), jellyfin,
vault (openbao), plus extend existing nats and unbound rules.

Add Terraform config for Prometheus metrics policy and token. The
token is created via vault_token resource and stored in KV, so no
manual token creation is needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 21:44:13 +01:00
0700033c0a secrets: migrate all hosts from sops to OpenBao vault
Replace sops-nix secrets with OpenBao vault secrets across all hosts.
Hardcode root password hash, add extractKey option to vault-secrets
module, update Terraform with secrets/policies for all hosts, and
create AppRole provisioning playbook.

Hosts migrated: ha1, monitoring01, ns1, ns2, http-proxy, nix-cache01
Wave 1 hosts (nats1, jelly01, pgdb1) get AppRole policies only.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 18:43:09 +01:00
28b8d7c115 monitoring: increase high_cpu_load duration for nix-cache01 to 2h
nix-cache01 regularly hits high CPU during nix builds, causing flappy
alerts. Keep the 15m threshold for all other hosts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:28:48 +01:00
3a9a47f1ad monitoring: exclude step-ca serving cert from general expiry alert
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m23s
Run nix flake check / flake-check (pull_request) Failing after 4m46s
The step-ca serving certificate is auto-renewed with a 24h lifetime,
so it always triggers the general < 86400s threshold. Exclude it and
add a dedicated step_ca_serving_cert_expiring alert at < 1h instead.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 01:12:42 +01:00
fa6380e767 monitoring: fix nix-cache_caddy scrape target TLS error
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m43s
Move nix-cache_caddy back to a manual config in prometheus.nix using the
service CNAME (nix-cache.home.2rjus.net) instead of the hostname. The
auto-generated target used nix-cache01.home.2rjus.net which doesn't
match the TLS certificate SAN.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 01:04:50 +01:00
dd1b64de27 monitoring: auto-generate Prometheus scrape targets from host configs
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m49s
Run nix flake check / flake-check (push) Has been cancelled
Add homelab.monitoring NixOS options (enable, scrapeTargets) following
the same pattern as homelab.dns. Prometheus scrape configs are now
auto-generated from flake host configurations and external targets,
replacing hardcoded target lists.

Also cleans up alert rules: snake_case naming, fix zigbee2mqtt typo,
remove duplicate pushgateway alert, add for clauses to monitoring_rules,
remove hardcoded WireGuard public key, and add new alerts for
certificates, proxmox, caddy, smartctl temperature, filesystem
prediction, systemd state, file descriptors, and host reboots.

Fixes grafana scrape target port from 3100 to 3000.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 00:49:07 +01:00
adf70999b9 Fix scrape config
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m7s
Periodic flake update / flake-update (push) Successful in 3m13s
2025-06-01 02:41:54 +02:00
acb9e59775 Scrape nix-cache caddy
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-06-01 02:40:41 +02:00
14aa3a9340 Remove non-working timer rule
Some checks failed
Run nix flake check / flake-check (push) Failing after 14m3s
Periodic flake update / flake-update (push) Successful in 3m9s
2025-05-29 10:15:40 +02:00
797f915939 Add monitoring rules for monitoring services
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-05-29 10:09:27 +02:00
3785b8047a Fix alert name for build-flakes alert
Some checks failed
Run nix flake check / flake-check (push) Failing after 10m34s
Periodic flake update / flake-update (push) Successful in 3m3s
2025-05-28 21:28:04 +02:00
fb1a36a846 Rework build-flakes alert rules
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-05-28 21:26:04 +02:00
77d1782f36 Set honor_labels for pushgw scrape
Some checks failed
Run nix flake check / flake-check (push) Failing after 8m37s
2025-05-28 20:34:17 +02:00
5b06a95222 Add prometheus pushgateway
Some checks failed
Run nix flake check / flake-check (push) Failing after 12m59s
2025-05-28 17:10:50 +02:00
5ce8f46394 Configure tempo otlp reciever endpoint
Some checks failed
Run nix flake check / flake-check (push) Failing after 11m42s
Periodic flake update / flake-update (push) Successful in 4m6s
2025-05-24 22:10:01 +02:00
feff1d06eb Configure tempo otlp reciever
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-05-24 22:08:36 +02:00
b75df7578f Configure tempo wal storage
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-05-24 22:03:56 +02:00
4d88644417 Configure tempo storage
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-05-24 21:55:08 +02:00
d4137f79aa Change tempo settings
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-05-24 21:32:19 +02:00
486320b0ec Add tempo to monitoring
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-05-24 21:29:05 +02:00
6fc4d42d16 Fix alloy config
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-05-24 12:42:40 +02:00
ebcdefd0ca Add alloy
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-05-24 12:40:39 +02:00
2dae23560d Fix pyroscope ports attribute
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-05-24 12:01:30 +02:00
1988b36f03 Add pyroscope container to monitoring
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-05-24 12:00:02 +02:00
2a46da3761 Add labmon to scrape config
Some checks failed
Run nix flake check / flake-check (push) Failing after 14m32s
2025-05-24 03:37:52 +02:00
4e870cda44 Scrape step-ca metrics
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m52s
Periodic flake update / flake-update (push) Successful in 2m42s
2025-05-23 09:28:52 +02:00
6e6d5098c5 Collect ghettoptt stats
Some checks failed
Run nix flake check / flake-check (push) Failing after 11m48s
2025-05-22 14:55:32 +02:00
aa2cbcda60 Add home assistant to prometheus
Some checks failed
Run nix flake check / flake-check (push) Failing after 15m18s
2025-05-19 11:21:46 +02:00
78efb084ec Alertonotify hardening part 3
Some checks failed
Run nix flake check / flake-check (push) Failing after 10m10s
Periodic flake update / flake-update (push) Successful in 4m12s
2025-05-18 15:24:58 +02:00
16042b08c0 Alertonotify hardening part 2
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m58s
2025-05-18 15:20:00 +02:00
8e0b97c9e0 Alertonotify hardening part 1
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m30s
2025-05-18 15:08:26 +02:00
fe2e87658a Move prometheus roles to external file
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m7s
2025-05-18 14:54:09 +02:00
c07d96bbab Add alert for wireguard handshake
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m17s
Periodic flake update / flake-update (push) Successful in 2m15s
2025-05-18 01:12:04 +02:00
bd58d07001 Monitor wireguard
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m32s
2025-05-18 00:59:55 +02:00
3797526000 Add some alerting rules for smartctl
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-05-18 00:51:02 +02:00
afa3cc3a57 Collect smartctl metrics from gunter
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m53s
2025-05-18 00:43:15 +02:00
08a0ddaf30 Increase prometheus retention to 30d
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m58s
Periodic flake update / flake-update (push) Successful in 4m7s
2025-05-12 23:22:31 +02:00
518e3a3ded Fix flapping build-flakes alarm
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m57s
Periodic flake update / flake-update (push) Successful in 3m59s
2025-04-07 10:41:35 +02:00
0dbdee65c5 Add harmonia alerting rule
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-02-24 18:29:41 +01:00
b468e9d533 Improve alerttonotify service
Some checks failed
Run nix flake check / flake-check (push) Failing after 2m56s
Periodic flake update / flake-update (push) Successful in 1m26s
2025-02-23 20:51:39 +01:00
874e30fb28 Tune cpu alarm
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m18s
2025-02-23 20:46:25 +01:00
db9bf38ab6 Fix alerttonotify service
Some checks failed
Run nix flake check / flake-check (push) Failing after 26m40s
2025-02-23 18:16:13 +01:00
15e5ccb0ec Change alertmanager repeat time
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m41s
2025-02-23 18:10:14 +01:00
b8d058d23e Add alerting rules
Some checks failed
Run nix flake check / flake-check (push) Failing after 8m51s
2025-02-12 20:34:22 +01:00
a5448c5fc1 Remove whitespace
Some checks failed
Run nix flake check / flake-check (push) Failing after 24m42s
Periodic flake update / flake-update (push) Successful in 1m24s
2025-02-12 00:26:14 +01:00
f1ca20a387 Add some alerting rules
Some checks failed
Run nix flake check / flake-check (push) Failing after 14m34s
2025-02-11 23:24:35 +01:00
f0bc29ac5e Add nats host to monitoring
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-02-11 23:12:55 +01:00
539ff4eeac Change cpu load alert
Some checks are pending
Run nix flake check / flake-check (push) Waiting to run
2025-02-11 23:07:56 +01:00
3b500a25a7 Enable alerttonotify service
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-02-11 22:34:41 +01:00