Commit Graph

44 Commits

Author SHA1 Message Date
3cccfc0487 monitoring: implement monitoring gaps coverage
Some checks failed
Run nix flake check / flake-check (push) Failing after 7m36s
Add exporters and scrape targets for services lacking monitoring:
- PostgreSQL: postgres-exporter on pgdb1
- Authelia: native telemetry metrics on auth01
- Unbound: unbound-exporter with remote-control on ns1/ns2
- NATS: HTTP monitoring endpoint on nats1
- OpenBao: telemetry config and Prometheus scrape with token auth
- Systemd: systemd-exporter on all hosts for per-service metrics

Add alert rules for postgres, auth (authelia + lldap), jellyfin,
vault (openbao), plus extend existing nats and unbound rules.

Add Terraform config for Prometheus metrics policy and token. The
token is created via vault_token resource and stored in KV, so no
manual token creation is needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 21:44:13 +01:00
fa6380e767 monitoring: fix nix-cache_caddy scrape target TLS error
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m43s
Move nix-cache_caddy back to a manual config in prometheus.nix using the
service CNAME (nix-cache.home.2rjus.net) instead of the hostname. The
auto-generated target used nix-cache01.home.2rjus.net which doesn't
match the TLS certificate SAN.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 01:04:50 +01:00
dd1b64de27 monitoring: auto-generate Prometheus scrape targets from host configs
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m49s
Run nix flake check / flake-check (push) Has been cancelled
Add homelab.monitoring NixOS options (enable, scrapeTargets) following
the same pattern as homelab.dns. Prometheus scrape configs are now
auto-generated from flake host configurations and external targets,
replacing hardcoded target lists.

Also cleans up alert rules: snake_case naming, fix zigbee2mqtt typo,
remove duplicate pushgateway alert, add for clauses to monitoring_rules,
remove hardcoded WireGuard public key, and add new alerts for
certificates, proxmox, caddy, smartctl temperature, filesystem
prediction, systemd state, file descriptors, and host reboots.

Fixes grafana scrape target port from 3100 to 3000.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 00:49:07 +01:00
adf70999b9 Fix scrape config
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m7s
Periodic flake update / flake-update (push) Successful in 3m13s
2025-06-01 02:41:54 +02:00
acb9e59775 Scrape nix-cache caddy
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-06-01 02:40:41 +02:00
77d1782f36 Set honor_labels for pushgw scrape
Some checks failed
Run nix flake check / flake-check (push) Failing after 8m37s
2025-05-28 20:34:17 +02:00
5b06a95222 Add prometheus pushgateway
Some checks failed
Run nix flake check / flake-check (push) Failing after 12m59s
2025-05-28 17:10:50 +02:00
2a46da3761 Add labmon to scrape config
Some checks failed
Run nix flake check / flake-check (push) Failing after 14m32s
2025-05-24 03:37:52 +02:00
4e870cda44 Scrape step-ca metrics
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m52s
Periodic flake update / flake-update (push) Successful in 2m42s
2025-05-23 09:28:52 +02:00
6e6d5098c5 Collect ghettoptt stats
Some checks failed
Run nix flake check / flake-check (push) Failing after 11m48s
2025-05-22 14:55:32 +02:00
aa2cbcda60 Add home assistant to prometheus
Some checks failed
Run nix flake check / flake-check (push) Failing after 15m18s
2025-05-19 11:21:46 +02:00
fe2e87658a Move prometheus roles to external file
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m7s
2025-05-18 14:54:09 +02:00
c07d96bbab Add alert for wireguard handshake
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m17s
Periodic flake update / flake-update (push) Successful in 2m15s
2025-05-18 01:12:04 +02:00
bd58d07001 Monitor wireguard
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m32s
2025-05-18 00:59:55 +02:00
3797526000 Add some alerting rules for smartctl
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-05-18 00:51:02 +02:00
afa3cc3a57 Collect smartctl metrics from gunter
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m53s
2025-05-18 00:43:15 +02:00
08a0ddaf30 Increase prometheus retention to 30d
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m58s
Periodic flake update / flake-update (push) Successful in 4m7s
2025-05-12 23:22:31 +02:00
518e3a3ded Fix flapping build-flakes alarm
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m57s
Periodic flake update / flake-update (push) Successful in 3m59s
2025-04-07 10:41:35 +02:00
0dbdee65c5 Add harmonia alerting rule
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-02-24 18:29:41 +01:00
874e30fb28 Tune cpu alarm
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m18s
2025-02-23 20:46:25 +01:00
15e5ccb0ec Change alertmanager repeat time
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m41s
2025-02-23 18:10:14 +01:00
b8d058d23e Add alerting rules
Some checks failed
Run nix flake check / flake-check (push) Failing after 8m51s
2025-02-12 20:34:22 +01:00
a5448c5fc1 Remove whitespace
Some checks failed
Run nix flake check / flake-check (push) Failing after 24m42s
Periodic flake update / flake-update (push) Successful in 1m24s
2025-02-12 00:26:14 +01:00
f1ca20a387 Add some alerting rules
Some checks failed
Run nix flake check / flake-check (push) Failing after 14m34s
2025-02-11 23:24:35 +01:00
f0bc29ac5e Add nats host to monitoring
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-02-11 23:12:55 +01:00
539ff4eeac Change cpu load alert
Some checks are pending
Run nix flake check / flake-check (push) Waiting to run
2025-02-11 23:07:56 +01:00
abb4cf58ea Add alerttonotify to monitoring host
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-02-11 22:25:54 +01:00
6079852cc6 Add missing hosts to prometheus scrap job
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m22s
Periodic flake update / flake-update (push) Successful in 1m30s
2025-01-26 00:56:21 +01:00
26bf43bba5 Collect restic rest metrics
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m44s
Periodic flake update / flake-update (push) Successful in 1m29s
2025-01-24 23:43:02 +01:00
2824718e53 Collect alertmanager metrics
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2025-01-24 23:34:43 +01:00
25b2f1d1ee Collect grafana metrics 2025-01-24 23:33:49 +01:00
f2b5bb6f2a Collect loki metrics 2025-01-24 23:32:45 +01:00
e70e892ab2 Add build-flakes script for nix-cache
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m20s
2025-01-24 01:12:18 +01:00
43dfc0ec28 Add some alerting rules
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m23s
Periodic flake update / flake-update (push) Successful in 1m32s
2025-01-21 22:47:44 +01:00
79b6598d0d Add jellyfin
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m36s
Periodic flake update / flake-update (push) Successful in 1m29s
2024-12-22 04:33:00 +01:00
b3ebe3a3b0 Monitor prometheus metrics
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m24s
Periodic flake update / flake-update (push) Successful in 1m59s
2024-12-05 19:36:55 +01:00
4c60f7b5c1 Fix caddy metrics endpoint
Some checks failed
Run nix flake check / flake-check (push) Failing after 10m38s
2024-12-04 04:09:06 +01:00
5af18ca418 Gather caddy metrics
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
2024-12-04 04:02:24 +01:00
4b38158780 Add pve monitoring
Some checks failed
Run nix flake check / flake-check (push) Failing after 23m15s
Periodic flake update / flake-update (push) Successful in 1m47s
2024-12-03 18:01:48 +01:00
91a844fe4d Fix alerting
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m25s
Periodic flake update / flake-update (push) Successful in 2m19s
2024-12-03 00:47:00 +01:00
f08ac69003 Improve monitoring stuff
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m5s
2024-12-02 23:41:46 +01:00
b62a5c3db9 Disable alertmanager
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m25s
Periodic flake update / flake-update (push) Successful in 1m45s
2024-12-01 23:45:45 +01:00
a4592ffda3 Improve monitoring stuff
Some checks failed
Run nix flake check / flake-check (push) Failing after 23m19s
2024-12-01 20:51:14 +01:00
3c3eaaa042 Add monitoring host 2024-12-01 01:51:34 +01:00