Some checks failed
Run nix flake check / flake-check (push) Failing after 20m32s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
5.4 KiB
5.4 KiB
Monitoring Gaps Audit
Overview
Audit of services running in the homelab that lack monitoring coverage, either missing Prometheus scrape targets, alerting rules, or both.
Services with No Monitoring
PostgreSQL (pgdb1)
- Current state: No scrape targets, no alert rules
- Risk: A database outage would go completely unnoticed by Prometheus
- Recommendation: Enable
services.prometheus.exporters.postgres(available in nixpkgs). This exposes connection counts, query throughput, replication lag, table/index stats, and more. Add alerts for at leastpostgres_down(systemd unit state) and connection pool exhaustion.
Authelia (auth01)
- Current state: No scrape targets, no alert rules
- Risk: The authentication gateway being down blocks access to all proxied services
- Recommendation: Authelia exposes Prometheus metrics natively at
/metrics. Add a scrape target and at minimum anauthelia_downsystemd unit state alert.
LLDAP (auth01)
- Current state: No scrape targets, no alert rules
- Risk: LLDAP is a dependency of Authelia -- if LDAP is down, authentication breaks even if Authelia is running
- Recommendation: Add an
lldap_downsystemd unit state alert. LLDAP does not expose Prometheus metrics natively, so systemd unit monitoring via node-exporter may be sufficient.
Vault / OpenBao (vault01)
- Current state: No scrape targets, no alert rules
- Risk: Secrets management service failures go undetected
- Recommendation: OpenBao supports Prometheus telemetry output natively. Add a scrape target for the telemetry endpoint and alerts for
vault_down(systemd unit) and seal status.
Gitea Actions Runner
- Current state: No scrape targets, no alert rules
- Risk: CI/CD failures go undetected
- Recommendation: Add at minimum a systemd unit state alert. The runner itself has limited metrics exposure.
Services with Partial Monitoring
Jellyfin (jelly01)
- Current state: Has scrape targets (port 8096), metrics are being collected, but zero alert rules
- Metrics available: 184 metrics, all .NET runtime / ASP.NET Core level. No Jellyfin-specific metrics (active streams, library size, transcoding sessions). Key useful metrics:
microsoft_aspnetcore_hosting_failed_requests- rate of HTTP errorsmicrosoft_aspnetcore_hosting_current_requests- in-flight requestsprocess_working_set_bytes- memory usage (~256 MB currently)dotnet_gc_pause_ratio- GC pressureup{job="jellyfin"}- basic availability
- Recommendation: Add a
jellyfin_downalert using eitherup{job="jellyfin"} == 0or systemd unit state. Consider alerting on sustainedfailed_requestsrate increase.
NATS (nats1)
- Current state: Has a
nats_downalert (systemd unit state via node-exporter), but no NATS-specific metrics - Metrics available: NATS has a built-in
/metricsendpoint exposing connection counts, message throughput, JetStream consumer lag, and more - Recommendation: Add a scrape target for the NATS metrics endpoint. Consider alerts for connection count spikes, slow consumers, and JetStream storage usage.
DNS - Unbound (ns1, ns2)
- Current state: Has
unbound_downalert (systemd unit state), but no DNS query metrics - Available in nixpkgs:
services.prometheus.exporters.unbound.enable(package:prometheus-unbound-exporterv0.5.0). Exposes query counts, cache hit ratios, response types (SERVFAIL, NXDOMAIN), upstream latency. - Recommendation: Enable the unbound exporter on ns1/ns2. Add alerts for cache hit ratio drops and SERVFAIL rate spikes.
DNS - NSD (ns1, ns2)
- Current state: Has
nsd_downalert (systemd unit state), no NSD-specific metrics - Available in nixpkgs: Nothing. No exporter package or NixOS module. Community
nsd_exporterexists but is not packaged. - Recommendation: The existing systemd unit alert is likely sufficient. NSD is a simple authoritative-only server with limited operational metrics. Not worth packaging a custom exporter for now.
Existing Monitoring (for reference)
These services have adequate alerting and/or scrape targets:
| Service | Scrape Targets | Alert Rules |
|---|---|---|
| Monitoring stack (Prometheus, Grafana, Loki, Tempo, Pyroscope) | Yes | 7 alerts |
| Home Assistant (+ Zigbee2MQTT, Mosquitto) | Yes (port 8123) | 3 alerts |
| HTTP Proxy (Caddy) | Yes (port 80) | 3 alerts |
| Nix Cache (Harmonia, build-flakes) | Via Caddy | 4 alerts |
| CA (step-ca) | Yes (port 9000) | 4 certificate alerts |
Suggested Priority
- PostgreSQL - Critical infrastructure, easy to add with existing nixpkgs module
- Authelia + LLDAP - Auth outage affects all proxied services
- Unbound exporter - Ready-to-go NixOS module, just needs enabling
- Jellyfin alerts - Metrics already collected, just needs alert rules
- NATS metrics - Built-in endpoint, just needs a scrape target
- Vault/OpenBao - Native telemetry support
- Actions Runner - Lower priority, basic systemd alert sufficient
Node-Exporter Targets Currently Down
Noted during audit -- these node-exporter targets are failing:
nixos-test1.home.2rjus.net:9100- no route to hostmedia1.home.2rjus.net:9100- no route to hostns3.home.2rjus.net:9100- no route to hostns4.home.2rjus.net:9100- no route to host
These may be decommissioned or powered-off hosts that should be removed from the scrape config.