docs: add monitoring gaps audit plan

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 03:19:20 +01:00
parent 870fb3e532
commit fe80ec3576
1 changed files with 101 additions and 0 deletions
--- a/docs/plans/monitoring-gaps.md
+++ b/docs/plans/monitoring-gaps.md
@@ -0,0 +1,101 @@
+# Monitoring Gaps Audit
+
+## Overview
+
+Audit of services running in the homelab that lack monitoring coverage, either missing Prometheus scrape targets, alerting rules, or both.
+
+## Services with No Monitoring
+
+### PostgreSQL (`pgdb1`)
+
+- **Current state:** No scrape targets, no alert rules
+- **Risk:** A database outage would go completely unnoticed by Prometheus
+- **Recommendation:** Enable `services.prometheus.exporters.postgres` (available in nixpkgs). This exposes connection counts, query throughput, replication lag, table/index stats, and more. Add alerts for at least `postgres_down` (systemd unit state) and connection pool exhaustion.
+
+### Authelia (`auth01`)
+
+- **Current state:** No scrape targets, no alert rules
+- **Risk:** The authentication gateway being down blocks access to all proxied services
+- **Recommendation:** Authelia exposes Prometheus metrics natively at `/metrics`. Add a scrape target and at minimum an `authelia_down` systemd unit state alert.
+
+### LLDAP (`auth01`)
+
+- **Current state:** No scrape targets, no alert rules
+- **Risk:** LLDAP is a dependency of Authelia -- if LDAP is down, authentication breaks even if Authelia is running
+- **Recommendation:** Add an `lldap_down` systemd unit state alert. LLDAP does not expose Prometheus metrics natively, so systemd unit monitoring via node-exporter may be sufficient.
+
+### Vault / OpenBao (`vault01`)
+
+- **Current state:** No scrape targets, no alert rules
+- **Risk:** Secrets management service failures go undetected
+- **Recommendation:** OpenBao supports Prometheus telemetry output natively. Add a scrape target for the telemetry endpoint and alerts for `vault_down` (systemd unit) and seal status.
+
+### Gitea Actions Runner
+
+- **Current state:** No scrape targets, no alert rules
+- **Risk:** CI/CD failures go undetected
+- **Recommendation:** Add at minimum a systemd unit state alert. The runner itself has limited metrics exposure.
+
+## Services with Partial Monitoring
+
+### Jellyfin (`jelly01`)
+
+- **Current state:** Has scrape targets (port 8096), metrics are being collected, but zero alert rules
+- **Metrics available:** 184 metrics, all .NET runtime / ASP.NET Core level. No Jellyfin-specific metrics (active streams, library size, transcoding sessions). Key useful metrics:
+  - `microsoft_aspnetcore_hosting_failed_requests` - rate of HTTP errors
+  - `microsoft_aspnetcore_hosting_current_requests` - in-flight requests
+  - `process_working_set_bytes` - memory usage (~256 MB currently)
+  - `dotnet_gc_pause_ratio` - GC pressure
+  - `up{job="jellyfin"}` - basic availability
+- **Recommendation:** Add a `jellyfin_down` alert using either `up{job="jellyfin"} == 0` or systemd unit state. Consider alerting on sustained `failed_requests` rate increase.
+
+### NATS (`nats1`)
+
+- **Current state:** Has a `nats_down` alert (systemd unit state via node-exporter), but no NATS-specific metrics
+- **Metrics available:** NATS has a built-in `/metrics` endpoint exposing connection counts, message throughput, JetStream consumer lag, and more
+- **Recommendation:** Add a scrape target for the NATS metrics endpoint. Consider alerts for connection count spikes, slow consumers, and JetStream storage usage.
+
+### DNS - Unbound (`ns1`, `ns2`)
+
+- **Current state:** Has `unbound_down` alert (systemd unit state), but no DNS query metrics
+- **Available in nixpkgs:** `services.prometheus.exporters.unbound.enable` (package: `prometheus-unbound-exporter` v0.5.0). Exposes query counts, cache hit ratios, response types (SERVFAIL, NXDOMAIN), upstream latency.
+- **Recommendation:** Enable the unbound exporter on ns1/ns2. Add alerts for cache hit ratio drops and SERVFAIL rate spikes.
+
+### DNS - NSD (`ns1`, `ns2`)
+
+- **Current state:** Has `nsd_down` alert (systemd unit state), no NSD-specific metrics
+- **Available in nixpkgs:** Nothing. No exporter package or NixOS module. Community `nsd_exporter` exists but is not packaged.
+- **Recommendation:** The existing systemd unit alert is likely sufficient. NSD is a simple authoritative-only server with limited operational metrics. Not worth packaging a custom exporter for now.
+
+## Existing Monitoring (for reference)
+
+These services have adequate alerting and/or scrape targets:
+
+| Service | Scrape Targets | Alert Rules |
+|---|---|---|
+| Monitoring stack (Prometheus, Grafana, Loki, Tempo, Pyroscope) | Yes | 7 alerts |
+| Home Assistant (+ Zigbee2MQTT, Mosquitto) | Yes (port 8123) | 3 alerts |
+| HTTP Proxy (Caddy) | Yes (port 80) | 3 alerts |
+| Nix Cache (Harmonia, build-flakes) | Via Caddy | 4 alerts |
+| CA (step-ca) | Yes (port 9000) | 4 certificate alerts |
+
+## Suggested Priority
+
+1. **PostgreSQL** - Critical infrastructure, easy to add with existing nixpkgs module
+2. **Authelia + LLDAP** - Auth outage affects all proxied services
+3. **Unbound exporter** - Ready-to-go NixOS module, just needs enabling
+4. **Jellyfin alerts** - Metrics already collected, just needs alert rules
+5. **NATS metrics** - Built-in endpoint, just needs a scrape target
+6. **Vault/OpenBao** - Native telemetry support
+7. **Actions Runner** - Lower priority, basic systemd alert sufficient
+
+## Node-Exporter Targets Currently Down
+
+Noted during audit -- these node-exporter targets are failing:
+
+- `nixos-test1.home.2rjus.net:9100` - no route to host
+- `media1.home.2rjus.net:9100` - no route to host
+- `ns3.home.2rjus.net:9100` - no route to host
+- `ns4.home.2rjus.net:9100` - no route to host
+
+These may be decommissioned or powered-off hosts that should be removed from the scrape config.