From fe80ec3576d579c0e25a441591750a910c6117cd Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Thu, 5 Feb 2026 03:19:20 +0100 Subject: [PATCH] docs: add monitoring gaps audit plan Co-Authored-By: Claude Opus 4.5 --- docs/plans/monitoring-gaps.md | 101 ++++++++++++++++++++++++++++++++++ 1 file changed, 101 insertions(+) create mode 100644 docs/plans/monitoring-gaps.md diff --git a/docs/plans/monitoring-gaps.md b/docs/plans/monitoring-gaps.md new file mode 100644 index 0000000..296bc71 --- /dev/null +++ b/docs/plans/monitoring-gaps.md @@ -0,0 +1,101 @@ +# Monitoring Gaps Audit + +## Overview + +Audit of services running in the homelab that lack monitoring coverage, either missing Prometheus scrape targets, alerting rules, or both. + +## Services with No Monitoring + +### PostgreSQL (`pgdb1`) + +- **Current state:** No scrape targets, no alert rules +- **Risk:** A database outage would go completely unnoticed by Prometheus +- **Recommendation:** Enable `services.prometheus.exporters.postgres` (available in nixpkgs). This exposes connection counts, query throughput, replication lag, table/index stats, and more. Add alerts for at least `postgres_down` (systemd unit state) and connection pool exhaustion. + +### Authelia (`auth01`) + +- **Current state:** No scrape targets, no alert rules +- **Risk:** The authentication gateway being down blocks access to all proxied services +- **Recommendation:** Authelia exposes Prometheus metrics natively at `/metrics`. Add a scrape target and at minimum an `authelia_down` systemd unit state alert. + +### LLDAP (`auth01`) + +- **Current state:** No scrape targets, no alert rules +- **Risk:** LLDAP is a dependency of Authelia -- if LDAP is down, authentication breaks even if Authelia is running +- **Recommendation:** Add an `lldap_down` systemd unit state alert. LLDAP does not expose Prometheus metrics natively, so systemd unit monitoring via node-exporter may be sufficient. + +### Vault / OpenBao (`vault01`) + +- **Current state:** No scrape targets, no alert rules +- **Risk:** Secrets management service failures go undetected +- **Recommendation:** OpenBao supports Prometheus telemetry output natively. Add a scrape target for the telemetry endpoint and alerts for `vault_down` (systemd unit) and seal status. + +### Gitea Actions Runner + +- **Current state:** No scrape targets, no alert rules +- **Risk:** CI/CD failures go undetected +- **Recommendation:** Add at minimum a systemd unit state alert. The runner itself has limited metrics exposure. + +## Services with Partial Monitoring + +### Jellyfin (`jelly01`) + +- **Current state:** Has scrape targets (port 8096), metrics are being collected, but zero alert rules +- **Metrics available:** 184 metrics, all .NET runtime / ASP.NET Core level. No Jellyfin-specific metrics (active streams, library size, transcoding sessions). Key useful metrics: + - `microsoft_aspnetcore_hosting_failed_requests` - rate of HTTP errors + - `microsoft_aspnetcore_hosting_current_requests` - in-flight requests + - `process_working_set_bytes` - memory usage (~256 MB currently) + - `dotnet_gc_pause_ratio` - GC pressure + - `up{job="jellyfin"}` - basic availability +- **Recommendation:** Add a `jellyfin_down` alert using either `up{job="jellyfin"} == 0` or systemd unit state. Consider alerting on sustained `failed_requests` rate increase. + +### NATS (`nats1`) + +- **Current state:** Has a `nats_down` alert (systemd unit state via node-exporter), but no NATS-specific metrics +- **Metrics available:** NATS has a built-in `/metrics` endpoint exposing connection counts, message throughput, JetStream consumer lag, and more +- **Recommendation:** Add a scrape target for the NATS metrics endpoint. Consider alerts for connection count spikes, slow consumers, and JetStream storage usage. + +### DNS - Unbound (`ns1`, `ns2`) + +- **Current state:** Has `unbound_down` alert (systemd unit state), but no DNS query metrics +- **Available in nixpkgs:** `services.prometheus.exporters.unbound.enable` (package: `prometheus-unbound-exporter` v0.5.0). Exposes query counts, cache hit ratios, response types (SERVFAIL, NXDOMAIN), upstream latency. +- **Recommendation:** Enable the unbound exporter on ns1/ns2. Add alerts for cache hit ratio drops and SERVFAIL rate spikes. + +### DNS - NSD (`ns1`, `ns2`) + +- **Current state:** Has `nsd_down` alert (systemd unit state), no NSD-specific metrics +- **Available in nixpkgs:** Nothing. No exporter package or NixOS module. Community `nsd_exporter` exists but is not packaged. +- **Recommendation:** The existing systemd unit alert is likely sufficient. NSD is a simple authoritative-only server with limited operational metrics. Not worth packaging a custom exporter for now. + +## Existing Monitoring (for reference) + +These services have adequate alerting and/or scrape targets: + +| Service | Scrape Targets | Alert Rules | +|---|---|---| +| Monitoring stack (Prometheus, Grafana, Loki, Tempo, Pyroscope) | Yes | 7 alerts | +| Home Assistant (+ Zigbee2MQTT, Mosquitto) | Yes (port 8123) | 3 alerts | +| HTTP Proxy (Caddy) | Yes (port 80) | 3 alerts | +| Nix Cache (Harmonia, build-flakes) | Via Caddy | 4 alerts | +| CA (step-ca) | Yes (port 9000) | 4 certificate alerts | + +## Suggested Priority + +1. **PostgreSQL** - Critical infrastructure, easy to add with existing nixpkgs module +2. **Authelia + LLDAP** - Auth outage affects all proxied services +3. **Unbound exporter** - Ready-to-go NixOS module, just needs enabling +4. **Jellyfin alerts** - Metrics already collected, just needs alert rules +5. **NATS metrics** - Built-in endpoint, just needs a scrape target +6. **Vault/OpenBao** - Native telemetry support +7. **Actions Runner** - Lower priority, basic systemd alert sufficient + +## Node-Exporter Targets Currently Down + +Noted during audit -- these node-exporter targets are failing: + +- `nixos-test1.home.2rjus.net:9100` - no route to host +- `media1.home.2rjus.net:9100` - no route to host +- `ns3.home.2rjus.net:9100` - no route to host +- `ns4.home.2rjus.net:9100` - no route to host + +These may be decommissioned or powered-off hosts that should be removed from the scrape config.