# Monitoring Gaps Audit ## Overview Audit of services running in the homelab that lack monitoring coverage, either missing Prometheus scrape targets, alerting rules, or both. ## Services with No Monitoring ### PostgreSQL (`pgdb1`) - **Current state:** No scrape targets, no alert rules - **Risk:** A database outage would go completely unnoticed by Prometheus - **Recommendation:** Enable `services.prometheus.exporters.postgres` (available in nixpkgs). This exposes connection counts, query throughput, replication lag, table/index stats, and more. Add alerts for at least `postgres_down` (systemd unit state) and connection pool exhaustion. ### Authelia (`auth01`) - **Current state:** No scrape targets, no alert rules - **Risk:** The authentication gateway being down blocks access to all proxied services - **Recommendation:** Authelia exposes Prometheus metrics natively at `/metrics`. Add a scrape target and at minimum an `authelia_down` systemd unit state alert. ### LLDAP (`auth01`) - **Current state:** No scrape targets, no alert rules - **Risk:** LLDAP is a dependency of Authelia -- if LDAP is down, authentication breaks even if Authelia is running - **Recommendation:** Add an `lldap_down` systemd unit state alert. LLDAP does not expose Prometheus metrics natively, so systemd unit monitoring via node-exporter may be sufficient. ### Vault / OpenBao (`vault01`) - **Current state:** No scrape targets, no alert rules - **Risk:** Secrets management service failures go undetected - **Recommendation:** OpenBao supports Prometheus telemetry output natively. Add a scrape target for the telemetry endpoint and alerts for `vault_down` (systemd unit) and seal status. ### Gitea Actions Runner - **Current state:** No scrape targets, no alert rules - **Risk:** CI/CD failures go undetected - **Recommendation:** Add at minimum a systemd unit state alert. The runner itself has limited metrics exposure. ## Services with Partial Monitoring ### Jellyfin (`jelly01`) - **Current state:** Has scrape targets (port 8096), metrics are being collected, but zero alert rules - **Metrics available:** 184 metrics, all .NET runtime / ASP.NET Core level. No Jellyfin-specific metrics (active streams, library size, transcoding sessions). Key useful metrics: - `microsoft_aspnetcore_hosting_failed_requests` - rate of HTTP errors - `microsoft_aspnetcore_hosting_current_requests` - in-flight requests - `process_working_set_bytes` - memory usage (~256 MB currently) - `dotnet_gc_pause_ratio` - GC pressure - `up{job="jellyfin"}` - basic availability - **Recommendation:** Add a `jellyfin_down` alert using either `up{job="jellyfin"} == 0` or systemd unit state. Consider alerting on sustained `failed_requests` rate increase. ### NATS (`nats1`) - **Current state:** Has a `nats_down` alert (systemd unit state via node-exporter), but no NATS-specific metrics - **Metrics available:** NATS has a built-in `/metrics` endpoint exposing connection counts, message throughput, JetStream consumer lag, and more - **Recommendation:** Add a scrape target for the NATS metrics endpoint. Consider alerts for connection count spikes, slow consumers, and JetStream storage usage. ### DNS - Unbound (`ns1`, `ns2`) - **Current state:** Has `unbound_down` alert (systemd unit state), but no DNS query metrics - **Available in nixpkgs:** `services.prometheus.exporters.unbound.enable` (package: `prometheus-unbound-exporter` v0.5.0). Exposes query counts, cache hit ratios, response types (SERVFAIL, NXDOMAIN), upstream latency. - **Recommendation:** Enable the unbound exporter on ns1/ns2. Add alerts for cache hit ratio drops and SERVFAIL rate spikes. ### DNS - NSD (`ns1`, `ns2`) - **Current state:** Has `nsd_down` alert (systemd unit state), no NSD-specific metrics - **Available in nixpkgs:** Nothing. No exporter package or NixOS module. Community `nsd_exporter` exists but is not packaged. - **Recommendation:** The existing systemd unit alert is likely sufficient. NSD is a simple authoritative-only server with limited operational metrics. Not worth packaging a custom exporter for now. ## Existing Monitoring (for reference) These services have adequate alerting and/or scrape targets: | Service | Scrape Targets | Alert Rules | |---|---|---| | Monitoring stack (Prometheus, Grafana, Loki, Tempo, Pyroscope) | Yes | 7 alerts | | Home Assistant (+ Zigbee2MQTT, Mosquitto) | Yes (port 8123) | 3 alerts | | HTTP Proxy (Caddy) | Yes (port 80) | 3 alerts | | Nix Cache (Harmonia, build-flakes) | Via Caddy | 4 alerts | | CA (step-ca) | Yes (port 9000) | 4 certificate alerts | ## Per-Service Resource Metrics (systemd-exporter) ### Current State No per-service CPU, memory, or IO metrics are collected. The existing node-exporter systemd collector only provides unit state (active/inactive/failed), socket stats, and timer triggers. While systemd tracks per-unit resource usage via cgroups internally (visible in `systemctl status` and `systemd-cgtop`), this data is not exported to Prometheus. ### Available Solution The `prometheus-systemd-exporter` package (v0.7.0) is available in nixpkgs with a ready-made NixOS module: ```nix services.prometheus.exporters.systemd.enable = true; ``` **Options:** `enable`, `port`, `extraFlags`, `user`, `group` This exporter reads cgroup data and exposes per-unit metrics including: - CPU seconds consumed per service - Memory usage per service - Task/process counts per service - Restart counts - IO usage ### Recommendation Enable on all hosts via the shared `system/` config (same pattern as node-exporter). Add a corresponding scrape job on monitoring01. This would give visibility into resource consumption per service across the fleet, useful for capacity planning and diagnosing noisy-neighbor issues on shared hosts. ## Suggested Priority 1. **PostgreSQL** - Critical infrastructure, easy to add with existing nixpkgs module 2. **Authelia + LLDAP** - Auth outage affects all proxied services 3. **Unbound exporter** - Ready-to-go NixOS module, just needs enabling 4. **Jellyfin alerts** - Metrics already collected, just needs alert rules 5. **NATS metrics** - Built-in endpoint, just needs a scrape target 6. **Vault/OpenBao** - Native telemetry support 7. **Actions Runner** - Lower priority, basic systemd alert sufficient ## Node-Exporter Targets Currently Down Noted during audit -- these node-exporter targets are failing: - `nixos-test1.home.2rjus.net:9100` - no route to host - `media1.home.2rjus.net:9100` - no route to host - `ns3.home.2rjus.net:9100` - no route to host - `ns4.home.2rjus.net:9100` - no route to host These may be decommissioned or powered-off hosts that should be removed from the scrape config.