docs: add monitoring gaps audit plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 20m32s
Some checks failed
Run nix flake check / flake-check (push) Failing after 20m32s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
101
docs/plans/monitoring-gaps.md
Normal file
101
docs/plans/monitoring-gaps.md
Normal file
@@ -0,0 +1,101 @@
|
||||
# Monitoring Gaps Audit
|
||||
|
||||
## Overview
|
||||
|
||||
Audit of services running in the homelab that lack monitoring coverage, either missing Prometheus scrape targets, alerting rules, or both.
|
||||
|
||||
## Services with No Monitoring
|
||||
|
||||
### PostgreSQL (`pgdb1`)
|
||||
|
||||
- **Current state:** No scrape targets, no alert rules
|
||||
- **Risk:** A database outage would go completely unnoticed by Prometheus
|
||||
- **Recommendation:** Enable `services.prometheus.exporters.postgres` (available in nixpkgs). This exposes connection counts, query throughput, replication lag, table/index stats, and more. Add alerts for at least `postgres_down` (systemd unit state) and connection pool exhaustion.
|
||||
|
||||
### Authelia (`auth01`)
|
||||
|
||||
- **Current state:** No scrape targets, no alert rules
|
||||
- **Risk:** The authentication gateway being down blocks access to all proxied services
|
||||
- **Recommendation:** Authelia exposes Prometheus metrics natively at `/metrics`. Add a scrape target and at minimum an `authelia_down` systemd unit state alert.
|
||||
|
||||
### LLDAP (`auth01`)
|
||||
|
||||
- **Current state:** No scrape targets, no alert rules
|
||||
- **Risk:** LLDAP is a dependency of Authelia -- if LDAP is down, authentication breaks even if Authelia is running
|
||||
- **Recommendation:** Add an `lldap_down` systemd unit state alert. LLDAP does not expose Prometheus metrics natively, so systemd unit monitoring via node-exporter may be sufficient.
|
||||
|
||||
### Vault / OpenBao (`vault01`)
|
||||
|
||||
- **Current state:** No scrape targets, no alert rules
|
||||
- **Risk:** Secrets management service failures go undetected
|
||||
- **Recommendation:** OpenBao supports Prometheus telemetry output natively. Add a scrape target for the telemetry endpoint and alerts for `vault_down` (systemd unit) and seal status.
|
||||
|
||||
### Gitea Actions Runner
|
||||
|
||||
- **Current state:** No scrape targets, no alert rules
|
||||
- **Risk:** CI/CD failures go undetected
|
||||
- **Recommendation:** Add at minimum a systemd unit state alert. The runner itself has limited metrics exposure.
|
||||
|
||||
## Services with Partial Monitoring
|
||||
|
||||
### Jellyfin (`jelly01`)
|
||||
|
||||
- **Current state:** Has scrape targets (port 8096), metrics are being collected, but zero alert rules
|
||||
- **Metrics available:** 184 metrics, all .NET runtime / ASP.NET Core level. No Jellyfin-specific metrics (active streams, library size, transcoding sessions). Key useful metrics:
|
||||
- `microsoft_aspnetcore_hosting_failed_requests` - rate of HTTP errors
|
||||
- `microsoft_aspnetcore_hosting_current_requests` - in-flight requests
|
||||
- `process_working_set_bytes` - memory usage (~256 MB currently)
|
||||
- `dotnet_gc_pause_ratio` - GC pressure
|
||||
- `up{job="jellyfin"}` - basic availability
|
||||
- **Recommendation:** Add a `jellyfin_down` alert using either `up{job="jellyfin"} == 0` or systemd unit state. Consider alerting on sustained `failed_requests` rate increase.
|
||||
|
||||
### NATS (`nats1`)
|
||||
|
||||
- **Current state:** Has a `nats_down` alert (systemd unit state via node-exporter), but no NATS-specific metrics
|
||||
- **Metrics available:** NATS has a built-in `/metrics` endpoint exposing connection counts, message throughput, JetStream consumer lag, and more
|
||||
- **Recommendation:** Add a scrape target for the NATS metrics endpoint. Consider alerts for connection count spikes, slow consumers, and JetStream storage usage.
|
||||
|
||||
### DNS - Unbound (`ns1`, `ns2`)
|
||||
|
||||
- **Current state:** Has `unbound_down` alert (systemd unit state), but no DNS query metrics
|
||||
- **Available in nixpkgs:** `services.prometheus.exporters.unbound.enable` (package: `prometheus-unbound-exporter` v0.5.0). Exposes query counts, cache hit ratios, response types (SERVFAIL, NXDOMAIN), upstream latency.
|
||||
- **Recommendation:** Enable the unbound exporter on ns1/ns2. Add alerts for cache hit ratio drops and SERVFAIL rate spikes.
|
||||
|
||||
### DNS - NSD (`ns1`, `ns2`)
|
||||
|
||||
- **Current state:** Has `nsd_down` alert (systemd unit state), no NSD-specific metrics
|
||||
- **Available in nixpkgs:** Nothing. No exporter package or NixOS module. Community `nsd_exporter` exists but is not packaged.
|
||||
- **Recommendation:** The existing systemd unit alert is likely sufficient. NSD is a simple authoritative-only server with limited operational metrics. Not worth packaging a custom exporter for now.
|
||||
|
||||
## Existing Monitoring (for reference)
|
||||
|
||||
These services have adequate alerting and/or scrape targets:
|
||||
|
||||
| Service | Scrape Targets | Alert Rules |
|
||||
|---|---|---|
|
||||
| Monitoring stack (Prometheus, Grafana, Loki, Tempo, Pyroscope) | Yes | 7 alerts |
|
||||
| Home Assistant (+ Zigbee2MQTT, Mosquitto) | Yes (port 8123) | 3 alerts |
|
||||
| HTTP Proxy (Caddy) | Yes (port 80) | 3 alerts |
|
||||
| Nix Cache (Harmonia, build-flakes) | Via Caddy | 4 alerts |
|
||||
| CA (step-ca) | Yes (port 9000) | 4 certificate alerts |
|
||||
|
||||
## Suggested Priority
|
||||
|
||||
1. **PostgreSQL** - Critical infrastructure, easy to add with existing nixpkgs module
|
||||
2. **Authelia + LLDAP** - Auth outage affects all proxied services
|
||||
3. **Unbound exporter** - Ready-to-go NixOS module, just needs enabling
|
||||
4. **Jellyfin alerts** - Metrics already collected, just needs alert rules
|
||||
5. **NATS metrics** - Built-in endpoint, just needs a scrape target
|
||||
6. **Vault/OpenBao** - Native telemetry support
|
||||
7. **Actions Runner** - Lower priority, basic systemd alert sufficient
|
||||
|
||||
## Node-Exporter Targets Currently Down
|
||||
|
||||
Noted during audit -- these node-exporter targets are failing:
|
||||
|
||||
- `nixos-test1.home.2rjus.net:9100` - no route to host
|
||||
- `media1.home.2rjus.net:9100` - no route to host
|
||||
- `ns3.home.2rjus.net:9100` - no route to host
|
||||
- `ns4.home.2rjus.net:9100` - no route to host
|
||||
|
||||
These may be decommissioned or powered-off hosts that should be removed from the scrape config.
|
||||
Reference in New Issue
Block a user