Files
nixos-servers/docs/plans/monitoring-gaps.md
Torjus Håkestad fe80ec3576
Some checks failed
Run nix flake check / flake-check (push) Failing after 20m32s
docs: add monitoring gaps audit plan
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 03:19:20 +01:00

5.4 KiB

Monitoring Gaps Audit

Overview

Audit of services running in the homelab that lack monitoring coverage, either missing Prometheus scrape targets, alerting rules, or both.

Services with No Monitoring

PostgreSQL (pgdb1)

  • Current state: No scrape targets, no alert rules
  • Risk: A database outage would go completely unnoticed by Prometheus
  • Recommendation: Enable services.prometheus.exporters.postgres (available in nixpkgs). This exposes connection counts, query throughput, replication lag, table/index stats, and more. Add alerts for at least postgres_down (systemd unit state) and connection pool exhaustion.

Authelia (auth01)

  • Current state: No scrape targets, no alert rules
  • Risk: The authentication gateway being down blocks access to all proxied services
  • Recommendation: Authelia exposes Prometheus metrics natively at /metrics. Add a scrape target and at minimum an authelia_down systemd unit state alert.

LLDAP (auth01)

  • Current state: No scrape targets, no alert rules
  • Risk: LLDAP is a dependency of Authelia -- if LDAP is down, authentication breaks even if Authelia is running
  • Recommendation: Add an lldap_down systemd unit state alert. LLDAP does not expose Prometheus metrics natively, so systemd unit monitoring via node-exporter may be sufficient.

Vault / OpenBao (vault01)

  • Current state: No scrape targets, no alert rules
  • Risk: Secrets management service failures go undetected
  • Recommendation: OpenBao supports Prometheus telemetry output natively. Add a scrape target for the telemetry endpoint and alerts for vault_down (systemd unit) and seal status.

Gitea Actions Runner

  • Current state: No scrape targets, no alert rules
  • Risk: CI/CD failures go undetected
  • Recommendation: Add at minimum a systemd unit state alert. The runner itself has limited metrics exposure.

Services with Partial Monitoring

Jellyfin (jelly01)

  • Current state: Has scrape targets (port 8096), metrics are being collected, but zero alert rules
  • Metrics available: 184 metrics, all .NET runtime / ASP.NET Core level. No Jellyfin-specific metrics (active streams, library size, transcoding sessions). Key useful metrics:
    • microsoft_aspnetcore_hosting_failed_requests - rate of HTTP errors
    • microsoft_aspnetcore_hosting_current_requests - in-flight requests
    • process_working_set_bytes - memory usage (~256 MB currently)
    • dotnet_gc_pause_ratio - GC pressure
    • up{job="jellyfin"} - basic availability
  • Recommendation: Add a jellyfin_down alert using either up{job="jellyfin"} == 0 or systemd unit state. Consider alerting on sustained failed_requests rate increase.

NATS (nats1)

  • Current state: Has a nats_down alert (systemd unit state via node-exporter), but no NATS-specific metrics
  • Metrics available: NATS has a built-in /metrics endpoint exposing connection counts, message throughput, JetStream consumer lag, and more
  • Recommendation: Add a scrape target for the NATS metrics endpoint. Consider alerts for connection count spikes, slow consumers, and JetStream storage usage.

DNS - Unbound (ns1, ns2)

  • Current state: Has unbound_down alert (systemd unit state), but no DNS query metrics
  • Available in nixpkgs: services.prometheus.exporters.unbound.enable (package: prometheus-unbound-exporter v0.5.0). Exposes query counts, cache hit ratios, response types (SERVFAIL, NXDOMAIN), upstream latency.
  • Recommendation: Enable the unbound exporter on ns1/ns2. Add alerts for cache hit ratio drops and SERVFAIL rate spikes.

DNS - NSD (ns1, ns2)

  • Current state: Has nsd_down alert (systemd unit state), no NSD-specific metrics
  • Available in nixpkgs: Nothing. No exporter package or NixOS module. Community nsd_exporter exists but is not packaged.
  • Recommendation: The existing systemd unit alert is likely sufficient. NSD is a simple authoritative-only server with limited operational metrics. Not worth packaging a custom exporter for now.

Existing Monitoring (for reference)

These services have adequate alerting and/or scrape targets:

Service Scrape Targets Alert Rules
Monitoring stack (Prometheus, Grafana, Loki, Tempo, Pyroscope) Yes 7 alerts
Home Assistant (+ Zigbee2MQTT, Mosquitto) Yes (port 8123) 3 alerts
HTTP Proxy (Caddy) Yes (port 80) 3 alerts
Nix Cache (Harmonia, build-flakes) Via Caddy 4 alerts
CA (step-ca) Yes (port 9000) 4 certificate alerts

Suggested Priority

  1. PostgreSQL - Critical infrastructure, easy to add with existing nixpkgs module
  2. Authelia + LLDAP - Auth outage affects all proxied services
  3. Unbound exporter - Ready-to-go NixOS module, just needs enabling
  4. Jellyfin alerts - Metrics already collected, just needs alert rules
  5. NATS metrics - Built-in endpoint, just needs a scrape target
  6. Vault/OpenBao - Native telemetry support
  7. Actions Runner - Lower priority, basic systemd alert sufficient

Node-Exporter Targets Currently Down

Noted during audit -- these node-exporter targets are failing:

  • nixos-test1.home.2rjus.net:9100 - no route to host
  • media1.home.2rjus.net:9100 - no route to host
  • ns3.home.2rjus.net:9100 - no route to host
  • ns4.home.2rjus.net:9100 - no route to host

These may be decommissioned or powered-off hosts that should be removed from the scrape config.