Files
nixos-servers/docs/plans/completed/monitoring-gaps.md
Torjus Håkestad 3cccfc0487
Some checks failed
Run nix flake check / flake-check (push) Failing after 7m36s
monitoring: implement monitoring gaps coverage
Add exporters and scrape targets for services lacking monitoring:
- PostgreSQL: postgres-exporter on pgdb1
- Authelia: native telemetry metrics on auth01
- Unbound: unbound-exporter with remote-control on ns1/ns2
- NATS: HTTP monitoring endpoint on nats1
- OpenBao: telemetry config and Prometheus scrape with token auth
- Systemd: systemd-exporter on all hosts for per-service metrics

Add alert rules for postgres, auth (authelia + lldap), jellyfin,
vault (openbao), plus extend existing nats and unbound rules.

Add Terraform config for Prometheus metrics policy and token. The
token is created via vault_token resource and stored in KV, so no
manual token creation is needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 21:44:13 +01:00

6.6 KiB

Monitoring Gaps Audit

Overview

Audit of services running in the homelab that lack monitoring coverage, either missing Prometheus scrape targets, alerting rules, or both.

Services with No Monitoring

PostgreSQL (pgdb1)

  • Current state: No scrape targets, no alert rules
  • Risk: A database outage would go completely unnoticed by Prometheus
  • Recommendation: Enable services.prometheus.exporters.postgres (available in nixpkgs). This exposes connection counts, query throughput, replication lag, table/index stats, and more. Add alerts for at least postgres_down (systemd unit state) and connection pool exhaustion.

Authelia (auth01)

  • Current state: No scrape targets, no alert rules
  • Risk: The authentication gateway being down blocks access to all proxied services
  • Recommendation: Authelia exposes Prometheus metrics natively at /metrics. Add a scrape target and at minimum an authelia_down systemd unit state alert.

LLDAP (auth01)

  • Current state: No scrape targets, no alert rules
  • Risk: LLDAP is a dependency of Authelia -- if LDAP is down, authentication breaks even if Authelia is running
  • Recommendation: Add an lldap_down systemd unit state alert. LLDAP does not expose Prometheus metrics natively, so systemd unit monitoring via node-exporter may be sufficient.

Vault / OpenBao (vault01)

  • Current state: No scrape targets, no alert rules
  • Risk: Secrets management service failures go undetected
  • Recommendation: OpenBao supports Prometheus telemetry output natively. Add a scrape target for the telemetry endpoint and alerts for vault_down (systemd unit) and seal status.

Gitea Actions Runner

  • Current state: No scrape targets, no alert rules
  • Risk: CI/CD failures go undetected
  • Recommendation: Add at minimum a systemd unit state alert. The runner itself has limited metrics exposure.

Services with Partial Monitoring

Jellyfin (jelly01)

  • Current state: Has scrape targets (port 8096), metrics are being collected, but zero alert rules
  • Metrics available: 184 metrics, all .NET runtime / ASP.NET Core level. No Jellyfin-specific metrics (active streams, library size, transcoding sessions). Key useful metrics:
    • microsoft_aspnetcore_hosting_failed_requests - rate of HTTP errors
    • microsoft_aspnetcore_hosting_current_requests - in-flight requests
    • process_working_set_bytes - memory usage (~256 MB currently)
    • dotnet_gc_pause_ratio - GC pressure
    • up{job="jellyfin"} - basic availability
  • Recommendation: Add a jellyfin_down alert using either up{job="jellyfin"} == 0 or systemd unit state. Consider alerting on sustained failed_requests rate increase.

NATS (nats1)

  • Current state: Has a nats_down alert (systemd unit state via node-exporter), but no NATS-specific metrics
  • Metrics available: NATS has a built-in /metrics endpoint exposing connection counts, message throughput, JetStream consumer lag, and more
  • Recommendation: Add a scrape target for the NATS metrics endpoint. Consider alerts for connection count spikes, slow consumers, and JetStream storage usage.

DNS - Unbound (ns1, ns2)

  • Current state: Has unbound_down alert (systemd unit state), but no DNS query metrics
  • Available in nixpkgs: services.prometheus.exporters.unbound.enable (package: prometheus-unbound-exporter v0.5.0). Exposes query counts, cache hit ratios, response types (SERVFAIL, NXDOMAIN), upstream latency.
  • Recommendation: Enable the unbound exporter on ns1/ns2. Add alerts for cache hit ratio drops and SERVFAIL rate spikes.

DNS - NSD (ns1, ns2)

  • Current state: Has nsd_down alert (systemd unit state), no NSD-specific metrics
  • Available in nixpkgs: Nothing. No exporter package or NixOS module. Community nsd_exporter exists but is not packaged.
  • Recommendation: The existing systemd unit alert is likely sufficient. NSD is a simple authoritative-only server with limited operational metrics. Not worth packaging a custom exporter for now.

Existing Monitoring (for reference)

These services have adequate alerting and/or scrape targets:

Service Scrape Targets Alert Rules
Monitoring stack (Prometheus, Grafana, Loki, Tempo, Pyroscope) Yes 7 alerts
Home Assistant (+ Zigbee2MQTT, Mosquitto) Yes (port 8123) 3 alerts
HTTP Proxy (Caddy) Yes (port 80) 3 alerts
Nix Cache (Harmonia, build-flakes) Via Caddy 4 alerts
CA (step-ca) Yes (port 9000) 4 certificate alerts

Per-Service Resource Metrics (systemd-exporter)

Current State

No per-service CPU, memory, or IO metrics are collected. The existing node-exporter systemd collector only provides unit state (active/inactive/failed), socket stats, and timer triggers. While systemd tracks per-unit resource usage via cgroups internally (visible in systemctl status and systemd-cgtop), this data is not exported to Prometheus.

Available Solution

The prometheus-systemd-exporter package (v0.7.0) is available in nixpkgs with a ready-made NixOS module:

services.prometheus.exporters.systemd.enable = true;

Options: enable, port, extraFlags, user, group

This exporter reads cgroup data and exposes per-unit metrics including:

  • CPU seconds consumed per service
  • Memory usage per service
  • Task/process counts per service
  • Restart counts
  • IO usage

Recommendation

Enable on all hosts via the shared system/ config (same pattern as node-exporter). Add a corresponding scrape job on monitoring01. This would give visibility into resource consumption per service across the fleet, useful for capacity planning and diagnosing noisy-neighbor issues on shared hosts.

Suggested Priority

  1. PostgreSQL - Critical infrastructure, easy to add with existing nixpkgs module
  2. Authelia + LLDAP - Auth outage affects all proxied services
  3. Unbound exporter - Ready-to-go NixOS module, just needs enabling
  4. Jellyfin alerts - Metrics already collected, just needs alert rules
  5. NATS metrics - Built-in endpoint, just needs a scrape target
  6. Vault/OpenBao - Native telemetry support
  7. Actions Runner - Lower priority, basic systemd alert sufficient

Node-Exporter Targets Currently Down

Noted during audit -- these node-exporter targets are failing:

  • nixos-test1.home.2rjus.net:9100 - no route to host
  • media1.home.2rjus.net:9100 - no route to host
  • ns3.home.2rjus.net:9100 - no route to host
  • ns4.home.2rjus.net:9100 - no route to host

These may be decommissioned or powered-off hosts that should be removed from the scrape config.