Files

Run nix flake check / flake-check (push) Failing after 7m36s

Details

monitoring: implement monitoring gaps coverage

Add exporters and scrape targets for services lacking monitoring:
- PostgreSQL: postgres-exporter on pgdb1
- Authelia: native telemetry metrics on auth01
- Unbound: unbound-exporter with remote-control on ns1/ns2
- NATS: HTTP monitoring endpoint on nats1
- OpenBao: telemetry config and Prometheus scrape with token auth
- Systemd: systemd-exporter on all hosts for per-service metrics

Add alert rules for postgres, auth (authelia + lldap), jellyfin,
vault (openbao), plus extend existing nats and unbound rules.

Add Terraform config for Prometheus metrics policy and token. The
token is created via vault_token resource and stored in KV, so no
manual token creation is needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-05 21:44:13 +01:00

6.6 KiB

Raw Blame History

Monitoring Gaps Audit

Overview

Audit of services running in the homelab that lack monitoring coverage, either missing Prometheus scrape targets, alerting rules, or both.

Services with No Monitoring

PostgreSQL (`pgdb1`)

Current state: No scrape targets, no alert rules
Risk: A database outage would go completely unnoticed by Prometheus
Recommendation: Enable services.prometheus.exporters.postgres (available in nixpkgs). This exposes connection counts, query throughput, replication lag, table/index stats, and more. Add alerts for at least postgres_down (systemd unit state) and connection pool exhaustion.

Authelia (`auth01`)

Current state: No scrape targets, no alert rules
Risk: The authentication gateway being down blocks access to all proxied services
Recommendation: Authelia exposes Prometheus metrics natively at /metrics. Add a scrape target and at minimum an authelia_down systemd unit state alert.

LLDAP (`auth01`)

Current state: No scrape targets, no alert rules
Risk: LLDAP is a dependency of Authelia -- if LDAP is down, authentication breaks even if Authelia is running
Recommendation: Add an lldap_down systemd unit state alert. LLDAP does not expose Prometheus metrics natively, so systemd unit monitoring via node-exporter may be sufficient.

Vault / OpenBao (`vault01`)

Current state: No scrape targets, no alert rules
Risk: Secrets management service failures go undetected
Recommendation: OpenBao supports Prometheus telemetry output natively. Add a scrape target for the telemetry endpoint and alerts for vault_down (systemd unit) and seal status.

Gitea Actions Runner

Current state: No scrape targets, no alert rules
Risk: CI/CD failures go undetected
Recommendation: Add at minimum a systemd unit state alert. The runner itself has limited metrics exposure.

Services with Partial Monitoring

Jellyfin (`jelly01`)

Current state: Has scrape targets (port 8096), metrics are being collected, but zero alert rules
Metrics available: 184 metrics, all .NET runtime / ASP.NET Core level. No Jellyfin-specific metrics (active streams, library size, transcoding sessions). Key useful metrics:
- microsoft_aspnetcore_hosting_failed_requests - rate of HTTP errors
- microsoft_aspnetcore_hosting_current_requests - in-flight requests
- process_working_set_bytes - memory usage (~256 MB currently)
- dotnet_gc_pause_ratio - GC pressure
- up{job="jellyfin"} - basic availability
Recommendation: Add a jellyfin_down alert using either up{job="jellyfin"} == 0 or systemd unit state. Consider alerting on sustained failed_requests rate increase.

NATS (`nats1`)

Current state: Has a nats_down alert (systemd unit state via node-exporter), but no NATS-specific metrics
Metrics available: NATS has a built-in /metrics endpoint exposing connection counts, message throughput, JetStream consumer lag, and more
Recommendation: Add a scrape target for the NATS metrics endpoint. Consider alerts for connection count spikes, slow consumers, and JetStream storage usage.

DNS - Unbound (`ns1`, `ns2`)

Current state: Has unbound_down alert (systemd unit state), but no DNS query metrics
Available in nixpkgs: services.prometheus.exporters.unbound.enable (package: prometheus-unbound-exporter v0.5.0). Exposes query counts, cache hit ratios, response types (SERVFAIL, NXDOMAIN), upstream latency.
Recommendation: Enable the unbound exporter on ns1/ns2. Add alerts for cache hit ratio drops and SERVFAIL rate spikes.

DNS - NSD (`ns1`, `ns2`)

Current state: Has nsd_down alert (systemd unit state), no NSD-specific metrics
Available in nixpkgs: Nothing. No exporter package or NixOS module. Community nsd_exporter exists but is not packaged.
Recommendation: The existing systemd unit alert is likely sufficient. NSD is a simple authoritative-only server with limited operational metrics. Not worth packaging a custom exporter for now.

Existing Monitoring (for reference)

These services have adequate alerting and/or scrape targets:

Service	Scrape Targets	Alert Rules
Monitoring stack (Prometheus, Grafana, Loki, Tempo, Pyroscope)	Yes	7 alerts
Home Assistant (+ Zigbee2MQTT, Mosquitto)	Yes (port 8123)	3 alerts
HTTP Proxy (Caddy)	Yes (port 80)	3 alerts
Nix Cache (Harmonia, build-flakes)	Via Caddy	4 alerts
CA (step-ca)	Yes (port 9000)	4 certificate alerts

Per-Service Resource Metrics (systemd-exporter)

Current State

No per-service CPU, memory, or IO metrics are collected. The existing node-exporter systemd collector only provides unit state (active/inactive/failed), socket stats, and timer triggers. While systemd tracks per-unit resource usage via cgroups internally (visible in systemctl status and systemd-cgtop), this data is not exported to Prometheus.

Available Solution

The prometheus-systemd-exporter package (v0.7.0) is available in nixpkgs with a ready-made NixOS module:

services.prometheus.exporters.systemd.enable = true;

Options: enable, port, extraFlags, user, group

This exporter reads cgroup data and exposes per-unit metrics including:

CPU seconds consumed per service
Memory usage per service
Task/process counts per service
Restart counts
IO usage

Recommendation

Enable on all hosts via the shared system/ config (same pattern as node-exporter). Add a corresponding scrape job on monitoring01. This would give visibility into resource consumption per service across the fleet, useful for capacity planning and diagnosing noisy-neighbor issues on shared hosts.

Suggested Priority

PostgreSQL - Critical infrastructure, easy to add with existing nixpkgs module
Authelia + LLDAP - Auth outage affects all proxied services
Unbound exporter - Ready-to-go NixOS module, just needs enabling
Jellyfin alerts - Metrics already collected, just needs alert rules
NATS metrics - Built-in endpoint, just needs a scrape target
Vault/OpenBao - Native telemetry support
Actions Runner - Lower priority, basic systemd alert sufficient

Node-Exporter Targets Currently Down

Noted during audit -- these node-exporter targets are failing:

nixos-test1.home.2rjus.net:9100 - no route to host
media1.home.2rjus.net:9100 - no route to host
ns3.home.2rjus.net:9100 - no route to host
ns4.home.2rjus.net:9100 - no route to host

These may be decommissioned or powered-off hosts that should be removed from the scrape config.

6.6 KiB Raw Blame History