Files

Run nix flake check / flake-check (push) Failing after 20m32s

Details

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-05 03:19:20 +01:00

5.4 KiB

Raw Blame History

Monitoring Gaps Audit

Overview

Audit of services running in the homelab that lack monitoring coverage, either missing Prometheus scrape targets, alerting rules, or both.

Services with No Monitoring

PostgreSQL (`pgdb1`)

Current state: No scrape targets, no alert rules
Risk: A database outage would go completely unnoticed by Prometheus
Recommendation: Enable services.prometheus.exporters.postgres (available in nixpkgs). This exposes connection counts, query throughput, replication lag, table/index stats, and more. Add alerts for at least postgres_down (systemd unit state) and connection pool exhaustion.

Authelia (`auth01`)

Current state: No scrape targets, no alert rules
Risk: The authentication gateway being down blocks access to all proxied services
Recommendation: Authelia exposes Prometheus metrics natively at /metrics. Add a scrape target and at minimum an authelia_down systemd unit state alert.

LLDAP (`auth01`)

Current state: No scrape targets, no alert rules
Risk: LLDAP is a dependency of Authelia -- if LDAP is down, authentication breaks even if Authelia is running
Recommendation: Add an lldap_down systemd unit state alert. LLDAP does not expose Prometheus metrics natively, so systemd unit monitoring via node-exporter may be sufficient.

Vault / OpenBao (`vault01`)

Current state: No scrape targets, no alert rules
Risk: Secrets management service failures go undetected
Recommendation: OpenBao supports Prometheus telemetry output natively. Add a scrape target for the telemetry endpoint and alerts for vault_down (systemd unit) and seal status.

Gitea Actions Runner

Current state: No scrape targets, no alert rules
Risk: CI/CD failures go undetected
Recommendation: Add at minimum a systemd unit state alert. The runner itself has limited metrics exposure.

Services with Partial Monitoring

Jellyfin (`jelly01`)

Current state: Has scrape targets (port 8096), metrics are being collected, but zero alert rules
Metrics available: 184 metrics, all .NET runtime / ASP.NET Core level. No Jellyfin-specific metrics (active streams, library size, transcoding sessions). Key useful metrics:
- microsoft_aspnetcore_hosting_failed_requests - rate of HTTP errors
- microsoft_aspnetcore_hosting_current_requests - in-flight requests
- process_working_set_bytes - memory usage (~256 MB currently)
- dotnet_gc_pause_ratio - GC pressure
- up{job="jellyfin"} - basic availability
Recommendation: Add a jellyfin_down alert using either up{job="jellyfin"} == 0 or systemd unit state. Consider alerting on sustained failed_requests rate increase.

NATS (`nats1`)

Current state: Has a nats_down alert (systemd unit state via node-exporter), but no NATS-specific metrics
Metrics available: NATS has a built-in /metrics endpoint exposing connection counts, message throughput, JetStream consumer lag, and more
Recommendation: Add a scrape target for the NATS metrics endpoint. Consider alerts for connection count spikes, slow consumers, and JetStream storage usage.

DNS - Unbound (`ns1`, `ns2`)

Current state: Has unbound_down alert (systemd unit state), but no DNS query metrics
Available in nixpkgs: services.prometheus.exporters.unbound.enable (package: prometheus-unbound-exporter v0.5.0). Exposes query counts, cache hit ratios, response types (SERVFAIL, NXDOMAIN), upstream latency.
Recommendation: Enable the unbound exporter on ns1/ns2. Add alerts for cache hit ratio drops and SERVFAIL rate spikes.

DNS - NSD (`ns1`, `ns2`)

Current state: Has nsd_down alert (systemd unit state), no NSD-specific metrics
Available in nixpkgs: Nothing. No exporter package or NixOS module. Community nsd_exporter exists but is not packaged.
Recommendation: The existing systemd unit alert is likely sufficient. NSD is a simple authoritative-only server with limited operational metrics. Not worth packaging a custom exporter for now.

Existing Monitoring (for reference)

These services have adequate alerting and/or scrape targets:

Service	Scrape Targets	Alert Rules
Monitoring stack (Prometheus, Grafana, Loki, Tempo, Pyroscope)	Yes	7 alerts
Home Assistant (+ Zigbee2MQTT, Mosquitto)	Yes (port 8123)	3 alerts
HTTP Proxy (Caddy)	Yes (port 80)	3 alerts
Nix Cache (Harmonia, build-flakes)	Via Caddy	4 alerts
CA (step-ca)	Yes (port 9000)	4 certificate alerts

Suggested Priority

PostgreSQL - Critical infrastructure, easy to add with existing nixpkgs module
Authelia + LLDAP - Auth outage affects all proxied services
Unbound exporter - Ready-to-go NixOS module, just needs enabling
Jellyfin alerts - Metrics already collected, just needs alert rules
NATS metrics - Built-in endpoint, just needs a scrape target
Vault/OpenBao - Native telemetry support
Actions Runner - Lower priority, basic systemd alert sufficient

Node-Exporter Targets Currently Down

Noted during audit -- these node-exporter targets are failing:

nixos-test1.home.2rjus.net:9100 - no route to host
media1.home.2rjus.net:9100 - no route to host
ns3.home.2rjus.net:9100 - no route to host
ns4.home.2rjus.net:9100 - no route to host

These may be decommissioned or powered-off hosts that should be removed from the scrape config.

5.4 KiB Raw Blame History

Monitoring Gaps Audit

Overview

Services with No Monitoring

PostgreSQL (pgdb1)

Authelia (auth01)

LLDAP (auth01)

Vault / OpenBao (vault01)