# Monitoring Gaps Audit

## Overview

Audit of services running in the homelab that lack monitoring coverage, either missing Prometheus scrape targets, alerting rules, or both.

## Services with No Monitoring

### PostgreSQL (`pgdb1`)

- **Current state:** No scrape targets, no alert rules
- **Risk:** A database outage would go completely unnoticed by Prometheus
- **Recommendation:** Enable `services.prometheus.exporters.postgres` (available in nixpkgs). This exposes connection counts, query throughput, replication lag, table/index stats, and more. Add alerts for at least `postgres_down` (systemd unit state) and connection pool exhaustion.

### Authelia (`auth01`)

- **Current state:** No scrape targets, no alert rules
- **Risk:** The authentication gateway being down blocks access to all proxied services
- **Recommendation:** Authelia exposes Prometheus metrics natively at `/metrics`. Add a scrape target and at minimum an `authelia_down` systemd unit state alert.

### LLDAP (`auth01`)

- **Current state:** No scrape targets, no alert rules
- **Risk:** LLDAP is a dependency of Authelia -- if LDAP is down, authentication breaks even if Authelia is running
- **Recommendation:** Add an `lldap_down` systemd unit state alert. LLDAP does not expose Prometheus metrics natively, so systemd unit monitoring via node-exporter may be sufficient.

### Vault / OpenBao (`vault01`)

- **Current state:** No scrape targets, no alert rules
- **Risk:** Secrets management service failures go undetected
- **Recommendation:** OpenBao supports Prometheus telemetry output natively. Add a scrape target for the telemetry endpoint and alerts for `vault_down` (systemd unit) and seal status.

### Gitea Actions Runner

- **Current state:** No scrape targets, no alert rules
- **Risk:** CI/CD failures go undetected
- **Recommendation:** Add at minimum a systemd unit state alert. The runner itself has limited metrics exposure.

## Services with Partial Monitoring

### Jellyfin (`jelly01`)

- **Current state:** Has scrape targets (port 8096), metrics are being collected, but zero alert rules
- **Metrics available:** 184 metrics, all .NET runtime / ASP.NET Core level. No Jellyfin-specific metrics (active streams, library size, transcoding sessions). Key useful metrics:
  - `microsoft_aspnetcore_hosting_failed_requests` - rate of HTTP errors
  - `microsoft_aspnetcore_hosting_current_requests` - in-flight requests
  - `process_working_set_bytes` - memory usage (~256 MB currently)
  - `dotnet_gc_pause_ratio` - GC pressure
  - `up{job="jellyfin"}` - basic availability
- **Recommendation:** Add a `jellyfin_down` alert using either `up{job="jellyfin"} == 0` or systemd unit state. Consider alerting on sustained `failed_requests` rate increase.

### NATS (`nats1`)

- **Current state:** Has a `nats_down` alert (systemd unit state via node-exporter), but no NATS-specific metrics
- **Metrics available:** NATS has a built-in `/metrics` endpoint exposing connection counts, message throughput, JetStream consumer lag, and more
- **Recommendation:** Add a scrape target for the NATS metrics endpoint. Consider alerts for connection count spikes, slow consumers, and JetStream storage usage.

### DNS - Unbound (`ns1`, `ns2`)

- **Current state:** Has `unbound_down` alert (systemd unit state), but no DNS query metrics
- **Available in nixpkgs:** `services.prometheus.exporters.unbound.enable` (package: `prometheus-unbound-exporter` v0.5.0). Exposes query counts, cache hit ratios, response types (SERVFAIL, NXDOMAIN), upstream latency.
- **Recommendation:** Enable the unbound exporter on ns1/ns2. Add alerts for cache hit ratio drops and SERVFAIL rate spikes.

### DNS - NSD (`ns1`, `ns2`)

- **Current state:** Has `nsd_down` alert (systemd unit state), no NSD-specific metrics
- **Available in nixpkgs:** Nothing. No exporter package or NixOS module. Community `nsd_exporter` exists but is not packaged.
- **Recommendation:** The existing systemd unit alert is likely sufficient. NSD is a simple authoritative-only server with limited operational metrics. Not worth packaging a custom exporter for now.

## Existing Monitoring (for reference)

These services have adequate alerting and/or scrape targets:

| Service | Scrape Targets | Alert Rules |
|---|---|---|
| Monitoring stack (Prometheus, Grafana, Loki, Tempo, Pyroscope) | Yes | 7 alerts |
| Home Assistant (+ Zigbee2MQTT, Mosquitto) | Yes (port 8123) | 3 alerts |
| HTTP Proxy (Caddy) | Yes (port 80) | 3 alerts |
| Nix Cache (Harmonia, build-flakes) | Via Caddy | 4 alerts |
| CA (step-ca) | Yes (port 9000) | 4 certificate alerts |

## Per-Service Resource Metrics (systemd-exporter)

### Current State

No per-service CPU, memory, or IO metrics are collected. The existing node-exporter systemd collector only provides unit state (active/inactive/failed), socket stats, and timer triggers. While systemd tracks per-unit resource usage via cgroups internally (visible in `systemctl status` and `systemd-cgtop`), this data is not exported to Prometheus.

### Available Solution

The `prometheus-systemd-exporter` package (v0.7.0) is available in nixpkgs with a ready-made NixOS module:

```nix
services.prometheus.exporters.systemd.enable = true;
```

**Options:** `enable`, `port`, `extraFlags`, `user`, `group`

This exporter reads cgroup data and exposes per-unit metrics including:
- CPU seconds consumed per service
- Memory usage per service
- Task/process counts per service
- Restart counts
- IO usage

### Recommendation

Enable on all hosts via the shared `system/` config (same pattern as node-exporter). Add a corresponding scrape job on monitoring01. This would give visibility into resource consumption per service across the fleet, useful for capacity planning and diagnosing noisy-neighbor issues on shared hosts.

## Suggested Priority

1. **PostgreSQL** - Critical infrastructure, easy to add with existing nixpkgs module
2. **Authelia + LLDAP** - Auth outage affects all proxied services
3. **Unbound exporter** - Ready-to-go NixOS module, just needs enabling
4. **Jellyfin alerts** - Metrics already collected, just needs alert rules
5. **NATS metrics** - Built-in endpoint, just needs a scrape target
6. **Vault/OpenBao** - Native telemetry support
7. **Actions Runner** - Lower priority, basic systemd alert sufficient

## Node-Exporter Targets Currently Down

Noted during audit -- these node-exporter targets are failing:

- `nixos-test1.home.2rjus.net:9100` - no route to host
- `media1.home.2rjus.net:9100` - no route to host
- `ns3.home.2rjus.net:9100` - no route to host
- `ns4.home.2rjus.net:9100` - no route to host

These may be decommissioned or powered-off hosts that should be removed from the scrape config.