docs: add systemd-exporter findings to monitoring gaps plan
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -79,6 +79,33 @@ These services have adequate alerting and/or scrape targets:
|
|||||||
| Nix Cache (Harmonia, build-flakes) | Via Caddy | 4 alerts |
|
| Nix Cache (Harmonia, build-flakes) | Via Caddy | 4 alerts |
|
||||||
| CA (step-ca) | Yes (port 9000) | 4 certificate alerts |
|
| CA (step-ca) | Yes (port 9000) | 4 certificate alerts |
|
||||||
|
|
||||||
|
## Per-Service Resource Metrics (systemd-exporter)
|
||||||
|
|
||||||
|
### Current State
|
||||||
|
|
||||||
|
No per-service CPU, memory, or IO metrics are collected. The existing node-exporter systemd collector only provides unit state (active/inactive/failed), socket stats, and timer triggers. While systemd tracks per-unit resource usage via cgroups internally (visible in `systemctl status` and `systemd-cgtop`), this data is not exported to Prometheus.
|
||||||
|
|
||||||
|
### Available Solution
|
||||||
|
|
||||||
|
The `prometheus-systemd-exporter` package (v0.7.0) is available in nixpkgs with a ready-made NixOS module:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
services.prometheus.exporters.systemd.enable = true;
|
||||||
|
```
|
||||||
|
|
||||||
|
**Options:** `enable`, `port`, `extraFlags`, `user`, `group`
|
||||||
|
|
||||||
|
This exporter reads cgroup data and exposes per-unit metrics including:
|
||||||
|
- CPU seconds consumed per service
|
||||||
|
- Memory usage per service
|
||||||
|
- Task/process counts per service
|
||||||
|
- Restart counts
|
||||||
|
- IO usage
|
||||||
|
|
||||||
|
### Recommendation
|
||||||
|
|
||||||
|
Enable on all hosts via the shared `system/` config (same pattern as node-exporter). Add a corresponding scrape job on monitoring01. This would give visibility into resource consumption per service across the fleet, useful for capacity planning and diagnosing noisy-neighbor issues on shared hosts.
|
||||||
|
|
||||||
## Suggested Priority
|
## Suggested Priority
|
||||||
|
|
||||||
1. **PostgreSQL** - Critical infrastructure, easy to add with existing nixpkgs module
|
1. **PostgreSQL** - Critical infrastructure, easy to add with existing nixpkgs module
|
||||||
|
|||||||
Reference in New Issue
Block a user