monitoring-gaps-implementation #20

Merged
torjus merged 2 commits from monitoring-gaps-implementation into master 2026-02-05 20:57:32 +00:00
Owner

Summary

Implements monitoring coverage for services identified in the monitoring gaps audit. Adds Prometheus exporters, scrape targets, and alert rules for previously unmonitored services.

New Exporters & Scrape Targets

Host Service Exporter Port
All hosts systemd systemd-exporter 9558
pgdb1 PostgreSQL postgres-exporter 9187
auth01 Authelia native telemetry 9959
ns1, ns2 Unbound unbound-exporter 9167
nats1 NATS native HTTP monitoring 8222
vault01 OpenBao native telemetry 8200

New Alert Rules

  • postgres_rules: postgres_down, postgres_exporter_down, postgres_high_connections
  • auth_rules: authelia_down, lldap_down
  • jellyfin_rules: jellyfin_down
  • vault_rules: openbao_down, openbao_sealed, openbao_scrape_down
  • nameserver_rules: unbound_low_cache_hit_ratio (extended)
  • nats_rules: nats_slow_consumers (extended)

Terraform Changes

  • New prometheus-metrics policy for OpenBao metrics access
  • vault_token resource to auto-create Prometheus scrape token
  • Token stored in KV at hosts/monitoring01/openbao-token

Files Changed

  • system/monitoring/metrics.nix - systemd-exporter on all hosts
  • services/postgres/postgres.nix - postgres-exporter
  • services/authelia/default.nix - Authelia telemetry
  • services/ns/resolver.nix - unbound-exporter with remote-control
  • services/nats/default.nix - NATS HTTP monitoring
  • services/vault/default.nix - OpenBao telemetry
  • services/monitoring/prometheus.nix - new scrape configs + vault secret
  • services/monitoring/rules.yml - all new alert rules
  • terraform/vault/policies.tf - new file for prometheus-metrics policy
  • terraform/vault/secrets.tf - prometheus token secret
  • docs/plans/monitoring-gaps.mddocs/plans/completed/

Deployment Notes

  • Run tofu apply in terraform/vault/ first (already done)
  • Deploy monitoring01 first to pick up new scrape configs
  • Other hosts can be deployed via normal auto-upgrade
## Summary Implements monitoring coverage for services identified in the monitoring gaps audit. Adds Prometheus exporters, scrape targets, and alert rules for previously unmonitored services. ### New Exporters & Scrape Targets | Host | Service | Exporter | Port | |------|---------|----------|------| | All hosts | systemd | systemd-exporter | 9558 | | pgdb1 | PostgreSQL | postgres-exporter | 9187 | | auth01 | Authelia | native telemetry | 9959 | | ns1, ns2 | Unbound | unbound-exporter | 9167 | | nats1 | NATS | native HTTP monitoring | 8222 | | vault01 | OpenBao | native telemetry | 8200 | ### New Alert Rules - **postgres_rules**: `postgres_down`, `postgres_exporter_down`, `postgres_high_connections` - **auth_rules**: `authelia_down`, `lldap_down` - **jellyfin_rules**: `jellyfin_down` - **vault_rules**: `openbao_down`, `openbao_sealed`, `openbao_scrape_down` - **nameserver_rules**: `unbound_low_cache_hit_ratio` (extended) - **nats_rules**: `nats_slow_consumers` (extended) ### Terraform Changes - New `prometheus-metrics` policy for OpenBao metrics access - `vault_token` resource to auto-create Prometheus scrape token - Token stored in KV at `hosts/monitoring01/openbao-token` ### Files Changed - `system/monitoring/metrics.nix` - systemd-exporter on all hosts - `services/postgres/postgres.nix` - postgres-exporter - `services/authelia/default.nix` - Authelia telemetry - `services/ns/resolver.nix` - unbound-exporter with remote-control - `services/nats/default.nix` - NATS HTTP monitoring - `services/vault/default.nix` - OpenBao telemetry - `services/monitoring/prometheus.nix` - new scrape configs + vault secret - `services/monitoring/rules.yml` - all new alert rules - `terraform/vault/policies.tf` - new file for prometheus-metrics policy - `terraform/vault/secrets.tf` - prometheus token secret - `docs/plans/monitoring-gaps.md` → `docs/plans/completed/` ### Deployment Notes - Run `tofu apply` in `terraform/vault/` first (already done) - Deploy monitoring01 first to pick up new scrape configs - Other hosts can be deployed via normal auto-upgrade
torjus added 2 commits 2026-02-05 20:57:25 +00:00
monitoring: implement monitoring gaps coverage
Some checks failed
Run nix flake check / flake-check (push) Failing after 7m36s
3cccfc0487
Add exporters and scrape targets for services lacking monitoring:
- PostgreSQL: postgres-exporter on pgdb1
- Authelia: native telemetry metrics on auth01
- Unbound: unbound-exporter with remote-control on ns1/ns2
- NATS: HTTP monitoring endpoint on nats1
- OpenBao: telemetry config and Prometheus scrape with token auth
- Systemd: systemd-exporter on all hosts for per-service metrics

Add alert rules for postgres, auth (authelia + lldap), jellyfin,
vault (openbao), plus extend existing nats and unbound rules.

Add Terraform config for Prometheus metrics policy and token. The
token is created via vault_token resource and stored in KV, so no
manual token creation is needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
monitoring: fix openbao token output path
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m17s
Run nix flake check / flake-check (push) Failing after 8m57s
b322b1156b
The outputDir with extractKey should be the full file path, not just
the parent directory.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
torjus merged commit 5be1f43c24 into master 2026-02-05 20:57:32 +00:00
torjus deleted branch monitoring-gaps-implementation 2026-02-05 20:57:32 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: torjus/nixos-servers#20