Commit Graph

6 Commits

Author SHA1 Message Date
4f593126c0 monitoring01: remove host and migrate services to monitoring02
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m15s
Run nix flake check / flake-check (pull_request) Failing after 3m8s
Remove monitoring01 host configuration and unused service modules
(prometheus, grafana, loki, tempo, pyroscope). Migrate blackbox,
exportarr, and pve exporters to monitoring02 with scrape configs
moved to VictoriaMetrics. Update alert rules, terraform vault
policies/secrets, http-proxy entries, and documentation to reflect
the monitoring02 migration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 21:50:20 +01:00
75210805d5 nix-cache01: decommission and remove all references
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Removed:
- hosts/nix-cache01/ directory
- services/nix-cache/build-flakes.{nix,sh} (replaced by NATS builder)
- Vault secret and AppRole for nix-cache01
- Old signing key variable from terraform
- Old trusted public key from system/nix.nix

Updated:
- flake.nix: removed nixosConfiguration
- README.md: nix-cache01 -> nix-cache02
- Monitoring rules: removed build-flakes alerts, updated harmonia to nix-cache02
- Simplified proxy.nix (no longer needs hostname conditional)

nix-cache02 is now the sole binary cache host.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:40:51 +01:00
9bd48e0808 monitoring: explicitly list valid HTTP status codes
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m6s
Empty valid_status_codes defaults to 2xx only, not "any".
Explicitly list common status codes (2xx, 3xx, 4xx, 5xx) so
services returning 400/401 like ha and nzbget pass the probe.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 22:41:47 +01:00
d1b0a5dc20 monitoring: accept any HTTP status in TLS probe
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Only care about TLS handshake success for certificate monitoring.
Services like nzbget (401) and ha (400) return non-2xx but have
valid certificates.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 22:33:45 +01:00
4d32707130 monitoring: remove duplicate rules from blackbox.nix
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m7s
The rules were already added to rules.yml but the blackbox.nix file
still had them, causing duplicate 'groups' key errors.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 22:28:42 +01:00
75e4fb61a5 monitoring: add blackbox exporter for TLS certificate monitoring
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m6s
Add blackbox exporter to monitoring01 to probe TLS endpoints and alert
on expiring certificates. Monitors all ACME-managed certificates from
OpenBao PKI including Caddy auto-TLS services.

Alerts:
- tls_certificate_expiring_soon (< 7 days, warning)
- tls_certificate_expiring_critical (< 24h, critical)
- tls_probe_failed (connectivity issues)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 22:21:42 +01:00