monitoring: add blackbox exporter for TLS certificate monitoring
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m6s
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m6s
Add blackbox exporter to monitoring01 to probe TLS endpoints and alert on expiring certificates. Monitors all ACME-managed certificates from OpenBao PKI including Caddy auto-TLS services. Alerts: - tls_certificate_expiring_soon (< 7 days, warning) - tls_certificate_expiring_critical (< 24h, critical) - tls_probe_failed (connectivity issues) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
72
docs/plans/completed/cert-monitoring.md
Normal file
72
docs/plans/completed/cert-monitoring.md
Normal file
@@ -0,0 +1,72 @@
|
||||
# Certificate Monitoring Plan
|
||||
|
||||
## Summary
|
||||
|
||||
This document describes the removal of labmon certificate monitoring and outlines future needs for certificate monitoring in the homelab.
|
||||
|
||||
## What Was Removed
|
||||
|
||||
### labmon Service
|
||||
|
||||
The `labmon` service was a custom Go application that provided:
|
||||
|
||||
1. **StepMonitor**: Monitoring for step-ca (Smallstep CA) certificate provisioning and health
|
||||
2. **TLSConnectionMonitor**: Periodic TLS connection checks to verify certificate validity and expiration
|
||||
|
||||
The service exposed Prometheus metrics at `:9969` including:
|
||||
- `labmon_tlsconmon_certificate_seconds_left` - Time until certificate expiration
|
||||
- `labmon_tlsconmon_certificate_check_error` - Whether the TLS check failed
|
||||
- `labmon_stepmon_certificate_seconds_left` - Step-CA internal certificate expiration
|
||||
|
||||
### Affected Files
|
||||
|
||||
- `hosts/monitoring01/configuration.nix` - Removed labmon configuration block
|
||||
- `services/monitoring/prometheus.nix` - Removed labmon scrape target
|
||||
- `services/monitoring/rules.yml` - Removed `certificate_rules` alert group
|
||||
- `services/monitoring/alloy.nix` - Deleted (was only used for labmon profiling)
|
||||
- `services/monitoring/default.nix` - Removed alloy.nix import
|
||||
|
||||
### Removed Alerts
|
||||
|
||||
- `certificate_expiring_soon` - Warned when any monitored TLS cert had < 24h validity
|
||||
- `step_ca_serving_cert_expiring` - Critical alert for step-ca's own serving certificate
|
||||
- `certificate_check_error` - Warned when TLS connection check failed
|
||||
- `step_ca_certificate_expiring` - Critical alert for step-ca issued certificates
|
||||
|
||||
## Why It Was Removed
|
||||
|
||||
1. **step-ca decommissioned**: The primary monitoring target (step-ca) is no longer in use
|
||||
2. **Outdated codebase**: labmon was a custom tool that required maintenance
|
||||
3. **Limited value**: With ACME auto-renewal, certificates should renew automatically
|
||||
|
||||
## Current State
|
||||
|
||||
ACME certificates are now issued by OpenBao PKI at `vault.home.2rjus.net:8200`. The ACME protocol handles automatic renewal, and certificates are typically renewed well before expiration.
|
||||
|
||||
## Future Needs
|
||||
|
||||
While ACME handles renewal automatically, we should consider monitoring for:
|
||||
|
||||
1. **ACME renewal failures**: Alert when a certificate fails to renew
|
||||
- Could monitor ACME client logs (via Loki queries)
|
||||
- Could check certificate file modification times
|
||||
|
||||
2. **Certificate expiration as backup**: Even with auto-renewal, a last-resort alert for certificates approaching expiration would catch renewal failures
|
||||
|
||||
3. **Certificate transparency**: Monitor for unexpected certificate issuance
|
||||
|
||||
### Potential Solutions
|
||||
|
||||
1. **Prometheus blackbox_exporter**: Can probe TLS endpoints and export certificate expiration metrics
|
||||
- `probe_ssl_earliest_cert_expiry` metric
|
||||
- Already a standard tool, well-maintained
|
||||
|
||||
2. **Custom Loki alerting**: Query ACME service logs for renewal failures
|
||||
- Works with existing infrastructure
|
||||
- No additional services needed
|
||||
|
||||
3. **Node-exporter textfile collector**: Script that checks local certificate files and writes expiration metrics
|
||||
|
||||
## Status
|
||||
|
||||
**Not yet implemented.** This document serves as a placeholder for future work on certificate monitoring.
|
||||
Reference in New Issue
Block a user