Add blackbox exporter to monitoring01 to probe TLS endpoints and alert on expiring certificates. Monitors all ACME-managed certificates from OpenBao PKI including Caddy auto-TLS services. Alerts: - tls_certificate_expiring_soon (< 7 days, warning) - tls_certificate_expiring_critical (< 24h, critical) - tls_probe_failed (connectivity issues) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
3.1 KiB
Certificate Monitoring Plan
Summary
This document describes the removal of labmon certificate monitoring and outlines future needs for certificate monitoring in the homelab.
What Was Removed
labmon Service
The labmon service was a custom Go application that provided:
- StepMonitor: Monitoring for step-ca (Smallstep CA) certificate provisioning and health
- TLSConnectionMonitor: Periodic TLS connection checks to verify certificate validity and expiration
The service exposed Prometheus metrics at :9969 including:
labmon_tlsconmon_certificate_seconds_left- Time until certificate expirationlabmon_tlsconmon_certificate_check_error- Whether the TLS check failedlabmon_stepmon_certificate_seconds_left- Step-CA internal certificate expiration
Affected Files
hosts/monitoring01/configuration.nix- Removed labmon configuration blockservices/monitoring/prometheus.nix- Removed labmon scrape targetservices/monitoring/rules.yml- Removedcertificate_rulesalert groupservices/monitoring/alloy.nix- Deleted (was only used for labmon profiling)services/monitoring/default.nix- Removed alloy.nix import
Removed Alerts
certificate_expiring_soon- Warned when any monitored TLS cert had < 24h validitystep_ca_serving_cert_expiring- Critical alert for step-ca's own serving certificatecertificate_check_error- Warned when TLS connection check failedstep_ca_certificate_expiring- Critical alert for step-ca issued certificates
Why It Was Removed
- step-ca decommissioned: The primary monitoring target (step-ca) is no longer in use
- Outdated codebase: labmon was a custom tool that required maintenance
- Limited value: With ACME auto-renewal, certificates should renew automatically
Current State
ACME certificates are now issued by OpenBao PKI at vault.home.2rjus.net:8200. The ACME protocol handles automatic renewal, and certificates are typically renewed well before expiration.
Future Needs
While ACME handles renewal automatically, we should consider monitoring for:
-
ACME renewal failures: Alert when a certificate fails to renew
- Could monitor ACME client logs (via Loki queries)
- Could check certificate file modification times
-
Certificate expiration as backup: Even with auto-renewal, a last-resort alert for certificates approaching expiration would catch renewal failures
-
Certificate transparency: Monitor for unexpected certificate issuance
Potential Solutions
-
Prometheus blackbox_exporter: Can probe TLS endpoints and export certificate expiration metrics
probe_ssl_earliest_cert_expirymetric- Already a standard tool, well-maintained
-
Custom Loki alerting: Query ACME service logs for renewal failures
- Works with existing infrastructure
- No additional services needed
-
Node-exporter textfile collector: Script that checks local certificate files and writes expiration metrics
Status
Not yet implemented. This document serves as a placeholder for future work on certificate monitoring.