# Certificate Monitoring Plan ## Summary This document describes the removal of labmon certificate monitoring and outlines future needs for certificate monitoring in the homelab. ## What Was Removed ### labmon Service The `labmon` service was a custom Go application that provided: 1. **StepMonitor**: Monitoring for step-ca (Smallstep CA) certificate provisioning and health 2. **TLSConnectionMonitor**: Periodic TLS connection checks to verify certificate validity and expiration The service exposed Prometheus metrics at `:9969` including: - `labmon_tlsconmon_certificate_seconds_left` - Time until certificate expiration - `labmon_tlsconmon_certificate_check_error` - Whether the TLS check failed - `labmon_stepmon_certificate_seconds_left` - Step-CA internal certificate expiration ### Affected Files - `hosts/monitoring01/configuration.nix` - Removed labmon configuration block - `services/monitoring/prometheus.nix` - Removed labmon scrape target - `services/monitoring/rules.yml` - Removed `certificate_rules` alert group - `services/monitoring/alloy.nix` - Deleted (was only used for labmon profiling) - `services/monitoring/default.nix` - Removed alloy.nix import ### Removed Alerts - `certificate_expiring_soon` - Warned when any monitored TLS cert had < 24h validity - `step_ca_serving_cert_expiring` - Critical alert for step-ca's own serving certificate - `certificate_check_error` - Warned when TLS connection check failed - `step_ca_certificate_expiring` - Critical alert for step-ca issued certificates ## Why It Was Removed 1. **step-ca decommissioned**: The primary monitoring target (step-ca) is no longer in use 2. **Outdated codebase**: labmon was a custom tool that required maintenance 3. **Limited value**: With ACME auto-renewal, certificates should renew automatically ## Current State ACME certificates are now issued by OpenBao PKI at `vault.home.2rjus.net:8200`. The ACME protocol handles automatic renewal, and certificates are typically renewed well before expiration. ## Future Needs While ACME handles renewal automatically, we should consider monitoring for: 1. **ACME renewal failures**: Alert when a certificate fails to renew - Could monitor ACME client logs (via Loki queries) - Could check certificate file modification times 2. **Certificate expiration as backup**: Even with auto-renewal, a last-resort alert for certificates approaching expiration would catch renewal failures 3. **Certificate transparency**: Monitor for unexpected certificate issuance ### Potential Solutions 1. **Prometheus blackbox_exporter**: Can probe TLS endpoints and export certificate expiration metrics - `probe_ssl_earliest_cert_expiry` metric - Already a standard tool, well-maintained 2. **Custom Loki alerting**: Query ACME service logs for renewal failures - Works with existing infrastructure - No additional services needed 3. **Node-exporter textfile collector**: Script that checks local certificate files and writes expiration metrics ## Status **Not yet implemented.** This document serves as a placeholder for future work on certificate monitoring.