Files
nixos-servers/docs/plans/cert-monitoring.md
Torjus Håkestad 21db7e9573 acme: migrate from step-ca to OpenBao PKI
Switch all ACME certificate issuance from step-ca (ca.home.2rjus.net)
to OpenBao PKI (vault.home.2rjus.net:8200/v1/pki_int/acme/directory).

- Update default ACME server in system/acme.nix
- Update Caddy acme_ca in http-proxy and nix-cache services
- Remove labmon service from monitoring01 (step-ca monitoring)
- Remove labmon scrape target and certificate_rules alerts
- Remove alloy.nix (only used for labmon profiling)
- Add docs/plans/cert-monitoring.md for future cert monitoring needs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 18:20:10 +01:00

3.1 KiB

Certificate Monitoring Plan

Summary

This document describes the removal of labmon certificate monitoring and outlines future needs for certificate monitoring in the homelab.

What Was Removed

labmon Service

The labmon service was a custom Go application that provided:

  1. StepMonitor: Monitoring for step-ca (Smallstep CA) certificate provisioning and health
  2. TLSConnectionMonitor: Periodic TLS connection checks to verify certificate validity and expiration

The service exposed Prometheus metrics at :9969 including:

  • labmon_tlsconmon_certificate_seconds_left - Time until certificate expiration
  • labmon_tlsconmon_certificate_check_error - Whether the TLS check failed
  • labmon_stepmon_certificate_seconds_left - Step-CA internal certificate expiration

Affected Files

  • hosts/monitoring01/configuration.nix - Removed labmon configuration block
  • services/monitoring/prometheus.nix - Removed labmon scrape target
  • services/monitoring/rules.yml - Removed certificate_rules alert group
  • services/monitoring/alloy.nix - Deleted (was only used for labmon profiling)
  • services/monitoring/default.nix - Removed alloy.nix import

Removed Alerts

  • certificate_expiring_soon - Warned when any monitored TLS cert had < 24h validity
  • step_ca_serving_cert_expiring - Critical alert for step-ca's own serving certificate
  • certificate_check_error - Warned when TLS connection check failed
  • step_ca_certificate_expiring - Critical alert for step-ca issued certificates

Why It Was Removed

  1. step-ca decommissioned: The primary monitoring target (step-ca) is no longer in use
  2. Outdated codebase: labmon was a custom tool that required maintenance
  3. Limited value: With ACME auto-renewal, certificates should renew automatically

Current State

ACME certificates are now issued by OpenBao PKI at vault.home.2rjus.net:8200. The ACME protocol handles automatic renewal, and certificates are typically renewed well before expiration.

Future Needs

While ACME handles renewal automatically, we should consider monitoring for:

  1. ACME renewal failures: Alert when a certificate fails to renew

    • Could monitor ACME client logs (via Loki queries)
    • Could check certificate file modification times
  2. Certificate expiration as backup: Even with auto-renewal, a last-resort alert for certificates approaching expiration would catch renewal failures

  3. Certificate transparency: Monitor for unexpected certificate issuance

Potential Solutions

  1. Prometheus blackbox_exporter: Can probe TLS endpoints and export certificate expiration metrics

    • probe_ssl_earliest_cert_expiry metric
    • Already a standard tool, well-maintained
  2. Custom Loki alerting: Query ACME service logs for renewal failures

    • Works with existing infrastructure
    • No additional services needed
  3. Node-exporter textfile collector: Script that checks local certificate files and writes expiration metrics

Status

Not yet implemented. This document serves as a placeholder for future work on certificate monitoring.