From 21db7e95730445c97275ef5c754dee5da1b9db7f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Sat, 7 Feb 2026 18:20:10 +0100 Subject: [PATCH 1/3] acme: migrate from step-ca to OpenBao PKI Switch all ACME certificate issuance from step-ca (ca.home.2rjus.net) to OpenBao PKI (vault.home.2rjus.net:8200/v1/pki_int/acme/directory). - Update default ACME server in system/acme.nix - Update Caddy acme_ca in http-proxy and nix-cache services - Remove labmon service from monitoring01 (step-ca monitoring) - Remove labmon scrape target and certificate_rules alerts - Remove alloy.nix (only used for labmon profiling) - Add docs/plans/cert-monitoring.md for future cert monitoring needs Co-Authored-By: Claude Opus 4.5 --- docs/plans/cert-monitoring.md | 72 ++++++++++++++++++++++++++++ hosts/monitoring01/configuration.nix | 55 --------------------- services/http-proxy/proxy.nix | 2 +- services/monitoring/alloy.nix | 41 ---------------- services/monitoring/default.nix | 1 - services/monitoring/prometheus.nix | 8 ---- services/monitoring/rules.yml | 34 ------------- services/nix-cache/proxy.nix | 2 +- system/acme.nix | 2 +- 9 files changed, 75 insertions(+), 142 deletions(-) create mode 100644 docs/plans/cert-monitoring.md delete mode 100644 services/monitoring/alloy.nix diff --git a/docs/plans/cert-monitoring.md b/docs/plans/cert-monitoring.md new file mode 100644 index 0000000..7027a61 --- /dev/null +++ b/docs/plans/cert-monitoring.md @@ -0,0 +1,72 @@ +# Certificate Monitoring Plan + +## Summary + +This document describes the removal of labmon certificate monitoring and outlines future needs for certificate monitoring in the homelab. + +## What Was Removed + +### labmon Service + +The `labmon` service was a custom Go application that provided: + +1. **StepMonitor**: Monitoring for step-ca (Smallstep CA) certificate provisioning and health +2. **TLSConnectionMonitor**: Periodic TLS connection checks to verify certificate validity and expiration + +The service exposed Prometheus metrics at `:9969` including: +- `labmon_tlsconmon_certificate_seconds_left` - Time until certificate expiration +- `labmon_tlsconmon_certificate_check_error` - Whether the TLS check failed +- `labmon_stepmon_certificate_seconds_left` - Step-CA internal certificate expiration + +### Affected Files + +- `hosts/monitoring01/configuration.nix` - Removed labmon configuration block +- `services/monitoring/prometheus.nix` - Removed labmon scrape target +- `services/monitoring/rules.yml` - Removed `certificate_rules` alert group +- `services/monitoring/alloy.nix` - Deleted (was only used for labmon profiling) +- `services/monitoring/default.nix` - Removed alloy.nix import + +### Removed Alerts + +- `certificate_expiring_soon` - Warned when any monitored TLS cert had < 24h validity +- `step_ca_serving_cert_expiring` - Critical alert for step-ca's own serving certificate +- `certificate_check_error` - Warned when TLS connection check failed +- `step_ca_certificate_expiring` - Critical alert for step-ca issued certificates + +## Why It Was Removed + +1. **step-ca decommissioned**: The primary monitoring target (step-ca) is no longer in use +2. **Outdated codebase**: labmon was a custom tool that required maintenance +3. **Limited value**: With ACME auto-renewal, certificates should renew automatically + +## Current State + +ACME certificates are now issued by OpenBao PKI at `vault.home.2rjus.net:8200`. The ACME protocol handles automatic renewal, and certificates are typically renewed well before expiration. + +## Future Needs + +While ACME handles renewal automatically, we should consider monitoring for: + +1. **ACME renewal failures**: Alert when a certificate fails to renew + - Could monitor ACME client logs (via Loki queries) + - Could check certificate file modification times + +2. **Certificate expiration as backup**: Even with auto-renewal, a last-resort alert for certificates approaching expiration would catch renewal failures + +3. **Certificate transparency**: Monitor for unexpected certificate issuance + +### Potential Solutions + +1. **Prometheus blackbox_exporter**: Can probe TLS endpoints and export certificate expiration metrics + - `probe_ssl_earliest_cert_expiry` metric + - Already a standard tool, well-maintained + +2. **Custom Loki alerting**: Query ACME service logs for renewal failures + - Works with existing infrastructure + - No additional services needed + +3. **Node-exporter textfile collector**: Script that checks local certificate files and writes expiration metrics + +## Status + +**Not yet implemented.** This document serves as a placeholder for future work on certificate monitoring. diff --git a/hosts/monitoring01/configuration.nix b/hosts/monitoring01/configuration.nix index 713dbf8..b014900 100644 --- a/hosts/monitoring01/configuration.nix +++ b/hosts/monitoring01/configuration.nix @@ -100,61 +100,6 @@ ]; }; - labmon = { - enable = true; - - settings = { - ListenAddr = ":9969"; - Profiling = true; - StepMonitors = [ - { - Enabled = true; - BaseURL = "https://ca.home.2rjus.net"; - RootID = "3381bda8015a86b9a3cd1851439d1091890a79005e0f1f7c4301fe4bccc29d80"; - } - ]; - - TLSConnectionMonitors = [ - { - Enabled = true; - Address = "ca.home.2rjus.net:443"; - Verify = true; - Duration = "12h"; - } - { - Enabled = true; - Address = "jelly.home.2rjus.net:443"; - Verify = true; - Duration = "12h"; - } - { - Enabled = true; - Address = "grafana.home.2rjus.net:443"; - Verify = true; - Duration = "12h"; - } - { - Enabled = true; - Address = "prometheus.home.2rjus.net:443"; - Verify = true; - Duration = "12h"; - } - { - Enabled = true; - Address = "alertmanager.home.2rjus.net:443"; - Verify = true; - Duration = "12h"; - } - { - Enabled = true; - Address = "pyroscope.home.2rjus.net:443"; - Verify = true; - Duration = "12h"; - } - ]; - }; - }; - # Open ports in the firewall. # networking.firewall.allowedTCPPorts = [ ... ]; # networking.firewall.allowedUDPPorts = [ ... ]; diff --git a/services/http-proxy/proxy.nix b/services/http-proxy/proxy.nix index c537d79..8756dd4 100644 --- a/services/http-proxy/proxy.nix +++ b/services/http-proxy/proxy.nix @@ -5,7 +5,7 @@ package = pkgs.unstable.caddy; configFile = pkgs.writeText "Caddyfile" '' { - acme_ca https://ca.home.2rjus.net/acme/acme/directory + acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory metrics { per_host diff --git a/services/monitoring/alloy.nix b/services/monitoring/alloy.nix deleted file mode 100644 index 47ad83b..0000000 --- a/services/monitoring/alloy.nix +++ /dev/null @@ -1,41 +0,0 @@ -{ ... }: -{ - services.alloy = { - enable = true; - }; - - environment.etc."alloy/config.alloy" = { - enable = true; - mode = "0644"; - text = '' - pyroscope.write "local_pyroscope" { - endpoint { - url = "http://localhost:4040" - } - } - - pyroscope.scrape "labmon" { - targets = [{"__address__" = "localhost:9969", "service_name" = "labmon"}] - forward_to = [pyroscope.write.local_pyroscope.receiver] - - profiling_config { - profile.process_cpu { - enabled = true - } - profile.memory { - enabled = true - } - profile.mutex { - enabled = true - } - profile.block { - enabled = true - } - profile.goroutine { - enabled = true - } - } - } - ''; - }; -} diff --git a/services/monitoring/default.nix b/services/monitoring/default.nix index ba626b8..9c96ffd 100644 --- a/services/monitoring/default.nix +++ b/services/monitoring/default.nix @@ -7,7 +7,6 @@ ./pve.nix ./alerttonotify.nix ./pyroscope.nix - ./alloy.nix ./tempo.nix ]; } diff --git a/services/monitoring/prometheus.nix b/services/monitoring/prometheus.nix index c37bd32..5dc28f9 100644 --- a/services/monitoring/prometheus.nix +++ b/services/monitoring/prometheus.nix @@ -178,14 +178,6 @@ in } ]; } - { - job_name = "labmon"; - static_configs = [ - { - targets = [ "monitoring01.home.2rjus.net:9969" ]; - } - ]; - } # TODO: nix-cache_caddy can't be auto-generated because the cert is issued # for nix-cache.home.2rjus.net (service CNAME), not nix-cache01 (hostname). # Consider adding a target override to homelab.monitoring.scrapeTargets. diff --git a/services/monitoring/rules.yml b/services/monitoring/rules.yml index 88c5e6c..2530d46 100644 --- a/services/monitoring/rules.yml +++ b/services/monitoring/rules.yml @@ -338,40 +338,6 @@ groups: annotations: summary: "Pyroscope service not running on {{ $labels.instance }}" description: "Pyroscope service not running on {{ $labels.instance }}" - - name: certificate_rules - rules: - - alert: certificate_expiring_soon - expr: labmon_tlsconmon_certificate_seconds_left{address!="ca.home.2rjus.net:443"} < 86400 - for: 5m - labels: - severity: warning - annotations: - summary: "TLS certificate expiring soon for {{ $labels.instance }}" - description: "TLS certificate for {{ $labels.address }} is expiring within 24 hours." - - alert: step_ca_serving_cert_expiring - expr: labmon_tlsconmon_certificate_seconds_left{address="ca.home.2rjus.net:443"} < 3600 - for: 5m - labels: - severity: critical - annotations: - summary: "Step-CA serving certificate expiring" - description: "The step-ca serving certificate (24h auto-renewed) has less than 1 hour of validity left. Renewal may have failed." - - alert: certificate_check_error - expr: labmon_tlsconmon_certificate_check_error == 1 - for: 5m - labels: - severity: warning - annotations: - summary: "Error checking certificate for {{ $labels.address }}" - description: "Certificate check is failing for {{ $labels.address }} on {{ $labels.instance }}." - - alert: step_ca_certificate_expiring - expr: labmon_stepmon_certificate_seconds_left < 3600 - for: 5m - labels: - severity: critical - annotations: - summary: "Step-CA certificate expiring for {{ $labels.instance }}" - description: "Step-CA certificate is expiring within 1 hour on {{ $labels.instance }}." - name: proxmox_rules rules: - alert: pve_node_down diff --git a/services/nix-cache/proxy.nix b/services/nix-cache/proxy.nix index 9762b41..f8eaab7 100644 --- a/services/nix-cache/proxy.nix +++ b/services/nix-cache/proxy.nix @@ -5,7 +5,7 @@ package = pkgs.unstable.caddy; configFile = pkgs.writeText "Caddyfile" '' { - acme_ca https://ca.home.2rjus.net/acme/acme/directory + acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory metrics } diff --git a/system/acme.nix b/system/acme.nix index 4466c9e..0aa628e 100644 --- a/system/acme.nix +++ b/system/acme.nix @@ -3,7 +3,7 @@ security.acme = { acceptTerms = true; defaults = { - server = "https://ca.home.2rjus.net/acme/acme/directory"; + server = "https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory"; email = "root@home.2rjus.net"; dnsPropagationCheck = false; }; From 9d019f2b9abefd1a1e893a72b4d310c9aaa46f1f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Sat, 7 Feb 2026 18:22:28 +0100 Subject: [PATCH 2/3] testvm01: add nginx with ACME certificate for PKI testing Set up a simple nginx server with an ACME certificate from the new OpenBao PKI infrastructure. This allows testing the ACME migration before deploying to production hosts. Co-Authored-By: Claude Opus 4.5 --- hosts/testvm01/configuration.nix | 33 ++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/hosts/testvm01/configuration.nix b/hosts/testvm01/configuration.nix index c493c76..572084b 100644 --- a/hosts/testvm01/configuration.nix +++ b/hosts/testvm01/configuration.nix @@ -62,6 +62,39 @@ git ]; + # Test nginx with ACME certificate from OpenBao PKI + services.nginx = { + enable = true; + virtualHosts."testvm01.home.2rjus.net" = { + forceSSL = true; + enableACME = true; + locations."/" = { + root = pkgs.writeTextDir "index.html" '' + + + + testvm01 - ACME Test + + + +

OpenBao PKI ACME Test

+

If you're seeing this over HTTPS, the migration worked!

+
+

Why do programmers prefer dark mode?

+

Because light attracts bugs.

+
+

Certificate issued by: vault.home.2rjus.net

+ + + ''; + }; + }; + }; + # Open ports in the firewall. # networking.firewall.allowedTCPPorts = [ ... ]; # networking.firewall.allowedUDPPorts = [ ... ]; From 46f03871f1d487dde3735e947522b58295703a9d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Sat, 7 Feb 2026 18:32:27 +0100 Subject: [PATCH 3/3] docs: update CLAUDE.md for PR creation and labmon removal - Add note that gh pr create is not supported - Remove labmon from Prometheus job names list - Remove labmon from flake inputs list Co-Authored-By: Claude Opus 4.5 --- CLAUDE.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index ccf67cc..4e89ad6 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -101,6 +101,8 @@ Legacy sops-nix is still present but only actively used by the `ca` host. Do not **Important:** Never amend commits to `master` unless the user explicitly asks for it. Amending rewrites history and causes issues for deployed configurations. +**Important:** Do not use `gh pr create` to create pull requests. The git server does not support GitHub CLI for PR creation. Instead, push the branch and let the user create the PR manually via the web interface. + When starting a new plan or task, the first step should typically be to create and checkout a new branch with an appropriate name (e.g., `git checkout -b dns-automation` or `git checkout -b fix-nginx-config`). ### Plan Management @@ -214,7 +216,7 @@ The **lab-monitoring** MCP server can query Prometheus metrics via PromQL. The ` - `wireguard` - VPN metrics (http-proxy) - `pushgateway` - Push-based metrics (e.g., backup results) - `restic_rest` - Backup server metrics -- `labmon` / `ghettoptt` / `alertmanager` - Other service metrics +- `ghettoptt` / `alertmanager` - Other service metrics **Example PromQL queries:** ``` @@ -374,7 +376,6 @@ Template hosts: - `homelab-deploy` - NATS-based remote deployment tool for test-tier hosts - Custom packages from git.t-juice.club: - `alerttonotify` - Alert routing - - `labmon` - Lab monitoring ### Network Architecture