monitoring: exclude step-ca serving cert from general expiry alert
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m23s
Run nix flake check / flake-check (pull_request) Failing after 4m46s

The step-ca serving certificate is auto-renewed with a 24h lifetime,
so it always triggers the general < 86400s threshold. Exclude it and
add a dedicated step_ca_serving_cert_expiring alert at < 1h instead.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-02-05 01:12:42 +01:00
parent fa6380e767
commit 3a9a47f1ad

View File

@@ -307,13 +307,21 @@ groups:
- name: certificate_rules - name: certificate_rules
rules: rules:
- alert: certificate_expiring_soon - alert: certificate_expiring_soon
expr: labmon_tlsconmon_certificate_seconds_left < 86400 expr: labmon_tlsconmon_certificate_seconds_left{address!="ca.home.2rjus.net:443"} < 86400
for: 5m for: 5m
labels: labels:
severity: warning severity: warning
annotations: annotations:
summary: "TLS certificate expiring soon for {{ $labels.instance }}" summary: "TLS certificate expiring soon for {{ $labels.instance }}"
description: "TLS certificate for {{ $labels.address }} is expiring within 24 hours." description: "TLS certificate for {{ $labels.address }} is expiring within 24 hours."
- alert: step_ca_serving_cert_expiring
expr: labmon_tlsconmon_certificate_seconds_left{address="ca.home.2rjus.net:443"} < 3600
for: 5m
labels:
severity: critical
annotations:
summary: "Step-CA serving certificate expiring"
description: "The step-ca serving certificate (24h auto-renewed) has less than 1 hour of validity left. Renewal may have failed."
- alert: certificate_check_error - alert: certificate_check_error
expr: labmon_tlsconmon_certificate_check_error == 1 expr: labmon_tlsconmon_certificate_check_error == 1
for: 5m for: 5m