monitoring: fix blackbox rules and add force-push policy
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Move certificate alert rules to rules.yml instead of adding them as a separate rules string in blackbox.nix. The previous approach caused a YAML parse error due to duplicate 'groups' keys. Also add policy to CLAUDE.md: never force push to master. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -132,6 +132,8 @@ Terraform manages the secrets and AppRole policies in `terraform/vault/`.
|
|||||||
|
|
||||||
**Important:** Never amend commits to `master` unless the user explicitly asks for it. Amending rewrites history and causes issues for deployed configurations.
|
**Important:** Never amend commits to `master` unless the user explicitly asks for it. Amending rewrites history and causes issues for deployed configurations.
|
||||||
|
|
||||||
|
**Important:** Never force push to `master`. If a commit on master has an error, fix it with a new commit rather than rewriting history.
|
||||||
|
|
||||||
**Important:** Do not use `gh pr create` to create pull requests. The git server does not support GitHub CLI for PR creation. Instead, push the branch and let the user create the PR manually via the web interface.
|
**Important:** Do not use `gh pr create` to create pull requests. The git server does not support GitHub CLI for PR creation. Instead, push the branch and let the user create the PR manually via the web interface.
|
||||||
|
|
||||||
When starting a new plan or task, the first step should typically be to create and checkout a new branch with an appropriate name (e.g., `git checkout -b dns-automation` or `git checkout -b fix-nginx-config`).
|
When starting a new plan or task, the first step should typically be to create and checkout a new branch with an appropriate name (e.g., `git checkout -b dns-automation` or `git checkout -b fix-nginx-config`).
|
||||||
|
|||||||
@@ -392,3 +392,29 @@ groups:
|
|||||||
annotations:
|
annotations:
|
||||||
summary: "Cannot scrape OpenBao metrics from {{ $labels.instance }}"
|
summary: "Cannot scrape OpenBao metrics from {{ $labels.instance }}"
|
||||||
description: "OpenBao metrics endpoint is not responding on {{ $labels.instance }}."
|
description: "OpenBao metrics endpoint is not responding on {{ $labels.instance }}."
|
||||||
|
- name: certificate_rules
|
||||||
|
rules:
|
||||||
|
- alert: tls_certificate_expiring_soon
|
||||||
|
expr: (probe_ssl_earliest_cert_expiry - time()) < 86400 * 7
|
||||||
|
for: 1h
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
summary: "TLS certificate expiring soon on {{ $labels.instance }}"
|
||||||
|
description: "The TLS certificate for {{ $labels.instance }} expires in less than 7 days."
|
||||||
|
- alert: tls_certificate_expiring_critical
|
||||||
|
expr: (probe_ssl_earliest_cert_expiry - time()) < 86400
|
||||||
|
for: 0m
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
annotations:
|
||||||
|
summary: "TLS certificate expiring within 24h on {{ $labels.instance }}"
|
||||||
|
description: "The TLS certificate for {{ $labels.instance }} expires in less than 24 hours. Immediate action required."
|
||||||
|
- alert: tls_probe_failed
|
||||||
|
expr: probe_success{job="blackbox_tls"} == 0
|
||||||
|
for: 5m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
summary: "TLS probe failed for {{ $labels.instance }}"
|
||||||
|
description: "Cannot connect to {{ $labels.instance }} to check TLS certificate. The service may be down or unreachable."
|
||||||
|
|||||||
Reference in New Issue
Block a user