From 5bfb51a4976211190fda025626df2574a8bf8df2 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Tue, 10 Feb 2026 22:46:38 +0100 Subject: [PATCH] docs: add observability phase to nix-cache plan - Add Phase 6 for alerting and Grafana dashboards - Document available Prometheus metrics - Include example alerting rules for build failures Co-Authored-By: Claude Opus 4.5 --- docs/plans/nix-cache-reprovision.md | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/docs/plans/nix-cache-reprovision.md b/docs/plans/nix-cache-reprovision.md index 94e405c..2bb57ef 100644 --- a/docs/plans/nix-cache-reprovision.md +++ b/docs/plans/nix-cache-reprovision.md @@ -118,6 +118,30 @@ Or trigger builds from CI after merges to master. - **Disk size?** 200GB for new host - **Build host specs?** 8 cores, 16-24GB RAM matches current nix-cache01 +### Phase 6: Observability + +1. **Alerting rules** for build failures: + ```promql + # Alert if any build fails + increase(homelab_deploy_build_host_total{status="failure"}[1h]) > 0 + + # Alert if no successful builds in 24h (scheduled builds stopped) + time() - homelab_deploy_build_last_success_timestamp > 86400 + ``` + +2. **Grafana dashboard** for build metrics: + - Build success/failure rate over time + - Average build duration per host (histogram) + - Build frequency (builds per hour/day) + - Last successful build timestamp per repo + +Available metrics: +- `homelab_deploy_builds_total{repo, status}` - total builds by repo and status +- `homelab_deploy_build_host_total{repo, host, status}` - per-host build counts +- `homelab_deploy_build_duration_seconds_{bucket,sum,count}` - build duration histogram +- `homelab_deploy_build_last_timestamp{repo}` - last build attempt +- `homelab_deploy_build_last_success_timestamp{repo}` - last successful build + ## Open Questions - [ ] When to cut over DNS from nix-cache01 to nix-cache02?