docs: add observability phase to nix-cache plan
- Add Phase 6 for alerting and Grafana dashboards - Document available Prometheus metrics - Include example alerting rules for build failures Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -118,6 +118,30 @@ Or trigger builds from CI after merges to master.
|
||||
- **Disk size?** 200GB for new host
|
||||
- **Build host specs?** 8 cores, 16-24GB RAM matches current nix-cache01
|
||||
|
||||
### Phase 6: Observability
|
||||
|
||||
1. **Alerting rules** for build failures:
|
||||
```promql
|
||||
# Alert if any build fails
|
||||
increase(homelab_deploy_build_host_total{status="failure"}[1h]) > 0
|
||||
|
||||
# Alert if no successful builds in 24h (scheduled builds stopped)
|
||||
time() - homelab_deploy_build_last_success_timestamp > 86400
|
||||
```
|
||||
|
||||
2. **Grafana dashboard** for build metrics:
|
||||
- Build success/failure rate over time
|
||||
- Average build duration per host (histogram)
|
||||
- Build frequency (builds per hour/day)
|
||||
- Last successful build timestamp per repo
|
||||
|
||||
Available metrics:
|
||||
- `homelab_deploy_builds_total{repo, status}` - total builds by repo and status
|
||||
- `homelab_deploy_build_host_total{repo, host, status}` - per-host build counts
|
||||
- `homelab_deploy_build_duration_seconds_{bucket,sum,count}` - build duration histogram
|
||||
- `homelab_deploy_build_last_timestamp{repo}` - last build attempt
|
||||
- `homelab_deploy_build_last_success_timestamp{repo}` - last successful build
|
||||
|
||||
## Open Questions
|
||||
|
||||
- [ ] When to cut over DNS from nix-cache01 to nix-cache02?
|
||||
|
||||
Reference in New Issue
Block a user