docs: add observability phase to nix-cache plan
- Add Phase 6 for alerting and Grafana dashboards - Document available Prometheus metrics - Include example alerting rules for build failures Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -118,6 +118,30 @@ Or trigger builds from CI after merges to master.
|
|||||||
- **Disk size?** 200GB for new host
|
- **Disk size?** 200GB for new host
|
||||||
- **Build host specs?** 8 cores, 16-24GB RAM matches current nix-cache01
|
- **Build host specs?** 8 cores, 16-24GB RAM matches current nix-cache01
|
||||||
|
|
||||||
|
### Phase 6: Observability
|
||||||
|
|
||||||
|
1. **Alerting rules** for build failures:
|
||||||
|
```promql
|
||||||
|
# Alert if any build fails
|
||||||
|
increase(homelab_deploy_build_host_total{status="failure"}[1h]) > 0
|
||||||
|
|
||||||
|
# Alert if no successful builds in 24h (scheduled builds stopped)
|
||||||
|
time() - homelab_deploy_build_last_success_timestamp > 86400
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Grafana dashboard** for build metrics:
|
||||||
|
- Build success/failure rate over time
|
||||||
|
- Average build duration per host (histogram)
|
||||||
|
- Build frequency (builds per hour/day)
|
||||||
|
- Last successful build timestamp per repo
|
||||||
|
|
||||||
|
Available metrics:
|
||||||
|
- `homelab_deploy_builds_total{repo, status}` - total builds by repo and status
|
||||||
|
- `homelab_deploy_build_host_total{repo, host, status}` - per-host build counts
|
||||||
|
- `homelab_deploy_build_duration_seconds_{bucket,sum,count}` - build duration histogram
|
||||||
|
- `homelab_deploy_build_last_timestamp{repo}` - last build attempt
|
||||||
|
- `homelab_deploy_build_last_success_timestamp{repo}` - last successful build
|
||||||
|
|
||||||
## Open Questions
|
## Open Questions
|
||||||
|
|
||||||
- [ ] When to cut over DNS from nix-cache01 to nix-cache02?
|
- [ ] When to cut over DNS from nix-cache01 to nix-cache02?
|
||||||
|
|||||||
Reference in New Issue
Block a user