- Add Phase 6 for alerting and Grafana dashboards - Document available Prometheus metrics - Include example alerting rules for build failures Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
150 lines
5.2 KiB
Markdown
150 lines
5.2 KiB
Markdown
# Nix Cache Host Reprovision
|
|
|
|
## Overview
|
|
|
|
Reprovision `nix-cache01` using the OpenTofu workflow, and improve the build/cache system with:
|
|
1. NATS-based remote build triggering (replacing the current bash script)
|
|
2. Safer flake update workflow that validates builds before pushing to master
|
|
|
|
## Status
|
|
|
|
**Phase 1: New Build Host** - COMPLETE
|
|
**Phase 2: NATS Build Triggering** - COMPLETE
|
|
**Phase 3: Safe Flake Update Workflow** - NOT STARTED
|
|
**Phase 4: Decommission Old System** - NOT STARTED
|
|
|
|
## Completed Work
|
|
|
|
### New Build Host (nix-cache02)
|
|
|
|
Instead of reprovisioning nix-cache01 in-place, we created a new host `nix-cache02` at 10.69.13.25:
|
|
|
|
- **Specs**: 8 CPU cores, 16GB RAM (temporarily, will increase to 24GB after nix-cache01 decommissioned), 200GB disk
|
|
- **Provisioned via OpenTofu** with automatic Vault credential bootstrapping
|
|
- **Builder service** configured with two repos:
|
|
- `nixos-servers` → `git+https://git.t-juice.club/torjus/nixos-servers.git`
|
|
- `nixos` (gunter) → `git+https://git.t-juice.club/torjus/nixos.git`
|
|
|
|
### NATS-Based Build Triggering
|
|
|
|
The `homelab-deploy` tool was extended with a builder mode:
|
|
|
|
**NATS Subjects:**
|
|
- `build.<repo>.<target>` - e.g., `build.nixos-servers.all` or `build.nixos-servers.ns1`
|
|
|
|
**NATS Permissions (in DEPLOY account):**
|
|
| User | Publish | Subscribe |
|
|
|------|---------|-----------|
|
|
| Builder | `build.responses.>` | `build.>` |
|
|
| Test deployer | `deploy.test.>`, `deploy.discover`, `build.>` | `deploy.responses.>`, `deploy.discover`, `build.responses.>` |
|
|
| Admin deployer | `deploy.>`, `build.>` | `deploy.>`, `build.responses.>` |
|
|
|
|
**Vault Secrets:**
|
|
- `shared/homelab-deploy/builder-nkey` - NKey seed for builder authentication
|
|
|
|
**NixOS Configuration:**
|
|
- `hosts/nix-cache02/builder.nix` - Builder service configuration
|
|
- `services/nats/default.nix` - Updated with builder NATS user
|
|
|
|
**MCP Integration:**
|
|
- `.mcp.json` updated with `--enable-builds` flag
|
|
- Build tool available via MCP for Claude Code
|
|
|
|
**Tested:**
|
|
- Single host build: `build nixos-servers testvm01` (~30s)
|
|
- All hosts build: `build nixos-servers all` (16 hosts in ~226s)
|
|
|
|
## Current State
|
|
|
|
### Old System (nix-cache01)
|
|
- Still running at 10.69.13.15
|
|
- Serves binary cache via Harmonia
|
|
- Runs Gitea Actions runner
|
|
- Has the old `build-flakes.sh` timer (every 30 min)
|
|
- Will be decommissioned after nix-cache02 is fully validated
|
|
|
|
### New System (nix-cache02)
|
|
- Running at 10.69.13.25
|
|
- Builder service active, responding to NATS build requests
|
|
- Metrics exposed on port 9973 (`homelab-deploy-builder` job)
|
|
- Does NOT yet have:
|
|
- Harmonia (binary cache server)
|
|
- Actions runner
|
|
- Cache signing key
|
|
|
|
## Remaining Work
|
|
|
|
### Phase 3: Safe Flake Update Workflow
|
|
|
|
1. Create `.github/workflows/flake-update-safe.yaml`
|
|
2. Disable or remove old `flake-update.yaml`
|
|
3. Test manually with `workflow_dispatch`
|
|
4. Monitor first automated run
|
|
|
|
### Phase 4: Complete Migration
|
|
|
|
1. **Add Harmonia to nix-cache02** - Copy cache signing key, configure service
|
|
2. **Add Actions runner to nix-cache02** - Configure with Vault token
|
|
3. **Update DNS** - Point `nix-cache.home.2rjus.net` to nix-cache02
|
|
4. **Increase RAM** - Bump to 24GB after nix-cache01 is gone
|
|
5. **Decommission nix-cache01**:
|
|
- Remove from `terraform/vms.tf`
|
|
- Remove old build script (`services/nix-cache/build-flakes.nix`, `build-flakes.sh`)
|
|
- Archive or delete host config
|
|
|
|
### Phase 5: Scheduled Builds (Optional)
|
|
|
|
Add a systemd timer on nix-cache02 to trigger periodic builds via NATS:
|
|
|
|
```nix
|
|
systemd.services.scheduled-build = {
|
|
script = ''
|
|
homelab-deploy build nixos-servers --all
|
|
homelab-deploy build nixos --all
|
|
'';
|
|
};
|
|
systemd.timers.scheduled-build = {
|
|
wantedBy = [ "timers.target" ];
|
|
timerConfig.OnCalendar = "*-*-* *:30:00";
|
|
};
|
|
```
|
|
|
|
Or trigger builds from CI after merges to master.
|
|
|
|
## Resolved Questions
|
|
|
|
- **Parallel vs sequential builds?** Sequential - hosts share packages, subsequent builds are fast after first
|
|
- **What about gunter?** Configured as `nixos` repo in builder settings
|
|
- **Disk size?** 200GB for new host
|
|
- **Build host specs?** 8 cores, 16-24GB RAM matches current nix-cache01
|
|
|
|
### Phase 6: Observability
|
|
|
|
1. **Alerting rules** for build failures:
|
|
```promql
|
|
# Alert if any build fails
|
|
increase(homelab_deploy_build_host_total{status="failure"}[1h]) > 0
|
|
|
|
# Alert if no successful builds in 24h (scheduled builds stopped)
|
|
time() - homelab_deploy_build_last_success_timestamp > 86400
|
|
```
|
|
|
|
2. **Grafana dashboard** for build metrics:
|
|
- Build success/failure rate over time
|
|
- Average build duration per host (histogram)
|
|
- Build frequency (builds per hour/day)
|
|
- Last successful build timestamp per repo
|
|
|
|
Available metrics:
|
|
- `homelab_deploy_builds_total{repo, status}` - total builds by repo and status
|
|
- `homelab_deploy_build_host_total{repo, host, status}` - per-host build counts
|
|
- `homelab_deploy_build_duration_seconds_{bucket,sum,count}` - build duration histogram
|
|
- `homelab_deploy_build_last_timestamp{repo}` - last build attempt
|
|
- `homelab_deploy_build_last_success_timestamp{repo}` - last successful build
|
|
|
|
## Open Questions
|
|
|
|
- [ ] When to cut over DNS from nix-cache01 to nix-cache02?
|
|
- [ ] Keep Actions runner on nix-cache02 or separate host?
|
|
- [ ] Implement safe flake update workflow before or after full migration?
|