diff --git a/docs/plans/nix-cache-reprovision.md b/docs/plans/nix-cache-reprovision.md new file mode 100644 index 0000000..74f5394 --- /dev/null +++ b/docs/plans/nix-cache-reprovision.md @@ -0,0 +1,212 @@ +# Nix Cache Host Reprovision + +## Overview + +Reprovision `nix-cache01` using the OpenTofu workflow, and improve the build/cache system with: +1. NATS-based remote build triggering (replacing the current bash script) +2. Safer flake update workflow that validates builds before pushing to master + +## Current State + +### Host Configuration +- `nix-cache01` at 10.69.13.15 serves the binary cache via Harmonia +- Runs Gitea Actions runner for CI workflows +- Has `homelab.deploy.enable = true` (already supports NATS-based deployment) +- Uses a dedicated XFS volume at `/nix` for cache storage + +### Current Build System (`services/nix-cache/build-flakes.sh`) +- Runs every 30 minutes via systemd timer +- Clones/pulls two repos: `nixos-servers` and `nixos` (gunter) +- Builds all hosts with `nixos-rebuild build` (no blacklist despite docs mentioning it) +- Pushes success/failure metrics to pushgateway +- Simple but has no filtering, no parallelism, no remote triggering + +### Current Flake Update Workflow (`.github/workflows/flake-update.yaml`) +- Runs daily at midnight via cron +- Runs `nix flake update --commit-lock-file` +- Pushes directly to master +- No build validation — can push broken inputs + +## Improvement 1: NATS-Based Remote Build Triggering + +### Design + +Extend the existing `homelab-deploy` tool to support a "build" command that triggers builds on the cache host. This reuses the NATS infrastructure already in place. + +| Approach | Pros | Cons | +|----------|------|------| +| Extend homelab-deploy | Reuses existing NATS auth, NKey handling, CLI | Adds scope to existing tool | +| New nix-cache-tool | Clean separation | Duplicate NATS boilerplate, new credentials | +| Gitea Actions webhook | No custom tooling | Less flexible, tied to Gitea | + +**Recommendation:** Extend `homelab-deploy` with a build subcommand. The tool already has NATS client code, authentication handling, and a listener module in NixOS. + +### Implementation + +1. Add new message type to homelab-deploy: `build.` subject +2. Listener on nix-cache01 subscribes to `build.>` wildcard +3. On message receipt, builds the specified host and returns success/failure +4. CLI command: `homelab-deploy build ` or `homelab-deploy build --all` + +### Benefits +- Trigger rebuild for specific host to ensure it's cached +- Could be called from CI after merging PRs +- Reuses existing NATS infrastructure and auth +- Progress/status could stream back via NATS reply + +## Improvement 2: Smarter Flake Update Workflow + +### Current Problems +1. Updates can push breaking changes to master +2. No visibility into what broke when it does +3. Hosts that auto-update can pull broken configs + +### Proposed Workflow + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Flake Update Workflow │ +├─────────────────────────────────────────────────────────────────┤ +│ 1. nix flake update (on feature branch) │ +│ 2. Build ALL hosts locally │ +│ 3. If all pass → fast-forward merge to master │ +│ 4. If any fail → create PR with failure logs attached │ +└─────────────────────────────────────────────────────────────────┘ +``` + +### Implementation Options + +| Option | Description | Pros | Cons | +|--------|-------------|------|------| +| **A: Self-hosted runner** | Build on nix-cache01 | Fast (local cache), simple | Ties up cache host during build | +| **B: Gitea Actions only** | Use container runner | Clean separation | Slow (no cache), resource limits | +| **C: Hybrid** | Trigger builds on nix-cache01 via NATS from Actions | Best of both | More complex | + +**Recommendation:** Option A with nix-cache01 as the runner. The host is already running Gitea Actions runner and has the cache. Building all ~16 hosts is disk I/O heavy but feasible on dedicated hardware. + +### Workflow Steps + +1. Workflow runs on schedule (daily or weekly) +2. Creates branch `flake-update/YYYY-MM-DD` +3. Runs `nix flake update --commit-lock-file` +4. Builds each host: `nix build .#nixosConfigurations..config.system.build.toplevel` +5. If all succeed: + - Fast-forward merge to master + - Delete feature branch +6. If any fail: + - Create PR from the update branch + - Attach build logs as PR comment + - Label PR with `needs-review` or `build-failure` + - Do NOT merge automatically + +### Workflow File Changes + +```yaml +# New: .github/workflows/flake-update-safe.yaml +name: Safe flake update +on: + schedule: + - cron: "0 2 * * 0" # Weekly on Sunday at 2 AM + workflow_dispatch: # Manual trigger + +jobs: + update-and-validate: + runs-on: homelab # Use self-hosted runner on nix-cache01 + steps: + - uses: actions/checkout@v4 + with: + ref: master + fetch-depth: 0 # Need full history for merge + + - name: Create update branch + run: | + BRANCH="flake-update/$(date +%Y-%m-%d)" + git checkout -b "$BRANCH" + + - name: Update flake + run: nix flake update --commit-lock-file + + - name: Build all hosts + id: build + run: | + FAILED="" + for host in $(nix flake show --json | jq -r '.nixosConfigurations | keys[]'); do + echo "Building $host..." + if ! nix build ".#nixosConfigurations.$host.config.system.build.toplevel" 2>&1 | tee "build-$host.log"; then + FAILED="$FAILED $host" + fi + done + echo "failed=$FAILED" >> $GITHUB_OUTPUT + + - name: Merge to master (if all pass) + if: steps.build.outputs.failed == '' + run: | + git checkout master + git merge --ff-only "$BRANCH" + git push origin master + git push origin --delete "$BRANCH" + + - name: Create PR (if any fail) + if: steps.build.outputs.failed != '' + run: | + git push origin "$BRANCH" + # Create PR via Gitea API with build logs + # ... (PR creation with log attachment) +``` + +## Migration Steps + +### Phase 1: Reprovision Host via OpenTofu + +1. Add `nix-cache01` to `terraform/vms.tf`: + ```hcl + "nix-cache01" = { + ip = "10.69.13.15/24" + cpu_cores = 4 + memory = 8192 + disk_size = "100G" # Larger for nix store + } + ``` + +2. Shut down existing nix-cache01 VM +3. Run `tofu apply` to provision new VM +4. Verify bootstrap completes and cache is serving + +**Note:** The cache will be cold after reprovision. Run initial builds to populate. + +### Phase 2: Add Build Triggering to homelab-deploy + +1. Add `build` command to homelab-deploy CLI +2. Add listener handler in NixOS module for `build.*` subjects +3. Update nix-cache01 config to enable build listener +4. Test with `homelab-deploy build testvm01` + +### Phase 3: Implement Safe Flake Update Workflow + +1. Create `.github/workflows/flake-update-safe.yaml` +2. Disable or remove old `flake-update.yaml` +3. Test manually with `workflow_dispatch` +4. Monitor first automated run + +### Phase 4: Remove Old Build Script + +1. After new workflow is stable, remove: + - `services/nix-cache/build-flakes.nix` + - `services/nix-cache/build-flakes.sh` +2. The new workflow handles scheduled builds + +## Open Questions + +- [ ] What runner labels should the self-hosted runner use for the update workflow? +- [ ] Should we build hosts in parallel (faster) or sequentially (easier to debug)? +- [ ] How long to keep flake-update PRs open before auto-closing stale ones? +- [ ] Should successful updates trigger a NATS notification to rebuild all hosts? +- [ ] What to do about `gunter` (external nixos repo) - include in validation? +- [ ] Disk size for new nix-cache01 - is 100G enough for cache + builds? + +## Notes + +- The existing `homelab.deploy.enable = true` on nix-cache01 means it already has NATS connectivity +- The Harmonia service and cache signing key will work the same after reprovision +- Actions runner token is in Vault, will be provisioned automatically +- Consider adding a `homelab.host.role = "build-host"` label for monitoring/filtering