# Nix Cache Host Reprovision ## Overview Reprovision `nix-cache01` using the OpenTofu workflow, and improve the build/cache system with: 1. NATS-based remote build triggering (replacing the current bash script) 2. Safer flake update workflow that validates builds before pushing to master ## Current State ### Host Configuration - `nix-cache01` at 10.69.13.15 serves the binary cache via Harmonia - Runs Gitea Actions runner for CI workflows - Has `homelab.deploy.enable = true` (already supports NATS-based deployment) - Uses a dedicated XFS volume at `/nix` for cache storage ### Current Build System (`services/nix-cache/build-flakes.sh`) - Runs every 30 minutes via systemd timer - Clones/pulls two repos: `nixos-servers` and `nixos` (gunter) - Builds all hosts with `nixos-rebuild build` (no blacklist despite docs mentioning it) - Pushes success/failure metrics to pushgateway - Simple but has no filtering, no parallelism, no remote triggering ### Current Flake Update Workflow (`.github/workflows/flake-update.yaml`) - Runs daily at midnight via cron - Runs `nix flake update --commit-lock-file` - Pushes directly to master - No build validation — can push broken inputs ## Improvement 1: NATS-Based Remote Build Triggering ### Design Extend the existing `homelab-deploy` tool to support a "build" command that triggers builds on the cache host. This reuses the NATS infrastructure already in place. | Approach | Pros | Cons | |----------|------|------| | Extend homelab-deploy | Reuses existing NATS auth, NKey handling, CLI | Adds scope to existing tool | | New nix-cache-tool | Clean separation | Duplicate NATS boilerplate, new credentials | | Gitea Actions webhook | No custom tooling | Less flexible, tied to Gitea | **Recommendation:** Extend `homelab-deploy` with a build subcommand. The tool already has NATS client code, authentication handling, and a listener module in NixOS. ### Implementation 1. Add new message type to homelab-deploy: `build.` subject 2. Listener on nix-cache01 subscribes to `build.>` wildcard 3. On message receipt, builds the specified host and returns success/failure 4. CLI command: `homelab-deploy build ` or `homelab-deploy build --all` ### Benefits - Trigger rebuild for specific host to ensure it's cached - Could be called from CI after merging PRs - Reuses existing NATS infrastructure and auth - Progress/status could stream back via NATS reply ## Improvement 2: Smarter Flake Update Workflow ### Current Problems 1. Updates can push breaking changes to master 2. No visibility into what broke when it does 3. Hosts that auto-update can pull broken configs ### Proposed Workflow ``` ┌─────────────────────────────────────────────────────────────────┐ │ Flake Update Workflow │ ├─────────────────────────────────────────────────────────────────┤ │ 1. nix flake update (on feature branch) │ │ 2. Build ALL hosts locally │ │ 3. If all pass → fast-forward merge to master │ │ 4. If any fail → create PR with failure logs attached │ └─────────────────────────────────────────────────────────────────┘ ``` ### Implementation Options | Option | Description | Pros | Cons | |--------|-------------|------|------| | **A: Self-hosted runner** | Build on nix-cache01 | Fast (local cache), simple | Ties up cache host during build | | **B: Gitea Actions only** | Use container runner | Clean separation | Slow (no cache), resource limits | | **C: Hybrid** | Trigger builds on nix-cache01 via NATS from Actions | Best of both | More complex | **Recommendation:** Option A with nix-cache01 as the runner. The host is already running Gitea Actions runner and has the cache. Building all ~16 hosts is disk I/O heavy but feasible on dedicated hardware. ### Workflow Steps 1. Workflow runs on schedule (daily or weekly) 2. Creates branch `flake-update/YYYY-MM-DD` 3. Runs `nix flake update --commit-lock-file` 4. Builds each host: `nix build .#nixosConfigurations..config.system.build.toplevel` 5. If all succeed: - Fast-forward merge to master - Delete feature branch 6. If any fail: - Create PR from the update branch - Attach build logs as PR comment - Label PR with `needs-review` or `build-failure` - Do NOT merge automatically ### Workflow File Changes ```yaml # New: .github/workflows/flake-update-safe.yaml name: Safe flake update on: schedule: - cron: "0 2 * * 0" # Weekly on Sunday at 2 AM workflow_dispatch: # Manual trigger jobs: update-and-validate: runs-on: homelab # Use self-hosted runner on nix-cache01 steps: - uses: actions/checkout@v4 with: ref: master fetch-depth: 0 # Need full history for merge - name: Create update branch run: | BRANCH="flake-update/$(date +%Y-%m-%d)" git checkout -b "$BRANCH" - name: Update flake run: nix flake update --commit-lock-file - name: Build all hosts id: build run: | FAILED="" for host in $(nix flake show --json | jq -r '.nixosConfigurations | keys[]'); do echo "Building $host..." if ! nix build ".#nixosConfigurations.$host.config.system.build.toplevel" 2>&1 | tee "build-$host.log"; then FAILED="$FAILED $host" fi done echo "failed=$FAILED" >> $GITHUB_OUTPUT - name: Merge to master (if all pass) if: steps.build.outputs.failed == '' run: | git checkout master git merge --ff-only "$BRANCH" git push origin master git push origin --delete "$BRANCH" - name: Create PR (if any fail) if: steps.build.outputs.failed != '' run: | git push origin "$BRANCH" # Create PR via Gitea API with build logs # ... (PR creation with log attachment) ``` ## Migration Steps ### Phase 1: Reprovision Host via OpenTofu 1. Add `nix-cache01` to `terraform/vms.tf`: ```hcl "nix-cache01" = { ip = "10.69.13.15/24" cpu_cores = 4 memory = 8192 disk_size = "100G" # Larger for nix store } ``` 2. Shut down existing nix-cache01 VM 3. Run `tofu apply` to provision new VM 4. Verify bootstrap completes and cache is serving **Note:** The cache will be cold after reprovision. Run initial builds to populate. ### Phase 2: Add Build Triggering to homelab-deploy 1. Add `build` command to homelab-deploy CLI 2. Add listener handler in NixOS module for `build.*` subjects 3. Update nix-cache01 config to enable build listener 4. Test with `homelab-deploy build testvm01` ### Phase 3: Implement Safe Flake Update Workflow 1. Create `.github/workflows/flake-update-safe.yaml` 2. Disable or remove old `flake-update.yaml` 3. Test manually with `workflow_dispatch` 4. Monitor first automated run ### Phase 4: Remove Old Build Script 1. After new workflow is stable, remove: - `services/nix-cache/build-flakes.nix` - `services/nix-cache/build-flakes.sh` 2. The new workflow handles scheduled builds ## Open Questions - [ ] What runner labels should the self-hosted runner use for the update workflow? - [ ] Should we build hosts in parallel (faster) or sequentially (easier to debug)? - [ ] How long to keep flake-update PRs open before auto-closing stale ones? - [ ] Should successful updates trigger a NATS notification to rebuild all hosts? - [ ] What to do about `gunter` (external nixos repo) - include in validation? - [ ] Disk size for new nix-cache01 - is 100G enough for cache + builds? ## Notes - The existing `homelab.deploy.enable = true` on nix-cache01 means it already has NATS connectivity - The Harmonia service and cache signing key will work the same after reprovision - Actions runner token is in Vault, will be provisioned automatically - Consider adding a `homelab.host.role = "build-host"` label for monitoring/filtering