Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
8.2 KiB
Nix Cache Host Reprovision
Overview
Reprovision nix-cache01 using the OpenTofu workflow, and improve the build/cache system with:
- NATS-based remote build triggering (replacing the current bash script)
- Safer flake update workflow that validates builds before pushing to master
Current State
Host Configuration
nix-cache01at 10.69.13.15 serves the binary cache via Harmonia- Runs Gitea Actions runner for CI workflows
- Has
homelab.deploy.enable = true(already supports NATS-based deployment) - Uses a dedicated XFS volume at
/nixfor cache storage
Current Build System (services/nix-cache/build-flakes.sh)
- Runs every 30 minutes via systemd timer
- Clones/pulls two repos:
nixos-serversandnixos(gunter) - Builds all hosts with
nixos-rebuild build(no blacklist despite docs mentioning it) - Pushes success/failure metrics to pushgateway
- Simple but has no filtering, no parallelism, no remote triggering
Current Flake Update Workflow (.github/workflows/flake-update.yaml)
- Runs daily at midnight via cron
- Runs
nix flake update --commit-lock-file - Pushes directly to master
- No build validation — can push broken inputs
Improvement 1: NATS-Based Remote Build Triggering
Design
Extend the existing homelab-deploy tool to support a "build" command that triggers builds on the cache host. This reuses the NATS infrastructure already in place.
| Approach | Pros | Cons |
|---|---|---|
| Extend homelab-deploy | Reuses existing NATS auth, NKey handling, CLI | Adds scope to existing tool |
| New nix-cache-tool | Clean separation | Duplicate NATS boilerplate, new credentials |
| Gitea Actions webhook | No custom tooling | Less flexible, tied to Gitea |
Recommendation: Extend homelab-deploy with a build subcommand. The tool already has NATS client code, authentication handling, and a listener module in NixOS.
Implementation
- Add new message type to homelab-deploy:
build.<host>subject - Listener on nix-cache01 subscribes to
build.>wildcard - On message receipt, builds the specified host and returns success/failure
- CLI command:
homelab-deploy build <hostname>orhomelab-deploy build --all
Benefits
- Trigger rebuild for specific host to ensure it's cached
- Could be called from CI after merging PRs
- Reuses existing NATS infrastructure and auth
- Progress/status could stream back via NATS reply
Improvement 2: Smarter Flake Update Workflow
Current Problems
- Updates can push breaking changes to master
- No visibility into what broke when it does
- Hosts that auto-update can pull broken configs
Proposed Workflow
┌─────────────────────────────────────────────────────────────────┐
│ Flake Update Workflow │
├─────────────────────────────────────────────────────────────────┤
│ 1. nix flake update (on feature branch) │
│ 2. Build ALL hosts locally │
│ 3. If all pass → fast-forward merge to master │
│ 4. If any fail → create PR with failure logs attached │
└─────────────────────────────────────────────────────────────────┘
Implementation Options
| Option | Description | Pros | Cons |
|---|---|---|---|
| A: Self-hosted runner | Build on nix-cache01 | Fast (local cache), simple | Ties up cache host during build |
| B: Gitea Actions only | Use container runner | Clean separation | Slow (no cache), resource limits |
| C: Hybrid | Trigger builds on nix-cache01 via NATS from Actions | Best of both | More complex |
Recommendation: Option A with nix-cache01 as the runner. The host is already running Gitea Actions runner and has the cache. Building all ~16 hosts is disk I/O heavy but feasible on dedicated hardware.
Workflow Steps
- Workflow runs on schedule (daily or weekly)
- Creates branch
flake-update/YYYY-MM-DD - Runs
nix flake update --commit-lock-file - Builds each host:
nix build .#nixosConfigurations.<host>.config.system.build.toplevel - If all succeed:
- Fast-forward merge to master
- Delete feature branch
- If any fail:
- Create PR from the update branch
- Attach build logs as PR comment
- Label PR with
needs-revieworbuild-failure - Do NOT merge automatically
Workflow File Changes
# New: .github/workflows/flake-update-safe.yaml
name: Safe flake update
on:
schedule:
- cron: "0 2 * * 0" # Weekly on Sunday at 2 AM
workflow_dispatch: # Manual trigger
jobs:
update-and-validate:
runs-on: homelab # Use self-hosted runner on nix-cache01
steps:
- uses: actions/checkout@v4
with:
ref: master
fetch-depth: 0 # Need full history for merge
- name: Create update branch
run: |
BRANCH="flake-update/$(date +%Y-%m-%d)"
git checkout -b "$BRANCH"
- name: Update flake
run: nix flake update --commit-lock-file
- name: Build all hosts
id: build
run: |
FAILED=""
for host in $(nix flake show --json | jq -r '.nixosConfigurations | keys[]'); do
echo "Building $host..."
if ! nix build ".#nixosConfigurations.$host.config.system.build.toplevel" 2>&1 | tee "build-$host.log"; then
FAILED="$FAILED $host"
fi
done
echo "failed=$FAILED" >> $GITHUB_OUTPUT
- name: Merge to master (if all pass)
if: steps.build.outputs.failed == ''
run: |
git checkout master
git merge --ff-only "$BRANCH"
git push origin master
git push origin --delete "$BRANCH"
- name: Create PR (if any fail)
if: steps.build.outputs.failed != ''
run: |
git push origin "$BRANCH"
# Create PR via Gitea API with build logs
# ... (PR creation with log attachment)
Migration Steps
Phase 1: Reprovision Host via OpenTofu
-
Add
nix-cache01toterraform/vms.tf:"nix-cache01" = { ip = "10.69.13.15/24" cpu_cores = 4 memory = 8192 disk_size = "100G" # Larger for nix store } -
Shut down existing nix-cache01 VM
-
Run
tofu applyto provision new VM -
Verify bootstrap completes and cache is serving
Note: The cache will be cold after reprovision. Run initial builds to populate.
Phase 2: Add Build Triggering to homelab-deploy
- Add
buildcommand to homelab-deploy CLI - Add listener handler in NixOS module for
build.*subjects - Update nix-cache01 config to enable build listener
- Test with
homelab-deploy build testvm01
Phase 3: Implement Safe Flake Update Workflow
- Create
.github/workflows/flake-update-safe.yaml - Disable or remove old
flake-update.yaml - Test manually with
workflow_dispatch - Monitor first automated run
Phase 4: Remove Old Build Script
- After new workflow is stable, remove:
services/nix-cache/build-flakes.nixservices/nix-cache/build-flakes.sh
- The new workflow handles scheduled builds
Open Questions
- What runner labels should the self-hosted runner use for the update workflow?
- Should we build hosts in parallel (faster) or sequentially (easier to debug)?
- How long to keep flake-update PRs open before auto-closing stale ones?
- Should successful updates trigger a NATS notification to rebuild all hosts?
- What to do about
gunter(external nixos repo) - include in validation? - Disk size for new nix-cache01 - is 100G enough for cache + builds?
Notes
- The existing
homelab.deploy.enable = trueon nix-cache01 means it already has NATS connectivity - The Harmonia service and cache signing key will work the same after reprovision
- Actions runner token is in Vault, will be provisioned automatically
- Consider adding a
homelab.host.role = "build-host"label for monitoring/filtering