Files
nixos-servers/docs/plans/nix-cache-reprovision.md
Torjus Håkestad f0950b33de
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
docs: add plan for nix-cache01 reprovision
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 20:34:52 +01:00

8.2 KiB

Nix Cache Host Reprovision

Overview

Reprovision nix-cache01 using the OpenTofu workflow, and improve the build/cache system with:

  1. NATS-based remote build triggering (replacing the current bash script)
  2. Safer flake update workflow that validates builds before pushing to master

Current State

Host Configuration

  • nix-cache01 at 10.69.13.15 serves the binary cache via Harmonia
  • Runs Gitea Actions runner for CI workflows
  • Has homelab.deploy.enable = true (already supports NATS-based deployment)
  • Uses a dedicated XFS volume at /nix for cache storage

Current Build System (services/nix-cache/build-flakes.sh)

  • Runs every 30 minutes via systemd timer
  • Clones/pulls two repos: nixos-servers and nixos (gunter)
  • Builds all hosts with nixos-rebuild build (no blacklist despite docs mentioning it)
  • Pushes success/failure metrics to pushgateway
  • Simple but has no filtering, no parallelism, no remote triggering

Current Flake Update Workflow (.github/workflows/flake-update.yaml)

  • Runs daily at midnight via cron
  • Runs nix flake update --commit-lock-file
  • Pushes directly to master
  • No build validation — can push broken inputs

Improvement 1: NATS-Based Remote Build Triggering

Design

Extend the existing homelab-deploy tool to support a "build" command that triggers builds on the cache host. This reuses the NATS infrastructure already in place.

Approach Pros Cons
Extend homelab-deploy Reuses existing NATS auth, NKey handling, CLI Adds scope to existing tool
New nix-cache-tool Clean separation Duplicate NATS boilerplate, new credentials
Gitea Actions webhook No custom tooling Less flexible, tied to Gitea

Recommendation: Extend homelab-deploy with a build subcommand. The tool already has NATS client code, authentication handling, and a listener module in NixOS.

Implementation

  1. Add new message type to homelab-deploy: build.<host> subject
  2. Listener on nix-cache01 subscribes to build.> wildcard
  3. On message receipt, builds the specified host and returns success/failure
  4. CLI command: homelab-deploy build <hostname> or homelab-deploy build --all

Benefits

  • Trigger rebuild for specific host to ensure it's cached
  • Could be called from CI after merging PRs
  • Reuses existing NATS infrastructure and auth
  • Progress/status could stream back via NATS reply

Improvement 2: Smarter Flake Update Workflow

Current Problems

  1. Updates can push breaking changes to master
  2. No visibility into what broke when it does
  3. Hosts that auto-update can pull broken configs

Proposed Workflow

┌─────────────────────────────────────────────────────────────────┐
│                    Flake Update Workflow                         │
├─────────────────────────────────────────────────────────────────┤
│  1. nix flake update (on feature branch)                        │
│  2. Build ALL hosts locally                                      │
│  3. If all pass → fast-forward merge to master                  │
│  4. If any fail → create PR with failure logs attached          │
└─────────────────────────────────────────────────────────────────┘

Implementation Options

Option Description Pros Cons
A: Self-hosted runner Build on nix-cache01 Fast (local cache), simple Ties up cache host during build
B: Gitea Actions only Use container runner Clean separation Slow (no cache), resource limits
C: Hybrid Trigger builds on nix-cache01 via NATS from Actions Best of both More complex

Recommendation: Option A with nix-cache01 as the runner. The host is already running Gitea Actions runner and has the cache. Building all ~16 hosts is disk I/O heavy but feasible on dedicated hardware.

Workflow Steps

  1. Workflow runs on schedule (daily or weekly)
  2. Creates branch flake-update/YYYY-MM-DD
  3. Runs nix flake update --commit-lock-file
  4. Builds each host: nix build .#nixosConfigurations.<host>.config.system.build.toplevel
  5. If all succeed:
    • Fast-forward merge to master
    • Delete feature branch
  6. If any fail:
    • Create PR from the update branch
    • Attach build logs as PR comment
    • Label PR with needs-review or build-failure
    • Do NOT merge automatically

Workflow File Changes

# New: .github/workflows/flake-update-safe.yaml
name: Safe flake update
on:
  schedule:
    - cron: "0 2 * * 0"  # Weekly on Sunday at 2 AM
  workflow_dispatch:  # Manual trigger

jobs:
  update-and-validate:
    runs-on: homelab  # Use self-hosted runner on nix-cache01
    steps:
      - uses: actions/checkout@v4
        with:
          ref: master
          fetch-depth: 0  # Need full history for merge

      - name: Create update branch
        run: |
          BRANCH="flake-update/$(date +%Y-%m-%d)"
          git checkout -b "$BRANCH"

      - name: Update flake
        run: nix flake update --commit-lock-file

      - name: Build all hosts
        id: build
        run: |
          FAILED=""
          for host in $(nix flake show --json | jq -r '.nixosConfigurations | keys[]'); do
            echo "Building $host..."
            if ! nix build ".#nixosConfigurations.$host.config.system.build.toplevel" 2>&1 | tee "build-$host.log"; then
              FAILED="$FAILED $host"
            fi
          done
          echo "failed=$FAILED" >> $GITHUB_OUTPUT

      - name: Merge to master (if all pass)
        if: steps.build.outputs.failed == ''
        run: |
          git checkout master
          git merge --ff-only "$BRANCH"
          git push origin master
          git push origin --delete "$BRANCH"

      - name: Create PR (if any fail)
        if: steps.build.outputs.failed != ''
        run: |
          git push origin "$BRANCH"
          # Create PR via Gitea API with build logs
          # ... (PR creation with log attachment)

Migration Steps

Phase 1: Reprovision Host via OpenTofu

  1. Add nix-cache01 to terraform/vms.tf:

    "nix-cache01" = {
      ip        = "10.69.13.15/24"
      cpu_cores = 4
      memory    = 8192
      disk_size = "100G"  # Larger for nix store
    }
    
  2. Shut down existing nix-cache01 VM

  3. Run tofu apply to provision new VM

  4. Verify bootstrap completes and cache is serving

Note: The cache will be cold after reprovision. Run initial builds to populate.

Phase 2: Add Build Triggering to homelab-deploy

  1. Add build command to homelab-deploy CLI
  2. Add listener handler in NixOS module for build.* subjects
  3. Update nix-cache01 config to enable build listener
  4. Test with homelab-deploy build testvm01

Phase 3: Implement Safe Flake Update Workflow

  1. Create .github/workflows/flake-update-safe.yaml
  2. Disable or remove old flake-update.yaml
  3. Test manually with workflow_dispatch
  4. Monitor first automated run

Phase 4: Remove Old Build Script

  1. After new workflow is stable, remove:
    • services/nix-cache/build-flakes.nix
    • services/nix-cache/build-flakes.sh
  2. The new workflow handles scheduled builds

Open Questions

  • What runner labels should the self-hosted runner use for the update workflow?
  • Should we build hosts in parallel (faster) or sequentially (easier to debug)?
  • How long to keep flake-update PRs open before auto-closing stale ones?
  • Should successful updates trigger a NATS notification to rebuild all hosts?
  • What to do about gunter (external nixos repo) - include in validation?
  • Disk size for new nix-cache01 - is 100G enough for cache + builds?

Notes

  • The existing homelab.deploy.enable = true on nix-cache01 means it already has NATS connectivity
  • The Harmonia service and cache signing key will work the same after reprovision
  • Actions runner token is in Vault, will be provisioned automatically
  • Consider adding a homelab.host.role = "build-host" label for monitoring/filtering