Files

Run nix flake check / flake-check (push) Failing after 1s

Details

docs: add plan for nix-cache01 reprovision

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-07 20:34:52 +01:00

8.2 KiB

Raw Blame History

Nix Cache Host Reprovision

Overview

Reprovision nix-cache01 using the OpenTofu workflow, and improve the build/cache system with:

NATS-based remote build triggering (replacing the current bash script)
Safer flake update workflow that validates builds before pushing to master

Current State

Host Configuration

nix-cache01 at 10.69.13.15 serves the binary cache via Harmonia
Runs Gitea Actions runner for CI workflows
Has homelab.deploy.enable = true (already supports NATS-based deployment)
Uses a dedicated XFS volume at /nix for cache storage

Current Build System (`services/nix-cache/build-flakes.sh`)

Runs every 30 minutes via systemd timer
Clones/pulls two repos: nixos-servers and nixos (gunter)
Builds all hosts with nixos-rebuild build (no blacklist despite docs mentioning it)
Pushes success/failure metrics to pushgateway
Simple but has no filtering, no parallelism, no remote triggering

Current Flake Update Workflow (`.github/workflows/flake-update.yaml`)

Runs daily at midnight via cron
Runs nix flake update --commit-lock-file
Pushes directly to master
No build validation — can push broken inputs

Improvement 1: NATS-Based Remote Build Triggering

Design

Extend the existing homelab-deploy tool to support a "build" command that triggers builds on the cache host. This reuses the NATS infrastructure already in place.

Approach	Pros	Cons
Extend homelab-deploy	Reuses existing NATS auth, NKey handling, CLI	Adds scope to existing tool
New nix-cache-tool	Clean separation	Duplicate NATS boilerplate, new credentials
Gitea Actions webhook	No custom tooling	Less flexible, tied to Gitea

Recommendation: Extend homelab-deploy with a build subcommand. The tool already has NATS client code, authentication handling, and a listener module in NixOS.

Implementation

Add new message type to homelab-deploy: build.<host> subject
Listener on nix-cache01 subscribes to build.> wildcard
On message receipt, builds the specified host and returns success/failure
CLI command: homelab-deploy build <hostname> or homelab-deploy build --all

Benefits

Trigger rebuild for specific host to ensure it's cached
Could be called from CI after merging PRs
Reuses existing NATS infrastructure and auth
Progress/status could stream back via NATS reply

Improvement 2: Smarter Flake Update Workflow

Current Problems

Updates can push breaking changes to master
No visibility into what broke when it does
Hosts that auto-update can pull broken configs

Proposed Workflow

┌─────────────────────────────────────────────────────────────────┐
│                    Flake Update Workflow                         │
├─────────────────────────────────────────────────────────────────┤
│  1. nix flake update (on feature branch)                        │
│  2. Build ALL hosts locally                                      │
│  3. If all pass → fast-forward merge to master                  │
│  4. If any fail → create PR with failure logs attached          │
└─────────────────────────────────────────────────────────────────┘

Implementation Options

Option	Description	Pros	Cons
A: Self-hosted runner	Build on nix-cache01	Fast (local cache), simple	Ties up cache host during build
B: Gitea Actions only	Use container runner	Clean separation	Slow (no cache), resource limits
C: Hybrid	Trigger builds on nix-cache01 via NATS from Actions	Best of both	More complex

Recommendation: Option A with nix-cache01 as the runner. The host is already running Gitea Actions runner and has the cache. Building all ~16 hosts is disk I/O heavy but feasible on dedicated hardware.

Workflow Steps

Workflow runs on schedule (daily or weekly)
Creates branch flake-update/YYYY-MM-DD
Runs nix flake update --commit-lock-file
Builds each host: nix build .#nixosConfigurations.<host>.config.system.build.toplevel
If all succeed:
- Fast-forward merge to master
- Delete feature branch
If any fail:
- Create PR from the update branch
- Attach build logs as PR comment
- Label PR with needs-review or build-failure
- Do NOT merge automatically

Workflow File Changes

# New: .github/workflows/flake-update-safe.yaml
name: Safe flake update
on:
  schedule:
    - cron: "0 2 * * 0"  # Weekly on Sunday at 2 AM
  workflow_dispatch:  # Manual trigger

jobs:
  update-and-validate:
    runs-on: homelab  # Use self-hosted runner on nix-cache01
    steps:
      - uses: actions/checkout@v4
        with:
          ref: master
          fetch-depth: 0  # Need full history for merge

      - name: Create update branch
        run: |
          BRANCH="flake-update/$(date +%Y-%m-%d)"
          git checkout -b "$BRANCH"

      - name: Update flake
        run: nix flake update --commit-lock-file

      - name: Build all hosts
        id: build
        run: |
          FAILED=""
          for host in $(nix flake show --json | jq -r '.nixosConfigurations | keys[]'); do
            echo "Building $host..."
            if ! nix build ".#nixosConfigurations.$host.config.system.build.toplevel" 2>&1 | tee "build-$host.log"; then
              FAILED="$FAILED $host"
            fi
          done
          echo "failed=$FAILED" >> $GITHUB_OUTPUT

      - name: Merge to master (if all pass)
        if: steps.build.outputs.failed == ''
        run: |
          git checkout master
          git merge --ff-only "$BRANCH"
          git push origin master
          git push origin --delete "$BRANCH"

      - name: Create PR (if any fail)
        if: steps.build.outputs.failed != ''
        run: |
          git push origin "$BRANCH"
          # Create PR via Gitea API with build logs
          # ... (PR creation with log attachment)

Migration Steps

Phase 1: Reprovision Host via OpenTofu

Add nix-cache01 to terraform/vms.tf:

"nix-cache01" = {
  ip        = "10.69.13.15/24"
  cpu_cores = 4
  memory    = 8192
  disk_size = "100G"  # Larger for nix store
}

Shut down existing nix-cache01 VM
Run tofu apply to provision new VM
Verify bootstrap completes and cache is serving

Note: The cache will be cold after reprovision. Run initial builds to populate.

Phase 2: Add Build Triggering to homelab-deploy

Add build command to homelab-deploy CLI
Add listener handler in NixOS module for build.* subjects
Update nix-cache01 config to enable build listener
Test with homelab-deploy build testvm01

Phase 3: Implement Safe Flake Update Workflow

Create .github/workflows/flake-update-safe.yaml
Disable or remove old flake-update.yaml
Test manually with workflow_dispatch
Monitor first automated run

Phase 4: Remove Old Build Script

After new workflow is stable, remove:
- services/nix-cache/build-flakes.nix
- services/nix-cache/build-flakes.sh
The new workflow handles scheduled builds

Open Questions

What runner labels should the self-hosted runner use for the update workflow?
Should we build hosts in parallel (faster) or sequentially (easier to debug)?
How long to keep flake-update PRs open before auto-closing stale ones?
Should successful updates trigger a NATS notification to rebuild all hosts?
What to do about gunter (external nixos repo) - include in validation?
Disk size for new nix-cache01 - is 100G enough for cache + builds?

Notes

The existing homelab.deploy.enable = true on nix-cache01 means it already has NATS connectivity
The Harmonia service and cache signing key will work the same after reprovision
Actions runner token is in Vault, will be provisioned automatically
Consider adding a homelab.host.role = "build-host" label for monitoring/filtering

8.2 KiB Raw Blame History