Files
nixos-servers/docs/plans/nix-cache-reprovision.md
Torjus Håkestad f83145d97a
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
docs: update nix-cache-reprovision plan with progress
- Mark Phase 1 (new build host) and Phase 2 (NATS build triggering) complete
- Document nix-cache02 configuration and tested build times
- Add remaining work for Harmonia, Actions runner, and DNS cutover
- Enable --enable-builds flag in MCP config

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 22:43:48 +01:00

4.2 KiB

Nix Cache Host Reprovision

Overview

Reprovision nix-cache01 using the OpenTofu workflow, and improve the build/cache system with:

  1. NATS-based remote build triggering (replacing the current bash script)
  2. Safer flake update workflow that validates builds before pushing to master

Status

Phase 1: New Build Host - COMPLETE Phase 2: NATS Build Triggering - COMPLETE Phase 3: Safe Flake Update Workflow - NOT STARTED Phase 4: Decommission Old System - NOT STARTED

Completed Work

New Build Host (nix-cache02)

Instead of reprovisioning nix-cache01 in-place, we created a new host nix-cache02 at 10.69.13.25:

  • Specs: 8 CPU cores, 16GB RAM (temporarily, will increase to 24GB after nix-cache01 decommissioned), 200GB disk
  • Provisioned via OpenTofu with automatic Vault credential bootstrapping
  • Builder service configured with two repos:
    • nixos-serversgit+https://git.t-juice.club/torjus/nixos-servers.git
    • nixos (gunter) → git+https://git.t-juice.club/torjus/nixos.git

NATS-Based Build Triggering

The homelab-deploy tool was extended with a builder mode:

NATS Subjects:

  • build.<repo>.<target> - e.g., build.nixos-servers.all or build.nixos-servers.ns1

NATS Permissions (in DEPLOY account):

User Publish Subscribe
Builder build.responses.> build.>
Test deployer deploy.test.>, deploy.discover, build.> deploy.responses.>, deploy.discover, build.responses.>
Admin deployer deploy.>, build.> deploy.>, build.responses.>

Vault Secrets:

  • shared/homelab-deploy/builder-nkey - NKey seed for builder authentication

NixOS Configuration:

  • hosts/nix-cache02/builder.nix - Builder service configuration
  • services/nats/default.nix - Updated with builder NATS user

MCP Integration:

  • .mcp.json updated with --enable-builds flag
  • Build tool available via MCP for Claude Code

Tested:

  • Single host build: build nixos-servers testvm01 (~30s)
  • All hosts build: build nixos-servers all (16 hosts in ~226s)

Current State

Old System (nix-cache01)

  • Still running at 10.69.13.15
  • Serves binary cache via Harmonia
  • Runs Gitea Actions runner
  • Has the old build-flakes.sh timer (every 30 min)
  • Will be decommissioned after nix-cache02 is fully validated

New System (nix-cache02)

  • Running at 10.69.13.25
  • Builder service active, responding to NATS build requests
  • Metrics exposed on port 9973 (homelab-deploy-builder job)
  • Does NOT yet have:
    • Harmonia (binary cache server)
    • Actions runner
    • Cache signing key

Remaining Work

Phase 3: Safe Flake Update Workflow

  1. Create .github/workflows/flake-update-safe.yaml
  2. Disable or remove old flake-update.yaml
  3. Test manually with workflow_dispatch
  4. Monitor first automated run

Phase 4: Complete Migration

  1. Add Harmonia to nix-cache02 - Copy cache signing key, configure service
  2. Add Actions runner to nix-cache02 - Configure with Vault token
  3. Update DNS - Point nix-cache.home.2rjus.net to nix-cache02
  4. Increase RAM - Bump to 24GB after nix-cache01 is gone
  5. Decommission nix-cache01:
    • Remove from terraform/vms.tf
    • Remove old build script (services/nix-cache/build-flakes.nix, build-flakes.sh)
    • Archive or delete host config

Phase 5: Scheduled Builds (Optional)

Add a systemd timer on nix-cache02 to trigger periodic builds via NATS:

systemd.services.scheduled-build = {
  script = ''
    homelab-deploy build nixos-servers --all
    homelab-deploy build nixos --all
  '';
};
systemd.timers.scheduled-build = {
  wantedBy = [ "timers.target" ];
  timerConfig.OnCalendar = "*-*-* *:30:00";
};

Or trigger builds from CI after merges to master.

Resolved Questions

  • Parallel vs sequential builds? Sequential - hosts share packages, subsequent builds are fast after first
  • What about gunter? Configured as nixos repo in builder settings
  • Disk size? 200GB for new host
  • Build host specs? 8 cores, 16-24GB RAM matches current nix-cache01

Open Questions

  • When to cut over DNS from nix-cache01 to nix-cache02?
  • Keep Actions runner on nix-cache02 or separate host?
  • Implement safe flake update workflow before or after full migration?