Files
nixos-servers/docs/plans/nix-cache-reprovision.md
Torjus Håkestad ade0538717
Some checks are pending
Run nix flake check / flake-check (push) Has started running
docs: mark nix-cache DNS cutover complete
nix-cache.home.2rjus.net now served by nix-cache02.
nix-cache01 ready for decommission.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:34:04 +01:00

6.0 KiB

Nix Cache Host Reprovision

Overview

Reprovision nix-cache01 using the OpenTofu workflow, and improve the build/cache system with:

  1. NATS-based remote build triggering (replacing the current bash script)
  2. Safer flake update workflow that validates builds before pushing to master

Status

Phase 1: New Build Host - COMPLETE Phase 2: NATS Build Triggering - COMPLETE Phase 3: Safe Flake Update Workflow - NOT STARTED Phase 4: Complete Migration - COMPLETE (cleanup pending)

Completed Work

New Build Host (nix-cache02)

Instead of reprovisioning nix-cache01 in-place, we created a new host nix-cache02 at 10.69.13.25:

  • Specs: 8 CPU cores, 16GB RAM (temporarily, will increase to 24GB after nix-cache01 decommissioned), 200GB disk
  • Provisioned via OpenTofu with automatic Vault credential bootstrapping
  • Builder service configured with two repos:
    • nixos-serversgit+https://git.t-juice.club/torjus/nixos-servers.git
    • nixos (gunter) → git+https://git.t-juice.club/torjus/nixos.git

NATS-Based Build Triggering

The homelab-deploy tool was extended with a builder mode:

NATS Subjects:

  • build.<repo>.<target> - e.g., build.nixos-servers.all or build.nixos-servers.ns1

NATS Permissions (in DEPLOY account):

User Publish Subscribe
Builder build.responses.> build.>
Test deployer deploy.test.>, deploy.discover, build.> deploy.responses.>, deploy.discover, build.responses.>
Admin deployer deploy.>, build.> deploy.>, build.responses.>

Vault Secrets:

  • shared/homelab-deploy/builder-nkey - NKey seed for builder authentication

NixOS Configuration:

  • hosts/nix-cache02/builder.nix - Builder service configuration
  • services/nats/default.nix - Updated with builder NATS user

MCP Integration:

  • .mcp.json updated with --enable-builds flag
  • Build tool available via MCP for Claude Code

Tested:

  • Single host build: build nixos-servers testvm01 (~30s)
  • All hosts build: build nixos-servers all (16 hosts in ~226s)

Harmonia Binary Cache

  • Parameterized services/nix-cache/harmonia.nix to use hostname-based Vault paths
  • Parameterized services/nix-cache/proxy.nix for hostname-based domain
  • New signing key: nix-cache02.home.2rjus.net-1
  • Vault secret: hosts/nix-cache02/cache-secret
  • Removed unused Gitea Actions runner from nix-cache01

Current State

Old System (nix-cache01) - PENDING DECOMMISSION

  • Running at 10.69.13.15
  • No longer serving the canonical nix-cache.home.2rjus.net (now serves nix-cache01.home.2rjus.net)
  • Still has the old build-flakes.sh timer (every 30 min) - to be removed
  • Ready for decommission

New System (nix-cache02) - NOW ACTIVE

  • Running at 10.69.13.25
  • Now serving https://nix-cache.home.2rjus.net (canonical URL)
  • Builder service active, responding to NATS build requests
  • Metrics exposed on port 9973 (homelab-deploy-builder job)
  • Harmonia binary cache server running
  • New signing key: nix-cache02.home.2rjus.net-1
  • Trusted public key deployed to all hosts
  • Promoted to prod tier with build-host role

Remaining Work

Phase 3: Safe Flake Update Workflow

  1. Create .github/workflows/flake-update-safe.yaml
  2. Disable or remove old flake-update.yaml
  3. Test manually with workflow_dispatch
  4. Monitor first automated run

Phase 4: Complete Migration

  1. Add Harmonia to nix-cache02 Done - new signing key, parameterized service
  2. Add trusted public key to all hosts Done - system/nix.nix updated
  3. Test cache from other hosts Done - verified from testvm01
  4. Update proxy and DNS Done - nix-cache.home.2rjus.net CNAME now points to nix-cache02
  5. Deploy to all hosts Done - all hosts have new trusted key
  6. Increase RAM - Bump to 24GB after nix-cache01 is gone
  7. Decommission nix-cache01:
    • Remove from terraform/vms.tf
    • Remove old build script (services/nix-cache/build-flakes.nix, build-flakes.sh)
    • Archive or delete host config
    • Remove old signing key from system/nix.nix trusted-public-keys

Phase 5: Scheduled Builds (Optional)

Add a systemd timer on nix-cache02 to trigger periodic builds via NATS:

systemd.services.scheduled-build = {
  script = ''
    homelab-deploy build nixos-servers --all
    homelab-deploy build nixos --all
  '';
};
systemd.timers.scheduled-build = {
  wantedBy = [ "timers.target" ];
  timerConfig.OnCalendar = "*-*-* *:30:00";
};

Or trigger builds from CI after merges to master.

Resolved Questions

  • Parallel vs sequential builds? Sequential - hosts share packages, subsequent builds are fast after first
  • What about gunter? Configured as nixos repo in builder settings
  • Disk size? 200GB for new host
  • Build host specs? 8 cores, 16-24GB RAM matches current nix-cache01

Phase 6: Observability

  1. Alerting rules for build failures:

    # Alert if any build fails
    increase(homelab_deploy_build_host_total{status="failure"}[1h]) > 0
    
    # Alert if no successful builds in 24h (scheduled builds stopped)
    time() - homelab_deploy_build_last_success_timestamp > 86400
    
  2. Grafana dashboard for build metrics:

    • Build success/failure rate over time
    • Average build duration per host (histogram)
    • Build frequency (builds per hour/day)
    • Last successful build timestamp per repo

Available metrics:

  • homelab_deploy_builds_total{repo, status} - total builds by repo and status
  • homelab_deploy_build_host_total{repo, host, status} - per-host build counts
  • homelab_deploy_build_duration_seconds_{bucket,sum,count} - build duration histogram
  • homelab_deploy_build_last_timestamp{repo} - last build attempt
  • homelab_deploy_build_last_success_timestamp{repo} - last successful build

Open Questions

  • When to cut over DNS from nix-cache01 to nix-cache02? Done - 2026-02-10
  • Implement safe flake update workflow before or after full migration?