Some checks failed
Run nix flake check / flake-check (push) Failing after 13m22s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6.0 KiB
6.0 KiB
Nix Cache Host Reprovision
Overview
Reprovision nix-cache01 using the OpenTofu workflow, and improve the build/cache system with:
- NATS-based remote build triggering (replacing the current bash script)
- Safer flake update workflow that validates builds before pushing to master
Status
Phase 1: New Build Host - COMPLETE Phase 2: NATS Build Triggering - COMPLETE Phase 3: Safe Flake Update Workflow - NOT STARTED Phase 4: Complete Migration - COMPLETE Phase 5: Scheduled Builds - COMPLETE
Completed Work
New Build Host (nix-cache02)
Instead of reprovisioning nix-cache01 in-place, we created a new host nix-cache02 at 10.69.13.25:
- Specs: 8 CPU cores, 16GB RAM (temporarily, will increase to 24GB after nix-cache01 decommissioned), 200GB disk
- Provisioned via OpenTofu with automatic Vault credential bootstrapping
- Builder service configured with two repos:
nixos-servers→git+https://git.t-juice.club/torjus/nixos-servers.gitnixos(gunter) →git+https://git.t-juice.club/torjus/nixos.git
NATS-Based Build Triggering
The homelab-deploy tool was extended with a builder mode:
NATS Subjects:
build.<repo>.<target>- e.g.,build.nixos-servers.allorbuild.nixos-servers.ns1
NATS Permissions (in DEPLOY account):
| User | Publish | Subscribe |
|---|---|---|
| Builder | build.responses.> |
build.> |
| Test deployer | deploy.test.>, deploy.discover, build.> |
deploy.responses.>, deploy.discover, build.responses.> |
| Admin deployer | deploy.>, build.> |
deploy.>, build.responses.> |
Vault Secrets:
shared/homelab-deploy/builder-nkey- NKey seed for builder authentication
NixOS Configuration:
hosts/nix-cache02/builder.nix- Builder service configurationservices/nats/default.nix- Updated with builder NATS user
MCP Integration:
.mcp.jsonupdated with--enable-buildsflag- Build tool available via MCP for Claude Code
Tested:
- Single host build:
build nixos-servers testvm01(~30s) - All hosts build:
build nixos-servers all(16 hosts in ~226s)
Harmonia Binary Cache
- Parameterized
services/nix-cache/harmonia.nixto use hostname-based Vault paths - Parameterized
services/nix-cache/proxy.nixfor hostname-based domain - New signing key:
nix-cache02.home.2rjus.net-1 - Vault secret:
hosts/nix-cache02/cache-secret - Removed unused Gitea Actions runner from nix-cache01
Current State
nix-cache02 (Active)
- Running at 10.69.13.25
- Serving
https://nix-cache.home.2rjus.net(canonical URL) - Builder service active, responding to NATS build requests
- Metrics exposed on port 9973 (
homelab-deploy-builderjob) - Harmonia binary cache server running
- Signing key:
nix-cache02.home.2rjus.net-1 - Prod tier with
build-hostrole
nix-cache01 (Decommissioned)
- VM deleted from Proxmox
- Host configuration removed from repo
- Vault AppRole and secrets removed
- Old signing key removed from trusted-public-keys
Remaining Work
Phase 3: Safe Flake Update Workflow
- Create
.github/workflows/flake-update-safe.yaml - Disable or remove old
flake-update.yaml - Test manually with
workflow_dispatch - Monitor first automated run
Phase 4: Complete Migration ✅
Add Harmonia to nix-cache02✅ Done - new signing key, parameterized serviceAdd trusted public key to all hosts✅ Done -system/nix.nixupdatedTest cache from other hosts✅ Done - verified from testvm01Update proxy and DNS✅ Done -nix-cache.home.2rjus.netCNAME now points to nix-cache02Deploy to all hosts✅ Done - all hosts have new trusted keyDecommission nix-cache01✅ Done - 2026-02-10:- Removed
hosts/nix-cache01/directory - Removed
services/nix-cache/build-flakes.{nix,sh} - Removed Vault AppRole and secrets
- Removed old signing key from
system/nix.nix - Removed from
flake.nix - Deleted VM from Proxmox
- Removed
Phase 5: Scheduled Builds ✅
Implemented a systemd timer on nix-cache02 that triggers builds every 2 hours:
- Timer:
scheduled-build.timerruns every 2 hours with 5m random jitter - Service:
scheduled-build.servicecallshomelab-deploy buildfor both repos - Authentication: Dedicated scheduler NKey stored in Vault
- NATS user: Added to DEPLOY account with publish
build.>and subscribebuild.responses.>
Files:
hosts/nix-cache02/scheduler.nix- Timer and service configurationservices/nats/default.nix- Scheduler NATS userterraform/vault/secrets.tf- Scheduler NKey secretterraform/vault/variables.tf- Variable for scheduler NKey
Resolved Questions
- Parallel vs sequential builds? Sequential - hosts share packages, subsequent builds are fast after first
- What about gunter? Configured as
nixosrepo in builder settings - Disk size? 200GB for new host
- Build host specs? 8 cores, 16-24GB RAM matches current nix-cache01
Phase 6: Observability
-
Alerting rules for build failures:
# Alert if any build fails increase(homelab_deploy_build_host_total{status="failure"}[1h]) > 0 # Alert if no successful builds in 24h (scheduled builds stopped) time() - homelab_deploy_build_last_success_timestamp > 86400 -
Grafana dashboard for build metrics:
- Build success/failure rate over time
- Average build duration per host (histogram)
- Build frequency (builds per hour/day)
- Last successful build timestamp per repo
Available metrics:
homelab_deploy_builds_total{repo, status}- total builds by repo and statushomelab_deploy_build_host_total{repo, host, status}- per-host build countshomelab_deploy_build_duration_seconds_{bucket,sum,count}- build duration histogramhomelab_deploy_build_last_timestamp{repo}- last build attempthomelab_deploy_build_last_success_timestamp{repo}- last successful build
Open Questions
When to cut over DNS from nix-cache01 to nix-cache02?Done - 2026-02-10- Implement safe flake update workflow before or after full migration?