Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
- Parameterize harmonia.nix to use hostname-based Vault paths - Add nix-cache services to nix-cache02 - Add Vault secret and variable for nix-cache02 signing key - Add nix-cache02 public key to trusted-public-keys on all hosts - Update plan doc to remove actions runner references Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
5.0 KiB
5.0 KiB
Nix Cache Host Reprovision
Overview
Reprovision nix-cache01 using the OpenTofu workflow, and improve the build/cache system with:
- NATS-based remote build triggering (replacing the current bash script)
- Safer flake update workflow that validates builds before pushing to master
Status
Phase 1: New Build Host - COMPLETE Phase 2: NATS Build Triggering - COMPLETE Phase 3: Safe Flake Update Workflow - NOT STARTED Phase 4: Decommission Old System - NOT STARTED
Completed Work
New Build Host (nix-cache02)
Instead of reprovisioning nix-cache01 in-place, we created a new host nix-cache02 at 10.69.13.25:
- Specs: 8 CPU cores, 16GB RAM (temporarily, will increase to 24GB after nix-cache01 decommissioned), 200GB disk
- Provisioned via OpenTofu with automatic Vault credential bootstrapping
- Builder service configured with two repos:
nixos-servers→git+https://git.t-juice.club/torjus/nixos-servers.gitnixos(gunter) →git+https://git.t-juice.club/torjus/nixos.git
NATS-Based Build Triggering
The homelab-deploy tool was extended with a builder mode:
NATS Subjects:
build.<repo>.<target>- e.g.,build.nixos-servers.allorbuild.nixos-servers.ns1
NATS Permissions (in DEPLOY account):
| User | Publish | Subscribe |
|---|---|---|
| Builder | build.responses.> |
build.> |
| Test deployer | deploy.test.>, deploy.discover, build.> |
deploy.responses.>, deploy.discover, build.responses.> |
| Admin deployer | deploy.>, build.> |
deploy.>, build.responses.> |
Vault Secrets:
shared/homelab-deploy/builder-nkey- NKey seed for builder authentication
NixOS Configuration:
hosts/nix-cache02/builder.nix- Builder service configurationservices/nats/default.nix- Updated with builder NATS user
MCP Integration:
.mcp.jsonupdated with--enable-buildsflag- Build tool available via MCP for Claude Code
Tested:
- Single host build:
build nixos-servers testvm01(~30s) - All hosts build:
build nixos-servers all(16 hosts in ~226s)
Current State
Old System (nix-cache01)
- Still running at 10.69.13.15
- Serves binary cache via Harmonia
- Has the old
build-flakes.shtimer (every 30 min) - Will be decommissioned after nix-cache02 is fully validated
New System (nix-cache02)
- Running at 10.69.13.25
- Builder service active, responding to NATS build requests
- Metrics exposed on port 9973 (
homelab-deploy-builderjob) - Does NOT yet have:
- Harmonia (binary cache server)
- Cache signing key
Remaining Work
Phase 3: Safe Flake Update Workflow
- Create
.github/workflows/flake-update-safe.yaml - Disable or remove old
flake-update.yaml - Test manually with
workflow_dispatch - Monitor first automated run
Phase 4: Complete Migration
- Add Harmonia to nix-cache02 - Copy cache signing key, configure service
- Update DNS - Point
nix-cache.home.2rjus.netto nix-cache02 - Increase RAM - Bump to 24GB after nix-cache01 is gone
- Decommission nix-cache01:
- Remove from
terraform/vms.tf - Remove old build script (
services/nix-cache/build-flakes.nix,build-flakes.sh) - Archive or delete host config
- Remove from
Phase 5: Scheduled Builds (Optional)
Add a systemd timer on nix-cache02 to trigger periodic builds via NATS:
systemd.services.scheduled-build = {
script = ''
homelab-deploy build nixos-servers --all
homelab-deploy build nixos --all
'';
};
systemd.timers.scheduled-build = {
wantedBy = [ "timers.target" ];
timerConfig.OnCalendar = "*-*-* *:30:00";
};
Or trigger builds from CI after merges to master.
Resolved Questions
- Parallel vs sequential builds? Sequential - hosts share packages, subsequent builds are fast after first
- What about gunter? Configured as
nixosrepo in builder settings - Disk size? 200GB for new host
- Build host specs? 8 cores, 16-24GB RAM matches current nix-cache01
Phase 6: Observability
-
Alerting rules for build failures:
# Alert if any build fails increase(homelab_deploy_build_host_total{status="failure"}[1h]) > 0 # Alert if no successful builds in 24h (scheduled builds stopped) time() - homelab_deploy_build_last_success_timestamp > 86400 -
Grafana dashboard for build metrics:
- Build success/failure rate over time
- Average build duration per host (histogram)
- Build frequency (builds per hour/day)
- Last successful build timestamp per repo
Available metrics:
homelab_deploy_builds_total{repo, status}- total builds by repo and statushomelab_deploy_build_host_total{repo, host, status}- per-host build countshomelab_deploy_build_duration_seconds_{bucket,sum,count}- build duration histogramhomelab_deploy_build_last_timestamp{repo}- last build attempthomelab_deploy_build_last_success_timestamp{repo}- last successful build
Open Questions
- When to cut over DNS from nix-cache01 to nix-cache02?
- Implement safe flake update workflow before or after full migration?