# Nix Cache Host Reprovision ## Overview Reprovision `nix-cache01` using the OpenTofu workflow, and improve the build/cache system with: 1. NATS-based remote build triggering (replacing the current bash script) 2. Safer flake update workflow that validates builds before pushing to master ## Status **Phase 1: New Build Host** - COMPLETE **Phase 2: NATS Build Triggering** - COMPLETE **Phase 3: Safe Flake Update Workflow** - NOT STARTED **Phase 4: Decommission Old System** - NOT STARTED ## Completed Work ### New Build Host (nix-cache02) Instead of reprovisioning nix-cache01 in-place, we created a new host `nix-cache02` at 10.69.13.25: - **Specs**: 8 CPU cores, 16GB RAM (temporarily, will increase to 24GB after nix-cache01 decommissioned), 200GB disk - **Provisioned via OpenTofu** with automatic Vault credential bootstrapping - **Builder service** configured with two repos: - `nixos-servers` → `git+https://git.t-juice.club/torjus/nixos-servers.git` - `nixos` (gunter) → `git+https://git.t-juice.club/torjus/nixos.git` ### NATS-Based Build Triggering The `homelab-deploy` tool was extended with a builder mode: **NATS Subjects:** - `build..` - e.g., `build.nixos-servers.all` or `build.nixos-servers.ns1` **NATS Permissions (in DEPLOY account):** | User | Publish | Subscribe | |------|---------|-----------| | Builder | `build.responses.>` | `build.>` | | Test deployer | `deploy.test.>`, `deploy.discover`, `build.>` | `deploy.responses.>`, `deploy.discover`, `build.responses.>` | | Admin deployer | `deploy.>`, `build.>` | `deploy.>`, `build.responses.>` | **Vault Secrets:** - `shared/homelab-deploy/builder-nkey` - NKey seed for builder authentication **NixOS Configuration:** - `hosts/nix-cache02/builder.nix` - Builder service configuration - `services/nats/default.nix` - Updated with builder NATS user **MCP Integration:** - `.mcp.json` updated with `--enable-builds` flag - Build tool available via MCP for Claude Code **Tested:** - Single host build: `build nixos-servers testvm01` (~30s) - All hosts build: `build nixos-servers all` (16 hosts in ~226s) ## Current State ### Old System (nix-cache01) - Still running at 10.69.13.15 - Serves binary cache via Harmonia - Runs Gitea Actions runner - Has the old `build-flakes.sh` timer (every 30 min) - Will be decommissioned after nix-cache02 is fully validated ### New System (nix-cache02) - Running at 10.69.13.25 - Builder service active, responding to NATS build requests - Metrics exposed on port 9973 (`homelab-deploy-builder` job) - Does NOT yet have: - Harmonia (binary cache server) - Actions runner - Cache signing key ## Remaining Work ### Phase 3: Safe Flake Update Workflow 1. Create `.github/workflows/flake-update-safe.yaml` 2. Disable or remove old `flake-update.yaml` 3. Test manually with `workflow_dispatch` 4. Monitor first automated run ### Phase 4: Complete Migration 1. **Add Harmonia to nix-cache02** - Copy cache signing key, configure service 2. **Add Actions runner to nix-cache02** - Configure with Vault token 3. **Update DNS** - Point `nix-cache.home.2rjus.net` to nix-cache02 4. **Increase RAM** - Bump to 24GB after nix-cache01 is gone 5. **Decommission nix-cache01**: - Remove from `terraform/vms.tf` - Remove old build script (`services/nix-cache/build-flakes.nix`, `build-flakes.sh`) - Archive or delete host config ### Phase 5: Scheduled Builds (Optional) Add a systemd timer on nix-cache02 to trigger periodic builds via NATS: ```nix systemd.services.scheduled-build = { script = '' homelab-deploy build nixos-servers --all homelab-deploy build nixos --all ''; }; systemd.timers.scheduled-build = { wantedBy = [ "timers.target" ]; timerConfig.OnCalendar = "*-*-* *:30:00"; }; ``` Or trigger builds from CI after merges to master. ## Resolved Questions - **Parallel vs sequential builds?** Sequential - hosts share packages, subsequent builds are fast after first - **What about gunter?** Configured as `nixos` repo in builder settings - **Disk size?** 200GB for new host - **Build host specs?** 8 cores, 16-24GB RAM matches current nix-cache01 ## Open Questions - [ ] When to cut over DNS from nix-cache01 to nix-cache02? - [ ] Keep Actions runner on nix-cache02 or separate host? - [ ] Implement safe flake update workflow before or after full migration?