# Nix Cache Host Reprovision ## Overview Reprovision `nix-cache01` using the OpenTofu workflow, and improve the build/cache system with: 1. NATS-based remote build triggering (replacing the current bash script) 2. Safer flake update workflow that validates builds before pushing to master ## Status **Phase 1: New Build Host** - COMPLETE **Phase 2: NATS Build Triggering** - COMPLETE **Phase 3: Safe Flake Update Workflow** - NOT STARTED **Phase 4: Complete Migration** - COMPLETE **Phase 5: Scheduled Builds** - COMPLETE ## Completed Work ### New Build Host (nix-cache02) Instead of reprovisioning nix-cache01 in-place, we created a new host `nix-cache02` at 10.69.13.25: - **Specs**: 8 CPU cores, 16GB RAM (temporarily, will increase to 24GB after nix-cache01 decommissioned), 200GB disk - **Provisioned via OpenTofu** with automatic Vault credential bootstrapping - **Builder service** configured with two repos: - `nixos-servers` → `git+https://git.t-juice.club/torjus/nixos-servers.git` - `nixos` (gunter) → `git+https://git.t-juice.club/torjus/nixos.git` ### NATS-Based Build Triggering The `homelab-deploy` tool was extended with a builder mode: **NATS Subjects:** - `build..` - e.g., `build.nixos-servers.all` or `build.nixos-servers.ns1` **NATS Permissions (in DEPLOY account):** | User | Publish | Subscribe | |------|---------|-----------| | Builder | `build.responses.>` | `build.>` | | Test deployer | `deploy.test.>`, `deploy.discover`, `build.>` | `deploy.responses.>`, `deploy.discover`, `build.responses.>` | | Admin deployer | `deploy.>`, `build.>` | `deploy.>`, `build.responses.>` | **Vault Secrets:** - `shared/homelab-deploy/builder-nkey` - NKey seed for builder authentication **NixOS Configuration:** - `hosts/nix-cache02/builder.nix` - Builder service configuration - `services/nats/default.nix` - Updated with builder NATS user **MCP Integration:** - `.mcp.json` updated with `--enable-builds` flag - Build tool available via MCP for Claude Code **Tested:** - Single host build: `build nixos-servers testvm01` (~30s) - All hosts build: `build nixos-servers all` (16 hosts in ~226s) ### Harmonia Binary Cache - Parameterized `services/nix-cache/harmonia.nix` to use hostname-based Vault paths - Parameterized `services/nix-cache/proxy.nix` for hostname-based domain - New signing key: `nix-cache02.home.2rjus.net-1` - Vault secret: `hosts/nix-cache02/cache-secret` - Removed unused Gitea Actions runner from nix-cache01 ## Current State ### nix-cache02 (Active) - Running at 10.69.13.25 - Serving `https://nix-cache.home.2rjus.net` (canonical URL) - Builder service active, responding to NATS build requests - Metrics exposed on port 9973 (`homelab-deploy-builder` job) - Harmonia binary cache server running - Signing key: `nix-cache02.home.2rjus.net-1` - Prod tier with `build-host` role ### nix-cache01 (Decommissioned) - VM deleted from Proxmox - Host configuration removed from repo - Vault AppRole and secrets removed - Old signing key removed from trusted-public-keys ## Remaining Work ### Phase 3: Safe Flake Update Workflow 1. Create `.github/workflows/flake-update-safe.yaml` 2. Disable or remove old `flake-update.yaml` 3. Test manually with `workflow_dispatch` 4. Monitor first automated run ### Phase 4: Complete Migration ✅ 1. ~~**Add Harmonia to nix-cache02**~~ ✅ Done - new signing key, parameterized service 2. ~~**Add trusted public key to all hosts**~~ ✅ Done - `system/nix.nix` updated 3. ~~**Test cache from other hosts**~~ ✅ Done - verified from testvm01 4. ~~**Update proxy and DNS**~~ ✅ Done - `nix-cache.home.2rjus.net` CNAME now points to nix-cache02 5. ~~**Deploy to all hosts**~~ ✅ Done - all hosts have new trusted key 6. ~~**Decommission nix-cache01**~~ ✅ Done - 2026-02-10: - Removed `hosts/nix-cache01/` directory - Removed `services/nix-cache/build-flakes.{nix,sh}` - Removed Vault AppRole and secrets - Removed old signing key from `system/nix.nix` - Removed from `flake.nix` - Deleted VM from Proxmox ### Phase 5: Scheduled Builds ✅ Implemented a systemd timer on nix-cache02 that triggers builds every 2 hours: - **Timer**: `scheduled-build.timer` runs every 2 hours with 5m random jitter - **Service**: `scheduled-build.service` calls `homelab-deploy build` for both repos - **Authentication**: Dedicated scheduler NKey stored in Vault - **NATS user**: Added to DEPLOY account with publish `build.>` and subscribe `build.responses.>` Files: - `hosts/nix-cache02/scheduler.nix` - Timer and service configuration - `services/nats/default.nix` - Scheduler NATS user - `terraform/vault/secrets.tf` - Scheduler NKey secret - `terraform/vault/variables.tf` - Variable for scheduler NKey ## Resolved Questions - **Parallel vs sequential builds?** Sequential - hosts share packages, subsequent builds are fast after first - **What about gunter?** Configured as `nixos` repo in builder settings - **Disk size?** 200GB for new host - **Build host specs?** 8 cores, 16-24GB RAM matches current nix-cache01 ### Phase 6: Observability 1. **Alerting rules** for build failures: ```promql # Alert if any build fails increase(homelab_deploy_build_host_total{status="failure"}[1h]) > 0 # Alert if no successful builds in 24h (scheduled builds stopped) time() - homelab_deploy_build_last_success_timestamp > 86400 ``` 2. **Grafana dashboard** for build metrics: - Build success/failure rate over time - Average build duration per host (histogram) - Build frequency (builds per hour/day) - Last successful build timestamp per repo Available metrics: - `homelab_deploy_builds_total{repo, status}` - total builds by repo and status - `homelab_deploy_build_host_total{repo, host, status}` - per-host build counts - `homelab_deploy_build_duration_seconds_{bucket,sum,count}` - build duration histogram - `homelab_deploy_build_last_timestamp{repo}` - last build attempt - `homelab_deploy_build_last_success_timestamp{repo}` - last successful build ## Open Questions - [x] ~~When to cut over DNS from nix-cache01 to nix-cache02?~~ Done - 2026-02-10 - [ ] Implement safe flake update workflow before or after full migration?