Files
nixos-servers/docs/plans/nix-cache-reprovision.md
Torjus Håkestad ade0538717
Some checks are pending
Run nix flake check / flake-check (push) Has started running
docs: mark nix-cache DNS cutover complete
nix-cache.home.2rjus.net now served by nix-cache02.
nix-cache01 ready for decommission.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:34:04 +01:00

160 lines
6.0 KiB
Markdown

# Nix Cache Host Reprovision
## Overview
Reprovision `nix-cache01` using the OpenTofu workflow, and improve the build/cache system with:
1. NATS-based remote build triggering (replacing the current bash script)
2. Safer flake update workflow that validates builds before pushing to master
## Status
**Phase 1: New Build Host** - COMPLETE
**Phase 2: NATS Build Triggering** - COMPLETE
**Phase 3: Safe Flake Update Workflow** - NOT STARTED
**Phase 4: Complete Migration** - COMPLETE (cleanup pending)
## Completed Work
### New Build Host (nix-cache02)
Instead of reprovisioning nix-cache01 in-place, we created a new host `nix-cache02` at 10.69.13.25:
- **Specs**: 8 CPU cores, 16GB RAM (temporarily, will increase to 24GB after nix-cache01 decommissioned), 200GB disk
- **Provisioned via OpenTofu** with automatic Vault credential bootstrapping
- **Builder service** configured with two repos:
- `nixos-servers``git+https://git.t-juice.club/torjus/nixos-servers.git`
- `nixos` (gunter) → `git+https://git.t-juice.club/torjus/nixos.git`
### NATS-Based Build Triggering
The `homelab-deploy` tool was extended with a builder mode:
**NATS Subjects:**
- `build.<repo>.<target>` - e.g., `build.nixos-servers.all` or `build.nixos-servers.ns1`
**NATS Permissions (in DEPLOY account):**
| User | Publish | Subscribe |
|------|---------|-----------|
| Builder | `build.responses.>` | `build.>` |
| Test deployer | `deploy.test.>`, `deploy.discover`, `build.>` | `deploy.responses.>`, `deploy.discover`, `build.responses.>` |
| Admin deployer | `deploy.>`, `build.>` | `deploy.>`, `build.responses.>` |
**Vault Secrets:**
- `shared/homelab-deploy/builder-nkey` - NKey seed for builder authentication
**NixOS Configuration:**
- `hosts/nix-cache02/builder.nix` - Builder service configuration
- `services/nats/default.nix` - Updated with builder NATS user
**MCP Integration:**
- `.mcp.json` updated with `--enable-builds` flag
- Build tool available via MCP for Claude Code
**Tested:**
- Single host build: `build nixos-servers testvm01` (~30s)
- All hosts build: `build nixos-servers all` (16 hosts in ~226s)
### Harmonia Binary Cache
- Parameterized `services/nix-cache/harmonia.nix` to use hostname-based Vault paths
- Parameterized `services/nix-cache/proxy.nix` for hostname-based domain
- New signing key: `nix-cache02.home.2rjus.net-1`
- Vault secret: `hosts/nix-cache02/cache-secret`
- Removed unused Gitea Actions runner from nix-cache01
## Current State
### Old System (nix-cache01) - PENDING DECOMMISSION
- Running at 10.69.13.15
- No longer serving the canonical `nix-cache.home.2rjus.net` (now serves `nix-cache01.home.2rjus.net`)
- Still has the old `build-flakes.sh` timer (every 30 min) - to be removed
- Ready for decommission
### New System (nix-cache02) - NOW ACTIVE
- Running at 10.69.13.25
- **Now serving `https://nix-cache.home.2rjus.net`** (canonical URL)
- Builder service active, responding to NATS build requests
- Metrics exposed on port 9973 (`homelab-deploy-builder` job)
- Harmonia binary cache server running
- New signing key: `nix-cache02.home.2rjus.net-1`
- Trusted public key deployed to all hosts
- Promoted to prod tier with `build-host` role
## Remaining Work
### Phase 3: Safe Flake Update Workflow
1. Create `.github/workflows/flake-update-safe.yaml`
2. Disable or remove old `flake-update.yaml`
3. Test manually with `workflow_dispatch`
4. Monitor first automated run
### Phase 4: Complete Migration
1. ~~**Add Harmonia to nix-cache02**~~ ✅ Done - new signing key, parameterized service
2. ~~**Add trusted public key to all hosts**~~ ✅ Done - `system/nix.nix` updated
3. ~~**Test cache from other hosts**~~ ✅ Done - verified from testvm01
4. ~~**Update proxy and DNS**~~ ✅ Done - `nix-cache.home.2rjus.net` CNAME now points to nix-cache02
5. ~~**Deploy to all hosts**~~ ✅ Done - all hosts have new trusted key
6. **Increase RAM** - Bump to 24GB after nix-cache01 is gone
7. **Decommission nix-cache01**:
- Remove from `terraform/vms.tf`
- Remove old build script (`services/nix-cache/build-flakes.nix`, `build-flakes.sh`)
- Archive or delete host config
- Remove old signing key from `system/nix.nix` trusted-public-keys
### Phase 5: Scheduled Builds (Optional)
Add a systemd timer on nix-cache02 to trigger periodic builds via NATS:
```nix
systemd.services.scheduled-build = {
script = ''
homelab-deploy build nixos-servers --all
homelab-deploy build nixos --all
'';
};
systemd.timers.scheduled-build = {
wantedBy = [ "timers.target" ];
timerConfig.OnCalendar = "*-*-* *:30:00";
};
```
Or trigger builds from CI after merges to master.
## Resolved Questions
- **Parallel vs sequential builds?** Sequential - hosts share packages, subsequent builds are fast after first
- **What about gunter?** Configured as `nixos` repo in builder settings
- **Disk size?** 200GB for new host
- **Build host specs?** 8 cores, 16-24GB RAM matches current nix-cache01
### Phase 6: Observability
1. **Alerting rules** for build failures:
```promql
# Alert if any build fails
increase(homelab_deploy_build_host_total{status="failure"}[1h]) > 0
# Alert if no successful builds in 24h (scheduled builds stopped)
time() - homelab_deploy_build_last_success_timestamp > 86400
```
2. **Grafana dashboard** for build metrics:
- Build success/failure rate over time
- Average build duration per host (histogram)
- Build frequency (builds per hour/day)
- Last successful build timestamp per repo
Available metrics:
- `homelab_deploy_builds_total{repo, status}` - total builds by repo and status
- `homelab_deploy_build_host_total{repo, host, status}` - per-host build counts
- `homelab_deploy_build_duration_seconds_{bucket,sum,count}` - build duration histogram
- `homelab_deploy_build_last_timestamp{repo}` - last build attempt
- `homelab_deploy_build_last_success_timestamp{repo}` - last successful build
## Open Questions
- [x] ~~When to cut over DNS from nix-cache01 to nix-cache02?~~ Done - 2026-02-10
- [ ] Implement safe flake update workflow before or after full migration?