docs: move completed plans to completed folder
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m22s
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m22s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
156
docs/plans/completed/nix-cache-reprovision.md
Normal file
156
docs/plans/completed/nix-cache-reprovision.md
Normal file
@@ -0,0 +1,156 @@
|
||||
# Nix Cache Host Reprovision
|
||||
|
||||
## Overview
|
||||
|
||||
Reprovision `nix-cache01` using the OpenTofu workflow, and improve the build/cache system with:
|
||||
1. NATS-based remote build triggering (replacing the current bash script)
|
||||
2. Safer flake update workflow that validates builds before pushing to master
|
||||
|
||||
## Status
|
||||
|
||||
**Phase 1: New Build Host** - COMPLETE
|
||||
**Phase 2: NATS Build Triggering** - COMPLETE
|
||||
**Phase 3: Safe Flake Update Workflow** - NOT STARTED
|
||||
**Phase 4: Complete Migration** - COMPLETE
|
||||
**Phase 5: Scheduled Builds** - COMPLETE
|
||||
|
||||
## Completed Work
|
||||
|
||||
### New Build Host (nix-cache02)
|
||||
|
||||
Instead of reprovisioning nix-cache01 in-place, we created a new host `nix-cache02` at 10.69.13.25:
|
||||
|
||||
- **Specs**: 8 CPU cores, 16GB RAM (temporarily, will increase to 24GB after nix-cache01 decommissioned), 200GB disk
|
||||
- **Provisioned via OpenTofu** with automatic Vault credential bootstrapping
|
||||
- **Builder service** configured with two repos:
|
||||
- `nixos-servers` → `git+https://git.t-juice.club/torjus/nixos-servers.git`
|
||||
- `nixos` (gunter) → `git+https://git.t-juice.club/torjus/nixos.git`
|
||||
|
||||
### NATS-Based Build Triggering
|
||||
|
||||
The `homelab-deploy` tool was extended with a builder mode:
|
||||
|
||||
**NATS Subjects:**
|
||||
- `build.<repo>.<target>` - e.g., `build.nixos-servers.all` or `build.nixos-servers.ns1`
|
||||
|
||||
**NATS Permissions (in DEPLOY account):**
|
||||
| User | Publish | Subscribe |
|
||||
|------|---------|-----------|
|
||||
| Builder | `build.responses.>` | `build.>` |
|
||||
| Test deployer | `deploy.test.>`, `deploy.discover`, `build.>` | `deploy.responses.>`, `deploy.discover`, `build.responses.>` |
|
||||
| Admin deployer | `deploy.>`, `build.>` | `deploy.>`, `build.responses.>` |
|
||||
|
||||
**Vault Secrets:**
|
||||
- `shared/homelab-deploy/builder-nkey` - NKey seed for builder authentication
|
||||
|
||||
**NixOS Configuration:**
|
||||
- `hosts/nix-cache02/builder.nix` - Builder service configuration
|
||||
- `services/nats/default.nix` - Updated with builder NATS user
|
||||
|
||||
**MCP Integration:**
|
||||
- `.mcp.json` updated with `--enable-builds` flag
|
||||
- Build tool available via MCP for Claude Code
|
||||
|
||||
**Tested:**
|
||||
- Single host build: `build nixos-servers testvm01` (~30s)
|
||||
- All hosts build: `build nixos-servers all` (16 hosts in ~226s)
|
||||
|
||||
### Harmonia Binary Cache
|
||||
|
||||
- Parameterized `services/nix-cache/harmonia.nix` to use hostname-based Vault paths
|
||||
- Parameterized `services/nix-cache/proxy.nix` for hostname-based domain
|
||||
- New signing key: `nix-cache02.home.2rjus.net-1`
|
||||
- Vault secret: `hosts/nix-cache02/cache-secret`
|
||||
- Removed unused Gitea Actions runner from nix-cache01
|
||||
|
||||
## Current State
|
||||
|
||||
### nix-cache02 (Active)
|
||||
- Running at 10.69.13.25
|
||||
- Serving `https://nix-cache.home.2rjus.net` (canonical URL)
|
||||
- Builder service active, responding to NATS build requests
|
||||
- Metrics exposed on port 9973 (`homelab-deploy-builder` job)
|
||||
- Harmonia binary cache server running
|
||||
- Signing key: `nix-cache02.home.2rjus.net-1`
|
||||
- Prod tier with `build-host` role
|
||||
|
||||
### nix-cache01 (Decommissioned)
|
||||
- VM deleted from Proxmox
|
||||
- Host configuration removed from repo
|
||||
- Vault AppRole and secrets removed
|
||||
- Old signing key removed from trusted-public-keys
|
||||
|
||||
## Remaining Work
|
||||
|
||||
### Phase 3: Safe Flake Update Workflow
|
||||
|
||||
1. Create `.github/workflows/flake-update-safe.yaml`
|
||||
2. Disable or remove old `flake-update.yaml`
|
||||
3. Test manually with `workflow_dispatch`
|
||||
4. Monitor first automated run
|
||||
|
||||
### Phase 4: Complete Migration ✅
|
||||
|
||||
1. ~~**Add Harmonia to nix-cache02**~~ ✅ Done - new signing key, parameterized service
|
||||
2. ~~**Add trusted public key to all hosts**~~ ✅ Done - `system/nix.nix` updated
|
||||
3. ~~**Test cache from other hosts**~~ ✅ Done - verified from testvm01
|
||||
4. ~~**Update proxy and DNS**~~ ✅ Done - `nix-cache.home.2rjus.net` CNAME now points to nix-cache02
|
||||
5. ~~**Deploy to all hosts**~~ ✅ Done - all hosts have new trusted key
|
||||
6. ~~**Decommission nix-cache01**~~ ✅ Done - 2026-02-10:
|
||||
- Removed `hosts/nix-cache01/` directory
|
||||
- Removed `services/nix-cache/build-flakes.{nix,sh}`
|
||||
- Removed Vault AppRole and secrets
|
||||
- Removed old signing key from `system/nix.nix`
|
||||
- Removed from `flake.nix`
|
||||
- Deleted VM from Proxmox
|
||||
|
||||
### Phase 5: Scheduled Builds ✅
|
||||
|
||||
Implemented a systemd timer on nix-cache02 that triggers builds every 2 hours:
|
||||
|
||||
- **Timer**: `scheduled-build.timer` runs every 2 hours with 5m random jitter
|
||||
- **Service**: `scheduled-build.service` calls `homelab-deploy build` for both repos
|
||||
- **Authentication**: Dedicated scheduler NKey stored in Vault
|
||||
- **NATS user**: Added to DEPLOY account with publish `build.>` and subscribe `build.responses.>`
|
||||
|
||||
Files:
|
||||
- `hosts/nix-cache02/scheduler.nix` - Timer and service configuration
|
||||
- `services/nats/default.nix` - Scheduler NATS user
|
||||
- `terraform/vault/secrets.tf` - Scheduler NKey secret
|
||||
- `terraform/vault/variables.tf` - Variable for scheduler NKey
|
||||
|
||||
## Resolved Questions
|
||||
|
||||
- **Parallel vs sequential builds?** Sequential - hosts share packages, subsequent builds are fast after first
|
||||
- **What about gunter?** Configured as `nixos` repo in builder settings
|
||||
- **Disk size?** 200GB for new host
|
||||
- **Build host specs?** 8 cores, 16-24GB RAM matches current nix-cache01
|
||||
|
||||
### Phase 6: Observability
|
||||
|
||||
1. **Alerting rules** for build failures:
|
||||
```promql
|
||||
# Alert if any build fails
|
||||
increase(homelab_deploy_build_host_total{status="failure"}[1h]) > 0
|
||||
|
||||
# Alert if no successful builds in 24h (scheduled builds stopped)
|
||||
time() - homelab_deploy_build_last_success_timestamp > 86400
|
||||
```
|
||||
|
||||
2. **Grafana dashboard** for build metrics:
|
||||
- Build success/failure rate over time
|
||||
- Average build duration per host (histogram)
|
||||
- Build frequency (builds per hour/day)
|
||||
- Last successful build timestamp per repo
|
||||
|
||||
Available metrics:
|
||||
- `homelab_deploy_builds_total{repo, status}` - total builds by repo and status
|
||||
- `homelab_deploy_build_host_total{repo, host, status}` - per-host build counts
|
||||
- `homelab_deploy_build_duration_seconds_{bucket,sum,count}` - build duration histogram
|
||||
- `homelab_deploy_build_last_timestamp{repo}` - last build attempt
|
||||
- `homelab_deploy_build_last_success_timestamp{repo}` - last successful build
|
||||
|
||||
## Open Questions
|
||||
|
||||
- [x] ~~When to cut over DNS from nix-cache01 to nix-cache02?~~ Done - 2026-02-10
|
||||
- [ ] Implement safe flake update workflow before or after full migration?
|
||||
Reference in New Issue
Block a user