docs: update nix-cache-reprovision plan with progress
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
- Mark Phase 1 (new build host) and Phase 2 (NATS build triggering) complete - Document nix-cache02 configuration and tested build times - Add remaining work for Harmonia, Actions runner, and DNS cutover - Enable --enable-builds flag in MCP config Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -31,7 +31,8 @@
|
|||||||
"--",
|
"--",
|
||||||
"mcp",
|
"mcp",
|
||||||
"--nats-url", "nats://nats1.home.2rjus.net:4222",
|
"--nats-url", "nats://nats1.home.2rjus.net:4222",
|
||||||
"--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey"
|
"--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey",
|
||||||
|
"--enable-builds"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
"git-explorer": {
|
"git-explorer": {
|
||||||
|
|||||||
@@ -6,207 +6,120 @@ Reprovision `nix-cache01` using the OpenTofu workflow, and improve the build/cac
|
|||||||
1. NATS-based remote build triggering (replacing the current bash script)
|
1. NATS-based remote build triggering (replacing the current bash script)
|
||||||
2. Safer flake update workflow that validates builds before pushing to master
|
2. Safer flake update workflow that validates builds before pushing to master
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
**Phase 1: New Build Host** - COMPLETE
|
||||||
|
**Phase 2: NATS Build Triggering** - COMPLETE
|
||||||
|
**Phase 3: Safe Flake Update Workflow** - NOT STARTED
|
||||||
|
**Phase 4: Decommission Old System** - NOT STARTED
|
||||||
|
|
||||||
|
## Completed Work
|
||||||
|
|
||||||
|
### New Build Host (nix-cache02)
|
||||||
|
|
||||||
|
Instead of reprovisioning nix-cache01 in-place, we created a new host `nix-cache02` at 10.69.13.25:
|
||||||
|
|
||||||
|
- **Specs**: 8 CPU cores, 16GB RAM (temporarily, will increase to 24GB after nix-cache01 decommissioned), 200GB disk
|
||||||
|
- **Provisioned via OpenTofu** with automatic Vault credential bootstrapping
|
||||||
|
- **Builder service** configured with two repos:
|
||||||
|
- `nixos-servers` → `git+https://git.t-juice.club/torjus/nixos-servers.git`
|
||||||
|
- `nixos` (gunter) → `git+https://git.t-juice.club/torjus/nixos.git`
|
||||||
|
|
||||||
|
### NATS-Based Build Triggering
|
||||||
|
|
||||||
|
The `homelab-deploy` tool was extended with a builder mode:
|
||||||
|
|
||||||
|
**NATS Subjects:**
|
||||||
|
- `build.<repo>.<target>` - e.g., `build.nixos-servers.all` or `build.nixos-servers.ns1`
|
||||||
|
|
||||||
|
**NATS Permissions (in DEPLOY account):**
|
||||||
|
| User | Publish | Subscribe |
|
||||||
|
|------|---------|-----------|
|
||||||
|
| Builder | `build.responses.>` | `build.>` |
|
||||||
|
| Test deployer | `deploy.test.>`, `deploy.discover`, `build.>` | `deploy.responses.>`, `deploy.discover`, `build.responses.>` |
|
||||||
|
| Admin deployer | `deploy.>`, `build.>` | `deploy.>`, `build.responses.>` |
|
||||||
|
|
||||||
|
**Vault Secrets:**
|
||||||
|
- `shared/homelab-deploy/builder-nkey` - NKey seed for builder authentication
|
||||||
|
|
||||||
|
**NixOS Configuration:**
|
||||||
|
- `hosts/nix-cache02/builder.nix` - Builder service configuration
|
||||||
|
- `services/nats/default.nix` - Updated with builder NATS user
|
||||||
|
|
||||||
|
**MCP Integration:**
|
||||||
|
- `.mcp.json` updated with `--enable-builds` flag
|
||||||
|
- Build tool available via MCP for Claude Code
|
||||||
|
|
||||||
|
**Tested:**
|
||||||
|
- Single host build: `build nixos-servers testvm01` (~30s)
|
||||||
|
- All hosts build: `build nixos-servers all` (16 hosts in ~226s)
|
||||||
|
|
||||||
## Current State
|
## Current State
|
||||||
|
|
||||||
### Host Configuration
|
### Old System (nix-cache01)
|
||||||
- `nix-cache01` at 10.69.13.15 serves the binary cache via Harmonia
|
- Still running at 10.69.13.15
|
||||||
- Runs Gitea Actions runner for CI workflows
|
- Serves binary cache via Harmonia
|
||||||
- Has `homelab.deploy.enable = true` (already supports NATS-based deployment)
|
- Runs Gitea Actions runner
|
||||||
- Uses a dedicated XFS volume at `/nix` for cache storage
|
- Has the old `build-flakes.sh` timer (every 30 min)
|
||||||
|
- Will be decommissioned after nix-cache02 is fully validated
|
||||||
|
|
||||||
### Current Build System (`services/nix-cache/build-flakes.sh`)
|
### New System (nix-cache02)
|
||||||
- Runs every 30 minutes via systemd timer
|
- Running at 10.69.13.25
|
||||||
- Clones/pulls two repos: `nixos-servers` and `nixos` (gunter)
|
- Builder service active, responding to NATS build requests
|
||||||
- Builds all hosts with `nixos-rebuild build` (no blacklist despite docs mentioning it)
|
- Metrics exposed on port 9973 (`homelab-deploy-builder` job)
|
||||||
- Pushes success/failure metrics to pushgateway
|
- Does NOT yet have:
|
||||||
- Simple but has no filtering, no parallelism, no remote triggering
|
- Harmonia (binary cache server)
|
||||||
|
- Actions runner
|
||||||
|
- Cache signing key
|
||||||
|
|
||||||
### Current Flake Update Workflow (`.github/workflows/flake-update.yaml`)
|
## Remaining Work
|
||||||
- Runs daily at midnight via cron
|
|
||||||
- Runs `nix flake update --commit-lock-file`
|
|
||||||
- Pushes directly to master
|
|
||||||
- No build validation — can push broken inputs
|
|
||||||
|
|
||||||
## Improvement 1: NATS-Based Remote Build Triggering
|
### Phase 3: Safe Flake Update Workflow
|
||||||
|
|
||||||
### Design
|
|
||||||
|
|
||||||
Extend the existing `homelab-deploy` tool to support a "build" command that triggers builds on the cache host. This reuses the NATS infrastructure already in place.
|
|
||||||
|
|
||||||
| Approach | Pros | Cons |
|
|
||||||
|----------|------|------|
|
|
||||||
| Extend homelab-deploy | Reuses existing NATS auth, NKey handling, CLI | Adds scope to existing tool |
|
|
||||||
| New nix-cache-tool | Clean separation | Duplicate NATS boilerplate, new credentials |
|
|
||||||
| Gitea Actions webhook | No custom tooling | Less flexible, tied to Gitea |
|
|
||||||
|
|
||||||
**Recommendation:** Extend `homelab-deploy` with a build subcommand. The tool already has NATS client code, authentication handling, and a listener module in NixOS.
|
|
||||||
|
|
||||||
### Implementation
|
|
||||||
|
|
||||||
1. Add new message type to homelab-deploy: `build.<host>` subject
|
|
||||||
2. Listener on nix-cache01 subscribes to `build.>` wildcard
|
|
||||||
3. On message receipt, builds the specified host and returns success/failure
|
|
||||||
4. CLI command: `homelab-deploy build <hostname>` or `homelab-deploy build --all`
|
|
||||||
|
|
||||||
### Benefits
|
|
||||||
- Trigger rebuild for specific host to ensure it's cached
|
|
||||||
- Could be called from CI after merging PRs
|
|
||||||
- Reuses existing NATS infrastructure and auth
|
|
||||||
- Progress/status could stream back via NATS reply
|
|
||||||
|
|
||||||
## Improvement 2: Smarter Flake Update Workflow
|
|
||||||
|
|
||||||
### Current Problems
|
|
||||||
1. Updates can push breaking changes to master
|
|
||||||
2. No visibility into what broke when it does
|
|
||||||
3. Hosts that auto-update can pull broken configs
|
|
||||||
|
|
||||||
### Proposed Workflow
|
|
||||||
|
|
||||||
```
|
|
||||||
┌─────────────────────────────────────────────────────────────────┐
|
|
||||||
│ Flake Update Workflow │
|
|
||||||
├─────────────────────────────────────────────────────────────────┤
|
|
||||||
│ 1. nix flake update (on feature branch) │
|
|
||||||
│ 2. Build ALL hosts locally │
|
|
||||||
│ 3. If all pass → fast-forward merge to master │
|
|
||||||
│ 4. If any fail → create PR with failure logs attached │
|
|
||||||
└─────────────────────────────────────────────────────────────────┘
|
|
||||||
```
|
|
||||||
|
|
||||||
### Implementation Options
|
|
||||||
|
|
||||||
| Option | Description | Pros | Cons |
|
|
||||||
|--------|-------------|------|------|
|
|
||||||
| **A: Self-hosted runner** | Build on nix-cache01 | Fast (local cache), simple | Ties up cache host during build |
|
|
||||||
| **B: Gitea Actions only** | Use container runner | Clean separation | Slow (no cache), resource limits |
|
|
||||||
| **C: Hybrid** | Trigger builds on nix-cache01 via NATS from Actions | Best of both | More complex |
|
|
||||||
|
|
||||||
**Recommendation:** Option A with nix-cache01 as the runner. The host is already running Gitea Actions runner and has the cache. Building all ~16 hosts is disk I/O heavy but feasible on dedicated hardware.
|
|
||||||
|
|
||||||
### Workflow Steps
|
|
||||||
|
|
||||||
1. Workflow runs on schedule (daily or weekly)
|
|
||||||
2. Creates branch `flake-update/YYYY-MM-DD`
|
|
||||||
3. Runs `nix flake update --commit-lock-file`
|
|
||||||
4. Builds each host: `nix build .#nixosConfigurations.<host>.config.system.build.toplevel`
|
|
||||||
5. If all succeed:
|
|
||||||
- Fast-forward merge to master
|
|
||||||
- Delete feature branch
|
|
||||||
6. If any fail:
|
|
||||||
- Create PR from the update branch
|
|
||||||
- Attach build logs as PR comment
|
|
||||||
- Label PR with `needs-review` or `build-failure`
|
|
||||||
- Do NOT merge automatically
|
|
||||||
|
|
||||||
### Workflow File Changes
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
# New: .github/workflows/flake-update-safe.yaml
|
|
||||||
name: Safe flake update
|
|
||||||
on:
|
|
||||||
schedule:
|
|
||||||
- cron: "0 2 * * 0" # Weekly on Sunday at 2 AM
|
|
||||||
workflow_dispatch: # Manual trigger
|
|
||||||
|
|
||||||
jobs:
|
|
||||||
update-and-validate:
|
|
||||||
runs-on: homelab # Use self-hosted runner on nix-cache01
|
|
||||||
steps:
|
|
||||||
- uses: actions/checkout@v4
|
|
||||||
with:
|
|
||||||
ref: master
|
|
||||||
fetch-depth: 0 # Need full history for merge
|
|
||||||
|
|
||||||
- name: Create update branch
|
|
||||||
run: |
|
|
||||||
BRANCH="flake-update/$(date +%Y-%m-%d)"
|
|
||||||
git checkout -b "$BRANCH"
|
|
||||||
|
|
||||||
- name: Update flake
|
|
||||||
run: nix flake update --commit-lock-file
|
|
||||||
|
|
||||||
- name: Build all hosts
|
|
||||||
id: build
|
|
||||||
run: |
|
|
||||||
FAILED=""
|
|
||||||
for host in $(nix flake show --json | jq -r '.nixosConfigurations | keys[]'); do
|
|
||||||
echo "Building $host..."
|
|
||||||
if ! nix build ".#nixosConfigurations.$host.config.system.build.toplevel" 2>&1 | tee "build-$host.log"; then
|
|
||||||
FAILED="$FAILED $host"
|
|
||||||
fi
|
|
||||||
done
|
|
||||||
echo "failed=$FAILED" >> $GITHUB_OUTPUT
|
|
||||||
|
|
||||||
- name: Merge to master (if all pass)
|
|
||||||
if: steps.build.outputs.failed == ''
|
|
||||||
run: |
|
|
||||||
git checkout master
|
|
||||||
git merge --ff-only "$BRANCH"
|
|
||||||
git push origin master
|
|
||||||
git push origin --delete "$BRANCH"
|
|
||||||
|
|
||||||
- name: Create PR (if any fail)
|
|
||||||
if: steps.build.outputs.failed != ''
|
|
||||||
run: |
|
|
||||||
git push origin "$BRANCH"
|
|
||||||
# Create PR via Gitea API with build logs
|
|
||||||
# ... (PR creation with log attachment)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Migration Steps
|
|
||||||
|
|
||||||
### Phase 1: Reprovision Host via OpenTofu
|
|
||||||
|
|
||||||
1. Add `nix-cache01` to `terraform/vms.tf`:
|
|
||||||
```hcl
|
|
||||||
"nix-cache01" = {
|
|
||||||
ip = "10.69.13.15/24"
|
|
||||||
cpu_cores = 4
|
|
||||||
memory = 8192
|
|
||||||
disk_size = "100G" # Larger for nix store
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
2. Shut down existing nix-cache01 VM
|
|
||||||
3. Run `tofu apply` to provision new VM
|
|
||||||
4. Verify bootstrap completes and cache is serving
|
|
||||||
|
|
||||||
**Note:** The cache will be cold after reprovision. Run initial builds to populate.
|
|
||||||
|
|
||||||
### Phase 2: Add Build Triggering to homelab-deploy
|
|
||||||
|
|
||||||
1. Add `build` command to homelab-deploy CLI
|
|
||||||
2. Add listener handler in NixOS module for `build.*` subjects
|
|
||||||
3. Update nix-cache01 config to enable build listener
|
|
||||||
4. Test with `homelab-deploy build testvm01`
|
|
||||||
|
|
||||||
### Phase 3: Implement Safe Flake Update Workflow
|
|
||||||
|
|
||||||
1. Create `.github/workflows/flake-update-safe.yaml`
|
1. Create `.github/workflows/flake-update-safe.yaml`
|
||||||
2. Disable or remove old `flake-update.yaml`
|
2. Disable or remove old `flake-update.yaml`
|
||||||
3. Test manually with `workflow_dispatch`
|
3. Test manually with `workflow_dispatch`
|
||||||
4. Monitor first automated run
|
4. Monitor first automated run
|
||||||
|
|
||||||
### Phase 4: Remove Old Build Script
|
### Phase 4: Complete Migration
|
||||||
|
|
||||||
1. After new workflow is stable, remove:
|
1. **Add Harmonia to nix-cache02** - Copy cache signing key, configure service
|
||||||
- `services/nix-cache/build-flakes.nix`
|
2. **Add Actions runner to nix-cache02** - Configure with Vault token
|
||||||
- `services/nix-cache/build-flakes.sh`
|
3. **Update DNS** - Point `nix-cache.home.2rjus.net` to nix-cache02
|
||||||
2. The new workflow handles scheduled builds
|
4. **Increase RAM** - Bump to 24GB after nix-cache01 is gone
|
||||||
|
5. **Decommission nix-cache01**:
|
||||||
|
- Remove from `terraform/vms.tf`
|
||||||
|
- Remove old build script (`services/nix-cache/build-flakes.nix`, `build-flakes.sh`)
|
||||||
|
- Archive or delete host config
|
||||||
|
|
||||||
|
### Phase 5: Scheduled Builds (Optional)
|
||||||
|
|
||||||
|
Add a systemd timer on nix-cache02 to trigger periodic builds via NATS:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
systemd.services.scheduled-build = {
|
||||||
|
script = ''
|
||||||
|
homelab-deploy build nixos-servers --all
|
||||||
|
homelab-deploy build nixos --all
|
||||||
|
'';
|
||||||
|
};
|
||||||
|
systemd.timers.scheduled-build = {
|
||||||
|
wantedBy = [ "timers.target" ];
|
||||||
|
timerConfig.OnCalendar = "*-*-* *:30:00";
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
Or trigger builds from CI after merges to master.
|
||||||
|
|
||||||
|
## Resolved Questions
|
||||||
|
|
||||||
|
- **Parallel vs sequential builds?** Sequential - hosts share packages, subsequent builds are fast after first
|
||||||
|
- **What about gunter?** Configured as `nixos` repo in builder settings
|
||||||
|
- **Disk size?** 200GB for new host
|
||||||
|
- **Build host specs?** 8 cores, 16-24GB RAM matches current nix-cache01
|
||||||
|
|
||||||
## Open Questions
|
## Open Questions
|
||||||
|
|
||||||
- [ ] What runner labels should the self-hosted runner use for the update workflow?
|
- [ ] When to cut over DNS from nix-cache01 to nix-cache02?
|
||||||
- [ ] Should we build hosts in parallel (faster) or sequentially (easier to debug)?
|
- [ ] Keep Actions runner on nix-cache02 or separate host?
|
||||||
- [ ] How long to keep flake-update PRs open before auto-closing stale ones?
|
- [ ] Implement safe flake update workflow before or after full migration?
|
||||||
- [ ] Should successful updates trigger a NATS notification to rebuild all hosts?
|
|
||||||
- [ ] What to do about `gunter` (external nixos repo) - include in validation?
|
|
||||||
- [ ] Disk size for new nix-cache01 - is 100G enough for cache + builds?
|
|
||||||
|
|
||||||
## Notes
|
|
||||||
|
|
||||||
- The existing `homelab.deploy.enable = true` on nix-cache01 means it already has NATS connectivity
|
|
||||||
- The Harmonia service and cache signing key will work the same after reprovision
|
|
||||||
- Actions runner token is in Vault, will be provisioned automatically
|
|
||||||
- Consider adding a `homelab.host.role = "build-host"` label for monitoring/filtering
|
|
||||||
|
|||||||
Reference in New Issue
Block a user