Files
nixos-servers/docs/plans/nix-cache-reprovision.md
Torjus Håkestad 5bfb51a497
Some checks failed
Run nix flake check / flake-check (push) Successful in 2m35s
Run nix flake check / flake-check (pull_request) Failing after 16m1s
docs: add observability phase to nix-cache plan
- Add Phase 6 for alerting and Grafana dashboards
- Document available Prometheus metrics
- Include example alerting rules for build failures

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 22:46:38 +01:00

5.2 KiB

Nix Cache Host Reprovision

Overview

Reprovision nix-cache01 using the OpenTofu workflow, and improve the build/cache system with:

  1. NATS-based remote build triggering (replacing the current bash script)
  2. Safer flake update workflow that validates builds before pushing to master

Status

Phase 1: New Build Host - COMPLETE Phase 2: NATS Build Triggering - COMPLETE Phase 3: Safe Flake Update Workflow - NOT STARTED Phase 4: Decommission Old System - NOT STARTED

Completed Work

New Build Host (nix-cache02)

Instead of reprovisioning nix-cache01 in-place, we created a new host nix-cache02 at 10.69.13.25:

  • Specs: 8 CPU cores, 16GB RAM (temporarily, will increase to 24GB after nix-cache01 decommissioned), 200GB disk
  • Provisioned via OpenTofu with automatic Vault credential bootstrapping
  • Builder service configured with two repos:
    • nixos-serversgit+https://git.t-juice.club/torjus/nixos-servers.git
    • nixos (gunter) → git+https://git.t-juice.club/torjus/nixos.git

NATS-Based Build Triggering

The homelab-deploy tool was extended with a builder mode:

NATS Subjects:

  • build.<repo>.<target> - e.g., build.nixos-servers.all or build.nixos-servers.ns1

NATS Permissions (in DEPLOY account):

User Publish Subscribe
Builder build.responses.> build.>
Test deployer deploy.test.>, deploy.discover, build.> deploy.responses.>, deploy.discover, build.responses.>
Admin deployer deploy.>, build.> deploy.>, build.responses.>

Vault Secrets:

  • shared/homelab-deploy/builder-nkey - NKey seed for builder authentication

NixOS Configuration:

  • hosts/nix-cache02/builder.nix - Builder service configuration
  • services/nats/default.nix - Updated with builder NATS user

MCP Integration:

  • .mcp.json updated with --enable-builds flag
  • Build tool available via MCP for Claude Code

Tested:

  • Single host build: build nixos-servers testvm01 (~30s)
  • All hosts build: build nixos-servers all (16 hosts in ~226s)

Current State

Old System (nix-cache01)

  • Still running at 10.69.13.15
  • Serves binary cache via Harmonia
  • Runs Gitea Actions runner
  • Has the old build-flakes.sh timer (every 30 min)
  • Will be decommissioned after nix-cache02 is fully validated

New System (nix-cache02)

  • Running at 10.69.13.25
  • Builder service active, responding to NATS build requests
  • Metrics exposed on port 9973 (homelab-deploy-builder job)
  • Does NOT yet have:
    • Harmonia (binary cache server)
    • Actions runner
    • Cache signing key

Remaining Work

Phase 3: Safe Flake Update Workflow

  1. Create .github/workflows/flake-update-safe.yaml
  2. Disable or remove old flake-update.yaml
  3. Test manually with workflow_dispatch
  4. Monitor first automated run

Phase 4: Complete Migration

  1. Add Harmonia to nix-cache02 - Copy cache signing key, configure service
  2. Add Actions runner to nix-cache02 - Configure with Vault token
  3. Update DNS - Point nix-cache.home.2rjus.net to nix-cache02
  4. Increase RAM - Bump to 24GB after nix-cache01 is gone
  5. Decommission nix-cache01:
    • Remove from terraform/vms.tf
    • Remove old build script (services/nix-cache/build-flakes.nix, build-flakes.sh)
    • Archive or delete host config

Phase 5: Scheduled Builds (Optional)

Add a systemd timer on nix-cache02 to trigger periodic builds via NATS:

systemd.services.scheduled-build = {
  script = ''
    homelab-deploy build nixos-servers --all
    homelab-deploy build nixos --all
  '';
};
systemd.timers.scheduled-build = {
  wantedBy = [ "timers.target" ];
  timerConfig.OnCalendar = "*-*-* *:30:00";
};

Or trigger builds from CI after merges to master.

Resolved Questions

  • Parallel vs sequential builds? Sequential - hosts share packages, subsequent builds are fast after first
  • What about gunter? Configured as nixos repo in builder settings
  • Disk size? 200GB for new host
  • Build host specs? 8 cores, 16-24GB RAM matches current nix-cache01

Phase 6: Observability

  1. Alerting rules for build failures:

    # Alert if any build fails
    increase(homelab_deploy_build_host_total{status="failure"}[1h]) > 0
    
    # Alert if no successful builds in 24h (scheduled builds stopped)
    time() - homelab_deploy_build_last_success_timestamp > 86400
    
  2. Grafana dashboard for build metrics:

    • Build success/failure rate over time
    • Average build duration per host (histogram)
    • Build frequency (builds per hour/day)
    • Last successful build timestamp per repo

Available metrics:

  • homelab_deploy_builds_total{repo, status} - total builds by repo and status
  • homelab_deploy_build_host_total{repo, host, status} - per-host build counts
  • homelab_deploy_build_duration_seconds_{bucket,sum,count} - build duration histogram
  • homelab_deploy_build_last_timestamp{repo} - last build attempt
  • homelab_deploy_build_last_success_timestamp{repo} - last successful build

Open Questions

  • When to cut over DNS from nix-cache01 to nix-cache02?
  • Keep Actions runner on nix-cache02 or separate host?
  • Implement safe flake update workflow before or after full migration?