From 07e86acbaa802bdadd28ce2cc3e384f6df6c8a95 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Tue, 10 Mar 2026 01:01:14 +0100 Subject: [PATCH] docs: add plan for bare metal actions runner on nix-cache02 Co-Authored-By: Claude Opus 4.6 --- docs/plans/bare-metal-actions-runner.md | 155 ++++++++++++++++++++++++ 1 file changed, 155 insertions(+) create mode 100644 docs/plans/bare-metal-actions-runner.md diff --git a/docs/plans/bare-metal-actions-runner.md b/docs/plans/bare-metal-actions-runner.md new file mode 100644 index 0000000..1b3cdcd --- /dev/null +++ b/docs/plans/bare-metal-actions-runner.md @@ -0,0 +1,155 @@ +# Bare Metal Forgejo Actions Runner on nix-cache02 + +## Goal + +Add a second Forgejo Actions runner instance on nix-cache02 that executes jobs directly on the host (bare metal). This allows CI builds to populate the nix binary cache automatically, reducing reliance on manually triggered builds before deployments. + +## Motivation + +Currently the workflow for updating a flake input (e.g. nixos-exporter) is: + +1. Update flake lock +2. Push to master +3. Manually trigger a build on nix-cache02 (or wait for the scheduled builder) +4. Deploy to hosts + +With a bare metal runner, repos like nixos-exporter can have CI workflows that run `nix build`, and those derivations automatically end up in the cache (served by harmonia). By the time hosts auto-upgrade, everything is already cached. + +## Design + +### Two Runner Instances + +- **actions1** (existing) — Container-based, available to all Forgejo repos. Unchanged. +- **actions2** (new) — Host-based, restricted to trusted repos only via Forgejo runner scoping. + +### Trusted Repos + +Repos that should be allowed to use the bare metal runner: + +- `torjus/nixos-servers` +- `torjus/nixos-exporter` +- `torjus/nixos` (gunter/magicman configs) +- Other repos with nix builds that benefit from cache population (add as needed) + +Restriction is configured in the Forgejo web UI when registering the runner — scope it to specific repos or the org. + +### Label Configuration + +The new instance would use a host label: + +```nix +labels = [ "native:host" ]; +``` + +Workflow files in trusted repos would target this with `runs-on: native`. + +### Host Packages + +The runner needs nix and basic tools available: + +```nix +hostPackages = with pkgs; [ + bash + coreutils + curl + gawk + gitMinimal + gnused + nodejs + wget + nix +]; +``` + +## Security Analysis + +### What the runner CAN access + +- **Nix store** — Can read and write derivations. This is the whole point; harmonia serves the store to all hosts. +- **Network** — Full network access during job execution. +- **World-readable files** — Standard for any process on the system. + +### What the runner CANNOT access + +- **Cache signing key** — `/run/secrets/cache-secret` is mode `0400` root-owned. Harmonia signs derivations on serve, not on store write. +- **Vault AppRole credentials** — `/var/lib/vault/approle/` is root-owned. +- **Other vault secrets** — All in `/run/secrets/` with restrictive permissions. + +### Mitigations + +- **Trusted repos only** — Forgejo runner scoping restricts which repos can submit jobs. Only repos we control should have access. +- **DynamicUser** — The runner uses systemd DynamicUser, so no persistent user account. Each invocation gets an ephemeral UID. +- **Separate instance** — Container-based jobs (untrusted repos) remain on actions1 and never get host access. + +### Accepted Risks + +- A compromised trusted repo could inject bad derivations into the nix store/cache. This is an accepted risk since those repos already have deploy access to production hosts. +- Jobs can consume host resources (CPU, memory, disk). The `runner.capacity` setting limits concurrent jobs. + +## Implementation + +### 1. NixOS Configuration + +**File:** `hosts/nix-cache02/actions-runner.nix` + +Add a second instance alongside the existing overrides: + +```nix +{ pkgs, ... }: +{ + # ... existing actions1 overrides ... + + services.gitea-actions-runner.instances.actions2 = { + enable = true; + name = "nix-cache02-native"; + url = "https://code.t-juice.club"; + tokenFile = "/run/secrets/forgejo-runner-token-native"; + labels = [ "native:host" ]; + hostPackages = with pkgs; [ + bash coreutils curl gawk gitMinimal gnused nodejs wget nix + ]; + settings = { + runner.capacity = 4; + cache = { + enabled = true; + dir = "/var/lib/gitea-runner/actions2/cache"; + }; + }; + }; +} +``` + +### 2. Vault Secret + +The native runner needs its own registration token (separate from actions1): + +- Add `hosts/nix-cache02/forgejo-runner-token-native` to `terraform/vault/secrets.tf` +- Add `forgejo_runner_token_native` variable to `terraform/vault/variables.tf` +- Add vault secret config in `actions-runner.nix` pointing to the new path + +### 3. Forgejo Setup + +1. Generate a new runner token in Forgejo, scoped to trusted repos only +2. Store in Vault: `bao kv put secret/hosts/nix-cache02/forgejo-runner-token-native token=` +3. Set the tfvar and run `tofu apply` in `terraform/vault/` + +### 4. Example Workflow + +In a trusted repo (e.g. nixos-exporter): + +```yaml +name: Build +on: [push] +jobs: + build: + runs-on: native + steps: + - uses: actions/checkout@v4 + - run: nix build +``` + +## Open Questions + +- Should `hostPackages` include additional tools (e.g. `cachix`, `nix-prefetch-*`)? +- Should we set resource limits on the runner (systemd MemoryMax, CPUQuota)? +- Do we want a separate capacity for the native runner vs container runner, or is 4 fine for both?