Files
nixos-servers/docs/plans/bare-metal-actions-runner.md

156 lines
4.8 KiB
Markdown

# Bare Metal Forgejo Actions Runner on nix-cache02
## Goal
Add a second Forgejo Actions runner instance on nix-cache02 that executes jobs directly on the host (bare metal). This allows CI builds to populate the nix binary cache automatically, reducing reliance on manually triggered builds before deployments.
## Motivation
Currently the workflow for updating a flake input (e.g. nixos-exporter) is:
1. Update flake lock
2. Push to master
3. Manually trigger a build on nix-cache02 (or wait for the scheduled builder)
4. Deploy to hosts
With a bare metal runner, repos like nixos-exporter can have CI workflows that run `nix build`, and those derivations automatically end up in the cache (served by harmonia). By the time hosts auto-upgrade, everything is already cached.
## Design
### Two Runner Instances
- **actions1** (existing) — Container-based, available to all Forgejo repos. Unchanged.
- **actions2** (new) — Host-based, restricted to trusted repos only via Forgejo runner scoping.
### Trusted Repos
Repos that should be allowed to use the bare metal runner:
- `torjus/nixos-servers`
- `torjus/nixos-exporter`
- `torjus/nixos` (gunter/magicman configs)
- Other repos with nix builds that benefit from cache population (add as needed)
Restriction is configured in the Forgejo web UI when registering the runner — scope it to specific repos or the org.
### Label Configuration
The new instance would use a host label:
```nix
labels = [ "native:host" ];
```
Workflow files in trusted repos would target this with `runs-on: native`.
### Host Packages
The runner needs nix and basic tools available:
```nix
hostPackages = with pkgs; [
bash
coreutils
curl
gawk
gitMinimal
gnused
nodejs
wget
nix
];
```
## Security Analysis
### What the runner CAN access
- **Nix store** — Can read and write derivations. This is the whole point; harmonia serves the store to all hosts.
- **Network** — Full network access during job execution.
- **World-readable files** — Standard for any process on the system.
### What the runner CANNOT access
- **Cache signing key** — `/run/secrets/cache-secret` is mode `0400` root-owned. Harmonia signs derivations on serve, not on store write.
- **Vault AppRole credentials** — `/var/lib/vault/approle/` is root-owned.
- **Other vault secrets** — All in `/run/secrets/` with restrictive permissions.
### Mitigations
- **Trusted repos only** — Forgejo runner scoping restricts which repos can submit jobs. Only repos we control should have access.
- **DynamicUser** — The runner uses systemd DynamicUser, so no persistent user account. Each invocation gets an ephemeral UID.
- **Separate instance** — Container-based jobs (untrusted repos) remain on actions1 and never get host access.
### Accepted Risks
- A compromised trusted repo could inject bad derivations into the nix store/cache. This is an accepted risk since those repos already have deploy access to production hosts.
- Jobs can consume host resources (CPU, memory, disk). The `runner.capacity` setting limits concurrent jobs.
## Implementation
### 1. NixOS Configuration
**File:** `hosts/nix-cache02/actions-runner.nix`
Add a second instance alongside the existing overrides:
```nix
{ pkgs, ... }:
{
# ... existing actions1 overrides ...
services.gitea-actions-runner.instances.actions2 = {
enable = true;
name = "nix-cache02-native";
url = "https://code.t-juice.club";
tokenFile = "/run/secrets/forgejo-runner-token-native";
labels = [ "native:host" ];
hostPackages = with pkgs; [
bash coreutils curl gawk gitMinimal gnused nodejs wget nix
];
settings = {
runner.capacity = 4;
cache = {
enabled = true;
dir = "/var/lib/gitea-runner/actions2/cache";
};
};
};
}
```
### 2. Vault Secret
The native runner needs its own registration token (separate from actions1):
- Add `hosts/nix-cache02/forgejo-runner-token-native` to `terraform/vault/secrets.tf`
- Add `forgejo_runner_token_native` variable to `terraform/vault/variables.tf`
- Add vault secret config in `actions-runner.nix` pointing to the new path
### 3. Forgejo Setup
1. Generate a new runner token in Forgejo, scoped to trusted repos only
2. Store in Vault: `bao kv put secret/hosts/nix-cache02/forgejo-runner-token-native token=<token>`
3. Set the tfvar and run `tofu apply` in `terraform/vault/`
### 4. Example Workflow
In a trusted repo (e.g. nixos-exporter):
```yaml
name: Build
on: [push]
jobs:
build:
runs-on: native
steps:
- uses: actions/checkout@v4
- run: nix build
```
## Open Questions
- Should `hostPackages` include additional tools (e.g. `cachix`, `nix-prefetch-*`)?
- Should we set resource limits on the runner (systemd MemoryMax, CPUQuota)?
- Do we want a separate capacity for the native runner vs container runner, or is 4 fine for both?