Files
nixos-servers/docs/plans/nats-deploy-service.md
Torjus Håkestad e8a43c6715
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m5s
docs: add deploy_admin tool with opt-in flag to homelab-deploy plan
MCP exposes two tools:
- deploy: test-tier only, always available
- deploy_admin: all tiers, requires --enable-admin flag

Three security layers: CLI flag, NATS authz, Claude Code permissions.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:29:13 +01:00

369 lines
14 KiB
Markdown

# NATS-Based Deployment Service
## Overview
Create a message-based deployment system that allows triggering NixOS configuration updates on-demand, rather than waiting for the daily auto-upgrade timer. This enables faster iteration when testing changes and immediate fleet-wide deployments.
## Goals
1. **On-demand deployment** - Trigger config updates immediately via NATS message
2. **Targeted deployment** - Deploy to specific hosts or all hosts
3. **Branch/revision support** - Test feature branches before merging to master
4. **MCP integration** - Allow Claude Code to trigger deployments during development
## Current State
- **Auto-upgrade**: All hosts run `nixos-upgrade.service` daily, pulling from master
- **Manual testing**: `nixos-rebuild-test <action> <branch>` helper exists on all hosts
- **NATS**: Running on nats1 with JetStream enabled, using NKey authentication
- **Accounts**: ADMIN (system) and HOMELAB (user workloads with JetStream)
## Architecture
```
┌─────────────┐ ┌─────────────┐
│ MCP Tool │ deploy.test.> │ Admin CLI │ deploy.test.> + deploy.prod.>
│ (claude) │────────────┐ ┌─────│ (torjus) │
└─────────────┘ │ │ └─────────────┘
▼ ▼
┌──────────────┐
│ nats1 │
│ (authz) │
└──────┬───────┘
┌─────────────────┼─────────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ template1│ │ ns1 │ │ ha1 │
│ tier=test│ │ tier=prod│ │ tier=prod│
└──────────┘ └──────────┘ └──────────┘
```
## Repository Structure
The project lives in a **separate repository** (e.g., `homelab-deploy`) containing:
```
homelab-deploy/
├── flake.nix # Nix flake with Go package + NixOS module
├── go.mod
├── go.sum
├── cmd/
│ └── homelab-deploy/
│ └── main.go # CLI entrypoint with subcommands
├── internal/
│ ├── listener/ # Listener mode logic
│ ├── mcp/ # MCP server mode logic
│ └── deploy/ # Shared deployment logic
└── nixos/
└── module.nix # NixOS module for listener service
```
This repo imports the flake as an input and uses the NixOS module.
## Single Binary with Subcommands
The `homelab-deploy` binary supports multiple modes:
```bash
# Run as listener on a host (systemd service)
homelab-deploy listener --hostname ns1 --nats-url nats://nats1:4222
# Run as MCP server (for Claude Code)
homelab-deploy mcp --nats-url nats://nats1:4222
# CLI commands for manual use
homelab-deploy deploy ns1 --branch feature-x --action switch # single host
homelab-deploy deploy --tier test --all --action boot # all test hosts
homelab-deploy deploy --tier prod --all --action boot # all prod hosts (admin only)
homelab-deploy deploy --tier prod --role dns --action switch # all prod dns hosts
homelab-deploy status
```
## Components
### Listener Mode
A systemd service on each host that:
- Subscribes to multiple subjects for targeted and group deployments
- Validates incoming messages (revision, action)
- Executes `nixos-rebuild` with specified parameters
- Reports status back via NATS
**Subject structure:**
```
deploy.<tier>.<hostname> # specific host (e.g., deploy.prod.ns1)
deploy.<tier>.all # all hosts in tier (e.g., deploy.test.all)
deploy.<tier>.role.<role> # all hosts with role in tier (e.g., deploy.prod.role.dns)
```
**Listener subscriptions** (based on `homelab.host` config):
- `deploy.<tier>.<hostname>` - direct messages to this host
- `deploy.<tier>.all` - broadcast to all hosts in tier
- `deploy.<tier>.role.<role>` - broadcast to hosts with matching role (if role is set)
Example: ns1 with `tier=prod, role=dns` subscribes to:
- `deploy.prod.ns1`
- `deploy.prod.all`
- `deploy.prod.role.dns`
**NixOS module configuration:**
```nix
services.homelab-deploy.listener = {
enable = true;
timeout = 600; # seconds, default 10 minutes
};
```
The listener reads tier and role from `config.homelab.host` (see Host Metadata below).
**Request message format:**
```json
{
"action": "switch" | "boot" | "test" | "dry-activate",
"revision": "master" | "feature-branch" | "abc123...",
"reply_to": "deploy.responses.<request-id>"
}
```
**Response message format:**
```json
{
"status": "accepted" | "rejected" | "started" | "completed" | "failed",
"error": "invalid_revision" | "already_running" | "build_failed" | null,
"message": "human-readable details"
}
```
**Request/Reply flow:**
1. MCP/CLI sends deploy request with unique `reply_to` subject
2. Listener validates request (e.g., `git ls-remote` to check revision exists)
3. Listener sends immediate response:
- `{"status": "rejected", "error": "invalid_revision", "message": "branch 'foo' not found"}`, or
- `{"status": "started", "message": "starting nixos-rebuild switch"}`
4. If started, listener runs nixos-rebuild
5. Listener sends final response:
- `{"status": "completed", "message": "successfully switched to generation 42"}`, or
- `{"status": "failed", "error": "build_failed", "message": "nixos-rebuild exited with code 1"}`
This provides immediate feedback on validation errors (bad revision, already running) without waiting for the build to fail.
### MCP Mode
Runs as an MCP server providing tools for Claude Code.
**Tools:**
| Tool | Description | Tier Access |
|------|-------------|-------------|
| `deploy` | Deploy to test hosts (individual, all, or by role) | test only |
| `deploy_admin` | Deploy to any host (requires `--enable-admin` flag) | test + prod |
| `deploy_status` | Check deployment status/history | n/a |
| `list_hosts` | List available deployment targets | n/a |
**CLI flags:**
```bash
# Default: only test-tier deployments available
homelab-deploy mcp --nats-url nats://nats1:4222
# Enable admin tool (requires admin NKey to be configured)
homelab-deploy mcp --nats-url nats://nats1:4222 --enable-admin --admin-nkey-file /path/to/admin.nkey
```
**Security layers:**
1. **MCP flag**: `deploy_admin` tool only exposed when `--enable-admin` is passed
2. **NATS authz**: Even if tool is exposed, NATS rejects publishes without valid admin NKey
3. **Claude Code permissions**: Can set `mcp__homelab-deploy__deploy_admin` to `ask` mode for confirmation popup
By default, the MCP only loads test-tier credentials and exposes the `deploy` tool. Claude can:
- Deploy to individual test hosts
- Deploy to all test hosts at once (`deploy.test.all`)
- Deploy to test hosts by role (`deploy.test.role.<role>`)
### Tiered Permissions
Authorization is enforced at the NATS layer using subject-based permissions. Different deployer credentials have different publish rights:
**NATS user configuration (on nats1):**
```nix
accounts = {
HOMELAB = {
users = [
# MCP/Claude - test tier only
{
nkey = "UABC..."; # mcp-deployer
permissions = {
publish = [ "deploy.test.>" ];
subscribe = [ "deploy.responses.>" ];
};
}
# Admin - full access to all tiers
{
nkey = "UXYZ..."; # admin-deployer
permissions = {
publish = [ "deploy.test.>" "deploy.prod.>" ];
subscribe = [ "deploy.responses.>" ];
};
}
# Host listeners - subscribe to their tier, publish responses
{
nkey = "UDEF..."; # host-listener (one per host)
permissions = {
subscribe = [ "deploy.*.>" ];
publish = [ "deploy.responses.>" ];
};
}
];
};
};
```
**Host tier assignments** (via `homelab.host.tier`):
| Tier | Hosts |
|------|-------|
| test | template1, nix-cache01, future test hosts |
| prod | ns1, ns2, ha1, monitoring01, http-proxy, etc. |
**Example deployment scenarios:**
| Command | Subject | MCP | Admin |
|---------|---------|-----|-------|
| Deploy to ns1 | `deploy.prod.ns1` | ❌ | ✅ |
| Deploy to template1 | `deploy.test.template1` | ✅ | ✅ |
| Deploy to all test hosts | `deploy.test.all` | ✅ | ✅ |
| Deploy to all prod hosts | `deploy.prod.all` | ❌ | ✅ |
| Deploy to all DNS servers | `deploy.prod.role.dns` | ❌ | ✅ |
All NKeys stored in Vault - MCP gets limited credentials, admin CLI gets full-access credentials.
### Host Metadata
Rather than defining `tier` in the listener config, use a central `homelab.host` module that provides host metadata for multiple consumers. This aligns with the approach proposed in `docs/plans/prometheus-scrape-target-labels.md`.
**Module definition (in `modules/homelab/host.nix`):**
```nix
homelab.host = {
tier = lib.mkOption {
type = lib.types.enum [ "test" "prod" ];
default = "prod";
description = "Deployment tier - controls which credentials can deploy to this host";
};
priority = lib.mkOption {
type = lib.types.enum [ "high" "low" ];
default = "high";
description = "Alerting priority - low priority hosts have relaxed thresholds";
};
role = lib.mkOption {
type = lib.types.nullOr lib.types.str;
default = null;
description = "Primary role of this host (dns, database, monitoring, etc.)";
};
labels = lib.mkOption {
type = lib.types.attrsOf lib.types.str;
default = { };
description = "Additional free-form labels";
};
};
```
**Consumers:**
- `homelab-deploy` listener reads `config.homelab.host.tier` for subject subscription
- Prometheus scrape config reads `priority`, `role`, `labels` for target labels
- Future services can consume the same metadata
**Example host config:**
```nix
# hosts/nix-cache01/configuration.nix
homelab.host = {
tier = "test"; # can be deployed by MCP
priority = "low"; # relaxed alerting thresholds
role = "build-host";
};
# hosts/ns1/configuration.nix
homelab.host = {
tier = "prod"; # requires admin credentials
priority = "high";
role = "dns";
labels.dns_role = "primary";
};
```
## Implementation Steps
### Phase 1: Core Binary + Listener
1. **Create homelab-deploy repository**
- Initialize Go module
- Set up flake.nix with Go package build
2. **Implement listener mode**
- NATS subscription logic
- nixos-rebuild execution
- Status reporting via NATS reply
3. **Create NixOS module**
- Systemd service definition
- Configuration options (hostname, NATS URL, NKey path)
- Vault secret integration for NKeys
4. **Create `homelab.host` module** (in nixos-servers)
- Define `tier`, `priority`, `role`, `labels` options
- This module is shared with Prometheus label work (see `docs/plans/prometheus-scrape-target-labels.md`)
5. **Integrate with nixos-servers**
- Add flake input for homelab-deploy
- Import listener module in `system/`
- Set `homelab.host.tier` per host (test vs prod)
6. **Configure NATS tiered permissions**
- Add deployer users to nats1 config (mcp-deployer, admin-deployer)
- Set up subject ACLs per user (test-only vs full access)
- Add deployer NKeys to Vault
- Create Terraform resources for NKey secrets
### Phase 2: MCP + CLI
7. **Implement MCP mode**
- MCP server with deploy/status tools
- Request/reply pattern for deployment feedback
8. **Implement CLI commands**
- `deploy` command for manual deployments
- `status` command to check deployment state
9. **Configure Claude Code**
- Add MCP server to configuration
- Document usage
### Phase 3: Enhancements
10. Add deployment locking (prevent concurrent deploys)
11. Prometheus metrics for deployment status
## Security Considerations
- **Privilege escalation**: Listener runs as root to execute nixos-rebuild
- **Input validation**: Strictly validate revision format (branch name or commit hash)
- **Rate limiting**: Prevent rapid-fire deployments
- **Audit logging**: Log all deployment requests with source identity
- **Network isolation**: NATS only accessible from internal network
## Decisions
All open questions have been resolved. See Notes section for decision rationale.
## Notes
- The existing `nixos-rebuild-test` helper provides a good reference for the rebuild logic
- Uses NATS request/reply pattern for immediate validation feedback and completion status
- Consider using NATS headers for metadata (request ID, timestamp)
- **Timeout decision**: Metrics show no-change upgrades complete in 5-55 seconds. A 10-minute default provides ample headroom for actual updates with package downloads. Per-host override available for hosts with known longer build times.
- **Rollback**: Not needed as a separate feature - deploy an older commit hash to effectively rollback.
- **Offline hosts**: No message persistence - if host is offline, deploy fails. Daily auto-upgrade is the safety net. Avoids complexity of JetStream deduplication (host coming online and applying 10 queued updates instead of just the latest).
- **Deploy history**: Use existing Loki - listener logs deployments to journald, queryable via Loki. No need for separate JetStream persistence.
- **Naming**: `homelab-deploy` - ties it to the infrastructure rather than implementation details.