docs: add homelab-deploy plan, unify host metadata
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Add plan for NATS-based deployment service (homelab-deploy) that enables on-demand NixOS configuration updates via messaging. Features tiered permissions (test/prod) enforced at NATS layer. Update prometheus-scrape-target-labels plan to share the homelab.host module for host metadata (tier, priority, role, labels) - single source of truth for both deployment tiers and prometheus labels. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
323
docs/plans/nats-deploy-service.md
Normal file
323
docs/plans/nats-deploy-service.md
Normal file
@@ -0,0 +1,323 @@
|
||||
# NATS-Based Deployment Service
|
||||
|
||||
## Overview
|
||||
|
||||
Create a message-based deployment system that allows triggering NixOS configuration updates on-demand, rather than waiting for the daily auto-upgrade timer. This enables faster iteration when testing changes and immediate fleet-wide deployments.
|
||||
|
||||
## Goals
|
||||
|
||||
1. **On-demand deployment** - Trigger config updates immediately via NATS message
|
||||
2. **Targeted deployment** - Deploy to specific hosts or all hosts
|
||||
3. **Branch/revision support** - Test feature branches before merging to master
|
||||
4. **MCP integration** - Allow Claude Code to trigger deployments during development
|
||||
|
||||
## Current State
|
||||
|
||||
- **Auto-upgrade**: All hosts run `nixos-upgrade.service` daily, pulling from master
|
||||
- **Manual testing**: `nixos-rebuild-test <action> <branch>` helper exists on all hosts
|
||||
- **NATS**: Running on nats1 with JetStream enabled, using NKey authentication
|
||||
- **Accounts**: ADMIN (system) and HOMELAB (user workloads with JetStream)
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────┐ ┌─────────────┐
|
||||
│ MCP Tool │ deploy.test.> │ Admin CLI │ deploy.test.> + deploy.prod.>
|
||||
│ (claude) │────────────┐ ┌─────│ (torjus) │
|
||||
└─────────────┘ │ │ └─────────────┘
|
||||
▼ ▼
|
||||
┌──────────────┐
|
||||
│ nats1 │
|
||||
│ (authz) │
|
||||
└──────┬───────┘
|
||||
│
|
||||
┌─────────────────┼─────────────────┐
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌──────────┐ ┌──────────┐ ┌──────────┐
|
||||
│ template1│ │ ns1 │ │ ha1 │
|
||||
│ tier=test│ │ tier=prod│ │ tier=prod│
|
||||
└──────────┘ └──────────┘ └──────────┘
|
||||
```
|
||||
|
||||
## Repository Structure
|
||||
|
||||
The project lives in a **separate repository** (e.g., `homelab-deploy`) containing:
|
||||
|
||||
```
|
||||
homelab-deploy/
|
||||
├── flake.nix # Nix flake with Go package + NixOS module
|
||||
├── go.mod
|
||||
├── go.sum
|
||||
├── cmd/
|
||||
│ └── homelab-deploy/
|
||||
│ └── main.go # CLI entrypoint with subcommands
|
||||
├── internal/
|
||||
│ ├── listener/ # Listener mode logic
|
||||
│ ├── mcp/ # MCP server mode logic
|
||||
│ └── deploy/ # Shared deployment logic
|
||||
└── nixos/
|
||||
└── module.nix # NixOS module for listener service
|
||||
```
|
||||
|
||||
This repo imports the flake as an input and uses the NixOS module.
|
||||
|
||||
## Single Binary with Subcommands
|
||||
|
||||
The `homelab-deploy` binary supports multiple modes:
|
||||
|
||||
```bash
|
||||
# Run as listener on a host (systemd service)
|
||||
homelab-deploy listener --hostname ns1 --nats-url nats://nats1:4222
|
||||
|
||||
# Run as MCP server (for Claude Code)
|
||||
homelab-deploy mcp --nats-url nats://nats1:4222
|
||||
|
||||
# CLI commands for manual use
|
||||
homelab-deploy deploy ns1 --branch feature-x --action switch
|
||||
homelab-deploy deploy --all --action boot
|
||||
homelab-deploy status
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
### Listener Mode
|
||||
|
||||
A systemd service on each host that:
|
||||
- Subscribes to `deploy.<tier>.<hostname>` and `deploy.<tier>.all` subjects
|
||||
- Validates incoming messages (revision, action)
|
||||
- Executes `nixos-rebuild` with specified parameters
|
||||
- Reports status back via NATS
|
||||
|
||||
**NixOS module configuration:**
|
||||
```nix
|
||||
services.homelab-deploy.listener = {
|
||||
enable = true;
|
||||
timeout = 600; # seconds, default 10 minutes
|
||||
};
|
||||
```
|
||||
|
||||
The listener reads its tier from `config.homelab.host.tier` (see Host Metadata below) and subscribes to tier-specific subjects (e.g., `deploy.prod.ns1` and `deploy.prod.all`).
|
||||
|
||||
**Request message format:**
|
||||
```json
|
||||
{
|
||||
"action": "switch" | "boot" | "test" | "dry-activate",
|
||||
"revision": "master" | "feature-branch" | "abc123...",
|
||||
"reply_to": "deploy.responses.<request-id>"
|
||||
}
|
||||
```
|
||||
|
||||
**Response message format:**
|
||||
```json
|
||||
{
|
||||
"status": "accepted" | "rejected" | "started" | "completed" | "failed",
|
||||
"error": "invalid_revision" | "already_running" | "build_failed" | null,
|
||||
"message": "human-readable details"
|
||||
}
|
||||
```
|
||||
|
||||
**Request/Reply flow:**
|
||||
1. MCP/CLI sends deploy request with unique `reply_to` subject
|
||||
2. Listener validates request (e.g., `git ls-remote` to check revision exists)
|
||||
3. Listener sends immediate response:
|
||||
- `{"status": "rejected", "error": "invalid_revision", "message": "branch 'foo' not found"}`, or
|
||||
- `{"status": "started", "message": "starting nixos-rebuild switch"}`
|
||||
4. If started, listener runs nixos-rebuild
|
||||
5. Listener sends final response:
|
||||
- `{"status": "completed", "message": "successfully switched to generation 42"}`, or
|
||||
- `{"status": "failed", "error": "build_failed", "message": "nixos-rebuild exited with code 1"}`
|
||||
|
||||
This provides immediate feedback on validation errors (bad revision, already running) without waiting for the build to fail.
|
||||
|
||||
### MCP Mode
|
||||
|
||||
Runs as an MCP server providing tools for Claude Code:
|
||||
- `deploy` - Deploy to specific host(s) with optional revision
|
||||
- `deploy_status` - Check deployment status/history
|
||||
- `list_hosts` - List available deployment targets
|
||||
|
||||
The MCP server runs with limited credentials (test-tier only), so Claude can deploy to test hosts but not production.
|
||||
|
||||
### Tiered Permissions
|
||||
|
||||
Authorization is enforced at the NATS layer using subject-based permissions. Different deployer credentials have different publish rights:
|
||||
|
||||
**NATS user configuration (on nats1):**
|
||||
```nix
|
||||
accounts = {
|
||||
HOMELAB = {
|
||||
users = [
|
||||
# MCP/Claude - test tier only
|
||||
{
|
||||
nkey = "UABC..."; # mcp-deployer
|
||||
permissions = {
|
||||
publish = [ "deploy.test.>" ];
|
||||
subscribe = [ "deploy.responses.>" ];
|
||||
};
|
||||
}
|
||||
# Admin - full access to all tiers
|
||||
{
|
||||
nkey = "UXYZ..."; # admin-deployer
|
||||
permissions = {
|
||||
publish = [ "deploy.test.>" "deploy.prod.>" ];
|
||||
subscribe = [ "deploy.responses.>" ];
|
||||
};
|
||||
}
|
||||
# Host listeners - subscribe to their tier, publish responses
|
||||
{
|
||||
nkey = "UDEF..."; # host-listener (one per host)
|
||||
permissions = {
|
||||
subscribe = [ "deploy.*.>" ];
|
||||
publish = [ "deploy.responses.>" ];
|
||||
};
|
||||
}
|
||||
];
|
||||
};
|
||||
};
|
||||
```
|
||||
|
||||
**Host tier assignments** (via `homelab.host.tier`):
|
||||
| Tier | Hosts |
|
||||
|------|-------|
|
||||
| test | template1, nix-cache01, future test hosts |
|
||||
| prod | ns1, ns2, ha1, monitoring01, http-proxy, etc. |
|
||||
|
||||
**How it works:**
|
||||
1. MCP tries to deploy to ns1 → publishes to `deploy.prod.ns1`
|
||||
2. NATS server rejects publish (mcp-deployer lacks `deploy.prod.>` permission)
|
||||
3. MCP tries to deploy to template1 → publishes to `deploy.test.template1`
|
||||
4. NATS allows it, listener receives and executes
|
||||
|
||||
All NKeys stored in Vault - MCP gets limited credentials, admin CLI gets full-access credentials.
|
||||
|
||||
### Host Metadata
|
||||
|
||||
Rather than defining `tier` in the listener config, use a central `homelab.host` module that provides host metadata for multiple consumers. This aligns with the approach proposed in `docs/plans/prometheus-scrape-target-labels.md`.
|
||||
|
||||
**Module definition (in `modules/homelab/host.nix`):**
|
||||
```nix
|
||||
homelab.host = {
|
||||
tier = lib.mkOption {
|
||||
type = lib.types.enum [ "test" "prod" ];
|
||||
default = "prod";
|
||||
description = "Deployment tier - controls which credentials can deploy to this host";
|
||||
};
|
||||
|
||||
priority = lib.mkOption {
|
||||
type = lib.types.enum [ "high" "low" ];
|
||||
default = "high";
|
||||
description = "Alerting priority - low priority hosts have relaxed thresholds";
|
||||
};
|
||||
|
||||
role = lib.mkOption {
|
||||
type = lib.types.nullOr lib.types.str;
|
||||
default = null;
|
||||
description = "Primary role of this host (dns, database, monitoring, etc.)";
|
||||
};
|
||||
|
||||
labels = lib.mkOption {
|
||||
type = lib.types.attrsOf lib.types.str;
|
||||
default = { };
|
||||
description = "Additional free-form labels";
|
||||
};
|
||||
};
|
||||
```
|
||||
|
||||
**Consumers:**
|
||||
- `homelab-deploy` listener reads `config.homelab.host.tier` for subject subscription
|
||||
- Prometheus scrape config reads `priority`, `role`, `labels` for target labels
|
||||
- Future services can consume the same metadata
|
||||
|
||||
**Example host config:**
|
||||
```nix
|
||||
# hosts/nix-cache01/configuration.nix
|
||||
homelab.host = {
|
||||
tier = "test"; # can be deployed by MCP
|
||||
priority = "low"; # relaxed alerting thresholds
|
||||
role = "build-host";
|
||||
};
|
||||
|
||||
# hosts/ns1/configuration.nix
|
||||
homelab.host = {
|
||||
tier = "prod"; # requires admin credentials
|
||||
priority = "high";
|
||||
role = "dns";
|
||||
labels.dns_role = "primary";
|
||||
};
|
||||
```
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
### Phase 1: Core Binary + Listener
|
||||
|
||||
1. **Create homelab-deploy repository**
|
||||
- Initialize Go module
|
||||
- Set up flake.nix with Go package build
|
||||
|
||||
2. **Implement listener mode**
|
||||
- NATS subscription logic
|
||||
- nixos-rebuild execution
|
||||
- Status reporting via NATS reply
|
||||
|
||||
3. **Create NixOS module**
|
||||
- Systemd service definition
|
||||
- Configuration options (hostname, NATS URL, NKey path)
|
||||
- Vault secret integration for NKeys
|
||||
|
||||
4. **Create `homelab.host` module** (in nixos-servers)
|
||||
- Define `tier`, `priority`, `role`, `labels` options
|
||||
- This module is shared with Prometheus label work (see `docs/plans/prometheus-scrape-target-labels.md`)
|
||||
|
||||
5. **Integrate with nixos-servers**
|
||||
- Add flake input for homelab-deploy
|
||||
- Import listener module in `system/`
|
||||
- Set `homelab.host.tier` per host (test vs prod)
|
||||
|
||||
6. **Configure NATS tiered permissions**
|
||||
- Add deployer users to nats1 config (mcp-deployer, admin-deployer)
|
||||
- Set up subject ACLs per user (test-only vs full access)
|
||||
- Add deployer NKeys to Vault
|
||||
- Create Terraform resources for NKey secrets
|
||||
|
||||
### Phase 2: MCP + CLI
|
||||
|
||||
7. **Implement MCP mode**
|
||||
- MCP server with deploy/status tools
|
||||
- Request/reply pattern for deployment feedback
|
||||
|
||||
8. **Implement CLI commands**
|
||||
- `deploy` command for manual deployments
|
||||
- `status` command to check deployment state
|
||||
|
||||
9. **Configure Claude Code**
|
||||
- Add MCP server to configuration
|
||||
- Document usage
|
||||
|
||||
### Phase 3: Enhancements
|
||||
|
||||
10. Add deployment locking (prevent concurrent deploys)
|
||||
11. Prometheus metrics for deployment status
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- **Privilege escalation**: Listener runs as root to execute nixos-rebuild
|
||||
- **Input validation**: Strictly validate revision format (branch name or commit hash)
|
||||
- **Rate limiting**: Prevent rapid-fire deployments
|
||||
- **Audit logging**: Log all deployment requests with source identity
|
||||
- **Network isolation**: NATS only accessible from internal network
|
||||
|
||||
## Decisions
|
||||
|
||||
All open questions have been resolved. See Notes section for decision rationale.
|
||||
|
||||
## Notes
|
||||
|
||||
- The existing `nixos-rebuild-test` helper provides a good reference for the rebuild logic
|
||||
- Uses NATS request/reply pattern for immediate validation feedback and completion status
|
||||
- Consider using NATS headers for metadata (request ID, timestamp)
|
||||
- **Timeout decision**: Metrics show no-change upgrades complete in 5-55 seconds. A 10-minute default provides ample headroom for actual updates with package downloads. Per-host override available for hosts with known longer build times.
|
||||
- **Rollback**: Not needed as a separate feature - deploy an older commit hash to effectively rollback.
|
||||
- **Offline hosts**: No message persistence - if host is offline, deploy fails. Daily auto-upgrade is the safety net. Avoids complexity of JetStream deduplication (host coming online and applying 10 queued updates instead of just the latest).
|
||||
- **Deploy history**: Use existing Loki - listener logs deployments to journald, queryable via Loki. No need for separate JetStream persistence.
|
||||
- **Naming**: `homelab-deploy` - ties it to the infrastructure rather than implementation details.
|
||||
Reference in New Issue
Block a user