MCP exposes two tools: - deploy: test-tier only, always available - deploy_admin: all tiers, requires --enable-admin flag Three security layers: CLI flag, NATS authz, Claude Code permissions. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
14 KiB
NATS-Based Deployment Service
Overview
Create a message-based deployment system that allows triggering NixOS configuration updates on-demand, rather than waiting for the daily auto-upgrade timer. This enables faster iteration when testing changes and immediate fleet-wide deployments.
Goals
- On-demand deployment - Trigger config updates immediately via NATS message
- Targeted deployment - Deploy to specific hosts or all hosts
- Branch/revision support - Test feature branches before merging to master
- MCP integration - Allow Claude Code to trigger deployments during development
Current State
- Auto-upgrade: All hosts run
nixos-upgrade.servicedaily, pulling from master - Manual testing:
nixos-rebuild-test <action> <branch>helper exists on all hosts - NATS: Running on nats1 with JetStream enabled, using NKey authentication
- Accounts: ADMIN (system) and HOMELAB (user workloads with JetStream)
Architecture
┌─────────────┐ ┌─────────────┐
│ MCP Tool │ deploy.test.> │ Admin CLI │ deploy.test.> + deploy.prod.>
│ (claude) │────────────┐ ┌─────│ (torjus) │
└─────────────┘ │ │ └─────────────┘
▼ ▼
┌──────────────┐
│ nats1 │
│ (authz) │
└──────┬───────┘
│
┌─────────────────┼─────────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ template1│ │ ns1 │ │ ha1 │
│ tier=test│ │ tier=prod│ │ tier=prod│
└──────────┘ └──────────┘ └──────────┘
Repository Structure
The project lives in a separate repository (e.g., homelab-deploy) containing:
homelab-deploy/
├── flake.nix # Nix flake with Go package + NixOS module
├── go.mod
├── go.sum
├── cmd/
│ └── homelab-deploy/
│ └── main.go # CLI entrypoint with subcommands
├── internal/
│ ├── listener/ # Listener mode logic
│ ├── mcp/ # MCP server mode logic
│ └── deploy/ # Shared deployment logic
└── nixos/
└── module.nix # NixOS module for listener service
This repo imports the flake as an input and uses the NixOS module.
Single Binary with Subcommands
The homelab-deploy binary supports multiple modes:
# Run as listener on a host (systemd service)
homelab-deploy listener --hostname ns1 --nats-url nats://nats1:4222
# Run as MCP server (for Claude Code)
homelab-deploy mcp --nats-url nats://nats1:4222
# CLI commands for manual use
homelab-deploy deploy ns1 --branch feature-x --action switch # single host
homelab-deploy deploy --tier test --all --action boot # all test hosts
homelab-deploy deploy --tier prod --all --action boot # all prod hosts (admin only)
homelab-deploy deploy --tier prod --role dns --action switch # all prod dns hosts
homelab-deploy status
Components
Listener Mode
A systemd service on each host that:
- Subscribes to multiple subjects for targeted and group deployments
- Validates incoming messages (revision, action)
- Executes
nixos-rebuildwith specified parameters - Reports status back via NATS
Subject structure:
deploy.<tier>.<hostname> # specific host (e.g., deploy.prod.ns1)
deploy.<tier>.all # all hosts in tier (e.g., deploy.test.all)
deploy.<tier>.role.<role> # all hosts with role in tier (e.g., deploy.prod.role.dns)
Listener subscriptions (based on homelab.host config):
deploy.<tier>.<hostname>- direct messages to this hostdeploy.<tier>.all- broadcast to all hosts in tierdeploy.<tier>.role.<role>- broadcast to hosts with matching role (if role is set)
Example: ns1 with tier=prod, role=dns subscribes to:
deploy.prod.ns1deploy.prod.alldeploy.prod.role.dns
NixOS module configuration:
services.homelab-deploy.listener = {
enable = true;
timeout = 600; # seconds, default 10 minutes
};
The listener reads tier and role from config.homelab.host (see Host Metadata below).
Request message format:
{
"action": "switch" | "boot" | "test" | "dry-activate",
"revision": "master" | "feature-branch" | "abc123...",
"reply_to": "deploy.responses.<request-id>"
}
Response message format:
{
"status": "accepted" | "rejected" | "started" | "completed" | "failed",
"error": "invalid_revision" | "already_running" | "build_failed" | null,
"message": "human-readable details"
}
Request/Reply flow:
- MCP/CLI sends deploy request with unique
reply_tosubject - Listener validates request (e.g.,
git ls-remoteto check revision exists) - Listener sends immediate response:
{"status": "rejected", "error": "invalid_revision", "message": "branch 'foo' not found"}, or{"status": "started", "message": "starting nixos-rebuild switch"}
- If started, listener runs nixos-rebuild
- Listener sends final response:
{"status": "completed", "message": "successfully switched to generation 42"}, or{"status": "failed", "error": "build_failed", "message": "nixos-rebuild exited with code 1"}
This provides immediate feedback on validation errors (bad revision, already running) without waiting for the build to fail.
MCP Mode
Runs as an MCP server providing tools for Claude Code.
Tools:
| Tool | Description | Tier Access |
|---|---|---|
deploy |
Deploy to test hosts (individual, all, or by role) | test only |
deploy_admin |
Deploy to any host (requires --enable-admin flag) |
test + prod |
deploy_status |
Check deployment status/history | n/a |
list_hosts |
List available deployment targets | n/a |
CLI flags:
# Default: only test-tier deployments available
homelab-deploy mcp --nats-url nats://nats1:4222
# Enable admin tool (requires admin NKey to be configured)
homelab-deploy mcp --nats-url nats://nats1:4222 --enable-admin --admin-nkey-file /path/to/admin.nkey
Security layers:
- MCP flag:
deploy_admintool only exposed when--enable-adminis passed - NATS authz: Even if tool is exposed, NATS rejects publishes without valid admin NKey
- Claude Code permissions: Can set
mcp__homelab-deploy__deploy_admintoaskmode for confirmation popup
By default, the MCP only loads test-tier credentials and exposes the deploy tool. Claude can:
- Deploy to individual test hosts
- Deploy to all test hosts at once (
deploy.test.all) - Deploy to test hosts by role (
deploy.test.role.<role>)
Tiered Permissions
Authorization is enforced at the NATS layer using subject-based permissions. Different deployer credentials have different publish rights:
NATS user configuration (on nats1):
accounts = {
HOMELAB = {
users = [
# MCP/Claude - test tier only
{
nkey = "UABC..."; # mcp-deployer
permissions = {
publish = [ "deploy.test.>" ];
subscribe = [ "deploy.responses.>" ];
};
}
# Admin - full access to all tiers
{
nkey = "UXYZ..."; # admin-deployer
permissions = {
publish = [ "deploy.test.>" "deploy.prod.>" ];
subscribe = [ "deploy.responses.>" ];
};
}
# Host listeners - subscribe to their tier, publish responses
{
nkey = "UDEF..."; # host-listener (one per host)
permissions = {
subscribe = [ "deploy.*.>" ];
publish = [ "deploy.responses.>" ];
};
}
];
};
};
Host tier assignments (via homelab.host.tier):
| Tier | Hosts |
|---|---|
| test | template1, nix-cache01, future test hosts |
| prod | ns1, ns2, ha1, monitoring01, http-proxy, etc. |
Example deployment scenarios:
| Command | Subject | MCP | Admin |
|---|---|---|---|
| Deploy to ns1 | deploy.prod.ns1 |
❌ | ✅ |
| Deploy to template1 | deploy.test.template1 |
✅ | ✅ |
| Deploy to all test hosts | deploy.test.all |
✅ | ✅ |
| Deploy to all prod hosts | deploy.prod.all |
❌ | ✅ |
| Deploy to all DNS servers | deploy.prod.role.dns |
❌ | ✅ |
All NKeys stored in Vault - MCP gets limited credentials, admin CLI gets full-access credentials.
Host Metadata
Rather than defining tier in the listener config, use a central homelab.host module that provides host metadata for multiple consumers. This aligns with the approach proposed in docs/plans/prometheus-scrape-target-labels.md.
Module definition (in modules/homelab/host.nix):
homelab.host = {
tier = lib.mkOption {
type = lib.types.enum [ "test" "prod" ];
default = "prod";
description = "Deployment tier - controls which credentials can deploy to this host";
};
priority = lib.mkOption {
type = lib.types.enum [ "high" "low" ];
default = "high";
description = "Alerting priority - low priority hosts have relaxed thresholds";
};
role = lib.mkOption {
type = lib.types.nullOr lib.types.str;
default = null;
description = "Primary role of this host (dns, database, monitoring, etc.)";
};
labels = lib.mkOption {
type = lib.types.attrsOf lib.types.str;
default = { };
description = "Additional free-form labels";
};
};
Consumers:
homelab-deploylistener readsconfig.homelab.host.tierfor subject subscription- Prometheus scrape config reads
priority,role,labelsfor target labels - Future services can consume the same metadata
Example host config:
# hosts/nix-cache01/configuration.nix
homelab.host = {
tier = "test"; # can be deployed by MCP
priority = "low"; # relaxed alerting thresholds
role = "build-host";
};
# hosts/ns1/configuration.nix
homelab.host = {
tier = "prod"; # requires admin credentials
priority = "high";
role = "dns";
labels.dns_role = "primary";
};
Implementation Steps
Phase 1: Core Binary + Listener
-
Create homelab-deploy repository
- Initialize Go module
- Set up flake.nix with Go package build
-
Implement listener mode
- NATS subscription logic
- nixos-rebuild execution
- Status reporting via NATS reply
-
Create NixOS module
- Systemd service definition
- Configuration options (hostname, NATS URL, NKey path)
- Vault secret integration for NKeys
-
Create
homelab.hostmodule (in nixos-servers)- Define
tier,priority,role,labelsoptions - This module is shared with Prometheus label work (see
docs/plans/prometheus-scrape-target-labels.md)
- Define
-
Integrate with nixos-servers
- Add flake input for homelab-deploy
- Import listener module in
system/ - Set
homelab.host.tierper host (test vs prod)
-
Configure NATS tiered permissions
- Add deployer users to nats1 config (mcp-deployer, admin-deployer)
- Set up subject ACLs per user (test-only vs full access)
- Add deployer NKeys to Vault
- Create Terraform resources for NKey secrets
Phase 2: MCP + CLI
-
Implement MCP mode
- MCP server with deploy/status tools
- Request/reply pattern for deployment feedback
-
Implement CLI commands
deploycommand for manual deploymentsstatuscommand to check deployment state
-
Configure Claude Code
- Add MCP server to configuration
- Document usage
Phase 3: Enhancements
- Add deployment locking (prevent concurrent deploys)
- Prometheus metrics for deployment status
Security Considerations
- Privilege escalation: Listener runs as root to execute nixos-rebuild
- Input validation: Strictly validate revision format (branch name or commit hash)
- Rate limiting: Prevent rapid-fire deployments
- Audit logging: Log all deployment requests with source identity
- Network isolation: NATS only accessible from internal network
Decisions
All open questions have been resolved. See Notes section for decision rationale.
Notes
- The existing
nixos-rebuild-testhelper provides a good reference for the rebuild logic - Uses NATS request/reply pattern for immediate validation feedback and completion status
- Consider using NATS headers for metadata (request ID, timestamp)
- Timeout decision: Metrics show no-change upgrades complete in 5-55 seconds. A 10-minute default provides ample headroom for actual updates with package downloads. Per-host override available for hosts with known longer build times.
- Rollback: Not needed as a separate feature - deploy an older commit hash to effectively rollback.
- Offline hosts: No message persistence - if host is offline, deploy fails. Daily auto-upgrade is the safety net. Avoids complexity of JetStream deduplication (host coming online and applying 10 queued updates instead of just the latest).
- Deploy history: Use existing Loki - listener logs deployments to journald, queryable via Loki. No need for separate JetStream persistence.
- Naming:
homelab-deploy- ties it to the infrastructure rather than implementation details.