Add plan for NATS-based deployment service (homelab-deploy) that enables on-demand NixOS configuration updates via messaging. Features tiered permissions (test/prod) enforced at NATS layer. Update prometheus-scrape-target-labels plan to share the homelab.host module for host metadata (tier, priority, role, labels) - single source of truth for both deployment tiers and prometheus labels. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
12 KiB
NATS-Based Deployment Service
Overview
Create a message-based deployment system that allows triggering NixOS configuration updates on-demand, rather than waiting for the daily auto-upgrade timer. This enables faster iteration when testing changes and immediate fleet-wide deployments.
Goals
- On-demand deployment - Trigger config updates immediately via NATS message
- Targeted deployment - Deploy to specific hosts or all hosts
- Branch/revision support - Test feature branches before merging to master
- MCP integration - Allow Claude Code to trigger deployments during development
Current State
- Auto-upgrade: All hosts run
nixos-upgrade.servicedaily, pulling from master - Manual testing:
nixos-rebuild-test <action> <branch>helper exists on all hosts - NATS: Running on nats1 with JetStream enabled, using NKey authentication
- Accounts: ADMIN (system) and HOMELAB (user workloads with JetStream)
Architecture
┌─────────────┐ ┌─────────────┐
│ MCP Tool │ deploy.test.> │ Admin CLI │ deploy.test.> + deploy.prod.>
│ (claude) │────────────┐ ┌─────│ (torjus) │
└─────────────┘ │ │ └─────────────┘
▼ ▼
┌──────────────┐
│ nats1 │
│ (authz) │
└──────┬───────┘
│
┌─────────────────┼─────────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ template1│ │ ns1 │ │ ha1 │
│ tier=test│ │ tier=prod│ │ tier=prod│
└──────────┘ └──────────┘ └──────────┘
Repository Structure
The project lives in a separate repository (e.g., homelab-deploy) containing:
homelab-deploy/
├── flake.nix # Nix flake with Go package + NixOS module
├── go.mod
├── go.sum
├── cmd/
│ └── homelab-deploy/
│ └── main.go # CLI entrypoint with subcommands
├── internal/
│ ├── listener/ # Listener mode logic
│ ├── mcp/ # MCP server mode logic
│ └── deploy/ # Shared deployment logic
└── nixos/
└── module.nix # NixOS module for listener service
This repo imports the flake as an input and uses the NixOS module.
Single Binary with Subcommands
The homelab-deploy binary supports multiple modes:
# Run as listener on a host (systemd service)
homelab-deploy listener --hostname ns1 --nats-url nats://nats1:4222
# Run as MCP server (for Claude Code)
homelab-deploy mcp --nats-url nats://nats1:4222
# CLI commands for manual use
homelab-deploy deploy ns1 --branch feature-x --action switch
homelab-deploy deploy --all --action boot
homelab-deploy status
Components
Listener Mode
A systemd service on each host that:
- Subscribes to
deploy.<tier>.<hostname>anddeploy.<tier>.allsubjects - Validates incoming messages (revision, action)
- Executes
nixos-rebuildwith specified parameters - Reports status back via NATS
NixOS module configuration:
services.homelab-deploy.listener = {
enable = true;
timeout = 600; # seconds, default 10 minutes
};
The listener reads its tier from config.homelab.host.tier (see Host Metadata below) and subscribes to tier-specific subjects (e.g., deploy.prod.ns1 and deploy.prod.all).
Request message format:
{
"action": "switch" | "boot" | "test" | "dry-activate",
"revision": "master" | "feature-branch" | "abc123...",
"reply_to": "deploy.responses.<request-id>"
}
Response message format:
{
"status": "accepted" | "rejected" | "started" | "completed" | "failed",
"error": "invalid_revision" | "already_running" | "build_failed" | null,
"message": "human-readable details"
}
Request/Reply flow:
- MCP/CLI sends deploy request with unique
reply_tosubject - Listener validates request (e.g.,
git ls-remoteto check revision exists) - Listener sends immediate response:
{"status": "rejected", "error": "invalid_revision", "message": "branch 'foo' not found"}, or{"status": "started", "message": "starting nixos-rebuild switch"}
- If started, listener runs nixos-rebuild
- Listener sends final response:
{"status": "completed", "message": "successfully switched to generation 42"}, or{"status": "failed", "error": "build_failed", "message": "nixos-rebuild exited with code 1"}
This provides immediate feedback on validation errors (bad revision, already running) without waiting for the build to fail.
MCP Mode
Runs as an MCP server providing tools for Claude Code:
deploy- Deploy to specific host(s) with optional revisiondeploy_status- Check deployment status/historylist_hosts- List available deployment targets
The MCP server runs with limited credentials (test-tier only), so Claude can deploy to test hosts but not production.
Tiered Permissions
Authorization is enforced at the NATS layer using subject-based permissions. Different deployer credentials have different publish rights:
NATS user configuration (on nats1):
accounts = {
HOMELAB = {
users = [
# MCP/Claude - test tier only
{
nkey = "UABC..."; # mcp-deployer
permissions = {
publish = [ "deploy.test.>" ];
subscribe = [ "deploy.responses.>" ];
};
}
# Admin - full access to all tiers
{
nkey = "UXYZ..."; # admin-deployer
permissions = {
publish = [ "deploy.test.>" "deploy.prod.>" ];
subscribe = [ "deploy.responses.>" ];
};
}
# Host listeners - subscribe to their tier, publish responses
{
nkey = "UDEF..."; # host-listener (one per host)
permissions = {
subscribe = [ "deploy.*.>" ];
publish = [ "deploy.responses.>" ];
};
}
];
};
};
Host tier assignments (via homelab.host.tier):
| Tier | Hosts |
|---|---|
| test | template1, nix-cache01, future test hosts |
| prod | ns1, ns2, ha1, monitoring01, http-proxy, etc. |
How it works:
- MCP tries to deploy to ns1 → publishes to
deploy.prod.ns1 - NATS server rejects publish (mcp-deployer lacks
deploy.prod.>permission) - MCP tries to deploy to template1 → publishes to
deploy.test.template1 - NATS allows it, listener receives and executes
All NKeys stored in Vault - MCP gets limited credentials, admin CLI gets full-access credentials.
Host Metadata
Rather than defining tier in the listener config, use a central homelab.host module that provides host metadata for multiple consumers. This aligns with the approach proposed in docs/plans/prometheus-scrape-target-labels.md.
Module definition (in modules/homelab/host.nix):
homelab.host = {
tier = lib.mkOption {
type = lib.types.enum [ "test" "prod" ];
default = "prod";
description = "Deployment tier - controls which credentials can deploy to this host";
};
priority = lib.mkOption {
type = lib.types.enum [ "high" "low" ];
default = "high";
description = "Alerting priority - low priority hosts have relaxed thresholds";
};
role = lib.mkOption {
type = lib.types.nullOr lib.types.str;
default = null;
description = "Primary role of this host (dns, database, monitoring, etc.)";
};
labels = lib.mkOption {
type = lib.types.attrsOf lib.types.str;
default = { };
description = "Additional free-form labels";
};
};
Consumers:
homelab-deploylistener readsconfig.homelab.host.tierfor subject subscription- Prometheus scrape config reads
priority,role,labelsfor target labels - Future services can consume the same metadata
Example host config:
# hosts/nix-cache01/configuration.nix
homelab.host = {
tier = "test"; # can be deployed by MCP
priority = "low"; # relaxed alerting thresholds
role = "build-host";
};
# hosts/ns1/configuration.nix
homelab.host = {
tier = "prod"; # requires admin credentials
priority = "high";
role = "dns";
labels.dns_role = "primary";
};
Implementation Steps
Phase 1: Core Binary + Listener
-
Create homelab-deploy repository
- Initialize Go module
- Set up flake.nix with Go package build
-
Implement listener mode
- NATS subscription logic
- nixos-rebuild execution
- Status reporting via NATS reply
-
Create NixOS module
- Systemd service definition
- Configuration options (hostname, NATS URL, NKey path)
- Vault secret integration for NKeys
-
Create
homelab.hostmodule (in nixos-servers)- Define
tier,priority,role,labelsoptions - This module is shared with Prometheus label work (see
docs/plans/prometheus-scrape-target-labels.md)
- Define
-
Integrate with nixos-servers
- Add flake input for homelab-deploy
- Import listener module in
system/ - Set
homelab.host.tierper host (test vs prod)
-
Configure NATS tiered permissions
- Add deployer users to nats1 config (mcp-deployer, admin-deployer)
- Set up subject ACLs per user (test-only vs full access)
- Add deployer NKeys to Vault
- Create Terraform resources for NKey secrets
Phase 2: MCP + CLI
-
Implement MCP mode
- MCP server with deploy/status tools
- Request/reply pattern for deployment feedback
-
Implement CLI commands
deploycommand for manual deploymentsstatuscommand to check deployment state
-
Configure Claude Code
- Add MCP server to configuration
- Document usage
Phase 3: Enhancements
- Add deployment locking (prevent concurrent deploys)
- Prometheus metrics for deployment status
Security Considerations
- Privilege escalation: Listener runs as root to execute nixos-rebuild
- Input validation: Strictly validate revision format (branch name or commit hash)
- Rate limiting: Prevent rapid-fire deployments
- Audit logging: Log all deployment requests with source identity
- Network isolation: NATS only accessible from internal network
Decisions
All open questions have been resolved. See Notes section for decision rationale.
Notes
- The existing
nixos-rebuild-testhelper provides a good reference for the rebuild logic - Uses NATS request/reply pattern for immediate validation feedback and completion status
- Consider using NATS headers for metadata (request ID, timestamp)
- Timeout decision: Metrics show no-change upgrades complete in 5-55 seconds. A 10-minute default provides ample headroom for actual updates with package downloads. Per-host override available for hosts with known longer build times.
- Rollback: Not needed as a separate feature - deploy an older commit hash to effectively rollback.
- Offline hosts: No message persistence - if host is offline, deploy fails. Daily auto-upgrade is the safety net. Avoids complexity of JetStream deduplication (host coming online and applying 10 queued updates instead of just the latest).
- Deploy history: Use existing Loki - listener logs deployments to journald, queryable via Loki. No need for separate JetStream persistence.
- Naming:
homelab-deploy- ties it to the infrastructure rather than implementation details.