- Add configurable NATS subject patterns with template variables (<hostname>, <tier>, <role>) for multi-tenant setups - Add deploy.discover subject for host discovery - Simplify CLI to use direct subjects with optional aliases via HOMELAB_DEPLOY_ALIAS_* environment variables - Clarify request/reply flow with UUID-based response subjects - Expand NixOS module with hardening options, package option, and configurable deploy/discover subjects - Switch CLI framework from cobra to urfave/cli/v3 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
21 KiB
homelab-deploy Design Document
A message-based deployment system for NixOS configurations using NATS for messaging. This binary runs in multiple modes to enable on-demand NixOS configuration updates across a fleet of hosts.
Overview
The homelab-deploy binary provides three operational modes:
- Listener mode - Runs on each NixOS host as a systemd service, subscribing to NATS subjects and executing
nixos-rebuildwhen deployment requests arrive - MCP mode - Runs as an MCP (Model Context Protocol) server, exposing deployment tools for AI assistants
- CLI mode - Manual deployment commands for administrators
Architecture
┌─────────────┐ ┌─────────────┐
│ MCP Tool │ deploy.test.> │ Admin CLI │ deploy.test.> + deploy.prod.>
│ │────────────┐ ┌─────│ │
└─────────────┘ │ │ └─────────────┘
▼ ▼
┌──────────────┐
│ NATS Server │
│ (authz) │
└──────┬───────┘
│
┌─────────────────┼─────────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ host-a │ │ host-b │ │ host-c │
│ tier=test│ │ tier=prod│ │ tier=prod│
└──────────┘ └──────────┘ └──────────┘
Repository Structure
homelab-deploy/
├── flake.nix # Nix flake with Go package + NixOS module
├── go.mod
├── go.sum
├── cmd/
│ └── homelab-deploy/
│ └── main.go # CLI entrypoint with subcommands
├── internal/
│ ├── listener/ # Listener mode logic
│ ├── mcp/ # MCP server mode logic
│ ├── nats/ # NATS client wrapper
│ └── deploy/ # Shared deployment execution logic
└── nixos/
└── module.nix # NixOS module for listener service
CLI Interface
# Listener mode (runs as systemd service on each host)
homelab-deploy listener \
--hostname <hostname> \
--tier <test|prod> \
--nats-url nats://server:4222 \
--nkey-file /path/to/listener.nkey \
--flake-url <git+https://...> \
[--role <role>] \
[--timeout 600] \
[--deploy-subject <subject>]... \
[--discover-subject <subject>]
# Subject flags can be repeated and use template variables:
homelab-deploy listener \
--hostname ns1 \
--tier prod \
--role dns \
--deploy-subject "deploy.<tier>.<hostname>" \
--deploy-subject "deploy.<tier>.all" \
--deploy-subject "deploy.<tier>.role.<role>" \
--discover-subject "deploy.discover" \
...
# MCP server mode (for AI assistants)
homelab-deploy mcp \
--nats-url nats://server:4222 \
--nkey-file /path/to/mcp.nkey \
[--enable-admin --admin-nkey-file /path/to/admin.nkey]
# CLI commands for manual use
# Deploy to a specific subject
homelab-deploy deploy <subject> \
--nats-url nats://server:4222 \
--nkey-file /path/to/deployer.nkey \
[--branch <branch>] \
[--action <switch|boot|test|dry-activate>]
# Examples:
homelab-deploy deploy deploy.prod.ns1 # Deploy to specific host
homelab-deploy deploy deploy.test.all # Deploy to all test hosts
homelab-deploy deploy deploy.prod.role.dns # Deploy to all prod DNS hosts
# Using aliases (configured via environment variables)
homelab-deploy deploy test # Expands to configured subject
homelab-deploy deploy prod-dns # Expands to configured subject
CLI Subject Aliases
The CLI supports subject aliases via environment variables. If the <subject> argument doesn't look like a NATS subject (no dots), the CLI checks for an alias.
Environment variable format: HOMELAB_DEPLOY_ALIAS_<NAME>=<subject>
export HOMELAB_DEPLOY_ALIAS_TEST="deploy.test.all"
export HOMELAB_DEPLOY_ALIAS_PROD="deploy.prod.all"
export HOMELAB_DEPLOY_ALIAS_PROD_DNS="deploy.prod.role.dns"
# Now these work:
homelab-deploy deploy test # -> deploy.test.all
homelab-deploy deploy prod # -> deploy.prod.all
homelab-deploy deploy prod-dns # -> deploy.prod.role.dns
Alias names are case-insensitive and hyphens are converted to underscores when looking up the environment variable.
NATS Subject Structure
Subjects follow the pattern deploy.<tier>.<target> by default, but are fully configurable:
| Subject Pattern | Description |
|---|---|
deploy.<tier>.<hostname> |
Deploy to specific host (e.g., deploy.prod.ns1) |
deploy.<tier>.all |
Deploy to all hosts in tier (e.g., deploy.test.all) |
deploy.<tier>.role.<role> |
Deploy to hosts with role in tier (e.g., deploy.prod.role.dns) |
deploy.responses.<uuid> |
Response subject for request/reply (UUID generated by CLI) |
deploy.discover |
Host discovery requests |
Subject Customization
Listeners can configure custom subject patterns using template variables:
<hostname>- The listener's hostname<tier>- The listener's tier (test/prod)<role>- The listener's role (if configured)
This allows prefixing subjects for multi-tenant setups (e.g., homelab.deploy.<tier>.<hostname>).
Listener Mode
Responsibilities
- Connect to NATS using NKey authentication
- Subscribe to configured deploy subjects (with template expansion)
- Subscribe to discovery subject and respond with host metadata
- Validate incoming deployment requests
- Execute
nixos-rebuildwith the specified parameters - Report status back via NATS reply subject
Subject Subscriptions
Listeners subscribe to a configurable list of subjects. The configuration uses template variables that are expanded at runtime:
listener:
hostname: ns1
tier: prod
role: dns
deploy_subjects:
- "deploy.<tier>.<hostname>"
- "deploy.<tier>.all"
- "deploy.<tier>.role.<role>"
discover_subject: "deploy.discover"
Template variables:
<hostname>- Replaced with the configured hostname<tier>- Replaced with the configured tier<role>- Replaced with the configured role (subject skipped if role is null)
Example: With the above configuration, the listener subscribes to:
deploy.prod.ns1deploy.prod.alldeploy.prod.role.dnsdeploy.discover
Prefixed example: For multi-tenant setups:
listener:
hostname: ns1
tier: prod
deploy_subjects:
- "homelab.deploy.<tier>.<hostname>"
- "homelab.deploy.<tier>.all"
discover_subject: "homelab.deploy.discover"
Message Formats
Request message:
{
"action": "switch",
"revision": "master",
"reply_to": "deploy.responses.abc123"
}
| Field | Type | Required | Description |
|---|---|---|---|
action |
string | yes | One of: switch, boot, test, dry-activate |
revision |
string | yes | Git branch name or commit hash |
reply_to |
string | yes | Subject to publish responses to |
Response message:
{
"hostname": "ns1",
"status": "completed",
"error": null,
"message": "Successfully switched to generation 42"
}
| Field | Type | Description |
|---|---|---|
hostname |
string | The responding host's name |
status |
string | One of: accepted, rejected, started, completed, failed |
error |
string or null | Error code if status is rejected or failed |
message |
string | Human-readable details |
Error codes:
invalid_revision- The specified branch/commit does not existinvalid_action- The action is not recognizedalready_running- A deployment is already in progress on this hostbuild_failed- nixos-rebuild exited with non-zero statustimeout- Deployment exceeded the configured timeout
Request/Reply Flow
- CLI generates a UUID for the request (e.g.,
550e8400-e29b-41d4-a716-446655440000) - CLI subscribes to
deploy.responses.<uuid> - CLI publishes deploy request to target subject with
reply_to: "deploy.responses.<uuid>" - Listener validates request:
- Checks revision exists using
git ls-remote - Checks no other deployment is running
- Checks revision exists using
- Listener publishes response to the
reply_tosubject:{"status": "rejected", ...}if validation fails, or{"status": "started", ...}if deployment begins
- If started, listener executes nixos-rebuild
- Listener publishes final response to the same
reply_tosubject:{"status": "completed", ...}on success, or{"status": "failed", ...}on failure
- CLI receives responses and displays progress/results
- CLI unsubscribes after receiving final status or timeout
Deployment Execution
The listener executes nixos-rebuild with the following command pattern:
nixos-rebuild <action> --flake <flake-url>?ref=<revision>#<hostname>
Where:
<action>is one of:switch,boot,test,dry-activate<flake-url>is the configured git flake URL (e.g.,git+https://git.example.com/user/nixos-configs.git)<revision>is the branch name or commit hash from the request<hostname>is the listener's configured hostname
Environment requirements:
- Must run as root (nixos-rebuild requires root)
- Nix must be configured with proper git credentials if the flake is private
- Network access to the git repository
Concurrency Control
Only one deployment may run at a time per host. The listener maintains a simple lock:
- Before starting a deployment, acquire lock
- If lock is held, reject with
already_runningerror - Release lock when deployment completes (success or failure)
- Lock should be in-memory (no persistence needed - restarts clear it)
Logging
All deployment events should be logged to stdout/stderr (captured by systemd journal):
- Request received (with subject, action, revision)
- Validation result
- Deployment start
- Deployment completion (with exit code)
- Any errors
This enables integration with log aggregation systems (e.g., Loki via Promtail).
MCP Mode
Purpose
Exposes deployment functionality as MCP tools for AI assistants (e.g., Claude Code).
Tools
| Tool | Description | Parameters |
|---|---|---|
deploy |
Deploy to test-tier hosts | hostname or all, optional role, branch, action |
deploy_admin |
Deploy to any tier (requires --enable-admin) |
tier, hostname or all, optional role, branch, action |
list_hosts |
List available deployment targets | tier (optional) |
Tool Schemas
deploy:
{
"name": "deploy",
"description": "Deploy NixOS configuration to test-tier hosts",
"inputSchema": {
"type": "object",
"properties": {
"hostname": {
"type": "string",
"description": "Target hostname, or omit to use 'all' or 'role' targeting"
},
"all": {
"type": "boolean",
"description": "Deploy to all test-tier hosts"
},
"role": {
"type": "string",
"description": "Deploy to all test-tier hosts with this role"
},
"branch": {
"type": "string",
"description": "Git branch or commit to deploy (default: master)"
},
"action": {
"type": "string",
"enum": ["switch", "boot", "test", "dry-activate"],
"description": "nixos-rebuild action (default: switch)"
}
}
}
}
deploy_admin:
{
"name": "deploy_admin",
"description": "Deploy NixOS configuration to any host (admin access required)",
"inputSchema": {
"type": "object",
"properties": {
"tier": {
"type": "string",
"enum": ["test", "prod"],
"description": "Target tier"
},
"hostname": {
"type": "string",
"description": "Target hostname, or omit to use 'all' or 'role' targeting"
},
"all": {
"type": "boolean",
"description": "Deploy to all hosts in tier"
},
"role": {
"type": "string",
"description": "Deploy to all hosts with this role in tier"
},
"branch": {
"type": "string",
"description": "Git branch or commit to deploy (default: master)"
},
"action": {
"type": "string",
"enum": ["switch", "boot", "test", "dry-activate"],
"description": "nixos-rebuild action (default: switch)"
}
},
"required": ["tier"]
}
}
list_hosts:
{
"name": "list_hosts",
"description": "List available deployment targets",
"inputSchema": {
"type": "object",
"properties": {
"tier": {
"type": "string",
"enum": ["test", "prod"],
"description": "Filter by tier (optional)"
}
}
}
}
Security Layers
- MCP flag:
deploy_admintool only registered when--enable-adminis passed - NATS authz: MCP credentials can only publish to authorized subjects
- AI assistant permissions: The assistant's configuration can require confirmation for admin operations
Multi-Host Deployments
When deploying to multiple hosts (via all or role), the MCP should:
- Publish the request to the appropriate broadcast subject
- Collect responses from all responding hosts
- Return aggregated results showing each host's status
Timeout handling:
- Set a reasonable timeout for collecting responses (e.g., 30 seconds after last response, or max 15 minutes)
- Return partial results if some hosts don't respond
- Indicate which hosts did not respond
Host Discovery
The list_hosts tool needs to know available hosts. Options:
- Static configuration: Read from a config file or environment variable
- NATS request: Publish to a discovery subject and collect responses from listeners
Recommend option 2: Listeners subscribe to their configured discover_subject and respond with metadata.
Discovery request:
{
"reply_to": "deploy.responses.discover-abc123"
}
Discovery response:
{
"hostname": "ns1",
"tier": "prod",
"role": "dns",
"deploy_subjects": [
"deploy.prod.ns1",
"deploy.prod.all",
"deploy.prod.role.dns"
]
}
The response includes the expanded deploy_subjects so clients know exactly which subjects reach this host.
NixOS Module
The NixOS module configures the listener as a systemd service with appropriate hardening.
Module Options
{
options.services.homelab-deploy.listener = {
enable = lib.mkEnableOption "homelab-deploy listener service";
package = lib.mkPackageOption pkgs "homelab-deploy" { };
hostname = lib.mkOption {
type = lib.types.str;
default = config.networking.hostName;
description = "Hostname for this listener (used in subject templates)";
};
tier = lib.mkOption {
type = lib.types.enum [ "test" "prod" ];
description = "Deployment tier for this host";
};
role = lib.mkOption {
type = lib.types.nullOr lib.types.str;
default = null;
description = "Role for role-based deployment targeting";
};
natsUrl = lib.mkOption {
type = lib.types.str;
description = "NATS server URL";
example = "nats://nats.example.com:4222";
};
nkeyFile = lib.mkOption {
type = lib.types.path;
description = "Path to NKey seed file for NATS authentication";
example = "/run/secrets/homelab-deploy-nkey";
};
flakeUrl = lib.mkOption {
type = lib.types.str;
description = "Git flake URL for nixos-rebuild";
example = "git+https://git.example.com/user/nixos-configs.git";
};
timeout = lib.mkOption {
type = lib.types.int;
default = 600;
description = "Deployment timeout in seconds";
};
deploySubjects = lib.mkOption {
type = lib.types.listOf lib.types.str;
default = [
"deploy.<tier>.<hostname>"
"deploy.<tier>.all"
"deploy.<tier>.role.<role>"
];
description = ''
List of NATS subjects to subscribe to for deployment requests.
Template variables: <hostname>, <tier>, <role>
'';
};
discoverSubject = lib.mkOption {
type = lib.types.str;
default = "deploy.discover";
description = "NATS subject for host discovery requests";
};
environment = lib.mkOption {
type = lib.types.attrsOf lib.types.str;
default = { };
description = "Additional environment variables for the service";
example = { GIT_SSH_COMMAND = "ssh -i /run/secrets/deploy-key"; };
};
};
}
Systemd Service
The module creates a hardened systemd service:
systemd.services.homelab-deploy-listener = {
description = "homelab-deploy listener";
wantedBy = [ "multi-user.target" ];
after = [ "network-online.target" ];
wants = [ "network-online.target" ];
environment = cfg.environment;
serviceConfig = {
Type = "simple";
ExecStart = "${cfg.package}/bin/homelab-deploy listener ...";
Restart = "always";
RestartSec = 10;
# Hardening (compatible with nixos-rebuild requirements)
NoNewPrivileges = false; # nixos-rebuild may need to spawn privileged processes
ProtectSystem = "false"; # nixos-rebuild modifies /nix/store and /run
ProtectHome = "read-only";
PrivateTmp = true;
PrivateDevices = true;
ProtectKernelTunables = true;
ProtectKernelModules = true;
ProtectControlGroups = true;
RestrictAddressFamilies = [ "AF_UNIX" "AF_INET" "AF_INET6" ];
RestrictNamespaces = false; # nix build uses namespaces
RestrictSUIDSGID = true;
LockPersonality = true;
MemoryDenyWriteExecute = false; # nix may need this
SystemCallArchitectures = "native";
};
};
Note: Some hardening options are relaxed because nixos-rebuild requires:
- Write access to
/nix/storefor building - Ability to activate system configurations
- Network access for fetching from git/cache
- Namespace support for nix sandbox builds
NATS Authentication
All NATS connections use NKey authentication. NKeys are ed25519 keypairs where:
- The seed (private key) is stored in a file readable by the service
- The public key is configured in the NATS server's user list
Credential Types
| Credential | Purpose | Publish Permissions | Subscribe Permissions |
|---|---|---|---|
| listener | Host listener service | deploy.responses.> |
deploy.*.> |
| mcp-deployer | MCP test-tier access | deploy.test.> |
deploy.responses.>, deploy.discover |
| admin-deployer | Full deployment access | deploy.test.>, deploy.prod.> |
deploy.responses.>, deploy.discover |
Flake Structure
The flake.nix should provide:
- Package: The Go binary
- NixOS module: The listener service configuration
- Development shell: Go toolchain for development
{
inputs = {
nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";
};
outputs = { self, nixpkgs }: {
packages.x86_64-linux.default = /* Go package build */;
packages.x86_64-linux.homelab-deploy = self.packages.x86_64-linux.default;
nixosModules.default = import ./nixos/module.nix;
nixosModules.homelab-deploy = self.nixosModules.default;
devShells.x86_64-linux.default = /* Go dev shell */;
};
}
Implementation Notes
Go Dependencies
Recommended libraries:
github.com/urfave/cli/v3- CLI frameworkgithub.com/nats-io/nats.go- NATS clientgithub.com/mark3labs/mcp-go- MCP server implementation- Standard library for JSON, logging, process execution
Error Handling
- NATS connection errors: Retry with exponential backoff
- nixos-rebuild failures: Capture stdout/stderr, report in response message
- Timeout: Kill the nixos-rebuild process, report timeout error
Testing
- Unit tests for message parsing and validation
- Integration tests using a local NATS server
- End-to-end tests with a NixOS VM (optional, can be done in consuming repo)
Security Considerations
- Privilege: Listener runs as root to execute nixos-rebuild
- Input validation: Strictly validate revision format (alphanumeric, dashes, underscores, dots, slashes for branch names; hex for commit hashes)
- Command injection: Never interpolate user input into shell commands without validation
- Rate limiting: Consider adding rate limiting to prevent rapid-fire deployments
- Audit logging: Log all deployment requests with full context
- Network isolation: NATS should only be accessible from trusted networks
Future Enhancements
These are not required for initial implementation:
- Deployment locking - Cluster-wide lock to prevent fleet-wide concurrent deploys
- Prometheus metrics - Export deployment count, duration, success/failure rates
- Webhook triggers - HTTP endpoint for CI/CD integration
- Scheduled deployments - Deploy at specific times (though this overlaps with existing auto-upgrade)