# NATS-Based Deployment Service ## Overview Create a message-based deployment system that allows triggering NixOS configuration updates on-demand, rather than waiting for the daily auto-upgrade timer. This enables faster iteration when testing changes and immediate fleet-wide deployments. ## Goals 1. **On-demand deployment** - Trigger config updates immediately via NATS message 2. **Targeted deployment** - Deploy to specific hosts or all hosts 3. **Branch/revision support** - Test feature branches before merging to master 4. **MCP integration** - Allow Claude Code to trigger deployments during development ## Current State - **Auto-upgrade**: All hosts run `nixos-upgrade.service` daily, pulling from master - **Manual testing**: `nixos-rebuild-test ` helper exists on all hosts - **NATS**: Running on nats1 with JetStream enabled, using NKey authentication - **Accounts**: ADMIN (system) and HOMELAB (user workloads with JetStream) ## Architecture ``` ┌─────────────┐ ┌─────────────┐ │ MCP Tool │ deploy.test.> │ Admin CLI │ deploy.test.> + deploy.prod.> │ (claude) │────────────┐ ┌─────│ (torjus) │ └─────────────┘ │ │ └─────────────┘ ▼ ▼ ┌──────────────┐ │ nats1 │ │ (authz) │ └──────┬───────┘ │ ┌─────────────────┼─────────────────┐ │ │ │ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ template1│ │ ns1 │ │ ha1 │ │ tier=test│ │ tier=prod│ │ tier=prod│ └──────────┘ └──────────┘ └──────────┘ ``` ## Repository Structure The project lives in a **separate repository** (e.g., `homelab-deploy`) containing: ``` homelab-deploy/ ├── flake.nix # Nix flake with Go package + NixOS module ├── go.mod ├── go.sum ├── cmd/ │ └── homelab-deploy/ │ └── main.go # CLI entrypoint with subcommands ├── internal/ │ ├── listener/ # Listener mode logic │ ├── mcp/ # MCP server mode logic │ └── deploy/ # Shared deployment logic └── nixos/ └── module.nix # NixOS module for listener service ``` This repo imports the flake as an input and uses the NixOS module. ## Single Binary with Subcommands The `homelab-deploy` binary supports multiple modes: ```bash # Run as listener on a host (systemd service) homelab-deploy listener --hostname ns1 --nats-url nats://nats1:4222 # Run as MCP server (for Claude Code) homelab-deploy mcp --nats-url nats://nats1:4222 # CLI commands for manual use homelab-deploy deploy ns1 --branch feature-x --action switch # single host homelab-deploy deploy --tier test --all --action boot # all test hosts homelab-deploy deploy --tier prod --all --action boot # all prod hosts (admin only) homelab-deploy deploy --tier prod --role dns --action switch # all prod dns hosts homelab-deploy status ``` ## Components ### Listener Mode A systemd service on each host that: - Subscribes to multiple subjects for targeted and group deployments - Validates incoming messages (revision, action) - Executes `nixos-rebuild` with specified parameters - Reports status back via NATS **Subject structure:** ``` deploy.. # specific host (e.g., deploy.prod.ns1) deploy..all # all hosts in tier (e.g., deploy.test.all) deploy..role. # all hosts with role in tier (e.g., deploy.prod.role.dns) ``` **Listener subscriptions** (based on `homelab.host` config): - `deploy..` - direct messages to this host - `deploy..all` - broadcast to all hosts in tier - `deploy..role.` - broadcast to hosts with matching role (if role is set) Example: ns1 with `tier=prod, role=dns` subscribes to: - `deploy.prod.ns1` - `deploy.prod.all` - `deploy.prod.role.dns` **NixOS module configuration:** ```nix services.homelab-deploy.listener = { enable = true; timeout = 600; # seconds, default 10 minutes }; ``` The listener reads tier and role from `config.homelab.host` (see Host Metadata below). **Request message format:** ```json { "action": "switch" | "boot" | "test" | "dry-activate", "revision": "master" | "feature-branch" | "abc123...", "reply_to": "deploy.responses." } ``` **Response message format:** ```json { "status": "accepted" | "rejected" | "started" | "completed" | "failed", "error": "invalid_revision" | "already_running" | "build_failed" | null, "message": "human-readable details" } ``` **Request/Reply flow:** 1. MCP/CLI sends deploy request with unique `reply_to` subject 2. Listener validates request (e.g., `git ls-remote` to check revision exists) 3. Listener sends immediate response: - `{"status": "rejected", "error": "invalid_revision", "message": "branch 'foo' not found"}`, or - `{"status": "started", "message": "starting nixos-rebuild switch"}` 4. If started, listener runs nixos-rebuild 5. Listener sends final response: - `{"status": "completed", "message": "successfully switched to generation 42"}`, or - `{"status": "failed", "error": "build_failed", "message": "nixos-rebuild exited with code 1"}` This provides immediate feedback on validation errors (bad revision, already running) without waiting for the build to fail. ### MCP Mode Runs as an MCP server providing tools for Claude Code. **Tools:** | Tool | Description | Tier Access | |------|-------------|-------------| | `deploy` | Deploy to test hosts (individual, all, or by role) | test only | | `deploy_admin` | Deploy to any host (requires `--enable-admin` flag) | test + prod | | `deploy_status` | Check deployment status/history | n/a | | `list_hosts` | List available deployment targets | n/a | **CLI flags:** ```bash # Default: only test-tier deployments available homelab-deploy mcp --nats-url nats://nats1:4222 # Enable admin tool (requires admin NKey to be configured) homelab-deploy mcp --nats-url nats://nats1:4222 --enable-admin --admin-nkey-file /path/to/admin.nkey ``` **Security layers:** 1. **MCP flag**: `deploy_admin` tool only exposed when `--enable-admin` is passed 2. **NATS authz**: Even if tool is exposed, NATS rejects publishes without valid admin NKey 3. **Claude Code permissions**: Can set `mcp__homelab-deploy__deploy_admin` to `ask` mode for confirmation popup By default, the MCP only loads test-tier credentials and exposes the `deploy` tool. Claude can: - Deploy to individual test hosts - Deploy to all test hosts at once (`deploy.test.all`) - Deploy to test hosts by role (`deploy.test.role.`) ### Tiered Permissions Authorization is enforced at the NATS layer using subject-based permissions. Different deployer credentials have different publish rights: **NATS user configuration (on nats1):** ```nix accounts = { HOMELAB = { users = [ # MCP/Claude - test tier only { nkey = "UABC..."; # mcp-deployer permissions = { publish = [ "deploy.test.>" ]; subscribe = [ "deploy.responses.>" ]; }; } # Admin - full access to all tiers { nkey = "UXYZ..."; # admin-deployer permissions = { publish = [ "deploy.test.>" "deploy.prod.>" ]; subscribe = [ "deploy.responses.>" ]; }; } # Host listeners - subscribe to their tier, publish responses { nkey = "UDEF..."; # host-listener (one per host) permissions = { subscribe = [ "deploy.*.>" ]; publish = [ "deploy.responses.>" ]; }; } ]; }; }; ``` **Host tier assignments** (via `homelab.host.tier`): | Tier | Hosts | |------|-------| | test | template1, nix-cache01, future test hosts | | prod | ns1, ns2, ha1, monitoring01, http-proxy, etc. | **Example deployment scenarios:** | Command | Subject | MCP | Admin | |---------|---------|-----|-------| | Deploy to ns1 | `deploy.prod.ns1` | ❌ | ✅ | | Deploy to template1 | `deploy.test.template1` | ✅ | ✅ | | Deploy to all test hosts | `deploy.test.all` | ✅ | ✅ | | Deploy to all prod hosts | `deploy.prod.all` | ❌ | ✅ | | Deploy to all DNS servers | `deploy.prod.role.dns` | ❌ | ✅ | All NKeys stored in Vault - MCP gets limited credentials, admin CLI gets full-access credentials. ### Host Metadata Rather than defining `tier` in the listener config, use a central `homelab.host` module that provides host metadata for multiple consumers. This aligns with the approach proposed in `docs/plans/prometheus-scrape-target-labels.md`. **Module definition (in `modules/homelab/host.nix`):** ```nix homelab.host = { tier = lib.mkOption { type = lib.types.enum [ "test" "prod" ]; default = "prod"; description = "Deployment tier - controls which credentials can deploy to this host"; }; priority = lib.mkOption { type = lib.types.enum [ "high" "low" ]; default = "high"; description = "Alerting priority - low priority hosts have relaxed thresholds"; }; role = lib.mkOption { type = lib.types.nullOr lib.types.str; default = null; description = "Primary role of this host (dns, database, monitoring, etc.)"; }; labels = lib.mkOption { type = lib.types.attrsOf lib.types.str; default = { }; description = "Additional free-form labels"; }; }; ``` **Consumers:** - `homelab-deploy` listener reads `config.homelab.host.tier` for subject subscription - Prometheus scrape config reads `priority`, `role`, `labels` for target labels - Future services can consume the same metadata **Example host config:** ```nix # hosts/nix-cache01/configuration.nix homelab.host = { tier = "test"; # can be deployed by MCP priority = "low"; # relaxed alerting thresholds role = "build-host"; }; # hosts/ns1/configuration.nix homelab.host = { tier = "prod"; # requires admin credentials priority = "high"; role = "dns"; labels.dns_role = "primary"; }; ``` ## Implementation Steps ### Phase 1: Core Binary + Listener 1. **Create homelab-deploy repository** - Initialize Go module - Set up flake.nix with Go package build 2. **Implement listener mode** - NATS subscription logic - nixos-rebuild execution - Status reporting via NATS reply 3. **Create NixOS module** - Systemd service definition - Configuration options (hostname, NATS URL, NKey path) - Vault secret integration for NKeys 4. **Create `homelab.host` module** (in nixos-servers) - Define `tier`, `priority`, `role`, `labels` options - This module is shared with Prometheus label work (see `docs/plans/prometheus-scrape-target-labels.md`) 5. **Integrate with nixos-servers** - Add flake input for homelab-deploy - Import listener module in `system/` - Set `homelab.host.tier` per host (test vs prod) 6. **Configure NATS tiered permissions** - Add deployer users to nats1 config (mcp-deployer, admin-deployer) - Set up subject ACLs per user (test-only vs full access) - Add deployer NKeys to Vault - Create Terraform resources for NKey secrets ### Phase 2: MCP + CLI 7. **Implement MCP mode** - MCP server with deploy/status tools - Request/reply pattern for deployment feedback 8. **Implement CLI commands** - `deploy` command for manual deployments - `status` command to check deployment state 9. **Configure Claude Code** - Add MCP server to configuration - Document usage ### Phase 3: Enhancements 10. Add deployment locking (prevent concurrent deploys) 11. Prometheus metrics for deployment status ## Security Considerations - **Privilege escalation**: Listener runs as root to execute nixos-rebuild - **Input validation**: Strictly validate revision format (branch name or commit hash) - **Rate limiting**: Prevent rapid-fire deployments - **Audit logging**: Log all deployment requests with source identity - **Network isolation**: NATS only accessible from internal network ## Decisions All open questions have been resolved. See Notes section for decision rationale. ## Notes - The existing `nixos-rebuild-test` helper provides a good reference for the rebuild logic - Uses NATS request/reply pattern for immediate validation feedback and completion status - Consider using NATS headers for metadata (request ID, timestamp) - **Timeout decision**: Metrics show no-change upgrades complete in 5-55 seconds. A 10-minute default provides ample headroom for actual updates with package downloads. Per-host override available for hosts with known longer build times. - **Rollback**: Not needed as a separate feature - deploy an older commit hash to effectively rollback. - **Offline hosts**: No message persistence - if host is offline, deploy fails. Daily auto-upgrade is the safety net. Avoids complexity of JetStream deduplication (host coming online and applying 10 queued updates instead of just the latest). - **Deploy history**: Use existing Loki - listener logs deployments to journald, queryable via Loki. No need for separate JetStream persistence. - **Naming**: `homelab-deploy` - ties it to the infrastructure rather than implementation details.