Files
nixos-servers/docs/plans/nats-deploy-service.md
Torjus Håkestad eef52bb8c5
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m3s
docs: add group deployment support to homelab-deploy plan
Support deploying to all hosts in a tier or all hosts with a role:
- deploy.<tier>.all - broadcast to all hosts in tier
- deploy.<tier>.role.<role> - broadcast to hosts with matching role

MCP can deploy to all test hosts at once, admin can deploy to any group.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:22:17 +01:00

13 KiB

NATS-Based Deployment Service

Overview

Create a message-based deployment system that allows triggering NixOS configuration updates on-demand, rather than waiting for the daily auto-upgrade timer. This enables faster iteration when testing changes and immediate fleet-wide deployments.

Goals

  1. On-demand deployment - Trigger config updates immediately via NATS message
  2. Targeted deployment - Deploy to specific hosts or all hosts
  3. Branch/revision support - Test feature branches before merging to master
  4. MCP integration - Allow Claude Code to trigger deployments during development

Current State

  • Auto-upgrade: All hosts run nixos-upgrade.service daily, pulling from master
  • Manual testing: nixos-rebuild-test <action> <branch> helper exists on all hosts
  • NATS: Running on nats1 with JetStream enabled, using NKey authentication
  • Accounts: ADMIN (system) and HOMELAB (user workloads with JetStream)

Architecture

┌─────────────┐                        ┌─────────────┐
│  MCP Tool   │  deploy.test.>         │  Admin CLI  │  deploy.test.> + deploy.prod.>
│  (claude)   │────────────┐     ┌─────│  (torjus)   │
└─────────────┘            │     │     └─────────────┘
                           ▼     ▼
                      ┌──────────────┐
                      │    nats1     │
                      │  (authz)     │
                      └──────┬───────┘
                             │
           ┌─────────────────┼─────────────────┐
           │                 │                 │
           ▼                 ▼                 ▼
     ┌──────────┐      ┌──────────┐      ┌──────────┐
     │ template1│      │   ns1    │      │   ha1    │
     │ tier=test│      │ tier=prod│      │ tier=prod│
     └──────────┘      └──────────┘      └──────────┘

Repository Structure

The project lives in a separate repository (e.g., homelab-deploy) containing:

homelab-deploy/
├── flake.nix           # Nix flake with Go package + NixOS module
├── go.mod
├── go.sum
├── cmd/
│   └── homelab-deploy/
│       └── main.go     # CLI entrypoint with subcommands
├── internal/
│   ├── listener/       # Listener mode logic
│   ├── mcp/            # MCP server mode logic
│   └── deploy/         # Shared deployment logic
└── nixos/
    └── module.nix      # NixOS module for listener service

This repo imports the flake as an input and uses the NixOS module.

Single Binary with Subcommands

The homelab-deploy binary supports multiple modes:

# Run as listener on a host (systemd service)
homelab-deploy listener --hostname ns1 --nats-url nats://nats1:4222

# Run as MCP server (for Claude Code)
homelab-deploy mcp --nats-url nats://nats1:4222

# CLI commands for manual use
homelab-deploy deploy ns1 --branch feature-x --action switch  # single host
homelab-deploy deploy --tier test --all --action boot          # all test hosts
homelab-deploy deploy --tier prod --all --action boot          # all prod hosts (admin only)
homelab-deploy deploy --tier prod --role dns --action switch   # all prod dns hosts
homelab-deploy status

Components

Listener Mode

A systemd service on each host that:

  • Subscribes to multiple subjects for targeted and group deployments
  • Validates incoming messages (revision, action)
  • Executes nixos-rebuild with specified parameters
  • Reports status back via NATS

Subject structure:

deploy.<tier>.<hostname>      # specific host (e.g., deploy.prod.ns1)
deploy.<tier>.all             # all hosts in tier (e.g., deploy.test.all)
deploy.<tier>.role.<role>     # all hosts with role in tier (e.g., deploy.prod.role.dns)

Listener subscriptions (based on homelab.host config):

  • deploy.<tier>.<hostname> - direct messages to this host
  • deploy.<tier>.all - broadcast to all hosts in tier
  • deploy.<tier>.role.<role> - broadcast to hosts with matching role (if role is set)

Example: ns1 with tier=prod, role=dns subscribes to:

  • deploy.prod.ns1
  • deploy.prod.all
  • deploy.prod.role.dns

NixOS module configuration:

services.homelab-deploy.listener = {
  enable = true;
  timeout = 600;  # seconds, default 10 minutes
};

The listener reads tier and role from config.homelab.host (see Host Metadata below).

Request message format:

{
  "action": "switch" | "boot" | "test" | "dry-activate",
  "revision": "master" | "feature-branch" | "abc123...",
  "reply_to": "deploy.responses.<request-id>"
}

Response message format:

{
  "status": "accepted" | "rejected" | "started" | "completed" | "failed",
  "error": "invalid_revision" | "already_running" | "build_failed" | null,
  "message": "human-readable details"
}

Request/Reply flow:

  1. MCP/CLI sends deploy request with unique reply_to subject
  2. Listener validates request (e.g., git ls-remote to check revision exists)
  3. Listener sends immediate response:
    • {"status": "rejected", "error": "invalid_revision", "message": "branch 'foo' not found"}, or
    • {"status": "started", "message": "starting nixos-rebuild switch"}
  4. If started, listener runs nixos-rebuild
  5. Listener sends final response:
    • {"status": "completed", "message": "successfully switched to generation 42"}, or
    • {"status": "failed", "error": "build_failed", "message": "nixos-rebuild exited with code 1"}

This provides immediate feedback on validation errors (bad revision, already running) without waiting for the build to fail.

MCP Mode

Runs as an MCP server providing tools for Claude Code:

  • deploy - Deploy to specific host, all hosts in tier, or all hosts with a role
  • deploy_status - Check deployment status/history
  • list_hosts - List available deployment targets

The MCP server runs with limited credentials (test-tier only), so Claude can:

  • Deploy to individual test hosts
  • Deploy to all test hosts at once (deploy.test.all)
  • Deploy to test hosts by role (deploy.test.role.<role>)

Production deployments require admin credentials.

Tiered Permissions

Authorization is enforced at the NATS layer using subject-based permissions. Different deployer credentials have different publish rights:

NATS user configuration (on nats1):

accounts = {
  HOMELAB = {
    users = [
      # MCP/Claude - test tier only
      {
        nkey = "UABC...";  # mcp-deployer
        permissions = {
          publish = [ "deploy.test.>" ];
          subscribe = [ "deploy.responses.>" ];
        };
      }
      # Admin - full access to all tiers
      {
        nkey = "UXYZ...";  # admin-deployer
        permissions = {
          publish = [ "deploy.test.>" "deploy.prod.>" ];
          subscribe = [ "deploy.responses.>" ];
        };
      }
      # Host listeners - subscribe to their tier, publish responses
      {
        nkey = "UDEF...";  # host-listener (one per host)
        permissions = {
          subscribe = [ "deploy.*.>" ];
          publish = [ "deploy.responses.>" ];
        };
      }
    ];
  };
};

Host tier assignments (via homelab.host.tier):

Tier Hosts
test template1, nix-cache01, future test hosts
prod ns1, ns2, ha1, monitoring01, http-proxy, etc.

Example deployment scenarios:

Command Subject MCP Admin
Deploy to ns1 deploy.prod.ns1
Deploy to template1 deploy.test.template1
Deploy to all test hosts deploy.test.all
Deploy to all prod hosts deploy.prod.all
Deploy to all DNS servers deploy.prod.role.dns

All NKeys stored in Vault - MCP gets limited credentials, admin CLI gets full-access credentials.

Host Metadata

Rather than defining tier in the listener config, use a central homelab.host module that provides host metadata for multiple consumers. This aligns with the approach proposed in docs/plans/prometheus-scrape-target-labels.md.

Module definition (in modules/homelab/host.nix):

homelab.host = {
  tier = lib.mkOption {
    type = lib.types.enum [ "test" "prod" ];
    default = "prod";
    description = "Deployment tier - controls which credentials can deploy to this host";
  };

  priority = lib.mkOption {
    type = lib.types.enum [ "high" "low" ];
    default = "high";
    description = "Alerting priority - low priority hosts have relaxed thresholds";
  };

  role = lib.mkOption {
    type = lib.types.nullOr lib.types.str;
    default = null;
    description = "Primary role of this host (dns, database, monitoring, etc.)";
  };

  labels = lib.mkOption {
    type = lib.types.attrsOf lib.types.str;
    default = { };
    description = "Additional free-form labels";
  };
};

Consumers:

  • homelab-deploy listener reads config.homelab.host.tier for subject subscription
  • Prometheus scrape config reads priority, role, labels for target labels
  • Future services can consume the same metadata

Example host config:

# hosts/nix-cache01/configuration.nix
homelab.host = {
  tier = "test";      # can be deployed by MCP
  priority = "low";   # relaxed alerting thresholds
  role = "build-host";
};

# hosts/ns1/configuration.nix
homelab.host = {
  tier = "prod";      # requires admin credentials
  priority = "high";
  role = "dns";
  labels.dns_role = "primary";
};

Implementation Steps

Phase 1: Core Binary + Listener

  1. Create homelab-deploy repository

    • Initialize Go module
    • Set up flake.nix with Go package build
  2. Implement listener mode

    • NATS subscription logic
    • nixos-rebuild execution
    • Status reporting via NATS reply
  3. Create NixOS module

    • Systemd service definition
    • Configuration options (hostname, NATS URL, NKey path)
    • Vault secret integration for NKeys
  4. Create homelab.host module (in nixos-servers)

    • Define tier, priority, role, labels options
    • This module is shared with Prometheus label work (see docs/plans/prometheus-scrape-target-labels.md)
  5. Integrate with nixos-servers

    • Add flake input for homelab-deploy
    • Import listener module in system/
    • Set homelab.host.tier per host (test vs prod)
  6. Configure NATS tiered permissions

    • Add deployer users to nats1 config (mcp-deployer, admin-deployer)
    • Set up subject ACLs per user (test-only vs full access)
    • Add deployer NKeys to Vault
    • Create Terraform resources for NKey secrets

Phase 2: MCP + CLI

  1. Implement MCP mode

    • MCP server with deploy/status tools
    • Request/reply pattern for deployment feedback
  2. Implement CLI commands

    • deploy command for manual deployments
    • status command to check deployment state
  3. Configure Claude Code

    • Add MCP server to configuration
    • Document usage

Phase 3: Enhancements

  1. Add deployment locking (prevent concurrent deploys)
  2. Prometheus metrics for deployment status

Security Considerations

  • Privilege escalation: Listener runs as root to execute nixos-rebuild
  • Input validation: Strictly validate revision format (branch name or commit hash)
  • Rate limiting: Prevent rapid-fire deployments
  • Audit logging: Log all deployment requests with source identity
  • Network isolation: NATS only accessible from internal network

Decisions

All open questions have been resolved. See Notes section for decision rationale.

Notes

  • The existing nixos-rebuild-test helper provides a good reference for the rebuild logic
  • Uses NATS request/reply pattern for immediate validation feedback and completion status
  • Consider using NATS headers for metadata (request ID, timestamp)
  • Timeout decision: Metrics show no-change upgrades complete in 5-55 seconds. A 10-minute default provides ample headroom for actual updates with package downloads. Per-host override available for hosts with known longer build times.
  • Rollback: Not needed as a separate feature - deploy an older commit hash to effectively rollback.
  • Offline hosts: No message persistence - if host is offline, deploy fails. Daily auto-upgrade is the safety net. Avoids complexity of JetStream deduplication (host coming online and applying 10 queued updates instead of just the latest).
  • Deploy history: Use existing Loki - listener logs deployments to journald, queryable via Loki. No need for separate JetStream persistence.
  • Naming: homelab-deploy - ties it to the infrastructure rather than implementation details.