Files
nixos-servers/docs/plans/nats-deploy-service.md
Torjus Håkestad 4d724329a6
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
docs: add homelab-deploy plan, unify host metadata
Add plan for NATS-based deployment service (homelab-deploy) that enables
on-demand NixOS configuration updates via messaging. Features tiered
permissions (test/prod) enforced at NATS layer.

Update prometheus-scrape-target-labels plan to share the homelab.host
module for host metadata (tier, priority, role, labels) - single source
of truth for both deployment tiers and prometheus labels.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:10:54 +01:00

12 KiB

NATS-Based Deployment Service

Overview

Create a message-based deployment system that allows triggering NixOS configuration updates on-demand, rather than waiting for the daily auto-upgrade timer. This enables faster iteration when testing changes and immediate fleet-wide deployments.

Goals

  1. On-demand deployment - Trigger config updates immediately via NATS message
  2. Targeted deployment - Deploy to specific hosts or all hosts
  3. Branch/revision support - Test feature branches before merging to master
  4. MCP integration - Allow Claude Code to trigger deployments during development

Current State

  • Auto-upgrade: All hosts run nixos-upgrade.service daily, pulling from master
  • Manual testing: nixos-rebuild-test <action> <branch> helper exists on all hosts
  • NATS: Running on nats1 with JetStream enabled, using NKey authentication
  • Accounts: ADMIN (system) and HOMELAB (user workloads with JetStream)

Architecture

┌─────────────┐                        ┌─────────────┐
│  MCP Tool   │  deploy.test.>         │  Admin CLI  │  deploy.test.> + deploy.prod.>
│  (claude)   │────────────┐     ┌─────│  (torjus)   │
└─────────────┘            │     │     └─────────────┘
                           ▼     ▼
                      ┌──────────────┐
                      │    nats1     │
                      │  (authz)     │
                      └──────┬───────┘
                             │
           ┌─────────────────┼─────────────────┐
           │                 │                 │
           ▼                 ▼                 ▼
     ┌──────────┐      ┌──────────┐      ┌──────────┐
     │ template1│      │   ns1    │      │   ha1    │
     │ tier=test│      │ tier=prod│      │ tier=prod│
     └──────────┘      └──────────┘      └──────────┘

Repository Structure

The project lives in a separate repository (e.g., homelab-deploy) containing:

homelab-deploy/
├── flake.nix           # Nix flake with Go package + NixOS module
├── go.mod
├── go.sum
├── cmd/
│   └── homelab-deploy/
│       └── main.go     # CLI entrypoint with subcommands
├── internal/
│   ├── listener/       # Listener mode logic
│   ├── mcp/            # MCP server mode logic
│   └── deploy/         # Shared deployment logic
└── nixos/
    └── module.nix      # NixOS module for listener service

This repo imports the flake as an input and uses the NixOS module.

Single Binary with Subcommands

The homelab-deploy binary supports multiple modes:

# Run as listener on a host (systemd service)
homelab-deploy listener --hostname ns1 --nats-url nats://nats1:4222

# Run as MCP server (for Claude Code)
homelab-deploy mcp --nats-url nats://nats1:4222

# CLI commands for manual use
homelab-deploy deploy ns1 --branch feature-x --action switch
homelab-deploy deploy --all --action boot
homelab-deploy status

Components

Listener Mode

A systemd service on each host that:

  • Subscribes to deploy.<tier>.<hostname> and deploy.<tier>.all subjects
  • Validates incoming messages (revision, action)
  • Executes nixos-rebuild with specified parameters
  • Reports status back via NATS

NixOS module configuration:

services.homelab-deploy.listener = {
  enable = true;
  timeout = 600;  # seconds, default 10 minutes
};

The listener reads its tier from config.homelab.host.tier (see Host Metadata below) and subscribes to tier-specific subjects (e.g., deploy.prod.ns1 and deploy.prod.all).

Request message format:

{
  "action": "switch" | "boot" | "test" | "dry-activate",
  "revision": "master" | "feature-branch" | "abc123...",
  "reply_to": "deploy.responses.<request-id>"
}

Response message format:

{
  "status": "accepted" | "rejected" | "started" | "completed" | "failed",
  "error": "invalid_revision" | "already_running" | "build_failed" | null,
  "message": "human-readable details"
}

Request/Reply flow:

  1. MCP/CLI sends deploy request with unique reply_to subject
  2. Listener validates request (e.g., git ls-remote to check revision exists)
  3. Listener sends immediate response:
    • {"status": "rejected", "error": "invalid_revision", "message": "branch 'foo' not found"}, or
    • {"status": "started", "message": "starting nixos-rebuild switch"}
  4. If started, listener runs nixos-rebuild
  5. Listener sends final response:
    • {"status": "completed", "message": "successfully switched to generation 42"}, or
    • {"status": "failed", "error": "build_failed", "message": "nixos-rebuild exited with code 1"}

This provides immediate feedback on validation errors (bad revision, already running) without waiting for the build to fail.

MCP Mode

Runs as an MCP server providing tools for Claude Code:

  • deploy - Deploy to specific host(s) with optional revision
  • deploy_status - Check deployment status/history
  • list_hosts - List available deployment targets

The MCP server runs with limited credentials (test-tier only), so Claude can deploy to test hosts but not production.

Tiered Permissions

Authorization is enforced at the NATS layer using subject-based permissions. Different deployer credentials have different publish rights:

NATS user configuration (on nats1):

accounts = {
  HOMELAB = {
    users = [
      # MCP/Claude - test tier only
      {
        nkey = "UABC...";  # mcp-deployer
        permissions = {
          publish = [ "deploy.test.>" ];
          subscribe = [ "deploy.responses.>" ];
        };
      }
      # Admin - full access to all tiers
      {
        nkey = "UXYZ...";  # admin-deployer
        permissions = {
          publish = [ "deploy.test.>" "deploy.prod.>" ];
          subscribe = [ "deploy.responses.>" ];
        };
      }
      # Host listeners - subscribe to their tier, publish responses
      {
        nkey = "UDEF...";  # host-listener (one per host)
        permissions = {
          subscribe = [ "deploy.*.>" ];
          publish = [ "deploy.responses.>" ];
        };
      }
    ];
  };
};

Host tier assignments (via homelab.host.tier):

Tier Hosts
test template1, nix-cache01, future test hosts
prod ns1, ns2, ha1, monitoring01, http-proxy, etc.

How it works:

  1. MCP tries to deploy to ns1 → publishes to deploy.prod.ns1
  2. NATS server rejects publish (mcp-deployer lacks deploy.prod.> permission)
  3. MCP tries to deploy to template1 → publishes to deploy.test.template1
  4. NATS allows it, listener receives and executes

All NKeys stored in Vault - MCP gets limited credentials, admin CLI gets full-access credentials.

Host Metadata

Rather than defining tier in the listener config, use a central homelab.host module that provides host metadata for multiple consumers. This aligns with the approach proposed in docs/plans/prometheus-scrape-target-labels.md.

Module definition (in modules/homelab/host.nix):

homelab.host = {
  tier = lib.mkOption {
    type = lib.types.enum [ "test" "prod" ];
    default = "prod";
    description = "Deployment tier - controls which credentials can deploy to this host";
  };

  priority = lib.mkOption {
    type = lib.types.enum [ "high" "low" ];
    default = "high";
    description = "Alerting priority - low priority hosts have relaxed thresholds";
  };

  role = lib.mkOption {
    type = lib.types.nullOr lib.types.str;
    default = null;
    description = "Primary role of this host (dns, database, monitoring, etc.)";
  };

  labels = lib.mkOption {
    type = lib.types.attrsOf lib.types.str;
    default = { };
    description = "Additional free-form labels";
  };
};

Consumers:

  • homelab-deploy listener reads config.homelab.host.tier for subject subscription
  • Prometheus scrape config reads priority, role, labels for target labels
  • Future services can consume the same metadata

Example host config:

# hosts/nix-cache01/configuration.nix
homelab.host = {
  tier = "test";      # can be deployed by MCP
  priority = "low";   # relaxed alerting thresholds
  role = "build-host";
};

# hosts/ns1/configuration.nix
homelab.host = {
  tier = "prod";      # requires admin credentials
  priority = "high";
  role = "dns";
  labels.dns_role = "primary";
};

Implementation Steps

Phase 1: Core Binary + Listener

  1. Create homelab-deploy repository

    • Initialize Go module
    • Set up flake.nix with Go package build
  2. Implement listener mode

    • NATS subscription logic
    • nixos-rebuild execution
    • Status reporting via NATS reply
  3. Create NixOS module

    • Systemd service definition
    • Configuration options (hostname, NATS URL, NKey path)
    • Vault secret integration for NKeys
  4. Create homelab.host module (in nixos-servers)

    • Define tier, priority, role, labels options
    • This module is shared with Prometheus label work (see docs/plans/prometheus-scrape-target-labels.md)
  5. Integrate with nixos-servers

    • Add flake input for homelab-deploy
    • Import listener module in system/
    • Set homelab.host.tier per host (test vs prod)
  6. Configure NATS tiered permissions

    • Add deployer users to nats1 config (mcp-deployer, admin-deployer)
    • Set up subject ACLs per user (test-only vs full access)
    • Add deployer NKeys to Vault
    • Create Terraform resources for NKey secrets

Phase 2: MCP + CLI

  1. Implement MCP mode

    • MCP server with deploy/status tools
    • Request/reply pattern for deployment feedback
  2. Implement CLI commands

    • deploy command for manual deployments
    • status command to check deployment state
  3. Configure Claude Code

    • Add MCP server to configuration
    • Document usage

Phase 3: Enhancements

  1. Add deployment locking (prevent concurrent deploys)
  2. Prometheus metrics for deployment status

Security Considerations

  • Privilege escalation: Listener runs as root to execute nixos-rebuild
  • Input validation: Strictly validate revision format (branch name or commit hash)
  • Rate limiting: Prevent rapid-fire deployments
  • Audit logging: Log all deployment requests with source identity
  • Network isolation: NATS only accessible from internal network

Decisions

All open questions have been resolved. See Notes section for decision rationale.

Notes

  • The existing nixos-rebuild-test helper provides a good reference for the rebuild logic
  • Uses NATS request/reply pattern for immediate validation feedback and completion status
  • Consider using NATS headers for metadata (request ID, timestamp)
  • Timeout decision: Metrics show no-change upgrades complete in 5-55 seconds. A 10-minute default provides ample headroom for actual updates with package downloads. Per-host override available for hosts with known longer build times.
  • Rollback: Not needed as a separate feature - deploy an older commit hash to effectively rollback.
  • Offline hosts: No message persistence - if host is offline, deploy fails. Daily auto-upgrade is the safety net. Avoids complexity of JetStream deduplication (host coming online and applying 10 queued updates instead of just the latest).
  • Deploy history: Use existing Loki - listener logs deployments to journald, queryable via Loki. No need for separate JetStream persistence.
  • Naming: homelab-deploy - ties it to the infrastructure rather than implementation details.