This repository has been archived on 2026-03-09. You can view files and clone it. You cannot open issues or pull requests or push a commit.
Files
homelab-deploy/design.md
Torjus Håkestad 1f23a6ddc9 docs: update design with configurable subjects and improved module
- Add configurable NATS subject patterns with template variables
  (<hostname>, <tier>, <role>) for multi-tenant setups
- Add deploy.discover subject for host discovery
- Simplify CLI to use direct subjects with optional aliases via
  HOMELAB_DEPLOY_ALIAS_* environment variables
- Clarify request/reply flow with UUID-based response subjects
- Expand NixOS module with hardening options, package option,
  and configurable deploy/discover subjects
- Switch CLI framework from cobra to urfave/cli/v3

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 03:52:01 +01:00

21 KiB

homelab-deploy Design Document

A message-based deployment system for NixOS configurations using NATS for messaging. This binary runs in multiple modes to enable on-demand NixOS configuration updates across a fleet of hosts.

Overview

The homelab-deploy binary provides three operational modes:

  1. Listener mode - Runs on each NixOS host as a systemd service, subscribing to NATS subjects and executing nixos-rebuild when deployment requests arrive
  2. MCP mode - Runs as an MCP (Model Context Protocol) server, exposing deployment tools for AI assistants
  3. CLI mode - Manual deployment commands for administrators

Architecture

┌─────────────┐                        ┌─────────────┐
│  MCP Tool   │  deploy.test.>         │  Admin CLI  │  deploy.test.> + deploy.prod.>
│             │────────────┐     ┌─────│             │
└─────────────┘            │     │     └─────────────┘
                           ▼     ▼
                      ┌──────────────┐
                      │ NATS Server  │
                      │  (authz)     │
                      └──────┬───────┘
                             │
           ┌─────────────────┼─────────────────┐
           │                 │                 │
           ▼                 ▼                 ▼
     ┌──────────┐      ┌──────────┐      ┌──────────┐
     │  host-a  │      │  host-b  │      │  host-c  │
     │ tier=test│      │ tier=prod│      │ tier=prod│
     └──────────┘      └──────────┘      └──────────┘

Repository Structure

homelab-deploy/
├── flake.nix           # Nix flake with Go package + NixOS module
├── go.mod
├── go.sum
├── cmd/
│   └── homelab-deploy/
│       └── main.go     # CLI entrypoint with subcommands
├── internal/
│   ├── listener/       # Listener mode logic
│   ├── mcp/            # MCP server mode logic
│   ├── nats/           # NATS client wrapper
│   └── deploy/         # Shared deployment execution logic
└── nixos/
    └── module.nix      # NixOS module for listener service

CLI Interface

# Listener mode (runs as systemd service on each host)
homelab-deploy listener \
  --hostname <hostname> \
  --tier <test|prod> \
  --nats-url nats://server:4222 \
  --nkey-file /path/to/listener.nkey \
  --flake-url <git+https://...> \
  [--role <role>] \
  [--timeout 600] \
  [--deploy-subject <subject>]... \
  [--discover-subject <subject>]

# Subject flags can be repeated and use template variables:
homelab-deploy listener \
  --hostname ns1 \
  --tier prod \
  --role dns \
  --deploy-subject "deploy.<tier>.<hostname>" \
  --deploy-subject "deploy.<tier>.all" \
  --deploy-subject "deploy.<tier>.role.<role>" \
  --discover-subject "deploy.discover" \
  ...

# MCP server mode (for AI assistants)
homelab-deploy mcp \
  --nats-url nats://server:4222 \
  --nkey-file /path/to/mcp.nkey \
  [--enable-admin --admin-nkey-file /path/to/admin.nkey]

# CLI commands for manual use
# Deploy to a specific subject
homelab-deploy deploy <subject> \
  --nats-url nats://server:4222 \
  --nkey-file /path/to/deployer.nkey \
  [--branch <branch>] \
  [--action <switch|boot|test|dry-activate>]

# Examples:
homelab-deploy deploy deploy.prod.ns1       # Deploy to specific host
homelab-deploy deploy deploy.test.all       # Deploy to all test hosts
homelab-deploy deploy deploy.prod.role.dns  # Deploy to all prod DNS hosts

# Using aliases (configured via environment variables)
homelab-deploy deploy test                  # Expands to configured subject
homelab-deploy deploy prod-dns              # Expands to configured subject

CLI Subject Aliases

The CLI supports subject aliases via environment variables. If the <subject> argument doesn't look like a NATS subject (no dots), the CLI checks for an alias.

Environment variable format: HOMELAB_DEPLOY_ALIAS_<NAME>=<subject>

export HOMELAB_DEPLOY_ALIAS_TEST="deploy.test.all"
export HOMELAB_DEPLOY_ALIAS_PROD="deploy.prod.all"
export HOMELAB_DEPLOY_ALIAS_PROD_DNS="deploy.prod.role.dns"

# Now these work:
homelab-deploy deploy test       # -> deploy.test.all
homelab-deploy deploy prod       # -> deploy.prod.all
homelab-deploy deploy prod-dns   # -> deploy.prod.role.dns

Alias names are case-insensitive and hyphens are converted to underscores when looking up the environment variable.

NATS Subject Structure

Subjects follow the pattern deploy.<tier>.<target> by default, but are fully configurable:

Subject Pattern Description
deploy.<tier>.<hostname> Deploy to specific host (e.g., deploy.prod.ns1)
deploy.<tier>.all Deploy to all hosts in tier (e.g., deploy.test.all)
deploy.<tier>.role.<role> Deploy to hosts with role in tier (e.g., deploy.prod.role.dns)
deploy.responses.<uuid> Response subject for request/reply (UUID generated by CLI)
deploy.discover Host discovery requests

Subject Customization

Listeners can configure custom subject patterns using template variables:

  • <hostname> - The listener's hostname
  • <tier> - The listener's tier (test/prod)
  • <role> - The listener's role (if configured)

This allows prefixing subjects for multi-tenant setups (e.g., homelab.deploy.<tier>.<hostname>).

Listener Mode

Responsibilities

  1. Connect to NATS using NKey authentication
  2. Subscribe to configured deploy subjects (with template expansion)
  3. Subscribe to discovery subject and respond with host metadata
  4. Validate incoming deployment requests
  5. Execute nixos-rebuild with the specified parameters
  6. Report status back via NATS reply subject

Subject Subscriptions

Listeners subscribe to a configurable list of subjects. The configuration uses template variables that are expanded at runtime:

listener:
  hostname: ns1
  tier: prod
  role: dns

  deploy_subjects:
    - "deploy.<tier>.<hostname>"
    - "deploy.<tier>.all"
    - "deploy.<tier>.role.<role>"

  discover_subject: "deploy.discover"

Template variables:

  • <hostname> - Replaced with the configured hostname
  • <tier> - Replaced with the configured tier
  • <role> - Replaced with the configured role (subject skipped if role is null)

Example: With the above configuration, the listener subscribes to:

  • deploy.prod.ns1
  • deploy.prod.all
  • deploy.prod.role.dns
  • deploy.discover

Prefixed example: For multi-tenant setups:

listener:
  hostname: ns1
  tier: prod
  deploy_subjects:
    - "homelab.deploy.<tier>.<hostname>"
    - "homelab.deploy.<tier>.all"
  discover_subject: "homelab.deploy.discover"

Message Formats

Request message:

{
  "action": "switch",
  "revision": "master",
  "reply_to": "deploy.responses.abc123"
}
Field Type Required Description
action string yes One of: switch, boot, test, dry-activate
revision string yes Git branch name or commit hash
reply_to string yes Subject to publish responses to

Response message:

{
  "hostname": "ns1",
  "status": "completed",
  "error": null,
  "message": "Successfully switched to generation 42"
}
Field Type Description
hostname string The responding host's name
status string One of: accepted, rejected, started, completed, failed
error string or null Error code if status is rejected or failed
message string Human-readable details

Error codes:

  • invalid_revision - The specified branch/commit does not exist
  • invalid_action - The action is not recognized
  • already_running - A deployment is already in progress on this host
  • build_failed - nixos-rebuild exited with non-zero status
  • timeout - Deployment exceeded the configured timeout

Request/Reply Flow

  1. CLI generates a UUID for the request (e.g., 550e8400-e29b-41d4-a716-446655440000)
  2. CLI subscribes to deploy.responses.<uuid>
  3. CLI publishes deploy request to target subject with reply_to: "deploy.responses.<uuid>"
  4. Listener validates request:
    • Checks revision exists using git ls-remote
    • Checks no other deployment is running
  5. Listener publishes response to the reply_to subject:
    • {"status": "rejected", ...} if validation fails, or
    • {"status": "started", ...} if deployment begins
  6. If started, listener executes nixos-rebuild
  7. Listener publishes final response to the same reply_to subject:
    • {"status": "completed", ...} on success, or
    • {"status": "failed", ...} on failure
  8. CLI receives responses and displays progress/results
  9. CLI unsubscribes after receiving final status or timeout

Deployment Execution

The listener executes nixos-rebuild with the following command pattern:

nixos-rebuild <action> --flake <flake-url>?ref=<revision>#<hostname>

Where:

  • <action> is one of: switch, boot, test, dry-activate
  • <flake-url> is the configured git flake URL (e.g., git+https://git.example.com/user/nixos-configs.git)
  • <revision> is the branch name or commit hash from the request
  • <hostname> is the listener's configured hostname

Environment requirements:

  • Must run as root (nixos-rebuild requires root)
  • Nix must be configured with proper git credentials if the flake is private
  • Network access to the git repository

Concurrency Control

Only one deployment may run at a time per host. The listener maintains a simple lock:

  • Before starting a deployment, acquire lock
  • If lock is held, reject with already_running error
  • Release lock when deployment completes (success or failure)
  • Lock should be in-memory (no persistence needed - restarts clear it)

Logging

All deployment events should be logged to stdout/stderr (captured by systemd journal):

  • Request received (with subject, action, revision)
  • Validation result
  • Deployment start
  • Deployment completion (with exit code)
  • Any errors

This enables integration with log aggregation systems (e.g., Loki via Promtail).

MCP Mode

Purpose

Exposes deployment functionality as MCP tools for AI assistants (e.g., Claude Code).

Tools

Tool Description Parameters
deploy Deploy to test-tier hosts hostname or all, optional role, branch, action
deploy_admin Deploy to any tier (requires --enable-admin) tier, hostname or all, optional role, branch, action
list_hosts List available deployment targets tier (optional)

Tool Schemas

deploy:

{
  "name": "deploy",
  "description": "Deploy NixOS configuration to test-tier hosts",
  "inputSchema": {
    "type": "object",
    "properties": {
      "hostname": {
        "type": "string",
        "description": "Target hostname, or omit to use 'all' or 'role' targeting"
      },
      "all": {
        "type": "boolean",
        "description": "Deploy to all test-tier hosts"
      },
      "role": {
        "type": "string",
        "description": "Deploy to all test-tier hosts with this role"
      },
      "branch": {
        "type": "string",
        "description": "Git branch or commit to deploy (default: master)"
      },
      "action": {
        "type": "string",
        "enum": ["switch", "boot", "test", "dry-activate"],
        "description": "nixos-rebuild action (default: switch)"
      }
    }
  }
}

deploy_admin:

{
  "name": "deploy_admin",
  "description": "Deploy NixOS configuration to any host (admin access required)",
  "inputSchema": {
    "type": "object",
    "properties": {
      "tier": {
        "type": "string",
        "enum": ["test", "prod"],
        "description": "Target tier"
      },
      "hostname": {
        "type": "string",
        "description": "Target hostname, or omit to use 'all' or 'role' targeting"
      },
      "all": {
        "type": "boolean",
        "description": "Deploy to all hosts in tier"
      },
      "role": {
        "type": "string",
        "description": "Deploy to all hosts with this role in tier"
      },
      "branch": {
        "type": "string",
        "description": "Git branch or commit to deploy (default: master)"
      },
      "action": {
        "type": "string",
        "enum": ["switch", "boot", "test", "dry-activate"],
        "description": "nixos-rebuild action (default: switch)"
      }
    },
    "required": ["tier"]
  }
}

list_hosts:

{
  "name": "list_hosts",
  "description": "List available deployment targets",
  "inputSchema": {
    "type": "object",
    "properties": {
      "tier": {
        "type": "string",
        "enum": ["test", "prod"],
        "description": "Filter by tier (optional)"
      }
    }
  }
}

Security Layers

  1. MCP flag: deploy_admin tool only registered when --enable-admin is passed
  2. NATS authz: MCP credentials can only publish to authorized subjects
  3. AI assistant permissions: The assistant's configuration can require confirmation for admin operations

Multi-Host Deployments

When deploying to multiple hosts (via all or role), the MCP should:

  1. Publish the request to the appropriate broadcast subject
  2. Collect responses from all responding hosts
  3. Return aggregated results showing each host's status

Timeout handling:

  • Set a reasonable timeout for collecting responses (e.g., 30 seconds after last response, or max 15 minutes)
  • Return partial results if some hosts don't respond
  • Indicate which hosts did not respond

Host Discovery

The list_hosts tool needs to know available hosts. Options:

  1. Static configuration: Read from a config file or environment variable
  2. NATS request: Publish to a discovery subject and collect responses from listeners

Recommend option 2: Listeners subscribe to their configured discover_subject and respond with metadata.

Discovery request:

{
  "reply_to": "deploy.responses.discover-abc123"
}

Discovery response:

{
  "hostname": "ns1",
  "tier": "prod",
  "role": "dns",
  "deploy_subjects": [
    "deploy.prod.ns1",
    "deploy.prod.all",
    "deploy.prod.role.dns"
  ]
}

The response includes the expanded deploy_subjects so clients know exactly which subjects reach this host.

NixOS Module

The NixOS module configures the listener as a systemd service with appropriate hardening.

Module Options

{
  options.services.homelab-deploy.listener = {
    enable = lib.mkEnableOption "homelab-deploy listener service";

    package = lib.mkPackageOption pkgs "homelab-deploy" { };

    hostname = lib.mkOption {
      type = lib.types.str;
      default = config.networking.hostName;
      description = "Hostname for this listener (used in subject templates)";
    };

    tier = lib.mkOption {
      type = lib.types.enum [ "test" "prod" ];
      description = "Deployment tier for this host";
    };

    role = lib.mkOption {
      type = lib.types.nullOr lib.types.str;
      default = null;
      description = "Role for role-based deployment targeting";
    };

    natsUrl = lib.mkOption {
      type = lib.types.str;
      description = "NATS server URL";
      example = "nats://nats.example.com:4222";
    };

    nkeyFile = lib.mkOption {
      type = lib.types.path;
      description = "Path to NKey seed file for NATS authentication";
      example = "/run/secrets/homelab-deploy-nkey";
    };

    flakeUrl = lib.mkOption {
      type = lib.types.str;
      description = "Git flake URL for nixos-rebuild";
      example = "git+https://git.example.com/user/nixos-configs.git";
    };

    timeout = lib.mkOption {
      type = lib.types.int;
      default = 600;
      description = "Deployment timeout in seconds";
    };

    deploySubjects = lib.mkOption {
      type = lib.types.listOf lib.types.str;
      default = [
        "deploy.<tier>.<hostname>"
        "deploy.<tier>.all"
        "deploy.<tier>.role.<role>"
      ];
      description = ''
        List of NATS subjects to subscribe to for deployment requests.
        Template variables: <hostname>, <tier>, <role>
      '';
    };

    discoverSubject = lib.mkOption {
      type = lib.types.str;
      default = "deploy.discover";
      description = "NATS subject for host discovery requests";
    };

    environment = lib.mkOption {
      type = lib.types.attrsOf lib.types.str;
      default = { };
      description = "Additional environment variables for the service";
      example = { GIT_SSH_COMMAND = "ssh -i /run/secrets/deploy-key"; };
    };
  };
}

Systemd Service

The module creates a hardened systemd service:

systemd.services.homelab-deploy-listener = {
  description = "homelab-deploy listener";
  wantedBy = [ "multi-user.target" ];
  after = [ "network-online.target" ];
  wants = [ "network-online.target" ];

  environment = cfg.environment;

  serviceConfig = {
    Type = "simple";
    ExecStart = "${cfg.package}/bin/homelab-deploy listener ...";
    Restart = "always";
    RestartSec = 10;

    # Hardening (compatible with nixos-rebuild requirements)
    NoNewPrivileges = false;  # nixos-rebuild may need to spawn privileged processes
    ProtectSystem = "false";  # nixos-rebuild modifies /nix/store and /run
    ProtectHome = "read-only";
    PrivateTmp = true;
    PrivateDevices = true;
    ProtectKernelTunables = true;
    ProtectKernelModules = true;
    ProtectControlGroups = true;
    RestrictAddressFamilies = [ "AF_UNIX" "AF_INET" "AF_INET6" ];
    RestrictNamespaces = false;  # nix build uses namespaces
    RestrictSUIDSGID = true;
    LockPersonality = true;
    MemoryDenyWriteExecute = false;  # nix may need this
    SystemCallArchitectures = "native";
  };
};

Note: Some hardening options are relaxed because nixos-rebuild requires:

  • Write access to /nix/store for building
  • Ability to activate system configurations
  • Network access for fetching from git/cache
  • Namespace support for nix sandbox builds

NATS Authentication

All NATS connections use NKey authentication. NKeys are ed25519 keypairs where:

  • The seed (private key) is stored in a file readable by the service
  • The public key is configured in the NATS server's user list

Credential Types

Credential Purpose Publish Permissions Subscribe Permissions
listener Host listener service deploy.responses.> deploy.*.>
mcp-deployer MCP test-tier access deploy.test.> deploy.responses.>, deploy.discover
admin-deployer Full deployment access deploy.test.>, deploy.prod.> deploy.responses.>, deploy.discover

Flake Structure

The flake.nix should provide:

  1. Package: The Go binary
  2. NixOS module: The listener service configuration
  3. Development shell: Go toolchain for development
{
  inputs = {
    nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";
  };

  outputs = { self, nixpkgs }: {
    packages.x86_64-linux.default = /* Go package build */;
    packages.x86_64-linux.homelab-deploy = self.packages.x86_64-linux.default;

    nixosModules.default = import ./nixos/module.nix;
    nixosModules.homelab-deploy = self.nixosModules.default;

    devShells.x86_64-linux.default = /* Go dev shell */;
  };
}

Implementation Notes

Go Dependencies

Recommended libraries:

  • github.com/urfave/cli/v3 - CLI framework
  • github.com/nats-io/nats.go - NATS client
  • github.com/mark3labs/mcp-go - MCP server implementation
  • Standard library for JSON, logging, process execution

Error Handling

  • NATS connection errors: Retry with exponential backoff
  • nixos-rebuild failures: Capture stdout/stderr, report in response message
  • Timeout: Kill the nixos-rebuild process, report timeout error

Testing

  • Unit tests for message parsing and validation
  • Integration tests using a local NATS server
  • End-to-end tests with a NixOS VM (optional, can be done in consuming repo)

Security Considerations

  • Privilege: Listener runs as root to execute nixos-rebuild
  • Input validation: Strictly validate revision format (alphanumeric, dashes, underscores, dots, slashes for branch names; hex for commit hashes)
  • Command injection: Never interpolate user input into shell commands without validation
  • Rate limiting: Consider adding rate limiting to prevent rapid-fire deployments
  • Audit logging: Log all deployment requests with full context
  • Network isolation: NATS should only be accessible from trusted networks

Future Enhancements

These are not required for initial implementation:

  1. Deployment locking - Cluster-wide lock to prevent fleet-wide concurrent deploys
  2. Prometheus metrics - Export deployment count, duration, success/failure rates
  3. Webhook triggers - HTTP endpoint for CI/CD integration
  4. Scheduled deployments - Deploy at specific times (though this overlaps with existing auto-upgrade)