Archived

This repository has been archived on 2026-03-09. You can view files and clone it. You cannot open issues or pull requests or push a commit.

Files

Torjus Håkestad 1f23a6ddc9

docs: update design with configurable subjects and improved module

- Add configurable NATS subject patterns with template variables
  (<hostname>, <tier>, <role>) for multi-tenant setups
- Add deploy.discover subject for host discovery
- Simplify CLI to use direct subjects with optional aliases via
  HOMELAB_DEPLOY_ALIAS_* environment variables
- Clarify request/reply flow with UUID-based response subjects
- Expand NixOS module with hardening options, package option,
  and configurable deploy/discover subjects
- Switch CLI framework from cobra to urfave/cli/v3

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-07 03:52:01 +01:00

21 KiB

Raw Blame History

homelab-deploy Design Document

A message-based deployment system for NixOS configurations using NATS for messaging. This binary runs in multiple modes to enable on-demand NixOS configuration updates across a fleet of hosts.

Overview

The homelab-deploy binary provides three operational modes:

Listener mode - Runs on each NixOS host as a systemd service, subscribing to NATS subjects and executing nixos-rebuild when deployment requests arrive
MCP mode - Runs as an MCP (Model Context Protocol) server, exposing deployment tools for AI assistants
CLI mode - Manual deployment commands for administrators

Architecture

┌─────────────┐                        ┌─────────────┐
│  MCP Tool   │  deploy.test.>         │  Admin CLI  │  deploy.test.> + deploy.prod.>
│             │────────────┐     ┌─────│             │
└─────────────┘            │     │     └─────────────┘
                           ▼     ▼
                      ┌──────────────┐
                      │ NATS Server  │
                      │  (authz)     │
                      └──────┬───────┘
                             │
           ┌─────────────────┼─────────────────┐
           │                 │                 │
           ▼                 ▼                 ▼
     ┌──────────┐      ┌──────────┐      ┌──────────┐
     │  host-a  │      │  host-b  │      │  host-c  │
     │ tier=test│      │ tier=prod│      │ tier=prod│
     └──────────┘      └──────────┘      └──────────┘

Repository Structure

homelab-deploy/
├── flake.nix           # Nix flake with Go package + NixOS module
├── go.mod
├── go.sum
├── cmd/
│   └── homelab-deploy/
│       └── main.go     # CLI entrypoint with subcommands
├── internal/
│   ├── listener/       # Listener mode logic
│   ├── mcp/            # MCP server mode logic
│   ├── nats/           # NATS client wrapper
│   └── deploy/         # Shared deployment execution logic
└── nixos/
    └── module.nix      # NixOS module for listener service

CLI Interface

# Listener mode (runs as systemd service on each host)
homelab-deploy listener \
  --hostname <hostname> \
  --tier <test|prod> \
  --nats-url nats://server:4222 \
  --nkey-file /path/to/listener.nkey \
  --flake-url <git+https://...> \
  [--role <role>] \
  [--timeout 600] \
  [--deploy-subject <subject>]... \
  [--discover-subject <subject>]

# Subject flags can be repeated and use template variables:
homelab-deploy listener \
  --hostname ns1 \
  --tier prod \
  --role dns \
  --deploy-subject "deploy.<tier>.<hostname>" \
  --deploy-subject "deploy.<tier>.all" \
  --deploy-subject "deploy.<tier>.role.<role>" \
  --discover-subject "deploy.discover" \
  ...

# MCP server mode (for AI assistants)
homelab-deploy mcp \
  --nats-url nats://server:4222 \
  --nkey-file /path/to/mcp.nkey \
  [--enable-admin --admin-nkey-file /path/to/admin.nkey]

# CLI commands for manual use
# Deploy to a specific subject
homelab-deploy deploy <subject> \
  --nats-url nats://server:4222 \
  --nkey-file /path/to/deployer.nkey \
  [--branch <branch>] \
  [--action <switch|boot|test|dry-activate>]

# Examples:
homelab-deploy deploy deploy.prod.ns1       # Deploy to specific host
homelab-deploy deploy deploy.test.all       # Deploy to all test hosts
homelab-deploy deploy deploy.prod.role.dns  # Deploy to all prod DNS hosts

# Using aliases (configured via environment variables)
homelab-deploy deploy test                  # Expands to configured subject
homelab-deploy deploy prod-dns              # Expands to configured subject

CLI Subject Aliases

The CLI supports subject aliases via environment variables. If the <subject> argument doesn't look like a NATS subject (no dots), the CLI checks for an alias.

Environment variable format: HOMELAB_DEPLOY_ALIAS_<NAME>=<subject>

export HOMELAB_DEPLOY_ALIAS_TEST="deploy.test.all"
export HOMELAB_DEPLOY_ALIAS_PROD="deploy.prod.all"
export HOMELAB_DEPLOY_ALIAS_PROD_DNS="deploy.prod.role.dns"

# Now these work:
homelab-deploy deploy test       # -> deploy.test.all
homelab-deploy deploy prod       # -> deploy.prod.all
homelab-deploy deploy prod-dns   # -> deploy.prod.role.dns

Alias names are case-insensitive and hyphens are converted to underscores when looking up the environment variable.

NATS Subject Structure

Subjects follow the pattern deploy.<tier>.<target> by default, but are fully configurable:

Subject Pattern	Description
`deploy.<tier>.<hostname>`	Deploy to specific host (e.g., `deploy.prod.ns1`)
`deploy.<tier>.all`	Deploy to all hosts in tier (e.g., `deploy.test.all`)
`deploy.<tier>.role.<role>`	Deploy to hosts with role in tier (e.g., `deploy.prod.role.dns`)
`deploy.responses.<uuid>`	Response subject for request/reply (UUID generated by CLI)
`deploy.discover`	Host discovery requests

Subject Customization

Listeners can configure custom subject patterns using template variables:

<hostname> - The listener's hostname
<tier> - The listener's tier (test/prod)
<role> - The listener's role (if configured)

This allows prefixing subjects for multi-tenant setups (e.g., homelab.deploy.<tier>.<hostname>).

Listener Mode

Responsibilities

Connect to NATS using NKey authentication
Subscribe to configured deploy subjects (with template expansion)
Subscribe to discovery subject and respond with host metadata
Validate incoming deployment requests
Execute nixos-rebuild with the specified parameters
Report status back via NATS reply subject

Subject Subscriptions

Listeners subscribe to a configurable list of subjects. The configuration uses template variables that are expanded at runtime:

listener:
  hostname: ns1
  tier: prod
  role: dns

  deploy_subjects:
    - "deploy.<tier>.<hostname>"
    - "deploy.<tier>.all"
    - "deploy.<tier>.role.<role>"

  discover_subject: "deploy.discover"

Template variables:

<hostname> - Replaced with the configured hostname
<tier> - Replaced with the configured tier
<role> - Replaced with the configured role (subject skipped if role is null)

Example: With the above configuration, the listener subscribes to:

deploy.prod.ns1
deploy.prod.all
deploy.prod.role.dns
deploy.discover

Prefixed example: For multi-tenant setups:

listener:
  hostname: ns1
  tier: prod
  deploy_subjects:
    - "homelab.deploy.<tier>.<hostname>"
    - "homelab.deploy.<tier>.all"
  discover_subject: "homelab.deploy.discover"

Message Formats

Request message:

{
  "action": "switch",
  "revision": "master",
  "reply_to": "deploy.responses.abc123"
}

Field	Type	Required	Description
`action`	string	yes	One of: `switch`, `boot`, `test`, `dry-activate`
`revision`	string	yes	Git branch name or commit hash
`reply_to`	string	yes	Subject to publish responses to

Response message:

{
  "hostname": "ns1",
  "status": "completed",
  "error": null,
  "message": "Successfully switched to generation 42"
}

Field	Type	Description
`hostname`	string	The responding host's name
`status`	string	One of: `accepted`, `rejected`, `started`, `completed`, `failed`
`error`	string or null	Error code if status is `rejected` or `failed`
`message`	string	Human-readable details

Error codes:

invalid_revision - The specified branch/commit does not exist
invalid_action - The action is not recognized
already_running - A deployment is already in progress on this host
build_failed - nixos-rebuild exited with non-zero status
timeout - Deployment exceeded the configured timeout

Request/Reply Flow

CLI generates a UUID for the request (e.g., 550e8400-e29b-41d4-a716-446655440000)
CLI subscribes to deploy.responses.<uuid>
CLI publishes deploy request to target subject with reply_to: "deploy.responses.<uuid>"
Listener validates request:
- Checks revision exists using git ls-remote
- Checks no other deployment is running
Listener publishes response to the reply_to subject:
- {"status": "rejected", ...} if validation fails, or
- {"status": "started", ...} if deployment begins
If started, listener executes nixos-rebuild
Listener publishes final response to the same reply_to subject:
- {"status": "completed", ...} on success, or
- {"status": "failed", ...} on failure
CLI receives responses and displays progress/results
CLI unsubscribes after receiving final status or timeout

Deployment Execution

The listener executes nixos-rebuild with the following command pattern:

nixos-rebuild <action> --flake <flake-url>?ref=<revision>#<hostname>

Where:

<action> is one of: switch, boot, test, dry-activate
<flake-url> is the configured git flake URL (e.g., git+https://git.example.com/user/nixos-configs.git)
<revision> is the branch name or commit hash from the request
<hostname> is the listener's configured hostname

Environment requirements:

Must run as root (nixos-rebuild requires root)
Nix must be configured with proper git credentials if the flake is private
Network access to the git repository

Concurrency Control

Only one deployment may run at a time per host. The listener maintains a simple lock:

Before starting a deployment, acquire lock
If lock is held, reject with already_running error
Release lock when deployment completes (success or failure)
Lock should be in-memory (no persistence needed - restarts clear it)

Logging

All deployment events should be logged to stdout/stderr (captured by systemd journal):

Request received (with subject, action, revision)
Validation result
Deployment start
Deployment completion (with exit code)
Any errors

This enables integration with log aggregation systems (e.g., Loki via Promtail).

MCP Mode

Purpose

Exposes deployment functionality as MCP tools for AI assistants (e.g., Claude Code).

Tools

Tool	Description	Parameters
`deploy`	Deploy to test-tier hosts	`hostname` or `all`, optional `role`, `branch`, `action`
`deploy_admin`	Deploy to any tier (requires `--enable-admin`)	`tier`, `hostname` or `all`, optional `role`, `branch`, `action`
`list_hosts`	List available deployment targets	`tier` (optional)

Tool Schemas

deploy:

{
  "name": "deploy",
  "description": "Deploy NixOS configuration to test-tier hosts",
  "inputSchema": {
    "type": "object",
    "properties": {
      "hostname": {
        "type": "string",
        "description": "Target hostname, or omit to use 'all' or 'role' targeting"
      },
      "all": {
        "type": "boolean",
        "description": "Deploy to all test-tier hosts"
      },
      "role": {
        "type": "string",
        "description": "Deploy to all test-tier hosts with this role"
      },
      "branch": {
        "type": "string",
        "description": "Git branch or commit to deploy (default: master)"
      },
      "action": {
        "type": "string",
        "enum": ["switch", "boot", "test", "dry-activate"],
        "description": "nixos-rebuild action (default: switch)"
      }
    }
  }
}

deploy_admin:

{
  "name": "deploy_admin",
  "description": "Deploy NixOS configuration to any host (admin access required)",
  "inputSchema": {
    "type": "object",
    "properties": {
      "tier": {
        "type": "string",
        "enum": ["test", "prod"],
        "description": "Target tier"
      },
      "hostname": {
        "type": "string",
        "description": "Target hostname, or omit to use 'all' or 'role' targeting"
      },
      "all": {
        "type": "boolean",
        "description": "Deploy to all hosts in tier"
      },
      "role": {
        "type": "string",
        "description": "Deploy to all hosts with this role in tier"
      },
      "branch": {
        "type": "string",
        "description": "Git branch or commit to deploy (default: master)"
      },
      "action": {
        "type": "string",
        "enum": ["switch", "boot", "test", "dry-activate"],
        "description": "nixos-rebuild action (default: switch)"
      }
    },
    "required": ["tier"]
  }
}

list_hosts:

{
  "name": "list_hosts",
  "description": "List available deployment targets",
  "inputSchema": {
    "type": "object",
    "properties": {
      "tier": {
        "type": "string",
        "enum": ["test", "prod"],
        "description": "Filter by tier (optional)"
      }
    }
  }
}

Security Layers

MCP flag: deploy_admin tool only registered when --enable-admin is passed
NATS authz: MCP credentials can only publish to authorized subjects
AI assistant permissions: The assistant's configuration can require confirmation for admin operations

Multi-Host Deployments

When deploying to multiple hosts (via all or role), the MCP should:

Publish the request to the appropriate broadcast subject
Collect responses from all responding hosts
Return aggregated results showing each host's status

Timeout handling:

Set a reasonable timeout for collecting responses (e.g., 30 seconds after last response, or max 15 minutes)
Return partial results if some hosts don't respond
Indicate which hosts did not respond

Host Discovery

The list_hosts tool needs to know available hosts. Options:

Static configuration: Read from a config file or environment variable
NATS request: Publish to a discovery subject and collect responses from listeners

Recommend option 2: Listeners subscribe to their configured discover_subject and respond with metadata.

Discovery request:

{
  "reply_to": "deploy.responses.discover-abc123"
}

Discovery response:

{
  "hostname": "ns1",
  "tier": "prod",
  "role": "dns",
  "deploy_subjects": [
    "deploy.prod.ns1",
    "deploy.prod.all",
    "deploy.prod.role.dns"
  ]
}

The response includes the expanded deploy_subjects so clients know exactly which subjects reach this host.

NixOS Module

The NixOS module configures the listener as a systemd service with appropriate hardening.

Module Options

{
  options.services.homelab-deploy.listener = {
    enable = lib.mkEnableOption "homelab-deploy listener service";

    package = lib.mkPackageOption pkgs "homelab-deploy" { };

    hostname = lib.mkOption {
      type = lib.types.str;
      default = config.networking.hostName;
      description = "Hostname for this listener (used in subject templates)";
    };

    tier = lib.mkOption {
      type = lib.types.enum [ "test" "prod" ];
      description = "Deployment tier for this host";
    };

    role = lib.mkOption {
      type = lib.types.nullOr lib.types.str;
      default = null;
      description = "Role for role-based deployment targeting";
    };

    natsUrl = lib.mkOption {
      type = lib.types.str;
      description = "NATS server URL";
      example = "nats://nats.example.com:4222";
    };

    nkeyFile = lib.mkOption {
      type = lib.types.path;
      description = "Path to NKey seed file for NATS authentication";
      example = "/run/secrets/homelab-deploy-nkey";
    };

    flakeUrl = lib.mkOption {
      type = lib.types.str;
      description = "Git flake URL for nixos-rebuild";
      example = "git+https://git.example.com/user/nixos-configs.git";
    };

    timeout = lib.mkOption {
      type = lib.types.int;
      default = 600;
      description = "Deployment timeout in seconds";
    };

    deploySubjects = lib.mkOption {
      type = lib.types.listOf lib.types.str;
      default = [
        "deploy.<tier>.<hostname>"
        "deploy.<tier>.all"
        "deploy.<tier>.role.<role>"
      ];
      description = ''
        List of NATS subjects to subscribe to for deployment requests.
        Template variables: <hostname>, <tier>, <role>
      '';
    };

    discoverSubject = lib.mkOption {
      type = lib.types.str;
      default = "deploy.discover";
      description = "NATS subject for host discovery requests";
    };

    environment = lib.mkOption {
      type = lib.types.attrsOf lib.types.str;
      default = { };
      description = "Additional environment variables for the service";
      example = { GIT_SSH_COMMAND = "ssh -i /run/secrets/deploy-key"; };
    };
  };
}

Systemd Service

The module creates a hardened systemd service:

systemd.services.homelab-deploy-listener = {
  description = "homelab-deploy listener";
  wantedBy = [ "multi-user.target" ];
  after = [ "network-online.target" ];
  wants = [ "network-online.target" ];

  environment = cfg.environment;

  serviceConfig = {
    Type = "simple";
    ExecStart = "${cfg.package}/bin/homelab-deploy listener ...";
    Restart = "always";
    RestartSec = 10;

    # Hardening (compatible with nixos-rebuild requirements)
    NoNewPrivileges = false;  # nixos-rebuild may need to spawn privileged processes
    ProtectSystem = "false";  # nixos-rebuild modifies /nix/store and /run
    ProtectHome = "read-only";
    PrivateTmp = true;
    PrivateDevices = true;
    ProtectKernelTunables = true;
    ProtectKernelModules = true;
    ProtectControlGroups = true;
    RestrictAddressFamilies = [ "AF_UNIX" "AF_INET" "AF_INET6" ];
    RestrictNamespaces = false;  # nix build uses namespaces
    RestrictSUIDSGID = true;
    LockPersonality = true;
    MemoryDenyWriteExecute = false;  # nix may need this
    SystemCallArchitectures = "native";
  };
};

Note: Some hardening options are relaxed because nixos-rebuild requires:

Write access to /nix/store for building
Ability to activate system configurations
Network access for fetching from git/cache
Namespace support for nix sandbox builds

NATS Authentication

All NATS connections use NKey authentication. NKeys are ed25519 keypairs where:

The seed (private key) is stored in a file readable by the service
The public key is configured in the NATS server's user list

Credential Types

Credential	Purpose	Publish Permissions	Subscribe Permissions
listener	Host listener service	`deploy.responses.>`	`deploy.*.>`
mcp-deployer	MCP test-tier access	`deploy.test.>`	`deploy.responses.>`, `deploy.discover`
admin-deployer	Full deployment access	`deploy.test.>`, `deploy.prod.>`	`deploy.responses.>`, `deploy.discover`

Flake Structure

The flake.nix should provide:

Package: The Go binary
NixOS module: The listener service configuration
Development shell: Go toolchain for development

{
  inputs = {
    nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";
  };

  outputs = { self, nixpkgs }: {
    packages.x86_64-linux.default = /* Go package build */;
    packages.x86_64-linux.homelab-deploy = self.packages.x86_64-linux.default;

    nixosModules.default = import ./nixos/module.nix;
    nixosModules.homelab-deploy = self.nixosModules.default;

    devShells.x86_64-linux.default = /* Go dev shell */;
  };
}

Implementation Notes

Go Dependencies

Recommended libraries:

github.com/urfave/cli/v3 - CLI framework
github.com/nats-io/nats.go - NATS client
github.com/mark3labs/mcp-go - MCP server implementation
Standard library for JSON, logging, process execution

Error Handling

NATS connection errors: Retry with exponential backoff
nixos-rebuild failures: Capture stdout/stderr, report in response message
Timeout: Kill the nixos-rebuild process, report timeout error

Testing

Unit tests for message parsing and validation
Integration tests using a local NATS server
End-to-end tests with a NixOS VM (optional, can be done in consuming repo)

Security Considerations

Privilege: Listener runs as root to execute nixos-rebuild
Input validation: Strictly validate revision format (alphanumeric, dashes, underscores, dots, slashes for branch names; hex for commit hashes)
Command injection: Never interpolate user input into shell commands without validation
Rate limiting: Consider adding rate limiting to prevent rapid-fire deployments
Audit logging: Log all deployment requests with full context
Network isolation: NATS should only be accessible from trusted networks

Future Enhancements

These are not required for initial implementation:

Deployment locking - Cluster-wide lock to prevent fleet-wide concurrent deploys
Prometheus metrics - Export deployment count, duration, success/failure rates
Webhook triggers - HTTP endpoint for CI/CD integration
Scheduled deployments - Deploy at specific times (though this overlaps with existing auto-upgrade)

21 KiB Raw Blame History