# homelab-deploy Design Document

A message-based deployment system for NixOS configurations using NATS for messaging. This binary runs in multiple modes to enable on-demand NixOS configuration updates across a fleet of hosts.

## Overview

The `homelab-deploy` binary provides three operational modes:

1. **Listener mode** - Runs on each NixOS host as a systemd service, subscribing to NATS subjects and executing `nixos-rebuild` when deployment requests arrive
2. **MCP mode** - Runs as an MCP (Model Context Protocol) server, exposing deployment tools for AI assistants
3. **CLI mode** - Manual deployment commands for administrators

## Architecture

```
┌─────────────┐                        ┌─────────────┐
│  MCP Tool   │  deploy.test.>         │  Admin CLI  │  deploy.test.> + deploy.prod.>
│             │────────────┐     ┌─────│             │
└─────────────┘            │     │     └─────────────┘
                           ▼     ▼
                      ┌──────────────┐
                      │ NATS Server  │
                      │  (authz)     │
                      └──────┬───────┘
                             │
           ┌─────────────────┼─────────────────┐
           │                 │                 │
           ▼                 ▼                 ▼
     ┌──────────┐      ┌──────────┐      ┌──────────┐
     │  host-a  │      │  host-b  │      │  host-c  │
     │ tier=test│      │ tier=prod│      │ tier=prod│
     └──────────┘      └──────────┘      └──────────┘
```

## Repository Structure

```
homelab-deploy/
├── flake.nix           # Nix flake with Go package + NixOS module
├── go.mod
├── go.sum
├── cmd/
│   └── homelab-deploy/
│       └── main.go     # CLI entrypoint with subcommands
├── internal/
│   ├── listener/       # Listener mode logic
│   ├── mcp/            # MCP server mode logic
│   ├── nats/           # NATS client wrapper
│   └── deploy/         # Shared deployment execution logic
└── nixos/
    └── module.nix      # NixOS module for listener service
```

## CLI Interface

```bash
# Listener mode (runs as systemd service on each host)
homelab-deploy listener \
  --hostname <hostname> \
  --tier <test|prod> \
  --nats-url nats://server:4222 \
  --nkey-file /path/to/listener.nkey \
  --flake-url <git+https://...> \
  [--role <role>] \
  [--timeout 600] \
  [--deploy-subject <subject>]... \
  [--discover-subject <subject>]

# Subject flags can be repeated and use template variables:
homelab-deploy listener \
  --hostname ns1 \
  --tier prod \
  --role dns \
  --deploy-subject "deploy.<tier>.<hostname>" \
  --deploy-subject "deploy.<tier>.all" \
  --deploy-subject "deploy.<tier>.role.<role>" \
  --discover-subject "deploy.discover" \
  ...

# MCP server mode (for AI assistants)
homelab-deploy mcp \
  --nats-url nats://server:4222 \
  --nkey-file /path/to/mcp.nkey \
  [--enable-admin --admin-nkey-file /path/to/admin.nkey]

# CLI commands for manual use
# Deploy to a specific subject
homelab-deploy deploy <subject> \
  --nats-url nats://server:4222 \
  --nkey-file /path/to/deployer.nkey \
  [--branch <branch>] \
  [--action <switch|boot|test|dry-activate>]

# Examples:
homelab-deploy deploy deploy.prod.ns1       # Deploy to specific host
homelab-deploy deploy deploy.test.all       # Deploy to all test hosts
homelab-deploy deploy deploy.prod.role.dns  # Deploy to all prod DNS hosts

# Using aliases (configured via environment variables)
homelab-deploy deploy test                  # Expands to configured subject
homelab-deploy deploy prod-dns              # Expands to configured subject
```

### CLI Subject Aliases

The CLI supports subject aliases via environment variables. If the `<subject>` argument doesn't look like a NATS subject (no dots), the CLI checks for an alias.

**Environment variable format:** `HOMELAB_DEPLOY_ALIAS_<NAME>=<subject>`

```bash
export HOMELAB_DEPLOY_ALIAS_TEST="deploy.test.all"
export HOMELAB_DEPLOY_ALIAS_PROD="deploy.prod.all"
export HOMELAB_DEPLOY_ALIAS_PROD_DNS="deploy.prod.role.dns"

# Now these work:
homelab-deploy deploy test       # -> deploy.test.all
homelab-deploy deploy prod       # -> deploy.prod.all
homelab-deploy deploy prod-dns   # -> deploy.prod.role.dns
```

Alias names are case-insensitive and hyphens are converted to underscores when looking up the environment variable.

## NATS Subject Structure

Subjects follow the pattern `deploy.<tier>.<target>` by default, but are fully configurable:

| Subject Pattern | Description |
|-----------------|-------------|
| `deploy.<tier>.<hostname>` | Deploy to specific host (e.g., `deploy.prod.ns1`) |
| `deploy.<tier>.all` | Deploy to all hosts in tier (e.g., `deploy.test.all`) |
| `deploy.<tier>.role.<role>` | Deploy to hosts with role in tier (e.g., `deploy.prod.role.dns`) |
| `deploy.responses.<uuid>` | Response subject for request/reply (UUID generated by CLI) |
| `deploy.discover` | Host discovery requests |

### Subject Customization

Listeners can configure custom subject patterns using template variables:
- `<hostname>` - The listener's hostname
- `<tier>` - The listener's tier (test/prod)
- `<role>` - The listener's role (if configured)

This allows prefixing subjects for multi-tenant setups (e.g., `homelab.deploy.<tier>.<hostname>`).

## Listener Mode

### Responsibilities

1. Connect to NATS using NKey authentication
2. Subscribe to configured deploy subjects (with template expansion)
3. Subscribe to discovery subject and respond with host metadata
4. Validate incoming deployment requests
5. Execute `nixos-rebuild` with the specified parameters
6. Report status back via NATS reply subject

### Subject Subscriptions

Listeners subscribe to a configurable list of subjects. The configuration uses template variables that are expanded at runtime:

```yaml
listener:
  hostname: ns1
  tier: prod
  role: dns

  deploy_subjects:
    - "deploy.<tier>.<hostname>"
    - "deploy.<tier>.all"
    - "deploy.<tier>.role.<role>"

  discover_subject: "deploy.discover"
```

Template variables:
- `<hostname>` - Replaced with the configured hostname
- `<tier>` - Replaced with the configured tier
- `<role>` - Replaced with the configured role (subject skipped if role is null)

**Example:** With the above configuration, the listener subscribes to:
- `deploy.prod.ns1`
- `deploy.prod.all`
- `deploy.prod.role.dns`
- `deploy.discover`

**Prefixed example:** For multi-tenant setups:
```yaml
listener:
  hostname: ns1
  tier: prod
  deploy_subjects:
    - "homelab.deploy.<tier>.<hostname>"
    - "homelab.deploy.<tier>.all"
  discover_subject: "homelab.deploy.discover"
```

### Message Formats

**Request message:**
```json
{
  "action": "switch",
  "revision": "master",
  "reply_to": "deploy.responses.abc123"
}
```

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `action` | string | yes | One of: `switch`, `boot`, `test`, `dry-activate` |
| `revision` | string | yes | Git branch name or commit hash |
| `reply_to` | string | yes | Subject to publish responses to |

**Response message:**
```json
{
  "hostname": "ns1",
  "status": "completed",
  "error": null,
  "message": "Successfully switched to generation 42"
}
```

| Field | Type | Description |
|-------|------|-------------|
| `hostname` | string | The responding host's name |
| `status` | string | One of: `accepted`, `rejected`, `started`, `completed`, `failed` |
| `error` | string or null | Error code if status is `rejected` or `failed` |
| `message` | string | Human-readable details |

**Error codes:**
- `invalid_revision` - The specified branch/commit does not exist
- `invalid_action` - The action is not recognized
- `already_running` - A deployment is already in progress on this host
- `build_failed` - nixos-rebuild exited with non-zero status
- `timeout` - Deployment exceeded the configured timeout

### Request/Reply Flow

1. CLI generates a UUID for the request (e.g., `550e8400-e29b-41d4-a716-446655440000`)
2. CLI subscribes to `deploy.responses.<uuid>`
3. CLI publishes deploy request to target subject with `reply_to: "deploy.responses.<uuid>"`
4. Listener validates request:
   - Checks revision exists using `git ls-remote`
   - Checks no other deployment is running
5. Listener publishes response to the `reply_to` subject:
   - `{"status": "rejected", ...}` if validation fails, or
   - `{"status": "started", ...}` if deployment begins
6. If started, listener executes nixos-rebuild
7. Listener publishes final response to the same `reply_to` subject:
   - `{"status": "completed", ...}` on success, or
   - `{"status": "failed", ...}` on failure
8. CLI receives responses and displays progress/results
9. CLI unsubscribes after receiving final status or timeout

### Deployment Execution

The listener executes `nixos-rebuild` with the following command pattern:

```bash
nixos-rebuild <action> --flake <flake-url>?ref=<revision>#<hostname>
```

Where:
- `<action>` is one of: `switch`, `boot`, `test`, `dry-activate`
- `<flake-url>` is the configured git flake URL (e.g., `git+https://git.example.com/user/nixos-configs.git`)
- `<revision>` is the branch name or commit hash from the request
- `<hostname>` is the listener's configured hostname

**Environment requirements:**
- Must run as root (nixos-rebuild requires root)
- Nix must be configured with proper git credentials if the flake is private
- Network access to the git repository

### Concurrency Control

Only one deployment may run at a time per host. The listener maintains a simple lock:
- Before starting a deployment, acquire lock
- If lock is held, reject with `already_running` error
- Release lock when deployment completes (success or failure)
- Lock should be in-memory (no persistence needed - restarts clear it)

### Logging

All deployment events should be logged to stdout/stderr (captured by systemd journal):
- Request received (with subject, action, revision)
- Validation result
- Deployment start
- Deployment completion (with exit code)
- Any errors

This enables integration with log aggregation systems (e.g., Loki via Promtail).

## MCP Mode

### Purpose

Exposes deployment functionality as MCP tools for AI assistants (e.g., Claude Code).

### Tools

| Tool | Description | Parameters |
|------|-------------|------------|
| `deploy` | Deploy to test-tier hosts | `hostname` or `all`, optional `role`, `branch`, `action` |
| `deploy_admin` | Deploy to any tier (requires `--enable-admin`) | `tier`, `hostname` or `all`, optional `role`, `branch`, `action` |
| `list_hosts` | List available deployment targets | `tier` (optional) |

### Tool Schemas

**deploy:**
```json
{
  "name": "deploy",
  "description": "Deploy NixOS configuration to test-tier hosts",
  "inputSchema": {
    "type": "object",
    "properties": {
      "hostname": {
        "type": "string",
        "description": "Target hostname, or omit to use 'all' or 'role' targeting"
      },
      "all": {
        "type": "boolean",
        "description": "Deploy to all test-tier hosts"
      },
      "role": {
        "type": "string",
        "description": "Deploy to all test-tier hosts with this role"
      },
      "branch": {
        "type": "string",
        "description": "Git branch or commit to deploy (default: master)"
      },
      "action": {
        "type": "string",
        "enum": ["switch", "boot", "test", "dry-activate"],
        "description": "nixos-rebuild action (default: switch)"
      }
    }
  }
}
```

**deploy_admin:**
```json
{
  "name": "deploy_admin",
  "description": "Deploy NixOS configuration to any host (admin access required)",
  "inputSchema": {
    "type": "object",
    "properties": {
      "tier": {
        "type": "string",
        "enum": ["test", "prod"],
        "description": "Target tier"
      },
      "hostname": {
        "type": "string",
        "description": "Target hostname, or omit to use 'all' or 'role' targeting"
      },
      "all": {
        "type": "boolean",
        "description": "Deploy to all hosts in tier"
      },
      "role": {
        "type": "string",
        "description": "Deploy to all hosts with this role in tier"
      },
      "branch": {
        "type": "string",
        "description": "Git branch or commit to deploy (default: master)"
      },
      "action": {
        "type": "string",
        "enum": ["switch", "boot", "test", "dry-activate"],
        "description": "nixos-rebuild action (default: switch)"
      }
    },
    "required": ["tier"]
  }
}
```

**list_hosts:**
```json
{
  "name": "list_hosts",
  "description": "List available deployment targets",
  "inputSchema": {
    "type": "object",
    "properties": {
      "tier": {
        "type": "string",
        "enum": ["test", "prod"],
        "description": "Filter by tier (optional)"
      }
    }
  }
}
```

### Security Layers

1. **MCP flag**: `deploy_admin` tool only registered when `--enable-admin` is passed
2. **NATS authz**: MCP credentials can only publish to authorized subjects
3. **AI assistant permissions**: The assistant's configuration can require confirmation for admin operations

### Multi-Host Deployments

When deploying to multiple hosts (via `all` or `role`), the MCP should:
1. Publish the request to the appropriate broadcast subject
2. Collect responses from all responding hosts
3. Return aggregated results showing each host's status

**Timeout handling:**
- Set a reasonable timeout for collecting responses (e.g., 30 seconds after last response, or max 15 minutes)
- Return partial results if some hosts don't respond
- Indicate which hosts did not respond

### Host Discovery

The `list_hosts` tool needs to know available hosts. Options:
1. **Static configuration**: Read from a config file or environment variable
2. **NATS request**: Publish to a discovery subject and collect responses from listeners

Recommend option 2: Listeners subscribe to their configured `discover_subject` and respond with metadata.

**Discovery request:**
```json
{
  "reply_to": "deploy.responses.discover-abc123"
}
```

**Discovery response:**
```json
{
  "hostname": "ns1",
  "tier": "prod",
  "role": "dns",
  "deploy_subjects": [
    "deploy.prod.ns1",
    "deploy.prod.all",
    "deploy.prod.role.dns"
  ]
}
```

The response includes the expanded `deploy_subjects` so clients know exactly which subjects reach this host.

## NixOS Module

The NixOS module configures the listener as a systemd service with appropriate hardening.

### Module Options

```nix
{
  options.services.homelab-deploy.listener = {
    enable = lib.mkEnableOption "homelab-deploy listener service";

    package = lib.mkPackageOption pkgs "homelab-deploy" { };

    hostname = lib.mkOption {
      type = lib.types.str;
      default = config.networking.hostName;
      description = "Hostname for this listener (used in subject templates)";
    };

    tier = lib.mkOption {
      type = lib.types.enum [ "test" "prod" ];
      description = "Deployment tier for this host";
    };

    role = lib.mkOption {
      type = lib.types.nullOr lib.types.str;
      default = null;
      description = "Role for role-based deployment targeting";
    };

    natsUrl = lib.mkOption {
      type = lib.types.str;
      description = "NATS server URL";
      example = "nats://nats.example.com:4222";
    };

    nkeyFile = lib.mkOption {
      type = lib.types.path;
      description = "Path to NKey seed file for NATS authentication";
      example = "/run/secrets/homelab-deploy-nkey";
    };

    flakeUrl = lib.mkOption {
      type = lib.types.str;
      description = "Git flake URL for nixos-rebuild";
      example = "git+https://git.example.com/user/nixos-configs.git";
    };

    timeout = lib.mkOption {
      type = lib.types.int;
      default = 600;
      description = "Deployment timeout in seconds";
    };

    deploySubjects = lib.mkOption {
      type = lib.types.listOf lib.types.str;
      default = [
        "deploy.<tier>.<hostname>"
        "deploy.<tier>.all"
        "deploy.<tier>.role.<role>"
      ];
      description = ''
        List of NATS subjects to subscribe to for deployment requests.
        Template variables: <hostname>, <tier>, <role>
      '';
    };

    discoverSubject = lib.mkOption {
      type = lib.types.str;
      default = "deploy.discover";
      description = "NATS subject for host discovery requests";
    };

    environment = lib.mkOption {
      type = lib.types.attrsOf lib.types.str;
      default = { };
      description = "Additional environment variables for the service";
      example = { GIT_SSH_COMMAND = "ssh -i /run/secrets/deploy-key"; };
    };
  };
}
```

### Systemd Service

The module creates a hardened systemd service:

```nix
systemd.services.homelab-deploy-listener = {
  description = "homelab-deploy listener";
  wantedBy = [ "multi-user.target" ];
  after = [ "network-online.target" ];
  wants = [ "network-online.target" ];

  environment = cfg.environment;

  serviceConfig = {
    Type = "simple";
    ExecStart = "${cfg.package}/bin/homelab-deploy listener ...";
    Restart = "always";
    RestartSec = 10;

    # Hardening (compatible with nixos-rebuild requirements)
    NoNewPrivileges = false;  # nixos-rebuild may need to spawn privileged processes
    ProtectSystem = "false";  # nixos-rebuild modifies /nix/store and /run
    ProtectHome = "read-only";
    PrivateTmp = true;
    PrivateDevices = true;
    ProtectKernelTunables = true;
    ProtectKernelModules = true;
    ProtectControlGroups = true;
    RestrictAddressFamilies = [ "AF_UNIX" "AF_INET" "AF_INET6" ];
    RestrictNamespaces = false;  # nix build uses namespaces
    RestrictSUIDSGID = true;
    LockPersonality = true;
    MemoryDenyWriteExecute = false;  # nix may need this
    SystemCallArchitectures = "native";
  };
};
```

**Note:** Some hardening options are relaxed because `nixos-rebuild` requires:
- Write access to `/nix/store` for building
- Ability to activate system configurations
- Network access for fetching from git/cache
- Namespace support for nix sandbox builds

## NATS Authentication

All NATS connections use NKey authentication. NKeys are ed25519 keypairs where:
- The seed (private key) is stored in a file readable by the service
- The public key is configured in the NATS server's user list

### Credential Types

| Credential | Purpose | Publish Permissions | Subscribe Permissions |
|------------|---------|---------------------|----------------------|
| listener | Host listener service | `deploy.responses.>` | `deploy.*.>` |
| mcp-deployer | MCP test-tier access | `deploy.test.>` | `deploy.responses.>`, `deploy.discover` |
| admin-deployer | Full deployment access | `deploy.test.>`, `deploy.prod.>` | `deploy.responses.>`, `deploy.discover` |

## Flake Structure

The flake.nix should provide:

1. **Package**: The Go binary
2. **NixOS module**: The listener service configuration
3. **Development shell**: Go toolchain for development

```nix
{
  inputs = {
    nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";
  };

  outputs = { self, nixpkgs }: {
    packages.x86_64-linux.default = /* Go package build */;
    packages.x86_64-linux.homelab-deploy = self.packages.x86_64-linux.default;

    nixosModules.default = import ./nixos/module.nix;
    nixosModules.homelab-deploy = self.nixosModules.default;

    devShells.x86_64-linux.default = /* Go dev shell */;
  };
}
```

## Implementation Notes

### Go Dependencies

Recommended libraries:
- `github.com/urfave/cli/v3` - CLI framework
- `github.com/nats-io/nats.go` - NATS client
- `github.com/mark3labs/mcp-go` - MCP server implementation
- Standard library for JSON, logging, process execution

### Error Handling

- NATS connection errors: Retry with exponential backoff
- nixos-rebuild failures: Capture stdout/stderr, report in response message
- Timeout: Kill the nixos-rebuild process, report timeout error

### Testing

- Unit tests for message parsing and validation
- Integration tests using a local NATS server
- End-to-end tests with a NixOS VM (optional, can be done in consuming repo)

## Security Considerations

- **Privilege**: Listener runs as root to execute nixos-rebuild
- **Input validation**: Strictly validate revision format (alphanumeric, dashes, underscores, dots, slashes for branch names; hex for commit hashes)
- **Command injection**: Never interpolate user input into shell commands without validation
- **Rate limiting**: Consider adding rate limiting to prevent rapid-fire deployments
- **Audit logging**: Log all deployment requests with full context
- **Network isolation**: NATS should only be accessible from trusted networks

## Future Enhancements

These are not required for initial implementation:

1. **Deployment locking** - Cluster-wide lock to prevent fleet-wide concurrent deploys
2. **Prometheus metrics** - Export deployment count, duration, success/failure rates
3. **Webhook triggers** - HTTP endpoint for CI/CD integration
4. **Scheduled deployments** - Deploy at specific times (though this overlaps with existing auto-upgrade)