docs: add homelab-deploy plan, unify host metadata

Add plan for NATS-based deployment service (homelab-deploy) that enables on-demand NixOS configuration updates via messaging. Features tiered permissions (test/prod) enforced at NATS layer. Update prometheus-scrape-target-labels plan to share the homelab.host module for host metadata (tier, priority, role, labels) - single source of truth for both deployment tiers and prometheus labels. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:10:54 +01:00
parent 881e70df27
commit 4d724329a6
2 changed files with 388 additions and 14 deletions
--- a/docs/plans/nats-deploy-service.md
+++ b/docs/plans/nats-deploy-service.md
@@ -0,0 +1,323 @@
 # NATS-Based Deployment Service
 ## Overview
 Create a message-based deployment system that allows triggering NixOS configuration updates on-demand, rather than waiting for the daily auto-upgrade timer. This enables faster iteration when testing changes and immediate fleet-wide deployments.
 ## Goals
 1. **On-demand deployment** - Trigger config updates immediately via NATS message
 2. **Targeted deployment** - Deploy to specific hosts or all hosts
 3. **Branch/revision support** - Test feature branches before merging to master
 4. **MCP integration** - Allow Claude Code to trigger deployments during development
 ## Current State
 - **Auto-upgrade**: All hosts run `nixos-upgrade.service` daily, pulling from master
 - **Manual testing**: `nixos-rebuild-test <action> <branch>` helper exists on all hosts
 - **NATS**: Running on nats1 with JetStream enabled, using NKey authentication
 - **Accounts**: ADMIN (system) and HOMELAB (user workloads with JetStream)
 ## Architecture
 ```
 ┌─────────────┐                        ┌─────────────┐
 │  MCP Tool   │  deploy.test.>         │  Admin CLI  │  deploy.test.> + deploy.prod.>
 │  (claude)   │────────────┐     ┌─────│  (torjus)   │
 └─────────────┘            │     │     └─────────────┘
                           ▼     ▼
                      ┌──────────────┐
                      │    nats1     │
                      │  (authz)     │
                      └──────┬───────┘
                             │
           ┌─────────────────┼─────────────────┐
           │                 │                 │
           ▼                 ▼                 ▼
     ┌──────────┐      ┌──────────┐      ┌──────────┐
     │ template1│      │   ns1    │      │   ha1    │
     │ tier=test│      │ tier=prod│      │ tier=prod│
     └──────────┘      └──────────┘      └──────────┘
 ```
 ## Repository Structure
 The project lives in a **separate repository** (e.g., `homelab-deploy`) containing:
 ```
 homelab-deploy/
 ├── flake.nix           # Nix flake with Go package + NixOS module
 ├── go.mod
 ├── go.sum
 ├── cmd/
 │   └── homelab-deploy/
 │       └── main.go     # CLI entrypoint with subcommands
 ├── internal/
 │   ├── listener/       # Listener mode logic
 │   ├── mcp/            # MCP server mode logic
 │   └── deploy/         # Shared deployment logic
 └── nixos/
    └── module.nix      # NixOS module for listener service
 ```
 This repo imports the flake as an input and uses the NixOS module.
 ## Single Binary with Subcommands
 The `homelab-deploy` binary supports multiple modes:
 ```bash
 # Run as listener on a host (systemd service)
 homelab-deploy listener --hostname ns1 --nats-url nats://nats1:4222
 # Run as MCP server (for Claude Code)
 homelab-deploy mcp --nats-url nats://nats1:4222
 # CLI commands for manual use
 homelab-deploy deploy ns1 --branch feature-x --action switch
 homelab-deploy deploy --all --action boot
 homelab-deploy status
 ```
 ## Components
 ### Listener Mode
 A systemd service on each host that:
 - Subscribes to `deploy.<tier>.<hostname>` and `deploy.<tier>.all` subjects
 - Validates incoming messages (revision, action)
 - Executes `nixos-rebuild` with specified parameters
 - Reports status back via NATS
 **NixOS module configuration:**
 ```nix
 services.homelab-deploy.listener = {
  enable = true;
  timeout = 600;  # seconds, default 10 minutes
 };
 ```
 The listener reads its tier from `config.homelab.host.tier` (see Host Metadata below) and subscribes to tier-specific subjects (e.g., `deploy.prod.ns1` and `deploy.prod.all`).
 **Request message format:**
 ```json
 {
  "action": "switch" | "boot" | "test" | "dry-activate",
  "revision": "master" | "feature-branch" | "abc123...",
  "reply_to": "deploy.responses.<request-id>"
 }
 ```
 **Response message format:**
 ```json
 {
  "status": "accepted" | "rejected" | "started" | "completed" | "failed",
  "error": "invalid_revision" | "already_running" | "build_failed" | null,
  "message": "human-readable details"
 }
 ```
 **Request/Reply flow:**
 1. MCP/CLI sends deploy request with unique `reply_to` subject
 2. Listener validates request (e.g., `git ls-remote` to check revision exists)
 3. Listener sends immediate response:
   - `{"status": "rejected", "error": "invalid_revision", "message": "branch 'foo' not found"}`, or
   - `{"status": "started", "message": "starting nixos-rebuild switch"}`
 4. If started, listener runs nixos-rebuild
 5. Listener sends final response:
   - `{"status": "completed", "message": "successfully switched to generation 42"}`, or
   - `{"status": "failed", "error": "build_failed", "message": "nixos-rebuild exited with code 1"}`
 This provides immediate feedback on validation errors (bad revision, already running) without waiting for the build to fail.
 ### MCP Mode
 Runs as an MCP server providing tools for Claude Code:
 - `deploy` - Deploy to specific host(s) with optional revision
 - `deploy_status` - Check deployment status/history
 - `list_hosts` - List available deployment targets
 The MCP server runs with limited credentials (test-tier only), so Claude can deploy to test hosts but not production.
 ### Tiered Permissions
 Authorization is enforced at the NATS layer using subject-based permissions. Different deployer credentials have different publish rights:
 **NATS user configuration (on nats1):**
 ```nix
 accounts = {
  HOMELAB = {
    users = [
      # MCP/Claude - test tier only
      {
        nkey = "UABC...";  # mcp-deployer
        permissions = {
          publish = [ "deploy.test.>" ];
          subscribe = [ "deploy.responses.>" ];
        };
      }
      # Admin - full access to all tiers
      {
        nkey = "UXYZ...";  # admin-deployer
        permissions = {
          publish = [ "deploy.test.>" "deploy.prod.>" ];
          subscribe = [ "deploy.responses.>" ];
        };
      }
      # Host listeners - subscribe to their tier, publish responses
      {
        nkey = "UDEF...";  # host-listener (one per host)
        permissions = {
          subscribe = [ "deploy.*.>" ];
          publish = [ "deploy.responses.>" ];
        };
      }
    ];
  };
 };
 ```
 **Host tier assignments** (via `homelab.host.tier`):
 | Tier | Hosts |
 |------|-------|
 | test | template1, nix-cache01, future test hosts |
 | prod | ns1, ns2, ha1, monitoring01, http-proxy, etc. |
 **How it works:**
 1. MCP tries to deploy to ns1 → publishes to `deploy.prod.ns1`
 2. NATS server rejects publish (mcp-deployer lacks `deploy.prod.>` permission)
 3. MCP tries to deploy to template1 → publishes to `deploy.test.template1`
 4. NATS allows it, listener receives and executes
 All NKeys stored in Vault - MCP gets limited credentials, admin CLI gets full-access credentials.
 ### Host Metadata
 Rather than defining `tier` in the listener config, use a central `homelab.host` module that provides host metadata for multiple consumers. This aligns with the approach proposed in `docs/plans/prometheus-scrape-target-labels.md`.
 **Module definition (in `modules/homelab/host.nix`):**
 ```nix
 homelab.host = {
  tier = lib.mkOption {
    type = lib.types.enum [ "test" "prod" ];
    default = "prod";
    description = "Deployment tier - controls which credentials can deploy to this host";
  };
  priority = lib.mkOption {
    type = lib.types.enum [ "high" "low" ];
    default = "high";
    description = "Alerting priority - low priority hosts have relaxed thresholds";
  };
  role = lib.mkOption {
    type = lib.types.nullOr lib.types.str;
    default = null;
    description = "Primary role of this host (dns, database, monitoring, etc.)";
  };
  labels = lib.mkOption {
    type = lib.types.attrsOf lib.types.str;
    default = { };
    description = "Additional free-form labels";
  };
 };
 ```
 **Consumers:**
 - `homelab-deploy` listener reads `config.homelab.host.tier` for subject subscription
 - Prometheus scrape config reads `priority`, `role`, `labels` for target labels
 - Future services can consume the same metadata
 **Example host config:**
 ```nix
 # hosts/nix-cache01/configuration.nix
 homelab.host = {
  tier = "test";      # can be deployed by MCP
  priority = "low";   # relaxed alerting thresholds
  role = "build-host";
 };
 # hosts/ns1/configuration.nix
 homelab.host = {
  tier = "prod";      # requires admin credentials
  priority = "high";
  role = "dns";
  labels.dns_role = "primary";
 };
 ```
 ## Implementation Steps
 ### Phase 1: Core Binary + Listener
 1. **Create homelab-deploy repository**
   - Initialize Go module
   - Set up flake.nix with Go package build
 2. **Implement listener mode**
   - NATS subscription logic
   - nixos-rebuild execution
   - Status reporting via NATS reply
 3. **Create NixOS module**
   - Systemd service definition
   - Configuration options (hostname, NATS URL, NKey path)
   - Vault secret integration for NKeys
 4. **Create `homelab.host` module** (in nixos-servers)
   - Define `tier`, `priority`, `role`, `labels` options
   - This module is shared with Prometheus label work (see `docs/plans/prometheus-scrape-target-labels.md`)
 5. **Integrate with nixos-servers**
   - Add flake input for homelab-deploy
   - Import listener module in `system/`
   - Set `homelab.host.tier` per host (test vs prod)
 6. **Configure NATS tiered permissions**
   - Add deployer users to nats1 config (mcp-deployer, admin-deployer)
   - Set up subject ACLs per user (test-only vs full access)
   - Add deployer NKeys to Vault
   - Create Terraform resources for NKey secrets
 ### Phase 2: MCP + CLI
 7. **Implement MCP mode**
   - MCP server with deploy/status tools
   - Request/reply pattern for deployment feedback
 8. **Implement CLI commands**
   - `deploy` command for manual deployments
   - `status` command to check deployment state
 9. **Configure Claude Code**
   - Add MCP server to configuration
   - Document usage
 ### Phase 3: Enhancements
 10. Add deployment locking (prevent concurrent deploys)
 11. Prometheus metrics for deployment status
 ## Security Considerations
 - **Privilege escalation**: Listener runs as root to execute nixos-rebuild
 - **Input validation**: Strictly validate revision format (branch name or commit hash)
 - **Rate limiting**: Prevent rapid-fire deployments
 - **Audit logging**: Log all deployment requests with source identity
 - **Network isolation**: NATS only accessible from internal network
 ## Decisions
 All open questions have been resolved. See Notes section for decision rationale.
 ## Notes
 - The existing `nixos-rebuild-test` helper provides a good reference for the rebuild logic
 - Uses NATS request/reply pattern for immediate validation feedback and completion status
 - Consider using NATS headers for metadata (request ID, timestamp)
 - **Timeout decision**: Metrics show no-change upgrades complete in 5-55 seconds. A 10-minute default provides ample headroom for actual updates with package downloads. Per-host override available for hosts with known longer build times.
 - **Rollback**: Not needed as a separate feature - deploy an older commit hash to effectively rollback.
 - **Offline hosts**: No message persistence - if host is offline, deploy fails. Daily auto-upgrade is the safety net. Avoids complexity of JetStream deduplication (host coming online and applying 10 queued updates instead of just the latest).
 - **Deploy history**: Use existing Loki - listener logs deployments to journald, queryable via Loki. No need for separate JetStream persistence.
 - **Naming**: `homelab-deploy` - ties it to the infrastructure rather than implementation details.
--- a/docs/plans/prometheus-scrape-target-labels.md
+++ b/docs/plans/prometheus-scrape-target-labels.md
@@ -4,6 +4,8 @@
 Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names.
 **Related:** This plan shares the `homelab.host` module with `docs/plans/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment.
 ## Motivation
 Some hosts have workloads that make generic alert thresholds inappropriate. For example, `nix-cache01` regularly hits high CPU during builds, requiring a longer `for` duration on `high_cpu_load`. Currently this is handled by excluding specific instance names in PromQL expressions, which is brittle and doesn't scale.
@@ -52,22 +54,59 @@ or
 ## Implementation
-### 1. Add `labels` option to `homelab.monitoring`
+This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/nats-deploy-service.md` which uses the same module for deployment tier assignment.
-In `modules/homelab/monitoring.nix`, add:
+### 1. Create `homelab.host` module
 Create `modules/homelab/host.nix` with shared host metadata options:
 ```nix
-labels = lib.mkOption {
+{ lib, ... }:
 {
  options.homelab.host = {
    tier = lib.mkOption {
      type = lib.types.enum [ "test" "prod" ];
      default = "prod";
      description = "Deployment tier - controls which credentials can deploy to this host";
    };
    priority = lib.mkOption {
      type = lib.types.enum [ "high" "low" ];
      default = "high";
      description = "Alerting priority - low priority hosts have relaxed thresholds";
    };
    role = lib.mkOption {
      type = lib.types.nullOr lib.types.str;
      default = null;
      description = "Primary role of this host (dns, database, monitoring, etc.)";
    };
    labels = lib.mkOption {
      type = lib.types.attrsOf lib.types.str;
      default = { };
-  description = "Custom labels to attach to this host's scrape targets";
+      description = "Additional free-form labels (e.g., dns_role = 'primary')";
-};
+    };
  };
 }
 ```
 Import this module in `modules/homelab/default.nix`.
 ### 2. Update `lib/monitoring.nix`
- `extractHostMonitoring` should carry `labels` through in its return value.
+- `extractHostMonitoring` should also extract `homelab.host` values (priority, role, labels).
- `generateNodeExporterTargets` currently returns a flat list of target strings. It needs to return structured `static_configs` entries instead, grouping targets by their label sets:
+- Build the combined label set from `homelab.host`:
 ```nix
 # Combine structured options + free-form labels
 effectiveLabels =
  (lib.optionalAttrs (host.priority != "high") { priority = host.priority; })
  // (lib.optionalAttrs (host.role != null) { role = host.role; })
  // host.labels;
 ```
 - `generateNodeExporterTargets` returns structured `static_configs` entries, grouping targets by their label sets:
 ```nix
 # Before (flat list):
@@ -80,7 +119,7 @@ labels = lib.mkOption {
 ]
 ```
-This requires grouping hosts by their label attrset and producing one `static_configs` entry per unique label combination. Hosts with no custom labels get grouped together with no extra labels (preserving current behavior).
+This requires grouping hosts by their label attrset and producing one `static_configs` entry per unique label combination. Hosts with default values (priority=high, no role, no labels) get grouped together with no extra labels (preserving current behavior).
 ### 3. Update `services/monitoring/prometheus.nix`
@@ -94,17 +133,29 @@ static_configs = [{ targets = nodeExporterTargets; }];
 static_configs = nodeExporterTargets;
 ```
-### 4. Set labels on hosts
+### 4. Set metadata on hosts
-Example in `hosts/nix-cache01/configuration.nix` or the relevant service module:
+Example in `hosts/nix-cache01/configuration.nix`:
 ```nix
-homelab.monitoring.labels = {
+homelab.host = {
-  priority = "low";
+  tier = "test";       # can be deployed by MCP (used by homelab-deploy)
  priority = "low";    # relaxed alerting thresholds
  role = "build-host";
 };
 ```
 Example in `hosts/ns1/configuration.nix`:
 ```nix
 homelab.host = {
  tier = "prod";
  priority = "high";
  role = "dns";
  labels.dns_role = "primary";
 };
 ```
 ### 5. Update alert rules
 After implementing labels, review and update `services/monitoring/rules.yml`: