diff --git a/docs/plans/nats-deploy-service.md b/docs/plans/nats-deploy-service.md new file mode 100644 index 0000000..de2d577 --- /dev/null +++ b/docs/plans/nats-deploy-service.md @@ -0,0 +1,323 @@ +# NATS-Based Deployment Service + +## Overview + +Create a message-based deployment system that allows triggering NixOS configuration updates on-demand, rather than waiting for the daily auto-upgrade timer. This enables faster iteration when testing changes and immediate fleet-wide deployments. + +## Goals + +1. **On-demand deployment** - Trigger config updates immediately via NATS message +2. **Targeted deployment** - Deploy to specific hosts or all hosts +3. **Branch/revision support** - Test feature branches before merging to master +4. **MCP integration** - Allow Claude Code to trigger deployments during development + +## Current State + +- **Auto-upgrade**: All hosts run `nixos-upgrade.service` daily, pulling from master +- **Manual testing**: `nixos-rebuild-test ` helper exists on all hosts +- **NATS**: Running on nats1 with JetStream enabled, using NKey authentication +- **Accounts**: ADMIN (system) and HOMELAB (user workloads with JetStream) + +## Architecture + +``` +┌─────────────┐ ┌─────────────┐ +│ MCP Tool │ deploy.test.> │ Admin CLI │ deploy.test.> + deploy.prod.> +│ (claude) │────────────┐ ┌─────│ (torjus) │ +└─────────────┘ │ │ └─────────────┘ + ▼ ▼ + ┌──────────────┐ + │ nats1 │ + │ (authz) │ + └──────┬───────┘ + │ + ┌─────────────────┼─────────────────┐ + │ │ │ + ▼ ▼ ▼ + ┌──────────┐ ┌──────────┐ ┌──────────┐ + │ template1│ │ ns1 │ │ ha1 │ + │ tier=test│ │ tier=prod│ │ tier=prod│ + └──────────┘ └──────────┘ └──────────┘ +``` + +## Repository Structure + +The project lives in a **separate repository** (e.g., `homelab-deploy`) containing: + +``` +homelab-deploy/ +├── flake.nix # Nix flake with Go package + NixOS module +├── go.mod +├── go.sum +├── cmd/ +│ └── homelab-deploy/ +│ └── main.go # CLI entrypoint with subcommands +├── internal/ +│ ├── listener/ # Listener mode logic +│ ├── mcp/ # MCP server mode logic +│ └── deploy/ # Shared deployment logic +└── nixos/ + └── module.nix # NixOS module for listener service +``` + +This repo imports the flake as an input and uses the NixOS module. + +## Single Binary with Subcommands + +The `homelab-deploy` binary supports multiple modes: + +```bash +# Run as listener on a host (systemd service) +homelab-deploy listener --hostname ns1 --nats-url nats://nats1:4222 + +# Run as MCP server (for Claude Code) +homelab-deploy mcp --nats-url nats://nats1:4222 + +# CLI commands for manual use +homelab-deploy deploy ns1 --branch feature-x --action switch +homelab-deploy deploy --all --action boot +homelab-deploy status +``` + +## Components + +### Listener Mode + +A systemd service on each host that: +- Subscribes to `deploy..` and `deploy..all` subjects +- Validates incoming messages (revision, action) +- Executes `nixos-rebuild` with specified parameters +- Reports status back via NATS + +**NixOS module configuration:** +```nix +services.homelab-deploy.listener = { + enable = true; + timeout = 600; # seconds, default 10 minutes +}; +``` + +The listener reads its tier from `config.homelab.host.tier` (see Host Metadata below) and subscribes to tier-specific subjects (e.g., `deploy.prod.ns1` and `deploy.prod.all`). + +**Request message format:** +```json +{ + "action": "switch" | "boot" | "test" | "dry-activate", + "revision": "master" | "feature-branch" | "abc123...", + "reply_to": "deploy.responses." +} +``` + +**Response message format:** +```json +{ + "status": "accepted" | "rejected" | "started" | "completed" | "failed", + "error": "invalid_revision" | "already_running" | "build_failed" | null, + "message": "human-readable details" +} +``` + +**Request/Reply flow:** +1. MCP/CLI sends deploy request with unique `reply_to` subject +2. Listener validates request (e.g., `git ls-remote` to check revision exists) +3. Listener sends immediate response: + - `{"status": "rejected", "error": "invalid_revision", "message": "branch 'foo' not found"}`, or + - `{"status": "started", "message": "starting nixos-rebuild switch"}` +4. If started, listener runs nixos-rebuild +5. Listener sends final response: + - `{"status": "completed", "message": "successfully switched to generation 42"}`, or + - `{"status": "failed", "error": "build_failed", "message": "nixos-rebuild exited with code 1"}` + +This provides immediate feedback on validation errors (bad revision, already running) without waiting for the build to fail. + +### MCP Mode + +Runs as an MCP server providing tools for Claude Code: +- `deploy` - Deploy to specific host(s) with optional revision +- `deploy_status` - Check deployment status/history +- `list_hosts` - List available deployment targets + +The MCP server runs with limited credentials (test-tier only), so Claude can deploy to test hosts but not production. + +### Tiered Permissions + +Authorization is enforced at the NATS layer using subject-based permissions. Different deployer credentials have different publish rights: + +**NATS user configuration (on nats1):** +```nix +accounts = { + HOMELAB = { + users = [ + # MCP/Claude - test tier only + { + nkey = "UABC..."; # mcp-deployer + permissions = { + publish = [ "deploy.test.>" ]; + subscribe = [ "deploy.responses.>" ]; + }; + } + # Admin - full access to all tiers + { + nkey = "UXYZ..."; # admin-deployer + permissions = { + publish = [ "deploy.test.>" "deploy.prod.>" ]; + subscribe = [ "deploy.responses.>" ]; + }; + } + # Host listeners - subscribe to their tier, publish responses + { + nkey = "UDEF..."; # host-listener (one per host) + permissions = { + subscribe = [ "deploy.*.>" ]; + publish = [ "deploy.responses.>" ]; + }; + } + ]; + }; +}; +``` + +**Host tier assignments** (via `homelab.host.tier`): +| Tier | Hosts | +|------|-------| +| test | template1, nix-cache01, future test hosts | +| prod | ns1, ns2, ha1, monitoring01, http-proxy, etc. | + +**How it works:** +1. MCP tries to deploy to ns1 → publishes to `deploy.prod.ns1` +2. NATS server rejects publish (mcp-deployer lacks `deploy.prod.>` permission) +3. MCP tries to deploy to template1 → publishes to `deploy.test.template1` +4. NATS allows it, listener receives and executes + +All NKeys stored in Vault - MCP gets limited credentials, admin CLI gets full-access credentials. + +### Host Metadata + +Rather than defining `tier` in the listener config, use a central `homelab.host` module that provides host metadata for multiple consumers. This aligns with the approach proposed in `docs/plans/prometheus-scrape-target-labels.md`. + +**Module definition (in `modules/homelab/host.nix`):** +```nix +homelab.host = { + tier = lib.mkOption { + type = lib.types.enum [ "test" "prod" ]; + default = "prod"; + description = "Deployment tier - controls which credentials can deploy to this host"; + }; + + priority = lib.mkOption { + type = lib.types.enum [ "high" "low" ]; + default = "high"; + description = "Alerting priority - low priority hosts have relaxed thresholds"; + }; + + role = lib.mkOption { + type = lib.types.nullOr lib.types.str; + default = null; + description = "Primary role of this host (dns, database, monitoring, etc.)"; + }; + + labels = lib.mkOption { + type = lib.types.attrsOf lib.types.str; + default = { }; + description = "Additional free-form labels"; + }; +}; +``` + +**Consumers:** +- `homelab-deploy` listener reads `config.homelab.host.tier` for subject subscription +- Prometheus scrape config reads `priority`, `role`, `labels` for target labels +- Future services can consume the same metadata + +**Example host config:** +```nix +# hosts/nix-cache01/configuration.nix +homelab.host = { + tier = "test"; # can be deployed by MCP + priority = "low"; # relaxed alerting thresholds + role = "build-host"; +}; + +# hosts/ns1/configuration.nix +homelab.host = { + tier = "prod"; # requires admin credentials + priority = "high"; + role = "dns"; + labels.dns_role = "primary"; +}; +``` + +## Implementation Steps + +### Phase 1: Core Binary + Listener + +1. **Create homelab-deploy repository** + - Initialize Go module + - Set up flake.nix with Go package build + +2. **Implement listener mode** + - NATS subscription logic + - nixos-rebuild execution + - Status reporting via NATS reply + +3. **Create NixOS module** + - Systemd service definition + - Configuration options (hostname, NATS URL, NKey path) + - Vault secret integration for NKeys + +4. **Create `homelab.host` module** (in nixos-servers) + - Define `tier`, `priority`, `role`, `labels` options + - This module is shared with Prometheus label work (see `docs/plans/prometheus-scrape-target-labels.md`) + +5. **Integrate with nixos-servers** + - Add flake input for homelab-deploy + - Import listener module in `system/` + - Set `homelab.host.tier` per host (test vs prod) + +6. **Configure NATS tiered permissions** + - Add deployer users to nats1 config (mcp-deployer, admin-deployer) + - Set up subject ACLs per user (test-only vs full access) + - Add deployer NKeys to Vault + - Create Terraform resources for NKey secrets + +### Phase 2: MCP + CLI + +7. **Implement MCP mode** + - MCP server with deploy/status tools + - Request/reply pattern for deployment feedback + +8. **Implement CLI commands** + - `deploy` command for manual deployments + - `status` command to check deployment state + +9. **Configure Claude Code** + - Add MCP server to configuration + - Document usage + +### Phase 3: Enhancements + +10. Add deployment locking (prevent concurrent deploys) +11. Prometheus metrics for deployment status + +## Security Considerations + +- **Privilege escalation**: Listener runs as root to execute nixos-rebuild +- **Input validation**: Strictly validate revision format (branch name or commit hash) +- **Rate limiting**: Prevent rapid-fire deployments +- **Audit logging**: Log all deployment requests with source identity +- **Network isolation**: NATS only accessible from internal network + +## Decisions + +All open questions have been resolved. See Notes section for decision rationale. + +## Notes + +- The existing `nixos-rebuild-test` helper provides a good reference for the rebuild logic +- Uses NATS request/reply pattern for immediate validation feedback and completion status +- Consider using NATS headers for metadata (request ID, timestamp) +- **Timeout decision**: Metrics show no-change upgrades complete in 5-55 seconds. A 10-minute default provides ample headroom for actual updates with package downloads. Per-host override available for hosts with known longer build times. +- **Rollback**: Not needed as a separate feature - deploy an older commit hash to effectively rollback. +- **Offline hosts**: No message persistence - if host is offline, deploy fails. Daily auto-upgrade is the safety net. Avoids complexity of JetStream deduplication (host coming online and applying 10 queued updates instead of just the latest). +- **Deploy history**: Use existing Loki - listener logs deployments to journald, queryable via Loki. No need for separate JetStream persistence. +- **Naming**: `homelab-deploy` - ties it to the infrastructure rather than implementation details. diff --git a/docs/plans/prometheus-scrape-target-labels.md b/docs/plans/prometheus-scrape-target-labels.md index a6a76f2..2261dc8 100644 --- a/docs/plans/prometheus-scrape-target-labels.md +++ b/docs/plans/prometheus-scrape-target-labels.md @@ -4,6 +4,8 @@ Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names. +**Related:** This plan shares the `homelab.host` module with `docs/plans/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment. + ## Motivation Some hosts have workloads that make generic alert thresholds inappropriate. For example, `nix-cache01` regularly hits high CPU during builds, requiring a longer `for` duration on `high_cpu_load`. Currently this is handled by excluding specific instance names in PromQL expressions, which is brittle and doesn't scale. @@ -52,22 +54,59 @@ or ## Implementation -### 1. Add `labels` option to `homelab.monitoring` +This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/nats-deploy-service.md` which uses the same module for deployment tier assignment. -In `modules/homelab/monitoring.nix`, add: +### 1. Create `homelab.host` module + +Create `modules/homelab/host.nix` with shared host metadata options: ```nix -labels = lib.mkOption { - type = lib.types.attrsOf lib.types.str; - default = { }; - description = "Custom labels to attach to this host's scrape targets"; -}; +{ lib, ... }: +{ + options.homelab.host = { + tier = lib.mkOption { + type = lib.types.enum [ "test" "prod" ]; + default = "prod"; + description = "Deployment tier - controls which credentials can deploy to this host"; + }; + + priority = lib.mkOption { + type = lib.types.enum [ "high" "low" ]; + default = "high"; + description = "Alerting priority - low priority hosts have relaxed thresholds"; + }; + + role = lib.mkOption { + type = lib.types.nullOr lib.types.str; + default = null; + description = "Primary role of this host (dns, database, monitoring, etc.)"; + }; + + labels = lib.mkOption { + type = lib.types.attrsOf lib.types.str; + default = { }; + description = "Additional free-form labels (e.g., dns_role = 'primary')"; + }; + }; +} ``` +Import this module in `modules/homelab/default.nix`. + ### 2. Update `lib/monitoring.nix` -- `extractHostMonitoring` should carry `labels` through in its return value. -- `generateNodeExporterTargets` currently returns a flat list of target strings. It needs to return structured `static_configs` entries instead, grouping targets by their label sets: +- `extractHostMonitoring` should also extract `homelab.host` values (priority, role, labels). +- Build the combined label set from `homelab.host`: + +```nix +# Combine structured options + free-form labels +effectiveLabels = + (lib.optionalAttrs (host.priority != "high") { priority = host.priority; }) + // (lib.optionalAttrs (host.role != null) { role = host.role; }) + // host.labels; +``` + +- `generateNodeExporterTargets` returns structured `static_configs` entries, grouping targets by their label sets: ```nix # Before (flat list): @@ -80,7 +119,7 @@ labels = lib.mkOption { ] ``` -This requires grouping hosts by their label attrset and producing one `static_configs` entry per unique label combination. Hosts with no custom labels get grouped together with no extra labels (preserving current behavior). +This requires grouping hosts by their label attrset and producing one `static_configs` entry per unique label combination. Hosts with default values (priority=high, no role, no labels) get grouped together with no extra labels (preserving current behavior). ### 3. Update `services/monitoring/prometheus.nix` @@ -94,17 +133,29 @@ static_configs = [{ targets = nodeExporterTargets; }]; static_configs = nodeExporterTargets; ``` -### 4. Set labels on hosts +### 4. Set metadata on hosts -Example in `hosts/nix-cache01/configuration.nix` or the relevant service module: +Example in `hosts/nix-cache01/configuration.nix`: ```nix -homelab.monitoring.labels = { - priority = "low"; +homelab.host = { + tier = "test"; # can be deployed by MCP (used by homelab-deploy) + priority = "low"; # relaxed alerting thresholds role = "build-host"; }; ``` +Example in `hosts/ns1/configuration.nix`: + +```nix +homelab.host = { + tier = "prod"; + priority = "high"; + role = "dns"; + labels.dns_role = "primary"; +}; +``` + ### 5. Update alert rules After implementing labels, review and update `services/monitoring/rules.yml`: