Commit Graph

25 Commits

Author SHA1 Message Date
746e30b24f fix: initialize counter and histogram metrics at startup
Counter and histogram metrics were absent from Prometheus scrapes until
the first deployment occurred, making it impossible to distinguish
"no deployments" from "exporter not running" in dashboards and alerts.

Initialize all expected label combinations with zero values when the
collector is created so metrics appear in every scrape from startup.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 21:29:36 +01:00
fd0d63b103 fix: flush NATS buffer after sending completed response
The CLI was reporting deployment failures even when the listener showed
success. This was a race condition: after a successful switch deployment,
the listener would send the "completed" response then immediately signal
restart. The NATS connection closed before the buffered message was
actually sent to the broker, so the CLI never received it.

Adding Flush() after sending the completed response ensures the message
reaches NATS before the listener can exit.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:30:34 +01:00
36a74b8cf9 feat: add heartbeat status updates during deployment
Send periodic "running" status messages while nixos-rebuild executes,
preventing the idle timeout from triggering before deployments complete.
This fixes false "Some deployments failed" warnings in MCP when builds
take longer than 30 seconds.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:23:33 +01:00
79db119d1c feat: add Prometheus metrics to listener service
Add an optional Prometheus metrics HTTP endpoint to the listener for
monitoring deployment operations. Includes four metrics:

- homelab_deploy_deployments_total (counter with status/action/error_code)
- homelab_deploy_deployment_duration_seconds (histogram with action/success)
- homelab_deploy_deployment_in_progress (gauge)
- homelab_deploy_info (gauge with hostname/tier/role/version)

New CLI flags: --metrics-enabled, --metrics-addr (default :9972)
New NixOS options: metrics.enable, metrics.address, metrics.openFirewall

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:58:22 +01:00
56365835c7 feat: add list-hosts command to CLI
Adds a list-hosts command that mirrors the MCP list_hosts functionality,
allowing discovery of available deployment targets from the command line.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:30:57 +01:00
95b795dcfd fix: remove systemd hardening to allow nix sandbox namespace creation
The previous hardening options (ProtectControlGroups, LockPersonality,
SystemCallArchitectures, etc.) prevented Nix from creating the kernel
namespaces required for build sandboxing. Following the approach of
the NixOS auto-upgrade module which has no hardening since nixos-rebuild
requires broad system access.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:52:16 +01:00
71d6aa8b61 fix: disable PrivateDevices to allow nix sandbox namespace creation
The PrivateDevices=true systemd hardening option was preventing Nix
from creating the kernel namespaces required for its build sandbox.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:26:53 +01:00
2c97b6140c fix: check only final responses in AllSucceeded to determine deployment success
The CLI was incorrectly reporting "some deployments failed" even when
deployments succeeded. This was because AllSucceeded() checked if every
response had StatusCompleted, but the Responses slice contains all
messages including intermediate ones like "started". Since started !=
completed, it returned false.

Now AllSucceeded() only examines final responses (using IsFinal()) and
checks that each host's final status is completed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:19:08 +01:00
efacb13b86 feat: exit listener after successful switch for automatic restart
After a successful switch deployment, the listener now exits gracefully
so systemd can restart it with the new binary. This works together with
stopIfChanged/restartIfChanged to ensure deployments complete before
the service restarts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:11:03 +01:00
ac3c9c7de6 fix: prevent listener service from restarting during deployment
Add stopIfChanged and restartIfChanged options to prevent the listener
from being interrupted when nixos-rebuild switch activates a new
configuration that changes the service definition.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:06:47 +01:00
9f205fee5e fix: add writable cache directory for nix git flake fetching
The listener service had ProtectHome=read-only which prevented Nix
from writing to /root/.cache when fetching git flakes. This adds a
CacheDirectory managed by systemd and sets XDG_CACHE_HOME to use it.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 05:57:59 +01:00
5f3cfc3d21 fix: add nixos-rebuild to PATH and fix CLI hanging after deploy failure
- Add nixos-rebuild to listener service PATH in NixOS module
- Fix CLI deploy command hanging after receiving final status by properly
  tracking lastResponse time and exiting when all hosts have responded

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 05:53:22 +01:00
c9b85435ba fix: add git to listener service PATH for revision validation
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 05:43:23 +01:00
cf3b1ce2c9 refactor: use flake package directly in NixOS module
Instead of requiring users to provide the package via overlay,
the module now receives `self` from the flake and uses the
package directly from `self.packages`.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 05:08:02 +01:00
9237814fed docs: add NATS subject structure and example server configuration
Document the complete subject hierarchy including deploy subjects,
response subjects, and discovery subject. Add example NATS server
configuration demonstrating tiered authentication with listener,
test deployer, and admin deployer permission patterns.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 04:48:11 +01:00
f03eb5f7dc feat: add environment variable support for deploy command flags
Allows setting --nats-url, --nkey-file, --branch, --action, and --timeout
via HOMELAB_DEPLOY_* environment variables.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 04:43:50 +01:00
f51058964d fix: verify NKey file has secure permissions before reading
Reject NKey files that are readable by group or others (permissions
more permissive than 0600). This prevents accidental exposure of
private keys through overly permissive file permissions.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 04:40:53 +01:00
95fbfb2339 chore: extract version from main.go in flake.nix
Use builtins.match to parse version from cmd/homelab-deploy/main.go
so only one location needs updating when bumping versions.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 04:25:38 +01:00
e1ab4599a8 docs: update CLAUDE.md to match implementation
- Add github.com/google/uuid to dependencies list
- Fix version bumping: both main.go and flake.nix need updates
- Add section on updating vendorHash when dependencies change
- Use nix run .#default instead of nix build for verification

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 04:24:21 +01:00
a19ed50394 docs: add README with CLI, MCP, and NixOS module documentation
Document all three operational modes, CLI flags, MCP tools,
NixOS module options, and the message protocol.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 04:21:10 +01:00
fa49e9322a feat: implement NATS-based NixOS deployment system
Implement the complete homelab-deploy system with three operational modes:

- Listener mode: Runs on NixOS hosts as a systemd service, subscribes to
  NATS subjects with configurable templates, executes nixos-rebuild on
  deployment requests with concurrency control

- MCP mode: MCP server exposing deploy, deploy_admin, and list_hosts
  tools for AI assistants with tiered access control

- CLI mode: Manual deployment commands with subject alias support via
  environment variables

Key components:
- internal/messages: Request/response types with validation
- internal/nats: Client wrapper with NKey authentication
- internal/deploy: Executor with timeout and lock for concurrency
- internal/listener: Subject template expansion and request handling
- internal/cli: Deploy logic with alias resolution
- internal/mcp: MCP server with mcp-go integration
- nixos/module.nix: NixOS module with hardened systemd service

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 04:19:47 +01:00
ad7d1a650c chore: add gopls LSP configuration for Claude Code
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 03:52:25 +01:00
1f23a6ddc9 docs: update design with configurable subjects and improved module
- Add configurable NATS subject patterns with template variables
  (<hostname>, <tier>, <role>) for multi-tenant setups
- Add deploy.discover subject for host discovery
- Simplify CLI to use direct subjects with optional aliases via
  HOMELAB_DEPLOY_ALIAS_* environment variables
- Clarify request/reply flow with UUID-based response subjects
- Expand NixOS module with hardening options, package option,
  and configurable deploy/discover subjects
- Switch CLI framework from cobra to urfave/cli/v3

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 03:52:01 +01:00
1460cc533d chore: add CLAUDE.md and configure nix dev shell
Add CLAUDE.md with project guidance for Claude Code including
architecture overview, build commands, and testing procedures.

Update flake.nix with proper Go development shell (go, gopls,
gotools, golangci-lint, govulncheck, delve) and buildGoModule
package definition.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 03:16:20 +01:00
737bb162c9 chore: initial commit with scaffolding and design doc 2026-02-07 03:07:30 +01:00