Commit Graph

14 Commits

Author SHA1 Message Date
c13914bf5a fix(builder): truncate large error output to prevent log overflow
Build errors from nix can be very large (100k+ chars). This truncates
error output to the first 50 and last 50 lines when it exceeds 100
lines, preventing journal and NATS message overflow.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-11 00:42:13 +01:00
c52e88ca7e fix: add validation for config and reply subjects
Address medium severity security issues:

- Validate repo names in config only allow alphanumeric, dash, underscore
  (prevents NATS subject injection via dots or wildcards)
- Validate repo URLs must start with git+https://, git+ssh://, or git+file://
- Validate ReplyTo field must start with "build.responses." to prevent
  publishing responses to arbitrary NATS subjects

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 22:09:51 +01:00
08f1fcc6ac fix: validate target and hostname inputs to prevent injection
Add input validation to address security concerns:

- Validate Target field in BuildRequest against safe character pattern
  (must be "all" or match alphanumeric/dash/underscore/dot pattern)
- Filter hostnames discovered from nix flake show output, skipping any
  with invalid characters before using them in build commands

This prevents potential command injection via crafted NATS messages or
malicious flake configurations.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 22:07:26 +01:00
14f5b31faf feat: add builder mode for centralized Nix builds
Add a new "builder" capability to trigger Nix builds on a dedicated
build host via NATS messaging. This allows pre-building NixOS
configurations before deployment.

New components:
- Builder mode: subscribes to build.<repo>.* subjects, executes nix build
- Build CLI command: triggers builds with progress tracking
- MCP build tool: available with --enable-builds flag
- Builder metrics: tracks build success/failure per repo and host
- NixOS module: services.homelab-deploy.builder

The builder uses a YAML config file to define allowed repositories
with their URLs and default branches. Builds can target all hosts
or specific hosts, with real-time progress updates.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 22:03:14 +01:00
bc02393c5a fix: wait for metrics scrape before restarting after switch deployment
After a successful switch deployment, the listener now waits for Prometheus
to scrape the /metrics endpoint before exiting for restart. This ensures
deployment metrics are captured before the process restarts and resets
in-memory counters. Falls back to a 60 second timeout if no scrape occurs.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 15:44:14 +01:00
746e30b24f fix: initialize counter and histogram metrics at startup
Counter and histogram metrics were absent from Prometheus scrapes until
the first deployment occurred, making it impossible to distinguish
"no deployments" from "exporter not running" in dashboards and alerts.

Initialize all expected label combinations with zero values when the
collector is created so metrics appear in every scrape from startup.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 21:29:36 +01:00
fd0d63b103 fix: flush NATS buffer after sending completed response
The CLI was reporting deployment failures even when the listener showed
success. This was a race condition: after a successful switch deployment,
the listener would send the "completed" response then immediately signal
restart. The NATS connection closed before the buffered message was
actually sent to the broker, so the CLI never received it.

Adding Flush() after sending the completed response ensures the message
reaches NATS before the listener can exit.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:30:34 +01:00
36a74b8cf9 feat: add heartbeat status updates during deployment
Send periodic "running" status messages while nixos-rebuild executes,
preventing the idle timeout from triggering before deployments complete.
This fixes false "Some deployments failed" warnings in MCP when builds
take longer than 30 seconds.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:23:33 +01:00
79db119d1c feat: add Prometheus metrics to listener service
Add an optional Prometheus metrics HTTP endpoint to the listener for
monitoring deployment operations. Includes four metrics:

- homelab_deploy_deployments_total (counter with status/action/error_code)
- homelab_deploy_deployment_duration_seconds (histogram with action/success)
- homelab_deploy_deployment_in_progress (gauge)
- homelab_deploy_info (gauge with hostname/tier/role/version)

New CLI flags: --metrics-enabled, --metrics-addr (default :9972)
New NixOS options: metrics.enable, metrics.address, metrics.openFirewall

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:58:22 +01:00
2c97b6140c fix: check only final responses in AllSucceeded to determine deployment success
The CLI was incorrectly reporting "some deployments failed" even when
deployments succeeded. This was because AllSucceeded() checked if every
response had StatusCompleted, but the Responses slice contains all
messages including intermediate ones like "started". Since started !=
completed, it returned false.

Now AllSucceeded() only examines final responses (using IsFinal()) and
checks that each host's final status is completed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:19:08 +01:00
efacb13b86 feat: exit listener after successful switch for automatic restart
After a successful switch deployment, the listener now exits gracefully
so systemd can restart it with the new binary. This works together with
stopIfChanged/restartIfChanged to ensure deployments complete before
the service restarts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:11:03 +01:00
5f3cfc3d21 fix: add nixos-rebuild to PATH and fix CLI hanging after deploy failure
- Add nixos-rebuild to listener service PATH in NixOS module
- Fix CLI deploy command hanging after receiving final status by properly
  tracking lastResponse time and exiting when all hosts have responded

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 05:53:22 +01:00
f51058964d fix: verify NKey file has secure permissions before reading
Reject NKey files that are readable by group or others (permissions
more permissive than 0600). This prevents accidental exposure of
private keys through overly permissive file permissions.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 04:40:53 +01:00
fa49e9322a feat: implement NATS-based NixOS deployment system
Implement the complete homelab-deploy system with three operational modes:

- Listener mode: Runs on NixOS hosts as a systemd service, subscribes to
  NATS subjects with configurable templates, executes nixos-rebuild on
  deployment requests with concurrency control

- MCP mode: MCP server exposing deploy, deploy_admin, and list_hosts
  tools for AI assistants with tiered access control

- CLI mode: Manual deployment commands with subject alias support via
  environment variables

Key components:
- internal/messages: Request/response types with validation
- internal/nats: Client wrapper with NKey authentication
- internal/deploy: Executor with timeout and lock for concurrency
- internal/listener: Subject template expansion and request handling
- internal/cli: Deploy logic with alias resolution
- internal/mcp: MCP server with mcp-go integration
- nixos/module.nix: NixOS module with hardened systemd service

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 04:19:47 +01:00