4 Commits

Author SHA1 Message Date
c272ce6903 docs: document --debug flag and extraArgs module option
Add documentation for:
- --debug flag in Listener Flags table
- --heartbeat-interval flag (was missing)
- extraArgs NixOS module option
- New Troubleshooting section with debug logging examples
  and guidance for diagnosing metrics issues

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 01:28:21 +01:00
c934d1ba38 feat: add --debug flag for metrics troubleshooting
Add a --debug flag to the listener command that enables debug-level
logging. When enabled, the listener logs detailed information about
metrics recording including:

- When deployment start/end metrics are recorded
- The action, success status, and duration being recorded
- Whether metrics are enabled or disabled (skipped)

This helps troubleshoot issues where deployment metrics appear to
remain at zero after deployments.

Also add extraArgs option to the NixOS module to allow passing
additional arguments like --debug to the service.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 01:26:12 +01:00
723a1f769f test: add histogram verification to deployment metrics tests
The existing tests for RecordDeploymentEnd and RecordDeploymentFailure
only verified the counter was incremented, not that the histogram was
updated with duration observations. Add histogram verification to:

- TestCollector_RecordDeploymentEnd_Success
- TestCollector_RecordDeploymentEnd_Failure
- TestCollector_RecordDeploymentFailure

Also add listener tests to verify metrics are properly initialized when
MetricsEnabled is true and that the recording functions work correctly
in the context of deployment handling.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 01:13:20 +01:00
46fc6a7e96 fix: wait for metrics scrape before restarting after switch deployment
After a successful switch deployment, the listener now waits for Prometheus
to scrape the /metrics endpoint before exiting for restart. This ensures
deployment metrics are captured before the process restarts and resets
in-memory counters. Falls back to a 60 second timeout if no scrape occurs.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 23:38:26 +01:00
24 changed files with 623 additions and 1908 deletions

263
README.md
View File

@@ -4,12 +4,11 @@ A message-based deployment system for NixOS configurations using NATS for messag
## Overview ## Overview
The `homelab-deploy` binary provides four operational modes: The `homelab-deploy` binary provides three operational modes:
1. **Listener mode** - Runs on each NixOS host as a systemd service, subscribing to NATS subjects and executing `nixos-rebuild` when deployment requests arrive 1. **Listener mode** - Runs on each NixOS host as a systemd service, subscribing to NATS subjects and executing `nixos-rebuild` when deployment requests arrive
2. **Builder mode** - Runs on a dedicated build host, subscribing to NATS subjects and executing `nix build` to pre-build configurations 2. **MCP mode** - Runs as an MCP (Model Context Protocol) server, exposing deployment tools for AI assistants
3. **MCP mode** - Runs as an MCP (Model Context Protocol) server, exposing deployment tools for AI assistants 3. **CLI mode** - Manual deployment commands for administrators
4. **CLI mode** - Manual deployment and build commands for administrators
## Installation ## Installation
@@ -64,6 +63,8 @@ homelab-deploy listener \
| `--discover-subject` | No | Discovery subject (default: `deploy.discover`) | | `--discover-subject` | No | Discovery subject (default: `deploy.discover`) |
| `--metrics-enabled` | No | Enable Prometheus metrics endpoint | | `--metrics-enabled` | No | Enable Prometheus metrics endpoint |
| `--metrics-addr` | No | Metrics HTTP server address (default: `:9972`) | | `--metrics-addr` | No | Metrics HTTP server address (default: `:9972`) |
| `--heartbeat-interval` | No | Status update interval in seconds during deployment (default: 15) |
| `--debug` | No | Enable debug logging for troubleshooting |
#### Subject Templates #### Subject Templates
@@ -129,82 +130,6 @@ homelab-deploy deploy prod-dns --nats-url ... --nkey-file ...
Alias lookup: `HOMELAB_DEPLOY_ALIAS_<NAME>` where name is uppercased and hyphens become underscores. Alias lookup: `HOMELAB_DEPLOY_ALIAS_<NAME>` where name is uppercased and hyphens become underscores.
### Builder Mode
Run on a dedicated build host to pre-build NixOS configurations:
```bash
homelab-deploy builder \
--nats-url nats://nats.example.com:4222 \
--nkey-file /run/secrets/builder.nkey \
--config /etc/homelab-deploy/builder.yaml \
--timeout 1800 \
--metrics-enabled \
--metrics-addr :9973
```
#### Builder Configuration File
The builder uses a YAML configuration file to define allowed repositories:
```yaml
repos:
nixos-servers:
url: "git+https://git.example.com/org/nixos-servers.git"
default_branch: "master"
homelab:
url: "git+ssh://git@github.com/user/homelab.git"
default_branch: "main"
```
#### Builder Flags
| Flag | Required | Description |
|------|----------|-------------|
| `--nats-url` | Yes | NATS server URL |
| `--nkey-file` | Yes | Path to NKey seed file |
| `--config` | Yes | Path to builder configuration file |
| `--timeout` | No | Build timeout per host in seconds (default: 1800) |
| `--metrics-enabled` | No | Enable Prometheus metrics endpoint |
| `--metrics-addr` | No | Metrics HTTP server address (default: `:9973`) |
### Build Command
Trigger a build on the build server:
```bash
# Build all hosts in a repository
homelab-deploy build nixos-servers --all \
--nats-url nats://nats.example.com:4222 \
--nkey-file /run/secrets/deployer.nkey
# Build a specific host
homelab-deploy build nixos-servers myhost \
--nats-url nats://nats.example.com:4222 \
--nkey-file /run/secrets/deployer.nkey
# Build with a specific branch
homelab-deploy build nixos-servers --all --branch feature-x \
--nats-url nats://nats.example.com:4222 \
--nkey-file /run/secrets/deployer.nkey
# JSON output for scripting
homelab-deploy build nixos-servers --all --json \
--nats-url nats://nats.example.com:4222 \
--nkey-file /run/secrets/deployer.nkey
```
#### Build Flags
| Flag | Required | Env Var | Description |
|------|----------|---------|-------------|
| `--nats-url` | Yes | `HOMELAB_DEPLOY_NATS_URL` | NATS server URL |
| `--nkey-file` | Yes | `HOMELAB_DEPLOY_NKEY_FILE` | Path to NKey seed file |
| `--branch` | No | `HOMELAB_DEPLOY_BRANCH` | Git branch (uses repo default if not specified) |
| `--all` | No | - | Build all hosts in the repository |
| `--timeout` | No | `HOMELAB_DEPLOY_BUILD_TIMEOUT` | Response timeout in seconds (default: 3600) |
| `--json` | No | - | Output results as JSON |
### MCP Server Mode ### MCP Server Mode
Run as an MCP server for AI assistant integration: Run as an MCP server for AI assistant integration:
@@ -221,12 +146,6 @@ homelab-deploy mcp \
--nkey-file /run/secrets/mcp.nkey \ --nkey-file /run/secrets/mcp.nkey \
--enable-admin \ --enable-admin \
--admin-nkey-file /run/secrets/admin.nkey --admin-nkey-file /run/secrets/admin.nkey
# With build tool enabled
homelab-deploy mcp \
--nats-url nats://nats.example.com:4222 \
--nkey-file /run/secrets/mcp.nkey \
--enable-builds
``` ```
#### MCP Tools #### MCP Tools
@@ -236,7 +155,6 @@ homelab-deploy mcp \
| `deploy` | Deploy to test-tier hosts only | | `deploy` | Deploy to test-tier hosts only |
| `deploy_admin` | Deploy to any tier (requires `--enable-admin`) | | `deploy_admin` | Deploy to any tier (requires `--enable-admin`) |
| `list_hosts` | Discover available deployment targets | | `list_hosts` | Discover available deployment targets |
| `build` | Trigger builds on the build server (requires `--enable-builds`) |
#### Tool Parameters #### Tool Parameters
@@ -251,12 +169,6 @@ homelab-deploy mcp \
**list_hosts:** **list_hosts:**
- `tier` - Filter by tier (optional) - `tier` - Filter by tier (optional)
**build:**
- `repo` - Repository name (required, must match builder config)
- `target` - Target hostname (optional, defaults to all)
- `all` - Build all hosts (default if no target specified)
- `branch` - Git branch (uses repo default if not specified)
## NixOS Module ## NixOS Module
Add the module to your NixOS configuration: Add the module to your NixOS configuration:
@@ -304,6 +216,7 @@ Add the module to your NixOS configuration:
| `metrics.enable` | bool | `false` | Enable Prometheus metrics endpoint | | `metrics.enable` | bool | `false` | Enable Prometheus metrics endpoint |
| `metrics.address` | string | `":9972"` | Metrics HTTP server address | | `metrics.address` | string | `":9972"` | Metrics HTTP server address |
| `metrics.openFirewall` | bool | `false` | Open firewall for metrics port | | `metrics.openFirewall` | bool | `false` | Open firewall for metrics port |
| `extraArgs` | list of string | `[]` | Extra command line arguments (e.g., `["--debug"]`) |
Default `deploySubjects`: Default `deploySubjects`:
```nix ```nix
@@ -314,65 +227,6 @@ Default `deploySubjects`:
] ]
``` ```
### Builder Module Options
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `enable` | bool | `false` | Enable the builder service |
| `package` | package | from flake | Package to use |
| `natsUrl` | string | required | NATS server URL |
| `nkeyFile` | path | required | Path to NKey seed file |
| `configFile` | path | `null` | Path to builder config file (alternative to `settings`) |
| `settings.repos` | attrs | `{}` | Repository configuration (see below) |
| `timeout` | int | `1800` | Build timeout per host in seconds |
| `environment` | attrs | `{}` | Additional environment variables |
| `metrics.enable` | bool | `false` | Enable Prometheus metrics endpoint |
| `metrics.address` | string | `":9973"` | Metrics HTTP server address |
| `metrics.openFirewall` | bool | `false` | Open firewall for metrics port |
Each entry in `settings.repos` is an attribute set with:
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `url` | string | required | Git flake URL (must start with `git+https://`, `git+ssh://`, or `git+file://`) |
| `defaultBranch` | string | `"master"` | Default branch to build when not specified |
Example builder configuration using `settings`:
```nix
services.homelab-deploy.builder = {
enable = true;
natsUrl = "nats://nats.example.com:4222";
nkeyFile = "/run/secrets/homelab-deploy-builder-nkey";
settings.repos = {
nixos-servers = {
url = "git+https://git.example.com/org/nixos-servers.git";
defaultBranch = "master";
};
homelab = {
url = "git+ssh://git@github.com/user/homelab.git";
defaultBranch = "main";
};
};
metrics = {
enable = true;
address = ":9973";
openFirewall = true;
};
};
```
Alternatively, you can use `configFile` to point to an external YAML file:
```nix
services.homelab-deploy.builder = {
enable = true;
natsUrl = "nats://nats.example.com:4222";
nkeyFile = "/run/secrets/homelab-deploy-builder-nkey";
configFile = "/etc/homelab-deploy/builder.yaml";
};
```
## Prometheus Metrics ## Prometheus Metrics
The listener can expose Prometheus metrics for monitoring deployment operations. The listener can expose Prometheus metrics for monitoring deployment operations.
@@ -447,23 +301,56 @@ histogram_quantile(0.95, rate(homelab_deploy_deployment_duration_seconds_bucket[
sum(homelab_deploy_deployment_in_progress) sum(homelab_deploy_deployment_in_progress)
``` ```
### Builder Metrics ## Troubleshooting
When running in builder mode, additional metrics are available: ### Debug Logging
| Metric | Type | Labels | Description | Enable debug logging to diagnose issues with deployments or metrics:
|--------|------|--------|-------------|
| `homelab_deploy_builds_total` | Counter | `repo`, `status` | Total builds processed |
| `homelab_deploy_build_host_total` | Counter | `repo`, `host`, `status` | Total host builds processed |
| `homelab_deploy_build_duration_seconds` | Histogram | `repo`, `host` | Build execution time per host |
| `homelab_deploy_build_last_timestamp` | Gauge | `repo` | Timestamp of last build attempt |
| `homelab_deploy_build_last_success_timestamp` | Gauge | `repo` | Timestamp of last successful build |
| `homelab_deploy_build_last_failure_timestamp` | Gauge | `repo` | Timestamp of last failed build |
**Label values:** **CLI:**
- `status`: `success`, `failure` ```bash
- `repo`: Repository name from config homelab-deploy listener --debug \
- `host`: Host name being built --hostname myhost \
--tier prod \
--nats-url nats://nats.example.com:4222 \
--nkey-file /run/secrets/listener.nkey \
--flake-url git+https://git.example.com/user/nixos-configs.git \
--metrics-enabled
```
**NixOS module:**
```nix
services.homelab-deploy.listener = {
enable = true;
tier = "prod";
natsUrl = "nats://nats.example.com:4222";
nkeyFile = "/run/secrets/homelab-deploy-nkey";
flakeUrl = "git+https://git.example.com/user/nixos-configs.git";
metrics.enable = true;
extraArgs = [ "--debug" ];
};
```
With debug logging enabled, the listener outputs detailed information about metrics recording:
```json
{"level":"DEBUG","msg":"recording deployment start metric","metrics_enabled":true}
{"level":"DEBUG","msg":"recording deployment end metric (success)","action":"switch","success":true,"duration_seconds":120.5}
```
### Metrics Showing Zero
If deployment metrics remain at zero after deployments:
1. **Check metrics are enabled**: Verify `--metrics-enabled` is set and the metrics endpoint is accessible at `/metrics`
2. **Enable debug logging**: Use `--debug` to confirm metrics recording is being called
3. **Check deployment status**: Metrics are only recorded for deployments that complete (success or failure). Rejected requests (e.g., already running) increment the counter with `status="rejected"` but don't record duration
4. **Check after restart**: After a successful `switch` deployment, the listener restarts. Metrics reset to zero in the new instance. The listener waits up to 60 seconds for a Prometheus scrape before restarting to capture the final metrics
5. **Verify Prometheus scrape timing**: Ensure Prometheus scrapes frequently enough to capture metrics before the listener restarts
## Message Protocol ## Message Protocol
@@ -492,37 +379,6 @@ When running in builder mode, additional metrics are available:
**Error codes:** `invalid_revision`, `invalid_action`, `already_running`, `build_failed`, `timeout` **Error codes:** `invalid_revision`, `invalid_action`, `already_running`, `build_failed`, `timeout`
### Build Request
```json
{
"repo": "nixos-servers",
"target": "all",
"branch": "main",
"reply_to": "build.responses.abc123"
}
```
### Build Response
```json
{
"status": "completed",
"message": "built 5/5 hosts successfully",
"results": [
{"host": "host1", "success": true, "duration_seconds": 120.5},
{"host": "host2", "success": true, "duration_seconds": 95.3}
],
"total_duration_seconds": 450.2,
"succeeded": 5,
"failed": 0
}
```
**Status values:** `started`, `progress`, `completed`, `failed`, `rejected`
Progress updates include `host`, `host_success`, `hosts_completed`, and `hosts_total` fields.
## NATS Authentication ## NATS Authentication
All connections use NKey authentication. Generate keys with: All connections use NKey authentication. Generate keys with:
@@ -552,22 +408,13 @@ The deployment system uses the following NATS subject hierarchy:
- `deploy.prod.all` - Deploy to all production hosts - `deploy.prod.all` - Deploy to all production hosts
- `deploy.prod.role.dns` - Deploy to all DNS servers in production - `deploy.prod.role.dns` - Deploy to all DNS servers in production
### Build Subjects
| Subject Pattern | Purpose |
|-----------------|---------|
| `build.<repo>.*` | Build requests for a repository |
| `build.<repo>.all` | Build all hosts in a repository |
| `build.<repo>.<hostname>` | Build a specific host |
### Response Subjects ### Response Subjects
| Subject Pattern | Purpose | | Subject Pattern | Purpose |
|-----------------|---------| |-----------------|---------|
| `deploy.responses.<uuid>` | Unique reply subject for each deployment request | | `deploy.responses.<uuid>` | Unique reply subject for each deployment request |
| `build.responses.<uuid>` | Unique reply subject for each build request |
Deployers and build clients create a unique response subject for each request and include it in the `reply_to` field. Listeners and builders publish status updates to this subject. Deployers create a unique response subject for each request and include it in the `reply_to` field. Listeners publish status updates to this subject.
### Discovery Subject ### Discovery Subject
@@ -658,9 +505,7 @@ authorization {
| Credential Type | Publish | Subscribe | | Credential Type | Publish | Subscribe |
|-----------------|---------|-----------| |-----------------|---------|-----------|
| Listener | `deploy.responses.>`, `deploy.discover` | Own subjects, `deploy.discover` | | Listener | `deploy.responses.>`, `deploy.discover` | Own subjects, `deploy.discover` |
| Builder | `build.responses.>` | `build.<repo>.*` for each configured repo |
| Test deployer | `deploy.test.>`, `deploy.discover` | `deploy.responses.>`, `deploy.discover` | | Test deployer | `deploy.test.>`, `deploy.discover` | `deploy.responses.>`, `deploy.discover` |
| Build client | `build.<repo>.*` | `build.responses.>` |
| Admin deployer | `deploy.>` | `deploy.>` | | Admin deployer | `deploy.>` | `deploy.>` |
### Generating NKeys ### Generating NKeys

View File

@@ -9,15 +9,14 @@ import (
"syscall" "syscall"
"time" "time"
"code.t-juice.club/torjus/homelab-deploy/internal/builder" deploycli "git.t-juice.club/torjus/homelab-deploy/internal/cli"
deploycli "code.t-juice.club/torjus/homelab-deploy/internal/cli" "git.t-juice.club/torjus/homelab-deploy/internal/listener"
"code.t-juice.club/torjus/homelab-deploy/internal/listener" "git.t-juice.club/torjus/homelab-deploy/internal/mcp"
"code.t-juice.club/torjus/homelab-deploy/internal/mcp" "git.t-juice.club/torjus/homelab-deploy/internal/messages"
"code.t-juice.club/torjus/homelab-deploy/internal/messages"
"github.com/urfave/cli/v3" "github.com/urfave/cli/v3"
) )
const version = "0.2.5" const version = "0.1.14"
func main() { func main() {
app := &cli.Command{ app := &cli.Command{
@@ -26,10 +25,8 @@ func main() {
Version: version, Version: version,
Commands: []*cli.Command{ Commands: []*cli.Command{
listenerCommand(), listenerCommand(),
builderCommand(),
mcpCommand(), mcpCommand(),
deployCommand(), deployCommand(),
buildCommand(),
listHostsCommand(), listHostsCommand(),
}, },
} }
@@ -45,6 +42,10 @@ func listenerCommand() *cli.Command {
Name: "listener", Name: "listener",
Usage: "Run as a deployment listener (systemd service mode)", Usage: "Run as a deployment listener (systemd service mode)",
Flags: []cli.Flag{ Flags: []cli.Flag{
&cli.BoolFlag{
Name: "debug",
Usage: "Enable debug logging for troubleshooting",
},
&cli.StringFlag{ &cli.StringFlag{
Name: "hostname", Name: "hostname",
Usage: "Hostname for this listener", Usage: "Hostname for this listener",
@@ -128,10 +129,16 @@ func listenerCommand() *cli.Command {
MetricsEnabled: c.Bool("metrics-enabled"), MetricsEnabled: c.Bool("metrics-enabled"),
MetricsAddr: c.String("metrics-addr"), MetricsAddr: c.String("metrics-addr"),
Version: version, Version: version,
Debug: c.Bool("debug"),
}
logLevel := slog.LevelInfo
if c.Bool("debug") {
logLevel = slog.LevelDebug
} }
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{ logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
Level: slog.LevelInfo, Level: logLevel,
})) }))
l := listener.New(cfg, logger) l := listener.New(cfg, logger)
@@ -178,10 +185,6 @@ func mcpCommand() *cli.Command {
Usage: "Timeout in seconds for deployment operations", Usage: "Timeout in seconds for deployment operations",
Value: 900, Value: 900,
}, },
&cli.BoolFlag{
Name: "enable-builds",
Usage: "Enable build tool",
},
}, },
Action: func(_ context.Context, c *cli.Command) error { Action: func(_ context.Context, c *cli.Command) error {
enableAdmin := c.Bool("enable-admin") enableAdmin := c.Bool("enable-admin")
@@ -196,7 +199,6 @@ func mcpCommand() *cli.Command {
NKeyFile: c.String("nkey-file"), NKeyFile: c.String("nkey-file"),
EnableAdmin: enableAdmin, EnableAdmin: enableAdmin,
AdminNKeyFile: adminNKeyFile, AdminNKeyFile: adminNKeyFile,
EnableBuilds: c.Bool("enable-builds"),
DiscoverSubject: c.String("discover-subject"), DiscoverSubject: c.String("discover-subject"),
Timeout: time.Duration(c.Int("timeout")) * time.Second, Timeout: time.Duration(c.Int("timeout")) * time.Second,
} }
@@ -382,204 +384,3 @@ func listHostsCommand() *cli.Command {
}, },
} }
} }
func builderCommand() *cli.Command {
return &cli.Command{
Name: "builder",
Usage: "Run as a build server (systemd service mode)",
Flags: []cli.Flag{
&cli.StringFlag{
Name: "nats-url",
Usage: "NATS server URL",
Required: true,
},
&cli.StringFlag{
Name: "nkey-file",
Usage: "Path to NKey seed file for NATS authentication",
Required: true,
},
&cli.StringFlag{
Name: "config",
Usage: "Path to builder configuration file",
Required: true,
},
&cli.IntFlag{
Name: "timeout",
Usage: "Build timeout in seconds per host",
Value: 1800,
},
&cli.BoolFlag{
Name: "metrics-enabled",
Usage: "Enable Prometheus metrics endpoint",
},
&cli.StringFlag{
Name: "metrics-addr",
Usage: "Address for Prometheus metrics HTTP server",
Value: ":9973",
},
},
Action: func(ctx context.Context, c *cli.Command) error {
repoCfg, err := builder.LoadConfig(c.String("config"))
if err != nil {
return fmt.Errorf("failed to load config: %w", err)
}
cfg := builder.BuilderConfig{
NATSUrl: c.String("nats-url"),
NKeyFile: c.String("nkey-file"),
ConfigFile: c.String("config"),
Timeout: time.Duration(c.Int("timeout")) * time.Second,
MetricsEnabled: c.Bool("metrics-enabled"),
MetricsAddr: c.String("metrics-addr"),
}
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
Level: slog.LevelInfo,
}))
b := builder.New(cfg, repoCfg, logger)
// Handle shutdown signals
ctx, cancel := signal.NotifyContext(ctx, syscall.SIGINT, syscall.SIGTERM)
defer cancel()
return b.Run(ctx)
},
}
}
func buildCommand() *cli.Command {
return &cli.Command{
Name: "build",
Usage: "Trigger a build on the build server",
ArgsUsage: "<repo> [hostname]",
Flags: []cli.Flag{
&cli.StringFlag{
Name: "nats-url",
Usage: "NATS server URL",
Sources: cli.EnvVars("HOMELAB_DEPLOY_NATS_URL"),
Required: true,
},
&cli.StringFlag{
Name: "nkey-file",
Usage: "Path to NKey seed file for NATS authentication",
Sources: cli.EnvVars("HOMELAB_DEPLOY_NKEY_FILE"),
Required: true,
},
&cli.StringFlag{
Name: "branch",
Usage: "Git branch to build (uses repo default if not specified)",
Sources: cli.EnvVars("HOMELAB_DEPLOY_BRANCH"),
},
&cli.BoolFlag{
Name: "all",
Usage: "Build all hosts in the repo",
},
&cli.IntFlag{
Name: "timeout",
Usage: "Timeout in seconds for collecting responses",
Sources: cli.EnvVars("HOMELAB_DEPLOY_BUILD_TIMEOUT"),
Value: 3600,
},
&cli.BoolFlag{
Name: "json",
Usage: "Output results as JSON",
},
},
Action: func(ctx context.Context, c *cli.Command) error {
if c.Args().Len() < 1 {
return fmt.Errorf("repo argument required")
}
repo := c.Args().First()
target := c.Args().Get(1)
all := c.Bool("all")
if target == "" && !all {
return fmt.Errorf("must specify hostname or --all")
}
if target != "" && all {
return fmt.Errorf("cannot specify both hostname and --all")
}
if all {
target = "all"
}
cfg := deploycli.BuildConfig{
NATSUrl: c.String("nats-url"),
NKeyFile: c.String("nkey-file"),
Repo: repo,
Target: target,
Branch: c.String("branch"),
Timeout: time.Duration(c.Int("timeout")) * time.Second,
}
jsonOutput := c.Bool("json")
if !jsonOutput {
branchStr := cfg.Branch
if branchStr == "" {
branchStr = "(default)"
}
fmt.Printf("Building %s target=%s branch=%s\n", repo, target, branchStr)
}
// Handle shutdown signals
ctx, cancel := signal.NotifyContext(ctx, syscall.SIGINT, syscall.SIGTERM)
defer cancel()
result, err := deploycli.Build(ctx, cfg, func(resp *messages.BuildResponse) {
if jsonOutput {
return
}
switch resp.Status {
case messages.BuildStatusStarted:
fmt.Printf("Started: %s\n", resp.Message)
case messages.BuildStatusProgress:
successStr := "..."
if resp.HostSuccess != nil {
if *resp.HostSuccess {
successStr = "success"
} else {
successStr = "failed"
}
}
fmt.Printf("[%d/%d] %s: %s\n", resp.HostsCompleted, resp.HostsTotal, resp.Host, successStr)
case messages.BuildStatusCompleted, messages.BuildStatusFailed:
fmt.Printf("\n%s\n", resp.Message)
case messages.BuildStatusRejected:
fmt.Printf("Rejected: %s\n", resp.Message)
}
})
if err != nil {
return fmt.Errorf("build failed: %w", err)
}
if jsonOutput {
data, err := result.MarshalJSON()
if err != nil {
return fmt.Errorf("failed to marshal result: %w", err)
}
fmt.Println(string(data))
} else if result.FinalResponse != nil {
fmt.Printf("\nBuild complete: %d succeeded, %d failed (%.1fs)\n",
result.FinalResponse.Succeeded,
result.FinalResponse.Failed,
result.FinalResponse.TotalDurationSeconds)
for _, hr := range result.FinalResponse.Results {
if !hr.Success {
fmt.Printf("\n--- %s (error: %s) ---\n", hr.Host, hr.Error)
if hr.Output != "" {
fmt.Println(hr.Output)
}
}
}
}
if !result.AllSucceeded() {
return fmt.Errorf("some builds failed")
}
return nil
},
}
}

6
flake.lock generated
View File

@@ -2,11 +2,11 @@
"nodes": { "nodes": {
"nixpkgs": { "nixpkgs": {
"locked": { "locked": {
"lastModified": 1770562336, "lastModified": 1770197578,
"narHash": "sha256-ub1gpAONMFsT/GU2hV6ZWJjur8rJ6kKxdm9IlCT0j84=", "narHash": "sha256-AYqlWrX09+HvGs8zM6ebZ1pwUqjkfpnv8mewYwAo+iM=",
"owner": "nixos", "owner": "nixos",
"repo": "nixpkgs", "repo": "nixpkgs",
"rev": "d6c71932130818840fc8fe9509cf50be8c64634f", "rev": "00c21e4c93d963c50d4c0c89bfa84ed6e0694df2",
"type": "github" "type": "github"
}, },
"original": { "original": {

4
go.mod
View File

@@ -1,4 +1,4 @@
module code.t-juice.club/torjus/homelab-deploy module git.t-juice.club/torjus/homelab-deploy
go 1.25.5 go 1.25.5
@@ -9,7 +9,6 @@ require (
github.com/nats-io/nkeys v0.4.15 github.com/nats-io/nkeys v0.4.15
github.com/prometheus/client_golang v1.23.2 github.com/prometheus/client_golang v1.23.2
github.com/urfave/cli/v3 v3.6.2 github.com/urfave/cli/v3 v3.6.2
gopkg.in/yaml.v3 v3.0.1
) )
require ( require (
@@ -33,4 +32,5 @@ require (
golang.org/x/crypto v0.47.0 // indirect golang.org/x/crypto v0.47.0 // indirect
golang.org/x/sys v0.40.0 // indirect golang.org/x/sys v0.40.0 // indirect
google.golang.org/protobuf v1.36.8 // indirect google.golang.org/protobuf v1.36.8 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect
) )

View File

@@ -1,377 +0,0 @@
package builder
import (
"context"
"fmt"
"log/slog"
"regexp"
"sort"
"strings"
"sync"
"time"
"code.t-juice.club/torjus/homelab-deploy/internal/messages"
"code.t-juice.club/torjus/homelab-deploy/internal/metrics"
"code.t-juice.club/torjus/homelab-deploy/internal/nats"
)
// hostnameRegex validates hostnames from flake output.
// Allows: alphanumeric, dashes, underscores, dots.
var hostnameRegex = regexp.MustCompile(`^[a-zA-Z0-9._-]+$`)
// truncateOutputLines truncates output to the first and last N lines if it exceeds 2*N lines,
// returning the result as a slice of strings.
func truncateOutputLines(output string, keepLines int) []string {
lines := strings.Split(output, "\n")
if len(lines) <= keepLines*2 {
return lines
}
head := lines[:keepLines]
tail := lines[len(lines)-keepLines:]
omitted := len(lines) - keepLines*2
result := make([]string, 0, keepLines*2+1)
result = append(result, head...)
result = append(result, fmt.Sprintf("... (%d lines omitted) ...", omitted))
result = append(result, tail...)
return result
}
// truncateOutput truncates output to the first and last N lines if it exceeds 2*N lines.
func truncateOutput(output string, keepLines int) string {
lines := strings.Split(output, "\n")
if len(lines) <= keepLines*2 {
return output
}
head := lines[:keepLines]
tail := lines[len(lines)-keepLines:]
omitted := len(lines) - keepLines*2
return strings.Join(head, "\n") + fmt.Sprintf("\n\n... (%d lines omitted) ...\n\n", omitted) + strings.Join(tail, "\n")
}
// BuilderConfig holds the configuration for the builder.
type BuilderConfig struct {
NATSUrl string
NKeyFile string
ConfigFile string
Timeout time.Duration
MetricsEnabled bool
MetricsAddr string
}
// Builder handles build requests from NATS.
type Builder struct {
cfg BuilderConfig
repoCfg *Config
client *nats.Client
executor *Executor
lock sync.Mutex
busy bool
logger *slog.Logger
// metrics server and collector (nil if metrics disabled)
metricsServer *metrics.Server
metrics *metrics.BuildCollector
}
// New creates a new builder with the given configuration.
func New(cfg BuilderConfig, repoCfg *Config, logger *slog.Logger) *Builder {
if logger == nil {
logger = slog.Default()
}
b := &Builder{
cfg: cfg,
repoCfg: repoCfg,
executor: NewExecutor(cfg.Timeout),
logger: logger,
}
if cfg.MetricsEnabled {
b.metricsServer = metrics.NewServer(metrics.ServerConfig{
Addr: cfg.MetricsAddr,
Logger: logger,
})
b.metrics = metrics.NewBuildCollector(b.metricsServer.Registry())
}
return b
}
// Run starts the builder and blocks until the context is cancelled.
func (b *Builder) Run(ctx context.Context) error {
// Start metrics server if enabled
if b.metricsServer != nil {
if err := b.metricsServer.Start(); err != nil {
return fmt.Errorf("failed to start metrics server: %w", err)
}
defer func() {
shutdownCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
_ = b.metricsServer.Shutdown(shutdownCtx)
}()
}
// Connect to NATS
b.logger.Info("connecting to NATS", "url", b.cfg.NATSUrl)
client, err := nats.Connect(nats.Config{
URL: b.cfg.NATSUrl,
NKeyFile: b.cfg.NKeyFile,
Name: "homelab-deploy-builder",
})
if err != nil {
return fmt.Errorf("failed to connect to NATS: %w", err)
}
b.client = client
defer b.client.Close()
b.logger.Info("connected to NATS")
// Subscribe to build subjects for each repo
for repoName := range b.repoCfg.Repos {
// Subscribe to build.<repo>.all and build.<repo>.<hostname>
allSubject := fmt.Sprintf("build.%s.*", repoName)
b.logger.Info("subscribing to build subject", "subject", allSubject)
if _, err := b.client.Subscribe(allSubject, b.handleBuildRequest); err != nil {
return fmt.Errorf("failed to subscribe to %s: %w", allSubject, err)
}
}
b.logger.Info("builder started", "repos", len(b.repoCfg.Repos))
// Wait for context cancellation
<-ctx.Done()
b.logger.Info("shutting down builder")
return nil
}
func (b *Builder) handleBuildRequest(subject string, data []byte) {
req, err := messages.UnmarshalBuildRequest(data)
if err != nil {
b.logger.Error("failed to unmarshal build request",
"subject", subject,
"error", err,
)
return
}
b.logger.Info("received build request",
"subject", subject,
"repo", req.Repo,
"target", req.Target,
"branch", req.Branch,
"reply_to", req.ReplyTo,
)
// Validate request
if err := req.Validate(); err != nil {
b.logger.Warn("invalid build request", "error", err)
b.sendResponse(req.ReplyTo, messages.NewBuildResponse(
messages.BuildStatusRejected,
err.Error(),
))
return
}
// Get repo config
repo, err := b.repoCfg.GetRepo(req.Repo)
if err != nil {
b.logger.Warn("unknown repo", "repo", req.Repo)
b.sendResponse(req.ReplyTo, messages.NewBuildResponse(
messages.BuildStatusRejected,
fmt.Sprintf("unknown repo: %s", req.Repo),
))
return
}
// Try to acquire lock
b.lock.Lock()
if b.busy {
b.lock.Unlock()
b.logger.Warn("build already in progress")
b.sendResponse(req.ReplyTo, messages.NewBuildResponse(
messages.BuildStatusRejected,
"another build is already in progress",
))
return
}
b.busy = true
b.lock.Unlock()
defer func() {
b.lock.Lock()
b.busy = false
b.lock.Unlock()
}()
// Use default branch if not specified
branch := req.Branch
if branch == "" {
branch = repo.DefaultBranch
}
// Determine hosts to build
var hosts []string
if req.Target == "all" {
// List hosts from flake
b.sendResponse(req.ReplyTo, messages.NewBuildResponse(
messages.BuildStatusStarted,
"discovering hosts...",
))
hosts, err = b.executor.ListHosts(context.Background(), repo.URL, branch)
if err != nil {
b.logger.Error("failed to list hosts", "error", err)
b.sendResponse(req.ReplyTo, messages.NewBuildResponse(
messages.BuildStatusFailed,
fmt.Sprintf("failed to list hosts: %v", err),
).WithError(err.Error()))
if b.metrics != nil {
b.metrics.RecordBuildFailure(req.Repo, "")
}
return
}
// Filter out hostnames with invalid characters (security: prevent injection)
validHosts := make([]string, 0, len(hosts))
for _, host := range hosts {
if hostnameRegex.MatchString(host) {
validHosts = append(validHosts, host)
} else {
b.logger.Warn("skipping hostname with invalid characters", "hostname", host)
}
}
hosts = validHosts
// Sort hosts for consistent ordering
sort.Strings(hosts)
} else {
hosts = []string{req.Target}
}
if len(hosts) == 0 {
b.sendResponse(req.ReplyTo, messages.NewBuildResponse(
messages.BuildStatusFailed,
"no hosts to build",
))
return
}
// Send started response
b.sendResponse(req.ReplyTo, &messages.BuildResponse{
Status: messages.BuildStatusStarted,
Message: fmt.Sprintf("building %d host(s)", len(hosts)),
HostsTotal: len(hosts),
})
// Build each host sequentially
startTime := time.Now()
results := make([]messages.BuildHostResult, 0, len(hosts))
succeeded := 0
failed := 0
for i, host := range hosts {
hostStart := time.Now()
b.logger.Info("building host",
"host", host,
"repo", req.Repo,
"rev", branch,
"progress", fmt.Sprintf("%d/%d", i+1, len(hosts)),
"command", b.executor.BuildCommand(repo.URL, branch, host),
)
result := b.executor.Build(context.Background(), repo.URL, branch, host)
hostDuration := time.Since(hostStart).Seconds()
hostResult := messages.BuildHostResult{
Host: host,
Success: result.Success,
DurationSeconds: hostDuration,
}
if !result.Success {
if result.Error != nil {
hostResult.Error = result.Error.Error()
}
if result.Stderr != "" {
hostResult.Output = truncateOutput(result.Stderr, 50)
}
}
results = append(results, hostResult)
if result.Success {
succeeded++
b.logger.Info("host build succeeded", "host", host, "repo", req.Repo, "rev", branch, "duration", hostDuration)
if b.metrics != nil {
b.metrics.RecordHostBuildSuccess(req.Repo, host, hostDuration)
}
} else {
failed++
b.logger.Error("host build failed", "host", host, "repo", req.Repo, "rev", branch, "error", hostResult.Error)
if result.Stderr != "" {
for _, line := range truncateOutputLines(result.Stderr, 50) {
b.logger.Warn("build output", "host", host, "repo", req.Repo, "line", line)
}
}
if b.metrics != nil {
b.metrics.RecordHostBuildFailure(req.Repo, host, hostDuration)
}
}
// Send progress update
success := result.Success
b.sendResponse(req.ReplyTo, &messages.BuildResponse{
Status: messages.BuildStatusProgress,
Host: host,
HostSuccess: &success,
HostsCompleted: i + 1,
HostsTotal: len(hosts),
})
}
totalDuration := time.Since(startTime).Seconds()
// Send final response
status := messages.BuildStatusCompleted
message := fmt.Sprintf("built %d/%d hosts successfully", succeeded, len(hosts))
if failed > 0 {
status = messages.BuildStatusFailed
message = fmt.Sprintf("build failed: %d/%d hosts failed", failed, len(hosts))
}
b.sendResponse(req.ReplyTo, &messages.BuildResponse{
Status: status,
Message: message,
Results: results,
TotalDurationSeconds: totalDuration,
Succeeded: succeeded,
Failed: failed,
})
// Record overall build metrics
if b.metrics != nil {
if failed == 0 {
b.metrics.RecordBuildSuccess(req.Repo)
} else {
b.metrics.RecordBuildFailure(req.Repo, "")
}
}
}
func (b *Builder) sendResponse(replyTo string, resp *messages.BuildResponse) {
data, err := resp.Marshal()
if err != nil {
b.logger.Error("failed to marshal build response", "error", err)
return
}
if err := b.client.Publish(replyTo, data); err != nil {
b.logger.Error("failed to publish build response",
"reply_to", replyTo,
"error", err,
)
}
// Flush to ensure response is sent immediately
if err := b.client.Flush(); err != nil {
b.logger.Error("failed to flush", "error", err)
}
}

View File

@@ -1,164 +0,0 @@
package builder
import (
"fmt"
"strings"
"testing"
)
func TestTruncateOutput(t *testing.T) {
tests := []struct {
name string
input string
keepLines int
wantLines int
wantOmit bool
}{
{
name: "short output unchanged",
input: "line1\nline2\nline3",
keepLines: 50,
wantLines: 3,
wantOmit: false,
},
{
name: "exactly at threshold unchanged",
input: strings.Join(makeLines(100), "\n"),
keepLines: 50,
wantLines: 100,
wantOmit: false,
},
{
name: "over threshold truncated",
input: strings.Join(makeLines(150), "\n"),
keepLines: 50,
wantLines: 103, // 50 + 1 (empty) + 1 (omitted msg) + 1 (empty) + 50
wantOmit: true,
},
{
name: "large output truncated",
input: strings.Join(makeLines(1000), "\n"),
keepLines: 50,
wantLines: 103,
wantOmit: true,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
got := truncateOutput(tt.input, tt.keepLines)
gotLines := strings.Split(got, "\n")
if len(gotLines) != tt.wantLines {
t.Errorf("got %d lines, want %d", len(gotLines), tt.wantLines)
}
hasOmit := strings.Contains(got, "lines omitted")
if hasOmit != tt.wantOmit {
t.Errorf("got omit marker = %v, want %v", hasOmit, tt.wantOmit)
}
if tt.wantOmit {
// Verify first and last lines are preserved
inputLines := strings.Split(tt.input, "\n")
firstLine := inputLines[0]
lastLine := inputLines[len(inputLines)-1]
if !strings.HasPrefix(got, firstLine+"\n") {
t.Errorf("first line not preserved, got prefix %q, want %q",
gotLines[0], firstLine)
}
if !strings.HasSuffix(got, lastLine) {
t.Errorf("last line not preserved, got suffix %q, want %q",
gotLines[len(gotLines)-1], lastLine)
}
}
})
}
}
func makeLines(n int) []string {
lines := make([]string, n)
for i := range lines {
lines[i] = "line " + strings.Repeat("x", i%80)
}
return lines
}
func TestTruncateOutputLines(t *testing.T) {
t.Run("short output returns all lines", func(t *testing.T) {
input := "line1\nline2\nline3"
got := truncateOutputLines(input, 50)
if len(got) != 3 {
t.Errorf("got %d lines, want 3", len(got))
}
if got[0] != "line1" || got[1] != "line2" || got[2] != "line3" {
t.Errorf("unexpected lines: %v", got)
}
})
t.Run("over threshold returns head + marker + tail", func(t *testing.T) {
lines := makeLines(200)
input := strings.Join(lines, "\n")
got := truncateOutputLines(input, 50)
// Should be 50 head + 1 marker + 50 tail = 101
if len(got) != 101 {
t.Errorf("got %d lines, want 101", len(got))
}
// Check first and last lines preserved
if got[0] != lines[0] {
t.Errorf("first line = %q, want %q", got[0], lines[0])
}
if got[len(got)-1] != lines[len(lines)-1] {
t.Errorf("last line = %q, want %q", got[len(got)-1], lines[len(lines)-1])
}
// Check omitted marker
marker := got[50]
expected := fmt.Sprintf("... (%d lines omitted) ...", 100)
if marker != expected {
t.Errorf("marker = %q, want %q", marker, expected)
}
})
t.Run("exactly at threshold returns all lines", func(t *testing.T) {
lines := makeLines(100)
input := strings.Join(lines, "\n")
got := truncateOutputLines(input, 50)
if len(got) != 100 {
t.Errorf("got %d lines, want 100", len(got))
}
})
}
func TestTruncateOutputPreservesContent(t *testing.T) {
// Create input with distinct first and last lines
lines := make([]string, 200)
for i := range lines {
lines[i] = "middle"
}
lines[0] = "FIRST"
lines[49] = "LAST_OF_HEAD"
lines[150] = "FIRST_OF_TAIL"
lines[199] = "LAST"
input := strings.Join(lines, "\n")
got := truncateOutput(input, 50)
if !strings.Contains(got, "FIRST") {
t.Error("missing FIRST")
}
if !strings.Contains(got, "LAST_OF_HEAD") {
t.Error("missing LAST_OF_HEAD")
}
if !strings.Contains(got, "FIRST_OF_TAIL") {
t.Error("missing FIRST_OF_TAIL")
}
if !strings.Contains(got, "LAST") {
t.Error("missing LAST")
}
if !strings.Contains(got, "(100 lines omitted)") {
t.Errorf("wrong omitted count, got: %s", got)
}
}

View File

@@ -1,96 +0,0 @@
package builder
import (
"fmt"
"os"
"regexp"
"strings"
"gopkg.in/yaml.v3"
)
// repoNameRegex validates repository names for safe use in NATS subjects.
// Only allows alphanumeric, dashes, and underscores (no dots or wildcards).
var repoNameRegex = regexp.MustCompile(`^[a-zA-Z0-9_-]+$`)
// validURLPrefixes are the allowed prefixes for repository URLs.
var validURLPrefixes = []string{
"git+https://",
"git+ssh://",
"git+file://",
}
// RepoConfig holds configuration for a single repository.
type RepoConfig struct {
URL string `yaml:"url"`
DefaultBranch string `yaml:"default_branch"`
}
// Config holds the builder configuration.
type Config struct {
Repos map[string]RepoConfig `yaml:"repos"`
}
// LoadConfig loads configuration from a YAML file.
func LoadConfig(path string) (*Config, error) {
data, err := os.ReadFile(path)
if err != nil {
return nil, fmt.Errorf("failed to read config file: %w", err)
}
var cfg Config
if err := yaml.Unmarshal(data, &cfg); err != nil {
return nil, fmt.Errorf("failed to parse config file: %w", err)
}
if err := cfg.Validate(); err != nil {
return nil, err
}
return &cfg, nil
}
// Validate checks that the configuration is valid.
func (c *Config) Validate() error {
if len(c.Repos) == 0 {
return fmt.Errorf("no repos configured")
}
for name, repo := range c.Repos {
// Validate repo name for safe use in NATS subjects
if !repoNameRegex.MatchString(name) {
return fmt.Errorf("repo name %q contains invalid characters (only alphanumeric, dash, underscore allowed)", name)
}
if repo.URL == "" {
return fmt.Errorf("repo %q: url is required", name)
}
// Validate URL format
validURL := false
for _, prefix := range validURLPrefixes {
if strings.HasPrefix(repo.URL, prefix) {
validURL = true
break
}
}
if !validURL {
return fmt.Errorf("repo %q: url must start with git+https://, git+ssh://, or git+file://", name)
}
if repo.DefaultBranch == "" {
return fmt.Errorf("repo %q: default_branch is required", name)
}
}
return nil
}
// GetRepo returns the configuration for a repository, or an error if not found.
func (c *Config) GetRepo(name string) (*RepoConfig, error) {
repo, ok := c.Repos[name]
if !ok {
return nil, fmt.Errorf("repo %q not found in configuration", name)
}
return &repo, nil
}

View File

@@ -1,116 +0,0 @@
package builder
import (
"bytes"
"context"
"encoding/json"
"fmt"
"os/exec"
"time"
)
// Executor handles the execution of nix build commands.
type Executor struct {
timeout time.Duration
}
// NewExecutor creates a new build executor.
func NewExecutor(timeout time.Duration) *Executor {
return &Executor{
timeout: timeout,
}
}
// BuildResult contains the result of a build execution.
type BuildResult struct {
Success bool
ExitCode int
Stdout string
Stderr string
Error error
}
// FlakeShowResult contains the parsed output of nix flake show.
type FlakeShowResult struct {
NixosConfigurations map[string]any `json:"nixosConfigurations"`
}
// ListHosts returns the list of hosts (nixosConfigurations) available in a flake.
func (e *Executor) ListHosts(ctx context.Context, flakeURL, branch string) ([]string, error) {
ctx, cancel := context.WithTimeout(ctx, 60*time.Second)
defer cancel()
flakeRef := fmt.Sprintf("%s?ref=%s", flakeURL, branch)
cmd := exec.CommandContext(ctx, "nix", "flake", "show", "--json", flakeRef)
var stdout, stderr bytes.Buffer
cmd.Stdout = &stdout
cmd.Stderr = &stderr
if err := cmd.Run(); err != nil {
if ctx.Err() == context.DeadlineExceeded {
return nil, fmt.Errorf("timeout listing hosts")
}
return nil, fmt.Errorf("failed to list hosts: %w\n%s", err, stderr.String())
}
var result FlakeShowResult
if err := json.Unmarshal(stdout.Bytes(), &result); err != nil {
return nil, fmt.Errorf("failed to parse flake show output: %w", err)
}
hosts := make([]string, 0, len(result.NixosConfigurations))
for host := range result.NixosConfigurations {
hosts = append(hosts, host)
}
return hosts, nil
}
// Build builds a single host's system configuration.
func (e *Executor) Build(ctx context.Context, flakeURL, branch, host string) *BuildResult {
ctx, cancel := context.WithTimeout(ctx, e.timeout)
defer cancel()
// Build the flake reference for the system toplevel
flakeRef := fmt.Sprintf("%s?ref=%s#nixosConfigurations.%s.config.system.build.toplevel", flakeURL, branch, host)
cmd := exec.CommandContext(ctx, "nix", "build", "--no-link", flakeRef)
var stdout, stderr bytes.Buffer
cmd.Stdout = &stdout
cmd.Stderr = &stderr
err := cmd.Run()
result := &BuildResult{
Stdout: stdout.String(),
Stderr: stderr.String(),
}
if err != nil {
result.Success = false
result.Error = err
if ctx.Err() == context.DeadlineExceeded {
result.Error = fmt.Errorf("build timed out after %v", e.timeout)
}
if exitErr, ok := err.(*exec.ExitError); ok {
result.ExitCode = exitErr.ExitCode()
} else {
result.ExitCode = -1
}
} else {
result.Success = true
result.ExitCode = 0
}
return result
}
// BuildCommand returns the command that would be executed (for logging/debugging).
func (e *Executor) BuildCommand(flakeURL, branch, host string) string {
flakeRef := fmt.Sprintf("%s?ref=%s#nixosConfigurations.%s.config.system.build.toplevel", flakeURL, branch, host)
return fmt.Sprintf("nix build --no-link %s", flakeRef)
}

View File

@@ -1,140 +0,0 @@
package cli
import (
"context"
"encoding/json"
"fmt"
"sync"
"time"
"github.com/google/uuid"
"code.t-juice.club/torjus/homelab-deploy/internal/messages"
"code.t-juice.club/torjus/homelab-deploy/internal/nats"
)
// BuildConfig holds configuration for a build operation.
type BuildConfig struct {
NATSUrl string
NKeyFile string
Repo string
Target string
Branch string
Timeout time.Duration
}
// BuildResult contains the aggregated results from a build.
type BuildResult struct {
Responses []*messages.BuildResponse
FinalResponse *messages.BuildResponse
Errors []error
}
// AllSucceeded returns true if the build completed successfully.
func (r *BuildResult) AllSucceeded() bool {
if len(r.Errors) > 0 {
return false
}
if r.FinalResponse == nil {
return false
}
return r.FinalResponse.Status == messages.BuildStatusCompleted && r.FinalResponse.Failed == 0
}
// MarshalJSON returns the JSON representation of the build result.
func (r *BuildResult) MarshalJSON() ([]byte, error) {
if r.FinalResponse != nil {
return json.Marshal(r.FinalResponse)
}
return json.Marshal(map[string]any{
"status": "unknown",
"responses": r.Responses,
"errors": r.Errors,
})
}
// Build triggers a build and collects responses.
func Build(ctx context.Context, cfg BuildConfig, onResponse func(*messages.BuildResponse)) (*BuildResult, error) {
// Connect to NATS
client, err := nats.Connect(nats.Config{
URL: cfg.NATSUrl,
NKeyFile: cfg.NKeyFile,
Name: "homelab-deploy-build-cli",
})
if err != nil {
return nil, fmt.Errorf("failed to connect to NATS: %w", err)
}
defer client.Close()
// Generate unique reply subject
requestID := uuid.New().String()
replySubject := fmt.Sprintf("build.responses.%s", requestID)
var mu sync.Mutex
result := &BuildResult{}
done := make(chan struct{})
// Subscribe to reply subject
sub, err := client.Subscribe(replySubject, func(subject string, data []byte) {
resp, err := messages.UnmarshalBuildResponse(data)
if err != nil {
mu.Lock()
result.Errors = append(result.Errors, fmt.Errorf("failed to unmarshal response: %w", err))
mu.Unlock()
return
}
mu.Lock()
result.Responses = append(result.Responses, resp)
if resp.Status.IsFinal() {
result.FinalResponse = resp
select {
case <-done:
default:
close(done)
}
}
mu.Unlock()
if onResponse != nil {
onResponse(resp)
}
})
if err != nil {
return nil, fmt.Errorf("failed to subscribe to reply subject: %w", err)
}
defer func() { _ = sub.Unsubscribe() }()
// Build and send request
req := &messages.BuildRequest{
Repo: cfg.Repo,
Target: cfg.Target,
Branch: cfg.Branch,
ReplyTo: replySubject,
}
data, err := req.Marshal()
if err != nil {
return nil, fmt.Errorf("failed to marshal request: %w", err)
}
// Publish to build.<repo>.<target>
buildSubject := fmt.Sprintf("build.%s.%s", cfg.Repo, cfg.Target)
if err := client.Publish(buildSubject, data); err != nil {
return nil, fmt.Errorf("failed to publish request: %w", err)
}
if err := client.Flush(); err != nil {
return nil, fmt.Errorf("failed to flush: %w", err)
}
// Wait for final response or timeout
select {
case <-ctx.Done():
return result, ctx.Err()
case <-done:
return result, nil
case <-time.After(cfg.Timeout):
return result, nil
}
}

View File

@@ -8,8 +8,8 @@ import (
"github.com/google/uuid" "github.com/google/uuid"
"code.t-juice.club/torjus/homelab-deploy/internal/messages" "git.t-juice.club/torjus/homelab-deploy/internal/messages"
"code.t-juice.club/torjus/homelab-deploy/internal/nats" "git.t-juice.club/torjus/homelab-deploy/internal/nats"
) )
// DeployConfig holds configuration for a deploy operation. // DeployConfig holds configuration for a deploy operation.

View File

@@ -3,7 +3,7 @@ package cli
import ( import (
"testing" "testing"
"code.t-juice.club/torjus/homelab-deploy/internal/messages" "git.t-juice.club/torjus/homelab-deploy/internal/messages"
) )
func TestDeployResult_AllSucceeded(t *testing.T) { func TestDeployResult_AllSucceeded(t *testing.T) {

View File

@@ -7,7 +7,7 @@ import (
"os/exec" "os/exec"
"time" "time"
"code.t-juice.club/torjus/homelab-deploy/internal/messages" "git.t-juice.club/torjus/homelab-deploy/internal/messages"
) )
// Executor handles the execution of nixos-rebuild commands. // Executor handles the execution of nixos-rebuild commands.

View File

@@ -4,7 +4,7 @@ import (
"testing" "testing"
"time" "time"
"code.t-juice.club/torjus/homelab-deploy/internal/messages" "git.t-juice.club/torjus/homelab-deploy/internal/messages"
) )
func TestExecutor_BuildCommand(t *testing.T) { func TestExecutor_BuildCommand(t *testing.T) {

View File

@@ -6,10 +6,10 @@ import (
"log/slog" "log/slog"
"time" "time"
"code.t-juice.club/torjus/homelab-deploy/internal/deploy" "git.t-juice.club/torjus/homelab-deploy/internal/deploy"
"code.t-juice.club/torjus/homelab-deploy/internal/messages" "git.t-juice.club/torjus/homelab-deploy/internal/messages"
"code.t-juice.club/torjus/homelab-deploy/internal/metrics" "git.t-juice.club/torjus/homelab-deploy/internal/metrics"
"code.t-juice.club/torjus/homelab-deploy/internal/nats" "git.t-juice.club/torjus/homelab-deploy/internal/nats"
) )
// Config holds the configuration for the listener. // Config holds the configuration for the listener.
@@ -27,6 +27,7 @@ type Config struct {
MetricsEnabled bool MetricsEnabled bool
MetricsAddr string MetricsAddr string
Version string Version string
Debug bool
} }
// Listener handles deployment requests from NATS. // Listener handles deployment requests from NATS.
@@ -203,7 +204,14 @@ func (l *Listener) handleDeployRequest(subject string, data []byte) {
// Record deployment start for metrics // Record deployment start for metrics
if l.metrics != nil { if l.metrics != nil {
l.logger.Debug("recording deployment start metric",
"metrics_enabled", true,
)
l.metrics.RecordDeploymentStart() l.metrics.RecordDeploymentStart()
} else {
l.logger.Debug("skipping deployment start metric",
"metrics_enabled", false,
)
} }
startTime := time.Now() startTime := time.Now()
@@ -219,9 +227,19 @@ func (l *Listener) handleDeployRequest(subject string, data []byte) {
messages.StatusFailed, messages.StatusFailed,
fmt.Sprintf("revision validation failed: %v", err), fmt.Sprintf("revision validation failed: %v", err),
).WithError(messages.ErrorInvalidRevision)) ).WithError(messages.ErrorInvalidRevision))
if l.metrics != nil {
duration := time.Since(startTime).Seconds() duration := time.Since(startTime).Seconds()
if l.metrics != nil {
l.logger.Debug("recording deployment failure metric (revision validation)",
"action", req.Action,
"error_code", messages.ErrorInvalidRevision,
"duration_seconds", duration,
)
l.metrics.RecordDeploymentFailure(req.Action, messages.ErrorInvalidRevision, duration) l.metrics.RecordDeploymentFailure(req.Action, messages.ErrorInvalidRevision, duration)
} else {
l.logger.Debug("skipping deployment failure metric",
"metrics_enabled", false,
"duration_seconds", duration,
)
} }
return return
} }
@@ -265,7 +283,17 @@ func (l *Listener) handleDeployRequest(subject string, data []byte) {
l.logger.Error("failed to flush completed response", "error", err) l.logger.Error("failed to flush completed response", "error", err)
} }
if l.metrics != nil { if l.metrics != nil {
l.logger.Debug("recording deployment end metric (success)",
"action", req.Action,
"success", true,
"duration_seconds", duration,
)
l.metrics.RecordDeploymentEnd(req.Action, true, duration) l.metrics.RecordDeploymentEnd(req.Action, true, duration)
} else {
l.logger.Debug("skipping deployment end metric",
"metrics_enabled", false,
"duration_seconds", duration,
)
} }
// After a successful switch, signal restart so we pick up any new version // After a successful switch, signal restart so we pick up any new version
@@ -305,7 +333,17 @@ func (l *Listener) handleDeployRequest(subject string, data []byte) {
fmt.Sprintf("deployment failed (exit code %d): %s", result.ExitCode, result.Stderr), fmt.Sprintf("deployment failed (exit code %d): %s", result.ExitCode, result.Stderr),
).WithError(errorCode)) ).WithError(errorCode))
if l.metrics != nil { if l.metrics != nil {
l.logger.Debug("recording deployment failure metric",
"action", req.Action,
"error_code", errorCode,
"duration_seconds", duration,
)
l.metrics.RecordDeploymentFailure(req.Action, errorCode, duration) l.metrics.RecordDeploymentFailure(req.Action, errorCode, duration)
} else {
l.logger.Debug("skipping deployment failure metric",
"metrics_enabled", false,
"duration_seconds", duration,
)
} }
} }
} }

View File

@@ -2,8 +2,14 @@ package listener
import ( import (
"log/slog" "log/slog"
"strings"
"testing" "testing"
"time" "time"
"git.t-juice.club/torjus/homelab-deploy/internal/messages"
"git.t-juice.club/torjus/homelab-deploy/internal/metrics"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/testutil"
) )
func TestNew(t *testing.T) { func TestNew(t *testing.T) {
@@ -51,3 +57,148 @@ func TestNew_WithLogger(t *testing.T) {
t.Error("should use provided logger") t.Error("should use provided logger")
} }
} }
func TestNew_WithMetricsEnabled(t *testing.T) {
cfg := Config{
Hostname: "test-host",
Tier: "test",
MetricsEnabled: true,
MetricsAddr: ":0",
}
l := New(cfg, nil)
if l.metricsServer == nil {
t.Error("metricsServer should not be nil when MetricsEnabled is true")
}
if l.metrics == nil {
t.Error("metrics should not be nil when MetricsEnabled is true")
}
}
func TestListener_MetricsRecordedOnDeployment(t *testing.T) {
// This test verifies that the listener correctly calls metrics functions
// when processing deployments. We test this by directly calling the internal
// metrics recording logic that handleDeployRequest uses.
reg := prometheus.NewRegistry()
collector := metrics.NewCollector(reg)
// Simulate what handleDeployRequest does for a successful deployment
collector.RecordDeploymentStart()
collector.RecordDeploymentEnd(messages.ActionSwitch, true, 120.5)
// Verify counter was incremented
counterExpected := `
# HELP homelab_deploy_deployments_total Total deployment requests processed
# TYPE homelab_deploy_deployments_total counter
homelab_deploy_deployments_total{action="boot",error_code="",status="completed"} 0
homelab_deploy_deployments_total{action="boot",error_code="",status="failed"} 0
homelab_deploy_deployments_total{action="dry-activate",error_code="",status="completed"} 0
homelab_deploy_deployments_total{action="dry-activate",error_code="",status="failed"} 0
homelab_deploy_deployments_total{action="switch",error_code="",status="completed"} 1
homelab_deploy_deployments_total{action="switch",error_code="",status="failed"} 0
homelab_deploy_deployments_total{action="test",error_code="",status="completed"} 0
homelab_deploy_deployments_total{action="test",error_code="",status="failed"} 0
`
if err := testutil.GatherAndCompare(reg, strings.NewReader(counterExpected), "homelab_deploy_deployments_total"); err != nil {
t.Errorf("unexpected counter metrics: %v", err)
}
// Verify histogram was updated (120.5 seconds falls into le="300" and higher buckets)
histogramExpected := `
# HELP homelab_deploy_deployment_duration_seconds Deployment execution time
# TYPE homelab_deploy_deployment_duration_seconds histogram
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="boot",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="boot",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="boot",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="boot",success="true"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="dry-activate",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="dry-activate",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="dry-activate",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="dry-activate",success="true"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="switch",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="switch",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="300"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="600"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="900"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="1200"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="1800"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="+Inf"} 1
homelab_deploy_deployment_duration_seconds_sum{action="switch",success="true"} 120.5
homelab_deploy_deployment_duration_seconds_count{action="switch",success="true"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="test",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="test",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="test",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="test",success="true"} 0
`
if err := testutil.GatherAndCompare(reg, strings.NewReader(histogramExpected), "homelab_deploy_deployment_duration_seconds"); err != nil {
t.Errorf("unexpected histogram metrics: %v", err)
}
}

View File

@@ -1,109 +0,0 @@
package mcp
import (
"context"
"fmt"
"strings"
"github.com/mark3labs/mcp-go/mcp"
deploycli "code.t-juice.club/torjus/homelab-deploy/internal/cli"
"code.t-juice.club/torjus/homelab-deploy/internal/messages"
)
// BuildTool creates the build tool definition.
func BuildTool() mcp.Tool {
return mcp.NewTool(
"build",
mcp.WithDescription("Trigger a Nix build on the build server"),
mcp.WithString("repo",
mcp.Required(),
mcp.Description("Repository name (must match builder config)"),
),
mcp.WithString("target",
mcp.Description("Target hostname, or omit to build all hosts"),
),
mcp.WithBoolean("all",
mcp.Description("Build all hosts in the repository (default if no target specified)"),
),
mcp.WithString("branch",
mcp.Description("Git branch to build (uses repo default if not specified)"),
),
)
}
// HandleBuild handles the build tool.
func (h *ToolHandler) HandleBuild(ctx context.Context, request mcp.CallToolRequest) (*mcp.CallToolResult, error) {
repo, err := request.RequireString("repo")
if err != nil {
return mcp.NewToolResultError("repo is required"), nil
}
target := request.GetString("target", "")
all := request.GetBool("all", false)
branch := request.GetString("branch", "")
// Default to "all" if no target specified
if target == "" {
if !all {
all = true
}
target = "all"
}
if all && target != "all" {
return mcp.NewToolResultError("cannot specify both target and all"), nil
}
cfg := deploycli.BuildConfig{
NATSUrl: h.cfg.NATSUrl,
NKeyFile: h.cfg.NKeyFile,
Repo: repo,
Target: target,
Branch: branch,
Timeout: h.cfg.Timeout,
}
var output strings.Builder
branchStr := branch
if branchStr == "" {
branchStr = "(default)"
}
output.WriteString(fmt.Sprintf("Building %s target=%s branch=%s\n\n", repo, target, branchStr))
result, err := deploycli.Build(ctx, cfg, func(resp *messages.BuildResponse) {
switch resp.Status {
case messages.BuildStatusStarted:
output.WriteString(fmt.Sprintf("Started: %s\n", resp.Message))
case messages.BuildStatusProgress:
successStr := "..."
if resp.HostSuccess != nil {
if *resp.HostSuccess {
successStr = "success"
} else {
successStr = "failed"
}
}
output.WriteString(fmt.Sprintf("[%d/%d] %s: %s\n", resp.HostsCompleted, resp.HostsTotal, resp.Host, successStr))
case messages.BuildStatusCompleted, messages.BuildStatusFailed:
output.WriteString(fmt.Sprintf("\n%s\n", resp.Message))
case messages.BuildStatusRejected:
output.WriteString(fmt.Sprintf("Rejected: %s\n", resp.Message))
}
})
if err != nil {
return mcp.NewToolResultError(fmt.Sprintf("build failed: %v", err)), nil
}
if result.FinalResponse != nil {
output.WriteString(fmt.Sprintf("\nBuild complete: %d succeeded, %d failed (%.1fs)\n",
result.FinalResponse.Succeeded,
result.FinalResponse.Failed,
result.FinalResponse.TotalDurationSeconds))
}
if !result.AllSucceeded() {
output.WriteString("WARNING: Some builds failed\n")
}
return mcp.NewToolResultText(output.String()), nil
}

View File

@@ -12,7 +12,6 @@ type ServerConfig struct {
NKeyFile string NKeyFile string
EnableAdmin bool EnableAdmin bool
AdminNKeyFile string AdminNKeyFile string
EnableBuilds bool
DiscoverSubject string DiscoverSubject string
Timeout time.Duration Timeout time.Duration
} }
@@ -50,11 +49,6 @@ func New(cfg ServerConfig) *Server {
s.AddTool(DeployAdminTool(), handler.HandleDeployAdmin) s.AddTool(DeployAdminTool(), handler.HandleDeployAdmin)
} }
// Optionally register build tool
if cfg.EnableBuilds {
s.AddTool(BuildTool(), handler.HandleBuild)
}
return &Server{ return &Server{
cfg: cfg, cfg: cfg,
server: s, server: s,

View File

@@ -9,8 +9,8 @@ import (
"github.com/mark3labs/mcp-go/mcp" "github.com/mark3labs/mcp-go/mcp"
deploycli "code.t-juice.club/torjus/homelab-deploy/internal/cli" deploycli "git.t-juice.club/torjus/homelab-deploy/internal/cli"
"code.t-juice.club/torjus/homelab-deploy/internal/messages" "git.t-juice.club/torjus/homelab-deploy/internal/messages"
) )
// ToolConfig holds configuration for the MCP tools. // ToolConfig holds configuration for the MCP tools.

View File

@@ -1,135 +0,0 @@
package messages
import (
"encoding/json"
"fmt"
"strings"
)
// BuildStatus represents the status of a build response.
type BuildStatus string
const (
BuildStatusStarted BuildStatus = "started"
BuildStatusProgress BuildStatus = "progress"
BuildStatusCompleted BuildStatus = "completed"
BuildStatusFailed BuildStatus = "failed"
BuildStatusRejected BuildStatus = "rejected"
)
// IsFinal returns true if this status indicates a terminal state.
func (s BuildStatus) IsFinal() bool {
switch s {
case BuildStatusCompleted, BuildStatusFailed, BuildStatusRejected:
return true
default:
return false
}
}
// BuildRequest is the message sent to request a build.
type BuildRequest struct {
Repo string `json:"repo"` // Must match config
Target string `json:"target"` // Hostname or "all"
Branch string `json:"branch,omitempty"` // Optional, uses repo default
ReplyTo string `json:"reply_to"`
}
// Validate checks that the request is valid.
func (r *BuildRequest) Validate() error {
if r.Repo == "" {
return fmt.Errorf("repo is required")
}
if !revisionRegex.MatchString(r.Repo) {
return fmt.Errorf("invalid repo name format: %q", r.Repo)
}
if r.Target == "" {
return fmt.Errorf("target is required")
}
// Target must be "all" or a valid hostname (same format as revision/branch)
if r.Target != "all" && !revisionRegex.MatchString(r.Target) {
return fmt.Errorf("invalid target format: %q", r.Target)
}
if r.Branch != "" && !revisionRegex.MatchString(r.Branch) {
return fmt.Errorf("invalid branch format: %q", r.Branch)
}
if r.ReplyTo == "" {
return fmt.Errorf("reply_to is required")
}
// Validate reply_to format to prevent publishing to arbitrary subjects
if !strings.HasPrefix(r.ReplyTo, "build.responses.") {
return fmt.Errorf("invalid reply_to format: must start with 'build.responses.'")
}
return nil
}
// Marshal serializes the request to JSON.
func (r *BuildRequest) Marshal() ([]byte, error) {
return json.Marshal(r)
}
// UnmarshalBuildRequest deserializes a request from JSON.
func UnmarshalBuildRequest(data []byte) (*BuildRequest, error) {
var r BuildRequest
if err := json.Unmarshal(data, &r); err != nil {
return nil, fmt.Errorf("failed to unmarshal build request: %w", err)
}
return &r, nil
}
// BuildHostResult contains the result of building a single host.
type BuildHostResult struct {
Host string `json:"host"`
Success bool `json:"success"`
Error string `json:"error,omitempty"`
Output string `json:"output,omitempty"`
DurationSeconds float64 `json:"duration_seconds"`
}
// BuildResponse is the message sent in response to a build request.
type BuildResponse struct {
Status BuildStatus `json:"status"`
Message string `json:"message,omitempty"`
// Progress updates
Host string `json:"host,omitempty"`
HostSuccess *bool `json:"host_success,omitempty"`
HostsCompleted int `json:"hosts_completed,omitempty"`
HostsTotal int `json:"hosts_total,omitempty"`
// Final response
Results []BuildHostResult `json:"results,omitempty"`
TotalDurationSeconds float64 `json:"total_duration_seconds,omitempty"`
Succeeded int `json:"succeeded,omitempty"`
Failed int `json:"failed,omitempty"`
Error string `json:"error,omitempty"`
}
// NewBuildResponse creates a new response with the given status and message.
func NewBuildResponse(status BuildStatus, message string) *BuildResponse {
return &BuildResponse{
Status: status,
Message: message,
}
}
// WithError adds an error message to the response.
func (r *BuildResponse) WithError(err string) *BuildResponse {
r.Error = err
return r
}
// Marshal serializes the response to JSON.
func (r *BuildResponse) Marshal() ([]byte, error) {
return json.Marshal(r)
}
// UnmarshalBuildResponse deserializes a response from JSON.
func UnmarshalBuildResponse(data []byte) (*BuildResponse, error) {
var r BuildResponse
if err := json.Unmarshal(data, &r); err != nil {
return nil, fmt.Errorf("failed to unmarshal build response: %w", err)
}
return &r, nil
}

View File

@@ -1,99 +0,0 @@
package metrics
import (
"github.com/prometheus/client_golang/prometheus"
)
// BuildCollector holds all Prometheus metrics for the builder.
type BuildCollector struct {
buildsTotal *prometheus.CounterVec
buildHostTotal *prometheus.CounterVec
buildDuration *prometheus.HistogramVec
buildLastTimestamp *prometheus.GaugeVec
buildLastSuccessTime *prometheus.GaugeVec
buildLastFailureTime *prometheus.GaugeVec
}
// NewBuildCollector creates a new build metrics collector and registers it with the given registerer.
func NewBuildCollector(reg prometheus.Registerer) *BuildCollector {
c := &BuildCollector{
buildsTotal: prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "homelab_deploy_builds_total",
Help: "Total builds processed",
},
[]string{"repo", "status"},
),
buildHostTotal: prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "homelab_deploy_build_host_total",
Help: "Total host builds processed",
},
[]string{"repo", "host", "status"},
),
buildDuration: prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "homelab_deploy_build_duration_seconds",
Help: "Build execution time per host",
Buckets: []float64{5, 10, 30, 60, 120, 300, 600, 1800, 3600, 7200, 14400},
},
[]string{"repo", "host"},
),
buildLastTimestamp: prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "homelab_deploy_build_last_timestamp",
Help: "Timestamp of last build attempt",
},
[]string{"repo"},
),
buildLastSuccessTime: prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "homelab_deploy_build_last_success_timestamp",
Help: "Timestamp of last successful build",
},
[]string{"repo"},
),
buildLastFailureTime: prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "homelab_deploy_build_last_failure_timestamp",
Help: "Timestamp of last failed build",
},
[]string{"repo"},
),
}
reg.MustRegister(c.buildsTotal)
reg.MustRegister(c.buildHostTotal)
reg.MustRegister(c.buildDuration)
reg.MustRegister(c.buildLastTimestamp)
reg.MustRegister(c.buildLastSuccessTime)
reg.MustRegister(c.buildLastFailureTime)
return c
}
// RecordBuildSuccess records a successful build.
func (c *BuildCollector) RecordBuildSuccess(repo string) {
c.buildsTotal.WithLabelValues(repo, "success").Inc()
c.buildLastTimestamp.WithLabelValues(repo).SetToCurrentTime()
c.buildLastSuccessTime.WithLabelValues(repo).SetToCurrentTime()
}
// RecordBuildFailure records a failed build.
func (c *BuildCollector) RecordBuildFailure(repo, errorCode string) {
c.buildsTotal.WithLabelValues(repo, "failure").Inc()
c.buildLastTimestamp.WithLabelValues(repo).SetToCurrentTime()
c.buildLastFailureTime.WithLabelValues(repo).SetToCurrentTime()
}
// RecordHostBuildSuccess records a successful host build.
func (c *BuildCollector) RecordHostBuildSuccess(repo, host string, durationSeconds float64) {
c.buildHostTotal.WithLabelValues(repo, host, "success").Inc()
c.buildDuration.WithLabelValues(repo, host).Observe(durationSeconds)
}
// RecordHostBuildFailure records a failed host build.
func (c *BuildCollector) RecordHostBuildFailure(repo, host string, durationSeconds float64) {
c.buildHostTotal.WithLabelValues(repo, host, "failure").Inc()
c.buildDuration.WithLabelValues(repo, host).Observe(durationSeconds)
}

View File

@@ -2,7 +2,7 @@
package metrics package metrics
import ( import (
"code.t-juice.club/torjus/homelab-deploy/internal/messages" "git.t-juice.club/torjus/homelab-deploy/internal/messages"
"github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus"
) )

View File

@@ -8,7 +8,7 @@ import (
"testing" "testing"
"time" "time"
"code.t-juice.club/torjus/homelab-deploy/internal/messages" "git.t-juice.club/torjus/homelab-deploy/internal/messages"
"github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/testutil" "github.com/prometheus/client_golang/prometheus/testutil"
) )
@@ -78,6 +78,103 @@ homelab_deploy_deployments_total{action="test",error_code="",status="failed"} 0
if err := testutil.GatherAndCompare(reg, strings.NewReader(counterExpected), "homelab_deploy_deployments_total"); err != nil { if err := testutil.GatherAndCompare(reg, strings.NewReader(counterExpected), "homelab_deploy_deployments_total"); err != nil {
t.Errorf("unexpected counter metrics: %v", err) t.Errorf("unexpected counter metrics: %v", err)
} }
// Check histogram recorded the duration (120.5 seconds falls into le="300" and higher buckets)
histogramExpected := `
# HELP homelab_deploy_deployment_duration_seconds Deployment execution time
# TYPE homelab_deploy_deployment_duration_seconds histogram
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="boot",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="boot",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="boot",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="boot",success="true"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="dry-activate",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="dry-activate",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="dry-activate",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="dry-activate",success="true"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="switch",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="switch",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="300"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="600"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="900"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="1200"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="1800"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="+Inf"} 1
homelab_deploy_deployment_duration_seconds_sum{action="switch",success="true"} 120.5
homelab_deploy_deployment_duration_seconds_count{action="switch",success="true"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="test",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="test",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="test",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="test",success="true"} 0
`
if err := testutil.GatherAndCompare(reg, strings.NewReader(histogramExpected), "homelab_deploy_deployment_duration_seconds"); err != nil {
t.Errorf("unexpected histogram metrics: %v", err)
}
} }
func TestCollector_RecordDeploymentEnd_Failure(t *testing.T) { func TestCollector_RecordDeploymentEnd_Failure(t *testing.T) {
@@ -102,6 +199,103 @@ homelab_deploy_deployments_total{action="test",error_code="",status="failed"} 0
if err := testutil.GatherAndCompare(reg, strings.NewReader(counterExpected), "homelab_deploy_deployments_total"); err != nil { if err := testutil.GatherAndCompare(reg, strings.NewReader(counterExpected), "homelab_deploy_deployments_total"); err != nil {
t.Errorf("unexpected counter metrics: %v", err) t.Errorf("unexpected counter metrics: %v", err)
} }
// Check histogram recorded the duration (60.0 seconds falls into le="60" and higher buckets)
histogramExpected := `
# HELP homelab_deploy_deployment_duration_seconds Deployment execution time
# TYPE homelab_deploy_deployment_duration_seconds histogram
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="60"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="120"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="300"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="600"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="900"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="1200"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="1800"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="+Inf"} 1
homelab_deploy_deployment_duration_seconds_sum{action="boot",success="false"} 60
homelab_deploy_deployment_duration_seconds_count{action="boot",success="false"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="boot",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="boot",success="true"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="dry-activate",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="dry-activate",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="dry-activate",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="dry-activate",success="true"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="switch",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="switch",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="switch",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="switch",success="true"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="test",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="test",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="test",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="test",success="true"} 0
`
if err := testutil.GatherAndCompare(reg, strings.NewReader(histogramExpected), "homelab_deploy_deployment_duration_seconds"); err != nil {
t.Errorf("unexpected histogram metrics: %v", err)
}
} }
func TestCollector_RecordDeploymentFailure(t *testing.T) { func TestCollector_RecordDeploymentFailure(t *testing.T) {
@@ -127,6 +321,103 @@ homelab_deploy_deployments_total{action="test",error_code="",status="failed"} 0
if err := testutil.GatherAndCompare(reg, strings.NewReader(counterExpected), "homelab_deploy_deployments_total"); err != nil { if err := testutil.GatherAndCompare(reg, strings.NewReader(counterExpected), "homelab_deploy_deployments_total"); err != nil {
t.Errorf("unexpected counter metrics: %v", err) t.Errorf("unexpected counter metrics: %v", err)
} }
// Check histogram recorded the duration (300.0 seconds falls into le="300" and higher buckets)
histogramExpected := `
# HELP homelab_deploy_deployment_duration_seconds Deployment execution time
# TYPE homelab_deploy_deployment_duration_seconds histogram
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="boot",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="boot",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="boot",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="boot",success="true"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="dry-activate",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="dry-activate",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="dry-activate",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="dry-activate",success="true"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="300"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="600"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="900"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="1200"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="1800"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="+Inf"} 1
homelab_deploy_deployment_duration_seconds_sum{action="switch",success="false"} 300
homelab_deploy_deployment_duration_seconds_count{action="switch",success="false"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="switch",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="switch",success="true"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="test",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="test",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="test",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="test",success="true"} 0
`
if err := testutil.GatherAndCompare(reg, strings.NewReader(histogramExpected), "homelab_deploy_deployment_duration_seconds"); err != nil {
t.Errorf("unexpected histogram metrics: %v", err)
}
} }
func TestCollector_RecordRejection(t *testing.T) { func TestCollector_RecordRejection(t *testing.T) {

View File

@@ -36,7 +36,7 @@ func NewServer(cfg ServerConfig) *Server {
registry := prometheus.NewRegistry() registry := prometheus.NewRegistry()
collector := NewCollector(registry) collector := NewCollector(registry)
scrapeCh := make(chan struct{}) scrapeCh := make(chan struct{}, 1)
metricsHandler := promhttp.HandlerFor(registry, promhttp.HandlerOpts{ metricsHandler := promhttp.HandlerFor(registry, promhttp.HandlerOpts{
Registry: registry, Registry: registry,
@@ -74,11 +74,6 @@ func (s *Server) Collector() *Collector {
return s.collector return s.collector
} }
// Registry returns the Prometheus registry.
func (s *Server) Registry() *prometheus.Registry {
return s.registry
}
// ScrapeCh returns a channel that receives a signal each time the metrics endpoint is scraped. // ScrapeCh returns a channel that receives a signal each time the metrics endpoint is scraped.
func (s *Server) ScrapeCh() <-chan struct{} { func (s *Server) ScrapeCh() <-chan struct{} {
return s.scrapeCh return s.scrapeCh

View File

@@ -2,61 +2,33 @@
{ config, lib, pkgs, ... }: { config, lib, pkgs, ... }:
let let
listenerCfg = config.services.homelab-deploy.listener; cfg = config.services.homelab-deploy.listener;
builderCfg = config.services.homelab-deploy.builder;
# Generate YAML config from settings # Build command line arguments from configuration
generatedConfigFile = pkgs.writeText "builder.yaml" (lib.generators.toYAML {} { args = lib.concatStringsSep " " ([
repos = lib.mapAttrs (name: repo: { "--hostname ${lib.escapeShellArg cfg.hostname}"
url = repo.url; "--tier ${cfg.tier}"
default_branch = repo.defaultBranch; "--nats-url ${lib.escapeShellArg cfg.natsUrl}"
}) builderCfg.settings.repos; "--nkey-file ${lib.escapeShellArg cfg.nkeyFile}"
}); "--flake-url ${lib.escapeShellArg cfg.flakeUrl}"
"--timeout ${toString cfg.timeout}"
# Use provided configFile or generate from settings "--discover-subject ${lib.escapeShellArg cfg.discoverSubject}"
builderConfigFile =
if builderCfg.configFile != null
then builderCfg.configFile
else generatedConfigFile;
# Build command line arguments for listener from configuration
listenerArgs = lib.concatStringsSep " " ([
"--hostname ${lib.escapeShellArg listenerCfg.hostname}"
"--tier ${listenerCfg.tier}"
"--nats-url ${lib.escapeShellArg listenerCfg.natsUrl}"
"--nkey-file ${lib.escapeShellArg listenerCfg.nkeyFile}"
"--flake-url ${lib.escapeShellArg listenerCfg.flakeUrl}"
"--timeout ${toString listenerCfg.timeout}"
"--discover-subject ${lib.escapeShellArg listenerCfg.discoverSubject}"
] ]
++ lib.optional (listenerCfg.role != null) "--role ${lib.escapeShellArg listenerCfg.role}" ++ lib.optional (cfg.role != null) "--role ${lib.escapeShellArg cfg.role}"
++ map (s: "--deploy-subject ${lib.escapeShellArg s}") listenerCfg.deploySubjects ++ map (s: "--deploy-subject ${lib.escapeShellArg s}") cfg.deploySubjects
++ lib.optionals listenerCfg.metrics.enable [ ++ lib.optionals cfg.metrics.enable [
"--metrics-enabled" "--metrics-enabled"
"--metrics-addr ${lib.escapeShellArg listenerCfg.metrics.address}" "--metrics-addr ${lib.escapeShellArg cfg.metrics.address}"
]);
# Build command line arguments for builder from configuration
builderArgs = lib.concatStringsSep " " ([
"--nats-url ${lib.escapeShellArg builderCfg.natsUrl}"
"--nkey-file ${lib.escapeShellArg builderCfg.nkeyFile}"
"--config ${builderConfigFile}"
"--timeout ${toString builderCfg.timeout}"
] ]
++ lib.optionals builderCfg.metrics.enable [ ++ cfg.extraArgs);
"--metrics-enabled"
"--metrics-addr ${lib.escapeShellArg builderCfg.metrics.address}"
]);
# Extract port from metrics address for firewall rule # Extract port from metrics address for firewall rule
extractPort = addr: let metricsPort = let
addr = cfg.metrics.address;
# Handle both ":9972" and "0.0.0.0:9972" formats # Handle both ":9972" and "0.0.0.0:9972" formats
parts = lib.splitString ":" addr; parts = lib.splitString ":" addr;
in lib.toInt (lib.last parts); in lib.toInt (lib.last parts);
listenerMetricsPort = extractPort listenerCfg.metrics.address;
builderMetricsPort = extractPort builderCfg.metrics.address;
in in
{ {
options.services.homelab-deploy.listener = { options.services.homelab-deploy.listener = {
@@ -151,118 +123,16 @@ in
description = "Open firewall for metrics port"; description = "Open firewall for metrics port";
}; };
}; };
};
options.services.homelab-deploy.builder = { extraArgs = lib.mkOption {
enable = lib.mkEnableOption "homelab-deploy builder service"; type = lib.types.listOf lib.types.str;
default = [ ];
package = lib.mkOption { description = "Extra command line arguments to pass to the listener";
type = lib.types.package; example = [ "--debug" ];
default = self.packages.${pkgs.system}.homelab-deploy;
description = "The homelab-deploy package to use";
};
natsUrl = lib.mkOption {
type = lib.types.str;
description = "NATS server URL";
example = "nats://nats.example.com:4222";
};
nkeyFile = lib.mkOption {
type = lib.types.path;
description = "Path to NKey seed file for NATS authentication";
example = "/run/secrets/homelab-deploy-builder-nkey";
};
configFile = lib.mkOption {
type = lib.types.nullOr lib.types.path;
default = null;
description = ''
Path to builder configuration file (YAML).
If not specified, a config file will be generated from the `settings` option.
'';
example = "/etc/homelab-deploy/builder.yaml";
};
settings = {
repos = lib.mkOption {
type = lib.types.attrsOf (lib.types.submodule {
options = {
url = lib.mkOption {
type = lib.types.str;
description = "Git flake URL for the repository";
example = "git+https://git.example.com/org/nixos-configs.git";
};
defaultBranch = lib.mkOption {
type = lib.types.str;
default = "master";
description = "Default branch to build when not specified in request";
example = "main";
};
};
});
default = {};
description = ''
Repository configuration for the builder.
Each key is the repository name used in build requests.
'';
example = lib.literalExpression ''
{
nixos-servers = {
url = "git+https://git.example.com/org/nixos-servers.git";
defaultBranch = "master";
};
homelab = {
url = "git+ssh://git@github.com/user/homelab.git";
defaultBranch = "main";
};
}
'';
}; };
}; };
timeout = lib.mkOption { config = lib.mkIf cfg.enable {
type = lib.types.int;
default = 1800;
description = "Build timeout in seconds per host";
};
environment = lib.mkOption {
type = lib.types.attrsOf lib.types.str;
default = { };
description = "Additional environment variables for the service";
example = { GIT_SSH_COMMAND = "ssh -i /run/secrets/deploy-key"; };
};
metrics = {
enable = lib.mkEnableOption "Prometheus metrics endpoint";
address = lib.mkOption {
type = lib.types.str;
default = ":9973";
description = "Address for Prometheus metrics HTTP server";
example = "127.0.0.1:9973";
};
openFirewall = lib.mkOption {
type = lib.types.bool;
default = false;
description = "Open firewall for metrics port";
};
};
};
config = lib.mkMerge [
(lib.mkIf builderCfg.enable {
assertions = [
{
assertion = builderCfg.configFile != null || builderCfg.settings.repos != {};
message = "services.homelab-deploy.builder: either configFile or settings.repos must be specified";
}
];
})
(lib.mkIf listenerCfg.enable {
systemd.services.homelab-deploy-listener = { systemd.services.homelab-deploy-listener = {
description = "homelab-deploy listener"; description = "homelab-deploy listener";
wantedBy = [ "multi-user.target" ]; wantedBy = [ "multi-user.target" ];
@@ -274,7 +144,7 @@ in
stopIfChanged = false; stopIfChanged = false;
restartIfChanged = false; restartIfChanged = false;
environment = listenerCfg.environment // { environment = cfg.environment // {
# Nix needs a writable cache for git flake fetching # Nix needs a writable cache for git flake fetching
XDG_CACHE_HOME = "/var/cache/homelab-deploy"; XDG_CACHE_HOME = "/var/cache/homelab-deploy";
}; };
@@ -284,7 +154,7 @@ in
serviceConfig = { serviceConfig = {
CacheDirectory = "homelab-deploy"; CacheDirectory = "homelab-deploy";
Type = "simple"; Type = "simple";
ExecStart = "${listenerCfg.package}/bin/homelab-deploy listener ${listenerArgs}"; ExecStart = "${cfg.package}/bin/homelab-deploy listener ${args}";
Restart = "always"; Restart = "always";
RestartSec = 10; RestartSec = 10;
@@ -297,42 +167,8 @@ in
}; };
}; };
networking.firewall.allowedTCPPorts = lib.mkIf (listenerCfg.metrics.enable && listenerCfg.metrics.openFirewall) [ networking.firewall.allowedTCPPorts = lib.mkIf (cfg.metrics.enable && cfg.metrics.openFirewall) [
listenerMetricsPort metricsPort
]; ];
})
(lib.mkIf builderCfg.enable {
systemd.services.homelab-deploy-builder = {
description = "homelab-deploy builder";
wantedBy = [ "multi-user.target" ];
after = [ "network-online.target" ];
wants = [ "network-online.target" ];
environment = builderCfg.environment // {
# Nix needs a writable cache for git flake fetching
XDG_CACHE_HOME = "/var/cache/homelab-deploy-builder";
}; };
path = [ pkgs.git pkgs.nix ];
serviceConfig = {
CacheDirectory = "homelab-deploy-builder";
Type = "simple";
ExecStart = "${builderCfg.package}/bin/homelab-deploy builder ${builderArgs}";
Restart = "always";
RestartSec = 10;
# Minimal hardening - nix build requires broad system access:
# - Write access to /nix/store for building
# - Kernel namespace support for nix sandbox builds
# - Network access for fetching from git/cache
};
};
networking.firewall.allowedTCPPorts = lib.mkIf (builderCfg.metrics.enable && builderCfg.metrics.openFirewall) [
builderMetricsPort
];
})
];
} }