Compare commits

...

23 Commits

Author SHA1 Message Date
c272ce6903 docs: document --debug flag and extraArgs module option
Add documentation for:
- --debug flag in Listener Flags table
- --heartbeat-interval flag (was missing)
- extraArgs NixOS module option
- New Troubleshooting section with debug logging examples
  and guidance for diagnosing metrics issues

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 01:28:21 +01:00
c934d1ba38 feat: add --debug flag for metrics troubleshooting
Add a --debug flag to the listener command that enables debug-level
logging. When enabled, the listener logs detailed information about
metrics recording including:

- When deployment start/end metrics are recorded
- The action, success status, and duration being recorded
- Whether metrics are enabled or disabled (skipped)

This helps troubleshoot issues where deployment metrics appear to
remain at zero after deployments.

Also add extraArgs option to the NixOS module to allow passing
additional arguments like --debug to the service.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 01:26:12 +01:00
723a1f769f test: add histogram verification to deployment metrics tests
The existing tests for RecordDeploymentEnd and RecordDeploymentFailure
only verified the counter was incremented, not that the histogram was
updated with duration observations. Add histogram verification to:

- TestCollector_RecordDeploymentEnd_Success
- TestCollector_RecordDeploymentEnd_Failure
- TestCollector_RecordDeploymentFailure

Also add listener tests to verify metrics are properly initialized when
MetricsEnabled is true and that the recording functions work correctly
in the context of deployment handling.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 01:13:20 +01:00
46fc6a7e96 fix: wait for metrics scrape before restarting after switch deployment
After a successful switch deployment, the listener now waits for Prometheus
to scrape the /metrics endpoint before exiting for restart. This ensures
deployment metrics are captured before the process restarts and resets
in-memory counters. Falls back to a 60 second timeout if no scrape occurs.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 23:38:26 +01:00
746e30b24f fix: initialize counter and histogram metrics at startup
Counter and histogram metrics were absent from Prometheus scrapes until
the first deployment occurred, making it impossible to distinguish
"no deployments" from "exporter not running" in dashboards and alerts.

Initialize all expected label combinations with zero values when the
collector is created so metrics appear in every scrape from startup.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 21:29:36 +01:00
fd0d63b103 fix: flush NATS buffer after sending completed response
The CLI was reporting deployment failures even when the listener showed
success. This was a race condition: after a successful switch deployment,
the listener would send the "completed" response then immediately signal
restart. The NATS connection closed before the buffered message was
actually sent to the broker, so the CLI never received it.

Adding Flush() after sending the completed response ensures the message
reaches NATS before the listener can exit.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:30:34 +01:00
36a74b8cf9 feat: add heartbeat status updates during deployment
Send periodic "running" status messages while nixos-rebuild executes,
preventing the idle timeout from triggering before deployments complete.
This fixes false "Some deployments failed" warnings in MCP when builds
take longer than 30 seconds.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:23:33 +01:00
79db119d1c feat: add Prometheus metrics to listener service
Add an optional Prometheus metrics HTTP endpoint to the listener for
monitoring deployment operations. Includes four metrics:

- homelab_deploy_deployments_total (counter with status/action/error_code)
- homelab_deploy_deployment_duration_seconds (histogram with action/success)
- homelab_deploy_deployment_in_progress (gauge)
- homelab_deploy_info (gauge with hostname/tier/role/version)

New CLI flags: --metrics-enabled, --metrics-addr (default :9972)
New NixOS options: metrics.enable, metrics.address, metrics.openFirewall

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:58:22 +01:00
56365835c7 feat: add list-hosts command to CLI
Adds a list-hosts command that mirrors the MCP list_hosts functionality,
allowing discovery of available deployment targets from the command line.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:30:57 +01:00
95b795dcfd fix: remove systemd hardening to allow nix sandbox namespace creation
The previous hardening options (ProtectControlGroups, LockPersonality,
SystemCallArchitectures, etc.) prevented Nix from creating the kernel
namespaces required for build sandboxing. Following the approach of
the NixOS auto-upgrade module which has no hardening since nixos-rebuild
requires broad system access.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:52:16 +01:00
71d6aa8b61 fix: disable PrivateDevices to allow nix sandbox namespace creation
The PrivateDevices=true systemd hardening option was preventing Nix
from creating the kernel namespaces required for its build sandbox.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:26:53 +01:00
2c97b6140c fix: check only final responses in AllSucceeded to determine deployment success
The CLI was incorrectly reporting "some deployments failed" even when
deployments succeeded. This was because AllSucceeded() checked if every
response had StatusCompleted, but the Responses slice contains all
messages including intermediate ones like "started". Since started !=
completed, it returned false.

Now AllSucceeded() only examines final responses (using IsFinal()) and
checks that each host's final status is completed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:19:08 +01:00
efacb13b86 feat: exit listener after successful switch for automatic restart
After a successful switch deployment, the listener now exits gracefully
so systemd can restart it with the new binary. This works together with
stopIfChanged/restartIfChanged to ensure deployments complete before
the service restarts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:11:03 +01:00
ac3c9c7de6 fix: prevent listener service from restarting during deployment
Add stopIfChanged and restartIfChanged options to prevent the listener
from being interrupted when nixos-rebuild switch activates a new
configuration that changes the service definition.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:06:47 +01:00
9f205fee5e fix: add writable cache directory for nix git flake fetching
The listener service had ProtectHome=read-only which prevented Nix
from writing to /root/.cache when fetching git flakes. This adds a
CacheDirectory managed by systemd and sets XDG_CACHE_HOME to use it.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 05:57:59 +01:00
5f3cfc3d21 fix: add nixos-rebuild to PATH and fix CLI hanging after deploy failure
- Add nixos-rebuild to listener service PATH in NixOS module
- Fix CLI deploy command hanging after receiving final status by properly
  tracking lastResponse time and exiting when all hosts have responded

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 05:53:22 +01:00
c9b85435ba fix: add git to listener service PATH for revision validation
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 05:43:23 +01:00
cf3b1ce2c9 refactor: use flake package directly in NixOS module
Instead of requiring users to provide the package via overlay,
the module now receives `self` from the flake and uses the
package directly from `self.packages`.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 05:08:02 +01:00
9237814fed docs: add NATS subject structure and example server configuration
Document the complete subject hierarchy including deploy subjects,
response subjects, and discovery subject. Add example NATS server
configuration demonstrating tiered authentication with listener,
test deployer, and admin deployer permission patterns.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 04:48:11 +01:00
f03eb5f7dc feat: add environment variable support for deploy command flags
Allows setting --nats-url, --nkey-file, --branch, --action, and --timeout
via HOMELAB_DEPLOY_* environment variables.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 04:43:50 +01:00
f51058964d fix: verify NKey file has secure permissions before reading
Reject NKey files that are readable by group or others (permissions
more permissive than 0600). This prevents accidental exposure of
private keys through overly permissive file permissions.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 04:40:53 +01:00
95fbfb2339 chore: extract version from main.go in flake.nix
Use builtins.match to parse version from cmd/homelab-deploy/main.go
so only one location needs updating when bumping versions.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 04:25:38 +01:00
e1ab4599a8 docs: update CLAUDE.md to match implementation
- Add github.com/google/uuid to dependencies list
- Fix version bumping: both main.go and flake.nix need updates
- Add section on updating vendorHash when dependencies change
- Use nix run .#default instead of nix build for verification

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 04:24:21 +01:00
18 changed files with 1881 additions and 101 deletions

View File

@@ -56,13 +56,14 @@ Key Go libraries:
- `github.com/nats-io/nats.go` - NATS client - `github.com/nats-io/nats.go` - NATS client
- `github.com/nats-io/nkeys` - NKey authentication - `github.com/nats-io/nkeys` - NKey authentication
- `github.com/mark3labs/mcp-go` - MCP server implementation - `github.com/mark3labs/mcp-go` - MCP server implementation
- `github.com/google/uuid` - UUID generation for reply subjects
## Build Commands ## Build Commands
Run commands through the Nix development shell using `nix develop -c`: Run commands through the Nix development shell using `nix develop -c`:
```bash ```bash
# Build # Build (for quick syntax checking)
nix develop -c go build ./... nix develop -c go build ./...
# Run tests # Run tests
@@ -77,10 +78,7 @@ nix develop -c golangci-lint run
# Vulnerability check # Vulnerability check
nix develop -c govulncheck ./... nix develop -c govulncheck ./...
# Test Nix build # Run the binary (preferred method - builds and runs via Nix)
nix build
# Run the binary (prefer this over go build + running binary)
# To pass arguments, use -- before them: nix run .#default -- --help # To pass arguments, use -- before them: nix run .#default -- --help
nix run .#default nix run .#default
``` ```
@@ -92,7 +90,7 @@ Before committing, run the following checks:
1. `nix develop -c go test ./...` - Unit tests 1. `nix develop -c go test ./...` - Unit tests
2. `nix develop -c golangci-lint run` - Linting 2. `nix develop -c golangci-lint run` - Linting
3. `nix develop -c govulncheck ./...` - Vulnerability scanning 3. `nix develop -c govulncheck ./...` - Vulnerability scanning
4. `nix build` - Verify nix build works 4. `nix run .#default -- --version` - Verify nix build works
## Commit Message Format ## Commit Message Format
@@ -115,6 +113,16 @@ Follow semantic versioning:
- **Minor** (0.x.0): Non-breaking changes adding features - **Minor** (0.x.0): Non-breaking changes adding features
- **Major** (x.0.0): Breaking changes - **Major** (x.0.0): Breaking changes
Update the `const version` in `main.go`. The Nix build extracts the version from there automatically. Update `const version` in `cmd/homelab-deploy/main.go`. The Nix build extracts the version from there automatically.
**When to bump**: If any Go code has changed, bump the version before committing. Do this automatically when asked to commit. On feature branches, only bump once per branch (check if version has already been bumped compared to master). **When to bump**: If any Go code has changed, bump the version before committing. Do this automatically when asked to commit. On feature branches, only bump once per branch (check if version has already been bumped compared to master).
## Updating Dependencies
When adding or updating Go dependencies:
1. Run `go get <package>` or `go mod tidy`
2. Update `vendorHash` in `flake.nix`:
- Set to a fake hash: `vendorHash = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";`
- Run `nix run .#default -- --version` - the error will show the correct hash
- Replace with the correct hash from the error message

282
README.md
View File

@@ -61,6 +61,10 @@ homelab-deploy listener \
| `--timeout` | No | Deployment timeout in seconds (default: 600) | | `--timeout` | No | Deployment timeout in seconds (default: 600) |
| `--deploy-subject` | No | NATS subjects to subscribe to (repeatable) | | `--deploy-subject` | No | NATS subjects to subscribe to (repeatable) |
| `--discover-subject` | No | Discovery subject (default: `deploy.discover`) | | `--discover-subject` | No | Discovery subject (default: `deploy.discover`) |
| `--metrics-enabled` | No | Enable Prometheus metrics endpoint |
| `--metrics-addr` | No | Metrics HTTP server address (default: `:9972`) |
| `--heartbeat-interval` | No | Status update interval in seconds during deployment (default: 15) |
| `--debug` | No | Enable debug logging for troubleshooting |
#### Subject Templates #### Subject Templates
@@ -102,13 +106,13 @@ homelab-deploy deploy deploy.prod.role.dns \
#### Deploy Flags #### Deploy Flags
| Flag | Required | Description | | Flag | Required | Env Var | Description |
|------|----------|-------------| |------|----------|---------|-------------|
| `--nats-url` | Yes | NATS server URL | | `--nats-url` | Yes | `HOMELAB_DEPLOY_NATS_URL` | NATS server URL |
| `--nkey-file` | Yes | Path to NKey seed file | | `--nkey-file` | Yes | `HOMELAB_DEPLOY_NKEY_FILE` | Path to NKey seed file |
| `--branch` | No | Git branch or commit (default: `master`) | | `--branch` | No | `HOMELAB_DEPLOY_BRANCH` | Git branch or commit (default: `master`) |
| `--action` | No | nixos-rebuild action (default: `switch`) | | `--action` | No | `HOMELAB_DEPLOY_ACTION` | nixos-rebuild action (default: `switch`) |
| `--timeout` | No | Response timeout in seconds (default: 900) | | `--timeout` | No | `HOMELAB_DEPLOY_TIMEOUT` | Response timeout in seconds (default: 900) |
#### Subject Aliases #### Subject Aliases
@@ -198,7 +202,7 @@ Add the module to your NixOS configuration:
| Option | Type | Default | Description | | Option | Type | Default | Description |
|--------|------|---------|-------------| |--------|------|---------|-------------|
| `enable` | bool | `false` | Enable the listener service | | `enable` | bool | `false` | Enable the listener service |
| `package` | package | `pkgs.homelab-deploy` | Package to use | | `package` | package | from flake | Package to use |
| `hostname` | string | `config.networking.hostName` | Hostname for subject templates | | `hostname` | string | `config.networking.hostName` | Hostname for subject templates |
| `tier` | enum | required | `"test"` or `"prod"` | | `tier` | enum | required | `"test"` or `"prod"` |
| `role` | string | `null` | Role for role-based targeting | | `role` | string | `null` | Role for role-based targeting |
@@ -209,6 +213,10 @@ Add the module to your NixOS configuration:
| `deploySubjects` | list of string | see below | Subjects to subscribe to | | `deploySubjects` | list of string | see below | Subjects to subscribe to |
| `discoverSubject` | string | `"deploy.discover"` | Discovery subject | | `discoverSubject` | string | `"deploy.discover"` | Discovery subject |
| `environment` | attrs | `{}` | Additional environment variables | | `environment` | attrs | `{}` | Additional environment variables |
| `metrics.enable` | bool | `false` | Enable Prometheus metrics endpoint |
| `metrics.address` | string | `":9972"` | Metrics HTTP server address |
| `metrics.openFirewall` | bool | `false` | Open firewall for metrics port |
| `extraArgs` | list of string | `[]` | Extra command line arguments (e.g., `["--debug"]`) |
Default `deploySubjects`: Default `deploySubjects`:
```nix ```nix
@@ -219,6 +227,131 @@ Default `deploySubjects`:
] ]
``` ```
## Prometheus Metrics
The listener can expose Prometheus metrics for monitoring deployment operations.
### Enabling Metrics
**CLI:**
```bash
homelab-deploy listener \
--hostname myhost \
--tier prod \
--nats-url nats://nats.example.com:4222 \
--nkey-file /run/secrets/listener.nkey \
--flake-url git+https://git.example.com/user/nixos-configs.git \
--metrics-enabled \
--metrics-addr :9972
```
**NixOS module:**
```nix
services.homelab-deploy.listener = {
enable = true;
tier = "prod";
natsUrl = "nats://nats.example.com:4222";
nkeyFile = "/run/secrets/homelab-deploy-nkey";
flakeUrl = "git+https://git.example.com/user/nixos-configs.git";
metrics = {
enable = true;
address = ":9972";
openFirewall = true; # Optional: open firewall for Prometheus scraping
};
};
```
### Available Metrics
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `homelab_deploy_deployments_total` | Counter | `status`, `action`, `error_code` | Total deployment requests processed |
| `homelab_deploy_deployment_duration_seconds` | Histogram | `action`, `success` | Deployment execution time |
| `homelab_deploy_deployment_in_progress` | Gauge | - | 1 if deployment running, 0 otherwise |
| `homelab_deploy_info` | Gauge | `hostname`, `tier`, `role`, `version` | Static instance metadata |
**Label values:**
- `status`: `completed`, `failed`, `rejected`
- `action`: `switch`, `boot`, `test`, `dry-activate`
- `error_code`: `invalid_action`, `invalid_revision`, `already_running`, `build_failed`, `timeout`, or empty
- `success`: `true`, `false`
### HTTP Endpoints
| Endpoint | Description |
|----------|-------------|
| `/metrics` | Prometheus metrics in text format |
| `/health` | Health check (returns `ok`) |
### Example Prometheus Queries
```promql
# Average deployment duration (last hour)
rate(homelab_deploy_deployment_duration_seconds_sum[1h]) /
rate(homelab_deploy_deployment_duration_seconds_count[1h])
# Deployment success rate (last 24 hours)
sum(rate(homelab_deploy_deployments_total{status="completed"}[24h])) /
sum(rate(homelab_deploy_deployments_total{status=~"completed|failed"}[24h]))
# 95th percentile deployment time
histogram_quantile(0.95, rate(homelab_deploy_deployment_duration_seconds_bucket[1h]))
# Currently running deployments across all hosts
sum(homelab_deploy_deployment_in_progress)
```
## Troubleshooting
### Debug Logging
Enable debug logging to diagnose issues with deployments or metrics:
**CLI:**
```bash
homelab-deploy listener --debug \
--hostname myhost \
--tier prod \
--nats-url nats://nats.example.com:4222 \
--nkey-file /run/secrets/listener.nkey \
--flake-url git+https://git.example.com/user/nixos-configs.git \
--metrics-enabled
```
**NixOS module:**
```nix
services.homelab-deploy.listener = {
enable = true;
tier = "prod";
natsUrl = "nats://nats.example.com:4222";
nkeyFile = "/run/secrets/homelab-deploy-nkey";
flakeUrl = "git+https://git.example.com/user/nixos-configs.git";
metrics.enable = true;
extraArgs = [ "--debug" ];
};
```
With debug logging enabled, the listener outputs detailed information about metrics recording:
```json
{"level":"DEBUG","msg":"recording deployment start metric","metrics_enabled":true}
{"level":"DEBUG","msg":"recording deployment end metric (success)","action":"switch","success":true,"duration_seconds":120.5}
```
### Metrics Showing Zero
If deployment metrics remain at zero after deployments:
1. **Check metrics are enabled**: Verify `--metrics-enabled` is set and the metrics endpoint is accessible at `/metrics`
2. **Enable debug logging**: Use `--debug` to confirm metrics recording is being called
3. **Check deployment status**: Metrics are only recorded for deployments that complete (success or failure). Rejected requests (e.g., already running) increment the counter with `status="rejected"` but don't record duration
4. **Check after restart**: After a successful `switch` deployment, the listener restarts. Metrics reset to zero in the new instance. The listener waits up to 60 seconds for a Prometheus scrape before restarting to capture the final metrics
5. **Verify Prometheus scrape timing**: Ensure Prometheus scrapes frequently enough to capture metrics before the listener restarts
## Message Protocol ## Message Protocol
### Deploy Request ### Deploy Request
@@ -256,6 +389,139 @@ nk -gen user -pubout
Configure appropriate publish/subscribe permissions in your NATS server for each credential type. Configure appropriate publish/subscribe permissions in your NATS server for each credential type.
## NATS Subject Structure
The deployment system uses the following NATS subject hierarchy:
### Deploy Subjects
| Subject Pattern | Purpose |
|-----------------|---------|
| `deploy.<tier>.<hostname>` | Deploy to a specific host |
| `deploy.<tier>.all` | Deploy to all hosts in a tier |
| `deploy.<tier>.role.<role>` | Deploy to hosts with a specific role in a tier |
**Tier values:** `test`, `prod`
**Examples:**
- `deploy.test.myhost` - Deploy to myhost in test tier
- `deploy.prod.all` - Deploy to all production hosts
- `deploy.prod.role.dns` - Deploy to all DNS servers in production
### Response Subjects
| Subject Pattern | Purpose |
|-----------------|---------|
| `deploy.responses.<uuid>` | Unique reply subject for each deployment request |
Deployers create a unique response subject for each request and include it in the `reply_to` field. Listeners publish status updates to this subject.
### Discovery Subject
| Subject Pattern | Purpose |
|-----------------|---------|
| `deploy.discover` | Host discovery requests and responses |
Used by the `list_hosts` MCP tool and for discovering available deployment targets.
## Example NATS Configuration
Below is an example NATS server configuration implementing tiered authentication. This setup provides:
- **Listeners** - Each host has credentials to subscribe to its own subjects and publish responses
- **Test deployer** - Can deploy to test tier only (suitable for MCP without admin access)
- **Admin deployer** - Can deploy to all tiers (for CLI or MCP with admin access)
```conf
authorization {
users = [
# Listener for a test-tier host
{
nkey: "UTEST_HOST1_PUBLIC_KEY_HERE"
permissions: {
subscribe: [
"deploy.test.testhost1"
"deploy.test.all"
"deploy.test.role.>"
"deploy.discover"
]
publish: [
"deploy.responses.>"
"deploy.discover"
]
}
}
# Listener for a prod-tier host with 'dns' role
{
nkey: "UPROD_DNS1_PUBLIC_KEY_HERE"
permissions: {
subscribe: [
"deploy.prod.dns1"
"deploy.prod.all"
"deploy.prod.role.dns"
"deploy.discover"
]
publish: [
"deploy.responses.>"
"deploy.discover"
]
}
}
# Test-tier deployer (MCP without admin)
{
nkey: "UTEST_DEPLOYER_PUBLIC_KEY_HERE"
permissions: {
publish: [
"deploy.test.>"
"deploy.discover"
]
subscribe: [
"deploy.responses.>"
"deploy.discover"
]
}
}
# Admin deployer (full access to all tiers)
{
nkey: "UADMIN_DEPLOYER_PUBLIC_KEY_HERE"
permissions: {
publish: [
"deploy.>"
]
subscribe: [
"deploy.>"
]
}
}
]
}
```
### Key Permission Patterns
| Credential Type | Publish | Subscribe |
|-----------------|---------|-----------|
| Listener | `deploy.responses.>`, `deploy.discover` | Own subjects, `deploy.discover` |
| Test deployer | `deploy.test.>`, `deploy.discover` | `deploy.responses.>`, `deploy.discover` |
| Admin deployer | `deploy.>` | `deploy.>` |
### Generating NKeys
```bash
# Generate a keypair (outputs public key, saves seed to file)
nk -gen user -pubout > mykey.pub
# The seed (private key) is printed to stderr - save it securely
# Or generate and save seed directly
nk -gen user > mykey.seed
nk -inkey mykey.seed -pubout # Get public key from seed
```
The public key (starting with `U`) goes in the NATS server config. The seed file (starting with `SU`) is used by homelab-deploy via `--nkey-file`.
## License ## License
MIT MIT

View File

@@ -16,7 +16,7 @@ import (
"github.com/urfave/cli/v3" "github.com/urfave/cli/v3"
) )
const version = "0.1.0" const version = "0.1.14"
func main() { func main() {
app := &cli.Command{ app := &cli.Command{
@@ -27,6 +27,7 @@ func main() {
listenerCommand(), listenerCommand(),
mcpCommand(), mcpCommand(),
deployCommand(), deployCommand(),
listHostsCommand(),
}, },
} }
@@ -41,6 +42,10 @@ func listenerCommand() *cli.Command {
Name: "listener", Name: "listener",
Usage: "Run as a deployment listener (systemd service mode)", Usage: "Run as a deployment listener (systemd service mode)",
Flags: []cli.Flag{ Flags: []cli.Flag{
&cli.BoolFlag{
Name: "debug",
Usage: "Enable debug logging for troubleshooting",
},
&cli.StringFlag{ &cli.StringFlag{
Name: "hostname", Name: "hostname",
Usage: "Hostname for this listener", Usage: "Hostname for this listener",
@@ -89,6 +94,20 @@ func listenerCommand() *cli.Command {
Usage: "NATS subject for host discovery requests", Usage: "NATS subject for host discovery requests",
Value: "deploy.discover", Value: "deploy.discover",
}, },
&cli.BoolFlag{
Name: "metrics-enabled",
Usage: "Enable Prometheus metrics endpoint",
},
&cli.StringFlag{
Name: "metrics-addr",
Usage: "Address for Prometheus metrics HTTP server",
Value: ":9972",
},
&cli.IntFlag{
Name: "heartbeat-interval",
Usage: "Interval in seconds for sending status updates during deployment (0 to disable)",
Value: 15,
},
}, },
Action: func(ctx context.Context, c *cli.Command) error { Action: func(ctx context.Context, c *cli.Command) error {
tier := c.String("tier") tier := c.String("tier")
@@ -104,12 +123,22 @@ func listenerCommand() *cli.Command {
NKeyFile: c.String("nkey-file"), NKeyFile: c.String("nkey-file"),
FlakeURL: c.String("flake-url"), FlakeURL: c.String("flake-url"),
Timeout: time.Duration(c.Int("timeout")) * time.Second, Timeout: time.Duration(c.Int("timeout")) * time.Second,
HeartbeatInterval: time.Duration(c.Int("heartbeat-interval")) * time.Second,
DeploySubjects: c.StringSlice("deploy-subject"), DeploySubjects: c.StringSlice("deploy-subject"),
DiscoverSubject: c.String("discover-subject"), DiscoverSubject: c.String("discover-subject"),
MetricsEnabled: c.Bool("metrics-enabled"),
MetricsAddr: c.String("metrics-addr"),
Version: version,
Debug: c.Bool("debug"),
}
logLevel := slog.LevelInfo
if c.Bool("debug") {
logLevel = slog.LevelDebug
} }
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{ logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
Level: slog.LevelInfo, Level: logLevel,
})) }))
l := listener.New(cfg, logger) l := listener.New(cfg, logger)
@@ -189,26 +218,31 @@ func deployCommand() *cli.Command {
&cli.StringFlag{ &cli.StringFlag{
Name: "nats-url", Name: "nats-url",
Usage: "NATS server URL", Usage: "NATS server URL",
Sources: cli.EnvVars("HOMELAB_DEPLOY_NATS_URL"),
Required: true, Required: true,
}, },
&cli.StringFlag{ &cli.StringFlag{
Name: "nkey-file", Name: "nkey-file",
Usage: "Path to NKey seed file for NATS authentication", Usage: "Path to NKey seed file for NATS authentication",
Sources: cli.EnvVars("HOMELAB_DEPLOY_NKEY_FILE"),
Required: true, Required: true,
}, },
&cli.StringFlag{ &cli.StringFlag{
Name: "branch", Name: "branch",
Usage: "Git branch or commit to deploy", Usage: "Git branch or commit to deploy",
Sources: cli.EnvVars("HOMELAB_DEPLOY_BRANCH"),
Value: "master", Value: "master",
}, },
&cli.StringFlag{ &cli.StringFlag{
Name: "action", Name: "action",
Usage: "nixos-rebuild action (switch, boot, test, dry-activate)", Usage: "nixos-rebuild action (switch, boot, test, dry-activate)",
Sources: cli.EnvVars("HOMELAB_DEPLOY_ACTION"),
Value: "switch", Value: "switch",
}, },
&cli.IntFlag{ &cli.IntFlag{
Name: "timeout", Name: "timeout",
Usage: "Timeout in seconds for collecting responses", Usage: "Timeout in seconds for collecting responses",
Sources: cli.EnvVars("HOMELAB_DEPLOY_TIMEOUT"),
Value: 900, Value: 900,
}, },
}, },
@@ -265,3 +299,88 @@ func deployCommand() *cli.Command {
}, },
} }
} }
func listHostsCommand() *cli.Command {
return &cli.Command{
Name: "list-hosts",
Usage: "List available deployment targets",
Flags: []cli.Flag{
&cli.StringFlag{
Name: "nats-url",
Usage: "NATS server URL",
Sources: cli.EnvVars("HOMELAB_DEPLOY_NATS_URL"),
Required: true,
},
&cli.StringFlag{
Name: "nkey-file",
Usage: "Path to NKey seed file for NATS authentication",
Sources: cli.EnvVars("HOMELAB_DEPLOY_NKEY_FILE"),
Required: true,
},
&cli.StringFlag{
Name: "tier",
Usage: "Filter by tier (test or prod)",
Sources: cli.EnvVars("HOMELAB_DEPLOY_TIER"),
},
&cli.StringFlag{
Name: "discover-subject",
Usage: "NATS subject for host discovery",
Sources: cli.EnvVars("HOMELAB_DEPLOY_DISCOVER_SUBJECT"),
Value: "deploy.discover",
},
&cli.IntFlag{
Name: "timeout",
Usage: "Timeout in seconds for discovery",
Sources: cli.EnvVars("HOMELAB_DEPLOY_DISCOVER_TIMEOUT"),
Value: 5,
},
},
Action: func(ctx context.Context, c *cli.Command) error {
tierFilter := c.String("tier")
if tierFilter != "" && tierFilter != "test" && tierFilter != "prod" {
return fmt.Errorf("tier must be 'test' or 'prod', got %q", tierFilter)
}
// Handle shutdown signals
ctx, cancel := signal.NotifyContext(ctx, syscall.SIGINT, syscall.SIGTERM)
defer cancel()
responses, err := deploycli.Discover(
ctx,
c.String("nats-url"),
c.String("nkey-file"),
c.String("discover-subject"),
time.Duration(c.Int("timeout"))*time.Second,
)
if err != nil {
return fmt.Errorf("discovery failed: %w", err)
}
if len(responses) == 0 {
fmt.Println("No hosts responded to discovery request")
return nil
}
fmt.Println("Available deployment targets:")
fmt.Println()
for _, resp := range responses {
if tierFilter != "" && resp.Tier != tierFilter {
continue
}
role := resp.Role
if role == "" {
role = "(none)"
}
fmt.Printf("- %s (tier=%s, role=%s)\n", resp.Hostname, resp.Tier, role)
for _, subj := range resp.DeploySubjects {
fmt.Printf(" %s\n", subj)
}
}
return nil
},
}
}

View File

@@ -15,13 +15,18 @@
packages = forAllSystems (system: packages = forAllSystems (system:
let let
pkgs = pkgsFor system; pkgs = pkgsFor system;
# Extract version from main.go
version = builtins.head (
builtins.match ''.*const version = "([^"]+)".*''
(builtins.readFile ./cmd/homelab-deploy/main.go)
);
in in
{ {
homelab-deploy = pkgs.buildGoModule { homelab-deploy = pkgs.buildGoModule {
pname = "homelab-deploy"; pname = "homelab-deploy";
version = "0.1.0"; inherit version;
src = ./.; src = ./.;
vendorHash = "sha256-JXa+obN62zrrwXlplqojY7dvEunUqDdSTee6N8c5JTg="; vendorHash = "sha256-CN+l0JbQu+HDfotkt3PUFzBexHCHpCKIIZpAQRyojBk=";
subPackages = [ "cmd/homelab-deploy" ]; subPackages = [ "cmd/homelab-deploy" ];
}; };
default = self.packages.${system}.homelab-deploy; default = self.packages.${system}.homelab-deploy;
@@ -44,7 +49,7 @@
}; };
}); });
nixosModules.default = import ./nixos/module.nix; nixosModules.default = import ./nixos/module.nix { inherit self; };
nixosModules.homelab-deploy = self.nixosModules.default; nixosModules.homelab-deploy = self.nixosModules.default;
}; };
} }

10
go.mod
View File

@@ -7,20 +7,30 @@ require (
github.com/mark3labs/mcp-go v0.43.2 github.com/mark3labs/mcp-go v0.43.2
github.com/nats-io/nats.go v1.48.0 github.com/nats-io/nats.go v1.48.0
github.com/nats-io/nkeys v0.4.15 github.com/nats-io/nkeys v0.4.15
github.com/prometheus/client_golang v1.23.2
github.com/urfave/cli/v3 v3.6.2 github.com/urfave/cli/v3 v3.6.2
) )
require ( require (
github.com/bahlo/generic-list-go v0.2.0 // indirect github.com/bahlo/generic-list-go v0.2.0 // indirect
github.com/beorn7/perks v1.0.1 // indirect
github.com/buger/jsonparser v1.1.1 // indirect github.com/buger/jsonparser v1.1.1 // indirect
github.com/cespare/xxhash/v2 v2.3.0 // indirect
github.com/invopop/jsonschema v0.13.0 // indirect github.com/invopop/jsonschema v0.13.0 // indirect
github.com/klauspost/compress v1.18.0 // indirect github.com/klauspost/compress v1.18.0 // indirect
github.com/kylelemons/godebug v1.1.0 // indirect
github.com/mailru/easyjson v0.7.7 // indirect github.com/mailru/easyjson v0.7.7 // indirect
github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 // indirect
github.com/nats-io/nuid v1.0.1 // indirect github.com/nats-io/nuid v1.0.1 // indirect
github.com/prometheus/client_model v0.6.2 // indirect
github.com/prometheus/common v0.66.1 // indirect
github.com/prometheus/procfs v0.16.1 // indirect
github.com/spf13/cast v1.7.1 // indirect github.com/spf13/cast v1.7.1 // indirect
github.com/wk8/go-ordered-map/v2 v2.1.8 // indirect github.com/wk8/go-ordered-map/v2 v2.1.8 // indirect
github.com/yosida95/uritemplate/v3 v3.0.2 // indirect github.com/yosida95/uritemplate/v3 v3.0.2 // indirect
go.yaml.in/yaml/v2 v2.4.2 // indirect
golang.org/x/crypto v0.47.0 // indirect golang.org/x/crypto v0.47.0 // indirect
golang.org/x/sys v0.40.0 // indirect golang.org/x/sys v0.40.0 // indirect
google.golang.org/protobuf v1.36.8 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect gopkg.in/yaml.v3 v3.0.1 // indirect
) )

33
go.sum
View File

@@ -1,13 +1,17 @@
github.com/bahlo/generic-list-go v0.2.0 h1:5sz/EEAK+ls5wF+NeqDpk5+iNdMDXrh3z3nPnH1Wvgk= github.com/bahlo/generic-list-go v0.2.0 h1:5sz/EEAK+ls5wF+NeqDpk5+iNdMDXrh3z3nPnH1Wvgk=
github.com/bahlo/generic-list-go v0.2.0/go.mod h1:2KvAjgMlE5NNynlg/5iLrrCCZ2+5xWbdbCW3pNTGyYg= github.com/bahlo/generic-list-go v0.2.0/go.mod h1:2KvAjgMlE5NNynlg/5iLrrCCZ2+5xWbdbCW3pNTGyYg=
github.com/beorn7/perks v1.0.1 h1:VlbKKnNfV8bJzeqoa4cOKqO6bYr3WgKZxO8Z16+hsOM=
github.com/beorn7/perks v1.0.1/go.mod h1:G2ZrVWU2WbWT9wwq4/hrbKbnv/1ERSJQ0ibhJ6rlkpw=
github.com/buger/jsonparser v1.1.1 h1:2PnMjfWD7wBILjqQbt530v576A/cAbQvEW9gGIpYMUs= github.com/buger/jsonparser v1.1.1 h1:2PnMjfWD7wBILjqQbt530v576A/cAbQvEW9gGIpYMUs=
github.com/buger/jsonparser v1.1.1/go.mod h1:6RYKKt7H4d4+iWqouImQ9R2FZql3VbhNgx27UK13J/0= github.com/buger/jsonparser v1.1.1/go.mod h1:6RYKKt7H4d4+iWqouImQ9R2FZql3VbhNgx27UK13J/0=
github.com/cespare/xxhash/v2 v2.3.0 h1:UL815xU9SqsFlibzuggzjXhog7bL6oX9BbNZnL2UFvs=
github.com/cespare/xxhash/v2 v2.3.0/go.mod h1:VGX0DQ3Q6kWi7AoAeZDth3/j3BFtOZR5XLFGgcrjCOs=
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c= github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/frankban/quicktest v1.14.6 h1:7Xjx+VpznH+oBnejlPUj8oUpdxnVs4f8XU8WnHkI4W8= github.com/frankban/quicktest v1.14.6 h1:7Xjx+VpznH+oBnejlPUj8oUpdxnVs4f8XU8WnHkI4W8=
github.com/frankban/quicktest v1.14.6/go.mod h1:4ptaffx2x8+WTWXmUCuVU6aPUX1/Mz7zb5vbUoiM6w0= github.com/frankban/quicktest v1.14.6/go.mod h1:4ptaffx2x8+WTWXmUCuVU6aPUX1/Mz7zb5vbUoiM6w0=
github.com/google/go-cmp v0.5.9 h1:O2Tfq5qg4qc4AmwVlvv0oLiVAGB7enBSJ2x2DqQFi38= github.com/google/go-cmp v0.7.0 h1:wk8382ETsv4JYUZwIsn6YpYiWiBsYLSJiTsyBybVuN8=
github.com/google/go-cmp v0.5.9/go.mod h1:17dUlkBOakJ0+DkrSSNjCkIjxS6bF9zb3elmeNGIjoY= github.com/google/go-cmp v0.7.0/go.mod h1:pXiqmnSA92OHEEa9HXL2W4E7lf9JzCmGVUdgjX3N/iU=
github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0= github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0=
github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo= github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo=
github.com/invopop/jsonschema v0.13.0 h1:KvpoAJWEjR3uD9Kbm2HWJmqsEaHt8lBUpd0qHcIi21E= github.com/invopop/jsonschema v0.13.0 h1:KvpoAJWEjR3uD9Kbm2HWJmqsEaHt8lBUpd0qHcIi21E=
@@ -19,10 +23,14 @@ github.com/kr/pretty v0.3.1 h1:flRD4NNwYAUpkphVc1HcthR4KEIFJ65n8Mw5qdRn3LE=
github.com/kr/pretty v0.3.1/go.mod h1:hoEshYVHaxMs3cyo3Yncou5ZscifuDolrwPKZanG3xk= github.com/kr/pretty v0.3.1/go.mod h1:hoEshYVHaxMs3cyo3Yncou5ZscifuDolrwPKZanG3xk=
github.com/kr/text v0.2.0 h1:5Nx0Ya0ZqY2ygV366QzturHI13Jq95ApcVaJBhpS+AY= github.com/kr/text v0.2.0 h1:5Nx0Ya0ZqY2ygV366QzturHI13Jq95ApcVaJBhpS+AY=
github.com/kr/text v0.2.0/go.mod h1:eLer722TekiGuMkidMxC/pM04lWEeraHUUmBw8l2grE= github.com/kr/text v0.2.0/go.mod h1:eLer722TekiGuMkidMxC/pM04lWEeraHUUmBw8l2grE=
github.com/kylelemons/godebug v1.1.0 h1:RPNrshWIDI6G2gRW9EHilWtl7Z6Sb1BR0xunSBf0SNc=
github.com/kylelemons/godebug v1.1.0/go.mod h1:9/0rRGxNHcop5bhtWyNeEfOS8JIWk580+fNqagV/RAw=
github.com/mailru/easyjson v0.7.7 h1:UGYAvKxe3sBsEDzO8ZeWOSlIQfWFlxbzLZe7hwFURr0= github.com/mailru/easyjson v0.7.7 h1:UGYAvKxe3sBsEDzO8ZeWOSlIQfWFlxbzLZe7hwFURr0=
github.com/mailru/easyjson v0.7.7/go.mod h1:xzfreul335JAWq5oZzymOObrkdz5UnU4kGfJJLY9Nlc= github.com/mailru/easyjson v0.7.7/go.mod h1:xzfreul335JAWq5oZzymOObrkdz5UnU4kGfJJLY9Nlc=
github.com/mark3labs/mcp-go v0.43.2 h1:21PUSlWWiSbUPQwXIJ5WKlETixpFpq+WBpbMGDSVy/I= github.com/mark3labs/mcp-go v0.43.2 h1:21PUSlWWiSbUPQwXIJ5WKlETixpFpq+WBpbMGDSVy/I=
github.com/mark3labs/mcp-go v0.43.2/go.mod h1:YnJfOL382MIWDx1kMY+2zsRHU/q78dBg9aFb8W6Thdw= github.com/mark3labs/mcp-go v0.43.2/go.mod h1:YnJfOL382MIWDx1kMY+2zsRHU/q78dBg9aFb8W6Thdw=
github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 h1:C3w9PqII01/Oq1c1nUAm88MOHcQC9l5mIlSMApZMrHA=
github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822/go.mod h1:+n7T8mK8HuQTcFwEeznm/DIxMOiR9yIdICNftLE1DvQ=
github.com/nats-io/nats.go v1.48.0 h1:pSFyXApG+yWU/TgbKCjmm5K4wrHu86231/w84qRVR+U= github.com/nats-io/nats.go v1.48.0 h1:pSFyXApG+yWU/TgbKCjmm5K4wrHu86231/w84qRVR+U=
github.com/nats-io/nats.go v1.48.0/go.mod h1:iRWIPokVIFbVijxuMQq4y9ttaBTMe0SFdlZfMDd+33g= github.com/nats-io/nats.go v1.48.0/go.mod h1:iRWIPokVIFbVijxuMQq4y9ttaBTMe0SFdlZfMDd+33g=
github.com/nats-io/nkeys v0.4.15 h1:JACV5jRVO9V856KOapQ7x+EY8Jo3qw1vJt/9Jpwzkk4= github.com/nats-io/nkeys v0.4.15 h1:JACV5jRVO9V856KOapQ7x+EY8Jo3qw1vJt/9Jpwzkk4=
@@ -31,8 +39,16 @@ github.com/nats-io/nuid v1.0.1 h1:5iA8DT8V7q8WK2EScv2padNa/rTESc1KdnPw4TC2paw=
github.com/nats-io/nuid v1.0.1/go.mod h1:19wcPz3Ph3q0Jbyiqsd0kePYG7A95tJPxeL+1OSON2c= github.com/nats-io/nuid v1.0.1/go.mod h1:19wcPz3Ph3q0Jbyiqsd0kePYG7A95tJPxeL+1OSON2c=
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM= github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4= github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/rogpeppe/go-internal v1.9.0 h1:73kH8U+JUqXU8lRuOHeVHaa/SZPifC7BkcraZVejAe8= github.com/prometheus/client_golang v1.23.2 h1:Je96obch5RDVy3FDMndoUsjAhG5Edi49h0RJWRi/o0o=
github.com/rogpeppe/go-internal v1.9.0/go.mod h1:WtVeX8xhTBvf0smdhujwtBcq4Qrzq/fJaraNFVN+nFs= github.com/prometheus/client_golang v1.23.2/go.mod h1:Tb1a6LWHB3/SPIzCoaDXI4I8UHKeFTEQ1YCr+0Gyqmg=
github.com/prometheus/client_model v0.6.2 h1:oBsgwpGs7iVziMvrGhE53c/GrLUsZdHnqNwqPLxwZyk=
github.com/prometheus/client_model v0.6.2/go.mod h1:y3m2F6Gdpfy6Ut/GBsUqTWZqCUvMVzSfMLjcu6wAwpE=
github.com/prometheus/common v0.66.1 h1:h5E0h5/Y8niHc5DlaLlWLArTQI7tMrsfQjHV+d9ZoGs=
github.com/prometheus/common v0.66.1/go.mod h1:gcaUsgf3KfRSwHY4dIMXLPV0K/Wg1oZ8+SbZk/HH/dA=
github.com/prometheus/procfs v0.16.1 h1:hZ15bTNuirocR6u0JZ6BAHHmwS1p8B4P6MRqxtzMyRg=
github.com/prometheus/procfs v0.16.1/go.mod h1:teAbpZRB1iIAJYREa1LsoWUXykVXA1KlTmWl8x/U+Is=
github.com/rogpeppe/go-internal v1.10.0 h1:TMyTOH3F/DB16zRVcYyreMH6GnZZrwQVAoYjRBZyWFQ=
github.com/rogpeppe/go-internal v1.10.0/go.mod h1:UQnix2H7Ngw/k4C5ijL5+65zddjncjaFoBhdsK/akog=
github.com/spf13/cast v1.7.1 h1:cuNEagBQEHWN1FnbGEjCXL2szYEXqfJPbP2HNUaca9Y= github.com/spf13/cast v1.7.1 h1:cuNEagBQEHWN1FnbGEjCXL2szYEXqfJPbP2HNUaca9Y=
github.com/spf13/cast v1.7.1/go.mod h1:ancEpBxwJDODSW/UG4rDrAqiKolqNNh2DX3mk86cAdo= github.com/spf13/cast v1.7.1/go.mod h1:ancEpBxwJDODSW/UG4rDrAqiKolqNNh2DX3mk86cAdo=
github.com/stretchr/testify v1.11.1 h1:7s2iGBzp5EwR7/aIZr8ao5+dra3wiQyKjjFuvgVKu7U= github.com/stretchr/testify v1.11.1 h1:7s2iGBzp5EwR7/aIZr8ao5+dra3wiQyKjjFuvgVKu7U=
@@ -43,11 +59,18 @@ github.com/wk8/go-ordered-map/v2 v2.1.8 h1:5h/BUHu93oj4gIdvHHHGsScSTMijfx5PeYkE/
github.com/wk8/go-ordered-map/v2 v2.1.8/go.mod h1:5nJHM5DyteebpVlHnWMV0rPz6Zp7+xBAnxjb1X5vnTw= github.com/wk8/go-ordered-map/v2 v2.1.8/go.mod h1:5nJHM5DyteebpVlHnWMV0rPz6Zp7+xBAnxjb1X5vnTw=
github.com/yosida95/uritemplate/v3 v3.0.2 h1:Ed3Oyj9yrmi9087+NczuL5BwkIc4wvTb5zIM+UJPGz4= github.com/yosida95/uritemplate/v3 v3.0.2 h1:Ed3Oyj9yrmi9087+NczuL5BwkIc4wvTb5zIM+UJPGz4=
github.com/yosida95/uritemplate/v3 v3.0.2/go.mod h1:ILOh0sOhIJR3+L/8afwt/kE++YT040gmv5BQTMR2HP4= github.com/yosida95/uritemplate/v3 v3.0.2/go.mod h1:ILOh0sOhIJR3+L/8afwt/kE++YT040gmv5BQTMR2HP4=
go.uber.org/goleak v1.3.0 h1:2K3zAYmnTNqV73imy9J1T3WC+gmCePx2hEGkimedGto=
go.uber.org/goleak v1.3.0/go.mod h1:CoHD4mav9JJNrW/WLlf7HGZPjdw8EucARQHekz1X6bE=
go.yaml.in/yaml/v2 v2.4.2 h1:DzmwEr2rDGHl7lsFgAHxmNz/1NlQ7xLIrlN2h5d1eGI=
go.yaml.in/yaml/v2 v2.4.2/go.mod h1:081UH+NErpNdqlCXm3TtEran0rJZGxAYx9hb/ELlsPU=
golang.org/x/crypto v0.47.0 h1:V6e3FRj+n4dbpw86FJ8Fv7XVOql7TEwpHapKoMJ/GO8= golang.org/x/crypto v0.47.0 h1:V6e3FRj+n4dbpw86FJ8Fv7XVOql7TEwpHapKoMJ/GO8=
golang.org/x/crypto v0.47.0/go.mod h1:ff3Y9VzzKbwSSEzWqJsJVBnWmRwRSHt/6Op5n9bQc4A= golang.org/x/crypto v0.47.0/go.mod h1:ff3Y9VzzKbwSSEzWqJsJVBnWmRwRSHt/6Op5n9bQc4A=
golang.org/x/sys v0.40.0 h1:DBZZqJ2Rkml6QMQsZywtnjnnGvHza6BTfYFWY9kjEWQ= golang.org/x/sys v0.40.0 h1:DBZZqJ2Rkml6QMQsZywtnjnnGvHza6BTfYFWY9kjEWQ=
golang.org/x/sys v0.40.0/go.mod h1:OgkHotnGiDImocRcuBABYBEXf8A9a87e/uXjp9XT3ks= golang.org/x/sys v0.40.0/go.mod h1:OgkHotnGiDImocRcuBABYBEXf8A9a87e/uXjp9XT3ks=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405 h1:yhCVgyC4o1eVCa2tZl7eS0r+SDo693bJlVdllGtEeKM= google.golang.org/protobuf v1.36.8 h1:xHScyCOEuuwZEc6UtSOvPbAT4zRh0xcNRYekJwfqyMc=
google.golang.org/protobuf v1.36.8/go.mod h1:fuxRtAxBytpl4zzqUh6/eyUujkJdNiuEkXntxiD/uRU=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0= gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c h1:Hei/4ADfdWqJk1ZMxUNpqntNwaWcugrBjAiHlqqRiVk=
gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c/go.mod h1:JHkPIbrfpd72SG/EVd6muEfDQjcINNoR0C8j2r3qZ4Q=
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA= gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=

View File

@@ -28,14 +28,32 @@ type DeployResult struct {
Errors []error Errors []error
} }
// AllSucceeded returns true if all responses indicate success. // AllSucceeded returns true if all hosts' final responses indicate success.
func (r *DeployResult) AllSucceeded() bool { func (r *DeployResult) AllSucceeded() bool {
if len(r.Errors) > 0 {
return false
}
// Track the final status for each host
finalStatus := make(map[string]messages.Status)
for _, resp := range r.Responses { for _, resp := range r.Responses {
if resp.Status != messages.StatusCompleted { if resp.Status.IsFinal() {
finalStatus[resp.Hostname] = resp.Status
}
}
// Need at least one host with a final status
if len(finalStatus) == 0 {
return false
}
// All final statuses must be completed
for _, status := range finalStatus {
if status != messages.StatusCompleted {
return false return false
} }
} }
return len(r.Responses) > 0 && len(r.Errors) == 0 return true
} }
// HostCount returns the number of unique hosts that responded. // HostCount returns the number of unique hosts that responded.
@@ -68,6 +86,8 @@ func Deploy(ctx context.Context, cfg DeployConfig, onResponse func(*messages.Dep
var mu sync.Mutex var mu sync.Mutex
result := &DeployResult{} result := &DeployResult{}
hostFinal := make(map[string]bool) // track which hosts have sent final status hostFinal := make(map[string]bool) // track which hosts have sent final status
hostSeen := make(map[string]bool) // track all hosts that have responded
lastResponse := time.Now()
// Subscribe to reply subject // Subscribe to reply subject
sub, err := client.Subscribe(replySubject, func(subject string, data []byte) { sub, err := client.Subscribe(replySubject, func(subject string, data []byte) {
@@ -81,9 +101,11 @@ func Deploy(ctx context.Context, cfg DeployConfig, onResponse func(*messages.Dep
mu.Lock() mu.Lock()
result.Responses = append(result.Responses, resp) result.Responses = append(result.Responses, resp)
hostSeen[resp.Hostname] = true
if resp.Status.IsFinal() { if resp.Status.IsFinal() {
hostFinal[resp.Hostname] = true hostFinal[resp.Hostname] = true
} }
lastResponse = time.Now()
mu.Unlock() mu.Unlock()
if onResponse != nil { if onResponse != nil {
@@ -119,8 +141,7 @@ func Deploy(ctx context.Context, cfg DeployConfig, onResponse func(*messages.Dep
// Use a dynamic timeout: wait for initial responses, then extend // Use a dynamic timeout: wait for initial responses, then extend
// timeout after each response until no new responses or max timeout // timeout after each response until no new responses or max timeout
deadline := time.Now().Add(cfg.Timeout) deadline := time.Now().Add(cfg.Timeout)
lastResponse := time.Now() idleTimeout := 30 * time.Second // wait this long after last response for new hosts
idleTimeout := 30 * time.Second // wait this long after last response
for { for {
select { select {
@@ -128,7 +149,9 @@ func Deploy(ctx context.Context, cfg DeployConfig, onResponse func(*messages.Dep
return result, ctx.Err() return result, ctx.Err()
case <-time.After(1 * time.Second): case <-time.After(1 * time.Second):
mu.Lock() mu.Lock()
responseCount := len(result.Responses) seenCount := len(hostSeen)
finalCount := len(hostFinal)
lastResponseTime := lastResponse
mu.Unlock() mu.Unlock()
now := time.Now() now := time.Now()
@@ -138,21 +161,19 @@ func Deploy(ctx context.Context, cfg DeployConfig, onResponse func(*messages.Dep
return result, nil return result, nil
} }
// If we have responses, use idle timeout // If all hosts that responded have sent final status, we're done
if responseCount > 0 { // Add a short grace period for late arrivals from other hosts
mu.Lock() if seenCount > 0 && seenCount == finalCount {
lastResponseTime := lastResponse // Wait a bit for any other hosts to respond
// Update lastResponse time if we got new responses if now.Sub(lastResponseTime) > 2*time.Second {
if responseCount > 0 {
// Simple approximation - in practice you'd track this more precisely
lastResponseTime = now
}
mu.Unlock()
if now.Sub(lastResponseTime) > idleTimeout {
return result, nil return result, nil
} }
} }
// If we have responses but waiting for more hosts, use idle timeout
if seenCount > 0 && now.Sub(lastResponseTime) > idleTimeout {
return result, nil
}
} }
} }
} }

View File

@@ -49,6 +49,40 @@ func TestDeployResult_AllSucceeded(t *testing.T) {
errors: []error{nil}, // placeholder error errors: []error{nil}, // placeholder error
want: false, want: false,
}, },
{
name: "with intermediate responses - success",
responses: []*messages.DeployResponse{
{Hostname: "host1", Status: messages.StatusStarted},
{Hostname: "host1", Status: messages.StatusCompleted},
},
want: true,
},
{
name: "with intermediate responses - failure",
responses: []*messages.DeployResponse{
{Hostname: "host1", Status: messages.StatusStarted},
{Hostname: "host1", Status: messages.StatusFailed},
},
want: false,
},
{
name: "multiple hosts with intermediate responses",
responses: []*messages.DeployResponse{
{Hostname: "host1", Status: messages.StatusStarted},
{Hostname: "host2", Status: messages.StatusStarted},
{Hostname: "host1", Status: messages.StatusCompleted},
{Hostname: "host2", Status: messages.StatusCompleted},
},
want: true,
},
{
name: "only intermediate responses - no final",
responses: []*messages.DeployResponse{
{Hostname: "host1", Status: messages.StatusStarted},
{Hostname: "host1", Status: messages.StatusAccepted},
},
want: false,
},
} }
for _, tc := range tests { for _, tc := range tests {

View File

@@ -35,6 +35,15 @@ type Result struct {
Error error Error error
} }
// ExecuteOptions contains optional settings for Execute.
type ExecuteOptions struct {
// HeartbeatInterval is how often to call the heartbeat callback.
// If zero, no heartbeat is sent.
HeartbeatInterval time.Duration
// HeartbeatCallback is called periodically with elapsed time while the command runs.
HeartbeatCallback func(elapsed time.Duration)
}
// ValidateRevision checks if a revision exists in the remote repository. // ValidateRevision checks if a revision exists in the remote repository.
// It uses git ls-remote to verify the ref exists. // It uses git ls-remote to verify the ref exists.
func (e *Executor) ValidateRevision(ctx context.Context, revision string) error { func (e *Executor) ValidateRevision(ctx context.Context, revision string) error {
@@ -65,6 +74,11 @@ func (e *Executor) ValidateRevision(ctx context.Context, revision string) error
// Execute runs nixos-rebuild with the specified action and revision. // Execute runs nixos-rebuild with the specified action and revision.
func (e *Executor) Execute(ctx context.Context, action messages.Action, revision string) *Result { func (e *Executor) Execute(ctx context.Context, action messages.Action, revision string) *Result {
return e.ExecuteWithOptions(ctx, action, revision, nil)
}
// ExecuteWithOptions runs nixos-rebuild with the specified action, revision, and options.
func (e *Executor) ExecuteWithOptions(ctx context.Context, action messages.Action, revision string, opts *ExecuteOptions) *Result {
ctx, cancel := context.WithTimeout(ctx, e.timeout) ctx, cancel := context.WithTimeout(ctx, e.timeout)
defer cancel() defer cancel()
@@ -77,7 +91,41 @@ func (e *Executor) Execute(ctx context.Context, action messages.Action, revision
cmd.Stdout = &stdout cmd.Stdout = &stdout
cmd.Stderr = &stderr cmd.Stderr = &stderr
err := cmd.Run() // Start the command
startTime := time.Now()
if err := cmd.Start(); err != nil {
return &Result{
Success: false,
ExitCode: -1,
Error: fmt.Errorf("failed to start command: %w", err),
}
}
// Set up heartbeat if configured
var heartbeatDone chan struct{}
if opts != nil && opts.HeartbeatInterval > 0 && opts.HeartbeatCallback != nil {
heartbeatDone = make(chan struct{})
go func() {
ticker := time.NewTicker(opts.HeartbeatInterval)
defer ticker.Stop()
for {
select {
case <-heartbeatDone:
return
case <-ticker.C:
opts.HeartbeatCallback(time.Since(startTime))
}
}
}()
}
// Wait for command to complete
err := cmd.Wait()
// Stop heartbeat goroutine
if heartbeatDone != nil {
close(heartbeatDone)
}
result := &Result{ result := &Result{
Stdout: stdout.String(), Stdout: stdout.String(),

View File

@@ -8,6 +8,7 @@ import (
"git.t-juice.club/torjus/homelab-deploy/internal/deploy" "git.t-juice.club/torjus/homelab-deploy/internal/deploy"
"git.t-juice.club/torjus/homelab-deploy/internal/messages" "git.t-juice.club/torjus/homelab-deploy/internal/messages"
"git.t-juice.club/torjus/homelab-deploy/internal/metrics"
"git.t-juice.club/torjus/homelab-deploy/internal/nats" "git.t-juice.club/torjus/homelab-deploy/internal/nats"
) )
@@ -20,8 +21,13 @@ type Config struct {
NKeyFile string NKeyFile string
FlakeURL string FlakeURL string
Timeout time.Duration Timeout time.Duration
HeartbeatInterval time.Duration
DeploySubjects []string DeploySubjects []string
DiscoverSubject string DiscoverSubject string
MetricsEnabled bool
MetricsAddr string
Version string
Debug bool
} }
// Listener handles deployment requests from NATS. // Listener handles deployment requests from NATS.
@@ -34,6 +40,14 @@ type Listener struct {
// Expanded subjects for discovery responses // Expanded subjects for discovery responses
expandedSubjects []string expandedSubjects []string
// restartCh signals that the listener should exit for restart
// (e.g., after a successful switch deployment)
restartCh chan struct{}
// metrics server and collector (nil if metrics disabled)
metricsServer *metrics.Server
metrics *metrics.Collector
} }
// New creates a new listener with the given configuration. // New creates a new listener with the given configuration.
@@ -42,16 +56,42 @@ func New(cfg Config, logger *slog.Logger) *Listener {
logger = slog.Default() logger = slog.Default()
} }
return &Listener{ l := &Listener{
cfg: cfg, cfg: cfg,
executor: deploy.NewExecutor(cfg.FlakeURL, cfg.Hostname, cfg.Timeout), executor: deploy.NewExecutor(cfg.FlakeURL, cfg.Hostname, cfg.Timeout),
lock: deploy.NewLock(), lock: deploy.NewLock(),
logger: logger, logger: logger,
restartCh: make(chan struct{}, 1),
} }
if cfg.MetricsEnabled {
l.metricsServer = metrics.NewServer(metrics.ServerConfig{
Addr: cfg.MetricsAddr,
Logger: logger,
})
l.metrics = l.metricsServer.Collector()
}
return l
} }
// Run starts the listener and blocks until the context is cancelled. // Run starts the listener and blocks until the context is cancelled.
func (l *Listener) Run(ctx context.Context) error { func (l *Listener) Run(ctx context.Context) error {
// Start metrics server if enabled
if l.metricsServer != nil {
if err := l.metricsServer.Start(); err != nil {
return fmt.Errorf("failed to start metrics server: %w", err)
}
defer func() {
shutdownCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
_ = l.metricsServer.Shutdown(shutdownCtx)
}()
// Set instance info metric
l.metrics.SetInfo(l.cfg.Hostname, l.cfg.Tier, l.cfg.Role, l.cfg.Version)
}
// Connect to NATS // Connect to NATS
l.logger.Info("connecting to NATS", l.logger.Info("connecting to NATS",
"url", l.cfg.NATSUrl, "url", l.cfg.NATSUrl,
@@ -93,9 +133,13 @@ func (l *Listener) Run(ctx context.Context) error {
l.logger.Info("listener started", "deploy_subjects", l.expandedSubjects, "discover_subject", discoverSubject) l.logger.Info("listener started", "deploy_subjects", l.expandedSubjects, "discover_subject", discoverSubject)
// Wait for context cancellation // Wait for context cancellation or restart signal
<-ctx.Done() select {
case <-ctx.Done():
l.logger.Info("shutting down listener") l.logger.Info("shutting down listener")
case <-l.restartCh:
l.logger.Info("exiting for restart after successful switch deployment")
}
return nil return nil
} }
@@ -127,6 +171,9 @@ func (l *Listener) handleDeployRequest(subject string, data []byte) {
messages.StatusRejected, messages.StatusRejected,
err.Error(), err.Error(),
).WithError(messages.ErrorInvalidAction)) ).WithError(messages.ErrorInvalidAction))
if l.metrics != nil {
l.metrics.RecordRejection(req.Action, messages.ErrorInvalidAction)
}
return return
} }
@@ -141,6 +188,9 @@ func (l *Listener) handleDeployRequest(subject string, data []byte) {
messages.StatusRejected, messages.StatusRejected,
"another deployment is already in progress", "another deployment is already in progress",
).WithError(messages.ErrorAlreadyRunning)) ).WithError(messages.ErrorAlreadyRunning))
if l.metrics != nil {
l.metrics.RecordRejection(req.Action, messages.ErrorAlreadyRunning)
}
return return
} }
defer l.lock.Release() defer l.lock.Release()
@@ -152,6 +202,19 @@ func (l *Listener) handleDeployRequest(subject string, data []byte) {
fmt.Sprintf("starting deployment: %s", l.executor.BuildCommand(req.Action, req.Revision)), fmt.Sprintf("starting deployment: %s", l.executor.BuildCommand(req.Action, req.Revision)),
)) ))
// Record deployment start for metrics
if l.metrics != nil {
l.logger.Debug("recording deployment start metric",
"metrics_enabled", true,
)
l.metrics.RecordDeploymentStart()
} else {
l.logger.Debug("skipping deployment start metric",
"metrics_enabled", false,
)
}
startTime := time.Now()
// Validate revision // Validate revision
ctx := context.Background() ctx := context.Background()
if err := l.executor.ValidateRevision(ctx, req.Revision); err != nil { if err := l.executor.ValidateRevision(ctx, req.Revision); err != nil {
@@ -164,6 +227,20 @@ func (l *Listener) handleDeployRequest(subject string, data []byte) {
messages.StatusFailed, messages.StatusFailed,
fmt.Sprintf("revision validation failed: %v", err), fmt.Sprintf("revision validation failed: %v", err),
).WithError(messages.ErrorInvalidRevision)) ).WithError(messages.ErrorInvalidRevision))
duration := time.Since(startTime).Seconds()
if l.metrics != nil {
l.logger.Debug("recording deployment failure metric (revision validation)",
"action", req.Action,
"error_code", messages.ErrorInvalidRevision,
"duration_seconds", duration,
)
l.metrics.RecordDeploymentFailure(req.Action, messages.ErrorInvalidRevision, duration)
} else {
l.logger.Debug("skipping deployment failure metric",
"metrics_enabled", false,
"duration_seconds", duration,
)
}
return return
} }
@@ -174,7 +251,23 @@ func (l *Listener) handleDeployRequest(subject string, data []byte) {
"command", l.executor.BuildCommand(req.Action, req.Revision), "command", l.executor.BuildCommand(req.Action, req.Revision),
) )
result := l.executor.Execute(ctx, req.Action, req.Revision) // Set up heartbeat options to send periodic status updates
var opts *deploy.ExecuteOptions
if l.cfg.HeartbeatInterval > 0 {
opts = &deploy.ExecuteOptions{
HeartbeatInterval: l.cfg.HeartbeatInterval,
HeartbeatCallback: func(elapsed time.Duration) {
l.sendResponse(req.ReplyTo, messages.NewDeployResponse(
l.cfg.Hostname,
messages.StatusRunning,
fmt.Sprintf("deployment in progress (%s elapsed)", elapsed.Round(time.Second)),
))
},
}
}
result := l.executor.ExecuteWithOptions(ctx, req.Action, req.Revision, opts)
duration := time.Since(startTime).Seconds()
if result.Success { if result.Success {
l.logger.Info("deployment completed successfully", l.logger.Info("deployment completed successfully",
@@ -185,6 +278,43 @@ func (l *Listener) handleDeployRequest(subject string, data []byte) {
messages.StatusCompleted, messages.StatusCompleted,
"deployment completed successfully", "deployment completed successfully",
)) ))
// Flush to ensure the completed response is sent before we potentially restart
if err := l.client.Flush(); err != nil {
l.logger.Error("failed to flush completed response", "error", err)
}
if l.metrics != nil {
l.logger.Debug("recording deployment end metric (success)",
"action", req.Action,
"success", true,
"duration_seconds", duration,
)
l.metrics.RecordDeploymentEnd(req.Action, true, duration)
} else {
l.logger.Debug("skipping deployment end metric",
"metrics_enabled", false,
"duration_seconds", duration,
)
}
// After a successful switch, signal restart so we pick up any new version
if req.Action == messages.ActionSwitch {
// Wait for metrics scrape before restarting (if metrics enabled)
if l.metricsServer != nil {
l.logger.Info("waiting for metrics scrape before restart")
select {
case <-l.metricsServer.ScrapeCh():
l.logger.Info("metrics scraped, proceeding with restart")
case <-time.After(60 * time.Second):
l.logger.Warn("no metrics scrape within timeout, proceeding with restart anyway")
}
}
select {
case l.restartCh <- struct{}{}:
default:
// Channel already has a signal pending
}
}
} else { } else {
l.logger.Error("deployment failed", l.logger.Error("deployment failed",
"exit_code", result.ExitCode, "exit_code", result.ExitCode,
@@ -202,6 +332,19 @@ func (l *Listener) handleDeployRequest(subject string, data []byte) {
messages.StatusFailed, messages.StatusFailed,
fmt.Sprintf("deployment failed (exit code %d): %s", result.ExitCode, result.Stderr), fmt.Sprintf("deployment failed (exit code %d): %s", result.ExitCode, result.Stderr),
).WithError(errorCode)) ).WithError(errorCode))
if l.metrics != nil {
l.logger.Debug("recording deployment failure metric",
"action", req.Action,
"error_code", errorCode,
"duration_seconds", duration,
)
l.metrics.RecordDeploymentFailure(req.Action, errorCode, duration)
} else {
l.logger.Debug("skipping deployment failure metric",
"metrics_enabled", false,
"duration_seconds", duration,
)
}
} }
} }

View File

@@ -2,8 +2,14 @@ package listener
import ( import (
"log/slog" "log/slog"
"strings"
"testing" "testing"
"time" "time"
"git.t-juice.club/torjus/homelab-deploy/internal/messages"
"git.t-juice.club/torjus/homelab-deploy/internal/metrics"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/testutil"
) )
func TestNew(t *testing.T) { func TestNew(t *testing.T) {
@@ -51,3 +57,148 @@ func TestNew_WithLogger(t *testing.T) {
t.Error("should use provided logger") t.Error("should use provided logger")
} }
} }
func TestNew_WithMetricsEnabled(t *testing.T) {
cfg := Config{
Hostname: "test-host",
Tier: "test",
MetricsEnabled: true,
MetricsAddr: ":0",
}
l := New(cfg, nil)
if l.metricsServer == nil {
t.Error("metricsServer should not be nil when MetricsEnabled is true")
}
if l.metrics == nil {
t.Error("metrics should not be nil when MetricsEnabled is true")
}
}
func TestListener_MetricsRecordedOnDeployment(t *testing.T) {
// This test verifies that the listener correctly calls metrics functions
// when processing deployments. We test this by directly calling the internal
// metrics recording logic that handleDeployRequest uses.
reg := prometheus.NewRegistry()
collector := metrics.NewCollector(reg)
// Simulate what handleDeployRequest does for a successful deployment
collector.RecordDeploymentStart()
collector.RecordDeploymentEnd(messages.ActionSwitch, true, 120.5)
// Verify counter was incremented
counterExpected := `
# HELP homelab_deploy_deployments_total Total deployment requests processed
# TYPE homelab_deploy_deployments_total counter
homelab_deploy_deployments_total{action="boot",error_code="",status="completed"} 0
homelab_deploy_deployments_total{action="boot",error_code="",status="failed"} 0
homelab_deploy_deployments_total{action="dry-activate",error_code="",status="completed"} 0
homelab_deploy_deployments_total{action="dry-activate",error_code="",status="failed"} 0
homelab_deploy_deployments_total{action="switch",error_code="",status="completed"} 1
homelab_deploy_deployments_total{action="switch",error_code="",status="failed"} 0
homelab_deploy_deployments_total{action="test",error_code="",status="completed"} 0
homelab_deploy_deployments_total{action="test",error_code="",status="failed"} 0
`
if err := testutil.GatherAndCompare(reg, strings.NewReader(counterExpected), "homelab_deploy_deployments_total"); err != nil {
t.Errorf("unexpected counter metrics: %v", err)
}
// Verify histogram was updated (120.5 seconds falls into le="300" and higher buckets)
histogramExpected := `
# HELP homelab_deploy_deployment_duration_seconds Deployment execution time
# TYPE homelab_deploy_deployment_duration_seconds histogram
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="boot",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="boot",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="boot",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="boot",success="true"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="dry-activate",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="dry-activate",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="dry-activate",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="dry-activate",success="true"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="switch",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="switch",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="300"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="600"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="900"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="1200"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="1800"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="+Inf"} 1
homelab_deploy_deployment_duration_seconds_sum{action="switch",success="true"} 120.5
homelab_deploy_deployment_duration_seconds_count{action="switch",success="true"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="test",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="test",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="test",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="test",success="true"} 0
`
if err := testutil.GatherAndCompare(reg, strings.NewReader(histogramExpected), "homelab_deploy_deployment_duration_seconds"); err != nil {
t.Errorf("unexpected histogram metrics: %v", err)
}
}

View File

@@ -35,6 +35,7 @@ const (
StatusAccepted Status = "accepted" StatusAccepted Status = "accepted"
StatusRejected Status = "rejected" StatusRejected Status = "rejected"
StatusStarted Status = "started" StatusStarted Status = "started"
StatusRunning Status = "running"
StatusCompleted Status = "completed" StatusCompleted Status = "completed"
StatusFailed Status = "failed" StatusFailed Status = "failed"
) )

125
internal/metrics/metrics.go Normal file
View File

@@ -0,0 +1,125 @@
// Package metrics provides Prometheus metrics for the homelab-deploy listener.
package metrics
import (
"git.t-juice.club/torjus/homelab-deploy/internal/messages"
"github.com/prometheus/client_golang/prometheus"
)
// Collector holds all Prometheus metrics for the listener.
type Collector struct {
deploymentsTotal *prometheus.CounterVec
deploymentDuration *prometheus.HistogramVec
deploymentInProgress prometheus.Gauge
info *prometheus.GaugeVec
}
// NewCollector creates a new metrics collector and registers it with the given registerer.
func NewCollector(reg prometheus.Registerer) *Collector {
c := &Collector{
deploymentsTotal: prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "homelab_deploy_deployments_total",
Help: "Total deployment requests processed",
},
[]string{"status", "action", "error_code"},
),
deploymentDuration: prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "homelab_deploy_deployment_duration_seconds",
Help: "Deployment execution time",
// Bucket boundaries for typical NixOS build times
Buckets: []float64{30, 60, 120, 300, 600, 900, 1200, 1800},
},
[]string{"action", "success"},
),
deploymentInProgress: prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "homelab_deploy_deployment_in_progress",
Help: "1 if deployment running, 0 otherwise",
},
),
info: prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "homelab_deploy_info",
Help: "Static instance metadata",
},
[]string{"hostname", "tier", "role", "version"},
),
}
reg.MustRegister(c.deploymentsTotal)
reg.MustRegister(c.deploymentDuration)
reg.MustRegister(c.deploymentInProgress)
reg.MustRegister(c.info)
c.initMetrics()
return c
}
// initMetrics initializes all metric label combinations with zero values.
// This ensures metrics appear in Prometheus scrapes before any deployments occur.
func (c *Collector) initMetrics() {
actions := []messages.Action{
messages.ActionSwitch,
messages.ActionBoot,
messages.ActionTest,
messages.ActionDryActivate,
}
// Initialize deployment counter for common status/action combinations
for _, action := range actions {
// Successful completions (no error code)
c.deploymentsTotal.WithLabelValues("completed", string(action), "")
// Failed deployments (no error code - from RecordDeploymentEnd)
c.deploymentsTotal.WithLabelValues("failed", string(action), "")
}
// Initialize histogram for all action/success combinations
for _, action := range actions {
c.deploymentDuration.WithLabelValues(string(action), "true")
c.deploymentDuration.WithLabelValues(string(action), "false")
}
}
// SetInfo sets the static instance metadata.
func (c *Collector) SetInfo(hostname, tier, role, version string) {
c.info.WithLabelValues(hostname, tier, role, version).Set(1)
}
// RecordDeploymentStart marks the start of a deployment.
func (c *Collector) RecordDeploymentStart() {
c.deploymentInProgress.Set(1)
}
// RecordDeploymentEnd records the completion of a deployment.
func (c *Collector) RecordDeploymentEnd(action messages.Action, success bool, durationSeconds float64) {
c.deploymentInProgress.Set(0)
successLabel := "false"
if success {
successLabel = "true"
}
c.deploymentDuration.WithLabelValues(string(action), successLabel).Observe(durationSeconds)
status := "completed"
if !success {
status = "failed"
}
c.deploymentsTotal.WithLabelValues(status, string(action), "").Inc()
}
// RecordDeploymentFailure records a deployment failure with an error code.
func (c *Collector) RecordDeploymentFailure(action messages.Action, errorCode messages.ErrorCode, durationSeconds float64) {
c.deploymentInProgress.Set(0)
c.deploymentDuration.WithLabelValues(string(action), "false").Observe(durationSeconds)
c.deploymentsTotal.WithLabelValues("failed", string(action), string(errorCode)).Inc()
}
// RecordRejection records a rejected deployment request.
func (c *Collector) RecordRejection(action messages.Action, errorCode messages.ErrorCode) {
c.deploymentsTotal.WithLabelValues("rejected", string(action), string(errorCode)).Inc()
}

View File

@@ -0,0 +1,650 @@
package metrics
import (
"context"
"io"
"net/http"
"strings"
"testing"
"time"
"git.t-juice.club/torjus/homelab-deploy/internal/messages"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/testutil"
)
func TestCollector_SetInfo(t *testing.T) {
reg := prometheus.NewRegistry()
c := NewCollector(reg)
c.SetInfo("testhost", "test", "web", "1.0.0")
expected := `
# HELP homelab_deploy_info Static instance metadata
# TYPE homelab_deploy_info gauge
homelab_deploy_info{hostname="testhost",role="web",tier="test",version="1.0.0"} 1
`
if err := testutil.GatherAndCompare(reg, strings.NewReader(expected), "homelab_deploy_info"); err != nil {
t.Errorf("unexpected metrics: %v", err)
}
}
func TestCollector_RecordDeploymentStart(t *testing.T) {
reg := prometheus.NewRegistry()
c := NewCollector(reg)
c.RecordDeploymentStart()
expected := `
# HELP homelab_deploy_deployment_in_progress 1 if deployment running, 0 otherwise
# TYPE homelab_deploy_deployment_in_progress gauge
homelab_deploy_deployment_in_progress 1
`
if err := testutil.GatherAndCompare(reg, strings.NewReader(expected), "homelab_deploy_deployment_in_progress"); err != nil {
t.Errorf("unexpected metrics: %v", err)
}
}
func TestCollector_RecordDeploymentEnd_Success(t *testing.T) {
reg := prometheus.NewRegistry()
c := NewCollector(reg)
c.RecordDeploymentStart()
c.RecordDeploymentEnd(messages.ActionSwitch, true, 120.5)
// Check in_progress is 0
inProgressExpected := `
# HELP homelab_deploy_deployment_in_progress 1 if deployment running, 0 otherwise
# TYPE homelab_deploy_deployment_in_progress gauge
homelab_deploy_deployment_in_progress 0
`
if err := testutil.GatherAndCompare(reg, strings.NewReader(inProgressExpected), "homelab_deploy_deployment_in_progress"); err != nil {
t.Errorf("unexpected in_progress metrics: %v", err)
}
// Check counter incremented (includes all pre-initialized metrics)
counterExpected := `
# HELP homelab_deploy_deployments_total Total deployment requests processed
# TYPE homelab_deploy_deployments_total counter
homelab_deploy_deployments_total{action="boot",error_code="",status="completed"} 0
homelab_deploy_deployments_total{action="boot",error_code="",status="failed"} 0
homelab_deploy_deployments_total{action="dry-activate",error_code="",status="completed"} 0
homelab_deploy_deployments_total{action="dry-activate",error_code="",status="failed"} 0
homelab_deploy_deployments_total{action="switch",error_code="",status="completed"} 1
homelab_deploy_deployments_total{action="switch",error_code="",status="failed"} 0
homelab_deploy_deployments_total{action="test",error_code="",status="completed"} 0
homelab_deploy_deployments_total{action="test",error_code="",status="failed"} 0
`
if err := testutil.GatherAndCompare(reg, strings.NewReader(counterExpected), "homelab_deploy_deployments_total"); err != nil {
t.Errorf("unexpected counter metrics: %v", err)
}
// Check histogram recorded the duration (120.5 seconds falls into le="300" and higher buckets)
histogramExpected := `
# HELP homelab_deploy_deployment_duration_seconds Deployment execution time
# TYPE homelab_deploy_deployment_duration_seconds histogram
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="boot",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="boot",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="boot",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="boot",success="true"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="dry-activate",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="dry-activate",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="dry-activate",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="dry-activate",success="true"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="switch",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="switch",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="300"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="600"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="900"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="1200"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="1800"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="+Inf"} 1
homelab_deploy_deployment_duration_seconds_sum{action="switch",success="true"} 120.5
homelab_deploy_deployment_duration_seconds_count{action="switch",success="true"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="test",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="test",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="test",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="test",success="true"} 0
`
if err := testutil.GatherAndCompare(reg, strings.NewReader(histogramExpected), "homelab_deploy_deployment_duration_seconds"); err != nil {
t.Errorf("unexpected histogram metrics: %v", err)
}
}
func TestCollector_RecordDeploymentEnd_Failure(t *testing.T) {
reg := prometheus.NewRegistry()
c := NewCollector(reg)
c.RecordDeploymentStart()
c.RecordDeploymentEnd(messages.ActionBoot, false, 60.0)
counterExpected := `
# HELP homelab_deploy_deployments_total Total deployment requests processed
# TYPE homelab_deploy_deployments_total counter
homelab_deploy_deployments_total{action="boot",error_code="",status="completed"} 0
homelab_deploy_deployments_total{action="boot",error_code="",status="failed"} 1
homelab_deploy_deployments_total{action="dry-activate",error_code="",status="completed"} 0
homelab_deploy_deployments_total{action="dry-activate",error_code="",status="failed"} 0
homelab_deploy_deployments_total{action="switch",error_code="",status="completed"} 0
homelab_deploy_deployments_total{action="switch",error_code="",status="failed"} 0
homelab_deploy_deployments_total{action="test",error_code="",status="completed"} 0
homelab_deploy_deployments_total{action="test",error_code="",status="failed"} 0
`
if err := testutil.GatherAndCompare(reg, strings.NewReader(counterExpected), "homelab_deploy_deployments_total"); err != nil {
t.Errorf("unexpected counter metrics: %v", err)
}
// Check histogram recorded the duration (60.0 seconds falls into le="60" and higher buckets)
histogramExpected := `
# HELP homelab_deploy_deployment_duration_seconds Deployment execution time
# TYPE homelab_deploy_deployment_duration_seconds histogram
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="60"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="120"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="300"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="600"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="900"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="1200"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="1800"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="+Inf"} 1
homelab_deploy_deployment_duration_seconds_sum{action="boot",success="false"} 60
homelab_deploy_deployment_duration_seconds_count{action="boot",success="false"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="boot",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="boot",success="true"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="dry-activate",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="dry-activate",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="dry-activate",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="dry-activate",success="true"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="switch",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="switch",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="switch",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="switch",success="true"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="test",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="test",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="test",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="test",success="true"} 0
`
if err := testutil.GatherAndCompare(reg, strings.NewReader(histogramExpected), "homelab_deploy_deployment_duration_seconds"); err != nil {
t.Errorf("unexpected histogram metrics: %v", err)
}
}
func TestCollector_RecordDeploymentFailure(t *testing.T) {
reg := prometheus.NewRegistry()
c := NewCollector(reg)
c.RecordDeploymentStart()
c.RecordDeploymentFailure(messages.ActionSwitch, messages.ErrorBuildFailed, 300.0)
counterExpected := `
# HELP homelab_deploy_deployments_total Total deployment requests processed
# TYPE homelab_deploy_deployments_total counter
homelab_deploy_deployments_total{action="boot",error_code="",status="completed"} 0
homelab_deploy_deployments_total{action="boot",error_code="",status="failed"} 0
homelab_deploy_deployments_total{action="dry-activate",error_code="",status="completed"} 0
homelab_deploy_deployments_total{action="dry-activate",error_code="",status="failed"} 0
homelab_deploy_deployments_total{action="switch",error_code="",status="completed"} 0
homelab_deploy_deployments_total{action="switch",error_code="",status="failed"} 0
homelab_deploy_deployments_total{action="switch",error_code="build_failed",status="failed"} 1
homelab_deploy_deployments_total{action="test",error_code="",status="completed"} 0
homelab_deploy_deployments_total{action="test",error_code="",status="failed"} 0
`
if err := testutil.GatherAndCompare(reg, strings.NewReader(counterExpected), "homelab_deploy_deployments_total"); err != nil {
t.Errorf("unexpected counter metrics: %v", err)
}
// Check histogram recorded the duration (300.0 seconds falls into le="300" and higher buckets)
histogramExpected := `
# HELP homelab_deploy_deployment_duration_seconds Deployment execution time
# TYPE homelab_deploy_deployment_duration_seconds histogram
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="boot",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="boot",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="boot",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="boot",success="true"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="dry-activate",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="dry-activate",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="dry-activate",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="dry-activate",success="true"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="300"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="600"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="900"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="1200"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="1800"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="+Inf"} 1
homelab_deploy_deployment_duration_seconds_sum{action="switch",success="false"} 300
homelab_deploy_deployment_duration_seconds_count{action="switch",success="false"} 1
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="switch",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="switch",success="true"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="test",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="test",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="test",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="test",success="true"} 0
`
if err := testutil.GatherAndCompare(reg, strings.NewReader(histogramExpected), "homelab_deploy_deployment_duration_seconds"); err != nil {
t.Errorf("unexpected histogram metrics: %v", err)
}
}
func TestCollector_RecordRejection(t *testing.T) {
reg := prometheus.NewRegistry()
c := NewCollector(reg)
c.RecordRejection(messages.ActionSwitch, messages.ErrorAlreadyRunning)
expected := `
# HELP homelab_deploy_deployments_total Total deployment requests processed
# TYPE homelab_deploy_deployments_total counter
homelab_deploy_deployments_total{action="boot",error_code="",status="completed"} 0
homelab_deploy_deployments_total{action="boot",error_code="",status="failed"} 0
homelab_deploy_deployments_total{action="dry-activate",error_code="",status="completed"} 0
homelab_deploy_deployments_total{action="dry-activate",error_code="",status="failed"} 0
homelab_deploy_deployments_total{action="switch",error_code="",status="completed"} 0
homelab_deploy_deployments_total{action="switch",error_code="",status="failed"} 0
homelab_deploy_deployments_total{action="switch",error_code="already_running",status="rejected"} 1
homelab_deploy_deployments_total{action="test",error_code="",status="completed"} 0
homelab_deploy_deployments_total{action="test",error_code="",status="failed"} 0
`
if err := testutil.GatherAndCompare(reg, strings.NewReader(expected), "homelab_deploy_deployments_total"); err != nil {
t.Errorf("unexpected metrics: %v", err)
}
}
func TestCollector_MetricsInitializedAtStartup(t *testing.T) {
reg := prometheus.NewRegistry()
_ = NewCollector(reg)
// Verify counter metrics are initialized with zero values before any deployments
counterExpected := `
# HELP homelab_deploy_deployments_total Total deployment requests processed
# TYPE homelab_deploy_deployments_total counter
homelab_deploy_deployments_total{action="boot",error_code="",status="completed"} 0
homelab_deploy_deployments_total{action="boot",error_code="",status="failed"} 0
homelab_deploy_deployments_total{action="dry-activate",error_code="",status="completed"} 0
homelab_deploy_deployments_total{action="dry-activate",error_code="",status="failed"} 0
homelab_deploy_deployments_total{action="switch",error_code="",status="completed"} 0
homelab_deploy_deployments_total{action="switch",error_code="",status="failed"} 0
homelab_deploy_deployments_total{action="test",error_code="",status="completed"} 0
homelab_deploy_deployments_total{action="test",error_code="",status="failed"} 0
`
if err := testutil.GatherAndCompare(reg, strings.NewReader(counterExpected), "homelab_deploy_deployments_total"); err != nil {
t.Errorf("counter metrics not initialized: %v", err)
}
// Verify histogram metrics are initialized with zero values before any deployments
histogramExpected := `
# HELP homelab_deploy_deployment_duration_seconds Deployment execution time
# TYPE homelab_deploy_deployment_duration_seconds histogram
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="boot",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="boot",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="boot",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="boot",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="boot",success="true"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="dry-activate",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="dry-activate",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="dry-activate",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="dry-activate",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="dry-activate",success="true"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="switch",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="switch",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="switch",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="switch",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="switch",success="true"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="false",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="test",success="false"} 0
homelab_deploy_deployment_duration_seconds_count{action="test",success="false"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="30"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="60"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="120"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="300"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="600"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="900"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="1200"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="1800"} 0
homelab_deploy_deployment_duration_seconds_bucket{action="test",success="true",le="+Inf"} 0
homelab_deploy_deployment_duration_seconds_sum{action="test",success="true"} 0
homelab_deploy_deployment_duration_seconds_count{action="test",success="true"} 0
`
if err := testutil.GatherAndCompare(reg, strings.NewReader(histogramExpected), "homelab_deploy_deployment_duration_seconds"); err != nil {
t.Errorf("histogram metrics not initialized: %v", err)
}
}
func TestServer_StartShutdown(t *testing.T) {
srv := NewServer(ServerConfig{
Addr: ":0", // Let OS pick a free port
})
if err := srv.Start(); err != nil {
t.Fatalf("failed to start server: %v", err)
}
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
if err := srv.Shutdown(ctx); err != nil {
t.Errorf("failed to shutdown server: %v", err)
}
}
func TestServer_Endpoints(t *testing.T) {
srv := NewServer(ServerConfig{
Addr: "127.0.0.1:19972", // Use a fixed port for testing
})
if err := srv.Start(); err != nil {
t.Fatalf("failed to start server: %v", err)
}
defer func() {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
_ = srv.Shutdown(ctx)
}()
// Give server time to start
time.Sleep(50 * time.Millisecond)
t.Run("health endpoint", func(t *testing.T) {
resp, err := http.Get("http://127.0.0.1:19972/health")
if err != nil {
t.Fatalf("failed to get health endpoint: %v", err)
}
defer func() { _ = resp.Body.Close() }()
if resp.StatusCode != http.StatusOK {
t.Errorf("expected status 200, got %d", resp.StatusCode)
}
body, _ := io.ReadAll(resp.Body)
if string(body) != "ok" {
t.Errorf("expected body 'ok', got %q", string(body))
}
})
t.Run("metrics endpoint", func(t *testing.T) {
// Set some info to have metrics to display
srv.Collector().SetInfo("testhost", "test", "web", "1.0.0")
resp, err := http.Get("http://127.0.0.1:19972/metrics")
if err != nil {
t.Fatalf("failed to get metrics endpoint: %v", err)
}
defer func() { _ = resp.Body.Close() }()
if resp.StatusCode != http.StatusOK {
t.Errorf("expected status 200, got %d", resp.StatusCode)
}
body, _ := io.ReadAll(resp.Body)
bodyStr := string(body)
if !strings.Contains(bodyStr, "homelab_deploy_info") {
t.Error("expected metrics to contain homelab_deploy_info")
}
})
}
func TestServer_Collector(t *testing.T) {
srv := NewServer(ServerConfig{
Addr: ":0",
})
collector := srv.Collector()
if collector == nil {
t.Error("expected non-nil collector")
}
}

102
internal/metrics/server.go Normal file
View File

@@ -0,0 +1,102 @@
package metrics
import (
"context"
"fmt"
"log/slog"
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
// ServerConfig holds configuration for the metrics server.
type ServerConfig struct {
Addr string
Logger *slog.Logger
}
// Server serves Prometheus metrics over HTTP.
type Server struct {
httpServer *http.Server
registry *prometheus.Registry
collector *Collector
logger *slog.Logger
scrapeCh chan struct{}
}
// NewServer creates a new metrics server.
func NewServer(cfg ServerConfig) *Server {
logger := cfg.Logger
if logger == nil {
logger = slog.Default()
}
registry := prometheus.NewRegistry()
collector := NewCollector(registry)
scrapeCh := make(chan struct{}, 1)
metricsHandler := promhttp.HandlerFor(registry, promhttp.HandlerOpts{
Registry: registry,
})
mux := http.NewServeMux()
mux.Handle("/metrics", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
metricsHandler.ServeHTTP(w, r)
// Signal that a scrape occurred (non-blocking)
select {
case scrapeCh <- struct{}{}:
default:
}
}))
mux.HandleFunc("/health", func(w http.ResponseWriter, _ *http.Request) {
w.WriteHeader(http.StatusOK)
_, _ = w.Write([]byte("ok"))
})
return &Server{
httpServer: &http.Server{
Addr: cfg.Addr,
Handler: mux,
ReadHeaderTimeout: 10 * time.Second,
},
registry: registry,
collector: collector,
logger: logger,
scrapeCh: scrapeCh,
}
}
// Collector returns the metrics collector.
func (s *Server) Collector() *Collector {
return s.collector
}
// ScrapeCh returns a channel that receives a signal each time the metrics endpoint is scraped.
func (s *Server) ScrapeCh() <-chan struct{} {
return s.scrapeCh
}
// Start starts the HTTP server in a goroutine.
func (s *Server) Start() error {
s.logger.Info("starting metrics server", "addr", s.httpServer.Addr)
go func() {
if err := s.httpServer.ListenAndServe(); err != nil && err != http.ErrServerClosed {
s.logger.Error("metrics server error", "error", err)
}
}()
return nil
}
// Shutdown gracefully shuts down the server.
func (s *Server) Shutdown(ctx context.Context) error {
s.logger.Info("shutting down metrics server")
if err := s.httpServer.Shutdown(ctx); err != nil {
return fmt.Errorf("failed to shutdown metrics server: %w", err)
}
return nil
}

View File

@@ -25,6 +25,15 @@ type Client struct {
// Connect establishes a connection to NATS using NKey authentication. // Connect establishes a connection to NATS using NKey authentication.
func Connect(cfg Config) (*Client, error) { func Connect(cfg Config) (*Client, error) {
// Verify NKey file has secure permissions (no group/other access)
info, err := os.Stat(cfg.NKeyFile)
if err != nil {
return nil, fmt.Errorf("failed to stat nkey file: %w", err)
}
if perm := info.Mode().Perm(); perm&0o077 != 0 {
return nil, fmt.Errorf("nkey file has insecure permissions %04o: must not be accessible by group or others", perm)
}
seed, err := os.ReadFile(cfg.NKeyFile) seed, err := os.ReadFile(cfg.NKeyFile)
if err != nil { if err != nil {
return nil, fmt.Errorf("failed to read nkey file: %w", err) return nil, fmt.Errorf("failed to read nkey file: %w", err)

View File

@@ -21,6 +21,29 @@ func TestConnect_InvalidNKeyFile(t *testing.T) {
} }
} }
func TestConnect_InsecureNKeyFilePermissions(t *testing.T) {
// Create a temp file with insecure permissions
tmpDir := t.TempDir()
keyFile := filepath.Join(tmpDir, "insecure.nkey")
if err := os.WriteFile(keyFile, []byte("test-content"), 0644); err != nil {
t.Fatalf("failed to write temp file: %v", err)
}
cfg := Config{
URL: "nats://localhost:4222",
NKeyFile: keyFile,
Name: "test",
}
_, err := Connect(cfg)
if err == nil {
t.Error("expected error for insecure nkey file permissions")
}
if err != nil && !contains(err.Error(), "insecure permissions") {
t.Errorf("expected insecure permissions error, got: %v", err)
}
}
func TestConnect_InvalidNKeySeed(t *testing.T) { func TestConnect_InvalidNKeySeed(t *testing.T) {
// Create a temp file with invalid content // Create a temp file with invalid content
tmpDir := t.TempDir() tmpDir := t.TempDir()

View File

@@ -1,3 +1,4 @@
{ self }:
{ config, lib, pkgs, ... }: { config, lib, pkgs, ... }:
let let
@@ -14,14 +15,30 @@ let
"--discover-subject ${lib.escapeShellArg cfg.discoverSubject}" "--discover-subject ${lib.escapeShellArg cfg.discoverSubject}"
] ]
++ lib.optional (cfg.role != null) "--role ${lib.escapeShellArg cfg.role}" ++ lib.optional (cfg.role != null) "--role ${lib.escapeShellArg cfg.role}"
++ map (s: "--deploy-subject ${lib.escapeShellArg s}") cfg.deploySubjects); ++ map (s: "--deploy-subject ${lib.escapeShellArg s}") cfg.deploySubjects
++ lib.optionals cfg.metrics.enable [
"--metrics-enabled"
"--metrics-addr ${lib.escapeShellArg cfg.metrics.address}"
]
++ cfg.extraArgs);
# Extract port from metrics address for firewall rule
metricsPort = let
addr = cfg.metrics.address;
# Handle both ":9972" and "0.0.0.0:9972" formats
parts = lib.splitString ":" addr;
in lib.toInt (lib.last parts);
in in
{ {
options.services.homelab-deploy.listener = { options.services.homelab-deploy.listener = {
enable = lib.mkEnableOption "homelab-deploy listener service"; enable = lib.mkEnableOption "homelab-deploy listener service";
package = lib.mkPackageOption pkgs "homelab-deploy" { }; package = lib.mkOption {
type = lib.types.package;
default = self.packages.${pkgs.system}.homelab-deploy;
description = "The homelab-deploy package to use";
};
hostname = lib.mkOption { hostname = lib.mkOption {
type = lib.types.str; type = lib.types.str;
@@ -89,6 +106,30 @@ in
description = "Additional environment variables for the service"; description = "Additional environment variables for the service";
example = { GIT_SSH_COMMAND = "ssh -i /run/secrets/deploy-key"; }; example = { GIT_SSH_COMMAND = "ssh -i /run/secrets/deploy-key"; };
}; };
metrics = {
enable = lib.mkEnableOption "Prometheus metrics endpoint";
address = lib.mkOption {
type = lib.types.str;
default = ":9972";
description = "Address for Prometheus metrics HTTP server";
example = "127.0.0.1:9972";
};
openFirewall = lib.mkOption {
type = lib.types.bool;
default = false;
description = "Open firewall for metrics port";
};
};
extraArgs = lib.mkOption {
type = lib.types.listOf lib.types.str;
default = [ ];
description = "Extra command line arguments to pass to the listener";
example = [ "--debug" ];
};
}; };
config = lib.mkIf cfg.enable { config = lib.mkIf cfg.enable {
@@ -98,35 +139,36 @@ in
after = [ "network-online.target" ]; after = [ "network-online.target" ];
wants = [ "network-online.target" ]; wants = [ "network-online.target" ];
environment = cfg.environment; # Prevent self-interruption during nixos-rebuild switch
# The service will continue running the old version until manually restarted
stopIfChanged = false;
restartIfChanged = false;
environment = cfg.environment // {
# Nix needs a writable cache for git flake fetching
XDG_CACHE_HOME = "/var/cache/homelab-deploy";
};
path = [ pkgs.git config.system.build.nixos-rebuild ];
serviceConfig = { serviceConfig = {
CacheDirectory = "homelab-deploy";
Type = "simple"; Type = "simple";
ExecStart = "${cfg.package}/bin/homelab-deploy listener ${args}"; ExecStart = "${cfg.package}/bin/homelab-deploy listener ${args}";
Restart = "always"; Restart = "always";
RestartSec = 10; RestartSec = 10;
# Hardening (compatible with nixos-rebuild requirements) # Minimal hardening - nixos-rebuild requires broad system access:
# Note: Some options are relaxed because nixos-rebuild requires:
# - Write access to /nix/store for building # - Write access to /nix/store for building
# - Kernel namespace support for nix sandbox builds
# - Ability to activate system configurations # - Ability to activate system configurations
# - Network access for fetching from git/cache # - Network access for fetching from git/cache
# - Namespace support for nix sandbox builds # Following the approach of nixos auto-upgrade which has no hardening
NoNewPrivileges = false;
ProtectSystem = "false";
ProtectHome = "read-only";
PrivateTmp = true;
PrivateDevices = true;
ProtectKernelTunables = true;
ProtectKernelModules = true;
ProtectControlGroups = true;
RestrictAddressFamilies = [ "AF_UNIX" "AF_INET" "AF_INET6" ];
RestrictNamespaces = false;
RestrictSUIDSGID = true;
LockPersonality = true;
MemoryDenyWriteExecute = false;
SystemCallArchitectures = "native";
}; };
}; };
networking.firewall.allowedTCPPorts = lib.mkIf (cfg.metrics.enable && cfg.metrics.openFirewall) [
metricsPort
];
}; };
} }