feat: add Prometheus metrics to listener service
Add an optional Prometheus metrics HTTP endpoint to the listener for monitoring deployment operations. Includes four metrics: - homelab_deploy_deployments_total (counter with status/action/error_code) - homelab_deploy_deployment_duration_seconds (histogram with action/success) - homelab_deploy_deployment_in_progress (gauge) - homelab_deploy_info (gauge with hostname/tier/role/version) New CLI flags: --metrics-enabled, --metrics-addr (default :9972) New NixOS options: metrics.enable, metrics.address, metrics.openFirewall Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
79
README.md
79
README.md
@@ -61,6 +61,8 @@ homelab-deploy listener \
|
||||
| `--timeout` | No | Deployment timeout in seconds (default: 600) |
|
||||
| `--deploy-subject` | No | NATS subjects to subscribe to (repeatable) |
|
||||
| `--discover-subject` | No | Discovery subject (default: `deploy.discover`) |
|
||||
| `--metrics-enabled` | No | Enable Prometheus metrics endpoint |
|
||||
| `--metrics-addr` | No | Metrics HTTP server address (default: `:9972`) |
|
||||
|
||||
#### Subject Templates
|
||||
|
||||
@@ -209,6 +211,9 @@ Add the module to your NixOS configuration:
|
||||
| `deploySubjects` | list of string | see below | Subjects to subscribe to |
|
||||
| `discoverSubject` | string | `"deploy.discover"` | Discovery subject |
|
||||
| `environment` | attrs | `{}` | Additional environment variables |
|
||||
| `metrics.enable` | bool | `false` | Enable Prometheus metrics endpoint |
|
||||
| `metrics.address` | string | `":9972"` | Metrics HTTP server address |
|
||||
| `metrics.openFirewall` | bool | `false` | Open firewall for metrics port |
|
||||
|
||||
Default `deploySubjects`:
|
||||
```nix
|
||||
@@ -219,6 +224,80 @@ Default `deploySubjects`:
|
||||
]
|
||||
```
|
||||
|
||||
## Prometheus Metrics
|
||||
|
||||
The listener can expose Prometheus metrics for monitoring deployment operations.
|
||||
|
||||
### Enabling Metrics
|
||||
|
||||
**CLI:**
|
||||
```bash
|
||||
homelab-deploy listener \
|
||||
--hostname myhost \
|
||||
--tier prod \
|
||||
--nats-url nats://nats.example.com:4222 \
|
||||
--nkey-file /run/secrets/listener.nkey \
|
||||
--flake-url git+https://git.example.com/user/nixos-configs.git \
|
||||
--metrics-enabled \
|
||||
--metrics-addr :9972
|
||||
```
|
||||
|
||||
**NixOS module:**
|
||||
```nix
|
||||
services.homelab-deploy.listener = {
|
||||
enable = true;
|
||||
tier = "prod";
|
||||
natsUrl = "nats://nats.example.com:4222";
|
||||
nkeyFile = "/run/secrets/homelab-deploy-nkey";
|
||||
flakeUrl = "git+https://git.example.com/user/nixos-configs.git";
|
||||
metrics = {
|
||||
enable = true;
|
||||
address = ":9972";
|
||||
openFirewall = true; # Optional: open firewall for Prometheus scraping
|
||||
};
|
||||
};
|
||||
```
|
||||
|
||||
### Available Metrics
|
||||
|
||||
| Metric | Type | Labels | Description |
|
||||
|--------|------|--------|-------------|
|
||||
| `homelab_deploy_deployments_total` | Counter | `status`, `action`, `error_code` | Total deployment requests processed |
|
||||
| `homelab_deploy_deployment_duration_seconds` | Histogram | `action`, `success` | Deployment execution time |
|
||||
| `homelab_deploy_deployment_in_progress` | Gauge | - | 1 if deployment running, 0 otherwise |
|
||||
| `homelab_deploy_info` | Gauge | `hostname`, `tier`, `role`, `version` | Static instance metadata |
|
||||
|
||||
**Label values:**
|
||||
- `status`: `completed`, `failed`, `rejected`
|
||||
- `action`: `switch`, `boot`, `test`, `dry-activate`
|
||||
- `error_code`: `invalid_action`, `invalid_revision`, `already_running`, `build_failed`, `timeout`, or empty
|
||||
- `success`: `true`, `false`
|
||||
|
||||
### HTTP Endpoints
|
||||
|
||||
| Endpoint | Description |
|
||||
|----------|-------------|
|
||||
| `/metrics` | Prometheus metrics in text format |
|
||||
| `/health` | Health check (returns `ok`) |
|
||||
|
||||
### Example Prometheus Queries
|
||||
|
||||
```promql
|
||||
# Average deployment duration (last hour)
|
||||
rate(homelab_deploy_deployment_duration_seconds_sum[1h]) /
|
||||
rate(homelab_deploy_deployment_duration_seconds_count[1h])
|
||||
|
||||
# Deployment success rate (last 24 hours)
|
||||
sum(rate(homelab_deploy_deployments_total{status="completed"}[24h])) /
|
||||
sum(rate(homelab_deploy_deployments_total{status=~"completed|failed"}[24h]))
|
||||
|
||||
# 95th percentile deployment time
|
||||
histogram_quantile(0.95, rate(homelab_deploy_deployment_duration_seconds_bucket[1h]))
|
||||
|
||||
# Currently running deployments across all hosts
|
||||
sum(homelab_deploy_deployment_in_progress)
|
||||
```
|
||||
|
||||
## Message Protocol
|
||||
|
||||
### Deploy Request
|
||||
|
||||
Reference in New Issue
Block a user