Deployment metrics not being recorded #1

Closed
opened 2026-02-08 23:58:29 +00:00 by torjus · 1 comment
Owner

Bug: Deployment metrics not being recorded

Summary

The homelab_deploy_deployment_duration_seconds histogram and homelab_deploy_deployments_total counter metrics are never populated with deployment data. All values remain at 0 even after successful deployments.

Environment

  • homelab-deploy version: 0.1.12
  • Tested on: testvm01, testvm02, testvm03, monitoring02, kanidm01

Steps to Reproduce

  1. Deploy to test hosts: deploy(all=true, action="switch")
  2. Wait 60 seconds for Prometheus scrape
  3. Query metrics:
    homelab_deploy_deployments_total{tier="test", action="switch"}
    homelab_deploy_deployment_duration_seconds_count{tier="test", action="switch"}
    homelab_deploy_deployment_duration_seconds_sum{tier="test", action="switch"}
    

Expected Behavior

After a successful deployment:

  • deployments_total{status="completed"} should increment by 1
  • deployment_duration_seconds_count{success="true"} should increment by 1
  • deployment_duration_seconds_sum{success="true"} should contain the deployment duration

Actual Behavior

All metric values are 0:

  • deployments_total: 0
  • deployment_duration_seconds_count: 0
  • deployment_duration_seconds_sum: 0

The metric labels (action, success, hostname, tier, etc.) are present, indicating the metrics are registered correctly. However, no observations are being recorded.

Working Functionality

The following features work correctly:

  • Deployment execution completes successfully
  • Logs show: "deployment completed successfully" with exit_code: 0
  • Logs show: "waiting for metrics scrape before restart"
  • Logs show: "metrics scraped, proceeding with restart"
  • Logs show: "exiting for restart after successful switch deployment"
  • homelab_deploy_info metric is populated correctly with version 0.1.12

Root Cause Hypothesis

The histogram .Observe() and counter .Inc() methods are not being called after deployment completion. The scrape-wait logic is functioning, but there's nothing to scrape because the metrics were never updated.

Additional Context

Verified using max_over_time(...[10m]) that Prometheus never captured non-zero values, confirming the metrics are never populated rather than being reset before scrape.

# Bug: Deployment metrics not being recorded ## Summary The `homelab_deploy_deployment_duration_seconds` histogram and `homelab_deploy_deployments_total` counter metrics are never populated with deployment data. All values remain at 0 even after successful deployments. ## Environment - homelab-deploy version: 0.1.12 - Tested on: testvm01, testvm02, testvm03, monitoring02, kanidm01 ## Steps to Reproduce 1. Deploy to test hosts: `deploy(all=true, action="switch")` 2. Wait 60 seconds for Prometheus scrape 3. Query metrics: ```promql homelab_deploy_deployments_total{tier="test", action="switch"} homelab_deploy_deployment_duration_seconds_count{tier="test", action="switch"} homelab_deploy_deployment_duration_seconds_sum{tier="test", action="switch"} ``` ## Expected Behavior After a successful deployment: - `deployments_total{status="completed"}` should increment by 1 - `deployment_duration_seconds_count{success="true"}` should increment by 1 - `deployment_duration_seconds_sum{success="true"}` should contain the deployment duration ## Actual Behavior All metric values are 0: - `deployments_total`: 0 - `deployment_duration_seconds_count`: 0 - `deployment_duration_seconds_sum`: 0 The metric labels (action, success, hostname, tier, etc.) are present, indicating the metrics are registered correctly. However, no observations are being recorded. ## Working Functionality The following features work correctly: - Deployment execution completes successfully - Logs show: "deployment completed successfully" with exit_code: 0 - Logs show: "waiting for metrics scrape before restart" - Logs show: "metrics scraped, proceeding with restart" - Logs show: "exiting for restart after successful switch deployment" - `homelab_deploy_info` metric is populated correctly with version 0.1.12 ## Root Cause Hypothesis The histogram `.Observe()` and counter `.Inc()` methods are not being called after deployment completion. The scrape-wait logic is functioning, but there's nothing to scrape because the metrics were never updated. ## Additional Context Verified using `max_over_time(...[10m])` that Prometheus never captured non-zero values, confirming the metrics are never populated rather than being reset before scrape.
Author
Owner

Testing Debug Logging for Metrics

Update your flake input to use the branch:

inputs.homelab-deploy.url = "github:torjus/homelab-deploy/fix/metrics-not-recorded";

Then enable debug logging in your NixOS config:

services.homelab-deploy.listener = {
  # ... existing config ...
  extraArgs = [ "--debug" ];
};

After rebuilding and running a deployment, check the logs:

journalctl -u homelab-deploy-listener -f

You should see entries like:

{"level":"DEBUG","msg":"recording deployment end metric (success)","action":"switch","success":true,"duration_seconds":120.5}

If you see "metrics_enabled":false or no debug messages at all, that will help pinpoint the issue.

## Testing Debug Logging for Metrics Update your flake input to use the branch: ```nix inputs.homelab-deploy.url = "github:torjus/homelab-deploy/fix/metrics-not-recorded"; ``` Then enable debug logging in your NixOS config: ```nix services.homelab-deploy.listener = { # ... existing config ... extraArgs = [ "--debug" ]; }; ``` After rebuilding and running a deployment, check the logs: ```bash journalctl -u homelab-deploy-listener -f ``` You should see entries like: ```json {"level":"DEBUG","msg":"recording deployment end metric (success)","action":"switch","success":true,"duration_seconds":120.5} ``` If you see `"metrics_enabled":false` or no debug messages at all, that will help pinpoint the issue.
This repo is archived. You cannot comment on issues.
No Label
1 Participants
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: torjus/homelab-deploy#1