Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Reference guide for exploring Prometheus metrics when troubleshooting homelab issues, including the new nixos_flake_info metrics. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
3.1 KiB
3.1 KiB
name, description
| name | description |
|---|---|
| metrics | Reference guide for exploring Prometheus metrics when troubleshooting homelab issues. Use when investigating system state, deployments, or service health. |
Metrics Troubleshooting Guide
Quick reference for exploring Prometheus metrics to troubleshoot homelab issues.
Available Tools
Use the lab-monitoring MCP server tools:
search_metrics- Find metrics by name substringget_metric_metadata- Get type/help for a specific metricquery- Execute PromQL querieslist_targets- Check scrape target healthlist_alerts- View active alerts
Key Metrics Reference
Deployment & Version Status
Check which NixOS revision hosts are running:
nixos_flake_info
Labels:
current_rev- Git commit of the running NixOS configurationremote_rev- Latest commit on the remote repositorynixpkgs_rev- Nixpkgs revision used to build the systemnixos_version- Full NixOS version string (e.g.,25.11.20260203.e576e3c)
Check if hosts are behind on updates:
nixos_flake_revision_behind == 1
View flake input versions:
nixos_flake_input_info
Labels: input (name), rev (revision), type (git/github)
Check flake input age:
nixos_flake_input_age_seconds / 86400
Returns age in days for each flake input.
System Health
Basic host availability:
up{job="node-exporter"}
CPU usage by host:
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Memory usage:
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
Disk space (root filesystem):
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
Service-Specific Metrics
Common job names:
node-exporter- System metrics (all hosts)nixos-exporter- NixOS version/generation metricscaddy- Reverse proxy metricsprometheus/loki/grafana- Monitoring stackhome-assistant- Home automationstep-ca- Internal CA
Instance Label Format
The instance label uses FQDN format:
<hostname>.home.2rjus.net:<port>
Example queries filtering by host:
up{instance=~"monitoring01.*"}
node_load1{instance=~"ns1.*"}
Troubleshooting Workflows
Check Deployment Status Across Fleet
- Query
nixos_flake_infoto see all hosts' current revisions - Check
nixos_flake_revision_behindfor hosts needing updates - Investigate specific hosts with
nixos_flake_input_info
Investigate Service Issues
- Check
up{job="<service>"}for scrape failures - Use
list_targetsto see target health details - Query service-specific metrics
- Check
list_alertsfor related alerts
After Deploying Changes
- Verify
current_revupdated innixos_flake_info - Confirm
nixos_flake_revision_behind == 0 - Check service metrics are being scraped
Notes
- Default scrape interval is 15s for most targets
- Use
rate()for counter metrics, direct queries for gauges - The
instancelabel includes the port, use regex matching (=~) for hostname-only filters