Files

Run nix flake check / flake-check (push) Has been cancelled

Details

chore: add metrics troubleshooting skill

Reference guide for exploring Prometheus metrics when troubleshooting
homelab issues, including the new nixos_flake_info metrics.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-07 01:11:40 +01:00

3.1 KiB

Raw Blame History

name, description

name	description
metrics	Reference guide for exploring Prometheus metrics when troubleshooting homelab issues. Use when investigating system state, deployments, or service health.

Metrics Troubleshooting Guide

Quick reference for exploring Prometheus metrics to troubleshoot homelab issues.

Available Tools

Use the lab-monitoring MCP server tools:

search_metrics - Find metrics by name substring
get_metric_metadata - Get type/help for a specific metric
query - Execute PromQL queries
list_targets - Check scrape target health
list_alerts - View active alerts

Key Metrics Reference

Deployment & Version Status

Check which NixOS revision hosts are running:

nixos_flake_info

Labels:

current_rev - Git commit of the running NixOS configuration
remote_rev - Latest commit on the remote repository
nixpkgs_rev - Nixpkgs revision used to build the system
nixos_version - Full NixOS version string (e.g., 25.11.20260203.e576e3c)

Check if hosts are behind on updates:

nixos_flake_revision_behind == 1

View flake input versions:

nixos_flake_input_info

Labels: input (name), rev (revision), type (git/github)

Check flake input age:

nixos_flake_input_age_seconds / 86400

Returns age in days for each flake input.

System Health

Basic host availability:

up{job="node-exporter"}

CPU usage by host:

100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory usage:

1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

Disk space (root filesystem):

node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}

Service-Specific Metrics

Common job names:

node-exporter - System metrics (all hosts)
nixos-exporter - NixOS version/generation metrics
caddy - Reverse proxy metrics
prometheus / loki / grafana - Monitoring stack
home-assistant - Home automation
step-ca - Internal CA

Instance Label Format

The instance label uses FQDN format:

<hostname>.home.2rjus.net:<port>

Example queries filtering by host:

up{instance=~"monitoring01.*"}
node_load1{instance=~"ns1.*"}

Troubleshooting Workflows

Check Deployment Status Across Fleet

Query nixos_flake_info to see all hosts' current revisions
Check nixos_flake_revision_behind for hosts needing updates
Investigate specific hosts with nixos_flake_input_info

Investigate Service Issues

Check up{job="<service>"} for scrape failures
Use list_targets to see target health details
Query service-specific metrics
Check list_alerts for related alerts

After Deploying Changes

Verify current_rev updated in nixos_flake_info
Confirm nixos_flake_revision_behind == 0
Check service metrics are being scraped

Notes

Default scrape interval is 15s for most targets
Use rate() for counter metrics, direct queries for gauges
The instance label includes the port, use regex matching (=~) for hostname-only filters

3.1 KiB Raw Blame History