From fcf1a66103735d1d8000d6da53336b11774bfa71 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Sat, 7 Feb 2026 01:11:40 +0100 Subject: [PATCH] chore: add metrics troubleshooting skill Reference guide for exploring Prometheus metrics when troubleshooting homelab issues, including the new nixos_flake_info metrics. Co-Authored-By: Claude Opus 4.5 --- .claude/skills/metrics/SKILL.md | 133 ++++++++++++++++++++++++++++++++ 1 file changed, 133 insertions(+) create mode 100644 .claude/skills/metrics/SKILL.md diff --git a/.claude/skills/metrics/SKILL.md b/.claude/skills/metrics/SKILL.md new file mode 100644 index 0000000..254858a --- /dev/null +++ b/.claude/skills/metrics/SKILL.md @@ -0,0 +1,133 @@ +--- +name: metrics +description: Reference guide for exploring Prometheus metrics when troubleshooting homelab issues. Use when investigating system state, deployments, or service health. +--- + +# Metrics Troubleshooting Guide + +Quick reference for exploring Prometheus metrics to troubleshoot homelab issues. + +## Available Tools + +Use the `lab-monitoring` MCP server tools: +- `search_metrics` - Find metrics by name substring +- `get_metric_metadata` - Get type/help for a specific metric +- `query` - Execute PromQL queries +- `list_targets` - Check scrape target health +- `list_alerts` - View active alerts + +## Key Metrics Reference + +### Deployment & Version Status + +Check which NixOS revision hosts are running: + +```promql +nixos_flake_info +``` + +Labels: +- `current_rev` - Git commit of the running NixOS configuration +- `remote_rev` - Latest commit on the remote repository +- `nixpkgs_rev` - Nixpkgs revision used to build the system +- `nixos_version` - Full NixOS version string (e.g., `25.11.20260203.e576e3c`) + +Check if hosts are behind on updates: + +```promql +nixos_flake_revision_behind == 1 +``` + +View flake input versions: + +```promql +nixos_flake_input_info +``` + +Labels: `input` (name), `rev` (revision), `type` (git/github) + +Check flake input age: + +```promql +nixos_flake_input_age_seconds / 86400 +``` + +Returns age in days for each flake input. + +### System Health + +Basic host availability: + +```promql +up{job="node-exporter"} +``` + +CPU usage by host: + +```promql +100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) +``` + +Memory usage: + +```promql +1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) +``` + +Disk space (root filesystem): + +```promql +node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} +``` + +### Service-Specific Metrics + +Common job names: +- `node-exporter` - System metrics (all hosts) +- `nixos-exporter` - NixOS version/generation metrics +- `caddy` - Reverse proxy metrics +- `prometheus` / `loki` / `grafana` - Monitoring stack +- `home-assistant` - Home automation +- `step-ca` - Internal CA + +### Instance Label Format + +The `instance` label uses FQDN format: + +``` +.home.2rjus.net: +``` + +Example queries filtering by host: + +```promql +up{instance=~"monitoring01.*"} +node_load1{instance=~"ns1.*"} +``` + +## Troubleshooting Workflows + +### Check Deployment Status Across Fleet + +1. Query `nixos_flake_info` to see all hosts' current revisions +2. Check `nixos_flake_revision_behind` for hosts needing updates +3. Investigate specific hosts with `nixos_flake_input_info` + +### Investigate Service Issues + +1. Check `up{job=""}` for scrape failures +2. Use `list_targets` to see target health details +3. Query service-specific metrics +4. Check `list_alerts` for related alerts + +### After Deploying Changes + +1. Verify `current_rev` updated in `nixos_flake_info` +2. Confirm `nixos_flake_revision_behind == 0` +3. Check service metrics are being scraped + +## Notes + +- Default scrape interval is 15s for most targets +- Use `rate()` for counter metrics, direct queries for gauges +- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters