From fcf1a66103735d1d8000d6da53336b11774bfa71 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= <torjus@usit.uio.no>
Date: Sat, 7 Feb 2026 01:11:40 +0100
Subject: [PATCH] chore: add metrics troubleshooting skill

Reference guide for exploring Prometheus metrics when troubleshooting
homelab issues, including the new nixos_flake_info metrics.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---
 .claude/skills/metrics/SKILL.md | 133 ++++++++++++++++++++++++++++++++
 1 file changed, 133 insertions(+)
 create mode 100644 .claude/skills/metrics/SKILL.md
diff --git a/.claude/skills/metrics/SKILL.md b/.claude/skills/metrics/SKILL.md
new file mode 100644
index 0000000..254858a
--- /dev/null
+++ b/.claude/skills/metrics/SKILL.md
@@ -0,0 +1,133 @@
+---
+name: metrics
+description: Reference guide for exploring Prometheus metrics when troubleshooting homelab issues. Use when investigating system state, deployments, or service health.
+---
+
+# Metrics Troubleshooting Guide
+
+Quick reference for exploring Prometheus metrics to troubleshoot homelab issues.
+
+## Available Tools
+
+Use the `lab-monitoring` MCP server tools:
+- `search_metrics` - Find metrics by name substring
+- `get_metric_metadata` - Get type/help for a specific metric
+- `query` - Execute PromQL queries
+- `list_targets` - Check scrape target health
+- `list_alerts` - View active alerts
+
+## Key Metrics Reference
+
+### Deployment & Version Status
+
+Check which NixOS revision hosts are running:
+
+```promql
+nixos_flake_info
+```
+
+Labels:
+- `current_rev` - Git commit of the running NixOS configuration
+- `remote_rev` - Latest commit on the remote repository
+- `nixpkgs_rev` - Nixpkgs revision used to build the system
+- `nixos_version` - Full NixOS version string (e.g., `25.11.20260203.e576e3c`)
+
+Check if hosts are behind on updates:
+
+```promql
+nixos_flake_revision_behind == 1
+```
+
+View flake input versions:
+
+```promql
+nixos_flake_input_info
+```
+
+Labels: `input` (name), `rev` (revision), `type` (git/github)
+
+Check flake input age:
+
+```promql
+nixos_flake_input_age_seconds / 86400
+```
+
+Returns age in days for each flake input.
+
+### System Health
+
+Basic host availability:
+
+```promql
+up{job="node-exporter"}
+```
+
+CPU usage by host:
+
+```promql
+100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
+```
+
+Memory usage:
+
+```promql
+1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
+```
+
+Disk space (root filesystem):
+
+```promql
+node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
+```
+
+### Service-Specific Metrics
+
+Common job names:
+- `node-exporter` - System metrics (all hosts)
+- `nixos-exporter` - NixOS version/generation metrics
+- `caddy` - Reverse proxy metrics
+- `prometheus` / `loki` / `grafana` - Monitoring stack
+- `home-assistant` - Home automation
+- `step-ca` - Internal CA
+
+### Instance Label Format
+
+The `instance` label uses FQDN format:
+
+```
+<hostname>.home.2rjus.net:<port>
+```
+
+Example queries filtering by host:
+
+```promql
+up{instance=~"monitoring01.*"}
+node_load1{instance=~"ns1.*"}
+```
+
+## Troubleshooting Workflows
+
+### Check Deployment Status Across Fleet
+
+1. Query `nixos_flake_info` to see all hosts' current revisions
+2. Check `nixos_flake_revision_behind` for hosts needing updates
+3. Investigate specific hosts with `nixos_flake_input_info`
+
+### Investigate Service Issues
+
+1. Check `up{job="<service>"}` for scrape failures
+2. Use `list_targets` to see target health details
+3. Query service-specific metrics
+4. Check `list_alerts` for related alerts
+
+### After Deploying Changes
+
+1. Verify `current_rev` updated in `nixos_flake_info`
+2. Confirm `nixos_flake_revision_behind == 0`
+3. Check service metrics are being scraped
+
+## Notes
+
+- Default scrape interval is 15s for most targets
+- Use `rate()` for counter metrics, direct queries for gauges
+- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters