Files
nixos-servers/.claude/skills/metrics/SKILL.md
Torjus Håkestad fcf1a66103
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
chore: add metrics troubleshooting skill
Reference guide for exploring Prometheus metrics when troubleshooting
homelab issues, including the new nixos_flake_info metrics.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 01:11:40 +01:00

134 lines
3.1 KiB
Markdown

---
name: metrics
description: Reference guide for exploring Prometheus metrics when troubleshooting homelab issues. Use when investigating system state, deployments, or service health.
---
# Metrics Troubleshooting Guide
Quick reference for exploring Prometheus metrics to troubleshoot homelab issues.
## Available Tools
Use the `lab-monitoring` MCP server tools:
- `search_metrics` - Find metrics by name substring
- `get_metric_metadata` - Get type/help for a specific metric
- `query` - Execute PromQL queries
- `list_targets` - Check scrape target health
- `list_alerts` - View active alerts
## Key Metrics Reference
### Deployment & Version Status
Check which NixOS revision hosts are running:
```promql
nixos_flake_info
```
Labels:
- `current_rev` - Git commit of the running NixOS configuration
- `remote_rev` - Latest commit on the remote repository
- `nixpkgs_rev` - Nixpkgs revision used to build the system
- `nixos_version` - Full NixOS version string (e.g., `25.11.20260203.e576e3c`)
Check if hosts are behind on updates:
```promql
nixos_flake_revision_behind == 1
```
View flake input versions:
```promql
nixos_flake_input_info
```
Labels: `input` (name), `rev` (revision), `type` (git/github)
Check flake input age:
```promql
nixos_flake_input_age_seconds / 86400
```
Returns age in days for each flake input.
### System Health
Basic host availability:
```promql
up{job="node-exporter"}
```
CPU usage by host:
```promql
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
```
Memory usage:
```promql
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
```
Disk space (root filesystem):
```promql
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
```
### Service-Specific Metrics
Common job names:
- `node-exporter` - System metrics (all hosts)
- `nixos-exporter` - NixOS version/generation metrics
- `caddy` - Reverse proxy metrics
- `prometheus` / `loki` / `grafana` - Monitoring stack
- `home-assistant` - Home automation
- `step-ca` - Internal CA
### Instance Label Format
The `instance` label uses FQDN format:
```
<hostname>.home.2rjus.net:<port>
```
Example queries filtering by host:
```promql
up{instance=~"monitoring01.*"}
node_load1{instance=~"ns1.*"}
```
## Troubleshooting Workflows
### Check Deployment Status Across Fleet
1. Query `nixos_flake_info` to see all hosts' current revisions
2. Check `nixos_flake_revision_behind` for hosts needing updates
3. Investigate specific hosts with `nixos_flake_input_info`
### Investigate Service Issues
1. Check `up{job="<service>"}` for scrape failures
2. Use `list_targets` to see target health details
3. Query service-specific metrics
4. Check `list_alerts` for related alerts
### After Deploying Changes
1. Verify `current_rev` updated in `nixos_flake_info`
2. Confirm `nixos_flake_revision_behind == 0`
3. Check service metrics are being scraped
## Notes
- Default scrape interval is 15s for most targets
- Use `rate()` for counter metrics, direct queries for gauges
- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters