--- name: metrics description: Reference guide for exploring Prometheus metrics when troubleshooting homelab issues. Use when investigating system state, deployments, or service health. --- # Metrics Troubleshooting Guide Quick reference for exploring Prometheus metrics to troubleshoot homelab issues. ## Available Tools Use the `lab-monitoring` MCP server tools: - `search_metrics` - Find metrics by name substring - `get_metric_metadata` - Get type/help for a specific metric - `query` - Execute PromQL queries - `list_targets` - Check scrape target health - `list_alerts` - View active alerts ## Key Metrics Reference ### Deployment & Version Status Check which NixOS revision hosts are running: ```promql nixos_flake_info ``` Labels: - `current_rev` - Git commit of the running NixOS configuration - `remote_rev` - Latest commit on the remote repository - `nixpkgs_rev` - Nixpkgs revision used to build the system - `nixos_version` - Full NixOS version string (e.g., `25.11.20260203.e576e3c`) Check if hosts are behind on updates: ```promql nixos_flake_revision_behind == 1 ``` View flake input versions: ```promql nixos_flake_input_info ``` Labels: `input` (name), `rev` (revision), `type` (git/github) Check flake input age: ```promql nixos_flake_input_age_seconds / 86400 ``` Returns age in days for each flake input. ### System Health Basic host availability: ```promql up{job="node-exporter"} ``` CPU usage by host: ```promql 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) ``` Memory usage: ```promql 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) ``` Disk space (root filesystem): ```promql node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ``` ### Service-Specific Metrics Common job names: - `node-exporter` - System metrics (all hosts) - `nixos-exporter` - NixOS version/generation metrics - `caddy` - Reverse proxy metrics - `prometheus` / `loki` / `grafana` - Monitoring stack - `home-assistant` - Home automation - `step-ca` - Internal CA ### Instance Label Format The `instance` label uses FQDN format: ``` .home.2rjus.net: ``` Example queries filtering by host: ```promql up{instance=~"monitoring01.*"} node_load1{instance=~"ns1.*"} ``` ## Troubleshooting Workflows ### Check Deployment Status Across Fleet 1. Query `nixos_flake_info` to see all hosts' current revisions 2. Check `nixos_flake_revision_behind` for hosts needing updates 3. Investigate specific hosts with `nixos_flake_input_info` ### Investigate Service Issues 1. Check `up{job=""}` for scrape failures 2. Use `list_targets` to see target health details 3. Query service-specific metrics 4. Check `list_alerts` for related alerts ### After Deploying Changes 1. Verify `current_rev` updated in `nixos_flake_info` 2. Confirm `nixos_flake_revision_behind == 0` 3. Check service metrics are being scraped ## Notes - Default scrape interval is 15s for most targets - Use `rate()` for counter metrics, direct queries for gauges - The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters