chore: rename metrics skill to observability, add logs reference

Merge Prometheus metrics and Loki logs into a unified troubleshooting skill. Adds LogQL query patterns, label reference, and common service units for log searching. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 01:17:41 +01:00
parent fcf1a66103
commit b9a269d280
2 changed files with 250 additions and 133 deletions
--- a/.claude/skills/metrics/SKILL.md
+++ b/.claude/skills/metrics/SKILL.md
@@ -1,133 +0,0 @@
 ---
 name: metrics
 description: Reference guide for exploring Prometheus metrics when troubleshooting homelab issues. Use when investigating system state, deployments, or service health.
 ---
 # Metrics Troubleshooting Guide
 Quick reference for exploring Prometheus metrics to troubleshoot homelab issues.
 ## Available Tools
 Use the `lab-monitoring` MCP server tools:
 - `search_metrics` - Find metrics by name substring
 - `get_metric_metadata` - Get type/help for a specific metric
 - `query` - Execute PromQL queries
 - `list_targets` - Check scrape target health
 - `list_alerts` - View active alerts
 ## Key Metrics Reference
 ### Deployment & Version Status
 Check which NixOS revision hosts are running:
 ```promql
 nixos_flake_info
 ```
 Labels:
 - `current_rev` - Git commit of the running NixOS configuration
 - `remote_rev` - Latest commit on the remote repository
 - `nixpkgs_rev` - Nixpkgs revision used to build the system
 - `nixos_version` - Full NixOS version string (e.g., `25.11.20260203.e576e3c`)
 Check if hosts are behind on updates:
 ```promql
 nixos_flake_revision_behind == 1
 ```
 View flake input versions:
 ```promql
 nixos_flake_input_info
 ```
 Labels: `input` (name), `rev` (revision), `type` (git/github)
 Check flake input age:
 ```promql
 nixos_flake_input_age_seconds / 86400
 ```
 Returns age in days for each flake input.
 ### System Health
 Basic host availability:
 ```promql
 up{job="node-exporter"}
 ```
 CPU usage by host:
 ```promql
 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
 ```
 Memory usage:
 ```promql
 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
 ```
 Disk space (root filesystem):
 ```promql
 node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
 ```
 ### Service-Specific Metrics
 Common job names:
 - `node-exporter` - System metrics (all hosts)
 - `nixos-exporter` - NixOS version/generation metrics
 - `caddy` - Reverse proxy metrics
 - `prometheus` / `loki` / `grafana` - Monitoring stack
 - `home-assistant` - Home automation
 - `step-ca` - Internal CA
 ### Instance Label Format
 The `instance` label uses FQDN format:
 ```
 <hostname>.home.2rjus.net:<port>
 ```
 Example queries filtering by host:
 ```promql
 up{instance=~"monitoring01.*"}
 node_load1{instance=~"ns1.*"}
 ```
 ## Troubleshooting Workflows
 ### Check Deployment Status Across Fleet
 1. Query `nixos_flake_info` to see all hosts' current revisions
 2. Check `nixos_flake_revision_behind` for hosts needing updates
 3. Investigate specific hosts with `nixos_flake_input_info`
 ### Investigate Service Issues
 1. Check `up{job="<service>"}` for scrape failures
 2. Use `list_targets` to see target health details
 3. Query service-specific metrics
 4. Check `list_alerts` for related alerts
 ### After Deploying Changes
 1. Verify `current_rev` updated in `nixos_flake_info`
 2. Confirm `nixos_flake_revision_behind == 0`
 3. Check service metrics are being scraped
 ## Notes
 - Default scrape interval is 15s for most targets
 - Use `rate()` for counter metrics, direct queries for gauges
 - The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters
--- a/.claude/skills/observability/SKILL.md
+++ b/.claude/skills/observability/SKILL.md
@@ -0,0 +1,250 @@
 ---
 name: observability
 description: Reference guide for exploring Prometheus metrics and Loki logs when troubleshooting homelab issues. Use when investigating system state, deployments, service health, or searching logs.
 ---
 # Observability Troubleshooting Guide
 Quick reference for exploring Prometheus metrics and Loki logs to troubleshoot homelab issues.
 ## Available Tools
 Use the `lab-monitoring` MCP server tools:
 **Metrics:**
 - `search_metrics` - Find metrics by name substring
 - `get_metric_metadata` - Get type/help for a specific metric
 - `query` - Execute PromQL queries
 - `list_targets` - Check scrape target health
 - `list_alerts` / `get_alert` - View active alerts
 **Logs:**
 - `query_logs` - Execute LogQL queries against Loki
 - `list_labels` - List available log labels
 - `list_label_values` - List values for a specific label
 ---
 ## Logs Reference
 ### Label Reference
 Available labels for log queries:
 - `host` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`)
 - `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
 - `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs)
 - `filename` - For `varlog` job, the log file path
 - `hostname` - Alternative to `host` for some streams
 ### Log Format
 Journal logs are JSON-formatted. Key fields:
 - `MESSAGE` - The actual log message
 - `PRIORITY` - Syslog priority (6=info, 4=warning, 3=error)
 - `SYSLOG_IDENTIFIER` - Program name
 ### Basic LogQL Queries
 **Logs from a specific service on a host:**
 ```logql
 {host="ns1", systemd_unit="nsd.service"}
 ```
 **All logs from a host:**
 ```logql
 {host="monitoring01"}
 ```
 **Logs from a service across all hosts:**
 ```logql
 {systemd_unit="nixos-upgrade.service"}
 ```
 **Substring matching (case-sensitive):**
 ```logql
 {host="ha1"} |= "error"
 ```
 **Exclude pattern:**
 ```logql
 {host="ns1"} != "routine"
 ```
 **Regex matching:**
 ```logql
 {systemd_unit="prometheus.service"} |~ "scrape.*failed"
 ```
 **File-based logs (caddy access logs, etc):**
 ```logql
 {job="varlog", hostname="nix-cache01"}
 {job="varlog", filename="/var/log/caddy/nix-cache.log"}
 ```
 ### Time Ranges
 Default lookback is 1 hour. Use `start` parameter for older logs:
 - `start: "1h"` - Last hour (default)
 - `start: "24h"` - Last 24 hours
 - `start: "168h"` - Last 7 days
 ### Common Services
 Useful systemd units for troubleshooting:
 - `nixos-upgrade.service` - Daily auto-upgrade logs
 - `nsd.service` - DNS server (ns1/ns2)
 - `prometheus.service` - Metrics collection
 - `loki.service` - Log aggregation
 - `caddy.service` - Reverse proxy
 - `home-assistant.service` - Home automation
 - `step-ca.service` - Internal CA
 - `openbao.service` - Secrets management
 - `sshd.service` - SSH daemon
 - `nix-gc.service` - Nix garbage collection
 ### Extracting JSON Fields
 Parse JSON and filter on fields:
 ```logql
 {systemd_unit="prometheus.service"} | json | PRIORITY="3"
 ```
 ---
 ## Metrics Reference
 ### Deployment & Version Status
 Check which NixOS revision hosts are running:
 ```promql
 nixos_flake_info
 ```
 Labels:
 - `current_rev` - Git commit of the running NixOS configuration
 - `remote_rev` - Latest commit on the remote repository
 - `nixpkgs_rev` - Nixpkgs revision used to build the system
 - `nixos_version` - Full NixOS version string (e.g., `25.11.20260203.e576e3c`)
 Check if hosts are behind on updates:
 ```promql
 nixos_flake_revision_behind == 1
 ```
 View flake input versions:
 ```promql
 nixos_flake_input_info
 ```
 Labels: `input` (name), `rev` (revision), `type` (git/github)
 Check flake input age:
 ```promql
 nixos_flake_input_age_seconds / 86400
 ```
 Returns age in days for each flake input.
 ### System Health
 Basic host availability:
 ```promql
 up{job="node-exporter"}
 ```
 CPU usage by host:
 ```promql
 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
 ```
 Memory usage:
 ```promql
 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
 ```
 Disk space (root filesystem):
 ```promql
 node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
 ```
 ### Service-Specific Metrics
 Common job names:
 - `node-exporter` - System metrics (all hosts)
 - `nixos-exporter` - NixOS version/generation metrics
 - `caddy` - Reverse proxy metrics
 - `prometheus` / `loki` / `grafana` - Monitoring stack
 - `home-assistant` - Home automation
 - `step-ca` - Internal CA
 ### Instance Label Format
 The `instance` label uses FQDN format:
 ```
 <hostname>.home.2rjus.net:<port>
 ```
 Example queries filtering by host:
 ```promql
 up{instance=~"monitoring01.*"}
 node_load1{instance=~"ns1.*"}
 ```
 ---
 ## Troubleshooting Workflows
 ### Check Deployment Status Across Fleet
 1. Query `nixos_flake_info` to see all hosts' current revisions
 2. Check `nixos_flake_revision_behind` for hosts needing updates
 3. Look at upgrade logs: `{systemd_unit="nixos-upgrade.service"}` with `start: "24h"`
 ### Investigate Service Issues
 1. Check `up{job="<service>"}` for scrape failures
 2. Use `list_targets` to see target health details
 3. Query service logs: `{host="<host>", systemd_unit="<service>.service"}`
 4. Search for errors: `{host="<host>"} |= "error"`
 5. Check `list_alerts` for related alerts
 ### After Deploying Changes
 1. Verify `current_rev` updated in `nixos_flake_info`
 2. Confirm `nixos_flake_revision_behind == 0`
 3. Check service logs for startup issues
 4. Check service metrics are being scraped
 ### Debug SSH/Access Issues
 ```logql
 {host="<host>", systemd_unit="sshd.service"}
 ```
 ### Check Recent Upgrades
 ```logql
 {systemd_unit="nixos-upgrade.service"}
 ```
 With `start: "24h"` to see last 24 hours of upgrades across all hosts.
 ---
 ## Notes
 - Default scrape interval is 15s for most metrics targets
 - Default log lookback is 1h - use `start` parameter for older logs
 - Use `rate()` for counter metrics, direct queries for gauges
 - The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters
 - Log `MESSAGE` field contains the actual log content in JSON format