diff --git a/.claude/skills/metrics/SKILL.md b/.claude/skills/metrics/SKILL.md deleted file mode 100644 index 254858a..0000000 --- a/.claude/skills/metrics/SKILL.md +++ /dev/null @@ -1,133 +0,0 @@ ---- -name: metrics -description: Reference guide for exploring Prometheus metrics when troubleshooting homelab issues. Use when investigating system state, deployments, or service health. ---- - -# Metrics Troubleshooting Guide - -Quick reference for exploring Prometheus metrics to troubleshoot homelab issues. - -## Available Tools - -Use the `lab-monitoring` MCP server tools: -- `search_metrics` - Find metrics by name substring -- `get_metric_metadata` - Get type/help for a specific metric -- `query` - Execute PromQL queries -- `list_targets` - Check scrape target health -- `list_alerts` - View active alerts - -## Key Metrics Reference - -### Deployment & Version Status - -Check which NixOS revision hosts are running: - -```promql -nixos_flake_info -``` - -Labels: -- `current_rev` - Git commit of the running NixOS configuration -- `remote_rev` - Latest commit on the remote repository -- `nixpkgs_rev` - Nixpkgs revision used to build the system -- `nixos_version` - Full NixOS version string (e.g., `25.11.20260203.e576e3c`) - -Check if hosts are behind on updates: - -```promql -nixos_flake_revision_behind == 1 -``` - -View flake input versions: - -```promql -nixos_flake_input_info -``` - -Labels: `input` (name), `rev` (revision), `type` (git/github) - -Check flake input age: - -```promql -nixos_flake_input_age_seconds / 86400 -``` - -Returns age in days for each flake input. - -### System Health - -Basic host availability: - -```promql -up{job="node-exporter"} -``` - -CPU usage by host: - -```promql -100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) -``` - -Memory usage: - -```promql -1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) -``` - -Disk space (root filesystem): - -```promql -node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} -``` - -### Service-Specific Metrics - -Common job names: -- `node-exporter` - System metrics (all hosts) -- `nixos-exporter` - NixOS version/generation metrics -- `caddy` - Reverse proxy metrics -- `prometheus` / `loki` / `grafana` - Monitoring stack -- `home-assistant` - Home automation -- `step-ca` - Internal CA - -### Instance Label Format - -The `instance` label uses FQDN format: - -``` -.home.2rjus.net: -``` - -Example queries filtering by host: - -```promql -up{instance=~"monitoring01.*"} -node_load1{instance=~"ns1.*"} -``` - -## Troubleshooting Workflows - -### Check Deployment Status Across Fleet - -1. Query `nixos_flake_info` to see all hosts' current revisions -2. Check `nixos_flake_revision_behind` for hosts needing updates -3. Investigate specific hosts with `nixos_flake_input_info` - -### Investigate Service Issues - -1. Check `up{job=""}` for scrape failures -2. Use `list_targets` to see target health details -3. Query service-specific metrics -4. Check `list_alerts` for related alerts - -### After Deploying Changes - -1. Verify `current_rev` updated in `nixos_flake_info` -2. Confirm `nixos_flake_revision_behind == 0` -3. Check service metrics are being scraped - -## Notes - -- Default scrape interval is 15s for most targets -- Use `rate()` for counter metrics, direct queries for gauges -- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters diff --git a/.claude/skills/observability/SKILL.md b/.claude/skills/observability/SKILL.md new file mode 100644 index 0000000..69be240 --- /dev/null +++ b/.claude/skills/observability/SKILL.md @@ -0,0 +1,250 @@ +--- +name: observability +description: Reference guide for exploring Prometheus metrics and Loki logs when troubleshooting homelab issues. Use when investigating system state, deployments, service health, or searching logs. +--- + +# Observability Troubleshooting Guide + +Quick reference for exploring Prometheus metrics and Loki logs to troubleshoot homelab issues. + +## Available Tools + +Use the `lab-monitoring` MCP server tools: + +**Metrics:** +- `search_metrics` - Find metrics by name substring +- `get_metric_metadata` - Get type/help for a specific metric +- `query` - Execute PromQL queries +- `list_targets` - Check scrape target health +- `list_alerts` / `get_alert` - View active alerts + +**Logs:** +- `query_logs` - Execute LogQL queries against Loki +- `list_labels` - List available log labels +- `list_label_values` - List values for a specific label + +--- + +## Logs Reference + +### Label Reference + +Available labels for log queries: +- `host` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`) +- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`) +- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs) +- `filename` - For `varlog` job, the log file path +- `hostname` - Alternative to `host` for some streams + +### Log Format + +Journal logs are JSON-formatted. Key fields: +- `MESSAGE` - The actual log message +- `PRIORITY` - Syslog priority (6=info, 4=warning, 3=error) +- `SYSLOG_IDENTIFIER` - Program name + +### Basic LogQL Queries + +**Logs from a specific service on a host:** +```logql +{host="ns1", systemd_unit="nsd.service"} +``` + +**All logs from a host:** +```logql +{host="monitoring01"} +``` + +**Logs from a service across all hosts:** +```logql +{systemd_unit="nixos-upgrade.service"} +``` + +**Substring matching (case-sensitive):** +```logql +{host="ha1"} |= "error" +``` + +**Exclude pattern:** +```logql +{host="ns1"} != "routine" +``` + +**Regex matching:** +```logql +{systemd_unit="prometheus.service"} |~ "scrape.*failed" +``` + +**File-based logs (caddy access logs, etc):** +```logql +{job="varlog", hostname="nix-cache01"} +{job="varlog", filename="/var/log/caddy/nix-cache.log"} +``` + +### Time Ranges + +Default lookback is 1 hour. Use `start` parameter for older logs: +- `start: "1h"` - Last hour (default) +- `start: "24h"` - Last 24 hours +- `start: "168h"` - Last 7 days + +### Common Services + +Useful systemd units for troubleshooting: +- `nixos-upgrade.service` - Daily auto-upgrade logs +- `nsd.service` - DNS server (ns1/ns2) +- `prometheus.service` - Metrics collection +- `loki.service` - Log aggregation +- `caddy.service` - Reverse proxy +- `home-assistant.service` - Home automation +- `step-ca.service` - Internal CA +- `openbao.service` - Secrets management +- `sshd.service` - SSH daemon +- `nix-gc.service` - Nix garbage collection + +### Extracting JSON Fields + +Parse JSON and filter on fields: +```logql +{systemd_unit="prometheus.service"} | json | PRIORITY="3" +``` + +--- + +## Metrics Reference + +### Deployment & Version Status + +Check which NixOS revision hosts are running: + +```promql +nixos_flake_info +``` + +Labels: +- `current_rev` - Git commit of the running NixOS configuration +- `remote_rev` - Latest commit on the remote repository +- `nixpkgs_rev` - Nixpkgs revision used to build the system +- `nixos_version` - Full NixOS version string (e.g., `25.11.20260203.e576e3c`) + +Check if hosts are behind on updates: + +```promql +nixos_flake_revision_behind == 1 +``` + +View flake input versions: + +```promql +nixos_flake_input_info +``` + +Labels: `input` (name), `rev` (revision), `type` (git/github) + +Check flake input age: + +```promql +nixos_flake_input_age_seconds / 86400 +``` + +Returns age in days for each flake input. + +### System Health + +Basic host availability: + +```promql +up{job="node-exporter"} +``` + +CPU usage by host: + +```promql +100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) +``` + +Memory usage: + +```promql +1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) +``` + +Disk space (root filesystem): + +```promql +node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} +``` + +### Service-Specific Metrics + +Common job names: +- `node-exporter` - System metrics (all hosts) +- `nixos-exporter` - NixOS version/generation metrics +- `caddy` - Reverse proxy metrics +- `prometheus` / `loki` / `grafana` - Monitoring stack +- `home-assistant` - Home automation +- `step-ca` - Internal CA + +### Instance Label Format + +The `instance` label uses FQDN format: + +``` +.home.2rjus.net: +``` + +Example queries filtering by host: + +```promql +up{instance=~"monitoring01.*"} +node_load1{instance=~"ns1.*"} +``` + +--- + +## Troubleshooting Workflows + +### Check Deployment Status Across Fleet + +1. Query `nixos_flake_info` to see all hosts' current revisions +2. Check `nixos_flake_revision_behind` for hosts needing updates +3. Look at upgrade logs: `{systemd_unit="nixos-upgrade.service"}` with `start: "24h"` + +### Investigate Service Issues + +1. Check `up{job=""}` for scrape failures +2. Use `list_targets` to see target health details +3. Query service logs: `{host="", systemd_unit=".service"}` +4. Search for errors: `{host=""} |= "error"` +5. Check `list_alerts` for related alerts + +### After Deploying Changes + +1. Verify `current_rev` updated in `nixos_flake_info` +2. Confirm `nixos_flake_revision_behind == 0` +3. Check service logs for startup issues +4. Check service metrics are being scraped + +### Debug SSH/Access Issues + +```logql +{host="", systemd_unit="sshd.service"} +``` + +### Check Recent Upgrades + +```logql +{systemd_unit="nixos-upgrade.service"} +``` + +With `start: "24h"` to see last 24 hours of upgrades across all hosts. + +--- + +## Notes + +- Default scrape interval is 15s for most metrics targets +- Default log lookback is 1h - use `start` parameter for older logs +- Use `rate()` for counter metrics, direct queries for gauges +- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters +- Log `MESSAGE` field contains the actual log content in JSON format