chore: rename metrics skill to observability, add logs reference
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m4s
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m4s
Merge Prometheus metrics and Loki logs into a unified troubleshooting skill. Adds LogQL query patterns, label reference, and common service units for log searching. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -1,133 +0,0 @@
|
|||||||
---
|
|
||||||
name: metrics
|
|
||||||
description: Reference guide for exploring Prometheus metrics when troubleshooting homelab issues. Use when investigating system state, deployments, or service health.
|
|
||||||
---
|
|
||||||
|
|
||||||
# Metrics Troubleshooting Guide
|
|
||||||
|
|
||||||
Quick reference for exploring Prometheus metrics to troubleshoot homelab issues.
|
|
||||||
|
|
||||||
## Available Tools
|
|
||||||
|
|
||||||
Use the `lab-monitoring` MCP server tools:
|
|
||||||
- `search_metrics` - Find metrics by name substring
|
|
||||||
- `get_metric_metadata` - Get type/help for a specific metric
|
|
||||||
- `query` - Execute PromQL queries
|
|
||||||
- `list_targets` - Check scrape target health
|
|
||||||
- `list_alerts` - View active alerts
|
|
||||||
|
|
||||||
## Key Metrics Reference
|
|
||||||
|
|
||||||
### Deployment & Version Status
|
|
||||||
|
|
||||||
Check which NixOS revision hosts are running:
|
|
||||||
|
|
||||||
```promql
|
|
||||||
nixos_flake_info
|
|
||||||
```
|
|
||||||
|
|
||||||
Labels:
|
|
||||||
- `current_rev` - Git commit of the running NixOS configuration
|
|
||||||
- `remote_rev` - Latest commit on the remote repository
|
|
||||||
- `nixpkgs_rev` - Nixpkgs revision used to build the system
|
|
||||||
- `nixos_version` - Full NixOS version string (e.g., `25.11.20260203.e576e3c`)
|
|
||||||
|
|
||||||
Check if hosts are behind on updates:
|
|
||||||
|
|
||||||
```promql
|
|
||||||
nixos_flake_revision_behind == 1
|
|
||||||
```
|
|
||||||
|
|
||||||
View flake input versions:
|
|
||||||
|
|
||||||
```promql
|
|
||||||
nixos_flake_input_info
|
|
||||||
```
|
|
||||||
|
|
||||||
Labels: `input` (name), `rev` (revision), `type` (git/github)
|
|
||||||
|
|
||||||
Check flake input age:
|
|
||||||
|
|
||||||
```promql
|
|
||||||
nixos_flake_input_age_seconds / 86400
|
|
||||||
```
|
|
||||||
|
|
||||||
Returns age in days for each flake input.
|
|
||||||
|
|
||||||
### System Health
|
|
||||||
|
|
||||||
Basic host availability:
|
|
||||||
|
|
||||||
```promql
|
|
||||||
up{job="node-exporter"}
|
|
||||||
```
|
|
||||||
|
|
||||||
CPU usage by host:
|
|
||||||
|
|
||||||
```promql
|
|
||||||
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
|
||||||
```
|
|
||||||
|
|
||||||
Memory usage:
|
|
||||||
|
|
||||||
```promql
|
|
||||||
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
|
|
||||||
```
|
|
||||||
|
|
||||||
Disk space (root filesystem):
|
|
||||||
|
|
||||||
```promql
|
|
||||||
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Service-Specific Metrics
|
|
||||||
|
|
||||||
Common job names:
|
|
||||||
- `node-exporter` - System metrics (all hosts)
|
|
||||||
- `nixos-exporter` - NixOS version/generation metrics
|
|
||||||
- `caddy` - Reverse proxy metrics
|
|
||||||
- `prometheus` / `loki` / `grafana` - Monitoring stack
|
|
||||||
- `home-assistant` - Home automation
|
|
||||||
- `step-ca` - Internal CA
|
|
||||||
|
|
||||||
### Instance Label Format
|
|
||||||
|
|
||||||
The `instance` label uses FQDN format:
|
|
||||||
|
|
||||||
```
|
|
||||||
<hostname>.home.2rjus.net:<port>
|
|
||||||
```
|
|
||||||
|
|
||||||
Example queries filtering by host:
|
|
||||||
|
|
||||||
```promql
|
|
||||||
up{instance=~"monitoring01.*"}
|
|
||||||
node_load1{instance=~"ns1.*"}
|
|
||||||
```
|
|
||||||
|
|
||||||
## Troubleshooting Workflows
|
|
||||||
|
|
||||||
### Check Deployment Status Across Fleet
|
|
||||||
|
|
||||||
1. Query `nixos_flake_info` to see all hosts' current revisions
|
|
||||||
2. Check `nixos_flake_revision_behind` for hosts needing updates
|
|
||||||
3. Investigate specific hosts with `nixos_flake_input_info`
|
|
||||||
|
|
||||||
### Investigate Service Issues
|
|
||||||
|
|
||||||
1. Check `up{job="<service>"}` for scrape failures
|
|
||||||
2. Use `list_targets` to see target health details
|
|
||||||
3. Query service-specific metrics
|
|
||||||
4. Check `list_alerts` for related alerts
|
|
||||||
|
|
||||||
### After Deploying Changes
|
|
||||||
|
|
||||||
1. Verify `current_rev` updated in `nixos_flake_info`
|
|
||||||
2. Confirm `nixos_flake_revision_behind == 0`
|
|
||||||
3. Check service metrics are being scraped
|
|
||||||
|
|
||||||
## Notes
|
|
||||||
|
|
||||||
- Default scrape interval is 15s for most targets
|
|
||||||
- Use `rate()` for counter metrics, direct queries for gauges
|
|
||||||
- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters
|
|
||||||
250
.claude/skills/observability/SKILL.md
Normal file
250
.claude/skills/observability/SKILL.md
Normal file
@@ -0,0 +1,250 @@
|
|||||||
|
---
|
||||||
|
name: observability
|
||||||
|
description: Reference guide for exploring Prometheus metrics and Loki logs when troubleshooting homelab issues. Use when investigating system state, deployments, service health, or searching logs.
|
||||||
|
---
|
||||||
|
|
||||||
|
# Observability Troubleshooting Guide
|
||||||
|
|
||||||
|
Quick reference for exploring Prometheus metrics and Loki logs to troubleshoot homelab issues.
|
||||||
|
|
||||||
|
## Available Tools
|
||||||
|
|
||||||
|
Use the `lab-monitoring` MCP server tools:
|
||||||
|
|
||||||
|
**Metrics:**
|
||||||
|
- `search_metrics` - Find metrics by name substring
|
||||||
|
- `get_metric_metadata` - Get type/help for a specific metric
|
||||||
|
- `query` - Execute PromQL queries
|
||||||
|
- `list_targets` - Check scrape target health
|
||||||
|
- `list_alerts` / `get_alert` - View active alerts
|
||||||
|
|
||||||
|
**Logs:**
|
||||||
|
- `query_logs` - Execute LogQL queries against Loki
|
||||||
|
- `list_labels` - List available log labels
|
||||||
|
- `list_label_values` - List values for a specific label
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Logs Reference
|
||||||
|
|
||||||
|
### Label Reference
|
||||||
|
|
||||||
|
Available labels for log queries:
|
||||||
|
- `host` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`)
|
||||||
|
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
|
||||||
|
- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs)
|
||||||
|
- `filename` - For `varlog` job, the log file path
|
||||||
|
- `hostname` - Alternative to `host` for some streams
|
||||||
|
|
||||||
|
### Log Format
|
||||||
|
|
||||||
|
Journal logs are JSON-formatted. Key fields:
|
||||||
|
- `MESSAGE` - The actual log message
|
||||||
|
- `PRIORITY` - Syslog priority (6=info, 4=warning, 3=error)
|
||||||
|
- `SYSLOG_IDENTIFIER` - Program name
|
||||||
|
|
||||||
|
### Basic LogQL Queries
|
||||||
|
|
||||||
|
**Logs from a specific service on a host:**
|
||||||
|
```logql
|
||||||
|
{host="ns1", systemd_unit="nsd.service"}
|
||||||
|
```
|
||||||
|
|
||||||
|
**All logs from a host:**
|
||||||
|
```logql
|
||||||
|
{host="monitoring01"}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Logs from a service across all hosts:**
|
||||||
|
```logql
|
||||||
|
{systemd_unit="nixos-upgrade.service"}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Substring matching (case-sensitive):**
|
||||||
|
```logql
|
||||||
|
{host="ha1"} |= "error"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Exclude pattern:**
|
||||||
|
```logql
|
||||||
|
{host="ns1"} != "routine"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Regex matching:**
|
||||||
|
```logql
|
||||||
|
{systemd_unit="prometheus.service"} |~ "scrape.*failed"
|
||||||
|
```
|
||||||
|
|
||||||
|
**File-based logs (caddy access logs, etc):**
|
||||||
|
```logql
|
||||||
|
{job="varlog", hostname="nix-cache01"}
|
||||||
|
{job="varlog", filename="/var/log/caddy/nix-cache.log"}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Time Ranges
|
||||||
|
|
||||||
|
Default lookback is 1 hour. Use `start` parameter for older logs:
|
||||||
|
- `start: "1h"` - Last hour (default)
|
||||||
|
- `start: "24h"` - Last 24 hours
|
||||||
|
- `start: "168h"` - Last 7 days
|
||||||
|
|
||||||
|
### Common Services
|
||||||
|
|
||||||
|
Useful systemd units for troubleshooting:
|
||||||
|
- `nixos-upgrade.service` - Daily auto-upgrade logs
|
||||||
|
- `nsd.service` - DNS server (ns1/ns2)
|
||||||
|
- `prometheus.service` - Metrics collection
|
||||||
|
- `loki.service` - Log aggregation
|
||||||
|
- `caddy.service` - Reverse proxy
|
||||||
|
- `home-assistant.service` - Home automation
|
||||||
|
- `step-ca.service` - Internal CA
|
||||||
|
- `openbao.service` - Secrets management
|
||||||
|
- `sshd.service` - SSH daemon
|
||||||
|
- `nix-gc.service` - Nix garbage collection
|
||||||
|
|
||||||
|
### Extracting JSON Fields
|
||||||
|
|
||||||
|
Parse JSON and filter on fields:
|
||||||
|
```logql
|
||||||
|
{systemd_unit="prometheus.service"} | json | PRIORITY="3"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Metrics Reference
|
||||||
|
|
||||||
|
### Deployment & Version Status
|
||||||
|
|
||||||
|
Check which NixOS revision hosts are running:
|
||||||
|
|
||||||
|
```promql
|
||||||
|
nixos_flake_info
|
||||||
|
```
|
||||||
|
|
||||||
|
Labels:
|
||||||
|
- `current_rev` - Git commit of the running NixOS configuration
|
||||||
|
- `remote_rev` - Latest commit on the remote repository
|
||||||
|
- `nixpkgs_rev` - Nixpkgs revision used to build the system
|
||||||
|
- `nixos_version` - Full NixOS version string (e.g., `25.11.20260203.e576e3c`)
|
||||||
|
|
||||||
|
Check if hosts are behind on updates:
|
||||||
|
|
||||||
|
```promql
|
||||||
|
nixos_flake_revision_behind == 1
|
||||||
|
```
|
||||||
|
|
||||||
|
View flake input versions:
|
||||||
|
|
||||||
|
```promql
|
||||||
|
nixos_flake_input_info
|
||||||
|
```
|
||||||
|
|
||||||
|
Labels: `input` (name), `rev` (revision), `type` (git/github)
|
||||||
|
|
||||||
|
Check flake input age:
|
||||||
|
|
||||||
|
```promql
|
||||||
|
nixos_flake_input_age_seconds / 86400
|
||||||
|
```
|
||||||
|
|
||||||
|
Returns age in days for each flake input.
|
||||||
|
|
||||||
|
### System Health
|
||||||
|
|
||||||
|
Basic host availability:
|
||||||
|
|
||||||
|
```promql
|
||||||
|
up{job="node-exporter"}
|
||||||
|
```
|
||||||
|
|
||||||
|
CPU usage by host:
|
||||||
|
|
||||||
|
```promql
|
||||||
|
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
||||||
|
```
|
||||||
|
|
||||||
|
Memory usage:
|
||||||
|
|
||||||
|
```promql
|
||||||
|
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
|
||||||
|
```
|
||||||
|
|
||||||
|
Disk space (root filesystem):
|
||||||
|
|
||||||
|
```promql
|
||||||
|
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Service-Specific Metrics
|
||||||
|
|
||||||
|
Common job names:
|
||||||
|
- `node-exporter` - System metrics (all hosts)
|
||||||
|
- `nixos-exporter` - NixOS version/generation metrics
|
||||||
|
- `caddy` - Reverse proxy metrics
|
||||||
|
- `prometheus` / `loki` / `grafana` - Monitoring stack
|
||||||
|
- `home-assistant` - Home automation
|
||||||
|
- `step-ca` - Internal CA
|
||||||
|
|
||||||
|
### Instance Label Format
|
||||||
|
|
||||||
|
The `instance` label uses FQDN format:
|
||||||
|
|
||||||
|
```
|
||||||
|
<hostname>.home.2rjus.net:<port>
|
||||||
|
```
|
||||||
|
|
||||||
|
Example queries filtering by host:
|
||||||
|
|
||||||
|
```promql
|
||||||
|
up{instance=~"monitoring01.*"}
|
||||||
|
node_load1{instance=~"ns1.*"}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting Workflows
|
||||||
|
|
||||||
|
### Check Deployment Status Across Fleet
|
||||||
|
|
||||||
|
1. Query `nixos_flake_info` to see all hosts' current revisions
|
||||||
|
2. Check `nixos_flake_revision_behind` for hosts needing updates
|
||||||
|
3. Look at upgrade logs: `{systemd_unit="nixos-upgrade.service"}` with `start: "24h"`
|
||||||
|
|
||||||
|
### Investigate Service Issues
|
||||||
|
|
||||||
|
1. Check `up{job="<service>"}` for scrape failures
|
||||||
|
2. Use `list_targets` to see target health details
|
||||||
|
3. Query service logs: `{host="<host>", systemd_unit="<service>.service"}`
|
||||||
|
4. Search for errors: `{host="<host>"} |= "error"`
|
||||||
|
5. Check `list_alerts` for related alerts
|
||||||
|
|
||||||
|
### After Deploying Changes
|
||||||
|
|
||||||
|
1. Verify `current_rev` updated in `nixos_flake_info`
|
||||||
|
2. Confirm `nixos_flake_revision_behind == 0`
|
||||||
|
3. Check service logs for startup issues
|
||||||
|
4. Check service metrics are being scraped
|
||||||
|
|
||||||
|
### Debug SSH/Access Issues
|
||||||
|
|
||||||
|
```logql
|
||||||
|
{host="<host>", systemd_unit="sshd.service"}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Check Recent Upgrades
|
||||||
|
|
||||||
|
```logql
|
||||||
|
{systemd_unit="nixos-upgrade.service"}
|
||||||
|
```
|
||||||
|
|
||||||
|
With `start: "24h"` to see last 24 hours of upgrades across all hosts.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- Default scrape interval is 15s for most metrics targets
|
||||||
|
- Default log lookback is 1h - use `start` parameter for older logs
|
||||||
|
- Use `rate()` for counter metrics, direct queries for gauges
|
||||||
|
- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters
|
||||||
|
- Log `MESSAGE` field contains the actual log content in JSON format
|
||||||
Reference in New Issue
Block a user