Files
nixos-servers/.claude/skills/observability/SKILL.md
Torjus Håkestad b9a269d280
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m4s
chore: rename metrics skill to observability, add logs reference
Merge Prometheus metrics and Loki logs into a unified troubleshooting
skill. Adds LogQL query patterns, label reference, and common service
units for log searching.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 01:17:41 +01:00

5.9 KiB

name, description
name description
observability Reference guide for exploring Prometheus metrics and Loki logs when troubleshooting homelab issues. Use when investigating system state, deployments, service health, or searching logs.

Observability Troubleshooting Guide

Quick reference for exploring Prometheus metrics and Loki logs to troubleshoot homelab issues.

Available Tools

Use the lab-monitoring MCP server tools:

Metrics:

  • search_metrics - Find metrics by name substring
  • get_metric_metadata - Get type/help for a specific metric
  • query - Execute PromQL queries
  • list_targets - Check scrape target health
  • list_alerts / get_alert - View active alerts

Logs:

  • query_logs - Execute LogQL queries against Loki
  • list_labels - List available log labels
  • list_label_values - List values for a specific label

Logs Reference

Label Reference

Available labels for log queries:

  • host - Hostname (e.g., ns1, monitoring01, ha1)
  • systemd_unit - Systemd unit name (e.g., nsd.service, nixos-upgrade.service)
  • job - Either systemd-journal (most logs) or varlog (file-based logs)
  • filename - For varlog job, the log file path
  • hostname - Alternative to host for some streams

Log Format

Journal logs are JSON-formatted. Key fields:

  • MESSAGE - The actual log message
  • PRIORITY - Syslog priority (6=info, 4=warning, 3=error)
  • SYSLOG_IDENTIFIER - Program name

Basic LogQL Queries

Logs from a specific service on a host:

{host="ns1", systemd_unit="nsd.service"}

All logs from a host:

{host="monitoring01"}

Logs from a service across all hosts:

{systemd_unit="nixos-upgrade.service"}

Substring matching (case-sensitive):

{host="ha1"} |= "error"

Exclude pattern:

{host="ns1"} != "routine"

Regex matching:

{systemd_unit="prometheus.service"} |~ "scrape.*failed"

File-based logs (caddy access logs, etc):

{job="varlog", hostname="nix-cache01"}
{job="varlog", filename="/var/log/caddy/nix-cache.log"}

Time Ranges

Default lookback is 1 hour. Use start parameter for older logs:

  • start: "1h" - Last hour (default)
  • start: "24h" - Last 24 hours
  • start: "168h" - Last 7 days

Common Services

Useful systemd units for troubleshooting:

  • nixos-upgrade.service - Daily auto-upgrade logs
  • nsd.service - DNS server (ns1/ns2)
  • prometheus.service - Metrics collection
  • loki.service - Log aggregation
  • caddy.service - Reverse proxy
  • home-assistant.service - Home automation
  • step-ca.service - Internal CA
  • openbao.service - Secrets management
  • sshd.service - SSH daemon
  • nix-gc.service - Nix garbage collection

Extracting JSON Fields

Parse JSON and filter on fields:

{systemd_unit="prometheus.service"} | json | PRIORITY="3"

Metrics Reference

Deployment & Version Status

Check which NixOS revision hosts are running:

nixos_flake_info

Labels:

  • current_rev - Git commit of the running NixOS configuration
  • remote_rev - Latest commit on the remote repository
  • nixpkgs_rev - Nixpkgs revision used to build the system
  • nixos_version - Full NixOS version string (e.g., 25.11.20260203.e576e3c)

Check if hosts are behind on updates:

nixos_flake_revision_behind == 1

View flake input versions:

nixos_flake_input_info

Labels: input (name), rev (revision), type (git/github)

Check flake input age:

nixos_flake_input_age_seconds / 86400

Returns age in days for each flake input.

System Health

Basic host availability:

up{job="node-exporter"}

CPU usage by host:

100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory usage:

1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

Disk space (root filesystem):

node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}

Service-Specific Metrics

Common job names:

  • node-exporter - System metrics (all hosts)
  • nixos-exporter - NixOS version/generation metrics
  • caddy - Reverse proxy metrics
  • prometheus / loki / grafana - Monitoring stack
  • home-assistant - Home automation
  • step-ca - Internal CA

Instance Label Format

The instance label uses FQDN format:

<hostname>.home.2rjus.net:<port>

Example queries filtering by host:

up{instance=~"monitoring01.*"}
node_load1{instance=~"ns1.*"}

Troubleshooting Workflows

Check Deployment Status Across Fleet

  1. Query nixos_flake_info to see all hosts' current revisions
  2. Check nixos_flake_revision_behind for hosts needing updates
  3. Look at upgrade logs: {systemd_unit="nixos-upgrade.service"} with start: "24h"

Investigate Service Issues

  1. Check up{job="<service>"} for scrape failures
  2. Use list_targets to see target health details
  3. Query service logs: {host="<host>", systemd_unit="<service>.service"}
  4. Search for errors: {host="<host>"} |= "error"
  5. Check list_alerts for related alerts

After Deploying Changes

  1. Verify current_rev updated in nixos_flake_info
  2. Confirm nixos_flake_revision_behind == 0
  3. Check service logs for startup issues
  4. Check service metrics are being scraped

Debug SSH/Access Issues

{host="<host>", systemd_unit="sshd.service"}

Check Recent Upgrades

{systemd_unit="nixos-upgrade.service"}

With start: "24h" to see last 24 hours of upgrades across all hosts.


Notes

  • Default scrape interval is 15s for most metrics targets
  • Default log lookback is 1h - use start parameter for older logs
  • Use rate() for counter metrics, direct queries for gauges
  • The instance label includes the port, use regex matching (=~) for hostname-only filters
  • Log MESSAGE field contains the actual log content in JSON format