Files

Run nix flake check / flake-check (push) Failing after 1s

Details

skills: update observability with new target labels

Document the new hostname and host metadata labels available on all
Prometheus scrape targets:
- hostname: short hostname for easy filtering
- role: host role (dns, build-host, vault)
- tier: deployment tier (test for test VMs)
- dns_role: primary/secondary for DNS servers

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-07 17:12:17 +01:00

7.4 KiB

Raw Blame History

name, description

name	description
observability	Reference guide for exploring Prometheus metrics and Loki logs when troubleshooting homelab issues. Use when investigating system state, deployments, service health, or searching logs.

Observability Troubleshooting Guide

Quick reference for exploring Prometheus metrics and Loki logs to troubleshoot homelab issues.

Available Tools

Use the lab-monitoring MCP server tools:

Metrics:

search_metrics - Find metrics by name substring
get_metric_metadata - Get type/help for a specific metric
query - Execute PromQL queries
list_targets - Check scrape target health
list_alerts / get_alert - View active alerts

Logs:

query_logs - Execute LogQL queries against Loki
list_labels - List available log labels
list_label_values - List values for a specific label

Logs Reference

Label Reference

Available labels for log queries:

host - Hostname (e.g., ns1, monitoring01, ha1)
systemd_unit - Systemd unit name (e.g., nsd.service, nixos-upgrade.service)
job - Either systemd-journal (most logs) or varlog (file-based logs)
filename - For varlog job, the log file path
hostname - Alternative to host for some streams

Log Format

Journal logs are JSON-formatted. Key fields:

MESSAGE - The actual log message
PRIORITY - Syslog priority (6=info, 4=warning, 3=error)
SYSLOG_IDENTIFIER - Program name

Basic LogQL Queries

Logs from a specific service on a host:

{host="ns1", systemd_unit="nsd.service"}

All logs from a host:

{host="monitoring01"}

Logs from a service across all hosts:

{systemd_unit="nixos-upgrade.service"}

Substring matching (case-sensitive):

{host="ha1"} |= "error"

Exclude pattern:

{host="ns1"} != "routine"

Regex matching:

{systemd_unit="prometheus.service"} |~ "scrape.*failed"

File-based logs (caddy access logs, etc):

{job="varlog", hostname="nix-cache01"}
{job="varlog", filename="/var/log/caddy/nix-cache.log"}

Time Ranges

Default lookback is 1 hour. Use start parameter for older logs:

start: "1h" - Last hour (default)
start: "24h" - Last 24 hours
start: "168h" - Last 7 days

Common Services

Useful systemd units for troubleshooting:

nixos-upgrade.service - Daily auto-upgrade logs
nsd.service - DNS server (ns1/ns2)
prometheus.service - Metrics collection
loki.service - Log aggregation
caddy.service - Reverse proxy
home-assistant.service - Home automation
step-ca.service - Internal CA
openbao.service - Secrets management
sshd.service - SSH daemon
nix-gc.service - Nix garbage collection

Extracting JSON Fields

Parse JSON and filter on fields:

{systemd_unit="prometheus.service"} | json | PRIORITY="3"

Metrics Reference

Deployment & Version Status

Check which NixOS revision hosts are running:

nixos_flake_info

Labels:

current_rev - Git commit of the running NixOS configuration
remote_rev - Latest commit on the remote repository
nixpkgs_rev - Nixpkgs revision used to build the system
nixos_version - Full NixOS version string (e.g., 25.11.20260203.e576e3c)

Check if hosts are behind on updates:

nixos_flake_revision_behind == 1

View flake input versions:

nixos_flake_input_info

Labels: input (name), rev (revision), type (git/github)

Check flake input age:

nixos_flake_input_age_seconds / 86400

Returns age in days for each flake input.

System Health

Basic host availability:

up{job="node-exporter"}

CPU usage by host:

100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory usage:

1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

Disk space (root filesystem):

node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}

Service-Specific Metrics

Common job names:

node-exporter - System metrics (all hosts)
nixos-exporter - NixOS version/generation metrics
caddy - Reverse proxy metrics
prometheus / loki / grafana - Monitoring stack
home-assistant - Home automation
step-ca - Internal CA

Target Labels

All scrape targets have these labels:

Standard labels:

instance - Full target address (<hostname>.home.2rjus.net:<port>)
job - Job name (e.g., node-exporter, unbound, nixos-exporter)
hostname - Short hostname (e.g., ns1, monitoring01) - use this for host filtering

Host metadata labels (when configured in homelab.host):

role - Host role (e.g., dns, build-host, vault)
tier - Deployment tier (test for test VMs, absent for prod)
dns_role - DNS-specific role (primary or secondary for ns1/ns2)

Filtering by Host

Use the hostname label for easy host filtering across all jobs:

{hostname="ns1"}                    # All metrics from ns1
node_load1{hostname="monitoring01"} # Specific metric by hostname
up{hostname="ha1"}                  # Check if ha1 is up

This is simpler than wildcarding the instance label:

# Old way (still works but verbose)
up{instance=~"monitoring01.*"}

# New way (preferred)
up{hostname="monitoring01"}

Filtering by Role/Tier

Filter hosts by their role or tier:

up{role="dns"}                      # All DNS servers (ns1, ns2)
node_cpu_seconds_total{role="build-host"}  # Build hosts only (nix-cache01)
up{tier="test"}                     # All test-tier VMs
up{dns_role="primary"}              # Primary DNS only (ns1)

Current host labels:

Host	Labels
ns1	`role=dns`, `dns_role=primary`
ns2	`role=dns`, `dns_role=secondary`
nix-cache01	`role=build-host`
vault01	`role=vault`
testvm01/02/03	`tier=test`

Troubleshooting Workflows

Check Deployment Status Across Fleet

Query nixos_flake_info to see all hosts' current revisions
Check nixos_flake_revision_behind for hosts needing updates
Look at upgrade logs: {systemd_unit="nixos-upgrade.service"} with start: "24h"

Investigate Service Issues

Check up{job="<service>"} or up{hostname="<host>"} for scrape failures
Use list_targets to see target health details
Query service logs: {host="<host>", systemd_unit="<service>.service"}
Search for errors: {host="<host>"} |= "error"
Check list_alerts for related alerts
Use role filters for group issues: up{role="dns"} to check all DNS servers

After Deploying Changes

Verify current_rev updated in nixos_flake_info
Confirm nixos_flake_revision_behind == 0
Check service logs for startup issues
Check service metrics are being scraped

Debug SSH/Access Issues

{host="<host>", systemd_unit="sshd.service"}

Check Recent Upgrades

{systemd_unit="nixos-upgrade.service"}

With start: "24h" to see last 24 hours of upgrades across all hosts.

Notes

Default scrape interval is 15s for most metrics targets
Default log lookback is 1h - use start parameter for older logs
Use rate() for counter metrics, direct queries for gauges
Use the hostname label to filter metrics by host (simpler than regex on instance)
Host metadata labels (role, tier, dns_role) are propagated to all scrape targets
Log MESSAGE field contains the actual log content in JSON format

7.4 KiB Raw Blame History