Update all LogQL examples, agent instructions, and scripts to use the hostname label instead of host, matching the Prometheus label naming convention. Also update pipe-to-loki and bootstrap scripts to push hostname instead of host. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
11 KiB
name, description
| name | description |
|---|---|
| observability | Reference guide for exploring Prometheus metrics and Loki logs when troubleshooting homelab issues. Use when investigating system state, deployments, service health, or searching logs. |
Observability Troubleshooting Guide
Quick reference for exploring Prometheus metrics and Loki logs to troubleshoot homelab issues.
Available Tools
Use the lab-monitoring MCP server tools:
Metrics:
search_metrics- Find metrics by name substringget_metric_metadata- Get type/help for a specific metricquery- Execute PromQL querieslist_targets- Check scrape target healthlist_alerts/get_alert- View active alerts
Logs:
query_logs- Execute LogQL queries against Lokilist_labels- List available log labelslist_label_values- List values for a specific label
Logs Reference
Label Reference
Available labels for log queries:
hostname- Hostname (e.g.,ns1,monitoring01,ha1) - matches the Prometheushostnamelabelsystemd_unit- Systemd unit name (e.g.,nsd.service,nixos-upgrade.service)job- Eithersystemd-journal(most logs),varlog(file-based logs), orbootstrap(VM bootstrap logs)filename- Forvarlogjob, the log file pathtier- Deployment tier (testorprod)role- Host role (e.g.,dns,vault,monitoring) - matches the Prometheusrolelabellevel- Log level mapped from journal PRIORITY (critical,error,warning,notice,info,debug) - journal scrape only
Log Format
Journal logs are JSON-formatted. Key fields:
MESSAGE- The actual log messagePRIORITY- Syslog priority (6=info, 4=warning, 3=error)SYSLOG_IDENTIFIER- Program name
Basic LogQL Queries
Logs from a specific service on a host:
{hostname="ns1", systemd_unit="nsd.service"}
All logs from a host:
{hostname="monitoring01"}
Logs from a service across all hosts:
{systemd_unit="nixos-upgrade.service"}
Substring matching (case-sensitive):
{hostname="ha1"} |= "error"
Exclude pattern:
{hostname="ns1"} != "routine"
Regex matching:
{systemd_unit="prometheus.service"} |~ "scrape.*failed"
Filter by level (journal scrape only):
{level="error"} # All errors across the fleet
{level=~"critical|error", tier="prod"} # Prod errors and criticals
{hostname="ns1", level="warning"} # Warnings from a specific host
Filter by tier/role:
{tier="prod"} |= "error" # All errors on prod hosts
{role="dns"} # All DNS server logs
{tier="test", job="systemd-journal"} # Journal logs from test hosts
File-based logs (caddy access logs, etc):
{job="varlog", hostname="nix-cache01"}
{job="varlog", filename="/var/log/caddy/nix-cache.log"}
Time Ranges
Default lookback is 1 hour. Use start parameter for older logs:
start: "1h"- Last hour (default)start: "24h"- Last 24 hoursstart: "168h"- Last 7 days
Common Services
Useful systemd units for troubleshooting:
nixos-upgrade.service- Daily auto-upgrade logsnsd.service- DNS server (ns1/ns2)prometheus.service- Metrics collectionloki.service- Log aggregationcaddy.service- Reverse proxyhome-assistant.service- Home automationstep-ca.service- Internal CAopenbao.service- Secrets managementsshd.service- SSH daemonnix-gc.service- Nix garbage collection
Bootstrap Logs
VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use job="bootstrap" with additional labels:
hostname- Target hostnamebranch- Git branch being deployedstage- Bootstrap stage (see table below)
Bootstrap stages:
| Stage | Message | Meaning |
|---|---|---|
starting |
Bootstrap starting for <host> (branch: <branch>) | Bootstrap service has started |
network_ok |
Network connectivity confirmed | Can reach git server |
vault_ok |
Vault credentials unwrapped and stored | AppRole credentials provisioned |
vault_skip |
No Vault token provided - skipping credential setup | No wrapped token was provided |
vault_warn |
Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) |
building |
Starting nixos-rebuild boot | NixOS build starting |
success |
Build successful - rebooting into new configuration | Build complete, rebooting |
failed |
nixos-rebuild failed - manual intervention required | Build failed |
Bootstrap queries:
{job="bootstrap"} # All bootstrap logs
{job="bootstrap", hostname="myhost"} # Specific host
{job="bootstrap", stage="failed"} # All failures
{job="bootstrap", stage=~"building|success"} # Track build progress
Extracting JSON Fields
Parse JSON and filter on fields:
{systemd_unit="prometheus.service"} | json | PRIORITY="3"
Metrics Reference
Deployment & Version Status
Check which NixOS revision hosts are running:
nixos_flake_info
Labels:
current_rev- Git commit of the running NixOS configurationremote_rev- Latest commit on the remote repositorynixpkgs_rev- Nixpkgs revision used to build the systemnixos_version- Full NixOS version string (e.g.,25.11.20260203.e576e3c)
Check if hosts are behind on updates:
nixos_flake_revision_behind == 1
View flake input versions:
nixos_flake_input_info
Labels: input (name), rev (revision), type (git/github)
Check flake input age:
nixos_flake_input_age_seconds / 86400
Returns age in days for each flake input.
System Health
Basic host availability:
up{job="node-exporter"}
CPU usage by host:
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Memory usage:
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
Disk space (root filesystem):
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
Prometheus Jobs
All available Prometheus job names:
System exporters (on all/most hosts):
node-exporter- System metrics (CPU, memory, disk, network)nixos-exporter- NixOS flake revision and generation infosystemd-exporter- Systemd unit status metricshomelab-deploy- Deployment listener metrics
Service-specific exporters:
caddy- Reverse proxy metrics (http-proxy)nix-cache_caddy- Nix binary cache metricshome-assistant- Home automation metrics (ha1)jellyfin- Media server metrics (jelly01)kanidm- Authentication server metrics (kanidm01)nats- NATS messaging metrics (nats1)openbao- Secrets management metrics (vault01)unbound- DNS resolver metrics (ns1, ns2)wireguard- VPN tunnel metrics (http-proxy)
Monitoring stack (localhost on monitoring01):
prometheus- Prometheus self-metricsloki- Loki self-metricsgrafana- Grafana self-metricsalertmanager- Alertmanager metricspushgateway- Push-based metrics gateway
External/infrastructure:
pve-exporter- Proxmox hypervisor metricssmartctl- Disk SMART health (gunter)restic_rest- Backup server metricsghettoptt- PTT service metrics (gunter)
Target Labels
All scrape targets have these labels:
Standard labels:
instance- Full target address (<hostname>.home.2rjus.net:<port>)job- Job name (e.g.,node-exporter,unbound,nixos-exporter)hostname- Short hostname (e.g.,ns1,monitoring01) - use this for host filtering
Host metadata labels (when configured in homelab.host):
role- Host role (e.g.,dns,build-host,vault)tier- Deployment tier (testfor test VMs, absent for prod)dns_role- DNS-specific role (primaryorsecondaryfor ns1/ns2)
Filtering by Host
Use the hostname label for easy host filtering across all jobs:
{hostname="ns1"} # All metrics from ns1
node_load1{hostname="monitoring01"} # Specific metric by hostname
up{hostname="ha1"} # Check if ha1 is up
This is simpler than wildcarding the instance label:
# Old way (still works but verbose)
up{instance=~"monitoring01.*"}
# New way (preferred)
up{hostname="monitoring01"}
Filtering by Role/Tier
Filter hosts by their role or tier:
up{role="dns"} # All DNS servers (ns1, ns2)
node_cpu_seconds_total{role="build-host"} # Build hosts only (nix-cache01)
up{tier="test"} # All test-tier VMs
up{dns_role="primary"} # Primary DNS only (ns1)
Current host labels:
| Host | Labels |
|---|---|
| ns1 | role=dns, dns_role=primary |
| ns2 | role=dns, dns_role=secondary |
| nix-cache01 | role=build-host |
| vault01 | role=vault |
| kanidm01 | role=auth, tier=test |
| testvm01/02/03 | tier=test |
Troubleshooting Workflows
Check Deployment Status Across Fleet
- Query
nixos_flake_infoto see all hosts' current revisions - Check
nixos_flake_revision_behindfor hosts needing updates - Look at upgrade logs:
{systemd_unit="nixos-upgrade.service"}withstart: "24h"
Investigate Service Issues
- Check
up{job="<service>"}orup{hostname="<host>"}for scrape failures - Use
list_targetsto see target health details - Query service logs:
{hostname="<host>", systemd_unit="<service>.service"} - Search for errors:
{hostname="<host>"} |= "error" - Check
list_alertsfor related alerts - Use role filters for group issues:
up{role="dns"}to check all DNS servers
After Deploying Changes
- Verify
current_revupdated innixos_flake_info - Confirm
nixos_flake_revision_behind == 0 - Check service logs for startup issues
- Check service metrics are being scraped
Monitor VM Bootstrap
When provisioning new VMs, track bootstrap progress:
- Watch bootstrap logs:
{job="bootstrap", hostname="<hostname>"} - Check for failures:
{job="bootstrap", hostname="<hostname>", stage="failed"} - After success, verify host appears in metrics:
up{hostname="<hostname>"} - Check logs are flowing:
{hostname="<hostname>"}
See docs/host-creation.md for the full host creation pipeline.
Debug SSH/Access Issues
{hostname="<host>", systemd_unit="sshd.service"}
Check Recent Upgrades
{systemd_unit="nixos-upgrade.service"}
With start: "24h" to see last 24 hours of upgrades across all hosts.
Notes
- Default scrape interval is 15s for most metrics targets
- Default log lookback is 1h - use
startparameter for older logs - Use
rate()for counter metrics, direct queries for gauges - Use the
hostnamelabel to filter metrics by host (simpler than regex oninstance) - Host metadata labels (
role,tier,dns_role) are propagated to all scrape targets - Log
MESSAGEfield contains the actual log content in JSON format