Files
Torjus Håkestad 4f593126c0
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m15s
Run nix flake check / flake-check (pull_request) Failing after 3m8s
monitoring01: remove host and migrate services to monitoring02
Remove monitoring01 host configuration and unused service modules
(prometheus, grafana, loki, tempo, pyroscope). Migrate blackbox,
exportarr, and pve exporters to monitoring02 with scrape configs
moved to VictoriaMetrics. Update alert rules, terraform vault
policies/secrets, http-proxy entries, and documentation to reflect
the monitoring02 migration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 21:50:20 +01:00

11 KiB

name, description
name description
observability Reference guide for exploring Prometheus metrics and Loki logs when troubleshooting homelab issues. Use when investigating system state, deployments, service health, or searching logs.

Observability Troubleshooting Guide

Quick reference for exploring Prometheus metrics and Loki logs to troubleshoot homelab issues.

Available Tools

Use the lab-monitoring MCP server tools:

Metrics:

  • search_metrics - Find metrics by name substring
  • get_metric_metadata - Get type/help for a specific metric
  • query - Execute PromQL queries
  • list_targets - Check scrape target health
  • list_alerts / get_alert - View active alerts

Logs:

  • query_logs - Execute LogQL queries against Loki
  • list_labels - List available log labels
  • list_label_values - List values for a specific label

Logs Reference

Label Reference

Available labels for log queries:

  • hostname - Hostname (e.g., ns1, monitoring02, ha1) - matches the Prometheus hostname label
  • systemd_unit - Systemd unit name (e.g., nsd.service, nixos-upgrade.service)
  • job - Either systemd-journal (most logs), varlog (file-based logs), or bootstrap (VM bootstrap logs)
  • filename - For varlog job, the log file path
  • tier - Deployment tier (test or prod)
  • role - Host role (e.g., dns, vault, monitoring) - matches the Prometheus role label
  • level - Log level mapped from journal PRIORITY (critical, error, warning, notice, info, debug) - journal scrape only

Log Format

Journal logs are JSON-formatted. Key fields:

  • MESSAGE - The actual log message
  • PRIORITY - Syslog priority (6=info, 4=warning, 3=error)
  • SYSLOG_IDENTIFIER - Program name

Basic LogQL Queries

Logs from a specific service on a host:

{hostname="ns1", systemd_unit="nsd.service"}

All logs from a host:

{hostname="monitoring02"}

Logs from a service across all hosts:

{systemd_unit="nixos-upgrade.service"}

Substring matching (case-sensitive):

{hostname="ha1"} |= "error"

Exclude pattern:

{hostname="ns1"} != "routine"

Regex matching:

{systemd_unit="victoriametrics.service"} |~ "scrape.*failed"

Filter by level (journal scrape only):

{level="error"}                                  # All errors across the fleet
{level=~"critical|error", tier="prod"}           # Prod errors and criticals
{hostname="ns1", level="warning"}                # Warnings from a specific host

Filter by tier/role:

{tier="prod"} |= "error"                        # All errors on prod hosts
{role="dns"}                                     # All DNS server logs
{tier="test", job="systemd-journal"}             # Journal logs from test hosts

File-based logs (caddy access logs, etc):

{job="varlog", hostname="nix-cache01"}
{job="varlog", filename="/var/log/caddy/nix-cache.log"}

Time Ranges

Default lookback is 1 hour. Use start parameter for older logs:

  • start: "1h" - Last hour (default)
  • start: "24h" - Last 24 hours
  • start: "168h" - Last 7 days

Common Services

Useful systemd units for troubleshooting:

  • nixos-upgrade.service - Daily auto-upgrade logs
  • nsd.service - DNS server (ns1/ns2)
  • victoriametrics.service - Metrics collection
  • loki.service - Log aggregation
  • caddy.service - Reverse proxy
  • home-assistant.service - Home automation
  • step-ca.service - Internal CA
  • openbao.service - Secrets management
  • sshd.service - SSH daemon
  • nix-gc.service - Nix garbage collection

Bootstrap Logs

VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use job="bootstrap" with additional labels:

  • hostname - Target hostname
  • branch - Git branch being deployed
  • stage - Bootstrap stage (see table below)

Bootstrap stages:

Stage Message Meaning
starting Bootstrap starting for <host> (branch: <branch>) Bootstrap service has started
network_ok Network connectivity confirmed Can reach git server
vault_ok Vault credentials unwrapped and stored AppRole credentials provisioned
vault_skip No Vault token provided - skipping credential setup No wrapped token was provided
vault_warn Failed to unwrap Vault token - continuing without secrets Token unwrap failed (expired/used)
building Starting nixos-rebuild boot NixOS build starting
success Build successful - rebooting into new configuration Build complete, rebooting
failed nixos-rebuild failed - manual intervention required Build failed

Bootstrap queries:

{job="bootstrap"}                              # All bootstrap logs
{job="bootstrap", hostname="myhost"}            # Specific host
{job="bootstrap", stage="failed"}              # All failures
{job="bootstrap", stage=~"building|success"}   # Track build progress

Extracting JSON Fields

Parse JSON and filter on fields:

{systemd_unit="victoriametrics.service"} | json | PRIORITY="3"

Metrics Reference

Deployment & Version Status

Check which NixOS revision hosts are running:

nixos_flake_info

Labels:

  • current_rev - Git commit of the running NixOS configuration
  • remote_rev - Latest commit on the remote repository
  • nixpkgs_rev - Nixpkgs revision used to build the system
  • nixos_version - Full NixOS version string (e.g., 25.11.20260203.e576e3c)

Check if hosts are behind on updates:

nixos_flake_revision_behind == 1

View flake input versions:

nixos_flake_input_info

Labels: input (name), rev (revision), type (git/github)

Check flake input age:

nixos_flake_input_age_seconds / 86400

Returns age in days for each flake input.

System Health

Basic host availability:

up{job="node-exporter"}

CPU usage by host:

100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory usage:

1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

Disk space (root filesystem):

node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}

Prometheus Jobs

All available Prometheus job names:

System exporters (on all/most hosts):

  • node-exporter - System metrics (CPU, memory, disk, network)
  • nixos-exporter - NixOS flake revision and generation info
  • systemd-exporter - Systemd unit status metrics
  • homelab-deploy - Deployment listener metrics

Service-specific exporters:

  • caddy - Reverse proxy metrics (http-proxy)
  • nix-cache_caddy - Nix binary cache metrics
  • home-assistant - Home automation metrics (ha1)
  • jellyfin - Media server metrics (jelly01)
  • kanidm - Authentication server metrics (kanidm01)
  • nats - NATS messaging metrics (nats1)
  • openbao - Secrets management metrics (vault01)
  • unbound - DNS resolver metrics (ns1, ns2)
  • wireguard - VPN tunnel metrics (http-proxy)

Monitoring stack (localhost on monitoring02):

  • victoriametrics - VictoriaMetrics self-metrics
  • loki - Loki self-metrics
  • grafana - Grafana self-metrics
  • alertmanager - Alertmanager metrics

External/infrastructure:

  • pve-exporter - Proxmox hypervisor metrics
  • smartctl - Disk SMART health (gunter)
  • restic_rest - Backup server metrics
  • ghettoptt - PTT service metrics (gunter)

Target Labels

All scrape targets have these labels:

Standard labels:

  • instance - Full target address (<hostname>.home.2rjus.net:<port>)
  • job - Job name (e.g., node-exporter, unbound, nixos-exporter)
  • hostname - Short hostname (e.g., ns1, monitoring02) - use this for host filtering

Host metadata labels (when configured in homelab.host):

  • role - Host role (e.g., dns, build-host, vault)
  • tier - Deployment tier (test for test VMs, absent for prod)
  • dns_role - DNS-specific role (primary or secondary for ns1/ns2)

Filtering by Host

Use the hostname label for easy host filtering across all jobs:

{hostname="ns1"}                    # All metrics from ns1
node_load1{hostname="monitoring02"} # Specific metric by hostname
up{hostname="ha1"}                  # Check if ha1 is up

This is simpler than wildcarding the instance label:

# Old way (still works but verbose)
up{instance=~"monitoring02.*"}

# New way (preferred)
up{hostname="monitoring02"}

Filtering by Role/Tier

Filter hosts by their role or tier:

up{role="dns"}                      # All DNS servers (ns1, ns2)
node_cpu_seconds_total{role="build-host"}  # Build hosts only (nix-cache01)
up{tier="test"}                     # All test-tier VMs
up{dns_role="primary"}              # Primary DNS only (ns1)

Current host labels:

Host Labels
ns1 role=dns, dns_role=primary
ns2 role=dns, dns_role=secondary
nix-cache01 role=build-host
vault01 role=vault
kanidm01 role=auth, tier=test
testvm01/02/03 tier=test

Troubleshooting Workflows

Check Deployment Status Across Fleet

  1. Query nixos_flake_info to see all hosts' current revisions
  2. Check nixos_flake_revision_behind for hosts needing updates
  3. Look at upgrade logs: {systemd_unit="nixos-upgrade.service"} with start: "24h"

Investigate Service Issues

  1. Check up{job="<service>"} or up{hostname="<host>"} for scrape failures
  2. Use list_targets to see target health details
  3. Query service logs: {hostname="<host>", systemd_unit="<service>.service"}
  4. Search for errors: {hostname="<host>"} |= "error"
  5. Check list_alerts for related alerts
  6. Use role filters for group issues: up{role="dns"} to check all DNS servers

After Deploying Changes

  1. Verify current_rev updated in nixos_flake_info
  2. Confirm nixos_flake_revision_behind == 0
  3. Check service logs for startup issues
  4. Check service metrics are being scraped

Monitor VM Bootstrap

When provisioning new VMs, track bootstrap progress:

  1. Watch bootstrap logs: {job="bootstrap", hostname="<hostname>"}
  2. Check for failures: {job="bootstrap", hostname="<hostname>", stage="failed"}
  3. After success, verify host appears in metrics: up{hostname="<hostname>"}
  4. Check logs are flowing: {hostname="<hostname>"}

See docs/host-creation.md for the full host creation pipeline.

Debug SSH/Access Issues

{hostname="<host>", systemd_unit="sshd.service"}

Check Recent Upgrades

{systemd_unit="nixos-upgrade.service"}

With start: "24h" to see last 24 hours of upgrades across all hosts.


Notes

  • Default scrape interval is 15s for most metrics targets
  • Default log lookback is 1h - use start parameter for older logs
  • Use rate() for counter metrics, direct queries for gauges
  • Use the hostname label to filter metrics by host (simpler than regex on instance)
  • Host metadata labels (role, tier, dns_role) are propagated to all scrape targets
  • Log MESSAGE field contains the actual log content in JSON format