Remove monitoring01 host configuration and unused service modules (prometheus, grafana, loki, tempo, pyroscope). Migrate blackbox, exportarr, and pve exporters to monitoring02 with scrape configs moved to VictoriaMetrics. Update alert rules, terraform vault policies/secrets, http-proxy entries, and documentation to reflect the monitoring02 migration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
11 KiB
name, description
| name | description |
|---|---|
| observability | Reference guide for exploring Prometheus metrics and Loki logs when troubleshooting homelab issues. Use when investigating system state, deployments, service health, or searching logs. |
Observability Troubleshooting Guide
Quick reference for exploring Prometheus metrics and Loki logs to troubleshoot homelab issues.
Available Tools
Use the lab-monitoring MCP server tools:
Metrics:
search_metrics- Find metrics by name substringget_metric_metadata- Get type/help for a specific metricquery- Execute PromQL querieslist_targets- Check scrape target healthlist_alerts/get_alert- View active alerts
Logs:
query_logs- Execute LogQL queries against Lokilist_labels- List available log labelslist_label_values- List values for a specific label
Logs Reference
Label Reference
Available labels for log queries:
hostname- Hostname (e.g.,ns1,monitoring02,ha1) - matches the Prometheushostnamelabelsystemd_unit- Systemd unit name (e.g.,nsd.service,nixos-upgrade.service)job- Eithersystemd-journal(most logs),varlog(file-based logs), orbootstrap(VM bootstrap logs)filename- Forvarlogjob, the log file pathtier- Deployment tier (testorprod)role- Host role (e.g.,dns,vault,monitoring) - matches the Prometheusrolelabellevel- Log level mapped from journal PRIORITY (critical,error,warning,notice,info,debug) - journal scrape only
Log Format
Journal logs are JSON-formatted. Key fields:
MESSAGE- The actual log messagePRIORITY- Syslog priority (6=info, 4=warning, 3=error)SYSLOG_IDENTIFIER- Program name
Basic LogQL Queries
Logs from a specific service on a host:
{hostname="ns1", systemd_unit="nsd.service"}
All logs from a host:
{hostname="monitoring02"}
Logs from a service across all hosts:
{systemd_unit="nixos-upgrade.service"}
Substring matching (case-sensitive):
{hostname="ha1"} |= "error"
Exclude pattern:
{hostname="ns1"} != "routine"
Regex matching:
{systemd_unit="victoriametrics.service"} |~ "scrape.*failed"
Filter by level (journal scrape only):
{level="error"} # All errors across the fleet
{level=~"critical|error", tier="prod"} # Prod errors and criticals
{hostname="ns1", level="warning"} # Warnings from a specific host
Filter by tier/role:
{tier="prod"} |= "error" # All errors on prod hosts
{role="dns"} # All DNS server logs
{tier="test", job="systemd-journal"} # Journal logs from test hosts
File-based logs (caddy access logs, etc):
{job="varlog", hostname="nix-cache01"}
{job="varlog", filename="/var/log/caddy/nix-cache.log"}
Time Ranges
Default lookback is 1 hour. Use start parameter for older logs:
start: "1h"- Last hour (default)start: "24h"- Last 24 hoursstart: "168h"- Last 7 days
Common Services
Useful systemd units for troubleshooting:
nixos-upgrade.service- Daily auto-upgrade logsnsd.service- DNS server (ns1/ns2)victoriametrics.service- Metrics collectionloki.service- Log aggregationcaddy.service- Reverse proxyhome-assistant.service- Home automationstep-ca.service- Internal CAopenbao.service- Secrets managementsshd.service- SSH daemonnix-gc.service- Nix garbage collection
Bootstrap Logs
VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use job="bootstrap" with additional labels:
hostname- Target hostnamebranch- Git branch being deployedstage- Bootstrap stage (see table below)
Bootstrap stages:
| Stage | Message | Meaning |
|---|---|---|
starting |
Bootstrap starting for <host> (branch: <branch>) | Bootstrap service has started |
network_ok |
Network connectivity confirmed | Can reach git server |
vault_ok |
Vault credentials unwrapped and stored | AppRole credentials provisioned |
vault_skip |
No Vault token provided - skipping credential setup | No wrapped token was provided |
vault_warn |
Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) |
building |
Starting nixos-rebuild boot | NixOS build starting |
success |
Build successful - rebooting into new configuration | Build complete, rebooting |
failed |
nixos-rebuild failed - manual intervention required | Build failed |
Bootstrap queries:
{job="bootstrap"} # All bootstrap logs
{job="bootstrap", hostname="myhost"} # Specific host
{job="bootstrap", stage="failed"} # All failures
{job="bootstrap", stage=~"building|success"} # Track build progress
Extracting JSON Fields
Parse JSON and filter on fields:
{systemd_unit="victoriametrics.service"} | json | PRIORITY="3"
Metrics Reference
Deployment & Version Status
Check which NixOS revision hosts are running:
nixos_flake_info
Labels:
current_rev- Git commit of the running NixOS configurationremote_rev- Latest commit on the remote repositorynixpkgs_rev- Nixpkgs revision used to build the systemnixos_version- Full NixOS version string (e.g.,25.11.20260203.e576e3c)
Check if hosts are behind on updates:
nixos_flake_revision_behind == 1
View flake input versions:
nixos_flake_input_info
Labels: input (name), rev (revision), type (git/github)
Check flake input age:
nixos_flake_input_age_seconds / 86400
Returns age in days for each flake input.
System Health
Basic host availability:
up{job="node-exporter"}
CPU usage by host:
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Memory usage:
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
Disk space (root filesystem):
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
Prometheus Jobs
All available Prometheus job names:
System exporters (on all/most hosts):
node-exporter- System metrics (CPU, memory, disk, network)nixos-exporter- NixOS flake revision and generation infosystemd-exporter- Systemd unit status metricshomelab-deploy- Deployment listener metrics
Service-specific exporters:
caddy- Reverse proxy metrics (http-proxy)nix-cache_caddy- Nix binary cache metricshome-assistant- Home automation metrics (ha1)jellyfin- Media server metrics (jelly01)kanidm- Authentication server metrics (kanidm01)nats- NATS messaging metrics (nats1)openbao- Secrets management metrics (vault01)unbound- DNS resolver metrics (ns1, ns2)wireguard- VPN tunnel metrics (http-proxy)
Monitoring stack (localhost on monitoring02):
victoriametrics- VictoriaMetrics self-metricsloki- Loki self-metricsgrafana- Grafana self-metricsalertmanager- Alertmanager metrics
External/infrastructure:
pve-exporter- Proxmox hypervisor metricssmartctl- Disk SMART health (gunter)restic_rest- Backup server metricsghettoptt- PTT service metrics (gunter)
Target Labels
All scrape targets have these labels:
Standard labels:
instance- Full target address (<hostname>.home.2rjus.net:<port>)job- Job name (e.g.,node-exporter,unbound,nixos-exporter)hostname- Short hostname (e.g.,ns1,monitoring02) - use this for host filtering
Host metadata labels (when configured in homelab.host):
role- Host role (e.g.,dns,build-host,vault)tier- Deployment tier (testfor test VMs, absent for prod)dns_role- DNS-specific role (primaryorsecondaryfor ns1/ns2)
Filtering by Host
Use the hostname label for easy host filtering across all jobs:
{hostname="ns1"} # All metrics from ns1
node_load1{hostname="monitoring02"} # Specific metric by hostname
up{hostname="ha1"} # Check if ha1 is up
This is simpler than wildcarding the instance label:
# Old way (still works but verbose)
up{instance=~"monitoring02.*"}
# New way (preferred)
up{hostname="monitoring02"}
Filtering by Role/Tier
Filter hosts by their role or tier:
up{role="dns"} # All DNS servers (ns1, ns2)
node_cpu_seconds_total{role="build-host"} # Build hosts only (nix-cache01)
up{tier="test"} # All test-tier VMs
up{dns_role="primary"} # Primary DNS only (ns1)
Current host labels:
| Host | Labels |
|---|---|
| ns1 | role=dns, dns_role=primary |
| ns2 | role=dns, dns_role=secondary |
| nix-cache01 | role=build-host |
| vault01 | role=vault |
| kanidm01 | role=auth, tier=test |
| testvm01/02/03 | tier=test |
Troubleshooting Workflows
Check Deployment Status Across Fleet
- Query
nixos_flake_infoto see all hosts' current revisions - Check
nixos_flake_revision_behindfor hosts needing updates - Look at upgrade logs:
{systemd_unit="nixos-upgrade.service"}withstart: "24h"
Investigate Service Issues
- Check
up{job="<service>"}orup{hostname="<host>"}for scrape failures - Use
list_targetsto see target health details - Query service logs:
{hostname="<host>", systemd_unit="<service>.service"} - Search for errors:
{hostname="<host>"} |= "error" - Check
list_alertsfor related alerts - Use role filters for group issues:
up{role="dns"}to check all DNS servers
After Deploying Changes
- Verify
current_revupdated innixos_flake_info - Confirm
nixos_flake_revision_behind == 0 - Check service logs for startup issues
- Check service metrics are being scraped
Monitor VM Bootstrap
When provisioning new VMs, track bootstrap progress:
- Watch bootstrap logs:
{job="bootstrap", hostname="<hostname>"} - Check for failures:
{job="bootstrap", hostname="<hostname>", stage="failed"} - After success, verify host appears in metrics:
up{hostname="<hostname>"} - Check logs are flowing:
{hostname="<hostname>"}
See docs/host-creation.md for the full host creation pipeline.
Debug SSH/Access Issues
{hostname="<host>", systemd_unit="sshd.service"}
Check Recent Upgrades
{systemd_unit="nixos-upgrade.service"}
With start: "24h" to see last 24 hours of upgrades across all hosts.
Notes
- Default scrape interval is 15s for most metrics targets
- Default log lookback is 1h - use
startparameter for older logs - Use
rate()for counter metrics, direct queries for gauges - Use the
hostnamelabel to filter metrics by host (simpler than regex oninstance) - Host metadata labels (
role,tier,dns_role) are propagated to all scrape targets - Log
MESSAGEfield contains the actual log content in JSON format