--- name: observability description: Reference guide for exploring Prometheus metrics and Loki logs when troubleshooting homelab issues. Use when investigating system state, deployments, service health, or searching logs. --- # Observability Troubleshooting Guide Quick reference for exploring Prometheus metrics and Loki logs to troubleshoot homelab issues. ## Available Tools Use the `lab-monitoring` MCP server tools: **Metrics:** - `search_metrics` - Find metrics by name substring - `get_metric_metadata` - Get type/help for a specific metric - `query` - Execute PromQL queries - `list_targets` - Check scrape target health - `list_alerts` / `get_alert` - View active alerts **Logs:** - `query_logs` - Execute LogQL queries against Loki - `list_labels` - List available log labels - `list_label_values` - List values for a specific label --- ## Logs Reference ### Label Reference Available labels for log queries: - `hostname` - Hostname (e.g., `ns1`, `monitoring02`, `ha1`) - matches the Prometheus `hostname` label - `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`) - `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs) - `filename` - For `varlog` job, the log file path - `tier` - Deployment tier (`test` or `prod`) - `role` - Host role (e.g., `dns`, `vault`, `monitoring`) - matches the Prometheus `role` label - `level` - Log level mapped from journal PRIORITY (`critical`, `error`, `warning`, `notice`, `info`, `debug`) - journal scrape only ### Log Format Journal logs are JSON-formatted. Key fields: - `MESSAGE` - The actual log message - `PRIORITY` - Syslog priority (6=info, 4=warning, 3=error) - `SYSLOG_IDENTIFIER` - Program name ### Basic LogQL Queries **Logs from a specific service on a host:** ```logql {hostname="ns1", systemd_unit="nsd.service"} ``` **All logs from a host:** ```logql {hostname="monitoring02"} ``` **Logs from a service across all hosts:** ```logql {systemd_unit="nixos-upgrade.service"} ``` **Substring matching (case-sensitive):** ```logql {hostname="ha1"} |= "error" ``` **Exclude pattern:** ```logql {hostname="ns1"} != "routine" ``` **Regex matching:** ```logql {systemd_unit="victoriametrics.service"} |~ "scrape.*failed" ``` **Filter by level (journal scrape only):** ```logql {level="error"} # All errors across the fleet {level=~"critical|error", tier="prod"} # Prod errors and criticals {hostname="ns1", level="warning"} # Warnings from a specific host ``` **Filter by tier/role:** ```logql {tier="prod"} |= "error" # All errors on prod hosts {role="dns"} # All DNS server logs {tier="test", job="systemd-journal"} # Journal logs from test hosts ``` **File-based logs (caddy access logs, etc):** ```logql {job="varlog", hostname="nix-cache01"} {job="varlog", filename="/var/log/caddy/nix-cache.log"} ``` ### Time Ranges Default lookback is 1 hour. Use `start` parameter for older logs: - `start: "1h"` - Last hour (default) - `start: "24h"` - Last 24 hours - `start: "168h"` - Last 7 days ### Common Services Useful systemd units for troubleshooting: - `nixos-upgrade.service` - Daily auto-upgrade logs - `nsd.service` - DNS server (ns1/ns2) - `victoriametrics.service` - Metrics collection - `loki.service` - Log aggregation - `caddy.service` - Reverse proxy - `home-assistant.service` - Home automation - `step-ca.service` - Internal CA - `openbao.service` - Secrets management - `sshd.service` - SSH daemon - `nix-gc.service` - Nix garbage collection ### Bootstrap Logs VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels: - `hostname` - Target hostname - `branch` - Git branch being deployed - `stage` - Bootstrap stage (see table below) **Bootstrap stages:** | Stage | Message | Meaning | |-------|---------|---------| | `starting` | Bootstrap starting for \ (branch: \) | Bootstrap service has started | | `network_ok` | Network connectivity confirmed | Can reach git server | | `vault_ok` | Vault credentials unwrapped and stored | AppRole credentials provisioned | | `vault_skip` | No Vault token provided - skipping credential setup | No wrapped token was provided | | `vault_warn` | Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) | | `building` | Starting nixos-rebuild boot | NixOS build starting | | `success` | Build successful - rebooting into new configuration | Build complete, rebooting | | `failed` | nixos-rebuild failed - manual intervention required | Build failed | **Bootstrap queries:** ```logql {job="bootstrap"} # All bootstrap logs {job="bootstrap", hostname="myhost"} # Specific host {job="bootstrap", stage="failed"} # All failures {job="bootstrap", stage=~"building|success"} # Track build progress ``` ### Extracting JSON Fields Parse JSON and filter on fields: ```logql {systemd_unit="victoriametrics.service"} | json | PRIORITY="3" ``` --- ## Metrics Reference ### Deployment & Version Status Check which NixOS revision hosts are running: ```promql nixos_flake_info ``` Labels: - `current_rev` - Git commit of the running NixOS configuration - `remote_rev` - Latest commit on the remote repository - `nixpkgs_rev` - Nixpkgs revision used to build the system - `nixos_version` - Full NixOS version string (e.g., `25.11.20260203.e576e3c`) Check if hosts are behind on updates: ```promql nixos_flake_revision_behind == 1 ``` View flake input versions: ```promql nixos_flake_input_info ``` Labels: `input` (name), `rev` (revision), `type` (git/github) Check flake input age: ```promql nixos_flake_input_age_seconds / 86400 ``` Returns age in days for each flake input. ### System Health Basic host availability: ```promql up{job="node-exporter"} ``` CPU usage by host: ```promql 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) ``` Memory usage: ```promql 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) ``` Disk space (root filesystem): ```promql node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ``` ### Prometheus Jobs All available Prometheus job names: **System exporters (on all/most hosts):** - `node-exporter` - System metrics (CPU, memory, disk, network) - `nixos-exporter` - NixOS flake revision and generation info - `systemd-exporter` - Systemd unit status metrics - `homelab-deploy` - Deployment listener metrics **Service-specific exporters:** - `caddy` - Reverse proxy metrics (http-proxy) - `nix-cache_caddy` - Nix binary cache metrics - `home-assistant` - Home automation metrics (ha1) - `jellyfin` - Media server metrics (jelly01) - `kanidm` - Authentication server metrics (kanidm01) - `nats` - NATS messaging metrics (nats1) - `openbao` - Secrets management metrics (vault01) - `unbound` - DNS resolver metrics (ns1, ns2) - `wireguard` - VPN tunnel metrics (http-proxy) **Monitoring stack (localhost on monitoring02):** - `victoriametrics` - VictoriaMetrics self-metrics - `loki` - Loki self-metrics - `grafana` - Grafana self-metrics - `alertmanager` - Alertmanager metrics **External/infrastructure:** - `pve-exporter` - Proxmox hypervisor metrics - `smartctl` - Disk SMART health (gunter) - `restic_rest` - Backup server metrics - `ghettoptt` - PTT service metrics (gunter) ### Target Labels All scrape targets have these labels: **Standard labels:** - `instance` - Full target address (`.home.2rjus.net:`) - `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`) - `hostname` - Short hostname (e.g., `ns1`, `monitoring02`) - use this for host filtering **Host metadata labels** (when configured in `homelab.host`): - `role` - Host role (e.g., `dns`, `build-host`, `vault`) - `tier` - Deployment tier (`test` for test VMs, absent for prod) - `dns_role` - DNS-specific role (`primary` or `secondary` for ns1/ns2) ### Filtering by Host Use the `hostname` label for easy host filtering across all jobs: ```promql {hostname="ns1"} # All metrics from ns1 node_load1{hostname="monitoring02"} # Specific metric by hostname up{hostname="ha1"} # Check if ha1 is up ``` This is simpler than wildcarding the `instance` label: ```promql # Old way (still works but verbose) up{instance=~"monitoring02.*"} # New way (preferred) up{hostname="monitoring02"} ``` ### Filtering by Role/Tier Filter hosts by their role or tier: ```promql up{role="dns"} # All DNS servers (ns1, ns2) node_cpu_seconds_total{role="build-host"} # Build hosts only (nix-cache01) up{tier="test"} # All test-tier VMs up{dns_role="primary"} # Primary DNS only (ns1) ``` Current host labels: | Host | Labels | |------|--------| | ns1 | `role=dns`, `dns_role=primary` | | ns2 | `role=dns`, `dns_role=secondary` | | nix-cache01 | `role=build-host` | | vault01 | `role=vault` | | kanidm01 | `role=auth`, `tier=test` | | testvm01/02/03 | `tier=test` | --- ## Troubleshooting Workflows ### Check Deployment Status Across Fleet 1. Query `nixos_flake_info` to see all hosts' current revisions 2. Check `nixos_flake_revision_behind` for hosts needing updates 3. Look at upgrade logs: `{systemd_unit="nixos-upgrade.service"}` with `start: "24h"` ### Investigate Service Issues 1. Check `up{job=""}` or `up{hostname=""}` for scrape failures 2. Use `list_targets` to see target health details 3. Query service logs: `{hostname="", systemd_unit=".service"}` 4. Search for errors: `{hostname=""} |= "error"` 5. Check `list_alerts` for related alerts 6. Use role filters for group issues: `up{role="dns"}` to check all DNS servers ### After Deploying Changes 1. Verify `current_rev` updated in `nixos_flake_info` 2. Confirm `nixos_flake_revision_behind == 0` 3. Check service logs for startup issues 4. Check service metrics are being scraped ### Monitor VM Bootstrap When provisioning new VMs, track bootstrap progress: 1. Watch bootstrap logs: `{job="bootstrap", hostname=""}` 2. Check for failures: `{job="bootstrap", hostname="", stage="failed"}` 3. After success, verify host appears in metrics: `up{hostname=""}` 4. Check logs are flowing: `{hostname=""}` See [docs/host-creation.md](../../../docs/host-creation.md) for the full host creation pipeline. ### Debug SSH/Access Issues ```logql {hostname="", systemd_unit="sshd.service"} ``` ### Check Recent Upgrades ```logql {systemd_unit="nixos-upgrade.service"} ``` With `start: "24h"` to see last 24 hours of upgrades across all hosts. --- ## Notes - Default scrape interval is 15s for most metrics targets - Default log lookback is 1h - use `start` parameter for older logs - Use `rate()` for counter metrics, direct queries for gauges - Use the `hostname` label to filter metrics by host (simpler than regex on `instance`) - Host metadata labels (`role`, `tier`, `dns_role`) are propagated to all scrape targets - Log `MESSAGE` field contains the actual log content in JSON format