Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Update all LogQL examples, agent instructions, and scripts to use the hostname label instead of host, matching the Prometheus label naming convention. Also update pipe-to-loki and bootstrap scripts to push hostname instead of host. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
374 lines
11 KiB
Markdown
374 lines
11 KiB
Markdown
---
|
|
name: observability
|
|
description: Reference guide for exploring Prometheus metrics and Loki logs when troubleshooting homelab issues. Use when investigating system state, deployments, service health, or searching logs.
|
|
---
|
|
|
|
# Observability Troubleshooting Guide
|
|
|
|
Quick reference for exploring Prometheus metrics and Loki logs to troubleshoot homelab issues.
|
|
|
|
## Available Tools
|
|
|
|
Use the `lab-monitoring` MCP server tools:
|
|
|
|
**Metrics:**
|
|
- `search_metrics` - Find metrics by name substring
|
|
- `get_metric_metadata` - Get type/help for a specific metric
|
|
- `query` - Execute PromQL queries
|
|
- `list_targets` - Check scrape target health
|
|
- `list_alerts` / `get_alert` - View active alerts
|
|
|
|
**Logs:**
|
|
- `query_logs` - Execute LogQL queries against Loki
|
|
- `list_labels` - List available log labels
|
|
- `list_label_values` - List values for a specific label
|
|
|
|
---
|
|
|
|
## Logs Reference
|
|
|
|
### Label Reference
|
|
|
|
Available labels for log queries:
|
|
- `hostname` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`) - matches the Prometheus `hostname` label
|
|
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
|
|
- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
|
|
- `filename` - For `varlog` job, the log file path
|
|
- `tier` - Deployment tier (`test` or `prod`)
|
|
- `role` - Host role (e.g., `dns`, `vault`, `monitoring`) - matches the Prometheus `role` label
|
|
- `level` - Log level mapped from journal PRIORITY (`critical`, `error`, `warning`, `notice`, `info`, `debug`) - journal scrape only
|
|
|
|
### Log Format
|
|
|
|
Journal logs are JSON-formatted. Key fields:
|
|
- `MESSAGE` - The actual log message
|
|
- `PRIORITY` - Syslog priority (6=info, 4=warning, 3=error)
|
|
- `SYSLOG_IDENTIFIER` - Program name
|
|
|
|
### Basic LogQL Queries
|
|
|
|
**Logs from a specific service on a host:**
|
|
```logql
|
|
{hostname="ns1", systemd_unit="nsd.service"}
|
|
```
|
|
|
|
**All logs from a host:**
|
|
```logql
|
|
{hostname="monitoring01"}
|
|
```
|
|
|
|
**Logs from a service across all hosts:**
|
|
```logql
|
|
{systemd_unit="nixos-upgrade.service"}
|
|
```
|
|
|
|
**Substring matching (case-sensitive):**
|
|
```logql
|
|
{hostname="ha1"} |= "error"
|
|
```
|
|
|
|
**Exclude pattern:**
|
|
```logql
|
|
{hostname="ns1"} != "routine"
|
|
```
|
|
|
|
**Regex matching:**
|
|
```logql
|
|
{systemd_unit="prometheus.service"} |~ "scrape.*failed"
|
|
```
|
|
|
|
**Filter by level (journal scrape only):**
|
|
```logql
|
|
{level="error"} # All errors across the fleet
|
|
{level=~"critical|error", tier="prod"} # Prod errors and criticals
|
|
{hostname="ns1", level="warning"} # Warnings from a specific host
|
|
```
|
|
|
|
**Filter by tier/role:**
|
|
```logql
|
|
{tier="prod"} |= "error" # All errors on prod hosts
|
|
{role="dns"} # All DNS server logs
|
|
{tier="test", job="systemd-journal"} # Journal logs from test hosts
|
|
```
|
|
|
|
**File-based logs (caddy access logs, etc):**
|
|
```logql
|
|
{job="varlog", hostname="nix-cache01"}
|
|
{job="varlog", filename="/var/log/caddy/nix-cache.log"}
|
|
```
|
|
|
|
### Time Ranges
|
|
|
|
Default lookback is 1 hour. Use `start` parameter for older logs:
|
|
- `start: "1h"` - Last hour (default)
|
|
- `start: "24h"` - Last 24 hours
|
|
- `start: "168h"` - Last 7 days
|
|
|
|
### Common Services
|
|
|
|
Useful systemd units for troubleshooting:
|
|
- `nixos-upgrade.service` - Daily auto-upgrade logs
|
|
- `nsd.service` - DNS server (ns1/ns2)
|
|
- `prometheus.service` - Metrics collection
|
|
- `loki.service` - Log aggregation
|
|
- `caddy.service` - Reverse proxy
|
|
- `home-assistant.service` - Home automation
|
|
- `step-ca.service` - Internal CA
|
|
- `openbao.service` - Secrets management
|
|
- `sshd.service` - SSH daemon
|
|
- `nix-gc.service` - Nix garbage collection
|
|
|
|
### Bootstrap Logs
|
|
|
|
VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
|
|
|
|
- `hostname` - Target hostname
|
|
- `branch` - Git branch being deployed
|
|
- `stage` - Bootstrap stage (see table below)
|
|
|
|
**Bootstrap stages:**
|
|
|
|
| Stage | Message | Meaning |
|
|
|-------|---------|---------|
|
|
| `starting` | Bootstrap starting for \<host\> (branch: \<branch\>) | Bootstrap service has started |
|
|
| `network_ok` | Network connectivity confirmed | Can reach git server |
|
|
| `vault_ok` | Vault credentials unwrapped and stored | AppRole credentials provisioned |
|
|
| `vault_skip` | No Vault token provided - skipping credential setup | No wrapped token was provided |
|
|
| `vault_warn` | Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) |
|
|
| `building` | Starting nixos-rebuild boot | NixOS build starting |
|
|
| `success` | Build successful - rebooting into new configuration | Build complete, rebooting |
|
|
| `failed` | nixos-rebuild failed - manual intervention required | Build failed |
|
|
|
|
**Bootstrap queries:**
|
|
|
|
```logql
|
|
{job="bootstrap"} # All bootstrap logs
|
|
{job="bootstrap", hostname="myhost"} # Specific host
|
|
{job="bootstrap", stage="failed"} # All failures
|
|
{job="bootstrap", stage=~"building|success"} # Track build progress
|
|
```
|
|
|
|
### Extracting JSON Fields
|
|
|
|
Parse JSON and filter on fields:
|
|
```logql
|
|
{systemd_unit="prometheus.service"} | json | PRIORITY="3"
|
|
```
|
|
|
|
---
|
|
|
|
## Metrics Reference
|
|
|
|
### Deployment & Version Status
|
|
|
|
Check which NixOS revision hosts are running:
|
|
|
|
```promql
|
|
nixos_flake_info
|
|
```
|
|
|
|
Labels:
|
|
- `current_rev` - Git commit of the running NixOS configuration
|
|
- `remote_rev` - Latest commit on the remote repository
|
|
- `nixpkgs_rev` - Nixpkgs revision used to build the system
|
|
- `nixos_version` - Full NixOS version string (e.g., `25.11.20260203.e576e3c`)
|
|
|
|
Check if hosts are behind on updates:
|
|
|
|
```promql
|
|
nixos_flake_revision_behind == 1
|
|
```
|
|
|
|
View flake input versions:
|
|
|
|
```promql
|
|
nixos_flake_input_info
|
|
```
|
|
|
|
Labels: `input` (name), `rev` (revision), `type` (git/github)
|
|
|
|
Check flake input age:
|
|
|
|
```promql
|
|
nixos_flake_input_age_seconds / 86400
|
|
```
|
|
|
|
Returns age in days for each flake input.
|
|
|
|
### System Health
|
|
|
|
Basic host availability:
|
|
|
|
```promql
|
|
up{job="node-exporter"}
|
|
```
|
|
|
|
CPU usage by host:
|
|
|
|
```promql
|
|
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
|
```
|
|
|
|
Memory usage:
|
|
|
|
```promql
|
|
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
|
|
```
|
|
|
|
Disk space (root filesystem):
|
|
|
|
```promql
|
|
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
|
|
```
|
|
|
|
### Prometheus Jobs
|
|
|
|
All available Prometheus job names:
|
|
|
|
**System exporters (on all/most hosts):**
|
|
- `node-exporter` - System metrics (CPU, memory, disk, network)
|
|
- `nixos-exporter` - NixOS flake revision and generation info
|
|
- `systemd-exporter` - Systemd unit status metrics
|
|
- `homelab-deploy` - Deployment listener metrics
|
|
|
|
**Service-specific exporters:**
|
|
- `caddy` - Reverse proxy metrics (http-proxy)
|
|
- `nix-cache_caddy` - Nix binary cache metrics
|
|
- `home-assistant` - Home automation metrics (ha1)
|
|
- `jellyfin` - Media server metrics (jelly01)
|
|
- `kanidm` - Authentication server metrics (kanidm01)
|
|
- `nats` - NATS messaging metrics (nats1)
|
|
- `openbao` - Secrets management metrics (vault01)
|
|
- `unbound` - DNS resolver metrics (ns1, ns2)
|
|
- `wireguard` - VPN tunnel metrics (http-proxy)
|
|
|
|
**Monitoring stack (localhost on monitoring01):**
|
|
- `prometheus` - Prometheus self-metrics
|
|
- `loki` - Loki self-metrics
|
|
- `grafana` - Grafana self-metrics
|
|
- `alertmanager` - Alertmanager metrics
|
|
- `pushgateway` - Push-based metrics gateway
|
|
|
|
**External/infrastructure:**
|
|
- `pve-exporter` - Proxmox hypervisor metrics
|
|
- `smartctl` - Disk SMART health (gunter)
|
|
- `restic_rest` - Backup server metrics
|
|
- `ghettoptt` - PTT service metrics (gunter)
|
|
|
|
### Target Labels
|
|
|
|
All scrape targets have these labels:
|
|
|
|
**Standard labels:**
|
|
- `instance` - Full target address (`<hostname>.home.2rjus.net:<port>`)
|
|
- `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`)
|
|
- `hostname` - Short hostname (e.g., `ns1`, `monitoring01`) - use this for host filtering
|
|
|
|
**Host metadata labels** (when configured in `homelab.host`):
|
|
- `role` - Host role (e.g., `dns`, `build-host`, `vault`)
|
|
- `tier` - Deployment tier (`test` for test VMs, absent for prod)
|
|
- `dns_role` - DNS-specific role (`primary` or `secondary` for ns1/ns2)
|
|
|
|
### Filtering by Host
|
|
|
|
Use the `hostname` label for easy host filtering across all jobs:
|
|
|
|
```promql
|
|
{hostname="ns1"} # All metrics from ns1
|
|
node_load1{hostname="monitoring01"} # Specific metric by hostname
|
|
up{hostname="ha1"} # Check if ha1 is up
|
|
```
|
|
|
|
This is simpler than wildcarding the `instance` label:
|
|
|
|
```promql
|
|
# Old way (still works but verbose)
|
|
up{instance=~"monitoring01.*"}
|
|
|
|
# New way (preferred)
|
|
up{hostname="monitoring01"}
|
|
```
|
|
|
|
### Filtering by Role/Tier
|
|
|
|
Filter hosts by their role or tier:
|
|
|
|
```promql
|
|
up{role="dns"} # All DNS servers (ns1, ns2)
|
|
node_cpu_seconds_total{role="build-host"} # Build hosts only (nix-cache01)
|
|
up{tier="test"} # All test-tier VMs
|
|
up{dns_role="primary"} # Primary DNS only (ns1)
|
|
```
|
|
|
|
Current host labels:
|
|
| Host | Labels |
|
|
|------|--------|
|
|
| ns1 | `role=dns`, `dns_role=primary` |
|
|
| ns2 | `role=dns`, `dns_role=secondary` |
|
|
| nix-cache01 | `role=build-host` |
|
|
| vault01 | `role=vault` |
|
|
| kanidm01 | `role=auth`, `tier=test` |
|
|
| testvm01/02/03 | `tier=test` |
|
|
|
|
---
|
|
|
|
## Troubleshooting Workflows
|
|
|
|
### Check Deployment Status Across Fleet
|
|
|
|
1. Query `nixos_flake_info` to see all hosts' current revisions
|
|
2. Check `nixos_flake_revision_behind` for hosts needing updates
|
|
3. Look at upgrade logs: `{systemd_unit="nixos-upgrade.service"}` with `start: "24h"`
|
|
|
|
### Investigate Service Issues
|
|
|
|
1. Check `up{job="<service>"}` or `up{hostname="<host>"}` for scrape failures
|
|
2. Use `list_targets` to see target health details
|
|
3. Query service logs: `{hostname="<host>", systemd_unit="<service>.service"}`
|
|
4. Search for errors: `{hostname="<host>"} |= "error"`
|
|
5. Check `list_alerts` for related alerts
|
|
6. Use role filters for group issues: `up{role="dns"}` to check all DNS servers
|
|
|
|
### After Deploying Changes
|
|
|
|
1. Verify `current_rev` updated in `nixos_flake_info`
|
|
2. Confirm `nixos_flake_revision_behind == 0`
|
|
3. Check service logs for startup issues
|
|
4. Check service metrics are being scraped
|
|
|
|
### Monitor VM Bootstrap
|
|
|
|
When provisioning new VMs, track bootstrap progress:
|
|
|
|
1. Watch bootstrap logs: `{job="bootstrap", hostname="<hostname>"}`
|
|
2. Check for failures: `{job="bootstrap", hostname="<hostname>", stage="failed"}`
|
|
3. After success, verify host appears in metrics: `up{hostname="<hostname>"}`
|
|
4. Check logs are flowing: `{hostname="<hostname>"}`
|
|
|
|
See [docs/host-creation.md](../../../docs/host-creation.md) for the full host creation pipeline.
|
|
|
|
### Debug SSH/Access Issues
|
|
|
|
```logql
|
|
{hostname="<host>", systemd_unit="sshd.service"}
|
|
```
|
|
|
|
### Check Recent Upgrades
|
|
|
|
```logql
|
|
{systemd_unit="nixos-upgrade.service"}
|
|
```
|
|
|
|
With `start: "24h"` to see last 24 hours of upgrades across all hosts.
|
|
|
|
---
|
|
|
|
## Notes
|
|
|
|
- Default scrape interval is 15s for most metrics targets
|
|
- Default log lookback is 1h - use `start` parameter for older logs
|
|
- Use `rate()` for counter metrics, direct queries for gauges
|
|
- Use the `hostname` label to filter metrics by host (simpler than regex on `instance`)
|
|
- Host metadata labels (`role`, `tier`, `dns_role`) are propagated to all scrape targets
|
|
- Log `MESSAGE` field contains the actual log content in JSON format
|