Remove monitoring01 host configuration and unused service modules (prometheus, grafana, loki, tempo, pyroscope). Migrate blackbox, exportarr, and pve exporters to monitoring02 with scrape configs moved to VictoriaMetrics. Update alert rules, terraform vault policies/secrets, http-proxy entries, and documentation to reflect the monitoring02 migration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
373 lines
11 KiB
Markdown
373 lines
11 KiB
Markdown
---
|
|
name: observability
|
|
description: Reference guide for exploring Prometheus metrics and Loki logs when troubleshooting homelab issues. Use when investigating system state, deployments, service health, or searching logs.
|
|
---
|
|
|
|
# Observability Troubleshooting Guide
|
|
|
|
Quick reference for exploring Prometheus metrics and Loki logs to troubleshoot homelab issues.
|
|
|
|
## Available Tools
|
|
|
|
Use the `lab-monitoring` MCP server tools:
|
|
|
|
**Metrics:**
|
|
- `search_metrics` - Find metrics by name substring
|
|
- `get_metric_metadata` - Get type/help for a specific metric
|
|
- `query` - Execute PromQL queries
|
|
- `list_targets` - Check scrape target health
|
|
- `list_alerts` / `get_alert` - View active alerts
|
|
|
|
**Logs:**
|
|
- `query_logs` - Execute LogQL queries against Loki
|
|
- `list_labels` - List available log labels
|
|
- `list_label_values` - List values for a specific label
|
|
|
|
---
|
|
|
|
## Logs Reference
|
|
|
|
### Label Reference
|
|
|
|
Available labels for log queries:
|
|
- `hostname` - Hostname (e.g., `ns1`, `monitoring02`, `ha1`) - matches the Prometheus `hostname` label
|
|
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
|
|
- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
|
|
- `filename` - For `varlog` job, the log file path
|
|
- `tier` - Deployment tier (`test` or `prod`)
|
|
- `role` - Host role (e.g., `dns`, `vault`, `monitoring`) - matches the Prometheus `role` label
|
|
- `level` - Log level mapped from journal PRIORITY (`critical`, `error`, `warning`, `notice`, `info`, `debug`) - journal scrape only
|
|
|
|
### Log Format
|
|
|
|
Journal logs are JSON-formatted. Key fields:
|
|
- `MESSAGE` - The actual log message
|
|
- `PRIORITY` - Syslog priority (6=info, 4=warning, 3=error)
|
|
- `SYSLOG_IDENTIFIER` - Program name
|
|
|
|
### Basic LogQL Queries
|
|
|
|
**Logs from a specific service on a host:**
|
|
```logql
|
|
{hostname="ns1", systemd_unit="nsd.service"}
|
|
```
|
|
|
|
**All logs from a host:**
|
|
```logql
|
|
{hostname="monitoring02"}
|
|
```
|
|
|
|
**Logs from a service across all hosts:**
|
|
```logql
|
|
{systemd_unit="nixos-upgrade.service"}
|
|
```
|
|
|
|
**Substring matching (case-sensitive):**
|
|
```logql
|
|
{hostname="ha1"} |= "error"
|
|
```
|
|
|
|
**Exclude pattern:**
|
|
```logql
|
|
{hostname="ns1"} != "routine"
|
|
```
|
|
|
|
**Regex matching:**
|
|
```logql
|
|
{systemd_unit="victoriametrics.service"} |~ "scrape.*failed"
|
|
```
|
|
|
|
**Filter by level (journal scrape only):**
|
|
```logql
|
|
{level="error"} # All errors across the fleet
|
|
{level=~"critical|error", tier="prod"} # Prod errors and criticals
|
|
{hostname="ns1", level="warning"} # Warnings from a specific host
|
|
```
|
|
|
|
**Filter by tier/role:**
|
|
```logql
|
|
{tier="prod"} |= "error" # All errors on prod hosts
|
|
{role="dns"} # All DNS server logs
|
|
{tier="test", job="systemd-journal"} # Journal logs from test hosts
|
|
```
|
|
|
|
**File-based logs (caddy access logs, etc):**
|
|
```logql
|
|
{job="varlog", hostname="nix-cache01"}
|
|
{job="varlog", filename="/var/log/caddy/nix-cache.log"}
|
|
```
|
|
|
|
### Time Ranges
|
|
|
|
Default lookback is 1 hour. Use `start` parameter for older logs:
|
|
- `start: "1h"` - Last hour (default)
|
|
- `start: "24h"` - Last 24 hours
|
|
- `start: "168h"` - Last 7 days
|
|
|
|
### Common Services
|
|
|
|
Useful systemd units for troubleshooting:
|
|
- `nixos-upgrade.service` - Daily auto-upgrade logs
|
|
- `nsd.service` - DNS server (ns1/ns2)
|
|
- `victoriametrics.service` - Metrics collection
|
|
- `loki.service` - Log aggregation
|
|
- `caddy.service` - Reverse proxy
|
|
- `home-assistant.service` - Home automation
|
|
- `step-ca.service` - Internal CA
|
|
- `openbao.service` - Secrets management
|
|
- `sshd.service` - SSH daemon
|
|
- `nix-gc.service` - Nix garbage collection
|
|
|
|
### Bootstrap Logs
|
|
|
|
VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
|
|
|
|
- `hostname` - Target hostname
|
|
- `branch` - Git branch being deployed
|
|
- `stage` - Bootstrap stage (see table below)
|
|
|
|
**Bootstrap stages:**
|
|
|
|
| Stage | Message | Meaning |
|
|
|-------|---------|---------|
|
|
| `starting` | Bootstrap starting for \<host\> (branch: \<branch\>) | Bootstrap service has started |
|
|
| `network_ok` | Network connectivity confirmed | Can reach git server |
|
|
| `vault_ok` | Vault credentials unwrapped and stored | AppRole credentials provisioned |
|
|
| `vault_skip` | No Vault token provided - skipping credential setup | No wrapped token was provided |
|
|
| `vault_warn` | Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) |
|
|
| `building` | Starting nixos-rebuild boot | NixOS build starting |
|
|
| `success` | Build successful - rebooting into new configuration | Build complete, rebooting |
|
|
| `failed` | nixos-rebuild failed - manual intervention required | Build failed |
|
|
|
|
**Bootstrap queries:**
|
|
|
|
```logql
|
|
{job="bootstrap"} # All bootstrap logs
|
|
{job="bootstrap", hostname="myhost"} # Specific host
|
|
{job="bootstrap", stage="failed"} # All failures
|
|
{job="bootstrap", stage=~"building|success"} # Track build progress
|
|
```
|
|
|
|
### Extracting JSON Fields
|
|
|
|
Parse JSON and filter on fields:
|
|
```logql
|
|
{systemd_unit="victoriametrics.service"} | json | PRIORITY="3"
|
|
```
|
|
|
|
---
|
|
|
|
## Metrics Reference
|
|
|
|
### Deployment & Version Status
|
|
|
|
Check which NixOS revision hosts are running:
|
|
|
|
```promql
|
|
nixos_flake_info
|
|
```
|
|
|
|
Labels:
|
|
- `current_rev` - Git commit of the running NixOS configuration
|
|
- `remote_rev` - Latest commit on the remote repository
|
|
- `nixpkgs_rev` - Nixpkgs revision used to build the system
|
|
- `nixos_version` - Full NixOS version string (e.g., `25.11.20260203.e576e3c`)
|
|
|
|
Check if hosts are behind on updates:
|
|
|
|
```promql
|
|
nixos_flake_revision_behind == 1
|
|
```
|
|
|
|
View flake input versions:
|
|
|
|
```promql
|
|
nixos_flake_input_info
|
|
```
|
|
|
|
Labels: `input` (name), `rev` (revision), `type` (git/github)
|
|
|
|
Check flake input age:
|
|
|
|
```promql
|
|
nixos_flake_input_age_seconds / 86400
|
|
```
|
|
|
|
Returns age in days for each flake input.
|
|
|
|
### System Health
|
|
|
|
Basic host availability:
|
|
|
|
```promql
|
|
up{job="node-exporter"}
|
|
```
|
|
|
|
CPU usage by host:
|
|
|
|
```promql
|
|
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
|
```
|
|
|
|
Memory usage:
|
|
|
|
```promql
|
|
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
|
|
```
|
|
|
|
Disk space (root filesystem):
|
|
|
|
```promql
|
|
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
|
|
```
|
|
|
|
### Prometheus Jobs
|
|
|
|
All available Prometheus job names:
|
|
|
|
**System exporters (on all/most hosts):**
|
|
- `node-exporter` - System metrics (CPU, memory, disk, network)
|
|
- `nixos-exporter` - NixOS flake revision and generation info
|
|
- `systemd-exporter` - Systemd unit status metrics
|
|
- `homelab-deploy` - Deployment listener metrics
|
|
|
|
**Service-specific exporters:**
|
|
- `caddy` - Reverse proxy metrics (http-proxy)
|
|
- `nix-cache_caddy` - Nix binary cache metrics
|
|
- `home-assistant` - Home automation metrics (ha1)
|
|
- `jellyfin` - Media server metrics (jelly01)
|
|
- `kanidm` - Authentication server metrics (kanidm01)
|
|
- `nats` - NATS messaging metrics (nats1)
|
|
- `openbao` - Secrets management metrics (vault01)
|
|
- `unbound` - DNS resolver metrics (ns1, ns2)
|
|
- `wireguard` - VPN tunnel metrics (http-proxy)
|
|
|
|
**Monitoring stack (localhost on monitoring02):**
|
|
- `victoriametrics` - VictoriaMetrics self-metrics
|
|
- `loki` - Loki self-metrics
|
|
- `grafana` - Grafana self-metrics
|
|
- `alertmanager` - Alertmanager metrics
|
|
|
|
**External/infrastructure:**
|
|
- `pve-exporter` - Proxmox hypervisor metrics
|
|
- `smartctl` - Disk SMART health (gunter)
|
|
- `restic_rest` - Backup server metrics
|
|
- `ghettoptt` - PTT service metrics (gunter)
|
|
|
|
### Target Labels
|
|
|
|
All scrape targets have these labels:
|
|
|
|
**Standard labels:**
|
|
- `instance` - Full target address (`<hostname>.home.2rjus.net:<port>`)
|
|
- `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`)
|
|
- `hostname` - Short hostname (e.g., `ns1`, `monitoring02`) - use this for host filtering
|
|
|
|
**Host metadata labels** (when configured in `homelab.host`):
|
|
- `role` - Host role (e.g., `dns`, `build-host`, `vault`)
|
|
- `tier` - Deployment tier (`test` for test VMs, absent for prod)
|
|
- `dns_role` - DNS-specific role (`primary` or `secondary` for ns1/ns2)
|
|
|
|
### Filtering by Host
|
|
|
|
Use the `hostname` label for easy host filtering across all jobs:
|
|
|
|
```promql
|
|
{hostname="ns1"} # All metrics from ns1
|
|
node_load1{hostname="monitoring02"} # Specific metric by hostname
|
|
up{hostname="ha1"} # Check if ha1 is up
|
|
```
|
|
|
|
This is simpler than wildcarding the `instance` label:
|
|
|
|
```promql
|
|
# Old way (still works but verbose)
|
|
up{instance=~"monitoring02.*"}
|
|
|
|
# New way (preferred)
|
|
up{hostname="monitoring02"}
|
|
```
|
|
|
|
### Filtering by Role/Tier
|
|
|
|
Filter hosts by their role or tier:
|
|
|
|
```promql
|
|
up{role="dns"} # All DNS servers (ns1, ns2)
|
|
node_cpu_seconds_total{role="build-host"} # Build hosts only (nix-cache01)
|
|
up{tier="test"} # All test-tier VMs
|
|
up{dns_role="primary"} # Primary DNS only (ns1)
|
|
```
|
|
|
|
Current host labels:
|
|
| Host | Labels |
|
|
|------|--------|
|
|
| ns1 | `role=dns`, `dns_role=primary` |
|
|
| ns2 | `role=dns`, `dns_role=secondary` |
|
|
| nix-cache01 | `role=build-host` |
|
|
| vault01 | `role=vault` |
|
|
| kanidm01 | `role=auth`, `tier=test` |
|
|
| testvm01/02/03 | `tier=test` |
|
|
|
|
---
|
|
|
|
## Troubleshooting Workflows
|
|
|
|
### Check Deployment Status Across Fleet
|
|
|
|
1. Query `nixos_flake_info` to see all hosts' current revisions
|
|
2. Check `nixos_flake_revision_behind` for hosts needing updates
|
|
3. Look at upgrade logs: `{systemd_unit="nixos-upgrade.service"}` with `start: "24h"`
|
|
|
|
### Investigate Service Issues
|
|
|
|
1. Check `up{job="<service>"}` or `up{hostname="<host>"}` for scrape failures
|
|
2. Use `list_targets` to see target health details
|
|
3. Query service logs: `{hostname="<host>", systemd_unit="<service>.service"}`
|
|
4. Search for errors: `{hostname="<host>"} |= "error"`
|
|
5. Check `list_alerts` for related alerts
|
|
6. Use role filters for group issues: `up{role="dns"}` to check all DNS servers
|
|
|
|
### After Deploying Changes
|
|
|
|
1. Verify `current_rev` updated in `nixos_flake_info`
|
|
2. Confirm `nixos_flake_revision_behind == 0`
|
|
3. Check service logs for startup issues
|
|
4. Check service metrics are being scraped
|
|
|
|
### Monitor VM Bootstrap
|
|
|
|
When provisioning new VMs, track bootstrap progress:
|
|
|
|
1. Watch bootstrap logs: `{job="bootstrap", hostname="<hostname>"}`
|
|
2. Check for failures: `{job="bootstrap", hostname="<hostname>", stage="failed"}`
|
|
3. After success, verify host appears in metrics: `up{hostname="<hostname>"}`
|
|
4. Check logs are flowing: `{hostname="<hostname>"}`
|
|
|
|
See [docs/host-creation.md](../../../docs/host-creation.md) for the full host creation pipeline.
|
|
|
|
### Debug SSH/Access Issues
|
|
|
|
```logql
|
|
{hostname="<host>", systemd_unit="sshd.service"}
|
|
```
|
|
|
|
### Check Recent Upgrades
|
|
|
|
```logql
|
|
{systemd_unit="nixos-upgrade.service"}
|
|
```
|
|
|
|
With `start: "24h"` to see last 24 hours of upgrades across all hosts.
|
|
|
|
---
|
|
|
|
## Notes
|
|
|
|
- Default scrape interval is 15s for most metrics targets
|
|
- Default log lookback is 1h - use `start` parameter for older logs
|
|
- Use `rate()` for counter metrics, direct queries for gauges
|
|
- Use the `hostname` label to filter metrics by host (simpler than regex on `instance`)
|
|
- Host metadata labels (`role`, `tier`, `dns_role`) are propagated to all scrape targets
|
|
- Log `MESSAGE` field contains the actual log content in JSON format
|