---
name: observability
description: Reference guide for exploring Prometheus metrics and Loki logs when troubleshooting homelab issues. Use when investigating system state, deployments, service health, or searching logs.
---

# Observability Troubleshooting Guide

Quick reference for exploring Prometheus metrics and Loki logs to troubleshoot homelab issues.

## Available Tools

Use the `lab-monitoring` MCP server tools:

**Metrics:**
- `search_metrics` - Find metrics by name substring
- `get_metric_metadata` - Get type/help for a specific metric
- `query` - Execute PromQL queries
- `list_targets` - Check scrape target health
- `list_alerts` / `get_alert` - View active alerts

**Logs:**
- `query_logs` - Execute LogQL queries against Loki
- `list_labels` - List available log labels
- `list_label_values` - List values for a specific label

---

## Logs Reference

### Label Reference

Available labels for log queries:
- `host` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`)
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs)
- `filename` - For `varlog` job, the log file path
- `hostname` - Alternative to `host` for some streams

### Log Format

Journal logs are JSON-formatted. Key fields:
- `MESSAGE` - The actual log message
- `PRIORITY` - Syslog priority (6=info, 4=warning, 3=error)
- `SYSLOG_IDENTIFIER` - Program name

### Basic LogQL Queries

**Logs from a specific service on a host:**
```logql
{host="ns1", systemd_unit="nsd.service"}
```

**All logs from a host:**
```logql
{host="monitoring01"}
```

**Logs from a service across all hosts:**
```logql
{systemd_unit="nixos-upgrade.service"}
```

**Substring matching (case-sensitive):**
```logql
{host="ha1"} |= "error"
```

**Exclude pattern:**
```logql
{host="ns1"} != "routine"
```

**Regex matching:**
```logql
{systemd_unit="prometheus.service"} |~ "scrape.*failed"
```

**File-based logs (caddy access logs, etc):**
```logql
{job="varlog", hostname="nix-cache01"}
{job="varlog", filename="/var/log/caddy/nix-cache.log"}
```

### Time Ranges

Default lookback is 1 hour. Use `start` parameter for older logs:
- `start: "1h"` - Last hour (default)
- `start: "24h"` - Last 24 hours
- `start: "168h"` - Last 7 days

### Common Services

Useful systemd units for troubleshooting:
- `nixos-upgrade.service` - Daily auto-upgrade logs
- `nsd.service` - DNS server (ns1/ns2)
- `prometheus.service` - Metrics collection
- `loki.service` - Log aggregation
- `caddy.service` - Reverse proxy
- `home-assistant.service` - Home automation
- `step-ca.service` - Internal CA
- `openbao.service` - Secrets management
- `sshd.service` - SSH daemon
- `nix-gc.service` - Nix garbage collection

### Extracting JSON Fields

Parse JSON and filter on fields:
```logql
{systemd_unit="prometheus.service"} | json | PRIORITY="3"
```

---

## Metrics Reference

### Deployment & Version Status

Check which NixOS revision hosts are running:

```promql
nixos_flake_info
```

Labels:
- `current_rev` - Git commit of the running NixOS configuration
- `remote_rev` - Latest commit on the remote repository
- `nixpkgs_rev` - Nixpkgs revision used to build the system
- `nixos_version` - Full NixOS version string (e.g., `25.11.20260203.e576e3c`)

Check if hosts are behind on updates:

```promql
nixos_flake_revision_behind == 1
```

View flake input versions:

```promql
nixos_flake_input_info
```

Labels: `input` (name), `rev` (revision), `type` (git/github)

Check flake input age:

```promql
nixos_flake_input_age_seconds / 86400
```

Returns age in days for each flake input.

### System Health

Basic host availability:

```promql
up{job="node-exporter"}
```

CPU usage by host:

```promql
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
```

Memory usage:

```promql
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
```

Disk space (root filesystem):

```promql
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
```

### Service-Specific Metrics

Common job names:
- `node-exporter` - System metrics (all hosts)
- `nixos-exporter` - NixOS version/generation metrics
- `caddy` - Reverse proxy metrics
- `prometheus` / `loki` / `grafana` - Monitoring stack
- `home-assistant` - Home automation
- `step-ca` - Internal CA

### Instance Label Format

The `instance` label uses FQDN format:

```
<hostname>.home.2rjus.net:<port>
```

Example queries filtering by host:

```promql
up{instance=~"monitoring01.*"}
node_load1{instance=~"ns1.*"}
```

---

## Troubleshooting Workflows

### Check Deployment Status Across Fleet

1. Query `nixos_flake_info` to see all hosts' current revisions
2. Check `nixos_flake_revision_behind` for hosts needing updates
3. Look at upgrade logs: `{systemd_unit="nixos-upgrade.service"}` with `start: "24h"`

### Investigate Service Issues

1. Check `up{job="<service>"}` for scrape failures
2. Use `list_targets` to see target health details
3. Query service logs: `{host="<host>", systemd_unit="<service>.service"}`
4. Search for errors: `{host="<host>"} |= "error"`
5. Check `list_alerts` for related alerts

### After Deploying Changes

1. Verify `current_rev` updated in `nixos_flake_info`
2. Confirm `nixos_flake_revision_behind == 0`
3. Check service logs for startup issues
4. Check service metrics are being scraped

### Debug SSH/Access Issues

```logql
{host="<host>", systemd_unit="sshd.service"}
```

### Check Recent Upgrades

```logql
{systemd_unit="nixos-upgrade.service"}
```

With `start: "24h"` to see last 24 hours of upgrades across all hosts.

---

## Notes

- Default scrape interval is 15s for most metrics targets
- Default log lookback is 1h - use `start` parameter for older logs
- Use `rate()` for counter metrics, direct queries for gauges
- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters
- Log `MESSAGE` field contains the actual log content in JSON format