From c2ec34cab9e0acf40896efcbeb20d5a89f9c6f21 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Sun, 8 Feb 2026 02:15:02 +0100 Subject: [PATCH] docs: consolidate monitoring docs into observability skill - Move detailed Prometheus/Loki reference from CLAUDE.md to the observability skill - Add complete list of Prometheus jobs organized by category - Add bootstrap log documentation with stages table - Add kanidm01 to host labels table - CLAUDE.md now references the skill instead of duplicating info Co-Authored-By: Claude Opus 4.5 --- .claude/skills/observability/SKILL.md | 84 ++++++++++++++++++++++++--- CLAUDE.md | 80 +++---------------------- 2 files changed, 82 insertions(+), 82 deletions(-) diff --git a/.claude/skills/observability/SKILL.md b/.claude/skills/observability/SKILL.md index 6053a10..c2b758c 100644 --- a/.claude/skills/observability/SKILL.md +++ b/.claude/skills/observability/SKILL.md @@ -32,7 +32,7 @@ Use the `lab-monitoring` MCP server tools: Available labels for log queries: - `host` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`) - `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`) -- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs) +- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs) - `filename` - For `varlog` job, the log file path - `hostname` - Alternative to `host` for some streams @@ -102,6 +102,36 @@ Useful systemd units for troubleshooting: - `sshd.service` - SSH daemon - `nix-gc.service` - Nix garbage collection +### Bootstrap Logs + +VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels: + +- `host` - Target hostname +- `branch` - Git branch being deployed +- `stage` - Bootstrap stage (see table below) + +**Bootstrap stages:** + +| Stage | Message | Meaning | +|-------|---------|---------| +| `starting` | Bootstrap starting for \ (branch: \) | Bootstrap service has started | +| `network_ok` | Network connectivity confirmed | Can reach git server | +| `vault_ok` | Vault credentials unwrapped and stored | AppRole credentials provisioned | +| `vault_skip` | No Vault token provided - skipping credential setup | No wrapped token was provided | +| `vault_warn` | Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) | +| `building` | Starting nixos-rebuild boot | NixOS build starting | +| `success` | Build successful - rebooting into new configuration | Build complete, rebooting | +| `failed` | nixos-rebuild failed - manual intervention required | Build failed | + +**Bootstrap queries:** + +```logql +{job="bootstrap"} # All bootstrap logs +{job="bootstrap", host="myhost"} # Specific host +{job="bootstrap", stage="failed"} # All failures +{job="bootstrap", stage=~"building|success"} # Track build progress +``` + ### Extracting JSON Fields Parse JSON and filter on fields: @@ -175,15 +205,39 @@ Disk space (root filesystem): node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ``` -### Service-Specific Metrics +### Prometheus Jobs -Common job names: -- `node-exporter` - System metrics (all hosts) -- `nixos-exporter` - NixOS version/generation metrics -- `caddy` - Reverse proxy metrics -- `prometheus` / `loki` / `grafana` - Monitoring stack -- `home-assistant` - Home automation -- `step-ca` - Internal CA +All available Prometheus job names: + +**System exporters (on all/most hosts):** +- `node-exporter` - System metrics (CPU, memory, disk, network) +- `nixos-exporter` - NixOS flake revision and generation info +- `systemd-exporter` - Systemd unit status metrics +- `homelab-deploy` - Deployment listener metrics + +**Service-specific exporters:** +- `caddy` - Reverse proxy metrics (http-proxy) +- `nix-cache_caddy` - Nix binary cache metrics +- `home-assistant` - Home automation metrics (ha1) +- `jellyfin` - Media server metrics (jelly01) +- `kanidm` - Authentication server metrics (kanidm01) +- `nats` - NATS messaging metrics (nats1) +- `openbao` - Secrets management metrics (vault01) +- `unbound` - DNS resolver metrics (ns1, ns2) +- `wireguard` - VPN tunnel metrics (http-proxy) + +**Monitoring stack (localhost on monitoring01):** +- `prometheus` - Prometheus self-metrics +- `loki` - Loki self-metrics +- `grafana` - Grafana self-metrics +- `alertmanager` - Alertmanager metrics +- `pushgateway` - Push-based metrics gateway + +**External/infrastructure:** +- `pve-exporter` - Proxmox hypervisor metrics +- `smartctl` - Disk SMART health (gunter) +- `restic_rest` - Backup server metrics +- `ghettoptt` - PTT service metrics (gunter) ### Target Labels @@ -237,6 +291,7 @@ Current host labels: | ns2 | `role=dns`, `dns_role=secondary` | | nix-cache01 | `role=build-host` | | vault01 | `role=vault` | +| kanidm01 | `role=auth`, `tier=test` | | testvm01/02/03 | `tier=test` | --- @@ -265,6 +320,17 @@ Current host labels: 3. Check service logs for startup issues 4. Check service metrics are being scraped +### Monitor VM Bootstrap + +When provisioning new VMs, track bootstrap progress: + +1. Watch bootstrap logs: `{job="bootstrap", host=""}` +2. Check for failures: `{job="bootstrap", host="", stage="failed"}` +3. After success, verify host appears in metrics: `up{hostname=""}` +4. Check logs are flowing: `{host=""}` + +See [docs/host-creation.md](../../../docs/host-creation.md) for the full host creation pipeline. + ### Debug SSH/Access Issues ```logql diff --git a/CLAUDE.md b/CLAUDE.md index d1e42df..511e4eb 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -152,82 +152,16 @@ Two MCP servers are available for searching NixOS options and packages: This ensures documentation matches the exact nixpkgs version (currently NixOS 25.11) used by this flake. -### Lab Monitoring Log Queries +### Lab Monitoring -The **lab-monitoring** MCP server can query logs from Loki. All hosts ship systemd journal logs via Promtail. +The **lab-monitoring** MCP server provides access to Prometheus metrics and Loki logs. Use the `/observability` skill for detailed reference on: -**Loki Label Reference:** +- Available Prometheus jobs and exporters +- Loki labels and LogQL query syntax +- Bootstrap log monitoring for new VMs +- Common troubleshooting workflows -- `host` - Hostname (e.g., `ns1`, `ns2`, `monitoring01`, `ha1`). Use this label, not `hostname`. -- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `prometheus.service`, `nixos-upgrade.service`) -- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs) -- `filename` - For `varlog` job, the log file path (e.g., `/var/log/caddy/nix-cache.log`) - -Journal log entries are JSON-formatted with the actual log message in the `MESSAGE` field. Other useful fields include `PRIORITY` and `SYSLOG_IDENTIFIER`. - -**Bootstrap Logs:** - -VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels: - -- `host` - Target hostname -- `branch` - Git branch being deployed -- `stage` - Bootstrap stage: `starting`, `network_ok`, `vault_ok`/`vault_skip`/`vault_warn`, `building`, `success`, `failed` - -Query bootstrap status: -``` -{job="bootstrap"} # All bootstrap logs -{job="bootstrap", host="testvm01"} # Specific host -{job="bootstrap", stage="failed"} # All failures -{job="bootstrap", stage=~"building|success"} # Track build progress -``` - -**Example LogQL queries:** -``` -# Logs from a specific service on a host -{host="ns2", systemd_unit="nsd.service"} - -# Substring match on log content -{host="ns1", systemd_unit="nsd.service"} |= "error" - -# File-based logs (e.g., caddy access logs) -{job="varlog", hostname="nix-cache01"} -``` - -Default lookback is 1 hour. Use the `start` parameter with relative durations (e.g., `24h`, `168h`) for older logs. - -### Lab Monitoring Prometheus Queries - -The **lab-monitoring** MCP server can query Prometheus metrics via PromQL. The `instance` label uses the FQDN format `.home.2rjus.net:`. - -**Prometheus Job Names:** - -- `node-exporter` - System metrics from all hosts (CPU, memory, disk, network) -- `caddy` - Reverse proxy metrics (http-proxy) -- `nix-cache_caddy` - Nix binary cache metrics -- `home-assistant` - Home automation metrics -- `jellyfin` - Media server metrics -- `loki` / `prometheus` / `grafana` - Monitoring stack self-metrics -- `pve-exporter` - Proxmox hypervisor metrics -- `smartctl` - Disk SMART health (gunter) -- `wireguard` - VPN metrics (http-proxy) -- `pushgateway` - Push-based metrics (e.g., backup results) -- `restic_rest` - Backup server metrics -- `ghettoptt` / `alertmanager` - Other service metrics - -**Example PromQL queries:** -``` -# Check all targets are up -up - -# CPU usage for a specific host -rate(node_cpu_seconds_total{instance=~"ns1.*", mode!="idle"}[5m]) - -# Memory usage across all hosts -node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes - -# Disk space -node_filesystem_avail_bytes{mountpoint="/"} -``` +The skill contains up-to-date information about all scrape targets, host labels, and example queries. ### Deploying to Test Hosts