From c2ec34cab9e0acf40896efcbeb20d5a89f9c6f21 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= <torjus@usit.uio.no>
Date: Sun, 8 Feb 2026 02:15:02 +0100
Subject: [PATCH] docs: consolidate monitoring docs into observability skill

- Move detailed Prometheus/Loki reference from CLAUDE.md to the
  observability skill
- Add complete list of Prometheus jobs organized by category
- Add bootstrap log documentation with stages table
- Add kanidm01 to host labels table
- CLAUDE.md now references the skill instead of duplicating info

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---
 .claude/skills/observability/SKILL.md | 84 ++++++++++++++++++++++++---
 CLAUDE.md                             | 80 +++----------------------
 2 files changed, 82 insertions(+), 82 deletions(-)

diff --git a/.claude/skills/observability/SKILL.md b/.claude/skills/observability/SKILL.md
index 6053a10..c2b758c 100644
--- a/.claude/skills/observability/SKILL.md
+++ b/.claude/skills/observability/SKILL.md
@@ -32,7 +32,7 @@ Use the `lab-monitoring` MCP server tools:
 Available labels for log queries:
 - `host` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`)
 - `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
-- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs)
+- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
 - `filename` - For `varlog` job, the log file path
 - `hostname` - Alternative to `host` for some streams
 
@@ -102,6 +102,36 @@ Useful systemd units for troubleshooting:
 - `sshd.service` - SSH daemon
 - `nix-gc.service` - Nix garbage collection
 
+### Bootstrap Logs
+
+VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
+
+- `host` - Target hostname
+- `branch` - Git branch being deployed
+- `stage` - Bootstrap stage (see table below)
+
+**Bootstrap stages:**
+
+| Stage | Message | Meaning |
+|-------|---------|---------|
+| `starting` | Bootstrap starting for \<host\> (branch: \<branch\>) | Bootstrap service has started |
+| `network_ok` | Network connectivity confirmed | Can reach git server |
+| `vault_ok` | Vault credentials unwrapped and stored | AppRole credentials provisioned |
+| `vault_skip` | No Vault token provided - skipping credential setup | No wrapped token was provided |
+| `vault_warn` | Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) |
+| `building` | Starting nixos-rebuild boot | NixOS build starting |
+| `success` | Build successful - rebooting into new configuration | Build complete, rebooting |
+| `failed` | nixos-rebuild failed - manual intervention required | Build failed |
+
+**Bootstrap queries:**
+
+```logql
+{job="bootstrap"}                              # All bootstrap logs
+{job="bootstrap", host="myhost"}               # Specific host
+{job="bootstrap", stage="failed"}              # All failures
+{job="bootstrap", stage=~"building|success"}   # Track build progress
+```
+
 ### Extracting JSON Fields
 
 Parse JSON and filter on fields:
@@ -175,15 +205,39 @@ Disk space (root filesystem):
 node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
 ```
 
-### Service-Specific Metrics
+### Prometheus Jobs
 
-Common job names:
-- `node-exporter` - System metrics (all hosts)
-- `nixos-exporter` - NixOS version/generation metrics
-- `caddy` - Reverse proxy metrics
-- `prometheus` / `loki` / `grafana` - Monitoring stack
-- `home-assistant` - Home automation
-- `step-ca` - Internal CA
+All available Prometheus job names:
+
+**System exporters (on all/most hosts):**
+- `node-exporter` - System metrics (CPU, memory, disk, network)
+- `nixos-exporter` - NixOS flake revision and generation info
+- `systemd-exporter` - Systemd unit status metrics
+- `homelab-deploy` - Deployment listener metrics
+
+**Service-specific exporters:**
+- `caddy` - Reverse proxy metrics (http-proxy)
+- `nix-cache_caddy` - Nix binary cache metrics
+- `home-assistant` - Home automation metrics (ha1)
+- `jellyfin` - Media server metrics (jelly01)
+- `kanidm` - Authentication server metrics (kanidm01)
+- `nats` - NATS messaging metrics (nats1)
+- `openbao` - Secrets management metrics (vault01)
+- `unbound` - DNS resolver metrics (ns1, ns2)
+- `wireguard` - VPN tunnel metrics (http-proxy)
+
+**Monitoring stack (localhost on monitoring01):**
+- `prometheus` - Prometheus self-metrics
+- `loki` - Loki self-metrics
+- `grafana` - Grafana self-metrics
+- `alertmanager` - Alertmanager metrics
+- `pushgateway` - Push-based metrics gateway
+
+**External/infrastructure:**
+- `pve-exporter` - Proxmox hypervisor metrics
+- `smartctl` - Disk SMART health (gunter)
+- `restic_rest` - Backup server metrics
+- `ghettoptt` - PTT service metrics (gunter)
 
 ### Target Labels
 
@@ -237,6 +291,7 @@ Current host labels:
 | ns2 | `role=dns`, `dns_role=secondary` |
 | nix-cache01 | `role=build-host` |
 | vault01 | `role=vault` |
+| kanidm01 | `role=auth`, `tier=test` |
 | testvm01/02/03 | `tier=test` |
 
 ---
@@ -265,6 +320,17 @@ Current host labels:
 3. Check service logs for startup issues
 4. Check service metrics are being scraped
 
+### Monitor VM Bootstrap
+
+When provisioning new VMs, track bootstrap progress:
+
+1. Watch bootstrap logs: `{job="bootstrap", host="<hostname>"}`
+2. Check for failures: `{job="bootstrap", host="<hostname>", stage="failed"}`
+3. After success, verify host appears in metrics: `up{hostname="<hostname>"}`
+4. Check logs are flowing: `{host="<hostname>"}`
+
+See [docs/host-creation.md](../../../docs/host-creation.md) for the full host creation pipeline.
+
 ### Debug SSH/Access Issues
 
 ```logql
diff --git a/CLAUDE.md b/CLAUDE.md
index d1e42df..511e4eb 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -152,82 +152,16 @@ Two MCP servers are available for searching NixOS options and packages:
 
 This ensures documentation matches the exact nixpkgs version (currently NixOS 25.11) used by this flake.
 
-### Lab Monitoring Log Queries
+### Lab Monitoring
 
-The **lab-monitoring** MCP server can query logs from Loki. All hosts ship systemd journal logs via Promtail.
+The **lab-monitoring** MCP server provides access to Prometheus metrics and Loki logs. Use the `/observability` skill for detailed reference on:
 
-**Loki Label Reference:**
+- Available Prometheus jobs and exporters
+- Loki labels and LogQL query syntax
+- Bootstrap log monitoring for new VMs
+- Common troubleshooting workflows
 
-- `host` - Hostname (e.g., `ns1`, `ns2`, `monitoring01`, `ha1`). Use this label, not `hostname`.
-- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `prometheus.service`, `nixos-upgrade.service`)
-- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
-- `filename` - For `varlog` job, the log file path (e.g., `/var/log/caddy/nix-cache.log`)
-
-Journal log entries are JSON-formatted with the actual log message in the `MESSAGE` field. Other useful fields include `PRIORITY` and `SYSLOG_IDENTIFIER`.
-
-**Bootstrap Logs:**
-
-VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
-
-- `host` - Target hostname
-- `branch` - Git branch being deployed
-- `stage` - Bootstrap stage: `starting`, `network_ok`, `vault_ok`/`vault_skip`/`vault_warn`, `building`, `success`, `failed`
-
-Query bootstrap status:
-```
-{job="bootstrap"}                              # All bootstrap logs
-{job="bootstrap", host="testvm01"}             # Specific host
-{job="bootstrap", stage="failed"}              # All failures
-{job="bootstrap", stage=~"building|success"}   # Track build progress
-```
-
-**Example LogQL queries:**
-```
-# Logs from a specific service on a host
-{host="ns2", systemd_unit="nsd.service"}
-
-# Substring match on log content
-{host="ns1", systemd_unit="nsd.service"} |= "error"
-
-# File-based logs (e.g., caddy access logs)
-{job="varlog", hostname="nix-cache01"}
-```
-
-Default lookback is 1 hour. Use the `start` parameter with relative durations (e.g., `24h`, `168h`) for older logs.
-
-### Lab Monitoring Prometheus Queries
-
-The **lab-monitoring** MCP server can query Prometheus metrics via PromQL. The `instance` label uses the FQDN format `<host>.home.2rjus.net:<port>`.
-
-**Prometheus Job Names:**
-
-- `node-exporter` - System metrics from all hosts (CPU, memory, disk, network)
-- `caddy` - Reverse proxy metrics (http-proxy)
-- `nix-cache_caddy` - Nix binary cache metrics
-- `home-assistant` - Home automation metrics
-- `jellyfin` - Media server metrics
-- `loki` / `prometheus` / `grafana` - Monitoring stack self-metrics
-- `pve-exporter` - Proxmox hypervisor metrics
-- `smartctl` - Disk SMART health (gunter)
-- `wireguard` - VPN metrics (http-proxy)
-- `pushgateway` - Push-based metrics (e.g., backup results)
-- `restic_rest` - Backup server metrics
-- `ghettoptt` / `alertmanager` - Other service metrics
-
-**Example PromQL queries:**
-```
-# Check all targets are up
-up
-
-# CPU usage for a specific host
-rate(node_cpu_seconds_total{instance=~"ns1.*", mode!="idle"}[5m])
-
-# Memory usage across all hosts
-node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
-
-# Disk space
-node_filesystem_avail_bytes{mountpoint="/"}
-```
+The skill contains up-to-date information about all scrape targets, host labels, and example queries.
 
 ### Deploying to Test Hosts