docs: consolidate monitoring docs into observability skill
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s

- Move detailed Prometheus/Loki reference from CLAUDE.md to the
  observability skill
- Add complete list of Prometheus jobs organized by category
- Add bootstrap log documentation with stages table
- Add kanidm01 to host labels table
- CLAUDE.md now references the skill instead of duplicating info

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-02-08 02:15:02 +01:00
parent 8fbf1224fa
commit c2ec34cab9
2 changed files with 82 additions and 82 deletions

View File

@@ -152,82 +152,16 @@ Two MCP servers are available for searching NixOS options and packages:
This ensures documentation matches the exact nixpkgs version (currently NixOS 25.11) used by this flake.
### Lab Monitoring Log Queries
### Lab Monitoring
The **lab-monitoring** MCP server can query logs from Loki. All hosts ship systemd journal logs via Promtail.
The **lab-monitoring** MCP server provides access to Prometheus metrics and Loki logs. Use the `/observability` skill for detailed reference on:
**Loki Label Reference:**
- Available Prometheus jobs and exporters
- Loki labels and LogQL query syntax
- Bootstrap log monitoring for new VMs
- Common troubleshooting workflows
- `host` - Hostname (e.g., `ns1`, `ns2`, `monitoring01`, `ha1`). Use this label, not `hostname`.
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `prometheus.service`, `nixos-upgrade.service`)
- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
- `filename` - For `varlog` job, the log file path (e.g., `/var/log/caddy/nix-cache.log`)
Journal log entries are JSON-formatted with the actual log message in the `MESSAGE` field. Other useful fields include `PRIORITY` and `SYSLOG_IDENTIFIER`.
**Bootstrap Logs:**
VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
- `host` - Target hostname
- `branch` - Git branch being deployed
- `stage` - Bootstrap stage: `starting`, `network_ok`, `vault_ok`/`vault_skip`/`vault_warn`, `building`, `success`, `failed`
Query bootstrap status:
```
{job="bootstrap"} # All bootstrap logs
{job="bootstrap", host="testvm01"} # Specific host
{job="bootstrap", stage="failed"} # All failures
{job="bootstrap", stage=~"building|success"} # Track build progress
```
**Example LogQL queries:**
```
# Logs from a specific service on a host
{host="ns2", systemd_unit="nsd.service"}
# Substring match on log content
{host="ns1", systemd_unit="nsd.service"} |= "error"
# File-based logs (e.g., caddy access logs)
{job="varlog", hostname="nix-cache01"}
```
Default lookback is 1 hour. Use the `start` parameter with relative durations (e.g., `24h`, `168h`) for older logs.
### Lab Monitoring Prometheus Queries
The **lab-monitoring** MCP server can query Prometheus metrics via PromQL. The `instance` label uses the FQDN format `<host>.home.2rjus.net:<port>`.
**Prometheus Job Names:**
- `node-exporter` - System metrics from all hosts (CPU, memory, disk, network)
- `caddy` - Reverse proxy metrics (http-proxy)
- `nix-cache_caddy` - Nix binary cache metrics
- `home-assistant` - Home automation metrics
- `jellyfin` - Media server metrics
- `loki` / `prometheus` / `grafana` - Monitoring stack self-metrics
- `pve-exporter` - Proxmox hypervisor metrics
- `smartctl` - Disk SMART health (gunter)
- `wireguard` - VPN metrics (http-proxy)
- `pushgateway` - Push-based metrics (e.g., backup results)
- `restic_rest` - Backup server metrics
- `ghettoptt` / `alertmanager` - Other service metrics
**Example PromQL queries:**
```
# Check all targets are up
up
# CPU usage for a specific host
rate(node_cpu_seconds_total{instance=~"ns1.*", mode!="idle"}[5m])
# Memory usage across all hosts
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
# Disk space
node_filesystem_avail_bytes{mountpoint="/"}
```
The skill contains up-to-date information about all scrape targets, host labels, and example queries.
### Deploying to Test Hosts