docs: consolidate monitoring docs into observability skill
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s

- Move detailed Prometheus/Loki reference from CLAUDE.md to the
  observability skill
- Add complete list of Prometheus jobs organized by category
- Add bootstrap log documentation with stages table
- Add kanidm01 to host labels table
- CLAUDE.md now references the skill instead of duplicating info

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-02-08 02:15:02 +01:00
parent 8fbf1224fa
commit c2ec34cab9
2 changed files with 82 additions and 82 deletions

View File

@@ -32,7 +32,7 @@ Use the `lab-monitoring` MCP server tools:
Available labels for log queries:
- `host` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`)
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs)
- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
- `filename` - For `varlog` job, the log file path
- `hostname` - Alternative to `host` for some streams
@@ -102,6 +102,36 @@ Useful systemd units for troubleshooting:
- `sshd.service` - SSH daemon
- `nix-gc.service` - Nix garbage collection
### Bootstrap Logs
VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
- `host` - Target hostname
- `branch` - Git branch being deployed
- `stage` - Bootstrap stage (see table below)
**Bootstrap stages:**
| Stage | Message | Meaning |
|-------|---------|---------|
| `starting` | Bootstrap starting for \<host\> (branch: \<branch\>) | Bootstrap service has started |
| `network_ok` | Network connectivity confirmed | Can reach git server |
| `vault_ok` | Vault credentials unwrapped and stored | AppRole credentials provisioned |
| `vault_skip` | No Vault token provided - skipping credential setup | No wrapped token was provided |
| `vault_warn` | Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) |
| `building` | Starting nixos-rebuild boot | NixOS build starting |
| `success` | Build successful - rebooting into new configuration | Build complete, rebooting |
| `failed` | nixos-rebuild failed - manual intervention required | Build failed |
**Bootstrap queries:**
```logql
{job="bootstrap"} # All bootstrap logs
{job="bootstrap", host="myhost"} # Specific host
{job="bootstrap", stage="failed"} # All failures
{job="bootstrap", stage=~"building|success"} # Track build progress
```
### Extracting JSON Fields
Parse JSON and filter on fields:
@@ -175,15 +205,39 @@ Disk space (root filesystem):
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
```
### Service-Specific Metrics
### Prometheus Jobs
Common job names:
- `node-exporter` - System metrics (all hosts)
- `nixos-exporter` - NixOS version/generation metrics
- `caddy` - Reverse proxy metrics
- `prometheus` / `loki` / `grafana` - Monitoring stack
- `home-assistant` - Home automation
- `step-ca` - Internal CA
All available Prometheus job names:
**System exporters (on all/most hosts):**
- `node-exporter` - System metrics (CPU, memory, disk, network)
- `nixos-exporter` - NixOS flake revision and generation info
- `systemd-exporter` - Systemd unit status metrics
- `homelab-deploy` - Deployment listener metrics
**Service-specific exporters:**
- `caddy` - Reverse proxy metrics (http-proxy)
- `nix-cache_caddy` - Nix binary cache metrics
- `home-assistant` - Home automation metrics (ha1)
- `jellyfin` - Media server metrics (jelly01)
- `kanidm` - Authentication server metrics (kanidm01)
- `nats` - NATS messaging metrics (nats1)
- `openbao` - Secrets management metrics (vault01)
- `unbound` - DNS resolver metrics (ns1, ns2)
- `wireguard` - VPN tunnel metrics (http-proxy)
**Monitoring stack (localhost on monitoring01):**
- `prometheus` - Prometheus self-metrics
- `loki` - Loki self-metrics
- `grafana` - Grafana self-metrics
- `alertmanager` - Alertmanager metrics
- `pushgateway` - Push-based metrics gateway
**External/infrastructure:**
- `pve-exporter` - Proxmox hypervisor metrics
- `smartctl` - Disk SMART health (gunter)
- `restic_rest` - Backup server metrics
- `ghettoptt` - PTT service metrics (gunter)
### Target Labels
@@ -237,6 +291,7 @@ Current host labels:
| ns2 | `role=dns`, `dns_role=secondary` |
| nix-cache01 | `role=build-host` |
| vault01 | `role=vault` |
| kanidm01 | `role=auth`, `tier=test` |
| testvm01/02/03 | `tier=test` |
---
@@ -265,6 +320,17 @@ Current host labels:
3. Check service logs for startup issues
4. Check service metrics are being scraped
### Monitor VM Bootstrap
When provisioning new VMs, track bootstrap progress:
1. Watch bootstrap logs: `{job="bootstrap", host="<hostname>"}`
2. Check for failures: `{job="bootstrap", host="<hostname>", stage="failed"}`
3. After success, verify host appears in metrics: `up{hostname="<hostname>"}`
4. Check logs are flowing: `{host="<hostname>"}`
See [docs/host-creation.md](../../../docs/host-creation.md) for the full host creation pipeline.
### Debug SSH/Access Issues
```logql