docs: consolidate monitoring docs into observability skill
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
- Move detailed Prometheus/Loki reference from CLAUDE.md to the observability skill - Add complete list of Prometheus jobs organized by category - Add bootstrap log documentation with stages table - Add kanidm01 to host labels table - CLAUDE.md now references the skill instead of duplicating info Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -32,7 +32,7 @@ Use the `lab-monitoring` MCP server tools:
|
|||||||
Available labels for log queries:
|
Available labels for log queries:
|
||||||
- `host` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`)
|
- `host` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`)
|
||||||
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
|
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
|
||||||
- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs)
|
- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
|
||||||
- `filename` - For `varlog` job, the log file path
|
- `filename` - For `varlog` job, the log file path
|
||||||
- `hostname` - Alternative to `host` for some streams
|
- `hostname` - Alternative to `host` for some streams
|
||||||
|
|
||||||
@@ -102,6 +102,36 @@ Useful systemd units for troubleshooting:
|
|||||||
- `sshd.service` - SSH daemon
|
- `sshd.service` - SSH daemon
|
||||||
- `nix-gc.service` - Nix garbage collection
|
- `nix-gc.service` - Nix garbage collection
|
||||||
|
|
||||||
|
### Bootstrap Logs
|
||||||
|
|
||||||
|
VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
|
||||||
|
|
||||||
|
- `host` - Target hostname
|
||||||
|
- `branch` - Git branch being deployed
|
||||||
|
- `stage` - Bootstrap stage (see table below)
|
||||||
|
|
||||||
|
**Bootstrap stages:**
|
||||||
|
|
||||||
|
| Stage | Message | Meaning |
|
||||||
|
|-------|---------|---------|
|
||||||
|
| `starting` | Bootstrap starting for \<host\> (branch: \<branch\>) | Bootstrap service has started |
|
||||||
|
| `network_ok` | Network connectivity confirmed | Can reach git server |
|
||||||
|
| `vault_ok` | Vault credentials unwrapped and stored | AppRole credentials provisioned |
|
||||||
|
| `vault_skip` | No Vault token provided - skipping credential setup | No wrapped token was provided |
|
||||||
|
| `vault_warn` | Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) |
|
||||||
|
| `building` | Starting nixos-rebuild boot | NixOS build starting |
|
||||||
|
| `success` | Build successful - rebooting into new configuration | Build complete, rebooting |
|
||||||
|
| `failed` | nixos-rebuild failed - manual intervention required | Build failed |
|
||||||
|
|
||||||
|
**Bootstrap queries:**
|
||||||
|
|
||||||
|
```logql
|
||||||
|
{job="bootstrap"} # All bootstrap logs
|
||||||
|
{job="bootstrap", host="myhost"} # Specific host
|
||||||
|
{job="bootstrap", stage="failed"} # All failures
|
||||||
|
{job="bootstrap", stage=~"building|success"} # Track build progress
|
||||||
|
```
|
||||||
|
|
||||||
### Extracting JSON Fields
|
### Extracting JSON Fields
|
||||||
|
|
||||||
Parse JSON and filter on fields:
|
Parse JSON and filter on fields:
|
||||||
@@ -175,15 +205,39 @@ Disk space (root filesystem):
|
|||||||
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
|
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
|
||||||
```
|
```
|
||||||
|
|
||||||
### Service-Specific Metrics
|
### Prometheus Jobs
|
||||||
|
|
||||||
Common job names:
|
All available Prometheus job names:
|
||||||
- `node-exporter` - System metrics (all hosts)
|
|
||||||
- `nixos-exporter` - NixOS version/generation metrics
|
**System exporters (on all/most hosts):**
|
||||||
- `caddy` - Reverse proxy metrics
|
- `node-exporter` - System metrics (CPU, memory, disk, network)
|
||||||
- `prometheus` / `loki` / `grafana` - Monitoring stack
|
- `nixos-exporter` - NixOS flake revision and generation info
|
||||||
- `home-assistant` - Home automation
|
- `systemd-exporter` - Systemd unit status metrics
|
||||||
- `step-ca` - Internal CA
|
- `homelab-deploy` - Deployment listener metrics
|
||||||
|
|
||||||
|
**Service-specific exporters:**
|
||||||
|
- `caddy` - Reverse proxy metrics (http-proxy)
|
||||||
|
- `nix-cache_caddy` - Nix binary cache metrics
|
||||||
|
- `home-assistant` - Home automation metrics (ha1)
|
||||||
|
- `jellyfin` - Media server metrics (jelly01)
|
||||||
|
- `kanidm` - Authentication server metrics (kanidm01)
|
||||||
|
- `nats` - NATS messaging metrics (nats1)
|
||||||
|
- `openbao` - Secrets management metrics (vault01)
|
||||||
|
- `unbound` - DNS resolver metrics (ns1, ns2)
|
||||||
|
- `wireguard` - VPN tunnel metrics (http-proxy)
|
||||||
|
|
||||||
|
**Monitoring stack (localhost on monitoring01):**
|
||||||
|
- `prometheus` - Prometheus self-metrics
|
||||||
|
- `loki` - Loki self-metrics
|
||||||
|
- `grafana` - Grafana self-metrics
|
||||||
|
- `alertmanager` - Alertmanager metrics
|
||||||
|
- `pushgateway` - Push-based metrics gateway
|
||||||
|
|
||||||
|
**External/infrastructure:**
|
||||||
|
- `pve-exporter` - Proxmox hypervisor metrics
|
||||||
|
- `smartctl` - Disk SMART health (gunter)
|
||||||
|
- `restic_rest` - Backup server metrics
|
||||||
|
- `ghettoptt` - PTT service metrics (gunter)
|
||||||
|
|
||||||
### Target Labels
|
### Target Labels
|
||||||
|
|
||||||
@@ -237,6 +291,7 @@ Current host labels:
|
|||||||
| ns2 | `role=dns`, `dns_role=secondary` |
|
| ns2 | `role=dns`, `dns_role=secondary` |
|
||||||
| nix-cache01 | `role=build-host` |
|
| nix-cache01 | `role=build-host` |
|
||||||
| vault01 | `role=vault` |
|
| vault01 | `role=vault` |
|
||||||
|
| kanidm01 | `role=auth`, `tier=test` |
|
||||||
| testvm01/02/03 | `tier=test` |
|
| testvm01/02/03 | `tier=test` |
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -265,6 +320,17 @@ Current host labels:
|
|||||||
3. Check service logs for startup issues
|
3. Check service logs for startup issues
|
||||||
4. Check service metrics are being scraped
|
4. Check service metrics are being scraped
|
||||||
|
|
||||||
|
### Monitor VM Bootstrap
|
||||||
|
|
||||||
|
When provisioning new VMs, track bootstrap progress:
|
||||||
|
|
||||||
|
1. Watch bootstrap logs: `{job="bootstrap", host="<hostname>"}`
|
||||||
|
2. Check for failures: `{job="bootstrap", host="<hostname>", stage="failed"}`
|
||||||
|
3. After success, verify host appears in metrics: `up{hostname="<hostname>"}`
|
||||||
|
4. Check logs are flowing: `{host="<hostname>"}`
|
||||||
|
|
||||||
|
See [docs/host-creation.md](../../../docs/host-creation.md) for the full host creation pipeline.
|
||||||
|
|
||||||
### Debug SSH/Access Issues
|
### Debug SSH/Access Issues
|
||||||
|
|
||||||
```logql
|
```logql
|
||||||
|
|||||||
80
CLAUDE.md
80
CLAUDE.md
@@ -152,82 +152,16 @@ Two MCP servers are available for searching NixOS options and packages:
|
|||||||
|
|
||||||
This ensures documentation matches the exact nixpkgs version (currently NixOS 25.11) used by this flake.
|
This ensures documentation matches the exact nixpkgs version (currently NixOS 25.11) used by this flake.
|
||||||
|
|
||||||
### Lab Monitoring Log Queries
|
### Lab Monitoring
|
||||||
|
|
||||||
The **lab-monitoring** MCP server can query logs from Loki. All hosts ship systemd journal logs via Promtail.
|
The **lab-monitoring** MCP server provides access to Prometheus metrics and Loki logs. Use the `/observability` skill for detailed reference on:
|
||||||
|
|
||||||
**Loki Label Reference:**
|
- Available Prometheus jobs and exporters
|
||||||
|
- Loki labels and LogQL query syntax
|
||||||
|
- Bootstrap log monitoring for new VMs
|
||||||
|
- Common troubleshooting workflows
|
||||||
|
|
||||||
- `host` - Hostname (e.g., `ns1`, `ns2`, `monitoring01`, `ha1`). Use this label, not `hostname`.
|
The skill contains up-to-date information about all scrape targets, host labels, and example queries.
|
||||||
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `prometheus.service`, `nixos-upgrade.service`)
|
|
||||||
- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
|
|
||||||
- `filename` - For `varlog` job, the log file path (e.g., `/var/log/caddy/nix-cache.log`)
|
|
||||||
|
|
||||||
Journal log entries are JSON-formatted with the actual log message in the `MESSAGE` field. Other useful fields include `PRIORITY` and `SYSLOG_IDENTIFIER`.
|
|
||||||
|
|
||||||
**Bootstrap Logs:**
|
|
||||||
|
|
||||||
VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
|
|
||||||
|
|
||||||
- `host` - Target hostname
|
|
||||||
- `branch` - Git branch being deployed
|
|
||||||
- `stage` - Bootstrap stage: `starting`, `network_ok`, `vault_ok`/`vault_skip`/`vault_warn`, `building`, `success`, `failed`
|
|
||||||
|
|
||||||
Query bootstrap status:
|
|
||||||
```
|
|
||||||
{job="bootstrap"} # All bootstrap logs
|
|
||||||
{job="bootstrap", host="testvm01"} # Specific host
|
|
||||||
{job="bootstrap", stage="failed"} # All failures
|
|
||||||
{job="bootstrap", stage=~"building|success"} # Track build progress
|
|
||||||
```
|
|
||||||
|
|
||||||
**Example LogQL queries:**
|
|
||||||
```
|
|
||||||
# Logs from a specific service on a host
|
|
||||||
{host="ns2", systemd_unit="nsd.service"}
|
|
||||||
|
|
||||||
# Substring match on log content
|
|
||||||
{host="ns1", systemd_unit="nsd.service"} |= "error"
|
|
||||||
|
|
||||||
# File-based logs (e.g., caddy access logs)
|
|
||||||
{job="varlog", hostname="nix-cache01"}
|
|
||||||
```
|
|
||||||
|
|
||||||
Default lookback is 1 hour. Use the `start` parameter with relative durations (e.g., `24h`, `168h`) for older logs.
|
|
||||||
|
|
||||||
### Lab Monitoring Prometheus Queries
|
|
||||||
|
|
||||||
The **lab-monitoring** MCP server can query Prometheus metrics via PromQL. The `instance` label uses the FQDN format `<host>.home.2rjus.net:<port>`.
|
|
||||||
|
|
||||||
**Prometheus Job Names:**
|
|
||||||
|
|
||||||
- `node-exporter` - System metrics from all hosts (CPU, memory, disk, network)
|
|
||||||
- `caddy` - Reverse proxy metrics (http-proxy)
|
|
||||||
- `nix-cache_caddy` - Nix binary cache metrics
|
|
||||||
- `home-assistant` - Home automation metrics
|
|
||||||
- `jellyfin` - Media server metrics
|
|
||||||
- `loki` / `prometheus` / `grafana` - Monitoring stack self-metrics
|
|
||||||
- `pve-exporter` - Proxmox hypervisor metrics
|
|
||||||
- `smartctl` - Disk SMART health (gunter)
|
|
||||||
- `wireguard` - VPN metrics (http-proxy)
|
|
||||||
- `pushgateway` - Push-based metrics (e.g., backup results)
|
|
||||||
- `restic_rest` - Backup server metrics
|
|
||||||
- `ghettoptt` / `alertmanager` - Other service metrics
|
|
||||||
|
|
||||||
**Example PromQL queries:**
|
|
||||||
```
|
|
||||||
# Check all targets are up
|
|
||||||
up
|
|
||||||
|
|
||||||
# CPU usage for a specific host
|
|
||||||
rate(node_cpu_seconds_total{instance=~"ns1.*", mode!="idle"}[5m])
|
|
||||||
|
|
||||||
# Memory usage across all hosts
|
|
||||||
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
|
|
||||||
|
|
||||||
# Disk space
|
|
||||||
node_filesystem_avail_bytes{mountpoint="/"}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Deploying to Test Hosts
|
### Deploying to Test Hosts
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user