diff --git a/CLAUDE.md b/CLAUDE.md index 5f87677..3566bad 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -115,6 +115,68 @@ Two MCP servers are available for searching NixOS options and packages: This ensures documentation matches the exact nixpkgs version (currently NixOS 25.11) used by this flake. +### Lab Monitoring Log Queries + +The **lab-monitoring** MCP server can query logs from Loki. All hosts ship systemd journal logs via Promtail. + +**Loki Label Reference:** + +- `host` - Hostname (e.g., `ns1`, `ns2`, `monitoring01`, `ha1`). Use this label, not `hostname`. +- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `prometheus.service`, `nixos-upgrade.service`) +- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs like caddy access logs) +- `filename` - For `varlog` job, the log file path (e.g., `/var/log/caddy/nix-cache.log`) + +Journal log entries are JSON-formatted with the actual log message in the `MESSAGE` field. Other useful fields include `PRIORITY` and `SYSLOG_IDENTIFIER`. + +**Example LogQL queries:** +``` +# Logs from a specific service on a host +{host="ns2", systemd_unit="nsd.service"} + +# Substring match on log content +{host="ns1", systemd_unit="nsd.service"} |= "error" + +# File-based logs (e.g., caddy access logs) +{job="varlog", hostname="nix-cache01"} +``` + +Default lookback is 1 hour. Use the `start` parameter with relative durations (e.g., `24h`, `168h`) for older logs. + +### Lab Monitoring Prometheus Queries + +The **lab-monitoring** MCP server can query Prometheus metrics via PromQL. The `instance` label uses the FQDN format `.home.2rjus.net:`. + +**Prometheus Job Names:** + +- `node-exporter` - System metrics from all hosts (CPU, memory, disk, network) +- `caddy` - Reverse proxy metrics (http-proxy) +- `nix-cache_caddy` - Nix binary cache metrics +- `home-assistant` - Home automation metrics +- `jellyfin` - Media server metrics +- `loki` / `prometheus` / `grafana` - Monitoring stack self-metrics +- `step-ca` - Internal CA metrics +- `pve-exporter` - Proxmox hypervisor metrics +- `smartctl` - Disk SMART health (gunter) +- `wireguard` - VPN metrics (http-proxy) +- `pushgateway` - Push-based metrics (e.g., backup results) +- `restic_rest` - Backup server metrics +- `labmon` / `ghettoptt` / `alertmanager` - Other service metrics + +**Example PromQL queries:** +``` +# Check all targets are up +up + +# CPU usage for a specific host +rate(node_cpu_seconds_total{instance=~"ns1.*", mode!="idle"}[5m]) + +# Memory usage across all hosts +node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes + +# Disk space +node_filesystem_avail_bytes{mountpoint="/"} +``` + ## Architecture ### Directory Structure