skills: update observability with new target labels

Document the new hostname and host metadata labels available on all Prometheus scrape targets: - hostname: short hostname for easy filtering - role: host role (dns, build-host, vault) - tier: deployment tier (test for test VMs) - dns_role: primary/secondary for DNS servers Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:12:17 +01:00
parent 50a85daa44
commit b794aa89db
1 changed files with 51 additions and 10 deletions
--- a/.claude/skills/observability/SKILL.md
+++ b/.claude/skills/observability/SKILL.md
@@ -185,21 +185,60 @@ Common job names:
 - `home-assistant` - Home automation
 - `step-ca` - Internal CA
-### Instance Label Format
+### Target Labels
-The `instance` label uses FQDN format:
+All scrape targets have these labels:
-```
+**Standard labels:**
-<hostname>.home.2rjus.net:<port>
+- `instance` - Full target address (`<hostname>.home.2rjus.net:<port>`)
-```
+- `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`)
 - `hostname` - Short hostname (e.g., `ns1`, `monitoring01`) - use this for host filtering
-Example queries filtering by host:
+**Host metadata labels** (when configured in `homelab.host`):
 - `role` - Host role (e.g., `dns`, `build-host`, `vault`)
 - `tier` - Deployment tier (`test` for test VMs, absent for prod)
 - `dns_role` - DNS-specific role (`primary` or `secondary` for ns1/ns2)
 ### Filtering by Host
 Use the `hostname` label for easy host filtering across all jobs:
 ```promql
-up{instance=~"monitoring01.*"}
+{hostname="ns1"}                    # All metrics from ns1
-node_load1{instance=~"ns1.*"}
+node_load1{hostname="monitoring01"} # Specific metric by hostname
 up{hostname="ha1"}                  # Check if ha1 is up
 ```
 This is simpler than wildcarding the `instance` label:
 ```promql
 # Old way (still works but verbose)
 up{instance=~"monitoring01.*"}
 # New way (preferred)
 up{hostname="monitoring01"}
 ```
 ### Filtering by Role/Tier
 Filter hosts by their role or tier:
 ```promql
 up{role="dns"}                      # All DNS servers (ns1, ns2)
 node_cpu_seconds_total{role="build-host"}  # Build hosts only (nix-cache01)
 up{tier="test"}                     # All test-tier VMs
 up{dns_role="primary"}              # Primary DNS only (ns1)
 ```
 Current host labels:
 | Host | Labels |
 |------|--------|
 | ns1 | `role=dns`, `dns_role=primary` |
 | ns2 | `role=dns`, `dns_role=secondary` |
 | nix-cache01 | `role=build-host` |
 | vault01 | `role=vault` |
 | testvm01/02/03 | `tier=test` |
 ---
 ## Troubleshooting Workflows
@@ -212,11 +251,12 @@ node_load1{instance=~"ns1.*"}
 ### Investigate Service Issues
-1. Check `up{job="<service>"}` for scrape failures
+1. Check `up{job="<service>"}` or `up{hostname="<host>"}` for scrape failures
 2. Use `list_targets` to see target health details
 3. Query service logs: `{host="<host>", systemd_unit="<service>.service"}`
 4. Search for errors: `{host="<host>"} |= "error"`
 5. Check `list_alerts` for related alerts
 6. Use role filters for group issues: `up{role="dns"}` to check all DNS servers
 ### After Deploying Changes
@@ -246,5 +286,6 @@ With `start: "24h"` to see last 24 hours of upgrades across all hosts.
 - Default scrape interval is 15s for most metrics targets
 - Default log lookback is 1h - use `start` parameter for older logs
 - Use `rate()` for counter metrics, direct queries for gauges
- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters
+- Use the `hostname` label to filter metrics by host (simpler than regex on `instance`)
 - Host metadata labels (`role`, `tier`, `dns_role`) are propagated to all scrape targets
 - Log `MESSAGE` field contains the actual log content in JSON format