From b794aa89db83067fa688be92ca9a5a3ceeffb780 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Sat, 7 Feb 2026 17:12:17 +0100 Subject: [PATCH] skills: update observability with new target labels Document the new hostname and host metadata labels available on all Prometheus scrape targets: - hostname: short hostname for easy filtering - role: host role (dns, build-host, vault) - tier: deployment tier (test for test VMs) - dns_role: primary/secondary for DNS servers Co-Authored-By: Claude Opus 4.5 --- .claude/skills/observability/SKILL.md | 61 ++++++++++++++++++++++----- 1 file changed, 51 insertions(+), 10 deletions(-) diff --git a/.claude/skills/observability/SKILL.md b/.claude/skills/observability/SKILL.md index 69be240..6053a10 100644 --- a/.claude/skills/observability/SKILL.md +++ b/.claude/skills/observability/SKILL.md @@ -185,21 +185,60 @@ Common job names: - `home-assistant` - Home automation - `step-ca` - Internal CA -### Instance Label Format +### Target Labels -The `instance` label uses FQDN format: +All scrape targets have these labels: -``` -.home.2rjus.net: -``` +**Standard labels:** +- `instance` - Full target address (`.home.2rjus.net:`) +- `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`) +- `hostname` - Short hostname (e.g., `ns1`, `monitoring01`) - use this for host filtering -Example queries filtering by host: +**Host metadata labels** (when configured in `homelab.host`): +- `role` - Host role (e.g., `dns`, `build-host`, `vault`) +- `tier` - Deployment tier (`test` for test VMs, absent for prod) +- `dns_role` - DNS-specific role (`primary` or `secondary` for ns1/ns2) + +### Filtering by Host + +Use the `hostname` label for easy host filtering across all jobs: ```promql -up{instance=~"monitoring01.*"} -node_load1{instance=~"ns1.*"} +{hostname="ns1"} # All metrics from ns1 +node_load1{hostname="monitoring01"} # Specific metric by hostname +up{hostname="ha1"} # Check if ha1 is up ``` +This is simpler than wildcarding the `instance` label: + +```promql +# Old way (still works but verbose) +up{instance=~"monitoring01.*"} + +# New way (preferred) +up{hostname="monitoring01"} +``` + +### Filtering by Role/Tier + +Filter hosts by their role or tier: + +```promql +up{role="dns"} # All DNS servers (ns1, ns2) +node_cpu_seconds_total{role="build-host"} # Build hosts only (nix-cache01) +up{tier="test"} # All test-tier VMs +up{dns_role="primary"} # Primary DNS only (ns1) +``` + +Current host labels: +| Host | Labels | +|------|--------| +| ns1 | `role=dns`, `dns_role=primary` | +| ns2 | `role=dns`, `dns_role=secondary` | +| nix-cache01 | `role=build-host` | +| vault01 | `role=vault` | +| testvm01/02/03 | `tier=test` | + --- ## Troubleshooting Workflows @@ -212,11 +251,12 @@ node_load1{instance=~"ns1.*"} ### Investigate Service Issues -1. Check `up{job=""}` for scrape failures +1. Check `up{job=""}` or `up{hostname=""}` for scrape failures 2. Use `list_targets` to see target health details 3. Query service logs: `{host="", systemd_unit=".service"}` 4. Search for errors: `{host=""} |= "error"` 5. Check `list_alerts` for related alerts +6. Use role filters for group issues: `up{role="dns"}` to check all DNS servers ### After Deploying Changes @@ -246,5 +286,6 @@ With `start: "24h"` to see last 24 hours of upgrades across all hosts. - Default scrape interval is 15s for most metrics targets - Default log lookback is 1h - use `start` parameter for older logs - Use `rate()` for counter metrics, direct queries for gauges -- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters +- Use the `hostname` label to filter metrics by host (simpler than regex on `instance`) +- Host metadata labels (`role`, `tier`, `dns_role`) are propagated to all scrape targets - Log `MESSAGE` field contains the actual log content in JSON format