skills: update observability with new target labels
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Document the new hostname and host metadata labels available on all Prometheus scrape targets: - hostname: short hostname for easy filtering - role: host role (dns, build-host, vault) - tier: deployment tier (test for test VMs) - dns_role: primary/secondary for DNS servers Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -185,21 +185,60 @@ Common job names:
|
||||
- `home-assistant` - Home automation
|
||||
- `step-ca` - Internal CA
|
||||
|
||||
### Instance Label Format
|
||||
### Target Labels
|
||||
|
||||
The `instance` label uses FQDN format:
|
||||
All scrape targets have these labels:
|
||||
|
||||
```
|
||||
<hostname>.home.2rjus.net:<port>
|
||||
```
|
||||
**Standard labels:**
|
||||
- `instance` - Full target address (`<hostname>.home.2rjus.net:<port>`)
|
||||
- `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`)
|
||||
- `hostname` - Short hostname (e.g., `ns1`, `monitoring01`) - use this for host filtering
|
||||
|
||||
Example queries filtering by host:
|
||||
**Host metadata labels** (when configured in `homelab.host`):
|
||||
- `role` - Host role (e.g., `dns`, `build-host`, `vault`)
|
||||
- `tier` - Deployment tier (`test` for test VMs, absent for prod)
|
||||
- `dns_role` - DNS-specific role (`primary` or `secondary` for ns1/ns2)
|
||||
|
||||
### Filtering by Host
|
||||
|
||||
Use the `hostname` label for easy host filtering across all jobs:
|
||||
|
||||
```promql
|
||||
up{instance=~"monitoring01.*"}
|
||||
node_load1{instance=~"ns1.*"}
|
||||
{hostname="ns1"} # All metrics from ns1
|
||||
node_load1{hostname="monitoring01"} # Specific metric by hostname
|
||||
up{hostname="ha1"} # Check if ha1 is up
|
||||
```
|
||||
|
||||
This is simpler than wildcarding the `instance` label:
|
||||
|
||||
```promql
|
||||
# Old way (still works but verbose)
|
||||
up{instance=~"monitoring01.*"}
|
||||
|
||||
# New way (preferred)
|
||||
up{hostname="monitoring01"}
|
||||
```
|
||||
|
||||
### Filtering by Role/Tier
|
||||
|
||||
Filter hosts by their role or tier:
|
||||
|
||||
```promql
|
||||
up{role="dns"} # All DNS servers (ns1, ns2)
|
||||
node_cpu_seconds_total{role="build-host"} # Build hosts only (nix-cache01)
|
||||
up{tier="test"} # All test-tier VMs
|
||||
up{dns_role="primary"} # Primary DNS only (ns1)
|
||||
```
|
||||
|
||||
Current host labels:
|
||||
| Host | Labels |
|
||||
|------|--------|
|
||||
| ns1 | `role=dns`, `dns_role=primary` |
|
||||
| ns2 | `role=dns`, `dns_role=secondary` |
|
||||
| nix-cache01 | `role=build-host` |
|
||||
| vault01 | `role=vault` |
|
||||
| testvm01/02/03 | `tier=test` |
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting Workflows
|
||||
@@ -212,11 +251,12 @@ node_load1{instance=~"ns1.*"}
|
||||
|
||||
### Investigate Service Issues
|
||||
|
||||
1. Check `up{job="<service>"}` for scrape failures
|
||||
1. Check `up{job="<service>"}` or `up{hostname="<host>"}` for scrape failures
|
||||
2. Use `list_targets` to see target health details
|
||||
3. Query service logs: `{host="<host>", systemd_unit="<service>.service"}`
|
||||
4. Search for errors: `{host="<host>"} |= "error"`
|
||||
5. Check `list_alerts` for related alerts
|
||||
6. Use role filters for group issues: `up{role="dns"}` to check all DNS servers
|
||||
|
||||
### After Deploying Changes
|
||||
|
||||
@@ -246,5 +286,6 @@ With `start: "24h"` to see last 24 hours of upgrades across all hosts.
|
||||
- Default scrape interval is 15s for most metrics targets
|
||||
- Default log lookback is 1h - use `start` parameter for older logs
|
||||
- Use `rate()` for counter metrics, direct queries for gauges
|
||||
- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters
|
||||
- Use the `hostname` label to filter metrics by host (simpler than regex on `instance`)
|
||||
- Host metadata labels (`role`, `tier`, `dns_role`) are propagated to all scrape targets
|
||||
- Log `MESSAGE` field contains the actual log content in JSON format
|
||||
|
||||
Reference in New Issue
Block a user