skills: update observability with new target labels
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Document the new hostname and host metadata labels available on all Prometheus scrape targets: - hostname: short hostname for easy filtering - role: host role (dns, build-host, vault) - tier: deployment tier (test for test VMs) - dns_role: primary/secondary for DNS servers Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -185,21 +185,60 @@ Common job names:
|
|||||||
- `home-assistant` - Home automation
|
- `home-assistant` - Home automation
|
||||||
- `step-ca` - Internal CA
|
- `step-ca` - Internal CA
|
||||||
|
|
||||||
### Instance Label Format
|
### Target Labels
|
||||||
|
|
||||||
The `instance` label uses FQDN format:
|
All scrape targets have these labels:
|
||||||
|
|
||||||
```
|
**Standard labels:**
|
||||||
<hostname>.home.2rjus.net:<port>
|
- `instance` - Full target address (`<hostname>.home.2rjus.net:<port>`)
|
||||||
```
|
- `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`)
|
||||||
|
- `hostname` - Short hostname (e.g., `ns1`, `monitoring01`) - use this for host filtering
|
||||||
|
|
||||||
Example queries filtering by host:
|
**Host metadata labels** (when configured in `homelab.host`):
|
||||||
|
- `role` - Host role (e.g., `dns`, `build-host`, `vault`)
|
||||||
|
- `tier` - Deployment tier (`test` for test VMs, absent for prod)
|
||||||
|
- `dns_role` - DNS-specific role (`primary` or `secondary` for ns1/ns2)
|
||||||
|
|
||||||
|
### Filtering by Host
|
||||||
|
|
||||||
|
Use the `hostname` label for easy host filtering across all jobs:
|
||||||
|
|
||||||
```promql
|
```promql
|
||||||
up{instance=~"monitoring01.*"}
|
{hostname="ns1"} # All metrics from ns1
|
||||||
node_load1{instance=~"ns1.*"}
|
node_load1{hostname="monitoring01"} # Specific metric by hostname
|
||||||
|
up{hostname="ha1"} # Check if ha1 is up
|
||||||
```
|
```
|
||||||
|
|
||||||
|
This is simpler than wildcarding the `instance` label:
|
||||||
|
|
||||||
|
```promql
|
||||||
|
# Old way (still works but verbose)
|
||||||
|
up{instance=~"monitoring01.*"}
|
||||||
|
|
||||||
|
# New way (preferred)
|
||||||
|
up{hostname="monitoring01"}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Filtering by Role/Tier
|
||||||
|
|
||||||
|
Filter hosts by their role or tier:
|
||||||
|
|
||||||
|
```promql
|
||||||
|
up{role="dns"} # All DNS servers (ns1, ns2)
|
||||||
|
node_cpu_seconds_total{role="build-host"} # Build hosts only (nix-cache01)
|
||||||
|
up{tier="test"} # All test-tier VMs
|
||||||
|
up{dns_role="primary"} # Primary DNS only (ns1)
|
||||||
|
```
|
||||||
|
|
||||||
|
Current host labels:
|
||||||
|
| Host | Labels |
|
||||||
|
|------|--------|
|
||||||
|
| ns1 | `role=dns`, `dns_role=primary` |
|
||||||
|
| ns2 | `role=dns`, `dns_role=secondary` |
|
||||||
|
| nix-cache01 | `role=build-host` |
|
||||||
|
| vault01 | `role=vault` |
|
||||||
|
| testvm01/02/03 | `tier=test` |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Troubleshooting Workflows
|
## Troubleshooting Workflows
|
||||||
@@ -212,11 +251,12 @@ node_load1{instance=~"ns1.*"}
|
|||||||
|
|
||||||
### Investigate Service Issues
|
### Investigate Service Issues
|
||||||
|
|
||||||
1. Check `up{job="<service>"}` for scrape failures
|
1. Check `up{job="<service>"}` or `up{hostname="<host>"}` for scrape failures
|
||||||
2. Use `list_targets` to see target health details
|
2. Use `list_targets` to see target health details
|
||||||
3. Query service logs: `{host="<host>", systemd_unit="<service>.service"}`
|
3. Query service logs: `{host="<host>", systemd_unit="<service>.service"}`
|
||||||
4. Search for errors: `{host="<host>"} |= "error"`
|
4. Search for errors: `{host="<host>"} |= "error"`
|
||||||
5. Check `list_alerts` for related alerts
|
5. Check `list_alerts` for related alerts
|
||||||
|
6. Use role filters for group issues: `up{role="dns"}` to check all DNS servers
|
||||||
|
|
||||||
### After Deploying Changes
|
### After Deploying Changes
|
||||||
|
|
||||||
@@ -246,5 +286,6 @@ With `start: "24h"` to see last 24 hours of upgrades across all hosts.
|
|||||||
- Default scrape interval is 15s for most metrics targets
|
- Default scrape interval is 15s for most metrics targets
|
||||||
- Default log lookback is 1h - use `start` parameter for older logs
|
- Default log lookback is 1h - use `start` parameter for older logs
|
||||||
- Use `rate()` for counter metrics, direct queries for gauges
|
- Use `rate()` for counter metrics, direct queries for gauges
|
||||||
- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters
|
- Use the `hostname` label to filter metrics by host (simpler than regex on `instance`)
|
||||||
|
- Host metadata labels (`role`, `tier`, `dns_role`) are propagated to all scrape targets
|
||||||
- Log `MESSAGE` field contains the actual log content in JSON format
|
- Log `MESSAGE` field contains the actual log content in JSON format
|
||||||
|
|||||||
Reference in New Issue
Block a user