prometheus-scrape-target-labels #30

Merged
torjus merged 5 commits from prometheus-scrape-target-labels into master 2026-02-07 16:27:38 +00:00
Showing only changes of commit b794aa89db - Show all commits

View File

@@ -185,21 +185,60 @@ Common job names:
- `home-assistant` - Home automation
- `step-ca` - Internal CA
### Instance Label Format
### Target Labels
The `instance` label uses FQDN format:
All scrape targets have these labels:
```
<hostname>.home.2rjus.net:<port>
```
**Standard labels:**
- `instance` - Full target address (`<hostname>.home.2rjus.net:<port>`)
- `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`)
- `hostname` - Short hostname (e.g., `ns1`, `monitoring01`) - use this for host filtering
Example queries filtering by host:
**Host metadata labels** (when configured in `homelab.host`):
- `role` - Host role (e.g., `dns`, `build-host`, `vault`)
- `tier` - Deployment tier (`test` for test VMs, absent for prod)
- `dns_role` - DNS-specific role (`primary` or `secondary` for ns1/ns2)
### Filtering by Host
Use the `hostname` label for easy host filtering across all jobs:
```promql
up{instance=~"monitoring01.*"}
node_load1{instance=~"ns1.*"}
{hostname="ns1"} # All metrics from ns1
node_load1{hostname="monitoring01"} # Specific metric by hostname
up{hostname="ha1"} # Check if ha1 is up
```
This is simpler than wildcarding the `instance` label:
```promql
# Old way (still works but verbose)
up{instance=~"monitoring01.*"}
# New way (preferred)
up{hostname="monitoring01"}
```
### Filtering by Role/Tier
Filter hosts by their role or tier:
```promql
up{role="dns"} # All DNS servers (ns1, ns2)
node_cpu_seconds_total{role="build-host"} # Build hosts only (nix-cache01)
up{tier="test"} # All test-tier VMs
up{dns_role="primary"} # Primary DNS only (ns1)
```
Current host labels:
| Host | Labels |
|------|--------|
| ns1 | `role=dns`, `dns_role=primary` |
| ns2 | `role=dns`, `dns_role=secondary` |
| nix-cache01 | `role=build-host` |
| vault01 | `role=vault` |
| testvm01/02/03 | `tier=test` |
---
## Troubleshooting Workflows
@@ -212,11 +251,12 @@ node_load1{instance=~"ns1.*"}
### Investigate Service Issues
1. Check `up{job="<service>"}` for scrape failures
1. Check `up{job="<service>"}` or `up{hostname="<host>"}` for scrape failures
2. Use `list_targets` to see target health details
3. Query service logs: `{host="<host>", systemd_unit="<service>.service"}`
4. Search for errors: `{host="<host>"} |= "error"`
5. Check `list_alerts` for related alerts
6. Use role filters for group issues: `up{role="dns"}` to check all DNS servers
### After Deploying Changes
@@ -246,5 +286,6 @@ With `start: "24h"` to see last 24 hours of upgrades across all hosts.
- Default scrape interval is 15s for most metrics targets
- Default log lookback is 1h - use `start` parameter for older logs
- Use `rate()` for counter metrics, direct queries for gauges
- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters
- Use the `hostname` label to filter metrics by host (simpler than regex on `instance`)
- Host metadata labels (`role`, `tier`, `dns_role`) are propagated to all scrape targets
- Log `MESSAGE` field contains the actual log content in JSON format