docs: move prometheus-scrape-target-labels plan to completed
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
205
docs/plans/completed/prometheus-scrape-target-labels.md
Normal file
205
docs/plans/completed/prometheus-scrape-target-labels.md
Normal file
@@ -0,0 +1,205 @@
|
||||
# Prometheus Scrape Target Labels
|
||||
|
||||
## Implementation Status
|
||||
|
||||
| Step | Status | Notes |
|
||||
|------|--------|-------|
|
||||
| 1. Create `homelab.host` module | ✅ Complete | `modules/homelab/host.nix` |
|
||||
| 2. Update `lib/monitoring.nix` | ✅ Complete | Labels extracted and propagated |
|
||||
| 3. Update Prometheus config | ✅ Complete | Uses structured static_configs |
|
||||
| 4. Set metadata on hosts | ✅ Complete | All relevant hosts configured |
|
||||
| 5. Update alert rules | ✅ Complete | Role-based filtering implemented |
|
||||
| 6. Labels for service targets | ✅ Complete | Host labels propagated to all services |
|
||||
| 7. Add hostname label | ✅ Complete | All targets have `hostname` label for easy filtering |
|
||||
|
||||
**Hosts with metadata configured:**
|
||||
- `ns1`, `ns2`: `role = "dns"`, `labels.dns_role = "primary"/"secondary"`
|
||||
- `nix-cache01`: `role = "build-host"`
|
||||
- `vault01`: `role = "vault"`
|
||||
- `testvm01/02/03`: `tier = "test"`
|
||||
|
||||
**Implementation complete.** Branch: `prometheus-scrape-target-labels`
|
||||
|
||||
**Query examples:**
|
||||
- `{hostname="ns1"}` - all metrics from ns1 (any job/port)
|
||||
- `node_cpu_seconds_total{hostname="monitoring01"}` - specific metric by hostname
|
||||
- `up{role="dns"}` - all DNS servers
|
||||
- `up{tier="test"}` - all test-tier hosts
|
||||
|
||||
---
|
||||
|
||||
## Goal
|
||||
|
||||
Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names.
|
||||
|
||||
**Related:** This plan shares the `homelab.host` module with `docs/plans/completed/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment.
|
||||
|
||||
## Motivation
|
||||
|
||||
Some hosts have workloads that make generic alert thresholds inappropriate. For example, `nix-cache01` regularly hits high CPU during builds, requiring a longer `for` duration on `high_cpu_load`. Currently this is handled by excluding specific instance names in PromQL expressions, which is brittle and doesn't scale.
|
||||
|
||||
With per-host labels, alert rules can use semantic filters like `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`.
|
||||
|
||||
## Proposed Labels
|
||||
|
||||
### `priority`
|
||||
|
||||
Indicates alerting importance. Hosts with `priority = "low"` can have relaxed thresholds or longer durations in alert rules.
|
||||
|
||||
Values: `"high"` (default), `"low"`
|
||||
|
||||
### `role`
|
||||
|
||||
Describes the function of the host. Useful for grouping in dashboards and targeting role-specific alert rules.
|
||||
|
||||
Values: free-form string, e.g. `"dns"`, `"build-host"`, `"database"`, `"monitoring"`
|
||||
|
||||
**Note on multiple roles:** Prometheus labels are strictly string values, not lists. For hosts that serve multiple roles there are a few options:
|
||||
|
||||
- **Separate boolean labels:** `role_build_host = "true"`, `role_cache_server = "true"` -- flexible but verbose, and requires updating the module when new roles are added.
|
||||
- **Delimited string:** `role = "build-host,cache-server"` -- works with regex matchers (`{role=~".*build-host.*"}`), but regex matching is less clean and more error-prone.
|
||||
- **Pick a primary role:** `role = "build-host"` -- simplest, and probably sufficient since most hosts have one primary role.
|
||||
|
||||
Recommendation: start with a single primary role string. If multi-role matching becomes a real need, switch to separate boolean labels.
|
||||
|
||||
### `dns_role`
|
||||
|
||||
For DNS servers specifically, distinguish between primary and secondary resolvers. The secondary resolver (ns2) receives very little traffic and has a cold cache, making generic cache hit ratio alerts inappropriate.
|
||||
|
||||
Values: `"primary"`, `"secondary"`
|
||||
|
||||
Example use case: The `unbound_low_cache_hit_ratio` alert fires on ns2 because its cache hit ratio (~62%) is lower than ns1 (~90%). This is expected behavior since ns2 gets ~100x less traffic. With a `dns_role` label, the alert can either exclude secondaries or use different thresholds:
|
||||
|
||||
```promql
|
||||
# Only alert on primary DNS
|
||||
unbound_cache_hit_ratio < 0.7 and on(instance) unbound_up{dns_role="primary"}
|
||||
|
||||
# Or use different thresholds
|
||||
(unbound_cache_hit_ratio < 0.7 and on(instance) unbound_up{dns_role="primary"})
|
||||
or
|
||||
(unbound_cache_hit_ratio < 0.5 and on(instance) unbound_up{dns_role="secondary"})
|
||||
```
|
||||
|
||||
## Implementation
|
||||
|
||||
This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/completed/nats-deploy-service.md` which uses the same module for deployment tier assignment.
|
||||
|
||||
### 1. Create `homelab.host` module
|
||||
|
||||
✅ **Complete.** The module is in `modules/homelab/host.nix`.
|
||||
|
||||
Create `modules/homelab/host.nix` with shared host metadata options:
|
||||
|
||||
```nix
|
||||
{ lib, ... }:
|
||||
{
|
||||
options.homelab.host = {
|
||||
tier = lib.mkOption {
|
||||
type = lib.types.enum [ "test" "prod" ];
|
||||
default = "prod";
|
||||
description = "Deployment tier - controls which credentials can deploy to this host";
|
||||
};
|
||||
|
||||
priority = lib.mkOption {
|
||||
type = lib.types.enum [ "high" "low" ];
|
||||
default = "high";
|
||||
description = "Alerting priority - low priority hosts have relaxed thresholds";
|
||||
};
|
||||
|
||||
role = lib.mkOption {
|
||||
type = lib.types.nullOr lib.types.str;
|
||||
default = null;
|
||||
description = "Primary role of this host (dns, database, monitoring, etc.)";
|
||||
};
|
||||
|
||||
labels = lib.mkOption {
|
||||
type = lib.types.attrsOf lib.types.str;
|
||||
default = { };
|
||||
description = "Additional free-form labels (e.g., dns_role = 'primary')";
|
||||
};
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
Import this module in `modules/homelab/default.nix`.
|
||||
|
||||
### 2. Update `lib/monitoring.nix`
|
||||
|
||||
✅ **Complete.** Labels are now extracted and propagated.
|
||||
|
||||
- `extractHostMonitoring` should also extract `homelab.host` values (priority, role, labels).
|
||||
- Build the combined label set from `homelab.host`:
|
||||
|
||||
```nix
|
||||
# Combine structured options + free-form labels
|
||||
effectiveLabels =
|
||||
(lib.optionalAttrs (host.priority != "high") { priority = host.priority; })
|
||||
// (lib.optionalAttrs (host.role != null) { role = host.role; })
|
||||
// host.labels;
|
||||
```
|
||||
|
||||
- `generateNodeExporterTargets` returns structured `static_configs` entries, grouping targets by their label sets:
|
||||
|
||||
```nix
|
||||
# Before (flat list):
|
||||
["ns1.home.2rjus.net:9100", "ns2.home.2rjus.net:9100", ...]
|
||||
|
||||
# After (grouped by labels):
|
||||
[
|
||||
{ targets = ["ns1.home.2rjus.net:9100", "ns2.home.2rjus.net:9100", ...]; }
|
||||
{ targets = ["nix-cache01.home.2rjus.net:9100"]; labels = { priority = "low"; role = "build-host"; }; }
|
||||
]
|
||||
```
|
||||
|
||||
This requires grouping hosts by their label attrset and producing one `static_configs` entry per unique label combination. Hosts with default values (priority=high, no role, no labels) get grouped together with no extra labels (preserving current behavior).
|
||||
|
||||
### 3. Update `services/monitoring/prometheus.nix`
|
||||
|
||||
✅ **Complete.** Now uses structured static_configs output.
|
||||
|
||||
Change the node-exporter scrape config to use the new structured output:
|
||||
|
||||
```nix
|
||||
# Before:
|
||||
static_configs = [{ targets = nodeExporterTargets; }];
|
||||
|
||||
# After:
|
||||
static_configs = nodeExporterTargets;
|
||||
```
|
||||
|
||||
### 4. Set metadata on hosts
|
||||
|
||||
✅ **Complete.** All relevant hosts have metadata configured. Note: The implementation filters by `role` rather than `priority`, which matches the existing nix-cache01 configuration.
|
||||
|
||||
Example in `hosts/nix-cache01/configuration.nix`:
|
||||
|
||||
```nix
|
||||
homelab.host = {
|
||||
priority = "low"; # relaxed alerting thresholds
|
||||
role = "build-host";
|
||||
};
|
||||
```
|
||||
|
||||
**Note:** Current implementation only sets `role = "build-host"`. Consider adding `priority = "low"` when label propagation is implemented.
|
||||
|
||||
Example in `hosts/ns1/configuration.nix`:
|
||||
|
||||
```nix
|
||||
homelab.host = {
|
||||
role = "dns";
|
||||
labels.dns_role = "primary";
|
||||
};
|
||||
```
|
||||
|
||||
**Note:** `tier` and `priority` use defaults ("prod" and "high"), which is the intended behavior. The current ns1/ns2 configurations match this pattern.
|
||||
|
||||
### 5. Update alert rules
|
||||
|
||||
✅ **Complete.** Updated `services/monitoring/rules.yml`:
|
||||
|
||||
- `high_cpu_load`: Replaced `instance!="nix-cache01..."` with `role!="build-host"` for standard hosts (15m duration) and `role="build-host"` for build hosts (2h duration).
|
||||
- `unbound_low_cache_hit_ratio`: Added `dns_role="primary"` filter to only alert on the primary DNS resolver (secondary has a cold cache).
|
||||
|
||||
### 6. Labels for `generateScrapeConfigs` (service targets)
|
||||
|
||||
✅ **Complete.** Host labels are now propagated to all auto-generated service scrape targets (unbound, homelab-deploy, nixos-exporter, etc.). This enables semantic filtering on any service metric, such as using `dns_role="primary"` with the unbound job.
|
||||
Reference in New Issue
Block a user