docs: add plan for prometheus scrape target labels
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m7s
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m7s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
101
docs/plans/prometheus-scrape-target-labels.md
Normal file
101
docs/plans/prometheus-scrape-target-labels.md
Normal file
@@ -0,0 +1,101 @@
|
||||
# Prometheus Scrape Target Labels
|
||||
|
||||
## Goal
|
||||
|
||||
Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names.
|
||||
|
||||
## Motivation
|
||||
|
||||
Some hosts have workloads that make generic alert thresholds inappropriate. For example, `nix-cache01` regularly hits high CPU during builds, requiring a longer `for` duration on `high_cpu_load`. Currently this is handled by excluding specific instance names in PromQL expressions, which is brittle and doesn't scale.
|
||||
|
||||
With per-host labels, alert rules can use semantic filters like `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`.
|
||||
|
||||
## Proposed Labels
|
||||
|
||||
### `priority`
|
||||
|
||||
Indicates alerting importance. Hosts with `priority = "low"` can have relaxed thresholds or longer durations in alert rules.
|
||||
|
||||
Values: `"high"` (default), `"low"`
|
||||
|
||||
### `role`
|
||||
|
||||
Describes the function of the host. Useful for grouping in dashboards and targeting role-specific alert rules.
|
||||
|
||||
Values: free-form string, e.g. `"dns"`, `"build-host"`, `"database"`, `"monitoring"`
|
||||
|
||||
**Note on multiple roles:** Prometheus labels are strictly string values, not lists. For hosts that serve multiple roles there are a few options:
|
||||
|
||||
- **Separate boolean labels:** `role_build_host = "true"`, `role_cache_server = "true"` -- flexible but verbose, and requires updating the module when new roles are added.
|
||||
- **Delimited string:** `role = "build-host,cache-server"` -- works with regex matchers (`{role=~".*build-host.*"}`), but regex matching is less clean and more error-prone.
|
||||
- **Pick a primary role:** `role = "build-host"` -- simplest, and probably sufficient since most hosts have one primary role.
|
||||
|
||||
Recommendation: start with a single primary role string. If multi-role matching becomes a real need, switch to separate boolean labels.
|
||||
|
||||
## Implementation
|
||||
|
||||
### 1. Add `labels` option to `homelab.monitoring`
|
||||
|
||||
In `modules/homelab/monitoring.nix`, add:
|
||||
|
||||
```nix
|
||||
labels = lib.mkOption {
|
||||
type = lib.types.attrsOf lib.types.str;
|
||||
default = { };
|
||||
description = "Custom labels to attach to this host's scrape targets";
|
||||
};
|
||||
```
|
||||
|
||||
### 2. Update `lib/monitoring.nix`
|
||||
|
||||
- `extractHostMonitoring` should carry `labels` through in its return value.
|
||||
- `generateNodeExporterTargets` currently returns a flat list of target strings. It needs to return structured `static_configs` entries instead, grouping targets by their label sets:
|
||||
|
||||
```nix
|
||||
# Before (flat list):
|
||||
["ns1.home.2rjus.net:9100", "ns2.home.2rjus.net:9100", ...]
|
||||
|
||||
# After (grouped by labels):
|
||||
[
|
||||
{ targets = ["ns1.home.2rjus.net:9100", "ns2.home.2rjus.net:9100", ...]; }
|
||||
{ targets = ["nix-cache01.home.2rjus.net:9100"]; labels = { priority = "low"; role = "build-host"; }; }
|
||||
]
|
||||
```
|
||||
|
||||
This requires grouping hosts by their label attrset and producing one `static_configs` entry per unique label combination. Hosts with no custom labels get grouped together with no extra labels (preserving current behavior).
|
||||
|
||||
### 3. Update `services/monitoring/prometheus.nix`
|
||||
|
||||
Change the node-exporter scrape config to use the new structured output:
|
||||
|
||||
```nix
|
||||
# Before:
|
||||
static_configs = [{ targets = nodeExporterTargets; }];
|
||||
|
||||
# After:
|
||||
static_configs = nodeExporterTargets;
|
||||
```
|
||||
|
||||
### 4. Set labels on hosts
|
||||
|
||||
Example in `hosts/nix-cache01/configuration.nix` or the relevant service module:
|
||||
|
||||
```nix
|
||||
homelab.monitoring.labels = {
|
||||
priority = "low";
|
||||
role = "build-host";
|
||||
};
|
||||
```
|
||||
|
||||
### 5. Update alert rules
|
||||
|
||||
After implementing labels, review and update `services/monitoring/rules.yml`:
|
||||
|
||||
- Replace instance-name exclusions with label-based filters (e.g. `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`).
|
||||
- Consider whether any other rules should differentiate by priority or role.
|
||||
|
||||
Specifically, the `high_cpu_load` rule currently has a nix-cache01 exclusion that should be replaced with a `priority`-based filter.
|
||||
|
||||
### 6. Consider labels for `generateScrapeConfigs` (service targets)
|
||||
|
||||
The same label propagation could be applied to service-level scrape targets. This is optional and can be deferred -- service targets are more specialized and less likely to need generic label-based filtering.
|
||||
Reference in New Issue
Block a user