From e602e8d70bb4bdc13ec15fcbaf87dbf4f004a68f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Thu, 5 Feb 2026 02:36:41 +0100 Subject: [PATCH] docs: add plan for prometheus scrape target labels Co-Authored-By: Claude Opus 4.5 --- docs/plans/prometheus-scrape-target-labels.md | 101 ++++++++++++++++++ 1 file changed, 101 insertions(+) create mode 100644 docs/plans/prometheus-scrape-target-labels.md diff --git a/docs/plans/prometheus-scrape-target-labels.md b/docs/plans/prometheus-scrape-target-labels.md new file mode 100644 index 0000000..cc754b9 --- /dev/null +++ b/docs/plans/prometheus-scrape-target-labels.md @@ -0,0 +1,101 @@ +# Prometheus Scrape Target Labels + +## Goal + +Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names. + +## Motivation + +Some hosts have workloads that make generic alert thresholds inappropriate. For example, `nix-cache01` regularly hits high CPU during builds, requiring a longer `for` duration on `high_cpu_load`. Currently this is handled by excluding specific instance names in PromQL expressions, which is brittle and doesn't scale. + +With per-host labels, alert rules can use semantic filters like `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`. + +## Proposed Labels + +### `priority` + +Indicates alerting importance. Hosts with `priority = "low"` can have relaxed thresholds or longer durations in alert rules. + +Values: `"high"` (default), `"low"` + +### `role` + +Describes the function of the host. Useful for grouping in dashboards and targeting role-specific alert rules. + +Values: free-form string, e.g. `"dns"`, `"build-host"`, `"database"`, `"monitoring"` + +**Note on multiple roles:** Prometheus labels are strictly string values, not lists. For hosts that serve multiple roles there are a few options: + +- **Separate boolean labels:** `role_build_host = "true"`, `role_cache_server = "true"` -- flexible but verbose, and requires updating the module when new roles are added. +- **Delimited string:** `role = "build-host,cache-server"` -- works with regex matchers (`{role=~".*build-host.*"}`), but regex matching is less clean and more error-prone. +- **Pick a primary role:** `role = "build-host"` -- simplest, and probably sufficient since most hosts have one primary role. + +Recommendation: start with a single primary role string. If multi-role matching becomes a real need, switch to separate boolean labels. + +## Implementation + +### 1. Add `labels` option to `homelab.monitoring` + +In `modules/homelab/monitoring.nix`, add: + +```nix +labels = lib.mkOption { + type = lib.types.attrsOf lib.types.str; + default = { }; + description = "Custom labels to attach to this host's scrape targets"; +}; +``` + +### 2. Update `lib/monitoring.nix` + +- `extractHostMonitoring` should carry `labels` through in its return value. +- `generateNodeExporterTargets` currently returns a flat list of target strings. It needs to return structured `static_configs` entries instead, grouping targets by their label sets: + +```nix +# Before (flat list): +["ns1.home.2rjus.net:9100", "ns2.home.2rjus.net:9100", ...] + +# After (grouped by labels): +[ + { targets = ["ns1.home.2rjus.net:9100", "ns2.home.2rjus.net:9100", ...]; } + { targets = ["nix-cache01.home.2rjus.net:9100"]; labels = { priority = "low"; role = "build-host"; }; } +] +``` + +This requires grouping hosts by their label attrset and producing one `static_configs` entry per unique label combination. Hosts with no custom labels get grouped together with no extra labels (preserving current behavior). + +### 3. Update `services/monitoring/prometheus.nix` + +Change the node-exporter scrape config to use the new structured output: + +```nix +# Before: +static_configs = [{ targets = nodeExporterTargets; }]; + +# After: +static_configs = nodeExporterTargets; +``` + +### 4. Set labels on hosts + +Example in `hosts/nix-cache01/configuration.nix` or the relevant service module: + +```nix +homelab.monitoring.labels = { + priority = "low"; + role = "build-host"; +}; +``` + +### 5. Update alert rules + +After implementing labels, review and update `services/monitoring/rules.yml`: + +- Replace instance-name exclusions with label-based filters (e.g. `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`). +- Consider whether any other rules should differentiate by priority or role. + +Specifically, the `high_cpu_load` rule currently has a nix-cache01 exclusion that should be replaced with a `priority`-based filter. + +### 6. Consider labels for `generateScrapeConfigs` (service targets) + +The same label propagation could be applied to service-level scrape targets. This is optional and can be deferred -- service targets are more specialized and less likely to need generic label-based filtering.