Files
nixos-servers/docs/plans/prometheus-scrape-target-labels.md
Torjus Håkestad 787c14c7a6
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m3s
docs: add dns_role label to scrape target labels plan
Add proposed dns_role label to distinguish primary/secondary DNS
resolvers. This addresses the unbound_low_cache_hit_ratio alert
firing on ns2, which has a cold cache due to low traffic.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 01:23:34 +01:00

4.9 KiB

Prometheus Scrape Target Labels

Goal

Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names.

Motivation

Some hosts have workloads that make generic alert thresholds inappropriate. For example, nix-cache01 regularly hits high CPU during builds, requiring a longer for duration on high_cpu_load. Currently this is handled by excluding specific instance names in PromQL expressions, which is brittle and doesn't scale.

With per-host labels, alert rules can use semantic filters like {priority!="low"} instead of {instance!="nix-cache01.home.2rjus.net:9100"}.

Proposed Labels

priority

Indicates alerting importance. Hosts with priority = "low" can have relaxed thresholds or longer durations in alert rules.

Values: "high" (default), "low"

role

Describes the function of the host. Useful for grouping in dashboards and targeting role-specific alert rules.

Values: free-form string, e.g. "dns", "build-host", "database", "monitoring"

Note on multiple roles: Prometheus labels are strictly string values, not lists. For hosts that serve multiple roles there are a few options:

  • Separate boolean labels: role_build_host = "true", role_cache_server = "true" -- flexible but verbose, and requires updating the module when new roles are added.
  • Delimited string: role = "build-host,cache-server" -- works with regex matchers ({role=~".*build-host.*"}), but regex matching is less clean and more error-prone.
  • Pick a primary role: role = "build-host" -- simplest, and probably sufficient since most hosts have one primary role.

Recommendation: start with a single primary role string. If multi-role matching becomes a real need, switch to separate boolean labels.

dns_role

For DNS servers specifically, distinguish between primary and secondary resolvers. The secondary resolver (ns2) receives very little traffic and has a cold cache, making generic cache hit ratio alerts inappropriate.

Values: "primary", "secondary"

Example use case: The unbound_low_cache_hit_ratio alert fires on ns2 because its cache hit ratio (~62%) is lower than ns1 (~90%). This is expected behavior since ns2 gets ~100x less traffic. With a dns_role label, the alert can either exclude secondaries or use different thresholds:

# Only alert on primary DNS
unbound_cache_hit_ratio < 0.7 and on(instance) unbound_up{dns_role="primary"}

# Or use different thresholds
(unbound_cache_hit_ratio < 0.7 and on(instance) unbound_up{dns_role="primary"})
or
(unbound_cache_hit_ratio < 0.5 and on(instance) unbound_up{dns_role="secondary"})

Implementation

1. Add labels option to homelab.monitoring

In modules/homelab/monitoring.nix, add:

labels = lib.mkOption {
  type = lib.types.attrsOf lib.types.str;
  default = { };
  description = "Custom labels to attach to this host's scrape targets";
};

2. Update lib/monitoring.nix

  • extractHostMonitoring should carry labels through in its return value.
  • generateNodeExporterTargets currently returns a flat list of target strings. It needs to return structured static_configs entries instead, grouping targets by their label sets:
# Before (flat list):
["ns1.home.2rjus.net:9100", "ns2.home.2rjus.net:9100", ...]

# After (grouped by labels):
[
  { targets = ["ns1.home.2rjus.net:9100", "ns2.home.2rjus.net:9100", ...]; }
  { targets = ["nix-cache01.home.2rjus.net:9100"]; labels = { priority = "low"; role = "build-host"; }; }
]

This requires grouping hosts by their label attrset and producing one static_configs entry per unique label combination. Hosts with no custom labels get grouped together with no extra labels (preserving current behavior).

3. Update services/monitoring/prometheus.nix

Change the node-exporter scrape config to use the new structured output:

# Before:
static_configs = [{ targets = nodeExporterTargets; }];

# After:
static_configs = nodeExporterTargets;

4. Set labels on hosts

Example in hosts/nix-cache01/configuration.nix or the relevant service module:

homelab.monitoring.labels = {
  priority = "low";
  role = "build-host";
};

5. Update alert rules

After implementing labels, review and update services/monitoring/rules.yml:

  • Replace instance-name exclusions with label-based filters (e.g. {priority!="low"} instead of {instance!="nix-cache01.home.2rjus.net:9100"}).
  • Consider whether any other rules should differentiate by priority or role.

Specifically, the high_cpu_load rule currently has a nix-cache01 exclusion that should be replaced with a priority-based filter.

6. Consider labels for generateScrapeConfigs (service targets)

The same label propagation could be applied to service-level scrape targets. This is optional and can be deferred -- service targets are more specialized and less likely to need generic label-based filtering.