Files
nixos-servers/docs/plans/prometheus-scrape-target-labels.md
2026-02-07 17:09:46 +01:00

8.2 KiB

Prometheus Scrape Target Labels

Implementation Status

Step Status Notes
1. Create homelab.host module Complete modules/homelab/host.nix
2. Update lib/monitoring.nix Complete Labels extracted and propagated
3. Update Prometheus config Complete Uses structured static_configs
4. Set metadata on hosts Complete All relevant hosts configured
5. Update alert rules Complete Role-based filtering implemented
6. Labels for service targets Complete Host labels propagated to all services
7. Add hostname label Complete All targets have hostname label for easy filtering

Hosts with metadata configured:

  • ns1, ns2: role = "dns", labels.dns_role = "primary"/"secondary"
  • nix-cache01: role = "build-host"
  • vault01: role = "vault"
  • testvm01/02/03: tier = "test"

Implementation complete. Branch: prometheus-scrape-target-labels

Query examples:

  • {hostname="ns1"} - all metrics from ns1 (any job/port)
  • node_cpu_seconds_total{hostname="monitoring01"} - specific metric by hostname
  • up{role="dns"} - all DNS servers
  • up{tier="test"} - all test-tier hosts

Goal

Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names.

Related: This plan shares the homelab.host module with docs/plans/completed/nats-deploy-service.md, which uses the same metadata for deployment tier assignment.

Motivation

Some hosts have workloads that make generic alert thresholds inappropriate. For example, nix-cache01 regularly hits high CPU during builds, requiring a longer for duration on high_cpu_load. Currently this is handled by excluding specific instance names in PromQL expressions, which is brittle and doesn't scale.

With per-host labels, alert rules can use semantic filters like {priority!="low"} instead of {instance!="nix-cache01.home.2rjus.net:9100"}.

Proposed Labels

priority

Indicates alerting importance. Hosts with priority = "low" can have relaxed thresholds or longer durations in alert rules.

Values: "high" (default), "low"

role

Describes the function of the host. Useful for grouping in dashboards and targeting role-specific alert rules.

Values: free-form string, e.g. "dns", "build-host", "database", "monitoring"

Note on multiple roles: Prometheus labels are strictly string values, not lists. For hosts that serve multiple roles there are a few options:

  • Separate boolean labels: role_build_host = "true", role_cache_server = "true" -- flexible but verbose, and requires updating the module when new roles are added.
  • Delimited string: role = "build-host,cache-server" -- works with regex matchers ({role=~".*build-host.*"}), but regex matching is less clean and more error-prone.
  • Pick a primary role: role = "build-host" -- simplest, and probably sufficient since most hosts have one primary role.

Recommendation: start with a single primary role string. If multi-role matching becomes a real need, switch to separate boolean labels.

dns_role

For DNS servers specifically, distinguish between primary and secondary resolvers. The secondary resolver (ns2) receives very little traffic and has a cold cache, making generic cache hit ratio alerts inappropriate.

Values: "primary", "secondary"

Example use case: The unbound_low_cache_hit_ratio alert fires on ns2 because its cache hit ratio (~62%) is lower than ns1 (~90%). This is expected behavior since ns2 gets ~100x less traffic. With a dns_role label, the alert can either exclude secondaries or use different thresholds:

# Only alert on primary DNS
unbound_cache_hit_ratio < 0.7 and on(instance) unbound_up{dns_role="primary"}

# Or use different thresholds
(unbound_cache_hit_ratio < 0.7 and on(instance) unbound_up{dns_role="primary"})
or
(unbound_cache_hit_ratio < 0.5 and on(instance) unbound_up{dns_role="secondary"})

Implementation

This implementation uses a shared homelab.host module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also docs/plans/completed/nats-deploy-service.md which uses the same module for deployment tier assignment.

1. Create homelab.host module

Complete. The module is in modules/homelab/host.nix.

Create modules/homelab/host.nix with shared host metadata options:

{ lib, ... }:
{
  options.homelab.host = {
    tier = lib.mkOption {
      type = lib.types.enum [ "test" "prod" ];
      default = "prod";
      description = "Deployment tier - controls which credentials can deploy to this host";
    };

    priority = lib.mkOption {
      type = lib.types.enum [ "high" "low" ];
      default = "high";
      description = "Alerting priority - low priority hosts have relaxed thresholds";
    };

    role = lib.mkOption {
      type = lib.types.nullOr lib.types.str;
      default = null;
      description = "Primary role of this host (dns, database, monitoring, etc.)";
    };

    labels = lib.mkOption {
      type = lib.types.attrsOf lib.types.str;
      default = { };
      description = "Additional free-form labels (e.g., dns_role = 'primary')";
    };
  };
}

Import this module in modules/homelab/default.nix.

2. Update lib/monitoring.nix

Complete. Labels are now extracted and propagated.

  • extractHostMonitoring should also extract homelab.host values (priority, role, labels).
  • Build the combined label set from homelab.host:
# Combine structured options + free-form labels
effectiveLabels =
  (lib.optionalAttrs (host.priority != "high") { priority = host.priority; })
  // (lib.optionalAttrs (host.role != null) { role = host.role; })
  // host.labels;
  • generateNodeExporterTargets returns structured static_configs entries, grouping targets by their label sets:
# Before (flat list):
["ns1.home.2rjus.net:9100", "ns2.home.2rjus.net:9100", ...]

# After (grouped by labels):
[
  { targets = ["ns1.home.2rjus.net:9100", "ns2.home.2rjus.net:9100", ...]; }
  { targets = ["nix-cache01.home.2rjus.net:9100"]; labels = { priority = "low"; role = "build-host"; }; }
]

This requires grouping hosts by their label attrset and producing one static_configs entry per unique label combination. Hosts with default values (priority=high, no role, no labels) get grouped together with no extra labels (preserving current behavior).

3. Update services/monitoring/prometheus.nix

Complete. Now uses structured static_configs output.

Change the node-exporter scrape config to use the new structured output:

# Before:
static_configs = [{ targets = nodeExporterTargets; }];

# After:
static_configs = nodeExporterTargets;

4. Set metadata on hosts

Complete. All relevant hosts have metadata configured. Note: The implementation filters by role rather than priority, which matches the existing nix-cache01 configuration.

Example in hosts/nix-cache01/configuration.nix:

homelab.host = {
  priority = "low";    # relaxed alerting thresholds
  role = "build-host";
};

Note: Current implementation only sets role = "build-host". Consider adding priority = "low" when label propagation is implemented.

Example in hosts/ns1/configuration.nix:

homelab.host = {
  role = "dns";
  labels.dns_role = "primary";
};

Note: tier and priority use defaults ("prod" and "high"), which is the intended behavior. The current ns1/ns2 configurations match this pattern.

5. Update alert rules

Complete. Updated services/monitoring/rules.yml:

  • high_cpu_load: Replaced instance!="nix-cache01..." with role!="build-host" for standard hosts (15m duration) and role="build-host" for build hosts (2h duration).
  • unbound_low_cache_hit_ratio: Added dns_role="primary" filter to only alert on the primary DNS resolver (secondary has a cold cache).

6. Labels for generateScrapeConfigs (service targets)

Complete. Host labels are now propagated to all auto-generated service scrape targets (unbound, homelab-deploy, nixos-exporter, etc.). This enables semantic filtering on any service metric, such as using dns_role="primary" with the unbound job.