Files
nixos-servers/docs/plans/prometheus-scrape-target-labels.md
Torjus Håkestad 4d724329a6
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
docs: add homelab-deploy plan, unify host metadata
Add plan for NATS-based deployment service (homelab-deploy) that enables
on-demand NixOS configuration updates via messaging. Features tiered
permissions (test/prod) enforced at NATS layer.

Update prometheus-scrape-target-labels plan to share the homelab.host
module for host metadata (tier, priority, role, labels) - single source
of truth for both deployment tiers and prometheus labels.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:10:54 +01:00

6.5 KiB

Prometheus Scrape Target Labels

Goal

Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names.

Related: This plan shares the homelab.host module with docs/plans/nats-deploy-service.md, which uses the same metadata for deployment tier assignment.

Motivation

Some hosts have workloads that make generic alert thresholds inappropriate. For example, nix-cache01 regularly hits high CPU during builds, requiring a longer for duration on high_cpu_load. Currently this is handled by excluding specific instance names in PromQL expressions, which is brittle and doesn't scale.

With per-host labels, alert rules can use semantic filters like {priority!="low"} instead of {instance!="nix-cache01.home.2rjus.net:9100"}.

Proposed Labels

priority

Indicates alerting importance. Hosts with priority = "low" can have relaxed thresholds or longer durations in alert rules.

Values: "high" (default), "low"

role

Describes the function of the host. Useful for grouping in dashboards and targeting role-specific alert rules.

Values: free-form string, e.g. "dns", "build-host", "database", "monitoring"

Note on multiple roles: Prometheus labels are strictly string values, not lists. For hosts that serve multiple roles there are a few options:

  • Separate boolean labels: role_build_host = "true", role_cache_server = "true" -- flexible but verbose, and requires updating the module when new roles are added.
  • Delimited string: role = "build-host,cache-server" -- works with regex matchers ({role=~".*build-host.*"}), but regex matching is less clean and more error-prone.
  • Pick a primary role: role = "build-host" -- simplest, and probably sufficient since most hosts have one primary role.

Recommendation: start with a single primary role string. If multi-role matching becomes a real need, switch to separate boolean labels.

dns_role

For DNS servers specifically, distinguish between primary and secondary resolvers. The secondary resolver (ns2) receives very little traffic and has a cold cache, making generic cache hit ratio alerts inappropriate.

Values: "primary", "secondary"

Example use case: The unbound_low_cache_hit_ratio alert fires on ns2 because its cache hit ratio (~62%) is lower than ns1 (~90%). This is expected behavior since ns2 gets ~100x less traffic. With a dns_role label, the alert can either exclude secondaries or use different thresholds:

# Only alert on primary DNS
unbound_cache_hit_ratio < 0.7 and on(instance) unbound_up{dns_role="primary"}

# Or use different thresholds
(unbound_cache_hit_ratio < 0.7 and on(instance) unbound_up{dns_role="primary"})
or
(unbound_cache_hit_ratio < 0.5 and on(instance) unbound_up{dns_role="secondary"})

Implementation

This implementation uses a shared homelab.host module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also docs/plans/nats-deploy-service.md which uses the same module for deployment tier assignment.

1. Create homelab.host module

Create modules/homelab/host.nix with shared host metadata options:

{ lib, ... }:
{
  options.homelab.host = {
    tier = lib.mkOption {
      type = lib.types.enum [ "test" "prod" ];
      default = "prod";
      description = "Deployment tier - controls which credentials can deploy to this host";
    };

    priority = lib.mkOption {
      type = lib.types.enum [ "high" "low" ];
      default = "high";
      description = "Alerting priority - low priority hosts have relaxed thresholds";
    };

    role = lib.mkOption {
      type = lib.types.nullOr lib.types.str;
      default = null;
      description = "Primary role of this host (dns, database, monitoring, etc.)";
    };

    labels = lib.mkOption {
      type = lib.types.attrsOf lib.types.str;
      default = { };
      description = "Additional free-form labels (e.g., dns_role = 'primary')";
    };
  };
}

Import this module in modules/homelab/default.nix.

2. Update lib/monitoring.nix

  • extractHostMonitoring should also extract homelab.host values (priority, role, labels).
  • Build the combined label set from homelab.host:
# Combine structured options + free-form labels
effectiveLabels =
  (lib.optionalAttrs (host.priority != "high") { priority = host.priority; })
  // (lib.optionalAttrs (host.role != null) { role = host.role; })
  // host.labels;
  • generateNodeExporterTargets returns structured static_configs entries, grouping targets by their label sets:
# Before (flat list):
["ns1.home.2rjus.net:9100", "ns2.home.2rjus.net:9100", ...]

# After (grouped by labels):
[
  { targets = ["ns1.home.2rjus.net:9100", "ns2.home.2rjus.net:9100", ...]; }
  { targets = ["nix-cache01.home.2rjus.net:9100"]; labels = { priority = "low"; role = "build-host"; }; }
]

This requires grouping hosts by their label attrset and producing one static_configs entry per unique label combination. Hosts with default values (priority=high, no role, no labels) get grouped together with no extra labels (preserving current behavior).

3. Update services/monitoring/prometheus.nix

Change the node-exporter scrape config to use the new structured output:

# Before:
static_configs = [{ targets = nodeExporterTargets; }];

# After:
static_configs = nodeExporterTargets;

4. Set metadata on hosts

Example in hosts/nix-cache01/configuration.nix:

homelab.host = {
  tier = "test";       # can be deployed by MCP (used by homelab-deploy)
  priority = "low";    # relaxed alerting thresholds
  role = "build-host";
};

Example in hosts/ns1/configuration.nix:

homelab.host = {
  tier = "prod";
  priority = "high";
  role = "dns";
  labels.dns_role = "primary";
};

5. Update alert rules

After implementing labels, review and update services/monitoring/rules.yml:

  • Replace instance-name exclusions with label-based filters (e.g. {priority!="low"} instead of {instance!="nix-cache01.home.2rjus.net:9100"}).
  • Consider whether any other rules should differentiate by priority or role.

Specifically, the high_cpu_load rule currently has a nix-cache01 exclusion that should be replaced with a priority-based filter.

6. Consider labels for generateScrapeConfigs (service targets)

The same label propagation could be applied to service-level scrape targets. This is optional and can be deferred -- service targets are more specialized and less likely to need generic label-based filtering.