8.2 KiB
Prometheus Scrape Target Labels
Implementation Status
| Step | Status | Notes |
|---|---|---|
1. Create homelab.host module |
✅ Complete | modules/homelab/host.nix |
2. Update lib/monitoring.nix |
✅ Complete | Labels extracted and propagated |
| 3. Update Prometheus config | ✅ Complete | Uses structured static_configs |
| 4. Set metadata on hosts | ✅ Complete | All relevant hosts configured |
| 5. Update alert rules | ✅ Complete | Role-based filtering implemented |
| 6. Labels for service targets | ✅ Complete | Host labels propagated to all services |
| 7. Add hostname label | ✅ Complete | All targets have hostname label for easy filtering |
Hosts with metadata configured:
ns1,ns2:role = "dns",labels.dns_role = "primary"/"secondary"nix-cache01:role = "build-host"vault01:role = "vault"testvm01/02/03:tier = "test"
Implementation complete. Branch: prometheus-scrape-target-labels
Query examples:
{hostname="ns1"}- all metrics from ns1 (any job/port)node_cpu_seconds_total{hostname="monitoring01"}- specific metric by hostnameup{role="dns"}- all DNS serversup{tier="test"}- all test-tier hosts
Goal
Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names.
Related: This plan shares the homelab.host module with docs/plans/completed/nats-deploy-service.md, which uses the same metadata for deployment tier assignment.
Motivation
Some hosts have workloads that make generic alert thresholds inappropriate. For example, nix-cache01 regularly hits high CPU during builds, requiring a longer for duration on high_cpu_load. Currently this is handled by excluding specific instance names in PromQL expressions, which is brittle and doesn't scale.
With per-host labels, alert rules can use semantic filters like {priority!="low"} instead of {instance!="nix-cache01.home.2rjus.net:9100"}.
Proposed Labels
priority
Indicates alerting importance. Hosts with priority = "low" can have relaxed thresholds or longer durations in alert rules.
Values: "high" (default), "low"
role
Describes the function of the host. Useful for grouping in dashboards and targeting role-specific alert rules.
Values: free-form string, e.g. "dns", "build-host", "database", "monitoring"
Note on multiple roles: Prometheus labels are strictly string values, not lists. For hosts that serve multiple roles there are a few options:
- Separate boolean labels:
role_build_host = "true",role_cache_server = "true"-- flexible but verbose, and requires updating the module when new roles are added. - Delimited string:
role = "build-host,cache-server"-- works with regex matchers ({role=~".*build-host.*"}), but regex matching is less clean and more error-prone. - Pick a primary role:
role = "build-host"-- simplest, and probably sufficient since most hosts have one primary role.
Recommendation: start with a single primary role string. If multi-role matching becomes a real need, switch to separate boolean labels.
dns_role
For DNS servers specifically, distinguish between primary and secondary resolvers. The secondary resolver (ns2) receives very little traffic and has a cold cache, making generic cache hit ratio alerts inappropriate.
Values: "primary", "secondary"
Example use case: The unbound_low_cache_hit_ratio alert fires on ns2 because its cache hit ratio (~62%) is lower than ns1 (~90%). This is expected behavior since ns2 gets ~100x less traffic. With a dns_role label, the alert can either exclude secondaries or use different thresholds:
# Only alert on primary DNS
unbound_cache_hit_ratio < 0.7 and on(instance) unbound_up{dns_role="primary"}
# Or use different thresholds
(unbound_cache_hit_ratio < 0.7 and on(instance) unbound_up{dns_role="primary"})
or
(unbound_cache_hit_ratio < 0.5 and on(instance) unbound_up{dns_role="secondary"})
Implementation
This implementation uses a shared homelab.host module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also docs/plans/completed/nats-deploy-service.md which uses the same module for deployment tier assignment.
1. Create homelab.host module
✅ Complete. The module is in modules/homelab/host.nix.
Create modules/homelab/host.nix with shared host metadata options:
{ lib, ... }:
{
options.homelab.host = {
tier = lib.mkOption {
type = lib.types.enum [ "test" "prod" ];
default = "prod";
description = "Deployment tier - controls which credentials can deploy to this host";
};
priority = lib.mkOption {
type = lib.types.enum [ "high" "low" ];
default = "high";
description = "Alerting priority - low priority hosts have relaxed thresholds";
};
role = lib.mkOption {
type = lib.types.nullOr lib.types.str;
default = null;
description = "Primary role of this host (dns, database, monitoring, etc.)";
};
labels = lib.mkOption {
type = lib.types.attrsOf lib.types.str;
default = { };
description = "Additional free-form labels (e.g., dns_role = 'primary')";
};
};
}
Import this module in modules/homelab/default.nix.
2. Update lib/monitoring.nix
✅ Complete. Labels are now extracted and propagated.
extractHostMonitoringshould also extracthomelab.hostvalues (priority, role, labels).- Build the combined label set from
homelab.host:
# Combine structured options + free-form labels
effectiveLabels =
(lib.optionalAttrs (host.priority != "high") { priority = host.priority; })
// (lib.optionalAttrs (host.role != null) { role = host.role; })
// host.labels;
generateNodeExporterTargetsreturns structuredstatic_configsentries, grouping targets by their label sets:
# Before (flat list):
["ns1.home.2rjus.net:9100", "ns2.home.2rjus.net:9100", ...]
# After (grouped by labels):
[
{ targets = ["ns1.home.2rjus.net:9100", "ns2.home.2rjus.net:9100", ...]; }
{ targets = ["nix-cache01.home.2rjus.net:9100"]; labels = { priority = "low"; role = "build-host"; }; }
]
This requires grouping hosts by their label attrset and producing one static_configs entry per unique label combination. Hosts with default values (priority=high, no role, no labels) get grouped together with no extra labels (preserving current behavior).
3. Update services/monitoring/prometheus.nix
✅ Complete. Now uses structured static_configs output.
Change the node-exporter scrape config to use the new structured output:
# Before:
static_configs = [{ targets = nodeExporterTargets; }];
# After:
static_configs = nodeExporterTargets;
4. Set metadata on hosts
✅ Complete. All relevant hosts have metadata configured. Note: The implementation filters by role rather than priority, which matches the existing nix-cache01 configuration.
Example in hosts/nix-cache01/configuration.nix:
homelab.host = {
priority = "low"; # relaxed alerting thresholds
role = "build-host";
};
Note: Current implementation only sets role = "build-host". Consider adding priority = "low" when label propagation is implemented.
Example in hosts/ns1/configuration.nix:
homelab.host = {
role = "dns";
labels.dns_role = "primary";
};
Note: tier and priority use defaults ("prod" and "high"), which is the intended behavior. The current ns1/ns2 configurations match this pattern.
5. Update alert rules
✅ Complete. Updated services/monitoring/rules.yml:
high_cpu_load: Replacedinstance!="nix-cache01..."withrole!="build-host"for standard hosts (15m duration) androle="build-host"for build hosts (2h duration).unbound_low_cache_hit_ratio: Addeddns_role="primary"filter to only alert on the primary DNS resolver (secondary has a cold cache).
6. Labels for generateScrapeConfigs (service targets)
✅ Complete. Host labels are now propagated to all auto-generated service scrape targets (unbound, homelab-deploy, nixos-exporter, etc.). This enables semantic filtering on any service metric, such as using dns_role="primary" with the unbound job.