# Prometheus Scrape Target Labels ## Implementation Status | Step | Status | Notes | |------|--------|-------| | 1. Create `homelab.host` module | ✅ Complete | `modules/homelab/host.nix` | | 2. Update `lib/monitoring.nix` | ❌ Not started | Labels not extracted or propagated | | 3. Update Prometheus config | ❌ Not started | Still uses flat target list | | 4. Set metadata on hosts | ⚠️ Partial | Some hosts configured, see below | | 5. Update alert rules | ❌ Not started | | | 6. Labels for service targets | ❌ Not started | Optional | **Hosts with metadata configured:** - `ns1`, `ns2`: `role = "dns"`, `labels.dns_role = "primary"/"secondary"` - `nix-cache01`: `role = "build-host"` (missing `priority = "low"` from plan) - `vault01`: `role = "vault"` - `jump`: `role = "bastion"` - `template`, `template2`, `testvm*`: `tier` and `priority` set **Key gap:** The `homelab.host` module exists and some hosts use it, but `lib/monitoring.nix` does not extract these values—they are not propagated to Prometheus scrape targets. --- ## Goal Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names. **Related:** This plan shares the `homelab.host` module with `docs/plans/completed/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment. ## Motivation Some hosts have workloads that make generic alert thresholds inappropriate. For example, `nix-cache01` regularly hits high CPU during builds, requiring a longer `for` duration on `high_cpu_load`. Currently this is handled by excluding specific instance names in PromQL expressions, which is brittle and doesn't scale. With per-host labels, alert rules can use semantic filters like `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`. ## Proposed Labels ### `priority` Indicates alerting importance. Hosts with `priority = "low"` can have relaxed thresholds or longer durations in alert rules. Values: `"high"` (default), `"low"` ### `role` Describes the function of the host. Useful for grouping in dashboards and targeting role-specific alert rules. Values: free-form string, e.g. `"dns"`, `"build-host"`, `"database"`, `"monitoring"` **Note on multiple roles:** Prometheus labels are strictly string values, not lists. For hosts that serve multiple roles there are a few options: - **Separate boolean labels:** `role_build_host = "true"`, `role_cache_server = "true"` -- flexible but verbose, and requires updating the module when new roles are added. - **Delimited string:** `role = "build-host,cache-server"` -- works with regex matchers (`{role=~".*build-host.*"}`), but regex matching is less clean and more error-prone. - **Pick a primary role:** `role = "build-host"` -- simplest, and probably sufficient since most hosts have one primary role. Recommendation: start with a single primary role string. If multi-role matching becomes a real need, switch to separate boolean labels. ### `dns_role` For DNS servers specifically, distinguish between primary and secondary resolvers. The secondary resolver (ns2) receives very little traffic and has a cold cache, making generic cache hit ratio alerts inappropriate. Values: `"primary"`, `"secondary"` Example use case: The `unbound_low_cache_hit_ratio` alert fires on ns2 because its cache hit ratio (~62%) is lower than ns1 (~90%). This is expected behavior since ns2 gets ~100x less traffic. With a `dns_role` label, the alert can either exclude secondaries or use different thresholds: ```promql # Only alert on primary DNS unbound_cache_hit_ratio < 0.7 and on(instance) unbound_up{dns_role="primary"} # Or use different thresholds (unbound_cache_hit_ratio < 0.7 and on(instance) unbound_up{dns_role="primary"}) or (unbound_cache_hit_ratio < 0.5 and on(instance) unbound_up{dns_role="secondary"}) ``` ## Implementation This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/completed/nats-deploy-service.md` which uses the same module for deployment tier assignment. ### 1. Create `homelab.host` module ✅ **Complete.** The module is in `modules/homelab/host.nix`. Create `modules/homelab/host.nix` with shared host metadata options: ```nix { lib, ... }: { options.homelab.host = { tier = lib.mkOption { type = lib.types.enum [ "test" "prod" ]; default = "prod"; description = "Deployment tier - controls which credentials can deploy to this host"; }; priority = lib.mkOption { type = lib.types.enum [ "high" "low" ]; default = "high"; description = "Alerting priority - low priority hosts have relaxed thresholds"; }; role = lib.mkOption { type = lib.types.nullOr lib.types.str; default = null; description = "Primary role of this host (dns, database, monitoring, etc.)"; }; labels = lib.mkOption { type = lib.types.attrsOf lib.types.str; default = { }; description = "Additional free-form labels (e.g., dns_role = 'primary')"; }; }; } ``` Import this module in `modules/homelab/default.nix`. ### 2. Update `lib/monitoring.nix` ❌ **Not started.** The current implementation does not extract `homelab.host` values. - `extractHostMonitoring` should also extract `homelab.host` values (priority, role, labels). - Build the combined label set from `homelab.host`: ```nix # Combine structured options + free-form labels effectiveLabels = (lib.optionalAttrs (host.priority != "high") { priority = host.priority; }) // (lib.optionalAttrs (host.role != null) { role = host.role; }) // host.labels; ``` - `generateNodeExporterTargets` returns structured `static_configs` entries, grouping targets by their label sets: ```nix # Before (flat list): ["ns1.home.2rjus.net:9100", "ns2.home.2rjus.net:9100", ...] # After (grouped by labels): [ { targets = ["ns1.home.2rjus.net:9100", "ns2.home.2rjus.net:9100", ...]; } { targets = ["nix-cache01.home.2rjus.net:9100"]; labels = { priority = "low"; role = "build-host"; }; } ] ``` This requires grouping hosts by their label attrset and producing one `static_configs` entry per unique label combination. Hosts with default values (priority=high, no role, no labels) get grouped together with no extra labels (preserving current behavior). ### 3. Update `services/monitoring/prometheus.nix` ❌ **Not started.** Still uses flat target list (`static_configs = [{ targets = nodeExporterTargets; }]`). Change the node-exporter scrape config to use the new structured output: ```nix # Before: static_configs = [{ targets = nodeExporterTargets; }]; # After: static_configs = nodeExporterTargets; ``` ### 4. Set metadata on hosts ⚠️ **Partial.** Some hosts configured (see status table above). Current `nix-cache01` only has `role`, missing the `priority = "low"` suggested below. Example in `hosts/nix-cache01/configuration.nix`: ```nix homelab.host = { priority = "low"; # relaxed alerting thresholds role = "build-host"; }; ``` **Note:** Current implementation only sets `role = "build-host"`. Consider adding `priority = "low"` when label propagation is implemented. Example in `hosts/ns1/configuration.nix`: ```nix homelab.host = { role = "dns"; labels.dns_role = "primary"; }; ``` **Note:** `tier` and `priority` use defaults ("prod" and "high"), which is the intended behavior. The current ns1/ns2 configurations match this pattern. ### 5. Update alert rules ❌ **Not started.** Requires steps 2-3 to be completed first. After implementing labels, review and update `services/monitoring/rules.yml`: - Replace instance-name exclusions with label-based filters (e.g. `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`). - Consider whether any other rules should differentiate by priority or role. Specifically, the `high_cpu_load` rule currently has a nix-cache01 exclusion that should be replaced with a `priority`-based filter. ### 6. Consider labels for `generateScrapeConfigs` (service targets) ❌ **Not started.** Optional enhancement. The same label propagation could be applied to service-level scrape targets. This is optional and can be deferred -- service targets are more specialized and less likely to need generic label-based filtering.