prometheus-scrape-target-labels #30
@@ -5,20 +5,19 @@
|
|||||||
| Step | Status | Notes |
|
| Step | Status | Notes |
|
||||||
|------|--------|-------|
|
|------|--------|-------|
|
||||||
| 1. Create `homelab.host` module | ✅ Complete | `modules/homelab/host.nix` |
|
| 1. Create `homelab.host` module | ✅ Complete | `modules/homelab/host.nix` |
|
||||||
| 2. Update `lib/monitoring.nix` | ❌ Not started | Labels not extracted or propagated |
|
| 2. Update `lib/monitoring.nix` | ✅ Complete | Labels extracted and propagated |
|
||||||
| 3. Update Prometheus config | ❌ Not started | Still uses flat target list |
|
| 3. Update Prometheus config | ✅ Complete | Uses structured static_configs |
|
||||||
| 4. Set metadata on hosts | ⚠️ Partial | Some hosts configured, see below |
|
| 4. Set metadata on hosts | ✅ Complete | All relevant hosts configured |
|
||||||
| 5. Update alert rules | ❌ Not started | |
|
| 5. Update alert rules | ✅ Complete | Role-based filtering implemented |
|
||||||
| 6. Labels for service targets | ❌ Not started | Optional |
|
| 6. Labels for service targets | ✅ Complete | Host labels propagated to all services |
|
||||||
|
|
||||||
**Hosts with metadata configured:**
|
**Hosts with metadata configured:**
|
||||||
- `ns1`, `ns2`: `role = "dns"`, `labels.dns_role = "primary"/"secondary"`
|
- `ns1`, `ns2`: `role = "dns"`, `labels.dns_role = "primary"/"secondary"`
|
||||||
- `nix-cache01`: `role = "build-host"` (missing `priority = "low"` from plan)
|
- `nix-cache01`: `role = "build-host"`
|
||||||
- `vault01`: `role = "vault"`
|
- `vault01`: `role = "vault"`
|
||||||
- `jump`: `role = "bastion"`
|
- `testvm01/02/03`: `tier = "test"`
|
||||||
- `template`, `template2`, `testvm*`: `tier` and `priority` set
|
|
||||||
|
|
||||||
**Key gap:** The `homelab.host` module exists and some hosts use it, but `lib/monitoring.nix` does not extract these values—they are not propagated to Prometheus scrape targets.
|
**Implementation complete.** Branch: `prometheus-scrape-target-labels`
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -119,7 +118,7 @@ Import this module in `modules/homelab/default.nix`.
|
|||||||
|
|
||||||
### 2. Update `lib/monitoring.nix`
|
### 2. Update `lib/monitoring.nix`
|
||||||
|
|
||||||
❌ **Not started.** The current implementation does not extract `homelab.host` values.
|
✅ **Complete.** Labels are now extracted and propagated.
|
||||||
|
|
||||||
- `extractHostMonitoring` should also extract `homelab.host` values (priority, role, labels).
|
- `extractHostMonitoring` should also extract `homelab.host` values (priority, role, labels).
|
||||||
- Build the combined label set from `homelab.host`:
|
- Build the combined label set from `homelab.host`:
|
||||||
@@ -149,7 +148,7 @@ This requires grouping hosts by their label attrset and producing one `static_co
|
|||||||
|
|
||||||
### 3. Update `services/monitoring/prometheus.nix`
|
### 3. Update `services/monitoring/prometheus.nix`
|
||||||
|
|
||||||
❌ **Not started.** Still uses flat target list (`static_configs = [{ targets = nodeExporterTargets; }]`).
|
✅ **Complete.** Now uses structured static_configs output.
|
||||||
|
|
||||||
Change the node-exporter scrape config to use the new structured output:
|
Change the node-exporter scrape config to use the new structured output:
|
||||||
|
|
||||||
@@ -163,7 +162,7 @@ static_configs = nodeExporterTargets;
|
|||||||
|
|
||||||
### 4. Set metadata on hosts
|
### 4. Set metadata on hosts
|
||||||
|
|
||||||
⚠️ **Partial.** Some hosts configured (see status table above). Current `nix-cache01` only has `role`, missing the `priority = "low"` suggested below.
|
✅ **Complete.** All relevant hosts have metadata configured. Note: The implementation filters by `role` rather than `priority`, which matches the existing nix-cache01 configuration.
|
||||||
|
|
||||||
Example in `hosts/nix-cache01/configuration.nix`:
|
Example in `hosts/nix-cache01/configuration.nix`:
|
||||||
|
|
||||||
@@ -189,17 +188,11 @@ homelab.host = {
|
|||||||
|
|
||||||
### 5. Update alert rules
|
### 5. Update alert rules
|
||||||
|
|
||||||
❌ **Not started.** Requires steps 2-3 to be completed first.
|
✅ **Complete.** Updated `services/monitoring/rules.yml`:
|
||||||
|
|
||||||
After implementing labels, review and update `services/monitoring/rules.yml`:
|
- `high_cpu_load`: Replaced `instance!="nix-cache01..."` with `role!="build-host"` for standard hosts (15m duration) and `role="build-host"` for build hosts (2h duration).
|
||||||
|
- `unbound_low_cache_hit_ratio`: Added `dns_role="primary"` filter to only alert on the primary DNS resolver (secondary has a cold cache).
|
||||||
|
|
||||||
- Replace instance-name exclusions with label-based filters (e.g. `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`).
|
### 6. Labels for `generateScrapeConfigs` (service targets)
|
||||||
- Consider whether any other rules should differentiate by priority or role.
|
|
||||||
|
|
||||||
Specifically, the `high_cpu_load` rule currently has a nix-cache01 exclusion that should be replaced with a `priority`-based filter.
|
✅ **Complete.** Host labels are now propagated to all auto-generated service scrape targets (unbound, homelab-deploy, nixos-exporter, etc.). This enables semantic filtering on any service metric, such as using `dns_role="primary"` with the unbound job.
|
||||||
|
|
||||||
### 6. Consider labels for `generateScrapeConfigs` (service targets)
|
|
||||||
|
|
||||||
❌ **Not started.** Optional enhancement.
|
|
||||||
|
|
||||||
The same label propagation could be applied to service-level scrape targets. This is optional and can be deferred -- service targets are more specialized and less likely to need generic label-based filtering.
|
|
||||||
|
|||||||
@@ -21,6 +21,7 @@ let
|
|||||||
cfg = hostConfig.config;
|
cfg = hostConfig.config;
|
||||||
monConfig = (cfg.homelab or { }).monitoring or { enable = true; scrapeTargets = [ ]; };
|
monConfig = (cfg.homelab or { }).monitoring or { enable = true; scrapeTargets = [ ]; };
|
||||||
dnsConfig = (cfg.homelab or { }).dns or { enable = true; };
|
dnsConfig = (cfg.homelab or { }).dns or { enable = true; };
|
||||||
|
hostConfig' = (cfg.homelab or { }).host or { };
|
||||||
hostname = cfg.networking.hostName;
|
hostname = cfg.networking.hostName;
|
||||||
networks = cfg.systemd.network.networks or { };
|
networks = cfg.systemd.network.networks or { };
|
||||||
|
|
||||||
@@ -49,20 +50,64 @@ let
|
|||||||
inherit hostname;
|
inherit hostname;
|
||||||
ip = extractIP firstAddress;
|
ip = extractIP firstAddress;
|
||||||
scrapeTargets = monConfig.scrapeTargets or [ ];
|
scrapeTargets = monConfig.scrapeTargets or [ ];
|
||||||
|
# Host metadata for label propagation
|
||||||
|
tier = hostConfig'.tier or "prod";
|
||||||
|
priority = hostConfig'.priority or "high";
|
||||||
|
role = hostConfig'.role or null;
|
||||||
|
labels = hostConfig'.labels or { };
|
||||||
};
|
};
|
||||||
|
|
||||||
|
# Build effective labels for a host (only include non-default values)
|
||||||
|
buildEffectiveLabels = host:
|
||||||
|
(lib.optionalAttrs (host.tier != "prod") { tier = host.tier; })
|
||||||
|
// (lib.optionalAttrs (host.priority != "high") { priority = host.priority; })
|
||||||
|
// (lib.optionalAttrs (host.role != null) { role = host.role; })
|
||||||
|
// host.labels;
|
||||||
|
|
||||||
# Generate node-exporter targets from all flake hosts
|
# Generate node-exporter targets from all flake hosts
|
||||||
|
# Returns a list of static_configs entries with labels
|
||||||
generateNodeExporterTargets = self: externalTargets:
|
generateNodeExporterTargets = self: externalTargets:
|
||||||
let
|
let
|
||||||
nixosConfigs = self.nixosConfigurations or { };
|
nixosConfigs = self.nixosConfigurations or { };
|
||||||
hostList = lib.filter (x: x != null) (
|
hostList = lib.filter (x: x != null) (
|
||||||
lib.mapAttrsToList extractHostMonitoring nixosConfigs
|
lib.mapAttrsToList extractHostMonitoring nixosConfigs
|
||||||
);
|
);
|
||||||
flakeTargets = map (host: "${host.hostname}.home.2rjus.net:9100") hostList;
|
|
||||||
|
# Build target entries with labels for each host
|
||||||
|
flakeEntries = map
|
||||||
|
(host: {
|
||||||
|
target = "${host.hostname}.home.2rjus.net:9100";
|
||||||
|
labels = buildEffectiveLabels host;
|
||||||
|
})
|
||||||
|
hostList;
|
||||||
|
|
||||||
|
# External targets have no labels
|
||||||
|
externalEntries = map
|
||||||
|
(target: { inherit target; labels = { }; })
|
||||||
|
(externalTargets.nodeExporter or [ ]);
|
||||||
|
|
||||||
|
allEntries = flakeEntries ++ externalEntries;
|
||||||
|
|
||||||
|
# Group entries by their label set for efficient static_configs
|
||||||
|
# Convert labels attrset to a string key for grouping
|
||||||
|
labelKey = entry: builtins.toJSON entry.labels;
|
||||||
|
grouped = lib.groupBy labelKey allEntries;
|
||||||
|
|
||||||
|
# Convert groups to static_configs format
|
||||||
|
staticConfigs = lib.mapAttrsToList
|
||||||
|
(key: entries:
|
||||||
|
let
|
||||||
|
labels = (builtins.head entries).labels;
|
||||||
|
in
|
||||||
|
{ targets = map (e: e.target) entries; }
|
||||||
|
// (lib.optionalAttrs (labels != { }) { inherit labels; })
|
||||||
|
)
|
||||||
|
grouped;
|
||||||
in
|
in
|
||||||
flakeTargets ++ (externalTargets.nodeExporter or [ ]);
|
staticConfigs;
|
||||||
|
|
||||||
# Generate scrape configs from all flake hosts and external targets
|
# Generate scrape configs from all flake hosts and external targets
|
||||||
|
# Host labels are propagated to service targets for semantic alert filtering
|
||||||
generateScrapeConfigs = self: externalTargets:
|
generateScrapeConfigs = self: externalTargets:
|
||||||
let
|
let
|
||||||
nixosConfigs = self.nixosConfigurations or { };
|
nixosConfigs = self.nixosConfigurations or { };
|
||||||
@@ -70,13 +115,14 @@ let
|
|||||||
lib.mapAttrsToList extractHostMonitoring nixosConfigs
|
lib.mapAttrsToList extractHostMonitoring nixosConfigs
|
||||||
);
|
);
|
||||||
|
|
||||||
# Collect all scrapeTargets from all hosts, grouped by job_name
|
# Collect all scrapeTargets from all hosts, including host labels
|
||||||
allTargets = lib.flatten (map
|
allTargets = lib.flatten (map
|
||||||
(host:
|
(host:
|
||||||
map
|
map
|
||||||
(target: {
|
(target: {
|
||||||
inherit (target) job_name port metrics_path scheme scrape_interval honor_labels;
|
inherit (target) job_name port metrics_path scheme scrape_interval honor_labels;
|
||||||
hostname = host.hostname;
|
hostname = host.hostname;
|
||||||
|
hostLabels = buildEffectiveLabels host;
|
||||||
})
|
})
|
||||||
host.scrapeTargets
|
host.scrapeTargets
|
||||||
)
|
)
|
||||||
@@ -87,22 +133,32 @@ let
|
|||||||
grouped = lib.groupBy (t: t.job_name) allTargets;
|
grouped = lib.groupBy (t: t.job_name) allTargets;
|
||||||
|
|
||||||
# Generate a scrape config for each job
|
# Generate a scrape config for each job
|
||||||
|
# Within each job, group targets by their host labels for efficient static_configs
|
||||||
flakeScrapeConfigs = lib.mapAttrsToList
|
flakeScrapeConfigs = lib.mapAttrsToList
|
||||||
(jobName: targets:
|
(jobName: targets:
|
||||||
let
|
let
|
||||||
first = builtins.head targets;
|
first = builtins.head targets;
|
||||||
targetAddrs = map
|
|
||||||
(t:
|
# Group targets within this job by their host labels
|
||||||
|
labelKey = t: builtins.toJSON t.hostLabels;
|
||||||
|
groupedByLabels = lib.groupBy labelKey targets;
|
||||||
|
|
||||||
|
staticConfigs = lib.mapAttrsToList
|
||||||
|
(key: labelTargets:
|
||||||
let
|
let
|
||||||
portStr = toString t.port;
|
labels = (builtins.head labelTargets).hostLabels;
|
||||||
|
targetAddrs = map
|
||||||
|
(t: "${t.hostname}.home.2rjus.net:${toString t.port}")
|
||||||
|
labelTargets;
|
||||||
in
|
in
|
||||||
"${t.hostname}.home.2rjus.net:${portStr}")
|
{ targets = targetAddrs; }
|
||||||
targets;
|
// (lib.optionalAttrs (labels != { }) { inherit labels; })
|
||||||
|
)
|
||||||
|
groupedByLabels;
|
||||||
|
|
||||||
config = {
|
config = {
|
||||||
job_name = jobName;
|
job_name = jobName;
|
||||||
static_configs = [{
|
static_configs = staticConfigs;
|
||||||
targets = targetAddrs;
|
|
||||||
}];
|
|
||||||
}
|
}
|
||||||
// (lib.optionalAttrs (first.metrics_path != "/metrics") {
|
// (lib.optionalAttrs (first.metrics_path != "/metrics") {
|
||||||
metrics_path = first.metrics_path;
|
metrics_path = first.metrics_path;
|
||||||
|
|||||||
@@ -121,22 +121,20 @@ in
|
|||||||
|
|
||||||
scrapeConfigs = [
|
scrapeConfigs = [
|
||||||
# Auto-generated node-exporter targets from flake hosts + external
|
# Auto-generated node-exporter targets from flake hosts + external
|
||||||
|
# Each static_config entry may have labels from homelab.host metadata
|
||||||
{
|
{
|
||||||
job_name = "node-exporter";
|
job_name = "node-exporter";
|
||||||
static_configs = [
|
static_configs = nodeExporterTargets;
|
||||||
{
|
|
||||||
targets = nodeExporterTargets;
|
|
||||||
}
|
|
||||||
];
|
|
||||||
}
|
}
|
||||||
# Systemd exporter on all hosts (same targets, different port)
|
# Systemd exporter on all hosts (same targets, different port)
|
||||||
|
# Preserves the same label grouping as node-exporter
|
||||||
{
|
{
|
||||||
job_name = "systemd-exporter";
|
job_name = "systemd-exporter";
|
||||||
static_configs = [
|
static_configs = map
|
||||||
{
|
(cfg: cfg // {
|
||||||
targets = map (t: builtins.replaceStrings [":9100"] [":9558"] t) nodeExporterTargets;
|
targets = map (t: builtins.replaceStrings [ ":9100" ] [ ":9558" ] t) cfg.targets;
|
||||||
}
|
})
|
||||||
];
|
nodeExporterTargets;
|
||||||
}
|
}
|
||||||
# Local monitoring services (not auto-generated)
|
# Local monitoring services (not auto-generated)
|
||||||
{
|
{
|
||||||
|
|||||||
@@ -17,8 +17,9 @@ groups:
|
|||||||
annotations:
|
annotations:
|
||||||
summary: "Disk space low on {{ $labels.instance }}"
|
summary: "Disk space low on {{ $labels.instance }}"
|
||||||
description: "Disk space is low on {{ $labels.instance }}. Please check."
|
description: "Disk space is low on {{ $labels.instance }}. Please check."
|
||||||
|
# Build hosts (e.g., nix-cache01) are expected to have high CPU during builds
|
||||||
- alert: high_cpu_load
|
- alert: high_cpu_load
|
||||||
expr: max(node_load5{instance!="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance!="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7)
|
expr: max(node_load5{role!="build-host"}) by (instance) > (count by (instance)(node_cpu_seconds_total{role!="build-host", mode="idle"}) * 0.7)
|
||||||
for: 15m
|
for: 15m
|
||||||
labels:
|
labels:
|
||||||
severity: warning
|
severity: warning
|
||||||
@@ -26,7 +27,7 @@ groups:
|
|||||||
summary: "High CPU load on {{ $labels.instance }}"
|
summary: "High CPU load on {{ $labels.instance }}"
|
||||||
description: "CPU load is high on {{ $labels.instance }}. Please check."
|
description: "CPU load is high on {{ $labels.instance }}. Please check."
|
||||||
- alert: high_cpu_load
|
- alert: high_cpu_load
|
||||||
expr: max(node_load5{instance="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7)
|
expr: max(node_load5{role="build-host"}) by (instance) > (count by (instance)(node_cpu_seconds_total{role="build-host", mode="idle"}) * 0.7)
|
||||||
for: 2h
|
for: 2h
|
||||||
labels:
|
labels:
|
||||||
severity: warning
|
severity: warning
|
||||||
@@ -115,8 +116,9 @@ groups:
|
|||||||
annotations:
|
annotations:
|
||||||
summary: "NSD not running on {{ $labels.instance }}"
|
summary: "NSD not running on {{ $labels.instance }}"
|
||||||
description: "NSD has been down on {{ $labels.instance }} more than 5 minutes."
|
description: "NSD has been down on {{ $labels.instance }} more than 5 minutes."
|
||||||
|
# Only alert on primary DNS (secondary has cold cache after failover)
|
||||||
- alert: unbound_low_cache_hit_ratio
|
- alert: unbound_low_cache_hit_ratio
|
||||||
expr: (rate(unbound_cache_hits_total[5m]) / (rate(unbound_cache_hits_total[5m]) + rate(unbound_cache_misses_total[5m]))) < 0.5
|
expr: (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) / (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) + rate(unbound_cache_misses_total{dns_role="primary"}[5m]))) < 0.5
|
||||||
for: 15m
|
for: 15m
|
||||||
labels:
|
labels:
|
||||||
severity: warning
|
severity: warning
|
||||||
|
|||||||
Reference in New Issue
Block a user