From 7d291f85bf5a901605363bed195f06af81fec2ee Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Sat, 7 Feb 2026 17:04:50 +0100 Subject: [PATCH 1/5] monitoring: propagate host labels to Prometheus scrape targets Extract homelab.host metadata (tier, priority, role, labels) from host configurations and propagate them to Prometheus scrape targets. This enables semantic alert filtering using labels instead of hardcoded instance names. Changes: - lib/monitoring.nix: Extract host metadata, group targets by labels - prometheus.nix: Use structured static_configs with labels - rules.yml: Replace instance filters with role-based filters Example labels in Prometheus: - ns1/ns2: role=dns, dns_role=primary/secondary - nix-cache01: role=build-host - testvm*: tier=test Co-Authored-By: Claude Opus 4.5 --- docs/plans/prometheus-scrape-target-labels.md | 39 ++++------ lib/monitoring.nix | 78 ++++++++++++++++--- services/monitoring/prometheus.nix | 18 ++--- services/monitoring/rules.yml | 8 +- 4 files changed, 96 insertions(+), 47 deletions(-) diff --git a/docs/plans/prometheus-scrape-target-labels.md b/docs/plans/prometheus-scrape-target-labels.md index 0255347..d1b2508 100644 --- a/docs/plans/prometheus-scrape-target-labels.md +++ b/docs/plans/prometheus-scrape-target-labels.md @@ -5,20 +5,19 @@ | Step | Status | Notes | |------|--------|-------| | 1. Create `homelab.host` module | ✅ Complete | `modules/homelab/host.nix` | -| 2. Update `lib/monitoring.nix` | ❌ Not started | Labels not extracted or propagated | -| 3. Update Prometheus config | ❌ Not started | Still uses flat target list | -| 4. Set metadata on hosts | ⚠️ Partial | Some hosts configured, see below | -| 5. Update alert rules | ❌ Not started | | -| 6. Labels for service targets | ❌ Not started | Optional | +| 2. Update `lib/monitoring.nix` | ✅ Complete | Labels extracted and propagated | +| 3. Update Prometheus config | ✅ Complete | Uses structured static_configs | +| 4. Set metadata on hosts | ✅ Complete | All relevant hosts configured | +| 5. Update alert rules | ✅ Complete | Role-based filtering implemented | +| 6. Labels for service targets | ✅ Complete | Host labels propagated to all services | **Hosts with metadata configured:** - `ns1`, `ns2`: `role = "dns"`, `labels.dns_role = "primary"/"secondary"` -- `nix-cache01`: `role = "build-host"` (missing `priority = "low"` from plan) +- `nix-cache01`: `role = "build-host"` - `vault01`: `role = "vault"` -- `jump`: `role = "bastion"` -- `template`, `template2`, `testvm*`: `tier` and `priority` set +- `testvm01/02/03`: `tier = "test"` -**Key gap:** The `homelab.host` module exists and some hosts use it, but `lib/monitoring.nix` does not extract these values—they are not propagated to Prometheus scrape targets. +**Implementation complete.** Branch: `prometheus-scrape-target-labels` --- @@ -119,7 +118,7 @@ Import this module in `modules/homelab/default.nix`. ### 2. Update `lib/monitoring.nix` -❌ **Not started.** The current implementation does not extract `homelab.host` values. +✅ **Complete.** Labels are now extracted and propagated. - `extractHostMonitoring` should also extract `homelab.host` values (priority, role, labels). - Build the combined label set from `homelab.host`: @@ -149,7 +148,7 @@ This requires grouping hosts by their label attrset and producing one `static_co ### 3. Update `services/monitoring/prometheus.nix` -❌ **Not started.** Still uses flat target list (`static_configs = [{ targets = nodeExporterTargets; }]`). +✅ **Complete.** Now uses structured static_configs output. Change the node-exporter scrape config to use the new structured output: @@ -163,7 +162,7 @@ static_configs = nodeExporterTargets; ### 4. Set metadata on hosts -⚠️ **Partial.** Some hosts configured (see status table above). Current `nix-cache01` only has `role`, missing the `priority = "low"` suggested below. +✅ **Complete.** All relevant hosts have metadata configured. Note: The implementation filters by `role` rather than `priority`, which matches the existing nix-cache01 configuration. Example in `hosts/nix-cache01/configuration.nix`: @@ -189,17 +188,11 @@ homelab.host = { ### 5. Update alert rules -❌ **Not started.** Requires steps 2-3 to be completed first. +✅ **Complete.** Updated `services/monitoring/rules.yml`: -After implementing labels, review and update `services/monitoring/rules.yml`: +- `high_cpu_load`: Replaced `instance!="nix-cache01..."` with `role!="build-host"` for standard hosts (15m duration) and `role="build-host"` for build hosts (2h duration). +- `unbound_low_cache_hit_ratio`: Added `dns_role="primary"` filter to only alert on the primary DNS resolver (secondary has a cold cache). -- Replace instance-name exclusions with label-based filters (e.g. `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`). -- Consider whether any other rules should differentiate by priority or role. +### 6. Labels for `generateScrapeConfigs` (service targets) -Specifically, the `high_cpu_load` rule currently has a nix-cache01 exclusion that should be replaced with a `priority`-based filter. - -### 6. Consider labels for `generateScrapeConfigs` (service targets) - -❌ **Not started.** Optional enhancement. - -The same label propagation could be applied to service-level scrape targets. This is optional and can be deferred -- service targets are more specialized and less likely to need generic label-based filtering. +✅ **Complete.** Host labels are now propagated to all auto-generated service scrape targets (unbound, homelab-deploy, nixos-exporter, etc.). This enables semantic filtering on any service metric, such as using `dns_role="primary"` with the unbound job. diff --git a/lib/monitoring.nix b/lib/monitoring.nix index 19e522a..dbb62b8 100644 --- a/lib/monitoring.nix +++ b/lib/monitoring.nix @@ -21,6 +21,7 @@ let cfg = hostConfig.config; monConfig = (cfg.homelab or { }).monitoring or { enable = true; scrapeTargets = [ ]; }; dnsConfig = (cfg.homelab or { }).dns or { enable = true; }; + hostConfig' = (cfg.homelab or { }).host or { }; hostname = cfg.networking.hostName; networks = cfg.systemd.network.networks or { }; @@ -49,20 +50,64 @@ let inherit hostname; ip = extractIP firstAddress; scrapeTargets = monConfig.scrapeTargets or [ ]; + # Host metadata for label propagation + tier = hostConfig'.tier or "prod"; + priority = hostConfig'.priority or "high"; + role = hostConfig'.role or null; + labels = hostConfig'.labels or { }; }; + # Build effective labels for a host (only include non-default values) + buildEffectiveLabels = host: + (lib.optionalAttrs (host.tier != "prod") { tier = host.tier; }) + // (lib.optionalAttrs (host.priority != "high") { priority = host.priority; }) + // (lib.optionalAttrs (host.role != null) { role = host.role; }) + // host.labels; + # Generate node-exporter targets from all flake hosts + # Returns a list of static_configs entries with labels generateNodeExporterTargets = self: externalTargets: let nixosConfigs = self.nixosConfigurations or { }; hostList = lib.filter (x: x != null) ( lib.mapAttrsToList extractHostMonitoring nixosConfigs ); - flakeTargets = map (host: "${host.hostname}.home.2rjus.net:9100") hostList; + + # Build target entries with labels for each host + flakeEntries = map + (host: { + target = "${host.hostname}.home.2rjus.net:9100"; + labels = buildEffectiveLabels host; + }) + hostList; + + # External targets have no labels + externalEntries = map + (target: { inherit target; labels = { }; }) + (externalTargets.nodeExporter or [ ]); + + allEntries = flakeEntries ++ externalEntries; + + # Group entries by their label set for efficient static_configs + # Convert labels attrset to a string key for grouping + labelKey = entry: builtins.toJSON entry.labels; + grouped = lib.groupBy labelKey allEntries; + + # Convert groups to static_configs format + staticConfigs = lib.mapAttrsToList + (key: entries: + let + labels = (builtins.head entries).labels; + in + { targets = map (e: e.target) entries; } + // (lib.optionalAttrs (labels != { }) { inherit labels; }) + ) + grouped; in - flakeTargets ++ (externalTargets.nodeExporter or [ ]); + staticConfigs; # Generate scrape configs from all flake hosts and external targets + # Host labels are propagated to service targets for semantic alert filtering generateScrapeConfigs = self: externalTargets: let nixosConfigs = self.nixosConfigurations or { }; @@ -70,13 +115,14 @@ let lib.mapAttrsToList extractHostMonitoring nixosConfigs ); - # Collect all scrapeTargets from all hosts, grouped by job_name + # Collect all scrapeTargets from all hosts, including host labels allTargets = lib.flatten (map (host: map (target: { inherit (target) job_name port metrics_path scheme scrape_interval honor_labels; hostname = host.hostname; + hostLabels = buildEffectiveLabels host; }) host.scrapeTargets ) @@ -87,22 +133,32 @@ let grouped = lib.groupBy (t: t.job_name) allTargets; # Generate a scrape config for each job + # Within each job, group targets by their host labels for efficient static_configs flakeScrapeConfigs = lib.mapAttrsToList (jobName: targets: let first = builtins.head targets; - targetAddrs = map - (t: + + # Group targets within this job by their host labels + labelKey = t: builtins.toJSON t.hostLabels; + groupedByLabels = lib.groupBy labelKey targets; + + staticConfigs = lib.mapAttrsToList + (key: labelTargets: let - portStr = toString t.port; + labels = (builtins.head labelTargets).hostLabels; + targetAddrs = map + (t: "${t.hostname}.home.2rjus.net:${toString t.port}") + labelTargets; in - "${t.hostname}.home.2rjus.net:${portStr}") - targets; + { targets = targetAddrs; } + // (lib.optionalAttrs (labels != { }) { inherit labels; }) + ) + groupedByLabels; + config = { job_name = jobName; - static_configs = [{ - targets = targetAddrs; - }]; + static_configs = staticConfigs; } // (lib.optionalAttrs (first.metrics_path != "/metrics") { metrics_path = first.metrics_path; diff --git a/services/monitoring/prometheus.nix b/services/monitoring/prometheus.nix index 57bc86d..c37bd32 100644 --- a/services/monitoring/prometheus.nix +++ b/services/monitoring/prometheus.nix @@ -121,22 +121,20 @@ in scrapeConfigs = [ # Auto-generated node-exporter targets from flake hosts + external + # Each static_config entry may have labels from homelab.host metadata { job_name = "node-exporter"; - static_configs = [ - { - targets = nodeExporterTargets; - } - ]; + static_configs = nodeExporterTargets; } # Systemd exporter on all hosts (same targets, different port) + # Preserves the same label grouping as node-exporter { job_name = "systemd-exporter"; - static_configs = [ - { - targets = map (t: builtins.replaceStrings [":9100"] [":9558"] t) nodeExporterTargets; - } - ]; + static_configs = map + (cfg: cfg // { + targets = map (t: builtins.replaceStrings [ ":9100" ] [ ":9558" ] t) cfg.targets; + }) + nodeExporterTargets; } # Local monitoring services (not auto-generated) { diff --git a/services/monitoring/rules.yml b/services/monitoring/rules.yml index 9e612eb..88c5e6c 100644 --- a/services/monitoring/rules.yml +++ b/services/monitoring/rules.yml @@ -17,8 +17,9 @@ groups: annotations: summary: "Disk space low on {{ $labels.instance }}" description: "Disk space is low on {{ $labels.instance }}. Please check." + # Build hosts (e.g., nix-cache01) are expected to have high CPU during builds - alert: high_cpu_load - expr: max(node_load5{instance!="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance!="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7) + expr: max(node_load5{role!="build-host"}) by (instance) > (count by (instance)(node_cpu_seconds_total{role!="build-host", mode="idle"}) * 0.7) for: 15m labels: severity: warning @@ -26,7 +27,7 @@ groups: summary: "High CPU load on {{ $labels.instance }}" description: "CPU load is high on {{ $labels.instance }}. Please check." - alert: high_cpu_load - expr: max(node_load5{instance="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7) + expr: max(node_load5{role="build-host"}) by (instance) > (count by (instance)(node_cpu_seconds_total{role="build-host", mode="idle"}) * 0.7) for: 2h labels: severity: warning @@ -115,8 +116,9 @@ groups: annotations: summary: "NSD not running on {{ $labels.instance }}" description: "NSD has been down on {{ $labels.instance }} more than 5 minutes." + # Only alert on primary DNS (secondary has cold cache after failover) - alert: unbound_low_cache_hit_ratio - expr: (rate(unbound_cache_hits_total[5m]) / (rate(unbound_cache_hits_total[5m]) + rate(unbound_cache_misses_total[5m]))) < 0.5 + expr: (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) / (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) + rate(unbound_cache_misses_total{dns_role="primary"}[5m]))) < 0.5 for: 15m labels: severity: warning From 23e561cf494c366dcffc4f10a19541f4ea1c7505 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Sat, 7 Feb 2026 17:09:19 +0100 Subject: [PATCH 2/5] monitoring: add hostname label to all scrape targets Add a `hostname` label to all Prometheus scrape targets, making it easy to query all metrics for a host without wildcarding the instance label. Example queries: - {hostname="ns1"} - all metrics from ns1 - node_cpu_seconds_total{hostname="monitoring01"} - specific metric For external targets (like gunter), the hostname is extracted from the target string. Co-Authored-By: Claude Opus 4.5 --- lib/monitoring.nix | 25 +++++++++++++++++-------- 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/lib/monitoring.nix b/lib/monitoring.nix index dbb62b8..57bffb4 100644 --- a/lib/monitoring.nix +++ b/lib/monitoring.nix @@ -57,9 +57,11 @@ let labels = hostConfig'.labels or { }; }; - # Build effective labels for a host (only include non-default values) + # Build effective labels for a host + # Always includes hostname; only includes tier/priority/role if non-default buildEffectiveLabels = host: - (lib.optionalAttrs (host.tier != "prod") { tier = host.tier; }) + { hostname = host.hostname; } + // (lib.optionalAttrs (host.tier != "prod") { tier = host.tier; }) // (lib.optionalAttrs (host.priority != "high") { priority = host.priority; }) // (lib.optionalAttrs (host.role != null) { role = host.role; }) // host.labels; @@ -73,6 +75,10 @@ let lib.mapAttrsToList extractHostMonitoring nixosConfigs ); + # Extract hostname from a target string like "gunter.home.2rjus.net:9100" + extractHostnameFromTarget = target: + builtins.head (lib.splitString "." target); + # Build target entries with labels for each host flakeEntries = map (host: { @@ -81,9 +87,12 @@ let }) hostList; - # External targets have no labels + # External targets get hostname extracted from the target string externalEntries = map - (target: { inherit target; labels = { }; }) + (target: { + inherit target; + labels = { hostname = extractHostnameFromTarget target; }; + }) (externalTargets.nodeExporter or [ ]); allEntries = flakeEntries ++ externalEntries; @@ -94,13 +103,13 @@ let grouped = lib.groupBy labelKey allEntries; # Convert groups to static_configs format + # Every flake host now has at least a hostname label staticConfigs = lib.mapAttrsToList (key: entries: let labels = (builtins.head entries).labels; in - { targets = map (e: e.target) entries; } - // (lib.optionalAttrs (labels != { }) { inherit labels; }) + { targets = map (e: e.target) entries; labels = labels; } ) grouped; in @@ -143,6 +152,7 @@ let labelKey = t: builtins.toJSON t.hostLabels; groupedByLabels = lib.groupBy labelKey targets; + # Every flake host now has at least a hostname label staticConfigs = lib.mapAttrsToList (key: labelTargets: let @@ -151,8 +161,7 @@ let (t: "${t.hostname}.home.2rjus.net:${toString t.port}") labelTargets; in - { targets = targetAddrs; } - // (lib.optionalAttrs (labels != { }) { inherit labels; }) + { targets = targetAddrs; labels = labels; } ) groupedByLabels; From 50a85daa445f828ce50078e3c4448c8c5b8cc639 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Sat, 7 Feb 2026 17:09:46 +0100 Subject: [PATCH 3/5] docs: update plan with hostname label documentation Co-Authored-By: Claude Opus 4.5 --- docs/plans/prometheus-scrape-target-labels.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/plans/prometheus-scrape-target-labels.md b/docs/plans/prometheus-scrape-target-labels.md index d1b2508..d756ebf 100644 --- a/docs/plans/prometheus-scrape-target-labels.md +++ b/docs/plans/prometheus-scrape-target-labels.md @@ -10,6 +10,7 @@ | 4. Set metadata on hosts | ✅ Complete | All relevant hosts configured | | 5. Update alert rules | ✅ Complete | Role-based filtering implemented | | 6. Labels for service targets | ✅ Complete | Host labels propagated to all services | +| 7. Add hostname label | ✅ Complete | All targets have `hostname` label for easy filtering | **Hosts with metadata configured:** - `ns1`, `ns2`: `role = "dns"`, `labels.dns_role = "primary"/"secondary"` @@ -19,6 +20,12 @@ **Implementation complete.** Branch: `prometheus-scrape-target-labels` +**Query examples:** +- `{hostname="ns1"}` - all metrics from ns1 (any job/port) +- `node_cpu_seconds_total{hostname="monitoring01"}` - specific metric by hostname +- `up{role="dns"}` - all DNS servers +- `up{tier="test"}` - all test-tier hosts + --- ## Goal From b794aa89db83067fa688be92ca9a5a3ceeffb780 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Sat, 7 Feb 2026 17:12:17 +0100 Subject: [PATCH 4/5] skills: update observability with new target labels Document the new hostname and host metadata labels available on all Prometheus scrape targets: - hostname: short hostname for easy filtering - role: host role (dns, build-host, vault) - tier: deployment tier (test for test VMs) - dns_role: primary/secondary for DNS servers Co-Authored-By: Claude Opus 4.5 --- .claude/skills/observability/SKILL.md | 61 ++++++++++++++++++++++----- 1 file changed, 51 insertions(+), 10 deletions(-) diff --git a/.claude/skills/observability/SKILL.md b/.claude/skills/observability/SKILL.md index 69be240..6053a10 100644 --- a/.claude/skills/observability/SKILL.md +++ b/.claude/skills/observability/SKILL.md @@ -185,21 +185,60 @@ Common job names: - `home-assistant` - Home automation - `step-ca` - Internal CA -### Instance Label Format +### Target Labels -The `instance` label uses FQDN format: +All scrape targets have these labels: -``` -.home.2rjus.net: -``` +**Standard labels:** +- `instance` - Full target address (`.home.2rjus.net:`) +- `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`) +- `hostname` - Short hostname (e.g., `ns1`, `monitoring01`) - use this for host filtering -Example queries filtering by host: +**Host metadata labels** (when configured in `homelab.host`): +- `role` - Host role (e.g., `dns`, `build-host`, `vault`) +- `tier` - Deployment tier (`test` for test VMs, absent for prod) +- `dns_role` - DNS-specific role (`primary` or `secondary` for ns1/ns2) + +### Filtering by Host + +Use the `hostname` label for easy host filtering across all jobs: ```promql -up{instance=~"monitoring01.*"} -node_load1{instance=~"ns1.*"} +{hostname="ns1"} # All metrics from ns1 +node_load1{hostname="monitoring01"} # Specific metric by hostname +up{hostname="ha1"} # Check if ha1 is up ``` +This is simpler than wildcarding the `instance` label: + +```promql +# Old way (still works but verbose) +up{instance=~"monitoring01.*"} + +# New way (preferred) +up{hostname="monitoring01"} +``` + +### Filtering by Role/Tier + +Filter hosts by their role or tier: + +```promql +up{role="dns"} # All DNS servers (ns1, ns2) +node_cpu_seconds_total{role="build-host"} # Build hosts only (nix-cache01) +up{tier="test"} # All test-tier VMs +up{dns_role="primary"} # Primary DNS only (ns1) +``` + +Current host labels: +| Host | Labels | +|------|--------| +| ns1 | `role=dns`, `dns_role=primary` | +| ns2 | `role=dns`, `dns_role=secondary` | +| nix-cache01 | `role=build-host` | +| vault01 | `role=vault` | +| testvm01/02/03 | `tier=test` | + --- ## Troubleshooting Workflows @@ -212,11 +251,12 @@ node_load1{instance=~"ns1.*"} ### Investigate Service Issues -1. Check `up{job=""}` for scrape failures +1. Check `up{job=""}` or `up{hostname=""}` for scrape failures 2. Use `list_targets` to see target health details 3. Query service logs: `{host="", systemd_unit=".service"}` 4. Search for errors: `{host=""} |= "error"` 5. Check `list_alerts` for related alerts +6. Use role filters for group issues: `up{role="dns"}` to check all DNS servers ### After Deploying Changes @@ -246,5 +286,6 @@ With `start: "24h"` to see last 24 hours of upgrades across all hosts. - Default scrape interval is 15s for most metrics targets - Default log lookback is 1h - use `start` parameter for older logs - Use `rate()` for counter metrics, direct queries for gauges -- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters +- Use the `hostname` label to filter metrics by host (simpler than regex on `instance`) +- Host metadata labels (`role`, `tier`, `dns_role`) are propagated to all scrape targets - Log `MESSAGE` field contains the actual log content in JSON format From 116abf3bec700aa03b9b6b8c2a12bc458e959575 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Sat, 7 Feb 2026 17:23:10 +0100 Subject: [PATCH 5/5] CLAUDE.md: document homelab-deploy CLI for prod hosts Add instructions for deploying to prod hosts using the CLI directly, since the MCP server only handles test-tier deployments. Co-Authored-By: Claude Opus 4.5 --- CLAUDE.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/CLAUDE.md b/CLAUDE.md index 1084e84..ccf67cc 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -266,6 +266,21 @@ deploy(role="vault", action="switch") **Note:** Only test-tier hosts with `homelab.deploy.enable = true` and the listener service running will respond to deployments. +**Deploying to Prod Hosts:** + +The MCP server only deploys to test-tier hosts. For prod hosts, use the CLI directly: + +```bash +nix develop -c homelab-deploy -- deploy \ + --nats-url nats://nats1.home.2rjus.net:4222 \ + --nkey-file ~/.config/homelab-deploy/admin-deployer.nkey \ + --branch \ + --action switch \ + deploy.prod. +``` + +Subject format: `deploy..` (e.g., `deploy.prod.monitoring01`, `deploy.test.testvm01`) + **Verifying Deployments:** After deploying, use the `nixos_flake_info` metric from nixos-exporter to verify the host is running the expected revision: