Merge pull request 'prometheus-scrape-target-labels' (#30) from prometheus-scrape-target-labels into master

Reviewed-on: #30
2026-02-07 16:27:38 +00:00
parent 2a842c655a 116abf3bec
commit 0b462f0a96
6 changed files with 178 additions and 57 deletions
--- a/.claude/skills/observability/SKILL.md
+++ b/.claude/skills/observability/SKILL.md
@@ -185,21 +185,60 @@ Common job names:
 - `home-assistant` - Home automation
 - `step-ca` - Internal CA

-### Instance Label Format
+### Target Labels

-The `instance` label uses FQDN format:
+All scrape targets have these labels:

-```
-<hostname>.home.2rjus.net:<port>
-```
+**Standard labels:**
+- `instance` - Full target address (`<hostname>.home.2rjus.net:<port>`)
+- `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`)
+- `hostname` - Short hostname (e.g., `ns1`, `monitoring01`) - use this for host filtering

-Example queries filtering by host:
+**Host metadata labels** (when configured in `homelab.host`):
+- `role` - Host role (e.g., `dns`, `build-host`, `vault`)
+- `tier` - Deployment tier (`test` for test VMs, absent for prod)
+- `dns_role` - DNS-specific role (`primary` or `secondary` for ns1/ns2)
+
+### Filtering by Host
+
+Use the `hostname` label for easy host filtering across all jobs:

 ```promql
-up{instance=~"monitoring01.*"}
-node_load1{instance=~"ns1.*"}
+{hostname="ns1"}                    # All metrics from ns1
+node_load1{hostname="monitoring01"} # Specific metric by hostname
+up{hostname="ha1"}                  # Check if ha1 is up
 ```

+This is simpler than wildcarding the `instance` label:
+
+```promql
+# Old way (still works but verbose)
+up{instance=~"monitoring01.*"}
+
+# New way (preferred)
+up{hostname="monitoring01"}
+```
+
+### Filtering by Role/Tier
+
+Filter hosts by their role or tier:
+
+```promql
+up{role="dns"}                      # All DNS servers (ns1, ns2)
+node_cpu_seconds_total{role="build-host"}  # Build hosts only (nix-cache01)
+up{tier="test"}                     # All test-tier VMs
+up{dns_role="primary"}              # Primary DNS only (ns1)
+```
+
+Current host labels:
+| Host | Labels |
+|------|--------|
+| ns1 | `role=dns`, `dns_role=primary` |
+| ns2 | `role=dns`, `dns_role=secondary` |
+| nix-cache01 | `role=build-host` |
+| vault01 | `role=vault` |
+| testvm01/02/03 | `tier=test` |
+
 ---

 ## Troubleshooting Workflows
@@ -212,11 +251,12 @@ node_load1{instance=~"ns1.*"}

 ### Investigate Service Issues

-1. Check `up{job="<service>"}` for scrape failures
+1. Check `up{job="<service>"}` or `up{hostname="<host>"}` for scrape failures
 2. Use `list_targets` to see target health details
 3. Query service logs: `{host="<host>", systemd_unit="<service>.service"}`
 4. Search for errors: `{host="<host>"} |= "error"`
 5. Check `list_alerts` for related alerts
+6. Use role filters for group issues: `up{role="dns"}` to check all DNS servers

 ### After Deploying Changes

@@ -246,5 +286,6 @@ With `start: "24h"` to see last 24 hours of upgrades across all hosts.
 - Default scrape interval is 15s for most metrics targets
 - Default log lookback is 1h - use `start` parameter for older logs
 - Use `rate()` for counter metrics, direct queries for gauges
- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters
+- Use the `hostname` label to filter metrics by host (simpler than regex on `instance`)
+- Host metadata labels (`role`, `tier`, `dns_role`) are propagated to all scrape targets
 - Log `MESSAGE` field contains the actual log content in JSON format
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -266,6 +266,21 @@ deploy(role="vault", action="switch")

 **Note:** Only test-tier hosts with `homelab.deploy.enable = true` and the listener service running will respond to deployments.

+**Deploying to Prod Hosts:**
+
+The MCP server only deploys to test-tier hosts. For prod hosts, use the CLI directly:
+
+```bash
+nix develop -c homelab-deploy -- deploy \
+  --nats-url nats://nats1.home.2rjus.net:4222 \
+  --nkey-file ~/.config/homelab-deploy/admin-deployer.nkey \
+  --branch <branch-name> \
+  --action switch \
+  deploy.prod.<hostname>
+```
+
+Subject format: `deploy.<tier>.<hostname>` (e.g., `deploy.prod.monitoring01`, `deploy.test.testvm01`)
+
 **Verifying Deployments:**

 After deploying, use the `nixos_flake_info` metric from nixos-exporter to verify the host is running the expected revision:
--- a/docs/plans/prometheus-scrape-target-labels.md
+++ b/docs/plans/prometheus-scrape-target-labels.md
@@ -5,20 +5,26 @@
 | Step | Status | Notes |
 |------|--------|-------|
 | 1. Create `homelab.host` module | ✅ Complete | `modules/homelab/host.nix` |
-| 2. Update `lib/monitoring.nix` | ❌ Not started | Labels not extracted or propagated |
-| 3. Update Prometheus config | ❌ Not started | Still uses flat target list |
-| 4. Set metadata on hosts | ⚠️ Partial | Some hosts configured, see below |
-| 5. Update alert rules | ❌ Not started | |
-| 6. Labels for service targets | ❌ Not started | Optional |
+| 2. Update `lib/monitoring.nix` | ✅ Complete | Labels extracted and propagated |
+| 3. Update Prometheus config | ✅ Complete | Uses structured static_configs |
+| 4. Set metadata on hosts | ✅ Complete | All relevant hosts configured |
+| 5. Update alert rules | ✅ Complete | Role-based filtering implemented |
+| 6. Labels for service targets | ✅ Complete | Host labels propagated to all services |
+| 7. Add hostname label | ✅ Complete | All targets have `hostname` label for easy filtering |

 **Hosts with metadata configured:**
 - `ns1`, `ns2`: `role = "dns"`, `labels.dns_role = "primary"/"secondary"`
- `nix-cache01`: `role = "build-host"` (missing `priority = "low"` from plan)
+- `nix-cache01`: `role = "build-host"`
 - `vault01`: `role = "vault"`
- `jump`: `role = "bastion"`
- `template`, `template2`, `testvm*`: `tier` and `priority` set
+- `testvm01/02/03`: `tier = "test"`

-**Key gap:** The `homelab.host` module exists and some hosts use it, but `lib/monitoring.nix` does not extract these values—they are not propagated to Prometheus scrape targets.
+**Implementation complete.** Branch: `prometheus-scrape-target-labels`
+
+**Query examples:**
+- `{hostname="ns1"}` - all metrics from ns1 (any job/port)
+- `node_cpu_seconds_total{hostname="monitoring01"}` - specific metric by hostname
+- `up{role="dns"}` - all DNS servers
+- `up{tier="test"}` - all test-tier hosts

 ---

@@ -119,7 +125,7 @@ Import this module in `modules/homelab/default.nix`.

 ### 2. Update `lib/monitoring.nix`

-❌ **Not started.** The current implementation does not extract `homelab.host` values.
+✅ **Complete.** Labels are now extracted and propagated.

 - `extractHostMonitoring` should also extract `homelab.host` values (priority, role, labels).
 - Build the combined label set from `homelab.host`:
@@ -149,7 +155,7 @@ This requires grouping hosts by their label attrset and producing one `static_co

 ### 3. Update `services/monitoring/prometheus.nix`

-❌ **Not started.** Still uses flat target list (`static_configs = [{ targets = nodeExporterTargets; }]`).
+✅ **Complete.** Now uses structured static_configs output.

 Change the node-exporter scrape config to use the new structured output:

@@ -163,7 +169,7 @@ static_configs = nodeExporterTargets;

 ### 4. Set metadata on hosts

-⚠️ **Partial.** Some hosts configured (see status table above). Current `nix-cache01` only has `role`, missing the `priority = "low"` suggested below.
+✅ **Complete.** All relevant hosts have metadata configured. Note: The implementation filters by `role` rather than `priority`, which matches the existing nix-cache01 configuration.

 Example in `hosts/nix-cache01/configuration.nix`:

@@ -189,17 +195,11 @@ homelab.host = {

 ### 5. Update alert rules

-❌ **Not started.** Requires steps 2-3 to be completed first.
+✅ **Complete.** Updated `services/monitoring/rules.yml`:

-After implementing labels, review and update `services/monitoring/rules.yml`:
+- `high_cpu_load`: Replaced `instance!="nix-cache01..."` with `role!="build-host"` for standard hosts (15m duration) and `role="build-host"` for build hosts (2h duration).
+- `unbound_low_cache_hit_ratio`: Added `dns_role="primary"` filter to only alert on the primary DNS resolver (secondary has a cold cache).

- Replace instance-name exclusions with label-based filters (e.g. `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`).
- Consider whether any other rules should differentiate by priority or role.
+### 6. Labels for `generateScrapeConfigs` (service targets)

-Specifically, the `high_cpu_load` rule currently has a nix-cache01 exclusion that should be replaced with a `priority`-based filter.
-
-### 6. Consider labels for `generateScrapeConfigs` (service targets)
-
-❌ **Not started.** Optional enhancement.
-
-The same label propagation could be applied to service-level scrape targets. This is optional and can be deferred -- service targets are more specialized and less likely to need generic label-based filtering.
+✅ **Complete.** Host labels are now propagated to all auto-generated service scrape targets (unbound, homelab-deploy, nixos-exporter, etc.). This enables semantic filtering on any service metric, such as using `dns_role="primary"` with the unbound job.
--- a/lib/monitoring.nix
+++ b/lib/monitoring.nix
@@ -21,6 +21,7 @@ let
      cfg = hostConfig.config;
      monConfig = (cfg.homelab or { }).monitoring or { enable = true; scrapeTargets = [ ]; };
      dnsConfig = (cfg.homelab or { }).dns or { enable = true; };
+      hostConfig' = (cfg.homelab or { }).host or { };
      hostname = cfg.networking.hostName;
      networks = cfg.systemd.network.networks or { };

@@ -49,20 +50,73 @@ let
        inherit hostname;
        ip = extractIP firstAddress;
        scrapeTargets = monConfig.scrapeTargets or [ ];
+        # Host metadata for label propagation
+        tier = hostConfig'.tier or "prod";
+        priority = hostConfig'.priority or "high";
+        role = hostConfig'.role or null;
+        labels = hostConfig'.labels or { };
      };

+  # Build effective labels for a host
+  # Always includes hostname; only includes tier/priority/role if non-default
+  buildEffectiveLabels = host:
+    { hostname = host.hostname; }
+    // (lib.optionalAttrs (host.tier != "prod") { tier = host.tier; })
+    // (lib.optionalAttrs (host.priority != "high") { priority = host.priority; })
+    // (lib.optionalAttrs (host.role != null) { role = host.role; })
+    // host.labels;
+
  # Generate node-exporter targets from all flake hosts
+  # Returns a list of static_configs entries with labels
  generateNodeExporterTargets = self: externalTargets:
    let
      nixosConfigs = self.nixosConfigurations or { };
      hostList = lib.filter (x: x != null) (
        lib.mapAttrsToList extractHostMonitoring nixosConfigs
      );
-      flakeTargets = map (host: "${host.hostname}.home.2rjus.net:9100") hostList;
+
+      # Extract hostname from a target string like "gunter.home.2rjus.net:9100"
+      extractHostnameFromTarget = target:
+        builtins.head (lib.splitString "." target);
+
+      # Build target entries with labels for each host
+      flakeEntries = map
+        (host: {
+          target = "${host.hostname}.home.2rjus.net:9100";
+          labels = buildEffectiveLabels host;
+        })
+        hostList;
+
+      # External targets get hostname extracted from the target string
+      externalEntries = map
+        (target: {
+          inherit target;
+          labels = { hostname = extractHostnameFromTarget target; };
+        })
+        (externalTargets.nodeExporter or [ ]);
+
+      allEntries = flakeEntries ++ externalEntries;
+
+      # Group entries by their label set for efficient static_configs
+      # Convert labels attrset to a string key for grouping
+      labelKey = entry: builtins.toJSON entry.labels;
+      grouped = lib.groupBy labelKey allEntries;
+
+      # Convert groups to static_configs format
+      # Every flake host now has at least a hostname label
+      staticConfigs = lib.mapAttrsToList
+        (key: entries:
+          let
+            labels = (builtins.head entries).labels;
          in
-    flakeTargets ++ (externalTargets.nodeExporter or [ ]);
+          { targets = map (e: e.target) entries; labels = labels; }
+        )
+        grouped;
+    in
+    staticConfigs;

  # Generate scrape configs from all flake hosts and external targets
+  # Host labels are propagated to service targets for semantic alert filtering
  generateScrapeConfigs = self: externalTargets:
    let
      nixosConfigs = self.nixosConfigurations or { };
@@ -70,13 +124,14 @@ let
        lib.mapAttrsToList extractHostMonitoring nixosConfigs
      );

-      # Collect all scrapeTargets from all hosts, grouped by job_name
+      # Collect all scrapeTargets from all hosts, including host labels
      allTargets = lib.flatten (map
        (host:
          map
            (target: {
              inherit (target) job_name port metrics_path scheme scrape_interval honor_labels;
              hostname = host.hostname;
+              hostLabels = buildEffectiveLabels host;
            })
            host.scrapeTargets
        )
@@ -87,22 +142,32 @@ let
      grouped = lib.groupBy (t: t.job_name) allTargets;

      # Generate a scrape config for each job
+      # Within each job, group targets by their host labels for efficient static_configs
      flakeScrapeConfigs = lib.mapAttrsToList
        (jobName: targets:
          let
            first = builtins.head targets;
-            targetAddrs = map
-              (t:
+
+            # Group targets within this job by their host labels
+            labelKey = t: builtins.toJSON t.hostLabels;
+            groupedByLabels = lib.groupBy labelKey targets;
+
+            # Every flake host now has at least a hostname label
+            staticConfigs = lib.mapAttrsToList
+              (key: labelTargets:
                let
-                  portStr = toString t.port;
+                  labels = (builtins.head labelTargets).hostLabels;
+                  targetAddrs = map
+                    (t: "${t.hostname}.home.2rjus.net:${toString t.port}")
+                    labelTargets;
                in
-                "${t.hostname}.home.2rjus.net:${portStr}")
-              targets;
+                { targets = targetAddrs; labels = labels; }
+              )
+              groupedByLabels;
+
            config = {
              job_name = jobName;
-              static_configs = [{
-                targets = targetAddrs;
-              }];
+              static_configs = staticConfigs;
            }
            // (lib.optionalAttrs (first.metrics_path != "/metrics") {
              metrics_path = first.metrics_path;
--- a/services/monitoring/prometheus.nix
+++ b/services/monitoring/prometheus.nix
@@ -121,22 +121,20 @@ in

    scrapeConfigs = [
      # Auto-generated node-exporter targets from flake hosts + external
+      # Each static_config entry may have labels from homelab.host metadata
      {
        job_name = "node-exporter";
-        static_configs = [
-          {
-            targets = nodeExporterTargets;
-          }
-        ];
+        static_configs = nodeExporterTargets;
      }
      # Systemd exporter on all hosts (same targets, different port)
+      # Preserves the same label grouping as node-exporter
      {
        job_name = "systemd-exporter";
-        static_configs = [
-          {
-            targets = map (t: builtins.replaceStrings [":9100"] [":9558"] t) nodeExporterTargets;
-          }
-        ];
+        static_configs = map
+          (cfg: cfg // {
+            targets = map (t: builtins.replaceStrings [ ":9100" ] [ ":9558" ] t) cfg.targets;
+          })
+          nodeExporterTargets;
      }
      # Local monitoring services (not auto-generated)
      {
--- a/services/monitoring/rules.yml
+++ b/services/monitoring/rules.yml
@@ -17,8 +17,9 @@ groups:
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: "Disk space is low on {{ $labels.instance }}. Please check."
+      # Build hosts (e.g., nix-cache01) are expected to have high CPU during builds
      - alert: high_cpu_load
-        expr: max(node_load5{instance!="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance!="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7)
+        expr: max(node_load5{role!="build-host"}) by (instance) > (count by (instance)(node_cpu_seconds_total{role!="build-host", mode="idle"}) * 0.7)
        for: 15m
        labels:
          severity: warning
@@ -26,7 +27,7 @@ groups:
          summary: "High CPU load on {{ $labels.instance }}"
          description: "CPU load is high on {{ $labels.instance }}. Please check."
      - alert: high_cpu_load
-        expr: max(node_load5{instance="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7)
+        expr: max(node_load5{role="build-host"}) by (instance) > (count by (instance)(node_cpu_seconds_total{role="build-host", mode="idle"}) * 0.7)
        for: 2h
        labels:
          severity: warning
@@ -115,8 +116,9 @@ groups:
        annotations:
          summary: "NSD not running on {{ $labels.instance }}"
          description: "NSD has been down on {{ $labels.instance }} more than 5 minutes."
+      # Only alert on primary DNS (secondary has cold cache after failover)
      - alert: unbound_low_cache_hit_ratio
-        expr: (rate(unbound_cache_hits_total[5m]) / (rate(unbound_cache_hits_total[5m]) + rate(unbound_cache_misses_total[5m]))) < 0.5
+        expr: (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) / (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) + rate(unbound_cache_misses_total{dns_role="primary"}[5m]))) < 0.5
        for: 15m
        labels:
          severity: warning