diff --git a/.claude/skills/observability/SKILL.md b/.claude/skills/observability/SKILL.md index 69be240..6053a10 100644 --- a/.claude/skills/observability/SKILL.md +++ b/.claude/skills/observability/SKILL.md @@ -185,21 +185,60 @@ Common job names: - `home-assistant` - Home automation - `step-ca` - Internal CA -### Instance Label Format +### Target Labels -The `instance` label uses FQDN format: +All scrape targets have these labels: -``` -.home.2rjus.net: -``` +**Standard labels:** +- `instance` - Full target address (`.home.2rjus.net:`) +- `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`) +- `hostname` - Short hostname (e.g., `ns1`, `monitoring01`) - use this for host filtering -Example queries filtering by host: +**Host metadata labels** (when configured in `homelab.host`): +- `role` - Host role (e.g., `dns`, `build-host`, `vault`) +- `tier` - Deployment tier (`test` for test VMs, absent for prod) +- `dns_role` - DNS-specific role (`primary` or `secondary` for ns1/ns2) + +### Filtering by Host + +Use the `hostname` label for easy host filtering across all jobs: ```promql -up{instance=~"monitoring01.*"} -node_load1{instance=~"ns1.*"} +{hostname="ns1"} # All metrics from ns1 +node_load1{hostname="monitoring01"} # Specific metric by hostname +up{hostname="ha1"} # Check if ha1 is up ``` +This is simpler than wildcarding the `instance` label: + +```promql +# Old way (still works but verbose) +up{instance=~"monitoring01.*"} + +# New way (preferred) +up{hostname="monitoring01"} +``` + +### Filtering by Role/Tier + +Filter hosts by their role or tier: + +```promql +up{role="dns"} # All DNS servers (ns1, ns2) +node_cpu_seconds_total{role="build-host"} # Build hosts only (nix-cache01) +up{tier="test"} # All test-tier VMs +up{dns_role="primary"} # Primary DNS only (ns1) +``` + +Current host labels: +| Host | Labels | +|------|--------| +| ns1 | `role=dns`, `dns_role=primary` | +| ns2 | `role=dns`, `dns_role=secondary` | +| nix-cache01 | `role=build-host` | +| vault01 | `role=vault` | +| testvm01/02/03 | `tier=test` | + --- ## Troubleshooting Workflows @@ -212,11 +251,12 @@ node_load1{instance=~"ns1.*"} ### Investigate Service Issues -1. Check `up{job=""}` for scrape failures +1. Check `up{job=""}` or `up{hostname=""}` for scrape failures 2. Use `list_targets` to see target health details 3. Query service logs: `{host="", systemd_unit=".service"}` 4. Search for errors: `{host=""} |= "error"` 5. Check `list_alerts` for related alerts +6. Use role filters for group issues: `up{role="dns"}` to check all DNS servers ### After Deploying Changes @@ -246,5 +286,6 @@ With `start: "24h"` to see last 24 hours of upgrades across all hosts. - Default scrape interval is 15s for most metrics targets - Default log lookback is 1h - use `start` parameter for older logs - Use `rate()` for counter metrics, direct queries for gauges -- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters +- Use the `hostname` label to filter metrics by host (simpler than regex on `instance`) +- Host metadata labels (`role`, `tier`, `dns_role`) are propagated to all scrape targets - Log `MESSAGE` field contains the actual log content in JSON format diff --git a/CLAUDE.md b/CLAUDE.md index 1084e84..ccf67cc 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -266,6 +266,21 @@ deploy(role="vault", action="switch") **Note:** Only test-tier hosts with `homelab.deploy.enable = true` and the listener service running will respond to deployments. +**Deploying to Prod Hosts:** + +The MCP server only deploys to test-tier hosts. For prod hosts, use the CLI directly: + +```bash +nix develop -c homelab-deploy -- deploy \ + --nats-url nats://nats1.home.2rjus.net:4222 \ + --nkey-file ~/.config/homelab-deploy/admin-deployer.nkey \ + --branch \ + --action switch \ + deploy.prod. +``` + +Subject format: `deploy..` (e.g., `deploy.prod.monitoring01`, `deploy.test.testvm01`) + **Verifying Deployments:** After deploying, use the `nixos_flake_info` metric from nixos-exporter to verify the host is running the expected revision: diff --git a/docs/plans/prometheus-scrape-target-labels.md b/docs/plans/prometheus-scrape-target-labels.md index 0255347..d756ebf 100644 --- a/docs/plans/prometheus-scrape-target-labels.md +++ b/docs/plans/prometheus-scrape-target-labels.md @@ -5,20 +5,26 @@ | Step | Status | Notes | |------|--------|-------| | 1. Create `homelab.host` module | ✅ Complete | `modules/homelab/host.nix` | -| 2. Update `lib/monitoring.nix` | ❌ Not started | Labels not extracted or propagated | -| 3. Update Prometheus config | ❌ Not started | Still uses flat target list | -| 4. Set metadata on hosts | ⚠️ Partial | Some hosts configured, see below | -| 5. Update alert rules | ❌ Not started | | -| 6. Labels for service targets | ❌ Not started | Optional | +| 2. Update `lib/monitoring.nix` | ✅ Complete | Labels extracted and propagated | +| 3. Update Prometheus config | ✅ Complete | Uses structured static_configs | +| 4. Set metadata on hosts | ✅ Complete | All relevant hosts configured | +| 5. Update alert rules | ✅ Complete | Role-based filtering implemented | +| 6. Labels for service targets | ✅ Complete | Host labels propagated to all services | +| 7. Add hostname label | ✅ Complete | All targets have `hostname` label for easy filtering | **Hosts with metadata configured:** - `ns1`, `ns2`: `role = "dns"`, `labels.dns_role = "primary"/"secondary"` -- `nix-cache01`: `role = "build-host"` (missing `priority = "low"` from plan) +- `nix-cache01`: `role = "build-host"` - `vault01`: `role = "vault"` -- `jump`: `role = "bastion"` -- `template`, `template2`, `testvm*`: `tier` and `priority` set +- `testvm01/02/03`: `tier = "test"` -**Key gap:** The `homelab.host` module exists and some hosts use it, but `lib/monitoring.nix` does not extract these values—they are not propagated to Prometheus scrape targets. +**Implementation complete.** Branch: `prometheus-scrape-target-labels` + +**Query examples:** +- `{hostname="ns1"}` - all metrics from ns1 (any job/port) +- `node_cpu_seconds_total{hostname="monitoring01"}` - specific metric by hostname +- `up{role="dns"}` - all DNS servers +- `up{tier="test"}` - all test-tier hosts --- @@ -119,7 +125,7 @@ Import this module in `modules/homelab/default.nix`. ### 2. Update `lib/monitoring.nix` -❌ **Not started.** The current implementation does not extract `homelab.host` values. +✅ **Complete.** Labels are now extracted and propagated. - `extractHostMonitoring` should also extract `homelab.host` values (priority, role, labels). - Build the combined label set from `homelab.host`: @@ -149,7 +155,7 @@ This requires grouping hosts by their label attrset and producing one `static_co ### 3. Update `services/monitoring/prometheus.nix` -❌ **Not started.** Still uses flat target list (`static_configs = [{ targets = nodeExporterTargets; }]`). +✅ **Complete.** Now uses structured static_configs output. Change the node-exporter scrape config to use the new structured output: @@ -163,7 +169,7 @@ static_configs = nodeExporterTargets; ### 4. Set metadata on hosts -⚠️ **Partial.** Some hosts configured (see status table above). Current `nix-cache01` only has `role`, missing the `priority = "low"` suggested below. +✅ **Complete.** All relevant hosts have metadata configured. Note: The implementation filters by `role` rather than `priority`, which matches the existing nix-cache01 configuration. Example in `hosts/nix-cache01/configuration.nix`: @@ -189,17 +195,11 @@ homelab.host = { ### 5. Update alert rules -❌ **Not started.** Requires steps 2-3 to be completed first. +✅ **Complete.** Updated `services/monitoring/rules.yml`: -After implementing labels, review and update `services/monitoring/rules.yml`: +- `high_cpu_load`: Replaced `instance!="nix-cache01..."` with `role!="build-host"` for standard hosts (15m duration) and `role="build-host"` for build hosts (2h duration). +- `unbound_low_cache_hit_ratio`: Added `dns_role="primary"` filter to only alert on the primary DNS resolver (secondary has a cold cache). -- Replace instance-name exclusions with label-based filters (e.g. `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`). -- Consider whether any other rules should differentiate by priority or role. +### 6. Labels for `generateScrapeConfigs` (service targets) -Specifically, the `high_cpu_load` rule currently has a nix-cache01 exclusion that should be replaced with a `priority`-based filter. - -### 6. Consider labels for `generateScrapeConfigs` (service targets) - -❌ **Not started.** Optional enhancement. - -The same label propagation could be applied to service-level scrape targets. This is optional and can be deferred -- service targets are more specialized and less likely to need generic label-based filtering. +✅ **Complete.** Host labels are now propagated to all auto-generated service scrape targets (unbound, homelab-deploy, nixos-exporter, etc.). This enables semantic filtering on any service metric, such as using `dns_role="primary"` with the unbound job. diff --git a/lib/monitoring.nix b/lib/monitoring.nix index 19e522a..57bffb4 100644 --- a/lib/monitoring.nix +++ b/lib/monitoring.nix @@ -21,6 +21,7 @@ let cfg = hostConfig.config; monConfig = (cfg.homelab or { }).monitoring or { enable = true; scrapeTargets = [ ]; }; dnsConfig = (cfg.homelab or { }).dns or { enable = true; }; + hostConfig' = (cfg.homelab or { }).host or { }; hostname = cfg.networking.hostName; networks = cfg.systemd.network.networks or { }; @@ -49,20 +50,73 @@ let inherit hostname; ip = extractIP firstAddress; scrapeTargets = monConfig.scrapeTargets or [ ]; + # Host metadata for label propagation + tier = hostConfig'.tier or "prod"; + priority = hostConfig'.priority or "high"; + role = hostConfig'.role or null; + labels = hostConfig'.labels or { }; }; + # Build effective labels for a host + # Always includes hostname; only includes tier/priority/role if non-default + buildEffectiveLabels = host: + { hostname = host.hostname; } + // (lib.optionalAttrs (host.tier != "prod") { tier = host.tier; }) + // (lib.optionalAttrs (host.priority != "high") { priority = host.priority; }) + // (lib.optionalAttrs (host.role != null) { role = host.role; }) + // host.labels; + # Generate node-exporter targets from all flake hosts + # Returns a list of static_configs entries with labels generateNodeExporterTargets = self: externalTargets: let nixosConfigs = self.nixosConfigurations or { }; hostList = lib.filter (x: x != null) ( lib.mapAttrsToList extractHostMonitoring nixosConfigs ); - flakeTargets = map (host: "${host.hostname}.home.2rjus.net:9100") hostList; + + # Extract hostname from a target string like "gunter.home.2rjus.net:9100" + extractHostnameFromTarget = target: + builtins.head (lib.splitString "." target); + + # Build target entries with labels for each host + flakeEntries = map + (host: { + target = "${host.hostname}.home.2rjus.net:9100"; + labels = buildEffectiveLabels host; + }) + hostList; + + # External targets get hostname extracted from the target string + externalEntries = map + (target: { + inherit target; + labels = { hostname = extractHostnameFromTarget target; }; + }) + (externalTargets.nodeExporter or [ ]); + + allEntries = flakeEntries ++ externalEntries; + + # Group entries by their label set for efficient static_configs + # Convert labels attrset to a string key for grouping + labelKey = entry: builtins.toJSON entry.labels; + grouped = lib.groupBy labelKey allEntries; + + # Convert groups to static_configs format + # Every flake host now has at least a hostname label + staticConfigs = lib.mapAttrsToList + (key: entries: + let + labels = (builtins.head entries).labels; + in + { targets = map (e: e.target) entries; labels = labels; } + ) + grouped; in - flakeTargets ++ (externalTargets.nodeExporter or [ ]); + staticConfigs; # Generate scrape configs from all flake hosts and external targets + # Host labels are propagated to service targets for semantic alert filtering generateScrapeConfigs = self: externalTargets: let nixosConfigs = self.nixosConfigurations or { }; @@ -70,13 +124,14 @@ let lib.mapAttrsToList extractHostMonitoring nixosConfigs ); - # Collect all scrapeTargets from all hosts, grouped by job_name + # Collect all scrapeTargets from all hosts, including host labels allTargets = lib.flatten (map (host: map (target: { inherit (target) job_name port metrics_path scheme scrape_interval honor_labels; hostname = host.hostname; + hostLabels = buildEffectiveLabels host; }) host.scrapeTargets ) @@ -87,22 +142,32 @@ let grouped = lib.groupBy (t: t.job_name) allTargets; # Generate a scrape config for each job + # Within each job, group targets by their host labels for efficient static_configs flakeScrapeConfigs = lib.mapAttrsToList (jobName: targets: let first = builtins.head targets; - targetAddrs = map - (t: + + # Group targets within this job by their host labels + labelKey = t: builtins.toJSON t.hostLabels; + groupedByLabels = lib.groupBy labelKey targets; + + # Every flake host now has at least a hostname label + staticConfigs = lib.mapAttrsToList + (key: labelTargets: let - portStr = toString t.port; + labels = (builtins.head labelTargets).hostLabels; + targetAddrs = map + (t: "${t.hostname}.home.2rjus.net:${toString t.port}") + labelTargets; in - "${t.hostname}.home.2rjus.net:${portStr}") - targets; + { targets = targetAddrs; labels = labels; } + ) + groupedByLabels; + config = { job_name = jobName; - static_configs = [{ - targets = targetAddrs; - }]; + static_configs = staticConfigs; } // (lib.optionalAttrs (first.metrics_path != "/metrics") { metrics_path = first.metrics_path; diff --git a/services/monitoring/prometheus.nix b/services/monitoring/prometheus.nix index 57bc86d..c37bd32 100644 --- a/services/monitoring/prometheus.nix +++ b/services/monitoring/prometheus.nix @@ -121,22 +121,20 @@ in scrapeConfigs = [ # Auto-generated node-exporter targets from flake hosts + external + # Each static_config entry may have labels from homelab.host metadata { job_name = "node-exporter"; - static_configs = [ - { - targets = nodeExporterTargets; - } - ]; + static_configs = nodeExporterTargets; } # Systemd exporter on all hosts (same targets, different port) + # Preserves the same label grouping as node-exporter { job_name = "systemd-exporter"; - static_configs = [ - { - targets = map (t: builtins.replaceStrings [":9100"] [":9558"] t) nodeExporterTargets; - } - ]; + static_configs = map + (cfg: cfg // { + targets = map (t: builtins.replaceStrings [ ":9100" ] [ ":9558" ] t) cfg.targets; + }) + nodeExporterTargets; } # Local monitoring services (not auto-generated) { diff --git a/services/monitoring/rules.yml b/services/monitoring/rules.yml index 9e612eb..88c5e6c 100644 --- a/services/monitoring/rules.yml +++ b/services/monitoring/rules.yml @@ -17,8 +17,9 @@ groups: annotations: summary: "Disk space low on {{ $labels.instance }}" description: "Disk space is low on {{ $labels.instance }}. Please check." + # Build hosts (e.g., nix-cache01) are expected to have high CPU during builds - alert: high_cpu_load - expr: max(node_load5{instance!="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance!="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7) + expr: max(node_load5{role!="build-host"}) by (instance) > (count by (instance)(node_cpu_seconds_total{role!="build-host", mode="idle"}) * 0.7) for: 15m labels: severity: warning @@ -26,7 +27,7 @@ groups: summary: "High CPU load on {{ $labels.instance }}" description: "CPU load is high on {{ $labels.instance }}. Please check." - alert: high_cpu_load - expr: max(node_load5{instance="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7) + expr: max(node_load5{role="build-host"}) by (instance) > (count by (instance)(node_cpu_seconds_total{role="build-host", mode="idle"}) * 0.7) for: 2h labels: severity: warning @@ -115,8 +116,9 @@ groups: annotations: summary: "NSD not running on {{ $labels.instance }}" description: "NSD has been down on {{ $labels.instance }} more than 5 minutes." + # Only alert on primary DNS (secondary has cold cache after failover) - alert: unbound_low_cache_hit_ratio - expr: (rate(unbound_cache_hits_total[5m]) / (rate(unbound_cache_hits_total[5m]) + rate(unbound_cache_misses_total[5m]))) < 0.5 + expr: (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) / (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) + rate(unbound_cache_misses_total{dns_role="primary"}[5m]))) < 0.5 for: 15m labels: severity: warning