docs: update Loki queries from host to hostname label

Update all LogQL examples, agent instructions, and scripts to use the hostname label instead of host, matching the Prometheus label naming convention. Also update pipe-to-loki and bootstrap scripts to push hostname instead of host. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 23:43:47 +01:00
parent 7b804450a3
commit d485948df0
8 changed files with 63 additions and 47 deletions
--- a/.claude/agents/auditor.md
+++ b/.claude/agents/auditor.md
@@ -19,7 +19,7 @@ You may receive:
 ## Audit Log Structure

 Logs are shipped to Loki via promtail. Audit events use these labels:
- `host` - hostname
+- `hostname` - hostname
 - `systemd_unit` - typically `auditd.service` for audit logs
 - `job` - typically `systemd-journal`

@@ -36,7 +36,7 @@ Audit log entries contain structured data:

 Find SSH logins and session activity:
 ```logql
-{host="<hostname>", systemd_unit="sshd.service"}
+{hostname="<hostname>", systemd_unit="sshd.service"}
 ```

 Look for:
@@ -48,7 +48,7 @@ Look for:

 Query executed commands (filter out noise):
 ```logql
-{host="<hostname>"} |= "EXECVE" != "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"
+{hostname="<hostname>"} |= "EXECVE" != "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"
 ```

 Further filtering:
@@ -60,28 +60,28 @@ Further filtering:

 Check for privilege escalation:
 ```logql
-{host="<hostname>"} |= "sudo" |= "COMMAND"
+{hostname="<hostname>"} |= "sudo" |= "COMMAND"
 ```

 Or via audit:
 ```logql
-{host="<hostname>"} |= "USER_CMD"
+{hostname="<hostname>"} |= "USER_CMD"
 ```

 ### 4. Service Manipulation

 Check if services were manually stopped/started:
 ```logql
-{host="<hostname>"} |= "EXECVE" |= "systemctl"
+{hostname="<hostname>"} |= "EXECVE" |= "systemctl"
 ```

 ### 5. File Operations

 Look for file modifications (if auditd rules are configured):
 ```logql
-{host="<hostname>"} |= "EXECVE" |= "vim"
-{host="<hostname>"} |= "EXECVE" |= "nano"
-{host="<hostname>"} |= "EXECVE" |= "rm"
+{hostname="<hostname>"} |= "EXECVE" |= "vim"
+{hostname="<hostname>"} |= "EXECVE" |= "nano"
+{hostname="<hostname>"} |= "EXECVE" |= "rm"
 ```

 ## Query Guidelines
@@ -99,7 +99,7 @@ Look for file modifications (if auditd rules are configured):
 **Time-bounded queries:**
 When investigating around a specific event:
 ```logql
-{host="<hostname>"} |= "EXECVE" != "systemd"
+{hostname="<hostname>"} |= "EXECVE" != "systemd"
 ```
 With `start: "2026-02-08T14:30:00Z"` and `end: "2026-02-08T14:35:00Z"`

--- a/.claude/agents/investigate-alarm.md
+++ b/.claude/agents/investigate-alarm.md
@@ -41,13 +41,13 @@ Search for relevant log entries using `query_logs`. Focus on service-specific lo
 **Query strategies (start narrow, expand if needed):**
 - Start with `limit: 20-30`, increase only if needed
 - Use tight time windows: `start: "15m"` or `start: "30m"` initially
- Filter to specific services: `{host="<hostname>", systemd_unit="<service>.service"}`
- Search for errors: `{host="<hostname>"} |= "error"` or `|= "failed"`
+- Filter to specific services: `{hostname="<hostname>", systemd_unit="<service>.service"}`
+- Search for errors: `{hostname="<hostname>"} |= "error"` or `|= "failed"`

 **Common patterns:**
- Service logs: `{host="<hostname>", systemd_unit="<service>.service"}`
- All errors on host: `{host="<hostname>"} |= "error"`
- Journal for a unit: `{host="<hostname>", systemd_unit="nginx.service"} |= "failed"`
+- Service logs: `{hostname="<hostname>", systemd_unit="<service>.service"}`
+- All errors on host: `{hostname="<hostname>"} |= "error"`
+- Journal for a unit: `{hostname="<hostname>", systemd_unit="nginx.service"} |= "failed"`

 **Avoid:**
 - Using `start: "1h"` with no filters on busy hosts
--- a/.claude/skills/observability/SKILL.md
+++ b/.claude/skills/observability/SKILL.md
@@ -30,11 +30,13 @@ Use the `lab-monitoring` MCP server tools:
 ### Label Reference

 Available labels for log queries:
- `host` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`)
+- `hostname` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`) - matches the Prometheus `hostname` label
 - `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
 - `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
 - `filename` - For `varlog` job, the log file path
- `hostname` - Alternative to `host` for some streams
+- `tier` - Deployment tier (`test` or `prod`)
+- `role` - Host role (e.g., `dns`, `vault`, `monitoring`) - matches the Prometheus `role` label
+- `level` - Log level mapped from journal PRIORITY (`critical`, `error`, `warning`, `notice`, `info`, `debug`) - journal scrape only

 ### Log Format

@@ -47,12 +49,12 @@ Journal logs are JSON-formatted. Key fields:

 **Logs from a specific service on a host:**
 ```logql
-{host="ns1", systemd_unit="nsd.service"}
+{hostname="ns1", systemd_unit="nsd.service"}
 ```

 **All logs from a host:**
 ```logql
-{host="monitoring01"}
+{hostname="monitoring01"}
 ```

 **Logs from a service across all hosts:**
@@ -62,12 +64,12 @@ Journal logs are JSON-formatted. Key fields:

 **Substring matching (case-sensitive):**
 ```logql
-{host="ha1"} |= "error"
+{hostname="ha1"} |= "error"
 ```

 **Exclude pattern:**
 ```logql
-{host="ns1"} != "routine"
+{hostname="ns1"} != "routine"
 ```

 **Regex matching:**
@@ -75,6 +77,20 @@ Journal logs are JSON-formatted. Key fields:
 {systemd_unit="prometheus.service"} |~ "scrape.*failed"
 ```

+**Filter by level (journal scrape only):**
+```logql
+{level="error"}                                  # All errors across the fleet
+{level=~"critical|error", tier="prod"}           # Prod errors and criticals
+{hostname="ns1", level="warning"}                # Warnings from a specific host
+```
+
+**Filter by tier/role:**
+```logql
+{tier="prod"} |= "error"                        # All errors on prod hosts
+{role="dns"}                                     # All DNS server logs
+{tier="test", job="systemd-journal"}             # Journal logs from test hosts
+```
+
 **File-based logs (caddy access logs, etc):**
 ```logql
 {job="varlog", hostname="nix-cache01"}
@@ -106,7 +122,7 @@ Useful systemd units for troubleshooting:

 VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:

- `host` - Target hostname
+- `hostname` - Target hostname
 - `branch` - Git branch being deployed
 - `stage` - Bootstrap stage (see table below)

@@ -127,7 +143,7 @@ VMs provisioned from template2 send bootstrap progress directly to Loki via curl

 ```logql
 {job="bootstrap"}                              # All bootstrap logs
-{job="bootstrap", host="myhost"}               # Specific host
+{job="bootstrap", hostname="myhost"}            # Specific host
 {job="bootstrap", stage="failed"}              # All failures
 {job="bootstrap", stage=~"building|success"}   # Track build progress
 ```
@@ -308,8 +324,8 @@ Current host labels:

 1. Check `up{job="<service>"}` or `up{hostname="<host>"}` for scrape failures
 2. Use `list_targets` to see target health details
-3. Query service logs: `{host="<host>", systemd_unit="<service>.service"}`
-4. Search for errors: `{host="<host>"} |= "error"`
+3. Query service logs: `{hostname="<host>", systemd_unit="<service>.service"}`
+4. Search for errors: `{hostname="<host>"} |= "error"`
 5. Check `list_alerts` for related alerts
 6. Use role filters for group issues: `up{role="dns"}` to check all DNS servers

@@ -324,17 +340,17 @@ Current host labels:

 When provisioning new VMs, track bootstrap progress:

-1. Watch bootstrap logs: `{job="bootstrap", host="<hostname>"}`
-2. Check for failures: `{job="bootstrap", host="<hostname>", stage="failed"}`
+1. Watch bootstrap logs: `{job="bootstrap", hostname="<hostname>"}`
+2. Check for failures: `{job="bootstrap", hostname="<hostname>", stage="failed"}`
 3. After success, verify host appears in metrics: `up{hostname="<hostname>"}`
-4. Check logs are flowing: `{host="<hostname>"}`
+4. Check logs are flowing: `{hostname="<hostname>"}`

 See [docs/host-creation.md](../../../docs/host-creation.md) for the full host creation pipeline.

 ### Debug SSH/Access Issues

 ```logql
-{host="<host>", systemd_unit="sshd.service"}
+{hostname="<host>", systemd_unit="sshd.service"}
 ```

 ### Check Recent Upgrades