docs: update Loki queries from host to hostname label
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Update all LogQL examples, agent instructions, and scripts to use the hostname label instead of host, matching the Prometheus label naming convention. Also update pipe-to-loki and bootstrap scripts to push hostname instead of host. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -19,7 +19,7 @@ You may receive:
|
|||||||
## Audit Log Structure
|
## Audit Log Structure
|
||||||
|
|
||||||
Logs are shipped to Loki via promtail. Audit events use these labels:
|
Logs are shipped to Loki via promtail. Audit events use these labels:
|
||||||
- `host` - hostname
|
- `hostname` - hostname
|
||||||
- `systemd_unit` - typically `auditd.service` for audit logs
|
- `systemd_unit` - typically `auditd.service` for audit logs
|
||||||
- `job` - typically `systemd-journal`
|
- `job` - typically `systemd-journal`
|
||||||
|
|
||||||
@@ -36,7 +36,7 @@ Audit log entries contain structured data:
|
|||||||
|
|
||||||
Find SSH logins and session activity:
|
Find SSH logins and session activity:
|
||||||
```logql
|
```logql
|
||||||
{host="<hostname>", systemd_unit="sshd.service"}
|
{hostname="<hostname>", systemd_unit="sshd.service"}
|
||||||
```
|
```
|
||||||
|
|
||||||
Look for:
|
Look for:
|
||||||
@@ -48,7 +48,7 @@ Look for:
|
|||||||
|
|
||||||
Query executed commands (filter out noise):
|
Query executed commands (filter out noise):
|
||||||
```logql
|
```logql
|
||||||
{host="<hostname>"} |= "EXECVE" != "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"
|
{hostname="<hostname>"} |= "EXECVE" != "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"
|
||||||
```
|
```
|
||||||
|
|
||||||
Further filtering:
|
Further filtering:
|
||||||
@@ -60,28 +60,28 @@ Further filtering:
|
|||||||
|
|
||||||
Check for privilege escalation:
|
Check for privilege escalation:
|
||||||
```logql
|
```logql
|
||||||
{host="<hostname>"} |= "sudo" |= "COMMAND"
|
{hostname="<hostname>"} |= "sudo" |= "COMMAND"
|
||||||
```
|
```
|
||||||
|
|
||||||
Or via audit:
|
Or via audit:
|
||||||
```logql
|
```logql
|
||||||
{host="<hostname>"} |= "USER_CMD"
|
{hostname="<hostname>"} |= "USER_CMD"
|
||||||
```
|
```
|
||||||
|
|
||||||
### 4. Service Manipulation
|
### 4. Service Manipulation
|
||||||
|
|
||||||
Check if services were manually stopped/started:
|
Check if services were manually stopped/started:
|
||||||
```logql
|
```logql
|
||||||
{host="<hostname>"} |= "EXECVE" |= "systemctl"
|
{hostname="<hostname>"} |= "EXECVE" |= "systemctl"
|
||||||
```
|
```
|
||||||
|
|
||||||
### 5. File Operations
|
### 5. File Operations
|
||||||
|
|
||||||
Look for file modifications (if auditd rules are configured):
|
Look for file modifications (if auditd rules are configured):
|
||||||
```logql
|
```logql
|
||||||
{host="<hostname>"} |= "EXECVE" |= "vim"
|
{hostname="<hostname>"} |= "EXECVE" |= "vim"
|
||||||
{host="<hostname>"} |= "EXECVE" |= "nano"
|
{hostname="<hostname>"} |= "EXECVE" |= "nano"
|
||||||
{host="<hostname>"} |= "EXECVE" |= "rm"
|
{hostname="<hostname>"} |= "EXECVE" |= "rm"
|
||||||
```
|
```
|
||||||
|
|
||||||
## Query Guidelines
|
## Query Guidelines
|
||||||
@@ -99,7 +99,7 @@ Look for file modifications (if auditd rules are configured):
|
|||||||
**Time-bounded queries:**
|
**Time-bounded queries:**
|
||||||
When investigating around a specific event:
|
When investigating around a specific event:
|
||||||
```logql
|
```logql
|
||||||
{host="<hostname>"} |= "EXECVE" != "systemd"
|
{hostname="<hostname>"} |= "EXECVE" != "systemd"
|
||||||
```
|
```
|
||||||
With `start: "2026-02-08T14:30:00Z"` and `end: "2026-02-08T14:35:00Z"`
|
With `start: "2026-02-08T14:30:00Z"` and `end: "2026-02-08T14:35:00Z"`
|
||||||
|
|
||||||
|
|||||||
@@ -41,13 +41,13 @@ Search for relevant log entries using `query_logs`. Focus on service-specific lo
|
|||||||
**Query strategies (start narrow, expand if needed):**
|
**Query strategies (start narrow, expand if needed):**
|
||||||
- Start with `limit: 20-30`, increase only if needed
|
- Start with `limit: 20-30`, increase only if needed
|
||||||
- Use tight time windows: `start: "15m"` or `start: "30m"` initially
|
- Use tight time windows: `start: "15m"` or `start: "30m"` initially
|
||||||
- Filter to specific services: `{host="<hostname>", systemd_unit="<service>.service"}`
|
- Filter to specific services: `{hostname="<hostname>", systemd_unit="<service>.service"}`
|
||||||
- Search for errors: `{host="<hostname>"} |= "error"` or `|= "failed"`
|
- Search for errors: `{hostname="<hostname>"} |= "error"` or `|= "failed"`
|
||||||
|
|
||||||
**Common patterns:**
|
**Common patterns:**
|
||||||
- Service logs: `{host="<hostname>", systemd_unit="<service>.service"}`
|
- Service logs: `{hostname="<hostname>", systemd_unit="<service>.service"}`
|
||||||
- All errors on host: `{host="<hostname>"} |= "error"`
|
- All errors on host: `{hostname="<hostname>"} |= "error"`
|
||||||
- Journal for a unit: `{host="<hostname>", systemd_unit="nginx.service"} |= "failed"`
|
- Journal for a unit: `{hostname="<hostname>", systemd_unit="nginx.service"} |= "failed"`
|
||||||
|
|
||||||
**Avoid:**
|
**Avoid:**
|
||||||
- Using `start: "1h"` with no filters on busy hosts
|
- Using `start: "1h"` with no filters on busy hosts
|
||||||
|
|||||||
@@ -30,11 +30,13 @@ Use the `lab-monitoring` MCP server tools:
|
|||||||
### Label Reference
|
### Label Reference
|
||||||
|
|
||||||
Available labels for log queries:
|
Available labels for log queries:
|
||||||
- `host` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`)
|
- `hostname` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`) - matches the Prometheus `hostname` label
|
||||||
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
|
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
|
||||||
- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
|
- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
|
||||||
- `filename` - For `varlog` job, the log file path
|
- `filename` - For `varlog` job, the log file path
|
||||||
- `hostname` - Alternative to `host` for some streams
|
- `tier` - Deployment tier (`test` or `prod`)
|
||||||
|
- `role` - Host role (e.g., `dns`, `vault`, `monitoring`) - matches the Prometheus `role` label
|
||||||
|
- `level` - Log level mapped from journal PRIORITY (`critical`, `error`, `warning`, `notice`, `info`, `debug`) - journal scrape only
|
||||||
|
|
||||||
### Log Format
|
### Log Format
|
||||||
|
|
||||||
@@ -47,12 +49,12 @@ Journal logs are JSON-formatted. Key fields:
|
|||||||
|
|
||||||
**Logs from a specific service on a host:**
|
**Logs from a specific service on a host:**
|
||||||
```logql
|
```logql
|
||||||
{host="ns1", systemd_unit="nsd.service"}
|
{hostname="ns1", systemd_unit="nsd.service"}
|
||||||
```
|
```
|
||||||
|
|
||||||
**All logs from a host:**
|
**All logs from a host:**
|
||||||
```logql
|
```logql
|
||||||
{host="monitoring01"}
|
{hostname="monitoring01"}
|
||||||
```
|
```
|
||||||
|
|
||||||
**Logs from a service across all hosts:**
|
**Logs from a service across all hosts:**
|
||||||
@@ -62,12 +64,12 @@ Journal logs are JSON-formatted. Key fields:
|
|||||||
|
|
||||||
**Substring matching (case-sensitive):**
|
**Substring matching (case-sensitive):**
|
||||||
```logql
|
```logql
|
||||||
{host="ha1"} |= "error"
|
{hostname="ha1"} |= "error"
|
||||||
```
|
```
|
||||||
|
|
||||||
**Exclude pattern:**
|
**Exclude pattern:**
|
||||||
```logql
|
```logql
|
||||||
{host="ns1"} != "routine"
|
{hostname="ns1"} != "routine"
|
||||||
```
|
```
|
||||||
|
|
||||||
**Regex matching:**
|
**Regex matching:**
|
||||||
@@ -75,6 +77,20 @@ Journal logs are JSON-formatted. Key fields:
|
|||||||
{systemd_unit="prometheus.service"} |~ "scrape.*failed"
|
{systemd_unit="prometheus.service"} |~ "scrape.*failed"
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Filter by level (journal scrape only):**
|
||||||
|
```logql
|
||||||
|
{level="error"} # All errors across the fleet
|
||||||
|
{level=~"critical|error", tier="prod"} # Prod errors and criticals
|
||||||
|
{hostname="ns1", level="warning"} # Warnings from a specific host
|
||||||
|
```
|
||||||
|
|
||||||
|
**Filter by tier/role:**
|
||||||
|
```logql
|
||||||
|
{tier="prod"} |= "error" # All errors on prod hosts
|
||||||
|
{role="dns"} # All DNS server logs
|
||||||
|
{tier="test", job="systemd-journal"} # Journal logs from test hosts
|
||||||
|
```
|
||||||
|
|
||||||
**File-based logs (caddy access logs, etc):**
|
**File-based logs (caddy access logs, etc):**
|
||||||
```logql
|
```logql
|
||||||
{job="varlog", hostname="nix-cache01"}
|
{job="varlog", hostname="nix-cache01"}
|
||||||
@@ -106,7 +122,7 @@ Useful systemd units for troubleshooting:
|
|||||||
|
|
||||||
VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
|
VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
|
||||||
|
|
||||||
- `host` - Target hostname
|
- `hostname` - Target hostname
|
||||||
- `branch` - Git branch being deployed
|
- `branch` - Git branch being deployed
|
||||||
- `stage` - Bootstrap stage (see table below)
|
- `stage` - Bootstrap stage (see table below)
|
||||||
|
|
||||||
@@ -127,7 +143,7 @@ VMs provisioned from template2 send bootstrap progress directly to Loki via curl
|
|||||||
|
|
||||||
```logql
|
```logql
|
||||||
{job="bootstrap"} # All bootstrap logs
|
{job="bootstrap"} # All bootstrap logs
|
||||||
{job="bootstrap", host="myhost"} # Specific host
|
{job="bootstrap", hostname="myhost"} # Specific host
|
||||||
{job="bootstrap", stage="failed"} # All failures
|
{job="bootstrap", stage="failed"} # All failures
|
||||||
{job="bootstrap", stage=~"building|success"} # Track build progress
|
{job="bootstrap", stage=~"building|success"} # Track build progress
|
||||||
```
|
```
|
||||||
@@ -308,8 +324,8 @@ Current host labels:
|
|||||||
|
|
||||||
1. Check `up{job="<service>"}` or `up{hostname="<host>"}` for scrape failures
|
1. Check `up{job="<service>"}` or `up{hostname="<host>"}` for scrape failures
|
||||||
2. Use `list_targets` to see target health details
|
2. Use `list_targets` to see target health details
|
||||||
3. Query service logs: `{host="<host>", systemd_unit="<service>.service"}`
|
3. Query service logs: `{hostname="<host>", systemd_unit="<service>.service"}`
|
||||||
4. Search for errors: `{host="<host>"} |= "error"`
|
4. Search for errors: `{hostname="<host>"} |= "error"`
|
||||||
5. Check `list_alerts` for related alerts
|
5. Check `list_alerts` for related alerts
|
||||||
6. Use role filters for group issues: `up{role="dns"}` to check all DNS servers
|
6. Use role filters for group issues: `up{role="dns"}` to check all DNS servers
|
||||||
|
|
||||||
@@ -324,17 +340,17 @@ Current host labels:
|
|||||||
|
|
||||||
When provisioning new VMs, track bootstrap progress:
|
When provisioning new VMs, track bootstrap progress:
|
||||||
|
|
||||||
1. Watch bootstrap logs: `{job="bootstrap", host="<hostname>"}`
|
1. Watch bootstrap logs: `{job="bootstrap", hostname="<hostname>"}`
|
||||||
2. Check for failures: `{job="bootstrap", host="<hostname>", stage="failed"}`
|
2. Check for failures: `{job="bootstrap", hostname="<hostname>", stage="failed"}`
|
||||||
3. After success, verify host appears in metrics: `up{hostname="<hostname>"}`
|
3. After success, verify host appears in metrics: `up{hostname="<hostname>"}`
|
||||||
4. Check logs are flowing: `{host="<hostname>"}`
|
4. Check logs are flowing: `{hostname="<hostname>"}`
|
||||||
|
|
||||||
See [docs/host-creation.md](../../../docs/host-creation.md) for the full host creation pipeline.
|
See [docs/host-creation.md](../../../docs/host-creation.md) for the full host creation pipeline.
|
||||||
|
|
||||||
### Debug SSH/Access Issues
|
### Debug SSH/Access Issues
|
||||||
|
|
||||||
```logql
|
```logql
|
||||||
{host="<host>", systemd_unit="sshd.service"}
|
{hostname="<host>", systemd_unit="sshd.service"}
|
||||||
```
|
```
|
||||||
|
|
||||||
### Check Recent Upgrades
|
### Check Recent Upgrades
|
||||||
|
|||||||
@@ -59,7 +59,7 @@ The script prints the session ID which the user can share. Query results with:
|
|||||||
```logql
|
```logql
|
||||||
{job="pipe-to-loki"} # All entries
|
{job="pipe-to-loki"} # All entries
|
||||||
{job="pipe-to-loki", id="my-test"} # Specific ID
|
{job="pipe-to-loki", id="my-test"} # Specific ID
|
||||||
{job="pipe-to-loki", host="testvm01"} # From specific host
|
{job="pipe-to-loki", hostname="testvm01"} # From specific host
|
||||||
{job="pipe-to-loki", type="session"} # Only sessions
|
{job="pipe-to-loki", type="session"} # Only sessions
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
@@ -50,7 +50,7 @@ homelab.host.tier = "test"; # or "prod"
|
|||||||
During the bootstrap process, status updates are sent to Loki. Query bootstrap logs with:
|
During the bootstrap process, status updates are sent to Loki. Query bootstrap logs with:
|
||||||
|
|
||||||
```
|
```
|
||||||
{job="bootstrap", host="<hostname>"}
|
{job="bootstrap", hostname="<hostname>"}
|
||||||
```
|
```
|
||||||
|
|
||||||
### Bootstrap Stages
|
### Bootstrap Stages
|
||||||
@@ -72,7 +72,7 @@ The bootstrap process reports these stages via the `stage` label:
|
|||||||
|
|
||||||
```
|
```
|
||||||
# All bootstrap activity for a host
|
# All bootstrap activity for a host
|
||||||
{job="bootstrap", host="myhost"}
|
{job="bootstrap", hostname="myhost"}
|
||||||
|
|
||||||
# Track all failures
|
# Track all failures
|
||||||
{job="bootstrap", stage="failed"}
|
{job="bootstrap", stage="failed"}
|
||||||
@@ -87,7 +87,7 @@ Once the VM reboots with its full configuration, it will start publishing metric
|
|||||||
|
|
||||||
1. Check bootstrap completed successfully:
|
1. Check bootstrap completed successfully:
|
||||||
```
|
```
|
||||||
{job="bootstrap", host="<hostname>", stage="success"}
|
{job="bootstrap", hostname="<hostname>", stage="success"}
|
||||||
```
|
```
|
||||||
|
|
||||||
2. Verify the host is up and reporting metrics:
|
2. Verify the host is up and reporting metrics:
|
||||||
@@ -102,7 +102,7 @@ Once the VM reboots with its full configuration, it will start publishing metric
|
|||||||
|
|
||||||
4. Check logs are flowing:
|
4. Check logs are flowing:
|
||||||
```
|
```
|
||||||
{host="<hostname>"}
|
{hostname="<hostname>"}
|
||||||
```
|
```
|
||||||
|
|
||||||
5. Confirm expected services are running and producing logs
|
5. Confirm expected services are running and producing logs
|
||||||
@@ -119,7 +119,7 @@ Once the VM reboots with its full configuration, it will start publishing metric
|
|||||||
|
|
||||||
1. Check bootstrap logs in Loki - if they never progress past `building`, the rebuild likely consumed all resources:
|
1. Check bootstrap logs in Loki - if they never progress past `building`, the rebuild likely consumed all resources:
|
||||||
```
|
```
|
||||||
{job="bootstrap", host="<hostname>"}
|
{job="bootstrap", hostname="<hostname>"}
|
||||||
```
|
```
|
||||||
|
|
||||||
2. **USER**: SSH into the host and check the bootstrap service:
|
2. **USER**: SSH into the host and check the bootstrap service:
|
||||||
@@ -149,7 +149,7 @@ Usually caused by running the `create-host` script without proper credentials, o
|
|||||||
|
|
||||||
2. Check bootstrap logs for vault-related stages:
|
2. Check bootstrap logs for vault-related stages:
|
||||||
```
|
```
|
||||||
{job="bootstrap", host="<hostname>", stage=~"vault.*"}
|
{job="bootstrap", hostname="<hostname>", stage=~"vault.*"}
|
||||||
```
|
```
|
||||||
|
|
||||||
3. **USER**: Regenerate and provision credentials manually:
|
3. **USER**: Regenerate and provision credentials manually:
|
||||||
|
|||||||
@@ -86,13 +86,13 @@ These are generous limits that shouldn't affect normal operation but protect aga
|
|||||||
- The `varlog` scrape config uses `hostname` while journal uses `host` (different label name)
|
- The `varlog` scrape config uses `hostname` while journal uses `host` (different label name)
|
||||||
- No `tier` or `role` labels, making it hard to filter logs by deployment tier or host function
|
- No `tier` or `role` labels, making it hard to filter logs by deployment tier or host function
|
||||||
|
|
||||||
**Recommendations:**
|
**Implemented:** Standardized on `hostname` to match Prometheus labels. The journal scrape previously used a relabel from `__journal__hostname` to `host`; now both scrape configs use a static `hostname` label from `config.networking.hostName`. Also updated `pipe-to-loki` and bootstrap scripts to use `hostname` instead of `host`.
|
||||||
|
|
||||||
1. **Fix varlog label:** Rename `hostname` to `host` for consistency with journal scrape config
|
1. **Standardized label:** Both scrape configs use `hostname` (matching Prometheus) via shared `hostLabels`
|
||||||
2. **Add `tier` label:** Static label from `config.homelab.host.tier` (`test`/`prod`) on both scrape configs
|
2. **Added `tier` label:** Static label from `config.homelab.host.tier` (`test`/`prod`) on both scrape configs
|
||||||
3. **Add `role` label:** Static label from `config.homelab.host.role` on both scrape configs, only when set (10 hosts have no role, so omit to keep labels clean)
|
3. **Added `role` label:** Static label from `config.homelab.host.role` on both scrape configs (conditionally, only when non-null)
|
||||||
|
|
||||||
No cardinality impact - `tier` and `role` are 1:1 with `host`, so they add metadata to existing streams without creating new ones.
|
No cardinality impact - `tier` and `role` are 1:1 with `hostname`, so they add metadata to existing streams without creating new ones.
|
||||||
|
|
||||||
This enables queries like:
|
This enables queries like:
|
||||||
- `{tier="prod"} |= "error"` - all errors on prod hosts
|
- `{tier="prod"} |= "error"` - all errors on prod hosts
|
||||||
@@ -167,10 +167,10 @@ For each service, check whether it supports a JSON log format option and whether
|
|||||||
1. Add `compactor` section to `services/monitoring/loki.nix`
|
1. Add `compactor` section to `services/monitoring/loki.nix`
|
||||||
2. Add `limits_config` with 30-day retention and basic rate limits
|
2. Add `limits_config` with 30-day retention and basic rate limits
|
||||||
3. Update `system/monitoring/logs.nix`:
|
3. Update `system/monitoring/logs.nix`:
|
||||||
- Fix `hostname` → `host` label in varlog scrape config
|
- ~~Fix `hostname` → `host` label in varlog scrape config~~ Done: standardized on `hostname` (matching Prometheus)
|
||||||
- Add `tier` static label from `config.homelab.host.tier` to both scrape configs
|
- ~~Add `tier` static label from `config.homelab.host.tier` to both scrape configs~~ Done
|
||||||
- Add `role` static label from `config.homelab.host.role` (conditionally, only when set) to both scrape configs
|
- ~~Add `role` static label from `config.homelab.host.role` (conditionally, only when set) to both scrape configs~~ Done
|
||||||
- Add pipeline stages to journal scrape config: `json` to extract PRIORITY, `template` to map to level name, `labels` to attach as `level`
|
- ~~Add pipeline stages to journal scrape config: `json` to extract PRIORITY, `template` to map to level name, `labels` to attach as `level`~~ Done
|
||||||
4. Deploy to monitoring01, verify compactor runs and old data gets cleaned
|
4. Deploy to monitoring01, verify compactor runs and old data gets cleaned
|
||||||
5. Verify `level` label works: `{level="error"}` should return results, and match cases where `detected_level="unknown"`
|
5. Verify `level` label works: `{level="error"}` should return results, and match cases where `detected_level="unknown"`
|
||||||
|
|
||||||
|
|||||||
@@ -28,7 +28,7 @@ let
|
|||||||
streams: [{
|
streams: [{
|
||||||
stream: {
|
stream: {
|
||||||
job: "bootstrap",
|
job: "bootstrap",
|
||||||
host: $host,
|
hostname: $host,
|
||||||
stage: $stage,
|
stage: $stage,
|
||||||
branch: $branch
|
branch: $branch
|
||||||
},
|
},
|
||||||
|
|||||||
@@ -61,7 +61,7 @@ let
|
|||||||
streams: [{
|
streams: [{
|
||||||
stream: {
|
stream: {
|
||||||
job: $job,
|
job: $job,
|
||||||
host: $host,
|
hostname: $host,
|
||||||
type: $type,
|
type: $type,
|
||||||
id: $id
|
id: $id
|
||||||
},
|
},
|
||||||
|
|||||||
Reference in New Issue
Block a user