claude: improve investigate-alarm log query guidelines
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s

Add best practices for querying Loki to avoid overwhelming responses:
- Start with narrow filters and small limits
- Filter audit logs to EXECVE only
- Exclude verbose noise (PATH, PROCTITLE, SYSCALL, BPF)
- Expand queries incrementally if needed

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-02-08 03:14:54 +01:00
parent 7fcc043a4d
commit 3f1d966919

View File

@@ -35,14 +35,29 @@ Gather evidence about the current system state:
### 3. Check Logs
Search for relevant log entries:
- Use `query_logs` to search Loki for the affected host/service
- Common patterns:
- `{host="<hostname>", systemd_unit="<service>.service"}`
- `{host="<hostname>"} |= "error"`
- `{systemd_unit="<service>.service"}` across all hosts
- Look for errors, warnings, or unusual patterns around the alert time
- Use `start: "1h"` or longer for context
Search for relevant log entries using `query_logs`. **Be careful to avoid overly broad queries that return too much data.**
**Query strategies (start narrow, expand if needed):**
- Start with `limit: 20-30`, increase only if needed
- Use tight time windows: `start: "15m"` or `start: "30m"` initially
- Filter to specific services: `{host="<hostname>", systemd_unit="<service>.service"}`
- Search for errors: `{host="<hostname>"} |= "error"` or `|= "failed"`
**For audit logs (SSH sessions, command execution):**
- Filter to just commands: `{host="<hostname>"} |= "EXECVE"`
- Exclude verbose noise: `!= "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"`
- Example: `{host="testvm01"} |= "EXECVE" != "systemd"` (user commands only)
**Common patterns:**
- Service logs: `{host="<hostname>", systemd_unit="<service>.service"}`
- SSH activity: `{host="<hostname>", systemd_unit="sshd.service"}`
- All errors on host: `{host="<hostname>"} |= "error"`
- Specific command: `{host="<hostname>"} |= "EXECVE" |= "stress"`
**Avoid:**
- Querying all audit logs without filtering (very verbose)
- Using `start: "1h"` with no filters on busy hosts
- Limits over 50 without specific filters
### 4. Check Configuration (if relevant)
@@ -119,3 +134,5 @@ Provide a concise report with one of two outcomes:
- Consider the host's tier (test vs prod) when assessing severity
- Build a timeline from log timestamps and metrics to show the sequence of events
- Include precursor events (logins, config changes, restarts) that led to the issue
- **Query logs incrementally**: start with narrow filters and small limits, expand only if needed
- **Avoid broad audit log queries**: always filter to EXECVE and exclude noise (PATH, SYSCALL, BPF)