claude: improve investigate-alarm log query guidelines
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Add best practices for querying Loki to avoid overwhelming responses: - Start with narrow filters and small limits - Filter audit logs to EXECVE only - Exclude verbose noise (PATH, PROCTITLE, SYSCALL, BPF) - Expand queries incrementally if needed Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -35,14 +35,29 @@ Gather evidence about the current system state:
|
||||
|
||||
### 3. Check Logs
|
||||
|
||||
Search for relevant log entries:
|
||||
- Use `query_logs` to search Loki for the affected host/service
|
||||
- Common patterns:
|
||||
- `{host="<hostname>", systemd_unit="<service>.service"}`
|
||||
- `{host="<hostname>"} |= "error"`
|
||||
- `{systemd_unit="<service>.service"}` across all hosts
|
||||
- Look for errors, warnings, or unusual patterns around the alert time
|
||||
- Use `start: "1h"` or longer for context
|
||||
Search for relevant log entries using `query_logs`. **Be careful to avoid overly broad queries that return too much data.**
|
||||
|
||||
**Query strategies (start narrow, expand if needed):**
|
||||
- Start with `limit: 20-30`, increase only if needed
|
||||
- Use tight time windows: `start: "15m"` or `start: "30m"` initially
|
||||
- Filter to specific services: `{host="<hostname>", systemd_unit="<service>.service"}`
|
||||
- Search for errors: `{host="<hostname>"} |= "error"` or `|= "failed"`
|
||||
|
||||
**For audit logs (SSH sessions, command execution):**
|
||||
- Filter to just commands: `{host="<hostname>"} |= "EXECVE"`
|
||||
- Exclude verbose noise: `!= "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"`
|
||||
- Example: `{host="testvm01"} |= "EXECVE" != "systemd"` (user commands only)
|
||||
|
||||
**Common patterns:**
|
||||
- Service logs: `{host="<hostname>", systemd_unit="<service>.service"}`
|
||||
- SSH activity: `{host="<hostname>", systemd_unit="sshd.service"}`
|
||||
- All errors on host: `{host="<hostname>"} |= "error"`
|
||||
- Specific command: `{host="<hostname>"} |= "EXECVE" |= "stress"`
|
||||
|
||||
**Avoid:**
|
||||
- Querying all audit logs without filtering (very verbose)
|
||||
- Using `start: "1h"` with no filters on busy hosts
|
||||
- Limits over 50 without specific filters
|
||||
|
||||
### 4. Check Configuration (if relevant)
|
||||
|
||||
@@ -119,3 +134,5 @@ Provide a concise report with one of two outcomes:
|
||||
- Consider the host's tier (test vs prod) when assessing severity
|
||||
- Build a timeline from log timestamps and metrics to show the sequence of events
|
||||
- Include precursor events (logins, config changes, restarts) that led to the issue
|
||||
- **Query logs incrementally**: start with narrow filters and small limits, expand only if needed
|
||||
- **Avoid broad audit log queries**: always filter to EXECVE and exclude noise (PATH, SYSCALL, BPF)
|
||||
|
||||
Reference in New Issue
Block a user