claude: improve investigate-alarm log query guidelines

Add best practices for querying Loki to avoid overwhelming responses: - Start with narrow filters and small limits - Filter audit logs to EXECVE only - Exclude verbose noise (PATH, PROCTITLE, SYSCALL, BPF) - Expand queries incrementally if needed Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 03:14:54 +01:00
parent 7fcc043a4d
commit 3f1d966919
1 changed files with 25 additions and 8 deletions
--- a/.claude/agents/investigate-alarm.md
+++ b/.claude/agents/investigate-alarm.md
@@ -35,14 +35,29 @@ Gather evidence about the current system state:

 ### 3. Check Logs

-Search for relevant log entries:
- Use `query_logs` to search Loki for the affected host/service
- Common patterns:
-  - `{host="<hostname>", systemd_unit="<service>.service"}`
-  - `{host="<hostname>"} |= "error"`
-  - `{systemd_unit="<service>.service"}` across all hosts
- Look for errors, warnings, or unusual patterns around the alert time
- Use `start: "1h"` or longer for context
+Search for relevant log entries using `query_logs`. **Be careful to avoid overly broad queries that return too much data.**
+
+**Query strategies (start narrow, expand if needed):**
+- Start with `limit: 20-30`, increase only if needed
+- Use tight time windows: `start: "15m"` or `start: "30m"` initially
+- Filter to specific services: `{host="<hostname>", systemd_unit="<service>.service"}`
+- Search for errors: `{host="<hostname>"} |= "error"` or `|= "failed"`
+
+**For audit logs (SSH sessions, command execution):**
+- Filter to just commands: `{host="<hostname>"} |= "EXECVE"`
+- Exclude verbose noise: `!= "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"`
+- Example: `{host="testvm01"} |= "EXECVE" != "systemd"` (user commands only)
+
+**Common patterns:**
+- Service logs: `{host="<hostname>", systemd_unit="<service>.service"}`
+- SSH activity: `{host="<hostname>", systemd_unit="sshd.service"}`
+- All errors on host: `{host="<hostname>"} |= "error"`
+- Specific command: `{host="<hostname>"} |= "EXECVE" |= "stress"`
+
+**Avoid:**
+- Querying all audit logs without filtering (very verbose)
+- Using `start: "1h"` with no filters on busy hosts
+- Limits over 50 without specific filters

 ### 4. Check Configuration (if relevant)

@@ -119,3 +134,5 @@ Provide a concise report with one of two outcomes:
 - Consider the host's tier (test vs prod) when assessing severity
 - Build a timeline from log timestamps and metrics to show the sequence of events
 - Include precursor events (logins, config changes, restarts) that led to the issue
+- **Query logs incrementally**: start with narrow filters and small limits, expand only if needed
+- **Avoid broad audit log queries**: always filter to EXECVE and exclude noise (PATH, SYSCALL, BPF)