From 3f1d966919d8438a444b77a75177f5955a830db9 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= <torjus@usit.uio.no>
Date: Sun, 8 Feb 2026 03:14:54 +0100
Subject: [PATCH] claude: improve investigate-alarm log query guidelines

Add best practices for querying Loki to avoid overwhelming responses:
- Start with narrow filters and small limits
- Filter audit logs to EXECVE only
- Exclude verbose noise (PATH, PROCTITLE, SYSCALL, BPF)
- Expand queries incrementally if needed

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---
 .claude/agents/investigate-alarm.md | 33 ++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 8 deletions(-)
diff --git a/.claude/agents/investigate-alarm.md b/.claude/agents/investigate-alarm.md
index 9ffbb12..009770c 100644
--- a/.claude/agents/investigate-alarm.md
+++ b/.claude/agents/investigate-alarm.md
@@ -35,14 +35,29 @@ Gather evidence about the current system state:
 
 ### 3. Check Logs
 
-Search for relevant log entries:
-- Use `query_logs` to search Loki for the affected host/service
-- Common patterns:
-  - `{host="<hostname>", systemd_unit="<service>.service"}`
-  - `{host="<hostname>"} |= "error"`
-  - `{systemd_unit="<service>.service"}` across all hosts
-- Look for errors, warnings, or unusual patterns around the alert time
-- Use `start: "1h"` or longer for context
+Search for relevant log entries using `query_logs`. **Be careful to avoid overly broad queries that return too much data.**
+
+**Query strategies (start narrow, expand if needed):**
+- Start with `limit: 20-30`, increase only if needed
+- Use tight time windows: `start: "15m"` or `start: "30m"` initially
+- Filter to specific services: `{host="<hostname>", systemd_unit="<service>.service"}`
+- Search for errors: `{host="<hostname>"} |= "error"` or `|= "failed"`
+
+**For audit logs (SSH sessions, command execution):**
+- Filter to just commands: `{host="<hostname>"} |= "EXECVE"`
+- Exclude verbose noise: `!= "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"`
+- Example: `{host="testvm01"} |= "EXECVE" != "systemd"` (user commands only)
+
+**Common patterns:**
+- Service logs: `{host="<hostname>", systemd_unit="<service>.service"}`
+- SSH activity: `{host="<hostname>", systemd_unit="sshd.service"}`
+- All errors on host: `{host="<hostname>"} |= "error"`
+- Specific command: `{host="<hostname>"} |= "EXECVE" |= "stress"`
+
+**Avoid:**
+- Querying all audit logs without filtering (very verbose)
+- Using `start: "1h"` with no filters on busy hosts
+- Limits over 50 without specific filters
 
 ### 4. Check Configuration (if relevant)
 
@@ -119,3 +134,5 @@ Provide a concise report with one of two outcomes:
 - Consider the host's tier (test vs prod) when assessing severity
 - Build a timeline from log timestamps and metrics to show the sequence of events
 - Include precursor events (logins, config changes, restarts) that led to the issue
+- **Query logs incrementally**: start with narrow filters and small limits, expand only if needed
+- **Avoid broad audit log queries**: always filter to EXECVE and exclude noise (PATH, SYSCALL, BPF)