From 3f1d966919d8438a444b77a75177f5955a830db9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Sun, 8 Feb 2026 03:14:54 +0100 Subject: [PATCH] claude: improve investigate-alarm log query guidelines Add best practices for querying Loki to avoid overwhelming responses: - Start with narrow filters and small limits - Filter audit logs to EXECVE only - Exclude verbose noise (PATH, PROCTITLE, SYSCALL, BPF) - Expand queries incrementally if needed Co-Authored-By: Claude Opus 4.5 --- .claude/agents/investigate-alarm.md | 33 ++++++++++++++++++++++------- 1 file changed, 25 insertions(+), 8 deletions(-) diff --git a/.claude/agents/investigate-alarm.md b/.claude/agents/investigate-alarm.md index 9ffbb12..009770c 100644 --- a/.claude/agents/investigate-alarm.md +++ b/.claude/agents/investigate-alarm.md @@ -35,14 +35,29 @@ Gather evidence about the current system state: ### 3. Check Logs -Search for relevant log entries: -- Use `query_logs` to search Loki for the affected host/service -- Common patterns: - - `{host="", systemd_unit=".service"}` - - `{host=""} |= "error"` - - `{systemd_unit=".service"}` across all hosts -- Look for errors, warnings, or unusual patterns around the alert time -- Use `start: "1h"` or longer for context +Search for relevant log entries using `query_logs`. **Be careful to avoid overly broad queries that return too much data.** + +**Query strategies (start narrow, expand if needed):** +- Start with `limit: 20-30`, increase only if needed +- Use tight time windows: `start: "15m"` or `start: "30m"` initially +- Filter to specific services: `{host="", systemd_unit=".service"}` +- Search for errors: `{host=""} |= "error"` or `|= "failed"` + +**For audit logs (SSH sessions, command execution):** +- Filter to just commands: `{host=""} |= "EXECVE"` +- Exclude verbose noise: `!= "PATH item" != "PROCTITLE" != "SYSCALL" != "BPF"` +- Example: `{host="testvm01"} |= "EXECVE" != "systemd"` (user commands only) + +**Common patterns:** +- Service logs: `{host="", systemd_unit=".service"}` +- SSH activity: `{host="", systemd_unit="sshd.service"}` +- All errors on host: `{host=""} |= "error"` +- Specific command: `{host=""} |= "EXECVE" |= "stress"` + +**Avoid:** +- Querying all audit logs without filtering (very verbose) +- Using `start: "1h"` with no filters on busy hosts +- Limits over 50 without specific filters ### 4. Check Configuration (if relevant) @@ -119,3 +134,5 @@ Provide a concise report with one of two outcomes: - Consider the host's tier (test vs prod) when assessing severity - Build a timeline from log timestamps and metrics to show the sequence of events - Include precursor events (logins, config changes, restarts) that led to the issue +- **Query logs incrementally**: start with narrow filters and small limits, expand only if needed +- **Avoid broad audit log queries**: always filter to EXECVE and exclude noise (PATH, SYSCALL, BPF)