claude: make auditor delegation explicit in investigate-alarm

- Changed section 4 from "if needed" to always spawn auditor - Added explicit "Do NOT query audit logs yourself" guidance - Listed specific scenarios requiring auditor (service stopped, etc.) - Added manual intervention as first common cause - Updated guidelines to emphasize mandatory delegation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 05:11:09 +01:00
parent e2dd21c994
commit 11cbb64097
1 changed files with 21 additions and 11 deletions
--- a/.claude/agents/investigate-alarm.md
+++ b/.claude/agents/investigate-alarm.md
@@ -53,24 +53,31 @@ Search for relevant log entries using `query_logs`. Focus on service-specific lo
 - Using `start: "1h"` with no filters on busy hosts
 - Limits over 50 without specific filters
-### 4. Investigate User Activity (if needed)
+### 4. Investigate User Activity
-If you suspect the issue may be related to user actions (manual commands, SSH sessions, service manipulation), **spawn the `auditor` agent** to investigate.
+For any analysis of user activity, **always spawn the `auditor` agent**. Do not query audit logs (EXECVE, USER_LOGIN, etc.) directly - delegate this to the auditor.
-Use the auditor when you need to know:
+**Always call the auditor when:**
- What commands were run around the time of an incident
+- A service stopped unexpectedly (may have been manually stopped)
- Whether a service was manually stopped/restarted
+- A process was killed or a config was changed
- Who was logged in and what they did
+- You need to know who was logged in around the time of an incident
- Whether there's suspicious activity on the host
+- You need to understand what commands led to the current state
 - The cause isn't obvious from service logs alone
 **Do NOT try to query audit logs yourself.** The auditor is specialized for:
 - Parsing EXECVE records and reconstructing command lines
 - Correlating SSH sessions with commands executed
 - Identifying suspicious patterns
 - Filtering out systemd/nix-store noise
 **Example prompt for auditor:**
 ```
 Investigate user activity on <hostname> between <start_time> and <end_time>.
-Context: The nginx service stopped unexpectedly at 14:32. Check if it was
+Context: The prometheus-node-exporter service stopped at 14:32.
-manually stopped or if any commands were run around that time.
+Determine if it was manually stopped and by whom.
 ```
-The auditor will return a focused report on relevant user activity that you can incorporate into your investigation.
+Incorporate the auditor's findings into your timeline and root cause analysis.
 ### 5. Check Configuration (if relevant)
@@ -133,6 +140,7 @@ get_commit_info(<hash>)            # Get full details of a specific change
 ### 7. Consider Common Causes
 For infrastructure alerts, common causes include:
 - **Manual intervention**: Service manually stopped/restarted (call auditor to confirm)
 - **Configuration drift**: Host running outdated config, fix already in master
 - **Disk space**: Nix store growth, logs, temp files
 - **Memory pressure**: Service memory leaks, insufficient limits
@@ -141,6 +149,8 @@ For infrastructure alerts, common causes include:
 - **Service restarts**: Failed upgrades, configuration errors
 - **Scrape failures**: Service down, firewall issues, port changes
 **Note:** If a service stopped unexpectedly and service logs don't show a crash or error, it was likely manual intervention - call the auditor to investigate.
 ## Output Format
 Provide a concise report with one of two outcomes:
@@ -198,4 +208,4 @@ Provide a concise report with one of two outcomes:
 - Consider the host's tier (test vs prod) when assessing severity
 - Build a timeline from log timestamps and metrics to show the sequence of events
 - **Query logs incrementally**: start with narrow filters and small limits, expand only if needed
- **Use the auditor agent** for user activity analysis (commands, SSH sessions, sudo usage)
+- **Always delegate to the auditor agent** for any user activity analysis - never query EXECVE or audit logs directly