diff --git a/.claude/agents/investigate-alarm.md b/.claude/agents/investigate-alarm.md index 6c83bbe..e4607ff 100644 --- a/.claude/agents/investigate-alarm.md +++ b/.claude/agents/investigate-alarm.md @@ -53,24 +53,31 @@ Search for relevant log entries using `query_logs`. Focus on service-specific lo - Using `start: "1h"` with no filters on busy hosts - Limits over 50 without specific filters -### 4. Investigate User Activity (if needed) +### 4. Investigate User Activity -If you suspect the issue may be related to user actions (manual commands, SSH sessions, service manipulation), **spawn the `auditor` agent** to investigate. +For any analysis of user activity, **always spawn the `auditor` agent**. Do not query audit logs (EXECVE, USER_LOGIN, etc.) directly - delegate this to the auditor. -Use the auditor when you need to know: -- What commands were run around the time of an incident -- Whether a service was manually stopped/restarted -- Who was logged in and what they did -- Whether there's suspicious activity on the host +**Always call the auditor when:** +- A service stopped unexpectedly (may have been manually stopped) +- A process was killed or a config was changed +- You need to know who was logged in around the time of an incident +- You need to understand what commands led to the current state +- The cause isn't obvious from service logs alone + +**Do NOT try to query audit logs yourself.** The auditor is specialized for: +- Parsing EXECVE records and reconstructing command lines +- Correlating SSH sessions with commands executed +- Identifying suspicious patterns +- Filtering out systemd/nix-store noise **Example prompt for auditor:** ``` Investigate user activity on between and . -Context: The nginx service stopped unexpectedly at 14:32. Check if it was -manually stopped or if any commands were run around that time. +Context: The prometheus-node-exporter service stopped at 14:32. +Determine if it was manually stopped and by whom. ``` -The auditor will return a focused report on relevant user activity that you can incorporate into your investigation. +Incorporate the auditor's findings into your timeline and root cause analysis. ### 5. Check Configuration (if relevant) @@ -133,6 +140,7 @@ get_commit_info() # Get full details of a specific change ### 7. Consider Common Causes For infrastructure alerts, common causes include: +- **Manual intervention**: Service manually stopped/restarted (call auditor to confirm) - **Configuration drift**: Host running outdated config, fix already in master - **Disk space**: Nix store growth, logs, temp files - **Memory pressure**: Service memory leaks, insufficient limits @@ -141,6 +149,8 @@ For infrastructure alerts, common causes include: - **Service restarts**: Failed upgrades, configuration errors - **Scrape failures**: Service down, firewall issues, port changes +**Note:** If a service stopped unexpectedly and service logs don't show a crash or error, it was likely manual intervention - call the auditor to investigate. + ## Output Format Provide a concise report with one of two outcomes: @@ -198,4 +208,4 @@ Provide a concise report with one of two outcomes: - Consider the host's tier (test vs prod) when assessing severity - Build a timeline from log timestamps and metrics to show the sequence of events - **Query logs incrementally**: start with narrow filters and small limits, expand only if needed -- **Use the auditor agent** for user activity analysis (commands, SSH sessions, sudo usage) +- **Always delegate to the auditor agent** for any user activity analysis - never query EXECVE or audit logs directly