claude: make auditor delegation explicit in investigate-alarm
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s

- Changed section 4 from "if needed" to always spawn auditor
- Added explicit "Do NOT query audit logs yourself" guidance
- Listed specific scenarios requiring auditor (service stopped, etc.)
- Added manual intervention as first common cause
- Updated guidelines to emphasize mandatory delegation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-02-08 05:11:09 +01:00
parent e2dd21c994
commit 11cbb64097

View File

@@ -53,24 +53,31 @@ Search for relevant log entries using `query_logs`. Focus on service-specific lo
- Using `start: "1h"` with no filters on busy hosts - Using `start: "1h"` with no filters on busy hosts
- Limits over 50 without specific filters - Limits over 50 without specific filters
### 4. Investigate User Activity (if needed) ### 4. Investigate User Activity
If you suspect the issue may be related to user actions (manual commands, SSH sessions, service manipulation), **spawn the `auditor` agent** to investigate. For any analysis of user activity, **always spawn the `auditor` agent**. Do not query audit logs (EXECVE, USER_LOGIN, etc.) directly - delegate this to the auditor.
Use the auditor when you need to know: **Always call the auditor when:**
- What commands were run around the time of an incident - A service stopped unexpectedly (may have been manually stopped)
- Whether a service was manually stopped/restarted - A process was killed or a config was changed
- Who was logged in and what they did - You need to know who was logged in around the time of an incident
- Whether there's suspicious activity on the host - You need to understand what commands led to the current state
- The cause isn't obvious from service logs alone
**Do NOT try to query audit logs yourself.** The auditor is specialized for:
- Parsing EXECVE records and reconstructing command lines
- Correlating SSH sessions with commands executed
- Identifying suspicious patterns
- Filtering out systemd/nix-store noise
**Example prompt for auditor:** **Example prompt for auditor:**
``` ```
Investigate user activity on <hostname> between <start_time> and <end_time>. Investigate user activity on <hostname> between <start_time> and <end_time>.
Context: The nginx service stopped unexpectedly at 14:32. Check if it was Context: The prometheus-node-exporter service stopped at 14:32.
manually stopped or if any commands were run around that time. Determine if it was manually stopped and by whom.
``` ```
The auditor will return a focused report on relevant user activity that you can incorporate into your investigation. Incorporate the auditor's findings into your timeline and root cause analysis.
### 5. Check Configuration (if relevant) ### 5. Check Configuration (if relevant)
@@ -133,6 +140,7 @@ get_commit_info(<hash>) # Get full details of a specific change
### 7. Consider Common Causes ### 7. Consider Common Causes
For infrastructure alerts, common causes include: For infrastructure alerts, common causes include:
- **Manual intervention**: Service manually stopped/restarted (call auditor to confirm)
- **Configuration drift**: Host running outdated config, fix already in master - **Configuration drift**: Host running outdated config, fix already in master
- **Disk space**: Nix store growth, logs, temp files - **Disk space**: Nix store growth, logs, temp files
- **Memory pressure**: Service memory leaks, insufficient limits - **Memory pressure**: Service memory leaks, insufficient limits
@@ -141,6 +149,8 @@ For infrastructure alerts, common causes include:
- **Service restarts**: Failed upgrades, configuration errors - **Service restarts**: Failed upgrades, configuration errors
- **Scrape failures**: Service down, firewall issues, port changes - **Scrape failures**: Service down, firewall issues, port changes
**Note:** If a service stopped unexpectedly and service logs don't show a crash or error, it was likely manual intervention - call the auditor to investigate.
## Output Format ## Output Format
Provide a concise report with one of two outcomes: Provide a concise report with one of two outcomes:
@@ -198,4 +208,4 @@ Provide a concise report with one of two outcomes:
- Consider the host's tier (test vs prod) when assessing severity - Consider the host's tier (test vs prod) when assessing severity
- Build a timeline from log timestamps and metrics to show the sequence of events - Build a timeline from log timestamps and metrics to show the sequence of events
- **Query logs incrementally**: start with narrow filters and small limits, expand only if needed - **Query logs incrementally**: start with narrow filters and small limits, expand only if needed
- **Use the auditor agent** for user activity analysis (commands, SSH sessions, sudo usage) - **Always delegate to the auditor agent** for any user activity analysis - never query EXECVE or audit logs directly