claude: make auditor delegation explicit in investigate-alarm
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
- Changed section 4 from "if needed" to always spawn auditor - Added explicit "Do NOT query audit logs yourself" guidance - Listed specific scenarios requiring auditor (service stopped, etc.) - Added manual intervention as first common cause - Updated guidelines to emphasize mandatory delegation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -53,24 +53,31 @@ Search for relevant log entries using `query_logs`. Focus on service-specific lo
|
||||
- Using `start: "1h"` with no filters on busy hosts
|
||||
- Limits over 50 without specific filters
|
||||
|
||||
### 4. Investigate User Activity (if needed)
|
||||
### 4. Investigate User Activity
|
||||
|
||||
If you suspect the issue may be related to user actions (manual commands, SSH sessions, service manipulation), **spawn the `auditor` agent** to investigate.
|
||||
For any analysis of user activity, **always spawn the `auditor` agent**. Do not query audit logs (EXECVE, USER_LOGIN, etc.) directly - delegate this to the auditor.
|
||||
|
||||
Use the auditor when you need to know:
|
||||
- What commands were run around the time of an incident
|
||||
- Whether a service was manually stopped/restarted
|
||||
- Who was logged in and what they did
|
||||
- Whether there's suspicious activity on the host
|
||||
**Always call the auditor when:**
|
||||
- A service stopped unexpectedly (may have been manually stopped)
|
||||
- A process was killed or a config was changed
|
||||
- You need to know who was logged in around the time of an incident
|
||||
- You need to understand what commands led to the current state
|
||||
- The cause isn't obvious from service logs alone
|
||||
|
||||
**Do NOT try to query audit logs yourself.** The auditor is specialized for:
|
||||
- Parsing EXECVE records and reconstructing command lines
|
||||
- Correlating SSH sessions with commands executed
|
||||
- Identifying suspicious patterns
|
||||
- Filtering out systemd/nix-store noise
|
||||
|
||||
**Example prompt for auditor:**
|
||||
```
|
||||
Investigate user activity on <hostname> between <start_time> and <end_time>.
|
||||
Context: The nginx service stopped unexpectedly at 14:32. Check if it was
|
||||
manually stopped or if any commands were run around that time.
|
||||
Context: The prometheus-node-exporter service stopped at 14:32.
|
||||
Determine if it was manually stopped and by whom.
|
||||
```
|
||||
|
||||
The auditor will return a focused report on relevant user activity that you can incorporate into your investigation.
|
||||
Incorporate the auditor's findings into your timeline and root cause analysis.
|
||||
|
||||
### 5. Check Configuration (if relevant)
|
||||
|
||||
@@ -133,6 +140,7 @@ get_commit_info(<hash>) # Get full details of a specific change
|
||||
### 7. Consider Common Causes
|
||||
|
||||
For infrastructure alerts, common causes include:
|
||||
- **Manual intervention**: Service manually stopped/restarted (call auditor to confirm)
|
||||
- **Configuration drift**: Host running outdated config, fix already in master
|
||||
- **Disk space**: Nix store growth, logs, temp files
|
||||
- **Memory pressure**: Service memory leaks, insufficient limits
|
||||
@@ -141,6 +149,8 @@ For infrastructure alerts, common causes include:
|
||||
- **Service restarts**: Failed upgrades, configuration errors
|
||||
- **Scrape failures**: Service down, firewall issues, port changes
|
||||
|
||||
**Note:** If a service stopped unexpectedly and service logs don't show a crash or error, it was likely manual intervention - call the auditor to investigate.
|
||||
|
||||
## Output Format
|
||||
|
||||
Provide a concise report with one of two outcomes:
|
||||
@@ -198,4 +208,4 @@ Provide a concise report with one of two outcomes:
|
||||
- Consider the host's tier (test vs prod) when assessing severity
|
||||
- Build a timeline from log timestamps and metrics to show the sequence of events
|
||||
- **Query logs incrementally**: start with narrow filters and small limits, expand only if needed
|
||||
- **Use the auditor agent** for user activity analysis (commands, SSH sessions, sudo usage)
|
||||
- **Always delegate to the auditor agent** for any user activity analysis - never query EXECVE or audit logs directly
|
||||
|
||||
Reference in New Issue
Block a user