claude: make auditor delegation explicit in investigate-alarm
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
- Changed section 4 from "if needed" to always spawn auditor - Added explicit "Do NOT query audit logs yourself" guidance - Listed specific scenarios requiring auditor (service stopped, etc.) - Added manual intervention as first common cause - Updated guidelines to emphasize mandatory delegation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -53,24 +53,31 @@ Search for relevant log entries using `query_logs`. Focus on service-specific lo
|
|||||||
- Using `start: "1h"` with no filters on busy hosts
|
- Using `start: "1h"` with no filters on busy hosts
|
||||||
- Limits over 50 without specific filters
|
- Limits over 50 without specific filters
|
||||||
|
|
||||||
### 4. Investigate User Activity (if needed)
|
### 4. Investigate User Activity
|
||||||
|
|
||||||
If you suspect the issue may be related to user actions (manual commands, SSH sessions, service manipulation), **spawn the `auditor` agent** to investigate.
|
For any analysis of user activity, **always spawn the `auditor` agent**. Do not query audit logs (EXECVE, USER_LOGIN, etc.) directly - delegate this to the auditor.
|
||||||
|
|
||||||
Use the auditor when you need to know:
|
**Always call the auditor when:**
|
||||||
- What commands were run around the time of an incident
|
- A service stopped unexpectedly (may have been manually stopped)
|
||||||
- Whether a service was manually stopped/restarted
|
- A process was killed or a config was changed
|
||||||
- Who was logged in and what they did
|
- You need to know who was logged in around the time of an incident
|
||||||
- Whether there's suspicious activity on the host
|
- You need to understand what commands led to the current state
|
||||||
|
- The cause isn't obvious from service logs alone
|
||||||
|
|
||||||
|
**Do NOT try to query audit logs yourself.** The auditor is specialized for:
|
||||||
|
- Parsing EXECVE records and reconstructing command lines
|
||||||
|
- Correlating SSH sessions with commands executed
|
||||||
|
- Identifying suspicious patterns
|
||||||
|
- Filtering out systemd/nix-store noise
|
||||||
|
|
||||||
**Example prompt for auditor:**
|
**Example prompt for auditor:**
|
||||||
```
|
```
|
||||||
Investigate user activity on <hostname> between <start_time> and <end_time>.
|
Investigate user activity on <hostname> between <start_time> and <end_time>.
|
||||||
Context: The nginx service stopped unexpectedly at 14:32. Check if it was
|
Context: The prometheus-node-exporter service stopped at 14:32.
|
||||||
manually stopped or if any commands were run around that time.
|
Determine if it was manually stopped and by whom.
|
||||||
```
|
```
|
||||||
|
|
||||||
The auditor will return a focused report on relevant user activity that you can incorporate into your investigation.
|
Incorporate the auditor's findings into your timeline and root cause analysis.
|
||||||
|
|
||||||
### 5. Check Configuration (if relevant)
|
### 5. Check Configuration (if relevant)
|
||||||
|
|
||||||
@@ -133,6 +140,7 @@ get_commit_info(<hash>) # Get full details of a specific change
|
|||||||
### 7. Consider Common Causes
|
### 7. Consider Common Causes
|
||||||
|
|
||||||
For infrastructure alerts, common causes include:
|
For infrastructure alerts, common causes include:
|
||||||
|
- **Manual intervention**: Service manually stopped/restarted (call auditor to confirm)
|
||||||
- **Configuration drift**: Host running outdated config, fix already in master
|
- **Configuration drift**: Host running outdated config, fix already in master
|
||||||
- **Disk space**: Nix store growth, logs, temp files
|
- **Disk space**: Nix store growth, logs, temp files
|
||||||
- **Memory pressure**: Service memory leaks, insufficient limits
|
- **Memory pressure**: Service memory leaks, insufficient limits
|
||||||
@@ -141,6 +149,8 @@ For infrastructure alerts, common causes include:
|
|||||||
- **Service restarts**: Failed upgrades, configuration errors
|
- **Service restarts**: Failed upgrades, configuration errors
|
||||||
- **Scrape failures**: Service down, firewall issues, port changes
|
- **Scrape failures**: Service down, firewall issues, port changes
|
||||||
|
|
||||||
|
**Note:** If a service stopped unexpectedly and service logs don't show a crash or error, it was likely manual intervention - call the auditor to investigate.
|
||||||
|
|
||||||
## Output Format
|
## Output Format
|
||||||
|
|
||||||
Provide a concise report with one of two outcomes:
|
Provide a concise report with one of two outcomes:
|
||||||
@@ -198,4 +208,4 @@ Provide a concise report with one of two outcomes:
|
|||||||
- Consider the host's tier (test vs prod) when assessing severity
|
- Consider the host's tier (test vs prod) when assessing severity
|
||||||
- Build a timeline from log timestamps and metrics to show the sequence of events
|
- Build a timeline from log timestamps and metrics to show the sequence of events
|
||||||
- **Query logs incrementally**: start with narrow filters and small limits, expand only if needed
|
- **Query logs incrementally**: start with narrow filters and small limits, expand only if needed
|
||||||
- **Use the auditor agent** for user activity analysis (commands, SSH sessions, sudo usage)
|
- **Always delegate to the auditor agent** for any user activity analysis - never query EXECVE or audit logs directly
|
||||||
|
|||||||
Reference in New Issue
Block a user