- Changed section 4 from "if needed" to always spawn auditor
- Added explicit "Do NOT query audit logs yourself" guidance
- Listed specific scenarios requiring auditor (service stopped, etc.)
- Added manual intervention as first common cause
- Updated guidelines to emphasize mandatory delegation
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add new auditor agent for security-focused audit log analysis:
- SSH session tracking, command execution, sudo usage
- Suspicious activity detection patterns
- Can be used standalone or as sub-agent by investigate-alarm
Update investigate-alarm to delegate audit analysis to auditor
and add git-explorer MCP for configuration drift detection.
Add git-explorer to .mcp.json for repository inspection.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add best practices for querying Loki to avoid overwhelming responses:
- Start with narrow filters and small limits
- Filter audit logs to EXECVE only
- Exclude verbose noise (PATH, PROCTITLE, SYSCALL, BPF)
- Expand queries incrementally if needed
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sub-agent for investigating system alarms using Prometheus metrics
and Loki logs. Provides root cause analysis with timeline of events.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move detailed Prometheus/Loki reference from CLAUDE.md to the
observability skill
- Add complete list of Prometheus jobs organized by category
- Add bootstrap log documentation with stages table
- Add kanidm01 to host labels table
- CLAUDE.md now references the skill instead of duplicating info
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document the new hostname and host metadata labels available on all
Prometheus scrape targets:
- hostname: short hostname for easy filtering
- role: host role (dns, build-host, vault)
- tier: deployment tier (test for test VMs)
- dns_role: primary/secondary for DNS servers
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Merge Prometheus metrics and Loki logs into a unified troubleshooting
skill. Adds LogQL query patterns, label reference, and common service
units for log searching.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Reference guide for exploring Prometheus metrics when troubleshooting
homelab issues, including the new nixos_flake_info metrics.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>