Remove monitoring01 host configuration and unused service modules
(prometheus, grafana, loki, tempo, pyroscope). Migrate blackbox,
exportarr, and pve exporters to monitoring02 with scrape configs
moved to VictoriaMetrics. Update alert rules, terraform vault
policies/secrets, http-proxy entries, and documentation to reflect
the monitoring02 migration.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update all LogQL examples, agent instructions, and scripts to use
the hostname label instead of host, matching the Prometheus label
naming convention. Also update pipe-to-loki and bootstrap scripts
to push hostname instead of host.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Changed section 4 from "if needed" to always spawn auditor
- Added explicit "Do NOT query audit logs yourself" guidance
- Listed specific scenarios requiring auditor (service stopped, etc.)
- Added manual intervention as first common cause
- Updated guidelines to emphasize mandatory delegation
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add new auditor agent for security-focused audit log analysis:
- SSH session tracking, command execution, sudo usage
- Suspicious activity detection patterns
- Can be used standalone or as sub-agent by investigate-alarm
Update investigate-alarm to delegate audit analysis to auditor
and add git-explorer MCP for configuration drift detection.
Add git-explorer to .mcp.json for repository inspection.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add best practices for querying Loki to avoid overwhelming responses:
- Start with narrow filters and small limits
- Filter audit logs to EXECVE only
- Exclude verbose noise (PATH, PROCTITLE, SYSCALL, BPF)
- Expand queries incrementally if needed
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sub-agent for investigating system alarms using Prometheus metrics
and Loki logs. Provides root cause analysis with timeline of events.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move detailed Prometheus/Loki reference from CLAUDE.md to the
observability skill
- Add complete list of Prometheus jobs organized by category
- Add bootstrap log documentation with stages table
- Add kanidm01 to host labels table
- CLAUDE.md now references the skill instead of duplicating info
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document the new hostname and host metadata labels available on all
Prometheus scrape targets:
- hostname: short hostname for easy filtering
- role: host role (dns, build-host, vault)
- tier: deployment tier (test for test VMs)
- dns_role: primary/secondary for DNS servers
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Merge Prometheus metrics and Loki logs into a unified troubleshooting
skill. Adds LogQL query patterns, label reference, and common service
units for log searching.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Reference guide for exploring Prometheus metrics when troubleshooting
homelab issues, including the new nixos_flake_info metrics.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>