Commit Graph

11 Commits

Author SHA1 Message Date
4f593126c0 monitoring01: remove host and migrate services to monitoring02
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m15s
Run nix flake check / flake-check (pull_request) Failing after 3m8s
Remove monitoring01 host configuration and unused service modules
(prometheus, grafana, loki, tempo, pyroscope). Migrate blackbox,
exportarr, and pve exporters to monitoring02 with scrape configs
moved to VictoriaMetrics. Update alert rules, terraform vault
policies/secrets, http-proxy entries, and documentation to reflect
the monitoring02 migration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 21:50:20 +01:00
d485948df0 docs: update Loki queries from host to hostname label
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Update all LogQL examples, agent instructions, and scripts to use
the hostname label instead of host, matching the Prometheus label
naming convention. Also update pipe-to-loki and bootstrap scripts
to push hostname instead of host.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 23:43:47 +01:00
11cbb64097 claude: make auditor delegation explicit in investigate-alarm
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
- Changed section 4 from "if needed" to always spawn auditor
- Added explicit "Do NOT query audit logs yourself" guidance
- Listed specific scenarios requiring auditor (service stopped, etc.)
- Added manual intervention as first common cause
- Updated guidelines to emphasize mandatory delegation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 05:11:09 +01:00
e2dd21c994 claude: add auditor agent and git-explorer MCP
Add new auditor agent for security-focused audit log analysis:
- SSH session tracking, command execution, sudo usage
- Suspicious activity detection patterns
- Can be used standalone or as sub-agent by investigate-alarm

Update investigate-alarm to delegate audit analysis to auditor
and add git-explorer MCP for configuration drift detection.

Add git-explorer to .mcp.json for repository inspection.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 04:48:55 +01:00
3f1d966919 claude: improve investigate-alarm log query guidelines
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Add best practices for querying Loki to avoid overwhelming responses:
- Start with narrow filters and small limits
- Filter audit logs to EXECVE only
- Exclude verbose noise (PATH, PROCTITLE, SYSCALL, BPF)
- Expand queries incrementally if needed

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 03:14:54 +01:00
70ec5f8109 claude: add investigate-alarm agent
Sub-agent for investigating system alarms using Prometheus metrics
and Loki logs. Provides root cause analysis with timeline of events.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 03:07:03 +01:00
c2ec34cab9 docs: consolidate monitoring docs into observability skill
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
- Move detailed Prometheus/Loki reference from CLAUDE.md to the
  observability skill
- Add complete list of Prometheus jobs organized by category
- Add bootstrap log documentation with stages table
- Add kanidm01 to host labels table
- CLAUDE.md now references the skill instead of duplicating info

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 02:15:02 +01:00
b794aa89db skills: update observability with new target labels
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Document the new hostname and host metadata labels available on all
Prometheus scrape targets:
- hostname: short hostname for easy filtering
- role: host role (dns, build-host, vault)
- tier: deployment tier (test for test VMs)
- dns_role: primary/secondary for DNS servers

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:12:17 +01:00
b9a269d280 chore: rename metrics skill to observability, add logs reference
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m4s
Merge Prometheus metrics and Loki logs into a unified troubleshooting
skill. Adds LogQL query patterns, label reference, and common service
units for log searching.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 01:17:41 +01:00
fcf1a66103 chore: add metrics troubleshooting skill
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Reference guide for exploring Prometheus metrics when troubleshooting
homelab issues, including the new nixos_flake_info metrics.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 01:11:40 +01:00
f2c30cc24f chore: give claude the quick-plan skill
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m57s
2026-02-06 21:58:30 +01:00