nixos-servers

Author	SHA1	Message	Date
Torjus Håkestad	3f1d966919	claude: improve investigate-alarm log query guidelines Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Add best practices for querying Loki to avoid overwhelming responses: - Start with narrow filters and small limits - Filter audit logs to EXECVE only - Exclude verbose noise (PATH, PROCTITLE, SYSCALL, BPF) - Expand queries incrementally if needed Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:14:54 +01:00
Torjus Håkestad	70ec5f8109	claude: add investigate-alarm agent Sub-agent for investigating system alarms using Prometheus metrics and Loki logs. Provides root cause analysis with timeline of events. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:07:03 +01:00
Torjus Håkestad	c2ec34cab9	docs: consolidate monitoring docs into observability skill Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - Move detailed Prometheus/Loki reference from CLAUDE.md to the observability skill - Add complete list of Prometheus jobs organized by category - Add bootstrap log documentation with stages table - Add kanidm01 to host labels table - CLAUDE.md now references the skill instead of duplicating info Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 02:15:02 +01:00
Torjus Håkestad	b794aa89db	skills: update observability with new target labels Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Document the new hostname and host metadata labels available on all Prometheus scrape targets: - hostname: short hostname for easy filtering - role: host role (dns, build-host, vault) - tier: deployment tier (test for test VMs) - dns_role: primary/secondary for DNS servers Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:12:17 +01:00
Torjus Håkestad	b9a269d280	chore: rename metrics skill to observability, add logs reference All checks were successful Run nix flake check / flake-check (push) Successful in 2m4s Details Merge Prometheus metrics and Loki logs into a unified troubleshooting skill. Adds LogQL query patterns, label reference, and common service units for log searching. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 01:17:41 +01:00
Torjus Håkestad	fcf1a66103	chore: add metrics troubleshooting skill Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Reference guide for exploring Prometheus metrics when troubleshooting homelab issues, including the new nixos_flake_info metrics. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 01:11:40 +01:00
Torjus Håkestad	f2c30cc24f	chore: give claude the quick-plan skill Some checks failed Run nix flake check / flake-check (push) Failing after 13m57s Details	2026-02-06 21:58:30 +01:00

7 Commits