nixos-servers

Author	SHA1	Message	Date
Torjus Håkestad	1e7200b494	quick-plan: add mermaid diagram guideline Some checks failed Run nix flake check / flake-check (push) Failing after 5m7s Details Periodic flake update / flake-update (push) Successful in 5m26s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 16:35:53 +01:00
Torjus Håkestad	4f593126c0	monitoring01: remove host and migrate services to monitoring02 Some checks failed Run nix flake check / flake-check (push) Failing after 3m15s Details Run nix flake check / flake-check (pull_request) Failing after 3m8s Details Remove monitoring01 host configuration and unused service modules (prometheus, grafana, loki, tempo, pyroscope). Migrate blackbox, exportarr, and pve exporters to monitoring02 with scrape configs moved to VictoriaMetrics. Update alert rules, terraform vault policies/secrets, http-proxy entries, and documentation to reflect the monitoring02 migration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 21:50:20 +01:00
Torjus Håkestad	d485948df0	docs: update Loki queries from host to hostname label Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Update all LogQL examples, agent instructions, and scripts to use the hostname label instead of host, matching the Prometheus label naming convention. Also update pipe-to-loki and bootstrap scripts to push hostname instead of host. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 23:43:47 +01:00
Torjus Håkestad	11cbb64097	claude: make auditor delegation explicit in investigate-alarm Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - Changed section 4 from "if needed" to always spawn auditor - Added explicit "Do NOT query audit logs yourself" guidance - Listed specific scenarios requiring auditor (service stopped, etc.) - Added manual intervention as first common cause - Updated guidelines to emphasize mandatory delegation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 05:11:09 +01:00
Torjus Håkestad	e2dd21c994	claude: add auditor agent and git-explorer MCP Add new auditor agent for security-focused audit log analysis: - SSH session tracking, command execution, sudo usage - Suspicious activity detection patterns - Can be used standalone or as sub-agent by investigate-alarm Update investigate-alarm to delegate audit analysis to auditor and add git-explorer MCP for configuration drift detection. Add git-explorer to .mcp.json for repository inspection. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 04:48:55 +01:00
Torjus Håkestad	3f1d966919	claude: improve investigate-alarm log query guidelines Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Add best practices for querying Loki to avoid overwhelming responses: - Start with narrow filters and small limits - Filter audit logs to EXECVE only - Exclude verbose noise (PATH, PROCTITLE, SYSCALL, BPF) - Expand queries incrementally if needed Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:14:54 +01:00
Torjus Håkestad	70ec5f8109	claude: add investigate-alarm agent Sub-agent for investigating system alarms using Prometheus metrics and Loki logs. Provides root cause analysis with timeline of events. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:07:03 +01:00
Torjus Håkestad	c2ec34cab9	docs: consolidate monitoring docs into observability skill Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - Move detailed Prometheus/Loki reference from CLAUDE.md to the observability skill - Add complete list of Prometheus jobs organized by category - Add bootstrap log documentation with stages table - Add kanidm01 to host labels table - CLAUDE.md now references the skill instead of duplicating info Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 02:15:02 +01:00
Torjus Håkestad	b794aa89db	skills: update observability with new target labels Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Document the new hostname and host metadata labels available on all Prometheus scrape targets: - hostname: short hostname for easy filtering - role: host role (dns, build-host, vault) - tier: deployment tier (test for test VMs) - dns_role: primary/secondary for DNS servers Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:12:17 +01:00
Torjus Håkestad	b9a269d280	chore: rename metrics skill to observability, add logs reference All checks were successful Run nix flake check / flake-check (push) Successful in 2m4s Details Merge Prometheus metrics and Loki logs into a unified troubleshooting skill. Adds LogQL query patterns, label reference, and common service units for log searching. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 01:17:41 +01:00
Torjus Håkestad	fcf1a66103	chore: add metrics troubleshooting skill Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Reference guide for exploring Prometheus metrics when troubleshooting homelab issues, including the new nixos_flake_info metrics. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 01:11:40 +01:00
Torjus Håkestad	f2c30cc24f	chore: give claude the quick-plan skill Some checks failed Run nix flake check / flake-check (push) Failing after 13m57s Details	2026-02-06 21:58:30 +01:00

12 Commits