nixos-servers

Author	SHA1	Message	Date
Torjus Håkestad	c64d299daf	grafana: extract MESSAGE field in log panels All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Use LogQL json parser and line_format to show only the MESSAGE field instead of the full JSON blob in log panels. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 21:50:47 +01:00
Torjus Håkestad	ef026d9bfc	grafana: add NixOS operations dashboard All checks were successful Run nix flake check / flake-check (push) Successful in 2m6s Details Loki-based dashboard for tracking NixOS operations including: - Upgrade activity and success/failure stats - Build activity during upgrades - Bootstrap logs for new VM deployments - ACME certificate renewal activity Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 21:32:34 +01:00
Torjus Håkestad	79a6a72719	Merge pull request 'grafana-dashboards-permissions' (#36 ) from grafana-dashboards-permissions into master All checks were successful Run nix flake check / flake-check (push) Successful in 2m4s Details Reviewed-on: #36	2026-02-08 20:18:22 +00:00
Torjus Håkestad	89d0a6f358	grafana: add systemd services dashboard Some checks failed Run nix flake check / flake-check (push) Failing after 8m30s Details Run nix flake check / flake-check (pull_request) Failing after 16m49s Details Dashboard for monitoring systemd across the fleet: - Summary stats: failed/active/inactive units, restarts, timers - Failed units table (shows any units in failed state) - Service restarts table (top 15 services by restart count) - Active units per host bar chart - NixOS upgrade timer table with last trigger time - Backup timers table (restic jobs) - Service restarts over time chart - Hostname filter to focus on specific hosts Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 21:06:59 +01:00
Torjus Håkestad	03ebee4d82	grafana: fix proxmox table __name__ column All checks were successful Run nix flake check / flake-check (push) Successful in 2m9s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 21:04:41 +01:00
Torjus Håkestad	05630eb4d4	grafana: add Proxmox dashboard Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Dashboard for monitoring Proxmox VMs: - Summary stats: VMs running/stopped, node CPU/memory, uptime - VM status table with name, status, CPU%, memory%, uptime - VM CPU usage over time - VM memory usage over time - Network traffic (RX/TX) per VM - Disk I/O (read/write) per VM - Storage usage gauges and capacity table - VM filter to focus on specific VMs Filters out template VMs, shows only actual guests. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 21:02:28 +01:00
Torjus Håkestad	1e52eec02a	monitoring: always include tier label in scrape configs All checks were successful Run nix flake check / flake-check (push) Successful in 2m8s Details Previously tier was only included if non-default (not "prod"), which meant prod hosts had no tier label. This made the Grafana tier filter only show "test" since "prod" never appeared in label_values(). Now tier is always included, so both "prod" and "test" appear in the fleet dashboard tier selector. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:58:52 +01:00
Torjus Håkestad	d333aa0164	grafana: fix fleet table __name__ columns All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Exclude the __name__ columns that were leaking through the table transformations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:52:39 +01:00
Torjus Håkestad	a5d5827dcc	grafana: add NixOS fleet dashboard Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Dashboard for monitoring NixOS deployments across the homelab: - Hosts behind remote / needing reboot stat panels - Fleet status table with revision, behind status, reboot needed, age - Generation age bar chart (shows stale configs) - Generations per host bar chart - Deployment activity time series (see when hosts were updated) - Flake input ages table - Pie charts for hosts by revision and tier - Tier filter variable Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:50:08 +01:00
Torjus Håkestad	1c13ec12a4	grafana: add temperature dashboard All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Dashboard includes: - Current temperatures per room (stat panel) - Average home temperature (gauge) - Current humidity (stat panel) - 30-day temperature history with mean/min/max in legend - Temperature trend (rate of change per hour) - 24h min/max/avg table per room - 30-day humidity history Filters out device_temperature (internal sensor) metrics. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:45:52 +01:00
Torjus Håkestad	4bf0eeeadb	grafana: add dashboards and fix permissions All checks were successful Run nix flake check / flake-check (push) Successful in 2m3s Details - Change default OIDC role from Viewer to Editor for Explore access - Add declarative dashboard provisioning - Add node-exporter dashboard (CPU, memory, disk, load, network, I/O) - Add Loki logs dashboard with host/job filters Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:39:21 +01:00
Torjus Håkestad	304cb117ce	Merge pull request 'grafana-kanidm-oidc' (#35 ) from grafana-kanidm-oidc into master All checks were successful Run nix flake check / flake-check (push) Successful in 2m7s Details Reviewed-on: #35	2026-02-08 19:30:20 +00:00
Torjus Håkestad	02270a0e4a	docs: update plans with Grafana OIDC progress Some checks failed Run nix flake check / flake-check (pull_request) Successful in 2m7s Details Run nix flake check / flake-check (push) Failing after 16m31s Details - auth-system-replacement.md: Mark OAuth2 client (Grafana) as completed, document key findings (PKCE, attribute paths, user requirements) - monitoring-migration-victoriametrics.md: Note Grafana deployment on monitoring02 with Kanidm OIDC as test instance Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:28:10 +01:00
Torjus Håkestad	030e8518c5	grafana: add Grafana on monitoring02 with Kanidm OIDC Some checks failed Run nix flake check / flake-check (push) Failing after 4m3s Details Deploy Grafana test instance on monitoring02 with: - Kanidm OIDC authentication (admins -> Admin role, others -> Viewer) - PKCE enabled for secure OAuth2 flow (required by Kanidm) - Declarative datasources for Prometheus and Loki on monitoring01 - Local Caddy for TLS termination via internal ACME CA - DNS CNAME grafana-test.home.2rjus.net Terraform changes add OAuth2 client secret and AppRole policies for kanidm01 and monitoring02. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:23:26 +01:00
Torjus Håkestad	9ffdd4f862	terraform: increase monitoring02 disk to 60G Some checks failed Run nix flake check / flake-check (push) Failing after 11m8s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 19:23:40 +01:00
Torjus Håkestad	0b977808ca	hosts: add monitoring02 configuration Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details New test-tier host for monitoring stack expansion with: - Static IP 10.69.13.24 - 4 CPU cores, 4GB RAM, 20GB disk - Vault integration and NATS-based deployment enabled Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 19:19:38 +01:00
Torjus Håkestad	8786113f8f	docs: add OpenBao + Kanidm OIDC integration plan Some checks failed Run nix flake check / flake-check (push) Failing after 3m10s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:45:44 +01:00
Torjus Håkestad	fdb2c31f84	docs: add pipe-to-loki documentation to CLAUDE.md All checks were successful Run nix flake check / flake-check (push) Successful in 2m1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:34:01 +01:00
Torjus Håkestad	78eb04205f	system: add pipe-to-loki helper script Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Adds a system-wide script for sending command output or interactive sessions to Loki for easy sharing with Claude. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:30:53 +01:00
Torjus Håkestad	19cb61ebbc	Merge pull request 'kanidm-pam-client' (#34 ) from kanidm-pam-client into master All checks were successful Run nix flake check / flake-check (push) Successful in 3m19s Details Reviewed-on: #34	2026-02-08 14:14:53 +00:00
Torjus Håkestad	9ed09c9a9c	docs: add user-management documentation All checks were successful Run nix flake check / flake-check (pull_request) Successful in 3m33s Details Run nix flake check / flake-check (push) Successful in 2m0s Details - CLI workflows for creating users and groups - Troubleshooting guide (nscd, cache invalidation) - Home directory behavior (UUID-based with symlinks) - Update auth-system-replacement plan with progress Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:14:21 +01:00
Torjus Håkestad	b31c64f1b9	kanidm: remove declarative user provisioning Keep base groups (admins, users, ssh-users) provisioned declaratively but manage regular users via the kanidm CLI. This allows setting POSIX attributes and passwords in a single workflow. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:14:03 +01:00
Torjus Håkestad	54b6e37420	flake: add kanidm to devshell Add kanidm_1_8 CLI for administering the Kanidm server. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:12:19 +01:00
Torjus Håkestad	b845a8bb8b	system: add kanidm PAM/NSS client module Add homelab.kanidm.enable option for central authentication via Kanidm. The module configures: - PAM/NSS integration with kanidm-unixd - Client connection to auth.home.2rjus.net - Login authorization for ssh-users group Enable on testvm01-03 for testing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:12:19 +01:00
Torjus Håkestad	bfbf0cea68	template2: enable zram for bootstrap Some checks failed Run nix flake check / flake-check (push) Failing after 3m34s Details Prevents OOM during initial nixos-rebuild on 2GB VMs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 13:34:08 +01:00
Torjus Håkestad	3abe5e83a7	docs: add memory ballooning as fallback option All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 13:29:42 +01:00
Torjus Håkestad	67c27555f3	docs: add memory issues follow-up plan All checks were successful Run nix flake check / flake-check (push) Successful in 2m2s Details Track zram change effectiveness for OOM prevention during upgrades. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 13:26:31 +01:00
Torjus Håkestad	1674b6a844	system: enable zram swap for all hosts Some checks failed Run nix flake check / flake-check (push) Failing after 12m6s Details Provides compressed swap in RAM to prevent OOM kills during nixos-rebuild on low-memory VMs (2GB). Removes duplicate zram configs from jelly01 and nix-cache01. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 13:02:58 +01:00
Torjus Håkestad	311be282b6	docs: add security hardening plan Some checks failed Run nix flake check / flake-check (push) Failing after 2s Details Based on security review findings, covering SSH hardening, firewall enablement, log transport TLS, security alerting, and secrets management. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 05:26:15 +01:00
Torjus Håkestad	11cbb64097	claude: make auditor delegation explicit in investigate-alarm Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - Changed section 4 from "if needed" to always spawn auditor - Added explicit "Do NOT query audit logs yourself" guidance - Listed specific scenarios requiring auditor (service stopped, etc.) - Added manual intervention as first common cause - Updated guidelines to emphasize mandatory delegation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 05:11:09 +01:00
Torjus Håkestad	e2dd21c994	claude: add auditor agent and git-explorer MCP Add new auditor agent for security-focused audit log analysis: - SSH session tracking, command execution, sudo usage - Suspicious activity detection patterns - Can be used standalone or as sub-agent by investigate-alarm Update investigate-alarm to delegate audit analysis to auditor and add git-explorer MCP for configuration drift detection. Add git-explorer to .mcp.json for repository inspection. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 04:48:55 +01:00
Torjus Håkestad	463342133e	kanidm: remove non-functional metrics scrape target All checks were successful Run nix flake check / flake-check (push) Successful in 1m56s Details Kanidm does not expose a Prometheus /metrics endpoint. The scrape target was causing 404 errors after the TLS certificate issue was fixed. Also add SSH command restriction to CLAUDE.md. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:34:12 +01:00
Torjus Håkestad	de36b9d016	kanidm: add hostname SAN to ACME certificate Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Include both auth.home.2rjus.net (CNAME) and kanidm01.home.2rjus.net (A record) as SANs in the TLS certificate. This fixes Prometheus scraping which connects via the hostname, not the CNAME. Fixes: x509: certificate is valid for auth.home.2rjus.net, not kanidm01.home.2rjus.net Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:29:54 +01:00
Torjus Håkestad	3f1d966919	claude: improve investigate-alarm log query guidelines Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Add best practices for querying Loki to avoid overwhelming responses: - Start with narrow filters and small limits - Filter audit logs to EXECVE only - Exclude verbose noise (PATH, PROCTITLE, SYSCALL, BPF) - Expand queries incrementally if needed Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:14:54 +01:00
Torjus Håkestad	7fcc043a4d	testvm: add SSH session command auditing Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Enable Linux audit to log execve syscalls from interactive SSH sessions. Uses auid filter to exclude system services and nix builds. Logs forwarded to journald for Loki ingestion. Query with: {host="testvmXX"} \|= "EXECVE" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:07:10 +01:00
Torjus Håkestad	70ec5f8109	claude: add investigate-alarm agent Sub-agent for investigating system alarms using Prometheus metrics and Loki logs. Provides root cause analysis with timeline of events. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:07:03 +01:00
Torjus Håkestad	c2ec34cab9	docs: consolidate monitoring docs into observability skill Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - Move detailed Prometheus/Loki reference from CLAUDE.md to the observability skill - Add complete list of Prometheus jobs organized by category - Add bootstrap log documentation with stages table - Add kanidm01 to host labels table - CLAUDE.md now references the skill instead of duplicating info Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 02:15:02 +01:00
Torjus Håkestad	8fbf1224fa	docs: add host creation pipeline documentation Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Document the end-to-end host creation workflow including: - Prerequisites and step-by-step process - Tier specification (test vs prod) - Bootstrap observability via Loki - Verification steps - Troubleshooting guide - Related files reference Update CLAUDE.md to reference the new document. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 02:05:21 +01:00
Torjus Håkestad	8959829f77	docs: add monitoring migration to VictoriaMetrics plan Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Plan for migrating from Prometheus to VictoriaMetrics on new monitoring02 host with parallel operation, declarative Grafana dashboards, and CNAME-based cutover. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 01:11:07 +01:00
Torjus Håkestad	93dbb45802	docs: update auth-system-replacement plan with progress Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Periodic flake update / flake-update (push) Failing after 5s Details - Mark completed implementation steps - Document deployed kanidm01 configuration - Record UID/GID range decision (65,536-69,999) - Add verified working items (WebUI, LDAP, certs) - Update next steps and resolved questions Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 00:50:36 +01:00
Torjus Håkestad	538c2ad097	kanidm: fix secret file permissions for provisioning Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Set owner/group to kanidm so the post-start provisioning script can read the idm_admin password. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 00:24:41 +01:00
Torjus Håkestad	d99c82c74c	kanidm: fix service ordering for vault secret Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Ensure vault-secret-kanidm-idm-admin runs before kanidm.service by adding services dependency. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 00:21:11 +01:00
Torjus Håkestad	ca0e3fd629	kanidm01: add kanidm authentication server Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - New test-tier VM at 10.69.13.23 with role=auth - Kanidm 1.8 server with HTTPS (443) and LDAPS (636) - ACME certificate from internal CA (auth.home.2rjus.net) - Provisioned groups: admins, users, ssh-users - Provisioned user: torjus - Daily backups at 22:00 (7 versions) - Prometheus monitoring scrape target Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 00:13:59 +01:00
Torjus Håkestad	732e9b8c22	docs: move bootstrap-cache plan to completed Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:41:05 +01:00
Torjus Håkestad	3a14ffd6b5	template2: add nix cache configuration Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details New VMs bootstrapped from template2 will now use the local nix cache during initial nixos-rebuild, speeding up bootstrap times. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:40:53 +01:00
Torjus Håkestad	f9a3961457	docs: move ns1-recreation plan to completed Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:35:04 +01:00
Torjus Håkestad	003d4ccf03	docs: mark ns1 migration to OpenTofu as complete Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:34:44 +01:00
Torjus Håkestad	735b8a9ee3	terraform: add dns and homelab-deploy secrets to ns1 policy Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details ns1 needs access to shared/dns/* for zone transfer key and shared/homelab-deploy/* for the NATS listener. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:33:36 +01:00
Torjus Håkestad	94feae82a0	ns1: recreate with OpenTofu workflow Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Old VM had incorrect hardware-configuration.nix with hardcoded UUIDs that didn't match actual disk layout, causing boot failure (emergency mode). Recreated using template2-based configuration for OpenTofu provisioning. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:18:08 +01:00
Torjus Håkestad	3f94f7ee95	docs: update pgdb1 decommission progress Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:55:55 +01:00

1 2 3 4 5 ...

866 Commits