nixos-servers

Author	SHA1	Message	Date
Torjus Håkestad	83fce5f927	nix-cache: switch DNS to nix-cache02 Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details - Move nix-cache CNAME from nix-cache01 to nix-cache02 - Remove actions1 CNAME (service removed) - Update proxy.nix to serve canonical domain on nix-cache02 - Promote nix-cache02 to prod tier with build-host role Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:22:23 +01:00
Torjus Håkestad	afff3f28ca	docs: update nix-cache-reprovision plan with Harmonia progress Some checks failed Run nix flake check / flake-check (push) Failing after 52s Details - Phase 4 now in progress - Harmonia configured on nix-cache02 with new signing key - Trusted public key deployed to all hosts - Cache tested successfully from testvm01 - Actions runner removed from scope Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:17:51 +01:00
Torjus Håkestad	49f7e3ae2e	nix-cache: use hostname-based domain for Caddy proxy All checks were successful Run nix flake check / flake-check (push) Successful in 2m18s Details nix-cache01 serves nix-cache.home.2rjus.net (canonical) nix-cache02 serves nix-cache02.home.2rjus.net (for testing) This allows testing nix-cache02 independently before DNS cutover. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:14:14 +01:00
Torjus Håkestad	751edfc11d	nix-cache02: add Harmonia binary cache service Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details - Parameterize harmonia.nix to use hostname-based Vault paths - Add nix-cache services to nix-cache02 - Add Vault secret and variable for nix-cache02 signing key - Add nix-cache02 public key to trusted-public-keys on all hosts - Update plan doc to remove actions runner references Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:08:48 +01:00
Torjus Håkestad	98a7301985	nix-cache: remove unused Gitea Actions runner All checks were successful Run nix flake check / flake-check (push) Successful in 2m23s Details The actions runner on nix-cache01 was never actively used. Removing it before migrating to nix-cache02. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 22:57:08 +01:00
Torjus Håkestad	34efa58cfe	Merge pull request 'nix-cache02-builder' (#39 ) from nix-cache02-builder into master All checks were successful Run nix flake check / flake-check (push) Successful in 2m27s Details Reviewed-on: #39	2026-02-10 21:47:58 +00:00
Torjus Håkestad	5bfb51a497	docs: add observability phase to nix-cache plan Some checks failed Run nix flake check / flake-check (push) Successful in 2m35s Details Run nix flake check / flake-check (pull_request) Failing after 16m1s Details - Add Phase 6 for alerting and Grafana dashboards - Document available Prometheus metrics - Include example alerting rules for build failures Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 22:46:38 +01:00
Torjus Håkestad	f83145d97a	docs: update nix-cache-reprovision plan with progress Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details - Mark Phase 1 (new build host) and Phase 2 (NATS build triggering) complete - Document nix-cache02 configuration and tested build times - Add remaining work for Harmonia, Actions runner, and DNS cutover - Enable --enable-builds flag in MCP config Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 22:43:48 +01:00
Torjus Håkestad	47747329c4	nix-cache02: add homelab-deploy builder service Some checks failed Run nix flake check / flake-check (push) Failing after 4m51s Details - Configure builder to build nixos-servers and nixos (gunter) repos - Add builder NKey to Vault secrets - Update NATS permissions for builder, test-deployer, and admin-deployer - Grant nix-cache02 access to shared homelab-deploy secrets Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 22:26:40 +01:00
Torjus Håkestad	2d9ca2a73f	hosts: add nix-cache02 build host Some checks failed Run nix flake check / flake-check (push) Failing after 16m26s Details New build host to replace nix-cache01 with: - 8 CPU cores, 16GB RAM, 200GB disk - Static IP 10.69.13.25 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 21:53:29 +01:00
Torjus Håkestad	98ea679ef2	docs: add monitoring02 reboot alert investigation Some checks failed Run nix flake check / flake-check (push) Failing after 13m41s Details Document findings from false positive host_reboot alert caused by NTP clock adjustment affecting node_boot_time_seconds metric. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 17:59:53 +01:00
Torjus Håkestad	b709c0b703	monitoring: disable radarr exporter (version mismatch) Some checks failed Run nix flake check / flake-check (push) Failing after 15m20s Details Periodic flake update / flake-update (push) Successful in 2m23s Details Radarr on TrueNAS jail is too old - exportarr fails on /api/v3/wanted/cutoff endpoint (404). Keep sonarr which works. Vault secret kept for when Radarr is updated. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:59:45 +01:00
Torjus Håkestad	33c5d5b3f0	monitoring: add exportarr for radarr/sonarr metrics All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Add prometheus exportarr exporters for Radarr and Sonarr media services. Runs on monitoring01, queries remote APIs. - Radarr exporter on port 9708 - Sonarr exporter on port 9709 - API keys fetched from Vault Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:56:03 +01:00
Torjus Håkestad	0a28c5f495	terraform: add radarr/sonarr API keys for exportarr Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Add vault secrets for Radarr and Sonarr API keys to enable exportarr metrics collection on monitoring01. - services/exportarr/radarr - Radarr API key - services/exportarr/sonarr - Sonarr API key - Grant monitoring01 access to services/exportarr/* Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:52:34 +01:00
Torjus Håkestad	9bd48e0808	monitoring: explicitly list valid HTTP status codes All checks were successful Run nix flake check / flake-check (push) Successful in 2m6s Details Empty valid_status_codes defaults to 2xx only, not "any". Explicitly list common status codes (2xx, 3xx, 4xx, 5xx) so services returning 400/401 like ha and nzbget pass the probe. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:41:47 +01:00
Torjus Håkestad	1460eea700	grafana: fix probe status table join All checks were successful Run nix flake check / flake-check (push) Successful in 2m9s Details Use joinByField transformation instead of merge to properly align rows by instance. Also exclude duplicate Time/job columns from join. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:38:02 +01:00
Torjus Håkestad	98c4f54f94	grafana: add TLS certificates dashboard Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Dashboard includes: - Stat panels for endpoints monitored, probe failures, expiring certs - Gauge showing minimum days until any cert expires - Table of all endpoints sorted by expiry (color-coded) - Probe status table with HTTP status and duration - Time series graphs for expiry trends and probe success rate Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:35:44 +01:00
Torjus Håkestad	d1b0a5dc20	monitoring: accept any HTTP status in TLS probe Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Only care about TLS handshake success for certificate monitoring. Services like nzbget (401) and ha (400) return non-2xx but have valid certificates. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:33:45 +01:00
Torjus Håkestad	4d32707130	monitoring: remove duplicate rules from blackbox.nix All checks were successful Run nix flake check / flake-check (push) Successful in 2m7s Details The rules were already added to rules.yml but the blackbox.nix file still had them, causing duplicate 'groups' key errors. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:28:42 +01:00
Torjus Håkestad	8e1753c2c8	monitoring: fix blackbox rules and add force-push policy Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Move certificate alert rules to rules.yml instead of adding them as a separate rules string in blackbox.nix. The previous approach caused a YAML parse error due to duplicate 'groups' keys. Also add policy to CLAUDE.md: never force push to master. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:26:05 +01:00
Torjus Håkestad	75e4fb61a5	monitoring: add blackbox exporter for TLS certificate monitoring All checks were successful Run nix flake check / flake-check (push) Successful in 2m6s Details Add blackbox exporter to monitoring01 to probe TLS endpoints and alert on expiring certificates. Monitors all ACME-managed certificates from OpenBao PKI including Caddy auto-TLS services. Alerts: - tls_certificate_expiring_soon (< 7 days, warning) - tls_certificate_expiring_critical (< 24h, critical) - tls_probe_failed (connectivity issues) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:21:42 +01:00
Torjus Håkestad	2be213e454	terraform: update default template to nixos-25.11.20260207 Some checks failed Run nix flake check / flake-check (push) Failing after 12m13s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 21:57:17 +01:00
Torjus Håkestad	12c252653b	ansible: add reboot playbook and short hostname support - Add reboot.yml playbook with rolling reboot (serial: 1) - Uses systemd reboot.target for NixOS compatibility - Waits for each host to come back before proceeding - Update dynamic inventory to use short hostnames - ansible_host set to FQDN for connections - Allows -l testvm01 instead of -l testvm01.home.2rjus.net - Update static.yml to match short hostname convention Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 21:56:32 +01:00
Torjus Håkestad	6493338c4c	ansible: fix deprecated yaml callback plugin Use result_format=yaml with builtin default callback instead of the removed community.general.yaml plugin. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 21:47:16 +01:00
Torjus Håkestad	6e08ba9720	ansible: restructure with dynamic inventory from flake - Move playbooks/ to ansible/playbooks/ - Add dynamic inventory script that extracts hosts from flake - Groups by tier (tier_test, tier_prod) and role (role_dns, etc.) - Reads homelab.host.* options for metadata - Add static inventory for non-flake hosts (Proxmox) - Add ansible.cfg with inventory path and SSH optimizations - Add group_vars/all.yml for common variables - Add restart-service.yml playbook for restarting systemd services - Update provision-approle.yml with single-host safeguard - Add ANSIBLE_CONFIG to devshell for automatic inventory discovery - Add ansible = "false" label to template2 to exclude from inventory - Update CLAUDE.md to reference ansible/README.md for details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 21:41:29 +01:00
Torjus Håkestad	7ff3d2a09b	docs: move openbao-kanidm-oidc plan to completed All checks were successful Run nix flake check / flake-check (push) Successful in 2m7s Details	2026-02-09 19:44:06 +01:00
Torjus Håkestad	e85f15b73d	vault: add OpenBao OIDC integration with Kanidm All checks were successful Run nix flake check / flake-check (push) Successful in 2m9s Details Enable Kanidm users to authenticate to OpenBao via OIDC for Web UI access. Members of the admins group get full read/write access to secrets. Changes: - Add OIDC auth backend in Terraform (oidc.tf) - Add oidc-admin and oidc-default policies - Add openbao OAuth2 client to Kanidm - Enable legacy crypto (RS256) for OpenBao compatibility - Allow imperative group membership management in Kanidm Limitations: - CLI login not supported (Kanidm requires HTTPS for confidential client redirects) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 19:42:26 +01:00
Torjus Håkestad	2f5a2a4bf1	grafana: use instant queries for fleet dashboard stat panels All checks were successful Run nix flake check / flake-check (push) Successful in 2m6s Details Prevents stat panels from being affected by dashboard time range selection. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 19:00:33 +01:00
Torjus Håkestad	287141c623	hosts: add role metadata to all hosts Some checks failed Run nix flake check / flake-check (push) Failing after 13m51s Details Assign roles to hosts for better organization and filtering: - ha1: home-automation - monitoring01, monitoring02: monitoring - jelly01: media - nats1: messaging - http-proxy: proxy - testvm01-03: test Also promote kanidm01 and monitoring02 from test to prod tier. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 16:21:08 +01:00
Torjus Håkestad	9ed11b712f	home-assistant: fix Jinja2 battery template syntax All checks were successful Run nix flake check / flake-check (push) Successful in 2m13s Details The template used \| min(100) \| max(0) which is invalid Jinja2 syntax. These filters expect iterables (lists), not scalar arguments. This caused TypeError warnings on every MQTT message and left battery sensors unavailable. Fixed by using proper list-based min/max: [[[value, 100] \| min, 0] \| max Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 16:12:59 +01:00
Torjus Håkestad	ffad2dd205	monitoring: increase zigbee_sensor_stale threshold to 4 hours The 2-hour threshold was too aggressive for temperature sensors in stable environments. Historical data shows gaps up to 2.75 hours when temperature hasn't changed (Home Assistant only updates last_updated when values change). Increasing to 4 hours avoids false positives while still catching genuine failures. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 16:10:54 +01:00
Torjus Håkestad	ed7d2aa727	grafana: add deployment metrics to nixos-fleet dashboard Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 15:58:28 +01:00
Torjus Håkestad	bf7a025364	flake: update homelab-deploy input Some checks failed Run nix flake check / flake-check (push) Failing after 3m49s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 15:45:30 +01:00
torjus-bot	4ae99dbc89	flake.lock: Update Flake lock file updates: • Updated input 'nixpkgs': 'github:nixos/nixpkgs/e576e3c9cf9bad747afcddd9e34f51d18c855b4e?narHash=sha256-tlFqNG/uzz2%2B%2BaAmn4v8J0vAkV3z7XngeIIB3rM3650%3D' (2026-02-03) → 'github:nixos/nixpkgs/23d72dabcb3b12469f57b37170fcbc1789bd7457?narHash=sha256-z5NJPSBwsLf/OfD8WTmh79tlSU8XgIbwmk6qB1/TFzY%3D' (2026-02-07) • Updated input 'nixpkgs-unstable': 'github:nixos/nixpkgs/00c21e4c93d963c50d4c0c89bfa84ed6e0694df2?narHash=sha256-AYqlWrX09%2BHvGs8zM6ebZ1pwUqjkfpnv8mewYwAo%2BiM%3D' (2026-02-04) → 'github:nixos/nixpkgs/d6c71932130818840fc8fe9509cf50be8c64634f?narHash=sha256-ub1gpAONMFsT/GU2hV6ZWJjur8rJ6kKxdm9IlCT0j84%3D' (2026-02-08)	2026-02-09 00:01:58 +00:00
Torjus Håkestad	5c142b1323	flake: update homelab-deploy input Some checks failed Run nix flake check / flake-check (push) Failing after 10m7s Details Periodic flake update / flake-update (push) Successful in 2m51s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 00:42:51 +01:00
Torjus Håkestad	4091e51f41	nixos-exporter: use nkeySeedFile option Some checks failed Run nix flake check / flake-check (push) Failing after 4m26s Details Use the new nkeySeedFile option instead of credentialsFile for NATS authentication. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 00:34:22 +01:00
Torjus Håkestad	a8e558a6b7	flake: update nixos-exporter input Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 00:32:56 +01:00
Torjus Håkestad	4efc798c38	nixos-exporter: fix nkey file permissions All checks were successful Run nix flake check / flake-check (push) Successful in 2m6s Details Set owner/group to nixos-exporter so the service can read the NATS credentials file. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 00:18:10 +01:00
Torjus Håkestad	016f8c9119	terraform: add nixos-exporter shared policy Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details - Create shared policy granting all hosts access to nixos-exporter nkey - Add policy to both manual and generated host AppRoles - Remove duplicate kanidm01/monitoring02 entries from hosts-generated.tf Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 00:04:17 +01:00
Torjus Håkestad	fec2a261ab	Merge pull request 'nixos-exporter: enable NATS cache sharing' (#38 ) from nixos-exporter-nats-cache into master All checks were successful Run nix flake check / flake-check (push) Successful in 2m18s Details Reviewed-on: #38	2026-02-08 22:58:24 +00:00
Torjus Håkestad	60c04a2052	nixos-exporter: enable NATS cache sharing Some checks failed Run nix flake check / flake-check (pull_request) Successful in 2m17s Details Run nix flake check / flake-check (push) Failing after 5m16s Details When one host fetches the latest flake revision, it publishes to NATS and all other hosts receive the update immediately. This reduces redundant nix flake metadata calls across the fleet. - Add nkeys to devshell for key generation - Add nixos-exporter user to NATS HOMELAB account - Add Vault secret for NKey storage - Configure all hosts to use NATS for revision sharing - Update nixos-exporter input to version with NATS support Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 23:57:28 +01:00
Torjus Håkestad	39e3f37263	flake: update homelab-deploy input Some checks failed Run nix flake check / flake-check (push) Failing after 15m17s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 22:49:44 +01:00
Torjus Håkestad	a2d93baba8	Merge pull request 'grafana: add NixOS operations dashboard' (#37 ) from grafana-nixos-operations-dashboard into master All checks were successful Run nix flake check / flake-check (push) Successful in 3m54s Details Reviewed-on: #37	2026-02-08 21:04:19 +00:00
Torjus Håkestad	f66dfc753c	grafana: add NixOS operations dashboard All checks were successful Run nix flake check / flake-check (push) Successful in 3m24s Details Run nix flake check / flake-check (pull_request) Successful in 4m5s Details Loki-based dashboard for tracking NixOS operations including: - Upgrade activity and success/failure stats - Build activity during upgrades - Bootstrap logs for new VM deployments - ACME certificate renewal activity Log panels use LogQL json parsing with \| keep host to show clean messages with host labels. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 22:03:28 +01:00
Torjus Håkestad	79a6a72719	Merge pull request 'grafana-dashboards-permissions' (#36 ) from grafana-dashboards-permissions into master All checks were successful Run nix flake check / flake-check (push) Successful in 2m4s Details Reviewed-on: #36	2026-02-08 20:18:22 +00:00
Torjus Håkestad	89d0a6f358	grafana: add systemd services dashboard Some checks failed Run nix flake check / flake-check (push) Failing after 8m30s Details Run nix flake check / flake-check (pull_request) Failing after 16m49s Details Dashboard for monitoring systemd across the fleet: - Summary stats: failed/active/inactive units, restarts, timers - Failed units table (shows any units in failed state) - Service restarts table (top 15 services by restart count) - Active units per host bar chart - NixOS upgrade timer table with last trigger time - Backup timers table (restic jobs) - Service restarts over time chart - Hostname filter to focus on specific hosts Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 21:06:59 +01:00
Torjus Håkestad	03ebee4d82	grafana: fix proxmox table __name__ column All checks were successful Run nix flake check / flake-check (push) Successful in 2m9s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 21:04:41 +01:00
Torjus Håkestad	05630eb4d4	grafana: add Proxmox dashboard Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Dashboard for monitoring Proxmox VMs: - Summary stats: VMs running/stopped, node CPU/memory, uptime - VM status table with name, status, CPU%, memory%, uptime - VM CPU usage over time - VM memory usage over time - Network traffic (RX/TX) per VM - Disk I/O (read/write) per VM - Storage usage gauges and capacity table - VM filter to focus on specific VMs Filters out template VMs, shows only actual guests. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 21:02:28 +01:00
Torjus Håkestad	1e52eec02a	monitoring: always include tier label in scrape configs All checks were successful Run nix flake check / flake-check (push) Successful in 2m8s Details Previously tier was only included if non-default (not "prod"), which meant prod hosts had no tier label. This made the Grafana tier filter only show "test" since "prod" never appeared in label_values(). Now tier is always included, so both "prod" and "test" appear in the fleet dashboard tier selector. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:58:52 +01:00
Torjus Håkestad	d333aa0164	grafana: fix fleet table __name__ columns All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Exclude the __name__ columns that were leaking through the table transformations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:52:39 +01:00

1 2 3 4 5 ...

958 Commits