Commit Graph

986 Commits

Author SHA1 Message Date
12c252653b ansible: add reboot playbook and short hostname support
- Add reboot.yml playbook with rolling reboot (serial: 1)
  - Uses systemd reboot.target for NixOS compatibility
  - Waits for each host to come back before proceeding
- Update dynamic inventory to use short hostnames
  - ansible_host set to FQDN for connections
  - Allows -l testvm01 instead of -l testvm01.home.2rjus.net
- Update static.yml to match short hostname convention

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 21:56:32 +01:00
6493338c4c ansible: fix deprecated yaml callback plugin
Use result_format=yaml with builtin default callback instead of
the removed community.general.yaml plugin.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 21:47:16 +01:00
6e08ba9720 ansible: restructure with dynamic inventory from flake
- Move playbooks/ to ansible/playbooks/
- Add dynamic inventory script that extracts hosts from flake
  - Groups by tier (tier_test, tier_prod) and role (role_dns, etc.)
  - Reads homelab.host.* options for metadata
- Add static inventory for non-flake hosts (Proxmox)
- Add ansible.cfg with inventory path and SSH optimizations
- Add group_vars/all.yml for common variables
- Add restart-service.yml playbook for restarting systemd services
- Update provision-approle.yml with single-host safeguard
- Add ANSIBLE_CONFIG to devshell for automatic inventory discovery
- Add ansible = "false" label to template2 to exclude from inventory
- Update CLAUDE.md to reference ansible/README.md for details

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 21:41:29 +01:00
7ff3d2a09b docs: move openbao-kanidm-oidc plan to completed
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m7s
2026-02-09 19:44:06 +01:00
e85f15b73d vault: add OpenBao OIDC integration with Kanidm
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m9s
Enable Kanidm users to authenticate to OpenBao via OIDC for Web UI access.
Members of the admins group get full read/write access to secrets.

Changes:
- Add OIDC auth backend in Terraform (oidc.tf)
- Add oidc-admin and oidc-default policies
- Add openbao OAuth2 client to Kanidm
- Enable legacy crypto (RS256) for OpenBao compatibility
- Allow imperative group membership management in Kanidm

Limitations:
- CLI login not supported (Kanidm requires HTTPS for confidential client redirects)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 19:42:26 +01:00
2f5a2a4bf1 grafana: use instant queries for fleet dashboard stat panels
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m6s
Prevents stat panels from being affected by dashboard time range selection.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 19:00:33 +01:00
287141c623 hosts: add role metadata to all hosts
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m51s
Assign roles to hosts for better organization and filtering:
- ha1: home-automation
- monitoring01, monitoring02: monitoring
- jelly01: media
- nats1: messaging
- http-proxy: proxy
- testvm01-03: test

Also promote kanidm01 and monitoring02 from test to prod tier.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 16:21:08 +01:00
9ed11b712f home-assistant: fix Jinja2 battery template syntax
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m13s
The template used | min(100) | max(0) which is invalid Jinja2 syntax.
These filters expect iterables (lists), not scalar arguments. This
caused TypeError warnings on every MQTT message and left battery
sensors unavailable.

Fixed by using proper list-based min/max:
  [[[value, 100] | min, 0] | max

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 16:12:59 +01:00
ffad2dd205 monitoring: increase zigbee_sensor_stale threshold to 4 hours
The 2-hour threshold was too aggressive for temperature sensors in
stable environments. Historical data shows gaps up to 2.75 hours when
temperature hasn't changed (Home Assistant only updates last_updated
when values change). Increasing to 4 hours avoids false positives
while still catching genuine failures.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 16:10:54 +01:00
ed7d2aa727 grafana: add deployment metrics to nixos-fleet dashboard
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 15:58:28 +01:00
bf7a025364 flake: update homelab-deploy input
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m49s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 15:45:30 +01:00
4ae99dbc89 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/e576e3c9cf9bad747afcddd9e34f51d18c855b4e?narHash=sha256-tlFqNG/uzz2%2B%2BaAmn4v8J0vAkV3z7XngeIIB3rM3650%3D' (2026-02-03)
  → 'github:nixos/nixpkgs/23d72dabcb3b12469f57b37170fcbc1789bd7457?narHash=sha256-z5NJPSBwsLf/OfD8WTmh79tlSU8XgIbwmk6qB1/TFzY%3D' (2026-02-07)
• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/00c21e4c93d963c50d4c0c89bfa84ed6e0694df2?narHash=sha256-AYqlWrX09%2BHvGs8zM6ebZ1pwUqjkfpnv8mewYwAo%2BiM%3D' (2026-02-04)
  → 'github:nixos/nixpkgs/d6c71932130818840fc8fe9509cf50be8c64634f?narHash=sha256-ub1gpAONMFsT/GU2hV6ZWJjur8rJ6kKxdm9IlCT0j84%3D' (2026-02-08)
2026-02-09 00:01:58 +00:00
5c142b1323 flake: update homelab-deploy input
Some checks failed
Run nix flake check / flake-check (push) Failing after 10m7s
Periodic flake update / flake-update (push) Successful in 2m51s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 00:42:51 +01:00
4091e51f41 nixos-exporter: use nkeySeedFile option
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m26s
Use the new nkeySeedFile option instead of credentialsFile for NATS
authentication.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 00:34:22 +01:00
a8e558a6b7 flake: update nixos-exporter input
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 00:32:56 +01:00
4efc798c38 nixos-exporter: fix nkey file permissions
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m6s
Set owner/group to nixos-exporter so the service can read the
NATS credentials file.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 00:18:10 +01:00
016f8c9119 terraform: add nixos-exporter shared policy
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
- Create shared policy granting all hosts access to nixos-exporter nkey
- Add policy to both manual and generated host AppRoles
- Remove duplicate kanidm01/monitoring02 entries from hosts-generated.tf

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 00:04:17 +01:00
fec2a261ab Merge pull request 'nixos-exporter: enable NATS cache sharing' (#38) from nixos-exporter-nats-cache into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m18s
Reviewed-on: #38
2026-02-08 22:58:24 +00:00
60c04a2052 nixos-exporter: enable NATS cache sharing
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m17s
Run nix flake check / flake-check (push) Failing after 5m16s
When one host fetches the latest flake revision, it publishes to NATS
and all other hosts receive the update immediately. This reduces
redundant nix flake metadata calls across the fleet.

- Add nkeys to devshell for key generation
- Add nixos-exporter user to NATS HOMELAB account
- Add Vault secret for NKey storage
- Configure all hosts to use NATS for revision sharing
- Update nixos-exporter input to version with NATS support

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 23:57:28 +01:00
39e3f37263 flake: update homelab-deploy input
Some checks failed
Run nix flake check / flake-check (push) Failing after 15m17s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 22:49:44 +01:00
a2d93baba8 Merge pull request 'grafana: add NixOS operations dashboard' (#37) from grafana-nixos-operations-dashboard into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 3m54s
Reviewed-on: #37
2026-02-08 21:04:19 +00:00
f66dfc753c grafana: add NixOS operations dashboard
All checks were successful
Run nix flake check / flake-check (push) Successful in 3m24s
Run nix flake check / flake-check (pull_request) Successful in 4m5s
Loki-based dashboard for tracking NixOS operations including:
- Upgrade activity and success/failure stats
- Build activity during upgrades
- Bootstrap logs for new VM deployments
- ACME certificate renewal activity

Log panels use LogQL json parsing with | keep host to show
clean messages with host labels.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 22:03:28 +01:00
79a6a72719 Merge pull request 'grafana-dashboards-permissions' (#36) from grafana-dashboards-permissions into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m4s
Reviewed-on: #36
2026-02-08 20:18:22 +00:00
89d0a6f358 grafana: add systemd services dashboard
Some checks failed
Run nix flake check / flake-check (push) Failing after 8m30s
Run nix flake check / flake-check (pull_request) Failing after 16m49s
Dashboard for monitoring systemd across the fleet:
- Summary stats: failed/active/inactive units, restarts, timers
- Failed units table (shows any units in failed state)
- Service restarts table (top 15 services by restart count)
- Active units per host bar chart
- NixOS upgrade timer table with last trigger time
- Backup timers table (restic jobs)
- Service restarts over time chart
- Hostname filter to focus on specific hosts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 21:06:59 +01:00
03ebee4d82 grafana: fix proxmox table __name__ column
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m9s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 21:04:41 +01:00
05630eb4d4 grafana: add Proxmox dashboard
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Dashboard for monitoring Proxmox VMs:
- Summary stats: VMs running/stopped, node CPU/memory, uptime
- VM status table with name, status, CPU%, memory%, uptime
- VM CPU usage over time
- VM memory usage over time
- Network traffic (RX/TX) per VM
- Disk I/O (read/write) per VM
- Storage usage gauges and capacity table
- VM filter to focus on specific VMs

Filters out template VMs, shows only actual guests.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 21:02:28 +01:00
1e52eec02a monitoring: always include tier label in scrape configs
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m8s
Previously tier was only included if non-default (not "prod"), which
meant prod hosts had no tier label. This made the Grafana tier filter
only show "test" since "prod" never appeared in label_values().

Now tier is always included, so both "prod" and "test" appear in the
fleet dashboard tier selector.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 20:58:52 +01:00
d333aa0164 grafana: fix fleet table __name__ columns
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m5s
Exclude the __name__ columns that were leaking through the
table transformations.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 20:52:39 +01:00
a5d5827dcc grafana: add NixOS fleet dashboard
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Dashboard for monitoring NixOS deployments across the homelab:
- Hosts behind remote / needing reboot stat panels
- Fleet status table with revision, behind status, reboot needed, age
- Generation age bar chart (shows stale configs)
- Generations per host bar chart
- Deployment activity time series (see when hosts were updated)
- Flake input ages table
- Pie charts for hosts by revision and tier
- Tier filter variable

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 20:50:08 +01:00
1c13ec12a4 grafana: add temperature dashboard
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m5s
Dashboard includes:
- Current temperatures per room (stat panel)
- Average home temperature (gauge)
- Current humidity (stat panel)
- 30-day temperature history with mean/min/max in legend
- Temperature trend (rate of change per hour)
- 24h min/max/avg table per room
- 30-day humidity history

Filters out device_temperature (internal sensor) metrics.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 20:45:52 +01:00
4bf0eeeadb grafana: add dashboards and fix permissions
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m3s
- Change default OIDC role from Viewer to Editor for Explore access
- Add declarative dashboard provisioning
- Add node-exporter dashboard (CPU, memory, disk, load, network, I/O)
- Add Loki logs dashboard with host/job filters

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 20:39:21 +01:00
304cb117ce Merge pull request 'grafana-kanidm-oidc' (#35) from grafana-kanidm-oidc into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m7s
Reviewed-on: #35
2026-02-08 19:30:20 +00:00
02270a0e4a docs: update plans with Grafana OIDC progress
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m7s
Run nix flake check / flake-check (push) Failing after 16m31s
- auth-system-replacement.md: Mark OAuth2 client (Grafana) as completed,
  document key findings (PKCE, attribute paths, user requirements)
- monitoring-migration-victoriametrics.md: Note Grafana deployment on
  monitoring02 with Kanidm OIDC as test instance

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 20:28:10 +01:00
030e8518c5 grafana: add Grafana on monitoring02 with Kanidm OIDC
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m3s
Deploy Grafana test instance on monitoring02 with:
- Kanidm OIDC authentication (admins -> Admin role, others -> Viewer)
- PKCE enabled for secure OAuth2 flow (required by Kanidm)
- Declarative datasources for Prometheus and Loki on monitoring01
- Local Caddy for TLS termination via internal ACME CA
- DNS CNAME grafana-test.home.2rjus.net

Terraform changes add OAuth2 client secret and AppRole policies for
kanidm01 and monitoring02.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 20:23:26 +01:00
9ffdd4f862 terraform: increase monitoring02 disk to 60G
Some checks failed
Run nix flake check / flake-check (push) Failing after 11m8s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 19:23:40 +01:00
0b977808ca hosts: add monitoring02 configuration
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
New test-tier host for monitoring stack expansion with:
- Static IP 10.69.13.24
- 4 CPU cores, 4GB RAM, 20GB disk
- Vault integration and NATS-based deployment enabled

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 19:19:38 +01:00
8786113f8f docs: add OpenBao + Kanidm OIDC integration plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m10s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 15:45:44 +01:00
fdb2c31f84 docs: add pipe-to-loki documentation to CLAUDE.md
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 15:34:01 +01:00
78eb04205f system: add pipe-to-loki helper script
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Adds a system-wide script for sending command output or interactive
sessions to Loki for easy sharing with Claude.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 15:30:53 +01:00
19cb61ebbc Merge pull request 'kanidm-pam-client' (#34) from kanidm-pam-client into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 3m19s
Reviewed-on: #34
2026-02-08 14:14:53 +00:00
9ed09c9a9c docs: add user-management documentation
All checks were successful
Run nix flake check / flake-check (pull_request) Successful in 3m33s
Run nix flake check / flake-check (push) Successful in 2m0s
- CLI workflows for creating users and groups
- Troubleshooting guide (nscd, cache invalidation)
- Home directory behavior (UUID-based with symlinks)
- Update auth-system-replacement plan with progress

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 15:14:21 +01:00
b31c64f1b9 kanidm: remove declarative user provisioning
Keep base groups (admins, users, ssh-users) provisioned declaratively
but manage regular users via the kanidm CLI. This allows setting POSIX
attributes and passwords in a single workflow.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 15:14:03 +01:00
54b6e37420 flake: add kanidm to devshell
Add kanidm_1_8 CLI for administering the Kanidm server.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 15:12:19 +01:00
b845a8bb8b system: add kanidm PAM/NSS client module
Add homelab.kanidm.enable option for central authentication via Kanidm.
The module configures:
- PAM/NSS integration with kanidm-unixd
- Client connection to auth.home.2rjus.net
- Login authorization for ssh-users group

Enable on testvm01-03 for testing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 15:12:19 +01:00
bfbf0cea68 template2: enable zram for bootstrap
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m34s
Prevents OOM during initial nixos-rebuild on 2GB VMs.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 13:34:08 +01:00
3abe5e83a7 docs: add memory ballooning as fallback option
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m5s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 13:29:42 +01:00
67c27555f3 docs: add memory issues follow-up plan
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m2s
Track zram change effectiveness for OOM prevention during upgrades.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 13:26:31 +01:00
1674b6a844 system: enable zram swap for all hosts
Some checks failed
Run nix flake check / flake-check (push) Failing after 12m6s
Provides compressed swap in RAM to prevent OOM kills during
nixos-rebuild on low-memory VMs (2GB). Removes duplicate zram
configs from jelly01 and nix-cache01.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 13:02:58 +01:00
311be282b6 docs: add security hardening plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 2s
Based on security review findings, covering SSH hardening, firewall
enablement, log transport TLS, security alerting, and secrets management.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 05:26:15 +01:00
11cbb64097 claude: make auditor delegation explicit in investigate-alarm
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
- Changed section 4 from "if needed" to always spawn auditor
- Added explicit "Do NOT query audit logs yourself" guidance
- Listed specific scenarios requiring auditor (service stopped, etc.)
- Added manual intervention as first common cause
- Updated guidelines to emphasize mandatory delegation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 05:11:09 +01:00