Enable Kanidm users to authenticate to OpenBao via OIDC for Web UI access.
Members of the admins group get full read/write access to secrets.
Changes:
- Add OIDC auth backend in Terraform (oidc.tf)
- Add oidc-admin and oidc-default policies
- Add openbao OAuth2 client to Kanidm
- Enable legacy crypto (RS256) for OpenBao compatibility
- Allow imperative group membership management in Kanidm
Limitations:
- CLI login not supported (Kanidm requires HTTPS for confidential client redirects)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Assign roles to hosts for better organization and filtering:
- ha1: home-automation
- monitoring01, monitoring02: monitoring
- jelly01: media
- nats1: messaging
- http-proxy: proxy
- testvm01-03: test
Also promote kanidm01 and monitoring02 from test to prod tier.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The template used | min(100) | max(0) which is invalid Jinja2 syntax.
These filters expect iterables (lists), not scalar arguments. This
caused TypeError warnings on every MQTT message and left battery
sensors unavailable.
Fixed by using proper list-based min/max:
[[[value, 100] | min, 0] | max
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The 2-hour threshold was too aggressive for temperature sensors in
stable environments. Historical data shows gaps up to 2.75 hours when
temperature hasn't changed (Home Assistant only updates last_updated
when values change). Increasing to 4 hours avoids false positives
while still catching genuine failures.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Create shared policy granting all hosts access to nixos-exporter nkey
- Add policy to both manual and generated host AppRoles
- Remove duplicate kanidm01/monitoring02 entries from hosts-generated.tf
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When one host fetches the latest flake revision, it publishes to NATS
and all other hosts receive the update immediately. This reduces
redundant nix flake metadata calls across the fleet.
- Add nkeys to devshell for key generation
- Add nixos-exporter user to NATS HOMELAB account
- Add Vault secret for NKey storage
- Configure all hosts to use NATS for revision sharing
- Update nixos-exporter input to version with NATS support
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Loki-based dashboard for tracking NixOS operations including:
- Upgrade activity and success/failure stats
- Build activity during upgrades
- Bootstrap logs for new VM deployments
- ACME certificate renewal activity
Log panels use LogQL json parsing with | keep host to show
clean messages with host labels.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Dashboard for monitoring systemd across the fleet:
- Summary stats: failed/active/inactive units, restarts, timers
- Failed units table (shows any units in failed state)
- Service restarts table (top 15 services by restart count)
- Active units per host bar chart
- NixOS upgrade timer table with last trigger time
- Backup timers table (restic jobs)
- Service restarts over time chart
- Hostname filter to focus on specific hosts
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Dashboard for monitoring Proxmox VMs:
- Summary stats: VMs running/stopped, node CPU/memory, uptime
- VM status table with name, status, CPU%, memory%, uptime
- VM CPU usage over time
- VM memory usage over time
- Network traffic (RX/TX) per VM
- Disk I/O (read/write) per VM
- Storage usage gauges and capacity table
- VM filter to focus on specific VMs
Filters out template VMs, shows only actual guests.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously tier was only included if non-default (not "prod"), which
meant prod hosts had no tier label. This made the Grafana tier filter
only show "test" since "prod" never appeared in label_values().
Now tier is always included, so both "prod" and "test" appear in the
fleet dashboard tier selector.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Dashboard for monitoring NixOS deployments across the homelab:
- Hosts behind remote / needing reboot stat panels
- Fleet status table with revision, behind status, reboot needed, age
- Generation age bar chart (shows stale configs)
- Generations per host bar chart
- Deployment activity time series (see when hosts were updated)
- Flake input ages table
- Pie charts for hosts by revision and tier
- Tier filter variable
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Dashboard includes:
- Current temperatures per room (stat panel)
- Average home temperature (gauge)
- Current humidity (stat panel)
- 30-day temperature history with mean/min/max in legend
- Temperature trend (rate of change per hour)
- 24h min/max/avg table per room
- 30-day humidity history
Filters out device_temperature (internal sensor) metrics.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- auth-system-replacement.md: Mark OAuth2 client (Grafana) as completed,
document key findings (PKCE, attribute paths, user requirements)
- monitoring-migration-victoriametrics.md: Note Grafana deployment on
monitoring02 with Kanidm OIDC as test instance
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Deploy Grafana test instance on monitoring02 with:
- Kanidm OIDC authentication (admins -> Admin role, others -> Viewer)
- PKCE enabled for secure OAuth2 flow (required by Kanidm)
- Declarative datasources for Prometheus and Loki on monitoring01
- Local Caddy for TLS termination via internal ACME CA
- DNS CNAME grafana-test.home.2rjus.net
Terraform changes add OAuth2 client secret and AppRole policies for
kanidm01 and monitoring02.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New test-tier host for monitoring stack expansion with:
- Static IP 10.69.13.24
- 4 CPU cores, 4GB RAM, 20GB disk
- Vault integration and NATS-based deployment enabled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds a system-wide script for sending command output or interactive
sessions to Loki for easy sharing with Claude.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- CLI workflows for creating users and groups
- Troubleshooting guide (nscd, cache invalidation)
- Home directory behavior (UUID-based with symlinks)
- Update auth-system-replacement plan with progress
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Keep base groups (admins, users, ssh-users) provisioned declaratively
but manage regular users via the kanidm CLI. This allows setting POSIX
attributes and passwords in a single workflow.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add homelab.kanidm.enable option for central authentication via Kanidm.
The module configures:
- PAM/NSS integration with kanidm-unixd
- Client connection to auth.home.2rjus.net
- Login authorization for ssh-users group
Enable on testvm01-03 for testing.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Provides compressed swap in RAM to prevent OOM kills during
nixos-rebuild on low-memory VMs (2GB). Removes duplicate zram
configs from jelly01 and nix-cache01.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Based on security review findings, covering SSH hardening, firewall
enablement, log transport TLS, security alerting, and secrets management.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Changed section 4 from "if needed" to always spawn auditor
- Added explicit "Do NOT query audit logs yourself" guidance
- Listed specific scenarios requiring auditor (service stopped, etc.)
- Added manual intervention as first common cause
- Updated guidelines to emphasize mandatory delegation
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add new auditor agent for security-focused audit log analysis:
- SSH session tracking, command execution, sudo usage
- Suspicious activity detection patterns
- Can be used standalone or as sub-agent by investigate-alarm
Update investigate-alarm to delegate audit analysis to auditor
and add git-explorer MCP for configuration drift detection.
Add git-explorer to .mcp.json for repository inspection.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Kanidm does not expose a Prometheus /metrics endpoint.
The scrape target was causing 404 errors after the TLS
certificate issue was fixed.
Also add SSH command restriction to CLAUDE.md.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Include both auth.home.2rjus.net (CNAME) and kanidm01.home.2rjus.net
(A record) as SANs in the TLS certificate. This fixes Prometheus
scraping which connects via the hostname, not the CNAME.
Fixes: x509: certificate is valid for auth.home.2rjus.net, not kanidm01.home.2rjus.net
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>