Radarr on TrueNAS jail is too old - exportarr fails on
/api/v3/wanted/cutoff endpoint (404). Keep sonarr which works.
Vault secret kept for when Radarr is updated.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add prometheus exportarr exporters for Radarr and Sonarr media
services. Runs on monitoring01, queries remote APIs.
- Radarr exporter on port 9708
- Sonarr exporter on port 9709
- API keys fetched from Vault
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Empty valid_status_codes defaults to 2xx only, not "any".
Explicitly list common status codes (2xx, 3xx, 4xx, 5xx) so
services returning 400/401 like ha and nzbget pass the probe.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use joinByField transformation instead of merge to properly align
rows by instance. Also exclude duplicate Time/job columns from join.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Dashboard includes:
- Stat panels for endpoints monitored, probe failures, expiring certs
- Gauge showing minimum days until any cert expires
- Table of all endpoints sorted by expiry (color-coded)
- Probe status table with HTTP status and duration
- Time series graphs for expiry trends and probe success rate
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Only care about TLS handshake success for certificate monitoring.
Services like nzbget (401) and ha (400) return non-2xx but have
valid certificates.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The rules were already added to rules.yml but the blackbox.nix file
still had them, causing duplicate 'groups' key errors.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move certificate alert rules to rules.yml instead of adding them as a
separate rules string in blackbox.nix. The previous approach caused a
YAML parse error due to duplicate 'groups' keys.
Also add policy to CLAUDE.md: never force push to master.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Enable Kanidm users to authenticate to OpenBao via OIDC for Web UI access.
Members of the admins group get full read/write access to secrets.
Changes:
- Add OIDC auth backend in Terraform (oidc.tf)
- Add oidc-admin and oidc-default policies
- Add openbao OAuth2 client to Kanidm
- Enable legacy crypto (RS256) for OpenBao compatibility
- Allow imperative group membership management in Kanidm
Limitations:
- CLI login not supported (Kanidm requires HTTPS for confidential client redirects)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The template used | min(100) | max(0) which is invalid Jinja2 syntax.
These filters expect iterables (lists), not scalar arguments. This
caused TypeError warnings on every MQTT message and left battery
sensors unavailable.
Fixed by using proper list-based min/max:
[[[value, 100] | min, 0] | max
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The 2-hour threshold was too aggressive for temperature sensors in
stable environments. Historical data shows gaps up to 2.75 hours when
temperature hasn't changed (Home Assistant only updates last_updated
when values change). Increasing to 4 hours avoids false positives
while still catching genuine failures.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When one host fetches the latest flake revision, it publishes to NATS
and all other hosts receive the update immediately. This reduces
redundant nix flake metadata calls across the fleet.
- Add nkeys to devshell for key generation
- Add nixos-exporter user to NATS HOMELAB account
- Add Vault secret for NKey storage
- Configure all hosts to use NATS for revision sharing
- Update nixos-exporter input to version with NATS support
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Loki-based dashboard for tracking NixOS operations including:
- Upgrade activity and success/failure stats
- Build activity during upgrades
- Bootstrap logs for new VM deployments
- ACME certificate renewal activity
Log panels use LogQL json parsing with | keep host to show
clean messages with host labels.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Dashboard for monitoring systemd across the fleet:
- Summary stats: failed/active/inactive units, restarts, timers
- Failed units table (shows any units in failed state)
- Service restarts table (top 15 services by restart count)
- Active units per host bar chart
- NixOS upgrade timer table with last trigger time
- Backup timers table (restic jobs)
- Service restarts over time chart
- Hostname filter to focus on specific hosts
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Dashboard for monitoring Proxmox VMs:
- Summary stats: VMs running/stopped, node CPU/memory, uptime
- VM status table with name, status, CPU%, memory%, uptime
- VM CPU usage over time
- VM memory usage over time
- Network traffic (RX/TX) per VM
- Disk I/O (read/write) per VM
- Storage usage gauges and capacity table
- VM filter to focus on specific VMs
Filters out template VMs, shows only actual guests.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Dashboard for monitoring NixOS deployments across the homelab:
- Hosts behind remote / needing reboot stat panels
- Fleet status table with revision, behind status, reboot needed, age
- Generation age bar chart (shows stale configs)
- Generations per host bar chart
- Deployment activity time series (see when hosts were updated)
- Flake input ages table
- Pie charts for hosts by revision and tier
- Tier filter variable
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Dashboard includes:
- Current temperatures per room (stat panel)
- Average home temperature (gauge)
- Current humidity (stat panel)
- 30-day temperature history with mean/min/max in legend
- Temperature trend (rate of change per hour)
- 24h min/max/avg table per room
- 30-day humidity history
Filters out device_temperature (internal sensor) metrics.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Deploy Grafana test instance on monitoring02 with:
- Kanidm OIDC authentication (admins -> Admin role, others -> Viewer)
- PKCE enabled for secure OAuth2 flow (required by Kanidm)
- Declarative datasources for Prometheus and Loki on monitoring01
- Local Caddy for TLS termination via internal ACME CA
- DNS CNAME grafana-test.home.2rjus.net
Terraform changes add OAuth2 client secret and AppRole policies for
kanidm01 and monitoring02.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Keep base groups (admins, users, ssh-users) provisioned declaratively
but manage regular users via the kanidm CLI. This allows setting POSIX
attributes and passwords in a single workflow.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Kanidm does not expose a Prometheus /metrics endpoint.
The scrape target was causing 404 errors after the TLS
certificate issue was fixed.
Also add SSH command restriction to CLAUDE.md.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Include both auth.home.2rjus.net (CNAME) and kanidm01.home.2rjus.net
(A record) as SANs in the TLS certificate. This fixes Prometheus
scraping which connects via the hostname, not the CNAME.
Fixes: x509: certificate is valid for auth.home.2rjus.net, not kanidm01.home.2rjus.net
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Set owner/group to kanidm so the post-start provisioning
script can read the idm_admin password.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- New test-tier VM at 10.69.13.23 with role=auth
- Kanidm 1.8 server with HTTPS (443) and LDAPS (636)
- ACME certificate from internal CA (auth.home.2rjus.net)
- Provisioned groups: admins, users, ssh-users
- Provisioned user: torjus
- Daily backups at 22:00 (7 versions)
- Prometheus monitoring scrape target
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove pgdb1 host configuration and postgres service module.
The only consumer (Open WebUI on gunter) has migrated to local PostgreSQL.
Removed:
- hosts/pgdb1/ - host configuration
- services/postgres/ - service module (only used by pgdb1)
- postgres_rules from monitoring rules
- rebuild-all.sh (obsolete script)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Configure Unbound to query both ns1 and ns2 for the home.2rjus.net
zone, in addition to local NSD. This provides redundancy during
bootstrap or if local NSD is temporarily unavailable.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove the step-ca host and labmon flake input now that ACME has been
migrated to OpenBao PKI.
Removed:
- hosts/ca/ - step-ca host configuration
- services/ca/ - step-ca service module
- labmon flake input and module (no longer used)
Updated:
- flake.nix - removed ca host and labmon references
- flake.lock - removed labmon input
- rebuild-all.sh - removed ca from host list
- CLAUDE.md - updated documentation
Note: secrets/ca/ should be manually removed by the user.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Switch all ACME certificate issuance from step-ca (ca.home.2rjus.net)
to OpenBao PKI (vault.home.2rjus.net:8200/v1/pki_int/acme/directory).
- Update default ACME server in system/acme.nix
- Update Caddy acme_ca in http-proxy and nix-cache services
- Remove labmon service from monitoring01 (step-ca monitoring)
- Remove labmon scrape target and certificate_rules alerts
- Remove alloy.nix (only used for labmon profiling)
- Add docs/plans/cert-monitoring.md for future cert monitoring needs
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extract homelab.host metadata (tier, priority, role, labels) from host
configurations and propagate them to Prometheus scrape targets. This
enables semantic alert filtering using labels instead of hardcoded
instance names.
Changes:
- lib/monitoring.nix: Extract host metadata, group targets by labels
- prometheus.nix: Use structured static_configs with labels
- rules.yml: Replace instance filters with role-based filters
Example labels in Prometheus:
- ns1/ns2: role=dns, dns_role=primary/secondary
- nix-cache01: role=build-host
- testvm*: tier=test
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add homelab-deploy flake input and NixOS module for message-based
deployments across the fleet. Configure DEPLOY account in NATS with
tiered access control (listener, test-deployer, admin-deployer).
Enable listener on vaulttest01 as initial test host.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Increase duration from 5m to 10m and demote severity from critical to
warning. Brief degraded states during nixos-rebuild are normal and were
causing false positive alerts.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
RemainAfterExit=true kept the service in "active" state, which
prevented OnUnitActiveSec from scheduling new triggers since there
was no new "activation" event. Removing it allows the service to
properly go inactive, enabling the timer to reschedule correctly.
Also fix ExecStart to use lib.getExe for proper path resolution
with writeShellApplication.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The homeassistant override key should match the entity type in the
MQTT discovery topic path. For battery sensors, the topic is
homeassistant/sensor/<device>/battery/config, so the key should be
"battery" not "sensor_battery".
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Convert remaining writeShellScript usages to writeShellApplication for
shellcheck validation and strict bash options.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Instead of creating a long-lived Vault token in Terraform (which gets
invalidated when Terraform recreates it), monitoring01 now uses its
existing AppRole credentials to fetch a fresh token for Prometheus.
Changes:
- Add prometheus-metrics policy to monitoring01's AppRole
- Remove vault_token.prometheus_metrics resource from Terraform
- Remove openbao-token KV secret from Terraform
- Add systemd service to fetch AppRole token on boot
- Add systemd timer to refresh token every 30 minutes
This ensures Prometheus always has a valid token without depending on
Terraform state or manual intervention.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove auth01 host configuration and associated services in preparation
for new auth stack with different provisioning system.
Removed:
- hosts/auth01/ - host configuration
- services/authelia/ - authelia service module
- services/lldap/ - lldap service module
- secrets/auth01/ - sops secrets
- Reverse proxy entries for auth and lldap
- Monitoring alert rules for authelia and lldap
- SOPS configuration for auth01
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
WSDCGQ12LM sensors report battery: 0 due to firmware quirk. Override
battery calculation using voltage via homeassistant value_template.
Also adds zigbee_sensor_stale alert for detecting dead sensors regardless
of battery reporting accuracy (1 hour threshold).
Device configuration moved from external devices.yaml to inline NixOS
config for declarative management.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
NATS HTTP monitoring endpoint serves JSON, not Prometheus format.
Use the prometheus-nats-exporter which queries the NATS endpoint
and exposes proper Prometheus metrics.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add exporters and scrape targets for services lacking monitoring:
- PostgreSQL: postgres-exporter on pgdb1
- Authelia: native telemetry metrics on auth01
- Unbound: unbound-exporter with remote-control on ns1/ns2
- NATS: HTTP monitoring endpoint on nats1
- OpenBao: telemetry config and Prometheus scrape with token auth
- Systemd: systemd-exporter on all hosts for per-service metrics
Add alert rules for postgres, auth (authelia + lldap), jellyfin,
vault (openbao), plus extend existing nats and unbound rules.
Add Terraform config for Prometheus metrics policy and token. The
token is created via vault_token resource and stored in KV, so no
manual token creation is needed.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace sops-nix secrets with OpenBao vault secrets across all hosts.
Hardcode root password hash, add extractKey option to vault-secrets
module, update Terraform with secrets/policies for all hosts, and
create AppRole provisioning playbook.
Hosts migrated: ha1, monitoring01, ns1, ns2, http-proxy, nix-cache01
Wave 1 hosts (nats1, jelly01, pgdb1) get AppRole policies only.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
nix-cache01 regularly hits high CPU during nix builds, causing flappy
alerts. Keep the 15m threshold for all other hosts.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>