Evaluate options for replacing LLDAP+Authelia with a unified auth solution.
Recommends Kanidm for its native NixOS PAM/NSS integration and built-in OIDC.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Updated plan with:
- Full device inventory from ha1
- Backup verification details
- Branch and commit references
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
WSDCGQ12LM sensors report battery: 0 due to firmware quirk. Override
battery calculation using voltage via homeassistant value_template.
Also adds zigbee_sensor_stale alert for detecting dead sensors regardless
of battery reporting accuracy (1 hour threshold).
Device configuration moved from external devices.yaml to inline NixOS
config for declarative management.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
NATS HTTP monitoring endpoint serves JSON, not Prometheus format.
Use the prometheus-nats-exporter which queries the NATS endpoint
and exposes proper Prometheus metrics.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add exporters and scrape targets for services lacking monitoring:
- PostgreSQL: postgres-exporter on pgdb1
- Authelia: native telemetry metrics on auth01
- Unbound: unbound-exporter with remote-control on ns1/ns2
- NATS: HTTP monitoring endpoint on nats1
- OpenBao: telemetry config and Prometheus scrape with token auth
- Systemd: systemd-exporter on all hosts for per-service metrics
Add alert rules for postgres, auth (authelia + lldap), jellyfin,
vault (openbao), plus extend existing nats and unbound rules.
Add Terraform config for Prometheus metrics policy and token. The
token is created via vault_token resource and stored in KV, so no
manual token creation is needed.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document Loki log query labels and patterns, and Prometheus job names
with example queries for the lab-monitoring MCP server.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update CLAUDE.md and README.md to reflect that secrets are now managed
by OpenBao, with sops only remaining for ca. Update migration plans
with sops cleanup checklist and auth01 decommission.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The read-based loop split multiline values on newlines, causing only
the first line to be written. Use jq -j to write each key's value
directly to files, preserving multiline content.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove backup_helper_secret variable and switch shared/backup/password
to auto_generate. New password will be added alongside existing restic
repository key.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace sops-nix secrets with OpenBao vault secrets across all hosts.
Hardcode root password hash, add extractKey option to vault-secrets
module, update Terraform with secrets/policies for all hosts, and
create AppRole provisioning playbook.
Hosts migrated: ha1, monitoring01, ns1, ns2, http-proxy, nix-cache01
Wave 1 hosts (nats1, jelly01, pgdb1) get AppRole policies only.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
nix-cache01 regularly hits high CPU during nix builds, causing flappy
alerts. Keep the 15m threshold for all other hosts.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The step-ca serving certificate is auto-renewed with a 24h lifetime,
so it always triggers the general < 86400s threshold. Exclude it and
add a dedicated step_ca_serving_cert_expiring alert at < 1h instead.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move nix-cache_caddy back to a manual config in prometheus.nix using the
service CNAME (nix-cache.home.2rjus.net) instead of the hostname. The
auto-generated target used nix-cache01.home.2rjus.net which doesn't
match the TLS certificate SAN.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add homelab.monitoring NixOS options (enable, scrapeTargets) following
the same pattern as homelab.dns. Prometheus scrape configs are now
auto-generated from flake host configurations and external targets,
replacing hardcoded target lists.
Also cleans up alert rules: snake_case naming, fix zigbee2mqtt typo,
remove duplicate pushgateway alert, add for clauses to monitoring_rules,
remove hardcoded WireGuard public key, and add new alerts for
certificates, proxmox, caddy, smartctl temperature, filesystem
prediction, systemd state, file descriptors, and host reboots.
Fixes grafana scrape target port from 3100 to 3000.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add /modules/ and /lib/ to directory structure
- Document homelab.dns options and zone auto-generation
- Update "Adding a New Host" workflow (no manual zone editing)
- Expand DNS Architecture section with auto-generation details
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace static zone file with dynamically generated records:
- Add homelab.dns module with enable/cnames options
- Extract IPs from systemd.network configs (filters VPN interfaces)
- Use git commit timestamp as zone serial number
- Move external hosts to separate external-hosts.nix
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Rename nixos-options to nixpkgs-options and add new nixpkgs-packages
server for package search functionality. Update CLAUDE.md to document
both MCP servers and their available tools.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>