- Move nix-cache CNAME from nix-cache01 to nix-cache02
- Remove actions1 CNAME (service removed)
- Update proxy.nix to serve canonical domain on nix-cache02
- Promote nix-cache02 to prod tier with build-host role
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Phase 4 now in progress
- Harmonia configured on nix-cache02 with new signing key
- Trusted public key deployed to all hosts
- Cache tested successfully from testvm01
- Actions runner removed from scope
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
nix-cache01 serves nix-cache.home.2rjus.net (canonical)
nix-cache02 serves nix-cache02.home.2rjus.net (for testing)
This allows testing nix-cache02 independently before DNS cutover.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Parameterize harmonia.nix to use hostname-based Vault paths
- Add nix-cache services to nix-cache02
- Add Vault secret and variable for nix-cache02 signing key
- Add nix-cache02 public key to trusted-public-keys on all hosts
- Update plan doc to remove actions runner references
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The actions runner on nix-cache01 was never actively used.
Removing it before migrating to nix-cache02.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add Phase 6 for alerting and Grafana dashboards
- Document available Prometheus metrics
- Include example alerting rules for build failures
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Mark Phase 1 (new build host) and Phase 2 (NATS build triggering) complete
- Document nix-cache02 configuration and tested build times
- Add remaining work for Harmonia, Actions runner, and DNS cutover
- Enable --enable-builds flag in MCP config
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Configure builder to build nixos-servers and nixos (gunter) repos
- Add builder NKey to Vault secrets
- Update NATS permissions for builder, test-deployer, and admin-deployer
- Grant nix-cache02 access to shared homelab-deploy secrets
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New build host to replace nix-cache01 with:
- 8 CPU cores, 16GB RAM, 200GB disk
- Static IP 10.69.13.25
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document findings from false positive host_reboot alert caused by
NTP clock adjustment affecting node_boot_time_seconds metric.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Radarr on TrueNAS jail is too old - exportarr fails on
/api/v3/wanted/cutoff endpoint (404). Keep sonarr which works.
Vault secret kept for when Radarr is updated.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add prometheus exportarr exporters for Radarr and Sonarr media
services. Runs on monitoring01, queries remote APIs.
- Radarr exporter on port 9708
- Sonarr exporter on port 9709
- API keys fetched from Vault
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add vault secrets for Radarr and Sonarr API keys to enable
exportarr metrics collection on monitoring01.
- services/exportarr/radarr - Radarr API key
- services/exportarr/sonarr - Sonarr API key
- Grant monitoring01 access to services/exportarr/*
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Empty valid_status_codes defaults to 2xx only, not "any".
Explicitly list common status codes (2xx, 3xx, 4xx, 5xx) so
services returning 400/401 like ha and nzbget pass the probe.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use joinByField transformation instead of merge to properly align
rows by instance. Also exclude duplicate Time/job columns from join.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Dashboard includes:
- Stat panels for endpoints monitored, probe failures, expiring certs
- Gauge showing minimum days until any cert expires
- Table of all endpoints sorted by expiry (color-coded)
- Probe status table with HTTP status and duration
- Time series graphs for expiry trends and probe success rate
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Only care about TLS handshake success for certificate monitoring.
Services like nzbget (401) and ha (400) return non-2xx but have
valid certificates.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The rules were already added to rules.yml but the blackbox.nix file
still had them, causing duplicate 'groups' key errors.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move certificate alert rules to rules.yml instead of adding them as a
separate rules string in blackbox.nix. The previous approach caused a
YAML parse error due to duplicate 'groups' keys.
Also add policy to CLAUDE.md: never force push to master.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add reboot.yml playbook with rolling reboot (serial: 1)
- Uses systemd reboot.target for NixOS compatibility
- Waits for each host to come back before proceeding
- Update dynamic inventory to use short hostnames
- ansible_host set to FQDN for connections
- Allows -l testvm01 instead of -l testvm01.home.2rjus.net
- Update static.yml to match short hostname convention
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use result_format=yaml with builtin default callback instead of
the removed community.general.yaml plugin.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move playbooks/ to ansible/playbooks/
- Add dynamic inventory script that extracts hosts from flake
- Groups by tier (tier_test, tier_prod) and role (role_dns, etc.)
- Reads homelab.host.* options for metadata
- Add static inventory for non-flake hosts (Proxmox)
- Add ansible.cfg with inventory path and SSH optimizations
- Add group_vars/all.yml for common variables
- Add restart-service.yml playbook for restarting systemd services
- Update provision-approle.yml with single-host safeguard
- Add ANSIBLE_CONFIG to devshell for automatic inventory discovery
- Add ansible = "false" label to template2 to exclude from inventory
- Update CLAUDE.md to reference ansible/README.md for details
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Enable Kanidm users to authenticate to OpenBao via OIDC for Web UI access.
Members of the admins group get full read/write access to secrets.
Changes:
- Add OIDC auth backend in Terraform (oidc.tf)
- Add oidc-admin and oidc-default policies
- Add openbao OAuth2 client to Kanidm
- Enable legacy crypto (RS256) for OpenBao compatibility
- Allow imperative group membership management in Kanidm
Limitations:
- CLI login not supported (Kanidm requires HTTPS for confidential client redirects)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Assign roles to hosts for better organization and filtering:
- ha1: home-automation
- monitoring01, monitoring02: monitoring
- jelly01: media
- nats1: messaging
- http-proxy: proxy
- testvm01-03: test
Also promote kanidm01 and monitoring02 from test to prod tier.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The template used | min(100) | max(0) which is invalid Jinja2 syntax.
These filters expect iterables (lists), not scalar arguments. This
caused TypeError warnings on every MQTT message and left battery
sensors unavailable.
Fixed by using proper list-based min/max:
[[[value, 100] | min, 0] | max
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The 2-hour threshold was too aggressive for temperature sensors in
stable environments. Historical data shows gaps up to 2.75 hours when
temperature hasn't changed (Home Assistant only updates last_updated
when values change). Increasing to 4 hours avoids false positives
while still catching genuine failures.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Create shared policy granting all hosts access to nixos-exporter nkey
- Add policy to both manual and generated host AppRoles
- Remove duplicate kanidm01/monitoring02 entries from hosts-generated.tf
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When one host fetches the latest flake revision, it publishes to NATS
and all other hosts receive the update immediately. This reduces
redundant nix flake metadata calls across the fleet.
- Add nkeys to devshell for key generation
- Add nixos-exporter user to NATS HOMELAB account
- Add Vault secret for NKey storage
- Configure all hosts to use NATS for revision sharing
- Update nixos-exporter input to version with NATS support
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Loki-based dashboard for tracking NixOS operations including:
- Upgrade activity and success/failure stats
- Build activity during upgrades
- Bootstrap logs for new VM deployments
- ACME certificate renewal activity
Log panels use LogQL json parsing with | keep host to show
clean messages with host labels.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Dashboard for monitoring systemd across the fleet:
- Summary stats: failed/active/inactive units, restarts, timers
- Failed units table (shows any units in failed state)
- Service restarts table (top 15 services by restart count)
- Active units per host bar chart
- NixOS upgrade timer table with last trigger time
- Backup timers table (restic jobs)
- Service restarts over time chart
- Hostname filter to focus on specific hosts
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Dashboard for monitoring Proxmox VMs:
- Summary stats: VMs running/stopped, node CPU/memory, uptime
- VM status table with name, status, CPU%, memory%, uptime
- VM CPU usage over time
- VM memory usage over time
- Network traffic (RX/TX) per VM
- Disk I/O (read/write) per VM
- Storage usage gauges and capacity table
- VM filter to focus on specific VMs
Filters out template VMs, shows only actual guests.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously tier was only included if non-default (not "prod"), which
meant prod hosts had no tier label. This made the Grafana tier filter
only show "test" since "prod" never appeared in label_values().
Now tier is always included, so both "prod" and "test" appear in the
fleet dashboard tier selector.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>