Add a shared `homelab.host` module that provides host metadata for
multiple consumers:
- tier: deployment tier (test/prod) for future homelab-deploy service
- priority: alerting priority (high/low) for Prometheus label filtering
- role: primary role of the host (dns, database, monitoring, etc.)
- labels: free-form labels for additional metadata
Host configurations updated with appropriate values:
- ns1, ns2: role=dns with dns_role labels
- nix-cache01: priority=low, role=build-host
- vault01: role=vault
- jump: role=bastion
- template, template2, testvm01, vaulttest01: tier=test, priority=low
The module is now imported via commonModules in flake.nix, making it
available to all hosts including minimal configurations like template2.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
MCP exposes two tools:
- deploy: test-tier only, always available
- deploy_admin: all tiers, requires --enable-admin flag
Three security layers: CLI flag, NATS authz, Claude Code permissions.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Support deploying to all hosts in a tier or all hosts with a role:
- deploy.<tier>.all - broadcast to all hosts in tier
- deploy.<tier>.role.<role> - broadcast to hosts with matching role
MCP can deploy to all test hosts at once, admin can deploy to any group.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add plan for NATS-based deployment service (homelab-deploy) that enables
on-demand NixOS configuration updates via messaging. Features tiered
permissions (test/prod) enforced at NATS layer.
Update prometheus-scrape-target-labels plan to share the homelab.host
module for host metadata (tier, priority, role, labels) - single source
of truth for both deployment tiers and prometheus labels.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Increase duration from 5m to 10m and demote severity from critical to
warning. Brief degraded states during nixos-rebuild are normal and were
causing false positive alerts.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Merge Prometheus metrics and Loki logs into a unified troubleshooting
skill. Adds LogQL query patterns, label reference, and common service
units for log searching.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Reference guide for exploring Prometheus metrics when troubleshooting
homelab issues, including the new nixos_flake_info metrics.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update nixos-exporter to 0.2.3
- Set system.configurationRevision for all hosts so the exporter
can report the flake's git revision
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add nixos-exporter prometheus exporter to track NixOS generation metrics
and flake revision status across all hosts.
Changes:
- Add nixos-exporter flake input
- Add commonModules list in flake.nix for modules shared by all hosts
- Enable nixos-exporter in system/monitoring/metrics.nix
- Configure Prometheus to scrape nixos-exporter on all hosts
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
RemainAfterExit=true kept the service in "active" state, which
prevented OnUnitActiveSec from scheduling new triggers since there
was no new "activation" event. Removing it allows the service to
properly go inactive, enabling the timer to reschedule correctly.
Also fix ExecStart to use lib.getExe for proper path resolution
with writeShellApplication.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add proposed dns_role label to distinguish primary/secondary DNS
resolvers. This addresses the unbound_low_cache_hit_ratio alert
firing on ns2, which has a cold cache due to low traffic.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Backups to the shared restic repository were all scheduled at exactly
midnight, causing lock conflicts. Adding RandomizedDelaySec spreads
them out over a 2-hour window to prevent simultaneous access.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The homeassistant override key should match the entity type in the
MQTT discovery topic path. For battery sensors, the topic is
homeassistant/sensor/<device>/battery/config, so the key should be
"battery" not "sensor_battery".
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Displays FQDN and flake commit hash with timestamp on login.
Templates can override with their own MOTD via mkDefault.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Convert remaining writeShellScript usages to writeShellApplication for
shellcheck validation and strict bash options.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds a helper script deployed to all hosts for testing feature branches.
Usage: nixos-rebuild-test <action> <branch>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Instead of creating a long-lived Vault token in Terraform (which gets
invalidated when Terraform recreates it), monitoring01 now uses its
existing AppRole credentials to fetch a fresh token for Prometheus.
Changes:
- Add prometheus-metrics policy to monitoring01's AppRole
- Remove vault_token.prometheus_metrics resource from Terraform
- Remove openbao-token KV secret from Terraform
- Add systemd service to fetch AppRole token on boot
- Add systemd timer to refresh token every 30 minutes
This ensures Prometheus always has a valid token without depending on
Terraform state or manual intervention.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove auth01 host configuration and associated services in preparation
for new auth stack with different provisioning system.
Removed:
- hosts/auth01/ - host configuration
- services/authelia/ - authelia service module
- services/lldap/ - lldap service module
- secrets/auth01/ - sops secrets
- Reverse proxy entries for auth and lldap
- Monitoring alert rules for authelia and lldap
- SOPS configuration for auth01
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document TrueNAS CORE LDAP integration approach (NFS-only) and
future NixOS NAS migration path with native Kanidm PAM/NSS.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Evaluate options for replacing LLDAP+Authelia with a unified auth solution.
Recommends Kanidm for its native NixOS PAM/NSS integration and built-in OIDC.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Updated plan with:
- Full device inventory from ha1
- Backup verification details
- Branch and commit references
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
WSDCGQ12LM sensors report battery: 0 due to firmware quirk. Override
battery calculation using voltage via homeassistant value_template.
Also adds zigbee_sensor_stale alert for detecting dead sensors regardless
of battery reporting accuracy (1 hour threshold).
Device configuration moved from external devices.yaml to inline NixOS
config for declarative management.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
NATS HTTP monitoring endpoint serves JSON, not Prometheus format.
Use the prometheus-nats-exporter which queries the NATS endpoint
and exposes proper Prometheus metrics.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>