Commit Graph

751 Commits

Author SHA1 Message Date
2f195d26d3 Merge pull request 'homelab-host-module' (#27) from homelab-host-module into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m8s
Reviewed-on: #27
2026-02-07 01:56:38 +00:00
a926d34287 nix-cache01: set priority to high
All checks were successful
Run nix flake check / flake-check (pull_request) Successful in 2m14s
Run nix flake check / flake-check (push) Successful in 2m17s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:54:32 +01:00
be2421746e gitignore: add result-* for parallel nix builds
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m4s
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:51:27 +01:00
12bf0683f5 modules: add homelab.host for host metadata
Add a shared `homelab.host` module that provides host metadata for
multiple consumers:
- tier: deployment tier (test/prod) for future homelab-deploy service
- priority: alerting priority (high/low) for Prometheus label filtering
- role: primary role of the host (dns, database, monitoring, etc.)
- labels: free-form labels for additional metadata

Host configurations updated with appropriate values:
- ns1, ns2: role=dns with dns_role labels
- nix-cache01: priority=low, role=build-host
- vault01: role=vault
- jump: role=bastion
- template, template2, testvm01, vaulttest01: tier=test, priority=low

The module is now imported via commonModules in flake.nix, making it
available to all hosts including minimal configurations like template2.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:49:58 +01:00
e8a43c6715 docs: add deploy_admin tool with opt-in flag to homelab-deploy plan
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m5s
MCP exposes two tools:
- deploy: test-tier only, always available
- deploy_admin: all tiers, requires --enable-admin flag

Three security layers: CLI flag, NATS authz, Claude Code permissions.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:29:13 +01:00
eef52bb8c5 docs: add group deployment support to homelab-deploy plan
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m3s
Support deploying to all hosts in a tier or all hosts with a role:
- deploy.<tier>.all - broadcast to all hosts in tier
- deploy.<tier>.role.<role> - broadcast to hosts with matching role

MCP can deploy to all test hosts at once, admin can deploy to any group.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:22:17 +01:00
c6cdbc6799 docs: move nixos-exporter plan to completed
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:13:14 +01:00
4d724329a6 docs: add homelab-deploy plan, unify host metadata
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Add plan for NATS-based deployment service (homelab-deploy) that enables
on-demand NixOS configuration updates via messaging. Features tiered
permissions (test/prod) enforced at NATS layer.

Update prometheus-scrape-target-labels plan to share the homelab.host
module for host metadata (tier, priority, role, labels) - single source
of truth for both deployment tiers and prometheus labels.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:10:54 +01:00
881e70df27 monitoring: relax systemd_not_running alert threshold
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m4s
Increase duration from 5m to 10m and demote severity from critical to
warning. Brief degraded states during nixos-rebuild are normal and were
causing false positive alerts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 01:22:29 +01:00
b9a269d280 chore: rename metrics skill to observability, add logs reference
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m4s
Merge Prometheus metrics and Loki logs into a unified troubleshooting
skill. Adds LogQL query patterns, label reference, and common service
units for log searching.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 01:17:41 +01:00
fcf1a66103 chore: add metrics troubleshooting skill
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Reference guide for exploring Prometheus metrics when troubleshooting
homelab issues, including the new nixos_flake_info metrics.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 01:11:40 +01:00
2034004280 flake: update nixos-exporter and set configurationRevision
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m33s
- Update nixos-exporter to 0.2.3
- Set system.configurationRevision for all hosts so the exporter
  can report the flake's git revision

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 01:06:47 +01:00
af43f88394 flake.lock: Update
Flake lock file updates:

• Updated input 'nixos-exporter':
    'git+https://git.t-juice.club/torjus/nixos-exporter?ref=refs/heads/master&rev=9c29505814954352b2af99b97910ee12a736b8dd' (2026-02-06)
  → 'git+https://git.t-juice.club/torjus/nixos-exporter?ref=refs/heads/master&rev=04eba77ac028033b6dfed604eb1b5664b46acc77' (2026-02-06)
2026-02-07 00:01:02 +00:00
a834497fe8 flake: update nixos-exporter input
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m27s
Periodic flake update / flake-update (push) Successful in 1m7s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 00:17:54 +01:00
d3de2a1511 Merge pull request 'monitoring: add nixos-exporter to all hosts' (#26) from nixos-exporter into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 3m6s
Reviewed-on: #26
2026-02-06 22:56:04 +00:00
97ff774d3f monitoring: add nixos-exporter to all hosts
All checks were successful
Run nix flake check / flake-check (push) Successful in 3m16s
Run nix flake check / flake-check (pull_request) Successful in 3m14s
Add nixos-exporter prometheus exporter to track NixOS generation metrics
and flake revision status across all hosts.

Changes:
- Add nixos-exporter flake input
- Add commonModules list in flake.nix for modules shared by all hosts
- Enable nixos-exporter in system/monitoring/metrics.nix
- Configure Prometheus to scrape nixos-exporter on all hosts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 23:55:29 +01:00
f2c30cc24f chore: give claude the quick-plan skill
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m57s
2026-02-06 21:58:30 +01:00
7e80d2e0bc docs: add plans for nixos and homelab prometheus exporters
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 21:56:55 +01:00
1f5b7b13e2 monitoring: enable restart-count and ip-accounting collectors
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m11s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 21:30:47 +01:00
c53e36c3f3 Revert "monitoring: enable additional systemd-exporter collectors"
This reverts commit 04a252b857.
2026-02-06 21:30:05 +01:00
04a252b857 monitoring: enable additional systemd-exporter collectors
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Enables restart-count, file-descriptor-size, and ip-accounting collectors.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 21:28:44 +01:00
5d26f52e0d Revert "monitoring: enable cpu, memory, io collectors for systemd-exporter"
This reverts commit 506a692548.
2026-02-06 21:26:20 +01:00
506a692548 monitoring: enable cpu, memory, io collectors for systemd-exporter
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 21:23:19 +01:00
fa8f4f0784 docs: add notes about lib.getExe and not amending master
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m11s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 19:41:45 +01:00
025570dea1 monitoring: fix openbao token refresh timer not triggering
RemainAfterExit=true kept the service in "active" state, which
prevented OnUnitActiveSec from scheduling new triggers since there
was no new "activation" event. Removing it allows the service to
properly go inactive, enabling the timer to reschedule correctly.

Also fix ExecStart to use lib.getExe for proper path resolution
with writeShellApplication.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 19:41:45 +01:00
15c00393f1 monitoring: increase zigbee_sensor_stale threshold to 2 hours
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m59s
Sensors report every ~45-50 minutes on average, so 1 hour was too tight.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 19:26:56 +01:00
787c14c7a6 docs: add dns_role label to scrape target labels plan
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m3s
Add proposed dns_role label to distinguish primary/secondary DNS
resolvers. This addresses the unbound_low_cache_hit_ratio alert
firing on ns2, which has a cold cache due to low traffic.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 01:23:34 +01:00
eee3dde04f restic: add randomized delay to backup timers
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Backups to the shared restic repository were all scheduled at exactly
midnight, causing lock conflicts. Adding RandomizedDelaySec spreads
them out over a 2-hour window to prevent simultaneous access.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 01:09:38 +01:00
682b07b977 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/bf922a59c5c9998a6584645f7d0de689512e444c?narHash=sha256-ksTL7P9QC1WfZasNlaAdLOzqD8x5EPyods69YBqxSfk%3D' (2026-02-04)
  → 'github:nixos/nixpkgs/00c21e4c93d963c50d4c0c89bfa84ed6e0694df2?narHash=sha256-AYqlWrX09%2BHvGs8zM6ebZ1pwUqjkfpnv8mewYwAo%2BiM%3D' (2026-02-04)
2026-02-06 00:01:04 +00:00
70661ac3d9 Merge pull request 'home-assistant: fix zigbee battery value_template override key' (#25) from fix-zigbee-battery-template into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m3s
Periodic flake update / flake-update (push) Successful in 1m11s
Reviewed-on: #25
2026-02-05 23:56:45 +00:00
506e93a5e2 home-assistant: fix zigbee battery value_template override key
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m39s
Run nix flake check / flake-check (pull_request) Failing after 12m37s
The homeassistant override key should match the entity type in the
MQTT discovery topic path. For battery sensors, the topic is
homeassistant/sensor/<device>/battery/config, so the key should be
"battery" not "sensor_battery".

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:48:30 +01:00
b6c41aa910 system: add UTC suffix to MOTD commit timestamp
Some checks failed
Run nix flake check / flake-check (push) Failing after 7m32s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:34:24 +01:00
aa6e00a327 Merge pull request 'add-nixos-rebuild-test' (#24) from add-nixos-rebuild-test into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m6s
Reviewed-on: #24
2026-02-05 23:26:34 +00:00
258e350b89 system: add MOTD banner with hostname and commit info
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m8s
Run nix flake check / flake-check (push) Failing after 3m53s
Displays FQDN and flake commit hash with timestamp on login.
Templates can override with their own MOTD via mkDefault.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:26:01 +01:00
eba195c192 docs: add nixos-rebuild-test usage to CLAUDE.md
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:19:49 +01:00
bbb22e588e system: replace writeShellScript with writeShellApplication
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m3s
Run nix flake check / flake-check (push) Failing after 5m57s
Convert remaining writeShellScript usages to writeShellApplication for
shellcheck validation and strict bash options.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:17:24 +01:00
879e7aba60 templates: use writeShellApplication for prepare-host script
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:14:05 +01:00
39a4ea98ab system: add nixos-rebuild-test helper script
Adds a helper script deployed to all hosts for testing feature branches.
Usage: nixos-rebuild-test <action> <branch>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:12:16 +01:00
1d90dc2181 Merge pull request 'monitoring: use AppRole token for OpenBao metrics scraping' (#23) from fix-prometheus-openbao-token into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m21s
Reviewed-on: #23
2026-02-05 22:52:42 +00:00
e9857afc11 monitoring: use AppRole token for OpenBao metrics scraping
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m12s
Run nix flake check / flake-check (pull_request) Successful in 2m19s
Instead of creating a long-lived Vault token in Terraform (which gets
invalidated when Terraform recreates it), monitoring01 now uses its
existing AppRole credentials to fetch a fresh token for Prometheus.

Changes:
- Add prometheus-metrics policy to monitoring01's AppRole
- Remove vault_token.prometheus_metrics resource from Terraform
- Remove openbao-token KV secret from Terraform
- Add systemd service to fetch AppRole token on boot
- Add systemd timer to refresh token every 30 minutes

This ensures Prometheus always has a valid token without depending on
Terraform state or manual intervention.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 23:51:11 +01:00
88e9036cb4 Merge pull request 'auth01: decommission host and remove authelia/lldap services' (#22) from decommission-auth01 into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m5s
Reviewed-on: #22
2026-02-05 22:37:38 +00:00
59e1962d75 auth01: decommission host and remove authelia/lldap services
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m5s
Run nix flake check / flake-check (push) Failing after 18m1s
Remove auth01 host configuration and associated services in preparation
for new auth stack with different provisioning system.

Removed:
- hosts/auth01/ - host configuration
- services/authelia/ - authelia service module
- services/lldap/ - lldap service module
- secrets/auth01/ - sops secrets
- Reverse proxy entries for auth and lldap
- Monitoring alert rules for authelia and lldap
- SOPS configuration for auth01

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 23:35:45 +01:00
3dc4422ba0 docs: add NAS integration notes to auth plan
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m4s
Document TrueNAS CORE LDAP integration approach (NFS-only) and
future NixOS NAS migration path with native Kanidm PAM/NSS.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 23:24:37 +01:00
f0963624bc docs: add auth system replacement plan
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Evaluate options for replacing LLDAP+Authelia with a unified auth solution.
Recommends Kanidm for its native NixOS PAM/NSS integration and built-in OIDC.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 23:18:38 +01:00
7b46f94e48 Merge pull request 'zigbee-battery-fix' (#21) from zigbee-battery-fix into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m20s
Reviewed-on: #21
2026-02-05 21:51:41 +00:00
32968147b5 docs: move zigbee battery plan to completed
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m17s
Run nix flake check / flake-check (pull_request) Successful in 2m19s
Updated plan with:
- Full device inventory from ha1
- Backup verification details
- Branch and commit references

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 22:49:49 +01:00
c515a6b4e1 home-assistant: fix zigbee sensor battery reporting
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
WSDCGQ12LM sensors report battery: 0 due to firmware quirk. Override
battery calculation using voltage via homeassistant value_template.

Also adds zigbee_sensor_stale alert for detecting dead sensors regardless
of battery reporting accuracy (1 hour threshold).

Device configuration moved from external devices.yaml to inline NixOS
config for declarative management.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 22:41:07 +01:00
4d8b94ce83 monitoring: add collector flags to nats exporter
Some checks failed
Run nix flake check / flake-check (push) Failing after 8m53s
The exporter requires explicit collector flags to specify what
metrics to collect.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 22:23:30 +01:00
8b0a4ea33a monitoring: use nats exporter instead of direct scrape
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
NATS HTTP monitoring endpoint serves JSON, not Prometheus format.
Use the prometheus-nats-exporter which queries the NATS endpoint
and exposes proper Prometheus metrics.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 22:22:04 +01:00
5be1f43c24 Merge pull request 'monitoring-gaps-implementation' (#20) from monitoring-gaps-implementation into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m20s
Reviewed-on: #20
2026-02-05 20:57:31 +00:00