Commit Graph

735 Commits

Author SHA1 Message Date
f2c30cc24f chore: give claude the quick-plan skill
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m57s
2026-02-06 21:58:30 +01:00
7e80d2e0bc docs: add plans for nixos and homelab prometheus exporters
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 21:56:55 +01:00
1f5b7b13e2 monitoring: enable restart-count and ip-accounting collectors
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m11s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 21:30:47 +01:00
c53e36c3f3 Revert "monitoring: enable additional systemd-exporter collectors"
This reverts commit 04a252b857.
2026-02-06 21:30:05 +01:00
04a252b857 monitoring: enable additional systemd-exporter collectors
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Enables restart-count, file-descriptor-size, and ip-accounting collectors.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 21:28:44 +01:00
5d26f52e0d Revert "monitoring: enable cpu, memory, io collectors for systemd-exporter"
This reverts commit 506a692548.
2026-02-06 21:26:20 +01:00
506a692548 monitoring: enable cpu, memory, io collectors for systemd-exporter
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 21:23:19 +01:00
fa8f4f0784 docs: add notes about lib.getExe and not amending master
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m11s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 19:41:45 +01:00
025570dea1 monitoring: fix openbao token refresh timer not triggering
RemainAfterExit=true kept the service in "active" state, which
prevented OnUnitActiveSec from scheduling new triggers since there
was no new "activation" event. Removing it allows the service to
properly go inactive, enabling the timer to reschedule correctly.

Also fix ExecStart to use lib.getExe for proper path resolution
with writeShellApplication.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 19:41:45 +01:00
15c00393f1 monitoring: increase zigbee_sensor_stale threshold to 2 hours
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m59s
Sensors report every ~45-50 minutes on average, so 1 hour was too tight.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 19:26:56 +01:00
787c14c7a6 docs: add dns_role label to scrape target labels plan
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m3s
Add proposed dns_role label to distinguish primary/secondary DNS
resolvers. This addresses the unbound_low_cache_hit_ratio alert
firing on ns2, which has a cold cache due to low traffic.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 01:23:34 +01:00
eee3dde04f restic: add randomized delay to backup timers
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Backups to the shared restic repository were all scheduled at exactly
midnight, causing lock conflicts. Adding RandomizedDelaySec spreads
them out over a 2-hour window to prevent simultaneous access.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 01:09:38 +01:00
682b07b977 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/bf922a59c5c9998a6584645f7d0de689512e444c?narHash=sha256-ksTL7P9QC1WfZasNlaAdLOzqD8x5EPyods69YBqxSfk%3D' (2026-02-04)
  → 'github:nixos/nixpkgs/00c21e4c93d963c50d4c0c89bfa84ed6e0694df2?narHash=sha256-AYqlWrX09%2BHvGs8zM6ebZ1pwUqjkfpnv8mewYwAo%2BiM%3D' (2026-02-04)
2026-02-06 00:01:04 +00:00
70661ac3d9 Merge pull request 'home-assistant: fix zigbee battery value_template override key' (#25) from fix-zigbee-battery-template into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m3s
Periodic flake update / flake-update (push) Successful in 1m11s
Reviewed-on: #25
2026-02-05 23:56:45 +00:00
506e93a5e2 home-assistant: fix zigbee battery value_template override key
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m39s
Run nix flake check / flake-check (pull_request) Failing after 12m37s
The homeassistant override key should match the entity type in the
MQTT discovery topic path. For battery sensors, the topic is
homeassistant/sensor/<device>/battery/config, so the key should be
"battery" not "sensor_battery".

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:48:30 +01:00
b6c41aa910 system: add UTC suffix to MOTD commit timestamp
Some checks failed
Run nix flake check / flake-check (push) Failing after 7m32s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:34:24 +01:00
aa6e00a327 Merge pull request 'add-nixos-rebuild-test' (#24) from add-nixos-rebuild-test into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m6s
Reviewed-on: #24
2026-02-05 23:26:34 +00:00
258e350b89 system: add MOTD banner with hostname and commit info
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m8s
Run nix flake check / flake-check (push) Failing after 3m53s
Displays FQDN and flake commit hash with timestamp on login.
Templates can override with their own MOTD via mkDefault.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:26:01 +01:00
eba195c192 docs: add nixos-rebuild-test usage to CLAUDE.md
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:19:49 +01:00
bbb22e588e system: replace writeShellScript with writeShellApplication
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m3s
Run nix flake check / flake-check (push) Failing after 5m57s
Convert remaining writeShellScript usages to writeShellApplication for
shellcheck validation and strict bash options.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:17:24 +01:00
879e7aba60 templates: use writeShellApplication for prepare-host script
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:14:05 +01:00
39a4ea98ab system: add nixos-rebuild-test helper script
Adds a helper script deployed to all hosts for testing feature branches.
Usage: nixos-rebuild-test <action> <branch>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:12:16 +01:00
1d90dc2181 Merge pull request 'monitoring: use AppRole token for OpenBao metrics scraping' (#23) from fix-prometheus-openbao-token into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m21s
Reviewed-on: #23
2026-02-05 22:52:42 +00:00
e9857afc11 monitoring: use AppRole token for OpenBao metrics scraping
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m12s
Run nix flake check / flake-check (pull_request) Successful in 2m19s
Instead of creating a long-lived Vault token in Terraform (which gets
invalidated when Terraform recreates it), monitoring01 now uses its
existing AppRole credentials to fetch a fresh token for Prometheus.

Changes:
- Add prometheus-metrics policy to monitoring01's AppRole
- Remove vault_token.prometheus_metrics resource from Terraform
- Remove openbao-token KV secret from Terraform
- Add systemd service to fetch AppRole token on boot
- Add systemd timer to refresh token every 30 minutes

This ensures Prometheus always has a valid token without depending on
Terraform state or manual intervention.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 23:51:11 +01:00
88e9036cb4 Merge pull request 'auth01: decommission host and remove authelia/lldap services' (#22) from decommission-auth01 into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m5s
Reviewed-on: #22
2026-02-05 22:37:38 +00:00
59e1962d75 auth01: decommission host and remove authelia/lldap services
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m5s
Run nix flake check / flake-check (push) Failing after 18m1s
Remove auth01 host configuration and associated services in preparation
for new auth stack with different provisioning system.

Removed:
- hosts/auth01/ - host configuration
- services/authelia/ - authelia service module
- services/lldap/ - lldap service module
- secrets/auth01/ - sops secrets
- Reverse proxy entries for auth and lldap
- Monitoring alert rules for authelia and lldap
- SOPS configuration for auth01

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 23:35:45 +01:00
3dc4422ba0 docs: add NAS integration notes to auth plan
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m4s
Document TrueNAS CORE LDAP integration approach (NFS-only) and
future NixOS NAS migration path with native Kanidm PAM/NSS.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 23:24:37 +01:00
f0963624bc docs: add auth system replacement plan
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Evaluate options for replacing LLDAP+Authelia with a unified auth solution.
Recommends Kanidm for its native NixOS PAM/NSS integration and built-in OIDC.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 23:18:38 +01:00
7b46f94e48 Merge pull request 'zigbee-battery-fix' (#21) from zigbee-battery-fix into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m20s
Reviewed-on: #21
2026-02-05 21:51:41 +00:00
32968147b5 docs: move zigbee battery plan to completed
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m17s
Run nix flake check / flake-check (pull_request) Successful in 2m19s
Updated plan with:
- Full device inventory from ha1
- Backup verification details
- Branch and commit references

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 22:49:49 +01:00
c515a6b4e1 home-assistant: fix zigbee sensor battery reporting
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
WSDCGQ12LM sensors report battery: 0 due to firmware quirk. Override
battery calculation using voltage via homeassistant value_template.

Also adds zigbee_sensor_stale alert for detecting dead sensors regardless
of battery reporting accuracy (1 hour threshold).

Device configuration moved from external devices.yaml to inline NixOS
config for declarative management.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 22:41:07 +01:00
4d8b94ce83 monitoring: add collector flags to nats exporter
Some checks failed
Run nix flake check / flake-check (push) Failing after 8m53s
The exporter requires explicit collector flags to specify what
metrics to collect.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 22:23:30 +01:00
8b0a4ea33a monitoring: use nats exporter instead of direct scrape
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
NATS HTTP monitoring endpoint serves JSON, not Prometheus format.
Use the prometheus-nats-exporter which queries the NATS endpoint
and exposes proper Prometheus metrics.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 22:22:04 +01:00
5be1f43c24 Merge pull request 'monitoring-gaps-implementation' (#20) from monitoring-gaps-implementation into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m20s
Reviewed-on: #20
2026-02-05 20:57:31 +00:00
b322b1156b monitoring: fix openbao token output path
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m17s
Run nix flake check / flake-check (push) Failing after 8m57s
The outputDir with extractKey should be the full file path, not just
the parent directory.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 21:56:26 +01:00
3cccfc0487 monitoring: implement monitoring gaps coverage
Some checks failed
Run nix flake check / flake-check (push) Failing after 7m36s
Add exporters and scrape targets for services lacking monitoring:
- PostgreSQL: postgres-exporter on pgdb1
- Authelia: native telemetry metrics on auth01
- Unbound: unbound-exporter with remote-control on ns1/ns2
- NATS: HTTP monitoring endpoint on nats1
- OpenBao: telemetry config and Prometheus scrape with token auth
- Systemd: systemd-exporter on all hosts for per-service metrics

Add alert rules for postgres, auth (authelia + lldap), jellyfin,
vault (openbao), plus extend existing nats and unbound rules.

Add Terraform config for Prometheus metrics policy and token. The
token is created via vault_token resource and stored in KV, so no
manual token creation is needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 21:44:13 +01:00
41d4226812 mcp: add Loki URL to lab-monitoring server config
Some checks failed
Run nix flake check / flake-check (push) Failing after 8m8s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 21:18:39 +01:00
351fb6f720 docs: add lab-monitoring query reference to CLAUDE.md
Document Loki log query labels and patterns, and Prometheus job names
with example queries for the lab-monitoring MCP server.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 21:18:17 +01:00
7d92c55d37 docs: update for sops-to-openbao migration completion
Some checks failed
Run nix flake check / flake-check (push) Failing after 18m17s
Update CLAUDE.md and README.md to reflect that secrets are now managed
by OpenBao, with sops only remaining for ca. Update migration plans
with sops cleanup checklist and auth01 decommission.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 20:06:21 +01:00
6d117d68ca docs: move sops-to-openbao migration plan to completed
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m5s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 19:45:42 +01:00
a46fbdaa70 Merge pull request 'sops-to-openbao-migration' (#19) from sops-to-openbao-migration into master
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Reviewed-on: #19
2026-02-05 18:44:53 +00:00
2c9d86eaf2 vault-fetch: fix multiline secret values being truncated
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m5s
Run nix flake check / flake-check (push) Failing after 16m11s
The read-based loop split multiline values on newlines, causing only
the first line to be written. Use jq -j to write each key's value
directly to files, preserving multiline content.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 19:36:51 +01:00
ccb1c3fe2e terraform: auto-generate backup password instead of manual
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m19s
Remove backup_helper_secret variable and switch shared/backup/password
to auto_generate. New password will be added alongside existing restic
repository key.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 18:58:39 +01:00
0700033c0a secrets: migrate all hosts from sops to OpenBao vault
Replace sops-nix secrets with OpenBao vault secrets across all hosts.
Hardcode root password hash, add extractKey option to vault-secrets
module, update Terraform with secrets/policies for all hosts, and
create AppRole provisioning playbook.

Hosts migrated: ha1, monitoring01, ns1, ns2, http-proxy, nix-cache01
Wave 1 hosts (nats1, jelly01, pgdb1) get AppRole policies only.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 18:43:09 +01:00
4d33018285 docs: add ha1 memory recommendation to migration plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m28s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 17:48:45 +01:00
678fd3d6de docs: add systemd-exporter findings to monitoring gaps plan
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 10:19:33 +01:00
9d74aa5c04 docs: add zigbee sensor battery monitoring findings
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 09:21:54 +01:00
fe80ec3576 docs: add monitoring gaps audit plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 20m32s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 03:19:20 +01:00
870fb3e532 docs: add plan for remote access to homelab services
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m4s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:53:27 +01:00
e602e8d70b docs: add plan for prometheus scrape target labels
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m7s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:36:41 +01:00