84 Commits

Author SHA1 Message Date
26ca6817f0 homelab-deploy: enable prometheus metrics
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m57s
- Update homelab-deploy input to get metrics support
- Enable metrics endpoint on port 9972
- Add scrape target for prometheus auto-discovery

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 08:04:23 +01:00
b03a9b3b64 docs: add long-term metrics storage plan
Compare VictoriaMetrics and Thanos as options for extending
metrics retention beyond 30 days while managing disk usage.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:56:10 +01:00
f805b9f629 mcp: add homelab-deploy MCP server
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m20s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:27:12 +01:00
f3adf7e77f CLAUDE.md: add homelab-deploy MCP documentation
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:25:44 +01:00
f6eca9decc vaulttest01: add htop for deploy verification test
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m3s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:23:22 +01:00
6e93b8eae3 Merge pull request 'add-deploy-homelab' (#28) from add-deploy-homelab into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m9s
Reviewed-on: #28
2026-02-07 05:56:51 +00:00
c214f8543c homelab: add deploy.enable option with assertion
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m6s
Run nix flake check / flake-check (pull_request) Successful in 2m7s
- Add homelab.deploy.enable option (requires vault.enable)
- Create shared homelab-deploy Vault policy for all hosts
- Enable homelab.deploy on all vault-enabled hosts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:54:42 +01:00
7933127d77 system: enable homelab-deploy listener for all vault hosts
Add system/homelab-deploy.nix module that automatically enables the
listener on all hosts with vault.enable=true. Uses homelab.host.tier
and homelab.host.role for NATS subject subscriptions.

- Add homelab-deploy access to all host AppRole policies
- Remove manual listener config from vaulttest01 (now handled by system module)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:54:42 +01:00
13c3897e86 flake: update homelab-deploy, add to devShell
Update homelab-deploy to include bugfix. Add CLI to devShell for
easier testing and deployment operations.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:54:42 +01:00
0643f23281 vaulttest01: add vault secret dependency to listener
Some checks failed
Run nix flake check / flake-check (push) Failing after 15m32s
Ensure homelab-deploy-listener waits for the NKey secret to be
fetched from Vault before starting.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 05:29:29 +01:00
ad8570f8db homelab-deploy: add NATS-based deployment system
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m45s
Add homelab-deploy flake input and NixOS module for message-based
deployments across the fleet. Configure DEPLOY account in NATS with
tiered access control (listener, test-deployer, admin-deployer).
Enable listener on vaulttest01 as initial test host.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 05:22:06 +01:00
2f195d26d3 Merge pull request 'homelab-host-module' (#27) from homelab-host-module into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m8s
Reviewed-on: #27
2026-02-07 01:56:38 +00:00
a926d34287 nix-cache01: set priority to high
All checks were successful
Run nix flake check / flake-check (pull_request) Successful in 2m14s
Run nix flake check / flake-check (push) Successful in 2m17s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:54:32 +01:00
be2421746e gitignore: add result-* for parallel nix builds
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m4s
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:51:27 +01:00
12bf0683f5 modules: add homelab.host for host metadata
Add a shared `homelab.host` module that provides host metadata for
multiple consumers:
- tier: deployment tier (test/prod) for future homelab-deploy service
- priority: alerting priority (high/low) for Prometheus label filtering
- role: primary role of the host (dns, database, monitoring, etc.)
- labels: free-form labels for additional metadata

Host configurations updated with appropriate values:
- ns1, ns2: role=dns with dns_role labels
- nix-cache01: priority=low, role=build-host
- vault01: role=vault
- jump: role=bastion
- template, template2, testvm01, vaulttest01: tier=test, priority=low

The module is now imported via commonModules in flake.nix, making it
available to all hosts including minimal configurations like template2.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:49:58 +01:00
e8a43c6715 docs: add deploy_admin tool with opt-in flag to homelab-deploy plan
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m5s
MCP exposes two tools:
- deploy: test-tier only, always available
- deploy_admin: all tiers, requires --enable-admin flag

Three security layers: CLI flag, NATS authz, Claude Code permissions.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:29:13 +01:00
eef52bb8c5 docs: add group deployment support to homelab-deploy plan
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m3s
Support deploying to all hosts in a tier or all hosts with a role:
- deploy.<tier>.all - broadcast to all hosts in tier
- deploy.<tier>.role.<role> - broadcast to hosts with matching role

MCP can deploy to all test hosts at once, admin can deploy to any group.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:22:17 +01:00
c6cdbc6799 docs: move nixos-exporter plan to completed
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:13:14 +01:00
4d724329a6 docs: add homelab-deploy plan, unify host metadata
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Add plan for NATS-based deployment service (homelab-deploy) that enables
on-demand NixOS configuration updates via messaging. Features tiered
permissions (test/prod) enforced at NATS layer.

Update prometheus-scrape-target-labels plan to share the homelab.host
module for host metadata (tier, priority, role, labels) - single source
of truth for both deployment tiers and prometheus labels.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:10:54 +01:00
881e70df27 monitoring: relax systemd_not_running alert threshold
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m4s
Increase duration from 5m to 10m and demote severity from critical to
warning. Brief degraded states during nixos-rebuild are normal and were
causing false positive alerts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 01:22:29 +01:00
b9a269d280 chore: rename metrics skill to observability, add logs reference
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m4s
Merge Prometheus metrics and Loki logs into a unified troubleshooting
skill. Adds LogQL query patterns, label reference, and common service
units for log searching.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 01:17:41 +01:00
fcf1a66103 chore: add metrics troubleshooting skill
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Reference guide for exploring Prometheus metrics when troubleshooting
homelab issues, including the new nixos_flake_info metrics.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 01:11:40 +01:00
2034004280 flake: update nixos-exporter and set configurationRevision
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m33s
- Update nixos-exporter to 0.2.3
- Set system.configurationRevision for all hosts so the exporter
  can report the flake's git revision

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 01:06:47 +01:00
af43f88394 flake.lock: Update
Flake lock file updates:

• Updated input 'nixos-exporter':
    'git+https://git.t-juice.club/torjus/nixos-exporter?ref=refs/heads/master&rev=9c29505814954352b2af99b97910ee12a736b8dd' (2026-02-06)
  → 'git+https://git.t-juice.club/torjus/nixos-exporter?ref=refs/heads/master&rev=04eba77ac028033b6dfed604eb1b5664b46acc77' (2026-02-06)
2026-02-07 00:01:02 +00:00
a834497fe8 flake: update nixos-exporter input
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m27s
Periodic flake update / flake-update (push) Successful in 1m7s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 00:17:54 +01:00
d3de2a1511 Merge pull request 'monitoring: add nixos-exporter to all hosts' (#26) from nixos-exporter into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 3m6s
Reviewed-on: #26
2026-02-06 22:56:04 +00:00
97ff774d3f monitoring: add nixos-exporter to all hosts
All checks were successful
Run nix flake check / flake-check (push) Successful in 3m16s
Run nix flake check / flake-check (pull_request) Successful in 3m14s
Add nixos-exporter prometheus exporter to track NixOS generation metrics
and flake revision status across all hosts.

Changes:
- Add nixos-exporter flake input
- Add commonModules list in flake.nix for modules shared by all hosts
- Enable nixos-exporter in system/monitoring/metrics.nix
- Configure Prometheus to scrape nixos-exporter on all hosts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 23:55:29 +01:00
f2c30cc24f chore: give claude the quick-plan skill
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m57s
2026-02-06 21:58:30 +01:00
7e80d2e0bc docs: add plans for nixos and homelab prometheus exporters
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 21:56:55 +01:00
1f5b7b13e2 monitoring: enable restart-count and ip-accounting collectors
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m11s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 21:30:47 +01:00
c53e36c3f3 Revert "monitoring: enable additional systemd-exporter collectors"
This reverts commit 04a252b857.
2026-02-06 21:30:05 +01:00
04a252b857 monitoring: enable additional systemd-exporter collectors
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Enables restart-count, file-descriptor-size, and ip-accounting collectors.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 21:28:44 +01:00
5d26f52e0d Revert "monitoring: enable cpu, memory, io collectors for systemd-exporter"
This reverts commit 506a692548.
2026-02-06 21:26:20 +01:00
506a692548 monitoring: enable cpu, memory, io collectors for systemd-exporter
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 21:23:19 +01:00
fa8f4f0784 docs: add notes about lib.getExe and not amending master
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m11s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 19:41:45 +01:00
025570dea1 monitoring: fix openbao token refresh timer not triggering
RemainAfterExit=true kept the service in "active" state, which
prevented OnUnitActiveSec from scheduling new triggers since there
was no new "activation" event. Removing it allows the service to
properly go inactive, enabling the timer to reschedule correctly.

Also fix ExecStart to use lib.getExe for proper path resolution
with writeShellApplication.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 19:41:45 +01:00
15c00393f1 monitoring: increase zigbee_sensor_stale threshold to 2 hours
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m59s
Sensors report every ~45-50 minutes on average, so 1 hour was too tight.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 19:26:56 +01:00
787c14c7a6 docs: add dns_role label to scrape target labels plan
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m3s
Add proposed dns_role label to distinguish primary/secondary DNS
resolvers. This addresses the unbound_low_cache_hit_ratio alert
firing on ns2, which has a cold cache due to low traffic.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 01:23:34 +01:00
eee3dde04f restic: add randomized delay to backup timers
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Backups to the shared restic repository were all scheduled at exactly
midnight, causing lock conflicts. Adding RandomizedDelaySec spreads
them out over a 2-hour window to prevent simultaneous access.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 01:09:38 +01:00
682b07b977 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/bf922a59c5c9998a6584645f7d0de689512e444c?narHash=sha256-ksTL7P9QC1WfZasNlaAdLOzqD8x5EPyods69YBqxSfk%3D' (2026-02-04)
  → 'github:nixos/nixpkgs/00c21e4c93d963c50d4c0c89bfa84ed6e0694df2?narHash=sha256-AYqlWrX09%2BHvGs8zM6ebZ1pwUqjkfpnv8mewYwAo%2BiM%3D' (2026-02-04)
2026-02-06 00:01:04 +00:00
70661ac3d9 Merge pull request 'home-assistant: fix zigbee battery value_template override key' (#25) from fix-zigbee-battery-template into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m3s
Periodic flake update / flake-update (push) Successful in 1m11s
Reviewed-on: #25
2026-02-05 23:56:45 +00:00
506e93a5e2 home-assistant: fix zigbee battery value_template override key
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m39s
Run nix flake check / flake-check (pull_request) Failing after 12m37s
The homeassistant override key should match the entity type in the
MQTT discovery topic path. For battery sensors, the topic is
homeassistant/sensor/<device>/battery/config, so the key should be
"battery" not "sensor_battery".

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:48:30 +01:00
b6c41aa910 system: add UTC suffix to MOTD commit timestamp
Some checks failed
Run nix flake check / flake-check (push) Failing after 7m32s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:34:24 +01:00
aa6e00a327 Merge pull request 'add-nixos-rebuild-test' (#24) from add-nixos-rebuild-test into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m6s
Reviewed-on: #24
2026-02-05 23:26:34 +00:00
258e350b89 system: add MOTD banner with hostname and commit info
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m8s
Run nix flake check / flake-check (push) Failing after 3m53s
Displays FQDN and flake commit hash with timestamp on login.
Templates can override with their own MOTD via mkDefault.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:26:01 +01:00
eba195c192 docs: add nixos-rebuild-test usage to CLAUDE.md
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:19:49 +01:00
bbb22e588e system: replace writeShellScript with writeShellApplication
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m3s
Run nix flake check / flake-check (push) Failing after 5m57s
Convert remaining writeShellScript usages to writeShellApplication for
shellcheck validation and strict bash options.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:17:24 +01:00
879e7aba60 templates: use writeShellApplication for prepare-host script
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:14:05 +01:00
39a4ea98ab system: add nixos-rebuild-test helper script
Adds a helper script deployed to all hosts for testing feature branches.
Usage: nixos-rebuild-test <action> <branch>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:12:16 +01:00
1d90dc2181 Merge pull request 'monitoring: use AppRole token for OpenBao metrics scraping' (#23) from fix-prometheus-openbao-token into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m21s
Reviewed-on: #23
2026-02-05 22:52:42 +00:00
e9857afc11 monitoring: use AppRole token for OpenBao metrics scraping
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m12s
Run nix flake check / flake-check (pull_request) Successful in 2m19s
Instead of creating a long-lived Vault token in Terraform (which gets
invalidated when Terraform recreates it), monitoring01 now uses its
existing AppRole credentials to fetch a fresh token for Prometheus.

Changes:
- Add prometheus-metrics policy to monitoring01's AppRole
- Remove vault_token.prometheus_metrics resource from Terraform
- Remove openbao-token KV secret from Terraform
- Add systemd service to fetch AppRole token on boot
- Add systemd timer to refresh token every 30 minutes

This ensures Prometheus always has a valid token without depending on
Terraform state or manual intervention.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 23:51:11 +01:00
88e9036cb4 Merge pull request 'auth01: decommission host and remove authelia/lldap services' (#22) from decommission-auth01 into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m5s
Reviewed-on: #22
2026-02-05 22:37:38 +00:00
59e1962d75 auth01: decommission host and remove authelia/lldap services
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m5s
Run nix flake check / flake-check (push) Failing after 18m1s
Remove auth01 host configuration and associated services in preparation
for new auth stack with different provisioning system.

Removed:
- hosts/auth01/ - host configuration
- services/authelia/ - authelia service module
- services/lldap/ - lldap service module
- secrets/auth01/ - sops secrets
- Reverse proxy entries for auth and lldap
- Monitoring alert rules for authelia and lldap
- SOPS configuration for auth01

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 23:35:45 +01:00
3dc4422ba0 docs: add NAS integration notes to auth plan
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m4s
Document TrueNAS CORE LDAP integration approach (NFS-only) and
future NixOS NAS migration path with native Kanidm PAM/NSS.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 23:24:37 +01:00
f0963624bc docs: add auth system replacement plan
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Evaluate options for replacing LLDAP+Authelia with a unified auth solution.
Recommends Kanidm for its native NixOS PAM/NSS integration and built-in OIDC.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 23:18:38 +01:00
7b46f94e48 Merge pull request 'zigbee-battery-fix' (#21) from zigbee-battery-fix into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m20s
Reviewed-on: #21
2026-02-05 21:51:41 +00:00
32968147b5 docs: move zigbee battery plan to completed
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m17s
Run nix flake check / flake-check (pull_request) Successful in 2m19s
Updated plan with:
- Full device inventory from ha1
- Backup verification details
- Branch and commit references

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 22:49:49 +01:00
c515a6b4e1 home-assistant: fix zigbee sensor battery reporting
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
WSDCGQ12LM sensors report battery: 0 due to firmware quirk. Override
battery calculation using voltage via homeassistant value_template.

Also adds zigbee_sensor_stale alert for detecting dead sensors regardless
of battery reporting accuracy (1 hour threshold).

Device configuration moved from external devices.yaml to inline NixOS
config for declarative management.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 22:41:07 +01:00
4d8b94ce83 monitoring: add collector flags to nats exporter
Some checks failed
Run nix flake check / flake-check (push) Failing after 8m53s
The exporter requires explicit collector flags to specify what
metrics to collect.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 22:23:30 +01:00
8b0a4ea33a monitoring: use nats exporter instead of direct scrape
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
NATS HTTP monitoring endpoint serves JSON, not Prometheus format.
Use the prometheus-nats-exporter which queries the NATS endpoint
and exposes proper Prometheus metrics.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 22:22:04 +01:00
5be1f43c24 Merge pull request 'monitoring-gaps-implementation' (#20) from monitoring-gaps-implementation into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m20s
Reviewed-on: #20
2026-02-05 20:57:31 +00:00
b322b1156b monitoring: fix openbao token output path
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m17s
Run nix flake check / flake-check (push) Failing after 8m57s
The outputDir with extractKey should be the full file path, not just
the parent directory.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 21:56:26 +01:00
3cccfc0487 monitoring: implement monitoring gaps coverage
Some checks failed
Run nix flake check / flake-check (push) Failing after 7m36s
Add exporters and scrape targets for services lacking monitoring:
- PostgreSQL: postgres-exporter on pgdb1
- Authelia: native telemetry metrics on auth01
- Unbound: unbound-exporter with remote-control on ns1/ns2
- NATS: HTTP monitoring endpoint on nats1
- OpenBao: telemetry config and Prometheus scrape with token auth
- Systemd: systemd-exporter on all hosts for per-service metrics

Add alert rules for postgres, auth (authelia + lldap), jellyfin,
vault (openbao), plus extend existing nats and unbound rules.

Add Terraform config for Prometheus metrics policy and token. The
token is created via vault_token resource and stored in KV, so no
manual token creation is needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 21:44:13 +01:00
41d4226812 mcp: add Loki URL to lab-monitoring server config
Some checks failed
Run nix flake check / flake-check (push) Failing after 8m8s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 21:18:39 +01:00
351fb6f720 docs: add lab-monitoring query reference to CLAUDE.md
Document Loki log query labels and patterns, and Prometheus job names
with example queries for the lab-monitoring MCP server.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 21:18:17 +01:00
7d92c55d37 docs: update for sops-to-openbao migration completion
Some checks failed
Run nix flake check / flake-check (push) Failing after 18m17s
Update CLAUDE.md and README.md to reflect that secrets are now managed
by OpenBao, with sops only remaining for ca. Update migration plans
with sops cleanup checklist and auth01 decommission.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 20:06:21 +01:00
6d117d68ca docs: move sops-to-openbao migration plan to completed
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m5s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 19:45:42 +01:00
a46fbdaa70 Merge pull request 'sops-to-openbao-migration' (#19) from sops-to-openbao-migration into master
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Reviewed-on: #19
2026-02-05 18:44:53 +00:00
2c9d86eaf2 vault-fetch: fix multiline secret values being truncated
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m5s
Run nix flake check / flake-check (push) Failing after 16m11s
The read-based loop split multiline values on newlines, causing only
the first line to be written. Use jq -j to write each key's value
directly to files, preserving multiline content.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 19:36:51 +01:00
ccb1c3fe2e terraform: auto-generate backup password instead of manual
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m19s
Remove backup_helper_secret variable and switch shared/backup/password
to auto_generate. New password will be added alongside existing restic
repository key.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 18:58:39 +01:00
0700033c0a secrets: migrate all hosts from sops to OpenBao vault
Replace sops-nix secrets with OpenBao vault secrets across all hosts.
Hardcode root password hash, add extractKey option to vault-secrets
module, update Terraform with secrets/policies for all hosts, and
create AppRole provisioning playbook.

Hosts migrated: ha1, monitoring01, ns1, ns2, http-proxy, nix-cache01
Wave 1 hosts (nats1, jelly01, pgdb1) get AppRole policies only.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 18:43:09 +01:00
4d33018285 docs: add ha1 memory recommendation to migration plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m28s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 17:48:45 +01:00
678fd3d6de docs: add systemd-exporter findings to monitoring gaps plan
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 10:19:33 +01:00
9d74aa5c04 docs: add zigbee sensor battery monitoring findings
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 09:21:54 +01:00
fe80ec3576 docs: add monitoring gaps audit plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 20m32s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 03:19:20 +01:00
870fb3e532 docs: add plan for remote access to homelab services
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m4s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:53:27 +01:00
e602e8d70b docs: add plan for prometheus scrape target labels
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m7s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:36:41 +01:00
28b8d7c115 monitoring: increase high_cpu_load duration for nix-cache01 to 2h
nix-cache01 regularly hits high CPU during nix builds, causing flappy
alerts. Keep the 15m threshold for all other hosts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:28:48 +01:00
64f2688349 nix: configure gc to delete generations older than 14d
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m27s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:21:19 +01:00
09d9d71e2b docs: note to establish hostname naming conventions before migration
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:04:58 +01:00
cc799f5929 docs: note USB passthrough requirement for ha1 migration
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:02:14 +01:00
0abdda8e8a docs: add plan for migrating existing hosts to opentofu
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m28s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 01:59:51 +01:00
4076361bf7 Merge pull request 'hosts: remove decommissioned media1, ns3, ns4, nixos-test1' (#18) from host-cleanup into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 3m36s
Reviewed-on: #18
2026-02-05 00:38:56 +00:00
0ef63ad874 hosts: remove decommissioned media1, ns3, ns4, nixos-test1
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m47s
Run nix flake check / flake-check (pull_request) Successful in 3m20s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 01:36:57 +01:00
87 changed files with 3539 additions and 1131 deletions

View File

@@ -0,0 +1,250 @@
---
name: observability
description: Reference guide for exploring Prometheus metrics and Loki logs when troubleshooting homelab issues. Use when investigating system state, deployments, service health, or searching logs.
---
# Observability Troubleshooting Guide
Quick reference for exploring Prometheus metrics and Loki logs to troubleshoot homelab issues.
## Available Tools
Use the `lab-monitoring` MCP server tools:
**Metrics:**
- `search_metrics` - Find metrics by name substring
- `get_metric_metadata` - Get type/help for a specific metric
- `query` - Execute PromQL queries
- `list_targets` - Check scrape target health
- `list_alerts` / `get_alert` - View active alerts
**Logs:**
- `query_logs` - Execute LogQL queries against Loki
- `list_labels` - List available log labels
- `list_label_values` - List values for a specific label
---
## Logs Reference
### Label Reference
Available labels for log queries:
- `host` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`)
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs)
- `filename` - For `varlog` job, the log file path
- `hostname` - Alternative to `host` for some streams
### Log Format
Journal logs are JSON-formatted. Key fields:
- `MESSAGE` - The actual log message
- `PRIORITY` - Syslog priority (6=info, 4=warning, 3=error)
- `SYSLOG_IDENTIFIER` - Program name
### Basic LogQL Queries
**Logs from a specific service on a host:**
```logql
{host="ns1", systemd_unit="nsd.service"}
```
**All logs from a host:**
```logql
{host="monitoring01"}
```
**Logs from a service across all hosts:**
```logql
{systemd_unit="nixos-upgrade.service"}
```
**Substring matching (case-sensitive):**
```logql
{host="ha1"} |= "error"
```
**Exclude pattern:**
```logql
{host="ns1"} != "routine"
```
**Regex matching:**
```logql
{systemd_unit="prometheus.service"} |~ "scrape.*failed"
```
**File-based logs (caddy access logs, etc):**
```logql
{job="varlog", hostname="nix-cache01"}
{job="varlog", filename="/var/log/caddy/nix-cache.log"}
```
### Time Ranges
Default lookback is 1 hour. Use `start` parameter for older logs:
- `start: "1h"` - Last hour (default)
- `start: "24h"` - Last 24 hours
- `start: "168h"` - Last 7 days
### Common Services
Useful systemd units for troubleshooting:
- `nixos-upgrade.service` - Daily auto-upgrade logs
- `nsd.service` - DNS server (ns1/ns2)
- `prometheus.service` - Metrics collection
- `loki.service` - Log aggregation
- `caddy.service` - Reverse proxy
- `home-assistant.service` - Home automation
- `step-ca.service` - Internal CA
- `openbao.service` - Secrets management
- `sshd.service` - SSH daemon
- `nix-gc.service` - Nix garbage collection
### Extracting JSON Fields
Parse JSON and filter on fields:
```logql
{systemd_unit="prometheus.service"} | json | PRIORITY="3"
```
---
## Metrics Reference
### Deployment & Version Status
Check which NixOS revision hosts are running:
```promql
nixos_flake_info
```
Labels:
- `current_rev` - Git commit of the running NixOS configuration
- `remote_rev` - Latest commit on the remote repository
- `nixpkgs_rev` - Nixpkgs revision used to build the system
- `nixos_version` - Full NixOS version string (e.g., `25.11.20260203.e576e3c`)
Check if hosts are behind on updates:
```promql
nixos_flake_revision_behind == 1
```
View flake input versions:
```promql
nixos_flake_input_info
```
Labels: `input` (name), `rev` (revision), `type` (git/github)
Check flake input age:
```promql
nixos_flake_input_age_seconds / 86400
```
Returns age in days for each flake input.
### System Health
Basic host availability:
```promql
up{job="node-exporter"}
```
CPU usage by host:
```promql
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
```
Memory usage:
```promql
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
```
Disk space (root filesystem):
```promql
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
```
### Service-Specific Metrics
Common job names:
- `node-exporter` - System metrics (all hosts)
- `nixos-exporter` - NixOS version/generation metrics
- `caddy` - Reverse proxy metrics
- `prometheus` / `loki` / `grafana` - Monitoring stack
- `home-assistant` - Home automation
- `step-ca` - Internal CA
### Instance Label Format
The `instance` label uses FQDN format:
```
<hostname>.home.2rjus.net:<port>
```
Example queries filtering by host:
```promql
up{instance=~"monitoring01.*"}
node_load1{instance=~"ns1.*"}
```
---
## Troubleshooting Workflows
### Check Deployment Status Across Fleet
1. Query `nixos_flake_info` to see all hosts' current revisions
2. Check `nixos_flake_revision_behind` for hosts needing updates
3. Look at upgrade logs: `{systemd_unit="nixos-upgrade.service"}` with `start: "24h"`
### Investigate Service Issues
1. Check `up{job="<service>"}` for scrape failures
2. Use `list_targets` to see target health details
3. Query service logs: `{host="<host>", systemd_unit="<service>.service"}`
4. Search for errors: `{host="<host>"} |= "error"`
5. Check `list_alerts` for related alerts
### After Deploying Changes
1. Verify `current_rev` updated in `nixos_flake_info`
2. Confirm `nixos_flake_revision_behind == 0`
3. Check service logs for startup issues
4. Check service metrics are being scraped
### Debug SSH/Access Issues
```logql
{host="<host>", systemd_unit="sshd.service"}
```
### Check Recent Upgrades
```logql
{systemd_unit="nixos-upgrade.service"}
```
With `start: "24h"` to see last 24 hours of upgrades across all hosts.
---
## Notes
- Default scrape interval is 15s for most metrics targets
- Default log lookback is 1h - use `start` parameter for older logs
- Use `rate()` for counter metrics, direct queries for gauges
- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters
- Log `MESSAGE` field contains the actual log content in JSON format

View File

@@ -0,0 +1,89 @@
---
name: quick-plan
description: Create a planning document for a future homelab project. Use when the user wants to document ideas for future work without implementing immediately.
argument-hint: [topic or feature to plan]
---
# Quick Plan Generator
Create a planning document for a future homelab infrastructure project. Plans are for documenting ideas and approaches that will be implemented later, not immediately.
## Input
The user provides: $ARGUMENTS
## Process
1. **Understand the topic**: Research the codebase to understand:
- Current state of related systems
- Existing patterns and conventions
- Relevant NixOS options or packages
- Any constraints or dependencies
2. **Evaluate options**: If there are multiple approaches, research and compare them with pros/cons.
3. **Draft the plan**: Create a markdown document following the structure below.
4. **Save the plan**: Write to `docs/plans/<topic-slug>.md` using a kebab-case filename derived from the topic.
## Plan Structure
Use these sections as appropriate (not all plans need every section):
```markdown
# Title
## Overview/Goal
Brief description of what this plan addresses and why.
## Current State
What exists today that's relevant to this plan.
## Options Evaluated (if multiple approaches)
For each option:
- **Option Name**
- **Pros:** bullet points
- **Cons:** bullet points
- **Verdict:** brief assessment
Or use a comparison table for structured evaluation.
## Recommendation/Decision
What approach is recommended and why. Include rationale.
## Implementation Steps
Numbered phases or steps. Be specific but not overly detailed.
Can use sub-sections for major phases.
## Open Questions
Things still to be determined. Use checkbox format:
- [ ] Question 1?
- [ ] Question 2?
## Notes (optional)
Additional context, caveats, or references.
```
## Style Guidelines
- **Concise**: Use bullet points, avoid verbose paragraphs
- **Technical but accessible**: Include NixOS config snippets when relevant
- **Future-oriented**: These are plans, not specifications
- **Acknowledge uncertainty**: Use "Open Questions" for unresolved decisions
- **Reference existing patterns**: Mention how this fits with existing infrastructure
- **Tables for comparisons**: Use markdown tables when comparing options
- **Practical focus**: Emphasize what needs to happen, not theory
## Examples of Good Plans
Reference these existing plans for style guidance:
- `docs/plans/auth-system-replacement.md` - Good option evaluation with table
- `docs/plans/truenas-migration.md` - Good decision documentation with rationale
- `docs/plans/remote-access.md` - Good multi-option comparison
- `docs/plans/prometheus-scrape-target-labels.md` - Good implementation detail level
## After Creating the Plan
1. Tell the user the plan was saved to `docs/plans/<filename>.md`
2. Summarize the key points
3. Ask if they want any adjustments before committing

1
.gitignore vendored
View File

@@ -1,5 +1,6 @@
.direnv/ .direnv/
result result
result-*
# Terraform/OpenTofu # Terraform/OpenTofu
terraform/.terraform/ terraform/.terraform/

View File

@@ -19,8 +19,20 @@
"args": ["run", "git+https://git.t-juice.club/torjus/labmcp#lab-monitoring", "--", "serve", "--enable-silences"], "args": ["run", "git+https://git.t-juice.club/torjus/labmcp#lab-monitoring", "--", "serve", "--enable-silences"],
"env": { "env": {
"PROMETHEUS_URL": "https://prometheus.home.2rjus.net", "PROMETHEUS_URL": "https://prometheus.home.2rjus.net",
"ALERTMANAGER_URL": "https://alertmanager.home.2rjus.net" "ALERTMANAGER_URL": "https://alertmanager.home.2rjus.net",
"LOKI_URL": "http://monitoring01.home.2rjus.net:3100"
} }
},
"homelab-deploy": {
"command": "nix",
"args": [
"run",
"git+https://git.t-juice.club/torjus/homelab-deploy",
"--",
"mcp",
"--nats-url", "nats://nats1.home.2rjus.net:4222",
"--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey"
]
} }
} }
} }

View File

@@ -2,11 +2,7 @@ keys:
- &admin_torjus age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u - &admin_torjus age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
- &server_ns1 age1hz2lz4k050ru3shrk5j3zk3f8azxmrp54pktw5a7nzjml4saudesx6jsl0 - &server_ns1 age1hz2lz4k050ru3shrk5j3zk3f8azxmrp54pktw5a7nzjml4saudesx6jsl0
- &server_ns2 age1w2q4gm2lrcgdzscq8du3ssyvk6qtzm4fcszc92z9ftclq23yyydqdga5um - &server_ns2 age1w2q4gm2lrcgdzscq8du3ssyvk6qtzm4fcszc92z9ftclq23yyydqdga5um
- &server_ns3 age1snmhmpavqy7xddmw4nuny0u4xusqmnqxqarjmghkm5zaluff84eq5xatrd
- &server_ns4 age12a3nyvjs8jrwmpkf3tgawel3nwcklwsr35ktmytnvhpawqwzrsfqpgcy0q
- &server_ha1 age1d2w5zece9647qwyq4vas9qyqegg96xwmg6c86440a6eg4uj6dd2qrq0w3l - &server_ha1 age1d2w5zece9647qwyq4vas9qyqegg96xwmg6c86440a6eg4uj6dd2qrq0w3l
- &server_nixos-test1 age1gcyfkxh4fq5zdp0dh484aj82ksz66wrly7qhnpv0r0p576sn9ekse8e9ju
- &server_inc1 age1g5luz2rtel3surgzuh62rkvtey7lythrvfenyq954vmeyfpxjqkqdj3wt8
- &server_http-proxy age1gq8434ku0xekqmvnseeunv83e779cg03c06gwrusnymdsr3rpufqx6vr3m - &server_http-proxy age1gq8434ku0xekqmvnseeunv83e779cg03c06gwrusnymdsr3rpufqx6vr3m
- &server_ca age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk - &server_ca age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk
- &server_monitoring01 age1vpns76ykll8jgdlu3h05cur4ew2t3k7u03kxdg8y6ypfhsfhq9fqyurjey - &server_monitoring01 age1vpns76ykll8jgdlu3h05cur4ew2t3k7u03kxdg8y6ypfhsfhq9fqyurjey
@@ -14,7 +10,6 @@ keys:
- &server_nix-cache01 age1w029fksjv0edrff9p7s03tgk3axecdkppqymfpwfn2nu2gsqqefqc37sxq - &server_nix-cache01 age1w029fksjv0edrff9p7s03tgk3axecdkppqymfpwfn2nu2gsqqefqc37sxq
- &server_pgdb1 age1ha34qeksr4jeaecevqvv2afqem67eja2mvawlmrqsudch0e7fe7qtpsekv - &server_pgdb1 age1ha34qeksr4jeaecevqvv2afqem67eja2mvawlmrqsudch0e7fe7qtpsekv
- &server_nats1 age1cxt8kwqzx35yuldazcc49q88qvgy9ajkz30xu0h37uw3ts97jagqgmn2ga - &server_nats1 age1cxt8kwqzx35yuldazcc49q88qvgy9ajkz30xu0h37uw3ts97jagqgmn2ga
- &server_auth01 age16prza00sqzuhwwcyakj6z4hvwkruwkqpmmrsn94a5ucgpkelncdq2ldctk
creation_rules: creation_rules:
- path_regex: secrets/[^/]+\.(yaml|json|env|ini) - path_regex: secrets/[^/]+\.(yaml|json|env|ini)
key_groups: key_groups:
@@ -22,11 +17,7 @@ creation_rules:
- *admin_torjus - *admin_torjus
- *server_ns1 - *server_ns1
- *server_ns2 - *server_ns2
- *server_ns3
- *server_ns4
- *server_ha1 - *server_ha1
- *server_nixos-test1
- *server_inc1
- *server_http-proxy - *server_http-proxy
- *server_ca - *server_ca
- *server_monitoring01 - *server_monitoring01
@@ -34,12 +25,6 @@ creation_rules:
- *server_nix-cache01 - *server_nix-cache01
- *server_pgdb1 - *server_pgdb1
- *server_nats1 - *server_nats1
- *server_auth01
- path_regex: secrets/ns3/[^/]+\.(yaml|json|env|ini)
key_groups:
- age:
- *admin_torjus
- *server_ns3
- path_regex: secrets/ca/[^/]+\.(yaml|json|env|ini|) - path_regex: secrets/ca/[^/]+\.(yaml|json|env|ini|)
key_groups: key_groups:
- age: - age:
@@ -65,8 +50,3 @@ creation_rules:
- age: - age:
- *admin_torjus - *admin_torjus
- *server_http-proxy - *server_http-proxy
- path_regex: secrets/auth01/[^/]+\.(yaml|json|env|ini|)
key_groups:
- age:
- *admin_torjus
- *server_auth01

178
CLAUDE.md
View File

@@ -35,6 +35,21 @@ nix build .#create-host
Do not automatically deploy changes. Deployments are usually done by updating the master branch, and then triggering the auto update on the specific host. Do not automatically deploy changes. Deployments are usually done by updating the master branch, and then triggering the auto update on the specific host.
### Testing Feature Branches on Hosts
All hosts have the `nixos-rebuild-test` helper script for testing feature branches before merging:
```bash
# On the target host, test a feature branch
nixos-rebuild-test boot <branch-name>
nixos-rebuild-test switch <branch-name>
# Additional arguments are passed through to nixos-rebuild
nixos-rebuild-test boot my-feature --show-trace
```
When working on a feature branch that requires testing on a live host, suggest using this command instead of the full flake URL syntax.
### Flake Management ### Flake Management
```bash ```bash
@@ -52,12 +67,19 @@ nix develop
### Secrets Management ### Secrets Management
Secrets are handled by sops. Do not edit any `.sops.yaml` or any file within `secrets/`. Ask the user to modify if necessary. Secrets are managed by OpenBao (Vault) using AppRole authentication. Most hosts use the
`vault.secrets` option defined in `system/vault-secrets.nix` to fetch secrets at boot.
Terraform manages the secrets and AppRole policies in `terraform/vault/`.
Legacy sops-nix is still present but only actively used by the `ca` host. Do not edit any
`.sops.yaml` or any file within `secrets/`. Ask the user to modify if necessary.
### Git Workflow ### Git Workflow
**Important:** Never commit directly to `master` unless the user explicitly asks for it. Always create a feature branch for changes. **Important:** Never commit directly to `master` unless the user explicitly asks for it. Always create a feature branch for changes.
**Important:** Never amend commits to `master` unless the user explicitly asks for it. Amending rewrites history and causes issues for deployed configurations.
When starting a new plan or task, the first step should typically be to create and checkout a new branch with an appropriate name (e.g., `git checkout -b dns-automation` or `git checkout -b fix-nginx-config`). When starting a new plan or task, the first step should typically be to create and checkout a new branch with an appropriate name (e.g., `git checkout -b dns-automation` or `git checkout -b fix-nginx-config`).
### Plan Management ### Plan Management
@@ -110,6 +132,113 @@ Two MCP servers are available for searching NixOS options and packages:
This ensures documentation matches the exact nixpkgs version (currently NixOS 25.11) used by this flake. This ensures documentation matches the exact nixpkgs version (currently NixOS 25.11) used by this flake.
### Lab Monitoring Log Queries
The **lab-monitoring** MCP server can query logs from Loki. All hosts ship systemd journal logs via Promtail.
**Loki Label Reference:**
- `host` - Hostname (e.g., `ns1`, `ns2`, `monitoring01`, `ha1`). Use this label, not `hostname`.
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `prometheus.service`, `nixos-upgrade.service`)
- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs like caddy access logs)
- `filename` - For `varlog` job, the log file path (e.g., `/var/log/caddy/nix-cache.log`)
Journal log entries are JSON-formatted with the actual log message in the `MESSAGE` field. Other useful fields include `PRIORITY` and `SYSLOG_IDENTIFIER`.
**Example LogQL queries:**
```
# Logs from a specific service on a host
{host="ns2", systemd_unit="nsd.service"}
# Substring match on log content
{host="ns1", systemd_unit="nsd.service"} |= "error"
# File-based logs (e.g., caddy access logs)
{job="varlog", hostname="nix-cache01"}
```
Default lookback is 1 hour. Use the `start` parameter with relative durations (e.g., `24h`, `168h`) for older logs.
### Lab Monitoring Prometheus Queries
The **lab-monitoring** MCP server can query Prometheus metrics via PromQL. The `instance` label uses the FQDN format `<host>.home.2rjus.net:<port>`.
**Prometheus Job Names:**
- `node-exporter` - System metrics from all hosts (CPU, memory, disk, network)
- `caddy` - Reverse proxy metrics (http-proxy)
- `nix-cache_caddy` - Nix binary cache metrics
- `home-assistant` - Home automation metrics
- `jellyfin` - Media server metrics
- `loki` / `prometheus` / `grafana` - Monitoring stack self-metrics
- `step-ca` - Internal CA metrics
- `pve-exporter` - Proxmox hypervisor metrics
- `smartctl` - Disk SMART health (gunter)
- `wireguard` - VPN metrics (http-proxy)
- `pushgateway` - Push-based metrics (e.g., backup results)
- `restic_rest` - Backup server metrics
- `labmon` / `ghettoptt` / `alertmanager` - Other service metrics
**Example PromQL queries:**
```
# Check all targets are up
up
# CPU usage for a specific host
rate(node_cpu_seconds_total{instance=~"ns1.*", mode!="idle"}[5m])
# Memory usage across all hosts
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
# Disk space
node_filesystem_avail_bytes{mountpoint="/"}
```
### Deploying to Test Hosts
The **homelab-deploy** MCP server enables remote deployments to test-tier hosts via NATS messaging.
**Available Tools:**
- `deploy` - Deploy NixOS configuration to test-tier hosts
- `list_hosts` - List available deployment targets
**Deploy Parameters:**
- `hostname` - Target a specific host (e.g., `vaulttest01`)
- `role` - Deploy to all hosts with a specific role (e.g., `vault`)
- `all` - Deploy to all test-tier hosts
- `action` - nixos-rebuild action: `switch` (default), `boot`, `test`, `dry-activate`
- `branch` - Git branch or commit to deploy (default: `master`)
**Examples:**
```
# List available hosts
list_hosts()
# Deploy to a specific host
deploy(hostname="vaulttest01", action="switch")
# Dry-run deployment
deploy(hostname="vaulttest01", action="dry-activate")
# Deploy to all hosts with a role
deploy(role="vault", action="switch")
```
**Note:** Only test-tier hosts with `homelab.deploy.enable = true` and the listener service running will respond to deployments.
**Verifying Deployments:**
After deploying, use the `nixos_flake_info` metric from nixos-exporter to verify the host is running the expected revision:
```promql
nixos_flake_info{instance=~"vaulttest01.*"}
```
The `current_rev` label contains the git commit hash of the deployed flake configuration.
## Architecture ## Architecture
### Directory Structure ### Directory Structure
@@ -119,7 +248,7 @@ This ensures documentation matches the exact nixpkgs version (currently NixOS 25
- `default.nix` - Entry point, imports configuration.nix and services - `default.nix` - Entry point, imports configuration.nix and services
- `configuration.nix` - Host-specific settings (networking, hardware, users) - `configuration.nix` - Host-specific settings (networking, hardware, users)
- `/system/` - Shared system-level configurations applied to ALL hosts - `/system/` - Shared system-level configurations applied to ALL hosts
- Core modules: nix.nix, sshd.nix, sops.nix, acme.nix, autoupgrade.nix - Core modules: nix.nix, sshd.nix, sops.nix (legacy), vault-secrets.nix, acme.nix, autoupgrade.nix
- Monitoring: node-exporter and promtail on every host - Monitoring: node-exporter and promtail on every host
- `/modules/` - Custom NixOS modules - `/modules/` - Custom NixOS modules
- `homelab/` - Homelab-specific options (DNS automation, monitoring scrape targets) - `homelab/` - Homelab-specific options (DNS automation, monitoring scrape targets)
@@ -131,13 +260,13 @@ This ensures documentation matches the exact nixpkgs version (currently NixOS 25
- `monitoring/` - Observability stack (Prometheus, Grafana, Loki, Tempo) - `monitoring/` - Observability stack (Prometheus, Grafana, Loki, Tempo)
- `ns/` - DNS services (authoritative, resolver, zone generation) - `ns/` - DNS services (authoritative, resolver, zone generation)
- `http-proxy/`, `ca/`, `postgres/`, `nats/`, `jellyfin/`, etc. - `http-proxy/`, `ca/`, `postgres/`, `nats/`, `jellyfin/`, etc.
- `/secrets/` - SOPS-encrypted secrets with age encryption - `/secrets/` - SOPS-encrypted secrets with age encryption (legacy, only used by ca)
- `/common/` - Shared configurations (e.g., VM guest agent) - `/common/` - Shared configurations (e.g., VM guest agent)
- `/docs/` - Documentation and plans - `/docs/` - Documentation and plans
- `plans/` - Future plans and proposals - `plans/` - Future plans and proposals
- `plans/completed/` - Completed plans (moved here when done) - `plans/completed/` - Completed plans (moved here when done)
- `/playbooks/` - Ansible playbooks for fleet management - `/playbooks/` - Ansible playbooks for fleet management
- `/.sops.yaml` - SOPS configuration with age keys for all servers - `/.sops.yaml` - SOPS configuration with age keys (legacy, only used by ca)
### Configuration Inheritance ### Configuration Inheritance
@@ -153,7 +282,7 @@ hosts/<hostname>/default.nix
All hosts automatically get: All hosts automatically get:
- Nix binary cache (nix-cache.home.2rjus.net) - Nix binary cache (nix-cache.home.2rjus.net)
- SSH with root login enabled - SSH with root login enabled
- SOPS secrets management with auto-generated age keys - OpenBao (Vault) secrets management via AppRole
- Internal ACME CA integration (ca.home.2rjus.net) - Internal ACME CA integration (ca.home.2rjus.net)
- Daily auto-upgrades with auto-reboot - Daily auto-upgrades with auto-reboot
- Prometheus node-exporter + Promtail (logs to monitoring01) - Prometheus node-exporter + Promtail (logs to monitoring01)
@@ -173,17 +302,15 @@ Production servers managed by `rebuild-all.sh`:
- `nix-cache01` - Binary cache server - `nix-cache01` - Binary cache server
- `pgdb1` - PostgreSQL database - `pgdb1` - PostgreSQL database
- `nats1` - NATS messaging server - `nats1` - NATS messaging server
- `auth01` - Authentication service
Template/test hosts: Template/test hosts:
- `template1` - Base template for cloning new hosts - `template1` - Base template for cloning new hosts
- `nixos-test1` - Test environment
### Flake Inputs ### Flake Inputs
- `nixpkgs` - NixOS 25.11 stable (primary) - `nixpkgs` - NixOS 25.11 stable (primary)
- `nixpkgs-unstable` - Unstable channel (available via overlay as `pkgs.unstable.<package>`) - `nixpkgs-unstable` - Unstable channel (available via overlay as `pkgs.unstable.<package>`)
- `sops-nix` - Secrets management - `sops-nix` - Secrets management (legacy, only used by ca)
- Custom packages from git.t-juice.club: - Custom packages from git.t-juice.club:
- `alerttonotify` - Alert routing - `alerttonotify` - Alert routing
- `labmon` - Lab monitoring - `labmon` - Lab monitoring
@@ -199,12 +326,21 @@ Template/test hosts:
### Secrets Management ### Secrets Management
- Uses SOPS with age encryption Most hosts use OpenBao (Vault) for secrets:
- Each server has unique age key in `.sops.yaml` - Vault server at `vault01.home.2rjus.net:8200`
- Keys auto-generated at `/var/lib/sops-nix/key.txt` on first boot - AppRole authentication with credentials at `/var/lib/vault/approle/`
- Secrets defined in Terraform (`terraform/vault/secrets.tf`)
- AppRole policies in Terraform (`terraform/vault/approle.tf`)
- NixOS module: `system/vault-secrets.nix` with `vault.secrets.<name>` options
- `extractKey` option extracts a single key from vault JSON as a plain file
- Secrets fetched at boot by `vault-secret-<name>.service` systemd units
- Fallback to cached secrets in `/var/lib/vault/cache/` when Vault is unreachable
- Provision AppRole credentials: `nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<host>`
Legacy SOPS (only used by `ca` host):
- SOPS with age encryption, keys in `.sops.yaml`
- Shared secrets: `/secrets/secrets.yaml` - Shared secrets: `/secrets/secrets.yaml`
- Per-host secrets: `/secrets/<hostname>/` - Per-host secrets: `/secrets/<hostname>/`
- All production servers can decrypt shared secrets; host-specific secrets require specific host keys
### Auto-Upgrade System ### Auto-Upgrade System
@@ -304,13 +440,15 @@ This means:
3. Add host entry to `flake.nix` nixosConfigurations 3. Add host entry to `flake.nix` nixosConfigurations
4. Configure networking in `configuration.nix` (static IP via `systemd.network.networks`, DNS servers) 4. Configure networking in `configuration.nix` (static IP via `systemd.network.networks`, DNS servers)
5. (Optional) Add `homelab.dns.cnames` if the host needs CNAME aliases 5. (Optional) Add `homelab.dns.cnames` if the host needs CNAME aliases
6. User clones template host 6. Add `vault.enable = true;` to the host configuration
7. User runs `prepare-host.sh` on new host, this deletes files which should be regenerated, like ssh host keys, machine-id etc. It also creates a new age key, and prints the public key 7. Add AppRole policy in `terraform/vault/approle.tf` and any secrets in `secrets.tf`
8. This key is then added to `.sops.yaml` 8. Run `tofu apply` in `terraform/vault/`
9. Create `/secrets/<hostname>/` if needed 9. User clones template host
10. Commit changes, and merge to master. 10. User runs `prepare-host.sh` on new host
11. Deploy by running `nixos-rebuild boot --flake URL#<hostname>` on the host. 11. Provision AppRole credentials: `nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<host>`
12. Run auto-upgrade on DNS servers (ns1, ns2) to pick up the new host's DNS entry 12. Commit changes, and merge to master.
13. Deploy by running `nixos-rebuild boot --flake URL#<hostname>` on the host.
14. Run auto-upgrade on DNS servers (ns1, ns2) to pick up the new host's DNS entry
**Note:** DNS A records and Prometheus node-exporter scrape targets are auto-generated from the host's `systemd.network.networks` static IP configuration. No manual zone file or Prometheus config editing is required. **Note:** DNS A records and Prometheus node-exporter scrape targets are auto-generated from the host's `systemd.network.networks` static IP configuration. No manual zone file or Prometheus config editing is required.
@@ -326,6 +464,8 @@ This means:
**Firewall**: Disabled on most hosts (trusted network). Enable selectively in host configuration if needed. **Firewall**: Disabled on most hosts (trusted network). Enable selectively in host configuration if needed.
**Shell scripts**: Use `pkgs.writeShellApplication` instead of `pkgs.writeShellScript` or `pkgs.writeShellScriptBin` for creating shell scripts. `writeShellApplication` provides automatic shellcheck validation, sets strict bash options (`set -euo pipefail`), and allows declaring `runtimeInputs` for dependencies. When referencing the executable path (e.g., in `ExecStart`), use `lib.getExe myScript` to get the proper `bin/` path.
### Monitoring Stack ### Monitoring Stack
All hosts ship metrics and logs to `monitoring01`: All hosts ship metrics and logs to `monitoring01`:

View File

@@ -7,7 +7,6 @@ NixOS Flake-based configuration repository for a homelab infrastructure. All hos
| Host | Role | | Host | Role |
|------|------| |------|------|
| `ns1`, `ns2` | Primary/secondary authoritative DNS | | `ns1`, `ns2` | Primary/secondary authoritative DNS |
| `ns3`, `ns4` | Additional DNS servers |
| `ca` | Internal Certificate Authority | | `ca` | Internal Certificate Authority |
| `ha1` | Home Assistant + Zigbee2MQTT + Mosquitto | | `ha1` | Home Assistant + Zigbee2MQTT + Mosquitto |
| `http-proxy` | Reverse proxy | | `http-proxy` | Reverse proxy |
@@ -16,9 +15,7 @@ NixOS Flake-based configuration repository for a homelab infrastructure. All hos
| `nix-cache01` | Nix binary cache | | `nix-cache01` | Nix binary cache |
| `pgdb1` | PostgreSQL | | `pgdb1` | PostgreSQL |
| `nats1` | NATS messaging | | `nats1` | NATS messaging |
| `auth01` | Authentication (LLDAP + Authelia) |
| `vault01` | OpenBao (Vault) secrets management | | `vault01` | OpenBao (Vault) secrets management |
| `media1` | Media services |
| `template1`, `template2` | VM templates for cloning new hosts | | `template1`, `template2` | VM templates for cloning new hosts |
## Directory Structure ## Directory Structure
@@ -30,7 +27,7 @@ system/ # Shared modules applied to ALL hosts
services/ # Reusable service modules, selectively imported per host services/ # Reusable service modules, selectively imported per host
modules/ # Custom NixOS module definitions modules/ # Custom NixOS module definitions
lib/ # Nix library functions (DNS zone generation, etc.) lib/ # Nix library functions (DNS zone generation, etc.)
secrets/ # SOPS-encrypted secrets (age encryption) secrets/ # SOPS-encrypted secrets (legacy, only used by ca)
common/ # Shared configurations (e.g., VM guest agent) common/ # Shared configurations (e.g., VM guest agent)
terraform/ # OpenTofu configs for Proxmox VM provisioning terraform/ # OpenTofu configs for Proxmox VM provisioning
terraform/vault/ # OpenTofu configs for OpenBao (secrets, PKI, AppRoles) terraform/vault/ # OpenTofu configs for OpenBao (secrets, PKI, AppRoles)
@@ -42,7 +39,7 @@ scripts/ # Helper scripts (create-host, vault-fetch)
**Automatic DNS zone generation** - A records are derived from each host's static IP configuration. CNAME aliases are defined via `homelab.dns.cnames`. No manual zone file editing required. **Automatic DNS zone generation** - A records are derived from each host's static IP configuration. CNAME aliases are defined via `homelab.dns.cnames`. No manual zone file editing required.
**SOPS secrets management** - Each host has a unique age key. Shared secrets live in `secrets/secrets.yaml`, per-host secrets in `secrets/<hostname>/`. **OpenBao (Vault) secrets** - Hosts authenticate via AppRole and fetch secrets at boot. Secrets and policies are managed as code in `terraform/vault/`. Legacy SOPS remains only for the `ca` host.
**Daily auto-upgrades** - All hosts pull from the master branch and automatically rebuild and reboot on a randomized schedule. **Daily auto-upgrades** - All hosts pull from the master branch and automatically rebuild and reboot on a randomized schedule.

View File

@@ -0,0 +1,192 @@
# Authentication System Replacement Plan
## Overview
Replace the current auth01 setup (LLDAP + Authelia) with a modern, unified authentication solution. The current setup is not in active use, making this a good time to evaluate alternatives.
## Goals
1. **Central user database** - Manage users across all homelab hosts from a single source
2. **Linux PAM/NSS integration** - Users can SSH into hosts using central credentials
3. **UID/GID consistency** - Proper POSIX attributes for NAS share permissions
4. **OIDC provider** - Single sign-on for homelab web services (Grafana, etc.)
## Options Evaluated
### OpenLDAP (raw)
- **NixOS Support:** Good (`services.openldap` with `declarativeContents`)
- **Pros:** Most widely supported, very flexible
- **Cons:** LDIF format is painful, schema management is complex, no built-in OIDC, requires SSSD on each client
- **Verdict:** Doesn't address LDAP complexity concerns
### LLDAP + Authelia (current)
- **NixOS Support:** Both have good modules
- **Pros:** Already configured, lightweight, nice web UIs
- **Cons:** Two services to manage, limited POSIX attribute support in LLDAP, requires SSSD on every client host
- **Verdict:** Workable but has friction for NAS/UID goals
### FreeIPA
- **NixOS Support:** None
- **Pros:** Full enterprise solution (LDAP + Kerberos + DNS + CA)
- **Cons:** Extremely heavy, wants to own DNS, designed for Red Hat ecosystems, massive overkill for homelab
- **Verdict:** Overkill, no NixOS support
### Keycloak
- **NixOS Support:** None
- **Pros:** Good OIDC/SAML, nice UI
- **Cons:** Primarily an identity broker not a user directory, poor POSIX support, heavy (Java)
- **Verdict:** Wrong tool for Linux user management
### Authentik
- **NixOS Support:** None (would need Docker)
- **Pros:** All-in-one with LDAP outpost and OIDC, modern UI
- **Cons:** Heavy stack (Python + PostgreSQL + Redis), LDAP is a separate component
- **Verdict:** Would work but requires Docker and is heavy
### Kanidm
- **NixOS Support:** Excellent - first-class module with PAM/NSS integration
- **Pros:**
- Native PAM/NSS module (no SSSD needed)
- Built-in OIDC provider
- Optional LDAP interface for legacy services
- Declarative provisioning via NixOS (users, groups, OAuth2 clients)
- Modern, written in Rust
- Single service handles everything
- **Cons:** Newer project, smaller community than LDAP
- **Verdict:** Best fit for requirements
### Pocket-ID
- **NixOS Support:** Unknown
- **Pros:** Very lightweight, passkey-first
- **Cons:** No LDAP, no PAM/NSS integration - purely OIDC for web apps
- **Verdict:** Doesn't solve Linux user management goal
## Recommendation: Kanidm
Kanidm is the recommended solution for the following reasons:
| Requirement | Kanidm Support |
|-------------|----------------|
| Central user database | Native |
| Linux PAM/NSS (host login) | Native NixOS module |
| UID/GID for NAS | POSIX attributes supported |
| OIDC for services | Built-in |
| Declarative config | Excellent NixOS provisioning |
| Simplicity | Modern API, LDAP optional |
| NixOS integration | First-class |
### Key NixOS Features
**Server configuration:**
```nix
services.kanidm.enableServer = true;
services.kanidm.serverSettings = {
domain = "home.2rjus.net";
origin = "https://auth.home.2rjus.net";
ldapbindaddress = "0.0.0.0:636"; # Optional LDAP interface
};
```
**Declarative user provisioning:**
```nix
services.kanidm.provision.enable = true;
services.kanidm.provision.persons.torjus = {
displayName = "Torjus";
groups = [ "admins" "nas-users" ];
};
```
**Declarative OAuth2 clients:**
```nix
services.kanidm.provision.systems.oauth2.grafana = {
displayName = "Grafana";
originUrl = "https://grafana.home.2rjus.net/login/generic_oauth";
originLanding = "https://grafana.home.2rjus.net";
};
```
**Client host configuration (add to system/):**
```nix
services.kanidm.enableClient = true;
services.kanidm.enablePam = true;
services.kanidm.clientSettings.uri = "https://auth.home.2rjus.net";
```
## NAS Integration
### Current: TrueNAS CORE (FreeBSD)
TrueNAS CORE has a built-in LDAP client. Kanidm's read-only LDAP interface will work for NFS share permissions:
- **NFS shares**: Only need consistent UID/GID mapping - Kanidm's LDAP provides this
- **No SMB requirement**: SMB would need Samba schema attributes (deprecated in TrueNAS 13.0+), but we're NFS-only
Configuration approach:
1. Enable Kanidm's LDAP interface (`ldapbindaddress = "0.0.0.0:636"`)
2. Import internal CA certificate into TrueNAS
3. Configure TrueNAS LDAP client with Kanidm's Base DN and bind credentials
4. Users/groups appear in TrueNAS permission dropdowns
Note: Kanidm's LDAP is read-only and uses LDAPS only (no StartTLS). This is fine for our use case.
### Future: NixOS NAS
When the NAS is migrated to NixOS, it becomes a first-class citizen:
- Native Kanidm PAM/NSS integration (same as other hosts)
- No LDAP compatibility layer needed
- Full integration with the rest of the homelab
This future migration path is a strong argument for Kanidm over LDAP-only solutions.
## Implementation Steps
1. **Create Kanidm service module** in `services/kanidm/`
- Server configuration
- TLS via internal ACME
- Vault secrets for admin passwords
2. **Configure declarative provisioning**
- Define initial users and groups
- Set up POSIX attributes (UID/GID ranges)
3. **Add OIDC clients** for homelab services
- Grafana
- Other services as needed
4. **Create client module** in `system/` for PAM/NSS
- Enable on all hosts that need central auth
- Configure trusted CA
5. **Test NAS integration**
- Configure TrueNAS LDAP client to connect to Kanidm
- Verify UID/GID mapping works with NFS shares
6. **Migrate auth01**
- Remove LLDAP and Authelia services
- Deploy Kanidm
- Update DNS CNAMEs if needed
7. **Documentation**
- User management procedures
- Adding new OAuth2 clients
- Troubleshooting PAM/NSS issues
## Open Questions
- What UID/GID range should be reserved for Kanidm-managed users?
- Which hosts should have PAM/NSS enabled initially?
- What OAuth2 clients are needed at launch?
## References
- [Kanidm Documentation](https://kanidm.github.io/kanidm/stable/)
- [NixOS Kanidm Module](https://search.nixos.org/options?query=services.kanidm)
- [Kanidm PAM/NSS Integration](https://kanidm.github.io/kanidm/stable/pam_and_nsswitch.html)

View File

@@ -0,0 +1,128 @@
# Monitoring Gaps Audit
## Overview
Audit of services running in the homelab that lack monitoring coverage, either missing Prometheus scrape targets, alerting rules, or both.
## Services with No Monitoring
### PostgreSQL (`pgdb1`)
- **Current state:** No scrape targets, no alert rules
- **Risk:** A database outage would go completely unnoticed by Prometheus
- **Recommendation:** Enable `services.prometheus.exporters.postgres` (available in nixpkgs). This exposes connection counts, query throughput, replication lag, table/index stats, and more. Add alerts for at least `postgres_down` (systemd unit state) and connection pool exhaustion.
### Authelia (`auth01`)
- **Current state:** No scrape targets, no alert rules
- **Risk:** The authentication gateway being down blocks access to all proxied services
- **Recommendation:** Authelia exposes Prometheus metrics natively at `/metrics`. Add a scrape target and at minimum an `authelia_down` systemd unit state alert.
### LLDAP (`auth01`)
- **Current state:** No scrape targets, no alert rules
- **Risk:** LLDAP is a dependency of Authelia -- if LDAP is down, authentication breaks even if Authelia is running
- **Recommendation:** Add an `lldap_down` systemd unit state alert. LLDAP does not expose Prometheus metrics natively, so systemd unit monitoring via node-exporter may be sufficient.
### Vault / OpenBao (`vault01`)
- **Current state:** No scrape targets, no alert rules
- **Risk:** Secrets management service failures go undetected
- **Recommendation:** OpenBao supports Prometheus telemetry output natively. Add a scrape target for the telemetry endpoint and alerts for `vault_down` (systemd unit) and seal status.
### Gitea Actions Runner
- **Current state:** No scrape targets, no alert rules
- **Risk:** CI/CD failures go undetected
- **Recommendation:** Add at minimum a systemd unit state alert. The runner itself has limited metrics exposure.
## Services with Partial Monitoring
### Jellyfin (`jelly01`)
- **Current state:** Has scrape targets (port 8096), metrics are being collected, but zero alert rules
- **Metrics available:** 184 metrics, all .NET runtime / ASP.NET Core level. No Jellyfin-specific metrics (active streams, library size, transcoding sessions). Key useful metrics:
- `microsoft_aspnetcore_hosting_failed_requests` - rate of HTTP errors
- `microsoft_aspnetcore_hosting_current_requests` - in-flight requests
- `process_working_set_bytes` - memory usage (~256 MB currently)
- `dotnet_gc_pause_ratio` - GC pressure
- `up{job="jellyfin"}` - basic availability
- **Recommendation:** Add a `jellyfin_down` alert using either `up{job="jellyfin"} == 0` or systemd unit state. Consider alerting on sustained `failed_requests` rate increase.
### NATS (`nats1`)
- **Current state:** Has a `nats_down` alert (systemd unit state via node-exporter), but no NATS-specific metrics
- **Metrics available:** NATS has a built-in `/metrics` endpoint exposing connection counts, message throughput, JetStream consumer lag, and more
- **Recommendation:** Add a scrape target for the NATS metrics endpoint. Consider alerts for connection count spikes, slow consumers, and JetStream storage usage.
### DNS - Unbound (`ns1`, `ns2`)
- **Current state:** Has `unbound_down` alert (systemd unit state), but no DNS query metrics
- **Available in nixpkgs:** `services.prometheus.exporters.unbound.enable` (package: `prometheus-unbound-exporter` v0.5.0). Exposes query counts, cache hit ratios, response types (SERVFAIL, NXDOMAIN), upstream latency.
- **Recommendation:** Enable the unbound exporter on ns1/ns2. Add alerts for cache hit ratio drops and SERVFAIL rate spikes.
### DNS - NSD (`ns1`, `ns2`)
- **Current state:** Has `nsd_down` alert (systemd unit state), no NSD-specific metrics
- **Available in nixpkgs:** Nothing. No exporter package or NixOS module. Community `nsd_exporter` exists but is not packaged.
- **Recommendation:** The existing systemd unit alert is likely sufficient. NSD is a simple authoritative-only server with limited operational metrics. Not worth packaging a custom exporter for now.
## Existing Monitoring (for reference)
These services have adequate alerting and/or scrape targets:
| Service | Scrape Targets | Alert Rules |
|---|---|---|
| Monitoring stack (Prometheus, Grafana, Loki, Tempo, Pyroscope) | Yes | 7 alerts |
| Home Assistant (+ Zigbee2MQTT, Mosquitto) | Yes (port 8123) | 3 alerts |
| HTTP Proxy (Caddy) | Yes (port 80) | 3 alerts |
| Nix Cache (Harmonia, build-flakes) | Via Caddy | 4 alerts |
| CA (step-ca) | Yes (port 9000) | 4 certificate alerts |
## Per-Service Resource Metrics (systemd-exporter)
### Current State
No per-service CPU, memory, or IO metrics are collected. The existing node-exporter systemd collector only provides unit state (active/inactive/failed), socket stats, and timer triggers. While systemd tracks per-unit resource usage via cgroups internally (visible in `systemctl status` and `systemd-cgtop`), this data is not exported to Prometheus.
### Available Solution
The `prometheus-systemd-exporter` package (v0.7.0) is available in nixpkgs with a ready-made NixOS module:
```nix
services.prometheus.exporters.systemd.enable = true;
```
**Options:** `enable`, `port`, `extraFlags`, `user`, `group`
This exporter reads cgroup data and exposes per-unit metrics including:
- CPU seconds consumed per service
- Memory usage per service
- Task/process counts per service
- Restart counts
- IO usage
### Recommendation
Enable on all hosts via the shared `system/` config (same pattern as node-exporter). Add a corresponding scrape job on monitoring01. This would give visibility into resource consumption per service across the fleet, useful for capacity planning and diagnosing noisy-neighbor issues on shared hosts.
## Suggested Priority
1. **PostgreSQL** - Critical infrastructure, easy to add with existing nixpkgs module
2. **Authelia + LLDAP** - Auth outage affects all proxied services
3. **Unbound exporter** - Ready-to-go NixOS module, just needs enabling
4. **Jellyfin alerts** - Metrics already collected, just needs alert rules
5. **NATS metrics** - Built-in endpoint, just needs a scrape target
6. **Vault/OpenBao** - Native telemetry support
7. **Actions Runner** - Lower priority, basic systemd alert sufficient
## Node-Exporter Targets Currently Down
Noted during audit -- these node-exporter targets are failing:
- `nixos-test1.home.2rjus.net:9100` - no route to host
- `media1.home.2rjus.net:9100` - no route to host
- `ns3.home.2rjus.net:9100` - no route to host
- `ns4.home.2rjus.net:9100` - no route to host
These may be decommissioned or powered-off hosts that should be removed from the scrape config.

View File

@@ -0,0 +1,176 @@
# NixOS Prometheus Exporter
## Overview
Build a generic Prometheus exporter for NixOS-specific metrics. This exporter should be useful for any NixOS deployment, not just our homelab.
## Goal
Provide visibility into NixOS system state that standard exporters don't cover:
- Generation management (count, age, current vs booted)
- Flake input freshness
- Upgrade status
## Metrics
### Core Metrics
| Metric | Description | Source |
|--------|-------------|--------|
| `nixos_generation_count` | Number of system generations | Count entries in `/nix/var/nix/profiles/system-*` |
| `nixos_current_generation` | Active generation number | Parse `readlink /run/current-system` |
| `nixos_booted_generation` | Generation that was booted | Parse `/run/booted-system` |
| `nixos_generation_age_seconds` | Age of current generation | File mtime of current system profile |
| `nixos_config_mismatch` | 1 if booted != current, 0 otherwise | Compare symlink targets |
### Flake Metrics (optional collector)
| Metric | Description | Source |
|--------|-------------|--------|
| `nixos_flake_input_age_seconds` | Age of each flake.lock input | Parse `lastModified` from flake.lock |
| `nixos_flake_input_info` | Info gauge with rev label | Parse `rev` from flake.lock |
Labels: `input` (e.g., "nixpkgs", "home-manager")
### Future Metrics
| Metric | Description | Source |
|--------|-------------|--------|
| `nixos_upgrade_pending` | 1 if remote differs from local | Compare flake refs (expensive) |
| `nixos_store_size_bytes` | Size of /nix/store | `du` or filesystem stats |
| `nixos_store_path_count` | Number of store paths | Count entries |
## Architecture
Single binary with optional collectors enabled via config or flags.
```
nixos-exporter
├── main.go
├── collector/
│ ├── generation.go # Core generation metrics
│ └── flake.go # Flake input metrics
└── config/
└── config.go
```
## Configuration
```yaml
listen_addr: ":9971"
collectors:
generation:
enabled: true
flake:
enabled: false
lock_path: "/etc/nixos/flake.lock" # or auto-detect from /run/current-system
```
Command-line alternative:
```bash
nixos-exporter --listen=:9971 --collector.flake --flake.lock-path=/etc/nixos/flake.lock
```
## NixOS Module
```nix
services.prometheus.exporters.nixos = {
enable = true;
port = 9971;
collectors = [ "generation" "flake" ];
flake.lockPath = "/etc/nixos/flake.lock";
};
```
The module should integrate with nixpkgs' existing `services.prometheus.exporters.*` pattern.
## Implementation
### Language
Go - mature prometheus client library, single static binary, easy cross-compilation.
### Phase 1: Core
1. Create git repository
2. Implement generation collector (count, current, booted, age, mismatch)
3. Basic HTTP server with `/metrics` endpoint
4. NixOS module
### Phase 2: Flake Collector
1. Parse flake.lock JSON format
2. Extract lastModified timestamps per input
3. Add input labels
### Phase 3: Packaging
1. Add to nixpkgs or publish as flake
2. Documentation
3. Example Grafana dashboard
## Example Output
```
# HELP nixos_generation_count Total number of system generations
# TYPE nixos_generation_count gauge
nixos_generation_count 47
# HELP nixos_current_generation Currently active generation number
# TYPE nixos_current_generation gauge
nixos_current_generation 47
# HELP nixos_booted_generation Generation that was booted
# TYPE nixos_booted_generation gauge
nixos_booted_generation 46
# HELP nixos_generation_age_seconds Age of current generation in seconds
# TYPE nixos_generation_age_seconds gauge
nixos_generation_age_seconds 3600
# HELP nixos_config_mismatch 1 if booted generation differs from current
# TYPE nixos_config_mismatch gauge
nixos_config_mismatch 1
# HELP nixos_flake_input_age_seconds Age of flake input in seconds
# TYPE nixos_flake_input_age_seconds gauge
nixos_flake_input_age_seconds{input="nixpkgs"} 259200
nixos_flake_input_age_seconds{input="home-manager"} 86400
```
## Alert Examples
```yaml
- alert: NixOSConfigStale
expr: nixos_generation_age_seconds > 7 * 24 * 3600
for: 1h
labels:
severity: warning
annotations:
summary: "NixOS config on {{ $labels.instance }} is over 7 days old"
- alert: NixOSRebootRequired
expr: nixos_config_mismatch == 1
for: 24h
labels:
severity: info
annotations:
summary: "{{ $labels.instance }} needs reboot to apply config"
- alert: NixpkgsInputStale
expr: nixos_flake_input_age_seconds{input="nixpkgs"} > 30 * 24 * 3600
for: 1d
labels:
severity: info
annotations:
summary: "nixpkgs input on {{ $labels.instance }} is over 30 days old"
```
## Open Questions
- [ ] How to detect flake.lock path automatically? (check /run/current-system for flake info)
- [ ] Should generation collector need root? (probably not, just reading symlinks)
- [ ] Include in nixpkgs or distribute as standalone flake?
## Notes
- Port 9971 suggested (9970 reserved for homelab-exporter)
- Keep scope focused on NixOS-specific metrics - don't duplicate node-exporter
- Consider submitting to prometheus exporter registry once stable

View File

@@ -0,0 +1,86 @@
# Sops to OpenBao Secrets Migration Plan
## Status: Complete (except ca, deferred)
## Remaining sops cleanup
The `sops-nix` flake input, `system/sops.nix`, `.sops.yaml`, and `secrets/` directory are
still present because `ca` still uses sops for its step-ca secrets (5 secrets in
`services/ca/default.nix`). The `services/authelia/` and `services/lldap/` modules also
reference sops but are only used by auth01 (decommissioned).
Once `ca` is migrated to OpenBao PKI (Phase 4c in host-migration-to-opentofu.md), remove:
- `sops-nix` input from `flake.nix`
- `sops-nix.nixosModules.sops` from all host module lists in `flake.nix`
- `inherit sops-nix` from all specialArgs in `flake.nix`
- `system/sops.nix` and its import in `system/default.nix`
- `.sops.yaml`
- `secrets/` directory
- All `sops.secrets.*` declarations in `services/ca/`, `services/authelia/`, `services/lldap/`
## Overview
Migrate all hosts from sops-nix secrets to OpenBao (vault) secrets management. Pilot with ha1, then roll out to remaining hosts in waves.
## Pre-requisites (completed)
1. Hardcoded root password hash in `system/root-user.nix` (removes sops dependency for all hosts)
2. Added `extractKey` option to `system/vault-secrets.nix` (extracts single key as file)
## Deployment Order
### Pilot: ha1
- Terraform: shared/backup/password secret, ha1 AppRole policy
- Provision AppRole credentials via `playbooks/provision-approle.yml`
- NixOS: vault.enable + backup-helper vault secret
### Wave 1: nats1, jelly01, pgdb1
- No service secrets (only root password, already handled)
- Just need AppRole policies + credential provisioning
### Wave 2: monitoring01
- 3 secrets: backup password, nats nkey, pve-exporter config
- Updates: alerttonotify.nix, pve.nix, configuration.nix
### Wave 3: ns1, then ns2 (critical - deploy ns1 first, verify, then ns2)
- DNS zone transfer key (shared/dns/xfer-key)
### Wave 4: http-proxy
- WireGuard private key
### Wave 5: nix-cache01
- Cache signing key + Gitea Actions token
### Wave 6: ca (DEFERRED - waiting for PKI migration)
### Skipped: auth01 (decommissioned)
## Terraform variables needed
User must extract from sops and add to `terraform/vault/terraform.tfvars`:
| Variable | Source |
|----------|--------|
| `backup_helper_secret` | `sops -d secrets/secrets.yaml` |
| `ns_xfer_key` | `sops -d secrets/secrets.yaml` |
| `nats_nkey` | `sops -d secrets/secrets.yaml` |
| `pve_exporter_config` | `sops -d secrets/monitoring01/pve-exporter.yaml` |
| `wireguard_private_key` | `sops -d secrets/http-proxy/wireguard.yaml` |
| `cache_signing_key` | `sops -d secrets/nix-cache01/cache-secret` |
| `actions_token_1` | `sops -d secrets/nix-cache01/actions_token_1` |
## Provisioning AppRole credentials
```bash
export BAO_ADDR='https://vault01.home.2rjus.net:8200'
export BAO_TOKEN='<root-token>'
nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<host>
```
## Verification (per host)
1. `systemctl status vault-secret-*` - all secret fetch services succeeded
2. Check secret files exist at expected paths with correct permissions
3. Verify dependent services are running
4. Check `/var/lib/vault/cache/` is populated (fallback ready)
5. Reboot host to verify boot-time secret fetching works

View File

@@ -0,0 +1,109 @@
# Zigbee Sensor Battery Monitoring
**Status:** Completed
**Branch:** `zigbee-battery-fix`
**Commit:** `c515a6b home-assistant: fix zigbee sensor battery reporting`
## Problem
Three Aqara Zigbee temperature sensors report `battery: 0` in their MQTT payload, making the `hass_sensor_battery_percent` Prometheus metric useless for battery monitoring on these devices.
Affected sensors:
- **Temp Living Room** (`0x54ef441000a54d3c`) — WSDCGQ12LM
- **Temp Office** (`0x54ef441000a547bd`) — WSDCGQ12LM
- **temp_server** (`0x54ef441000a564b6`) — WSDCGQ12LM
The **Temp Bedroom** sensor (`0x00124b0025495463`) is a SONOFF SNZB-02 and reports battery correctly.
## Findings
- All three sensors are actively reporting temperature, humidity, and pressure data — they are not dead.
- The Zigbee2MQTT payload includes a `voltage` field (e.g., `2707` = 2.707V), which indicates healthy battery levels (~40-60% for a CR2032 coin cell).
- CR2032 voltage reference: ~3.0V fresh, ~2.7V mid-life, ~2.1V dead.
- The `voltage` field is not exposed as a Prometheus metric — it exists only in the MQTT payload.
- This is a known firmware quirk with some Aqara WSDCGQ12LM sensors that always report 0% battery.
## Device Inventory
Full list of Zigbee devices on ha1 (12 total):
| Device | IEEE Address | Model | Type |
|--------|-------------|-------|------|
| temp_server | 0x54ef441000a564b6 | WSDCGQ12LM | Temperature sensor (battery fix applied) |
| (Temp Living Room) | 0x54ef441000a54d3c | WSDCGQ12LM | Temperature sensor (battery fix applied) |
| (Temp Office) | 0x54ef441000a547bd | WSDCGQ12LM | Temperature sensor (battery fix applied) |
| (Temp Bedroom) | 0x00124b0025495463 | SNZB-02 | Temperature sensor (battery works) |
| (Water leak) | 0x54ef4410009ac117 | SJCGQ12LM | Water leak sensor |
| btn_livingroom | 0x54ef441000a1f907 | WXKG13LM | Wireless mini switch |
| btn_bedroom | 0x54ef441000a1ee71 | WXKG13LM | Wireless mini switch |
| (Hue bulb) | 0x001788010dc35d06 | 9290024688 | Hue E27 1100lm (Router) |
| (Hue bulb) | 0x001788010dc5f003 | 9290024688 | Hue E27 1100lm (Router) |
| (Hue ceiling) | 0x001788010e371aa4 | 915005997301 | Hue Infuse medium (Router) |
| (Hue ceiling) | 0x001788010d253b99 | 915005997301 | Hue Infuse medium (Router) |
| (Hue wall) | 0x001788010d1b599a | 929003052901 | Hue Sana wall light (Router, transition=5) |
## Implementation
### Solution 1: Calculate battery from voltage in Zigbee2MQTT (Implemented)
Override the Home Assistant battery entity's `value_template` in Zigbee2MQTT device configuration to calculate battery percentage from voltage.
**Formula:** `(voltage - 2100) / 9` (maps 2100-3000mV to 0-100%)
**Changes in `services/home-assistant/default.nix`:**
- Device configuration moved from external `devices.yaml` to inline NixOS config
- Three affected sensors have `homeassistant.sensor_battery.value_template` override
- All 12 devices now declaratively managed
**Expected battery values based on current voltages:**
| Sensor | Voltage | Expected Battery |
|--------|---------|------------------|
| Temp Living Room | 2710 mV | ~68% |
| Temp Office | 2658 mV | ~62% |
| temp_server | 2765 mV | ~74% |
### Solution 2: Alert on sensor staleness (Implemented)
Added Prometheus alert `zigbee_sensor_stale` in `services/monitoring/rules.yml` that fires when a Zigbee temperature sensor hasn't updated in over 1 hour. This provides defense-in-depth for detecting dead sensors regardless of battery reporting accuracy.
**Alert details:**
- Expression: `(time() - hass_last_updated_time_seconds{entity=~"sensor\\.(0x[0-9a-f]+|temp_server)_temperature"}) > 3600`
- Severity: warning
- For: 5m
## Pre-Deployment Verification
### Backup Verification
Before deployment, verified ha1 backup configuration and ran manual backup:
**Backup paths:**
- `/var/lib/hass`
- `/var/lib/zigbee2mqtt`
- `/var/lib/mosquitto`
**Manual backup (2026-02-05 22:45:23):**
- Snapshot ID: `59704dfa`
- Files: 77 total (0 new, 13 changed, 64 unmodified)
- Data: 62.635 MiB processed, 6.928 MiB stored (compressed)
### Other directories reviewed
- `/var/lib/vault` — Contains AppRole credentials; not backed up (can be re-provisioned via Ansible)
- `/var/lib/sops-nix` — Legacy; ha1 uses Vault now
## Post-Deployment Steps
After deploying to ha1:
1. Restart zigbee2mqtt service (automatic on NixOS rebuild)
2. In Home Assistant, the battery entities may need to be re-discovered:
- Go to Settings → Devices & Services → MQTT
- The new `value_template` should take effect after entity re-discovery
- If not, try disabling and re-enabling the battery entities
## Notes
- Device configuration is now declarative in NixOS. Future device additions via Zigbee2MQTT frontend will need to be added to the NixOS config to persist.
- The `devices.yaml` file on ha1 will be overwritten on service start but can be removed after confirming the new config works.
- The NixOS zigbee2mqtt module defaults to `devices = "devices.yaml"` but our explicit inline config overrides this.

View File

@@ -0,0 +1,179 @@
# Homelab Infrastructure Exporter
## Overview
Build a Prometheus exporter for metrics specific to our homelab infrastructure. Unlike the generic nixos-exporter, this covers services and patterns unique to our environment.
## Current State
### Existing Exporters
- **node-exporter** (all hosts): System metrics
- **systemd-exporter** (all hosts): Service restart counts, IP accounting
- **labmon** (monitoring01): TLS certificate monitoring, step-ca health
- **Service-specific**: unbound, postgres, nats, jellyfin, home-assistant, caddy, step-ca
### Gaps
- No visibility into Vault/OpenBao lease expiry
- No ACME certificate expiry from internal CA
- No Proxmox guest agent metrics from inside VMs
## Metrics
### Vault/OpenBao Metrics
| Metric | Description | Source |
|--------|-------------|--------|
| `homelab_vault_token_expiry_seconds` | Seconds until AppRole token expires | Token metadata or lease file |
| `homelab_vault_token_renewable` | 1 if token is renewable | Token metadata |
Labels: `role` (AppRole name)
### ACME Certificate Metrics
| Metric | Description | Source |
|--------|-------------|--------|
| `homelab_acme_cert_expiry_seconds` | Seconds until certificate expires | Parse cert from `/var/lib/acme/*/cert.pem` |
| `homelab_acme_cert_not_after` | Unix timestamp of cert expiry | Certificate NotAfter field |
Labels: `domain`, `issuer`
Note: labmon already monitors external TLS endpoints. This covers local ACME-managed certs.
### Proxmox Guest Metrics (future)
| Metric | Description | Source |
|--------|-------------|--------|
| `homelab_proxmox_guest_info` | Info gauge with VM ID, name | QEMU guest agent |
| `homelab_proxmox_guest_agent_running` | 1 if guest agent is responsive | Agent ping |
### DNS Zone Metrics (future)
| Metric | Description | Source |
|--------|-------------|--------|
| `homelab_dns_zone_serial` | Current zone serial number | DNS AXFR or zone file |
Labels: `zone`
## Architecture
Single binary with collectors enabled via config. Runs on hosts that need specific collectors.
```
homelab-exporter
├── main.go
├── collector/
│ ├── vault.go # Vault/OpenBao token metrics
│ ├── acme.go # ACME certificate metrics
│ └── proxmox.go # Proxmox guest agent (future)
└── config/
└── config.go
```
## Configuration
```yaml
listen_addr: ":9970"
collectors:
vault:
enabled: true
token_path: "/var/lib/vault/token"
acme:
enabled: true
cert_dirs:
- "/var/lib/acme"
proxmox:
enabled: false
```
## NixOS Module
```nix
services.homelab-exporter = {
enable = true;
port = 9970;
collectors = {
vault = {
enable = true;
tokenPath = "/var/lib/vault/token";
};
acme = {
enable = true;
certDirs = [ "/var/lib/acme" ];
};
};
};
# Auto-register scrape target
homelab.monitoring.scrapeTargets = [{
job_name = "homelab-exporter";
port = 9970;
}];
```
## Integration
### Deployment
Deploy on hosts that have relevant data:
- **All hosts with ACME certs**: acme collector
- **All hosts with Vault**: vault collector
- **Proxmox VMs**: proxmox collector (when implemented)
### Relationship with nixos-exporter
These are complementary:
- **nixos-exporter** (port 9971): Generic NixOS metrics, deploy everywhere
- **homelab-exporter** (port 9970): Infrastructure-specific, deploy selectively
Both can run on the same host if needed.
## Implementation
### Language
Go - consistent with labmon and nixos-exporter.
### Phase 1: Core + ACME
1. Create git repository (git.t-juice.club/torjus/homelab-exporter)
2. Implement ACME certificate collector
3. HTTP server with `/metrics`
4. NixOS module
### Phase 2: Vault Collector
1. Implement token expiry detection
2. Handle missing/expired tokens gracefully
### Phase 3: Dashboard
1. Create Grafana dashboard for infrastructure health
2. Add to existing monitoring service module
## Alert Examples
```yaml
- alert: VaultTokenExpiringSoon
expr: homelab_vault_token_expiry_seconds < 3600
for: 5m
labels:
severity: warning
annotations:
summary: "Vault token on {{ $labels.instance }} expires in < 1 hour"
- alert: ACMECertExpiringSoon
expr: homelab_acme_cert_expiry_seconds < 7 * 24 * 3600
for: 1h
labels:
severity: warning
annotations:
summary: "ACME cert {{ $labels.domain }} on {{ $labels.instance }} expires in < 7 days"
```
## Open Questions
- [ ] How to read Vault token expiry without re-authenticating?
- [ ] Should ACME collector also check key/cert match?
## Notes
- Port 9970 (labmon uses 9969, nixos-exporter will use 9971)
- Keep infrastructure-specific logic here, generic NixOS stuff in nixos-exporter
- Consider merging Proxmox metrics with pve-exporter if overlap is significant

View File

@@ -0,0 +1,224 @@
# Host Migration to OpenTofu
## Overview
Migrate all existing hosts (provisioned manually before the OpenTofu pipeline) into the new
OpenTofu-managed provisioning workflow. Hosts are categorized by their state requirements:
stateless hosts are simply recreated, stateful hosts require backup and restore, and some
hosts are decommissioned or deferred.
## Current State
Hosts already managed by OpenTofu: `vault01`, `testvm01`, `vaulttest01`
Hosts to migrate:
| Host | Category | Notes |
|------|----------|-------|
| ns1 | Stateless | Primary DNS, recreate |
| ns2 | Stateless | Secondary DNS, recreate |
| nix-cache01 | Stateless | Binary cache, recreate |
| http-proxy | Stateless | Reverse proxy, recreate |
| nats1 | Stateless | Messaging, recreate |
| auth01 | Decommission | No longer in use |
| ha1 | Stateful | Home Assistant + Zigbee2MQTT + Mosquitto |
| monitoring01 | Stateful | Prometheus, Grafana, Loki |
| jelly01 | Stateful | Jellyfin metadata, watch history, config |
| pgdb1 | Stateful | PostgreSQL databases |
| jump | Decommission | No longer needed |
| ca | Deferred | Pending Phase 4c PKI migration to OpenBao |
## Phase 1: Backup Preparation
Before migrating any stateful host, ensure restic backups are in place and verified.
### 1a. Expand monitoring01 Grafana Backup
The existing backup only covers `/var/lib/grafana/plugins` and a sqlite dump of `grafana.db`.
Expand to back up all of `/var/lib/grafana/` to capture config directory and any other state.
### 1b. Add Jellyfin Backup to jelly01
No backup currently exists. Add a restic backup job for `/var/lib/jellyfin/` which contains:
- `config/` — server settings, library configuration
- `data/` — user watch history, playback state, library metadata
Media files are on the NAS (`nas.home.2rjus.net:/mnt/hdd-pool/media`) and do not need backup.
The cache directory (`/var/cache/jellyfin/`) does not need backup — it regenerates.
### 1c. Add PostgreSQL Backup to pgdb1
No backup currently exists. Add a restic backup job with a `pg_dumpall` pre-hook to capture
all databases and roles. The dump should be piped through restic's stdin backup (similar to
the Grafana DB dump pattern on monitoring01).
### 1d. Verify Existing ha1 Backup
ha1 already backs up `/var/lib/hass`, `/var/lib/zigbee2mqtt`, `/var/lib/mosquitto`. Verify
these backups are current and restorable before proceeding with migration.
### 1e. Verify All Backups
After adding/expanding backup jobs:
1. Trigger a manual backup run on each host
2. Verify backup integrity with `restic check`
3. Test a restore to a temporary location to confirm data is recoverable
## Phase 2: Declare pgdb1 Databases in Nix
Before migrating pgdb1, audit the manually-created databases and users on the running
instance, then declare them in the Nix configuration using `ensureDatabases` and
`ensureUsers`. This makes the PostgreSQL setup reproducible on the new host.
Steps:
1. SSH to pgdb1, run `\l` and `\du` in psql to list databases and roles
2. Add `ensureDatabases` and `ensureUsers` to `services/postgres/postgres.nix`
3. Document any non-default PostgreSQL settings or extensions per database
After reprovisioning, the databases will be created by NixOS, and data restored from the
`pg_dumpall` backup.
## Phase 3: Stateless Host Migration
These hosts have no meaningful state and can be recreated fresh. For each host:
1. Add the host definition to `terraform/vms.tf` (using `create-host` or manually)
2. Commit and push to master
3. Run `tofu apply` to provision the new VM
4. Wait for bootstrap to complete (VM pulls config from master and reboots)
5. Verify the host is functional
6. Decommission the old VM in Proxmox
### Migration Order
Migrate stateless hosts in an order that minimizes disruption:
1. **nix-cache01** — low risk, no downstream dependencies during migration
2. **nats1** — low risk, verify no persistent JetStream streams first
4. **http-proxy** — brief disruption to proxied services, migrate during low-traffic window
5. **ns1, ns2** — migrate one at a time, verify DNS resolution between each
For ns1/ns2: migrate ns2 first (secondary), verify AXFR works, then migrate ns1. All hosts
use both ns1 and ns2 as resolvers, so one being down briefly is tolerable.
## Phase 4: Stateful Host Migration
For each stateful host, the procedure is:
1. Trigger a final restic backup
2. Stop services on the old host (to prevent state drift during migration)
3. Provision the new VM via `tofu apply`
4. Wait for bootstrap to complete
5. Stop the relevant services on the new host
6. Restore data from restic backup
7. Start services and verify functionality
8. Decommission the old VM
### 4a. pgdb1
1. Run final `pg_dumpall` backup via restic
2. Stop PostgreSQL on the old host
3. Provision new pgdb1 via OpenTofu
4. After bootstrap, NixOS creates the declared databases/users
5. Restore data with `pg_restore` or `psql < dumpall.sql`
6. Verify database connectivity from gunter (`10.69.30.105`)
7. Decommission old VM
### 4b. monitoring01
1. Run final Grafana backup
2. Provision new monitoring01 via OpenTofu
3. After bootstrap, restore `/var/lib/grafana/` from restic
4. Restart Grafana, verify dashboards and datasources are intact
5. Prometheus and Loki start fresh with empty data (acceptable)
6. Verify all scrape targets are being collected
7. Decommission old VM
### 4c. jelly01
1. Run final Jellyfin backup
2. Provision new jelly01 via OpenTofu
3. After bootstrap, restore `/var/lib/jellyfin/` from restic
4. Verify NFS mount to NAS is working
5. Start Jellyfin, verify watch history and library metadata are present
6. Decommission old VM
### 4d. ha1
1. Verify latest restic backup is current
2. Stop Home Assistant, Zigbee2MQTT, and Mosquitto on old host
3. Provision new ha1 via OpenTofu
4. After bootstrap, restore `/var/lib/hass`, `/var/lib/zigbee2mqtt`, `/var/lib/mosquitto`
5. Start services, verify Home Assistant is functional
6. Verify Zigbee devices are still paired and communicating
7. Decommission old VM
**Note:** ha1 currently has 2 GB RAM, which is consistently tight. Average memory usage has
climbed from ~57% (30-day avg) to ~70% currently, with a 30-day low of only 187 MB free.
Consider increasing to 4 GB when reprovisioning to allow headroom for additional integrations.
**Note:** ha1 is the highest-risk migration due to Zigbee device pairings. The Zigbee
coordinator state in `/var/lib/zigbee2mqtt` should preserve pairings, but verify on a
non-critical time window.
**USB Passthrough:** The ha1 VM has a USB device passed through from the Proxmox hypervisor
(the Zigbee coordinator). The new VM must be configured with the same USB passthrough in
OpenTofu/Proxmox. Verify the USB device ID on the hypervisor and add the appropriate
`usb` block to the VM definition in `terraform/vms.tf`. The USB device must be passed
through before starting Zigbee2MQTT on the new host.
## Phase 5: Decommission jump and auth01 Hosts
### jump
1. Verify nothing depends on the jump host (no SSH proxy configs pointing to it, etc.)
2. Remove host configuration from `hosts/jump/`
3. Remove from `flake.nix`
4. Remove any secrets in `secrets/jump/`
5. Remove from `.sops.yaml`
6. Destroy the VM in Proxmox
7. Commit cleanup
### auth01
1. Remove host configuration from `hosts/auth01/`
2. Remove from `flake.nix`
3. Remove any secrets in `secrets/auth01/`
4. Remove from `.sops.yaml`
5. Remove `services/authelia/` and `services/lldap/` (only used by auth01)
6. Destroy the VM in Proxmox
7. Commit cleanup
## Phase 6: Decommission ca Host (Deferred)
Deferred until Phase 4c (PKI migration to OpenBao) is complete. Once all hosts use the
OpenBao ACME endpoint for certificates, the step-ca host can be decommissioned following
the same cleanup steps as the jump host.
## Phase 7: Remove sops-nix
Once `ca` is decommissioned (Phase 6), `sops-nix` is no longer used by any host. Remove
all remnants:
- `sops-nix` input from `flake.nix` and `flake.lock`
- `sops-nix.nixosModules.sops` from all host module lists in `flake.nix`
- `inherit sops-nix` from all specialArgs in `flake.nix`
- `system/sops.nix` and its import in `system/default.nix`
- `.sops.yaml`
- `secrets/` directory
- All `sops.secrets.*` declarations in `services/ca/`, `services/authelia/`, `services/lldap/`
- Template scripts that generate age keys for sops (`hosts/template/scripts.nix`,
`hosts/template2/scripts.nix`)
See `docs/plans/completed/sops-to-openbao-migration.md` for full context.
## Notes
- Each host migration should be done individually, not in bulk, to limit blast radius
- Keep the old VM running until the new one is verified — do not destroy prematurely
- The old VMs use IPs that the new VMs need, so the old VM must be shut down before
the new one is provisioned (or use a temporary IP and swap after verification)
- Stateful migrations should be done during low-usage windows
- After all migrations are complete, the only hosts not in OpenTofu will be ca (deferred)
- Since many hosts are being recreated, this is a good opportunity to establish consistent
hostname naming conventions before provisioning the new VMs. Current naming is inconsistent
(e.g. `ns1` vs `nix-cache01`, `ha1` vs `auth01`, `pgdb1` vs `http-proxy`). Decide on a
convention before starting migrations — e.g. whether to always use numeric suffixes, a
consistent format like `service-NN`, role-based vs function-based names, etc.

View File

@@ -0,0 +1,122 @@
# Long-Term Metrics Storage Options
## Problem Statement
Current Prometheus configuration retains metrics for 30 days (`retentionTime = "30d"`). Extending retention further raises disk usage concerns on the homelab hypervisor with limited local storage.
Prometheus does not support downsampling - it stores all data at full resolution until the retention period expires, then deletes it entirely.
## Current Configuration
Location: `services/monitoring/prometheus.nix`
- **Retention**: 30 days
- **Scrape interval**: 15s
- **Features**: Alertmanager, Pushgateway, auto-generated scrape configs from flake hosts
- **Storage**: Local disk on monitoring01
## Options Evaluated
### Option 1: VictoriaMetrics
VictoriaMetrics is a Prometheus-compatible TSDB with significantly better compression (5-10x smaller storage footprint).
**NixOS Options Available:**
- `services.victoriametrics.enable`
- `services.victoriametrics.prometheusConfig` - accepts Prometheus scrape config format
- `services.victoriametrics.retentionPeriod` - e.g., "6m" for 6 months
- `services.vmagent` - dedicated scraping agent
- `services.vmalert` - alerting rules evaluation
**Pros:**
- Simple migration - single service replacement
- Same PromQL query language - Grafana dashboards work unchanged
- Same scrape config format - existing auto-generated configs work as-is
- 5-10x better compression means 30 days of Prometheus data could become 180+ days
- Lightweight, single binary
**Cons:**
- No automatic downsampling (relies on compression alone)
- Alerting requires switching to vmalert instead of Prometheus alertmanager integration
- Would need to migrate existing data or start fresh
**Migration Steps:**
1. Replace `services.prometheus` with `services.victoriametrics`
2. Move scrape configs to `prometheusConfig`
3. Set up `services.vmalert` for alerting rules
4. Update Grafana datasource to VictoriaMetrics port (8428)
5. Keep Alertmanager for notification routing
### Option 2: Thanos
Thanos extends Prometheus with long-term storage and automatic downsampling by uploading data to object storage.
**NixOS Options Available:**
- `services.thanos.sidecar` - uploads Prometheus blocks to object storage
- `services.thanos.compact` - compacts and downsamples data
- `services.thanos.query` - unified query gateway
- `services.thanos.query-frontend` - query caching and parallelization
- `services.thanos.downsample` - dedicated downsampling service
**Downsampling Behavior:**
- Raw resolution kept for configurable period (default: indefinite)
- 5-minute resolution created after 40 hours
- 1-hour resolution created after 10 days
**Retention Configuration (in compactor):**
```nix
services.thanos.compact = {
retention.resolution-raw = "30d"; # Keep raw for 30 days
retention.resolution-5m = "180d"; # Keep 5m samples for 6 months
retention.resolution-1h = "2y"; # Keep 1h samples for 2 years
};
```
**Pros:**
- True downsampling - older data uses progressively less storage
- Keep metrics for years with minimal storage impact
- Prometheus continues running unchanged
- Existing Alertmanager integration preserved
**Cons:**
- Requires object storage (MinIO, S3, or local filesystem)
- Multiple services to manage (sidecar, compactor, query)
- More complex architecture
- Additional infrastructure (MinIO) may be needed
**Required Components:**
1. Thanos Sidecar (runs alongside Prometheus)
2. Object storage (MinIO or local filesystem)
3. Thanos Compactor (handles downsampling)
4. Thanos Query (provides unified query endpoint)
**Migration Steps:**
1. Deploy object storage (MinIO or configure filesystem backend)
2. Add Thanos sidecar pointing to Prometheus data directory
3. Add Thanos compactor with retention policies
4. Add Thanos query gateway
5. Update Grafana datasource to Thanos Query port (10902)
## Comparison
| Aspect | VictoriaMetrics | Thanos |
|--------|-----------------|--------|
| Complexity | Low (1 service) | Higher (3-4 services) |
| Downsampling | No | Yes (automatic) |
| Storage savings | 5-10x compression | Compression + downsampling |
| Object storage required | No | Yes |
| Migration effort | Minimal | Moderate |
| Grafana changes | Change port only | Change port only |
| Alerting changes | Need vmalert | Keep existing |
## Recommendation
**Start with VictoriaMetrics** for simplicity. The compression alone may provide 6+ months of retention in the same disk space currently used for 30 days.
If multi-year retention with true downsampling becomes necessary, Thanos can be evaluated later. However, it requires deploying object storage infrastructure (MinIO) which adds operational complexity.
## References
- VictoriaMetrics docs: https://docs.victoriametrics.com/
- Thanos docs: https://thanos.io/tip/thanos/getting-started.md/
- NixOS options searched from nixpkgs revision e576e3c9 (NixOS 25.11)

View File

@@ -0,0 +1,371 @@
# NATS-Based Deployment Service
## Overview
Create a message-based deployment system that allows triggering NixOS configuration updates on-demand, rather than waiting for the daily auto-upgrade timer. This enables faster iteration when testing changes and immediate fleet-wide deployments.
## Goals
1. **On-demand deployment** - Trigger config updates immediately via NATS message
2. **Targeted deployment** - Deploy to specific hosts or all hosts
3. **Branch/revision support** - Test feature branches before merging to master
4. **MCP integration** - Allow Claude Code to trigger deployments during development
## Current State
- **Auto-upgrade**: All hosts run `nixos-upgrade.service` daily, pulling from master
- **Manual testing**: `nixos-rebuild-test <action> <branch>` helper exists on all hosts
- **NATS**: Running on nats1 with JetStream enabled, using NKey authentication
- **Accounts**: ADMIN (system) and HOMELAB (user workloads with JetStream)
## Architecture
```
┌─────────────┐ ┌─────────────┐
│ MCP Tool │ deploy.test.> │ Admin CLI │ deploy.test.> + deploy.prod.>
│ (claude) │────────────┐ ┌─────│ (torjus) │
└─────────────┘ │ │ └─────────────┘
▼ ▼
┌──────────────┐
│ nats1 │
│ (authz) │
└──────┬───────┘
┌─────────────────┼─────────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ template1│ │ ns1 │ │ ha1 │
│ tier=test│ │ tier=prod│ │ tier=prod│
└──────────┘ └──────────┘ └──────────┘
```
## Repository Structure
The project lives in a **separate repository** (e.g., `homelab-deploy`) containing:
```
homelab-deploy/
├── flake.nix # Nix flake with Go package + NixOS module
├── go.mod
├── go.sum
├── cmd/
│ └── homelab-deploy/
│ └── main.go # CLI entrypoint with subcommands
├── internal/
│ ├── listener/ # Listener mode logic
│ ├── mcp/ # MCP server mode logic
│ └── deploy/ # Shared deployment logic
└── nixos/
└── module.nix # NixOS module for listener service
```
This repo imports the flake as an input and uses the NixOS module.
## Single Binary with Subcommands
The `homelab-deploy` binary supports multiple modes:
```bash
# Run as listener on a host (systemd service)
homelab-deploy listener --hostname ns1 --nats-url nats://nats1:4222
# Run as MCP server (for Claude Code)
homelab-deploy mcp --nats-url nats://nats1:4222
# CLI commands for manual use
homelab-deploy deploy ns1 --branch feature-x --action switch # single host
homelab-deploy deploy --tier test --all --action boot # all test hosts
homelab-deploy deploy --tier prod --all --action boot # all prod hosts (admin only)
homelab-deploy deploy --tier prod --role dns --action switch # all prod dns hosts
homelab-deploy status
```
## Components
### Listener Mode
A systemd service on each host that:
- Subscribes to multiple subjects for targeted and group deployments
- Validates incoming messages (revision, action)
- Executes `nixos-rebuild` with specified parameters
- Reports status back via NATS
**Subject structure:**
```
deploy.<tier>.<hostname> # specific host (e.g., deploy.prod.ns1)
deploy.<tier>.all # all hosts in tier (e.g., deploy.test.all)
deploy.<tier>.role.<role> # all hosts with role in tier (e.g., deploy.prod.role.dns)
```
**Listener subscriptions** (based on `homelab.host` config):
- `deploy.<tier>.<hostname>` - direct messages to this host
- `deploy.<tier>.all` - broadcast to all hosts in tier
- `deploy.<tier>.role.<role>` - broadcast to hosts with matching role (if role is set)
Example: ns1 with `tier=prod, role=dns` subscribes to:
- `deploy.prod.ns1`
- `deploy.prod.all`
- `deploy.prod.role.dns`
**NixOS module configuration:**
```nix
services.homelab-deploy.listener = {
enable = true;
timeout = 600; # seconds, default 10 minutes
};
```
The listener reads tier and role from `config.homelab.host` (see Host Metadata below).
**Request message format:**
```json
{
"action": "switch" | "boot" | "test" | "dry-activate",
"revision": "master" | "feature-branch" | "abc123...",
"reply_to": "deploy.responses.<request-id>"
}
```
**Response message format:**
```json
{
"status": "accepted" | "rejected" | "started" | "completed" | "failed",
"error": "invalid_revision" | "already_running" | "build_failed" | null,
"message": "human-readable details"
}
```
**Request/Reply flow:**
1. MCP/CLI sends deploy request with unique `reply_to` subject
2. Listener validates request (e.g., `git ls-remote` to check revision exists)
3. Listener sends immediate response:
- `{"status": "rejected", "error": "invalid_revision", "message": "branch 'foo' not found"}`, or
- `{"status": "started", "message": "starting nixos-rebuild switch"}`
4. If started, listener runs nixos-rebuild
5. Listener sends final response:
- `{"status": "completed", "message": "successfully switched to generation 42"}`, or
- `{"status": "failed", "error": "build_failed", "message": "nixos-rebuild exited with code 1"}`
This provides immediate feedback on validation errors (bad revision, already running) without waiting for the build to fail.
### MCP Mode
Runs as an MCP server providing tools for Claude Code.
**Tools:**
| Tool | Description | Tier Access |
|------|-------------|-------------|
| `deploy` | Deploy to test hosts (individual, all, or by role) | test only |
| `deploy_admin` | Deploy to any host (requires `--enable-admin` flag) | test + prod |
| `deploy_status` | Check deployment status/history | n/a |
| `list_hosts` | List available deployment targets | n/a |
**CLI flags:**
```bash
# Default: only test-tier deployments available
homelab-deploy mcp --nats-url nats://nats1:4222
# Enable admin tool (requires admin NKey to be configured)
homelab-deploy mcp --nats-url nats://nats1:4222 --enable-admin --admin-nkey-file /path/to/admin.nkey
```
**Security layers:**
1. **MCP flag**: `deploy_admin` tool only exposed when `--enable-admin` is passed
2. **NATS authz**: Even if tool is exposed, NATS rejects publishes without valid admin NKey
3. **Claude Code permissions**: Can set `mcp__homelab-deploy__deploy_admin` to `ask` mode for confirmation popup
By default, the MCP only loads test-tier credentials and exposes the `deploy` tool. Claude can:
- Deploy to individual test hosts
- Deploy to all test hosts at once (`deploy.test.all`)
- Deploy to test hosts by role (`deploy.test.role.<role>`)
### Tiered Permissions
Authorization is enforced at the NATS layer using subject-based permissions. Different deployer credentials have different publish rights:
**NATS user configuration (on nats1):**
```nix
accounts = {
HOMELAB = {
users = [
# MCP/Claude - test tier only
{
nkey = "UABC..."; # mcp-deployer
permissions = {
publish = [ "deploy.test.>" ];
subscribe = [ "deploy.responses.>" ];
};
}
# Admin - full access to all tiers
{
nkey = "UXYZ..."; # admin-deployer
permissions = {
publish = [ "deploy.test.>" "deploy.prod.>" ];
subscribe = [ "deploy.responses.>" ];
};
}
# Host listeners - subscribe to their tier, publish responses
{
nkey = "UDEF..."; # host-listener (one per host)
permissions = {
subscribe = [ "deploy.*.>" ];
publish = [ "deploy.responses.>" ];
};
}
];
};
};
```
**Host tier assignments** (via `homelab.host.tier`):
| Tier | Hosts |
|------|-------|
| test | template1, nix-cache01, future test hosts |
| prod | ns1, ns2, ha1, monitoring01, http-proxy, etc. |
**Example deployment scenarios:**
| Command | Subject | MCP | Admin |
|---------|---------|-----|-------|
| Deploy to ns1 | `deploy.prod.ns1` | ❌ | ✅ |
| Deploy to template1 | `deploy.test.template1` | ✅ | ✅ |
| Deploy to all test hosts | `deploy.test.all` | ✅ | ✅ |
| Deploy to all prod hosts | `deploy.prod.all` | ❌ | ✅ |
| Deploy to all DNS servers | `deploy.prod.role.dns` | ❌ | ✅ |
All NKeys stored in Vault - MCP gets limited credentials, admin CLI gets full-access credentials.
### Host Metadata
Rather than defining `tier` in the listener config, use a central `homelab.host` module that provides host metadata for multiple consumers. This aligns with the approach proposed in `docs/plans/prometheus-scrape-target-labels.md`.
**Status:** The `homelab.host` module is implemented in `modules/homelab/host.nix`.
Hosts can be filtered by tier using `config.homelab.host.tier`.
**Module definition (in `modules/homelab/host.nix`):**
```nix
homelab.host = {
tier = lib.mkOption {
type = lib.types.enum [ "test" "prod" ];
default = "prod";
description = "Deployment tier - controls which credentials can deploy to this host";
};
priority = lib.mkOption {
type = lib.types.enum [ "high" "low" ];
default = "high";
description = "Alerting priority - low priority hosts have relaxed thresholds";
};
role = lib.mkOption {
type = lib.types.nullOr lib.types.str;
default = null;
description = "Primary role of this host (dns, database, monitoring, etc.)";
};
labels = lib.mkOption {
type = lib.types.attrsOf lib.types.str;
default = { };
description = "Additional free-form labels";
};
};
```
**Consumers:**
- `homelab-deploy` listener reads `config.homelab.host.tier` for subject subscription
- Prometheus scrape config reads `priority`, `role`, `labels` for target labels
- Future services can consume the same metadata
**Example host config:**
```nix
# hosts/nix-cache01/configuration.nix
homelab.host = {
tier = "test"; # can be deployed by MCP
priority = "low"; # relaxed alerting thresholds
role = "build-host";
};
# hosts/ns1/configuration.nix
homelab.host = {
tier = "prod"; # requires admin credentials
priority = "high";
role = "dns";
labels.dns_role = "primary";
};
```
## Implementation Steps
### Phase 1: Core Binary + Listener
1. **Create homelab-deploy repository**
- Initialize Go module
- Set up flake.nix with Go package build
2. **Implement listener mode**
- NATS subscription logic
- nixos-rebuild execution
- Status reporting via NATS reply
3. **Create NixOS module**
- Systemd service definition
- Configuration options (hostname, NATS URL, NKey path)
- Vault secret integration for NKeys
4. **Create `homelab.host` module** (in nixos-servers)
- Define `tier`, `priority`, `role`, `labels` options
- This module is shared with Prometheus label work (see `docs/plans/prometheus-scrape-target-labels.md`)
5. **Integrate with nixos-servers**
- Add flake input for homelab-deploy
- Import listener module in `system/`
- Set `homelab.host.tier` per host (test vs prod)
6. **Configure NATS tiered permissions**
- Add deployer users to nats1 config (mcp-deployer, admin-deployer)
- Set up subject ACLs per user (test-only vs full access)
- Add deployer NKeys to Vault
- Create Terraform resources for NKey secrets
### Phase 2: MCP + CLI
7. **Implement MCP mode**
- MCP server with deploy/status tools
- Request/reply pattern for deployment feedback
8. **Implement CLI commands**
- `deploy` command for manual deployments
- `status` command to check deployment state
9. **Configure Claude Code**
- Add MCP server to configuration
- Document usage
### Phase 3: Enhancements
10. Add deployment locking (prevent concurrent deploys)
11. Prometheus metrics for deployment status
## Security Considerations
- **Privilege escalation**: Listener runs as root to execute nixos-rebuild
- **Input validation**: Strictly validate revision format (branch name or commit hash)
- **Rate limiting**: Prevent rapid-fire deployments
- **Audit logging**: Log all deployment requests with source identity
- **Network isolation**: NATS only accessible from internal network
## Decisions
All open questions have been resolved. See Notes section for decision rationale.
## Notes
- The existing `nixos-rebuild-test` helper provides a good reference for the rebuild logic
- Uses NATS request/reply pattern for immediate validation feedback and completion status
- Consider using NATS headers for metadata (request ID, timestamp)
- **Timeout decision**: Metrics show no-change upgrades complete in 5-55 seconds. A 10-minute default provides ample headroom for actual updates with package downloads. Per-host override available for hosts with known longer build times.
- **Rollback**: Not needed as a separate feature - deploy an older commit hash to effectively rollback.
- **Offline hosts**: No message persistence - if host is offline, deploy fails. Daily auto-upgrade is the safety net. Avoids complexity of JetStream deduplication (host coming online and applying 10 queued updates instead of just the latest).
- **Deploy history**: Use existing Loki - listener logs deployments to journald, queryable via Loki. No need for separate JetStream persistence.
- **Naming**: `homelab-deploy` - ties it to the infrastructure rather than implementation details.

View File

@@ -0,0 +1,173 @@
# Prometheus Scrape Target Labels
## Goal
Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names.
**Related:** This plan shares the `homelab.host` module with `docs/plans/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment.
## Motivation
Some hosts have workloads that make generic alert thresholds inappropriate. For example, `nix-cache01` regularly hits high CPU during builds, requiring a longer `for` duration on `high_cpu_load`. Currently this is handled by excluding specific instance names in PromQL expressions, which is brittle and doesn't scale.
With per-host labels, alert rules can use semantic filters like `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`.
## Proposed Labels
### `priority`
Indicates alerting importance. Hosts with `priority = "low"` can have relaxed thresholds or longer durations in alert rules.
Values: `"high"` (default), `"low"`
### `role`
Describes the function of the host. Useful for grouping in dashboards and targeting role-specific alert rules.
Values: free-form string, e.g. `"dns"`, `"build-host"`, `"database"`, `"monitoring"`
**Note on multiple roles:** Prometheus labels are strictly string values, not lists. For hosts that serve multiple roles there are a few options:
- **Separate boolean labels:** `role_build_host = "true"`, `role_cache_server = "true"` -- flexible but verbose, and requires updating the module when new roles are added.
- **Delimited string:** `role = "build-host,cache-server"` -- works with regex matchers (`{role=~".*build-host.*"}`), but regex matching is less clean and more error-prone.
- **Pick a primary role:** `role = "build-host"` -- simplest, and probably sufficient since most hosts have one primary role.
Recommendation: start with a single primary role string. If multi-role matching becomes a real need, switch to separate boolean labels.
### `dns_role`
For DNS servers specifically, distinguish between primary and secondary resolvers. The secondary resolver (ns2) receives very little traffic and has a cold cache, making generic cache hit ratio alerts inappropriate.
Values: `"primary"`, `"secondary"`
Example use case: The `unbound_low_cache_hit_ratio` alert fires on ns2 because its cache hit ratio (~62%) is lower than ns1 (~90%). This is expected behavior since ns2 gets ~100x less traffic. With a `dns_role` label, the alert can either exclude secondaries or use different thresholds:
```promql
# Only alert on primary DNS
unbound_cache_hit_ratio < 0.7 and on(instance) unbound_up{dns_role="primary"}
# Or use different thresholds
(unbound_cache_hit_ratio < 0.7 and on(instance) unbound_up{dns_role="primary"})
or
(unbound_cache_hit_ratio < 0.5 and on(instance) unbound_up{dns_role="secondary"})
```
## Implementation
This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/nats-deploy-service.md` which uses the same module for deployment tier assignment.
### 1. Create `homelab.host` module
**Status:** Step 1 (Create `homelab.host` module) is complete. The module is in
`modules/homelab/host.nix` with tier, priority, role, and labels options.
Create `modules/homelab/host.nix` with shared host metadata options:
```nix
{ lib, ... }:
{
options.homelab.host = {
tier = lib.mkOption {
type = lib.types.enum [ "test" "prod" ];
default = "prod";
description = "Deployment tier - controls which credentials can deploy to this host";
};
priority = lib.mkOption {
type = lib.types.enum [ "high" "low" ];
default = "high";
description = "Alerting priority - low priority hosts have relaxed thresholds";
};
role = lib.mkOption {
type = lib.types.nullOr lib.types.str;
default = null;
description = "Primary role of this host (dns, database, monitoring, etc.)";
};
labels = lib.mkOption {
type = lib.types.attrsOf lib.types.str;
default = { };
description = "Additional free-form labels (e.g., dns_role = 'primary')";
};
};
}
```
Import this module in `modules/homelab/default.nix`.
### 2. Update `lib/monitoring.nix`
- `extractHostMonitoring` should also extract `homelab.host` values (priority, role, labels).
- Build the combined label set from `homelab.host`:
```nix
# Combine structured options + free-form labels
effectiveLabels =
(lib.optionalAttrs (host.priority != "high") { priority = host.priority; })
// (lib.optionalAttrs (host.role != null) { role = host.role; })
// host.labels;
```
- `generateNodeExporterTargets` returns structured `static_configs` entries, grouping targets by their label sets:
```nix
# Before (flat list):
["ns1.home.2rjus.net:9100", "ns2.home.2rjus.net:9100", ...]
# After (grouped by labels):
[
{ targets = ["ns1.home.2rjus.net:9100", "ns2.home.2rjus.net:9100", ...]; }
{ targets = ["nix-cache01.home.2rjus.net:9100"]; labels = { priority = "low"; role = "build-host"; }; }
]
```
This requires grouping hosts by their label attrset and producing one `static_configs` entry per unique label combination. Hosts with default values (priority=high, no role, no labels) get grouped together with no extra labels (preserving current behavior).
### 3. Update `services/monitoring/prometheus.nix`
Change the node-exporter scrape config to use the new structured output:
```nix
# Before:
static_configs = [{ targets = nodeExporterTargets; }];
# After:
static_configs = nodeExporterTargets;
```
### 4. Set metadata on hosts
Example in `hosts/nix-cache01/configuration.nix`:
```nix
homelab.host = {
tier = "test"; # can be deployed by MCP (used by homelab-deploy)
priority = "low"; # relaxed alerting thresholds
role = "build-host";
};
```
Example in `hosts/ns1/configuration.nix`:
```nix
homelab.host = {
tier = "prod";
priority = "high";
role = "dns";
labels.dns_role = "primary";
};
```
### 5. Update alert rules
After implementing labels, review and update `services/monitoring/rules.yml`:
- Replace instance-name exclusions with label-based filters (e.g. `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`).
- Consider whether any other rules should differentiate by priority or role.
Specifically, the `high_cpu_load` rule currently has a nix-cache01 exclusion that should be replaced with a `priority`-based filter.
### 6. Consider labels for `generateScrapeConfigs` (service targets)
The same label propagation could be applied to service-level scrape targets. This is optional and can be deferred -- service targets are more specialized and less likely to need generic label-based filtering.

122
docs/plans/remote-access.md Normal file
View File

@@ -0,0 +1,122 @@
# Remote Access to Homelab Services
## Status: Planning
## Goal
Enable remote access to some or all homelab services from outside the internal network, without exposing anything directly to the internet.
## Current State
- All services are only accessible from the internal 10.69.13.x network
- Exception: jelly01 has a WireGuard link to an external VPS
- No services are directly exposed to the public internet
## Constraints
- Nothing should be directly accessible from the outside
- Must use VPN or overlay network (no port forwarding of services)
- Self-hosted solutions preferred over managed services
## Options
### 1. WireGuard Gateway (Internal Router)
A dedicated NixOS host on the internal network with a WireGuard tunnel out to the VPS. The VPS becomes the public entry point, and the gateway routes traffic to internal services. Firewall rules on the gateway control which services are reachable.
**Pros:**
- Simple, well-understood technology
- Already running WireGuard for jelly01
- Full control over routing and firewall rules
- Excellent NixOS module support
- No extra dependencies
**Cons:**
- Hub-and-spoke topology (all traffic goes through VPS)
- Manual peer management
- Adding a new client device means editing configs on both VPS and gateway
### 2. WireGuard Mesh (No Relay)
Each client device connects directly to a WireGuard endpoint. Could be on the VPS which forwards to the homelab, or if there is a routable IP at home, directly to an internal host.
**Pros:**
- Simple and fast
- No extra software
**Cons:**
- Manual key and endpoint management for every peer
- Doesn't scale well
- If behind CGNAT, still needs the VPS as intermediary
### 3. Headscale (Self-Hosted Tailscale)
Run a Headscale control server (on the VPS or internally) and install the Tailscale client on homelab hosts and personal devices. Gets the Tailscale mesh networking UX without depending on Tailscale's infrastructure.
**Pros:**
- Mesh topology - devices communicate directly via NAT traversal (DERP relay as fallback)
- Easy to add/remove devices
- ACL support for granular access control
- MagicDNS for service discovery
- Good NixOS support for both headscale server and tailscale client
- Subnet routing lets you expose the entire 10.69.13.x network or specific hosts without installing tailscale on every host
**Cons:**
- More moving parts than plain WireGuard
- Headscale is a third-party reimplementation, can lag behind Tailscale features
- Need to run and maintain the control server
### 4. Tailscale (Managed)
Same as Headscale but using Tailscale's hosted control plane.
**Pros:**
- Zero infrastructure to manage on the control plane side
- Polished UX, well-maintained clients
- Free tier covers personal use
**Cons:**
- Dependency on Tailscale's service
- Less aligned with self-hosting preference
- Coordination metadata goes through their servers (data plane is still peer-to-peer)
### 5. Netbird (Self-Hosted)
Open-source alternative to Tailscale with a self-hostable management server. WireGuard-based, supports ACLs and NAT traversal.
**Pros:**
- Fully self-hostable
- Web UI for management
- ACL and peer grouping support
**Cons:**
- Heavier to self-host (needs multiple components: management server, signal server, TURN relay)
- Less mature NixOS module support compared to Tailscale/Headscale
### 6. Nebula (by Defined Networking)
Certificate-based mesh VPN. Each node gets a certificate from a CA you control. No central coordination server needed at runtime.
**Pros:**
- No always-on control plane
- Certificate-based identity
- Lightweight
**Cons:**
- Less convenient for ad-hoc device addition (need to issue certs)
- NAT traversal less mature than Tailscale's
- Smaller community/ecosystem
## Key Decision Points
- **Static public IP vs CGNAT?** Determines whether clients can connect directly to home network or need VPS relay.
- **Number of client devices?** If just phone and laptop, plain WireGuard via VPS is fine. More devices favors Headscale.
- **Per-service vs per-network access?** Gateway with firewall rules gives per-service control. Headscale ACLs can also do this. Plain WireGuard gives network-level access with gateway firewall for finer control.
- **Subnet routing vs per-host agents?** With Headscale/Tailscale, can either install client on every host, or use a single subnet router that advertises the 10.69.13.x range. The latter is closer to the gateway approach and avoids touching every host.
## Leading Candidates
Based on existing WireGuard experience, self-hosting preference, and NixOS stack:
1. **Headscale with a subnet router** - Best balance of convenience and self-hosting
2. **WireGuard gateway via VPS** - Simplest, most transparent, builds on existing setup

49
flake.lock generated
View File

@@ -21,6 +21,27 @@
"url": "https://git.t-juice.club/torjus/alerttonotify" "url": "https://git.t-juice.club/torjus/alerttonotify"
} }
}, },
"homelab-deploy": {
"inputs": {
"nixpkgs": [
"nixpkgs-unstable"
]
},
"locked": {
"lastModified": 1770447502,
"narHash": "sha256-xH1PNyE3ydj4udhe1IpK8VQxBPZETGLuORZdSWYRmSU=",
"ref": "master",
"rev": "79db119d1ca6630023947ef0a65896cc3307c2ff",
"revCount": 22,
"type": "git",
"url": "https://git.t-juice.club/torjus/homelab-deploy"
},
"original": {
"ref": "master",
"type": "git",
"url": "https://git.t-juice.club/torjus/homelab-deploy"
}
},
"labmon": { "labmon": {
"inputs": { "inputs": {
"nixpkgs": [ "nixpkgs": [
@@ -42,6 +63,26 @@
"url": "https://git.t-juice.club/torjus/labmon" "url": "https://git.t-juice.club/torjus/labmon"
} }
}, },
"nixos-exporter": {
"inputs": {
"nixpkgs": [
"nixpkgs-unstable"
]
},
"locked": {
"lastModified": 1770422522,
"narHash": "sha256-WmIFnquu4u58v8S2bOVWmknRwHn4x88CRfBFTzJ1inQ=",
"ref": "refs/heads/master",
"rev": "cf0ce858997af4d8dcc2ce10393ff393e17fc911",
"revCount": 11,
"type": "git",
"url": "https://git.t-juice.club/torjus/nixos-exporter"
},
"original": {
"type": "git",
"url": "https://git.t-juice.club/torjus/nixos-exporter"
}
},
"nixpkgs": { "nixpkgs": {
"locked": { "locked": {
"lastModified": 1770136044, "lastModified": 1770136044,
@@ -60,11 +101,11 @@
}, },
"nixpkgs-unstable": { "nixpkgs-unstable": {
"locked": { "locked": {
"lastModified": 1770181073, "lastModified": 1770197578,
"narHash": "sha256-ksTL7P9QC1WfZasNlaAdLOzqD8x5EPyods69YBqxSfk=", "narHash": "sha256-AYqlWrX09+HvGs8zM6ebZ1pwUqjkfpnv8mewYwAo+iM=",
"owner": "nixos", "owner": "nixos",
"repo": "nixpkgs", "repo": "nixpkgs",
"rev": "bf922a59c5c9998a6584645f7d0de689512e444c", "rev": "00c21e4c93d963c50d4c0c89bfa84ed6e0694df2",
"type": "github" "type": "github"
}, },
"original": { "original": {
@@ -77,7 +118,9 @@
"root": { "root": {
"inputs": { "inputs": {
"alerttonotify": "alerttonotify", "alerttonotify": "alerttonotify",
"homelab-deploy": "homelab-deploy",
"labmon": "labmon", "labmon": "labmon",
"nixos-exporter": "nixos-exporter",
"nixpkgs": "nixpkgs", "nixpkgs": "nixpkgs",
"nixpkgs-unstable": "nixpkgs-unstable", "nixpkgs-unstable": "nixpkgs-unstable",
"sops-nix": "sops-nix" "sops-nix": "sops-nix"

248
flake.nix
View File

@@ -17,6 +17,14 @@
url = "git+https://git.t-juice.club/torjus/labmon?ref=master"; url = "git+https://git.t-juice.club/torjus/labmon?ref=master";
inputs.nixpkgs.follows = "nixpkgs-unstable"; inputs.nixpkgs.follows = "nixpkgs-unstable";
}; };
nixos-exporter = {
url = "git+https://git.t-juice.club/torjus/nixos-exporter";
inputs.nixpkgs.follows = "nixpkgs-unstable";
};
homelab-deploy = {
url = "git+https://git.t-juice.club/torjus/homelab-deploy?ref=master";
inputs.nixpkgs.follows = "nixpkgs-unstable";
};
}; };
outputs = outputs =
@@ -27,6 +35,8 @@
sops-nix, sops-nix,
alerttonotify, alerttonotify,
labmon, labmon,
nixos-exporter,
homelab-deploy,
... ...
}@inputs: }@inputs:
let let
@@ -42,6 +52,20 @@
alerttonotify.overlays.default alerttonotify.overlays.default
labmon.overlays.default labmon.overlays.default
]; ];
# Common modules applied to all hosts
commonModules = [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
system.configurationRevision = self.rev or self.dirtyRev or "dirty";
}
)
sops-nix.nixosModules.sops
nixos-exporter.nixosModules.default
homelab-deploy.nixosModules.default
./modules/homelab
];
allSystems = [ allSystems = [
"x86_64-linux" "x86_64-linux"
"aarch64-linux" "aarch64-linux"
@@ -58,15 +82,8 @@
specialArgs = { specialArgs = {
inherit inputs self sops-nix; inherit inputs self sops-nix;
}; };
modules = [ modules = commonModules ++ [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/ns1 ./hosts/ns1
sops-nix.nixosModules.sops
]; ];
}; };
ns2 = nixpkgs.lib.nixosSystem { ns2 = nixpkgs.lib.nixosSystem {
@@ -74,63 +91,8 @@
specialArgs = { specialArgs = {
inherit inputs self sops-nix; inherit inputs self sops-nix;
}; };
modules = [ modules = commonModules ++ [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/ns2 ./hosts/ns2
sops-nix.nixosModules.sops
];
};
ns3 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self sops-nix;
};
modules = [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/ns3
sops-nix.nixosModules.sops
];
};
ns4 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self sops-nix;
};
modules = [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/ns4
sops-nix.nixosModules.sops
];
};
nixos-test1 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self sops-nix;
};
modules = [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/nixos-test1
sops-nix.nixosModules.sops
]; ];
}; };
ha1 = nixpkgs.lib.nixosSystem { ha1 = nixpkgs.lib.nixosSystem {
@@ -138,15 +100,8 @@
specialArgs = { specialArgs = {
inherit inputs self sops-nix; inherit inputs self sops-nix;
}; };
modules = [ modules = commonModules ++ [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/ha1 ./hosts/ha1
sops-nix.nixosModules.sops
]; ];
}; };
template1 = nixpkgs.lib.nixosSystem { template1 = nixpkgs.lib.nixosSystem {
@@ -154,15 +109,8 @@
specialArgs = { specialArgs = {
inherit inputs self sops-nix; inherit inputs self sops-nix;
}; };
modules = [ modules = commonModules ++ [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/template ./hosts/template
sops-nix.nixosModules.sops
]; ];
}; };
template2 = nixpkgs.lib.nixosSystem { template2 = nixpkgs.lib.nixosSystem {
@@ -170,15 +118,8 @@
specialArgs = { specialArgs = {
inherit inputs self sops-nix; inherit inputs self sops-nix;
}; };
modules = [ modules = commonModules ++ [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/template2 ./hosts/template2
sops-nix.nixosModules.sops
]; ];
}; };
http-proxy = nixpkgs.lib.nixosSystem { http-proxy = nixpkgs.lib.nixosSystem {
@@ -186,15 +127,8 @@
specialArgs = { specialArgs = {
inherit inputs self sops-nix; inherit inputs self sops-nix;
}; };
modules = [ modules = commonModules ++ [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/http-proxy ./hosts/http-proxy
sops-nix.nixosModules.sops
]; ];
}; };
ca = nixpkgs.lib.nixosSystem { ca = nixpkgs.lib.nixosSystem {
@@ -202,15 +136,8 @@
specialArgs = { specialArgs = {
inherit inputs self sops-nix; inherit inputs self sops-nix;
}; };
modules = [ modules = commonModules ++ [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/ca ./hosts/ca
sops-nix.nixosModules.sops
]; ];
}; };
monitoring01 = nixpkgs.lib.nixosSystem { monitoring01 = nixpkgs.lib.nixosSystem {
@@ -218,15 +145,8 @@
specialArgs = { specialArgs = {
inherit inputs self sops-nix; inherit inputs self sops-nix;
}; };
modules = [ modules = commonModules ++ [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/monitoring01 ./hosts/monitoring01
sops-nix.nixosModules.sops
labmon.nixosModules.labmon labmon.nixosModules.labmon
]; ];
}; };
@@ -235,15 +155,8 @@
specialArgs = { specialArgs = {
inherit inputs self sops-nix; inherit inputs self sops-nix;
}; };
modules = [ modules = commonModules ++ [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/jelly01 ./hosts/jelly01
sops-nix.nixosModules.sops
]; ];
}; };
nix-cache01 = nixpkgs.lib.nixosSystem { nix-cache01 = nixpkgs.lib.nixosSystem {
@@ -251,31 +164,8 @@
specialArgs = { specialArgs = {
inherit inputs self sops-nix; inherit inputs self sops-nix;
}; };
modules = [ modules = commonModules ++ [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/nix-cache01 ./hosts/nix-cache01
sops-nix.nixosModules.sops
];
};
media1 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self sops-nix;
};
modules = [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/media1
sops-nix.nixosModules.sops
]; ];
}; };
pgdb1 = nixpkgs.lib.nixosSystem { pgdb1 = nixpkgs.lib.nixosSystem {
@@ -283,15 +173,8 @@
specialArgs = { specialArgs = {
inherit inputs self sops-nix; inherit inputs self sops-nix;
}; };
modules = [ modules = commonModules ++ [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/pgdb1 ./hosts/pgdb1
sops-nix.nixosModules.sops
]; ];
}; };
nats1 = nixpkgs.lib.nixosSystem { nats1 = nixpkgs.lib.nixosSystem {
@@ -299,31 +182,8 @@
specialArgs = { specialArgs = {
inherit inputs self sops-nix; inherit inputs self sops-nix;
}; };
modules = [ modules = commonModules ++ [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/nats1 ./hosts/nats1
sops-nix.nixosModules.sops
];
};
auth01 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self sops-nix;
};
modules = [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/auth01
sops-nix.nixosModules.sops
]; ];
}; };
testvm01 = nixpkgs.lib.nixosSystem { testvm01 = nixpkgs.lib.nixosSystem {
@@ -331,15 +191,8 @@
specialArgs = { specialArgs = {
inherit inputs self sops-nix; inherit inputs self sops-nix;
}; };
modules = [ modules = commonModules ++ [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/testvm01 ./hosts/testvm01
sops-nix.nixosModules.sops
]; ];
}; };
vault01 = nixpkgs.lib.nixosSystem { vault01 = nixpkgs.lib.nixosSystem {
@@ -347,15 +200,8 @@
specialArgs = { specialArgs = {
inherit inputs self sops-nix; inherit inputs self sops-nix;
}; };
modules = [ modules = commonModules ++ [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/vault01 ./hosts/vault01
sops-nix.nixosModules.sops
]; ];
}; };
vaulttest01 = nixpkgs.lib.nixosSystem { vaulttest01 = nixpkgs.lib.nixosSystem {
@@ -363,15 +209,8 @@
specialArgs = { specialArgs = {
inherit inputs self sops-nix; inherit inputs self sops-nix;
}; };
modules = [ modules = commonModules ++ [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/vaulttest01 ./hosts/vaulttest01
sops-nix.nixosModules.sops
]; ];
}; };
}; };
@@ -386,11 +225,12 @@
{ pkgs }: { pkgs }:
{ {
default = pkgs.mkShell { default = pkgs.mkShell {
packages = with pkgs; [ packages = [
ansible pkgs.ansible
opentofu pkgs.opentofu
openbao pkgs.openbao
(pkgs.callPackage ./scripts/create-host { }) (pkgs.callPackage ./scripts/create-host { })
homelab-deploy.packages.${pkgs.system}.default
]; ];
}; };
} }

View File

@@ -1,67 +0,0 @@
{
pkgs,
...
}:
{
imports = [
../template/hardware-configuration.nix
../../system
../../common/vm
];
homelab.dns.cnames = [ "ldap" ];
nixpkgs.config.allowUnfree = true;
# Use the systemd-boot EFI boot loader.
boot.loader.grub = {
enable = true;
device = "/dev/sda";
configurationLimit = 3;
};
networking.hostName = "auth01";
networking.domain = "home.2rjus.net";
networking.useNetworkd = true;
networking.useDHCP = false;
services.resolved.enable = true;
networking.nameservers = [
"10.69.13.5"
"10.69.13.6"
];
systemd.network.enable = true;
systemd.network.networks."ens18" = {
matchConfig.Name = "ens18";
address = [
"10.69.13.18/24"
];
routes = [
{ Gateway = "10.69.13.1"; }
];
linkConfig.RequiredForOnline = "routable";
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim
wget
git
];
services.qemuGuest.enable = true;
# Open ports in the firewall.
# networking.firewall.allowedTCPPorts = [ ... ];
# networking.firewall.allowedUDPPorts = [ ... ];
# Or disable the firewall altogether.
networking.firewall.enable = false;
system.stateVersion = "23.11"; # Did you read the comment?
}

View File

@@ -1,8 +0,0 @@
{ ... }:
{
imports = [
./configuration.nix
../../services/lldap
../../services/authelia
];
}

View File

@@ -55,8 +55,17 @@
git git
]; ];
# Vault secrets management
vault.enable = true;
homelab.deploy.enable = true;
vault.secrets.backup-helper = {
secretPath = "shared/backup/password";
extractKey = "password";
outputDir = "/run/secrets/backup_helper_secret";
services = [ "restic-backups-ha1" ];
};
# Backup service dirs # Backup service dirs
sops.secrets."backup_helper_secret" = { };
services.restic.backups.ha1 = { services.restic.backups.ha1 = {
repository = "rest:http://10.69.12.52:8000/backup-nix"; repository = "rest:http://10.69.12.52:8000/backup-nix";
passwordFile = "/run/secrets/backup_helper_secret"; passwordFile = "/run/secrets/backup_helper_secret";
@@ -68,6 +77,7 @@
timerConfig = { timerConfig = {
OnCalendar = "daily"; OnCalendar = "daily";
Persistent = true; Persistent = true;
RandomizedDelaySec = "2h";
}; };
pruneOpts = [ pruneOpts = [
"--keep-daily 7" "--keep-daily 7"

View File

@@ -21,8 +21,6 @@
"prometheus" "prometheus"
"alertmanager" "alertmanager"
"jelly" "jelly"
"auth"
"lldap"
"pyroscope" "pyroscope"
"pushgw" "pushgw"
]; ];
@@ -62,6 +60,9 @@
"nix-command" "nix-command"
"flakes" "flakes"
]; ];
vault.enable = true;
homelab.deploy.enable = true;
nix.settings.tarball-ttl = 0; nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [ environment.systemPackages = with pkgs; [
vim vim

View File

@@ -1,9 +1,12 @@
{ config, ... }: { config, ... }:
{ {
sops.secrets.wireguard_private_key = { vault.secrets.wireguard = {
sopsFile = ../../secrets/http-proxy/wireguard.yaml; secretPath = "hosts/http-proxy/wireguard";
key = "wg_private_key"; extractKey = "private_key";
outputDir = "/run/secrets/wireguard_private_key";
services = [ "wireguard-wg0" ];
}; };
networking.wireguard = { networking.wireguard = {
enable = true; enable = true;
useNetworkd = true; useNetworkd = true;
@@ -13,7 +16,7 @@
ips = [ "10.69.222.3/24" ]; ips = [ "10.69.222.3/24" ];
mtu = 1384; mtu = 1384;
listenPort = 51820; listenPort = 51820;
privateKeyFile = config.sops.secrets.wireguard_private_key.path; privateKeyFile = "/run/secrets/wireguard_private_key";
peers = [ peers = [
{ {
name = "docker2.t-juice.club"; name = "docker2.t-juice.club";

View File

@@ -8,6 +8,9 @@
]; ];
nixpkgs.config.allowUnfree = true; nixpkgs.config.allowUnfree = true;
homelab.host.role = "bastion";
# Use the systemd-boot EFI boot loader. # Use the systemd-boot EFI boot loader.
boot.loader.grub.enable = true; boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/sda"; boot.loader.grub.device = "/dev/sda";

View File

@@ -1,76 +0,0 @@
{
pkgs,
...
}:
{
imports = [
./hardware-configuration.nix
../../system
];
nixpkgs.config.allowUnfree = true;
# Use the systemd-boot EFI boot loader.
boot = {
loader.systemd-boot = {
enable = true;
configurationLimit = 5;
memtest86.enable = true;
};
loader.efi.canTouchEfiVariables = true;
supportedFilesystems = [ "nfs" ];
};
networking.hostName = "media1";
networking.domain = "home.2rjus.net";
networking.useNetworkd = true;
networking.useDHCP = false;
services.resolved.enable = true;
networking.nameservers = [
"10.69.13.5"
"10.69.13.6"
];
systemd.network.enable = true;
systemd.network.networks."enp2s0" = {
matchConfig.Name = "enp2s0";
address = [
"10.69.12.82/24"
];
routes = [
{ Gateway = "10.69.12.1"; }
];
linkConfig.RequiredForOnline = "routable";
};
time.timeZone = "Europe/Oslo";
# Graphics
hardware.graphics = {
enable = true;
extraPackages = with pkgs; [
libvdpau-va-gl
libva-vdpau-driver
];
};
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim
wget
git
];
# Open ports in the firewall.
# networking.firewall.allowedTCPPorts = [ ... ];
# networking.firewall.allowedUDPPorts = [ ... ];
# Or disable the firewall altogether.
networking.firewall.enable = false;
system.stateVersion = "23.11"; # Did you read the comment?
}

View File

@@ -1,7 +0,0 @@
{ ... }:
{
imports = [
./configuration.nix
./kodi.nix
];
}

View File

@@ -1,33 +0,0 @@
{ config, lib, pkgs, modulesPath, ... }:
{
imports =
[
(modulesPath + "/installer/scan/not-detected.nix")
];
boot.initrd.availableKernelModules = [ "xhci_pci" "ahci" "usb_storage" "usbhid" "sd_mod" "rtsx_usb_sdmmc" ];
boot.initrd.kernelModules = [ ];
boot.kernelModules = [ "kvm-amd" ];
boot.extraModulePackages = [ ];
fileSystems."/" =
{
device = "/dev/disk/by-uuid/3e7c311c-b1a3-4be7-b8bf-e497cba64302";
fsType = "btrfs";
};
fileSystems."/boot" =
{
device = "/dev/disk/by-uuid/F0D7-E5C1";
fsType = "vfat";
options = [ "fmask=0022" "dmask=0022" ];
};
swapDevices =
[{ device = "/dev/disk/by-uuid/1a06a36f-da61-4d36-b94e-b852836c328a"; }];
nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
hardware.cpu.amd.updateMicrocode = lib.mkDefault config.hardware.enableRedistributableFirmware;
}

View File

@@ -1,29 +0,0 @@
{ pkgs, ... }:
let
kodipkg = pkgs.kodi-wayland.withPackages (
p: with p; [
jellyfin
]
);
in
{
users.users.kodi = {
isNormalUser = true;
description = "Kodi Media Center user";
};
#services.xserver = {
# enable = true;
#};
services.cage = {
enable = true;
user = "kodi";
environment = {
XKB_DEFAULT_LAYOUT = "no";
};
program = "${kodipkg}/bin/kodi";
};
environment.systemPackages = with pkgs; [
firefox
];
}

View File

@@ -56,7 +56,16 @@
services.qemuGuest.enable = true; services.qemuGuest.enable = true;
sops.secrets."backup_helper_secret" = { }; # Vault secrets management
vault.enable = true;
homelab.deploy.enable = true;
vault.secrets.backup-helper = {
secretPath = "shared/backup/password";
extractKey = "password";
outputDir = "/run/secrets/backup_helper_secret";
services = [ "restic-backups-grafana" "restic-backups-grafana-db" ];
};
services.restic.backups.grafana = { services.restic.backups.grafana = {
repository = "rest:http://10.69.12.52:8000/backup-nix"; repository = "rest:http://10.69.12.52:8000/backup-nix";
passwordFile = "/run/secrets/backup_helper_secret"; passwordFile = "/run/secrets/backup_helper_secret";
@@ -64,6 +73,7 @@
timerConfig = { timerConfig = {
OnCalendar = "daily"; OnCalendar = "daily";
Persistent = true; Persistent = true;
RandomizedDelaySec = "2h";
}; };
pruneOpts = [ pruneOpts = [
"--keep-daily 7" "--keep-daily 7"
@@ -80,6 +90,7 @@
timerConfig = { timerConfig = {
OnCalendar = "daily"; OnCalendar = "daily";
Persistent = true; Persistent = true;
RandomizedDelaySec = "2h";
}; };
pruneOpts = [ pruneOpts = [
"--keep-daily 7" "--keep-daily 7"

View File

@@ -13,6 +13,8 @@
homelab.dns.cnames = [ "nix-cache" "actions1" ]; homelab.dns.cnames = [ "nix-cache" "actions1" ];
homelab.host.role = "build-host";
fileSystems."/nix" = { fileSystems."/nix" = {
device = "/dev/disk/by-label/nixcache"; device = "/dev/disk/by-label/nixcache";
fsType = "xfs"; fsType = "xfs";
@@ -52,6 +54,9 @@
"nix-command" "nix-command"
"flakes" "flakes"
]; ];
vault.enable = true;
homelab.deploy.enable = true;
nix.settings.tarball-ttl = 0; nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [ environment.systemPackages = with pkgs; [
vim vim

View File

@@ -1,77 +0,0 @@
{ config, lib, pkgs, ... }:
{
imports =
[
../template/hardware-configuration.nix
../../system
];
nixpkgs.config.allowUnfree = true;
# Use the systemd-boot EFI boot loader.
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/sda";
networking.hostName = "nixos-test1";
networking.domain = "home.2rjus.net";
networking.useNetworkd = true;
networking.useDHCP = false;
services.resolved.enable = true;
networking.nameservers = [
"10.69.13.5"
"10.69.13.6"
];
systemd.network.enable = true;
systemd.network.networks."ens18" = {
matchConfig.Name = "ens18";
address = [
"10.69.13.10/24"
];
routes = [
{ Gateway = "10.69.13.1"; }
];
linkConfig.RequiredForOnline = "routable";
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [ "nix-command" "flakes" ];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim
wget
git
];
# Open ports in the firewall.
# networking.firewall.allowedTCPPorts = [ ... ];
# networking.firewall.allowedUDPPorts = [ ... ];
# Or disable the firewall altogether.
networking.firewall.enable = false;
# Secrets
# Backup
sops.secrets."backup_helper_secret" = { };
services.restic.backups.test = {
repository = "rest:http://10.69.12.52:8000/backup-nix";
passwordFile = "/run/secrets/backup_helper_secret";
paths = [
"/etc/machine-id"
"/etc/os-release"
];
timerConfig = {
OnCalendar = "daily";
Persistent = true;
};
pruneOpts = [
"--keep-daily 7"
"--keep-weekly 4"
"--keep-monthly 6"
"--keep-within 1d"
];
};
system.stateVersion = "23.11"; # Did you read the comment?
}

View File

@@ -1,5 +0,0 @@
{ ... }: {
imports = [
./configuration.nix
];
}

View File

@@ -47,6 +47,14 @@
"nix-command" "nix-command"
"flakes" "flakes"
]; ];
vault.enable = true;
homelab.deploy.enable = true;
homelab.host = {
role = "dns";
labels.dns_role = "primary";
};
nix.settings.tarball-ttl = 0; nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [ environment.systemPackages = with pkgs; [
vim vim

View File

@@ -47,6 +47,14 @@
"nix-command" "nix-command"
"flakes" "flakes"
]; ];
vault.enable = true;
homelab.deploy.enable = true;
homelab.host = {
role = "dns";
labels.dns_role = "secondary";
};
environment.systemPackages = with pkgs; [ environment.systemPackages = with pkgs; [
vim vim
wget wget

View File

@@ -1,56 +0,0 @@
{ config, lib, pkgs, ... }:
{
imports =
[
../template/hardware-configuration.nix
../../system
../../services/ns/master-authorative.nix
../../services/ns/resolver.nix
];
nixpkgs.config.allowUnfree = true;
# Use the systemd-boot EFI boot loader.
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/sda";
networking.hostName = "ns3";
networking.domain = "home.2rjus.net";
networking.useNetworkd = true;
networking.useDHCP = false;
services.resolved.enable = false;
networking.nameservers = [
"10.69.13.5"
"10.69.13.6"
];
systemd.network.enable = true;
systemd.network.networks."ens18" = {
matchConfig.Name = "ens18";
address = [
"10.69.13.7/24"
];
routes = [
{ Gateway = "10.69.13.1"; }
];
linkConfig.RequiredForOnline = "routable";
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [ "nix-command" "flakes" ];
environment.systemPackages = with pkgs; [
vim
wget
git
];
# Open ports in the firewall.
# networking.firewall.allowedTCPPorts = [ ... ];
# networking.firewall.allowedUDPPorts = [ ... ];
# Or disable the firewall altogether.
networking.firewall.enable = false;
system.stateVersion = "23.11"; # Did you read the comment?
}

View File

@@ -1,5 +0,0 @@
{ ... }: {
imports = [
./configuration.nix
];
}

View File

@@ -1,36 +0,0 @@
{ config, lib, pkgs, modulesPath, ... }:
{
imports =
[
(modulesPath + "/profiles/qemu-guest.nix")
];
boot.initrd.availableKernelModules = [ "ata_piix" "uhci_hcd" "virtio_pci" "virtio_scsi" "sd_mod" "sr_mod" ];
boot.initrd.kernelModules = [ ];
# boot.kernelModules = [ ];
# boot.extraModulePackages = [ ];
fileSystems."/" =
{
device = "/dev/disk/by-uuid/6889aba9-61ed-4687-ab10-e5cf4017ac8d";
fsType = "xfs";
};
fileSystems."/boot" =
{
device = "/dev/disk/by-uuid/BC07-3B7A";
fsType = "vfat";
};
swapDevices =
[{ device = "/dev/disk/by-uuid/64e5757b-6625-4dd2-aa2a-66ca93444d23"; }];
# Enables DHCP on each ethernet and wireless interface. In case of scripted networking
# (the default) this is the recommended approach. When using systemd-networkd it's
# still possible to use this option, but it's recommended to use it in conjunction
# with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
# networking.interfaces.ens18.useDHCP = lib.mkDefault true;
nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
}

View File

@@ -1,56 +0,0 @@
{ config, lib, pkgs, ... }:
{
imports =
[
../template/hardware-configuration.nix
../../system
../../services/ns/secondary-authorative.nix
../../services/ns/resolver.nix
];
nixpkgs.config.allowUnfree = true;
# Use the systemd-boot EFI boot loader.
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/sda";
networking.hostName = "ns4";
networking.domain = "home.2rjus.net";
networking.useNetworkd = true;
networking.useDHCP = false;
services.resolved.enable = false;
networking.nameservers = [
"10.69.13.5"
"10.69.13.6"
];
systemd.network.enable = true;
systemd.network.networks."ens18" = {
matchConfig.Name = "ens18";
address = [
"10.69.13.8/24"
];
routes = [
{ Gateway = "10.69.13.1"; }
];
linkConfig.RequiredForOnline = "routable";
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [ "nix-command" "flakes" ];
environment.systemPackages = with pkgs; [
vim
wget
git
];
# Open ports in the firewall.
# networking.firewall.allowedTCPPorts = [ ... ];
# networking.firewall.allowedUDPPorts = [ ... ];
# Or disable the firewall altogether.
networking.firewall.enable = false;
system.stateVersion = "23.11"; # Did you read the comment?
}

View File

@@ -1,5 +0,0 @@
{ ... }: {
imports = [
./configuration.nix
];
}

View File

@@ -1,36 +0,0 @@
{ config, lib, pkgs, modulesPath, ... }:
{
imports =
[
(modulesPath + "/profiles/qemu-guest.nix")
];
boot.initrd.availableKernelModules = [ "ata_piix" "uhci_hcd" "virtio_pci" "virtio_scsi" "sd_mod" "sr_mod" ];
boot.initrd.kernelModules = [ ];
# boot.kernelModules = [ ];
# boot.extraModulePackages = [ ];
fileSystems."/" =
{
device = "/dev/disk/by-uuid/6889aba9-61ed-4687-ab10-e5cf4017ac8d";
fsType = "xfs";
};
fileSystems."/boot" =
{
device = "/dev/disk/by-uuid/BC07-3B7A";
fsType = "vfat";
};
swapDevices =
[{ device = "/dev/disk/by-uuid/64e5757b-6625-4dd2-aa2a-66ca93444d23"; }];
# Enables DHCP on each ethernet and wireless interface. In case of scripted networking
# (the default) this is the recommended approach. When using systemd-networkd it's
# still possible to use this option, but it's recommended to use it in conjunction
# with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
# networking.interfaces.ens18.useDHCP = lib.mkDefault true;
nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
}

View File

@@ -11,6 +11,11 @@
# Template host - exclude from DNS zone generation # Template host - exclude from DNS zone generation
homelab.dns.enable = false; homelab.dns.enable = false;
homelab.host = {
tier = "test";
priority = "low";
};
boot.loader.grub.enable = true; boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/sda"; boot.loader.grub.device = "/dev/sda";

View File

@@ -1,7 +1,9 @@
{ pkgs, ... }: { pkgs, ... }:
let let
prepare-host-script = pkgs.writeShellScriptBin "prepare-host.sh" prepare-host-script = pkgs.writeShellApplication {
'' name = "prepare-host.sh";
runtimeInputs = [ pkgs.age ];
text = ''
echo "Removing machine-id" echo "Removing machine-id"
rm -f /etc/machine-id || true rm -f /etc/machine-id || true
@@ -24,8 +26,9 @@ let
echo "Generate age key" echo "Generate age key"
rm -rf /var/lib/sops-nix || true rm -rf /var/lib/sops-nix || true
mkdir -p /var/lib/sops-nix mkdir -p /var/lib/sops-nix
${pkgs.age}/bin/age-keygen -o /var/lib/sops-nix/key.txt age-keygen -o /var/lib/sops-nix/key.txt
''; '';
};
in in
{ {
environment.systemPackages = [ prepare-host-script ]; environment.systemPackages = [ prepare-host-script ];

View File

@@ -32,6 +32,11 @@
datasource_list = [ "ConfigDrive" "NoCloud" ]; datasource_list = [ "ConfigDrive" "NoCloud" ];
}; };
homelab.host = {
tier = "test";
priority = "low";
};
boot.loader.grub.enable = true; boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda"; boot.loader.grub.device = "/dev/vda";
networking.hostName = "nixos-template2"; networking.hostName = "nixos-template2";

View File

@@ -1,7 +1,9 @@
{ pkgs, ... }: { pkgs, ... }:
let let
prepare-host-script = pkgs.writeShellScriptBin "prepare-host.sh" prepare-host-script = pkgs.writeShellApplication {
'' name = "prepare-host.sh";
runtimeInputs = [ pkgs.age ];
text = ''
echo "Removing machine-id" echo "Removing machine-id"
rm -f /etc/machine-id || true rm -f /etc/machine-id || true
@@ -24,8 +26,9 @@ let
echo "Generate age key" echo "Generate age key"
rm -rf /var/lib/sops-nix || true rm -rf /var/lib/sops-nix || true
mkdir -p /var/lib/sops-nix mkdir -p /var/lib/sops-nix
${pkgs.age}/bin/age-keygen -o /var/lib/sops-nix/key.txt age-keygen -o /var/lib/sops-nix/key.txt
''; '';
};
in in
{ {
environment.systemPackages = [ prepare-host-script ]; environment.systemPackages = [ prepare-host-script ];

View File

@@ -16,6 +16,11 @@
# Test VM - exclude from DNS zone generation # Test VM - exclude from DNS zone generation
homelab.dns.enable = false; homelab.dns.enable = false;
homelab.host = {
tier = "test";
priority = "low";
};
nixpkgs.config.allowUnfree = true; nixpkgs.config.allowUnfree = true;
boot.loader.grub.enable = true; boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda"; boot.loader.grub.device = "/dev/vda";

View File

@@ -16,6 +16,8 @@
homelab.dns.cnames = [ "vault" ]; homelab.dns.cnames = [ "vault" ];
homelab.host.role = "vault";
nixpkgs.config.allowUnfree = true; nixpkgs.config.allowUnfree = true;
boot.loader.grub.enable = true; boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda"; boot.loader.grub.device = "/dev/vda";

View File

@@ -5,6 +5,32 @@
... ...
}: }:
let
vault-test-script = pkgs.writeShellApplication {
name = "vault-test";
text = ''
echo "=== Vault Secret Test ==="
echo "Secret path: hosts/vaulttest01/test-service"
if [ -f /run/secrets/test-service/password ]; then
echo " Password file exists"
echo "Password length: $(wc -c < /run/secrets/test-service/password)"
else
echo " Password file missing!"
exit 1
fi
if [ -d /var/lib/vault/cache/test-service ]; then
echo " Cache directory exists"
else
echo " Cache directory missing!"
exit 1
fi
echo "Test successful!"
'';
};
in
{ {
imports = [ imports = [
../template2/hardware-configuration.nix ../template2/hardware-configuration.nix
@@ -13,6 +39,12 @@
../../common/vm ../../common/vm
]; ];
homelab.host = {
tier = "test";
priority = "low";
role = "vault";
};
nixpkgs.config.allowUnfree = true; nixpkgs.config.allowUnfree = true;
boot.loader.grub.enable = true; boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda"; boot.loader.grub.device = "/dev/vda";
@@ -49,6 +81,7 @@
vim vim
wget wget
git git
htop # test deploy verification
]; ];
# Open ports in the firewall. # Open ports in the firewall.
@@ -60,6 +93,7 @@
# Testing config # Testing config
# Enable Vault secrets management # Enable Vault secrets management
vault.enable = true; vault.enable = true;
homelab.deploy.enable = true;
# Define a test secret # Define a test secret
vault.secrets.test-service = { vault.secrets.test-service = {
@@ -79,27 +113,7 @@
Type = "oneshot"; Type = "oneshot";
RemainAfterExit = true; RemainAfterExit = true;
ExecStart = pkgs.writeShellScript "vault-test" '' ExecStart = lib.getExe vault-test-script;
echo "=== Vault Secret Test ==="
echo "Secret path: hosts/vaulttest01/test-service"
if [ -f /run/secrets/test-service/password ]; then
echo " Password file exists"
echo "Password length: $(wc -c < /run/secrets/test-service/password)"
else
echo " Password file missing!"
exit 1
fi
if [ -d /var/lib/vault/cache/test-service ]; then
echo " Cache directory exists"
else
echo " Cache directory missing!"
exit 1
fi
echo "Test successful!"
'';
StandardOutput = "journal+console"; StandardOutput = "journal+console";
}; };

View File

@@ -6,10 +6,6 @@ import subprocess
IGNORED_HOSTS = [ IGNORED_HOSTS = [
"inc1", "inc1",
"inc2", "inc2",
"media1",
"nixos-test1",
"ns3",
"ns4",
"template1", "template1",
] ]

View File

@@ -86,7 +86,7 @@ let
, retry ? 900 , retry ? 900
, expire ? 1209600 , expire ? 1209600
, minTtl ? 120 , minTtl ? 120
, nameservers ? [ "ns1" "ns2" "ns3" ] , nameservers ? [ "ns1" "ns2" ]
, adminEmail ? "admin.test.2rjus.net" , adminEmail ? "admin.test.2rjus.net"
}: }:
let let

View File

@@ -1,7 +1,9 @@
{ ... }: { ... }:
{ {
imports = [ imports = [
./deploy.nix
./dns.nix ./dns.nix
./host.nix
./monitoring.nix ./monitoring.nix
]; ];
} }

View File

@@ -0,0 +1,16 @@
{ config, lib, ... }:
{
options.homelab.deploy = {
enable = lib.mkEnableOption "homelab-deploy listener for NATS-based deployments";
};
config = {
assertions = [
{
assertion = config.homelab.deploy.enable -> config.vault.enable;
message = "homelab.deploy.enable requires vault.enable to be true (needed for NKey secret)";
}
];
};
}

28
modules/homelab/host.nix Normal file
View File

@@ -0,0 +1,28 @@
{ lib, ... }:
{
options.homelab.host = {
tier = lib.mkOption {
type = lib.types.enum [ "test" "prod" ];
default = "prod";
description = "Deployment tier - controls which credentials can deploy to this host";
};
priority = lib.mkOption {
type = lib.types.enum [ "high" "low" ];
default = "high";
description = "Alerting priority - low priority hosts have relaxed thresholds";
};
role = lib.mkOption {
type = lib.types.nullOr lib.types.str;
default = null;
description = "Primary role of this host (dns, database, monitoring, etc.)";
};
labels = lib.mkOption {
type = lib.types.attrsOf lib.types.str;
default = { };
description = "Additional free-form labels (e.g., dns_role = 'primary')";
};
};
}

View File

@@ -0,0 +1,78 @@
---
# Provision OpenBao AppRole credentials to an existing host
# Usage: nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=ha1
# Requires: BAO_ADDR and BAO_TOKEN environment variables set
- name: Fetch AppRole credentials from OpenBao
hosts: localhost
connection: local
gather_facts: false
vars:
vault_addr: "{{ lookup('env', 'BAO_ADDR') | default('https://vault01.home.2rjus.net:8200', true) }}"
domain: "home.2rjus.net"
tasks:
- name: Validate hostname is provided
ansible.builtin.fail:
msg: "hostname variable is required. Use: -e hostname=<name>"
when: hostname is not defined
- name: Get role-id for host
ansible.builtin.command:
cmd: "bao read -field=role_id auth/approle/role/{{ hostname }}/role-id"
environment:
BAO_ADDR: "{{ vault_addr }}"
BAO_SKIP_VERIFY: "1"
register: role_id_result
changed_when: false
- name: Generate secret-id for host
ansible.builtin.command:
cmd: "bao write -field=secret_id -f auth/approle/role/{{ hostname }}/secret-id"
environment:
BAO_ADDR: "{{ vault_addr }}"
BAO_SKIP_VERIFY: "1"
register: secret_id_result
changed_when: true
- name: Add target host to inventory
ansible.builtin.add_host:
name: "{{ hostname }}.{{ domain }}"
groups: vault_target
ansible_user: root
vault_role_id: "{{ role_id_result.stdout }}"
vault_secret_id: "{{ secret_id_result.stdout }}"
- name: Deploy AppRole credentials to host
hosts: vault_target
gather_facts: false
tasks:
- name: Create AppRole directory
ansible.builtin.file:
path: /var/lib/vault/approle
state: directory
mode: "0700"
owner: root
group: root
- name: Write role-id
ansible.builtin.copy:
content: "{{ vault_role_id }}"
dest: /var/lib/vault/approle/role-id
mode: "0600"
owner: root
group: root
- name: Write secret-id
ansible.builtin.copy:
content: "{{ vault_secret_id }}"
dest: /var/lib/vault/approle/secret-id
mode: "0600"
owner: root
group: root
- name: Display success
ansible.builtin.debug:
msg: "AppRole credentials provisioned to {{ inventory_hostname }}"

View File

@@ -13,6 +13,11 @@
../../common/vm ../../common/vm
]; ];
# Host metadata (adjust as needed)
homelab.host = {
tier = "test"; # Start in test tier, move to prod after validation
};
nixpkgs.config.allowUnfree = true; nixpkgs.config.allowUnfree = true;
boot.loader.grub.enable = true; boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda"; boot.loader.grub.device = "/dev/vda";

View File

@@ -137,9 +137,9 @@ fetch_from_vault() {
# Write each secret key to a separate file # Write each secret key to a separate file
log "Writing secrets to $OUTPUT_DIR" log "Writing secrets to $OUTPUT_DIR"
echo "$SECRET_DATA" | jq -r 'to_entries[] | "\(.key)\n\(.value)"' | while read -r key; read -r value; do for key in $(echo "$SECRET_DATA" | jq -r 'keys[]'); do
echo -n "$value" > "$OUTPUT_DIR/$key" echo "$SECRET_DATA" | jq -j --arg k "$key" '.[$k]' > "$OUTPUT_DIR/$key"
echo -n "$value" > "$CACHE_DIR/$key" echo "$SECRET_DATA" | jq -j --arg k "$key" '.[$k]' > "$CACHE_DIR/$key"
chmod 600 "$OUTPUT_DIR/$key" chmod 600 "$OUTPUT_DIR/$key"
chmod 600 "$CACHE_DIR/$key" chmod 600 "$CACHE_DIR/$key"
log " - Wrote secret key: $key" log " - Wrote secret key: $key"

View File

@@ -1,29 +0,0 @@
authelia_ldap_password: ENC[AES256_GCM,data:x2UDMpqQKoRVSlDSmK5XiC9x4/WWzmjk7cwtFA70waAD7xYQfXEOV+AeX1LlFfj0qHYrhyn//TLsa+tJzb7HPEAfl8vYR4MdkVFOm5vjPWWoF5Ul8ZVn8+B1VJLbiXkexv0/hfXL8NMzEcp/pF4H0Yei7xaKezu9OPtGzKufHws=,iv:88RXaOj8Zy9fGeDLAE0ItY7TKCCzxn6F0+kU5+Zy/XU=,tag:yPdCJ9d139iO6J97thVVgA==,type:str]
authelia_jwt_secret: ENC[AES256_GCM,data:9ZHkT2o5KZLmml95g8HZce8fNBmaWtRn+175Gaz0KhsndNl3zdgGq3hydRuoZuEgLVsherJImVmb5DQAZpv04lUEsDKCYeFNwAyYl4Go2jCp1fI53fdcRCKlNVZA37pMi4AYaCoe8vIl/cwPOOBDEwK5raOBnklCzVERoO0B8a0=,iv:9CTWCw0ImZR0OSrl2znbhpRHlzAxA5Cpcy98JeH9Z+Y=,tag:L+0xKqiwXTi7XiDYWA1Bcw==,type:str]
authelia_storage_encryption_key_file: ENC[AES256_GCM,data:RfbcQK8+rrW/Krd2rbDfgo7YI2YvQKqpLuDtk5DZJNNhw4giBh5nFp/8LNeo8r39/oiJLYTe6FjTLBu72TZz2wWrJFsBqjwQ/3TfATQGdLUsaXXRDr88ezHLTiYvEHIHJhUS5qsr7VMwBam5e7YGWBe5sGZCE/nX41ijyPUjtOY=,iv:sayYcAC38cApAtL+cDhgGNjWaHn+furKRowKL6AmfdU=,tag:1IZpnlpvDWGLLpZyU9iJUw==,type:str]
authelia_session_secret: ENC[AES256_GCM,data:4PaLv4RRA7/9Z8QzETXLwo3OctJ0mvzQkYmHsGGF97nq9QeB3eo0xj4FyuCbkJGGZ/huAyRgmFBTyscY3wgxoc4t+8BdlYcSbefEk1/xRFjmG8ooXLKhvGJ5c6t72KJRcqsEGTiC0l9CFJWQ2qYcjM4dPwG8z0tjUZ6j25Zfx4M=,iv:QORJkf0w6iyuRHM/xuql1s7K75Qa49ygq+lwHfrm9rk=,tag:/HZ/qI80fKjmuTRwIwmX8g==,type:str]
lldap_user_pass: ENC[AES256_GCM,data:56gF7uqVQ+/J5/lY/N904Q==,iv:qtY1XhHs4WWA4kPY56NigPvX4OslO0koZepgdv947zg=,tag:UDmJs8FPXskp7rUS2Sxinw==,type:str]
sops:
age:
- recipient: age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
enc: |
-----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBlc1dxK3FKU2ZGWTNGUmxZ
aWx1NngySjVHclJTd3hXejJRTmVHRExReHcwCk55c0xMbGcyTktySkJZdHRZbzhK
bEI3RzBHQkROTU1qWXBoU1RqTXppdVkKLS0tIHkwZ0QyNTMydWRqUlBtTEdhZ05r
YVpuT1JadnlyN1hqNnJxYzVPT3pXN1UKDCeIv0xv+5pcoDdtYc+rYjwi8SLrqWth
vdWepxmV2edajZRqcwFEC9weOZ1j2lh7Z3hR6RSN/+X3sFpqkpw+Yg==
-----END AGE ENCRYPTED FILE-----
- recipient: age16prza00sqzuhwwcyakj6z4hvwkruwkqpmmrsn94a5ucgpkelncdq2ldctk
enc: |
-----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSAvbU0wNmFLelRmNmJTRlho
dTEwVXZqUVI5NHZkb1QyNUZ4R0pLVFZWVDM4CkhVc00zY2FKaVdNRXdGVk1ranpG
MlRWWGJmd2FWeFE1dXU4WHVFL0FHZ3MKLS0tIGt2ZWlaOW5wNkJnQVkrTDZWTnY0
RW5HRjA3cERCUU1CVWZhck12SGhTRUkK6k/zQ87TIETYouRBby7ujtwgpqIPKKv+
2aLJW6lSWMVzL/f3ZrIeg12tJjHs3f44EXR6j3tfLfSKog2iL8Y57w==
-----END AGE ENCRYPTED FILE-----
lastmodified: "2025-12-06T10:03:56Z"
mac: ENC[AES256_GCM,data:SRNqx5n+xg/cNGiyze3CGKufox3IuXmOKLqNRDeJhBNMBHC1iYYCjRdHEVXsl7XSiYe51dSwjV0KrJa/SG1pRVkuyT+xyPrTjT2/DyXN7A/CESSAkBIwI7lkZmIf8DkxB3CELF1PgjIr1o2isxlBnkAnhEBTxQ7t8AzpcH7I5yU=,iv:P3FGQurZrL0ed5UuBPRFk11T0VRFtL6xI4iQ4LmYTec=,tag:8gQL08ojjIMyCl5E0Qs/Ww==,type:str]
unencrypted_suffix: _unencrypted
version: 3.11.0

View File

@@ -7,146 +7,101 @@ sops:
- recipient: age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u - recipient: age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
enc: | enc: |
-----BEGIN AGE ENCRYPTED FILE----- -----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBnbC90WWJiRXRPZ1VUVWhO YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBuWXhzQWFmeCt1R05jREcz
azc5R2lGeDhoRmQydXBnYlltbE81ajFQNW0wClRJNC9iaFV0NDRKRkw2Mm1vOHpN Ui9HZFN5dkxHNVE0RVJGZUJUa3hKK2sxdkhBCktYcGpLeGZIQzZIV3ZZWGs3YzF1
dVhnUm1nbElQRGQ4dmkxQ2FWdEdpdDAKLS0tIG9GNEpuZUFUQkVXbjZPREo0aEh4 T09sUEhPWkRkOWZFWkltQXBlM1lQV1UKLS0tIERRSlRUYW5QeW9TVjJFSmorOWNI
ZVMyY0Y0Zldvd244eSt2RVZDeUZKWmcKGQ7jq50qiXPLKCHq751Y2SA79vEjbSbt ZytmaEhzMjVhRXI1S0hielF0NlBrMmcK4I1PtSf7tSvSIJxWBjTnfBCO8GEFHbuZ
yhRiakVEjwf9A+/iSNvXYAr/tnKaYC+NTA7F6AKmYpBcrzlBGU68KA== BkZskr5fRnWUIs72ZOGoTAVSO5ZNiBglOZ8YChl4Vz1U7bvdOCt0bw==
-----END AGE ENCRYPTED FILE----- -----END AGE ENCRYPTED FILE-----
- recipient: age1hz2lz4k050ru3shrk5j3zk3f8azxmrp54pktw5a7nzjml4saudesx6jsl0 - recipient: age1hz2lz4k050ru3shrk5j3zk3f8azxmrp54pktw5a7nzjml4saudesx6jsl0
enc: | enc: |
-----BEGIN AGE ENCRYPTED FILE----- -----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBRTWFBRVRKeXR0UUloQ3FK YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBQcXM0RHlGcmZrYW4yNGZs
Rmhsak45aFZBVUp4Szk5eHJhZmswV3JUcHh3Cis0N09JaCtOZE1pQUM5blg4WDY5 S1ZqQzVaYmQ4MGhGaTFMUVIwOTk5K0tZZjB3ClN0QkhVeHRrNXZHdmZWMzFBRnJ6
Q0ZGajJSZnJVQzdJK0dxZjJNWHZkbGsKLS0tIEVtRVJROTlWdWl0cFlNZmZkajM5 WTFtaWZyRmx2TitkOXkrVkFiYVd3RncKLS0tIExpeGUvY1VpODNDL2NCaUhtZkp0
N3FpdU56WlFWaC9QYU5Kc1o2a1VkT0UK2Utr9mvK8If4JhjzD+l06xZxdE3nbvCO cGNVZTI3UGxlNWdFWVZMd3FlS3pDR3cKBulaMeonV++pArXOg3ilgKnW/51IyT6Z
NixMiYDhuQ/a55Fu0653jqd35i3CI3HukzEI9G5zLEeCcXxTKR5Bjg== vH9HOJUix+ryEwDIcjv4aWx9pYDHthPFZUDC25kLYG91WrJFQOo2oA==
-----END AGE ENCRYPTED FILE----- -----END AGE ENCRYPTED FILE-----
- recipient: age1w2q4gm2lrcgdzscq8du3ssyvk6qtzm4fcszc92z9ftclq23yyydqdga5um - recipient: age1w2q4gm2lrcgdzscq8du3ssyvk6qtzm4fcszc92z9ftclq23yyydqdga5um
enc: | enc: |
-----BEGIN AGE ENCRYPTED FILE----- -----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBFQVk0aUw0aStuOWhFMk5a YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBabTdsZWxZQjV2TGx2YjNM
UVJ5YWg2WjU2eVFUWDlobEIrRDlZV3dxelc0Clo0N3lvOUZNL3YrM2l3Y21VaUQz ZTgzWktqTjY0S0M3bFpNZXlDRDk5TSt3V2k0CjdWWTN0TlRlK1RpUm9xYW03MFFG
MTV5djdPWTBIUXFXVDZpZitRTVhMbVEKLS0tIFluV1NFTzd0cFFaR0RwVkhlSmNm aWN4a3o4VUVnYzBDd2FrelUraWtrMTAKLS0tIE1vTGpKYkhzcWErWDRreml2QmE2
VGdZNDlsUGI3cTQ1Tk9XRWtDSE1wNWMKQI226dcROyp/GprVZKtM0R57m5WbJyuR ZkNIWERKb1drdVR6MTBSTnVmdm51VEkKVNDYdyBSrUT7dUn6a4eF7ELQ2B2Pk6V9
UZO74NqiDr7nxKfw+tHCfDLh94rbC1iP4jRiaQjDgfDDxviafSbGBA== Z5fbT75ibuyX1JO315/gl2P/FhxmlRW1K6e+04gQe2R/t/3H11Q7YQ==
-----END AGE ENCRYPTED FILE-----
- recipient: age1snmhmpavqy7xddmw4nuny0u4xusqmnqxqarjmghkm5zaluff84eq5xatrd
enc: |
-----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSA4WVBzazE3VkNDWXUwMk5x
NnZtL3N3THVBQytxZzdZNUhCeThURFBLdjBVClBpZjd5L3lKYjRZNVF2Z3hibW5R
YTdTR0NzaVp4VEZlTjlaTHVFNXNSSUEKLS0tIDBGbmhGUFNJQ21zeW1SbWtyWWh0
QkFXN2g5TlhBbnlmbW1aSUJQL1FOaWMKTv8OoaTxyG8XhKGZNs4aFR/9SXQ+RG6w
+fxiUx7xQnOIYag9YQYfuAgoGzOaj/ha+i18WkQnx9LAgrjCTd+ejA==
-----END AGE ENCRYPTED FILE-----
- recipient: age12a3nyvjs8jrwmpkf3tgawel3nwcklwsr35ktmytnvhpawqwzrsfqpgcy0q
enc: |
-----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSAzcnVxL09JTEdsZ0FUb2VH
a3dSY09uRFFCYnJXQno3YUFhMlpueHJreXdFCjQ4UWdRak5yK0VIT2lYUjBVK2h5
RFJmMTlyVEpnS3JxdkE4ckp1UHpLM2sKLS0tIHVyZXRTSHQxL1p1dUxMKzkyV0pW
a2o0bG9vZUtmckdYTkhLSVZtZVRtNlUKpALeaeaH4/wFUPPGsNArTAIIJOvBWWDp
MUYPJjqLqBVmWzIgCexM2jsDOhtcCV26MXjzTXmZhthaGJMSp23kMQ==
-----END AGE ENCRYPTED FILE----- -----END AGE ENCRYPTED FILE-----
- recipient: age1d2w5zece9647qwyq4vas9qyqegg96xwmg6c86440a6eg4uj6dd2qrq0w3l - recipient: age1d2w5zece9647qwyq4vas9qyqegg96xwmg6c86440a6eg4uj6dd2qrq0w3l
enc: | enc: |
-----BEGIN AGE ENCRYPTED FILE----- -----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSA5M0liYUY1UHRHUDdvN3ds YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBVSFhDOFRVbnZWbVlQaG5G
TVdiWDlrWFROSVdRTy9nOHFOUTdmTmlHSzE4CjBpU3gzdjdWaHQzNXRMRkxPdVps U0NWekU0NzI1SlpRN0NVS1hPN210MXY3Z244CmtFemR5OUpzdlBzMHBUV3g0SFFo
TEZXbVlYenUwc3o0TXRnaXg4MmVHQmcKLS0tIDlVeWQ4V0hjbWJqRlNUL2hOWVhp eUtqNThXZDJ2b01yVVVuOFdwQVo2Qm8KLS0tIHpXRWd3OEpPRkpaVDNDTEJLMWEv
WEJvZWZzbWZFeWZVeWJ1c3pVOWI3MFUKN2QfuOaod5IBKkBkYzi3jvPty+8PRGMJ ZlZtaFpBdzF0YXFmdjNkNUR3YkxBZU0KAub+HF/OBZQR9bx/SVadZcL6Ms+NQ7yq
mozL7qydsb0bAZJtAwcL7HWCr1axar/Ertce0yMqhuthJ5bciVD5xQ== 21HCcDTWyWHbN4ymUrIYXci1A/0tTOrQL9Mkvaz7IJh4VdHLPZrwwA==
-----END AGE ENCRYPTED FILE-----
- recipient: age1gcyfkxh4fq5zdp0dh484aj82ksz66wrly7qhnpv0r0p576sn9ekse8e9ju
enc: |
-----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSA5L3NmcFMyUUpLOW9mOW9v
VXhMTjl5SEFsZ0pzR3lHb1VJL0IzUUxCckdzCnltZnVySkszVUtwbDdQNHAwVWxl
V2xJU1BqSG0yMk5sTkpKRTIvc2JORFUKLS0tIHNydWZjdGg3clNpMDhGSGR6VVVh
VU1Rbk9ybGRJOG1ETEh4a1orNUY2Z00KJmdp+wLHd+86RJJ/G0QbLp4BEDPXfE9o
VZhPPSC6qtUcFV2z6rqSHSpsHPTlgzbCRqX39iePNhfQ2o0lR2P2zQ==
-----END AGE ENCRYPTED FILE-----
- recipient: age1g5luz2rtel3surgzuh62rkvtey7lythrvfenyq954vmeyfpxjqkqdj3wt8
enc: |
-----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBBbnhXSG5qdVJHSjNmQ3Qx
Yk9zSVBkVTQyb3luYXgwbFJWbG9xK2tWZUdrCkh2MktoWmFOdkRldFNlQW1EMm9t
ZHJRa3QrRzh0UElSNGkvSWcyYTUxZzgKLS0tIGdPT2dwWU9LbERYZGxzUTNEUHE1
TmlIdWJjbmFvdnVQSURqUTBwbW9EL00Kaiy5ZGgHjKgAGvzbdjbwNExLf4MGDtiE
NJEvnmNWkQyEhtx9YzUteY02Tl/D7zBzAWHlV3RjAWTNIwLmm7QgCw==
-----END AGE ENCRYPTED FILE----- -----END AGE ENCRYPTED FILE-----
- recipient: age1gq8434ku0xekqmvnseeunv83e779cg03c06gwrusnymdsr3rpufqx6vr3m - recipient: age1gq8434ku0xekqmvnseeunv83e779cg03c06gwrusnymdsr3rpufqx6vr3m
enc: | enc: |
-----BEGIN AGE ENCRYPTED FILE----- -----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBVSDFIa1hNZU1BNWxHckk1 YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBBWkhBL1NTdjFDeEhQcEgv
UEdJT282Y054eVNpb3VOZ2t3S2NndTkycXdNCk1sNk5uL2xpbXk1MG95dVM1OWVD Z3c3Z213L2ZhWGo0Qm5Zd1A1RTBDY3plUkh3CkNWV2ZtNWkrUjB0eWFzUlVtbHlk
TldUWmsrSmxGeHYweWhGWXpSaE0xRmcKLS0tIFlVbEp2UU1kM0hhbHlSZm96TFl2 WTdTQjN4eDIzY0c0dyt6ajVXZ0krd1UKLS0tIHB4aEJqTTRMenV3UkFkTGEySjQ2
TkVaK0xHN1NxNzlpUVYyY2RpdisrQVkKG+DlyZVruH64nB9UtCPMbXhmRHj+zpr6 YVM1a3ZPdUU4T244UU0rc3hVQ3NYczQK10wug4kTjsvv/iOPWi5WrVZMOYUq4/Mf
CX4JOTXbUsueZIA4J/N93+d2J3V6yauoRYwCSl/JXX/gaSeSxF4z3A== oXS4sikXeUsqH1T2LUBjVnUieSneQVn7puYZlN+cpDQ0XdK/RZ+91A==
-----END AGE ENCRYPTED FILE----- -----END AGE ENCRYPTED FILE-----
- recipient: age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk - recipient: age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk
enc: | enc: |
-----BEGIN AGE ENCRYPTED FILE----- -----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSB3YWxPRTNaVTNLb2tYSzZ5 YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBYcEtHbjNWRkdodUxYdHRn
ZmVMYXk2MlVXYzNtZGFJNlJLR2FIVWhKb1RFCmx5bXozeExlbEZBQzhpSHA0T1JE MDBMU08zWDlKa0Z4cHJvc28rZk5pUjhnMjE0CmdzRmVGWDlYQ052Wm1zWnlYSFV6
dFpHRm8rcFl1QjZ2anRGYjVxeGJqc0EKLS0tIGVibzRnRTA3Vk5yR3c4QVFsdy95 dURQK3JSbThxQlg3M2ZaL1hGRzVuL0UKLS0tIEI3UGZvbEpvRS9aR2J2Tnc1YmxZ
bG1tejcremFiUjZaL3hmc1gwYzJIOGMKFmXmY60vABYlpfop2F020SaOEwV4TNya aUY5Q2MrdHNQWDJNaGt5MWx6MVRrRVEKRPxyAekGHFMKs0Z6spVDayBA4EtPk18e
F0tgrIqbufU1Yw4RhxPdBb9Wv1cQu25lcqQLh1i4VH9BSaWKk6TDEA== jiFc97BGVtC5IoSu4icq3ZpKOdxymnkqKEt0YP/p/JTC+8MKvTJFQw==
-----END AGE ENCRYPTED FILE----- -----END AGE ENCRYPTED FILE-----
- recipient: age1vpns76ykll8jgdlu3h05cur4ew2t3k7u03kxdg8y6ypfhsfhq9fqyurjey - recipient: age1vpns76ykll8jgdlu3h05cur4ew2t3k7u03kxdg8y6ypfhsfhq9fqyurjey
enc: | enc: |
-----BEGIN AGE ENCRYPTED FILE----- -----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSAzRXM1VUJPNm90UUx4UEdZ YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBQL3ZMUkI1dUV1T2tTSHhn
cDY5czVQaGl0MEdIMStjTnphTmR5ZkFWTDBjClhTd0xmaHNWUXo3NXR6eEUzTkg2 SjhyQ3dKTytoaDBNcit1VHpwVGUzWVNpdjBnCklYZWtBYzBpcGxZSDBvM2tIZm9H
L3BqT1N6bTNsYitmTGVpREtiWEpzdlEKLS0tIFUybTczSlRNbDkxRVZjSnFvdmtq bTFjb1ZCaDkrOU1JODVBVTBTbmxFbmcKLS0tIGtGcS9kejZPZlhHRXI5QnI5Wm9Q
MVdRU3RPSHNqUzJzQWl1VVkyczFaencK72ZmWJIcfBTXlezmefvWeCGOC1BhpkXO VjMxTDdWZEltWThKVDl0S24yWHJxZHcKgzH79zT2I7ZgyTbbbvIhLN/rEcfiomJH
bm+X+ihzNfktuOCl6ZIMo2n4aJ3hYakrMp4npO10a6s4o/ldqeiATg== oSZDFvPiXlhPgy8bRyyq3l47CVpWbUI2Y7DFXRuODpLUirt3K3TmCA==
-----END AGE ENCRYPTED FILE----- -----END AGE ENCRYPTED FILE-----
- recipient: age1hchvlf3apn8g8jq2743pw53sd6v6ay6xu6lqk0qufrjeccan9vzsc7hdfq - recipient: age1hchvlf3apn8g8jq2743pw53sd6v6ay6xu6lqk0qufrjeccan9vzsc7hdfq
enc: | enc: |
-----BEGIN AGE ENCRYPTED FILE----- -----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBOL3F3OWRYVVdxWncwWmlk YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBPcm9zUm1XUkpLWm1Jb3Uw
SnloWFdscE02L3ZRa0JGcFlwSU9tU3JRakhnCjZyTnR3T051Tmt2NGM2dkFaNGJz RncveGozOW5SRThEM1Y4SFF5RDdxUEhZTUE4CjVESHE5R3JZK0krOXZDL0RHR0oy
WVRnNDdNN0ozYXJnK0t4ZW5JRVQ2YzQKLS0tIFk0cFBxcVFETERNTGowMThJcDNR Z3JKaEpydjRjeFFHck1ic2JTRU5yZTQKLS0tIGY2ck56eG95YnpDYlNqUDh5RVp1
UW0wUUlFeHovSS9qYU5BRkJ6dnNjcWcKh2WcrmxsqMZeQ0/2HsaHeSqGsU3ILynU U3dRYkNleUtsQU1LMWpDbitJbnRIem8K+27HRtZihG8+k7ZC33XVfuXDFjC1e8lA
SHBziWHGlFoNirCVjljh/Mw4DM8v66i0ztIQtWV5cFaFhu4kVda5jA== kffmxp9kOEShZF3IKmAjVHFBiPXRyGk3fGPyQLmSMK2UOOfCy/a/qA==
-----END AGE ENCRYPTED FILE----- -----END AGE ENCRYPTED FILE-----
- recipient: age1w029fksjv0edrff9p7s03tgk3axecdkppqymfpwfn2nu2gsqqefqc37sxq - recipient: age1w029fksjv0edrff9p7s03tgk3axecdkppqymfpwfn2nu2gsqqefqc37sxq
enc: | enc: |
-----BEGIN AGE ENCRYPTED FILE----- -----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSB6ZkovUkMzdmhOUGpZUC91 YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBTZHlldDdSOEhjTklCSXQr
d1JFZGk1T2hOS2dlVFNHRGJKVTUwdUhpQmg0CnEybzlRdjBLcjVEckNtR0xzMDVk U2pXajFwZnNqQzZOTzY5b3lkMzlyREhXRWo4CmxId2F6NkNqeHNCSWNrcUJIY0Nw
dURWbFdnTXk1alV5cjRSMkRrZ21vTjAKLS0tIEtDZlFCTGdVMU1PUWdBYTVOcTU4 cGF6NXJaQnovK1FYSXQ2TkJSTFloTUEKLS0tIHRhWk5aZ0lDVkZaZEJobm9FTDNw
ZkZHYmJiTUdJUGZhTFdLM1EzdU9wNmsK3AqFfycJfrBpvnjccN1srNiVBCv107rt a29sZE1GL2ZQSk0vUEc1ZGhkUlpNRkEK9tfe7cNOznSKgxshd5Z6TQiNKp+XW6XH
b/O5zcqKGR3Nzey7zAhlxasPCRKARyBTo292ScZ03QMU8p8HIukdzg== VvPgMqMitgiDYnUPj10bYo3kqhd0xZH2IhLXMnZnqqQ0I23zfPiNaw==
-----END AGE ENCRYPTED FILE----- -----END AGE ENCRYPTED FILE-----
- recipient: age1ha34qeksr4jeaecevqvv2afqem67eja2mvawlmrqsudch0e7fe7qtpsekv - recipient: age1ha34qeksr4jeaecevqvv2afqem67eja2mvawlmrqsudch0e7fe7qtpsekv
enc: | enc: |
-----BEGIN AGE ENCRYPTED FILE----- -----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBlOVNVNmFzbTE2NmdiM1dP YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSB5bk9NVjJNWmMxUGd3cXRx
TlhuTGYyQWlWeFlkaVU3Tml2aDNJbmxXVnlZCmJSb001OVJTaGpRcllzN2JSWDFF amZ5SWJ3dHpHcnM4UHJxdmh6NnhFVmJQdldzCm95dHN3R21qSkE4Vm9VTnVPREp3
b1MyYjdKZys4ZHRoUmFhdG1oYTA2RzQKLS0tIEhGeU9YcW9Wc0ZZK3I5UjB0RHFm dUQyS1B4MWhhdmd3dk5LQ0htZEtpTWMKLS0tIGFaa3MxVExFYk1MY2loOFBvWm1o
bW1ucjZtYXFkT1A4bGszamFxaG5IaHMKqHuaWFi/ImnbDOZ9VisIN7jqplAYV8fo L0NoRStkeW9VZVdpWlhteC8yTnRmMUkKMYjUdE1rGgVR29FnhJ5OEVjTB1Rh5Mtu
y3PeVX34LcYE0d8cxbvH8CTs/Ubirt6P1obrmAL9W9Y0ozpqdqQSjA== M/DvlhW3a7tZU8nDF3IgG2GE5xOXZMDO9QWGdB8zO2RJZAr3Q+YIlA==
-----END AGE ENCRYPTED FILE----- -----END AGE ENCRYPTED FILE-----
- recipient: age1cxt8kwqzx35yuldazcc49q88qvgy9ajkz30xu0h37uw3ts97jagqgmn2ga - recipient: age1cxt8kwqzx35yuldazcc49q88qvgy9ajkz30xu0h37uw3ts97jagqgmn2ga
enc: | enc: |
-----BEGIN AGE ENCRYPTED FILE----- -----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBXbXo4UWhoMUQxc1lMcnNB YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBBU0xYMnhqOE0wdXdleStF
VWc1MUJuS3NnVnh4U254TE0wSDJTMzFSM3lrCnhHbmk1N0VqTlViT2dtZndGT1pn THcrY2NBQzNoRHdYTXY3ZmM5YXRZZkQ4aUZnCm9ad0IxSWxYT1JBd2RseUdVT1pi
NmpPc01iMjk3TXZLU1htZjBvd2NBK2sKLS0tIEN3dGlRZHF5Ykgybjl6MzRBVUJ0 UXBuNzFxVlN0OWNTQU5BV2NiVEV0RUUKLS0tIGJHY0dzSDczUzcrV0RpTjE0czEy
Rm92SGdwanFHZlp6U00wMDUzL3MrMzgKtCJqy+BfDMFQMHaIVPlFyzALBsb4Ekls cWZMNUNlTzBRcEV5MjlRV1BsWGhoaUUKGhYaH8I0oPCfrbs7HbQKVOF/99rg3HXv
+r7ofZ1ZjSomBljYxVPhKE9XaZJe6bqICEhJBCpODyxavfh8HmxHDQ== RRTXUI71/ejKIuxehOvifClQc3nUW73bWkASFQ0guUvO4R+c0xOgUg==
-----END AGE ENCRYPTED FILE-----
- recipient: age16prza00sqzuhwwcyakj6z4hvwkruwkqpmmrsn94a5ucgpkelncdq2ldctk
enc: |
-----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBiQTRXTHljd2UrbFJOWUI4
WGRYcEVrZDJGM3hpVVNmVXlSREYzc1FHRlhFCjZHa2VTTzFHR1RXRmllT1huVDNV
UkRKaEQrWjF5eHpiaUg1NExnME5veFkKLS0tIFpZY1RrOVNTTjU0N2Y1dFN6QWpX
MTM3NDJrV1JZNE5pWGNLMUg1OFFwYUUKMx0hpB3iunnCbJ/+zWetdp1NI/LsrUTe
J84+aDoe7/WJYT0FLMlC0RK80txm6ztVygoyRdN0cRKx1z3KqPmavw==
-----END AGE ENCRYPTED FILE----- -----END AGE ENCRYPTED FILE-----
lastmodified: "2025-02-11T21:18:22Z" lastmodified: "2025-02-11T21:18:22Z"
mac: ENC[AES256_GCM,data:5//boMp1awc/2XAkSASSCuobpkxa0E6IKf3GR8xHpMoCD30FJsCwV7PgX3fR8OuLEhOJ7UguqMNQdNqG37RMacreuDmI1J8oCFKp+3M2j4kCbXaEo8bw7WAtyjUez+SAXKzZWYmBibH0KOy6jdt+v0fdgy5hMBT4IFDofYRsyD0=,iv:6pD+SLwncpmal/FR4U8It2njvaQfUzzpALBCxa0NyME=,tag:4QN8ZFjdqck5ZgulF+FtbA==,type:str] mac: ENC[AES256_GCM,data:5//boMp1awc/2XAkSASSCuobpkxa0E6IKf3GR8xHpMoCD30FJsCwV7PgX3fR8OuLEhOJ7UguqMNQdNqG37RMacreuDmI1J8oCFKp+3M2j4kCbXaEo8bw7WAtyjUez+SAXKzZWYmBibH0KOy6jdt+v0fdgy5hMBT4IFDofYRsyD0=,iv:6pD+SLwncpmal/FR4U8It2njvaQfUzzpALBCxa0NyME=,tag:4QN8ZFjdqck5ZgulF+FtbA==,type:str]

View File

@@ -1,8 +1,10 @@
{ pkgs, config, ... }: { pkgs, config, ... }:
{ {
sops.secrets."actions-token-1" = { vault.secrets.actions-token = {
sopsFile = ../../secrets/nix-cache01/actions_token_1; secretPath = "hosts/nix-cache01/actions-token";
format = "binary"; extractKey = "token";
outputDir = "/run/secrets/actions-token-1";
services = [ "gitea-runner-actions1" ];
}; };
virtualisation.podman = { virtualisation.podman = {
@@ -13,7 +15,7 @@
services.gitea-actions-runner.instances = { services.gitea-actions-runner.instances = {
actions1 = { actions1 = {
enable = true; enable = true;
tokenFile = config.sops.secrets.actions-token-1.path; tokenFile = "/run/secrets/actions-token-1";
name = "actions1.home.2rjus.net"; name = "actions1.home.2rjus.net";
settings = { settings = {
log = { log = {

View File

@@ -1,87 +0,0 @@
{ config, ... }:
{
sops.secrets.authelia_ldap_password = {
format = "yaml";
sopsFile = ../../secrets/auth01/secrets.yaml;
key = "authelia_ldap_password";
restartUnits = [ "authelia-auth.service" ];
owner = "authelia-auth";
group = "authelia-auth";
};
sops.secrets.authelia_jwt_secret = {
format = "yaml";
sopsFile = ../../secrets/auth01/secrets.yaml;
key = "authelia_jwt_secret";
restartUnits = [ "authelia-auth.service" ];
owner = "authelia-auth";
group = "authelia-auth";
};
sops.secrets.authelia_storage_encryption_key_file = {
format = "yaml";
key = "authelia_storage_encryption_key_file";
sopsFile = ../../secrets/auth01/secrets.yaml;
restartUnits = [ "authelia-auth.service" ];
owner = "authelia-auth";
group = "authelia-auth";
};
sops.secrets.authelia_session_secret = {
format = "yaml";
key = "authelia_session_secret";
sopsFile = ../../secrets/auth01/secrets.yaml;
restartUnits = [ "authelia-auth.service" ];
owner = "authelia-auth";
group = "authelia-auth";
};
services.authelia.instances."auth" = {
enable = true;
environmentVariables = {
AUTHELIA_AUTHENTICATION_BACKEND_LDAP_PASSWORD_FILE =
config.sops.secrets.authelia_ldap_password.path;
AUTHELIA_SESSION_SECRET_FILE = config.sops.secrets.authelia_session_secret.path;
};
secrets = {
jwtSecretFile = config.sops.secrets.authelia_jwt_secret.path;
storageEncryptionKeyFile = config.sops.secrets.authelia_storage_encryption_key_file.path;
};
settings = {
access_control = {
default_policy = "two_factor";
};
session = {
# secret = "{{- fileContent \"${config.sops.secrets.authelia_session_secret.path}\" }}";
cookies = [
{
domain = "home.2rjus.net";
authelia_url = "https://auth.home.2rjus.net";
default_redirection_url = "https://dashboard.home.2rjus.net";
name = "authelia_session";
same_site = "lax";
inactivity = "1h";
expiration = "24h";
remember_me = "30d";
}
];
};
notifier = {
filesystem.filename = "/var/lib/authelia-auth/notification.txt";
};
storage = {
local.path = "/var/lib/authelia-auth/db.sqlite3";
};
authentication_backend = {
password_reset = {
disable = false;
};
ldap = {
address = "ldap://127.0.0.1:3890";
implementation = "lldap";
timeout = "5s";
base_dn = "dc=home,dc=2rjus,dc=net";
user = "uid=authelia_ldap_user,ou=people,dc=home,dc=2rjus,dc=net";
# password = "{{- fileContent \"${config.sops.secrets.authelia_ldap_password.path}\" -}}";
};
};
};
};
}

View File

@@ -69,6 +69,44 @@
frontend = true; frontend = true;
permit_join = false; permit_join = false;
serial.port = "/dev/ttyUSB0"; serial.port = "/dev/ttyUSB0";
# Inline device configuration (replaces devices.yaml)
# This allows declarative management and homeassistant overrides
devices = {
# Temperature sensors with battery fix
# WSDCGQ12LM sensors report battery: 0 due to firmware quirk
# Override battery calculation using voltage (mV): (voltage - 2100) / 9
"0x54ef441000a547bd" = {
friendly_name = "0x54ef441000a547bd";
homeassistant.battery.value_template = "{{ (((value_json.voltage | float) - 2100) / 9) | round(0) | int | min(100) | max(0) }}";
};
"0x54ef441000a54d3c" = {
friendly_name = "0x54ef441000a54d3c";
homeassistant.battery.value_template = "{{ (((value_json.voltage | float) - 2100) / 9) | round(0) | int | min(100) | max(0) }}";
};
"0x54ef441000a564b6" = {
friendly_name = "temp_server";
homeassistant.battery.value_template = "{{ (((value_json.voltage | float) - 2100) / 9) | round(0) | int | min(100) | max(0) }}";
};
# Other sensors
"0x00124b0025495463".friendly_name = "0x00124b0025495463"; # SONOFF temp sensor (battery works)
"0x54ef4410009ac117".friendly_name = "0x54ef4410009ac117"; # Water leak sensor
# Buttons
"0x54ef441000a1f907".friendly_name = "btn_livingroom";
"0x54ef441000a1ee71".friendly_name = "btn_bedroom";
# Philips Hue lights
"0x001788010d1b599a" = {
friendly_name = "0x001788010d1b599a";
transition = 5;
};
"0x001788010d253b99".friendly_name = "0x001788010d253b99";
"0x001788010e371aa4".friendly_name = "0x001788010e371aa4";
"0x001788010dc5f003".friendly_name = "0x001788010dc5f003";
"0x001788010dc35d06".friendly_name = "0x001788010dc35d06";
};
}; };
}; };
} }

View File

@@ -86,22 +86,6 @@
} }
reverse_proxy http://jelly01.home.2rjus.net:8096 reverse_proxy http://jelly01.home.2rjus.net:8096
} }
lldap.home.2rjus.net {
log {
output file /var/log/caddy/auth.log {
mode 644
}
}
reverse_proxy http://auth01.home.2rjus.net:17170
}
auth.home.2rjus.net {
log {
output file /var/log/caddy/auth.log {
mode 644
}
}
reverse_proxy http://auth01.home.2rjus.net:9091
}
pyroscope.home.2rjus.net { pyroscope.home.2rjus.net {
log { log {
output file /var/log/caddy/pyroscope.log { output file /var/log/caddy/pyroscope.log {

View File

@@ -1,38 +0,0 @@
{ config, ... }:
{
sops.secrets.lldap_user_pass = {
format = "yaml";
key = "lldap_user_pass";
sopsFile = ../../secrets/auth01/secrets.yaml;
restartUnits = [ "lldap.service" ];
group = "acme";
mode = "0440";
};
services.lldap = {
enable = true;
settings = {
ldap_base_dn = "dc=home,dc=2rjus,dc=net";
ldap_user_email = "admin@home.2rjus.net";
ldap_user_dn = "admin";
ldap_user_pass_file = config.sops.secrets.lldap_user_pass.path;
ldaps_options = {
enabled = true;
port = 6360;
cert_file = "/var/lib/acme/auth01.home.2rjus.net/cert.pem";
key_file = "/var/lib/acme/auth01.home.2rjus.net/key.pem";
};
};
};
systemd.services.lldap = {
serviceConfig = {
SupplementaryGroups = [ "acme" ];
};
};
security.acme.certs."auth01.home.2rjus.net" = {
listenHTTP = ":80";
reloadServices = [ "lldap" ];
extraDomainNames = [ "ldap.home.2rjus.net" ];
enableDebugLogs = true;
};
}

View File

@@ -1,12 +1,18 @@
{ pkgs, config, ... }: { pkgs, config, ... }:
{ {
sops.secrets."nats_nkey" = { }; vault.secrets.nats-nkey = {
secretPath = "shared/nats/nkey";
extractKey = "nkey";
outputDir = "/run/secrets/nats_nkey";
services = [ "alerttonotify" ];
};
systemd.services."alerttonotify" = { systemd.services."alerttonotify" = {
enable = true; enable = true;
wants = [ "network-online.target" ]; wants = [ "network-online.target" ];
after = [ after = [
"network-online.target" "network-online.target"
"sops-nix.service" "vault-secret-nats-nkey.service"
]; ];
wantedBy = [ "multi-user.target" ]; wantedBy = [ "multi-user.target" ];
restartIfChanged = true; restartIfChanged = true;

View File

@@ -1,14 +1,82 @@
{ self, lib, ... }: { self, lib, pkgs, ... }:
let let
monLib = import ../../lib/monitoring.nix { inherit lib; }; monLib = import ../../lib/monitoring.nix { inherit lib; };
externalTargets = import ./external-targets.nix; externalTargets = import ./external-targets.nix;
nodeExporterTargets = monLib.generateNodeExporterTargets self externalTargets; nodeExporterTargets = monLib.generateNodeExporterTargets self externalTargets;
autoScrapeConfigs = monLib.generateScrapeConfigs self externalTargets; autoScrapeConfigs = monLib.generateScrapeConfigs self externalTargets;
# Script to fetch AppRole token for Prometheus to use when scraping OpenBao metrics
fetchOpenbaoToken = pkgs.writeShellApplication {
name = "fetch-openbao-token";
runtimeInputs = [ pkgs.curl pkgs.jq ];
text = ''
VAULT_ADDR="https://vault01.home.2rjus.net:8200"
APPROLE_DIR="/var/lib/vault/approle"
OUTPUT_FILE="/run/secrets/prometheus/openbao-token"
# Read AppRole credentials
if [ ! -f "$APPROLE_DIR/role-id" ] || [ ! -f "$APPROLE_DIR/secret-id" ]; then
echo "AppRole credentials not found at $APPROLE_DIR" >&2
exit 1
fi
ROLE_ID=$(cat "$APPROLE_DIR/role-id")
SECRET_ID=$(cat "$APPROLE_DIR/secret-id")
# Authenticate to Vault
AUTH_RESPONSE=$(curl -sf -k -X POST \
-d "{\"role_id\":\"$ROLE_ID\",\"secret_id\":\"$SECRET_ID\"}" \
"$VAULT_ADDR/v1/auth/approle/login")
# Extract token
VAULT_TOKEN=$(echo "$AUTH_RESPONSE" | jq -r '.auth.client_token')
if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
echo "Failed to extract Vault token from response" >&2
exit 1
fi
# Write token to file
mkdir -p "$(dirname "$OUTPUT_FILE")"
echo -n "$VAULT_TOKEN" > "$OUTPUT_FILE"
chown prometheus:prometheus "$OUTPUT_FILE"
chmod 0400 "$OUTPUT_FILE"
echo "Successfully fetched OpenBao token"
'';
};
in in
{ {
# Systemd service to fetch AppRole token for Prometheus OpenBao scraping
# The token is used to authenticate when scraping /v1/sys/metrics
systemd.services.prometheus-openbao-token = {
description = "Fetch OpenBao token for Prometheus metrics scraping";
after = [ "network-online.target" ];
wants = [ "network-online.target" ];
before = [ "prometheus.service" ];
requiredBy = [ "prometheus.service" ];
serviceConfig = {
Type = "oneshot";
ExecStart = lib.getExe fetchOpenbaoToken;
};
};
# Timer to periodically refresh the token (AppRole tokens have 1-hour TTL)
systemd.timers.prometheus-openbao-token = {
description = "Refresh OpenBao token for Prometheus";
wantedBy = [ "timers.target" ];
timerConfig = {
OnBootSec = "5min";
OnUnitActiveSec = "30min";
RandomizedDelaySec = "5min";
};
};
services.prometheus = { services.prometheus = {
enable = true; enable = true;
# syntax-only check because we use external credential files (e.g., openbao-token)
checkConfig = "syntax-only";
alertmanager = { alertmanager = {
enable = true; enable = true;
configuration = { configuration = {
@@ -61,6 +129,15 @@ in
} }
]; ];
} }
# Systemd exporter on all hosts (same targets, different port)
{
job_name = "systemd-exporter";
static_configs = [
{
targets = map (t: builtins.replaceStrings [":9100"] [":9558"] t) nodeExporterTargets;
}
];
}
# Local monitoring services (not auto-generated) # Local monitoring services (not auto-generated)
{ {
job_name = "prometheus"; job_name = "prometheus";
@@ -152,6 +229,22 @@ in
} }
]; ];
} }
# OpenBao metrics with bearer token auth
{
job_name = "openbao";
scheme = "https";
metrics_path = "/v1/sys/metrics";
params = {
format = [ "prometheus" ];
};
static_configs = [{
targets = [ "vault01.home.2rjus.net:8200" ];
}];
authorization = {
type = "Bearer";
credentials_file = "/run/secrets/prometheus/openbao-token";
};
}
] ++ autoScrapeConfigs; ] ++ autoScrapeConfigs;
pushgateway = { pushgateway = {

View File

@@ -1,14 +1,16 @@
{ config, ... }: { config, ... }:
{ {
sops.secrets.pve_exporter = { vault.secrets.pve-exporter = {
format = "yaml"; secretPath = "hosts/monitoring01/pve-exporter";
sopsFile = ../../secrets/monitoring01/pve-exporter.yaml; extractKey = "config";
key = ""; outputDir = "/run/secrets/pve_exporter";
mode = "0444"; mode = "0444";
services = [ "prometheus-pve-exporter" ];
}; };
services.prometheus.exporters.pve = { services.prometheus.exporters.pve = {
enable = true; enable = true;
configFile = config.sops.secrets.pve_exporter.path; configFile = "/run/secrets/pve_exporter";
collectors = { collectors = {
cluster = false; cluster = false;
replication = false; replication = false;

View File

@@ -18,13 +18,21 @@ groups:
summary: "Disk space low on {{ $labels.instance }}" summary: "Disk space low on {{ $labels.instance }}"
description: "Disk space is low on {{ $labels.instance }}. Please check." description: "Disk space is low on {{ $labels.instance }}. Please check."
- alert: high_cpu_load - alert: high_cpu_load
expr: max(node_load5{}) by (instance) > (count by (instance)(node_cpu_seconds_total{mode="idle"}) * 0.7) expr: max(node_load5{instance!="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance!="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7)
for: 15m for: 15m
labels: labels:
severity: warning severity: warning
annotations: annotations:
summary: "High CPU load on {{ $labels.instance }}" summary: "High CPU load on {{ $labels.instance }}"
description: "CPU load is high on {{ $labels.instance }}. Please check." description: "CPU load is high on {{ $labels.instance }}. Please check."
- alert: high_cpu_load
expr: max(node_load5{instance="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7)
for: 2h
labels:
severity: warning
annotations:
summary: "High CPU load on {{ $labels.instance }}"
description: "CPU load is high on {{ $labels.instance }}. Please check."
- alert: low_memory - alert: low_memory
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10 expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
for: 2m for: 2m
@@ -67,12 +75,12 @@ groups:
description: "Based on the last 6h trend, the root filesystem on {{ $labels.instance }} is predicted to run out of space within 24 hours." description: "Based on the last 6h trend, the root filesystem on {{ $labels.instance }} is predicted to run out of space within 24 hours."
- alert: systemd_not_running - alert: systemd_not_running
expr: node_systemd_system_running == 0 expr: node_systemd_system_running == 0
for: 5m for: 10m
labels: labels:
severity: critical severity: warning
annotations: annotations:
summary: "Systemd not in running state on {{ $labels.instance }}" summary: "Systemd not in running state on {{ $labels.instance }}"
description: "Systemd is not in running state on {{ $labels.instance }}. The system may be in a degraded state." description: "Systemd is not in running state on {{ $labels.instance }}. The system may be in a degraded state. Note: brief degraded states during nixos-rebuild are normal."
- alert: high_file_descriptors - alert: high_file_descriptors
expr: node_filefd_allocated / node_filefd_maximum > 0.8 expr: node_filefd_allocated / node_filefd_maximum > 0.8
for: 5m for: 5m
@@ -107,6 +115,14 @@ groups:
annotations: annotations:
summary: "NSD not running on {{ $labels.instance }}" summary: "NSD not running on {{ $labels.instance }}"
description: "NSD has been down on {{ $labels.instance }} more than 5 minutes." description: "NSD has been down on {{ $labels.instance }} more than 5 minutes."
- alert: unbound_low_cache_hit_ratio
expr: (rate(unbound_cache_hits_total[5m]) / (rate(unbound_cache_hits_total[5m]) + rate(unbound_cache_misses_total[5m]))) < 0.5
for: 15m
labels:
severity: warning
annotations:
summary: "Low DNS cache hit ratio on {{ $labels.instance }}"
description: "Unbound cache hit ratio is below 50% on {{ $labels.instance }}."
- name: http_proxy_rules - name: http_proxy_rules
rules: rules:
- alert: caddy_down - alert: caddy_down
@@ -143,6 +159,14 @@ groups:
annotations: annotations:
summary: "NATS not running on {{ $labels.instance }}" summary: "NATS not running on {{ $labels.instance }}"
description: "NATS has been down on {{ $labels.instance }} more than 5 minutes." description: "NATS has been down on {{ $labels.instance }} more than 5 minutes."
- alert: nats_slow_consumers
expr: nats_core_slow_consumer_count > 0
for: 5m
labels:
severity: warning
annotations:
summary: "NATS has slow consumers on {{ $labels.instance }}"
description: "NATS has {{ $value }} slow consumers on {{ $labels.instance }}."
- name: nix_cache_rules - name: nix_cache_rules
rules: rules:
- alert: build_flakes_service_not_active_recently - alert: build_flakes_service_not_active_recently
@@ -202,6 +226,14 @@ groups:
annotations: annotations:
summary: "Mosquitto not running on {{ $labels.instance }}" summary: "Mosquitto not running on {{ $labels.instance }}"
description: "Mosquitto has been down on {{ $labels.instance }} more than 5 minutes." description: "Mosquitto has been down on {{ $labels.instance }} more than 5 minutes."
- alert: zigbee_sensor_stale
expr: (time() - hass_last_updated_time_seconds{entity=~"sensor\\.(0x[0-9a-f]+|temp_server)_temperature"}) > 7200
for: 5m
labels:
severity: warning
annotations:
summary: "Zigbee sensor {{ $labels.friendly_name }} is stale"
description: "Zigbee temperature sensor {{ $labels.entity }} has not reported data for over 2 hours. The sensor may have a dead battery or connectivity issues."
- name: smartctl_rules - name: smartctl_rules
rules: rules:
- alert: smart_critical_warning - alert: smart_critical_warning
@@ -356,3 +388,65 @@ groups:
annotations: annotations:
summary: "Proxmox VM {{ $labels.id }} is stopped" summary: "Proxmox VM {{ $labels.id }} is stopped"
description: "Proxmox VM {{ $labels.id }} ({{ $labels.name }}) has onboot=1 but is stopped." description: "Proxmox VM {{ $labels.id }} ({{ $labels.name }}) has onboot=1 but is stopped."
- name: postgres_rules
rules:
- alert: postgres_down
expr: node_systemd_unit_state{instance="pgdb1.home.2rjus.net:9100", name="postgresql.service", state="active"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "PostgreSQL not running on {{ $labels.instance }}"
description: "PostgreSQL has been down on {{ $labels.instance }} more than 5 minutes."
- alert: postgres_exporter_down
expr: up{job="postgres"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL exporter down on {{ $labels.instance }}"
description: "Cannot scrape PostgreSQL metrics from {{ $labels.instance }}."
- alert: postgres_high_connections
expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL connection pool near exhaustion on {{ $labels.instance }}"
description: "PostgreSQL is using over 80% of max_connections on {{ $labels.instance }}."
- name: jellyfin_rules
rules:
- alert: jellyfin_down
expr: up{job="jellyfin"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Jellyfin not responding on {{ $labels.instance }}"
description: "Cannot scrape Jellyfin metrics from {{ $labels.instance }} for 5 minutes."
- name: vault_rules
rules:
- alert: openbao_down
expr: node_systemd_unit_state{instance="vault01.home.2rjus.net:9100", name="openbao.service", state="active"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "OpenBao not running on {{ $labels.instance }}"
description: "OpenBao has been down on {{ $labels.instance }} more than 5 minutes."
- alert: openbao_sealed
expr: vault_core_unsealed == 0
for: 5m
labels:
severity: critical
annotations:
summary: "OpenBao is sealed on {{ $labels.instance }}"
description: "OpenBao has been sealed on {{ $labels.instance }} for more than 5 minutes."
- alert: openbao_scrape_down
expr: up{job="openbao"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Cannot scrape OpenBao metrics from {{ $labels.instance }}"
description: "OpenBao metrics endpoint is not responding on {{ $labels.instance }}."

View File

@@ -1,10 +1,28 @@
{ ... }: { ... }:
{ {
homelab.monitoring.scrapeTargets = [
{
job_name = "nats";
port = 7777;
}
];
services.prometheus.exporters.nats = {
enable = true;
url = "http://localhost:8222";
extraFlags = [
"-varz" # General server info
"-connz" # Connection info
"-jsz=all" # JetStream info
];
};
services.nats = { services.nats = {
enable = true; enable = true;
jetstream = true; jetstream = true;
serverName = "nats1"; serverName = "nats1";
settings = { settings = {
http_port = 8222;
accounts = { accounts = {
ADMIN = { ADMIN = {
users = [ users = [
@@ -22,6 +40,48 @@
} }
]; ];
}; };
DEPLOY = {
users = [
# Shared listener (all hosts use this)
{
nkey = "UCCZJSUGLCSLBBKHBPL4QA66TUMQUGIXGLIFTWDEH43MGWM3LDD232X4";
permissions = {
subscribe = [
"deploy.test.>"
"deploy.prod.>"
"deploy.discover"
];
publish = [
"deploy.responses.>"
"deploy.discover"
];
};
}
# Test deployer (MCP without admin)
{
nkey = "UBR66CX2ZNY5XNVQF5VBG4WFAF54LSGUYCUNNCEYRILDQ4NXDAD2THZU";
permissions = {
publish = [
"deploy.test.>"
"deploy.discover"
];
subscribe = [
"deploy.responses.>"
"deploy.discover"
];
};
}
# Admin deployer (full access)
{
nkey = "UD2BFB7DLM67P5UUVCKBUJMCHADIZLGGVUNSRLZE2ZC66FW2XT44P73Y";
permissions = {
publish = [ "deploy.>" ];
subscribe = [ "deploy.>" ];
};
}
];
};
}; };
system_account = "ADMIN"; system_account = "ADMIN";
jetstream = { jetstream = {

View File

@@ -1,14 +1,16 @@
{ pkgs, config, ... }: { pkgs, config, ... }:
{ {
sops.secrets."cache-secret" = { vault.secrets.cache-secret = {
sopsFile = ../../secrets/nix-cache01/cache-secret; secretPath = "hosts/nix-cache01/cache-secret";
format = "binary"; extractKey = "key";
outputDir = "/run/secrets/cache-secret";
services = [ "harmonia" ];
}; };
services.harmonia = { services.harmonia = {
enable = true; enable = true;
package = pkgs.unstable.harmonia; package = pkgs.unstable.harmonia;
signKeyPaths = [ config.sops.secrets.cache-secret.path ]; signKeyPaths = [ "/run/secrets/cache-secret" ];
}; };
systemd.services.harmonia = { systemd.services.harmonia = {
environment.RUST_LOG = "info,actix_web=debug"; environment.RUST_LOG = "info,actix_web=debug";

View File

@@ -12,8 +12,11 @@ let
}; };
in in
{ {
sops.secrets.ns_xfer_key = { vault.secrets.ns-xfer-key = {
path = "/etc/nsd/xfer.key"; secretPath = "shared/dns/xfer-key";
extractKey = "key";
outputDir = "/etc/nsd/xfer.key";
services = [ "nsd" ];
}; };
networking.firewall.allowedTCPPorts = [ 8053 ]; networking.firewall.allowedTCPPorts = [ 8053 ];

View File

@@ -1,10 +1,24 @@
{ pkgs, ... }: { { pkgs, ... }: {
homelab.monitoring.scrapeTargets = [{
job_name = "unbound";
port = 9167;
}];
networking.firewall.allowedTCPPorts = [ networking.firewall.allowedTCPPorts = [
53 53
]; ];
networking.firewall.allowedUDPPorts = [ networking.firewall.allowedUDPPorts = [
53 53
]; ];
services.prometheus.exporters.unbound = {
enable = true;
unbound.host = "unix:///run/unbound/unbound.ctl";
};
# Grant exporter access to unbound socket
systemd.services.prometheus-unbound-exporter.serviceConfig.SupplementaryGroups = [ "unbound" ];
services.unbound = { services.unbound = {
enable = true; enable = true;
@@ -23,6 +37,11 @@
do-ip6 = "no"; do-ip6 = "no";
do-udp = "yes"; do-udp = "yes";
do-tcp = "yes"; do-tcp = "yes";
extended-statistics = true;
};
remote-control = {
control-enable = true;
control-interface = "/run/unbound/unbound.ctl";
}; };
stub-zone = { stub-zone = {
name = "home.2rjus.net"; name = "home.2rjus.net";

View File

@@ -12,8 +12,11 @@ let
}; };
in in
{ {
sops.secrets.ns_xfer_key = { vault.secrets.ns-xfer-key = {
path = "/etc/nsd/xfer.key"; secretPath = "shared/dns/xfer-key";
extractKey = "key";
outputDir = "/etc/nsd/xfer.key";
services = [ "nsd" ];
}; };
networking.firewall.allowedTCPPorts = [ 8053 ]; networking.firewall.allowedTCPPorts = [ 8053 ];
networking.firewall.allowedUDPPorts = [ 8053 ]; networking.firewall.allowedUDPPorts = [ 8053 ];

View File

@@ -1,5 +1,15 @@
{ pkgs, ... }: { pkgs, ... }:
{ {
homelab.monitoring.scrapeTargets = [{
job_name = "postgres";
port = 9187;
}];
services.prometheus.exporters.postgres = {
enable = true;
runAsLocalSuperUser = true; # Use peer auth as postgres user
};
services.postgresql = { services.postgresql = {
enable = true; enable = true;
enableJIT = true; enableJIT = true;

View File

@@ -166,6 +166,11 @@ in
settings = { settings = {
ui = true; ui = true;
telemetry = {
prometheus_retention_time = "60s";
disable_hostname = true;
};
storage.file.path = "/var/lib/openbao"; storage.file.path = "/var/lib/openbao";
listener.default = { listener.default = {
type = "tcp"; type = "tcp";

View File

@@ -3,7 +3,9 @@
imports = [ imports = [
./acme.nix ./acme.nix
./autoupgrade.nix ./autoupgrade.nix
./homelab-deploy.nix
./monitoring ./monitoring
./motd.nix
./packages.nix ./packages.nix
./nix.nix ./nix.nix
./root-user.nix ./root-user.nix
@@ -11,7 +13,5 @@
./sops.nix ./sops.nix
./sshd.nix ./sshd.nix
./vault-secrets.nix ./vault-secrets.nix
../modules/homelab
]; ];
} }

37
system/homelab-deploy.nix Normal file
View File

@@ -0,0 +1,37 @@
{ config, lib, ... }:
let
hostCfg = config.homelab.host;
in
{
config = lib.mkIf config.homelab.deploy.enable {
# Fetch listener NKey from Vault
vault.secrets.homelab-deploy-nkey = {
secretPath = "shared/homelab-deploy/listener-nkey";
extractKey = "nkey";
};
# Enable homelab-deploy listener
services.homelab-deploy.listener = {
enable = true;
tier = hostCfg.tier;
role = hostCfg.role;
natsUrl = "nats://nats1.home.2rjus.net:4222";
nkeyFile = "/run/secrets/homelab-deploy-nkey";
flakeUrl = "git+https://git.t-juice.club/torjus/nixos-servers.git";
metrics.enable = true;
};
# Expose metrics for Prometheus scraping
homelab.monitoring.scrapeTargets = [{
job_name = "homelab-deploy";
port = 9972;
}];
# Ensure listener starts after vault secret is available
systemd.services.homelab-deploy-listener = {
after = [ "vault-secret-homelab-deploy-nkey.service" ];
requires = [ "vault-secret-homelab-deploy-nkey.service" ];
};
};
}

View File

@@ -9,4 +9,30 @@
"processes" "processes"
]; ];
}; };
services.prometheus.exporters.systemd = {
enable = true;
# Default port: 9558
extraFlags = [
"--systemd.collector.enable-restart-count"
"--systemd.collector.enable-ip-accounting"
];
};
services.prometheus.exporters.nixos = {
enable = true;
# Default port: 9971
flake = {
enable = true;
url = "git+https://git.t-juice.club/torjus/nixos-servers.git";
};
};
# Register nixos-exporter as a Prometheus scrape target
homelab.monitoring.scrapeTargets = [
{
job_name = "nixos-exporter";
port = 9971;
}
];
} }

28
system/motd.nix Normal file
View File

@@ -0,0 +1,28 @@
{ config, lib, self, ... }:
let
hostname = config.networking.hostName;
domain = config.networking.domain or "";
fqdn = if domain != "" then "${hostname}.${domain}" else hostname;
# Get commit hash (handles both clean and dirty trees)
shortRev = self.shortRev or self.dirtyShortRev or "unknown";
# Format timestamp from lastModified (Unix timestamp)
# lastModifiedDate is in format "YYYYMMDDHHMMSS"
dateStr = self.sourceInfo.lastModifiedDate or "unknown";
formattedDate = if dateStr != "unknown" then
"${builtins.substring 0 4 dateStr}-${builtins.substring 4 2 dateStr}-${builtins.substring 6 2 dateStr} ${builtins.substring 8 2 dateStr}:${builtins.substring 10 2 dateStr} UTC"
else
"unknown";
banner = ''
####################################
${fqdn}
Commit: ${shortRev} (${formattedDate})
####################################
'';
in
{
users.motd = lib.mkDefault banner;
}

View File

@@ -1,8 +1,29 @@
{ lib, ... }: { lib, pkgs, ... }:
let
nixos-rebuild-test = pkgs.writeShellApplication {
name = "nixos-rebuild-test";
runtimeInputs = [ pkgs.nixos-rebuild ];
text = ''
if [ $# -lt 2 ]; then
echo "Usage: nixos-rebuild-test <action> <branch>"
echo "Example: nixos-rebuild-test boot my-feature-branch"
exit 1
fi
action="$1"
branch="$2"
shift 2
exec nixos-rebuild "$action" --flake "git+https://git.t-juice.club/torjus/nixos-servers.git?ref=$branch" "$@"
'';
};
in
{ {
environment.systemPackages = [ nixos-rebuild-test ];
nix = { nix = {
gc = { gc = {
automatic = true; automatic = true;
options = "--delete-older-than 14d";
}; };
optimise = { optimise = {

View File

@@ -1,11 +1,10 @@
{ pkgs, config, ... }: { { pkgs, config, ... }:
{
programs.zsh.enable = true; programs.zsh.enable = true;
sops.secrets.root_password_hash = { };
sops.secrets.root_password_hash.neededForUsers = true;
users.users.root = { users.users.root = {
shell = pkgs.zsh; shell = pkgs.zsh;
hashedPasswordFile = config.sops.secrets.root_password_hash.path; hashedPassword = "$y$j9T$N09APWqKc4//z9BoGyzSb0$3dMUzojSmo3/10nbIfShd6/IpaYoKdI21bfbWER3jl8";
openssh.authorizedKeys.keys = [ openssh.authorizedKeys.keys = [
"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAwfb2jpKrBnCw28aevnH8HbE5YbcMXpdaVv2KmueDu6 torjus@gunter" "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAwfb2jpKrBnCw28aevnH8HbE5YbcMXpdaVv2KmueDu6 torjus@gunter"
]; ];

View File

@@ -8,6 +8,48 @@ let
# Import vault-fetch package # Import vault-fetch package
vault-fetch = pkgs.callPackage ../scripts/vault-fetch { }; vault-fetch = pkgs.callPackage ../scripts/vault-fetch { };
# Helper to create fetch scripts using writeShellApplication
mkFetchScript = name: secretCfg: pkgs.writeShellApplication {
name = "fetch-${name}";
runtimeInputs = [ vault-fetch ];
text = ''
# Set Vault environment variables
export VAULT_ADDR="${cfg.vaultAddress}"
export VAULT_SKIP_VERIFY="${if cfg.skipTlsVerify then "1" else "0"}"
'' + (if secretCfg.extractKey != null then ''
# Fetch to temporary directory, then extract single key
TMPDIR=$(mktemp -d)
trap 'rm -rf $TMPDIR' EXIT
vault-fetch \
"${secretCfg.secretPath}" \
"$TMPDIR" \
"${secretCfg.cacheDir}"
# Extract the specified key and write as a single file
if [ ! -f "$TMPDIR/${secretCfg.extractKey}" ]; then
echo "ERROR: Key '${secretCfg.extractKey}' not found in secret" >&2
exit 1
fi
# Ensure parent directory exists
mkdir -p "$(dirname "${secretCfg.outputDir}")"
cp "$TMPDIR/${secretCfg.extractKey}" "${secretCfg.outputDir}"
chown ${secretCfg.owner}:${secretCfg.group} "${secretCfg.outputDir}"
chmod ${secretCfg.mode} "${secretCfg.outputDir}"
'' else ''
# Fetch secret as directory of files
vault-fetch \
"${secretCfg.secretPath}" \
"${secretCfg.outputDir}" \
"${secretCfg.cacheDir}"
# Set ownership and permissions
chown -R ${secretCfg.owner}:${secretCfg.group} "${secretCfg.outputDir}"
chmod ${secretCfg.mode} "${secretCfg.outputDir}"/*
'');
};
# Secret configuration type # Secret configuration type
secretType = types.submodule ({ name, config, ... }: { secretType = types.submodule ({ name, config, ... }: {
options = { options = {
@@ -73,6 +115,16 @@ let
''; '';
}; };
extractKey = mkOption {
type = types.nullOr types.str;
default = null;
description = ''
Extract a single key from the vault secret JSON and write it as a
plain file instead of a directory of files. When set, outputDir
becomes a file path rather than a directory path.
'';
};
services = mkOption { services = mkOption {
type = types.listOf types.str; type = types.listOf types.str;
default = []; default = [];
@@ -152,23 +204,7 @@ in
RemainAfterExit = true; RemainAfterExit = true;
# Fetch the secret # Fetch the secret
ExecStart = pkgs.writeShellScript "fetch-${name}" '' ExecStart = lib.getExe (mkFetchScript name secretCfg);
set -euo pipefail
# Set Vault environment variables
export VAULT_ADDR="${cfg.vaultAddress}"
export VAULT_SKIP_VERIFY="${if cfg.skipTlsVerify then "1" else "0"}"
# Fetch secret using vault-fetch
${vault-fetch}/bin/vault-fetch \
"${secretCfg.secretPath}" \
"${secretCfg.outputDir}" \
"${secretCfg.cacheDir}"
# Set ownership and permissions
chown -R ${secretCfg.owner}:${secretCfg.group} "${secretCfg.outputDir}"
chmod ${secretCfg.mode} "${secretCfg.outputDir}"/*
'';
# Logging # Logging
StandardOutput = "journal"; StandardOutput = "journal";
@@ -216,7 +252,10 @@ in
[ "d /run/secrets 0755 root root -" ] ++ [ "d /run/secrets 0755 root root -" ] ++
[ "d /var/lib/vault/cache 0700 root root -" ] ++ [ "d /var/lib/vault/cache 0700 root root -" ] ++
flatten (mapAttrsToList (name: secretCfg: [ flatten (mapAttrsToList (name: secretCfg: [
"d ${secretCfg.outputDir} 0755 root root -" # When extractKey is set, outputDir is a file path - create parent dir instead
(if secretCfg.extractKey != null
then "d ${dirOf secretCfg.outputDir} 0755 root root -"
else "d ${secretCfg.outputDir} 0755 root root -")
"d ${secretCfg.cacheDir} 0700 root root -" "d ${secretCfg.cacheDir} 0700 root root -"
]) cfg.secrets); ]) cfg.secrets);
}; };

View File

@@ -4,6 +4,17 @@ resource "vault_auth_backend" "approle" {
path = "approle" path = "approle"
} }
# Shared policy for homelab-deploy (all hosts need this for NATS-based deployments)
resource "vault_policy" "homelab_deploy" {
name = "homelab-deploy"
policy = <<EOT
path "secret/data/shared/homelab-deploy/*" {
capabilities = ["read", "list"]
}
EOT
}
# Define host access policies # Define host access policies
locals { locals {
host_policies = { host_policies = {
@@ -15,6 +26,7 @@ locals {
# "secret/data/services/grafana/*", # "secret/data/services/grafana/*",
# "secret/data/shared/smtp/*" # "secret/data/shared/smtp/*"
# ] # ]
# extra_policies = ["some-other-policy"] # Optional: additional policies
# } # }
# Example: ha1 host # Example: ha1 host
@@ -25,17 +37,73 @@ locals {
# ] # ]
# } # }
# TODO: actually use this policy
"ha1" = { "ha1" = {
paths = [ paths = [
"secret/data/hosts/ha1/*", "secret/data/hosts/ha1/*",
"secret/data/shared/backup/*",
] ]
} }
# TODO: actually use this policy
"monitoring01" = { "monitoring01" = {
paths = [ paths = [
"secret/data/hosts/monitoring01/*", "secret/data/hosts/monitoring01/*",
"secret/data/shared/backup/*",
"secret/data/shared/nats/*",
]
extra_policies = ["prometheus-metrics"]
}
# Wave 1: hosts with no service secrets (only need vault.enable for future use)
"nats1" = {
paths = [
"secret/data/hosts/nats1/*",
]
}
"jelly01" = {
paths = [
"secret/data/hosts/jelly01/*",
]
}
"pgdb1" = {
paths = [
"secret/data/hosts/pgdb1/*",
]
}
# Wave 3: DNS servers
"ns1" = {
paths = [
"secret/data/hosts/ns1/*",
"secret/data/shared/dns/*",
]
}
"ns2" = {
paths = [
"secret/data/hosts/ns2/*",
"secret/data/shared/dns/*",
]
}
# Wave 4: http-proxy
"http-proxy" = {
paths = [
"secret/data/hosts/http-proxy/*",
]
}
# Wave 5: nix-cache01
"nix-cache01" = {
paths = [
"secret/data/hosts/nix-cache01/*",
]
}
"vaulttest01" = {
paths = [
"secret/data/hosts/vaulttest01/*",
] ]
} }
} }
@@ -62,7 +130,10 @@ resource "vault_approle_auth_backend_role" "hosts" {
backend = vault_auth_backend.approle.path backend = vault_auth_backend.approle.path
role_name = each.key role_name = each.key
token_policies = ["${each.key}-policy"] token_policies = concat(
["${each.key}-policy", "homelab-deploy"],
lookup(each.value, "extra_policies", [])
)
# Token configuration # Token configuration
token_ttl = 3600 # 1 hour token_ttl = 3600 # 1 hour

View File

@@ -0,0 +1,10 @@
# Generic policies for services (not host-specific)
resource "vault_policy" "prometheus_metrics" {
name = "prometheus-metrics"
policy = <<EOT
path "sys/metrics" {
capabilities = ["read"]
}
EOT
}

View File

@@ -35,22 +35,79 @@ locals {
# } # }
# } # }
# TODO: actually use the secret
"hosts/monitoring01/grafana-admin" = { "hosts/monitoring01/grafana-admin" = {
auto_generate = true auto_generate = true
password_length = 32 password_length = 32
} }
# TODO: actually use the secret
"hosts/ha1/mqtt-password" = { "hosts/ha1/mqtt-password" = {
auto_generate = true auto_generate = true
password_length = 24 password_length = 24
} }
# TODO: Remove after testing # TODO: Remove after testing
"hosts/vaulttest01/test-service" = { "hosts/vaulttest01/test-service" = {
auto_generate = true auto_generate = true
password_length = 32 password_length = 32
} }
# Shared backup password (auto-generated, add alongside existing restic key)
"shared/backup/password" = {
auto_generate = true
password_length = 32
}
# NATS NKey for alerttonotify
"shared/nats/nkey" = {
auto_generate = false
data = { nkey = var.nats_nkey }
}
# PVE exporter config for monitoring01
"hosts/monitoring01/pve-exporter" = {
auto_generate = false
data = { config = var.pve_exporter_config }
}
# DNS zone transfer key
"shared/dns/xfer-key" = {
auto_generate = false
data = { key = var.ns_xfer_key }
}
# WireGuard private key for http-proxy
"hosts/http-proxy/wireguard" = {
auto_generate = false
data = { private_key = var.wireguard_private_key }
}
# Nix cache signing key
"hosts/nix-cache01/cache-secret" = {
auto_generate = false
data = { key = var.cache_signing_key }
}
# Gitea Actions runner token
"hosts/nix-cache01/actions-token" = {
auto_generate = false
data = { token = var.actions_token_1 }
}
# Homelab-deploy NKeys
"shared/homelab-deploy/listener-nkey" = {
auto_generate = false
data = { nkey = var.homelab_deploy_listener_nkey }
}
"shared/homelab-deploy/test-deployer-nkey" = {
auto_generate = false
data = { nkey = var.homelab_deploy_test_deployer_nkey }
}
"shared/homelab-deploy/admin-deployer-nkey" = {
auto_generate = false
data = { nkey = var.homelab_deploy_admin_deployer_nkey }
}
} }
} }

View File

@@ -16,11 +16,60 @@ variable "vault_skip_tls_verify" {
default = true default = true
} }
# Example variables for manual secrets variable "nats_nkey" {
# Uncomment and add to terraform.tfvars as needed description = "NATS NKey for alerttonotify"
type = string
sensitive = true
}
variable "pve_exporter_config" {
description = "PVE exporter YAML configuration"
type = string
sensitive = true
}
variable "ns_xfer_key" {
description = "DNS zone transfer TSIG key"
type = string
sensitive = true
}
variable "wireguard_private_key" {
description = "WireGuard private key for http-proxy"
type = string
sensitive = true
}
variable "cache_signing_key" {
description = "Nix binary cache signing key"
type = string
sensitive = true
}
variable "actions_token_1" {
description = "Gitea Actions runner token"
type = string
sensitive = true
}
variable "homelab_deploy_listener_nkey" {
description = "NKey seed for homelab-deploy listeners"
type = string
default = "PLACEHOLDER"
sensitive = true
}
variable "homelab_deploy_test_deployer_nkey" {
description = "NKey seed for test-tier deployer"
type = string
default = "PLACEHOLDER"
sensitive = true
}
variable "homelab_deploy_admin_deployer_nkey" {
description = "NKey seed for admin deployer"
type = string
default = "PLACEHOLDER"
sensitive = true
}
# variable "smtp_password" {
# description = "SMTP password for notifications"
# type = string
# sensitive = true
# }