Compare commits

..

89 Commits

Author SHA1 Message Date
1f5b7b13e2 monitoring: enable restart-count and ip-accounting collectors
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m11s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 21:30:47 +01:00
c53e36c3f3 Revert "monitoring: enable additional systemd-exporter collectors"
This reverts commit 04a252b857.
2026-02-06 21:30:05 +01:00
04a252b857 monitoring: enable additional systemd-exporter collectors
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Enables restart-count, file-descriptor-size, and ip-accounting collectors.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 21:28:44 +01:00
5d26f52e0d Revert "monitoring: enable cpu, memory, io collectors for systemd-exporter"
This reverts commit 506a692548.
2026-02-06 21:26:20 +01:00
506a692548 monitoring: enable cpu, memory, io collectors for systemd-exporter
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 21:23:19 +01:00
fa8f4f0784 docs: add notes about lib.getExe and not amending master
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m11s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 19:41:45 +01:00
025570dea1 monitoring: fix openbao token refresh timer not triggering
RemainAfterExit=true kept the service in "active" state, which
prevented OnUnitActiveSec from scheduling new triggers since there
was no new "activation" event. Removing it allows the service to
properly go inactive, enabling the timer to reschedule correctly.

Also fix ExecStart to use lib.getExe for proper path resolution
with writeShellApplication.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 19:41:45 +01:00
15c00393f1 monitoring: increase zigbee_sensor_stale threshold to 2 hours
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m59s
Sensors report every ~45-50 minutes on average, so 1 hour was too tight.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 19:26:56 +01:00
787c14c7a6 docs: add dns_role label to scrape target labels plan
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m3s
Add proposed dns_role label to distinguish primary/secondary DNS
resolvers. This addresses the unbound_low_cache_hit_ratio alert
firing on ns2, which has a cold cache due to low traffic.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 01:23:34 +01:00
eee3dde04f restic: add randomized delay to backup timers
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Backups to the shared restic repository were all scheduled at exactly
midnight, causing lock conflicts. Adding RandomizedDelaySec spreads
them out over a 2-hour window to prevent simultaneous access.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 01:09:38 +01:00
682b07b977 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/bf922a59c5c9998a6584645f7d0de689512e444c?narHash=sha256-ksTL7P9QC1WfZasNlaAdLOzqD8x5EPyods69YBqxSfk%3D' (2026-02-04)
  → 'github:nixos/nixpkgs/00c21e4c93d963c50d4c0c89bfa84ed6e0694df2?narHash=sha256-AYqlWrX09%2BHvGs8zM6ebZ1pwUqjkfpnv8mewYwAo%2BiM%3D' (2026-02-04)
2026-02-06 00:01:04 +00:00
70661ac3d9 Merge pull request 'home-assistant: fix zigbee battery value_template override key' (#25) from fix-zigbee-battery-template into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m3s
Periodic flake update / flake-update (push) Successful in 1m11s
Reviewed-on: #25
2026-02-05 23:56:45 +00:00
506e93a5e2 home-assistant: fix zigbee battery value_template override key
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m39s
Run nix flake check / flake-check (pull_request) Failing after 12m37s
The homeassistant override key should match the entity type in the
MQTT discovery topic path. For battery sensors, the topic is
homeassistant/sensor/<device>/battery/config, so the key should be
"battery" not "sensor_battery".

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:48:30 +01:00
b6c41aa910 system: add UTC suffix to MOTD commit timestamp
Some checks failed
Run nix flake check / flake-check (push) Failing after 7m32s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:34:24 +01:00
aa6e00a327 Merge pull request 'add-nixos-rebuild-test' (#24) from add-nixos-rebuild-test into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m6s
Reviewed-on: #24
2026-02-05 23:26:34 +00:00
258e350b89 system: add MOTD banner with hostname and commit info
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m8s
Run nix flake check / flake-check (push) Failing after 3m53s
Displays FQDN and flake commit hash with timestamp on login.
Templates can override with their own MOTD via mkDefault.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:26:01 +01:00
eba195c192 docs: add nixos-rebuild-test usage to CLAUDE.md
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:19:49 +01:00
bbb22e588e system: replace writeShellScript with writeShellApplication
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m3s
Run nix flake check / flake-check (push) Failing after 5m57s
Convert remaining writeShellScript usages to writeShellApplication for
shellcheck validation and strict bash options.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:17:24 +01:00
879e7aba60 templates: use writeShellApplication for prepare-host script
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:14:05 +01:00
39a4ea98ab system: add nixos-rebuild-test helper script
Adds a helper script deployed to all hosts for testing feature branches.
Usage: nixos-rebuild-test <action> <branch>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 00:12:16 +01:00
1d90dc2181 Merge pull request 'monitoring: use AppRole token for OpenBao metrics scraping' (#23) from fix-prometheus-openbao-token into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m21s
Reviewed-on: #23
2026-02-05 22:52:42 +00:00
e9857afc11 monitoring: use AppRole token for OpenBao metrics scraping
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m12s
Run nix flake check / flake-check (pull_request) Successful in 2m19s
Instead of creating a long-lived Vault token in Terraform (which gets
invalidated when Terraform recreates it), monitoring01 now uses its
existing AppRole credentials to fetch a fresh token for Prometheus.

Changes:
- Add prometheus-metrics policy to monitoring01's AppRole
- Remove vault_token.prometheus_metrics resource from Terraform
- Remove openbao-token KV secret from Terraform
- Add systemd service to fetch AppRole token on boot
- Add systemd timer to refresh token every 30 minutes

This ensures Prometheus always has a valid token without depending on
Terraform state or manual intervention.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 23:51:11 +01:00
88e9036cb4 Merge pull request 'auth01: decommission host and remove authelia/lldap services' (#22) from decommission-auth01 into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m5s
Reviewed-on: #22
2026-02-05 22:37:38 +00:00
59e1962d75 auth01: decommission host and remove authelia/lldap services
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m5s
Run nix flake check / flake-check (push) Failing after 18m1s
Remove auth01 host configuration and associated services in preparation
for new auth stack with different provisioning system.

Removed:
- hosts/auth01/ - host configuration
- services/authelia/ - authelia service module
- services/lldap/ - lldap service module
- secrets/auth01/ - sops secrets
- Reverse proxy entries for auth and lldap
- Monitoring alert rules for authelia and lldap
- SOPS configuration for auth01

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 23:35:45 +01:00
3dc4422ba0 docs: add NAS integration notes to auth plan
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m4s
Document TrueNAS CORE LDAP integration approach (NFS-only) and
future NixOS NAS migration path with native Kanidm PAM/NSS.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 23:24:37 +01:00
f0963624bc docs: add auth system replacement plan
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Evaluate options for replacing LLDAP+Authelia with a unified auth solution.
Recommends Kanidm for its native NixOS PAM/NSS integration and built-in OIDC.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 23:18:38 +01:00
7b46f94e48 Merge pull request 'zigbee-battery-fix' (#21) from zigbee-battery-fix into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m20s
Reviewed-on: #21
2026-02-05 21:51:41 +00:00
32968147b5 docs: move zigbee battery plan to completed
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m17s
Run nix flake check / flake-check (pull_request) Successful in 2m19s
Updated plan with:
- Full device inventory from ha1
- Backup verification details
- Branch and commit references

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 22:49:49 +01:00
c515a6b4e1 home-assistant: fix zigbee sensor battery reporting
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
WSDCGQ12LM sensors report battery: 0 due to firmware quirk. Override
battery calculation using voltage via homeassistant value_template.

Also adds zigbee_sensor_stale alert for detecting dead sensors regardless
of battery reporting accuracy (1 hour threshold).

Device configuration moved from external devices.yaml to inline NixOS
config for declarative management.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 22:41:07 +01:00
4d8b94ce83 monitoring: add collector flags to nats exporter
Some checks failed
Run nix flake check / flake-check (push) Failing after 8m53s
The exporter requires explicit collector flags to specify what
metrics to collect.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 22:23:30 +01:00
8b0a4ea33a monitoring: use nats exporter instead of direct scrape
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
NATS HTTP monitoring endpoint serves JSON, not Prometheus format.
Use the prometheus-nats-exporter which queries the NATS endpoint
and exposes proper Prometheus metrics.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 22:22:04 +01:00
5be1f43c24 Merge pull request 'monitoring-gaps-implementation' (#20) from monitoring-gaps-implementation into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m20s
Reviewed-on: #20
2026-02-05 20:57:31 +00:00
b322b1156b monitoring: fix openbao token output path
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m17s
Run nix flake check / flake-check (push) Failing after 8m57s
The outputDir with extractKey should be the full file path, not just
the parent directory.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 21:56:26 +01:00
3cccfc0487 monitoring: implement monitoring gaps coverage
Some checks failed
Run nix flake check / flake-check (push) Failing after 7m36s
Add exporters and scrape targets for services lacking monitoring:
- PostgreSQL: postgres-exporter on pgdb1
- Authelia: native telemetry metrics on auth01
- Unbound: unbound-exporter with remote-control on ns1/ns2
- NATS: HTTP monitoring endpoint on nats1
- OpenBao: telemetry config and Prometheus scrape with token auth
- Systemd: systemd-exporter on all hosts for per-service metrics

Add alert rules for postgres, auth (authelia + lldap), jellyfin,
vault (openbao), plus extend existing nats and unbound rules.

Add Terraform config for Prometheus metrics policy and token. The
token is created via vault_token resource and stored in KV, so no
manual token creation is needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 21:44:13 +01:00
41d4226812 mcp: add Loki URL to lab-monitoring server config
Some checks failed
Run nix flake check / flake-check (push) Failing after 8m8s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 21:18:39 +01:00
351fb6f720 docs: add lab-monitoring query reference to CLAUDE.md
Document Loki log query labels and patterns, and Prometheus job names
with example queries for the lab-monitoring MCP server.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 21:18:17 +01:00
7d92c55d37 docs: update for sops-to-openbao migration completion
Some checks failed
Run nix flake check / flake-check (push) Failing after 18m17s
Update CLAUDE.md and README.md to reflect that secrets are now managed
by OpenBao, with sops only remaining for ca. Update migration plans
with sops cleanup checklist and auth01 decommission.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 20:06:21 +01:00
6d117d68ca docs: move sops-to-openbao migration plan to completed
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m5s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 19:45:42 +01:00
a46fbdaa70 Merge pull request 'sops-to-openbao-migration' (#19) from sops-to-openbao-migration into master
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Reviewed-on: #19
2026-02-05 18:44:53 +00:00
2c9d86eaf2 vault-fetch: fix multiline secret values being truncated
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m5s
Run nix flake check / flake-check (push) Failing after 16m11s
The read-based loop split multiline values on newlines, causing only
the first line to be written. Use jq -j to write each key's value
directly to files, preserving multiline content.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 19:36:51 +01:00
ccb1c3fe2e terraform: auto-generate backup password instead of manual
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m19s
Remove backup_helper_secret variable and switch shared/backup/password
to auto_generate. New password will be added alongside existing restic
repository key.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 18:58:39 +01:00
0700033c0a secrets: migrate all hosts from sops to OpenBao vault
Replace sops-nix secrets with OpenBao vault secrets across all hosts.
Hardcode root password hash, add extractKey option to vault-secrets
module, update Terraform with secrets/policies for all hosts, and
create AppRole provisioning playbook.

Hosts migrated: ha1, monitoring01, ns1, ns2, http-proxy, nix-cache01
Wave 1 hosts (nats1, jelly01, pgdb1) get AppRole policies only.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 18:43:09 +01:00
4d33018285 docs: add ha1 memory recommendation to migration plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m28s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 17:48:45 +01:00
678fd3d6de docs: add systemd-exporter findings to monitoring gaps plan
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 10:19:33 +01:00
9d74aa5c04 docs: add zigbee sensor battery monitoring findings
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 09:21:54 +01:00
fe80ec3576 docs: add monitoring gaps audit plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 20m32s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 03:19:20 +01:00
870fb3e532 docs: add plan for remote access to homelab services
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m4s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:53:27 +01:00
e602e8d70b docs: add plan for prometheus scrape target labels
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m7s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:36:41 +01:00
28b8d7c115 monitoring: increase high_cpu_load duration for nix-cache01 to 2h
nix-cache01 regularly hits high CPU during nix builds, causing flappy
alerts. Keep the 15m threshold for all other hosts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:28:48 +01:00
64f2688349 nix: configure gc to delete generations older than 14d
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m27s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:21:19 +01:00
09d9d71e2b docs: note to establish hostname naming conventions before migration
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:04:58 +01:00
cc799f5929 docs: note USB passthrough requirement for ha1 migration
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:02:14 +01:00
0abdda8e8a docs: add plan for migrating existing hosts to opentofu
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m28s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 01:59:51 +01:00
4076361bf7 Merge pull request 'hosts: remove decommissioned media1, ns3, ns4, nixos-test1' (#18) from host-cleanup into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 3m36s
Reviewed-on: #18
2026-02-05 00:38:56 +00:00
0ef63ad874 hosts: remove decommissioned media1, ns3, ns4, nixos-test1
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m47s
Run nix flake check / flake-check (pull_request) Successful in 3m20s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 01:36:57 +01:00
8f29141dd1 Merge pull request 'monitoring: exclude step-ca serving cert from general expiry alert' (#17) from monitoring-cleanup into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 5m30s
Reviewed-on: #17
2026-02-05 00:22:15 +00:00
3a9a47f1ad monitoring: exclude step-ca serving cert from general expiry alert
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m23s
Run nix flake check / flake-check (pull_request) Failing after 4m46s
The step-ca serving certificate is auto-renewed with a 24h lifetime,
so it always triggers the general < 86400s threshold. Exclude it and
add a dedicated step_ca_serving_cert_expiring alert at < 1h instead.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 01:12:42 +01:00
fa6380e767 monitoring: fix nix-cache_caddy scrape target TLS error
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m43s
Move nix-cache_caddy back to a manual config in prometheus.nix using the
service CNAME (nix-cache.home.2rjus.net) instead of the hostname. The
auto-generated target used nix-cache01.home.2rjus.net which doesn't
match the TLS certificate SAN.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 01:04:50 +01:00
86a077e152 docs: add host cleanup plan for decommissioned hosts
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 01:04:50 +01:00
9da57c6a2f flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/e6eae2ee2110f3d31110d5c222cd395303343b08?narHash=sha256-KHFT9UWOF2yRPlAnSXQJh6uVcgNcWlFqqiAZ7OVlHNc%3D' (2026-02-03)
  → 'github:nixos/nixpkgs/bf922a59c5c9998a6584645f7d0de689512e444c?narHash=sha256-ksTL7P9QC1WfZasNlaAdLOzqD8x5EPyods69YBqxSfk%3D' (2026-02-04)
2026-02-05 00:01:37 +00:00
da9dd02d10 Merge pull request 'monitoring: auto-generate Prometheus scrape targets from host configs' (#16) from monitoring-improvements into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 6m32s
Periodic flake update / flake-update (push) Successful in 1m54s
Reviewed-on: #16
2026-02-04 23:53:46 +00:00
e7980978c7 docs: document monitoring auto-generation in CLAUDE.md
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m33s
Run nix flake check / flake-check (pull_request) Successful in 6m48s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 00:52:39 +01:00
dd1b64de27 monitoring: auto-generate Prometheus scrape targets from host configs
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m49s
Run nix flake check / flake-check (push) Has been cancelled
Add homelab.monitoring NixOS options (enable, scrapeTargets) following
the same pattern as homelab.dns. Prometheus scrape configs are now
auto-generated from flake host configurations and external targets,
replacing hardcoded target lists.

Also cleans up alert rules: snake_case naming, fix zigbee2mqtt typo,
remove duplicate pushgateway alert, add for clauses to monitoring_rules,
remove hardcoded WireGuard public key, and add new alerts for
certificates, proxmox, caddy, smartctl temperature, filesystem
prediction, systemd state, file descriptors, and host reboots.

Fixes grafana scrape target port from 3100 to 3000.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 00:49:07 +01:00
4e8cc124f2 docs: add plan management workflow and lab-monitoring MCP server
Some checks failed
Run nix flake check / flake-check (push) Failing after 11m30s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 00:21:08 +01:00
a2a55f3955 docs: add docs directory info and nixos options improvement plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m12s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 22:27:11 +01:00
c38034ba41 docs: rewrite README with current infrastructure overview
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m41s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 22:20:49 +01:00
d7d4b0846c docs: move dns-automation plan to completed
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m17s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 22:13:38 +01:00
8ca7c4e402 Merge pull request 'dns-automation' (#15) from dns-automation into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m18s
Reviewed-on: #15
2026-02-04 21:02:24 +00:00
106912499b docs: add git workflow note about not committing to master
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m16s
Run nix flake check / flake-check (push) Failing after 17m2s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 21:57:40 +01:00
83af00458b dns: remove defunct external hosts
Remove hosts that no longer respond to ping:
- kube-blue1-10 (entire k8s cluster)
- virt-mini1, mpnzb, inc2, testing
- CNAMEs: rook, git (pointed to removed kube-blue nodes)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 21:50:56 +01:00
67d5de3eb8 docs: update CLAUDE.md for DNS automation
- Add /modules/ and /lib/ to directory structure
- Document homelab.dns options and zone auto-generation
- Update "Adding a New Host" workflow (no manual zone editing)
- Expand DNS Architecture section with auto-generation details

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 21:45:16 +01:00
cee1b264cd dns: auto-generate zone entries from host configurations
Replace static zone file with dynamically generated records:
- Add homelab.dns module with enable/cnames options
- Extract IPs from systemd.network configs (filters VPN interfaces)
- Use git commit timestamp as zone serial number
- Move external hosts to separate external-hosts.nix

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 21:43:44 +01:00
4ceee04308 docs: update MCP config for nixpkgs-options and add nixpkgs-packages
Some checks failed
Run nix flake check / flake-check (push) Failing after 14m50s
Rename nixos-options to nixpkgs-options and add new nixpkgs-packages
server for package search functionality. Update CLAUDE.md to document
both MCP servers and their available tools.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 20:50:36 +01:00
e3ced5bcda flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/41e216c0ca66c83b12ab7a98cc326b5db01db646?narHash=sha256-I7Lmgj3owOTBGuauy9FL6qdpeK2umDoe07lM4V%2BPnyA%3D' (2026-01-31)
  → 'github:nixos/nixpkgs/e576e3c9cf9bad747afcddd9e34f51d18c855b4e?narHash=sha256-tlFqNG/uzz2%2B%2BaAmn4v8J0vAkV3z7XngeIIB3rM3650%3D' (2026-02-03)
• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/cb369ef2efd432b3cdf8622b0ffc0a97a02f3137?narHash=sha256-VKS4ZLNx4PNrABoB0L8KUpc1fE7CLpQXQs985tGfaCU%3D' (2026-02-02)
  → 'github:nixos/nixpkgs/e6eae2ee2110f3d31110d5c222cd395303343b08?narHash=sha256-KHFT9UWOF2yRPlAnSXQJh6uVcgNcWlFqqiAZ7OVlHNc%3D' (2026-02-03)
• Updated input 'sops-nix':
    'github:Mic92/sops-nix/1e89149dcfc229e7e2ae24a8030f124a31e4f24f?narHash=sha256-twBMKGQvaztZQxFxbZnkg7y/50BW9yjtCBWwdjtOZew%3D' (2026-02-01)
  → 'github:Mic92/sops-nix/17eea6f3816ba6568b8c81db8a4e6ca438b30b7c?narHash=sha256-ktjWTq%2BD5MTXQcL9N6cDZXUf9kX8JBLLBLT0ZyOTSYY%3D' (2026-02-03)
2026-02-04 00:01:04 +00:00
15459870cd Merge pull request 'backup: migrate to native services.restic.backups' (#14) from migrate-to-native-restic-backups into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 4m4s
Periodic flake update / flake-update (push) Successful in 1m10s
Reviewed-on: #14
2026-02-03 23:47:11 +00:00
d1861eefb5 docs: add clipboard note and update flake inputs
Some checks failed
Run nix flake check / flake-check (push) Successful in 4m10s
Run nix flake check / flake-check (pull_request) Failing after 18m29s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 00:45:37 +01:00
d25fc99e1d backup: migrate to native services.restic.backups
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Run nix flake check / flake-check (pull_request) Successful in 4m0s
Replace custom backup-helper flake input with NixOS native
services.restic.backups module for ha1, monitoring01, and nixos-test1.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 00:41:40 +01:00
b5da9431aa docs: add nixos-options MCP configuration
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m51s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 00:01:00 +01:00
0e5dea635e Merge pull request 'create-host: add delete feature' (#13) from create-host-delete-feature into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m20s
Reviewed-on: #13
2026-02-03 12:06:32 +00:00
86249c466b create-host: add delete feature
Some checks failed
Run nix flake check / flake-check (push) Failing after 21m31s
Run nix flake check / flake-check (pull_request) Failing after 15m17s
2026-02-03 12:11:41 +01:00
5d560267cf Merge pull request 'pki-migration' (#12) from pki-migration into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m21s
Reviewed-on: #12
2026-02-03 05:56:53 +00:00
63662b89e0 docs: update TODO.md
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m51s
Run nix flake check / flake-check (pull_request) Successful in 2m53s
2026-02-03 06:53:59 +01:00
7ae474fd3e pki: add new vault root ca to pki 2026-02-03 06:53:59 +01:00
f0525b5c74 ns: add vaulttest01 to zone
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m19s
2026-02-03 06:42:05 +01:00
42c391b355 ns: add vault cname to zone
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m7s
2026-02-03 06:00:59 +01:00
048536ba70 docs: move dns automation from TODO.md to nixos-improvements.md
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m20s
2026-02-03 04:51:27 +01:00
cccce09406 Merge pull request 'vault: implement bootstrap integration' (#11) from vault-bootstrap-integration into master
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Reviewed-on: #11
2026-02-03 03:46:25 +00:00
01d4812280 vault: implement bootstrap integration
Some checks failed
Run nix flake check / flake-check (push) Successful in 2m31s
Run nix flake check / flake-check (pull_request) Failing after 14m16s
2026-02-03 01:10:36 +01:00
b5364d2ccc flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/62c8382960464ceb98ea593cb8321a2cf8f9e3e5?narHash=sha256-kKB3bqYJU5nzYeIROI82Ef9VtTbu4uA3YydSk/Bioa8%3D' (2026-01-30)
  → 'github:nixos/nixpkgs/cb369ef2efd432b3cdf8622b0ffc0a97a02f3137?narHash=sha256-VKS4ZLNx4PNrABoB0L8KUpc1fE7CLpQXQs985tGfaCU%3D' (2026-02-02)
2026-02-03 00:01:39 +00:00
94 changed files with 3479 additions and 1303 deletions

28
.mcp.json Normal file
View File

@@ -0,0 +1,28 @@
{
"mcpServers": {
"nixpkgs-options": {
"command": "nix",
"args": ["run", "git+https://git.t-juice.club/torjus/labmcp#nixpkgs-search", "--", "options", "serve"],
"env": {
"NIXPKGS_SEARCH_DATABASE": "sqlite:///run/user/1000/labmcp/nixpkgs-search.db"
}
},
"nixpkgs-packages": {
"command": "nix",
"args": ["run", "git+https://git.t-juice.club/torjus/labmcp#nixpkgs-search", "--", "packages", "serve"],
"env": {
"NIXPKGS_SEARCH_DATABASE": "sqlite:///run/user/1000/labmcp/nixpkgs-search.db"
}
},
"lab-monitoring": {
"command": "nix",
"args": ["run", "git+https://git.t-juice.club/torjus/labmcp#lab-monitoring", "--", "serve", "--enable-silences"],
"env": {
"PROMETHEUS_URL": "https://prometheus.home.2rjus.net",
"ALERTMANAGER_URL": "https://alertmanager.home.2rjus.net",
"LOKI_URL": "http://monitoring01.home.2rjus.net:3100"
}
}
}
}

View File

@@ -2,11 +2,7 @@ keys:
- &admin_torjus age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
- &server_ns1 age1hz2lz4k050ru3shrk5j3zk3f8azxmrp54pktw5a7nzjml4saudesx6jsl0
- &server_ns2 age1w2q4gm2lrcgdzscq8du3ssyvk6qtzm4fcszc92z9ftclq23yyydqdga5um
- &server_ns3 age1snmhmpavqy7xddmw4nuny0u4xusqmnqxqarjmghkm5zaluff84eq5xatrd
- &server_ns4 age12a3nyvjs8jrwmpkf3tgawel3nwcklwsr35ktmytnvhpawqwzrsfqpgcy0q
- &server_ha1 age1d2w5zece9647qwyq4vas9qyqegg96xwmg6c86440a6eg4uj6dd2qrq0w3l
- &server_nixos-test1 age1gcyfkxh4fq5zdp0dh484aj82ksz66wrly7qhnpv0r0p576sn9ekse8e9ju
- &server_inc1 age1g5luz2rtel3surgzuh62rkvtey7lythrvfenyq954vmeyfpxjqkqdj3wt8
- &server_http-proxy age1gq8434ku0xekqmvnseeunv83e779cg03c06gwrusnymdsr3rpufqx6vr3m
- &server_ca age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk
- &server_monitoring01 age1vpns76ykll8jgdlu3h05cur4ew2t3k7u03kxdg8y6ypfhsfhq9fqyurjey
@@ -14,7 +10,6 @@ keys:
- &server_nix-cache01 age1w029fksjv0edrff9p7s03tgk3axecdkppqymfpwfn2nu2gsqqefqc37sxq
- &server_pgdb1 age1ha34qeksr4jeaecevqvv2afqem67eja2mvawlmrqsudch0e7fe7qtpsekv
- &server_nats1 age1cxt8kwqzx35yuldazcc49q88qvgy9ajkz30xu0h37uw3ts97jagqgmn2ga
- &server_auth01 age16prza00sqzuhwwcyakj6z4hvwkruwkqpmmrsn94a5ucgpkelncdq2ldctk
creation_rules:
- path_regex: secrets/[^/]+\.(yaml|json|env|ini)
key_groups:
@@ -22,11 +17,7 @@ creation_rules:
- *admin_torjus
- *server_ns1
- *server_ns2
- *server_ns3
- *server_ns4
- *server_ha1
- *server_nixos-test1
- *server_inc1
- *server_http-proxy
- *server_ca
- *server_monitoring01
@@ -34,12 +25,6 @@ creation_rules:
- *server_nix-cache01
- *server_pgdb1
- *server_nats1
- *server_auth01
- path_regex: secrets/ns3/[^/]+\.(yaml|json|env|ini)
key_groups:
- age:
- *admin_torjus
- *server_ns3
- path_regex: secrets/ca/[^/]+\.(yaml|json|env|ini|)
key_groups:
- age:
@@ -65,8 +50,3 @@ creation_rules:
- age:
- *admin_torjus
- *server_http-proxy
- path_regex: secrets/auth01/[^/]+\.(yaml|json|env|ini|)
key_groups:
- age:
- *admin_torjus
- *server_auth01

238
CLAUDE.md
View File

@@ -35,6 +35,21 @@ nix build .#create-host
Do not automatically deploy changes. Deployments are usually done by updating the master branch, and then triggering the auto update on the specific host.
### Testing Feature Branches on Hosts
All hosts have the `nixos-rebuild-test` helper script for testing feature branches before merging:
```bash
# On the target host, test a feature branch
nixos-rebuild-test boot <branch-name>
nixos-rebuild-test switch <branch-name>
# Additional arguments are passed through to nixos-rebuild
nixos-rebuild-test boot my-feature --show-trace
```
When working on a feature branch that requires testing on a live host, suggest using this command instead of the full flake URL syntax.
### Flake Management
```bash
@@ -52,7 +67,27 @@ nix develop
### Secrets Management
Secrets are handled by sops. Do not edit any `.sops.yaml` or any file within `secrets/`. Ask the user to modify if necessary.
Secrets are managed by OpenBao (Vault) using AppRole authentication. Most hosts use the
`vault.secrets` option defined in `system/vault-secrets.nix` to fetch secrets at boot.
Terraform manages the secrets and AppRole policies in `terraform/vault/`.
Legacy sops-nix is still present but only actively used by the `ca` host. Do not edit any
`.sops.yaml` or any file within `secrets/`. Ask the user to modify if necessary.
### Git Workflow
**Important:** Never commit directly to `master` unless the user explicitly asks for it. Always create a feature branch for changes.
**Important:** Never amend commits to `master` unless the user explicitly asks for it. Amending rewrites history and causes issues for deployed configurations.
When starting a new plan or task, the first step should typically be to create and checkout a new branch with an appropriate name (e.g., `git checkout -b dns-automation` or `git checkout -b fix-nginx-config`).
### Plan Management
When creating plans for large features, follow this workflow:
1. When implementation begins, save a copy of the plan to `docs/plans/` (e.g., `docs/plans/feature-name.md`)
2. Once the feature is fully implemented, move the plan to `docs/plans/completed/`
### Git Commit Messages
@@ -63,26 +98,130 @@ Examples:
- `template2: add proxmox image configuration`
- `terraform: add VM deployment configuration`
### Clipboard
To copy text to the clipboard, pipe to `wl-copy` (Wayland):
```bash
echo "text" | wl-copy
```
### NixOS Options and Packages Lookup
Two MCP servers are available for searching NixOS options and packages:
- **nixpkgs-options** - Search and lookup NixOS configuration option documentation
- **nixpkgs-packages** - Search and lookup Nix packages from nixpkgs
**Session Setup:** At the start of each session, index the nixpkgs revision from `flake.lock` to ensure documentation matches the project's nixpkgs version:
1. Read `flake.lock` and find the `nixpkgs` node's `rev` field
2. Call `index_revision` with that git hash (both servers share the same index)
**Options Tools (nixpkgs-options):**
- `search_options` - Search for options by name or description (e.g., query "nginx" or "postgresql")
- `get_option` - Get full details for a specific option (e.g., `services.loki.configuration`)
- `get_file` - Fetch the source file from nixpkgs that declares an option
**Package Tools (nixpkgs-packages):**
- `search_packages` - Search for packages by name or description (e.g., query "nginx" or "python")
- `get_package` - Get full details for a specific package by attribute path (e.g., `firefox`, `python312Packages.requests`)
- `get_file` - Fetch the source file from nixpkgs that defines a package
This ensures documentation matches the exact nixpkgs version (currently NixOS 25.11) used by this flake.
### Lab Monitoring Log Queries
The **lab-monitoring** MCP server can query logs from Loki. All hosts ship systemd journal logs via Promtail.
**Loki Label Reference:**
- `host` - Hostname (e.g., `ns1`, `ns2`, `monitoring01`, `ha1`). Use this label, not `hostname`.
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `prometheus.service`, `nixos-upgrade.service`)
- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs like caddy access logs)
- `filename` - For `varlog` job, the log file path (e.g., `/var/log/caddy/nix-cache.log`)
Journal log entries are JSON-formatted with the actual log message in the `MESSAGE` field. Other useful fields include `PRIORITY` and `SYSLOG_IDENTIFIER`.
**Example LogQL queries:**
```
# Logs from a specific service on a host
{host="ns2", systemd_unit="nsd.service"}
# Substring match on log content
{host="ns1", systemd_unit="nsd.service"} |= "error"
# File-based logs (e.g., caddy access logs)
{job="varlog", hostname="nix-cache01"}
```
Default lookback is 1 hour. Use the `start` parameter with relative durations (e.g., `24h`, `168h`) for older logs.
### Lab Monitoring Prometheus Queries
The **lab-monitoring** MCP server can query Prometheus metrics via PromQL. The `instance` label uses the FQDN format `<host>.home.2rjus.net:<port>`.
**Prometheus Job Names:**
- `node-exporter` - System metrics from all hosts (CPU, memory, disk, network)
- `caddy` - Reverse proxy metrics (http-proxy)
- `nix-cache_caddy` - Nix binary cache metrics
- `home-assistant` - Home automation metrics
- `jellyfin` - Media server metrics
- `loki` / `prometheus` / `grafana` - Monitoring stack self-metrics
- `step-ca` - Internal CA metrics
- `pve-exporter` - Proxmox hypervisor metrics
- `smartctl` - Disk SMART health (gunter)
- `wireguard` - VPN metrics (http-proxy)
- `pushgateway` - Push-based metrics (e.g., backup results)
- `restic_rest` - Backup server metrics
- `labmon` / `ghettoptt` / `alertmanager` - Other service metrics
**Example PromQL queries:**
```
# Check all targets are up
up
# CPU usage for a specific host
rate(node_cpu_seconds_total{instance=~"ns1.*", mode!="idle"}[5m])
# Memory usage across all hosts
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
# Disk space
node_filesystem_avail_bytes{mountpoint="/"}
```
## Architecture
### Directory Structure
- `/flake.nix` - Central flake defining all 16 NixOS configurations
- `/flake.nix` - Central flake defining all NixOS configurations
- `/hosts/<hostname>/` - Per-host configurations
- `default.nix` - Entry point, imports configuration.nix and services
- `configuration.nix` - Host-specific settings (networking, hardware, users)
- `/system/` - Shared system-level configurations applied to ALL hosts
- Core modules: nix.nix, sshd.nix, sops.nix, acme.nix, autoupgrade.nix
- Core modules: nix.nix, sshd.nix, sops.nix (legacy), vault-secrets.nix, acme.nix, autoupgrade.nix
- Monitoring: node-exporter and promtail on every host
- `/modules/` - Custom NixOS modules
- `homelab/` - Homelab-specific options (DNS automation, monitoring scrape targets)
- `/lib/` - Nix library functions
- `dns-zone.nix` - DNS zone generation functions
- `monitoring.nix` - Prometheus scrape target generation functions
- `/services/` - Reusable service modules, selectively imported by hosts
- `home-assistant/` - Home automation stack
- `monitoring/` - Observability stack (Prometheus, Grafana, Loki, Tempo)
- `ns/` - DNS services (authoritative, resolver)
- `ns/` - DNS services (authoritative, resolver, zone generation)
- `http-proxy/`, `ca/`, `postgres/`, `nats/`, `jellyfin/`, etc.
- `/secrets/` - SOPS-encrypted secrets with age encryption
- `/secrets/` - SOPS-encrypted secrets with age encryption (legacy, only used by ca)
- `/common/` - Shared configurations (e.g., VM guest agent)
- `/docs/` - Documentation and plans
- `plans/` - Future plans and proposals
- `plans/completed/` - Completed plans (moved here when done)
- `/playbooks/` - Ansible playbooks for fleet management
- `/.sops.yaml` - SOPS configuration with age keys for all servers
- `/.sops.yaml` - SOPS configuration with age keys (legacy, only used by ca)
### Configuration Inheritance
@@ -98,11 +237,13 @@ hosts/<hostname>/default.nix
All hosts automatically get:
- Nix binary cache (nix-cache.home.2rjus.net)
- SSH with root login enabled
- SOPS secrets management with auto-generated age keys
- OpenBao (Vault) secrets management via AppRole
- Internal ACME CA integration (ca.home.2rjus.net)
- Daily auto-upgrades with auto-reboot
- Prometheus node-exporter + Promtail (logs to monitoring01)
- Monitoring scrape target auto-registration via `homelab.monitoring` options
- Custom root CA trust
- DNS zone auto-registration via `homelab.dns` options
### Active Hosts
@@ -116,19 +257,16 @@ Production servers managed by `rebuild-all.sh`:
- `nix-cache01` - Binary cache server
- `pgdb1` - PostgreSQL database
- `nats1` - NATS messaging server
- `auth01` - Authentication service
Template/test hosts:
- `template1` - Base template for cloning new hosts
- `nixos-test1` - Test environment
### Flake Inputs
- `nixpkgs` - NixOS 25.11 stable (primary)
- `nixpkgs-unstable` - Unstable channel (available via overlay as `pkgs.unstable.<package>`)
- `sops-nix` - Secrets management
- `sops-nix` - Secrets management (legacy, only used by ca)
- Custom packages from git.t-juice.club:
- `backup-helper` - Backup automation module
- `alerttonotify` - Alert routing
- `labmon` - Lab monitoring
@@ -143,12 +281,21 @@ Template/test hosts:
### Secrets Management
- Uses SOPS with age encryption
- Each server has unique age key in `.sops.yaml`
- Keys auto-generated at `/var/lib/sops-nix/key.txt` on first boot
Most hosts use OpenBao (Vault) for secrets:
- Vault server at `vault01.home.2rjus.net:8200`
- AppRole authentication with credentials at `/var/lib/vault/approle/`
- Secrets defined in Terraform (`terraform/vault/secrets.tf`)
- AppRole policies in Terraform (`terraform/vault/approle.tf`)
- NixOS module: `system/vault-secrets.nix` with `vault.secrets.<name>` options
- `extractKey` option extracts a single key from vault JSON as a plain file
- Secrets fetched at boot by `vault-secret-<name>.service` systemd units
- Fallback to cached secrets in `/var/lib/vault/cache/` when Vault is unreachable
- Provision AppRole credentials: `nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<host>`
Legacy SOPS (only used by `ca` host):
- SOPS with age encryption, keys in `.sops.yaml`
- Shared secrets: `/secrets/secrets.yaml`
- Per-host secrets: `/secrets/<hostname>/`
- All production servers can decrypt shared secrets; host-specific secrets require specific host keys
### Auto-Upgrade System
@@ -246,14 +393,19 @@ This means:
1. Create `/hosts/<hostname>/` directory
2. Copy structure from `template1` or similar host
3. Add host entry to `flake.nix` nixosConfigurations
4. Add hostname to dns zone files. Merge to master. Run auto-upgrade on dns servers.
5. User clones template host
6. User runs `prepare-host.sh` on new host, this deletes files which should be regenerated, like ssh host keys, machine-id etc. It also creates a new age key, and prints the public key
7. This key is then added to `.sops.yaml`
8. Create `/secrets/<hostname>/` if needed
9. Configure networking (static IP, DNS servers)
10. Commit changes, and merge to master.
11. Deploy by running `nixos-rebuild boot --flake URL#<hostname>` on the host.
4. Configure networking in `configuration.nix` (static IP via `systemd.network.networks`, DNS servers)
5. (Optional) Add `homelab.dns.cnames` if the host needs CNAME aliases
6. Add `vault.enable = true;` to the host configuration
7. Add AppRole policy in `terraform/vault/approle.tf` and any secrets in `secrets.tf`
8. Run `tofu apply` in `terraform/vault/`
9. User clones template host
10. User runs `prepare-host.sh` on new host
11. Provision AppRole credentials: `nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<host>`
12. Commit changes, and merge to master.
13. Deploy by running `nixos-rebuild boot --flake URL#<hostname>` on the host.
14. Run auto-upgrade on DNS servers (ns1, ns2) to pick up the new host's DNS entry
**Note:** DNS A records and Prometheus node-exporter scrape targets are auto-generated from the host's `systemd.network.networks` static IP configuration. No manual zone file or Prometheus config editing is required.
### Important Patterns
@@ -267,6 +419,8 @@ This means:
**Firewall**: Disabled on most hosts (trusted network). Enable selectively in host configuration if needed.
**Shell scripts**: Use `pkgs.writeShellApplication` instead of `pkgs.writeShellScript` or `pkgs.writeShellScriptBin` for creating shell scripts. `writeShellApplication` provides automatic shellcheck validation, sets strict bash options (`set -euo pipefail`), and allows declaring `runtimeInputs` for dependencies. When referencing the executable path (e.g., in `ExecStart`), use `lib.getExe myScript` to get the proper `bin/` path.
### Monitoring Stack
All hosts ship metrics and logs to `monitoring01`:
@@ -276,9 +430,45 @@ All hosts ship metrics and logs to `monitoring01`:
- **Tracing**: Tempo for distributed tracing
- **Profiling**: Pyroscope for continuous profiling
**Scrape Target Auto-Generation:**
Prometheus scrape targets are automatically generated from host configurations, following the same pattern as DNS zone generation:
- **Node-exporter**: All flake hosts with static IPs are automatically added as node-exporter targets
- **Service targets**: Defined via `homelab.monitoring.scrapeTargets` in service modules
- **External targets**: Non-flake hosts defined in `/services/monitoring/external-targets.nix`
- **Library**: `lib/monitoring.nix` provides `generateNodeExporterTargets` and `generateScrapeConfigs`
Host monitoring options (`homelab.monitoring.*`):
- `enable` (default: `true`) - Include host in Prometheus node-exporter scrape targets
- `scrapeTargets` (default: `[]`) - Additional scrape targets exposed by this host (job_name, port, metrics_path, scheme, scrape_interval, honor_labels)
Service modules declare their scrape targets directly (e.g., `services/ca/default.nix` declares step-ca on port 9000). The Prometheus config on monitoring01 auto-generates scrape configs from all hosts.
To add monitoring targets for non-NixOS hosts, edit `/services/monitoring/external-targets.nix`.
### DNS Architecture
- `ns1` (10.69.13.5) - Primary authoritative DNS + resolver
- `ns2` (10.69.13.6) - Secondary authoritative DNS (AXFR from ns1)
- Zone files managed in `/services/ns/`
- All hosts point to ns1/ns2 for DNS resolution
**Zone Auto-Generation:**
DNS zone entries are automatically generated from host configurations:
- **Flake-managed hosts**: A records extracted from `systemd.network.networks` static IPs
- **CNAMEs**: Defined via `homelab.dns.cnames` option in host configs
- **External hosts**: Non-flake hosts defined in `/services/ns/external-hosts.nix`
- **Serial number**: Uses `self.sourceInfo.lastModified` (git commit timestamp)
Host DNS options (`homelab.dns.*`):
- `enable` (default: `true`) - Include host in DNS zone generation
- `cnames` (default: `[]`) - List of CNAME aliases pointing to this host
Hosts are automatically excluded from DNS if:
- `homelab.dns.enable = false` (e.g., template hosts)
- No static IP configured (e.g., DHCP-only hosts)
- Network interface is a VPN/tunnel (wg*, tun*, tap*)
To add DNS entries for non-NixOS hosts, edit `/services/ns/external-hosts.nix`.

126
README.md
View File

@@ -1,11 +1,125 @@
# nixos-servers
Nixos configs for my homelab servers.
NixOS Flake-based configuration repository for a homelab infrastructure. All hosts run NixOS 25.11 and are managed declaratively through this single repository.
## Configurations in use
## Hosts
* ha1
* ns1
* ns2
* template1
| Host | Role |
|------|------|
| `ns1`, `ns2` | Primary/secondary authoritative DNS |
| `ca` | Internal Certificate Authority |
| `ha1` | Home Assistant + Zigbee2MQTT + Mosquitto |
| `http-proxy` | Reverse proxy |
| `monitoring01` | Prometheus, Grafana, Loki, Tempo, Pyroscope |
| `jelly01` | Jellyfin media server |
| `nix-cache01` | Nix binary cache |
| `pgdb1` | PostgreSQL |
| `nats1` | NATS messaging |
| `vault01` | OpenBao (Vault) secrets management |
| `template1`, `template2` | VM templates for cloning new hosts |
## Directory Structure
```
flake.nix # Flake entry point, defines all host configurations
hosts/<hostname>/ # Per-host configuration
system/ # Shared modules applied to ALL hosts
services/ # Reusable service modules, selectively imported per host
modules/ # Custom NixOS module definitions
lib/ # Nix library functions (DNS zone generation, etc.)
secrets/ # SOPS-encrypted secrets (legacy, only used by ca)
common/ # Shared configurations (e.g., VM guest agent)
terraform/ # OpenTofu configs for Proxmox VM provisioning
terraform/vault/ # OpenTofu configs for OpenBao (secrets, PKI, AppRoles)
playbooks/ # Ansible playbooks for template building and fleet ops
scripts/ # Helper scripts (create-host, vault-fetch)
```
## Key Features
**Automatic DNS zone generation** - A records are derived from each host's static IP configuration. CNAME aliases are defined via `homelab.dns.cnames`. No manual zone file editing required.
**OpenBao (Vault) secrets** - Hosts authenticate via AppRole and fetch secrets at boot. Secrets and policies are managed as code in `terraform/vault/`. Legacy SOPS remains only for the `ca` host.
**Daily auto-upgrades** - All hosts pull from the master branch and automatically rebuild and reboot on a randomized schedule.
**Shared base configuration** - Every host automatically gets SSH, monitoring (node-exporter + Promtail), internal ACME certificates, and Nix binary cache access via the `system/` modules.
**Proxmox VM provisioning** - Build VM templates with Ansible and deploy VMs with OpenTofu from `terraform/`.
**OpenBao (Vault) secrets** - Centralized secrets management with AppRole authentication, PKI infrastructure, and automated bootstrap. Managed as code in `terraform/vault/`.
## Usage
```bash
# Enter dev shell (provides ansible, opentofu, openbao, create-host)
nix develop
# Build a host configuration locally
nix build .#nixosConfigurations.<hostname>.config.system.build.toplevel
# List all configurations
nix flake show
```
Deployments are done by merging to master and triggering the auto-upgrade on the target host.
## Provisioning New Hosts
The repository includes an automated pipeline for creating and deploying new hosts on Proxmox.
### 1. Generate host configuration
The `create-host` tool (available in the dev shell) generates all required files for a new host:
```bash
create-host \
--hostname myhost \
--ip 10.69.13.50/24 \
--cpu 4 \
--memory 4096 \
--disk 50G
```
This creates:
- `hosts/<hostname>/` - NixOS configuration (networking, imports, hardware)
- Entry in `flake.nix`
- VM definition in `terraform/vms.tf`
- Vault AppRole policy and wrapped bootstrap token
Omit `--ip` for DHCP. Use `--dry-run` to preview changes. Use `--force` to regenerate an existing host's config.
### 2. Build and deploy the VM template
The Proxmox VM template is built from `hosts/template2` and deployed with Ansible:
```bash
nix develop -c ansible-playbook -i playbooks/inventory.ini playbooks/build-and-deploy-template.yml
```
This only needs to be re-run when the base template changes.
### 3. Deploy the VM
```bash
cd terraform && tofu apply
```
### 4. Automatic bootstrap
On first boot, the VM automatically:
1. Receives its hostname and Vault credentials via cloud-init
2. Unwraps the Vault token and stores AppRole credentials
3. Runs `nixos-rebuild boot` against the flake on the master branch
4. Reboots into the host-specific configuration
5. Services fetch their secrets from Vault at startup
No manual intervention is required after `tofu apply`.
## Network
- Domain: `home.2rjus.net`
- Infrastructure subnet: `10.69.13.0/24`
- DNS: ns1/ns2 authoritative with primary-secondary AXFR
- Internal CA for TLS certificates (migrating from step-ca to OpenBao PKI)
- Centralized monitoring at monitoring01

104
TODO.md
View File

@@ -155,7 +155,7 @@ create-host \
### Phase 4: Secrets Management with OpenBao (Vault)
**Status:** 🚧 Phases 4a & 4b Complete, 4c & 4d In Progress
**Status:** 🚧 Phases 4a, 4b, 4c (partial), & 4d Complete
**Challenge:** Current sops-nix approach has chicken-and-egg problem with age keys
@@ -339,6 +339,8 @@ vault01.home.2rjus.net (10.69.13.19)
#### Phase 4c: PKI Migration (Replace step-ca)
**Status:** 🚧 Partially Complete - vault01 and test host migrated, remaining hosts pending
**Goal:** Migrate hosts from step-ca to OpenBao PKI for TLS certificates
**Note:** PKI infrastructure already set up in Phase 4b (root CA, intermediate CA, ACME support)
@@ -349,27 +351,33 @@ vault01.home.2rjus.net (10.69.13.19)
- [x] Intermediate CA (`pki_int/` mount, 5 year TTL, EC P-384)
- [x] Signed intermediate with root CA
- [x] Configured CRL, OCSP, and issuing certificate URLs
- [x] Enable ACME support (completed in Phase 4b)
- [x] Enable ACME support (completed in Phase 4b, fixed in Phase 4c)
- [x] Enabled ACME on intermediate CA
- [x] Created PKI role for `*.home.2rjus.net`
- [x] Set certificate TTLs (30 day max) and allowed domains
- [x] ACME directory: `https://vault01.home.2rjus.net:8200/v1/pki_int/acme/directory`
- [ ] Download and distribute root CA certificate
- [ ] Export root CA: `bao read -field=certificate pki/cert/ca > homelab-root-ca.crt`
- [ ] Add to NixOS trust store on all hosts via `security.pki.certificateFiles`
- [ ] Deploy via auto-upgrade
- [ ] Test certificate issuance
- [ ] Issue test certificate using ACME client (lego/certbot)
- [ ] Or issue static certificate via OpenBao CLI
- [ ] Verify certificate chain and trust
- [ ] Migrate vault01's own certificate
- [ ] Issue new certificate from OpenBao PKI (self-issued)
- [ ] Replace self-signed bootstrap certificate
- [ ] Update service configuration
- [x] Fixed ACME response headers (added Replay-Nonce, Link, Location to allowed_response_headers)
- [x] Configured cluster path for ACME
- [x] Download and distribute root CA certificate
- [x] Added root CA to `system/pki/root-ca.nix`
- [x] Distributed to all hosts via system imports
- [x] Test certificate issuance
- [x] Tested ACME issuance on vaulttest01 successfully
- [x] Verified certificate chain and trust
- [x] Migrate vault01's own certificate
- [x] Created `bootstrap-vault-cert` script for initial certificate issuance via bao CLI
- [x] Issued certificate with SANs (vault01.home.2rjus.net + vault.home.2rjus.net)
- [x] Updated service to read certificates from `/var/lib/acme/vault01.home.2rjus.net/`
- [x] Configured ACME for automatic renewals
- [ ] Migrate hosts from step-ca to OpenBao
- [x] Tested on vaulttest01 (non-production host)
- [ ] Standardize hostname usage across all configurations
- [ ] Use `vault.home.2rjus.net` (CNAME) consistently everywhere
- [ ] Update NixOS configurations to use CNAME instead of vault01
- [ ] Update Terraform configurations to use CNAME
- [ ] Audit and fix mixed usage of vault01.home.2rjus.net vs vault.home.2rjus.net
- [ ] Update `system/acme.nix` to use OpenBao ACME endpoint
- [ ] Change server to `https://vault01.home.2rjus.net:8200/v1/pki_int/acme/directory`
- [ ] Test on one host (non-critical service)
- [ ] Change server to `https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory`
- [ ] Roll out to all hosts via auto-upgrade
- [ ] Configure SSH CA in OpenBao (optional, future work)
- [ ] Enable SSH secrets engine (`ssh/` mount)
@@ -384,7 +392,39 @@ vault01.home.2rjus.net (10.69.13.19)
- [ ] Archive step-ca configuration for backup
- [ ] Update documentation
**Deliverable:** All TLS certificates issued by OpenBao PKI, step-ca retired
**Implementation Details (2026-02-03):**
**ACME Configuration Fix:**
The key blocker was that OpenBao's PKI mount was filtering out required ACME response headers. The solution was to add `allowed_response_headers` to the Terraform mount configuration:
```hcl
allowed_response_headers = [
"Replay-Nonce", # Required for ACME nonce generation
"Link", # Required for ACME navigation
"Location" # Required for ACME resource location
]
```
**Cluster Path Configuration:**
ACME requires the cluster path to include the full API path:
```hcl
path = "${var.vault_address}/v1/${vault_mount.pki_int.path}"
aia_path = "${var.vault_address}/v1/${vault_mount.pki_int.path}"
```
**Bootstrap Process:**
Since vault01 needed a certificate from its own PKI (chicken-and-egg problem), we created a `bootstrap-vault-cert` script that:
1. Uses the Unix socket (no TLS) to issue a certificate via `bao` CLI
2. Places it in the ACME directory structure
3. Includes both vault01.home.2rjus.net and vault.home.2rjus.net as SANs
4. After restart, ACME manages renewals automatically
**Files Modified:**
- `terraform/vault/pki.tf` - Added allowed_response_headers, cluster config, ACME config
- `services/vault/default.nix` - Updated cert paths, added bootstrap script, configured ACME
- `system/pki/root-ca.nix` - Added OpenBao root CA to trust store
- `hosts/vaulttest01/configuration.nix` - Overrode ACME server for testing
**Deliverable:** ✅ vault01 and vaulttest01 using OpenBao PKI, remaining hosts still on step-ca
---
@@ -484,36 +524,6 @@ vault01.home.2rjus.net (10.69.13.19)
---
### Phase 5: DNS Automation
**Goal:** Automatically generate DNS entries from host configurations
**Approach:** Leverage Nix to generate zone file entries from flake host configurations
Since most hosts use static IPs defined in their NixOS configurations, we can extract this information and automatically generate A records. This keeps DNS in sync with the actual host configs.
**Tasks:**
- [ ] Add optional CNAME field to host configurations
- [ ] Add `networking.cnames = [ "alias1" "alias2" ]` or similar option
- [ ] Document in host configuration template
- [ ] Create Nix function to extract DNS records from all hosts
- [ ] Parse each host's `networking.hostName` and IP configuration
- [ ] Collect any defined CNAMEs
- [ ] Generate zone file fragment with A and CNAME records
- [ ] Integrate auto-generated records into zone files
- [ ] Keep manual entries separate (for non-flake hosts/services)
- [ ] Include generated fragment in main zone file
- [ ] Add comments showing which records are auto-generated
- [ ] Update zone file serial number automatically
- [ ] Test zone file validity after generation
- [ ] Either:
- [ ] Automatically trigger DNS server reload (Ansible)
- [ ] Or document manual step: merge to master, run upgrade on ns1/ns2
**Deliverable:** DNS A records and CNAMEs automatically generated from host configs
---
### Phase 6: Integration Script
**Goal:** Single command to create and deploy a new host

View File

@@ -0,0 +1,192 @@
# Authentication System Replacement Plan
## Overview
Replace the current auth01 setup (LLDAP + Authelia) with a modern, unified authentication solution. The current setup is not in active use, making this a good time to evaluate alternatives.
## Goals
1. **Central user database** - Manage users across all homelab hosts from a single source
2. **Linux PAM/NSS integration** - Users can SSH into hosts using central credentials
3. **UID/GID consistency** - Proper POSIX attributes for NAS share permissions
4. **OIDC provider** - Single sign-on for homelab web services (Grafana, etc.)
## Options Evaluated
### OpenLDAP (raw)
- **NixOS Support:** Good (`services.openldap` with `declarativeContents`)
- **Pros:** Most widely supported, very flexible
- **Cons:** LDIF format is painful, schema management is complex, no built-in OIDC, requires SSSD on each client
- **Verdict:** Doesn't address LDAP complexity concerns
### LLDAP + Authelia (current)
- **NixOS Support:** Both have good modules
- **Pros:** Already configured, lightweight, nice web UIs
- **Cons:** Two services to manage, limited POSIX attribute support in LLDAP, requires SSSD on every client host
- **Verdict:** Workable but has friction for NAS/UID goals
### FreeIPA
- **NixOS Support:** None
- **Pros:** Full enterprise solution (LDAP + Kerberos + DNS + CA)
- **Cons:** Extremely heavy, wants to own DNS, designed for Red Hat ecosystems, massive overkill for homelab
- **Verdict:** Overkill, no NixOS support
### Keycloak
- **NixOS Support:** None
- **Pros:** Good OIDC/SAML, nice UI
- **Cons:** Primarily an identity broker not a user directory, poor POSIX support, heavy (Java)
- **Verdict:** Wrong tool for Linux user management
### Authentik
- **NixOS Support:** None (would need Docker)
- **Pros:** All-in-one with LDAP outpost and OIDC, modern UI
- **Cons:** Heavy stack (Python + PostgreSQL + Redis), LDAP is a separate component
- **Verdict:** Would work but requires Docker and is heavy
### Kanidm
- **NixOS Support:** Excellent - first-class module with PAM/NSS integration
- **Pros:**
- Native PAM/NSS module (no SSSD needed)
- Built-in OIDC provider
- Optional LDAP interface for legacy services
- Declarative provisioning via NixOS (users, groups, OAuth2 clients)
- Modern, written in Rust
- Single service handles everything
- **Cons:** Newer project, smaller community than LDAP
- **Verdict:** Best fit for requirements
### Pocket-ID
- **NixOS Support:** Unknown
- **Pros:** Very lightweight, passkey-first
- **Cons:** No LDAP, no PAM/NSS integration - purely OIDC for web apps
- **Verdict:** Doesn't solve Linux user management goal
## Recommendation: Kanidm
Kanidm is the recommended solution for the following reasons:
| Requirement | Kanidm Support |
|-------------|----------------|
| Central user database | Native |
| Linux PAM/NSS (host login) | Native NixOS module |
| UID/GID for NAS | POSIX attributes supported |
| OIDC for services | Built-in |
| Declarative config | Excellent NixOS provisioning |
| Simplicity | Modern API, LDAP optional |
| NixOS integration | First-class |
### Key NixOS Features
**Server configuration:**
```nix
services.kanidm.enableServer = true;
services.kanidm.serverSettings = {
domain = "home.2rjus.net";
origin = "https://auth.home.2rjus.net";
ldapbindaddress = "0.0.0.0:636"; # Optional LDAP interface
};
```
**Declarative user provisioning:**
```nix
services.kanidm.provision.enable = true;
services.kanidm.provision.persons.torjus = {
displayName = "Torjus";
groups = [ "admins" "nas-users" ];
};
```
**Declarative OAuth2 clients:**
```nix
services.kanidm.provision.systems.oauth2.grafana = {
displayName = "Grafana";
originUrl = "https://grafana.home.2rjus.net/login/generic_oauth";
originLanding = "https://grafana.home.2rjus.net";
};
```
**Client host configuration (add to system/):**
```nix
services.kanidm.enableClient = true;
services.kanidm.enablePam = true;
services.kanidm.clientSettings.uri = "https://auth.home.2rjus.net";
```
## NAS Integration
### Current: TrueNAS CORE (FreeBSD)
TrueNAS CORE has a built-in LDAP client. Kanidm's read-only LDAP interface will work for NFS share permissions:
- **NFS shares**: Only need consistent UID/GID mapping - Kanidm's LDAP provides this
- **No SMB requirement**: SMB would need Samba schema attributes (deprecated in TrueNAS 13.0+), but we're NFS-only
Configuration approach:
1. Enable Kanidm's LDAP interface (`ldapbindaddress = "0.0.0.0:636"`)
2. Import internal CA certificate into TrueNAS
3. Configure TrueNAS LDAP client with Kanidm's Base DN and bind credentials
4. Users/groups appear in TrueNAS permission dropdowns
Note: Kanidm's LDAP is read-only and uses LDAPS only (no StartTLS). This is fine for our use case.
### Future: NixOS NAS
When the NAS is migrated to NixOS, it becomes a first-class citizen:
- Native Kanidm PAM/NSS integration (same as other hosts)
- No LDAP compatibility layer needed
- Full integration with the rest of the homelab
This future migration path is a strong argument for Kanidm over LDAP-only solutions.
## Implementation Steps
1. **Create Kanidm service module** in `services/kanidm/`
- Server configuration
- TLS via internal ACME
- Vault secrets for admin passwords
2. **Configure declarative provisioning**
- Define initial users and groups
- Set up POSIX attributes (UID/GID ranges)
3. **Add OIDC clients** for homelab services
- Grafana
- Other services as needed
4. **Create client module** in `system/` for PAM/NSS
- Enable on all hosts that need central auth
- Configure trusted CA
5. **Test NAS integration**
- Configure TrueNAS LDAP client to connect to Kanidm
- Verify UID/GID mapping works with NFS shares
6. **Migrate auth01**
- Remove LLDAP and Authelia services
- Deploy Kanidm
- Update DNS CNAMEs if needed
7. **Documentation**
- User management procedures
- Adding new OAuth2 clients
- Troubleshooting PAM/NSS issues
## Open Questions
- What UID/GID range should be reserved for Kanidm-managed users?
- Which hosts should have PAM/NSS enabled initially?
- What OAuth2 clients are needed at launch?
## References
- [Kanidm Documentation](https://kanidm.github.io/kanidm/stable/)
- [NixOS Kanidm Module](https://search.nixos.org/options?query=services.kanidm)
- [Kanidm PAM/NSS Integration](https://kanidm.github.io/kanidm/stable/pam_and_nsswitch.html)

View File

@@ -0,0 +1,61 @@
# DNS Automation
**Status:** Completed (2026-02-04)
**Goal:** Automatically generate DNS entries from host configurations
**Approach:** Leverage Nix to generate zone file entries from flake host configurations
Since most hosts use static IPs defined in their NixOS configurations, we can extract this information and automatically generate A records. This keeps DNS in sync with the actual host configs.
## Implementation
- [x] Add optional CNAME field to host configurations
- [x] Added `homelab.dns.cnames` option in `modules/homelab/dns.nix`
- [x] Added `homelab.dns.enable` to allow opting out (defaults to true)
- [x] Documented in CLAUDE.md
- [x] Create Nix function to extract DNS records from all hosts
- [x] Created `lib/dns-zone.nix` with extraction functions
- [x] Parses each host's `networking.hostName` and `systemd.network.networks` IP configuration
- [x] Collects CNAMEs from `homelab.dns.cnames`
- [x] Filters out VPN interfaces (wg*, tun*, tap*, vti*)
- [x] Generates complete zone file with A and CNAME records
- [x] Integrate auto-generated records into zone files
- [x] External hosts separated to `services/ns/external-hosts.nix`
- [x] Zone includes comments showing which records are auto-generated vs external
- [x] Update zone file serial number automatically
- [x] Uses `self.sourceInfo.lastModified` (git commit timestamp)
- [x] Test zone file validity after generation
- [x] NSD validates zone at build time via `nsd-checkzone`
- [x] Deploy process documented
- [x] Merge to master, run auto-upgrade on ns1/ns2
## Files Created/Modified
| File | Purpose |
|------|---------|
| `modules/homelab/dns.nix` | Defines `homelab.dns.*` options |
| `modules/homelab/default.nix` | Module import hub |
| `lib/dns-zone.nix` | Zone generation functions |
| `services/ns/external-hosts.nix` | Non-flake host records |
| `services/ns/master-authorative.nix` | Uses generated zone |
| `services/ns/secondary-authorative.nix` | Uses generated zone |
## Usage
View generated zone:
```bash
nix eval .#nixosConfigurations.ns1.config.services.nsd.zones.'"home.2rjus.net"'.data --raw
```
Add CNAMEs to a host:
```nix
homelab.dns.cnames = [ "alias1" "alias2" ];
```
Exclude a host from DNS:
```nix
homelab.dns.enable = false;
```
Add non-flake hosts: Edit `services/ns/external-hosts.nix`

View File

@@ -0,0 +1,23 @@
# Host Cleanup
## Overview
Remove decommissioned/unused host configurations that are no longer reachable on the network.
## Hosts to review
The following hosts return "no route to host" from Prometheus scraping and are likely no longer needed:
- `media1` (10.69.12.82)
- `ns3` (10.69.13.7)
- `ns4` (10.69.13.8)
- `nixos-test1` (10.69.13.10)
## Steps
1. Confirm each host is truly decommissioned (not just temporarily powered off)
2. Remove host directory from `hosts/`
3. Remove `nixosConfigurations` entry from `flake.nix`
4. Remove host's age key from `.sops.yaml`
5. Remove per-host secrets from `secrets/<hostname>/` if any
6. Verify DNS zone and Prometheus targets no longer include the removed hosts after rebuild

View File

@@ -0,0 +1,128 @@
# Monitoring Gaps Audit
## Overview
Audit of services running in the homelab that lack monitoring coverage, either missing Prometheus scrape targets, alerting rules, or both.
## Services with No Monitoring
### PostgreSQL (`pgdb1`)
- **Current state:** No scrape targets, no alert rules
- **Risk:** A database outage would go completely unnoticed by Prometheus
- **Recommendation:** Enable `services.prometheus.exporters.postgres` (available in nixpkgs). This exposes connection counts, query throughput, replication lag, table/index stats, and more. Add alerts for at least `postgres_down` (systemd unit state) and connection pool exhaustion.
### Authelia (`auth01`)
- **Current state:** No scrape targets, no alert rules
- **Risk:** The authentication gateway being down blocks access to all proxied services
- **Recommendation:** Authelia exposes Prometheus metrics natively at `/metrics`. Add a scrape target and at minimum an `authelia_down` systemd unit state alert.
### LLDAP (`auth01`)
- **Current state:** No scrape targets, no alert rules
- **Risk:** LLDAP is a dependency of Authelia -- if LDAP is down, authentication breaks even if Authelia is running
- **Recommendation:** Add an `lldap_down` systemd unit state alert. LLDAP does not expose Prometheus metrics natively, so systemd unit monitoring via node-exporter may be sufficient.
### Vault / OpenBao (`vault01`)
- **Current state:** No scrape targets, no alert rules
- **Risk:** Secrets management service failures go undetected
- **Recommendation:** OpenBao supports Prometheus telemetry output natively. Add a scrape target for the telemetry endpoint and alerts for `vault_down` (systemd unit) and seal status.
### Gitea Actions Runner
- **Current state:** No scrape targets, no alert rules
- **Risk:** CI/CD failures go undetected
- **Recommendation:** Add at minimum a systemd unit state alert. The runner itself has limited metrics exposure.
## Services with Partial Monitoring
### Jellyfin (`jelly01`)
- **Current state:** Has scrape targets (port 8096), metrics are being collected, but zero alert rules
- **Metrics available:** 184 metrics, all .NET runtime / ASP.NET Core level. No Jellyfin-specific metrics (active streams, library size, transcoding sessions). Key useful metrics:
- `microsoft_aspnetcore_hosting_failed_requests` - rate of HTTP errors
- `microsoft_aspnetcore_hosting_current_requests` - in-flight requests
- `process_working_set_bytes` - memory usage (~256 MB currently)
- `dotnet_gc_pause_ratio` - GC pressure
- `up{job="jellyfin"}` - basic availability
- **Recommendation:** Add a `jellyfin_down` alert using either `up{job="jellyfin"} == 0` or systemd unit state. Consider alerting on sustained `failed_requests` rate increase.
### NATS (`nats1`)
- **Current state:** Has a `nats_down` alert (systemd unit state via node-exporter), but no NATS-specific metrics
- **Metrics available:** NATS has a built-in `/metrics` endpoint exposing connection counts, message throughput, JetStream consumer lag, and more
- **Recommendation:** Add a scrape target for the NATS metrics endpoint. Consider alerts for connection count spikes, slow consumers, and JetStream storage usage.
### DNS - Unbound (`ns1`, `ns2`)
- **Current state:** Has `unbound_down` alert (systemd unit state), but no DNS query metrics
- **Available in nixpkgs:** `services.prometheus.exporters.unbound.enable` (package: `prometheus-unbound-exporter` v0.5.0). Exposes query counts, cache hit ratios, response types (SERVFAIL, NXDOMAIN), upstream latency.
- **Recommendation:** Enable the unbound exporter on ns1/ns2. Add alerts for cache hit ratio drops and SERVFAIL rate spikes.
### DNS - NSD (`ns1`, `ns2`)
- **Current state:** Has `nsd_down` alert (systemd unit state), no NSD-specific metrics
- **Available in nixpkgs:** Nothing. No exporter package or NixOS module. Community `nsd_exporter` exists but is not packaged.
- **Recommendation:** The existing systemd unit alert is likely sufficient. NSD is a simple authoritative-only server with limited operational metrics. Not worth packaging a custom exporter for now.
## Existing Monitoring (for reference)
These services have adequate alerting and/or scrape targets:
| Service | Scrape Targets | Alert Rules |
|---|---|---|
| Monitoring stack (Prometheus, Grafana, Loki, Tempo, Pyroscope) | Yes | 7 alerts |
| Home Assistant (+ Zigbee2MQTT, Mosquitto) | Yes (port 8123) | 3 alerts |
| HTTP Proxy (Caddy) | Yes (port 80) | 3 alerts |
| Nix Cache (Harmonia, build-flakes) | Via Caddy | 4 alerts |
| CA (step-ca) | Yes (port 9000) | 4 certificate alerts |
## Per-Service Resource Metrics (systemd-exporter)
### Current State
No per-service CPU, memory, or IO metrics are collected. The existing node-exporter systemd collector only provides unit state (active/inactive/failed), socket stats, and timer triggers. While systemd tracks per-unit resource usage via cgroups internally (visible in `systemctl status` and `systemd-cgtop`), this data is not exported to Prometheus.
### Available Solution
The `prometheus-systemd-exporter` package (v0.7.0) is available in nixpkgs with a ready-made NixOS module:
```nix
services.prometheus.exporters.systemd.enable = true;
```
**Options:** `enable`, `port`, `extraFlags`, `user`, `group`
This exporter reads cgroup data and exposes per-unit metrics including:
- CPU seconds consumed per service
- Memory usage per service
- Task/process counts per service
- Restart counts
- IO usage
### Recommendation
Enable on all hosts via the shared `system/` config (same pattern as node-exporter). Add a corresponding scrape job on monitoring01. This would give visibility into resource consumption per service across the fleet, useful for capacity planning and diagnosing noisy-neighbor issues on shared hosts.
## Suggested Priority
1. **PostgreSQL** - Critical infrastructure, easy to add with existing nixpkgs module
2. **Authelia + LLDAP** - Auth outage affects all proxied services
3. **Unbound exporter** - Ready-to-go NixOS module, just needs enabling
4. **Jellyfin alerts** - Metrics already collected, just needs alert rules
5. **NATS metrics** - Built-in endpoint, just needs a scrape target
6. **Vault/OpenBao** - Native telemetry support
7. **Actions Runner** - Lower priority, basic systemd alert sufficient
## Node-Exporter Targets Currently Down
Noted during audit -- these node-exporter targets are failing:
- `nixos-test1.home.2rjus.net:9100` - no route to host
- `media1.home.2rjus.net:9100` - no route to host
- `ns3.home.2rjus.net:9100` - no route to host
- `ns4.home.2rjus.net:9100` - no route to host
These may be decommissioned or powered-off hosts that should be removed from the scrape config.

View File

@@ -0,0 +1,86 @@
# Sops to OpenBao Secrets Migration Plan
## Status: Complete (except ca, deferred)
## Remaining sops cleanup
The `sops-nix` flake input, `system/sops.nix`, `.sops.yaml`, and `secrets/` directory are
still present because `ca` still uses sops for its step-ca secrets (5 secrets in
`services/ca/default.nix`). The `services/authelia/` and `services/lldap/` modules also
reference sops but are only used by auth01 (decommissioned).
Once `ca` is migrated to OpenBao PKI (Phase 4c in host-migration-to-opentofu.md), remove:
- `sops-nix` input from `flake.nix`
- `sops-nix.nixosModules.sops` from all host module lists in `flake.nix`
- `inherit sops-nix` from all specialArgs in `flake.nix`
- `system/sops.nix` and its import in `system/default.nix`
- `.sops.yaml`
- `secrets/` directory
- All `sops.secrets.*` declarations in `services/ca/`, `services/authelia/`, `services/lldap/`
## Overview
Migrate all hosts from sops-nix secrets to OpenBao (vault) secrets management. Pilot with ha1, then roll out to remaining hosts in waves.
## Pre-requisites (completed)
1. Hardcoded root password hash in `system/root-user.nix` (removes sops dependency for all hosts)
2. Added `extractKey` option to `system/vault-secrets.nix` (extracts single key as file)
## Deployment Order
### Pilot: ha1
- Terraform: shared/backup/password secret, ha1 AppRole policy
- Provision AppRole credentials via `playbooks/provision-approle.yml`
- NixOS: vault.enable + backup-helper vault secret
### Wave 1: nats1, jelly01, pgdb1
- No service secrets (only root password, already handled)
- Just need AppRole policies + credential provisioning
### Wave 2: monitoring01
- 3 secrets: backup password, nats nkey, pve-exporter config
- Updates: alerttonotify.nix, pve.nix, configuration.nix
### Wave 3: ns1, then ns2 (critical - deploy ns1 first, verify, then ns2)
- DNS zone transfer key (shared/dns/xfer-key)
### Wave 4: http-proxy
- WireGuard private key
### Wave 5: nix-cache01
- Cache signing key + Gitea Actions token
### Wave 6: ca (DEFERRED - waiting for PKI migration)
### Skipped: auth01 (decommissioned)
## Terraform variables needed
User must extract from sops and add to `terraform/vault/terraform.tfvars`:
| Variable | Source |
|----------|--------|
| `backup_helper_secret` | `sops -d secrets/secrets.yaml` |
| `ns_xfer_key` | `sops -d secrets/secrets.yaml` |
| `nats_nkey` | `sops -d secrets/secrets.yaml` |
| `pve_exporter_config` | `sops -d secrets/monitoring01/pve-exporter.yaml` |
| `wireguard_private_key` | `sops -d secrets/http-proxy/wireguard.yaml` |
| `cache_signing_key` | `sops -d secrets/nix-cache01/cache-secret` |
| `actions_token_1` | `sops -d secrets/nix-cache01/actions_token_1` |
## Provisioning AppRole credentials
```bash
export BAO_ADDR='https://vault01.home.2rjus.net:8200'
export BAO_TOKEN='<root-token>'
nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<host>
```
## Verification (per host)
1. `systemctl status vault-secret-*` - all secret fetch services succeeded
2. Check secret files exist at expected paths with correct permissions
3. Verify dependent services are running
4. Check `/var/lib/vault/cache/` is populated (fallback ready)
5. Reboot host to verify boot-time secret fetching works

View File

@@ -0,0 +1,109 @@
# Zigbee Sensor Battery Monitoring
**Status:** Completed
**Branch:** `zigbee-battery-fix`
**Commit:** `c515a6b home-assistant: fix zigbee sensor battery reporting`
## Problem
Three Aqara Zigbee temperature sensors report `battery: 0` in their MQTT payload, making the `hass_sensor_battery_percent` Prometheus metric useless for battery monitoring on these devices.
Affected sensors:
- **Temp Living Room** (`0x54ef441000a54d3c`) — WSDCGQ12LM
- **Temp Office** (`0x54ef441000a547bd`) — WSDCGQ12LM
- **temp_server** (`0x54ef441000a564b6`) — WSDCGQ12LM
The **Temp Bedroom** sensor (`0x00124b0025495463`) is a SONOFF SNZB-02 and reports battery correctly.
## Findings
- All three sensors are actively reporting temperature, humidity, and pressure data — they are not dead.
- The Zigbee2MQTT payload includes a `voltage` field (e.g., `2707` = 2.707V), which indicates healthy battery levels (~40-60% for a CR2032 coin cell).
- CR2032 voltage reference: ~3.0V fresh, ~2.7V mid-life, ~2.1V dead.
- The `voltage` field is not exposed as a Prometheus metric — it exists only in the MQTT payload.
- This is a known firmware quirk with some Aqara WSDCGQ12LM sensors that always report 0% battery.
## Device Inventory
Full list of Zigbee devices on ha1 (12 total):
| Device | IEEE Address | Model | Type |
|--------|-------------|-------|------|
| temp_server | 0x54ef441000a564b6 | WSDCGQ12LM | Temperature sensor (battery fix applied) |
| (Temp Living Room) | 0x54ef441000a54d3c | WSDCGQ12LM | Temperature sensor (battery fix applied) |
| (Temp Office) | 0x54ef441000a547bd | WSDCGQ12LM | Temperature sensor (battery fix applied) |
| (Temp Bedroom) | 0x00124b0025495463 | SNZB-02 | Temperature sensor (battery works) |
| (Water leak) | 0x54ef4410009ac117 | SJCGQ12LM | Water leak sensor |
| btn_livingroom | 0x54ef441000a1f907 | WXKG13LM | Wireless mini switch |
| btn_bedroom | 0x54ef441000a1ee71 | WXKG13LM | Wireless mini switch |
| (Hue bulb) | 0x001788010dc35d06 | 9290024688 | Hue E27 1100lm (Router) |
| (Hue bulb) | 0x001788010dc5f003 | 9290024688 | Hue E27 1100lm (Router) |
| (Hue ceiling) | 0x001788010e371aa4 | 915005997301 | Hue Infuse medium (Router) |
| (Hue ceiling) | 0x001788010d253b99 | 915005997301 | Hue Infuse medium (Router) |
| (Hue wall) | 0x001788010d1b599a | 929003052901 | Hue Sana wall light (Router, transition=5) |
## Implementation
### Solution 1: Calculate battery from voltage in Zigbee2MQTT (Implemented)
Override the Home Assistant battery entity's `value_template` in Zigbee2MQTT device configuration to calculate battery percentage from voltage.
**Formula:** `(voltage - 2100) / 9` (maps 2100-3000mV to 0-100%)
**Changes in `services/home-assistant/default.nix`:**
- Device configuration moved from external `devices.yaml` to inline NixOS config
- Three affected sensors have `homeassistant.sensor_battery.value_template` override
- All 12 devices now declaratively managed
**Expected battery values based on current voltages:**
| Sensor | Voltage | Expected Battery |
|--------|---------|------------------|
| Temp Living Room | 2710 mV | ~68% |
| Temp Office | 2658 mV | ~62% |
| temp_server | 2765 mV | ~74% |
### Solution 2: Alert on sensor staleness (Implemented)
Added Prometheus alert `zigbee_sensor_stale` in `services/monitoring/rules.yml` that fires when a Zigbee temperature sensor hasn't updated in over 1 hour. This provides defense-in-depth for detecting dead sensors regardless of battery reporting accuracy.
**Alert details:**
- Expression: `(time() - hass_last_updated_time_seconds{entity=~"sensor\\.(0x[0-9a-f]+|temp_server)_temperature"}) > 3600`
- Severity: warning
- For: 5m
## Pre-Deployment Verification
### Backup Verification
Before deployment, verified ha1 backup configuration and ran manual backup:
**Backup paths:**
- `/var/lib/hass`
- `/var/lib/zigbee2mqtt`
- `/var/lib/mosquitto`
**Manual backup (2026-02-05 22:45:23):**
- Snapshot ID: `59704dfa`
- Files: 77 total (0 new, 13 changed, 64 unmodified)
- Data: 62.635 MiB processed, 6.928 MiB stored (compressed)
### Other directories reviewed
- `/var/lib/vault` — Contains AppRole credentials; not backed up (can be re-provisioned via Ansible)
- `/var/lib/sops-nix` — Legacy; ha1 uses Vault now
## Post-Deployment Steps
After deploying to ha1:
1. Restart zigbee2mqtt service (automatic on NixOS rebuild)
2. In Home Assistant, the battery entities may need to be re-discovered:
- Go to Settings → Devices & Services → MQTT
- The new `value_template` should take effect after entity re-discovery
- If not, try disabling and re-enabling the battery entities
## Notes
- Device configuration is now declarative in NixOS. Future device additions via Zigbee2MQTT frontend will need to be added to the NixOS config to persist.
- The `devices.yaml` file on ha1 will be overwritten on service start but can be removed after confirming the new config works.
- The NixOS zigbee2mqtt module defaults to `devices = "devices.yaml"` but our explicit inline config overrides this.

View File

@@ -0,0 +1,224 @@
# Host Migration to OpenTofu
## Overview
Migrate all existing hosts (provisioned manually before the OpenTofu pipeline) into the new
OpenTofu-managed provisioning workflow. Hosts are categorized by their state requirements:
stateless hosts are simply recreated, stateful hosts require backup and restore, and some
hosts are decommissioned or deferred.
## Current State
Hosts already managed by OpenTofu: `vault01`, `testvm01`, `vaulttest01`
Hosts to migrate:
| Host | Category | Notes |
|------|----------|-------|
| ns1 | Stateless | Primary DNS, recreate |
| ns2 | Stateless | Secondary DNS, recreate |
| nix-cache01 | Stateless | Binary cache, recreate |
| http-proxy | Stateless | Reverse proxy, recreate |
| nats1 | Stateless | Messaging, recreate |
| auth01 | Decommission | No longer in use |
| ha1 | Stateful | Home Assistant + Zigbee2MQTT + Mosquitto |
| monitoring01 | Stateful | Prometheus, Grafana, Loki |
| jelly01 | Stateful | Jellyfin metadata, watch history, config |
| pgdb1 | Stateful | PostgreSQL databases |
| jump | Decommission | No longer needed |
| ca | Deferred | Pending Phase 4c PKI migration to OpenBao |
## Phase 1: Backup Preparation
Before migrating any stateful host, ensure restic backups are in place and verified.
### 1a. Expand monitoring01 Grafana Backup
The existing backup only covers `/var/lib/grafana/plugins` and a sqlite dump of `grafana.db`.
Expand to back up all of `/var/lib/grafana/` to capture config directory and any other state.
### 1b. Add Jellyfin Backup to jelly01
No backup currently exists. Add a restic backup job for `/var/lib/jellyfin/` which contains:
- `config/` — server settings, library configuration
- `data/` — user watch history, playback state, library metadata
Media files are on the NAS (`nas.home.2rjus.net:/mnt/hdd-pool/media`) and do not need backup.
The cache directory (`/var/cache/jellyfin/`) does not need backup — it regenerates.
### 1c. Add PostgreSQL Backup to pgdb1
No backup currently exists. Add a restic backup job with a `pg_dumpall` pre-hook to capture
all databases and roles. The dump should be piped through restic's stdin backup (similar to
the Grafana DB dump pattern on monitoring01).
### 1d. Verify Existing ha1 Backup
ha1 already backs up `/var/lib/hass`, `/var/lib/zigbee2mqtt`, `/var/lib/mosquitto`. Verify
these backups are current and restorable before proceeding with migration.
### 1e. Verify All Backups
After adding/expanding backup jobs:
1. Trigger a manual backup run on each host
2. Verify backup integrity with `restic check`
3. Test a restore to a temporary location to confirm data is recoverable
## Phase 2: Declare pgdb1 Databases in Nix
Before migrating pgdb1, audit the manually-created databases and users on the running
instance, then declare them in the Nix configuration using `ensureDatabases` and
`ensureUsers`. This makes the PostgreSQL setup reproducible on the new host.
Steps:
1. SSH to pgdb1, run `\l` and `\du` in psql to list databases and roles
2. Add `ensureDatabases` and `ensureUsers` to `services/postgres/postgres.nix`
3. Document any non-default PostgreSQL settings or extensions per database
After reprovisioning, the databases will be created by NixOS, and data restored from the
`pg_dumpall` backup.
## Phase 3: Stateless Host Migration
These hosts have no meaningful state and can be recreated fresh. For each host:
1. Add the host definition to `terraform/vms.tf` (using `create-host` or manually)
2. Commit and push to master
3. Run `tofu apply` to provision the new VM
4. Wait for bootstrap to complete (VM pulls config from master and reboots)
5. Verify the host is functional
6. Decommission the old VM in Proxmox
### Migration Order
Migrate stateless hosts in an order that minimizes disruption:
1. **nix-cache01** — low risk, no downstream dependencies during migration
2. **nats1** — low risk, verify no persistent JetStream streams first
4. **http-proxy** — brief disruption to proxied services, migrate during low-traffic window
5. **ns1, ns2** — migrate one at a time, verify DNS resolution between each
For ns1/ns2: migrate ns2 first (secondary), verify AXFR works, then migrate ns1. All hosts
use both ns1 and ns2 as resolvers, so one being down briefly is tolerable.
## Phase 4: Stateful Host Migration
For each stateful host, the procedure is:
1. Trigger a final restic backup
2. Stop services on the old host (to prevent state drift during migration)
3. Provision the new VM via `tofu apply`
4. Wait for bootstrap to complete
5. Stop the relevant services on the new host
6. Restore data from restic backup
7. Start services and verify functionality
8. Decommission the old VM
### 4a. pgdb1
1. Run final `pg_dumpall` backup via restic
2. Stop PostgreSQL on the old host
3. Provision new pgdb1 via OpenTofu
4. After bootstrap, NixOS creates the declared databases/users
5. Restore data with `pg_restore` or `psql < dumpall.sql`
6. Verify database connectivity from gunter (`10.69.30.105`)
7. Decommission old VM
### 4b. monitoring01
1. Run final Grafana backup
2. Provision new monitoring01 via OpenTofu
3. After bootstrap, restore `/var/lib/grafana/` from restic
4. Restart Grafana, verify dashboards and datasources are intact
5. Prometheus and Loki start fresh with empty data (acceptable)
6. Verify all scrape targets are being collected
7. Decommission old VM
### 4c. jelly01
1. Run final Jellyfin backup
2. Provision new jelly01 via OpenTofu
3. After bootstrap, restore `/var/lib/jellyfin/` from restic
4. Verify NFS mount to NAS is working
5. Start Jellyfin, verify watch history and library metadata are present
6. Decommission old VM
### 4d. ha1
1. Verify latest restic backup is current
2. Stop Home Assistant, Zigbee2MQTT, and Mosquitto on old host
3. Provision new ha1 via OpenTofu
4. After bootstrap, restore `/var/lib/hass`, `/var/lib/zigbee2mqtt`, `/var/lib/mosquitto`
5. Start services, verify Home Assistant is functional
6. Verify Zigbee devices are still paired and communicating
7. Decommission old VM
**Note:** ha1 currently has 2 GB RAM, which is consistently tight. Average memory usage has
climbed from ~57% (30-day avg) to ~70% currently, with a 30-day low of only 187 MB free.
Consider increasing to 4 GB when reprovisioning to allow headroom for additional integrations.
**Note:** ha1 is the highest-risk migration due to Zigbee device pairings. The Zigbee
coordinator state in `/var/lib/zigbee2mqtt` should preserve pairings, but verify on a
non-critical time window.
**USB Passthrough:** The ha1 VM has a USB device passed through from the Proxmox hypervisor
(the Zigbee coordinator). The new VM must be configured with the same USB passthrough in
OpenTofu/Proxmox. Verify the USB device ID on the hypervisor and add the appropriate
`usb` block to the VM definition in `terraform/vms.tf`. The USB device must be passed
through before starting Zigbee2MQTT on the new host.
## Phase 5: Decommission jump and auth01 Hosts
### jump
1. Verify nothing depends on the jump host (no SSH proxy configs pointing to it, etc.)
2. Remove host configuration from `hosts/jump/`
3. Remove from `flake.nix`
4. Remove any secrets in `secrets/jump/`
5. Remove from `.sops.yaml`
6. Destroy the VM in Proxmox
7. Commit cleanup
### auth01
1. Remove host configuration from `hosts/auth01/`
2. Remove from `flake.nix`
3. Remove any secrets in `secrets/auth01/`
4. Remove from `.sops.yaml`
5. Remove `services/authelia/` and `services/lldap/` (only used by auth01)
6. Destroy the VM in Proxmox
7. Commit cleanup
## Phase 6: Decommission ca Host (Deferred)
Deferred until Phase 4c (PKI migration to OpenBao) is complete. Once all hosts use the
OpenBao ACME endpoint for certificates, the step-ca host can be decommissioned following
the same cleanup steps as the jump host.
## Phase 7: Remove sops-nix
Once `ca` is decommissioned (Phase 6), `sops-nix` is no longer used by any host. Remove
all remnants:
- `sops-nix` input from `flake.nix` and `flake.lock`
- `sops-nix.nixosModules.sops` from all host module lists in `flake.nix`
- `inherit sops-nix` from all specialArgs in `flake.nix`
- `system/sops.nix` and its import in `system/default.nix`
- `.sops.yaml`
- `secrets/` directory
- All `sops.secrets.*` declarations in `services/ca/`, `services/authelia/`, `services/lldap/`
- Template scripts that generate age keys for sops (`hosts/template/scripts.nix`,
`hosts/template2/scripts.nix`)
See `docs/plans/completed/sops-to-openbao-migration.md` for full context.
## Notes
- Each host migration should be done individually, not in bulk, to limit blast radius
- Keep the old VM running until the new one is verified — do not destroy prematurely
- The old VMs use IPs that the new VMs need, so the old VM must be shut down before
the new one is provisioned (or use a temporary IP and swap after verification)
- Stateful migrations should be done during low-usage windows
- After all migrations are complete, the only hosts not in OpenTofu will be ca (deferred)
- Since many hosts are being recreated, this is a good opportunity to establish consistent
hostname naming conventions before provisioning the new VMs. Current naming is inconsistent
(e.g. `ns1` vs `nix-cache01`, `ha1` vs `auth01`, `pgdb1` vs `http-proxy`). Decide on a
convention before starting migrations — e.g. whether to always use numeric suffixes, a
consistent format like `service-NN`, role-based vs function-based names, etc.

View File

@@ -0,0 +1,27 @@
# NixOS Infrastructure Improvements
This document contains planned improvements to the NixOS infrastructure that are not directly part of the automated deployment pipeline.
## Planned
### Custom NixOS Options for Service and System Configuration
Currently, most service configurations in `services/` and shared system configurations in `system/` are written as plain NixOS module imports without declaring custom options. This means host-specific customization is done by directly setting upstream NixOS options or by duplicating configuration across hosts.
The `homelab.dns` module (`modules/homelab/dns.nix`) is the first example of defining custom options under a `homelab.*` namespace. This pattern should be extended to more of the repository's configuration.
**Goals:**
- Define `homelab.*` options for services and shared configuration where it makes sense, following the pattern established by `homelab.dns`
- Allow hosts to enable/configure services declaratively (e.g. `homelab.monitoring.enable`, `homelab.http-proxy.virtualHosts`) rather than importing opaque module files
- Keep options simple and focused — wrap only the parts that vary between hosts or that benefit from a clearer interface. Not everything needs a custom option.
**Candidate areas:**
- `system/` modules (e.g. auto-upgrade schedule, ACME CA URL, monitoring endpoints)
- `services/` modules where multiple hosts use the same service with different parameters
- Cross-cutting concerns that are currently implicit (e.g. which Loki endpoint promtail ships to)
## Completed
- [DNS Automation](completed/dns-automation.md) - Automatically generate DNS entries from host configurations

View File

@@ -0,0 +1,119 @@
# Prometheus Scrape Target Labels
## Goal
Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names.
## Motivation
Some hosts have workloads that make generic alert thresholds inappropriate. For example, `nix-cache01` regularly hits high CPU during builds, requiring a longer `for` duration on `high_cpu_load`. Currently this is handled by excluding specific instance names in PromQL expressions, which is brittle and doesn't scale.
With per-host labels, alert rules can use semantic filters like `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`.
## Proposed Labels
### `priority`
Indicates alerting importance. Hosts with `priority = "low"` can have relaxed thresholds or longer durations in alert rules.
Values: `"high"` (default), `"low"`
### `role`
Describes the function of the host. Useful for grouping in dashboards and targeting role-specific alert rules.
Values: free-form string, e.g. `"dns"`, `"build-host"`, `"database"`, `"monitoring"`
**Note on multiple roles:** Prometheus labels are strictly string values, not lists. For hosts that serve multiple roles there are a few options:
- **Separate boolean labels:** `role_build_host = "true"`, `role_cache_server = "true"` -- flexible but verbose, and requires updating the module when new roles are added.
- **Delimited string:** `role = "build-host,cache-server"` -- works with regex matchers (`{role=~".*build-host.*"}`), but regex matching is less clean and more error-prone.
- **Pick a primary role:** `role = "build-host"` -- simplest, and probably sufficient since most hosts have one primary role.
Recommendation: start with a single primary role string. If multi-role matching becomes a real need, switch to separate boolean labels.
### `dns_role`
For DNS servers specifically, distinguish between primary and secondary resolvers. The secondary resolver (ns2) receives very little traffic and has a cold cache, making generic cache hit ratio alerts inappropriate.
Values: `"primary"`, `"secondary"`
Example use case: The `unbound_low_cache_hit_ratio` alert fires on ns2 because its cache hit ratio (~62%) is lower than ns1 (~90%). This is expected behavior since ns2 gets ~100x less traffic. With a `dns_role` label, the alert can either exclude secondaries or use different thresholds:
```promql
# Only alert on primary DNS
unbound_cache_hit_ratio < 0.7 and on(instance) unbound_up{dns_role="primary"}
# Or use different thresholds
(unbound_cache_hit_ratio < 0.7 and on(instance) unbound_up{dns_role="primary"})
or
(unbound_cache_hit_ratio < 0.5 and on(instance) unbound_up{dns_role="secondary"})
```
## Implementation
### 1. Add `labels` option to `homelab.monitoring`
In `modules/homelab/monitoring.nix`, add:
```nix
labels = lib.mkOption {
type = lib.types.attrsOf lib.types.str;
default = { };
description = "Custom labels to attach to this host's scrape targets";
};
```
### 2. Update `lib/monitoring.nix`
- `extractHostMonitoring` should carry `labels` through in its return value.
- `generateNodeExporterTargets` currently returns a flat list of target strings. It needs to return structured `static_configs` entries instead, grouping targets by their label sets:
```nix
# Before (flat list):
["ns1.home.2rjus.net:9100", "ns2.home.2rjus.net:9100", ...]
# After (grouped by labels):
[
{ targets = ["ns1.home.2rjus.net:9100", "ns2.home.2rjus.net:9100", ...]; }
{ targets = ["nix-cache01.home.2rjus.net:9100"]; labels = { priority = "low"; role = "build-host"; }; }
]
```
This requires grouping hosts by their label attrset and producing one `static_configs` entry per unique label combination. Hosts with no custom labels get grouped together with no extra labels (preserving current behavior).
### 3. Update `services/monitoring/prometheus.nix`
Change the node-exporter scrape config to use the new structured output:
```nix
# Before:
static_configs = [{ targets = nodeExporterTargets; }];
# After:
static_configs = nodeExporterTargets;
```
### 4. Set labels on hosts
Example in `hosts/nix-cache01/configuration.nix` or the relevant service module:
```nix
homelab.monitoring.labels = {
priority = "low";
role = "build-host";
};
```
### 5. Update alert rules
After implementing labels, review and update `services/monitoring/rules.yml`:
- Replace instance-name exclusions with label-based filters (e.g. `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`).
- Consider whether any other rules should differentiate by priority or role.
Specifically, the `high_cpu_load` rule currently has a nix-cache01 exclusion that should be replaced with a `priority`-based filter.
### 6. Consider labels for `generateScrapeConfigs` (service targets)
The same label propagation could be applied to service-level scrape targets. This is optional and can be deferred -- service targets are more specialized and less likely to need generic label-based filtering.

122
docs/plans/remote-access.md Normal file
View File

@@ -0,0 +1,122 @@
# Remote Access to Homelab Services
## Status: Planning
## Goal
Enable remote access to some or all homelab services from outside the internal network, without exposing anything directly to the internet.
## Current State
- All services are only accessible from the internal 10.69.13.x network
- Exception: jelly01 has a WireGuard link to an external VPS
- No services are directly exposed to the public internet
## Constraints
- Nothing should be directly accessible from the outside
- Must use VPN or overlay network (no port forwarding of services)
- Self-hosted solutions preferred over managed services
## Options
### 1. WireGuard Gateway (Internal Router)
A dedicated NixOS host on the internal network with a WireGuard tunnel out to the VPS. The VPS becomes the public entry point, and the gateway routes traffic to internal services. Firewall rules on the gateway control which services are reachable.
**Pros:**
- Simple, well-understood technology
- Already running WireGuard for jelly01
- Full control over routing and firewall rules
- Excellent NixOS module support
- No extra dependencies
**Cons:**
- Hub-and-spoke topology (all traffic goes through VPS)
- Manual peer management
- Adding a new client device means editing configs on both VPS and gateway
### 2. WireGuard Mesh (No Relay)
Each client device connects directly to a WireGuard endpoint. Could be on the VPS which forwards to the homelab, or if there is a routable IP at home, directly to an internal host.
**Pros:**
- Simple and fast
- No extra software
**Cons:**
- Manual key and endpoint management for every peer
- Doesn't scale well
- If behind CGNAT, still needs the VPS as intermediary
### 3. Headscale (Self-Hosted Tailscale)
Run a Headscale control server (on the VPS or internally) and install the Tailscale client on homelab hosts and personal devices. Gets the Tailscale mesh networking UX without depending on Tailscale's infrastructure.
**Pros:**
- Mesh topology - devices communicate directly via NAT traversal (DERP relay as fallback)
- Easy to add/remove devices
- ACL support for granular access control
- MagicDNS for service discovery
- Good NixOS support for both headscale server and tailscale client
- Subnet routing lets you expose the entire 10.69.13.x network or specific hosts without installing tailscale on every host
**Cons:**
- More moving parts than plain WireGuard
- Headscale is a third-party reimplementation, can lag behind Tailscale features
- Need to run and maintain the control server
### 4. Tailscale (Managed)
Same as Headscale but using Tailscale's hosted control plane.
**Pros:**
- Zero infrastructure to manage on the control plane side
- Polished UX, well-maintained clients
- Free tier covers personal use
**Cons:**
- Dependency on Tailscale's service
- Less aligned with self-hosting preference
- Coordination metadata goes through their servers (data plane is still peer-to-peer)
### 5. Netbird (Self-Hosted)
Open-source alternative to Tailscale with a self-hostable management server. WireGuard-based, supports ACLs and NAT traversal.
**Pros:**
- Fully self-hostable
- Web UI for management
- ACL and peer grouping support
**Cons:**
- Heavier to self-host (needs multiple components: management server, signal server, TURN relay)
- Less mature NixOS module support compared to Tailscale/Headscale
### 6. Nebula (by Defined Networking)
Certificate-based mesh VPN. Each node gets a certificate from a CA you control. No central coordination server needed at runtime.
**Pros:**
- No always-on control plane
- Certificate-based identity
- Lightweight
**Cons:**
- Less convenient for ad-hoc device addition (need to issue certs)
- NAT traversal less mature than Tailscale's
- Smaller community/ecosystem
## Key Decision Points
- **Static public IP vs CGNAT?** Determines whether clients can connect directly to home network or need VPS relay.
- **Number of client devices?** If just phone and laptop, plain WireGuard via VPS is fine. More devices favors Headscale.
- **Per-service vs per-network access?** Gateway with firewall rules gives per-service control. Headscale ACLs can also do this. Plain WireGuard gives network-level access with gateway firewall for finer control.
- **Subnet routing vs per-host agents?** With Headscale/Tailscale, can either install client on every host, or use a single subnet router that advertises the 10.69.13.x range. The latter is closer to the gateway approach and avoids touching every host.
## Leading Candidates
Based on existing WireGuard experience, self-hosting preference, and NixOS stack:
1. **Headscale with a subnet router** - Best balance of convenience and self-hosting
2. **WireGuard gateway via VPS** - Simplest, most transparent, builds on existing setup

40
flake.lock generated
View File

@@ -21,27 +21,6 @@
"url": "https://git.t-juice.club/torjus/alerttonotify"
}
},
"backup-helper": {
"inputs": {
"nixpkgs": [
"nixpkgs-unstable"
]
},
"locked": {
"lastModified": 1738015166,
"narHash": "sha256-573tR4aXNjILKvYnjZUM5DZZME2H6YTHJkUKs3ZehFU=",
"ref": "master",
"rev": "f9540cc065692c7ca80735e7b08399459e0ea6d6",
"revCount": 35,
"type": "git",
"url": "https://git.t-juice.club/torjus/backup-helper"
},
"original": {
"ref": "master",
"type": "git",
"url": "https://git.t-juice.club/torjus/backup-helper"
}
},
"labmon": {
"inputs": {
"nixpkgs": [
@@ -65,11 +44,11 @@
},
"nixpkgs": {
"locked": {
"lastModified": 1769900590,
"narHash": "sha256-I7Lmgj3owOTBGuauy9FL6qdpeK2umDoe07lM4V+PnyA=",
"lastModified": 1770136044,
"narHash": "sha256-tlFqNG/uzz2++aAmn4v8J0vAkV3z7XngeIIB3rM3650=",
"owner": "nixos",
"repo": "nixpkgs",
"rev": "41e216c0ca66c83b12ab7a98cc326b5db01db646",
"rev": "e576e3c9cf9bad747afcddd9e34f51d18c855b4e",
"type": "github"
},
"original": {
@@ -81,11 +60,11 @@
},
"nixpkgs-unstable": {
"locked": {
"lastModified": 1769789167,
"narHash": "sha256-kKB3bqYJU5nzYeIROI82Ef9VtTbu4uA3YydSk/Bioa8=",
"lastModified": 1770197578,
"narHash": "sha256-AYqlWrX09+HvGs8zM6ebZ1pwUqjkfpnv8mewYwAo+iM=",
"owner": "nixos",
"repo": "nixpkgs",
"rev": "62c8382960464ceb98ea593cb8321a2cf8f9e3e5",
"rev": "00c21e4c93d963c50d4c0c89bfa84ed6e0694df2",
"type": "github"
},
"original": {
@@ -98,7 +77,6 @@
"root": {
"inputs": {
"alerttonotify": "alerttonotify",
"backup-helper": "backup-helper",
"labmon": "labmon",
"nixpkgs": "nixpkgs",
"nixpkgs-unstable": "nixpkgs-unstable",
@@ -112,11 +90,11 @@
]
},
"locked": {
"lastModified": 1769921679,
"narHash": "sha256-twBMKGQvaztZQxFxbZnkg7y/50BW9yjtCBWwdjtOZew=",
"lastModified": 1770145881,
"narHash": "sha256-ktjWTq+D5MTXQcL9N6cDZXUf9kX8JBLLBLT0ZyOTSYY=",
"owner": "Mic92",
"repo": "sops-nix",
"rev": "1e89149dcfc229e7e2ae24a8030f124a31e4f24f",
"rev": "17eea6f3816ba6568b8c81db8a4e6ca438b30b7c",
"type": "github"
},
"original": {

View File

@@ -9,10 +9,6 @@
url = "github:Mic92/sops-nix";
inputs.nixpkgs.follows = "nixpkgs-unstable";
};
backup-helper = {
url = "git+https://git.t-juice.club/torjus/backup-helper?ref=master";
inputs.nixpkgs.follows = "nixpkgs-unstable";
};
alerttonotify = {
url = "git+https://git.t-juice.club/torjus/alerttonotify?ref=master";
inputs.nixpkgs.follows = "nixpkgs-unstable";
@@ -29,7 +25,6 @@
nixpkgs,
nixpkgs-unstable,
sops-nix,
backup-helper,
alerttonotify,
labmon,
...
@@ -90,55 +85,6 @@
sops-nix.nixosModules.sops
];
};
ns3 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self sops-nix;
};
modules = [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/ns3
sops-nix.nixosModules.sops
];
};
ns4 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self sops-nix;
};
modules = [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/ns4
sops-nix.nixosModules.sops
];
};
nixos-test1 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self sops-nix;
};
modules = [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/nixos-test1
sops-nix.nixosModules.sops
backup-helper.nixosModules.backup-helper
];
};
ha1 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
@@ -153,7 +99,6 @@
)
./hosts/ha1
sops-nix.nixosModules.sops
backup-helper.nixosModules.backup-helper
];
};
template1 = nixpkgs.lib.nixosSystem {
@@ -234,7 +179,6 @@
)
./hosts/monitoring01
sops-nix.nixosModules.sops
backup-helper.nixosModules.backup-helper
labmon.nixosModules.labmon
];
};
@@ -270,22 +214,6 @@
sops-nix.nixosModules.sops
];
};
media1 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self sops-nix;
};
modules = [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/media1
sops-nix.nixosModules.sops
];
};
pgdb1 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
@@ -318,22 +246,6 @@
sops-nix.nixosModules.sops
];
};
auth01 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self sops-nix;
};
modules = [
(
{ config, pkgs, ... }:
{
nixpkgs.overlays = commonOverlays;
}
)
./hosts/auth01
sops-nix.nixosModules.sops
];
};
testvm01 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {

View File

@@ -1,65 +0,0 @@
{
pkgs,
...
}:
{
imports = [
../template/hardware-configuration.nix
../../system
../../common/vm
];
nixpkgs.config.allowUnfree = true;
# Use the systemd-boot EFI boot loader.
boot.loader.grub = {
enable = true;
device = "/dev/sda";
configurationLimit = 3;
};
networking.hostName = "auth01";
networking.domain = "home.2rjus.net";
networking.useNetworkd = true;
networking.useDHCP = false;
services.resolved.enable = true;
networking.nameservers = [
"10.69.13.5"
"10.69.13.6"
];
systemd.network.enable = true;
systemd.network.networks."ens18" = {
matchConfig.Name = "ens18";
address = [
"10.69.13.18/24"
];
routes = [
{ Gateway = "10.69.13.1"; }
];
linkConfig.RequiredForOnline = "routable";
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim
wget
git
];
services.qemuGuest.enable = true;
# Open ports in the firewall.
# networking.firewall.allowedTCPPorts = [ ... ];
# networking.firewall.allowedUDPPorts = [ ... ];
# Or disable the firewall altogether.
networking.firewall.enable = false;
system.stateVersion = "23.11"; # Did you read the comment?
}

View File

@@ -1,8 +0,0 @@
{ ... }:
{
imports = [
./configuration.nix
../../services/lldap
../../services/authelia
];
}

View File

@@ -55,16 +55,35 @@
git
];
# Vault secrets management
vault.enable = true;
vault.secrets.backup-helper = {
secretPath = "shared/backup/password";
extractKey = "password";
outputDir = "/run/secrets/backup_helper_secret";
services = [ "restic-backups-ha1" ];
};
# Backup service dirs
sops.secrets."backup_helper_secret" = { };
backup-helper = {
enable = true;
password-file = "/run/secrets/backup_helper_secret";
backup-dirs = [
services.restic.backups.ha1 = {
repository = "rest:http://10.69.12.52:8000/backup-nix";
passwordFile = "/run/secrets/backup_helper_secret";
paths = [
"/var/lib/hass"
"/var/lib/zigbee2mqtt"
"/var/lib/mosquitto"
];
timerConfig = {
OnCalendar = "daily";
Persistent = true;
RandomizedDelaySec = "2h";
};
pruneOpts = [
"--keep-daily 7"
"--keep-weekly 4"
"--keep-monthly 6"
"--keep-within 1d"
];
};
# Open ports in the firewall.

View File

@@ -11,6 +11,20 @@
../../common/vm
];
homelab.dns.cnames = [
"nzbget"
"radarr"
"sonarr"
"ha"
"z2m"
"grafana"
"prometheus"
"alertmanager"
"jelly"
"pyroscope"
"pushgw"
];
nixpkgs.config.allowUnfree = true;
# Use the systemd-boot EFI boot loader.
boot.loader.grub = {
@@ -46,6 +60,8 @@
"nix-command"
"flakes"
];
vault.enable = true;
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim

View File

@@ -1,9 +1,12 @@
{ config, ... }:
{
sops.secrets.wireguard_private_key = {
sopsFile = ../../secrets/http-proxy/wireguard.yaml;
key = "wg_private_key";
vault.secrets.wireguard = {
secretPath = "hosts/http-proxy/wireguard";
extractKey = "private_key";
outputDir = "/run/secrets/wireguard_private_key";
services = [ "wireguard-wg0" ];
};
networking.wireguard = {
enable = true;
useNetworkd = true;
@@ -13,7 +16,7 @@
ips = [ "10.69.222.3/24" ];
mtu = 1384;
listenPort = 51820;
privateKeyFile = config.sops.secrets.wireguard_private_key.path;
privateKeyFile = "/run/secrets/wireguard_private_key";
peers = [
{
name = "docker2.t-juice.club";
@@ -26,7 +29,11 @@
};
};
};
# monitoring
homelab.monitoring.scrapeTargets = [{
job_name = "wireguard";
port = 9586;
}];
services.prometheus.exporters.wireguard = {
enable = true;
};

View File

@@ -1,76 +0,0 @@
{
pkgs,
...
}:
{
imports = [
./hardware-configuration.nix
../../system
];
nixpkgs.config.allowUnfree = true;
# Use the systemd-boot EFI boot loader.
boot = {
loader.systemd-boot = {
enable = true;
configurationLimit = 5;
memtest86.enable = true;
};
loader.efi.canTouchEfiVariables = true;
supportedFilesystems = [ "nfs" ];
};
networking.hostName = "media1";
networking.domain = "home.2rjus.net";
networking.useNetworkd = true;
networking.useDHCP = false;
services.resolved.enable = true;
networking.nameservers = [
"10.69.13.5"
"10.69.13.6"
];
systemd.network.enable = true;
systemd.network.networks."enp2s0" = {
matchConfig.Name = "enp2s0";
address = [
"10.69.12.82/24"
];
routes = [
{ Gateway = "10.69.12.1"; }
];
linkConfig.RequiredForOnline = "routable";
};
time.timeZone = "Europe/Oslo";
# Graphics
hardware.graphics = {
enable = true;
extraPackages = with pkgs; [
libvdpau-va-gl
libva-vdpau-driver
];
};
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim
wget
git
];
# Open ports in the firewall.
# networking.firewall.allowedTCPPorts = [ ... ];
# networking.firewall.allowedUDPPorts = [ ... ];
# Or disable the firewall altogether.
networking.firewall.enable = false;
system.stateVersion = "23.11"; # Did you read the comment?
}

View File

@@ -1,7 +0,0 @@
{ ... }:
{
imports = [
./configuration.nix
./kodi.nix
];
}

View File

@@ -1,33 +0,0 @@
{ config, lib, pkgs, modulesPath, ... }:
{
imports =
[
(modulesPath + "/installer/scan/not-detected.nix")
];
boot.initrd.availableKernelModules = [ "xhci_pci" "ahci" "usb_storage" "usbhid" "sd_mod" "rtsx_usb_sdmmc" ];
boot.initrd.kernelModules = [ ];
boot.kernelModules = [ "kvm-amd" ];
boot.extraModulePackages = [ ];
fileSystems."/" =
{
device = "/dev/disk/by-uuid/3e7c311c-b1a3-4be7-b8bf-e497cba64302";
fsType = "btrfs";
};
fileSystems."/boot" =
{
device = "/dev/disk/by-uuid/F0D7-E5C1";
fsType = "vfat";
options = [ "fmask=0022" "dmask=0022" ];
};
swapDevices =
[{ device = "/dev/disk/by-uuid/1a06a36f-da61-4d36-b94e-b852836c328a"; }];
nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
hardware.cpu.amd.updateMicrocode = lib.mkDefault config.hardware.enableRedistributableFirmware;
}

View File

@@ -1,29 +0,0 @@
{ pkgs, ... }:
let
kodipkg = pkgs.kodi-wayland.withPackages (
p: with p; [
jellyfin
]
);
in
{
users.users.kodi = {
isNormalUser = true;
description = "Kodi Media Center user";
};
#services.xserver = {
# enable = true;
#};
services.cage = {
enable = true;
user = "kodi";
environment = {
XKB_DEFAULT_LAYOUT = "no";
};
program = "${kodipkg}/bin/kodi";
};
environment.systemPackages = with pkgs; [
firefox
];
}

View File

@@ -56,16 +56,46 @@
services.qemuGuest.enable = true;
sops.secrets."backup_helper_secret" = { };
backup-helper = {
enable = true;
password-file = "/run/secrets/backup_helper_secret";
backup-dirs = [
"/var/lib/grafana/plugins"
# Vault secrets management
vault.enable = true;
vault.secrets.backup-helper = {
secretPath = "shared/backup/password";
extractKey = "password";
outputDir = "/run/secrets/backup_helper_secret";
services = [ "restic-backups-grafana" "restic-backups-grafana-db" ];
};
services.restic.backups.grafana = {
repository = "rest:http://10.69.12.52:8000/backup-nix";
passwordFile = "/run/secrets/backup_helper_secret";
paths = [ "/var/lib/grafana/plugins" ];
timerConfig = {
OnCalendar = "daily";
Persistent = true;
RandomizedDelaySec = "2h";
};
pruneOpts = [
"--keep-daily 7"
"--keep-weekly 4"
"--keep-monthly 6"
"--keep-within 1d"
];
backup-commands = [
# "grafana.db:${pkgs.sqlite}/bin/sqlite /var/lib/grafana/data/grafana.db .dump"
"grafana.db:${pkgs.sqlite}/bin/sqlite3 /var/lib/grafana/data/grafana.db .dump"
};
services.restic.backups.grafana-db = {
repository = "rest:http://10.69.12.52:8000/backup-nix";
passwordFile = "/run/secrets/backup_helper_secret";
command = [ "${pkgs.sqlite}/bin/sqlite3" "/var/lib/grafana/data/grafana.db" ".dump" ];
timerConfig = {
OnCalendar = "daily";
Persistent = true;
RandomizedDelaySec = "2h";
};
pruneOpts = [
"--keep-daily 7"
"--keep-weekly 4"
"--keep-monthly 6"
"--keep-within 1d"
];
};

View File

@@ -11,6 +11,8 @@
../../common/vm
];
homelab.dns.cnames = [ "nix-cache" "actions1" ];
fileSystems."/nix" = {
device = "/dev/disk/by-label/nixcache";
fsType = "xfs";
@@ -50,6 +52,8 @@
"nix-command"
"flakes"
];
vault.enable = true;
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim

View File

@@ -1,67 +0,0 @@
{ config, lib, pkgs, ... }:
{
imports =
[
../template/hardware-configuration.nix
../../system
];
nixpkgs.config.allowUnfree = true;
# Use the systemd-boot EFI boot loader.
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/sda";
networking.hostName = "nixos-test1";
networking.domain = "home.2rjus.net";
networking.useNetworkd = true;
networking.useDHCP = false;
services.resolved.enable = true;
networking.nameservers = [
"10.69.13.5"
"10.69.13.6"
];
systemd.network.enable = true;
systemd.network.networks."ens18" = {
matchConfig.Name = "ens18";
address = [
"10.69.13.10/24"
];
routes = [
{ Gateway = "10.69.13.1"; }
];
linkConfig.RequiredForOnline = "routable";
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [ "nix-command" "flakes" ];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim
wget
git
];
# Open ports in the firewall.
# networking.firewall.allowedTCPPorts = [ ... ];
# networking.firewall.allowedUDPPorts = [ ... ];
# Or disable the firewall altogether.
networking.firewall.enable = false;
# Secrets
# Backup helper
sops.secrets."backup_helper_secret" = { };
backup-helper = {
enable = true;
password-file = "/run/secrets/backup_helper_secret";
backup-dirs = [
"/etc/machine-id"
"/etc/os-release"
];
};
system.stateVersion = "23.11"; # Did you read the comment?
}

View File

@@ -1,5 +0,0 @@
{ ... }: {
imports = [
./configuration.nix
];
}

View File

@@ -47,6 +47,8 @@
"nix-command"
"flakes"
];
vault.enable = true;
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim

View File

@@ -47,6 +47,8 @@
"nix-command"
"flakes"
];
vault.enable = true;
environment.systemPackages = with pkgs; [
vim
wget

View File

@@ -1,56 +0,0 @@
{ config, lib, pkgs, ... }:
{
imports =
[
../template/hardware-configuration.nix
../../system
../../services/ns/master-authorative.nix
../../services/ns/resolver.nix
];
nixpkgs.config.allowUnfree = true;
# Use the systemd-boot EFI boot loader.
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/sda";
networking.hostName = "ns3";
networking.domain = "home.2rjus.net";
networking.useNetworkd = true;
networking.useDHCP = false;
services.resolved.enable = false;
networking.nameservers = [
"10.69.13.5"
"10.69.13.6"
];
systemd.network.enable = true;
systemd.network.networks."ens18" = {
matchConfig.Name = "ens18";
address = [
"10.69.13.7/24"
];
routes = [
{ Gateway = "10.69.13.1"; }
];
linkConfig.RequiredForOnline = "routable";
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [ "nix-command" "flakes" ];
environment.systemPackages = with pkgs; [
vim
wget
git
];
# Open ports in the firewall.
# networking.firewall.allowedTCPPorts = [ ... ];
# networking.firewall.allowedUDPPorts = [ ... ];
# Or disable the firewall altogether.
networking.firewall.enable = false;
system.stateVersion = "23.11"; # Did you read the comment?
}

View File

@@ -1,5 +0,0 @@
{ ... }: {
imports = [
./configuration.nix
];
}

View File

@@ -1,36 +0,0 @@
{ config, lib, pkgs, modulesPath, ... }:
{
imports =
[
(modulesPath + "/profiles/qemu-guest.nix")
];
boot.initrd.availableKernelModules = [ "ata_piix" "uhci_hcd" "virtio_pci" "virtio_scsi" "sd_mod" "sr_mod" ];
boot.initrd.kernelModules = [ ];
# boot.kernelModules = [ ];
# boot.extraModulePackages = [ ];
fileSystems."/" =
{
device = "/dev/disk/by-uuid/6889aba9-61ed-4687-ab10-e5cf4017ac8d";
fsType = "xfs";
};
fileSystems."/boot" =
{
device = "/dev/disk/by-uuid/BC07-3B7A";
fsType = "vfat";
};
swapDevices =
[{ device = "/dev/disk/by-uuid/64e5757b-6625-4dd2-aa2a-66ca93444d23"; }];
# Enables DHCP on each ethernet and wireless interface. In case of scripted networking
# (the default) this is the recommended approach. When using systemd-networkd it's
# still possible to use this option, but it's recommended to use it in conjunction
# with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
# networking.interfaces.ens18.useDHCP = lib.mkDefault true;
nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
}

View File

@@ -1,56 +0,0 @@
{ config, lib, pkgs, ... }:
{
imports =
[
../template/hardware-configuration.nix
../../system
../../services/ns/secondary-authorative.nix
../../services/ns/resolver.nix
];
nixpkgs.config.allowUnfree = true;
# Use the systemd-boot EFI boot loader.
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/sda";
networking.hostName = "ns4";
networking.domain = "home.2rjus.net";
networking.useNetworkd = true;
networking.useDHCP = false;
services.resolved.enable = false;
networking.nameservers = [
"10.69.13.5"
"10.69.13.6"
];
systemd.network.enable = true;
systemd.network.networks."ens18" = {
matchConfig.Name = "ens18";
address = [
"10.69.13.8/24"
];
routes = [
{ Gateway = "10.69.13.1"; }
];
linkConfig.RequiredForOnline = "routable";
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [ "nix-command" "flakes" ];
environment.systemPackages = with pkgs; [
vim
wget
git
];
# Open ports in the firewall.
# networking.firewall.allowedTCPPorts = [ ... ];
# networking.firewall.allowedUDPPorts = [ ... ];
# Or disable the firewall altogether.
networking.firewall.enable = false;
system.stateVersion = "23.11"; # Did you read the comment?
}

View File

@@ -1,5 +0,0 @@
{ ... }: {
imports = [
./configuration.nix
];
}

View File

@@ -1,36 +0,0 @@
{ config, lib, pkgs, modulesPath, ... }:
{
imports =
[
(modulesPath + "/profiles/qemu-guest.nix")
];
boot.initrd.availableKernelModules = [ "ata_piix" "uhci_hcd" "virtio_pci" "virtio_scsi" "sd_mod" "sr_mod" ];
boot.initrd.kernelModules = [ ];
# boot.kernelModules = [ ];
# boot.extraModulePackages = [ ];
fileSystems."/" =
{
device = "/dev/disk/by-uuid/6889aba9-61ed-4687-ab10-e5cf4017ac8d";
fsType = "xfs";
};
fileSystems."/boot" =
{
device = "/dev/disk/by-uuid/BC07-3B7A";
fsType = "vfat";
};
swapDevices =
[{ device = "/dev/disk/by-uuid/64e5757b-6625-4dd2-aa2a-66ca93444d23"; }];
# Enables DHCP on each ethernet and wireless interface. In case of scripted networking
# (the default) this is the recommended approach. When using systemd-networkd it's
# still possible to use this option, but it's recommended to use it in conjunction
# with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
# networking.interfaces.ens18.useDHCP = lib.mkDefault true;
nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
}

View File

@@ -8,6 +8,9 @@
../../system
];
# Template host - exclude from DNS zone generation
homelab.dns.enable = false;
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/sda";

View File

@@ -1,7 +1,9 @@
{ pkgs, ... }:
let
prepare-host-script = pkgs.writeShellScriptBin "prepare-host.sh"
''
prepare-host-script = pkgs.writeShellApplication {
name = "prepare-host.sh";
runtimeInputs = [ pkgs.age ];
text = ''
echo "Removing machine-id"
rm -f /etc/machine-id || true
@@ -24,8 +26,9 @@ let
echo "Generate age key"
rm -rf /var/lib/sops-nix || true
mkdir -p /var/lib/sops-nix
${pkgs.age}/bin/age-keygen -o /var/lib/sops-nix/key.txt
age-keygen -o /var/lib/sops-nix/key.txt
'';
};
in
{
environment.systemPackages = [ prepare-host-script ];

View File

@@ -1,7 +1,9 @@
{ pkgs, ... }:
let
prepare-host-script = pkgs.writeShellScriptBin "prepare-host.sh"
''
prepare-host-script = pkgs.writeShellApplication {
name = "prepare-host.sh";
runtimeInputs = [ pkgs.age ];
text = ''
echo "Removing machine-id"
rm -f /etc/machine-id || true
@@ -24,8 +26,9 @@ let
echo "Generate age key"
rm -rf /var/lib/sops-nix || true
mkdir -p /var/lib/sops-nix
${pkgs.age}/bin/age-keygen -o /var/lib/sops-nix/key.txt
age-keygen -o /var/lib/sops-nix/key.txt
'';
};
in
{
environment.systemPackages = [ prepare-host-script ];

View File

@@ -13,6 +13,9 @@
../../common/vm
];
# Test VM - exclude from DNS zone generation
homelab.dns.enable = false;
nixpkgs.config.allowUnfree = true;
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda";

View File

@@ -14,6 +14,8 @@
../../services/vault
];
homelab.dns.cnames = [ "vault" ];
nixpkgs.config.allowUnfree = true;
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda";

View File

@@ -5,6 +5,32 @@
...
}:
let
vault-test-script = pkgs.writeShellApplication {
name = "vault-test";
text = ''
echo "=== Vault Secret Test ==="
echo "Secret path: hosts/vaulttest01/test-service"
if [ -f /run/secrets/test-service/password ]; then
echo " Password file exists"
echo "Password length: $(wc -c < /run/secrets/test-service/password)"
else
echo " Password file missing!"
exit 1
fi
if [ -d /var/lib/vault/cache/test-service ]; then
echo " Cache directory exists"
else
echo " Cache directory missing!"
exit 1
fi
echo "Test successful!"
'';
};
in
{
imports = [
../template2/hardware-configuration.nix
@@ -79,32 +105,23 @@
Type = "oneshot";
RemainAfterExit = true;
ExecStart = pkgs.writeShellScript "vault-test" ''
echo "=== Vault Secret Test ==="
echo "Secret path: hosts/vaulttest01/test-service"
if [ -f /run/secrets/test-service/password ]; then
echo " Password file exists"
echo "Password length: $(wc -c < /run/secrets/test-service/password)"
else
echo " Password file missing!"
exit 1
fi
if [ -d /var/lib/vault/cache/test-service ]; then
echo " Cache directory exists"
else
echo " Cache directory missing!"
exit 1
fi
echo "Test successful!"
'';
ExecStart = lib.getExe vault-test-script;
StandardOutput = "journal+console";
};
};
# Test ACME certificate issuance from OpenBao PKI
# Override the global ACME server (from system/acme.nix) to use OpenBao instead of step-ca
security.acme.defaults.server = lib.mkForce "https://vault01.home.2rjus.net:8200/v1/pki_int/acme/directory";
# Request a certificate for this host
# Using HTTP-01 challenge with standalone listener on port 80
security.acme.certs."vaulttest01.home.2rjus.net" = {
listenHTTP = ":80";
enableDebugLogs = true;
};
system.stateVersion = "25.11"; # Did you read the comment?
}

View File

@@ -6,10 +6,6 @@ import subprocess
IGNORED_HOSTS = [
"inc1",
"inc2",
"media1",
"nixos-test1",
"ns3",
"ns4",
"template1",
]

160
lib/dns-zone.nix Normal file
View File

@@ -0,0 +1,160 @@
{ lib }:
let
# Pad string on the right to reach a fixed width
rightPad = width: str:
let
len = builtins.stringLength str;
padding = if len >= width then "" else lib.strings.replicate (width - len) " ";
in
str + padding;
# Extract IP address from CIDR notation (e.g., "10.69.13.5/24" -> "10.69.13.5")
extractIP = address:
let
parts = lib.splitString "/" address;
in
builtins.head parts;
# Check if a network interface name looks like a VPN/tunnel interface
isVpnInterface = ifaceName:
lib.hasPrefix "wg" ifaceName ||
lib.hasPrefix "tun" ifaceName ||
lib.hasPrefix "tap" ifaceName ||
lib.hasPrefix "vti" ifaceName;
# Extract DNS information from a single host configuration
# Returns null if host should not be included in DNS
extractHostDNS = name: hostConfig:
let
cfg = hostConfig.config;
# Handle cases where homelab module might not be imported
dnsConfig = (cfg.homelab or { }).dns or { enable = true; cnames = [ ]; };
hostname = cfg.networking.hostName;
networks = cfg.systemd.network.networks or { };
# Filter out VPN interfaces and find networks with static addresses
# Check matchConfig.Name instead of network unit name (which can have prefixes like "40-")
physicalNetworks = lib.filterAttrs
(netName: netCfg:
let
ifaceName = netCfg.matchConfig.Name or "";
in
!(isVpnInterface ifaceName) && (netCfg.address or [ ]) != [ ])
networks;
# Get addresses from physical networks only
networkAddresses = lib.flatten (
lib.mapAttrsToList
(netName: netCfg: netCfg.address or [ ])
physicalNetworks
);
# Get the first address, if any
firstAddress = if networkAddresses != [ ] then builtins.head networkAddresses else null;
# Check if host uses DHCP (no static address)
usesDHCP = firstAddress == null ||
lib.any
(netName: (networks.${netName}.networkConfig.DHCP or "no") != "no")
(lib.attrNames networks);
in
if !(dnsConfig.enable or true) || firstAddress == null then
null
else
{
inherit hostname;
ip = extractIP firstAddress;
cnames = dnsConfig.cnames or [ ];
};
# Generate A record line
generateARecord = hostname: ip:
"${rightPad 20 hostname}IN A ${ip}";
# Generate CNAME record line
generateCNAME = alias: target:
"${rightPad 20 alias}IN CNAME ${target}";
# Generate zone file from flake configurations and external hosts
generateZone =
{ self
, externalHosts
, serial
, domain ? "home.2rjus.net"
, ttl ? 1800
, refresh ? 3600
, retry ? 900
, expire ? 1209600
, minTtl ? 120
, nameservers ? [ "ns1" "ns2" ]
, adminEmail ? "admin.test.2rjus.net"
}:
let
# Extract DNS info from all flake hosts
nixosConfigs = self.nixosConfigurations or { };
hostDNSList = lib.filter (x: x != null) (
lib.mapAttrsToList extractHostDNS nixosConfigs
);
# Sort hosts by IP for consistent output
sortedHosts = lib.sort (a: b: a.ip < b.ip) hostDNSList;
# Generate A records for flake hosts
flakeARecords = lib.concatMapStringsSep "\n" (host:
generateARecord host.hostname host.ip
) sortedHosts;
# Generate CNAMEs for flake hosts
flakeCNAMEs = lib.concatMapStringsSep "\n" (host:
lib.concatMapStringsSep "\n" (cname:
generateCNAME cname host.hostname
) host.cnames
) (lib.filter (h: h.cnames != [ ]) sortedHosts);
# Generate A records for external hosts
externalARecords = lib.concatStringsSep "\n" (
lib.mapAttrsToList (name: ip:
generateARecord name ip
) (externalHosts.aRecords or { })
);
# Generate CNAMEs for external hosts
externalCNAMEs = lib.concatStringsSep "\n" (
lib.mapAttrsToList (alias: target:
generateCNAME alias target
) (externalHosts.cnames or { })
);
# NS records
nsRecords = lib.concatMapStringsSep "\n" (ns:
" IN NS ${ns}.${domain}."
) nameservers;
# SOA record
soa = ''
$ORIGIN ${domain}.
$TTL ${toString ttl}
@ IN SOA ns1.${domain}. ${adminEmail}. (
${toString serial} ; serial number
${toString refresh} ; refresh
${toString retry} ; retry
${toString expire} ; expire
${toString minTtl} ; ttl
)'';
in
lib.concatStringsSep "\n\n" (lib.filter (s: s != "") [
soa
nsRecords
"; Flake-managed hosts (auto-generated)"
flakeARecords
(if flakeCNAMEs != "" then "; Flake-managed CNAMEs\n${flakeCNAMEs}" else "")
"; External hosts (not managed by this flake)"
externalARecords
(if externalCNAMEs != "" then "; External CNAMEs\n${externalCNAMEs}" else "")
""
]);
in
{
inherit extractIP extractHostDNS generateARecord generateCNAME generateZone;
}

145
lib/monitoring.nix Normal file
View File

@@ -0,0 +1,145 @@
{ lib }:
let
# Extract IP address from CIDR notation (e.g., "10.69.13.5/24" -> "10.69.13.5")
extractIP = address:
let
parts = lib.splitString "/" address;
in
builtins.head parts;
# Check if a network interface name looks like a VPN/tunnel interface
isVpnInterface = ifaceName:
lib.hasPrefix "wg" ifaceName ||
lib.hasPrefix "tun" ifaceName ||
lib.hasPrefix "tap" ifaceName ||
lib.hasPrefix "vti" ifaceName;
# Extract monitoring info from a single host configuration
# Returns null if host should not be included
extractHostMonitoring = name: hostConfig:
let
cfg = hostConfig.config;
monConfig = (cfg.homelab or { }).monitoring or { enable = true; scrapeTargets = [ ]; };
dnsConfig = (cfg.homelab or { }).dns or { enable = true; };
hostname = cfg.networking.hostName;
networks = cfg.systemd.network.networks or { };
# Filter out VPN interfaces and find networks with static addresses
physicalNetworks = lib.filterAttrs
(netName: netCfg:
let
ifaceName = netCfg.matchConfig.Name or "";
in
!(isVpnInterface ifaceName) && (netCfg.address or [ ]) != [ ])
networks;
# Get addresses from physical networks only
networkAddresses = lib.flatten (
lib.mapAttrsToList
(netName: netCfg: netCfg.address or [ ])
physicalNetworks
);
firstAddress = if networkAddresses != [ ] then builtins.head networkAddresses else null;
in
if !(monConfig.enable or true) || !(dnsConfig.enable or true) || firstAddress == null then
null
else
{
inherit hostname;
ip = extractIP firstAddress;
scrapeTargets = monConfig.scrapeTargets or [ ];
};
# Generate node-exporter targets from all flake hosts
generateNodeExporterTargets = self: externalTargets:
let
nixosConfigs = self.nixosConfigurations or { };
hostList = lib.filter (x: x != null) (
lib.mapAttrsToList extractHostMonitoring nixosConfigs
);
flakeTargets = map (host: "${host.hostname}.home.2rjus.net:9100") hostList;
in
flakeTargets ++ (externalTargets.nodeExporter or [ ]);
# Generate scrape configs from all flake hosts and external targets
generateScrapeConfigs = self: externalTargets:
let
nixosConfigs = self.nixosConfigurations or { };
hostList = lib.filter (x: x != null) (
lib.mapAttrsToList extractHostMonitoring nixosConfigs
);
# Collect all scrapeTargets from all hosts, grouped by job_name
allTargets = lib.flatten (map
(host:
map
(target: {
inherit (target) job_name port metrics_path scheme scrape_interval honor_labels;
hostname = host.hostname;
})
host.scrapeTargets
)
hostList
);
# Group targets by job_name
grouped = lib.groupBy (t: t.job_name) allTargets;
# Generate a scrape config for each job
flakeScrapeConfigs = lib.mapAttrsToList
(jobName: targets:
let
first = builtins.head targets;
targetAddrs = map
(t:
let
portStr = toString t.port;
in
"${t.hostname}.home.2rjus.net:${portStr}")
targets;
config = {
job_name = jobName;
static_configs = [{
targets = targetAddrs;
}];
}
// (lib.optionalAttrs (first.metrics_path != "/metrics") {
metrics_path = first.metrics_path;
})
// (lib.optionalAttrs (first.scheme != "http") {
scheme = first.scheme;
})
// (lib.optionalAttrs (first.scrape_interval != null) {
scrape_interval = first.scrape_interval;
})
// (lib.optionalAttrs first.honor_labels {
honor_labels = true;
});
in
config
)
grouped;
# External scrape configs
externalScrapeConfigs = map
(ext: {
job_name = ext.job_name;
static_configs = [{
targets = ext.targets;
}];
} // (lib.optionalAttrs (ext ? metrics_path) {
metrics_path = ext.metrics_path;
}) // (lib.optionalAttrs (ext ? scheme) {
scheme = ext.scheme;
}) // (lib.optionalAttrs (ext ? scrape_interval) {
scrape_interval = ext.scrape_interval;
}))
(externalTargets.scrapeConfigs or [ ]);
in
flakeScrapeConfigs ++ externalScrapeConfigs;
in
{
inherit extractHostMonitoring generateNodeExporterTargets generateScrapeConfigs;
}

View File

@@ -0,0 +1,7 @@
{ ... }:
{
imports = [
./dns.nix
./monitoring.nix
];
}

20
modules/homelab/dns.nix Normal file
View File

@@ -0,0 +1,20 @@
{ config, lib, ... }:
let
cfg = config.homelab.dns;
in
{
options.homelab.dns = {
enable = lib.mkOption {
type = lib.types.bool;
default = true;
description = "Include this host in DNS zone generation";
};
cnames = lib.mkOption {
type = lib.types.listOf lib.types.str;
default = [ ];
description = "CNAME records pointing to this host";
example = [ "web" "api" ];
};
};
}

View File

@@ -0,0 +1,50 @@
{ config, lib, ... }:
let
cfg = config.homelab.monitoring;
in
{
options.homelab.monitoring = {
enable = lib.mkOption {
type = lib.types.bool;
default = true;
description = "Include this host in Prometheus node-exporter scrape targets";
};
scrapeTargets = lib.mkOption {
type = lib.types.listOf (lib.types.submodule {
options = {
job_name = lib.mkOption {
type = lib.types.str;
description = "Prometheus scrape job name";
};
port = lib.mkOption {
type = lib.types.port;
description = "Port to scrape metrics from";
};
metrics_path = lib.mkOption {
type = lib.types.str;
default = "/metrics";
description = "HTTP path to scrape metrics from";
};
scheme = lib.mkOption {
type = lib.types.str;
default = "http";
description = "HTTP scheme (http or https)";
};
scrape_interval = lib.mkOption {
type = lib.types.nullOr lib.types.str;
default = null;
description = "Override the global scrape interval for this target";
};
honor_labels = lib.mkOption {
type = lib.types.bool;
default = false;
description = "Whether to honor labels from the scraped target";
};
};
});
default = [ ];
description = "Additional Prometheus scrape targets exposed by this host";
};
};
}

View File

@@ -0,0 +1,78 @@
---
# Provision OpenBao AppRole credentials to an existing host
# Usage: nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=ha1
# Requires: BAO_ADDR and BAO_TOKEN environment variables set
- name: Fetch AppRole credentials from OpenBao
hosts: localhost
connection: local
gather_facts: false
vars:
vault_addr: "{{ lookup('env', 'BAO_ADDR') | default('https://vault01.home.2rjus.net:8200', true) }}"
domain: "home.2rjus.net"
tasks:
- name: Validate hostname is provided
ansible.builtin.fail:
msg: "hostname variable is required. Use: -e hostname=<name>"
when: hostname is not defined
- name: Get role-id for host
ansible.builtin.command:
cmd: "bao read -field=role_id auth/approle/role/{{ hostname }}/role-id"
environment:
BAO_ADDR: "{{ vault_addr }}"
BAO_SKIP_VERIFY: "1"
register: role_id_result
changed_when: false
- name: Generate secret-id for host
ansible.builtin.command:
cmd: "bao write -field=secret_id -f auth/approle/role/{{ hostname }}/secret-id"
environment:
BAO_ADDR: "{{ vault_addr }}"
BAO_SKIP_VERIFY: "1"
register: secret_id_result
changed_when: true
- name: Add target host to inventory
ansible.builtin.add_host:
name: "{{ hostname }}.{{ domain }}"
groups: vault_target
ansible_user: root
vault_role_id: "{{ role_id_result.stdout }}"
vault_secret_id: "{{ secret_id_result.stdout }}"
- name: Deploy AppRole credentials to host
hosts: vault_target
gather_facts: false
tasks:
- name: Create AppRole directory
ansible.builtin.file:
path: /var/lib/vault/approle
state: directory
mode: "0700"
owner: root
group: root
- name: Write role-id
ansible.builtin.copy:
content: "{{ vault_role_id }}"
dest: /var/lib/vault/approle/role-id
mode: "0600"
owner: root
group: root
- name: Write secret-id
ansible.builtin.copy:
content: "{{ vault_secret_id }}"
dest: /var/lib/vault/approle/secret-id
mode: "0600"
owner: root
group: root
- name: Display success
ansible.builtin.debug:
msg: "AppRole credentials provisioned to {{ inventory_hostname }}"

View File

@@ -1,5 +1,6 @@
"""CLI tool for generating NixOS host configurations."""
import shutil
import sys
from pathlib import Path
from typing import Optional
@@ -10,7 +11,15 @@ from rich.panel import Panel
from rich.table import Table
from generators import generate_host_files, generate_vault_terraform
from manipulators import update_flake_nix, update_terraform_vms, add_wrapped_token_to_vm
from manipulators import (
update_flake_nix,
update_terraform_vms,
add_wrapped_token_to_vm,
remove_from_flake_nix,
remove_from_terraform_vms,
remove_from_vault_terraform,
check_entries_exist,
)
from models import HostConfig
from vault_helper import generate_wrapped_token
from validators import (
@@ -46,9 +55,10 @@ def main(
memory: int = typer.Option(2048, "--memory", help="Memory in MB"),
disk: str = typer.Option("20G", "--disk", help="Disk size (e.g., 20G, 50G, 100G)"),
dry_run: bool = typer.Option(False, "--dry-run", help="Preview changes without creating files"),
force: bool = typer.Option(False, "--force", help="Overwrite existing host configuration"),
force: bool = typer.Option(False, "--force", help="Overwrite existing host configuration / skip confirmation for removal"),
skip_vault: bool = typer.Option(False, "--skip-vault", help="Skip Vault configuration and token generation"),
regenerate_token: bool = typer.Option(False, "--regenerate-token", help="Only regenerate Vault wrapped token (no other changes)"),
remove: bool = typer.Option(False, "--remove", help="Remove host configuration and terraform entries"),
) -> None:
"""
Create a new NixOS host configuration.
@@ -64,6 +74,11 @@ def main(
# Get repository root
repo_root = get_repo_root()
# Handle removal mode
if remove:
handle_remove(hostname, repo_root, dry_run, force, ip, cpu, memory, disk, skip_vault, regenerate_token)
return
# Handle token regeneration mode
if regenerate_token:
# Validate that incompatible options aren't used
@@ -198,6 +213,166 @@ def main(
sys.exit(1)
def handle_remove(
hostname: str,
repo_root: Path,
dry_run: bool,
force: bool,
ip: Optional[str],
cpu: int,
memory: int,
disk: str,
skip_vault: bool,
regenerate_token: bool,
) -> None:
"""Handle the --remove workflow."""
# Validate --remove isn't used with create options
incompatible_options = []
if ip:
incompatible_options.append("--ip")
if cpu != 2:
incompatible_options.append("--cpu")
if memory != 2048:
incompatible_options.append("--memory")
if disk != "20G":
incompatible_options.append("--disk")
if skip_vault:
incompatible_options.append("--skip-vault")
if regenerate_token:
incompatible_options.append("--regenerate-token")
if incompatible_options:
console.print(
f"[bold red]Error:[/bold red] --remove cannot be used with: {', '.join(incompatible_options)}\n"
)
sys.exit(1)
# Validate hostname exists (host directory must exist)
host_dir = repo_root / "hosts" / hostname
if not host_dir.exists():
console.print(f"[bold red]Error:[/bold red] Host {hostname} does not exist")
console.print(f"Host directory not found: {host_dir}")
sys.exit(1)
# Check what entries exist
flake_exists, terraform_exists, vault_exists = check_entries_exist(hostname, repo_root)
# Collect all files in the host directory recursively
files_in_host_dir = sorted([f for f in host_dir.rglob("*") if f.is_file()])
# Check for secrets directory
secrets_dir = repo_root / "secrets" / hostname
secrets_exist = secrets_dir.exists()
# Display summary
if dry_run:
console.print("\n[yellow][DRY RUN - No changes will be made][/yellow]\n")
console.print(f"\n[bold blue]Removing host: {hostname}[/bold blue]\n")
# Show host directory contents
console.print("[bold]Directory to be deleted (and all contents):[/bold]")
console.print(f" • hosts/{hostname}/")
for f in files_in_host_dir:
rel_path = f.relative_to(host_dir)
console.print(f" - {rel_path}")
# Show entries to be removed
console.print("\n[bold]Entries to be removed:[/bold]")
if flake_exists:
console.print(f" • flake.nix (nixosConfigurations.{hostname})")
else:
console.print(f" • flake.nix [dim](not found)[/dim]")
if terraform_exists:
console.print(f' • terraform/vms.tf (locals.vms["{hostname}"])')
else:
console.print(f" • terraform/vms.tf [dim](not found)[/dim]")
if vault_exists:
console.print(f' • terraform/vault/hosts-generated.tf (generated_host_policies["{hostname}"])')
else:
console.print(f" • terraform/vault/hosts-generated.tf [dim](not found)[/dim]")
# Warn about secrets directory
if secrets_exist:
console.print(f"\n[yellow]⚠️ Warning: secrets/{hostname}/ directory exists and will NOT be deleted[/yellow]")
console.print(f" Manually remove if no longer needed: [white]rm -rf secrets/{hostname}/[/white]")
console.print(f" Also update .sops.yaml to remove the host's age key")
# Exit if dry run
if dry_run:
console.print("\n[yellow][DRY RUN - No changes made][/yellow]\n")
return
# Prompt for confirmation unless --force
if not force:
console.print("")
confirm = typer.confirm("Proceed with removal?", default=False)
if not confirm:
console.print("\n[yellow]Removal cancelled[/yellow]\n")
sys.exit(0)
# Perform removal
console.print("\n[bold blue]Removing host configuration...[/bold blue]")
# Remove from terraform/vault/hosts-generated.tf
if vault_exists:
if remove_from_vault_terraform(hostname, repo_root):
console.print("[green]✓[/green] Removed from terraform/vault/hosts-generated.tf")
else:
console.print("[yellow]⚠[/yellow] Could not remove from terraform/vault/hosts-generated.tf")
# Remove from terraform/vms.tf
if terraform_exists:
if remove_from_terraform_vms(hostname, repo_root):
console.print("[green]✓[/green] Removed from terraform/vms.tf")
else:
console.print("[yellow]⚠[/yellow] Could not remove from terraform/vms.tf")
# Remove from flake.nix
if flake_exists:
if remove_from_flake_nix(hostname, repo_root):
console.print("[green]✓[/green] Removed from flake.nix")
else:
console.print("[yellow]⚠[/yellow] Could not remove from flake.nix")
# Delete hosts/<hostname>/ directory
shutil.rmtree(host_dir)
console.print(f"[green]✓[/green] Deleted hosts/{hostname}/")
# Success message
console.print(f"\n[bold green]✓ Host {hostname} removed successfully![/bold green]\n")
# Display next steps
display_removal_next_steps(hostname, vault_exists)
def display_removal_next_steps(hostname: str, had_vault: bool) -> None:
"""Display next steps after successful removal."""
vault_file = " terraform/vault/hosts-generated.tf" if had_vault else ""
vault_apply = ""
if had_vault:
vault_apply = f"""
3. Apply Vault changes:
[white]cd terraform/vault && tofu apply[/white]
"""
next_steps = f"""[bold cyan]Next Steps:[/bold cyan]
1. Review changes:
[white]git diff[/white]
2. If VM exists in Proxmox, destroy it first:
[white]cd terraform && tofu destroy -target='proxmox_vm_qemu.vm["{hostname}"]'[/white]
{vault_apply}
4. Commit changes:
[white]git add -u hosts/{hostname} flake.nix terraform/vms.tf{vault_file}
git commit -m "hosts: remove {hostname}"[/white]
"""
console.print(Panel(next_steps, border_style="cyan"))
def display_config_summary(config: HostConfig) -> None:
"""Display configuration summary table."""
table = Table(title="Host Configuration", show_header=False)

View File

@@ -2,10 +2,138 @@
import re
from pathlib import Path
from typing import Tuple
from models import HostConfig
def remove_from_flake_nix(hostname: str, repo_root: Path) -> bool:
"""
Remove host entry from flake.nix nixosConfigurations.
Args:
hostname: Hostname to remove
repo_root: Path to repository root
Returns:
True if found and removed, False if not found
"""
flake_path = repo_root / "flake.nix"
content = flake_path.read_text()
# Check if hostname exists
hostname_pattern = rf"^ {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem"
if not re.search(hostname_pattern, content, re.MULTILINE):
return False
# Match the entire block from "hostname = " to "};"
replace_pattern = rf"^ {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem \{{.*?^ \}};\n"
new_content, count = re.subn(replace_pattern, "", content, flags=re.MULTILINE | re.DOTALL)
if count == 0:
return False
flake_path.write_text(new_content)
return True
def remove_from_terraform_vms(hostname: str, repo_root: Path) -> bool:
"""
Remove VM entry from terraform/vms.tf locals.vms map.
Args:
hostname: Hostname to remove
repo_root: Path to repository root
Returns:
True if found and removed, False if not found
"""
terraform_path = repo_root / "terraform" / "vms.tf"
content = terraform_path.read_text()
# Check if hostname exists
hostname_pattern = rf'^\s+"{re.escape(hostname)}" = \{{'
if not re.search(hostname_pattern, content, re.MULTILINE):
return False
# Match the entire block from "hostname" = { to }
replace_pattern = rf'^\s+"{re.escape(hostname)}" = \{{.*?^\s+\}}\n'
new_content, count = re.subn(replace_pattern, "", content, flags=re.MULTILINE | re.DOTALL)
if count == 0:
return False
terraform_path.write_text(new_content)
return True
def remove_from_vault_terraform(hostname: str, repo_root: Path) -> bool:
"""
Remove host policy from terraform/vault/hosts-generated.tf.
Args:
hostname: Hostname to remove
repo_root: Path to repository root
Returns:
True if found and removed, False if not found
"""
vault_tf_path = repo_root / "terraform" / "vault" / "hosts-generated.tf"
if not vault_tf_path.exists():
return False
content = vault_tf_path.read_text()
# Check if hostname exists in the policies
if f'"{hostname}"' not in content:
return False
# Match the host entry block within generated_host_policies
# Pattern matches: "hostname" = { ... } with possible trailing newlines
replace_pattern = rf'\s*"{re.escape(hostname)}" = \{{\s*paths = \[.*?\]\s*\}}\n?'
new_content, count = re.subn(replace_pattern, "", content, flags=re.DOTALL)
if count == 0:
return False
vault_tf_path.write_text(new_content)
return True
def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, bool]:
"""
Check which entries exist for a hostname.
Args:
hostname: Hostname to check
repo_root: Path to repository root
Returns:
Tuple of (flake_exists, terraform_vms_exists, vault_exists)
"""
# Check flake.nix
flake_path = repo_root / "flake.nix"
flake_content = flake_path.read_text()
flake_pattern = rf"^ {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem"
flake_exists = bool(re.search(flake_pattern, flake_content, re.MULTILINE))
# Check terraform/vms.tf
terraform_path = repo_root / "terraform" / "vms.tf"
terraform_content = terraform_path.read_text()
terraform_pattern = rf'^\s+"{re.escape(hostname)}" = \{{'
terraform_exists = bool(re.search(terraform_pattern, terraform_content, re.MULTILINE))
# Check terraform/vault/hosts-generated.tf
vault_tf_path = repo_root / "terraform" / "vault" / "hosts-generated.tf"
vault_exists = False
if vault_tf_path.exists():
vault_content = vault_tf_path.read_text()
vault_exists = f'"{hostname}"' in vault_content
return (flake_exists, terraform_exists, vault_exists)
def update_flake_nix(config: HostConfig, repo_root: Path, force: bool = False) -> None:
"""
Add or update host entry in flake.nix nixosConfigurations.

View File

@@ -137,9 +137,9 @@ fetch_from_vault() {
# Write each secret key to a separate file
log "Writing secrets to $OUTPUT_DIR"
echo "$SECRET_DATA" | jq -r 'to_entries[] | "\(.key)\n\(.value)"' | while read -r key; read -r value; do
echo -n "$value" > "$OUTPUT_DIR/$key"
echo -n "$value" > "$CACHE_DIR/$key"
for key in $(echo "$SECRET_DATA" | jq -r 'keys[]'); do
echo "$SECRET_DATA" | jq -j --arg k "$key" '.[$k]' > "$OUTPUT_DIR/$key"
echo "$SECRET_DATA" | jq -j --arg k "$key" '.[$k]' > "$CACHE_DIR/$key"
chmod 600 "$OUTPUT_DIR/$key"
chmod 600 "$CACHE_DIR/$key"
log " - Wrote secret key: $key"

View File

@@ -1,29 +0,0 @@
authelia_ldap_password: ENC[AES256_GCM,data:x2UDMpqQKoRVSlDSmK5XiC9x4/WWzmjk7cwtFA70waAD7xYQfXEOV+AeX1LlFfj0qHYrhyn//TLsa+tJzb7HPEAfl8vYR4MdkVFOm5vjPWWoF5Ul8ZVn8+B1VJLbiXkexv0/hfXL8NMzEcp/pF4H0Yei7xaKezu9OPtGzKufHws=,iv:88RXaOj8Zy9fGeDLAE0ItY7TKCCzxn6F0+kU5+Zy/XU=,tag:yPdCJ9d139iO6J97thVVgA==,type:str]
authelia_jwt_secret: ENC[AES256_GCM,data:9ZHkT2o5KZLmml95g8HZce8fNBmaWtRn+175Gaz0KhsndNl3zdgGq3hydRuoZuEgLVsherJImVmb5DQAZpv04lUEsDKCYeFNwAyYl4Go2jCp1fI53fdcRCKlNVZA37pMi4AYaCoe8vIl/cwPOOBDEwK5raOBnklCzVERoO0B8a0=,iv:9CTWCw0ImZR0OSrl2znbhpRHlzAxA5Cpcy98JeH9Z+Y=,tag:L+0xKqiwXTi7XiDYWA1Bcw==,type:str]
authelia_storage_encryption_key_file: ENC[AES256_GCM,data:RfbcQK8+rrW/Krd2rbDfgo7YI2YvQKqpLuDtk5DZJNNhw4giBh5nFp/8LNeo8r39/oiJLYTe6FjTLBu72TZz2wWrJFsBqjwQ/3TfATQGdLUsaXXRDr88ezHLTiYvEHIHJhUS5qsr7VMwBam5e7YGWBe5sGZCE/nX41ijyPUjtOY=,iv:sayYcAC38cApAtL+cDhgGNjWaHn+furKRowKL6AmfdU=,tag:1IZpnlpvDWGLLpZyU9iJUw==,type:str]
authelia_session_secret: ENC[AES256_GCM,data:4PaLv4RRA7/9Z8QzETXLwo3OctJ0mvzQkYmHsGGF97nq9QeB3eo0xj4FyuCbkJGGZ/huAyRgmFBTyscY3wgxoc4t+8BdlYcSbefEk1/xRFjmG8ooXLKhvGJ5c6t72KJRcqsEGTiC0l9CFJWQ2qYcjM4dPwG8z0tjUZ6j25Zfx4M=,iv:QORJkf0w6iyuRHM/xuql1s7K75Qa49ygq+lwHfrm9rk=,tag:/HZ/qI80fKjmuTRwIwmX8g==,type:str]
lldap_user_pass: ENC[AES256_GCM,data:56gF7uqVQ+/J5/lY/N904Q==,iv:qtY1XhHs4WWA4kPY56NigPvX4OslO0koZepgdv947zg=,tag:UDmJs8FPXskp7rUS2Sxinw==,type:str]
sops:
age:
- recipient: age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
enc: |
-----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBlc1dxK3FKU2ZGWTNGUmxZ
aWx1NngySjVHclJTd3hXejJRTmVHRExReHcwCk55c0xMbGcyTktySkJZdHRZbzhK
bEI3RzBHQkROTU1qWXBoU1RqTXppdVkKLS0tIHkwZ0QyNTMydWRqUlBtTEdhZ05r
YVpuT1JadnlyN1hqNnJxYzVPT3pXN1UKDCeIv0xv+5pcoDdtYc+rYjwi8SLrqWth
vdWepxmV2edajZRqcwFEC9weOZ1j2lh7Z3hR6RSN/+X3sFpqkpw+Yg==
-----END AGE ENCRYPTED FILE-----
- recipient: age16prza00sqzuhwwcyakj6z4hvwkruwkqpmmrsn94a5ucgpkelncdq2ldctk
enc: |
-----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSAvbU0wNmFLelRmNmJTRlho
dTEwVXZqUVI5NHZkb1QyNUZ4R0pLVFZWVDM4CkhVc00zY2FKaVdNRXdGVk1ranpG
MlRWWGJmd2FWeFE1dXU4WHVFL0FHZ3MKLS0tIGt2ZWlaOW5wNkJnQVkrTDZWTnY0
RW5HRjA3cERCUU1CVWZhck12SGhTRUkK6k/zQ87TIETYouRBby7ujtwgpqIPKKv+
2aLJW6lSWMVzL/f3ZrIeg12tJjHs3f44EXR6j3tfLfSKog2iL8Y57w==
-----END AGE ENCRYPTED FILE-----
lastmodified: "2025-12-06T10:03:56Z"
mac: ENC[AES256_GCM,data:SRNqx5n+xg/cNGiyze3CGKufox3IuXmOKLqNRDeJhBNMBHC1iYYCjRdHEVXsl7XSiYe51dSwjV0KrJa/SG1pRVkuyT+xyPrTjT2/DyXN7A/CESSAkBIwI7lkZmIf8DkxB3CELF1PgjIr1o2isxlBnkAnhEBTxQ7t8AzpcH7I5yU=,iv:P3FGQurZrL0ed5UuBPRFk11T0VRFtL6xI4iQ4LmYTec=,tag:8gQL08ojjIMyCl5E0Qs/Ww==,type:str]
unencrypted_suffix: _unencrypted
version: 3.11.0

View File

@@ -7,146 +7,101 @@ sops:
- recipient: age1lznyk4ee7e7x8n92cq2n87kz9920473ks5u9jlhd3dczfzq4wamqept56u
enc: |
-----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBnbC90WWJiRXRPZ1VUVWhO
azc5R2lGeDhoRmQydXBnYlltbE81ajFQNW0wClRJNC9iaFV0NDRKRkw2Mm1vOHpN
dVhnUm1nbElQRGQ4dmkxQ2FWdEdpdDAKLS0tIG9GNEpuZUFUQkVXbjZPREo0aEh4
ZVMyY0Y0Zldvd244eSt2RVZDeUZKWmcKGQ7jq50qiXPLKCHq751Y2SA79vEjbSbt
yhRiakVEjwf9A+/iSNvXYAr/tnKaYC+NTA7F6AKmYpBcrzlBGU68KA==
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBuWXhzQWFmeCt1R05jREcz
Ui9HZFN5dkxHNVE0RVJGZUJUa3hKK2sxdkhBCktYcGpLeGZIQzZIV3ZZWGs3YzF1
T09sUEhPWkRkOWZFWkltQXBlM1lQV1UKLS0tIERRSlRUYW5QeW9TVjJFSmorOWNI
ZytmaEhzMjVhRXI1S0hielF0NlBrMmcK4I1PtSf7tSvSIJxWBjTnfBCO8GEFHbuZ
BkZskr5fRnWUIs72ZOGoTAVSO5ZNiBglOZ8YChl4Vz1U7bvdOCt0bw==
-----END AGE ENCRYPTED FILE-----
- recipient: age1hz2lz4k050ru3shrk5j3zk3f8azxmrp54pktw5a7nzjml4saudesx6jsl0
enc: |
-----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBRTWFBRVRKeXR0UUloQ3FK
Rmhsak45aFZBVUp4Szk5eHJhZmswV3JUcHh3Cis0N09JaCtOZE1pQUM5blg4WDY5
Q0ZGajJSZnJVQzdJK0dxZjJNWHZkbGsKLS0tIEVtRVJROTlWdWl0cFlNZmZkajM5
N3FpdU56WlFWaC9QYU5Kc1o2a1VkT0UK2Utr9mvK8If4JhjzD+l06xZxdE3nbvCO
NixMiYDhuQ/a55Fu0653jqd35i3CI3HukzEI9G5zLEeCcXxTKR5Bjg==
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBQcXM0RHlGcmZrYW4yNGZs
S1ZqQzVaYmQ4MGhGaTFMUVIwOTk5K0tZZjB3ClN0QkhVeHRrNXZHdmZWMzFBRnJ6
WTFtaWZyRmx2TitkOXkrVkFiYVd3RncKLS0tIExpeGUvY1VpODNDL2NCaUhtZkp0
cGNVZTI3UGxlNWdFWVZMd3FlS3pDR3cKBulaMeonV++pArXOg3ilgKnW/51IyT6Z
vH9HOJUix+ryEwDIcjv4aWx9pYDHthPFZUDC25kLYG91WrJFQOo2oA==
-----END AGE ENCRYPTED FILE-----
- recipient: age1w2q4gm2lrcgdzscq8du3ssyvk6qtzm4fcszc92z9ftclq23yyydqdga5um
enc: |
-----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBFQVk0aUw0aStuOWhFMk5a
UVJ5YWg2WjU2eVFUWDlobEIrRDlZV3dxelc0Clo0N3lvOUZNL3YrM2l3Y21VaUQz
MTV5djdPWTBIUXFXVDZpZitRTVhMbVEKLS0tIFluV1NFTzd0cFFaR0RwVkhlSmNm
VGdZNDlsUGI3cTQ1Tk9XRWtDSE1wNWMKQI226dcROyp/GprVZKtM0R57m5WbJyuR
UZO74NqiDr7nxKfw+tHCfDLh94rbC1iP4jRiaQjDgfDDxviafSbGBA==
-----END AGE ENCRYPTED FILE-----
- recipient: age1snmhmpavqy7xddmw4nuny0u4xusqmnqxqarjmghkm5zaluff84eq5xatrd
enc: |
-----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSA4WVBzazE3VkNDWXUwMk5x
NnZtL3N3THVBQytxZzdZNUhCeThURFBLdjBVClBpZjd5L3lKYjRZNVF2Z3hibW5R
YTdTR0NzaVp4VEZlTjlaTHVFNXNSSUEKLS0tIDBGbmhGUFNJQ21zeW1SbWtyWWh0
QkFXN2g5TlhBbnlmbW1aSUJQL1FOaWMKTv8OoaTxyG8XhKGZNs4aFR/9SXQ+RG6w
+fxiUx7xQnOIYag9YQYfuAgoGzOaj/ha+i18WkQnx9LAgrjCTd+ejA==
-----END AGE ENCRYPTED FILE-----
- recipient: age12a3nyvjs8jrwmpkf3tgawel3nwcklwsr35ktmytnvhpawqwzrsfqpgcy0q
enc: |
-----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSAzcnVxL09JTEdsZ0FUb2VH
a3dSY09uRFFCYnJXQno3YUFhMlpueHJreXdFCjQ4UWdRak5yK0VIT2lYUjBVK2h5
RFJmMTlyVEpnS3JxdkE4ckp1UHpLM2sKLS0tIHVyZXRTSHQxL1p1dUxMKzkyV0pW
a2o0bG9vZUtmckdYTkhLSVZtZVRtNlUKpALeaeaH4/wFUPPGsNArTAIIJOvBWWDp
MUYPJjqLqBVmWzIgCexM2jsDOhtcCV26MXjzTXmZhthaGJMSp23kMQ==
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBabTdsZWxZQjV2TGx2YjNM
ZTgzWktqTjY0S0M3bFpNZXlDRDk5TSt3V2k0CjdWWTN0TlRlK1RpUm9xYW03MFFG
aWN4a3o4VUVnYzBDd2FrelUraWtrMTAKLS0tIE1vTGpKYkhzcWErWDRreml2QmE2
ZkNIWERKb1drdVR6MTBSTnVmdm51VEkKVNDYdyBSrUT7dUn6a4eF7ELQ2B2Pk6V9
Z5fbT75ibuyX1JO315/gl2P/FhxmlRW1K6e+04gQe2R/t/3H11Q7YQ==
-----END AGE ENCRYPTED FILE-----
- recipient: age1d2w5zece9647qwyq4vas9qyqegg96xwmg6c86440a6eg4uj6dd2qrq0w3l
enc: |
-----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSA5M0liYUY1UHRHUDdvN3ds
TVdiWDlrWFROSVdRTy9nOHFOUTdmTmlHSzE4CjBpU3gzdjdWaHQzNXRMRkxPdVps
TEZXbVlYenUwc3o0TXRnaXg4MmVHQmcKLS0tIDlVeWQ4V0hjbWJqRlNUL2hOWVhp
WEJvZWZzbWZFeWZVeWJ1c3pVOWI3MFUKN2QfuOaod5IBKkBkYzi3jvPty+8PRGMJ
mozL7qydsb0bAZJtAwcL7HWCr1axar/Ertce0yMqhuthJ5bciVD5xQ==
-----END AGE ENCRYPTED FILE-----
- recipient: age1gcyfkxh4fq5zdp0dh484aj82ksz66wrly7qhnpv0r0p576sn9ekse8e9ju
enc: |
-----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSA5L3NmcFMyUUpLOW9mOW9v
VXhMTjl5SEFsZ0pzR3lHb1VJL0IzUUxCckdzCnltZnVySkszVUtwbDdQNHAwVWxl
V2xJU1BqSG0yMk5sTkpKRTIvc2JORFUKLS0tIHNydWZjdGg3clNpMDhGSGR6VVVh
VU1Rbk9ybGRJOG1ETEh4a1orNUY2Z00KJmdp+wLHd+86RJJ/G0QbLp4BEDPXfE9o
VZhPPSC6qtUcFV2z6rqSHSpsHPTlgzbCRqX39iePNhfQ2o0lR2P2zQ==
-----END AGE ENCRYPTED FILE-----
- recipient: age1g5luz2rtel3surgzuh62rkvtey7lythrvfenyq954vmeyfpxjqkqdj3wt8
enc: |
-----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBBbnhXSG5qdVJHSjNmQ3Qx
Yk9zSVBkVTQyb3luYXgwbFJWbG9xK2tWZUdrCkh2MktoWmFOdkRldFNlQW1EMm9t
ZHJRa3QrRzh0UElSNGkvSWcyYTUxZzgKLS0tIGdPT2dwWU9LbERYZGxzUTNEUHE1
TmlIdWJjbmFvdnVQSURqUTBwbW9EL00Kaiy5ZGgHjKgAGvzbdjbwNExLf4MGDtiE
NJEvnmNWkQyEhtx9YzUteY02Tl/D7zBzAWHlV3RjAWTNIwLmm7QgCw==
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBVSFhDOFRVbnZWbVlQaG5G
U0NWekU0NzI1SlpRN0NVS1hPN210MXY3Z244CmtFemR5OUpzdlBzMHBUV3g0SFFo
eUtqNThXZDJ2b01yVVVuOFdwQVo2Qm8KLS0tIHpXRWd3OEpPRkpaVDNDTEJLMWEv
ZlZtaFpBdzF0YXFmdjNkNUR3YkxBZU0KAub+HF/OBZQR9bx/SVadZcL6Ms+NQ7yq
21HCcDTWyWHbN4ymUrIYXci1A/0tTOrQL9Mkvaz7IJh4VdHLPZrwwA==
-----END AGE ENCRYPTED FILE-----
- recipient: age1gq8434ku0xekqmvnseeunv83e779cg03c06gwrusnymdsr3rpufqx6vr3m
enc: |
-----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBVSDFIa1hNZU1BNWxHckk1
UEdJT282Y054eVNpb3VOZ2t3S2NndTkycXdNCk1sNk5uL2xpbXk1MG95dVM1OWVD
TldUWmsrSmxGeHYweWhGWXpSaE0xRmcKLS0tIFlVbEp2UU1kM0hhbHlSZm96TFl2
TkVaK0xHN1NxNzlpUVYyY2RpdisrQVkKG+DlyZVruH64nB9UtCPMbXhmRHj+zpr6
CX4JOTXbUsueZIA4J/N93+d2J3V6yauoRYwCSl/JXX/gaSeSxF4z3A==
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBBWkhBL1NTdjFDeEhQcEgv
Z3c3Z213L2ZhWGo0Qm5Zd1A1RTBDY3plUkh3CkNWV2ZtNWkrUjB0eWFzUlVtbHlk
WTdTQjN4eDIzY0c0dyt6ajVXZ0krd1UKLS0tIHB4aEJqTTRMenV3UkFkTGEySjQ2
YVM1a3ZPdUU4T244UU0rc3hVQ3NYczQK10wug4kTjsvv/iOPWi5WrVZMOYUq4/Mf
oXS4sikXeUsqH1T2LUBjVnUieSneQVn7puYZlN+cpDQ0XdK/RZ+91A==
-----END AGE ENCRYPTED FILE-----
- recipient: age1288993th0ge00reg4zqueyvmkrsvk829cs068eekjqfdprsrkeqql7mljk
enc: |
-----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSB3YWxPRTNaVTNLb2tYSzZ5
ZmVMYXk2MlVXYzNtZGFJNlJLR2FIVWhKb1RFCmx5bXozeExlbEZBQzhpSHA0T1JE
dFpHRm8rcFl1QjZ2anRGYjVxeGJqc0EKLS0tIGVibzRnRTA3Vk5yR3c4QVFsdy95
bG1tejcremFiUjZaL3hmc1gwYzJIOGMKFmXmY60vABYlpfop2F020SaOEwV4TNya
F0tgrIqbufU1Yw4RhxPdBb9Wv1cQu25lcqQLh1i4VH9BSaWKk6TDEA==
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBYcEtHbjNWRkdodUxYdHRn
MDBMU08zWDlKa0Z4cHJvc28rZk5pUjhnMjE0CmdzRmVGWDlYQ052Wm1zWnlYSFV6
dURQK3JSbThxQlg3M2ZaL1hGRzVuL0UKLS0tIEI3UGZvbEpvRS9aR2J2Tnc1YmxZ
aUY5Q2MrdHNQWDJNaGt5MWx6MVRrRVEKRPxyAekGHFMKs0Z6spVDayBA4EtPk18e
jiFc97BGVtC5IoSu4icq3ZpKOdxymnkqKEt0YP/p/JTC+8MKvTJFQw==
-----END AGE ENCRYPTED FILE-----
- recipient: age1vpns76ykll8jgdlu3h05cur4ew2t3k7u03kxdg8y6ypfhsfhq9fqyurjey
enc: |
-----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSAzRXM1VUJPNm90UUx4UEdZ
cDY5czVQaGl0MEdIMStjTnphTmR5ZkFWTDBjClhTd0xmaHNWUXo3NXR6eEUzTkg2
L3BqT1N6bTNsYitmTGVpREtiWEpzdlEKLS0tIFUybTczSlRNbDkxRVZjSnFvdmtq
MVdRU3RPSHNqUzJzQWl1VVkyczFaencK72ZmWJIcfBTXlezmefvWeCGOC1BhpkXO
bm+X+ihzNfktuOCl6ZIMo2n4aJ3hYakrMp4npO10a6s4o/ldqeiATg==
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBQL3ZMUkI1dUV1T2tTSHhn
SjhyQ3dKTytoaDBNcit1VHpwVGUzWVNpdjBnCklYZWtBYzBpcGxZSDBvM2tIZm9H
bTFjb1ZCaDkrOU1JODVBVTBTbmxFbmcKLS0tIGtGcS9kejZPZlhHRXI5QnI5Wm9Q
VjMxTDdWZEltWThKVDl0S24yWHJxZHcKgzH79zT2I7ZgyTbbbvIhLN/rEcfiomJH
oSZDFvPiXlhPgy8bRyyq3l47CVpWbUI2Y7DFXRuODpLUirt3K3TmCA==
-----END AGE ENCRYPTED FILE-----
- recipient: age1hchvlf3apn8g8jq2743pw53sd6v6ay6xu6lqk0qufrjeccan9vzsc7hdfq
enc: |
-----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBOL3F3OWRYVVdxWncwWmlk
SnloWFdscE02L3ZRa0JGcFlwSU9tU3JRakhnCjZyTnR3T051Tmt2NGM2dkFaNGJz
WVRnNDdNN0ozYXJnK0t4ZW5JRVQ2YzQKLS0tIFk0cFBxcVFETERNTGowMThJcDNR
UW0wUUlFeHovSS9qYU5BRkJ6dnNjcWcKh2WcrmxsqMZeQ0/2HsaHeSqGsU3ILynU
SHBziWHGlFoNirCVjljh/Mw4DM8v66i0ztIQtWV5cFaFhu4kVda5jA==
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBPcm9zUm1XUkpLWm1Jb3Uw
RncveGozOW5SRThEM1Y4SFF5RDdxUEhZTUE4CjVESHE5R3JZK0krOXZDL0RHR0oy
Z3JKaEpydjRjeFFHck1ic2JTRU5yZTQKLS0tIGY2ck56eG95YnpDYlNqUDh5RVp1
U3dRYkNleUtsQU1LMWpDbitJbnRIem8K+27HRtZihG8+k7ZC33XVfuXDFjC1e8lA
kffmxp9kOEShZF3IKmAjVHFBiPXRyGk3fGPyQLmSMK2UOOfCy/a/qA==
-----END AGE ENCRYPTED FILE-----
- recipient: age1w029fksjv0edrff9p7s03tgk3axecdkppqymfpwfn2nu2gsqqefqc37sxq
enc: |
-----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSB6ZkovUkMzdmhOUGpZUC91
d1JFZGk1T2hOS2dlVFNHRGJKVTUwdUhpQmg0CnEybzlRdjBLcjVEckNtR0xzMDVk
dURWbFdnTXk1alV5cjRSMkRrZ21vTjAKLS0tIEtDZlFCTGdVMU1PUWdBYTVOcTU4
ZkZHYmJiTUdJUGZhTFdLM1EzdU9wNmsK3AqFfycJfrBpvnjccN1srNiVBCv107rt
b/O5zcqKGR3Nzey7zAhlxasPCRKARyBTo292ScZ03QMU8p8HIukdzg==
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBTZHlldDdSOEhjTklCSXQr
U2pXajFwZnNqQzZOTzY5b3lkMzlyREhXRWo4CmxId2F6NkNqeHNCSWNrcUJIY0Nw
cGF6NXJaQnovK1FYSXQ2TkJSTFloTUEKLS0tIHRhWk5aZ0lDVkZaZEJobm9FTDNw
a29sZE1GL2ZQSk0vUEc1ZGhkUlpNRkEK9tfe7cNOznSKgxshd5Z6TQiNKp+XW6XH
VvPgMqMitgiDYnUPj10bYo3kqhd0xZH2IhLXMnZnqqQ0I23zfPiNaw==
-----END AGE ENCRYPTED FILE-----
- recipient: age1ha34qeksr4jeaecevqvv2afqem67eja2mvawlmrqsudch0e7fe7qtpsekv
enc: |
-----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBlOVNVNmFzbTE2NmdiM1dP
TlhuTGYyQWlWeFlkaVU3Tml2aDNJbmxXVnlZCmJSb001OVJTaGpRcllzN2JSWDFF
b1MyYjdKZys4ZHRoUmFhdG1oYTA2RzQKLS0tIEhGeU9YcW9Wc0ZZK3I5UjB0RHFm
bW1ucjZtYXFkT1A4bGszamFxaG5IaHMKqHuaWFi/ImnbDOZ9VisIN7jqplAYV8fo
y3PeVX34LcYE0d8cxbvH8CTs/Ubirt6P1obrmAL9W9Y0ozpqdqQSjA==
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSB5bk9NVjJNWmMxUGd3cXRx
amZ5SWJ3dHpHcnM4UHJxdmh6NnhFVmJQdldzCm95dHN3R21qSkE4Vm9VTnVPREp3
dUQyS1B4MWhhdmd3dk5LQ0htZEtpTWMKLS0tIGFaa3MxVExFYk1MY2loOFBvWm1o
L0NoRStkeW9VZVdpWlhteC8yTnRmMUkKMYjUdE1rGgVR29FnhJ5OEVjTB1Rh5Mtu
M/DvlhW3a7tZU8nDF3IgG2GE5xOXZMDO9QWGdB8zO2RJZAr3Q+YIlA==
-----END AGE ENCRYPTED FILE-----
- recipient: age1cxt8kwqzx35yuldazcc49q88qvgy9ajkz30xu0h37uw3ts97jagqgmn2ga
enc: |
-----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBXbXo4UWhoMUQxc1lMcnNB
VWc1MUJuS3NnVnh4U254TE0wSDJTMzFSM3lrCnhHbmk1N0VqTlViT2dtZndGT1pn
NmpPc01iMjk3TXZLU1htZjBvd2NBK2sKLS0tIEN3dGlRZHF5Ykgybjl6MzRBVUJ0
Rm92SGdwanFHZlp6U00wMDUzL3MrMzgKtCJqy+BfDMFQMHaIVPlFyzALBsb4Ekls
+r7ofZ1ZjSomBljYxVPhKE9XaZJe6bqICEhJBCpODyxavfh8HmxHDQ==
-----END AGE ENCRYPTED FILE-----
- recipient: age16prza00sqzuhwwcyakj6z4hvwkruwkqpmmrsn94a5ucgpkelncdq2ldctk
enc: |
-----BEGIN AGE ENCRYPTED FILE-----
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBiQTRXTHljd2UrbFJOWUI4
WGRYcEVrZDJGM3hpVVNmVXlSREYzc1FHRlhFCjZHa2VTTzFHR1RXRmllT1huVDNV
UkRKaEQrWjF5eHpiaUg1NExnME5veFkKLS0tIFpZY1RrOVNTTjU0N2Y1dFN6QWpX
MTM3NDJrV1JZNE5pWGNLMUg1OFFwYUUKMx0hpB3iunnCbJ/+zWetdp1NI/LsrUTe
J84+aDoe7/WJYT0FLMlC0RK80txm6ztVygoyRdN0cRKx1z3KqPmavw==
YWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBBU0xYMnhqOE0wdXdleStF
THcrY2NBQzNoRHdYTXY3ZmM5YXRZZkQ4aUZnCm9ad0IxSWxYT1JBd2RseUdVT1pi
UXBuNzFxVlN0OWNTQU5BV2NiVEV0RUUKLS0tIGJHY0dzSDczUzcrV0RpTjE0czEy
cWZMNUNlTzBRcEV5MjlRV1BsWGhoaUUKGhYaH8I0oPCfrbs7HbQKVOF/99rg3HXv
RRTXUI71/ejKIuxehOvifClQc3nUW73bWkASFQ0guUvO4R+c0xOgUg==
-----END AGE ENCRYPTED FILE-----
lastmodified: "2025-02-11T21:18:22Z"
mac: ENC[AES256_GCM,data:5//boMp1awc/2XAkSASSCuobpkxa0E6IKf3GR8xHpMoCD30FJsCwV7PgX3fR8OuLEhOJ7UguqMNQdNqG37RMacreuDmI1J8oCFKp+3M2j4kCbXaEo8bw7WAtyjUez+SAXKzZWYmBibH0KOy6jdt+v0fdgy5hMBT4IFDofYRsyD0=,iv:6pD+SLwncpmal/FR4U8It2njvaQfUzzpALBCxa0NyME=,tag:4QN8ZFjdqck5ZgulF+FtbA==,type:str]

View File

@@ -1,8 +1,10 @@
{ pkgs, config, ... }:
{
sops.secrets."actions-token-1" = {
sopsFile = ../../secrets/nix-cache01/actions_token_1;
format = "binary";
vault.secrets.actions-token = {
secretPath = "hosts/nix-cache01/actions-token";
extractKey = "token";
outputDir = "/run/secrets/actions-token-1";
services = [ "gitea-runner-actions1" ];
};
virtualisation.podman = {
@@ -13,7 +15,7 @@
services.gitea-actions-runner.instances = {
actions1 = {
enable = true;
tokenFile = config.sops.secrets.actions-token-1.path;
tokenFile = "/run/secrets/actions-token-1";
name = "actions1.home.2rjus.net";
settings = {
log = {

View File

@@ -1,87 +0,0 @@
{ config, ... }:
{
sops.secrets.authelia_ldap_password = {
format = "yaml";
sopsFile = ../../secrets/auth01/secrets.yaml;
key = "authelia_ldap_password";
restartUnits = [ "authelia-auth.service" ];
owner = "authelia-auth";
group = "authelia-auth";
};
sops.secrets.authelia_jwt_secret = {
format = "yaml";
sopsFile = ../../secrets/auth01/secrets.yaml;
key = "authelia_jwt_secret";
restartUnits = [ "authelia-auth.service" ];
owner = "authelia-auth";
group = "authelia-auth";
};
sops.secrets.authelia_storage_encryption_key_file = {
format = "yaml";
key = "authelia_storage_encryption_key_file";
sopsFile = ../../secrets/auth01/secrets.yaml;
restartUnits = [ "authelia-auth.service" ];
owner = "authelia-auth";
group = "authelia-auth";
};
sops.secrets.authelia_session_secret = {
format = "yaml";
key = "authelia_session_secret";
sopsFile = ../../secrets/auth01/secrets.yaml;
restartUnits = [ "authelia-auth.service" ];
owner = "authelia-auth";
group = "authelia-auth";
};
services.authelia.instances."auth" = {
enable = true;
environmentVariables = {
AUTHELIA_AUTHENTICATION_BACKEND_LDAP_PASSWORD_FILE =
config.sops.secrets.authelia_ldap_password.path;
AUTHELIA_SESSION_SECRET_FILE = config.sops.secrets.authelia_session_secret.path;
};
secrets = {
jwtSecretFile = config.sops.secrets.authelia_jwt_secret.path;
storageEncryptionKeyFile = config.sops.secrets.authelia_storage_encryption_key_file.path;
};
settings = {
access_control = {
default_policy = "two_factor";
};
session = {
# secret = "{{- fileContent \"${config.sops.secrets.authelia_session_secret.path}\" }}";
cookies = [
{
domain = "home.2rjus.net";
authelia_url = "https://auth.home.2rjus.net";
default_redirection_url = "https://dashboard.home.2rjus.net";
name = "authelia_session";
same_site = "lax";
inactivity = "1h";
expiration = "24h";
remember_me = "30d";
}
];
};
notifier = {
filesystem.filename = "/var/lib/authelia-auth/notification.txt";
};
storage = {
local.path = "/var/lib/authelia-auth/db.sqlite3";
};
authentication_backend = {
password_reset = {
disable = false;
};
ldap = {
address = "ldap://127.0.0.1:3890";
implementation = "lldap";
timeout = "5s";
base_dn = "dc=home,dc=2rjus,dc=net";
user = "uid=authelia_ldap_user,ou=people,dc=home,dc=2rjus,dc=net";
# password = "{{- fileContent \"${config.sops.secrets.authelia_ldap_password.path}\" -}}";
};
};
};
};
}

View File

@@ -1,5 +1,9 @@
{ pkgs, unstable, ... }:
{
homelab.monitoring.scrapeTargets = [{
job_name = "step-ca";
port = 9000;
}];
sops.secrets."ca_root_pw" = {
sopsFile = ../../secrets/ca/secrets.yaml;
owner = "step-ca";

View File

@@ -1,5 +1,11 @@
{ pkgs, config, ... }:
{
homelab.monitoring.scrapeTargets = [{
job_name = "home-assistant";
port = 8123;
metrics_path = "/api/prometheus";
scrape_interval = "60s";
}];
# Enable the Home Assistant service
services.home-assistant = {
enable = true;
@@ -63,6 +69,44 @@
frontend = true;
permit_join = false;
serial.port = "/dev/ttyUSB0";
# Inline device configuration (replaces devices.yaml)
# This allows declarative management and homeassistant overrides
devices = {
# Temperature sensors with battery fix
# WSDCGQ12LM sensors report battery: 0 due to firmware quirk
# Override battery calculation using voltage (mV): (voltage - 2100) / 9
"0x54ef441000a547bd" = {
friendly_name = "0x54ef441000a547bd";
homeassistant.battery.value_template = "{{ (((value_json.voltage | float) - 2100) / 9) | round(0) | int | min(100) | max(0) }}";
};
"0x54ef441000a54d3c" = {
friendly_name = "0x54ef441000a54d3c";
homeassistant.battery.value_template = "{{ (((value_json.voltage | float) - 2100) / 9) | round(0) | int | min(100) | max(0) }}";
};
"0x54ef441000a564b6" = {
friendly_name = "temp_server";
homeassistant.battery.value_template = "{{ (((value_json.voltage | float) - 2100) / 9) | round(0) | int | min(100) | max(0) }}";
};
# Other sensors
"0x00124b0025495463".friendly_name = "0x00124b0025495463"; # SONOFF temp sensor (battery works)
"0x54ef4410009ac117".friendly_name = "0x54ef4410009ac117"; # Water leak sensor
# Buttons
"0x54ef441000a1f907".friendly_name = "btn_livingroom";
"0x54ef441000a1ee71".friendly_name = "btn_bedroom";
# Philips Hue lights
"0x001788010d1b599a" = {
friendly_name = "0x001788010d1b599a";
transition = 5;
};
"0x001788010d253b99".friendly_name = "0x001788010d253b99";
"0x001788010e371aa4".friendly_name = "0x001788010e371aa4";
"0x001788010dc5f003".friendly_name = "0x001788010dc5f003";
"0x001788010dc35d06".friendly_name = "0x001788010dc35d06";
};
};
};
}

View File

@@ -3,4 +3,9 @@
imports = [
./proxy.nix
];
homelab.monitoring.scrapeTargets = [{
job_name = "caddy";
port = 80;
}];
}

View File

@@ -86,22 +86,6 @@
}
reverse_proxy http://jelly01.home.2rjus.net:8096
}
lldap.home.2rjus.net {
log {
output file /var/log/caddy/auth.log {
mode 644
}
}
reverse_proxy http://auth01.home.2rjus.net:17170
}
auth.home.2rjus.net {
log {
output file /var/log/caddy/auth.log {
mode 644
}
}
reverse_proxy http://auth01.home.2rjus.net:9091
}
pyroscope.home.2rjus.net {
log {
output file /var/log/caddy/pyroscope.log {

View File

@@ -1,5 +1,9 @@
{ pkgs, ... }:
{
homelab.monitoring.scrapeTargets = [{
job_name = "jellyfin";
port = 8096;
}];
services.jellyfin = {
enable = true;
};

View File

@@ -1,38 +0,0 @@
{ config, ... }:
{
sops.secrets.lldap_user_pass = {
format = "yaml";
key = "lldap_user_pass";
sopsFile = ../../secrets/auth01/secrets.yaml;
restartUnits = [ "lldap.service" ];
group = "acme";
mode = "0440";
};
services.lldap = {
enable = true;
settings = {
ldap_base_dn = "dc=home,dc=2rjus,dc=net";
ldap_user_email = "admin@home.2rjus.net";
ldap_user_dn = "admin";
ldap_user_pass_file = config.sops.secrets.lldap_user_pass.path;
ldaps_options = {
enabled = true;
port = 6360;
cert_file = "/var/lib/acme/auth01.home.2rjus.net/cert.pem";
key_file = "/var/lib/acme/auth01.home.2rjus.net/key.pem";
};
};
};
systemd.services.lldap = {
serviceConfig = {
SupplementaryGroups = [ "acme" ];
};
};
security.acme.certs."auth01.home.2rjus.net" = {
listenHTTP = ":80";
reloadServices = [ "lldap" ];
extraDomainNames = [ "ldap.home.2rjus.net" ];
enableDebugLogs = true;
};
}

View File

@@ -1,12 +1,18 @@
{ pkgs, config, ... }:
{
sops.secrets."nats_nkey" = { };
vault.secrets.nats-nkey = {
secretPath = "shared/nats/nkey";
extractKey = "nkey";
outputDir = "/run/secrets/nats_nkey";
services = [ "alerttonotify" ];
};
systemd.services."alerttonotify" = {
enable = true;
wants = [ "network-online.target" ];
after = [
"network-online.target"
"sops-nix.service"
"vault-secret-nats-nkey.service"
];
wantedBy = [ "multi-user.target" ];
restartIfChanged = true;

View File

@@ -0,0 +1,12 @@
# Monitoring targets for hosts not managed by this flake
# These are manually maintained and combined with auto-generated targets
{
nodeExporter = [
"gunter.home.2rjus.net:9100"
];
scrapeConfigs = [
{ job_name = "smartctl"; targets = [ "gunter.home.2rjus.net:9633" ]; }
{ job_name = "ghettoptt"; targets = [ "gunter.home.2rjus.net:8989" ]; }
{ job_name = "restic_rest"; targets = [ "10.69.12.52:8000" ]; }
];
}

View File

@@ -1,7 +1,82 @@
{ ... }:
{ self, lib, pkgs, ... }:
let
monLib = import ../../lib/monitoring.nix { inherit lib; };
externalTargets = import ./external-targets.nix;
nodeExporterTargets = monLib.generateNodeExporterTargets self externalTargets;
autoScrapeConfigs = monLib.generateScrapeConfigs self externalTargets;
# Script to fetch AppRole token for Prometheus to use when scraping OpenBao metrics
fetchOpenbaoToken = pkgs.writeShellApplication {
name = "fetch-openbao-token";
runtimeInputs = [ pkgs.curl pkgs.jq ];
text = ''
VAULT_ADDR="https://vault01.home.2rjus.net:8200"
APPROLE_DIR="/var/lib/vault/approle"
OUTPUT_FILE="/run/secrets/prometheus/openbao-token"
# Read AppRole credentials
if [ ! -f "$APPROLE_DIR/role-id" ] || [ ! -f "$APPROLE_DIR/secret-id" ]; then
echo "AppRole credentials not found at $APPROLE_DIR" >&2
exit 1
fi
ROLE_ID=$(cat "$APPROLE_DIR/role-id")
SECRET_ID=$(cat "$APPROLE_DIR/secret-id")
# Authenticate to Vault
AUTH_RESPONSE=$(curl -sf -k -X POST \
-d "{\"role_id\":\"$ROLE_ID\",\"secret_id\":\"$SECRET_ID\"}" \
"$VAULT_ADDR/v1/auth/approle/login")
# Extract token
VAULT_TOKEN=$(echo "$AUTH_RESPONSE" | jq -r '.auth.client_token')
if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
echo "Failed to extract Vault token from response" >&2
exit 1
fi
# Write token to file
mkdir -p "$(dirname "$OUTPUT_FILE")"
echo -n "$VAULT_TOKEN" > "$OUTPUT_FILE"
chown prometheus:prometheus "$OUTPUT_FILE"
chmod 0400 "$OUTPUT_FILE"
echo "Successfully fetched OpenBao token"
'';
};
in
{
# Systemd service to fetch AppRole token for Prometheus OpenBao scraping
# The token is used to authenticate when scraping /v1/sys/metrics
systemd.services.prometheus-openbao-token = {
description = "Fetch OpenBao token for Prometheus metrics scraping";
after = [ "network-online.target" ];
wants = [ "network-online.target" ];
before = [ "prometheus.service" ];
requiredBy = [ "prometheus.service" ];
serviceConfig = {
Type = "oneshot";
ExecStart = lib.getExe fetchOpenbaoToken;
};
};
# Timer to periodically refresh the token (AppRole tokens have 1-hour TTL)
systemd.timers.prometheus-openbao-token = {
description = "Refresh OpenBao token for Prometheus";
wantedBy = [ "timers.target" ];
timerConfig = {
OnBootSec = "5min";
OnUnitActiveSec = "30min";
RandomizedDelaySec = "5min";
};
};
services.prometheus = {
enable = true;
# syntax-only check because we use external credential files (e.g., openbao-token)
checkConfig = "syntax-only";
alertmanager = {
enable = true;
configuration = {
@@ -45,26 +120,25 @@
];
scrapeConfigs = [
# Auto-generated node-exporter targets from flake hosts + external
{
job_name = "node-exporter";
static_configs = [
{
targets = [
"ca.home.2rjus.net:9100"
"gunter.home.2rjus.net:9100"
"ha1.home.2rjus.net:9100"
"http-proxy.home.2rjus.net:9100"
"jelly01.home.2rjus.net:9100"
"monitoring01.home.2rjus.net:9100"
"nix-cache01.home.2rjus.net:9100"
"ns1.home.2rjus.net:9100"
"ns2.home.2rjus.net:9100"
"pgdb1.home.2rjus.net:9100"
"nats1.home.2rjus.net:9100"
];
targets = nodeExporterTargets;
}
];
}
# Systemd exporter on all hosts (same targets, different port)
{
job_name = "systemd-exporter";
static_configs = [
{
targets = map (t: builtins.replaceStrings [":9100"] [":9558"] t) nodeExporterTargets;
}
];
}
# Local monitoring services (not auto-generated)
{
job_name = "prometheus";
static_configs = [
@@ -85,7 +159,7 @@
job_name = "grafana";
static_configs = [
{
targets = [ "localhost:3100" ];
targets = [ "localhost:3000" ];
}
];
}
@@ -98,13 +172,35 @@
];
}
{
job_name = "restic_rest";
job_name = "pushgateway";
honor_labels = true;
static_configs = [
{
targets = [ "10.69.12.52:8000" ];
targets = [ "localhost:9091" ];
}
];
}
{
job_name = "labmon";
static_configs = [
{
targets = [ "monitoring01.home.2rjus.net:9969" ];
}
];
}
# TODO: nix-cache_caddy can't be auto-generated because the cert is issued
# for nix-cache.home.2rjus.net (service CNAME), not nix-cache01 (hostname).
# Consider adding a target override to homelab.monitoring.scrapeTargets.
{
job_name = "nix-cache_caddy";
scheme = "https";
static_configs = [
{
targets = [ "nix-cache.home.2rjus.net" ];
}
];
}
# pve-exporter with complex relabel config
{
job_name = "pve-exporter";
static_configs = [
@@ -133,91 +229,24 @@
}
];
}
# OpenBao metrics with bearer token auth
{
job_name = "caddy";
static_configs = [
{
targets = [ "http-proxy.home.2rjus.net" ];
}
];
}
{
job_name = "jellyfin";
static_configs = [
{
targets = [ "jelly01.home.2rjus.net:8096" ];
}
];
}
{
job_name = "smartctl";
static_configs = [
{
targets = [ "gunter.home.2rjus.net:9633" ];
}
];
}
{
job_name = "wireguard";
static_configs = [
{
targets = [ "http-proxy.home.2rjus.net:9586" ];
}
];
}
{
job_name = "home-assistant";
scrape_interval = "60s";
metrics_path = "/api/prometheus";
static_configs = [
{
targets = [ "ha1.home.2rjus.net:8123" ];
}
];
}
{
job_name = "ghettoptt";
static_configs = [
{
targets = [ "gunter.home.2rjus.net:8989" ];
}
];
}
{
job_name = "step-ca";
static_configs = [
{
targets = [ "ca.home.2rjus.net:9000" ];
}
];
}
{
job_name = "labmon";
static_configs = [
{
targets = [ "monitoring01.home.2rjus.net:9969" ];
}
];
}
{
job_name = "pushgateway";
honor_labels = true;
static_configs = [
{
targets = [ "localhost:9091" ];
}
];
}
{
job_name = "nix-cache_caddy";
job_name = "openbao";
scheme = "https";
static_configs = [
{
targets = [ "nix-cache.home.2rjus.net" ];
metrics_path = "/v1/sys/metrics";
params = {
format = [ "prometheus" ];
};
static_configs = [{
targets = [ "vault01.home.2rjus.net:8200" ];
}];
authorization = {
type = "Bearer";
credentials_file = "/run/secrets/prometheus/openbao-token";
};
}
];
}
];
] ++ autoScrapeConfigs;
pushgateway = {
enable = true;
web = {

View File

@@ -1,14 +1,16 @@
{ config, ... }:
{
sops.secrets.pve_exporter = {
format = "yaml";
sopsFile = ../../secrets/monitoring01/pve-exporter.yaml;
key = "";
vault.secrets.pve-exporter = {
secretPath = "hosts/monitoring01/pve-exporter";
extractKey = "config";
outputDir = "/run/secrets/pve_exporter";
mode = "0444";
services = [ "prometheus-pve-exporter" ];
};
services.prometheus.exporters.pve = {
enable = true;
configFile = config.sops.secrets.pve_exporter.path;
configFile = "/run/secrets/pve_exporter";
collectors = {
cluster = false;
replication = false;

View File

@@ -18,13 +18,21 @@ groups:
summary: "Disk space low on {{ $labels.instance }}"
description: "Disk space is low on {{ $labels.instance }}. Please check."
- alert: high_cpu_load
expr: max(node_load5{}) by (instance) > (count by (instance)(node_cpu_seconds_total{mode="idle"}) * 0.7)
expr: max(node_load5{instance!="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance!="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7)
for: 15m
labels:
severity: warning
annotations:
summary: "High CPU load on {{ $labels.instance }}"
description: "CPU load is high on {{ $labels.instance }}. Please check."
- alert: high_cpu_load
expr: max(node_load5{instance="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7)
for: 2h
labels:
severity: warning
annotations:
summary: "High CPU load on {{ $labels.instance }}"
description: "CPU load is high on {{ $labels.instance }}. Please check."
- alert: low_memory
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
for: 2m
@@ -57,6 +65,38 @@ groups:
annotations:
summary: "Promtail service not running on {{ $labels.instance }}"
description: "The promtail service has not been active on {{ $labels.instance }} for 5 minutes."
- alert: filesystem_filling_up
expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 24*3600) < 0
for: 1h
labels:
severity: warning
annotations:
summary: "Filesystem predicted to fill within 24h on {{ $labels.instance }}"
description: "Based on the last 6h trend, the root filesystem on {{ $labels.instance }} is predicted to run out of space within 24 hours."
- alert: systemd_not_running
expr: node_systemd_system_running == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Systemd not in running state on {{ $labels.instance }}"
description: "Systemd is not in running state on {{ $labels.instance }}. The system may be in a degraded state."
- alert: high_file_descriptors
expr: node_filefd_allocated / node_filefd_maximum > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High file descriptor usage on {{ $labels.instance }}"
description: "More than 80% of file descriptors are in use on {{ $labels.instance }}."
- alert: host_reboot
expr: changes(node_boot_time_seconds[10m]) > 0
for: 0m
labels:
severity: info
annotations:
summary: "Host {{ $labels.instance }} has rebooted"
description: "Host {{ $labels.instance }} has rebooted."
- name: nameserver_rules
rules:
- alert: unbound_down
@@ -75,7 +115,15 @@ groups:
annotations:
summary: "NSD not running on {{ $labels.instance }}"
description: "NSD has been down on {{ $labels.instance }} more than 5 minutes."
- name: http-proxy_rules
- alert: unbound_low_cache_hit_ratio
expr: (rate(unbound_cache_hits_total[5m]) / (rate(unbound_cache_hits_total[5m]) + rate(unbound_cache_misses_total[5m]))) < 0.5
for: 15m
labels:
severity: warning
annotations:
summary: "Low DNS cache hit ratio on {{ $labels.instance }}"
description: "Unbound cache hit ratio is below 50% on {{ $labels.instance }}."
- name: http_proxy_rules
rules:
- alert: caddy_down
expr: node_systemd_unit_state {instance="http-proxy.home.2rjus.net:9100", name = "caddy.service", state = "active"} == 0
@@ -85,6 +133,22 @@ groups:
annotations:
summary: "Caddy not running on {{ $labels.instance }}"
description: "Caddy has been down on {{ $labels.instance }} more than 5 minutes."
- alert: caddy_upstream_unhealthy
expr: caddy_reverse_proxy_upstreams_healthy == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Caddy upstream unhealthy for {{ $labels.upstream }}"
description: "Caddy reverse proxy upstream {{ $labels.upstream }} is unhealthy on {{ $labels.instance }}."
- alert: caddy_high_error_rate
expr: rate(caddy_http_request_errors_total[5m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High HTTP error rate on {{ $labels.instance }}"
description: "Caddy is experiencing a high rate of HTTP errors on {{ $labels.instance }}."
- name: nats_rules
rules:
- alert: nats_down
@@ -95,9 +159,17 @@ groups:
annotations:
summary: "NATS not running on {{ $labels.instance }}"
description: "NATS has been down on {{ $labels.instance }} more than 5 minutes."
- alert: nats_slow_consumers
expr: nats_core_slow_consumer_count > 0
for: 5m
labels:
severity: warning
annotations:
summary: "NATS has slow consumers on {{ $labels.instance }}"
description: "NATS has {{ $value }} slow consumers on {{ $labels.instance }}."
- name: nix_cache_rules
rules:
- alert: build-flakes_service_not_active_recently
- alert: build_flakes_service_not_active_recently
expr: count_over_time(node_systemd_unit_state{instance="nix-cache01.home.2rjus.net:9100", name="build-flakes.service", state="active"}[1h]) < 1
for: 0m
labels:
@@ -138,7 +210,7 @@ groups:
annotations:
summary: "Home assistant not running on {{ $labels.instance }}"
description: "Home assistant has been down on {{ $labels.instance }} more than 5 minutes."
- alert: zigbee2qmtt_down
- alert: zigbee2mqtt_down
expr: node_systemd_unit_state {instance = "ha1.home.2rjus.net:9100", name = "zigbee2mqtt.service", state = "active"} == 0
for: 5m
labels:
@@ -154,9 +226,17 @@ groups:
annotations:
summary: "Mosquitto not running on {{ $labels.instance }}"
description: "Mosquitto has been down on {{ $labels.instance }} more than 5 minutes."
- alert: zigbee_sensor_stale
expr: (time() - hass_last_updated_time_seconds{entity=~"sensor\\.(0x[0-9a-f]+|temp_server)_temperature"}) > 7200
for: 5m
labels:
severity: warning
annotations:
summary: "Zigbee sensor {{ $labels.friendly_name }} is stale"
description: "Zigbee temperature sensor {{ $labels.entity }} has not reported data for over 2 hours. The sensor may have a dead battery or connectivity issues."
- name: smartctl_rules
rules:
- alert: SmartCriticalWarning
- alert: smart_critical_warning
expr: smartctl_device_critical_warning > 0
for: 0m
labels:
@@ -164,7 +244,7 @@ groups:
annotations:
summary: SMART critical warning (instance {{ $labels.instance }})
description: "Disk controller has critical warning on {{ $labels.instance }} drive {{ $labels.device }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: SmartMediaErrors
- alert: smart_media_errors
expr: smartctl_device_media_errors > 0
for: 0m
labels:
@@ -172,7 +252,7 @@ groups:
annotations:
summary: SMART media errors (instance {{ $labels.instance }})
description: "Disk controller detected media errors on {{ $labels.instance }} drive {{ $labels.device }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: SmartWearoutIndicator
- alert: smart_wearout_indicator
expr: smartctl_device_available_spare < smartctl_device_available_spare_threshold
for: 0m
labels:
@@ -180,20 +260,29 @@ groups:
annotations:
summary: SMART Wearout Indicator (instance {{ $labels.instance }})
description: "Device is wearing out on {{ $labels.instance }} drive {{ $labels.device }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: smartctl_high_temperature
expr: smartctl_device_temperature > 60
for: 5m
labels:
severity: warning
annotations:
summary: "Disk temperature above 60C on {{ $labels.instance }}"
description: "Disk {{ $labels.device }} on {{ $labels.instance }} has temperature {{ $value }}C."
- name: wireguard_rules
rules:
- alert: WireguardHandshake
expr: (time() - wireguard_latest_handshake_seconds{instance="http-proxy.home.2rjus.net:9586",interface="wg0",public_key="32Rb13wExcy8uI92JTnFdiOfkv0mlQ6f181WA741DHs="}) > 300
- alert: wireguard_handshake_timeout
expr: (time() - wireguard_latest_handshake_seconds{interface="wg0"}) > 300
for: 1m
labels:
severity: warning
annotations:
summary: "Wireguard handshake timeout on {{ $labels.instance }}"
description: "Wireguard handshake timeout on {{ $labels.instance }} for more than 1 minutes."
description: "Wireguard handshake timeout on {{ $labels.instance }} for peer {{ $labels.public_key }}."
- name: monitoring_rules
rules:
- alert: prometheus_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="prometheus.service", state="active"} == 0
for: 5m
labels:
severity: critical
annotations:
@@ -201,6 +290,7 @@ groups:
description: "Prometheus service not running on {{ $labels.instance }}"
- alert: alertmanager_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="alertmanager.service", state="active"} == 0
for: 5m
labels:
severity: critical
annotations:
@@ -208,13 +298,7 @@ groups:
description: "Alertmanager service not running on {{ $labels.instance }}"
- alert: pushgateway_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="pushgateway.service", state="active"} == 0
labels:
severity: critical
annotations:
summary: "Pushgateway service not running on {{ $labels.instance }}"
description: "Pushgateway service not running on {{ $labels.instance }}"
- alert: pushgateway_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="pushgateway.service", state="active"} == 0
for: 5m
labels:
severity: critical
annotations:
@@ -222,6 +306,7 @@ groups:
description: "Pushgateway service not running on {{ $labels.instance }}"
- alert: loki_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="loki.service", state="active"} == 0
for: 5m
labels:
severity: critical
annotations:
@@ -229,6 +314,7 @@ groups:
description: "Loki service not running on {{ $labels.instance }}"
- alert: grafana_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="grafana.service", state="active"} == 0
for: 5m
labels:
severity: warning
annotations:
@@ -236,6 +322,7 @@ groups:
description: "Grafana service not running on {{ $labels.instance }}"
- alert: tempo_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="tempo.service", state="active"} == 0
for: 5m
labels:
severity: warning
annotations:
@@ -243,8 +330,123 @@ groups:
description: "Tempo service not running on {{ $labels.instance }}"
- alert: pyroscope_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="podman-pyroscope.service", state="active"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pyroscope service not running on {{ $labels.instance }}"
description: "Pyroscope service not running on {{ $labels.instance }}"
- name: certificate_rules
rules:
- alert: certificate_expiring_soon
expr: labmon_tlsconmon_certificate_seconds_left{address!="ca.home.2rjus.net:443"} < 86400
for: 5m
labels:
severity: warning
annotations:
summary: "TLS certificate expiring soon for {{ $labels.instance }}"
description: "TLS certificate for {{ $labels.address }} is expiring within 24 hours."
- alert: step_ca_serving_cert_expiring
expr: labmon_tlsconmon_certificate_seconds_left{address="ca.home.2rjus.net:443"} < 3600
for: 5m
labels:
severity: critical
annotations:
summary: "Step-CA serving certificate expiring"
description: "The step-ca serving certificate (24h auto-renewed) has less than 1 hour of validity left. Renewal may have failed."
- alert: certificate_check_error
expr: labmon_tlsconmon_certificate_check_error == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Error checking certificate for {{ $labels.address }}"
description: "Certificate check is failing for {{ $labels.address }} on {{ $labels.instance }}."
- alert: step_ca_certificate_expiring
expr: labmon_stepmon_certificate_seconds_left < 3600
for: 5m
labels:
severity: critical
annotations:
summary: "Step-CA certificate expiring for {{ $labels.instance }}"
description: "Step-CA certificate is expiring within 1 hour on {{ $labels.instance }}."
- name: proxmox_rules
rules:
- alert: pve_node_down
expr: pve_up{id=~"node/.*"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Proxmox node {{ $labels.id }} is down"
description: "Proxmox node {{ $labels.id }} has been down for more than 5 minutes."
- alert: pve_guest_stopped
expr: pve_up{id=~"qemu/.*"} == 0 and pve_onboot_status == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Proxmox VM {{ $labels.id }} is stopped"
description: "Proxmox VM {{ $labels.id }} ({{ $labels.name }}) has onboot=1 but is stopped."
- name: postgres_rules
rules:
- alert: postgres_down
expr: node_systemd_unit_state{instance="pgdb1.home.2rjus.net:9100", name="postgresql.service", state="active"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "PostgreSQL not running on {{ $labels.instance }}"
description: "PostgreSQL has been down on {{ $labels.instance }} more than 5 minutes."
- alert: postgres_exporter_down
expr: up{job="postgres"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL exporter down on {{ $labels.instance }}"
description: "Cannot scrape PostgreSQL metrics from {{ $labels.instance }}."
- alert: postgres_high_connections
expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL connection pool near exhaustion on {{ $labels.instance }}"
description: "PostgreSQL is using over 80% of max_connections on {{ $labels.instance }}."
- name: jellyfin_rules
rules:
- alert: jellyfin_down
expr: up{job="jellyfin"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Jellyfin not responding on {{ $labels.instance }}"
description: "Cannot scrape Jellyfin metrics from {{ $labels.instance }} for 5 minutes."
- name: vault_rules
rules:
- alert: openbao_down
expr: node_systemd_unit_state{instance="vault01.home.2rjus.net:9100", name="openbao.service", state="active"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "OpenBao not running on {{ $labels.instance }}"
description: "OpenBao has been down on {{ $labels.instance }} more than 5 minutes."
- alert: openbao_sealed
expr: vault_core_unsealed == 0
for: 5m
labels:
severity: critical
annotations:
summary: "OpenBao is sealed on {{ $labels.instance }}"
description: "OpenBao has been sealed on {{ $labels.instance }} for more than 5 minutes."
- alert: openbao_scrape_down
expr: up{job="openbao"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Cannot scrape OpenBao metrics from {{ $labels.instance }}"
description: "OpenBao metrics endpoint is not responding on {{ $labels.instance }}."

View File

@@ -1,10 +1,26 @@
{ ... }:
{
homelab.monitoring.scrapeTargets = [{
job_name = "nats";
port = 7777;
}];
services.prometheus.exporters.nats = {
enable = true;
url = "http://localhost:8222";
extraFlags = [
"-varz" # General server info
"-connz" # Connection info
"-jsz=all" # JetStream info
];
};
services.nats = {
enable = true;
jetstream = true;
serverName = "nats1";
settings = {
http_port = 8222;
accounts = {
ADMIN = {
users = [

View File

@@ -6,4 +6,5 @@
./proxy.nix
./nix.nix
];
}

View File

@@ -1,14 +1,16 @@
{ pkgs, config, ... }:
{
sops.secrets."cache-secret" = {
sopsFile = ../../secrets/nix-cache01/cache-secret;
format = "binary";
vault.secrets.cache-secret = {
secretPath = "hosts/nix-cache01/cache-secret";
extractKey = "key";
outputDir = "/run/secrets/cache-secret";
services = [ "harmonia" ];
};
services.harmonia = {
enable = true;
package = pkgs.unstable.harmonia;
signKeyPaths = [ config.sops.secrets.cache-secret.path ];
signKeyPaths = [ "/run/secrets/cache-secret" ];
};
systemd.services.harmonia = {
environment.RUST_LOG = "info,actix_web=debug";

View File

@@ -0,0 +1,33 @@
# DNS records for hosts not managed by this flake
# These are manually maintained and combined with auto-generated records
{
aRecords = {
# 10
"gw" = "10.69.10.1";
# 12_CORE
"nas" = "10.69.12.50";
"nzbget-jail" = "10.69.12.51";
"restic" = "10.69.12.52";
"radarr-jail" = "10.69.12.53";
"sonarr-jail" = "10.69.12.54";
"bazarr" = "10.69.12.55";
"pve1" = "10.69.12.75";
"inc1" = "10.69.12.80";
# 22_WLAN
"unifi-ctrl" = "10.69.22.5";
# 30
"gunter" = "10.69.30.105";
# 31
"media" = "10.69.31.50";
# 99_MGMT
"sw1" = "10.69.99.2";
};
cnames = {
};
}

View File

@@ -1,7 +1,22 @@
{ ... }:
{ self, lib, ... }:
let
dnsLib = import ../../lib/dns-zone.nix { inherit lib; };
externalHosts = import ./external-hosts.nix;
# Generate zone from flake hosts + external hosts
# Use lastModified from git commit as serial number
zoneData = dnsLib.generateZone {
inherit self externalHosts;
serial = self.sourceInfo.lastModified;
domain = "home.2rjus.net";
};
in
{
sops.secrets.ns_xfer_key = {
path = "/etc/nsd/xfer.key";
vault.secrets.ns-xfer-key = {
secretPath = "shared/dns/xfer-key";
extractKey = "key";
outputDir = "/etc/nsd/xfer.key";
services = [ "nsd" ];
};
networking.firewall.allowedTCPPorts = [ 8053 ];
@@ -26,7 +41,7 @@
"home.2rjus.net" = {
provideXFR = [ "10.69.13.6 xferkey" ];
notify = [ "10.69.13.6@8053 xferkey" ];
data = builtins.readFile ./zones-home-2rjus-net.conf;
data = zoneData;
};
};
};

View File

@@ -1,10 +1,24 @@
{ pkgs, ... }: {
homelab.monitoring.scrapeTargets = [{
job_name = "unbound";
port = 9167;
}];
networking.firewall.allowedTCPPorts = [
53
];
networking.firewall.allowedUDPPorts = [
53
];
services.prometheus.exporters.unbound = {
enable = true;
unbound.host = "unix:///run/unbound/unbound.ctl";
};
# Grant exporter access to unbound socket
systemd.services.prometheus-unbound-exporter.serviceConfig.SupplementaryGroups = [ "unbound" ];
services.unbound = {
enable = true;
@@ -23,6 +37,11 @@
do-ip6 = "no";
do-udp = "yes";
do-tcp = "yes";
extended-statistics = true;
};
remote-control = {
control-enable = true;
control-interface = "/run/unbound/unbound.ctl";
};
stub-zone = {
name = "home.2rjus.net";

View File

@@ -1,7 +1,22 @@
{ ... }:
{ self, lib, ... }:
let
dnsLib = import ../../lib/dns-zone.nix { inherit lib; };
externalHosts = import ./external-hosts.nix;
# Generate zone from flake hosts + external hosts
# Used as initial zone data before first AXFR completes
zoneData = dnsLib.generateZone {
inherit self externalHosts;
serial = self.sourceInfo.lastModified;
domain = "home.2rjus.net";
};
in
{
sops.secrets.ns_xfer_key = {
path = "/etc/nsd/xfer.key";
vault.secrets.ns-xfer-key = {
secretPath = "shared/dns/xfer-key";
extractKey = "key";
outputDir = "/etc/nsd/xfer.key";
services = [ "nsd" ];
};
networking.firewall.allowedTCPPorts = [ 8053 ];
networking.firewall.allowedUDPPorts = [ 8053 ];
@@ -24,7 +39,7 @@
"home.2rjus.net" = {
allowNotify = [ "10.69.13.5 xferkey" ];
requestXFR = [ "AXFR 10.69.13.5@8053 xferkey" ];
data = builtins.readFile ./zones-home-2rjus-net.conf;
data = zoneData;
};
};
};

View File

@@ -1,97 +0,0 @@
$ORIGIN home.2rjus.net.
$TTL 1800
@ IN SOA ns1.home.2rjus.net. admin.test.2rjus.net. (
2064 ; serial number
3600 ; refresh
900 ; retry
1209600 ; expire
120 ; ttl
)
IN NS ns1.home.2rjus.net.
IN NS ns2.home.2rjus.net.
IN NS ns3.home.2rjus.net.
; 8_k8s
kube-blue1 IN A 10.69.8.150
kube-blue2 IN A 10.69.8.151
kube-blue3 IN A 10.69.8.152
kube-blue4 IN A 10.69.8.153
rook IN CNAME kube-blue4
kube-blue5 IN A 10.69.8.154
git IN CNAME kube-blue5
kube-blue6 IN A 10.69.8.155
kube-blue7 IN A 10.69.8.156
kube-blue8 IN A 10.69.8.157
kube-blue9 IN A 10.69.8.158
kube-blue10 IN A 10.69.8.159
; 10
gw IN A 10.69.10.1
; 12_CORE
virt-mini1 IN A 10.69.12.11
nas IN A 10.69.12.50
nzbget-jail IN A 10.69.12.51
restic IN A 10.69.12.52
radarr-jail IN A 10.69.12.53
sonarr-jail IN A 10.69.12.54
bazarr IN A 10.69.12.55
mpnzb IN A 10.69.12.57
pve1 IN A 10.69.12.75
inc1 IN A 10.69.12.80
inc2 IN A 10.69.12.81
media1 IN A 10.69.12.82
; 13_SVC
ns1 IN A 10.69.13.5
ns2 IN A 10.69.13.6
ns3 IN A 10.69.13.7
ns4 IN A 10.69.13.8
ha1 IN A 10.69.13.9
nixos-test1 IN A 10.69.13.10
http-proxy IN A 10.69.13.11
ca IN A 10.69.13.12
monitoring01 IN A 10.69.13.13
jelly01 IN A 10.69.13.14
nix-cache01 IN A 10.69.13.15
nix-cache IN CNAME nix-cache01
actions1 IN CNAME nix-cache01
pgdb1 IN A 10.69.13.16
nats1 IN A 10.69.13.17
auth01 IN A 10.69.13.18
vault01 IN A 10.69.13.19
; http-proxy cnames
nzbget IN CNAME http-proxy
radarr IN CNAME http-proxy
sonarr IN CNAME http-proxy
ha IN CNAME http-proxy
z2m IN CNAME http-proxy
grafana IN CNAME http-proxy
prometheus IN CNAME http-proxy
alertmanager IN CNAME http-proxy
jelly IN CNAME http-proxy
auth IN CNAME http-proxy
lldap IN CNAME http-proxy
pyroscope IN CNAME http-proxy
pushgw IN CNAME http-proxy
ldap IN CNAME auth01
; 22_WLAN
unifi-ctrl IN A 10.69.22.5
; 30
gunter IN A 10.69.30.105
; 31
media IN A 10.69.31.50
; 99_MGMT
sw1 IN A 10.69.99.2
testing IN A 10.69.33.33

View File

@@ -1,5 +1,15 @@
{ pkgs, ... }:
{
homelab.monitoring.scrapeTargets = [{
job_name = "postgres";
port = 9187;
}];
services.prometheus.exporters.postgres = {
enable = true;
runAsLocalSuperUser = true; # Use peer auth as postgres user
};
services.postgresql = {
enable = true;
enableJIT = true;

View File

@@ -77,14 +77,100 @@ let
fi
'';
};
bootstrapCertScript = pkgs.writeShellApplication {
name = "bootstrap-vault-cert";
runtimeInputs = with pkgs; [
openbao
jq
openssl
coreutils
];
text = ''
# Bootstrap vault01 with a proper certificate from its own PKI
# This solves the chicken-and-egg problem where ACME clients can't trust
# vault01's self-signed certificate.
echo "=== Bootstrapping vault01 certificate ==="
# Use Unix socket to avoid TLS issues
export BAO_ADDR='unix:///run/openbao/openbao.sock'
# ACME certificate directory
CERT_DIR="/var/lib/acme/vault01.home.2rjus.net"
# Issue certificate for vault01 with vault as SAN
echo "Issuing certificate for vault01.home.2rjus.net (with SAN: vault.home.2rjus.net)..."
OUTPUT=$(bao write -format=json pki_int/issue/homelab \
common_name="vault01.home.2rjus.net" \
alt_names="vault.home.2rjus.net" \
ttl="720h")
# Create ACME directory structure
echo "Creating ACME certificate directory..."
mkdir -p "$CERT_DIR"
# Extract certificate components to temp files
echo "$OUTPUT" | jq -r '.data.certificate' > /tmp/vault01-cert.pem
echo "$OUTPUT" | jq -r '.data.private_key' > /tmp/vault01-key.pem
echo "$OUTPUT" | jq -r '.data.issuing_ca' > /tmp/vault01-ca.pem
# Create fullchain (cert + CA)
cat /tmp/vault01-cert.pem /tmp/vault01-ca.pem > /tmp/vault01-fullchain.pem
# Backup old certificates if they exist
if [ -f "$CERT_DIR/fullchain.pem" ]; then
echo "Backing up old certificate..."
cp "$CERT_DIR/fullchain.pem" "$CERT_DIR/fullchain.pem.backup"
cp "$CERT_DIR/key.pem" "$CERT_DIR/key.pem.backup"
fi
# Install new certificates
echo "Installing new certificate..."
mv /tmp/vault01-fullchain.pem "$CERT_DIR/fullchain.pem"
mv /tmp/vault01-cert.pem "$CERT_DIR/cert.pem"
mv /tmp/vault01-ca.pem "$CERT_DIR/chain.pem"
mv /tmp/vault01-key.pem "$CERT_DIR/key.pem"
# Set proper ownership and permissions (ACME-style)
chown -R acme:acme "$CERT_DIR"
chmod 750 "$CERT_DIR"
chmod 640 "$CERT_DIR"/*.pem
echo "Certificate installed successfully!"
echo ""
echo "Certificate details:"
openssl x509 -in "$CERT_DIR/cert.pem" -noout -subject -issuer -dates
echo ""
echo "Subject Alternative Names:"
openssl x509 -in "$CERT_DIR/cert.pem" -noout -ext subjectAltName
echo ""
echo "Now restart openbao service:"
echo " systemctl restart openbao"
echo ""
echo "After restart, verify ACME endpoint is accessible:"
echo " curl https://vault01.home.2rjus.net:8200/v1/pki_int/acme/directory"
echo ""
echo "Once working, ACME will automatically manage certificate renewals."
'';
};
in
{
# Make bootstrap script available as a command
environment.systemPackages = [ bootstrapCertScript ];
services.openbao = {
enable = true;
settings = {
ui = true;
telemetry = {
prometheus_retention_time = "60s";
disable_hostname = true;
};
storage.file.path = "/var/lib/openbao";
listener.default = {
type = "tcp";
@@ -101,8 +187,8 @@ in
systemd.services.openbao.serviceConfig = {
LoadCredential = [
"key.pem:/var/lib/openbao/key.pem"
"cert.pem:/var/lib/openbao/cert.pem"
"key.pem:/var/lib/acme/vault01.home.2rjus.net/key.pem"
"cert.pem:/var/lib/acme/vault01.home.2rjus.net/fullchain.pem"
];
# TPM2-encrypted unseal key (created manually, see setup instructions)
LoadCredentialEncrypted = [
@@ -110,5 +196,16 @@ in
];
# Auto-unseal on service start
ExecStartPost = "${unsealScript}/bin/openbao-unseal";
# Add openbao user to acme group to read certificates
SupplementaryGroups = [ "acme" ];
};
# ACME certificate management
# Bootstrapped with bootstrap-vault-cert, now managed by ACME
security.acme.certs."vault01.home.2rjus.net" = {
server = "https://vault01.home.2rjus.net:8200/v1/pki_int/acme/directory";
listenHTTP = ":80";
reloadServices = [ "openbao" ];
extraDomainNames = [ "vault.home.2rjus.net" ];
};
}

View File

@@ -4,12 +4,15 @@
./acme.nix
./autoupgrade.nix
./monitoring
./motd.nix
./packages.nix
./nix.nix
./root-user.nix
./root-ca.nix
./pki/root-ca.nix
./sops.nix
./sshd.nix
./vault-secrets.nix
../modules/homelab
];
}

View File

@@ -9,4 +9,13 @@
"processes"
];
};
services.prometheus.exporters.systemd = {
enable = true;
# Default port: 9558
extraFlags = [
"--systemd.collector.enable-restart-count"
"--systemd.collector.enable-ip-accounting"
];
};
}

28
system/motd.nix Normal file
View File

@@ -0,0 +1,28 @@
{ config, lib, self, ... }:
let
hostname = config.networking.hostName;
domain = config.networking.domain or "";
fqdn = if domain != "" then "${hostname}.${domain}" else hostname;
# Get commit hash (handles both clean and dirty trees)
shortRev = self.shortRev or self.dirtyShortRev or "unknown";
# Format timestamp from lastModified (Unix timestamp)
# lastModifiedDate is in format "YYYYMMDDHHMMSS"
dateStr = self.sourceInfo.lastModifiedDate or "unknown";
formattedDate = if dateStr != "unknown" then
"${builtins.substring 0 4 dateStr}-${builtins.substring 4 2 dateStr}-${builtins.substring 6 2 dateStr} ${builtins.substring 8 2 dateStr}:${builtins.substring 10 2 dateStr} UTC"
else
"unknown";
banner = ''
####################################
${fqdn}
Commit: ${shortRev} (${formattedDate})
####################################
'';
in
{
users.motd = lib.mkDefault banner;
}

View File

@@ -1,8 +1,29 @@
{ lib, ... }:
{ lib, pkgs, ... }:
let
nixos-rebuild-test = pkgs.writeShellApplication {
name = "nixos-rebuild-test";
runtimeInputs = [ pkgs.nixos-rebuild ];
text = ''
if [ $# -lt 2 ]; then
echo "Usage: nixos-rebuild-test <action> <branch>"
echo "Example: nixos-rebuild-test boot my-feature-branch"
exit 1
fi
action="$1"
branch="$2"
shift 2
exec nixos-rebuild "$action" --flake "git+https://git.t-juice.club/torjus/nixos-servers.git?ref=$branch" "$@"
'';
};
in
{
environment.systemPackages = [ nixos-rebuild-test ];
nix = {
gc = {
automatic = true;
options = "--delete-older-than 14d";
};
optimise = {

View File

@@ -4,6 +4,7 @@
certificateFiles = [
"${pkgs.cacert}/etc/ssl/certs/ca-bundle.crt"
./root-ca.crt
./vault-root-ca.crt
];
};
}

View File

@@ -0,0 +1,14 @@
-----BEGIN CERTIFICATE-----
MIICIjCCAaigAwIBAgIUQ/Bd/4kNvkPjQjgGLUMynIVzGeAwCgYIKoZIzj0EAwMw
QDELMAkGA1UEBhMCTk8xEDAOBgNVBAoTB0hvbWVsYWIxHzAdBgNVBAMTFmhvbWUu
MnJqdXMubmV0IFJvb3QgQ0EwHhcNMjYwMjAxMjIxODA5WhcNMzYwMTMwMjIxODM5
WjBAMQswCQYDVQQGEwJOTzEQMA4GA1UEChMHSG9tZWxhYjEfMB0GA1UEAxMWaG9t
ZS4ycmp1cy5uZXQgUm9vdCBDQTB2MBAGByqGSM49AgEGBSuBBAAiA2IABH8xhIOl
Nd1Yb1OFhgIJQZM+OkwoFenOQiKfuQ4oPMxaF+fnXdKc77qPDVRjeDy61oGS38X3
CjPOZAzS9kjo7FmVbzdqlYK7ut/OylF+8MJkCT8mFO1xvuzIXhufnyAD4aNjMGEw
DgYDVR0PAQH/BAQDAgEGMA8GA1UdEwEB/wQFMAMBAf8wHQYDVR0OBBYEFEimBeAg
3JVeF4BqdC9hMZ8MYKw2MB8GA1UdIwQYMBaAFEimBeAg3JVeF4BqdC9hMZ8MYKw2
MAoGCCqGSM49BAMDA2gAMGUCMQCvhRElHBra/XyT93SKcG6ZzIG+K+DH3J5jm6Xr
zaGj2VtdhBRVmEKaUcjU7htgSxcCMA9qHKYFcUH72W7By763M6sy8OOiGQNDSERY
VgnNv9rLCvCef1C8G2bYh/sKGZTPGQ==
-----END CERTIFICATE-----

View File

@@ -1,11 +1,10 @@
{ pkgs, config, ... }: {
{ pkgs, config, ... }:
{
programs.zsh.enable = true;
sops.secrets.root_password_hash = { };
sops.secrets.root_password_hash.neededForUsers = true;
users.users.root = {
shell = pkgs.zsh;
hashedPasswordFile = config.sops.secrets.root_password_hash.path;
hashedPassword = "$y$j9T$N09APWqKc4//z9BoGyzSb0$3dMUzojSmo3/10nbIfShd6/IpaYoKdI21bfbWER3jl8";
openssh.authorizedKeys.keys = [
"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAwfb2jpKrBnCw28aevnH8HbE5YbcMXpdaVv2KmueDu6 torjus@gunter"
];

View File

@@ -8,6 +8,48 @@ let
# Import vault-fetch package
vault-fetch = pkgs.callPackage ../scripts/vault-fetch { };
# Helper to create fetch scripts using writeShellApplication
mkFetchScript = name: secretCfg: pkgs.writeShellApplication {
name = "fetch-${name}";
runtimeInputs = [ vault-fetch ];
text = ''
# Set Vault environment variables
export VAULT_ADDR="${cfg.vaultAddress}"
export VAULT_SKIP_VERIFY="${if cfg.skipTlsVerify then "1" else "0"}"
'' + (if secretCfg.extractKey != null then ''
# Fetch to temporary directory, then extract single key
TMPDIR=$(mktemp -d)
trap 'rm -rf $TMPDIR' EXIT
vault-fetch \
"${secretCfg.secretPath}" \
"$TMPDIR" \
"${secretCfg.cacheDir}"
# Extract the specified key and write as a single file
if [ ! -f "$TMPDIR/${secretCfg.extractKey}" ]; then
echo "ERROR: Key '${secretCfg.extractKey}' not found in secret" >&2
exit 1
fi
# Ensure parent directory exists
mkdir -p "$(dirname "${secretCfg.outputDir}")"
cp "$TMPDIR/${secretCfg.extractKey}" "${secretCfg.outputDir}"
chown ${secretCfg.owner}:${secretCfg.group} "${secretCfg.outputDir}"
chmod ${secretCfg.mode} "${secretCfg.outputDir}"
'' else ''
# Fetch secret as directory of files
vault-fetch \
"${secretCfg.secretPath}" \
"${secretCfg.outputDir}" \
"${secretCfg.cacheDir}"
# Set ownership and permissions
chown -R ${secretCfg.owner}:${secretCfg.group} "${secretCfg.outputDir}"
chmod ${secretCfg.mode} "${secretCfg.outputDir}"/*
'');
};
# Secret configuration type
secretType = types.submodule ({ name, config, ... }: {
options = {
@@ -73,6 +115,16 @@ let
'';
};
extractKey = mkOption {
type = types.nullOr types.str;
default = null;
description = ''
Extract a single key from the vault secret JSON and write it as a
plain file instead of a directory of files. When set, outputDir
becomes a file path rather than a directory path.
'';
};
services = mkOption {
type = types.listOf types.str;
default = [];
@@ -152,23 +204,7 @@ in
RemainAfterExit = true;
# Fetch the secret
ExecStart = pkgs.writeShellScript "fetch-${name}" ''
set -euo pipefail
# Set Vault environment variables
export VAULT_ADDR="${cfg.vaultAddress}"
export VAULT_SKIP_VERIFY="${if cfg.skipTlsVerify then "1" else "0"}"
# Fetch secret using vault-fetch
${vault-fetch}/bin/vault-fetch \
"${secretCfg.secretPath}" \
"${secretCfg.outputDir}" \
"${secretCfg.cacheDir}"
# Set ownership and permissions
chown -R ${secretCfg.owner}:${secretCfg.group} "${secretCfg.outputDir}"
chmod ${secretCfg.mode} "${secretCfg.outputDir}"/*
'';
ExecStart = lib.getExe (mkFetchScript name secretCfg);
# Logging
StandardOutput = "journal";
@@ -216,7 +252,10 @@ in
[ "d /run/secrets 0755 root root -" ] ++
[ "d /var/lib/vault/cache 0700 root root -" ] ++
flatten (mapAttrsToList (name: secretCfg: [
"d ${secretCfg.outputDir} 0755 root root -"
# When extractKey is set, outputDir is a file path - create parent dir instead
(if secretCfg.extractKey != null
then "d ${dirOf secretCfg.outputDir} 0755 root root -"
else "d ${secretCfg.outputDir} 0755 root root -")
"d ${secretCfg.cacheDir} 0700 root root -"
]) cfg.secrets);
};

View File

@@ -15,6 +15,7 @@ locals {
# "secret/data/services/grafana/*",
# "secret/data/shared/smtp/*"
# ]
# extra_policies = ["some-other-policy"] # Optional: additional policies
# }
# Example: ha1 host
@@ -25,17 +26,67 @@ locals {
# ]
# }
# TODO: actually use this policy
"ha1" = {
paths = [
"secret/data/hosts/ha1/*",
"secret/data/shared/backup/*",
]
}
# TODO: actually use this policy
"monitoring01" = {
paths = [
"secret/data/hosts/monitoring01/*",
"secret/data/shared/backup/*",
"secret/data/shared/nats/*",
]
extra_policies = ["prometheus-metrics"]
}
# Wave 1: hosts with no service secrets (only need vault.enable for future use)
"nats1" = {
paths = [
"secret/data/hosts/nats1/*",
]
}
"jelly01" = {
paths = [
"secret/data/hosts/jelly01/*",
]
}
"pgdb1" = {
paths = [
"secret/data/hosts/pgdb1/*",
]
}
# Wave 3: DNS servers
"ns1" = {
paths = [
"secret/data/hosts/ns1/*",
"secret/data/shared/dns/*",
]
}
"ns2" = {
paths = [
"secret/data/hosts/ns2/*",
"secret/data/shared/dns/*",
]
}
# Wave 4: http-proxy
"http-proxy" = {
paths = [
"secret/data/hosts/http-proxy/*",
]
}
# Wave 5: nix-cache01
"nix-cache01" = {
paths = [
"secret/data/hosts/nix-cache01/*",
]
}
}
@@ -62,7 +113,10 @@ resource "vault_approle_auth_backend_role" "hosts" {
backend = vault_auth_backend.approle.path
role_name = each.key
token_policies = ["${each.key}-policy"]
token_policies = concat(
["${each.key}-policy"],
lookup(each.value, "extra_policies", [])
)
# Token configuration
token_ttl = 3600 # 1 hour

View File

@@ -62,6 +62,13 @@ resource "vault_mount" "pki_int" {
description = "Intermediate CA"
default_lease_ttl_seconds = 157680000 # 5 years
max_lease_ttl_seconds = 157680000 # 5 years
# Required for ACME support - allow ACME-specific response headers
allowed_response_headers = [
"Replay-Nonce",
"Link",
"Location"
]
}
resource "vault_pki_secret_backend_intermediate_cert_request" "intermediate" {
@@ -139,6 +146,33 @@ resource "vault_pki_secret_backend_config_urls" "config_urls" {
]
}
# Configure cluster path (required for ACME)
resource "vault_pki_secret_backend_config_cluster" "cluster" {
backend = vault_mount.pki_int.path
path = "${var.vault_address}/v1/${vault_mount.pki_int.path}"
aia_path = "${var.vault_address}/v1/${vault_mount.pki_int.path}"
}
# Enable ACME support
resource "vault_generic_endpoint" "acme_config" {
depends_on = [
vault_pki_secret_backend_config_cluster.cluster,
vault_pki_secret_backend_role.homelab
]
path = "${vault_mount.pki_int.path}/config/acme"
ignore_absent_fields = true
disable_read = true
disable_delete = true
data_json = jsonencode({
enabled = true
allowed_issuers = ["*"]
allowed_roles = ["*"]
default_directory_policy = "sign-verbatim"
})
}
# ============================================================================
# Direct Certificate Issuance (Non-ACME)
# ============================================================================

View File

@@ -0,0 +1,10 @@
# Generic policies for services (not host-specific)
resource "vault_policy" "prometheus_metrics" {
name = "prometheus-metrics"
policy = <<EOT
path "sys/metrics" {
capabilities = ["read"]
}
EOT
}

View File

@@ -35,22 +35,63 @@ locals {
# }
# }
# TODO: actually use the secret
"hosts/monitoring01/grafana-admin" = {
auto_generate = true
password_length = 32
}
# TODO: actually use the secret
"hosts/ha1/mqtt-password" = {
auto_generate = true
password_length = 24
}
# TODO: Remove after testing
"hosts/vaulttest01/test-service" = {
auto_generate = true
password_length = 32
}
# Shared backup password (auto-generated, add alongside existing restic key)
"shared/backup/password" = {
auto_generate = true
password_length = 32
}
# NATS NKey for alerttonotify
"shared/nats/nkey" = {
auto_generate = false
data = { nkey = var.nats_nkey }
}
# PVE exporter config for monitoring01
"hosts/monitoring01/pve-exporter" = {
auto_generate = false
data = { config = var.pve_exporter_config }
}
# DNS zone transfer key
"shared/dns/xfer-key" = {
auto_generate = false
data = { key = var.ns_xfer_key }
}
# WireGuard private key for http-proxy
"hosts/http-proxy/wireguard" = {
auto_generate = false
data = { private_key = var.wireguard_private_key }
}
# Nix cache signing key
"hosts/nix-cache01/cache-secret" = {
auto_generate = false
data = { key = var.cache_signing_key }
}
# Gitea Actions runner token
"hosts/nix-cache01/actions-token" = {
auto_generate = false
data = { token = var.actions_token_1 }
}
}
}

View File

@@ -16,11 +16,39 @@ variable "vault_skip_tls_verify" {
default = true
}
# Example variables for manual secrets
# Uncomment and add to terraform.tfvars as needed
variable "nats_nkey" {
description = "NATS NKey for alerttonotify"
type = string
sensitive = true
}
variable "pve_exporter_config" {
description = "PVE exporter YAML configuration"
type = string
sensitive = true
}
variable "ns_xfer_key" {
description = "DNS zone transfer TSIG key"
type = string
sensitive = true
}
variable "wireguard_private_key" {
description = "WireGuard private key for http-proxy"
type = string
sensitive = true
}
variable "cache_signing_key" {
description = "Nix binary cache signing key"
type = string
sensitive = true
}
variable "actions_token_1" {
description = "Gitea Actions runner token"
type = string
sensitive = true
}
# variable "smtp_password" {
# description = "SMTP password for notifications"
# type = string
# sensitive = true
# }

View File

@@ -50,8 +50,8 @@ locals {
cpu_cores = 2
memory = 2048
disk_size = "20G"
flake_branch = "vault-bootstrap-integration"
vault_wrapped_token = "s.HwNenAYvXBsPs8uICh4CbE11"
flake_branch = "pki-migration"
vault_wrapped_token = "s.UCpQCOp7cOKDdtGGBvfRWwAt"
}
}