diff --git a/CLAUDE.md b/CLAUDE.md index 5871ede..5f87677 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -52,7 +52,12 @@ nix develop ### Secrets Management -Secrets are handled by sops. Do not edit any `.sops.yaml` or any file within `secrets/`. Ask the user to modify if necessary. +Secrets are managed by OpenBao (Vault) using AppRole authentication. Most hosts use the +`vault.secrets` option defined in `system/vault-secrets.nix` to fetch secrets at boot. +Terraform manages the secrets and AppRole policies in `terraform/vault/`. + +Legacy sops-nix is still present but only actively used by the `ca` host. Do not edit any +`.sops.yaml` or any file within `secrets/`. Ask the user to modify if necessary. ### Git Workflow @@ -119,7 +124,7 @@ This ensures documentation matches the exact nixpkgs version (currently NixOS 25 - `default.nix` - Entry point, imports configuration.nix and services - `configuration.nix` - Host-specific settings (networking, hardware, users) - `/system/` - Shared system-level configurations applied to ALL hosts - - Core modules: nix.nix, sshd.nix, sops.nix, acme.nix, autoupgrade.nix + - Core modules: nix.nix, sshd.nix, sops.nix (legacy), vault-secrets.nix, acme.nix, autoupgrade.nix - Monitoring: node-exporter and promtail on every host - `/modules/` - Custom NixOS modules - `homelab/` - Homelab-specific options (DNS automation, monitoring scrape targets) @@ -131,13 +136,13 @@ This ensures documentation matches the exact nixpkgs version (currently NixOS 25 - `monitoring/` - Observability stack (Prometheus, Grafana, Loki, Tempo) - `ns/` - DNS services (authoritative, resolver, zone generation) - `http-proxy/`, `ca/`, `postgres/`, `nats/`, `jellyfin/`, etc. -- `/secrets/` - SOPS-encrypted secrets with age encryption +- `/secrets/` - SOPS-encrypted secrets with age encryption (legacy, only used by ca) - `/common/` - Shared configurations (e.g., VM guest agent) - `/docs/` - Documentation and plans - `plans/` - Future plans and proposals - `plans/completed/` - Completed plans (moved here when done) - `/playbooks/` - Ansible playbooks for fleet management -- `/.sops.yaml` - SOPS configuration with age keys for all servers +- `/.sops.yaml` - SOPS configuration with age keys (legacy, only used by ca) ### Configuration Inheritance @@ -153,7 +158,7 @@ hosts//default.nix All hosts automatically get: - Nix binary cache (nix-cache.home.2rjus.net) - SSH with root login enabled -- SOPS secrets management with auto-generated age keys +- OpenBao (Vault) secrets management via AppRole - Internal ACME CA integration (ca.home.2rjus.net) - Daily auto-upgrades with auto-reboot - Prometheus node-exporter + Promtail (logs to monitoring01) @@ -173,7 +178,6 @@ Production servers managed by `rebuild-all.sh`: - `nix-cache01` - Binary cache server - `pgdb1` - PostgreSQL database - `nats1` - NATS messaging server -- `auth01` - Authentication service Template/test hosts: - `template1` - Base template for cloning new hosts @@ -182,7 +186,7 @@ Template/test hosts: - `nixpkgs` - NixOS 25.11 stable (primary) - `nixpkgs-unstable` - Unstable channel (available via overlay as `pkgs.unstable.`) -- `sops-nix` - Secrets management +- `sops-nix` - Secrets management (legacy, only used by ca) - Custom packages from git.t-juice.club: - `alerttonotify` - Alert routing - `labmon` - Lab monitoring @@ -198,12 +202,21 @@ Template/test hosts: ### Secrets Management -- Uses SOPS with age encryption -- Each server has unique age key in `.sops.yaml` -- Keys auto-generated at `/var/lib/sops-nix/key.txt` on first boot +Most hosts use OpenBao (Vault) for secrets: +- Vault server at `vault01.home.2rjus.net:8200` +- AppRole authentication with credentials at `/var/lib/vault/approle/` +- Secrets defined in Terraform (`terraform/vault/secrets.tf`) +- AppRole policies in Terraform (`terraform/vault/approle.tf`) +- NixOS module: `system/vault-secrets.nix` with `vault.secrets.` options +- `extractKey` option extracts a single key from vault JSON as a plain file +- Secrets fetched at boot by `vault-secret-.service` systemd units +- Fallback to cached secrets in `/var/lib/vault/cache/` when Vault is unreachable +- Provision AppRole credentials: `nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=` + +Legacy SOPS (only used by `ca` host): +- SOPS with age encryption, keys in `.sops.yaml` - Shared secrets: `/secrets/secrets.yaml` - Per-host secrets: `/secrets//` -- All production servers can decrypt shared secrets; host-specific secrets require specific host keys ### Auto-Upgrade System @@ -303,13 +316,15 @@ This means: 3. Add host entry to `flake.nix` nixosConfigurations 4. Configure networking in `configuration.nix` (static IP via `systemd.network.networks`, DNS servers) 5. (Optional) Add `homelab.dns.cnames` if the host needs CNAME aliases -6. User clones template host -7. User runs `prepare-host.sh` on new host, this deletes files which should be regenerated, like ssh host keys, machine-id etc. It also creates a new age key, and prints the public key -8. This key is then added to `.sops.yaml` -9. Create `/secrets//` if needed -10. Commit changes, and merge to master. -11. Deploy by running `nixos-rebuild boot --flake URL#` on the host. -12. Run auto-upgrade on DNS servers (ns1, ns2) to pick up the new host's DNS entry +6. Add `vault.enable = true;` to the host configuration +7. Add AppRole policy in `terraform/vault/approle.tf` and any secrets in `secrets.tf` +8. Run `tofu apply` in `terraform/vault/` +9. User clones template host +10. User runs `prepare-host.sh` on new host +11. Provision AppRole credentials: `nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=` +12. Commit changes, and merge to master. +13. Deploy by running `nixos-rebuild boot --flake URL#` on the host. +14. Run auto-upgrade on DNS servers (ns1, ns2) to pick up the new host's DNS entry **Note:** DNS A records and Prometheus node-exporter scrape targets are auto-generated from the host's `systemd.network.networks` static IP configuration. No manual zone file or Prometheus config editing is required. diff --git a/README.md b/README.md index 59fd82f..1939988 100644 --- a/README.md +++ b/README.md @@ -15,7 +15,6 @@ NixOS Flake-based configuration repository for a homelab infrastructure. All hos | `nix-cache01` | Nix binary cache | | `pgdb1` | PostgreSQL | | `nats1` | NATS messaging | -| `auth01` | Authentication (LLDAP + Authelia) | | `vault01` | OpenBao (Vault) secrets management | | `template1`, `template2` | VM templates for cloning new hosts | @@ -28,7 +27,7 @@ system/ # Shared modules applied to ALL hosts services/ # Reusable service modules, selectively imported per host modules/ # Custom NixOS module definitions lib/ # Nix library functions (DNS zone generation, etc.) -secrets/ # SOPS-encrypted secrets (age encryption) +secrets/ # SOPS-encrypted secrets (legacy, only used by ca) common/ # Shared configurations (e.g., VM guest agent) terraform/ # OpenTofu configs for Proxmox VM provisioning terraform/vault/ # OpenTofu configs for OpenBao (secrets, PKI, AppRoles) @@ -40,7 +39,7 @@ scripts/ # Helper scripts (create-host, vault-fetch) **Automatic DNS zone generation** - A records are derived from each host's static IP configuration. CNAME aliases are defined via `homelab.dns.cnames`. No manual zone file editing required. -**SOPS secrets management** - Each host has a unique age key. Shared secrets live in `secrets/secrets.yaml`, per-host secrets in `secrets//`. +**OpenBao (Vault) secrets** - Hosts authenticate via AppRole and fetch secrets at boot. Secrets and policies are managed as code in `terraform/vault/`. Legacy SOPS remains only for the `ca` host. **Daily auto-upgrades** - All hosts pull from the master branch and automatically rebuild and reboot on a randomized schedule. diff --git a/docs/plans/completed/sops-to-openbao-migration.md b/docs/plans/completed/sops-to-openbao-migration.md index 9f9e60e..32299d2 100644 --- a/docs/plans/completed/sops-to-openbao-migration.md +++ b/docs/plans/completed/sops-to-openbao-migration.md @@ -1,6 +1,22 @@ # Sops to OpenBao Secrets Migration Plan -## Status: In Progress +## Status: Complete (except ca, deferred) + +## Remaining sops cleanup + +The `sops-nix` flake input, `system/sops.nix`, `.sops.yaml`, and `secrets/` directory are +still present because `ca` still uses sops for its step-ca secrets (5 secrets in +`services/ca/default.nix`). The `services/authelia/` and `services/lldap/` modules also +reference sops but are only used by auth01 (decommissioned). + +Once `ca` is migrated to OpenBao PKI (Phase 4c in host-migration-to-opentofu.md), remove: +- `sops-nix` input from `flake.nix` +- `sops-nix.nixosModules.sops` from all host module lists in `flake.nix` +- `inherit sops-nix` from all specialArgs in `flake.nix` +- `system/sops.nix` and its import in `system/default.nix` +- `.sops.yaml` +- `secrets/` directory +- All `sops.secrets.*` declarations in `services/ca/`, `services/authelia/`, `services/lldap/` ## Overview diff --git a/docs/plans/host-migration-to-opentofu.md b/docs/plans/host-migration-to-opentofu.md index 7f537fc..7019627 100644 --- a/docs/plans/host-migration-to-opentofu.md +++ b/docs/plans/host-migration-to-opentofu.md @@ -20,7 +20,7 @@ Hosts to migrate: | nix-cache01 | Stateless | Binary cache, recreate | | http-proxy | Stateless | Reverse proxy, recreate | | nats1 | Stateless | Messaging, recreate | -| auth01 | Stateless | Authentication, recreate | +| auth01 | Decommission | No longer in use | | ha1 | Stateful | Home Assistant + Zigbee2MQTT + Mosquitto | | monitoring01 | Stateful | Prometheus, Grafana, Loki | | jelly01 | Stateful | Jellyfin metadata, watch history, config | @@ -94,8 +94,7 @@ These hosts have no meaningful state and can be recreated fresh. For each host: Migrate stateless hosts in an order that minimizes disruption: 1. **nix-cache01** — low risk, no downstream dependencies during migration -2. **auth01** — low risk -3. **nats1** — low risk, verify no persistent JetStream streams first +2. **nats1** — low risk, verify no persistent JetStream streams first 4. **http-proxy** — brief disruption to proxied services, migrate during low-traffic window 5. **ns1, ns2** — migrate one at a time, verify DNS resolution between each @@ -168,8 +167,9 @@ OpenTofu/Proxmox. Verify the USB device ID on the hypervisor and add the appropr `usb` block to the VM definition in `terraform/vms.tf`. The USB device must be passed through before starting Zigbee2MQTT on the new host. -## Phase 5: Decommission jump Host +## Phase 5: Decommission jump and auth01 Hosts +### jump 1. Verify nothing depends on the jump host (no SSH proxy configs pointing to it, etc.) 2. Remove host configuration from `hosts/jump/` 3. Remove from `flake.nix` @@ -178,12 +178,37 @@ through before starting Zigbee2MQTT on the new host. 6. Destroy the VM in Proxmox 7. Commit cleanup +### auth01 +1. Remove host configuration from `hosts/auth01/` +2. Remove from `flake.nix` +3. Remove any secrets in `secrets/auth01/` +4. Remove from `.sops.yaml` +5. Remove `services/authelia/` and `services/lldap/` (only used by auth01) +6. Destroy the VM in Proxmox +7. Commit cleanup + ## Phase 6: Decommission ca Host (Deferred) Deferred until Phase 4c (PKI migration to OpenBao) is complete. Once all hosts use the OpenBao ACME endpoint for certificates, the step-ca host can be decommissioned following the same cleanup steps as the jump host. +## Phase 7: Remove sops-nix + +Once `ca` is decommissioned (Phase 6), `sops-nix` is no longer used by any host. Remove +all remnants: +- `sops-nix` input from `flake.nix` and `flake.lock` +- `sops-nix.nixosModules.sops` from all host module lists in `flake.nix` +- `inherit sops-nix` from all specialArgs in `flake.nix` +- `system/sops.nix` and its import in `system/default.nix` +- `.sops.yaml` +- `secrets/` directory +- All `sops.secrets.*` declarations in `services/ca/`, `services/authelia/`, `services/lldap/` +- Template scripts that generate age keys for sops (`hosts/template/scripts.nix`, + `hosts/template2/scripts.nix`) + +See `docs/plans/completed/sops-to-openbao-migration.md` for full context. + ## Notes - Each host migration should be done individually, not in bulk, to limit blast radius