docs: update for sops-to-openbao migration completion
Some checks failed
Run nix flake check / flake-check (push) Failing after 18m17s
Some checks failed
Run nix flake check / flake-check (push) Failing after 18m17s
Update CLAUDE.md and README.md to reflect that secrets are now managed by OpenBao, with sops only remaining for ca. Update migration plans with sops cleanup checklist and auth01 decommission. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
51
CLAUDE.md
51
CLAUDE.md
@@ -52,7 +52,12 @@ nix develop
|
||||
|
||||
### Secrets Management
|
||||
|
||||
Secrets are handled by sops. Do not edit any `.sops.yaml` or any file within `secrets/`. Ask the user to modify if necessary.
|
||||
Secrets are managed by OpenBao (Vault) using AppRole authentication. Most hosts use the
|
||||
`vault.secrets` option defined in `system/vault-secrets.nix` to fetch secrets at boot.
|
||||
Terraform manages the secrets and AppRole policies in `terraform/vault/`.
|
||||
|
||||
Legacy sops-nix is still present but only actively used by the `ca` host. Do not edit any
|
||||
`.sops.yaml` or any file within `secrets/`. Ask the user to modify if necessary.
|
||||
|
||||
### Git Workflow
|
||||
|
||||
@@ -119,7 +124,7 @@ This ensures documentation matches the exact nixpkgs version (currently NixOS 25
|
||||
- `default.nix` - Entry point, imports configuration.nix and services
|
||||
- `configuration.nix` - Host-specific settings (networking, hardware, users)
|
||||
- `/system/` - Shared system-level configurations applied to ALL hosts
|
||||
- Core modules: nix.nix, sshd.nix, sops.nix, acme.nix, autoupgrade.nix
|
||||
- Core modules: nix.nix, sshd.nix, sops.nix (legacy), vault-secrets.nix, acme.nix, autoupgrade.nix
|
||||
- Monitoring: node-exporter and promtail on every host
|
||||
- `/modules/` - Custom NixOS modules
|
||||
- `homelab/` - Homelab-specific options (DNS automation, monitoring scrape targets)
|
||||
@@ -131,13 +136,13 @@ This ensures documentation matches the exact nixpkgs version (currently NixOS 25
|
||||
- `monitoring/` - Observability stack (Prometheus, Grafana, Loki, Tempo)
|
||||
- `ns/` - DNS services (authoritative, resolver, zone generation)
|
||||
- `http-proxy/`, `ca/`, `postgres/`, `nats/`, `jellyfin/`, etc.
|
||||
- `/secrets/` - SOPS-encrypted secrets with age encryption
|
||||
- `/secrets/` - SOPS-encrypted secrets with age encryption (legacy, only used by ca)
|
||||
- `/common/` - Shared configurations (e.g., VM guest agent)
|
||||
- `/docs/` - Documentation and plans
|
||||
- `plans/` - Future plans and proposals
|
||||
- `plans/completed/` - Completed plans (moved here when done)
|
||||
- `/playbooks/` - Ansible playbooks for fleet management
|
||||
- `/.sops.yaml` - SOPS configuration with age keys for all servers
|
||||
- `/.sops.yaml` - SOPS configuration with age keys (legacy, only used by ca)
|
||||
|
||||
### Configuration Inheritance
|
||||
|
||||
@@ -153,7 +158,7 @@ hosts/<hostname>/default.nix
|
||||
All hosts automatically get:
|
||||
- Nix binary cache (nix-cache.home.2rjus.net)
|
||||
- SSH with root login enabled
|
||||
- SOPS secrets management with auto-generated age keys
|
||||
- OpenBao (Vault) secrets management via AppRole
|
||||
- Internal ACME CA integration (ca.home.2rjus.net)
|
||||
- Daily auto-upgrades with auto-reboot
|
||||
- Prometheus node-exporter + Promtail (logs to monitoring01)
|
||||
@@ -173,7 +178,6 @@ Production servers managed by `rebuild-all.sh`:
|
||||
- `nix-cache01` - Binary cache server
|
||||
- `pgdb1` - PostgreSQL database
|
||||
- `nats1` - NATS messaging server
|
||||
- `auth01` - Authentication service
|
||||
|
||||
Template/test hosts:
|
||||
- `template1` - Base template for cloning new hosts
|
||||
@@ -182,7 +186,7 @@ Template/test hosts:
|
||||
|
||||
- `nixpkgs` - NixOS 25.11 stable (primary)
|
||||
- `nixpkgs-unstable` - Unstable channel (available via overlay as `pkgs.unstable.<package>`)
|
||||
- `sops-nix` - Secrets management
|
||||
- `sops-nix` - Secrets management (legacy, only used by ca)
|
||||
- Custom packages from git.t-juice.club:
|
||||
- `alerttonotify` - Alert routing
|
||||
- `labmon` - Lab monitoring
|
||||
@@ -198,12 +202,21 @@ Template/test hosts:
|
||||
|
||||
### Secrets Management
|
||||
|
||||
- Uses SOPS with age encryption
|
||||
- Each server has unique age key in `.sops.yaml`
|
||||
- Keys auto-generated at `/var/lib/sops-nix/key.txt` on first boot
|
||||
Most hosts use OpenBao (Vault) for secrets:
|
||||
- Vault server at `vault01.home.2rjus.net:8200`
|
||||
- AppRole authentication with credentials at `/var/lib/vault/approle/`
|
||||
- Secrets defined in Terraform (`terraform/vault/secrets.tf`)
|
||||
- AppRole policies in Terraform (`terraform/vault/approle.tf`)
|
||||
- NixOS module: `system/vault-secrets.nix` with `vault.secrets.<name>` options
|
||||
- `extractKey` option extracts a single key from vault JSON as a plain file
|
||||
- Secrets fetched at boot by `vault-secret-<name>.service` systemd units
|
||||
- Fallback to cached secrets in `/var/lib/vault/cache/` when Vault is unreachable
|
||||
- Provision AppRole credentials: `nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<host>`
|
||||
|
||||
Legacy SOPS (only used by `ca` host):
|
||||
- SOPS with age encryption, keys in `.sops.yaml`
|
||||
- Shared secrets: `/secrets/secrets.yaml`
|
||||
- Per-host secrets: `/secrets/<hostname>/`
|
||||
- All production servers can decrypt shared secrets; host-specific secrets require specific host keys
|
||||
|
||||
### Auto-Upgrade System
|
||||
|
||||
@@ -303,13 +316,15 @@ This means:
|
||||
3. Add host entry to `flake.nix` nixosConfigurations
|
||||
4. Configure networking in `configuration.nix` (static IP via `systemd.network.networks`, DNS servers)
|
||||
5. (Optional) Add `homelab.dns.cnames` if the host needs CNAME aliases
|
||||
6. User clones template host
|
||||
7. User runs `prepare-host.sh` on new host, this deletes files which should be regenerated, like ssh host keys, machine-id etc. It also creates a new age key, and prints the public key
|
||||
8. This key is then added to `.sops.yaml`
|
||||
9. Create `/secrets/<hostname>/` if needed
|
||||
10. Commit changes, and merge to master.
|
||||
11. Deploy by running `nixos-rebuild boot --flake URL#<hostname>` on the host.
|
||||
12. Run auto-upgrade on DNS servers (ns1, ns2) to pick up the new host's DNS entry
|
||||
6. Add `vault.enable = true;` to the host configuration
|
||||
7. Add AppRole policy in `terraform/vault/approle.tf` and any secrets in `secrets.tf`
|
||||
8. Run `tofu apply` in `terraform/vault/`
|
||||
9. User clones template host
|
||||
10. User runs `prepare-host.sh` on new host
|
||||
11. Provision AppRole credentials: `nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<host>`
|
||||
12. Commit changes, and merge to master.
|
||||
13. Deploy by running `nixos-rebuild boot --flake URL#<hostname>` on the host.
|
||||
14. Run auto-upgrade on DNS servers (ns1, ns2) to pick up the new host's DNS entry
|
||||
|
||||
**Note:** DNS A records and Prometheus node-exporter scrape targets are auto-generated from the host's `systemd.network.networks` static IP configuration. No manual zone file or Prometheus config editing is required.
|
||||
|
||||
|
||||
@@ -15,7 +15,6 @@ NixOS Flake-based configuration repository for a homelab infrastructure. All hos
|
||||
| `nix-cache01` | Nix binary cache |
|
||||
| `pgdb1` | PostgreSQL |
|
||||
| `nats1` | NATS messaging |
|
||||
| `auth01` | Authentication (LLDAP + Authelia) |
|
||||
| `vault01` | OpenBao (Vault) secrets management |
|
||||
| `template1`, `template2` | VM templates for cloning new hosts |
|
||||
|
||||
@@ -28,7 +27,7 @@ system/ # Shared modules applied to ALL hosts
|
||||
services/ # Reusable service modules, selectively imported per host
|
||||
modules/ # Custom NixOS module definitions
|
||||
lib/ # Nix library functions (DNS zone generation, etc.)
|
||||
secrets/ # SOPS-encrypted secrets (age encryption)
|
||||
secrets/ # SOPS-encrypted secrets (legacy, only used by ca)
|
||||
common/ # Shared configurations (e.g., VM guest agent)
|
||||
terraform/ # OpenTofu configs for Proxmox VM provisioning
|
||||
terraform/vault/ # OpenTofu configs for OpenBao (secrets, PKI, AppRoles)
|
||||
@@ -40,7 +39,7 @@ scripts/ # Helper scripts (create-host, vault-fetch)
|
||||
|
||||
**Automatic DNS zone generation** - A records are derived from each host's static IP configuration. CNAME aliases are defined via `homelab.dns.cnames`. No manual zone file editing required.
|
||||
|
||||
**SOPS secrets management** - Each host has a unique age key. Shared secrets live in `secrets/secrets.yaml`, per-host secrets in `secrets/<hostname>/`.
|
||||
**OpenBao (Vault) secrets** - Hosts authenticate via AppRole and fetch secrets at boot. Secrets and policies are managed as code in `terraform/vault/`. Legacy SOPS remains only for the `ca` host.
|
||||
|
||||
**Daily auto-upgrades** - All hosts pull from the master branch and automatically rebuild and reboot on a randomized schedule.
|
||||
|
||||
|
||||
@@ -1,6 +1,22 @@
|
||||
# Sops to OpenBao Secrets Migration Plan
|
||||
|
||||
## Status: In Progress
|
||||
## Status: Complete (except ca, deferred)
|
||||
|
||||
## Remaining sops cleanup
|
||||
|
||||
The `sops-nix` flake input, `system/sops.nix`, `.sops.yaml`, and `secrets/` directory are
|
||||
still present because `ca` still uses sops for its step-ca secrets (5 secrets in
|
||||
`services/ca/default.nix`). The `services/authelia/` and `services/lldap/` modules also
|
||||
reference sops but are only used by auth01 (decommissioned).
|
||||
|
||||
Once `ca` is migrated to OpenBao PKI (Phase 4c in host-migration-to-opentofu.md), remove:
|
||||
- `sops-nix` input from `flake.nix`
|
||||
- `sops-nix.nixosModules.sops` from all host module lists in `flake.nix`
|
||||
- `inherit sops-nix` from all specialArgs in `flake.nix`
|
||||
- `system/sops.nix` and its import in `system/default.nix`
|
||||
- `.sops.yaml`
|
||||
- `secrets/` directory
|
||||
- All `sops.secrets.*` declarations in `services/ca/`, `services/authelia/`, `services/lldap/`
|
||||
|
||||
## Overview
|
||||
|
||||
|
||||
@@ -20,7 +20,7 @@ Hosts to migrate:
|
||||
| nix-cache01 | Stateless | Binary cache, recreate |
|
||||
| http-proxy | Stateless | Reverse proxy, recreate |
|
||||
| nats1 | Stateless | Messaging, recreate |
|
||||
| auth01 | Stateless | Authentication, recreate |
|
||||
| auth01 | Decommission | No longer in use |
|
||||
| ha1 | Stateful | Home Assistant + Zigbee2MQTT + Mosquitto |
|
||||
| monitoring01 | Stateful | Prometheus, Grafana, Loki |
|
||||
| jelly01 | Stateful | Jellyfin metadata, watch history, config |
|
||||
@@ -94,8 +94,7 @@ These hosts have no meaningful state and can be recreated fresh. For each host:
|
||||
Migrate stateless hosts in an order that minimizes disruption:
|
||||
|
||||
1. **nix-cache01** — low risk, no downstream dependencies during migration
|
||||
2. **auth01** — low risk
|
||||
3. **nats1** — low risk, verify no persistent JetStream streams first
|
||||
2. **nats1** — low risk, verify no persistent JetStream streams first
|
||||
4. **http-proxy** — brief disruption to proxied services, migrate during low-traffic window
|
||||
5. **ns1, ns2** — migrate one at a time, verify DNS resolution between each
|
||||
|
||||
@@ -168,8 +167,9 @@ OpenTofu/Proxmox. Verify the USB device ID on the hypervisor and add the appropr
|
||||
`usb` block to the VM definition in `terraform/vms.tf`. The USB device must be passed
|
||||
through before starting Zigbee2MQTT on the new host.
|
||||
|
||||
## Phase 5: Decommission jump Host
|
||||
## Phase 5: Decommission jump and auth01 Hosts
|
||||
|
||||
### jump
|
||||
1. Verify nothing depends on the jump host (no SSH proxy configs pointing to it, etc.)
|
||||
2. Remove host configuration from `hosts/jump/`
|
||||
3. Remove from `flake.nix`
|
||||
@@ -178,12 +178,37 @@ through before starting Zigbee2MQTT on the new host.
|
||||
6. Destroy the VM in Proxmox
|
||||
7. Commit cleanup
|
||||
|
||||
### auth01
|
||||
1. Remove host configuration from `hosts/auth01/`
|
||||
2. Remove from `flake.nix`
|
||||
3. Remove any secrets in `secrets/auth01/`
|
||||
4. Remove from `.sops.yaml`
|
||||
5. Remove `services/authelia/` and `services/lldap/` (only used by auth01)
|
||||
6. Destroy the VM in Proxmox
|
||||
7. Commit cleanup
|
||||
|
||||
## Phase 6: Decommission ca Host (Deferred)
|
||||
|
||||
Deferred until Phase 4c (PKI migration to OpenBao) is complete. Once all hosts use the
|
||||
OpenBao ACME endpoint for certificates, the step-ca host can be decommissioned following
|
||||
the same cleanup steps as the jump host.
|
||||
|
||||
## Phase 7: Remove sops-nix
|
||||
|
||||
Once `ca` is decommissioned (Phase 6), `sops-nix` is no longer used by any host. Remove
|
||||
all remnants:
|
||||
- `sops-nix` input from `flake.nix` and `flake.lock`
|
||||
- `sops-nix.nixosModules.sops` from all host module lists in `flake.nix`
|
||||
- `inherit sops-nix` from all specialArgs in `flake.nix`
|
||||
- `system/sops.nix` and its import in `system/default.nix`
|
||||
- `.sops.yaml`
|
||||
- `secrets/` directory
|
||||
- All `sops.secrets.*` declarations in `services/ca/`, `services/authelia/`, `services/lldap/`
|
||||
- Template scripts that generate age keys for sops (`hosts/template/scripts.nix`,
|
||||
`hosts/template2/scripts.nix`)
|
||||
|
||||
See `docs/plans/completed/sops-to-openbao-migration.md` for full context.
|
||||
|
||||
## Notes
|
||||
|
||||
- Each host migration should be done individually, not in bulk, to limit blast radius
|
||||
|
||||
Reference in New Issue
Block a user