diff --git a/TODO.md b/TODO.md index 46f314f..553ed40 100644 --- a/TODO.md +++ b/TODO.md @@ -153,7 +153,9 @@ create-host \ --- -### Phase 4: Secrets Management with HashiCorp Vault +### Phase 4: Secrets Management with OpenBao (Vault) + +**Status:** 🚧 Phases 4a & 4b Complete, 4c & 4d In Progress **Challenge:** Current sops-nix approach has chicken-and-egg problem with age keys @@ -164,161 +166,225 @@ create-host \ 4. User commits, pushes 5. VM can now decrypt secrets -**Selected approach:** Migrate to HashiCorp Vault for centralized secrets management +**Selected approach:** Migrate to OpenBao (Vault fork) for centralized secrets management + +**Why OpenBao instead of HashiCorp Vault:** +- HashiCorp Vault switched to BSL (Business Source License), unavailable in NixOS cache +- OpenBao is the community fork maintaining the pre-BSL MPL 2.0 license +- API-compatible with Vault, uses same Terraform provider +- Maintains all Vault features we need **Benefits:** -- Industry-standard secrets management (Vault experience transferable to work) +- Industry-standard secrets management (Vault-compatible experience) - Eliminates manual age key distribution step - Secrets-as-code via OpenTofu (infrastructure-as-code aligned) -- Centralized PKI management (replaces step-ca, consolidates TLS + SSH CA) +- Centralized PKI management with ACME support (ready to replace step-ca) - Automatic secret rotation capabilities -- Audit logging for all secret access +- Audit logging for all secret access (not yet enabled) - AppRole authentication enables automated bootstrap -**Architecture:** +**Current Architecture:** ``` -vault.home.2rjus.net - ├─ KV Secrets Engine (replaces sops-nix) - ├─ PKI Engine (replaces step-ca for TLS) - ├─ SSH CA Engine (replaces step-ca SSH CA) - └─ AppRole Auth (per-host authentication) +vault.home.2rjus.net (10.69.13.19) + ├─ KV Secrets Engine (ready to replace sops-nix) + │ ├─ secret/hosts/{hostname}/* + │ ├─ secret/services/{service}/* + │ └─ secret/shared/{category}/* + ├─ PKI Engine (ready to replace step-ca for TLS) + │ ├─ Root CA (EC P-384, 10 year) + │ ├─ Intermediate CA (EC P-384, 5 year) + │ └─ ACME endpoint enabled + ├─ SSH CA Engine (TODO: Phase 4c) + └─ AppRole Auth (per-host authentication configured) ↓ - New hosts authenticate on first boot - Fetch secrets via Vault API + [Phase 4d] New hosts authenticate on first boot + [Phase 4d] Fetch secrets via Vault API No manual key distribution needed ``` +**Completed:** +- ✅ Phase 4a: OpenBao server with TPM2 auto-unseal +- ✅ Phase 4b: Infrastructure-as-code (secrets, policies, AppRoles, PKI) + +**Next Steps:** +- Phase 4c: Migrate from step-ca to OpenBao PKI +- Phase 4d: Bootstrap integration for automated secrets access + --- -#### Phase 4a: Vault Server Setup +#### Phase 4a: Vault Server Setup ✅ COMPLETED + +**Status:** ✅ Fully implemented and tested +**Completed:** 2026-02-02 **Goal:** Deploy and configure Vault server with auto-unseal -**Tasks:** -- [ ] Create `hosts/vault01/` configuration - - [ ] Basic NixOS configuration (hostname, networking, etc.) - - [ ] Vault service configuration - - [ ] Firewall rules (8200 for API, 8201 for cluster) - - [ ] Add to flake.nix and terraform -- [ ] Implement auto-unseal mechanism - - [ ] **Preferred:** TPM-based auto-unseal if hardware supports it - - [ ] Use tpm2-tools to seal/unseal Vault keys - - [ ] Systemd service to unseal on boot - - [ ] **Fallback:** Shamir secret sharing with systemd automation - - [ ] Generate 3 keys, threshold 2 - - [ ] Store 2 keys on disk (encrypted), keep 1 offline - - [ ] Systemd service auto-unseals using 2 keys -- [ ] Initial Vault setup - - [ ] Initialize Vault - - [ ] Configure storage backend (integrated raft or file) - - [ ] Set up root token management - - [ ] Enable audit logging -- [ ] Deploy to infrastructure - - [ ] Add DNS entry for vault.home.2rjus.net - - [ ] Deploy VM via terraform - - [ ] Bootstrap and verify Vault is running +**Implementation:** +- Used **OpenBao** (Vault fork) instead of HashiCorp Vault due to BSL licensing concerns +- TPM2-based auto-unseal using systemd's native `LoadCredentialEncrypted` +- Self-signed bootstrap TLS certificates (avoiding circular dependency with step-ca) +- File-based storage backend at `/var/lib/openbao` +- Unix socket + TCP listener (0.0.0.0:8200) configuration -**Deliverable:** Running Vault server that auto-unseals on boot +**Tasks:** +- [x] Create `hosts/vault01/` configuration + - [x] Basic NixOS configuration (hostname: vault01, IP: 10.69.13.19/24) + - [x] Created reusable `services/vault` module + - [x] Firewall not needed (trusted network) + - [x] Already in flake.nix, deployed via terraform +- [x] Implement auto-unseal mechanism + - [x] **TPM2-based auto-unseal** (preferred option) + - [x] systemd `LoadCredentialEncrypted` with TPM2 binding + - [x] `writeShellApplication` script with proper runtime dependencies + - [x] Reads multiple unseal keys (one per line) until unsealed + - [x] Auto-unseals on service start via `ExecStartPost` +- [x] Initial Vault setup + - [x] Initialized OpenBao with Shamir secret sharing (5 keys, threshold 3) + - [x] File storage backend + - [x] Self-signed TLS certificates via LoadCredential +- [x] Deploy to infrastructure + - [x] DNS entry added for vault.home.2rjus.net + - [x] VM deployed via terraform + - [x] Verified OpenBao running and auto-unsealing + +**Changes from Original Plan:** +- Used OpenBao instead of HashiCorp Vault (licensing) +- Used systemd's native TPM2 support instead of tpm2-tools directly +- Skipped audit logging (can be enabled later) +- Used self-signed certs initially (will migrate to OpenBao PKI later) + +**Deliverable:** ✅ Running OpenBao server that auto-unseals on boot using TPM2 + +**Documentation:** +- `/services/vault/README.md` - Service module overview +- `/docs/vault/auto-unseal.md` - Complete TPM2 auto-unseal setup guide --- -#### Phase 4b: Vault-as-Code with OpenTofu +#### Phase 4b: Vault-as-Code with OpenTofu ✅ COMPLETED + +**Status:** ✅ Fully implemented and tested +**Completed:** 2026-02-02 **Goal:** Manage all Vault configuration (secrets structure, policies, roles) as code +**Implementation:** +- Complete Terraform/OpenTofu configuration in `terraform/vault/` +- Locals-based pattern (similar to `vms.tf`) for declaring secrets and policies +- Auto-generation of secrets using `random_password` provider +- Three-tier secrets path hierarchy: `hosts/`, `services/`, `shared/` +- PKI infrastructure with **Elliptic Curve certificates** (P-384 for CAs, P-256 for leaf certs) +- ACME support enabled on intermediate CA + **Tasks:** -- [ ] Set up Vault Terraform provider - - [ ] Create `terraform/vault/` directory - - [ ] Configure Vault provider (address, auth) - - [ ] Store Vault token securely (terraform.tfvars, gitignored) -- [ ] Enable and configure secrets engines - - [ ] Enable KV v2 secrets engine at `secret/` - - [ ] Define secret path structure (per-service, per-host) - - [ ] Example: `secret/monitoring/grafana`, `secret/postgres/ha1` -- [ ] Define policies as code - - [ ] Create policies for different service tiers - - [ ] Principle of least privilege (hosts only read their secrets) - - [ ] Example: monitoring-policy allows read on `secret/monitoring/*` -- [ ] Set up AppRole authentication - - [ ] Enable AppRole auth backend - - [ ] Create role per host type (monitoring, dns, database, etc.) - - [ ] Bind policies to roles - - [ ] Configure TTL and token policies -- [ ] Migrate existing secrets from sops-nix - - [ ] Create migration script/playbook - - [ ] Decrypt sops secrets and load into Vault KV - - [ ] Verify all secrets migrated successfully - - [ ] Keep sops as backup during transition -- [ ] Implement secrets-as-code patterns - - [ ] Secret values in gitignored terraform.tfvars - - [ ] Or use random_password for auto-generated secrets - - [ ] Secret structure/paths in version-controlled .tf files +- [x] Set up Vault Terraform provider + - [x] Created `terraform/vault/` directory + - [x] Configured Vault provider (uses HashiCorp provider, compatible with OpenBao) + - [x] Credentials in terraform.tfvars (gitignored) + - [x] terraform.tfvars.example for reference +- [x] Enable and configure secrets engines + - [x] KV v2 engine at `secret/` + - [x] Three-tier path structure: + - `secret/hosts/{hostname}/*` - Host-specific secrets + - `secret/services/{service}/*` - Service-wide secrets + - `secret/shared/{category}/*` - Shared secrets (SMTP, backups, etc.) +- [x] Define policies as code + - [x] Policies auto-generated from `locals.host_policies` + - [x] Per-host policies with read/list on designated paths + - [x] Principle of least privilege enforced +- [x] Set up AppRole authentication + - [x] AppRole backend enabled at `approle/` + - [x] Roles auto-generated per host from `locals.host_policies` + - [x] Token TTL: 1 hour, max 24 hours + - [x] Policies bound to roles +- [x] Implement secrets-as-code patterns + - [x] Auto-generated secrets using `random_password` provider + - [x] Manual secrets supported via variables in terraform.tfvars + - [x] Secret structure versioned in .tf files + - [x] Secret values excluded from git +- [x] Set up PKI infrastructure + - [x] Root CA (10 year TTL, EC P-384) + - [x] Intermediate CA (5 year TTL, EC P-384) + - [x] PKI role for `*.home.2rjus.net` (30 day max TTL, EC P-256) + - [x] ACME enabled on intermediate CA + - [x] Support for static certificate issuance via Terraform + - [x] CRL, OCSP, and issuing certificate URLs configured -**Example OpenTofu:** -```hcl -resource "vault_kv_secret_v2" "monitoring_grafana" { - mount = "secret" - name = "monitoring/grafana" - data_json = jsonencode({ - admin_password = var.grafana_admin_password - smtp_password = var.smtp_password - }) -} +**Changes from Original Plan:** +- Used Elliptic Curve instead of RSA for all certificates (better performance, smaller keys) +- Implemented PKI infrastructure in Phase 4b instead of Phase 4c (more logical grouping) +- ACME support configured immediately (ready for migration from step-ca) +- Did not migrate existing sops-nix secrets yet (deferred to gradual migration) -resource "vault_policy" "monitoring" { - name = "monitoring-policy" - policy = < homelab-root-ca.crt` + - [ ] Add to NixOS trust store on all hosts via `security.pki.certificateFiles` + - [ ] Deploy via auto-upgrade +- [ ] Test certificate issuance + - [ ] Issue test certificate using ACME client (lego/certbot) + - [ ] Or issue static certificate via OpenBao CLI + - [ ] Verify certificate chain and trust +- [ ] Migrate vault01's own certificate + - [ ] Issue new certificate from OpenBao PKI (self-issued) + - [ ] Replace self-signed bootstrap certificate + - [ ] Update service configuration +- [ ] Migrate hosts from step-ca to OpenBao + - [ ] Update `system/acme.nix` to use OpenBao ACME endpoint + - [ ] Change server to `https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory` + - [ ] Test on one host (non-critical service) + - [ ] Roll out to all hosts via auto-upgrade +- [ ] Configure SSH CA in OpenBao (optional, future work) - [ ] Enable SSH secrets engine (`ssh/` mount) - [ ] Generate SSH signing keys - [ ] Create roles for host and user certificates - [ ] Configure TTLs and allowed principals -- [ ] Migrate hosts from step-ca to Vault - - [ ] Update system/acme.nix to use Vault ACME endpoint - - [ ] Change server to `https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory` - - [ ] Test certificate issuance on one host - - [ ] Roll out to all hosts via auto-upgrade -- [ ] Migrate SSH CA trust - - [ ] Distribute Vault SSH CA public key to all hosts - - [ ] Update sshd_config to trust Vault CA - - [ ] Test SSH certificate authentication + - [ ] Distribute SSH CA public key to all hosts + - [ ] Update sshd_config to trust OpenBao CA - [ ] Decommission step-ca - - [ ] Verify all services migrated + - [ ] Verify all ACME services migrated and working - [ ] Stop step-ca service on ca host - [ ] Archive step-ca configuration for backup + - [ ] Update documentation -**Deliverable:** All TLS and SSH certificates issued by Vault, step-ca retired +**Deliverable:** All TLS certificates issued by OpenBao PKI, step-ca retired ---