660 lines
26 KiB
Markdown
660 lines
26 KiB
Markdown
# TODO: Automated Host Deployment Pipeline
|
|
|
|
## Vision
|
|
|
|
Automate the entire process of creating, configuring, and deploying new NixOS hosts on Proxmox from a single command or script.
|
|
|
|
**Desired workflow:**
|
|
```bash
|
|
./scripts/create-host.sh --hostname myhost --ip 10.69.13.50
|
|
# Script creates config, deploys VM, bootstraps NixOS, and you're ready to go
|
|
```
|
|
|
|
**Current manual workflow (from CLAUDE.md):**
|
|
1. Create `/hosts/<hostname>/` directory structure
|
|
2. Add host to `flake.nix`
|
|
3. Add DNS entries
|
|
4. Clone template VM manually
|
|
5. Run `prepare-host.sh` on new VM
|
|
6. Add generated age key to `.sops.yaml`
|
|
7. Configure networking
|
|
8. Commit and push
|
|
9. Run `nixos-rebuild boot --flake URL#<hostname>` on host
|
|
|
|
## The Plan
|
|
|
|
### Phase 1: Parameterized OpenTofu Deployments ✅ COMPLETED
|
|
|
|
**Status:** Fully implemented and tested
|
|
|
|
**Implementation:**
|
|
- Locals-based structure using `for_each` pattern for multiple VM deployments
|
|
- All VM parameters configurable with smart defaults (CPU, memory, disk, IP, storage, etc.)
|
|
- Automatic DHCP vs static IP detection based on `ip` field presence
|
|
- Dynamic outputs showing deployed VM IPs and specifications
|
|
- Successfully tested deploying multiple VMs simultaneously
|
|
|
|
**Tasks:**
|
|
- [x] Create module/template structure in terraform for repeatable VM deployments
|
|
- [x] Parameterize VM configuration (hostname, CPU, memory, disk, IP)
|
|
- [x] Support both DHCP and static IP configuration via cloud-init
|
|
- [x] Test deploying multiple VMs from same template
|
|
|
|
**Deliverable:** ✅ Can deploy multiple VMs with custom parameters via OpenTofu in a single `tofu apply`
|
|
|
|
**Files:**
|
|
- `terraform/vms.tf` - VM definitions using locals map
|
|
- `terraform/outputs.tf` - Dynamic outputs for all VMs
|
|
- `terraform/variables.tf` - Configurable defaults
|
|
- `terraform/README.md` - Complete documentation
|
|
|
|
---
|
|
|
|
### Phase 2: Host Configuration Generator ✅ COMPLETED
|
|
|
|
**Status:** ✅ Fully implemented and tested
|
|
**Completed:** 2025-02-01
|
|
**Enhanced:** 2025-02-01 (added --force flag)
|
|
|
|
**Goal:** Automate creation of host configuration files
|
|
|
|
**Implementation:**
|
|
- Python CLI tool packaged as Nix derivation
|
|
- Available as `create-host` command in devShell
|
|
- Rich terminal UI with configuration previews
|
|
- Comprehensive validation (hostname format/uniqueness, IP subnet/uniqueness)
|
|
- Jinja2 templates for NixOS configurations
|
|
- Automatic updates to flake.nix and terraform/vms.tf
|
|
- `--force` flag for regenerating existing configurations (useful for testing)
|
|
|
|
**Tasks:**
|
|
- [x] Create Python CLI with typer framework
|
|
- [x] Takes parameters: hostname, IP, CPU cores, memory, disk size
|
|
- [x] Generates `/hosts/<hostname>/` directory structure
|
|
- [x] Creates `configuration.nix` with proper hostname and networking
|
|
- [x] Generates `default.nix` with standard imports
|
|
- [x] References shared `hardware-configuration.nix` from template
|
|
- [x] Add host entry to `flake.nix` programmatically
|
|
- [x] Text-based manipulation (regex insertion)
|
|
- [x] Inserts new nixosConfiguration entry
|
|
- [x] Maintains proper formatting
|
|
- [x] Generate corresponding OpenTofu configuration
|
|
- [x] Adds VM definition to `terraform/vms.tf`
|
|
- [x] Uses parameters from CLI input
|
|
- [x] Supports both static IP and DHCP modes
|
|
- [x] Package as Nix derivation with templates
|
|
- [x] Add to flake packages and devShell
|
|
- [x] Implement dry-run mode
|
|
- [x] Write comprehensive README
|
|
|
|
**Usage:**
|
|
```bash
|
|
# In nix develop shell
|
|
create-host \
|
|
--hostname test01 \
|
|
--ip 10.69.13.50/24 \ # optional, omit for DHCP
|
|
--cpu 4 \ # optional, default 2
|
|
--memory 4096 \ # optional, default 2048
|
|
--disk 50G \ # optional, default 20G
|
|
--dry-run # optional preview mode
|
|
```
|
|
|
|
**Files:**
|
|
- `scripts/create-host/` - Complete Python package with Nix derivation
|
|
- `scripts/create-host/README.md` - Full documentation and examples
|
|
|
|
**Deliverable:** ✅ Tool generates all config files for a new host, validated with Nix and Terraform
|
|
|
|
---
|
|
|
|
### Phase 3: Bootstrap Mechanism ✅ COMPLETED
|
|
|
|
**Status:** ✅ Fully implemented and tested
|
|
**Completed:** 2025-02-01
|
|
**Enhanced:** 2025-02-01 (added branch support for testing)
|
|
|
|
**Goal:** Get freshly deployed VM to apply its specific host configuration
|
|
|
|
**Implementation:** Systemd oneshot service that runs on first boot after cloud-init
|
|
|
|
**Approach taken:** Systemd service (variant of Option A)
|
|
- Systemd service `nixos-bootstrap.service` runs on first boot
|
|
- Depends on `cloud-config.service` to ensure hostname is set
|
|
- Reads hostname from `hostnamectl` (set by cloud-init via Terraform)
|
|
- Supports custom git branch via `NIXOS_FLAKE_BRANCH` environment variable
|
|
- Runs `nixos-rebuild boot --flake git+https://git.t-juice.club/torjus/nixos-servers.git?ref=$BRANCH#${hostname}`
|
|
- Reboots into new configuration on success
|
|
- Fails gracefully without reboot on errors (network issues, missing config)
|
|
- Service self-destructs after successful bootstrap (not in new config)
|
|
|
|
**Tasks:**
|
|
- [x] Create bootstrap service module in template2
|
|
- [x] systemd oneshot service with proper dependencies
|
|
- [x] Reads hostname from hostnamectl (cloud-init sets it)
|
|
- [x] Checks network connectivity via HTTPS (curl)
|
|
- [x] Runs nixos-rebuild boot with flake URL
|
|
- [x] Reboots on success, fails gracefully on error
|
|
- [x] Configure cloud-init datasource
|
|
- [x] Use ConfigDrive datasource (Proxmox provider)
|
|
- [x] Add cloud-init disk to Terraform VMs (disks.ide.ide2.cloudinit)
|
|
- [x] Hostname passed via cloud-init user-data from Terraform
|
|
- [x] Test bootstrap service execution on fresh VM
|
|
- [x] Handle failure cases (flake doesn't exist, network issues)
|
|
- [x] Clear error messages in journald
|
|
- [x] No reboot on failure
|
|
- [x] System remains accessible for debugging
|
|
|
|
**Files:**
|
|
- `hosts/template2/bootstrap.nix` - Bootstrap service definition
|
|
- `hosts/template2/configuration.nix` - Cloud-init ConfigDrive datasource
|
|
- `terraform/vms.tf` - Cloud-init disk configuration
|
|
|
|
**Deliverable:** ✅ VMs automatically bootstrap and reboot into host-specific configuration on first boot
|
|
|
|
---
|
|
|
|
### Phase 4: Secrets Management with OpenBao (Vault)
|
|
|
|
**Status:** 🚧 Phases 4a & 4b Complete, 4c & 4d In Progress
|
|
|
|
**Challenge:** Current sops-nix approach has chicken-and-egg problem with age keys
|
|
|
|
**Current workflow:**
|
|
1. VM boots, generates age key at `/var/lib/sops-nix/key.txt`
|
|
2. User runs `prepare-host.sh` which prints public key
|
|
3. User manually adds public key to `.sops.yaml`
|
|
4. User commits, pushes
|
|
5. VM can now decrypt secrets
|
|
|
|
**Selected approach:** Migrate to OpenBao (Vault fork) for centralized secrets management
|
|
|
|
**Why OpenBao instead of HashiCorp Vault:**
|
|
- HashiCorp Vault switched to BSL (Business Source License), unavailable in NixOS cache
|
|
- OpenBao is the community fork maintaining the pre-BSL MPL 2.0 license
|
|
- API-compatible with Vault, uses same Terraform provider
|
|
- Maintains all Vault features we need
|
|
|
|
**Benefits:**
|
|
- Industry-standard secrets management (Vault-compatible experience)
|
|
- Eliminates manual age key distribution step
|
|
- Secrets-as-code via OpenTofu (infrastructure-as-code aligned)
|
|
- Centralized PKI management with ACME support (ready to replace step-ca)
|
|
- Automatic secret rotation capabilities
|
|
- Audit logging for all secret access (not yet enabled)
|
|
- AppRole authentication enables automated bootstrap
|
|
|
|
**Current Architecture:**
|
|
```
|
|
vault01.home.2rjus.net (10.69.13.19)
|
|
├─ KV Secrets Engine (ready to replace sops-nix)
|
|
│ ├─ secret/hosts/{hostname}/*
|
|
│ ├─ secret/services/{service}/*
|
|
│ └─ secret/shared/{category}/*
|
|
├─ PKI Engine (ready to replace step-ca for TLS)
|
|
│ ├─ Root CA (EC P-384, 10 year)
|
|
│ ├─ Intermediate CA (EC P-384, 5 year)
|
|
│ └─ ACME endpoint enabled
|
|
├─ SSH CA Engine (TODO: Phase 4c)
|
|
└─ AppRole Auth (per-host authentication configured)
|
|
↓
|
|
[✅ Phase 4d] New hosts authenticate on first boot
|
|
[✅ Phase 4d] Fetch secrets via Vault API
|
|
No manual key distribution needed
|
|
```
|
|
|
|
**Completed:**
|
|
- ✅ Phase 4a: OpenBao server with TPM2 auto-unseal
|
|
- ✅ Phase 4b: Infrastructure-as-code (secrets, policies, AppRoles, PKI)
|
|
- ✅ Phase 4d: Bootstrap integration for automated secrets access
|
|
|
|
**Next Steps:**
|
|
- Phase 4c: Migrate from step-ca to OpenBao PKI
|
|
|
|
---
|
|
|
|
#### Phase 4a: Vault Server Setup ✅ COMPLETED
|
|
|
|
**Status:** ✅ Fully implemented and tested
|
|
**Completed:** 2026-02-02
|
|
|
|
**Goal:** Deploy and configure Vault server with auto-unseal
|
|
|
|
**Implementation:**
|
|
- Used **OpenBao** (Vault fork) instead of HashiCorp Vault due to BSL licensing concerns
|
|
- TPM2-based auto-unseal using systemd's native `LoadCredentialEncrypted`
|
|
- Self-signed bootstrap TLS certificates (avoiding circular dependency with step-ca)
|
|
- File-based storage backend at `/var/lib/openbao`
|
|
- Unix socket + TCP listener (0.0.0.0:8200) configuration
|
|
|
|
**Tasks:**
|
|
- [x] Create `hosts/vault01/` configuration
|
|
- [x] Basic NixOS configuration (hostname: vault01, IP: 10.69.13.19/24)
|
|
- [x] Created reusable `services/vault` module
|
|
- [x] Firewall not needed (trusted network)
|
|
- [x] Already in flake.nix, deployed via terraform
|
|
- [x] Implement auto-unseal mechanism
|
|
- [x] **TPM2-based auto-unseal** (preferred option)
|
|
- [x] systemd `LoadCredentialEncrypted` with TPM2 binding
|
|
- [x] `writeShellApplication` script with proper runtime dependencies
|
|
- [x] Reads multiple unseal keys (one per line) until unsealed
|
|
- [x] Auto-unseals on service start via `ExecStartPost`
|
|
- [x] Initial Vault setup
|
|
- [x] Initialized OpenBao with Shamir secret sharing (5 keys, threshold 3)
|
|
- [x] File storage backend
|
|
- [x] Self-signed TLS certificates via LoadCredential
|
|
- [x] Deploy to infrastructure
|
|
- [x] DNS entry added for vault01.home.2rjus.net
|
|
- [x] VM deployed via terraform
|
|
- [x] Verified OpenBao running and auto-unsealing
|
|
|
|
**Changes from Original Plan:**
|
|
- Used OpenBao instead of HashiCorp Vault (licensing)
|
|
- Used systemd's native TPM2 support instead of tpm2-tools directly
|
|
- Skipped audit logging (can be enabled later)
|
|
- Used self-signed certs initially (will migrate to OpenBao PKI later)
|
|
|
|
**Deliverable:** ✅ Running OpenBao server that auto-unseals on boot using TPM2
|
|
|
|
**Documentation:**
|
|
- `/services/vault/README.md` - Service module overview
|
|
- `/docs/vault/auto-unseal.md` - Complete TPM2 auto-unseal setup guide
|
|
|
|
---
|
|
|
|
#### Phase 4b: Vault-as-Code with OpenTofu ✅ COMPLETED
|
|
|
|
**Status:** ✅ Fully implemented and tested
|
|
**Completed:** 2026-02-02
|
|
|
|
**Goal:** Manage all Vault configuration (secrets structure, policies, roles) as code
|
|
|
|
**Implementation:**
|
|
- Complete Terraform/OpenTofu configuration in `terraform/vault/`
|
|
- Locals-based pattern (similar to `vms.tf`) for declaring secrets and policies
|
|
- Auto-generation of secrets using `random_password` provider
|
|
- Three-tier secrets path hierarchy: `hosts/`, `services/`, `shared/`
|
|
- PKI infrastructure with **Elliptic Curve certificates** (P-384 for CAs, P-256 for leaf certs)
|
|
- ACME support enabled on intermediate CA
|
|
|
|
**Tasks:**
|
|
- [x] Set up Vault Terraform provider
|
|
- [x] Created `terraform/vault/` directory
|
|
- [x] Configured Vault provider (uses HashiCorp provider, compatible with OpenBao)
|
|
- [x] Credentials in terraform.tfvars (gitignored)
|
|
- [x] terraform.tfvars.example for reference
|
|
- [x] Enable and configure secrets engines
|
|
- [x] KV v2 engine at `secret/`
|
|
- [x] Three-tier path structure:
|
|
- `secret/hosts/{hostname}/*` - Host-specific secrets
|
|
- `secret/services/{service}/*` - Service-wide secrets
|
|
- `secret/shared/{category}/*` - Shared secrets (SMTP, backups, etc.)
|
|
- [x] Define policies as code
|
|
- [x] Policies auto-generated from `locals.host_policies`
|
|
- [x] Per-host policies with read/list on designated paths
|
|
- [x] Principle of least privilege enforced
|
|
- [x] Set up AppRole authentication
|
|
- [x] AppRole backend enabled at `approle/`
|
|
- [x] Roles auto-generated per host from `locals.host_policies`
|
|
- [x] Token TTL: 1 hour, max 24 hours
|
|
- [x] Policies bound to roles
|
|
- [x] Implement secrets-as-code patterns
|
|
- [x] Auto-generated secrets using `random_password` provider
|
|
- [x] Manual secrets supported via variables in terraform.tfvars
|
|
- [x] Secret structure versioned in .tf files
|
|
- [x] Secret values excluded from git
|
|
- [x] Set up PKI infrastructure
|
|
- [x] Root CA (10 year TTL, EC P-384)
|
|
- [x] Intermediate CA (5 year TTL, EC P-384)
|
|
- [x] PKI role for `*.home.2rjus.net` (30 day max TTL, EC P-256)
|
|
- [x] ACME enabled on intermediate CA
|
|
- [x] Support for static certificate issuance via Terraform
|
|
- [x] CRL, OCSP, and issuing certificate URLs configured
|
|
|
|
**Changes from Original Plan:**
|
|
- Used Elliptic Curve instead of RSA for all certificates (better performance, smaller keys)
|
|
- Implemented PKI infrastructure in Phase 4b instead of Phase 4c (more logical grouping)
|
|
- ACME support configured immediately (ready for migration from step-ca)
|
|
- Did not migrate existing sops-nix secrets yet (deferred to gradual migration)
|
|
|
|
**Files:**
|
|
- `terraform/vault/main.tf` - Provider configuration
|
|
- `terraform/vault/variables.tf` - Variable definitions
|
|
- `terraform/vault/approle.tf` - AppRole authentication (locals-based pattern)
|
|
- `terraform/vault/pki.tf` - PKI infrastructure with EC certificates
|
|
- `terraform/vault/secrets.tf` - KV secrets engine (auto-generation support)
|
|
- `terraform/vault/README.md` - Complete documentation and usage examples
|
|
- `terraform/vault/terraform.tfvars.example` - Example credentials
|
|
|
|
**Deliverable:** ✅ All secrets, policies, AppRoles, and PKI managed as OpenTofu code in `terraform/vault/`
|
|
|
|
**Documentation:**
|
|
- `/terraform/vault/README.md` - Comprehensive guide covering:
|
|
- Setup and deployment
|
|
- AppRole usage and host access patterns
|
|
- PKI certificate issuance (ACME, static, manual)
|
|
- Secrets management patterns
|
|
- ACME configuration and troubleshooting
|
|
|
|
---
|
|
|
|
#### Phase 4c: PKI Migration (Replace step-ca)
|
|
|
|
**Goal:** Migrate hosts from step-ca to OpenBao PKI for TLS certificates
|
|
|
|
**Note:** PKI infrastructure already set up in Phase 4b (root CA, intermediate CA, ACME support)
|
|
|
|
**Tasks:**
|
|
- [x] Set up OpenBao PKI engines (completed in Phase 4b)
|
|
- [x] Root CA (`pki/` mount, 10 year TTL, EC P-384)
|
|
- [x] Intermediate CA (`pki_int/` mount, 5 year TTL, EC P-384)
|
|
- [x] Signed intermediate with root CA
|
|
- [x] Configured CRL, OCSP, and issuing certificate URLs
|
|
- [x] Enable ACME support (completed in Phase 4b)
|
|
- [x] Enabled ACME on intermediate CA
|
|
- [x] Created PKI role for `*.home.2rjus.net`
|
|
- [x] Set certificate TTLs (30 day max) and allowed domains
|
|
- [x] ACME directory: `https://vault01.home.2rjus.net:8200/v1/pki_int/acme/directory`
|
|
- [ ] Download and distribute root CA certificate
|
|
- [ ] Export root CA: `bao read -field=certificate pki/cert/ca > homelab-root-ca.crt`
|
|
- [ ] Add to NixOS trust store on all hosts via `security.pki.certificateFiles`
|
|
- [ ] Deploy via auto-upgrade
|
|
- [ ] Test certificate issuance
|
|
- [ ] Issue test certificate using ACME client (lego/certbot)
|
|
- [ ] Or issue static certificate via OpenBao CLI
|
|
- [ ] Verify certificate chain and trust
|
|
- [ ] Migrate vault01's own certificate
|
|
- [ ] Issue new certificate from OpenBao PKI (self-issued)
|
|
- [ ] Replace self-signed bootstrap certificate
|
|
- [ ] Update service configuration
|
|
- [ ] Migrate hosts from step-ca to OpenBao
|
|
- [ ] Update `system/acme.nix` to use OpenBao ACME endpoint
|
|
- [ ] Change server to `https://vault01.home.2rjus.net:8200/v1/pki_int/acme/directory`
|
|
- [ ] Test on one host (non-critical service)
|
|
- [ ] Roll out to all hosts via auto-upgrade
|
|
- [ ] Configure SSH CA in OpenBao (optional, future work)
|
|
- [ ] Enable SSH secrets engine (`ssh/` mount)
|
|
- [ ] Generate SSH signing keys
|
|
- [ ] Create roles for host and user certificates
|
|
- [ ] Configure TTLs and allowed principals
|
|
- [ ] Distribute SSH CA public key to all hosts
|
|
- [ ] Update sshd_config to trust OpenBao CA
|
|
- [ ] Decommission step-ca
|
|
- [ ] Verify all ACME services migrated and working
|
|
- [ ] Stop step-ca service on ca host
|
|
- [ ] Archive step-ca configuration for backup
|
|
- [ ] Update documentation
|
|
|
|
**Deliverable:** All TLS certificates issued by OpenBao PKI, step-ca retired
|
|
|
|
---
|
|
|
|
#### Phase 4d: Bootstrap Integration ✅ COMPLETED (2026-02-02)
|
|
|
|
**Goal:** New hosts automatically authenticate to Vault on first boot, no manual steps
|
|
|
|
**Tasks:**
|
|
- [x] Update create-host tool
|
|
- [x] Generate wrapped token (24h TTL, single-use) for new host
|
|
- [x] Add host-specific policy to Vault (via terraform/vault/hosts-generated.tf)
|
|
- [x] Store wrapped token in terraform/vms.tf for cloud-init injection
|
|
- [x] Add `--regenerate-token` flag to regenerate only the token without overwriting config
|
|
- [x] Update template2 for Vault authentication
|
|
- [x] Reads wrapped token from cloud-init (/run/cloud-init-env)
|
|
- [x] Unwraps token to get role_id + secret_id
|
|
- [x] Stores AppRole credentials in /var/lib/vault/approle/ (persistent)
|
|
- [x] Graceful fallback if Vault unavailable during bootstrap
|
|
- [x] Create NixOS Vault secrets module (system/vault-secrets.nix)
|
|
- [x] Runtime secret fetching (services fetch on start, not at nixos-rebuild time)
|
|
- [x] Secrets cached in /var/lib/vault/cache/ for fallback when Vault unreachable
|
|
- [x] Secrets written to /run/secrets/ (tmpfs, cleared on reboot)
|
|
- [x] Fresh authentication per service start (no token renewal needed)
|
|
- [x] Optional periodic rotation with systemd timers
|
|
- [x] Critical service protection (no auto-restart for DNS, CA, Vault itself)
|
|
- [x] Create vault-fetch helper script
|
|
- [x] Standalone tool for fetching secrets from Vault
|
|
- [x] Authenticates using AppRole credentials
|
|
- [x] Writes individual files per secret key
|
|
- [x] Handles caching and fallback logic
|
|
- [x] Update bootstrap service (hosts/template2/bootstrap.nix)
|
|
- [x] Unwraps Vault token on first boot
|
|
- [x] Stores persistent AppRole credentials
|
|
- [x] Continues with nixos-rebuild
|
|
- [x] Services fetch secrets when they start
|
|
- [x] Update terraform cloud-init (terraform/cloud-init.tf)
|
|
- [x] Inject VAULT_ADDR and VAULT_WRAPPED_TOKEN via write_files
|
|
- [x] Write to /run/cloud-init-env (tmpfs, cleaned on reboot)
|
|
- [x] Fixed YAML indentation issues (write_files at top level)
|
|
- [x] Support flake_branch alongside vault credentials
|
|
- [x] Test complete flow
|
|
- [x] Created vaulttest01 test host
|
|
- [x] Verified bootstrap with Vault integration
|
|
- [x] Verified service secret fetching
|
|
- [x] Tested cache fallback when Vault unreachable
|
|
- [x] Tested wrapped token single-use (second bootstrap fails as expected)
|
|
- [x] Confirmed zero manual steps required
|
|
|
|
**Implementation Details:**
|
|
|
|
**Wrapped Token Security:**
|
|
- Single-use tokens prevent reuse if leaked
|
|
- 24h TTL limits exposure window
|
|
- Safe to commit to git (expired/used tokens useless)
|
|
- Regenerate with `create-host --hostname X --regenerate-token`
|
|
|
|
**Secret Fetching:**
|
|
- Runtime (not build-time) keeps secrets out of Nix store
|
|
- Cache fallback enables service availability when Vault down
|
|
- Fresh authentication per service start (no renewal complexity)
|
|
- Individual files per secret key for easy consumption
|
|
|
|
**Bootstrap Flow:**
|
|
```
|
|
1. create-host --hostname myhost --ip 10.69.13.x/24
|
|
↓ Generates wrapped token, updates terraform
|
|
2. tofu apply (deploys VM with cloud-init)
|
|
↓ Cloud-init writes wrapped token to /run/cloud-init-env
|
|
3. nixos-bootstrap.service runs:
|
|
↓ Unwraps token → gets role_id + secret_id
|
|
↓ Stores in /var/lib/vault/approle/ (persistent)
|
|
↓ Runs nixos-rebuild boot
|
|
4. Service starts → fetches secrets from Vault
|
|
↓ Uses stored AppRole credentials
|
|
↓ Caches secrets for fallback
|
|
5. Done - zero manual intervention
|
|
```
|
|
|
|
**Files Created:**
|
|
- `scripts/vault-fetch/` - Secret fetching helper (Nix package)
|
|
- `system/vault-secrets.nix` - NixOS module for declarative Vault secrets
|
|
- `scripts/create-host/vault_helper.py` - Vault API integration
|
|
- `terraform/vault/hosts-generated.tf` - Auto-generated host policies
|
|
- `docs/vault-bootstrap-implementation.md` - Architecture documentation
|
|
- `docs/vault-bootstrap-testing.md` - Testing guide
|
|
|
|
**Configuration:**
|
|
- Vault address: `https://vault01.home.2rjus.net:8200` (configurable)
|
|
- All defaults remain configurable via environment variables or NixOS options
|
|
|
|
**Next Steps:**
|
|
- Gradually migrate existing services from sops-nix to Vault
|
|
- Add CNAME for vault.home.2rjus.net → vault01.home.2rjus.net
|
|
- Phase 4c: Migrate from step-ca to OpenBao PKI (future)
|
|
|
|
**Deliverable:** ✅ Fully automated secrets access from first boot, zero manual steps
|
|
|
|
---
|
|
|
|
### Phase 5: DNS Automation
|
|
|
|
**Goal:** Automatically generate DNS entries from host configurations
|
|
|
|
**Approach:** Leverage Nix to generate zone file entries from flake host configurations
|
|
|
|
Since most hosts use static IPs defined in their NixOS configurations, we can extract this information and automatically generate A records. This keeps DNS in sync with the actual host configs.
|
|
|
|
**Tasks:**
|
|
- [ ] Add optional CNAME field to host configurations
|
|
- [ ] Add `networking.cnames = [ "alias1" "alias2" ]` or similar option
|
|
- [ ] Document in host configuration template
|
|
- [ ] Create Nix function to extract DNS records from all hosts
|
|
- [ ] Parse each host's `networking.hostName` and IP configuration
|
|
- [ ] Collect any defined CNAMEs
|
|
- [ ] Generate zone file fragment with A and CNAME records
|
|
- [ ] Integrate auto-generated records into zone files
|
|
- [ ] Keep manual entries separate (for non-flake hosts/services)
|
|
- [ ] Include generated fragment in main zone file
|
|
- [ ] Add comments showing which records are auto-generated
|
|
- [ ] Update zone file serial number automatically
|
|
- [ ] Test zone file validity after generation
|
|
- [ ] Either:
|
|
- [ ] Automatically trigger DNS server reload (Ansible)
|
|
- [ ] Or document manual step: merge to master, run upgrade on ns1/ns2
|
|
|
|
**Deliverable:** DNS A records and CNAMEs automatically generated from host configs
|
|
|
|
---
|
|
|
|
### Phase 6: Integration Script
|
|
|
|
**Goal:** Single command to create and deploy a new host
|
|
|
|
**Tasks:**
|
|
- [ ] Create `scripts/create-host.sh` master script that orchestrates:
|
|
1. Prompts for: hostname, IP (or DHCP), CPU, memory, disk
|
|
2. Validates inputs (IP not in use, hostname unique, etc.)
|
|
3. Calls host config generator (Phase 2)
|
|
4. Generates OpenTofu config (Phase 2)
|
|
5. Handles secrets (Phase 4)
|
|
6. Updates DNS (Phase 5)
|
|
7. Commits all changes to git
|
|
8. Runs `tofu apply` to deploy VM
|
|
9. Waits for bootstrap to complete (Phase 3)
|
|
10. Prints success message with IP and SSH command
|
|
- [ ] Add `--dry-run` flag to preview changes
|
|
- [ ] Add `--interactive` mode vs `--batch` mode
|
|
- [ ] Error handling and rollback on failures
|
|
|
|
**Deliverable:** `./scripts/create-host.sh --hostname myhost --ip 10.69.13.50` creates a fully working host
|
|
|
|
---
|
|
|
|
### Phase 7: Testing & Documentation
|
|
|
|
**Status:** 🚧 In Progress (testing improvements completed)
|
|
|
|
**Testing Improvements Implemented (2025-02-01):**
|
|
|
|
The pipeline now supports efficient testing without polluting master branch:
|
|
|
|
**1. --force Flag for create-host**
|
|
- Re-run `create-host` to regenerate existing configurations
|
|
- Updates existing entries in flake.nix and terraform/vms.tf (no duplicates)
|
|
- Skip uniqueness validation checks
|
|
- Useful for iterating on configuration templates during testing
|
|
|
|
**2. Branch Support for Bootstrap**
|
|
- Bootstrap service reads `NIXOS_FLAKE_BRANCH` environment variable
|
|
- Defaults to `master` if not set
|
|
- Allows testing pipeline changes on feature branches
|
|
- Cloud-init passes branch via `/etc/environment`
|
|
|
|
**3. Cloud-init Disk for Branch Configuration**
|
|
- Terraform generates custom cloud-init snippets for test VMs
|
|
- Set `flake_branch` field in VM definition to use non-master branch
|
|
- Production VMs omit this field and use master (default)
|
|
- Files automatically uploaded to Proxmox via SSH
|
|
|
|
**Testing Workflow:**
|
|
|
|
```bash
|
|
# 1. Create test branch
|
|
git checkout -b test-pipeline
|
|
|
|
# 2. Generate or update host config
|
|
create-host --hostname testvm01 --ip 10.69.13.100/24
|
|
|
|
# 3. Edit terraform/vms.tf to add test VM with branch
|
|
# vms = {
|
|
# "testvm01" = {
|
|
# ip = "10.69.13.100/24"
|
|
# flake_branch = "test-pipeline" # Bootstrap from this branch
|
|
# }
|
|
# }
|
|
|
|
# 4. Commit and push test branch
|
|
git add -A && git commit -m "test: add testvm01"
|
|
git push origin test-pipeline
|
|
|
|
# 5. Deploy VM
|
|
cd terraform && tofu apply
|
|
|
|
# 6. Watch bootstrap (VM fetches from test-pipeline branch)
|
|
ssh root@10.69.13.100
|
|
journalctl -fu nixos-bootstrap.service
|
|
|
|
# 7. Iterate: modify templates and regenerate with --force
|
|
cd .. && create-host --hostname testvm01 --ip 10.69.13.100/24 --force
|
|
git commit -am "test: update config" && git push
|
|
|
|
# Redeploy to test fresh bootstrap
|
|
cd terraform
|
|
tofu destroy -target=proxmox_vm_qemu.vm[\"testvm01\"] && tofu apply
|
|
|
|
# 8. Clean up when done: squash commits, merge to master, remove test VM
|
|
```
|
|
|
|
**Files:**
|
|
- `scripts/create-host/create_host.py` - Added --force parameter
|
|
- `scripts/create-host/manipulators.py` - Update vs insert logic
|
|
- `hosts/template2/bootstrap.nix` - Branch support via environment variable
|
|
- `terraform/vms.tf` - flake_branch field support
|
|
- `terraform/cloud-init.tf` - Custom cloud-init disk generation
|
|
- `terraform/variables.tf` - proxmox_host variable for SSH uploads
|
|
|
|
**Remaining Tasks:**
|
|
- [ ] Test full pipeline end-to-end on feature branch
|
|
- [ ] Update CLAUDE.md with testing workflow
|
|
- [ ] Add troubleshooting section
|
|
- [ ] Create examples for common scenarios (DHCP host, static IP host, etc.)
|
|
|
|
---
|
|
|
|
## Open Questions
|
|
|
|
1. **Bootstrap method:** Cloud-init runcmd vs Terraform provisioner vs Ansible?
|
|
2. **Secrets handling:** Pre-generate keys vs post-deployment injection?
|
|
3. **DNS automation:** Auto-commit or manual merge?
|
|
4. **Git workflow:** Auto-push changes or leave for user review?
|
|
5. **Template selection:** Single template2 or multiple templates for different host types?
|
|
6. **Networking:** Always DHCP initially, or support static IP from start?
|
|
7. **Error recovery:** What happens if bootstrap fails? Manual intervention or retry?
|
|
|
|
## Implementation Order
|
|
|
|
Recommended sequence:
|
|
1. Phase 1: Parameterize OpenTofu (foundation for testing)
|
|
2. Phase 3: Bootstrap mechanism (core automation)
|
|
3. Phase 2: Config generator (automate the boilerplate)
|
|
4. Phase 4: Secrets (solves biggest chicken-and-egg)
|
|
5. Phase 5: DNS (nice-to-have automation)
|
|
6. Phase 6: Integration script (ties it all together)
|
|
7. Phase 7: Testing & docs
|
|
|
|
## Success Criteria
|
|
|
|
When complete, creating a new host should:
|
|
- Take < 5 minutes of human time
|
|
- Require minimal user input (hostname, IP, basic specs)
|
|
- Result in a fully configured, secret-enabled, DNS-registered host
|
|
- Be reproducible and documented
|
|
- Handle common errors gracefully
|
|
|
|
---
|
|
|
|
## Notes
|
|
|
|
- Keep incremental commits at each phase
|
|
- Test each phase independently before moving to next
|
|
- Maintain backward compatibility with manual workflow
|
|
- Document any manual steps that can't be automated
|