# TODO: Automated Host Deployment Pipeline ## Vision Automate the entire process of creating, configuring, and deploying new NixOS hosts on Proxmox from a single command or script. **Desired workflow:** ```bash ./scripts/create-host.sh --hostname myhost --ip 10.69.13.50 # Script creates config, deploys VM, bootstraps NixOS, and you're ready to go ``` **Current manual workflow (from CLAUDE.md):** 1. Create `/hosts//` directory structure 2. Add host to `flake.nix` 3. Add DNS entries 4. Clone template VM manually 5. Run `prepare-host.sh` on new VM 6. Add generated age key to `.sops.yaml` 7. Configure networking 8. Commit and push 9. Run `nixos-rebuild boot --flake URL#` on host ## The Plan ### Phase 1: Parameterized OpenTofu Deployments ✅ COMPLETED **Status:** Fully implemented and tested **Implementation:** - Locals-based structure using `for_each` pattern for multiple VM deployments - All VM parameters configurable with smart defaults (CPU, memory, disk, IP, storage, etc.) - Automatic DHCP vs static IP detection based on `ip` field presence - Dynamic outputs showing deployed VM IPs and specifications - Successfully tested deploying multiple VMs simultaneously **Tasks:** - [x] Create module/template structure in terraform for repeatable VM deployments - [x] Parameterize VM configuration (hostname, CPU, memory, disk, IP) - [x] Support both DHCP and static IP configuration via cloud-init - [x] Test deploying multiple VMs from same template **Deliverable:** ✅ Can deploy multiple VMs with custom parameters via OpenTofu in a single `tofu apply` **Files:** - `terraform/vms.tf` - VM definitions using locals map - `terraform/outputs.tf` - Dynamic outputs for all VMs - `terraform/variables.tf` - Configurable defaults - `terraform/README.md` - Complete documentation --- ### Phase 2: Host Configuration Generator ✅ COMPLETED **Status:** ✅ Fully implemented and tested **Completed:** 2025-02-01 **Enhanced:** 2025-02-01 (added --force flag) **Goal:** Automate creation of host configuration files **Implementation:** - Python CLI tool packaged as Nix derivation - Available as `create-host` command in devShell - Rich terminal UI with configuration previews - Comprehensive validation (hostname format/uniqueness, IP subnet/uniqueness) - Jinja2 templates for NixOS configurations - Automatic updates to flake.nix and terraform/vms.tf - `--force` flag for regenerating existing configurations (useful for testing) **Tasks:** - [x] Create Python CLI with typer framework - [x] Takes parameters: hostname, IP, CPU cores, memory, disk size - [x] Generates `/hosts//` directory structure - [x] Creates `configuration.nix` with proper hostname and networking - [x] Generates `default.nix` with standard imports - [x] References shared `hardware-configuration.nix` from template - [x] Add host entry to `flake.nix` programmatically - [x] Text-based manipulation (regex insertion) - [x] Inserts new nixosConfiguration entry - [x] Maintains proper formatting - [x] Generate corresponding OpenTofu configuration - [x] Adds VM definition to `terraform/vms.tf` - [x] Uses parameters from CLI input - [x] Supports both static IP and DHCP modes - [x] Package as Nix derivation with templates - [x] Add to flake packages and devShell - [x] Implement dry-run mode - [x] Write comprehensive README **Usage:** ```bash # In nix develop shell create-host \ --hostname test01 \ --ip 10.69.13.50/24 \ # optional, omit for DHCP --cpu 4 \ # optional, default 2 --memory 4096 \ # optional, default 2048 --disk 50G \ # optional, default 20G --dry-run # optional preview mode ``` **Files:** - `scripts/create-host/` - Complete Python package with Nix derivation - `scripts/create-host/README.md` - Full documentation and examples **Deliverable:** ✅ Tool generates all config files for a new host, validated with Nix and Terraform --- ### Phase 3: Bootstrap Mechanism ✅ COMPLETED **Status:** ✅ Fully implemented and tested **Completed:** 2025-02-01 **Enhanced:** 2025-02-01 (added branch support for testing) **Goal:** Get freshly deployed VM to apply its specific host configuration **Implementation:** Systemd oneshot service that runs on first boot after cloud-init **Approach taken:** Systemd service (variant of Option A) - Systemd service `nixos-bootstrap.service` runs on first boot - Depends on `cloud-config.service` to ensure hostname is set - Reads hostname from `hostnamectl` (set by cloud-init via Terraform) - Supports custom git branch via `NIXOS_FLAKE_BRANCH` environment variable - Runs `nixos-rebuild boot --flake git+https://git.t-juice.club/torjus/nixos-servers.git?ref=$BRANCH#${hostname}` - Reboots into new configuration on success - Fails gracefully without reboot on errors (network issues, missing config) - Service self-destructs after successful bootstrap (not in new config) **Tasks:** - [x] Create bootstrap service module in template2 - [x] systemd oneshot service with proper dependencies - [x] Reads hostname from hostnamectl (cloud-init sets it) - [x] Checks network connectivity via HTTPS (curl) - [x] Runs nixos-rebuild boot with flake URL - [x] Reboots on success, fails gracefully on error - [x] Configure cloud-init datasource - [x] Use ConfigDrive datasource (Proxmox provider) - [x] Add cloud-init disk to Terraform VMs (disks.ide.ide2.cloudinit) - [x] Hostname passed via cloud-init user-data from Terraform - [x] Test bootstrap service execution on fresh VM - [x] Handle failure cases (flake doesn't exist, network issues) - [x] Clear error messages in journald - [x] No reboot on failure - [x] System remains accessible for debugging **Files:** - `hosts/template2/bootstrap.nix` - Bootstrap service definition - `hosts/template2/configuration.nix` - Cloud-init ConfigDrive datasource - `terraform/vms.tf` - Cloud-init disk configuration **Deliverable:** ✅ VMs automatically bootstrap and reboot into host-specific configuration on first boot --- ### Phase 4: Secrets Management with OpenBao (Vault) **Status:** 🚧 Phases 4a & 4b Complete, 4c & 4d In Progress **Challenge:** Current sops-nix approach has chicken-and-egg problem with age keys **Current workflow:** 1. VM boots, generates age key at `/var/lib/sops-nix/key.txt` 2. User runs `prepare-host.sh` which prints public key 3. User manually adds public key to `.sops.yaml` 4. User commits, pushes 5. VM can now decrypt secrets **Selected approach:** Migrate to OpenBao (Vault fork) for centralized secrets management **Why OpenBao instead of HashiCorp Vault:** - HashiCorp Vault switched to BSL (Business Source License), unavailable in NixOS cache - OpenBao is the community fork maintaining the pre-BSL MPL 2.0 license - API-compatible with Vault, uses same Terraform provider - Maintains all Vault features we need **Benefits:** - Industry-standard secrets management (Vault-compatible experience) - Eliminates manual age key distribution step - Secrets-as-code via OpenTofu (infrastructure-as-code aligned) - Centralized PKI management with ACME support (ready to replace step-ca) - Automatic secret rotation capabilities - Audit logging for all secret access (not yet enabled) - AppRole authentication enables automated bootstrap **Current Architecture:** ``` vault.home.2rjus.net (10.69.13.19) ├─ KV Secrets Engine (ready to replace sops-nix) │ ├─ secret/hosts/{hostname}/* │ ├─ secret/services/{service}/* │ └─ secret/shared/{category}/* ├─ PKI Engine (ready to replace step-ca for TLS) │ ├─ Root CA (EC P-384, 10 year) │ ├─ Intermediate CA (EC P-384, 5 year) │ └─ ACME endpoint enabled ├─ SSH CA Engine (TODO: Phase 4c) └─ AppRole Auth (per-host authentication configured) ↓ [Phase 4d] New hosts authenticate on first boot [Phase 4d] Fetch secrets via Vault API No manual key distribution needed ``` **Completed:** - ✅ Phase 4a: OpenBao server with TPM2 auto-unseal - ✅ Phase 4b: Infrastructure-as-code (secrets, policies, AppRoles, PKI) **Next Steps:** - Phase 4c: Migrate from step-ca to OpenBao PKI - Phase 4d: Bootstrap integration for automated secrets access --- #### Phase 4a: Vault Server Setup ✅ COMPLETED **Status:** ✅ Fully implemented and tested **Completed:** 2026-02-02 **Goal:** Deploy and configure Vault server with auto-unseal **Implementation:** - Used **OpenBao** (Vault fork) instead of HashiCorp Vault due to BSL licensing concerns - TPM2-based auto-unseal using systemd's native `LoadCredentialEncrypted` - Self-signed bootstrap TLS certificates (avoiding circular dependency with step-ca) - File-based storage backend at `/var/lib/openbao` - Unix socket + TCP listener (0.0.0.0:8200) configuration **Tasks:** - [x] Create `hosts/vault01/` configuration - [x] Basic NixOS configuration (hostname: vault01, IP: 10.69.13.19/24) - [x] Created reusable `services/vault` module - [x] Firewall not needed (trusted network) - [x] Already in flake.nix, deployed via terraform - [x] Implement auto-unseal mechanism - [x] **TPM2-based auto-unseal** (preferred option) - [x] systemd `LoadCredentialEncrypted` with TPM2 binding - [x] `writeShellApplication` script with proper runtime dependencies - [x] Reads multiple unseal keys (one per line) until unsealed - [x] Auto-unseals on service start via `ExecStartPost` - [x] Initial Vault setup - [x] Initialized OpenBao with Shamir secret sharing (5 keys, threshold 3) - [x] File storage backend - [x] Self-signed TLS certificates via LoadCredential - [x] Deploy to infrastructure - [x] DNS entry added for vault.home.2rjus.net - [x] VM deployed via terraform - [x] Verified OpenBao running and auto-unsealing **Changes from Original Plan:** - Used OpenBao instead of HashiCorp Vault (licensing) - Used systemd's native TPM2 support instead of tpm2-tools directly - Skipped audit logging (can be enabled later) - Used self-signed certs initially (will migrate to OpenBao PKI later) **Deliverable:** ✅ Running OpenBao server that auto-unseals on boot using TPM2 **Documentation:** - `/services/vault/README.md` - Service module overview - `/docs/vault/auto-unseal.md` - Complete TPM2 auto-unseal setup guide --- #### Phase 4b: Vault-as-Code with OpenTofu ✅ COMPLETED **Status:** ✅ Fully implemented and tested **Completed:** 2026-02-02 **Goal:** Manage all Vault configuration (secrets structure, policies, roles) as code **Implementation:** - Complete Terraform/OpenTofu configuration in `terraform/vault/` - Locals-based pattern (similar to `vms.tf`) for declaring secrets and policies - Auto-generation of secrets using `random_password` provider - Three-tier secrets path hierarchy: `hosts/`, `services/`, `shared/` - PKI infrastructure with **Elliptic Curve certificates** (P-384 for CAs, P-256 for leaf certs) - ACME support enabled on intermediate CA **Tasks:** - [x] Set up Vault Terraform provider - [x] Created `terraform/vault/` directory - [x] Configured Vault provider (uses HashiCorp provider, compatible with OpenBao) - [x] Credentials in terraform.tfvars (gitignored) - [x] terraform.tfvars.example for reference - [x] Enable and configure secrets engines - [x] KV v2 engine at `secret/` - [x] Three-tier path structure: - `secret/hosts/{hostname}/*` - Host-specific secrets - `secret/services/{service}/*` - Service-wide secrets - `secret/shared/{category}/*` - Shared secrets (SMTP, backups, etc.) - [x] Define policies as code - [x] Policies auto-generated from `locals.host_policies` - [x] Per-host policies with read/list on designated paths - [x] Principle of least privilege enforced - [x] Set up AppRole authentication - [x] AppRole backend enabled at `approle/` - [x] Roles auto-generated per host from `locals.host_policies` - [x] Token TTL: 1 hour, max 24 hours - [x] Policies bound to roles - [x] Implement secrets-as-code patterns - [x] Auto-generated secrets using `random_password` provider - [x] Manual secrets supported via variables in terraform.tfvars - [x] Secret structure versioned in .tf files - [x] Secret values excluded from git - [x] Set up PKI infrastructure - [x] Root CA (10 year TTL, EC P-384) - [x] Intermediate CA (5 year TTL, EC P-384) - [x] PKI role for `*.home.2rjus.net` (30 day max TTL, EC P-256) - [x] ACME enabled on intermediate CA - [x] Support for static certificate issuance via Terraform - [x] CRL, OCSP, and issuing certificate URLs configured **Changes from Original Plan:** - Used Elliptic Curve instead of RSA for all certificates (better performance, smaller keys) - Implemented PKI infrastructure in Phase 4b instead of Phase 4c (more logical grouping) - ACME support configured immediately (ready for migration from step-ca) - Did not migrate existing sops-nix secrets yet (deferred to gradual migration) **Files:** - `terraform/vault/main.tf` - Provider configuration - `terraform/vault/variables.tf` - Variable definitions - `terraform/vault/approle.tf` - AppRole authentication (locals-based pattern) - `terraform/vault/pki.tf` - PKI infrastructure with EC certificates - `terraform/vault/secrets.tf` - KV secrets engine (auto-generation support) - `terraform/vault/README.md` - Complete documentation and usage examples - `terraform/vault/terraform.tfvars.example` - Example credentials **Deliverable:** ✅ All secrets, policies, AppRoles, and PKI managed as OpenTofu code in `terraform/vault/` **Documentation:** - `/terraform/vault/README.md` - Comprehensive guide covering: - Setup and deployment - AppRole usage and host access patterns - PKI certificate issuance (ACME, static, manual) - Secrets management patterns - ACME configuration and troubleshooting --- #### Phase 4c: PKI Migration (Replace step-ca) **Goal:** Migrate hosts from step-ca to OpenBao PKI for TLS certificates **Note:** PKI infrastructure already set up in Phase 4b (root CA, intermediate CA, ACME support) **Tasks:** - [x] Set up OpenBao PKI engines (completed in Phase 4b) - [x] Root CA (`pki/` mount, 10 year TTL, EC P-384) - [x] Intermediate CA (`pki_int/` mount, 5 year TTL, EC P-384) - [x] Signed intermediate with root CA - [x] Configured CRL, OCSP, and issuing certificate URLs - [x] Enable ACME support (completed in Phase 4b) - [x] Enabled ACME on intermediate CA - [x] Created PKI role for `*.home.2rjus.net` - [x] Set certificate TTLs (30 day max) and allowed domains - [x] ACME directory: `https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory` - [ ] Download and distribute root CA certificate - [ ] Export root CA: `bao read -field=certificate pki/cert/ca > homelab-root-ca.crt` - [ ] Add to NixOS trust store on all hosts via `security.pki.certificateFiles` - [ ] Deploy via auto-upgrade - [ ] Test certificate issuance - [ ] Issue test certificate using ACME client (lego/certbot) - [ ] Or issue static certificate via OpenBao CLI - [ ] Verify certificate chain and trust - [ ] Migrate vault01's own certificate - [ ] Issue new certificate from OpenBao PKI (self-issued) - [ ] Replace self-signed bootstrap certificate - [ ] Update service configuration - [ ] Migrate hosts from step-ca to OpenBao - [ ] Update `system/acme.nix` to use OpenBao ACME endpoint - [ ] Change server to `https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory` - [ ] Test on one host (non-critical service) - [ ] Roll out to all hosts via auto-upgrade - [ ] Configure SSH CA in OpenBao (optional, future work) - [ ] Enable SSH secrets engine (`ssh/` mount) - [ ] Generate SSH signing keys - [ ] Create roles for host and user certificates - [ ] Configure TTLs and allowed principals - [ ] Distribute SSH CA public key to all hosts - [ ] Update sshd_config to trust OpenBao CA - [ ] Decommission step-ca - [ ] Verify all ACME services migrated and working - [ ] Stop step-ca service on ca host - [ ] Archive step-ca configuration for backup - [ ] Update documentation **Deliverable:** All TLS certificates issued by OpenBao PKI, step-ca retired --- #### Phase 4d: Bootstrap Integration **Goal:** New hosts automatically authenticate to Vault on first boot, no manual steps **Tasks:** - [ ] Update create-host tool - [ ] Generate AppRole role_id + secret_id for new host - [ ] Or create wrapped token for one-time bootstrap - [ ] Add host-specific policy to Vault (via terraform) - [ ] Store bootstrap credentials for cloud-init injection - [ ] Update template2 for Vault authentication - [ ] Create Vault authentication module - [ ] Reads bootstrap credentials from cloud-init - [ ] Authenticates to Vault, retrieves permanent AppRole credentials - [ ] Stores role_id + secret_id locally for services to use - [ ] Create NixOS Vault secrets module - [ ] Replacement for sops.secrets - [ ] Fetches secrets from Vault at nixos-rebuild/activation time - [ ] Or runtime secret fetching for services - [ ] Handle Vault token renewal - [ ] Update bootstrap service - [ ] After authenticating to Vault, fetch any bootstrap secrets - [ ] Run nixos-rebuild with host configuration - [ ] Services automatically fetch their secrets from Vault - [ ] Update terraform cloud-init - [ ] Inject Vault address and bootstrap credentials - [ ] Pass via cloud-init user-data or write_files - [ ] Credentials scoped to single use or short TTL - [ ] Test complete flow - [ ] Run create-host to generate new host config - [ ] Deploy with terraform - [ ] Verify host bootstraps and authenticates to Vault - [ ] Verify services can fetch secrets - [ ] Confirm no manual steps required **Bootstrap flow:** ``` 1. terraform apply (deploys VM with cloud-init) 2. Cloud-init sets hostname + Vault bootstrap credentials 3. nixos-bootstrap.service runs: - Authenticates to Vault with bootstrap credentials - Retrieves permanent AppRole credentials - Stores locally for service use - Runs nixos-rebuild 4. Host services fetch secrets from Vault as needed 5. Done - no manual intervention ``` **Deliverable:** Fully automated secrets access from first boot, zero manual steps --- ### Phase 5: DNS Automation **Goal:** Automatically generate DNS entries from host configurations **Approach:** Leverage Nix to generate zone file entries from flake host configurations Since most hosts use static IPs defined in their NixOS configurations, we can extract this information and automatically generate A records. This keeps DNS in sync with the actual host configs. **Tasks:** - [ ] Add optional CNAME field to host configurations - [ ] Add `networking.cnames = [ "alias1" "alias2" ]` or similar option - [ ] Document in host configuration template - [ ] Create Nix function to extract DNS records from all hosts - [ ] Parse each host's `networking.hostName` and IP configuration - [ ] Collect any defined CNAMEs - [ ] Generate zone file fragment with A and CNAME records - [ ] Integrate auto-generated records into zone files - [ ] Keep manual entries separate (for non-flake hosts/services) - [ ] Include generated fragment in main zone file - [ ] Add comments showing which records are auto-generated - [ ] Update zone file serial number automatically - [ ] Test zone file validity after generation - [ ] Either: - [ ] Automatically trigger DNS server reload (Ansible) - [ ] Or document manual step: merge to master, run upgrade on ns1/ns2 **Deliverable:** DNS A records and CNAMEs automatically generated from host configs --- ### Phase 6: Integration Script **Goal:** Single command to create and deploy a new host **Tasks:** - [ ] Create `scripts/create-host.sh` master script that orchestrates: 1. Prompts for: hostname, IP (or DHCP), CPU, memory, disk 2. Validates inputs (IP not in use, hostname unique, etc.) 3. Calls host config generator (Phase 2) 4. Generates OpenTofu config (Phase 2) 5. Handles secrets (Phase 4) 6. Updates DNS (Phase 5) 7. Commits all changes to git 8. Runs `tofu apply` to deploy VM 9. Waits for bootstrap to complete (Phase 3) 10. Prints success message with IP and SSH command - [ ] Add `--dry-run` flag to preview changes - [ ] Add `--interactive` mode vs `--batch` mode - [ ] Error handling and rollback on failures **Deliverable:** `./scripts/create-host.sh --hostname myhost --ip 10.69.13.50` creates a fully working host --- ### Phase 7: Testing & Documentation **Status:** 🚧 In Progress (testing improvements completed) **Testing Improvements Implemented (2025-02-01):** The pipeline now supports efficient testing without polluting master branch: **1. --force Flag for create-host** - Re-run `create-host` to regenerate existing configurations - Updates existing entries in flake.nix and terraform/vms.tf (no duplicates) - Skip uniqueness validation checks - Useful for iterating on configuration templates during testing **2. Branch Support for Bootstrap** - Bootstrap service reads `NIXOS_FLAKE_BRANCH` environment variable - Defaults to `master` if not set - Allows testing pipeline changes on feature branches - Cloud-init passes branch via `/etc/environment` **3. Cloud-init Disk for Branch Configuration** - Terraform generates custom cloud-init snippets for test VMs - Set `flake_branch` field in VM definition to use non-master branch - Production VMs omit this field and use master (default) - Files automatically uploaded to Proxmox via SSH **Testing Workflow:** ```bash # 1. Create test branch git checkout -b test-pipeline # 2. Generate or update host config create-host --hostname testvm01 --ip 10.69.13.100/24 # 3. Edit terraform/vms.tf to add test VM with branch # vms = { # "testvm01" = { # ip = "10.69.13.100/24" # flake_branch = "test-pipeline" # Bootstrap from this branch # } # } # 4. Commit and push test branch git add -A && git commit -m "test: add testvm01" git push origin test-pipeline # 5. Deploy VM cd terraform && tofu apply # 6. Watch bootstrap (VM fetches from test-pipeline branch) ssh root@10.69.13.100 journalctl -fu nixos-bootstrap.service # 7. Iterate: modify templates and regenerate with --force cd .. && create-host --hostname testvm01 --ip 10.69.13.100/24 --force git commit -am "test: update config" && git push # Redeploy to test fresh bootstrap cd terraform tofu destroy -target=proxmox_vm_qemu.vm[\"testvm01\"] && tofu apply # 8. Clean up when done: squash commits, merge to master, remove test VM ``` **Files:** - `scripts/create-host/create_host.py` - Added --force parameter - `scripts/create-host/manipulators.py` - Update vs insert logic - `hosts/template2/bootstrap.nix` - Branch support via environment variable - `terraform/vms.tf` - flake_branch field support - `terraform/cloud-init.tf` - Custom cloud-init disk generation - `terraform/variables.tf` - proxmox_host variable for SSH uploads **Remaining Tasks:** - [ ] Test full pipeline end-to-end on feature branch - [ ] Update CLAUDE.md with testing workflow - [ ] Add troubleshooting section - [ ] Create examples for common scenarios (DHCP host, static IP host, etc.) --- ## Open Questions 1. **Bootstrap method:** Cloud-init runcmd vs Terraform provisioner vs Ansible? 2. **Secrets handling:** Pre-generate keys vs post-deployment injection? 3. **DNS automation:** Auto-commit or manual merge? 4. **Git workflow:** Auto-push changes or leave for user review? 5. **Template selection:** Single template2 or multiple templates for different host types? 6. **Networking:** Always DHCP initially, or support static IP from start? 7. **Error recovery:** What happens if bootstrap fails? Manual intervention or retry? ## Implementation Order Recommended sequence: 1. Phase 1: Parameterize OpenTofu (foundation for testing) 2. Phase 3: Bootstrap mechanism (core automation) 3. Phase 2: Config generator (automate the boilerplate) 4. Phase 4: Secrets (solves biggest chicken-and-egg) 5. Phase 5: DNS (nice-to-have automation) 6. Phase 6: Integration script (ties it all together) 7. Phase 7: Testing & docs ## Success Criteria When complete, creating a new host should: - Take < 5 minutes of human time - Require minimal user input (hostname, IP, basic specs) - Result in a fully configured, secret-enabled, DNS-registered host - Be reproducible and documented - Handle common errors gracefully --- ## Notes - Keep incremental commits at each phase - Test each phase independently before moving to next - Maintain backward compatibility with manual workflow - Document any manual steps that can't be automated