# TODO: Automated Host Deployment Pipeline ## Vision Automate the entire process of creating, configuring, and deploying new NixOS hosts on Proxmox from a single command or script. **Desired workflow:** ```bash ./scripts/create-host.sh --hostname myhost --ip 10.69.13.50 # Script creates config, deploys VM, bootstraps NixOS, and you're ready to go ``` **Current manual workflow (from CLAUDE.md):** 1. Create `/hosts//` directory structure 2. Add host to `flake.nix` 3. Add DNS entries 4. Clone template VM manually 5. Run `prepare-host.sh` on new VM 6. Add generated age key to `.sops.yaml` 7. Configure networking 8. Commit and push 9. Run `nixos-rebuild boot --flake URL#` on host ## The Plan ### Phase 1: Parameterized OpenTofu Deployments ✅ COMPLETED **Status:** Fully implemented and tested **Implementation:** - Locals-based structure using `for_each` pattern for multiple VM deployments - All VM parameters configurable with smart defaults (CPU, memory, disk, IP, storage, etc.) - Automatic DHCP vs static IP detection based on `ip` field presence - Dynamic outputs showing deployed VM IPs and specifications - Successfully tested deploying multiple VMs simultaneously **Tasks:** - [x] Create module/template structure in terraform for repeatable VM deployments - [x] Parameterize VM configuration (hostname, CPU, memory, disk, IP) - [x] Support both DHCP and static IP configuration via cloud-init - [x] Test deploying multiple VMs from same template **Deliverable:** ✅ Can deploy multiple VMs with custom parameters via OpenTofu in a single `tofu apply` **Files:** - `terraform/vms.tf` - VM definitions using locals map - `terraform/outputs.tf` - Dynamic outputs for all VMs - `terraform/variables.tf` - Configurable defaults - `terraform/README.md` - Complete documentation --- ### Phase 2: Host Configuration Generator ✅ COMPLETED **Status:** ✅ Fully implemented and tested **Completed:** 2025-02-01 **Enhanced:** 2025-02-01 (added --force flag) **Goal:** Automate creation of host configuration files **Implementation:** - Python CLI tool packaged as Nix derivation - Available as `create-host` command in devShell - Rich terminal UI with configuration previews - Comprehensive validation (hostname format/uniqueness, IP subnet/uniqueness) - Jinja2 templates for NixOS configurations - Automatic updates to flake.nix and terraform/vms.tf - `--force` flag for regenerating existing configurations (useful for testing) **Tasks:** - [x] Create Python CLI with typer framework - [x] Takes parameters: hostname, IP, CPU cores, memory, disk size - [x] Generates `/hosts//` directory structure - [x] Creates `configuration.nix` with proper hostname and networking - [x] Generates `default.nix` with standard imports - [x] References shared `hardware-configuration.nix` from template - [x] Add host entry to `flake.nix` programmatically - [x] Text-based manipulation (regex insertion) - [x] Inserts new nixosConfiguration entry - [x] Maintains proper formatting - [x] Generate corresponding OpenTofu configuration - [x] Adds VM definition to `terraform/vms.tf` - [x] Uses parameters from CLI input - [x] Supports both static IP and DHCP modes - [x] Package as Nix derivation with templates - [x] Add to flake packages and devShell - [x] Implement dry-run mode - [x] Write comprehensive README **Usage:** ```bash # In nix develop shell create-host \ --hostname test01 \ --ip 10.69.13.50/24 \ # optional, omit for DHCP --cpu 4 \ # optional, default 2 --memory 4096 \ # optional, default 2048 --disk 50G \ # optional, default 20G --dry-run # optional preview mode ``` **Files:** - `scripts/create-host/` - Complete Python package with Nix derivation - `scripts/create-host/README.md` - Full documentation and examples **Deliverable:** ✅ Tool generates all config files for a new host, validated with Nix and Terraform --- ### Phase 3: Bootstrap Mechanism ✅ COMPLETED **Status:** ✅ Fully implemented and tested **Completed:** 2025-02-01 **Enhanced:** 2025-02-01 (added branch support for testing) **Goal:** Get freshly deployed VM to apply its specific host configuration **Implementation:** Systemd oneshot service that runs on first boot after cloud-init **Approach taken:** Systemd service (variant of Option A) - Systemd service `nixos-bootstrap.service` runs on first boot - Depends on `cloud-config.service` to ensure hostname is set - Reads hostname from `hostnamectl` (set by cloud-init via Terraform) - Supports custom git branch via `NIXOS_FLAKE_BRANCH` environment variable - Runs `nixos-rebuild boot --flake git+https://git.t-juice.club/torjus/nixos-servers.git?ref=$BRANCH#${hostname}` - Reboots into new configuration on success - Fails gracefully without reboot on errors (network issues, missing config) - Service self-destructs after successful bootstrap (not in new config) **Tasks:** - [x] Create bootstrap service module in template2 - [x] systemd oneshot service with proper dependencies - [x] Reads hostname from hostnamectl (cloud-init sets it) - [x] Checks network connectivity via HTTPS (curl) - [x] Runs nixos-rebuild boot with flake URL - [x] Reboots on success, fails gracefully on error - [x] Configure cloud-init datasource - [x] Use ConfigDrive datasource (Proxmox provider) - [x] Add cloud-init disk to Terraform VMs (disks.ide.ide2.cloudinit) - [x] Hostname passed via cloud-init user-data from Terraform - [x] Test bootstrap service execution on fresh VM - [x] Handle failure cases (flake doesn't exist, network issues) - [x] Clear error messages in journald - [x] No reboot on failure - [x] System remains accessible for debugging **Files:** - `hosts/template2/bootstrap.nix` - Bootstrap service definition - `hosts/template2/configuration.nix` - Cloud-init ConfigDrive datasource - `terraform/vms.tf` - Cloud-init disk configuration **Deliverable:** ✅ VMs automatically bootstrap and reboot into host-specific configuration on first boot --- ### Phase 4: Secrets Management with HashiCorp Vault **Challenge:** Current sops-nix approach has chicken-and-egg problem with age keys **Current workflow:** 1. VM boots, generates age key at `/var/lib/sops-nix/key.txt` 2. User runs `prepare-host.sh` which prints public key 3. User manually adds public key to `.sops.yaml` 4. User commits, pushes 5. VM can now decrypt secrets **Selected approach:** Migrate to HashiCorp Vault for centralized secrets management **Benefits:** - Industry-standard secrets management (Vault experience transferable to work) - Eliminates manual age key distribution step - Secrets-as-code via OpenTofu (infrastructure-as-code aligned) - Centralized PKI management (replaces step-ca, consolidates TLS + SSH CA) - Automatic secret rotation capabilities - Audit logging for all secret access - AppRole authentication enables automated bootstrap **Architecture:** ``` vault.home.2rjus.net ├─ KV Secrets Engine (replaces sops-nix) ├─ PKI Engine (replaces step-ca for TLS) ├─ SSH CA Engine (replaces step-ca SSH CA) └─ AppRole Auth (per-host authentication) ↓ New hosts authenticate on first boot Fetch secrets via Vault API No manual key distribution needed ``` --- #### Phase 4a: Vault Server Setup **Goal:** Deploy and configure Vault server with auto-unseal **Tasks:** - [ ] Create `hosts/vault01/` configuration - [ ] Basic NixOS configuration (hostname, networking, etc.) - [ ] Vault service configuration - [ ] Firewall rules (8200 for API, 8201 for cluster) - [ ] Add to flake.nix and terraform - [ ] Implement auto-unseal mechanism - [ ] **Preferred:** TPM-based auto-unseal if hardware supports it - [ ] Use tpm2-tools to seal/unseal Vault keys - [ ] Systemd service to unseal on boot - [ ] **Fallback:** Shamir secret sharing with systemd automation - [ ] Generate 3 keys, threshold 2 - [ ] Store 2 keys on disk (encrypted), keep 1 offline - [ ] Systemd service auto-unseals using 2 keys - [ ] Initial Vault setup - [ ] Initialize Vault - [ ] Configure storage backend (integrated raft or file) - [ ] Set up root token management - [ ] Enable audit logging - [ ] Deploy to infrastructure - [ ] Add DNS entry for vault.home.2rjus.net - [ ] Deploy VM via terraform - [ ] Bootstrap and verify Vault is running **Deliverable:** Running Vault server that auto-unseals on boot --- #### Phase 4b: Vault-as-Code with OpenTofu **Goal:** Manage all Vault configuration (secrets structure, policies, roles) as code **Tasks:** - [ ] Set up Vault Terraform provider - [ ] Create `terraform/vault/` directory - [ ] Configure Vault provider (address, auth) - [ ] Store Vault token securely (terraform.tfvars, gitignored) - [ ] Enable and configure secrets engines - [ ] Enable KV v2 secrets engine at `secret/` - [ ] Define secret path structure (per-service, per-host) - [ ] Example: `secret/monitoring/grafana`, `secret/postgres/ha1` - [ ] Define policies as code - [ ] Create policies for different service tiers - [ ] Principle of least privilege (hosts only read their secrets) - [ ] Example: monitoring-policy allows read on `secret/monitoring/*` - [ ] Set up AppRole authentication - [ ] Enable AppRole auth backend - [ ] Create role per host type (monitoring, dns, database, etc.) - [ ] Bind policies to roles - [ ] Configure TTL and token policies - [ ] Migrate existing secrets from sops-nix - [ ] Create migration script/playbook - [ ] Decrypt sops secrets and load into Vault KV - [ ] Verify all secrets migrated successfully - [ ] Keep sops as backup during transition - [ ] Implement secrets-as-code patterns - [ ] Secret values in gitignored terraform.tfvars - [ ] Or use random_password for auto-generated secrets - [ ] Secret structure/paths in version-controlled .tf files **Example OpenTofu:** ```hcl resource "vault_kv_secret_v2" "monitoring_grafana" { mount = "secret" name = "monitoring/grafana" data_json = jsonencode({ admin_password = var.grafana_admin_password smtp_password = var.smtp_password }) } resource "vault_policy" "monitoring" { name = "monitoring-policy" policy = <