Add systemd service that automatically bootstraps freshly deployed VMs with their host-specific NixOS configuration from the flake repository. Changes: - hosts/template2/bootstrap.nix: New systemd oneshot service that: - Runs after cloud-init completes (ensures hostname is set) - Reads hostname from hostnamectl (set by cloud-init from Terraform) - Checks network connectivity via HTTPS (curl) - Runs nixos-rebuild boot with flake URL - Reboots on success, fails gracefully with clear errors on failure - hosts/template2/configuration.nix: Configure cloud-init datasource - Changed from NoCloud to ConfigDrive (used by Proxmox) - Allows cloud-init to receive config from Proxmox - hosts/template2/default.nix: Import bootstrap.nix module - terraform/vms.tf: Add cloud-init disk to VMs - Configure disks.ide.ide2.cloudinit block - Removed invalid cloudinit_cdrom_storage parameter - Enables Proxmox to inject cloud-init configuration - TODO.md: Mark Phase 3 as completed This eliminates the manual nixos-rebuild step from the deployment workflow. VMs now automatically pull and apply their configuration on first boot. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
290 lines
11 KiB
Markdown
290 lines
11 KiB
Markdown
# TODO: Automated Host Deployment Pipeline
|
|
|
|
## Vision
|
|
|
|
Automate the entire process of creating, configuring, and deploying new NixOS hosts on Proxmox from a single command or script.
|
|
|
|
**Desired workflow:**
|
|
```bash
|
|
./scripts/create-host.sh --hostname myhost --ip 10.69.13.50
|
|
# Script creates config, deploys VM, bootstraps NixOS, and you're ready to go
|
|
```
|
|
|
|
**Current manual workflow (from CLAUDE.md):**
|
|
1. Create `/hosts/<hostname>/` directory structure
|
|
2. Add host to `flake.nix`
|
|
3. Add DNS entries
|
|
4. Clone template VM manually
|
|
5. Run `prepare-host.sh` on new VM
|
|
6. Add generated age key to `.sops.yaml`
|
|
7. Configure networking
|
|
8. Commit and push
|
|
9. Run `nixos-rebuild boot --flake URL#<hostname>` on host
|
|
|
|
## The Plan
|
|
|
|
### Phase 1: Parameterized OpenTofu Deployments ✅ COMPLETED
|
|
|
|
**Status:** Fully implemented and tested
|
|
|
|
**Implementation:**
|
|
- Locals-based structure using `for_each` pattern for multiple VM deployments
|
|
- All VM parameters configurable with smart defaults (CPU, memory, disk, IP, storage, etc.)
|
|
- Automatic DHCP vs static IP detection based on `ip` field presence
|
|
- Dynamic outputs showing deployed VM IPs and specifications
|
|
- Successfully tested deploying multiple VMs simultaneously
|
|
|
|
**Tasks:**
|
|
- [x] Create module/template structure in terraform for repeatable VM deployments
|
|
- [x] Parameterize VM configuration (hostname, CPU, memory, disk, IP)
|
|
- [x] Support both DHCP and static IP configuration via cloud-init
|
|
- [x] Test deploying multiple VMs from same template
|
|
|
|
**Deliverable:** ✅ Can deploy multiple VMs with custom parameters via OpenTofu in a single `tofu apply`
|
|
|
|
**Files:**
|
|
- `terraform/vms.tf` - VM definitions using locals map
|
|
- `terraform/outputs.tf` - Dynamic outputs for all VMs
|
|
- `terraform/variables.tf` - Configurable defaults
|
|
- `terraform/README.md` - Complete documentation
|
|
|
|
---
|
|
|
|
### Phase 2: Host Configuration Generator ✅ COMPLETED
|
|
|
|
**Status:** ✅ Fully implemented and tested
|
|
**Completed:** 2025-02-01
|
|
|
|
**Goal:** Automate creation of host configuration files
|
|
|
|
**Implementation:**
|
|
- Python CLI tool packaged as Nix derivation
|
|
- Available as `create-host` command in devShell
|
|
- Rich terminal UI with configuration previews
|
|
- Comprehensive validation (hostname format/uniqueness, IP subnet/uniqueness)
|
|
- Jinja2 templates for NixOS configurations
|
|
- Automatic updates to flake.nix and terraform/vms.tf
|
|
|
|
**Tasks:**
|
|
- [x] Create Python CLI with typer framework
|
|
- [x] Takes parameters: hostname, IP, CPU cores, memory, disk size
|
|
- [x] Generates `/hosts/<hostname>/` directory structure
|
|
- [x] Creates `configuration.nix` with proper hostname and networking
|
|
- [x] Generates `default.nix` with standard imports
|
|
- [x] References shared `hardware-configuration.nix` from template
|
|
- [x] Add host entry to `flake.nix` programmatically
|
|
- [x] Text-based manipulation (regex insertion)
|
|
- [x] Inserts new nixosConfiguration entry
|
|
- [x] Maintains proper formatting
|
|
- [x] Generate corresponding OpenTofu configuration
|
|
- [x] Adds VM definition to `terraform/vms.tf`
|
|
- [x] Uses parameters from CLI input
|
|
- [x] Supports both static IP and DHCP modes
|
|
- [x] Package as Nix derivation with templates
|
|
- [x] Add to flake packages and devShell
|
|
- [x] Implement dry-run mode
|
|
- [x] Write comprehensive README
|
|
|
|
**Usage:**
|
|
```bash
|
|
# In nix develop shell
|
|
create-host \
|
|
--hostname test01 \
|
|
--ip 10.69.13.50/24 \ # optional, omit for DHCP
|
|
--cpu 4 \ # optional, default 2
|
|
--memory 4096 \ # optional, default 2048
|
|
--disk 50G \ # optional, default 20G
|
|
--dry-run # optional preview mode
|
|
```
|
|
|
|
**Files:**
|
|
- `scripts/create-host/` - Complete Python package with Nix derivation
|
|
- `scripts/create-host/README.md` - Full documentation and examples
|
|
|
|
**Deliverable:** ✅ Tool generates all config files for a new host, validated with Nix and Terraform
|
|
|
|
---
|
|
|
|
### Phase 3: Bootstrap Mechanism ✅ COMPLETED
|
|
|
|
**Status:** ✅ Fully implemented and tested
|
|
**Completed:** 2025-02-01
|
|
|
|
**Goal:** Get freshly deployed VM to apply its specific host configuration
|
|
|
|
**Implementation:** Systemd oneshot service that runs on first boot after cloud-init
|
|
|
|
**Approach taken:** Systemd service (variant of Option A)
|
|
- Systemd service `nixos-bootstrap.service` runs on first boot
|
|
- Depends on `cloud-config.service` to ensure hostname is set
|
|
- Reads hostname from `hostnamectl` (set by cloud-init via Terraform)
|
|
- Runs `nixos-rebuild boot --flake git+https://git.t-juice.club/torjus/nixos-servers.git#${hostname}`
|
|
- Reboots into new configuration on success
|
|
- Fails gracefully without reboot on errors (network issues, missing config)
|
|
- Service self-destructs after successful bootstrap (not in new config)
|
|
|
|
**Tasks:**
|
|
- [x] Create bootstrap service module in template2
|
|
- [x] systemd oneshot service with proper dependencies
|
|
- [x] Reads hostname from hostnamectl (cloud-init sets it)
|
|
- [x] Checks network connectivity via HTTPS (curl)
|
|
- [x] Runs nixos-rebuild boot with flake URL
|
|
- [x] Reboots on success, fails gracefully on error
|
|
- [x] Configure cloud-init datasource
|
|
- [x] Use ConfigDrive datasource (Proxmox provider)
|
|
- [x] Add cloud-init disk to Terraform VMs (disks.ide.ide2.cloudinit)
|
|
- [x] Hostname passed via cloud-init user-data from Terraform
|
|
- [x] Test bootstrap service execution on fresh VM
|
|
- [x] Handle failure cases (flake doesn't exist, network issues)
|
|
- [x] Clear error messages in journald
|
|
- [x] No reboot on failure
|
|
- [x] System remains accessible for debugging
|
|
|
|
**Files:**
|
|
- `hosts/template2/bootstrap.nix` - Bootstrap service definition
|
|
- `hosts/template2/configuration.nix` - Cloud-init ConfigDrive datasource
|
|
- `terraform/vms.tf` - Cloud-init disk configuration
|
|
|
|
**Deliverable:** ✅ VMs automatically bootstrap and reboot into host-specific configuration on first boot
|
|
|
|
---
|
|
|
|
### Phase 4: Secrets Management Automation
|
|
|
|
**Challenge:** sops needs age key, but age key is generated on first boot
|
|
|
|
**Current workflow:**
|
|
1. VM boots, generates age key at `/var/lib/sops-nix/key.txt`
|
|
2. User runs `prepare-host.sh` which prints public key
|
|
3. User manually adds public key to `.sops.yaml`
|
|
4. User commits, pushes
|
|
5. VM can now decrypt secrets
|
|
|
|
**Proposed solution:**
|
|
|
|
**Option A: Pre-generate age keys**
|
|
- [ ] Generate age key pair during `create-host-config.sh`
|
|
- [ ] Add public key to `.sops.yaml` immediately
|
|
- [ ] Store private key temporarily (secure location)
|
|
- [ ] Inject private key via cloud-init write_files or Terraform file provisioner
|
|
- [ ] VM uses pre-configured key from first boot
|
|
|
|
**Option B: Post-deployment secret injection**
|
|
- [ ] VM boots with template, generates its own key
|
|
- [ ] Fetch public key via SSH after first boot
|
|
- [ ] Automatically add to `.sops.yaml` and commit
|
|
- [ ] Trigger rebuild on VM to pick up secrets access
|
|
|
|
**Option C: Separate secrets from initial deployment**
|
|
- [ ] Initial deployment works without secrets
|
|
- [ ] After VM is running, user manually adds age key
|
|
- [ ] Subsequent auto-upgrades pick up secrets
|
|
|
|
**Decision needed:** Option A is most automated, but requires secure key handling
|
|
|
|
---
|
|
|
|
### Phase 5: DNS Automation
|
|
|
|
**Goal:** Automatically generate DNS entries from host configurations
|
|
|
|
**Approach:** Leverage Nix to generate zone file entries from flake host configurations
|
|
|
|
Since most hosts use static IPs defined in their NixOS configurations, we can extract this information and automatically generate A records. This keeps DNS in sync with the actual host configs.
|
|
|
|
**Tasks:**
|
|
- [ ] Add optional CNAME field to host configurations
|
|
- [ ] Add `networking.cnames = [ "alias1" "alias2" ]` or similar option
|
|
- [ ] Document in host configuration template
|
|
- [ ] Create Nix function to extract DNS records from all hosts
|
|
- [ ] Parse each host's `networking.hostName` and IP configuration
|
|
- [ ] Collect any defined CNAMEs
|
|
- [ ] Generate zone file fragment with A and CNAME records
|
|
- [ ] Integrate auto-generated records into zone files
|
|
- [ ] Keep manual entries separate (for non-flake hosts/services)
|
|
- [ ] Include generated fragment in main zone file
|
|
- [ ] Add comments showing which records are auto-generated
|
|
- [ ] Update zone file serial number automatically
|
|
- [ ] Test zone file validity after generation
|
|
- [ ] Either:
|
|
- [ ] Automatically trigger DNS server reload (Ansible)
|
|
- [ ] Or document manual step: merge to master, run upgrade on ns1/ns2
|
|
|
|
**Deliverable:** DNS A records and CNAMEs automatically generated from host configs
|
|
|
|
---
|
|
|
|
### Phase 6: Integration Script
|
|
|
|
**Goal:** Single command to create and deploy a new host
|
|
|
|
**Tasks:**
|
|
- [ ] Create `scripts/create-host.sh` master script that orchestrates:
|
|
1. Prompts for: hostname, IP (or DHCP), CPU, memory, disk
|
|
2. Validates inputs (IP not in use, hostname unique, etc.)
|
|
3. Calls host config generator (Phase 2)
|
|
4. Generates OpenTofu config (Phase 2)
|
|
5. Handles secrets (Phase 4)
|
|
6. Updates DNS (Phase 5)
|
|
7. Commits all changes to git
|
|
8. Runs `tofu apply` to deploy VM
|
|
9. Waits for bootstrap to complete (Phase 3)
|
|
10. Prints success message with IP and SSH command
|
|
- [ ] Add `--dry-run` flag to preview changes
|
|
- [ ] Add `--interactive` mode vs `--batch` mode
|
|
- [ ] Error handling and rollback on failures
|
|
|
|
**Deliverable:** `./scripts/create-host.sh --hostname myhost --ip 10.69.13.50` creates a fully working host
|
|
|
|
---
|
|
|
|
### Phase 7: Testing & Documentation
|
|
|
|
**Tasks:**
|
|
- [ ] Test full pipeline end-to-end
|
|
- [ ] Create test host and verify all steps
|
|
- [ ] Document the new workflow in CLAUDE.md
|
|
- [ ] Add troubleshooting section
|
|
- [ ] Create examples for common scenarios (DHCP host, static IP host, etc.)
|
|
|
|
---
|
|
|
|
## Open Questions
|
|
|
|
1. **Bootstrap method:** Cloud-init runcmd vs Terraform provisioner vs Ansible?
|
|
2. **Secrets handling:** Pre-generate keys vs post-deployment injection?
|
|
3. **DNS automation:** Auto-commit or manual merge?
|
|
4. **Git workflow:** Auto-push changes or leave for user review?
|
|
5. **Template selection:** Single template2 or multiple templates for different host types?
|
|
6. **Networking:** Always DHCP initially, or support static IP from start?
|
|
7. **Error recovery:** What happens if bootstrap fails? Manual intervention or retry?
|
|
|
|
## Implementation Order
|
|
|
|
Recommended sequence:
|
|
1. Phase 1: Parameterize OpenTofu (foundation for testing)
|
|
2. Phase 3: Bootstrap mechanism (core automation)
|
|
3. Phase 2: Config generator (automate the boilerplate)
|
|
4. Phase 4: Secrets (solves biggest chicken-and-egg)
|
|
5. Phase 5: DNS (nice-to-have automation)
|
|
6. Phase 6: Integration script (ties it all together)
|
|
7. Phase 7: Testing & docs
|
|
|
|
## Success Criteria
|
|
|
|
When complete, creating a new host should:
|
|
- Take < 5 minutes of human time
|
|
- Require minimal user input (hostname, IP, basic specs)
|
|
- Result in a fully configured, secret-enabled, DNS-registered host
|
|
- Be reproducible and documented
|
|
- Handle common errors gracefully
|
|
|
|
---
|
|
|
|
## Notes
|
|
|
|
- Keep incremental commits at each phase
|
|
- Test each phase independently before moving to next
|
|
- Maintain backward compatibility with manual workflow
|
|
- Document any manual steps that can't be automated
|