Implement dual improvements to enable efficient testing of pipeline changes without polluting master branch: 1. Add --force flag to create-host script - Skip hostname/IP uniqueness validation - Overwrite existing host configurations - Update entries in flake.nix and terraform/vms.tf (no duplicates) - Useful for iterating on configurations during testing 2. Add branch support to bootstrap mechanism - Bootstrap service reads NIXOS_FLAKE_BRANCH environment variable - Defaults to master if not set - Uses branch in git URL via ?ref= parameter - Service loads environment from /etc/environment 3. Add cloud-init disk support for branch configuration - VMs can specify flake_branch field in terraform/vms.tf - Automatically generates cloud-init snippet setting NIXOS_FLAKE_BRANCH - Uploads snippet to Proxmox via SSH - Production VMs omit flake_branch and use master 4. Update documentation - Document --force flag usage in create-host README - Add branch testing examples in terraform README - Update TODO.md with testing workflow - Add .generated/ to gitignore Testing workflow: Create feature branch, set flake_branch in VM definition, deploy with terraform, iterate with --force flag, clean up before merging. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
13 KiB
TODO: Automated Host Deployment Pipeline
Vision
Automate the entire process of creating, configuring, and deploying new NixOS hosts on Proxmox from a single command or script.
Desired workflow:
./scripts/create-host.sh --hostname myhost --ip 10.69.13.50
# Script creates config, deploys VM, bootstraps NixOS, and you're ready to go
Current manual workflow (from CLAUDE.md):
- Create
/hosts/<hostname>/directory structure - Add host to
flake.nix - Add DNS entries
- Clone template VM manually
- Run
prepare-host.shon new VM - Add generated age key to
.sops.yaml - Configure networking
- Commit and push
- Run
nixos-rebuild boot --flake URL#<hostname>on host
The Plan
Phase 1: Parameterized OpenTofu Deployments ✅ COMPLETED
Status: Fully implemented and tested
Implementation:
- Locals-based structure using
for_eachpattern for multiple VM deployments - All VM parameters configurable with smart defaults (CPU, memory, disk, IP, storage, etc.)
- Automatic DHCP vs static IP detection based on
ipfield presence - Dynamic outputs showing deployed VM IPs and specifications
- Successfully tested deploying multiple VMs simultaneously
Tasks:
- Create module/template structure in terraform for repeatable VM deployments
- Parameterize VM configuration (hostname, CPU, memory, disk, IP)
- Support both DHCP and static IP configuration via cloud-init
- Test deploying multiple VMs from same template
Deliverable: ✅ Can deploy multiple VMs with custom parameters via OpenTofu in a single tofu apply
Files:
terraform/vms.tf- VM definitions using locals mapterraform/outputs.tf- Dynamic outputs for all VMsterraform/variables.tf- Configurable defaultsterraform/README.md- Complete documentation
Phase 2: Host Configuration Generator ✅ COMPLETED
Status: ✅ Fully implemented and tested Completed: 2025-02-01 Enhanced: 2025-02-01 (added --force flag)
Goal: Automate creation of host configuration files
Implementation:
- Python CLI tool packaged as Nix derivation
- Available as
create-hostcommand in devShell - Rich terminal UI with configuration previews
- Comprehensive validation (hostname format/uniqueness, IP subnet/uniqueness)
- Jinja2 templates for NixOS configurations
- Automatic updates to flake.nix and terraform/vms.tf
--forceflag for regenerating existing configurations (useful for testing)
Tasks:
- Create Python CLI with typer framework
- Takes parameters: hostname, IP, CPU cores, memory, disk size
- Generates
/hosts/<hostname>/directory structure - Creates
configuration.nixwith proper hostname and networking - Generates
default.nixwith standard imports - References shared
hardware-configuration.nixfrom template
- Add host entry to
flake.nixprogrammatically- Text-based manipulation (regex insertion)
- Inserts new nixosConfiguration entry
- Maintains proper formatting
- Generate corresponding OpenTofu configuration
- Adds VM definition to
terraform/vms.tf - Uses parameters from CLI input
- Supports both static IP and DHCP modes
- Adds VM definition to
- Package as Nix derivation with templates
- Add to flake packages and devShell
- Implement dry-run mode
- Write comprehensive README
Usage:
# In nix develop shell
create-host \
--hostname test01 \
--ip 10.69.13.50/24 \ # optional, omit for DHCP
--cpu 4 \ # optional, default 2
--memory 4096 \ # optional, default 2048
--disk 50G \ # optional, default 20G
--dry-run # optional preview mode
Files:
scripts/create-host/- Complete Python package with Nix derivationscripts/create-host/README.md- Full documentation and examples
Deliverable: ✅ Tool generates all config files for a new host, validated with Nix and Terraform
Phase 3: Bootstrap Mechanism ✅ COMPLETED
Status: ✅ Fully implemented and tested Completed: 2025-02-01 Enhanced: 2025-02-01 (added branch support for testing)
Goal: Get freshly deployed VM to apply its specific host configuration
Implementation: Systemd oneshot service that runs on first boot after cloud-init
Approach taken: Systemd service (variant of Option A)
- Systemd service
nixos-bootstrap.serviceruns on first boot - Depends on
cloud-config.serviceto ensure hostname is set - Reads hostname from
hostnamectl(set by cloud-init via Terraform) - Supports custom git branch via
NIXOS_FLAKE_BRANCHenvironment variable - Runs
nixos-rebuild boot --flake git+https://git.t-juice.club/torjus/nixos-servers.git?ref=$BRANCH#${hostname} - Reboots into new configuration on success
- Fails gracefully without reboot on errors (network issues, missing config)
- Service self-destructs after successful bootstrap (not in new config)
Tasks:
- Create bootstrap service module in template2
- systemd oneshot service with proper dependencies
- Reads hostname from hostnamectl (cloud-init sets it)
- Checks network connectivity via HTTPS (curl)
- Runs nixos-rebuild boot with flake URL
- Reboots on success, fails gracefully on error
- Configure cloud-init datasource
- Use ConfigDrive datasource (Proxmox provider)
- Add cloud-init disk to Terraform VMs (disks.ide.ide2.cloudinit)
- Hostname passed via cloud-init user-data from Terraform
- Test bootstrap service execution on fresh VM
- Handle failure cases (flake doesn't exist, network issues)
- Clear error messages in journald
- No reboot on failure
- System remains accessible for debugging
Files:
hosts/template2/bootstrap.nix- Bootstrap service definitionhosts/template2/configuration.nix- Cloud-init ConfigDrive datasourceterraform/vms.tf- Cloud-init disk configuration
Deliverable: ✅ VMs automatically bootstrap and reboot into host-specific configuration on first boot
Phase 4: Secrets Management Automation
Challenge: sops needs age key, but age key is generated on first boot
Current workflow:
- VM boots, generates age key at
/var/lib/sops-nix/key.txt - User runs
prepare-host.shwhich prints public key - User manually adds public key to
.sops.yaml - User commits, pushes
- VM can now decrypt secrets
Proposed solution:
Option A: Pre-generate age keys
- Generate age key pair during
create-host-config.sh - Add public key to
.sops.yamlimmediately - Store private key temporarily (secure location)
- Inject private key via cloud-init write_files or Terraform file provisioner
- VM uses pre-configured key from first boot
Option B: Post-deployment secret injection
- VM boots with template, generates its own key
- Fetch public key via SSH after first boot
- Automatically add to
.sops.yamland commit - Trigger rebuild on VM to pick up secrets access
Option C: Separate secrets from initial deployment
- Initial deployment works without secrets
- After VM is running, user manually adds age key
- Subsequent auto-upgrades pick up secrets
Decision needed: Option A is most automated, but requires secure key handling
Phase 5: DNS Automation
Goal: Automatically generate DNS entries from host configurations
Approach: Leverage Nix to generate zone file entries from flake host configurations
Since most hosts use static IPs defined in their NixOS configurations, we can extract this information and automatically generate A records. This keeps DNS in sync with the actual host configs.
Tasks:
- Add optional CNAME field to host configurations
- Add
networking.cnames = [ "alias1" "alias2" ]or similar option - Document in host configuration template
- Add
- Create Nix function to extract DNS records from all hosts
- Parse each host's
networking.hostNameand IP configuration - Collect any defined CNAMEs
- Generate zone file fragment with A and CNAME records
- Parse each host's
- Integrate auto-generated records into zone files
- Keep manual entries separate (for non-flake hosts/services)
- Include generated fragment in main zone file
- Add comments showing which records are auto-generated
- Update zone file serial number automatically
- Test zone file validity after generation
- Either:
- Automatically trigger DNS server reload (Ansible)
- Or document manual step: merge to master, run upgrade on ns1/ns2
Deliverable: DNS A records and CNAMEs automatically generated from host configs
Phase 6: Integration Script
Goal: Single command to create and deploy a new host
Tasks:
- Create
scripts/create-host.shmaster script that orchestrates:- Prompts for: hostname, IP (or DHCP), CPU, memory, disk
- Validates inputs (IP not in use, hostname unique, etc.)
- Calls host config generator (Phase 2)
- Generates OpenTofu config (Phase 2)
- Handles secrets (Phase 4)
- Updates DNS (Phase 5)
- Commits all changes to git
- Runs
tofu applyto deploy VM - Waits for bootstrap to complete (Phase 3)
- Prints success message with IP and SSH command
- Add
--dry-runflag to preview changes - Add
--interactivemode vs--batchmode - Error handling and rollback on failures
Deliverable: ./scripts/create-host.sh --hostname myhost --ip 10.69.13.50 creates a fully working host
Phase 7: Testing & Documentation
Status: 🚧 In Progress (testing improvements completed)
Testing Improvements Implemented (2025-02-01):
The pipeline now supports efficient testing without polluting master branch:
1. --force Flag for create-host
- Re-run
create-hostto regenerate existing configurations - Updates existing entries in flake.nix and terraform/vms.tf (no duplicates)
- Skip uniqueness validation checks
- Useful for iterating on configuration templates during testing
2. Branch Support for Bootstrap
- Bootstrap service reads
NIXOS_FLAKE_BRANCHenvironment variable - Defaults to
masterif not set - Allows testing pipeline changes on feature branches
- Cloud-init passes branch via
/etc/environment
3. Cloud-init Disk for Branch Configuration
- Terraform generates custom cloud-init snippets for test VMs
- Set
flake_branchfield in VM definition to use non-master branch - Production VMs omit this field and use master (default)
- Files automatically uploaded to Proxmox via SSH
Testing Workflow:
# 1. Create test branch
git checkout -b test-pipeline
# 2. Generate or update host config
create-host --hostname testvm01 --ip 10.69.13.100/24
# 3. Edit terraform/vms.tf to add test VM with branch
# vms = {
# "testvm01" = {
# ip = "10.69.13.100/24"
# flake_branch = "test-pipeline" # Bootstrap from this branch
# }
# }
# 4. Commit and push test branch
git add -A && git commit -m "test: add testvm01"
git push origin test-pipeline
# 5. Deploy VM
cd terraform && tofu apply
# 6. Watch bootstrap (VM fetches from test-pipeline branch)
ssh root@10.69.13.100
journalctl -fu nixos-bootstrap.service
# 7. Iterate: modify templates and regenerate with --force
cd .. && create-host --hostname testvm01 --ip 10.69.13.100/24 --force
git commit -am "test: update config" && git push
# Redeploy to test fresh bootstrap
cd terraform
tofu destroy -target=proxmox_vm_qemu.vm[\"testvm01\"] && tofu apply
# 8. Clean up when done: squash commits, merge to master, remove test VM
Files:
scripts/create-host/create_host.py- Added --force parameterscripts/create-host/manipulators.py- Update vs insert logichosts/template2/bootstrap.nix- Branch support via environment variableterraform/vms.tf- flake_branch field supportterraform/cloud-init.tf- Custom cloud-init disk generationterraform/variables.tf- proxmox_host variable for SSH uploads
Remaining Tasks:
- Test full pipeline end-to-end on feature branch
- Update CLAUDE.md with testing workflow
- Add troubleshooting section
- Create examples for common scenarios (DHCP host, static IP host, etc.)
Open Questions
- Bootstrap method: Cloud-init runcmd vs Terraform provisioner vs Ansible?
- Secrets handling: Pre-generate keys vs post-deployment injection?
- DNS automation: Auto-commit or manual merge?
- Git workflow: Auto-push changes or leave for user review?
- Template selection: Single template2 or multiple templates for different host types?
- Networking: Always DHCP initially, or support static IP from start?
- Error recovery: What happens if bootstrap fails? Manual intervention or retry?
Implementation Order
Recommended sequence:
- Phase 1: Parameterize OpenTofu (foundation for testing)
- Phase 3: Bootstrap mechanism (core automation)
- Phase 2: Config generator (automate the boilerplate)
- Phase 4: Secrets (solves biggest chicken-and-egg)
- Phase 5: DNS (nice-to-have automation)
- Phase 6: Integration script (ties it all together)
- Phase 7: Testing & docs
Success Criteria
When complete, creating a new host should:
- Take < 5 minutes of human time
- Require minimal user input (hostname, IP, basic specs)
- Result in a fully configured, secret-enabled, DNS-registered host
- Be reproducible and documented
- Handle common errors gracefully
Notes
- Keep incremental commits at each phase
- Test each phase independently before moving to next
- Maintain backward compatibility with manual workflow
- Document any manual steps that can't be automated