Add systemd service that automatically bootstraps freshly deployed VMs with their host-specific NixOS configuration from the flake repository. Changes: - hosts/template2/bootstrap.nix: New systemd oneshot service that: - Runs after cloud-init completes (ensures hostname is set) - Reads hostname from hostnamectl (set by cloud-init from Terraform) - Checks network connectivity via HTTPS (curl) - Runs nixos-rebuild boot with flake URL - Reboots on success, fails gracefully with clear errors on failure - hosts/template2/configuration.nix: Configure cloud-init datasource - Changed from NoCloud to ConfigDrive (used by Proxmox) - Allows cloud-init to receive config from Proxmox - hosts/template2/default.nix: Import bootstrap.nix module - terraform/vms.tf: Add cloud-init disk to VMs - Configure disks.ide.ide2.cloudinit block - Removed invalid cloudinit_cdrom_storage parameter - Enables Proxmox to inject cloud-init configuration - TODO.md: Mark Phase 3 as completed This eliminates the manual nixos-rebuild step from the deployment workflow. VMs now automatically pull and apply their configuration on first boot. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
11 KiB
TODO: Automated Host Deployment Pipeline
Vision
Automate the entire process of creating, configuring, and deploying new NixOS hosts on Proxmox from a single command or script.
Desired workflow:
./scripts/create-host.sh --hostname myhost --ip 10.69.13.50
# Script creates config, deploys VM, bootstraps NixOS, and you're ready to go
Current manual workflow (from CLAUDE.md):
- Create
/hosts/<hostname>/directory structure - Add host to
flake.nix - Add DNS entries
- Clone template VM manually
- Run
prepare-host.shon new VM - Add generated age key to
.sops.yaml - Configure networking
- Commit and push
- Run
nixos-rebuild boot --flake URL#<hostname>on host
The Plan
Phase 1: Parameterized OpenTofu Deployments ✅ COMPLETED
Status: Fully implemented and tested
Implementation:
- Locals-based structure using
for_eachpattern for multiple VM deployments - All VM parameters configurable with smart defaults (CPU, memory, disk, IP, storage, etc.)
- Automatic DHCP vs static IP detection based on
ipfield presence - Dynamic outputs showing deployed VM IPs and specifications
- Successfully tested deploying multiple VMs simultaneously
Tasks:
- Create module/template structure in terraform for repeatable VM deployments
- Parameterize VM configuration (hostname, CPU, memory, disk, IP)
- Support both DHCP and static IP configuration via cloud-init
- Test deploying multiple VMs from same template
Deliverable: ✅ Can deploy multiple VMs with custom parameters via OpenTofu in a single tofu apply
Files:
terraform/vms.tf- VM definitions using locals mapterraform/outputs.tf- Dynamic outputs for all VMsterraform/variables.tf- Configurable defaultsterraform/README.md- Complete documentation
Phase 2: Host Configuration Generator ✅ COMPLETED
Status: ✅ Fully implemented and tested Completed: 2025-02-01
Goal: Automate creation of host configuration files
Implementation:
- Python CLI tool packaged as Nix derivation
- Available as
create-hostcommand in devShell - Rich terminal UI with configuration previews
- Comprehensive validation (hostname format/uniqueness, IP subnet/uniqueness)
- Jinja2 templates for NixOS configurations
- Automatic updates to flake.nix and terraform/vms.tf
Tasks:
- Create Python CLI with typer framework
- Takes parameters: hostname, IP, CPU cores, memory, disk size
- Generates
/hosts/<hostname>/directory structure - Creates
configuration.nixwith proper hostname and networking - Generates
default.nixwith standard imports - References shared
hardware-configuration.nixfrom template
- Add host entry to
flake.nixprogrammatically- Text-based manipulation (regex insertion)
- Inserts new nixosConfiguration entry
- Maintains proper formatting
- Generate corresponding OpenTofu configuration
- Adds VM definition to
terraform/vms.tf - Uses parameters from CLI input
- Supports both static IP and DHCP modes
- Adds VM definition to
- Package as Nix derivation with templates
- Add to flake packages and devShell
- Implement dry-run mode
- Write comprehensive README
Usage:
# In nix develop shell
create-host \
--hostname test01 \
--ip 10.69.13.50/24 \ # optional, omit for DHCP
--cpu 4 \ # optional, default 2
--memory 4096 \ # optional, default 2048
--disk 50G \ # optional, default 20G
--dry-run # optional preview mode
Files:
scripts/create-host/- Complete Python package with Nix derivationscripts/create-host/README.md- Full documentation and examples
Deliverable: ✅ Tool generates all config files for a new host, validated with Nix and Terraform
Phase 3: Bootstrap Mechanism ✅ COMPLETED
Status: ✅ Fully implemented and tested Completed: 2025-02-01
Goal: Get freshly deployed VM to apply its specific host configuration
Implementation: Systemd oneshot service that runs on first boot after cloud-init
Approach taken: Systemd service (variant of Option A)
- Systemd service
nixos-bootstrap.serviceruns on first boot - Depends on
cloud-config.serviceto ensure hostname is set - Reads hostname from
hostnamectl(set by cloud-init via Terraform) - Runs
nixos-rebuild boot --flake git+https://git.t-juice.club/torjus/nixos-servers.git#${hostname} - Reboots into new configuration on success
- Fails gracefully without reboot on errors (network issues, missing config)
- Service self-destructs after successful bootstrap (not in new config)
Tasks:
- Create bootstrap service module in template2
- systemd oneshot service with proper dependencies
- Reads hostname from hostnamectl (cloud-init sets it)
- Checks network connectivity via HTTPS (curl)
- Runs nixos-rebuild boot with flake URL
- Reboots on success, fails gracefully on error
- Configure cloud-init datasource
- Use ConfigDrive datasource (Proxmox provider)
- Add cloud-init disk to Terraform VMs (disks.ide.ide2.cloudinit)
- Hostname passed via cloud-init user-data from Terraform
- Test bootstrap service execution on fresh VM
- Handle failure cases (flake doesn't exist, network issues)
- Clear error messages in journald
- No reboot on failure
- System remains accessible for debugging
Files:
hosts/template2/bootstrap.nix- Bootstrap service definitionhosts/template2/configuration.nix- Cloud-init ConfigDrive datasourceterraform/vms.tf- Cloud-init disk configuration
Deliverable: ✅ VMs automatically bootstrap and reboot into host-specific configuration on first boot
Phase 4: Secrets Management Automation
Challenge: sops needs age key, but age key is generated on first boot
Current workflow:
- VM boots, generates age key at
/var/lib/sops-nix/key.txt - User runs
prepare-host.shwhich prints public key - User manually adds public key to
.sops.yaml - User commits, pushes
- VM can now decrypt secrets
Proposed solution:
Option A: Pre-generate age keys
- Generate age key pair during
create-host-config.sh - Add public key to
.sops.yamlimmediately - Store private key temporarily (secure location)
- Inject private key via cloud-init write_files or Terraform file provisioner
- VM uses pre-configured key from first boot
Option B: Post-deployment secret injection
- VM boots with template, generates its own key
- Fetch public key via SSH after first boot
- Automatically add to
.sops.yamland commit - Trigger rebuild on VM to pick up secrets access
Option C: Separate secrets from initial deployment
- Initial deployment works without secrets
- After VM is running, user manually adds age key
- Subsequent auto-upgrades pick up secrets
Decision needed: Option A is most automated, but requires secure key handling
Phase 5: DNS Automation
Goal: Automatically generate DNS entries from host configurations
Approach: Leverage Nix to generate zone file entries from flake host configurations
Since most hosts use static IPs defined in their NixOS configurations, we can extract this information and automatically generate A records. This keeps DNS in sync with the actual host configs.
Tasks:
- Add optional CNAME field to host configurations
- Add
networking.cnames = [ "alias1" "alias2" ]or similar option - Document in host configuration template
- Add
- Create Nix function to extract DNS records from all hosts
- Parse each host's
networking.hostNameand IP configuration - Collect any defined CNAMEs
- Generate zone file fragment with A and CNAME records
- Parse each host's
- Integrate auto-generated records into zone files
- Keep manual entries separate (for non-flake hosts/services)
- Include generated fragment in main zone file
- Add comments showing which records are auto-generated
- Update zone file serial number automatically
- Test zone file validity after generation
- Either:
- Automatically trigger DNS server reload (Ansible)
- Or document manual step: merge to master, run upgrade on ns1/ns2
Deliverable: DNS A records and CNAMEs automatically generated from host configs
Phase 6: Integration Script
Goal: Single command to create and deploy a new host
Tasks:
- Create
scripts/create-host.shmaster script that orchestrates:- Prompts for: hostname, IP (or DHCP), CPU, memory, disk
- Validates inputs (IP not in use, hostname unique, etc.)
- Calls host config generator (Phase 2)
- Generates OpenTofu config (Phase 2)
- Handles secrets (Phase 4)
- Updates DNS (Phase 5)
- Commits all changes to git
- Runs
tofu applyto deploy VM - Waits for bootstrap to complete (Phase 3)
- Prints success message with IP and SSH command
- Add
--dry-runflag to preview changes - Add
--interactivemode vs--batchmode - Error handling and rollback on failures
Deliverable: ./scripts/create-host.sh --hostname myhost --ip 10.69.13.50 creates a fully working host
Phase 7: Testing & Documentation
Tasks:
- Test full pipeline end-to-end
- Create test host and verify all steps
- Document the new workflow in CLAUDE.md
- Add troubleshooting section
- Create examples for common scenarios (DHCP host, static IP host, etc.)
Open Questions
- Bootstrap method: Cloud-init runcmd vs Terraform provisioner vs Ansible?
- Secrets handling: Pre-generate keys vs post-deployment injection?
- DNS automation: Auto-commit or manual merge?
- Git workflow: Auto-push changes or leave for user review?
- Template selection: Single template2 or multiple templates for different host types?
- Networking: Always DHCP initially, or support static IP from start?
- Error recovery: What happens if bootstrap fails? Manual intervention or retry?
Implementation Order
Recommended sequence:
- Phase 1: Parameterize OpenTofu (foundation for testing)
- Phase 3: Bootstrap mechanism (core automation)
- Phase 2: Config generator (automate the boilerplate)
- Phase 4: Secrets (solves biggest chicken-and-egg)
- Phase 5: DNS (nice-to-have automation)
- Phase 6: Integration script (ties it all together)
- Phase 7: Testing & docs
Success Criteria
When complete, creating a new host should:
- Take < 5 minutes of human time
- Require minimal user input (hostname, IP, basic specs)
- Result in a fully configured, secret-enabled, DNS-registered host
- Be reproducible and documented
- Handle common errors gracefully
Notes
- Keep incremental commits at each phase
- Test each phase independently before moving to next
- Maintain backward compatibility with manual workflow
- Document any manual steps that can't be automated