20 KiB
TODO: Automated Host Deployment Pipeline
Vision
Automate the entire process of creating, configuring, and deploying new NixOS hosts on Proxmox from a single command or script.
Desired workflow:
./scripts/create-host.sh --hostname myhost --ip 10.69.13.50
# Script creates config, deploys VM, bootstraps NixOS, and you're ready to go
Current manual workflow (from CLAUDE.md):
- Create
/hosts/<hostname>/directory structure - Add host to
flake.nix - Add DNS entries
- Clone template VM manually
- Run
prepare-host.shon new VM - Add generated age key to
.sops.yaml - Configure networking
- Commit and push
- Run
nixos-rebuild boot --flake URL#<hostname>on host
The Plan
Phase 1: Parameterized OpenTofu Deployments ✅ COMPLETED
Status: Fully implemented and tested
Implementation:
- Locals-based structure using
for_eachpattern for multiple VM deployments - All VM parameters configurable with smart defaults (CPU, memory, disk, IP, storage, etc.)
- Automatic DHCP vs static IP detection based on
ipfield presence - Dynamic outputs showing deployed VM IPs and specifications
- Successfully tested deploying multiple VMs simultaneously
Tasks:
- Create module/template structure in terraform for repeatable VM deployments
- Parameterize VM configuration (hostname, CPU, memory, disk, IP)
- Support both DHCP and static IP configuration via cloud-init
- Test deploying multiple VMs from same template
Deliverable: ✅ Can deploy multiple VMs with custom parameters via OpenTofu in a single tofu apply
Files:
terraform/vms.tf- VM definitions using locals mapterraform/outputs.tf- Dynamic outputs for all VMsterraform/variables.tf- Configurable defaultsterraform/README.md- Complete documentation
Phase 2: Host Configuration Generator ✅ COMPLETED
Status: ✅ Fully implemented and tested Completed: 2025-02-01 Enhanced: 2025-02-01 (added --force flag)
Goal: Automate creation of host configuration files
Implementation:
- Python CLI tool packaged as Nix derivation
- Available as
create-hostcommand in devShell - Rich terminal UI with configuration previews
- Comprehensive validation (hostname format/uniqueness, IP subnet/uniqueness)
- Jinja2 templates for NixOS configurations
- Automatic updates to flake.nix and terraform/vms.tf
--forceflag for regenerating existing configurations (useful for testing)
Tasks:
- Create Python CLI with typer framework
- Takes parameters: hostname, IP, CPU cores, memory, disk size
- Generates
/hosts/<hostname>/directory structure - Creates
configuration.nixwith proper hostname and networking - Generates
default.nixwith standard imports - References shared
hardware-configuration.nixfrom template
- Add host entry to
flake.nixprogrammatically- Text-based manipulation (regex insertion)
- Inserts new nixosConfiguration entry
- Maintains proper formatting
- Generate corresponding OpenTofu configuration
- Adds VM definition to
terraform/vms.tf - Uses parameters from CLI input
- Supports both static IP and DHCP modes
- Adds VM definition to
- Package as Nix derivation with templates
- Add to flake packages and devShell
- Implement dry-run mode
- Write comprehensive README
Usage:
# In nix develop shell
create-host \
--hostname test01 \
--ip 10.69.13.50/24 \ # optional, omit for DHCP
--cpu 4 \ # optional, default 2
--memory 4096 \ # optional, default 2048
--disk 50G \ # optional, default 20G
--dry-run # optional preview mode
Files:
scripts/create-host/- Complete Python package with Nix derivationscripts/create-host/README.md- Full documentation and examples
Deliverable: ✅ Tool generates all config files for a new host, validated with Nix and Terraform
Phase 3: Bootstrap Mechanism ✅ COMPLETED
Status: ✅ Fully implemented and tested Completed: 2025-02-01 Enhanced: 2025-02-01 (added branch support for testing)
Goal: Get freshly deployed VM to apply its specific host configuration
Implementation: Systemd oneshot service that runs on first boot after cloud-init
Approach taken: Systemd service (variant of Option A)
- Systemd service
nixos-bootstrap.serviceruns on first boot - Depends on
cloud-config.serviceto ensure hostname is set - Reads hostname from
hostnamectl(set by cloud-init via Terraform) - Supports custom git branch via
NIXOS_FLAKE_BRANCHenvironment variable - Runs
nixos-rebuild boot --flake git+https://git.t-juice.club/torjus/nixos-servers.git?ref=$BRANCH#${hostname} - Reboots into new configuration on success
- Fails gracefully without reboot on errors (network issues, missing config)
- Service self-destructs after successful bootstrap (not in new config)
Tasks:
- Create bootstrap service module in template2
- systemd oneshot service with proper dependencies
- Reads hostname from hostnamectl (cloud-init sets it)
- Checks network connectivity via HTTPS (curl)
- Runs nixos-rebuild boot with flake URL
- Reboots on success, fails gracefully on error
- Configure cloud-init datasource
- Use ConfigDrive datasource (Proxmox provider)
- Add cloud-init disk to Terraform VMs (disks.ide.ide2.cloudinit)
- Hostname passed via cloud-init user-data from Terraform
- Test bootstrap service execution on fresh VM
- Handle failure cases (flake doesn't exist, network issues)
- Clear error messages in journald
- No reboot on failure
- System remains accessible for debugging
Files:
hosts/template2/bootstrap.nix- Bootstrap service definitionhosts/template2/configuration.nix- Cloud-init ConfigDrive datasourceterraform/vms.tf- Cloud-init disk configuration
Deliverable: ✅ VMs automatically bootstrap and reboot into host-specific configuration on first boot
Phase 4: Secrets Management with HashiCorp Vault
Challenge: Current sops-nix approach has chicken-and-egg problem with age keys
Current workflow:
- VM boots, generates age key at
/var/lib/sops-nix/key.txt - User runs
prepare-host.shwhich prints public key - User manually adds public key to
.sops.yaml - User commits, pushes
- VM can now decrypt secrets
Selected approach: Migrate to HashiCorp Vault for centralized secrets management
Benefits:
- Industry-standard secrets management (Vault experience transferable to work)
- Eliminates manual age key distribution step
- Secrets-as-code via OpenTofu (infrastructure-as-code aligned)
- Centralized PKI management (replaces step-ca, consolidates TLS + SSH CA)
- Automatic secret rotation capabilities
- Audit logging for all secret access
- AppRole authentication enables automated bootstrap
Architecture:
vault.home.2rjus.net
├─ KV Secrets Engine (replaces sops-nix)
├─ PKI Engine (replaces step-ca for TLS)
├─ SSH CA Engine (replaces step-ca SSH CA)
└─ AppRole Auth (per-host authentication)
↓
New hosts authenticate on first boot
Fetch secrets via Vault API
No manual key distribution needed
Phase 4a: Vault Server Setup
Goal: Deploy and configure Vault server with auto-unseal
Tasks:
- Create
hosts/vault01/configuration- Basic NixOS configuration (hostname, networking, etc.)
- Vault service configuration
- Firewall rules (8200 for API, 8201 for cluster)
- Add to flake.nix and terraform
- Implement auto-unseal mechanism
- Preferred: TPM-based auto-unseal if hardware supports it
- Use tpm2-tools to seal/unseal Vault keys
- Systemd service to unseal on boot
- Fallback: Shamir secret sharing with systemd automation
- Generate 3 keys, threshold 2
- Store 2 keys on disk (encrypted), keep 1 offline
- Systemd service auto-unseals using 2 keys
- Preferred: TPM-based auto-unseal if hardware supports it
- Initial Vault setup
- Initialize Vault
- Configure storage backend (integrated raft or file)
- Set up root token management
- Enable audit logging
- Deploy to infrastructure
- Add DNS entry for vault.home.2rjus.net
- Deploy VM via terraform
- Bootstrap and verify Vault is running
Deliverable: Running Vault server that auto-unseals on boot
Phase 4b: Vault-as-Code with OpenTofu
Goal: Manage all Vault configuration (secrets structure, policies, roles) as code
Tasks:
- Set up Vault Terraform provider
- Create
terraform/vault/directory - Configure Vault provider (address, auth)
- Store Vault token securely (terraform.tfvars, gitignored)
- Create
- Enable and configure secrets engines
- Enable KV v2 secrets engine at
secret/ - Define secret path structure (per-service, per-host)
- Example:
secret/monitoring/grafana,secret/postgres/ha1
- Enable KV v2 secrets engine at
- Define policies as code
- Create policies for different service tiers
- Principle of least privilege (hosts only read their secrets)
- Example: monitoring-policy allows read on
secret/monitoring/*
- Set up AppRole authentication
- Enable AppRole auth backend
- Create role per host type (monitoring, dns, database, etc.)
- Bind policies to roles
- Configure TTL and token policies
- Migrate existing secrets from sops-nix
- Create migration script/playbook
- Decrypt sops secrets and load into Vault KV
- Verify all secrets migrated successfully
- Keep sops as backup during transition
- Implement secrets-as-code patterns
- Secret values in gitignored terraform.tfvars
- Or use random_password for auto-generated secrets
- Secret structure/paths in version-controlled .tf files
Example OpenTofu:
resource "vault_kv_secret_v2" "monitoring_grafana" {
mount = "secret"
name = "monitoring/grafana"
data_json = jsonencode({
admin_password = var.grafana_admin_password
smtp_password = var.smtp_password
})
}
resource "vault_policy" "monitoring" {
name = "monitoring-policy"
policy = <<EOT
path "secret/data/monitoring/*" {
capabilities = ["read"]
}
EOT
}
resource "vault_approle_auth_backend_role" "monitoring01" {
backend = "approle"
role_name = "monitoring01"
token_policies = ["monitoring-policy"]
}
Deliverable: All secrets and policies managed as OpenTofu code in terraform/vault/
Phase 4c: PKI Migration (Replace step-ca)
Goal: Consolidate PKI infrastructure into Vault
Tasks:
- Set up Vault PKI engines
- Create root CA in Vault (
pki/mount, 10 year TTL) - Create intermediate CA (
pki_int/mount, 5 year TTL) - Sign intermediate with root CA
- Configure CRL and OCSP
- Create root CA in Vault (
- Enable ACME support
- Enable ACME on intermediate CA (Vault 1.14+)
- Create PKI role for homelab domain
- Set certificate TTLs and allowed domains
- Configure SSH CA in Vault
- Enable SSH secrets engine (
ssh/mount) - Generate SSH signing keys
- Create roles for host and user certificates
- Configure TTLs and allowed principals
- Enable SSH secrets engine (
- Migrate hosts from step-ca to Vault
- Update system/acme.nix to use Vault ACME endpoint
- Change server to
https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory - Test certificate issuance on one host
- Roll out to all hosts via auto-upgrade
- Migrate SSH CA trust
- Distribute Vault SSH CA public key to all hosts
- Update sshd_config to trust Vault CA
- Test SSH certificate authentication
- Decommission step-ca
- Verify all services migrated
- Stop step-ca service on ca host
- Archive step-ca configuration for backup
Deliverable: All TLS and SSH certificates issued by Vault, step-ca retired
Phase 4d: Bootstrap Integration
Goal: New hosts automatically authenticate to Vault on first boot, no manual steps
Tasks:
- Update create-host tool
- Generate AppRole role_id + secret_id for new host
- Or create wrapped token for one-time bootstrap
- Add host-specific policy to Vault (via terraform)
- Store bootstrap credentials for cloud-init injection
- Update template2 for Vault authentication
- Create Vault authentication module
- Reads bootstrap credentials from cloud-init
- Authenticates to Vault, retrieves permanent AppRole credentials
- Stores role_id + secret_id locally for services to use
- Create NixOS Vault secrets module
- Replacement for sops.secrets
- Fetches secrets from Vault at nixos-rebuild/activation time
- Or runtime secret fetching for services
- Handle Vault token renewal
- Update bootstrap service
- After authenticating to Vault, fetch any bootstrap secrets
- Run nixos-rebuild with host configuration
- Services automatically fetch their secrets from Vault
- Update terraform cloud-init
- Inject Vault address and bootstrap credentials
- Pass via cloud-init user-data or write_files
- Credentials scoped to single use or short TTL
- Test complete flow
- Run create-host to generate new host config
- Deploy with terraform
- Verify host bootstraps and authenticates to Vault
- Verify services can fetch secrets
- Confirm no manual steps required
Bootstrap flow:
1. terraform apply (deploys VM with cloud-init)
2. Cloud-init sets hostname + Vault bootstrap credentials
3. nixos-bootstrap.service runs:
- Authenticates to Vault with bootstrap credentials
- Retrieves permanent AppRole credentials
- Stores locally for service use
- Runs nixos-rebuild
4. Host services fetch secrets from Vault as needed
5. Done - no manual intervention
Deliverable: Fully automated secrets access from first boot, zero manual steps
Phase 5: DNS Automation
Goal: Automatically generate DNS entries from host configurations
Approach: Leverage Nix to generate zone file entries from flake host configurations
Since most hosts use static IPs defined in their NixOS configurations, we can extract this information and automatically generate A records. This keeps DNS in sync with the actual host configs.
Tasks:
- Add optional CNAME field to host configurations
- Add
networking.cnames = [ "alias1" "alias2" ]or similar option - Document in host configuration template
- Add
- Create Nix function to extract DNS records from all hosts
- Parse each host's
networking.hostNameand IP configuration - Collect any defined CNAMEs
- Generate zone file fragment with A and CNAME records
- Parse each host's
- Integrate auto-generated records into zone files
- Keep manual entries separate (for non-flake hosts/services)
- Include generated fragment in main zone file
- Add comments showing which records are auto-generated
- Update zone file serial number automatically
- Test zone file validity after generation
- Either:
- Automatically trigger DNS server reload (Ansible)
- Or document manual step: merge to master, run upgrade on ns1/ns2
Deliverable: DNS A records and CNAMEs automatically generated from host configs
Phase 6: Integration Script
Goal: Single command to create and deploy a new host
Tasks:
- Create
scripts/create-host.shmaster script that orchestrates:- Prompts for: hostname, IP (or DHCP), CPU, memory, disk
- Validates inputs (IP not in use, hostname unique, etc.)
- Calls host config generator (Phase 2)
- Generates OpenTofu config (Phase 2)
- Handles secrets (Phase 4)
- Updates DNS (Phase 5)
- Commits all changes to git
- Runs
tofu applyto deploy VM - Waits for bootstrap to complete (Phase 3)
- Prints success message with IP and SSH command
- Add
--dry-runflag to preview changes - Add
--interactivemode vs--batchmode - Error handling and rollback on failures
Deliverable: ./scripts/create-host.sh --hostname myhost --ip 10.69.13.50 creates a fully working host
Phase 7: Testing & Documentation
Status: 🚧 In Progress (testing improvements completed)
Testing Improvements Implemented (2025-02-01):
The pipeline now supports efficient testing without polluting master branch:
1. --force Flag for create-host
- Re-run
create-hostto regenerate existing configurations - Updates existing entries in flake.nix and terraform/vms.tf (no duplicates)
- Skip uniqueness validation checks
- Useful for iterating on configuration templates during testing
2. Branch Support for Bootstrap
- Bootstrap service reads
NIXOS_FLAKE_BRANCHenvironment variable - Defaults to
masterif not set - Allows testing pipeline changes on feature branches
- Cloud-init passes branch via
/etc/environment
3. Cloud-init Disk for Branch Configuration
- Terraform generates custom cloud-init snippets for test VMs
- Set
flake_branchfield in VM definition to use non-master branch - Production VMs omit this field and use master (default)
- Files automatically uploaded to Proxmox via SSH
Testing Workflow:
# 1. Create test branch
git checkout -b test-pipeline
# 2. Generate or update host config
create-host --hostname testvm01 --ip 10.69.13.100/24
# 3. Edit terraform/vms.tf to add test VM with branch
# vms = {
# "testvm01" = {
# ip = "10.69.13.100/24"
# flake_branch = "test-pipeline" # Bootstrap from this branch
# }
# }
# 4. Commit and push test branch
git add -A && git commit -m "test: add testvm01"
git push origin test-pipeline
# 5. Deploy VM
cd terraform && tofu apply
# 6. Watch bootstrap (VM fetches from test-pipeline branch)
ssh root@10.69.13.100
journalctl -fu nixos-bootstrap.service
# 7. Iterate: modify templates and regenerate with --force
cd .. && create-host --hostname testvm01 --ip 10.69.13.100/24 --force
git commit -am "test: update config" && git push
# Redeploy to test fresh bootstrap
cd terraform
tofu destroy -target=proxmox_vm_qemu.vm[\"testvm01\"] && tofu apply
# 8. Clean up when done: squash commits, merge to master, remove test VM
Files:
scripts/create-host/create_host.py- Added --force parameterscripts/create-host/manipulators.py- Update vs insert logichosts/template2/bootstrap.nix- Branch support via environment variableterraform/vms.tf- flake_branch field supportterraform/cloud-init.tf- Custom cloud-init disk generationterraform/variables.tf- proxmox_host variable for SSH uploads
Remaining Tasks:
- Test full pipeline end-to-end on feature branch
- Update CLAUDE.md with testing workflow
- Add troubleshooting section
- Create examples for common scenarios (DHCP host, static IP host, etc.)
Open Questions
- Bootstrap method: Cloud-init runcmd vs Terraform provisioner vs Ansible?
- Secrets handling: Pre-generate keys vs post-deployment injection?
- DNS automation: Auto-commit or manual merge?
- Git workflow: Auto-push changes or leave for user review?
- Template selection: Single template2 or multiple templates for different host types?
- Networking: Always DHCP initially, or support static IP from start?
- Error recovery: What happens if bootstrap fails? Manual intervention or retry?
Implementation Order
Recommended sequence:
- Phase 1: Parameterize OpenTofu (foundation for testing)
- Phase 3: Bootstrap mechanism (core automation)
- Phase 2: Config generator (automate the boilerplate)
- Phase 4: Secrets (solves biggest chicken-and-egg)
- Phase 5: DNS (nice-to-have automation)
- Phase 6: Integration script (ties it all together)
- Phase 7: Testing & docs
Success Criteria
When complete, creating a new host should:
- Take < 5 minutes of human time
- Require minimal user input (hostname, IP, basic specs)
- Result in a fully configured, secret-enabled, DNS-registered host
- Be reproducible and documented
- Handle common errors gracefully
Notes
- Keep incremental commits at each phase
- Test each phase independently before moving to next
- Maintain backward compatibility with manual workflow
- Document any manual steps that can't be automated