Files
nixos-servers/TODO.md
Torjus Håkestad 63662b89e0
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m51s
Run nix flake check / flake-check (pull_request) Successful in 2m53s
docs: update TODO.md
2026-02-03 06:53:59 +01:00

27 KiB

TODO: Automated Host Deployment Pipeline

Vision

Automate the entire process of creating, configuring, and deploying new NixOS hosts on Proxmox from a single command or script.

Desired workflow:

./scripts/create-host.sh --hostname myhost --ip 10.69.13.50
# Script creates config, deploys VM, bootstraps NixOS, and you're ready to go

Current manual workflow (from CLAUDE.md):

  1. Create /hosts/<hostname>/ directory structure
  2. Add host to flake.nix
  3. Add DNS entries
  4. Clone template VM manually
  5. Run prepare-host.sh on new VM
  6. Add generated age key to .sops.yaml
  7. Configure networking
  8. Commit and push
  9. Run nixos-rebuild boot --flake URL#<hostname> on host

The Plan

Phase 1: Parameterized OpenTofu Deployments COMPLETED

Status: Fully implemented and tested

Implementation:

  • Locals-based structure using for_each pattern for multiple VM deployments
  • All VM parameters configurable with smart defaults (CPU, memory, disk, IP, storage, etc.)
  • Automatic DHCP vs static IP detection based on ip field presence
  • Dynamic outputs showing deployed VM IPs and specifications
  • Successfully tested deploying multiple VMs simultaneously

Tasks:

  • Create module/template structure in terraform for repeatable VM deployments
  • Parameterize VM configuration (hostname, CPU, memory, disk, IP)
  • Support both DHCP and static IP configuration via cloud-init
  • Test deploying multiple VMs from same template

Deliverable: Can deploy multiple VMs with custom parameters via OpenTofu in a single tofu apply

Files:

  • terraform/vms.tf - VM definitions using locals map
  • terraform/outputs.tf - Dynamic outputs for all VMs
  • terraform/variables.tf - Configurable defaults
  • terraform/README.md - Complete documentation

Phase 2: Host Configuration Generator COMPLETED

Status: Fully implemented and tested Completed: 2025-02-01 Enhanced: 2025-02-01 (added --force flag)

Goal: Automate creation of host configuration files

Implementation:

  • Python CLI tool packaged as Nix derivation
  • Available as create-host command in devShell
  • Rich terminal UI with configuration previews
  • Comprehensive validation (hostname format/uniqueness, IP subnet/uniqueness)
  • Jinja2 templates for NixOS configurations
  • Automatic updates to flake.nix and terraform/vms.tf
  • --force flag for regenerating existing configurations (useful for testing)

Tasks:

  • Create Python CLI with typer framework
    • Takes parameters: hostname, IP, CPU cores, memory, disk size
    • Generates /hosts/<hostname>/ directory structure
    • Creates configuration.nix with proper hostname and networking
    • Generates default.nix with standard imports
    • References shared hardware-configuration.nix from template
  • Add host entry to flake.nix programmatically
    • Text-based manipulation (regex insertion)
    • Inserts new nixosConfiguration entry
    • Maintains proper formatting
  • Generate corresponding OpenTofu configuration
    • Adds VM definition to terraform/vms.tf
    • Uses parameters from CLI input
    • Supports both static IP and DHCP modes
  • Package as Nix derivation with templates
  • Add to flake packages and devShell
  • Implement dry-run mode
  • Write comprehensive README

Usage:

# In nix develop shell
create-host \
  --hostname test01 \
  --ip 10.69.13.50/24 \  # optional, omit for DHCP
  --cpu 4 \               # optional, default 2
  --memory 4096 \         # optional, default 2048
  --disk 50G \            # optional, default 20G
  --dry-run               # optional preview mode

Files:

  • scripts/create-host/ - Complete Python package with Nix derivation
  • scripts/create-host/README.md - Full documentation and examples

Deliverable: Tool generates all config files for a new host, validated with Nix and Terraform


Phase 3: Bootstrap Mechanism COMPLETED

Status: Fully implemented and tested Completed: 2025-02-01 Enhanced: 2025-02-01 (added branch support for testing)

Goal: Get freshly deployed VM to apply its specific host configuration

Implementation: Systemd oneshot service that runs on first boot after cloud-init

Approach taken: Systemd service (variant of Option A)

  • Systemd service nixos-bootstrap.service runs on first boot
  • Depends on cloud-config.service to ensure hostname is set
  • Reads hostname from hostnamectl (set by cloud-init via Terraform)
  • Supports custom git branch via NIXOS_FLAKE_BRANCH environment variable
  • Runs nixos-rebuild boot --flake git+https://git.t-juice.club/torjus/nixos-servers.git?ref=$BRANCH#${hostname}
  • Reboots into new configuration on success
  • Fails gracefully without reboot on errors (network issues, missing config)
  • Service self-destructs after successful bootstrap (not in new config)

Tasks:

  • Create bootstrap service module in template2
    • systemd oneshot service with proper dependencies
    • Reads hostname from hostnamectl (cloud-init sets it)
    • Checks network connectivity via HTTPS (curl)
    • Runs nixos-rebuild boot with flake URL
    • Reboots on success, fails gracefully on error
  • Configure cloud-init datasource
    • Use ConfigDrive datasource (Proxmox provider)
    • Add cloud-init disk to Terraform VMs (disks.ide.ide2.cloudinit)
    • Hostname passed via cloud-init user-data from Terraform
  • Test bootstrap service execution on fresh VM
  • Handle failure cases (flake doesn't exist, network issues)
    • Clear error messages in journald
    • No reboot on failure
    • System remains accessible for debugging

Files:

  • hosts/template2/bootstrap.nix - Bootstrap service definition
  • hosts/template2/configuration.nix - Cloud-init ConfigDrive datasource
  • terraform/vms.tf - Cloud-init disk configuration

Deliverable: VMs automatically bootstrap and reboot into host-specific configuration on first boot


Phase 4: Secrets Management with OpenBao (Vault)

Status: 🚧 Phases 4a, 4b, 4c (partial), & 4d Complete

Challenge: Current sops-nix approach has chicken-and-egg problem with age keys

Current workflow:

  1. VM boots, generates age key at /var/lib/sops-nix/key.txt
  2. User runs prepare-host.sh which prints public key
  3. User manually adds public key to .sops.yaml
  4. User commits, pushes
  5. VM can now decrypt secrets

Selected approach: Migrate to OpenBao (Vault fork) for centralized secrets management

Why OpenBao instead of HashiCorp Vault:

  • HashiCorp Vault switched to BSL (Business Source License), unavailable in NixOS cache
  • OpenBao is the community fork maintaining the pre-BSL MPL 2.0 license
  • API-compatible with Vault, uses same Terraform provider
  • Maintains all Vault features we need

Benefits:

  • Industry-standard secrets management (Vault-compatible experience)
  • Eliminates manual age key distribution step
  • Secrets-as-code via OpenTofu (infrastructure-as-code aligned)
  • Centralized PKI management with ACME support (ready to replace step-ca)
  • Automatic secret rotation capabilities
  • Audit logging for all secret access (not yet enabled)
  • AppRole authentication enables automated bootstrap

Current Architecture:

vault01.home.2rjus.net (10.69.13.19)
  ├─ KV Secrets Engine (ready to replace sops-nix)
  │   ├─ secret/hosts/{hostname}/*
  │   ├─ secret/services/{service}/*
  │   └─ secret/shared/{category}/*
  ├─ PKI Engine (ready to replace step-ca for TLS)
  │   ├─ Root CA (EC P-384, 10 year)
  │   ├─ Intermediate CA (EC P-384, 5 year)
  │   └─ ACME endpoint enabled
  ├─ SSH CA Engine (TODO: Phase 4c)
  └─ AppRole Auth (per-host authentication configured)
       ↓
   [✅ Phase 4d] New hosts authenticate on first boot
   [✅ Phase 4d] Fetch secrets via Vault API
   No manual key distribution needed

Completed:

  • Phase 4a: OpenBao server with TPM2 auto-unseal
  • Phase 4b: Infrastructure-as-code (secrets, policies, AppRoles, PKI)
  • Phase 4d: Bootstrap integration for automated secrets access

Next Steps:

  • Phase 4c: Migrate from step-ca to OpenBao PKI

Phase 4a: Vault Server Setup COMPLETED

Status: Fully implemented and tested Completed: 2026-02-02

Goal: Deploy and configure Vault server with auto-unseal

Implementation:

  • Used OpenBao (Vault fork) instead of HashiCorp Vault due to BSL licensing concerns
  • TPM2-based auto-unseal using systemd's native LoadCredentialEncrypted
  • Self-signed bootstrap TLS certificates (avoiding circular dependency with step-ca)
  • File-based storage backend at /var/lib/openbao
  • Unix socket + TCP listener (0.0.0.0:8200) configuration

Tasks:

  • Create hosts/vault01/ configuration
    • Basic NixOS configuration (hostname: vault01, IP: 10.69.13.19/24)
    • Created reusable services/vault module
    • Firewall not needed (trusted network)
    • Already in flake.nix, deployed via terraform
  • Implement auto-unseal mechanism
    • TPM2-based auto-unseal (preferred option)
      • systemd LoadCredentialEncrypted with TPM2 binding
      • writeShellApplication script with proper runtime dependencies
      • Reads multiple unseal keys (one per line) until unsealed
      • Auto-unseals on service start via ExecStartPost
  • Initial Vault setup
    • Initialized OpenBao with Shamir secret sharing (5 keys, threshold 3)
    • File storage backend
    • Self-signed TLS certificates via LoadCredential
  • Deploy to infrastructure
    • DNS entry added for vault01.home.2rjus.net
    • VM deployed via terraform
    • Verified OpenBao running and auto-unsealing

Changes from Original Plan:

  • Used OpenBao instead of HashiCorp Vault (licensing)
  • Used systemd's native TPM2 support instead of tpm2-tools directly
  • Skipped audit logging (can be enabled later)
  • Used self-signed certs initially (will migrate to OpenBao PKI later)

Deliverable: Running OpenBao server that auto-unseals on boot using TPM2

Documentation:

  • /services/vault/README.md - Service module overview
  • /docs/vault/auto-unseal.md - Complete TPM2 auto-unseal setup guide

Phase 4b: Vault-as-Code with OpenTofu COMPLETED

Status: Fully implemented and tested Completed: 2026-02-02

Goal: Manage all Vault configuration (secrets structure, policies, roles) as code

Implementation:

  • Complete Terraform/OpenTofu configuration in terraform/vault/
  • Locals-based pattern (similar to vms.tf) for declaring secrets and policies
  • Auto-generation of secrets using random_password provider
  • Three-tier secrets path hierarchy: hosts/, services/, shared/
  • PKI infrastructure with Elliptic Curve certificates (P-384 for CAs, P-256 for leaf certs)
  • ACME support enabled on intermediate CA

Tasks:

  • Set up Vault Terraform provider
    • Created terraform/vault/ directory
    • Configured Vault provider (uses HashiCorp provider, compatible with OpenBao)
    • Credentials in terraform.tfvars (gitignored)
    • terraform.tfvars.example for reference
  • Enable and configure secrets engines
    • KV v2 engine at secret/
    • Three-tier path structure:
      • secret/hosts/{hostname}/* - Host-specific secrets
      • secret/services/{service}/* - Service-wide secrets
      • secret/shared/{category}/* - Shared secrets (SMTP, backups, etc.)
  • Define policies as code
    • Policies auto-generated from locals.host_policies
    • Per-host policies with read/list on designated paths
    • Principle of least privilege enforced
  • Set up AppRole authentication
    • AppRole backend enabled at approle/
    • Roles auto-generated per host from locals.host_policies
    • Token TTL: 1 hour, max 24 hours
    • Policies bound to roles
  • Implement secrets-as-code patterns
    • Auto-generated secrets using random_password provider
    • Manual secrets supported via variables in terraform.tfvars
    • Secret structure versioned in .tf files
    • Secret values excluded from git
  • Set up PKI infrastructure
    • Root CA (10 year TTL, EC P-384)
    • Intermediate CA (5 year TTL, EC P-384)
    • PKI role for *.home.2rjus.net (30 day max TTL, EC P-256)
    • ACME enabled on intermediate CA
    • Support for static certificate issuance via Terraform
    • CRL, OCSP, and issuing certificate URLs configured

Changes from Original Plan:

  • Used Elliptic Curve instead of RSA for all certificates (better performance, smaller keys)
  • Implemented PKI infrastructure in Phase 4b instead of Phase 4c (more logical grouping)
  • ACME support configured immediately (ready for migration from step-ca)
  • Did not migrate existing sops-nix secrets yet (deferred to gradual migration)

Files:

  • terraform/vault/main.tf - Provider configuration
  • terraform/vault/variables.tf - Variable definitions
  • terraform/vault/approle.tf - AppRole authentication (locals-based pattern)
  • terraform/vault/pki.tf - PKI infrastructure with EC certificates
  • terraform/vault/secrets.tf - KV secrets engine (auto-generation support)
  • terraform/vault/README.md - Complete documentation and usage examples
  • terraform/vault/terraform.tfvars.example - Example credentials

Deliverable: All secrets, policies, AppRoles, and PKI managed as OpenTofu code in terraform/vault/

Documentation:

  • /terraform/vault/README.md - Comprehensive guide covering:
    • Setup and deployment
    • AppRole usage and host access patterns
    • PKI certificate issuance (ACME, static, manual)
    • Secrets management patterns
    • ACME configuration and troubleshooting

Phase 4c: PKI Migration (Replace step-ca)

Status: 🚧 Partially Complete - vault01 and test host migrated, remaining hosts pending

Goal: Migrate hosts from step-ca to OpenBao PKI for TLS certificates

Note: PKI infrastructure already set up in Phase 4b (root CA, intermediate CA, ACME support)

Tasks:

  • Set up OpenBao PKI engines (completed in Phase 4b)
    • Root CA (pki/ mount, 10 year TTL, EC P-384)
    • Intermediate CA (pki_int/ mount, 5 year TTL, EC P-384)
    • Signed intermediate with root CA
    • Configured CRL, OCSP, and issuing certificate URLs
  • Enable ACME support (completed in Phase 4b, fixed in Phase 4c)
    • Enabled ACME on intermediate CA
    • Created PKI role for *.home.2rjus.net
    • Set certificate TTLs (30 day max) and allowed domains
    • ACME directory: https://vault01.home.2rjus.net:8200/v1/pki_int/acme/directory
    • Fixed ACME response headers (added Replay-Nonce, Link, Location to allowed_response_headers)
    • Configured cluster path for ACME
  • Download and distribute root CA certificate
    • Added root CA to system/pki/root-ca.nix
    • Distributed to all hosts via system imports
  • Test certificate issuance
    • Tested ACME issuance on vaulttest01 successfully
    • Verified certificate chain and trust
  • Migrate vault01's own certificate
    • Created bootstrap-vault-cert script for initial certificate issuance via bao CLI
    • Issued certificate with SANs (vault01.home.2rjus.net + vault.home.2rjus.net)
    • Updated service to read certificates from /var/lib/acme/vault01.home.2rjus.net/
    • Configured ACME for automatic renewals
  • Migrate hosts from step-ca to OpenBao
    • Tested on vaulttest01 (non-production host)
    • Standardize hostname usage across all configurations
      • Use vault.home.2rjus.net (CNAME) consistently everywhere
      • Update NixOS configurations to use CNAME instead of vault01
      • Update Terraform configurations to use CNAME
      • Audit and fix mixed usage of vault01.home.2rjus.net vs vault.home.2rjus.net
    • Update system/acme.nix to use OpenBao ACME endpoint
    • Change server to https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
    • Roll out to all hosts via auto-upgrade
  • Configure SSH CA in OpenBao (optional, future work)
    • Enable SSH secrets engine (ssh/ mount)
    • Generate SSH signing keys
    • Create roles for host and user certificates
    • Configure TTLs and allowed principals
    • Distribute SSH CA public key to all hosts
    • Update sshd_config to trust OpenBao CA
  • Decommission step-ca
    • Verify all ACME services migrated and working
    • Stop step-ca service on ca host
    • Archive step-ca configuration for backup
    • Update documentation

Implementation Details (2026-02-03):

ACME Configuration Fix: The key blocker was that OpenBao's PKI mount was filtering out required ACME response headers. The solution was to add allowed_response_headers to the Terraform mount configuration:

allowed_response_headers = [
  "Replay-Nonce",  # Required for ACME nonce generation
  "Link",          # Required for ACME navigation
  "Location"       # Required for ACME resource location
]

Cluster Path Configuration: ACME requires the cluster path to include the full API path:

path     = "${var.vault_address}/v1/${vault_mount.pki_int.path}"
aia_path = "${var.vault_address}/v1/${vault_mount.pki_int.path}"

Bootstrap Process: Since vault01 needed a certificate from its own PKI (chicken-and-egg problem), we created a bootstrap-vault-cert script that:

  1. Uses the Unix socket (no TLS) to issue a certificate via bao CLI
  2. Places it in the ACME directory structure
  3. Includes both vault01.home.2rjus.net and vault.home.2rjus.net as SANs
  4. After restart, ACME manages renewals automatically

Files Modified:

  • terraform/vault/pki.tf - Added allowed_response_headers, cluster config, ACME config
  • services/vault/default.nix - Updated cert paths, added bootstrap script, configured ACME
  • system/pki/root-ca.nix - Added OpenBao root CA to trust store
  • hosts/vaulttest01/configuration.nix - Overrode ACME server for testing

Deliverable: vault01 and vaulttest01 using OpenBao PKI, remaining hosts still on step-ca


Phase 4d: Bootstrap Integration COMPLETED (2026-02-02)

Goal: New hosts automatically authenticate to Vault on first boot, no manual steps

Tasks:

  • Update create-host tool
    • Generate wrapped token (24h TTL, single-use) for new host
    • Add host-specific policy to Vault (via terraform/vault/hosts-generated.tf)
    • Store wrapped token in terraform/vms.tf for cloud-init injection
    • Add --regenerate-token flag to regenerate only the token without overwriting config
  • Update template2 for Vault authentication
    • Reads wrapped token from cloud-init (/run/cloud-init-env)
    • Unwraps token to get role_id + secret_id
    • Stores AppRole credentials in /var/lib/vault/approle/ (persistent)
    • Graceful fallback if Vault unavailable during bootstrap
  • Create NixOS Vault secrets module (system/vault-secrets.nix)
    • Runtime secret fetching (services fetch on start, not at nixos-rebuild time)
    • Secrets cached in /var/lib/vault/cache/ for fallback when Vault unreachable
    • Secrets written to /run/secrets/ (tmpfs, cleared on reboot)
    • Fresh authentication per service start (no token renewal needed)
    • Optional periodic rotation with systemd timers
    • Critical service protection (no auto-restart for DNS, CA, Vault itself)
  • Create vault-fetch helper script
    • Standalone tool for fetching secrets from Vault
    • Authenticates using AppRole credentials
    • Writes individual files per secret key
    • Handles caching and fallback logic
  • Update bootstrap service (hosts/template2/bootstrap.nix)
    • Unwraps Vault token on first boot
    • Stores persistent AppRole credentials
    • Continues with nixos-rebuild
    • Services fetch secrets when they start
  • Update terraform cloud-init (terraform/cloud-init.tf)
    • Inject VAULT_ADDR and VAULT_WRAPPED_TOKEN via write_files
    • Write to /run/cloud-init-env (tmpfs, cleaned on reboot)
    • Fixed YAML indentation issues (write_files at top level)
    • Support flake_branch alongside vault credentials
  • Test complete flow
    • Created vaulttest01 test host
    • Verified bootstrap with Vault integration
    • Verified service secret fetching
    • Tested cache fallback when Vault unreachable
    • Tested wrapped token single-use (second bootstrap fails as expected)
    • Confirmed zero manual steps required

Implementation Details:

Wrapped Token Security:

  • Single-use tokens prevent reuse if leaked
  • 24h TTL limits exposure window
  • Safe to commit to git (expired/used tokens useless)
  • Regenerate with create-host --hostname X --regenerate-token

Secret Fetching:

  • Runtime (not build-time) keeps secrets out of Nix store
  • Cache fallback enables service availability when Vault down
  • Fresh authentication per service start (no renewal complexity)
  • Individual files per secret key for easy consumption

Bootstrap Flow:

1. create-host --hostname myhost --ip 10.69.13.x/24
   ↓ Generates wrapped token, updates terraform
2. tofu apply (deploys VM with cloud-init)
   ↓ Cloud-init writes wrapped token to /run/cloud-init-env
3. nixos-bootstrap.service runs:
   ↓ Unwraps token → gets role_id + secret_id
   ↓ Stores in /var/lib/vault/approle/ (persistent)
   ↓ Runs nixos-rebuild boot
4. Service starts → fetches secrets from Vault
   ↓ Uses stored AppRole credentials
   ↓ Caches secrets for fallback
5. Done - zero manual intervention

Files Created:

  • scripts/vault-fetch/ - Secret fetching helper (Nix package)
  • system/vault-secrets.nix - NixOS module for declarative Vault secrets
  • scripts/create-host/vault_helper.py - Vault API integration
  • terraform/vault/hosts-generated.tf - Auto-generated host policies
  • docs/vault-bootstrap-implementation.md - Architecture documentation
  • docs/vault-bootstrap-testing.md - Testing guide

Configuration:

  • Vault address: https://vault01.home.2rjus.net:8200 (configurable)
  • All defaults remain configurable via environment variables or NixOS options

Next Steps:

  • Gradually migrate existing services from sops-nix to Vault
  • Add CNAME for vault.home.2rjus.net → vault01.home.2rjus.net
  • Phase 4c: Migrate from step-ca to OpenBao PKI (future)

Deliverable: Fully automated secrets access from first boot, zero manual steps


Phase 6: Integration Script

Goal: Single command to create and deploy a new host

Tasks:

  • Create scripts/create-host.sh master script that orchestrates:
    1. Prompts for: hostname, IP (or DHCP), CPU, memory, disk
    2. Validates inputs (IP not in use, hostname unique, etc.)
    3. Calls host config generator (Phase 2)
    4. Generates OpenTofu config (Phase 2)
    5. Handles secrets (Phase 4)
    6. Updates DNS (Phase 5)
    7. Commits all changes to git
    8. Runs tofu apply to deploy VM
    9. Waits for bootstrap to complete (Phase 3)
    10. Prints success message with IP and SSH command
  • Add --dry-run flag to preview changes
  • Add --interactive mode vs --batch mode
  • Error handling and rollback on failures

Deliverable: ./scripts/create-host.sh --hostname myhost --ip 10.69.13.50 creates a fully working host


Phase 7: Testing & Documentation

Status: 🚧 In Progress (testing improvements completed)

Testing Improvements Implemented (2025-02-01):

The pipeline now supports efficient testing without polluting master branch:

1. --force Flag for create-host

  • Re-run create-host to regenerate existing configurations
  • Updates existing entries in flake.nix and terraform/vms.tf (no duplicates)
  • Skip uniqueness validation checks
  • Useful for iterating on configuration templates during testing

2. Branch Support for Bootstrap

  • Bootstrap service reads NIXOS_FLAKE_BRANCH environment variable
  • Defaults to master if not set
  • Allows testing pipeline changes on feature branches
  • Cloud-init passes branch via /etc/environment

3. Cloud-init Disk for Branch Configuration

  • Terraform generates custom cloud-init snippets for test VMs
  • Set flake_branch field in VM definition to use non-master branch
  • Production VMs omit this field and use master (default)
  • Files automatically uploaded to Proxmox via SSH

Testing Workflow:

# 1. Create test branch
git checkout -b test-pipeline

# 2. Generate or update host config
create-host --hostname testvm01 --ip 10.69.13.100/24

# 3. Edit terraform/vms.tf to add test VM with branch
# vms = {
#   "testvm01" = {
#     ip = "10.69.13.100/24"
#     flake_branch = "test-pipeline"  # Bootstrap from this branch
#   }
# }

# 4. Commit and push test branch
git add -A && git commit -m "test: add testvm01"
git push origin test-pipeline

# 5. Deploy VM
cd terraform && tofu apply

# 6. Watch bootstrap (VM fetches from test-pipeline branch)
ssh root@10.69.13.100
journalctl -fu nixos-bootstrap.service

# 7. Iterate: modify templates and regenerate with --force
cd .. && create-host --hostname testvm01 --ip 10.69.13.100/24 --force
git commit -am "test: update config" && git push

# Redeploy to test fresh bootstrap
cd terraform
tofu destroy -target=proxmox_vm_qemu.vm[\"testvm01\"] && tofu apply

# 8. Clean up when done: squash commits, merge to master, remove test VM

Files:

  • scripts/create-host/create_host.py - Added --force parameter
  • scripts/create-host/manipulators.py - Update vs insert logic
  • hosts/template2/bootstrap.nix - Branch support via environment variable
  • terraform/vms.tf - flake_branch field support
  • terraform/cloud-init.tf - Custom cloud-init disk generation
  • terraform/variables.tf - proxmox_host variable for SSH uploads

Remaining Tasks:

  • Test full pipeline end-to-end on feature branch
  • Update CLAUDE.md with testing workflow
  • Add troubleshooting section
  • Create examples for common scenarios (DHCP host, static IP host, etc.)

Open Questions

  1. Bootstrap method: Cloud-init runcmd vs Terraform provisioner vs Ansible?
  2. Secrets handling: Pre-generate keys vs post-deployment injection?
  3. DNS automation: Auto-commit or manual merge?
  4. Git workflow: Auto-push changes or leave for user review?
  5. Template selection: Single template2 or multiple templates for different host types?
  6. Networking: Always DHCP initially, or support static IP from start?
  7. Error recovery: What happens if bootstrap fails? Manual intervention or retry?

Implementation Order

Recommended sequence:

  1. Phase 1: Parameterize OpenTofu (foundation for testing)
  2. Phase 3: Bootstrap mechanism (core automation)
  3. Phase 2: Config generator (automate the boilerplate)
  4. Phase 4: Secrets (solves biggest chicken-and-egg)
  5. Phase 5: DNS (nice-to-have automation)
  6. Phase 6: Integration script (ties it all together)
  7. Phase 7: Testing & docs

Success Criteria

When complete, creating a new host should:

  • Take < 5 minutes of human time
  • Require minimal user input (hostname, IP, basic specs)
  • Result in a fully configured, secret-enabled, DNS-registered host
  • Be reproducible and documented
  • Handle common errors gracefully

Notes

  • Keep incremental commits at each phase
  • Test each phase independently before moving to next
  • Maintain backward compatibility with manual workflow
  • Document any manual steps that can't be automated