Files
nixos-servers/TODO.md
Torjus Håkestad e0ad445341
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
planning: update TODO.md
2026-02-01 20:05:56 +01:00

20 KiB

TODO: Automated Host Deployment Pipeline

Vision

Automate the entire process of creating, configuring, and deploying new NixOS hosts on Proxmox from a single command or script.

Desired workflow:

./scripts/create-host.sh --hostname myhost --ip 10.69.13.50
# Script creates config, deploys VM, bootstraps NixOS, and you're ready to go

Current manual workflow (from CLAUDE.md):

  1. Create /hosts/<hostname>/ directory structure
  2. Add host to flake.nix
  3. Add DNS entries
  4. Clone template VM manually
  5. Run prepare-host.sh on new VM
  6. Add generated age key to .sops.yaml
  7. Configure networking
  8. Commit and push
  9. Run nixos-rebuild boot --flake URL#<hostname> on host

The Plan

Phase 1: Parameterized OpenTofu Deployments COMPLETED

Status: Fully implemented and tested

Implementation:

  • Locals-based structure using for_each pattern for multiple VM deployments
  • All VM parameters configurable with smart defaults (CPU, memory, disk, IP, storage, etc.)
  • Automatic DHCP vs static IP detection based on ip field presence
  • Dynamic outputs showing deployed VM IPs and specifications
  • Successfully tested deploying multiple VMs simultaneously

Tasks:

  • Create module/template structure in terraform for repeatable VM deployments
  • Parameterize VM configuration (hostname, CPU, memory, disk, IP)
  • Support both DHCP and static IP configuration via cloud-init
  • Test deploying multiple VMs from same template

Deliverable: Can deploy multiple VMs with custom parameters via OpenTofu in a single tofu apply

Files:

  • terraform/vms.tf - VM definitions using locals map
  • terraform/outputs.tf - Dynamic outputs for all VMs
  • terraform/variables.tf - Configurable defaults
  • terraform/README.md - Complete documentation

Phase 2: Host Configuration Generator COMPLETED

Status: Fully implemented and tested Completed: 2025-02-01 Enhanced: 2025-02-01 (added --force flag)

Goal: Automate creation of host configuration files

Implementation:

  • Python CLI tool packaged as Nix derivation
  • Available as create-host command in devShell
  • Rich terminal UI with configuration previews
  • Comprehensive validation (hostname format/uniqueness, IP subnet/uniqueness)
  • Jinja2 templates for NixOS configurations
  • Automatic updates to flake.nix and terraform/vms.tf
  • --force flag for regenerating existing configurations (useful for testing)

Tasks:

  • Create Python CLI with typer framework
    • Takes parameters: hostname, IP, CPU cores, memory, disk size
    • Generates /hosts/<hostname>/ directory structure
    • Creates configuration.nix with proper hostname and networking
    • Generates default.nix with standard imports
    • References shared hardware-configuration.nix from template
  • Add host entry to flake.nix programmatically
    • Text-based manipulation (regex insertion)
    • Inserts new nixosConfiguration entry
    • Maintains proper formatting
  • Generate corresponding OpenTofu configuration
    • Adds VM definition to terraform/vms.tf
    • Uses parameters from CLI input
    • Supports both static IP and DHCP modes
  • Package as Nix derivation with templates
  • Add to flake packages and devShell
  • Implement dry-run mode
  • Write comprehensive README

Usage:

# In nix develop shell
create-host \
  --hostname test01 \
  --ip 10.69.13.50/24 \  # optional, omit for DHCP
  --cpu 4 \               # optional, default 2
  --memory 4096 \         # optional, default 2048
  --disk 50G \            # optional, default 20G
  --dry-run               # optional preview mode

Files:

  • scripts/create-host/ - Complete Python package with Nix derivation
  • scripts/create-host/README.md - Full documentation and examples

Deliverable: Tool generates all config files for a new host, validated with Nix and Terraform


Phase 3: Bootstrap Mechanism COMPLETED

Status: Fully implemented and tested Completed: 2025-02-01 Enhanced: 2025-02-01 (added branch support for testing)

Goal: Get freshly deployed VM to apply its specific host configuration

Implementation: Systemd oneshot service that runs on first boot after cloud-init

Approach taken: Systemd service (variant of Option A)

  • Systemd service nixos-bootstrap.service runs on first boot
  • Depends on cloud-config.service to ensure hostname is set
  • Reads hostname from hostnamectl (set by cloud-init via Terraform)
  • Supports custom git branch via NIXOS_FLAKE_BRANCH environment variable
  • Runs nixos-rebuild boot --flake git+https://git.t-juice.club/torjus/nixos-servers.git?ref=$BRANCH#${hostname}
  • Reboots into new configuration on success
  • Fails gracefully without reboot on errors (network issues, missing config)
  • Service self-destructs after successful bootstrap (not in new config)

Tasks:

  • Create bootstrap service module in template2
    • systemd oneshot service with proper dependencies
    • Reads hostname from hostnamectl (cloud-init sets it)
    • Checks network connectivity via HTTPS (curl)
    • Runs nixos-rebuild boot with flake URL
    • Reboots on success, fails gracefully on error
  • Configure cloud-init datasource
    • Use ConfigDrive datasource (Proxmox provider)
    • Add cloud-init disk to Terraform VMs (disks.ide.ide2.cloudinit)
    • Hostname passed via cloud-init user-data from Terraform
  • Test bootstrap service execution on fresh VM
  • Handle failure cases (flake doesn't exist, network issues)
    • Clear error messages in journald
    • No reboot on failure
    • System remains accessible for debugging

Files:

  • hosts/template2/bootstrap.nix - Bootstrap service definition
  • hosts/template2/configuration.nix - Cloud-init ConfigDrive datasource
  • terraform/vms.tf - Cloud-init disk configuration

Deliverable: VMs automatically bootstrap and reboot into host-specific configuration on first boot


Phase 4: Secrets Management with HashiCorp Vault

Challenge: Current sops-nix approach has chicken-and-egg problem with age keys

Current workflow:

  1. VM boots, generates age key at /var/lib/sops-nix/key.txt
  2. User runs prepare-host.sh which prints public key
  3. User manually adds public key to .sops.yaml
  4. User commits, pushes
  5. VM can now decrypt secrets

Selected approach: Migrate to HashiCorp Vault for centralized secrets management

Benefits:

  • Industry-standard secrets management (Vault experience transferable to work)
  • Eliminates manual age key distribution step
  • Secrets-as-code via OpenTofu (infrastructure-as-code aligned)
  • Centralized PKI management (replaces step-ca, consolidates TLS + SSH CA)
  • Automatic secret rotation capabilities
  • Audit logging for all secret access
  • AppRole authentication enables automated bootstrap

Architecture:

vault.home.2rjus.net
  ├─ KV Secrets Engine (replaces sops-nix)
  ├─ PKI Engine (replaces step-ca for TLS)
  ├─ SSH CA Engine (replaces step-ca SSH CA)
  └─ AppRole Auth (per-host authentication)
       ↓
   New hosts authenticate on first boot
   Fetch secrets via Vault API
   No manual key distribution needed

Phase 4a: Vault Server Setup

Goal: Deploy and configure Vault server with auto-unseal

Tasks:

  • Create hosts/vault01/ configuration
    • Basic NixOS configuration (hostname, networking, etc.)
    • Vault service configuration
    • Firewall rules (8200 for API, 8201 for cluster)
    • Add to flake.nix and terraform
  • Implement auto-unseal mechanism
    • Preferred: TPM-based auto-unseal if hardware supports it
      • Use tpm2-tools to seal/unseal Vault keys
      • Systemd service to unseal on boot
    • Fallback: Shamir secret sharing with systemd automation
      • Generate 3 keys, threshold 2
      • Store 2 keys on disk (encrypted), keep 1 offline
      • Systemd service auto-unseals using 2 keys
  • Initial Vault setup
    • Initialize Vault
    • Configure storage backend (integrated raft or file)
    • Set up root token management
    • Enable audit logging
  • Deploy to infrastructure
    • Add DNS entry for vault.home.2rjus.net
    • Deploy VM via terraform
    • Bootstrap and verify Vault is running

Deliverable: Running Vault server that auto-unseals on boot


Phase 4b: Vault-as-Code with OpenTofu

Goal: Manage all Vault configuration (secrets structure, policies, roles) as code

Tasks:

  • Set up Vault Terraform provider
    • Create terraform/vault/ directory
    • Configure Vault provider (address, auth)
    • Store Vault token securely (terraform.tfvars, gitignored)
  • Enable and configure secrets engines
    • Enable KV v2 secrets engine at secret/
    • Define secret path structure (per-service, per-host)
    • Example: secret/monitoring/grafana, secret/postgres/ha1
  • Define policies as code
    • Create policies for different service tiers
    • Principle of least privilege (hosts only read their secrets)
    • Example: monitoring-policy allows read on secret/monitoring/*
  • Set up AppRole authentication
    • Enable AppRole auth backend
    • Create role per host type (monitoring, dns, database, etc.)
    • Bind policies to roles
    • Configure TTL and token policies
  • Migrate existing secrets from sops-nix
    • Create migration script/playbook
    • Decrypt sops secrets and load into Vault KV
    • Verify all secrets migrated successfully
    • Keep sops as backup during transition
  • Implement secrets-as-code patterns
    • Secret values in gitignored terraform.tfvars
    • Or use random_password for auto-generated secrets
    • Secret structure/paths in version-controlled .tf files

Example OpenTofu:

resource "vault_kv_secret_v2" "monitoring_grafana" {
  mount = "secret"
  name  = "monitoring/grafana"
  data_json = jsonencode({
    admin_password = var.grafana_admin_password
    smtp_password  = var.smtp_password
  })
}

resource "vault_policy" "monitoring" {
  name = "monitoring-policy"
  policy = <<EOT
path "secret/data/monitoring/*" {
  capabilities = ["read"]
}
EOT
}

resource "vault_approle_auth_backend_role" "monitoring01" {
  backend   = "approle"
  role_name = "monitoring01"
  token_policies = ["monitoring-policy"]
}

Deliverable: All secrets and policies managed as OpenTofu code in terraform/vault/


Phase 4c: PKI Migration (Replace step-ca)

Goal: Consolidate PKI infrastructure into Vault

Tasks:

  • Set up Vault PKI engines
    • Create root CA in Vault (pki/ mount, 10 year TTL)
    • Create intermediate CA (pki_int/ mount, 5 year TTL)
    • Sign intermediate with root CA
    • Configure CRL and OCSP
  • Enable ACME support
    • Enable ACME on intermediate CA (Vault 1.14+)
    • Create PKI role for homelab domain
    • Set certificate TTLs and allowed domains
  • Configure SSH CA in Vault
    • Enable SSH secrets engine (ssh/ mount)
    • Generate SSH signing keys
    • Create roles for host and user certificates
    • Configure TTLs and allowed principals
  • Migrate hosts from step-ca to Vault
    • Update system/acme.nix to use Vault ACME endpoint
    • Change server to https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
    • Test certificate issuance on one host
    • Roll out to all hosts via auto-upgrade
  • Migrate SSH CA trust
    • Distribute Vault SSH CA public key to all hosts
    • Update sshd_config to trust Vault CA
    • Test SSH certificate authentication
  • Decommission step-ca
    • Verify all services migrated
    • Stop step-ca service on ca host
    • Archive step-ca configuration for backup

Deliverable: All TLS and SSH certificates issued by Vault, step-ca retired


Phase 4d: Bootstrap Integration

Goal: New hosts automatically authenticate to Vault on first boot, no manual steps

Tasks:

  • Update create-host tool
    • Generate AppRole role_id + secret_id for new host
    • Or create wrapped token for one-time bootstrap
    • Add host-specific policy to Vault (via terraform)
    • Store bootstrap credentials for cloud-init injection
  • Update template2 for Vault authentication
    • Create Vault authentication module
    • Reads bootstrap credentials from cloud-init
    • Authenticates to Vault, retrieves permanent AppRole credentials
    • Stores role_id + secret_id locally for services to use
  • Create NixOS Vault secrets module
    • Replacement for sops.secrets
    • Fetches secrets from Vault at nixos-rebuild/activation time
    • Or runtime secret fetching for services
    • Handle Vault token renewal
  • Update bootstrap service
    • After authenticating to Vault, fetch any bootstrap secrets
    • Run nixos-rebuild with host configuration
    • Services automatically fetch their secrets from Vault
  • Update terraform cloud-init
    • Inject Vault address and bootstrap credentials
    • Pass via cloud-init user-data or write_files
    • Credentials scoped to single use or short TTL
  • Test complete flow
    • Run create-host to generate new host config
    • Deploy with terraform
    • Verify host bootstraps and authenticates to Vault
    • Verify services can fetch secrets
    • Confirm no manual steps required

Bootstrap flow:

1. terraform apply (deploys VM with cloud-init)
2. Cloud-init sets hostname + Vault bootstrap credentials
3. nixos-bootstrap.service runs:
   - Authenticates to Vault with bootstrap credentials
   - Retrieves permanent AppRole credentials
   - Stores locally for service use
   - Runs nixos-rebuild
4. Host services fetch secrets from Vault as needed
5. Done - no manual intervention

Deliverable: Fully automated secrets access from first boot, zero manual steps


Phase 5: DNS Automation

Goal: Automatically generate DNS entries from host configurations

Approach: Leverage Nix to generate zone file entries from flake host configurations

Since most hosts use static IPs defined in their NixOS configurations, we can extract this information and automatically generate A records. This keeps DNS in sync with the actual host configs.

Tasks:

  • Add optional CNAME field to host configurations
    • Add networking.cnames = [ "alias1" "alias2" ] or similar option
    • Document in host configuration template
  • Create Nix function to extract DNS records from all hosts
    • Parse each host's networking.hostName and IP configuration
    • Collect any defined CNAMEs
    • Generate zone file fragment with A and CNAME records
  • Integrate auto-generated records into zone files
    • Keep manual entries separate (for non-flake hosts/services)
    • Include generated fragment in main zone file
    • Add comments showing which records are auto-generated
  • Update zone file serial number automatically
  • Test zone file validity after generation
  • Either:
    • Automatically trigger DNS server reload (Ansible)
    • Or document manual step: merge to master, run upgrade on ns1/ns2

Deliverable: DNS A records and CNAMEs automatically generated from host configs


Phase 6: Integration Script

Goal: Single command to create and deploy a new host

Tasks:

  • Create scripts/create-host.sh master script that orchestrates:
    1. Prompts for: hostname, IP (or DHCP), CPU, memory, disk
    2. Validates inputs (IP not in use, hostname unique, etc.)
    3. Calls host config generator (Phase 2)
    4. Generates OpenTofu config (Phase 2)
    5. Handles secrets (Phase 4)
    6. Updates DNS (Phase 5)
    7. Commits all changes to git
    8. Runs tofu apply to deploy VM
    9. Waits for bootstrap to complete (Phase 3)
    10. Prints success message with IP and SSH command
  • Add --dry-run flag to preview changes
  • Add --interactive mode vs --batch mode
  • Error handling and rollback on failures

Deliverable: ./scripts/create-host.sh --hostname myhost --ip 10.69.13.50 creates a fully working host


Phase 7: Testing & Documentation

Status: 🚧 In Progress (testing improvements completed)

Testing Improvements Implemented (2025-02-01):

The pipeline now supports efficient testing without polluting master branch:

1. --force Flag for create-host

  • Re-run create-host to regenerate existing configurations
  • Updates existing entries in flake.nix and terraform/vms.tf (no duplicates)
  • Skip uniqueness validation checks
  • Useful for iterating on configuration templates during testing

2. Branch Support for Bootstrap

  • Bootstrap service reads NIXOS_FLAKE_BRANCH environment variable
  • Defaults to master if not set
  • Allows testing pipeline changes on feature branches
  • Cloud-init passes branch via /etc/environment

3. Cloud-init Disk for Branch Configuration

  • Terraform generates custom cloud-init snippets for test VMs
  • Set flake_branch field in VM definition to use non-master branch
  • Production VMs omit this field and use master (default)
  • Files automatically uploaded to Proxmox via SSH

Testing Workflow:

# 1. Create test branch
git checkout -b test-pipeline

# 2. Generate or update host config
create-host --hostname testvm01 --ip 10.69.13.100/24

# 3. Edit terraform/vms.tf to add test VM with branch
# vms = {
#   "testvm01" = {
#     ip = "10.69.13.100/24"
#     flake_branch = "test-pipeline"  # Bootstrap from this branch
#   }
# }

# 4. Commit and push test branch
git add -A && git commit -m "test: add testvm01"
git push origin test-pipeline

# 5. Deploy VM
cd terraform && tofu apply

# 6. Watch bootstrap (VM fetches from test-pipeline branch)
ssh root@10.69.13.100
journalctl -fu nixos-bootstrap.service

# 7. Iterate: modify templates and regenerate with --force
cd .. && create-host --hostname testvm01 --ip 10.69.13.100/24 --force
git commit -am "test: update config" && git push

# Redeploy to test fresh bootstrap
cd terraform
tofu destroy -target=proxmox_vm_qemu.vm[\"testvm01\"] && tofu apply

# 8. Clean up when done: squash commits, merge to master, remove test VM

Files:

  • scripts/create-host/create_host.py - Added --force parameter
  • scripts/create-host/manipulators.py - Update vs insert logic
  • hosts/template2/bootstrap.nix - Branch support via environment variable
  • terraform/vms.tf - flake_branch field support
  • terraform/cloud-init.tf - Custom cloud-init disk generation
  • terraform/variables.tf - proxmox_host variable for SSH uploads

Remaining Tasks:

  • Test full pipeline end-to-end on feature branch
  • Update CLAUDE.md with testing workflow
  • Add troubleshooting section
  • Create examples for common scenarios (DHCP host, static IP host, etc.)

Open Questions

  1. Bootstrap method: Cloud-init runcmd vs Terraform provisioner vs Ansible?
  2. Secrets handling: Pre-generate keys vs post-deployment injection?
  3. DNS automation: Auto-commit or manual merge?
  4. Git workflow: Auto-push changes or leave for user review?
  5. Template selection: Single template2 or multiple templates for different host types?
  6. Networking: Always DHCP initially, or support static IP from start?
  7. Error recovery: What happens if bootstrap fails? Manual intervention or retry?

Implementation Order

Recommended sequence:

  1. Phase 1: Parameterize OpenTofu (foundation for testing)
  2. Phase 3: Bootstrap mechanism (core automation)
  3. Phase 2: Config generator (automate the boilerplate)
  4. Phase 4: Secrets (solves biggest chicken-and-egg)
  5. Phase 5: DNS (nice-to-have automation)
  6. Phase 6: Integration script (ties it all together)
  7. Phase 7: Testing & docs

Success Criteria

When complete, creating a new host should:

  • Take < 5 minutes of human time
  • Require minimal user input (hostname, IP, basic specs)
  • Result in a fully configured, secret-enabled, DNS-registered host
  • Be reproducible and documented
  • Handle common errors gracefully

Notes

  • Keep incremental commits at each phase
  • Test each phase independently before moving to next
  • Maintain backward compatibility with manual workflow
  • Document any manual steps that can't be automated