Files
nixos-servers/TODO.md
Torjus Håkestad ce6d2b1d33
Some checks failed
Run nix flake check / flake-check (push) Failing after 1m56s
Run nix flake check / flake-check (pull_request) Failing after 1m30s
docs: add TODO.md for automated deployment pipeline
Document multi-phase plan for automating NixOS host creation, deployment, and configuration on Proxmox including OpenTofu parameterization, config generation, bootstrap mechanism, secrets management, and Nix-based DNS automation.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-31 22:22:19 +01:00

232 lines
8.4 KiB
Markdown

# TODO: Automated Host Deployment Pipeline
## Vision
Automate the entire process of creating, configuring, and deploying new NixOS hosts on Proxmox from a single command or script.
**Desired workflow:**
```bash
./scripts/create-host.sh --hostname myhost --ip 10.69.13.50
# Script creates config, deploys VM, bootstraps NixOS, and you're ready to go
```
**Current manual workflow (from CLAUDE.md):**
1. Create `/hosts/<hostname>/` directory structure
2. Add host to `flake.nix`
3. Add DNS entries
4. Clone template VM manually
5. Run `prepare-host.sh` on new VM
6. Add generated age key to `.sops.yaml`
7. Configure networking
8. Commit and push
9. Run `nixos-rebuild boot --flake URL#<hostname>` on host
## The Plan
### Phase 1: Parameterized OpenTofu Deployments ✓ (Partially Complete)
**Status:** Template building works, single VM deployment works, need to parameterize
**Tasks:**
- [ ] Create module/template structure in terraform for repeatable VM deployments
- [ ] Parameterize VM configuration (hostname, CPU, memory, disk, IP)
- [ ] Support both DHCP and static IP configuration via cloud-init
- [ ] Test deploying multiple VMs from same template
**Deliverable:** Can deploy a VM with custom parameters via OpenTofu
---
### Phase 2: Host Configuration Generator
**Goal:** Automate creation of host configuration files
Doesn't have to be a plain shell script, we could also use something like python, would probably make templating easier.
**Tasks:**
- [ ] Create script `scripts/create-host-config.sh`
- [ ] Takes parameters: hostname, IP, CPU cores, memory, disk size
- [ ] Generates `/hosts/<hostname>/` directory structure from template
- [ ] Creates `configuration.nix` with proper hostname and networking
- [ ] Generates `default.nix` with standard imports
- [ ] Copies/links `hardware-configuration.nix` from template
- [ ] Add host entry to `flake.nix` programmatically
- [ ] Parse flake.nix
- [ ] Insert new nixosConfiguration entry
- [ ] Maintain formatting
- [ ] Generate corresponding OpenTofu configuration
- [ ] Create `terraform/hosts/<hostname>.tf` with VM definition
- [ ] Use parameters from script input
**Deliverable:** Script generates all config files for a new host
---
### Phase 3: Bootstrap Mechanism
**Goal:** Get freshly deployed VM to apply its specific host configuration
**Challenge:** Chicken-and-egg problem - VM needs to know its hostname and pull the right config
**Option A: Cloud-init bootstrap script**
- [ ] Add cloud-init `runcmd` to template2 that:
- [ ] Reads hostname from cloud-init metadata
- [ ] Runs `nixos-rebuild boot --flake git+https://git.t-juice.club/torjus/nixos-servers.git#${hostname}`
- [ ] Reboots into the new configuration
- [ ] Test cloud-init script execution on fresh VM
- [ ] Handle failure cases (flake doesn't exist, network issues)
**Option B: Terraform provisioner**
- [ ] Use OpenTofu's `remote-exec` provisioner
- [ ] SSH into new VM after creation
- [ ] Run `nixos-rebuild boot --flake <url>#<hostname>`
- [ ] Trigger reboot via SSH
**Option C: Two-stage deployment**
- [ ] Deploy VM with template2 (minimal config)
- [ ] Run Ansible playbook to bootstrap specific config
- [ ] Similar to existing `run-upgrade.yml` pattern
**Decision needed:** Which approach fits best? (Recommend Option A for automation)
---
### Phase 4: Secrets Management Automation
**Challenge:** sops needs age key, but age key is generated on first boot
**Current workflow:**
1. VM boots, generates age key at `/var/lib/sops-nix/key.txt`
2. User runs `prepare-host.sh` which prints public key
3. User manually adds public key to `.sops.yaml`
4. User commits, pushes
5. VM can now decrypt secrets
**Proposed solution:**
**Option A: Pre-generate age keys**
- [ ] Generate age key pair during `create-host-config.sh`
- [ ] Add public key to `.sops.yaml` immediately
- [ ] Store private key temporarily (secure location)
- [ ] Inject private key via cloud-init write_files or Terraform file provisioner
- [ ] VM uses pre-configured key from first boot
**Option B: Post-deployment secret injection**
- [ ] VM boots with template, generates its own key
- [ ] Fetch public key via SSH after first boot
- [ ] Automatically add to `.sops.yaml` and commit
- [ ] Trigger rebuild on VM to pick up secrets access
**Option C: Separate secrets from initial deployment**
- [ ] Initial deployment works without secrets
- [ ] After VM is running, user manually adds age key
- [ ] Subsequent auto-upgrades pick up secrets
**Decision needed:** Option A is most automated, but requires secure key handling
---
### Phase 5: DNS Automation
**Goal:** Automatically generate DNS entries from host configurations
**Approach:** Leverage Nix to generate zone file entries from flake host configurations
Since most hosts use static IPs defined in their NixOS configurations, we can extract this information and automatically generate A records. This keeps DNS in sync with the actual host configs.
**Tasks:**
- [ ] Add optional CNAME field to host configurations
- [ ] Add `networking.cnames = [ "alias1" "alias2" ]` or similar option
- [ ] Document in host configuration template
- [ ] Create Nix function to extract DNS records from all hosts
- [ ] Parse each host's `networking.hostName` and IP configuration
- [ ] Collect any defined CNAMEs
- [ ] Generate zone file fragment with A and CNAME records
- [ ] Integrate auto-generated records into zone files
- [ ] Keep manual entries separate (for non-flake hosts/services)
- [ ] Include generated fragment in main zone file
- [ ] Add comments showing which records are auto-generated
- [ ] Update zone file serial number automatically
- [ ] Test zone file validity after generation
- [ ] Either:
- [ ] Automatically trigger DNS server reload (Ansible)
- [ ] Or document manual step: merge to master, run upgrade on ns1/ns2
**Deliverable:** DNS A records and CNAMEs automatically generated from host configs
---
### Phase 6: Integration Script
**Goal:** Single command to create and deploy a new host
**Tasks:**
- [ ] Create `scripts/create-host.sh` master script that orchestrates:
1. Prompts for: hostname, IP (or DHCP), CPU, memory, disk
2. Validates inputs (IP not in use, hostname unique, etc.)
3. Calls host config generator (Phase 2)
4. Generates OpenTofu config (Phase 2)
5. Handles secrets (Phase 4)
6. Updates DNS (Phase 5)
7. Commits all changes to git
8. Runs `tofu apply` to deploy VM
9. Waits for bootstrap to complete (Phase 3)
10. Prints success message with IP and SSH command
- [ ] Add `--dry-run` flag to preview changes
- [ ] Add `--interactive` mode vs `--batch` mode
- [ ] Error handling and rollback on failures
**Deliverable:** `./scripts/create-host.sh --hostname myhost --ip 10.69.13.50` creates a fully working host
---
### Phase 7: Testing & Documentation
**Tasks:**
- [ ] Test full pipeline end-to-end
- [ ] Create test host and verify all steps
- [ ] Document the new workflow in CLAUDE.md
- [ ] Add troubleshooting section
- [ ] Create examples for common scenarios (DHCP host, static IP host, etc.)
---
## Open Questions
1. **Bootstrap method:** Cloud-init runcmd vs Terraform provisioner vs Ansible?
2. **Secrets handling:** Pre-generate keys vs post-deployment injection?
3. **DNS automation:** Auto-commit or manual merge?
4. **Git workflow:** Auto-push changes or leave for user review?
5. **Template selection:** Single template2 or multiple templates for different host types?
6. **Networking:** Always DHCP initially, or support static IP from start?
7. **Error recovery:** What happens if bootstrap fails? Manual intervention or retry?
## Implementation Order
Recommended sequence:
1. Phase 1: Parameterize OpenTofu (foundation for testing)
2. Phase 3: Bootstrap mechanism (core automation)
3. Phase 2: Config generator (automate the boilerplate)
4. Phase 4: Secrets (solves biggest chicken-and-egg)
5. Phase 5: DNS (nice-to-have automation)
6. Phase 6: Integration script (ties it all together)
7. Phase 7: Testing & docs
## Success Criteria
When complete, creating a new host should:
- Take < 5 minutes of human time
- Require minimal user input (hostname, IP, basic specs)
- Result in a fully configured, secret-enabled, DNS-registered host
- Be reproducible and documented
- Handle common errors gracefully
---
## Notes
- Keep incremental commits at each phase
- Test each phase independently before moving to next
- Maintain backward compatibility with manual workflow
- Document any manual steps that can't be automated