Instead of creating a long-lived Vault token in Terraform (which gets
invalidated when Terraform recreates it), monitoring01 now uses its
existing AppRole credentials to fetch a fresh token for Prometheus.
Changes:
- Add prometheus-metrics policy to monitoring01's AppRole
- Remove vault_token.prometheus_metrics resource from Terraform
- Remove openbao-token KV secret from Terraform
- Add systemd service to fetch AppRole token on boot
- Add systemd timer to refresh token every 30 minutes
This ensures Prometheus always has a valid token without depending on
Terraform state or manual intervention.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add exporters and scrape targets for services lacking monitoring:
- PostgreSQL: postgres-exporter on pgdb1
- Authelia: native telemetry metrics on auth01
- Unbound: unbound-exporter with remote-control on ns1/ns2
- NATS: HTTP monitoring endpoint on nats1
- OpenBao: telemetry config and Prometheus scrape with token auth
- Systemd: systemd-exporter on all hosts for per-service metrics
Add alert rules for postgres, auth (authelia + lldap), jellyfin,
vault (openbao), plus extend existing nats and unbound rules.
Add Terraform config for Prometheus metrics policy and token. The
token is created via vault_token resource and stored in KV, so no
manual token creation is needed.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove backup_helper_secret variable and switch shared/backup/password
to auto_generate. New password will be added alongside existing restic
repository key.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace sops-nix secrets with OpenBao vault secrets across all hosts.
Hardcode root password hash, add extractKey option to vault-secrets
module, update Terraform with secrets/policies for all hosts, and
create AppRole provisioning playbook.
Hosts migrated: ha1, monitoring01, ns1, ns2, http-proxy, nix-cache01
Wave 1 hosts (nats1, jelly01, pgdb1) get AppRole policies only.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fix error "500 can't upload to storage type 'zfspool'" by using "local"
storage pool for cloud-init disks instead of the VM's storage pool.
Cloud-init disks require storage that supports ISO/snippet content types,
which zfspool does not. The "local" storage pool (directory-based) supports
this content type.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fix OpenTofu error where static IP and DHCP branches had different object
structures in the subnets array. Move conditional to network_config level
so both branches return complete, consistent yamlencode() results.
Error was: "The true and false result expressions must have consistent types"
Solution: Make network_config itself conditional rather than the subnets
array, ensuring both branches return the same type (string from yamlencode).
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove mention of .generated/ directory and clarify that cloud-init.tf
manages all cloud-init disks, not just branch-specific ones.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replace SSH upload approach with native proxmox_cloud_init_disk resource
for cleaner, more maintainable cloud-init management.
Changes:
- Use proxmox_cloud_init_disk for all VMs (not just branch-specific ones)
- Include SSH keys, network config, and metadata in cloud-init disk
- Conditionally include NIXOS_FLAKE_BRANCH for VMs with flake_branch set
- Replace ide2 cloudinit disk with cdrom reference to cloud-init disk
- Remove built-in cloud-init parameters (ciuser, sshkeys, etc.)
- Remove cicustom parameter (no longer needed)
- Remove proxmox_host variable (no SSH uploads required)
- Remove .gitignore entry for .generated/ directory
Benefits:
- No SSH access to Proxmox required
- All cloud-init config managed in Terraform
- Consistent approach for all VMs
- Cleaner state management
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implement dual improvements to enable efficient testing of pipeline changes
without polluting master branch:
1. Add --force flag to create-host script
- Skip hostname/IP uniqueness validation
- Overwrite existing host configurations
- Update entries in flake.nix and terraform/vms.tf (no duplicates)
- Useful for iterating on configurations during testing
2. Add branch support to bootstrap mechanism
- Bootstrap service reads NIXOS_FLAKE_BRANCH environment variable
- Defaults to master if not set
- Uses branch in git URL via ?ref= parameter
- Service loads environment from /etc/environment
3. Add cloud-init disk support for branch configuration
- VMs can specify flake_branch field in terraform/vms.tf
- Automatically generates cloud-init snippet setting NIXOS_FLAKE_BRANCH
- Uploads snippet to Proxmox via SSH
- Production VMs omit flake_branch and use master
4. Update documentation
- Document --force flag usage in create-host README
- Add branch testing examples in terraform README
- Update TODO.md with testing workflow
- Add .generated/ to gitignore
Testing workflow: Create feature branch, set flake_branch in VM definition,
deploy with terraform, iterate with --force flag, clean up before merging.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add systemd service that automatically bootstraps freshly deployed VMs
with their host-specific NixOS configuration from the flake repository.
Changes:
- hosts/template2/bootstrap.nix: New systemd oneshot service that:
- Runs after cloud-init completes (ensures hostname is set)
- Reads hostname from hostnamectl (set by cloud-init from Terraform)
- Checks network connectivity via HTTPS (curl)
- Runs nixos-rebuild boot with flake URL
- Reboots on success, fails gracefully with clear errors on failure
- hosts/template2/configuration.nix: Configure cloud-init datasource
- Changed from NoCloud to ConfigDrive (used by Proxmox)
- Allows cloud-init to receive config from Proxmox
- hosts/template2/default.nix: Import bootstrap.nix module
- terraform/vms.tf: Add cloud-init disk to VMs
- Configure disks.ide.ide2.cloudinit block
- Removed invalid cloudinit_cdrom_storage parameter
- Enables Proxmox to inject cloud-init configuration
- TODO.md: Mark Phase 3 as completed
This eliminates the manual nixos-rebuild step from the deployment workflow.
VMs now automatically pull and apply their configuration on first boot.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implements Phase 1 of the OpenTofu deployment plan:
- Replace single-VM configuration with locals-based for_each pattern
- Support multiple VMs in single deployment
- Automatic DHCP vs static IP detection
- Configurable defaults with per-VM overrides
- Dynamic outputs for VM IPs and specifications
New files:
- outputs.tf: Dynamic outputs for deployed VMs
- vms.tf: VM definitions using locals.vms map
Updated files:
- variables.tf: Added default variables for VM configuration
- README.md: Comprehensive documentation and examples
Removed files:
- vm.tf: Replaced by new vms.tf (archived as vm.tf.old, then removed)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add automated workflow for building and deploying NixOS VMs on Proxmox including template2 host configuration, Ansible playbook for image building/deployment, and OpenTofu configuration for VM provisioning with cloud-init.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>