Update CLAUDE.md and README.md to reflect that secrets are now managed by OpenBao, with sops only remaining for ca. Update migration plans with sops cleanup checklist and auth01 decommission. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
9.2 KiB
Host Migration to OpenTofu
Overview
Migrate all existing hosts (provisioned manually before the OpenTofu pipeline) into the new OpenTofu-managed provisioning workflow. Hosts are categorized by their state requirements: stateless hosts are simply recreated, stateful hosts require backup and restore, and some hosts are decommissioned or deferred.
Current State
Hosts already managed by OpenTofu: vault01, testvm01, vaulttest01
Hosts to migrate:
| Host | Category | Notes |
|---|---|---|
| ns1 | Stateless | Primary DNS, recreate |
| ns2 | Stateless | Secondary DNS, recreate |
| nix-cache01 | Stateless | Binary cache, recreate |
| http-proxy | Stateless | Reverse proxy, recreate |
| nats1 | Stateless | Messaging, recreate |
| auth01 | Decommission | No longer in use |
| ha1 | Stateful | Home Assistant + Zigbee2MQTT + Mosquitto |
| monitoring01 | Stateful | Prometheus, Grafana, Loki |
| jelly01 | Stateful | Jellyfin metadata, watch history, config |
| pgdb1 | Stateful | PostgreSQL databases |
| jump | Decommission | No longer needed |
| ca | Deferred | Pending Phase 4c PKI migration to OpenBao |
Phase 1: Backup Preparation
Before migrating any stateful host, ensure restic backups are in place and verified.
1a. Expand monitoring01 Grafana Backup
The existing backup only covers /var/lib/grafana/plugins and a sqlite dump of grafana.db.
Expand to back up all of /var/lib/grafana/ to capture config directory and any other state.
1b. Add Jellyfin Backup to jelly01
No backup currently exists. Add a restic backup job for /var/lib/jellyfin/ which contains:
config/— server settings, library configurationdata/— user watch history, playback state, library metadata
Media files are on the NAS (nas.home.2rjus.net:/mnt/hdd-pool/media) and do not need backup.
The cache directory (/var/cache/jellyfin/) does not need backup — it regenerates.
1c. Add PostgreSQL Backup to pgdb1
No backup currently exists. Add a restic backup job with a pg_dumpall pre-hook to capture
all databases and roles. The dump should be piped through restic's stdin backup (similar to
the Grafana DB dump pattern on monitoring01).
1d. Verify Existing ha1 Backup
ha1 already backs up /var/lib/hass, /var/lib/zigbee2mqtt, /var/lib/mosquitto. Verify
these backups are current and restorable before proceeding with migration.
1e. Verify All Backups
After adding/expanding backup jobs:
- Trigger a manual backup run on each host
- Verify backup integrity with
restic check - Test a restore to a temporary location to confirm data is recoverable
Phase 2: Declare pgdb1 Databases in Nix
Before migrating pgdb1, audit the manually-created databases and users on the running
instance, then declare them in the Nix configuration using ensureDatabases and
ensureUsers. This makes the PostgreSQL setup reproducible on the new host.
Steps:
- SSH to pgdb1, run
\land\duin psql to list databases and roles - Add
ensureDatabasesandensureUserstoservices/postgres/postgres.nix - Document any non-default PostgreSQL settings or extensions per database
After reprovisioning, the databases will be created by NixOS, and data restored from the
pg_dumpall backup.
Phase 3: Stateless Host Migration
These hosts have no meaningful state and can be recreated fresh. For each host:
- Add the host definition to
terraform/vms.tf(usingcreate-hostor manually) - Commit and push to master
- Run
tofu applyto provision the new VM - Wait for bootstrap to complete (VM pulls config from master and reboots)
- Verify the host is functional
- Decommission the old VM in Proxmox
Migration Order
Migrate stateless hosts in an order that minimizes disruption:
- nix-cache01 — low risk, no downstream dependencies during migration
- nats1 — low risk, verify no persistent JetStream streams first
- http-proxy — brief disruption to proxied services, migrate during low-traffic window
- ns1, ns2 — migrate one at a time, verify DNS resolution between each
For ns1/ns2: migrate ns2 first (secondary), verify AXFR works, then migrate ns1. All hosts use both ns1 and ns2 as resolvers, so one being down briefly is tolerable.
Phase 4: Stateful Host Migration
For each stateful host, the procedure is:
- Trigger a final restic backup
- Stop services on the old host (to prevent state drift during migration)
- Provision the new VM via
tofu apply - Wait for bootstrap to complete
- Stop the relevant services on the new host
- Restore data from restic backup
- Start services and verify functionality
- Decommission the old VM
4a. pgdb1
- Run final
pg_dumpallbackup via restic - Stop PostgreSQL on the old host
- Provision new pgdb1 via OpenTofu
- After bootstrap, NixOS creates the declared databases/users
- Restore data with
pg_restoreorpsql < dumpall.sql - Verify database connectivity from gunter (
10.69.30.105) - Decommission old VM
4b. monitoring01
- Run final Grafana backup
- Provision new monitoring01 via OpenTofu
- After bootstrap, restore
/var/lib/grafana/from restic - Restart Grafana, verify dashboards and datasources are intact
- Prometheus and Loki start fresh with empty data (acceptable)
- Verify all scrape targets are being collected
- Decommission old VM
4c. jelly01
- Run final Jellyfin backup
- Provision new jelly01 via OpenTofu
- After bootstrap, restore
/var/lib/jellyfin/from restic - Verify NFS mount to NAS is working
- Start Jellyfin, verify watch history and library metadata are present
- Decommission old VM
4d. ha1
- Verify latest restic backup is current
- Stop Home Assistant, Zigbee2MQTT, and Mosquitto on old host
- Provision new ha1 via OpenTofu
- After bootstrap, restore
/var/lib/hass,/var/lib/zigbee2mqtt,/var/lib/mosquitto - Start services, verify Home Assistant is functional
- Verify Zigbee devices are still paired and communicating
- Decommission old VM
Note: ha1 currently has 2 GB RAM, which is consistently tight. Average memory usage has climbed from ~57% (30-day avg) to ~70% currently, with a 30-day low of only 187 MB free. Consider increasing to 4 GB when reprovisioning to allow headroom for additional integrations.
Note: ha1 is the highest-risk migration due to Zigbee device pairings. The Zigbee
coordinator state in /var/lib/zigbee2mqtt should preserve pairings, but verify on a
non-critical time window.
USB Passthrough: The ha1 VM has a USB device passed through from the Proxmox hypervisor
(the Zigbee coordinator). The new VM must be configured with the same USB passthrough in
OpenTofu/Proxmox. Verify the USB device ID on the hypervisor and add the appropriate
usb block to the VM definition in terraform/vms.tf. The USB device must be passed
through before starting Zigbee2MQTT on the new host.
Phase 5: Decommission jump and auth01 Hosts
jump
- Verify nothing depends on the jump host (no SSH proxy configs pointing to it, etc.)
- Remove host configuration from
hosts/jump/ - Remove from
flake.nix - Remove any secrets in
secrets/jump/ - Remove from
.sops.yaml - Destroy the VM in Proxmox
- Commit cleanup
auth01
- Remove host configuration from
hosts/auth01/ - Remove from
flake.nix - Remove any secrets in
secrets/auth01/ - Remove from
.sops.yaml - Remove
services/authelia/andservices/lldap/(only used by auth01) - Destroy the VM in Proxmox
- Commit cleanup
Phase 6: Decommission ca Host (Deferred)
Deferred until Phase 4c (PKI migration to OpenBao) is complete. Once all hosts use the OpenBao ACME endpoint for certificates, the step-ca host can be decommissioned following the same cleanup steps as the jump host.
Phase 7: Remove sops-nix
Once ca is decommissioned (Phase 6), sops-nix is no longer used by any host. Remove
all remnants:
sops-nixinput fromflake.nixandflake.locksops-nix.nixosModules.sopsfrom all host module lists inflake.nixinherit sops-nixfrom all specialArgs inflake.nixsystem/sops.nixand its import insystem/default.nix.sops.yamlsecrets/directory- All
sops.secrets.*declarations inservices/ca/,services/authelia/,services/lldap/ - Template scripts that generate age keys for sops (
hosts/template/scripts.nix,hosts/template2/scripts.nix)
See docs/plans/completed/sops-to-openbao-migration.md for full context.
Notes
- Each host migration should be done individually, not in bulk, to limit blast radius
- Keep the old VM running until the new one is verified — do not destroy prematurely
- The old VMs use IPs that the new VMs need, so the old VM must be shut down before the new one is provisioned (or use a temporary IP and swap after verification)
- Stateful migrations should be done during low-usage windows
- After all migrations are complete, the only hosts not in OpenTofu will be ca (deferred)
- Since many hosts are being recreated, this is a good opportunity to establish consistent
hostname naming conventions before provisioning the new VMs. Current naming is inconsistent
(e.g.
ns1vsnix-cache01,ha1vsauth01,pgdb1vshttp-proxy). Decide on a convention before starting migrations — e.g. whether to always use numeric suffixes, a consistent format likeservice-NN, role-based vs function-based names, etc.