From 0abdda8e8a45253c9c1b6d3dcf620b82b6cb581c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Thu, 5 Feb 2026 01:59:51 +0100 Subject: [PATCH] docs: add plan for migrating existing hosts to opentofu Co-Authored-By: Claude Opus 4.5 --- docs/plans/host-migration-to-opentofu.md | 184 +++++++++++++++++++++++ 1 file changed, 184 insertions(+) create mode 100644 docs/plans/host-migration-to-opentofu.md diff --git a/docs/plans/host-migration-to-opentofu.md b/docs/plans/host-migration-to-opentofu.md new file mode 100644 index 0000000..ddcdb15 --- /dev/null +++ b/docs/plans/host-migration-to-opentofu.md @@ -0,0 +1,184 @@ +# Host Migration to OpenTofu + +## Overview + +Migrate all existing hosts (provisioned manually before the OpenTofu pipeline) into the new +OpenTofu-managed provisioning workflow. Hosts are categorized by their state requirements: +stateless hosts are simply recreated, stateful hosts require backup and restore, and some +hosts are decommissioned or deferred. + +## Current State + +Hosts already managed by OpenTofu: `vault01`, `testvm01`, `vaulttest01` + +Hosts to migrate: + +| Host | Category | Notes | +|------|----------|-------| +| ns1 | Stateless | Primary DNS, recreate | +| ns2 | Stateless | Secondary DNS, recreate | +| nix-cache01 | Stateless | Binary cache, recreate | +| http-proxy | Stateless | Reverse proxy, recreate | +| nats1 | Stateless | Messaging, recreate | +| auth01 | Stateless | Authentication, recreate | +| ha1 | Stateful | Home Assistant + Zigbee2MQTT + Mosquitto | +| monitoring01 | Stateful | Prometheus, Grafana, Loki | +| jelly01 | Stateful | Jellyfin metadata, watch history, config | +| pgdb1 | Stateful | PostgreSQL databases | +| jump | Decommission | No longer needed | +| ca | Deferred | Pending Phase 4c PKI migration to OpenBao | + +## Phase 1: Backup Preparation + +Before migrating any stateful host, ensure restic backups are in place and verified. + +### 1a. Expand monitoring01 Grafana Backup + +The existing backup only covers `/var/lib/grafana/plugins` and a sqlite dump of `grafana.db`. +Expand to back up all of `/var/lib/grafana/` to capture config directory and any other state. + +### 1b. Add Jellyfin Backup to jelly01 + +No backup currently exists. Add a restic backup job for `/var/lib/jellyfin/` which contains: +- `config/` — server settings, library configuration +- `data/` — user watch history, playback state, library metadata + +Media files are on the NAS (`nas.home.2rjus.net:/mnt/hdd-pool/media`) and do not need backup. +The cache directory (`/var/cache/jellyfin/`) does not need backup — it regenerates. + +### 1c. Add PostgreSQL Backup to pgdb1 + +No backup currently exists. Add a restic backup job with a `pg_dumpall` pre-hook to capture +all databases and roles. The dump should be piped through restic's stdin backup (similar to +the Grafana DB dump pattern on monitoring01). + +### 1d. Verify Existing ha1 Backup + +ha1 already backs up `/var/lib/hass`, `/var/lib/zigbee2mqtt`, `/var/lib/mosquitto`. Verify +these backups are current and restorable before proceeding with migration. + +### 1e. Verify All Backups + +After adding/expanding backup jobs: +1. Trigger a manual backup run on each host +2. Verify backup integrity with `restic check` +3. Test a restore to a temporary location to confirm data is recoverable + +## Phase 2: Declare pgdb1 Databases in Nix + +Before migrating pgdb1, audit the manually-created databases and users on the running +instance, then declare them in the Nix configuration using `ensureDatabases` and +`ensureUsers`. This makes the PostgreSQL setup reproducible on the new host. + +Steps: +1. SSH to pgdb1, run `\l` and `\du` in psql to list databases and roles +2. Add `ensureDatabases` and `ensureUsers` to `services/postgres/postgres.nix` +3. Document any non-default PostgreSQL settings or extensions per database + +After reprovisioning, the databases will be created by NixOS, and data restored from the +`pg_dumpall` backup. + +## Phase 3: Stateless Host Migration + +These hosts have no meaningful state and can be recreated fresh. For each host: + +1. Add the host definition to `terraform/vms.tf` (using `create-host` or manually) +2. Commit and push to master +3. Run `tofu apply` to provision the new VM +4. Wait for bootstrap to complete (VM pulls config from master and reboots) +5. Verify the host is functional +6. Decommission the old VM in Proxmox + +### Migration Order + +Migrate stateless hosts in an order that minimizes disruption: + +1. **nix-cache01** — low risk, no downstream dependencies during migration +2. **auth01** — low risk +3. **nats1** — low risk, verify no persistent JetStream streams first +4. **http-proxy** — brief disruption to proxied services, migrate during low-traffic window +5. **ns1, ns2** — migrate one at a time, verify DNS resolution between each + +For ns1/ns2: migrate ns2 first (secondary), verify AXFR works, then migrate ns1. All hosts +use both ns1 and ns2 as resolvers, so one being down briefly is tolerable. + +## Phase 4: Stateful Host Migration + +For each stateful host, the procedure is: + +1. Trigger a final restic backup +2. Stop services on the old host (to prevent state drift during migration) +3. Provision the new VM via `tofu apply` +4. Wait for bootstrap to complete +5. Stop the relevant services on the new host +6. Restore data from restic backup +7. Start services and verify functionality +8. Decommission the old VM + +### 4a. pgdb1 + +1. Run final `pg_dumpall` backup via restic +2. Stop PostgreSQL on the old host +3. Provision new pgdb1 via OpenTofu +4. After bootstrap, NixOS creates the declared databases/users +5. Restore data with `pg_restore` or `psql < dumpall.sql` +6. Verify database connectivity from gunter (`10.69.30.105`) +7. Decommission old VM + +### 4b. monitoring01 + +1. Run final Grafana backup +2. Provision new monitoring01 via OpenTofu +3. After bootstrap, restore `/var/lib/grafana/` from restic +4. Restart Grafana, verify dashboards and datasources are intact +5. Prometheus and Loki start fresh with empty data (acceptable) +6. Verify all scrape targets are being collected +7. Decommission old VM + +### 4c. jelly01 + +1. Run final Jellyfin backup +2. Provision new jelly01 via OpenTofu +3. After bootstrap, restore `/var/lib/jellyfin/` from restic +4. Verify NFS mount to NAS is working +5. Start Jellyfin, verify watch history and library metadata are present +6. Decommission old VM + +### 4d. ha1 + +1. Verify latest restic backup is current +2. Stop Home Assistant, Zigbee2MQTT, and Mosquitto on old host +3. Provision new ha1 via OpenTofu +4. After bootstrap, restore `/var/lib/hass`, `/var/lib/zigbee2mqtt`, `/var/lib/mosquitto` +5. Start services, verify Home Assistant is functional +6. Verify Zigbee devices are still paired and communicating +7. Decommission old VM + +**Note:** ha1 is the highest-risk migration due to Zigbee device pairings. The Zigbee +coordinator state in `/var/lib/zigbee2mqtt` should preserve pairings, but verify on a +non-critical time window. + +## Phase 5: Decommission jump Host + +1. Verify nothing depends on the jump host (no SSH proxy configs pointing to it, etc.) +2. Remove host configuration from `hosts/jump/` +3. Remove from `flake.nix` +4. Remove any secrets in `secrets/jump/` +5. Remove from `.sops.yaml` +6. Destroy the VM in Proxmox +7. Commit cleanup + +## Phase 6: Decommission ca Host (Deferred) + +Deferred until Phase 4c (PKI migration to OpenBao) is complete. Once all hosts use the +OpenBao ACME endpoint for certificates, the step-ca host can be decommissioned following +the same cleanup steps as the jump host. + +## Notes + +- Each host migration should be done individually, not in bulk, to limit blast radius +- Keep the old VM running until the new one is verified — do not destroy prematurely +- The old VMs use IPs that the new VMs need, so the old VM must be shut down before + the new one is provisioned (or use a temporary IP and swap after verification) +- Stateful migrations should be done during low-usage windows +- After all migrations are complete, the only hosts not in OpenTofu will be ca (deferred)