From 0abdda8e8a45253c9c1b6d3dcf620b82b6cb581c Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= <torjus@usit.uio.no>
Date: Thu, 5 Feb 2026 01:59:51 +0100
Subject: [PATCH] docs: add plan for migrating existing hosts to opentofu

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---
 docs/plans/host-migration-to-opentofu.md | 184 +++++++++++++++++++++++
 1 file changed, 184 insertions(+)
 create mode 100644 docs/plans/host-migration-to-opentofu.md

diff --git a/docs/plans/host-migration-to-opentofu.md b/docs/plans/host-migration-to-opentofu.md
new file mode 100644
index 0000000..ddcdb15
--- /dev/null
+++ b/docs/plans/host-migration-to-opentofu.md
@@ -0,0 +1,184 @@
+# Host Migration to OpenTofu
+
+## Overview
+
+Migrate all existing hosts (provisioned manually before the OpenTofu pipeline) into the new
+OpenTofu-managed provisioning workflow. Hosts are categorized by their state requirements:
+stateless hosts are simply recreated, stateful hosts require backup and restore, and some
+hosts are decommissioned or deferred.
+
+## Current State
+
+Hosts already managed by OpenTofu: `vault01`, `testvm01`, `vaulttest01`
+
+Hosts to migrate:
+
+| Host | Category | Notes |
+|------|----------|-------|
+| ns1 | Stateless | Primary DNS, recreate |
+| ns2 | Stateless | Secondary DNS, recreate |
+| nix-cache01 | Stateless | Binary cache, recreate |
+| http-proxy | Stateless | Reverse proxy, recreate |
+| nats1 | Stateless | Messaging, recreate |
+| auth01 | Stateless | Authentication, recreate |
+| ha1 | Stateful | Home Assistant + Zigbee2MQTT + Mosquitto |
+| monitoring01 | Stateful | Prometheus, Grafana, Loki |
+| jelly01 | Stateful | Jellyfin metadata, watch history, config |
+| pgdb1 | Stateful | PostgreSQL databases |
+| jump | Decommission | No longer needed |
+| ca | Deferred | Pending Phase 4c PKI migration to OpenBao |
+
+## Phase 1: Backup Preparation
+
+Before migrating any stateful host, ensure restic backups are in place and verified.
+
+### 1a. Expand monitoring01 Grafana Backup
+
+The existing backup only covers `/var/lib/grafana/plugins` and a sqlite dump of `grafana.db`.
+Expand to back up all of `/var/lib/grafana/` to capture config directory and any other state.
+
+### 1b. Add Jellyfin Backup to jelly01
+
+No backup currently exists. Add a restic backup job for `/var/lib/jellyfin/` which contains:
+- `config/` — server settings, library configuration
+- `data/` — user watch history, playback state, library metadata
+
+Media files are on the NAS (`nas.home.2rjus.net:/mnt/hdd-pool/media`) and do not need backup.
+The cache directory (`/var/cache/jellyfin/`) does not need backup — it regenerates.
+
+### 1c. Add PostgreSQL Backup to pgdb1
+
+No backup currently exists. Add a restic backup job with a `pg_dumpall` pre-hook to capture
+all databases and roles. The dump should be piped through restic's stdin backup (similar to
+the Grafana DB dump pattern on monitoring01).
+
+### 1d. Verify Existing ha1 Backup
+
+ha1 already backs up `/var/lib/hass`, `/var/lib/zigbee2mqtt`, `/var/lib/mosquitto`. Verify
+these backups are current and restorable before proceeding with migration.
+
+### 1e. Verify All Backups
+
+After adding/expanding backup jobs:
+1. Trigger a manual backup run on each host
+2. Verify backup integrity with `restic check`
+3. Test a restore to a temporary location to confirm data is recoverable
+
+## Phase 2: Declare pgdb1 Databases in Nix
+
+Before migrating pgdb1, audit the manually-created databases and users on the running
+instance, then declare them in the Nix configuration using `ensureDatabases` and
+`ensureUsers`. This makes the PostgreSQL setup reproducible on the new host.
+
+Steps:
+1. SSH to pgdb1, run `\l` and `\du` in psql to list databases and roles
+2. Add `ensureDatabases` and `ensureUsers` to `services/postgres/postgres.nix`
+3. Document any non-default PostgreSQL settings or extensions per database
+
+After reprovisioning, the databases will be created by NixOS, and data restored from the
+`pg_dumpall` backup.
+
+## Phase 3: Stateless Host Migration
+
+These hosts have no meaningful state and can be recreated fresh. For each host:
+
+1. Add the host definition to `terraform/vms.tf` (using `create-host` or manually)
+2. Commit and push to master
+3. Run `tofu apply` to provision the new VM
+4. Wait for bootstrap to complete (VM pulls config from master and reboots)
+5. Verify the host is functional
+6. Decommission the old VM in Proxmox
+
+### Migration Order
+
+Migrate stateless hosts in an order that minimizes disruption:
+
+1. **nix-cache01** — low risk, no downstream dependencies during migration
+2. **auth01** — low risk
+3. **nats1** — low risk, verify no persistent JetStream streams first
+4. **http-proxy** — brief disruption to proxied services, migrate during low-traffic window
+5. **ns1, ns2** — migrate one at a time, verify DNS resolution between each
+
+For ns1/ns2: migrate ns2 first (secondary), verify AXFR works, then migrate ns1. All hosts
+use both ns1 and ns2 as resolvers, so one being down briefly is tolerable.
+
+## Phase 4: Stateful Host Migration
+
+For each stateful host, the procedure is:
+
+1. Trigger a final restic backup
+2. Stop services on the old host (to prevent state drift during migration)
+3. Provision the new VM via `tofu apply`
+4. Wait for bootstrap to complete
+5. Stop the relevant services on the new host
+6. Restore data from restic backup
+7. Start services and verify functionality
+8. Decommission the old VM
+
+### 4a. pgdb1
+
+1. Run final `pg_dumpall` backup via restic
+2. Stop PostgreSQL on the old host
+3. Provision new pgdb1 via OpenTofu
+4. After bootstrap, NixOS creates the declared databases/users
+5. Restore data with `pg_restore` or `psql < dumpall.sql`
+6. Verify database connectivity from gunter (`10.69.30.105`)
+7. Decommission old VM
+
+### 4b. monitoring01
+
+1. Run final Grafana backup
+2. Provision new monitoring01 via OpenTofu
+3. After bootstrap, restore `/var/lib/grafana/` from restic
+4. Restart Grafana, verify dashboards and datasources are intact
+5. Prometheus and Loki start fresh with empty data (acceptable)
+6. Verify all scrape targets are being collected
+7. Decommission old VM
+
+### 4c. jelly01
+
+1. Run final Jellyfin backup
+2. Provision new jelly01 via OpenTofu
+3. After bootstrap, restore `/var/lib/jellyfin/` from restic
+4. Verify NFS mount to NAS is working
+5. Start Jellyfin, verify watch history and library metadata are present
+6. Decommission old VM
+
+### 4d. ha1
+
+1. Verify latest restic backup is current
+2. Stop Home Assistant, Zigbee2MQTT, and Mosquitto on old host
+3. Provision new ha1 via OpenTofu
+4. After bootstrap, restore `/var/lib/hass`, `/var/lib/zigbee2mqtt`, `/var/lib/mosquitto`
+5. Start services, verify Home Assistant is functional
+6. Verify Zigbee devices are still paired and communicating
+7. Decommission old VM
+
+**Note:** ha1 is the highest-risk migration due to Zigbee device pairings. The Zigbee
+coordinator state in `/var/lib/zigbee2mqtt` should preserve pairings, but verify on a
+non-critical time window.
+
+## Phase 5: Decommission jump Host
+
+1. Verify nothing depends on the jump host (no SSH proxy configs pointing to it, etc.)
+2. Remove host configuration from `hosts/jump/`
+3. Remove from `flake.nix`
+4. Remove any secrets in `secrets/jump/`
+5. Remove from `.sops.yaml`
+6. Destroy the VM in Proxmox
+7. Commit cleanup
+
+## Phase 6: Decommission ca Host (Deferred)
+
+Deferred until Phase 4c (PKI migration to OpenBao) is complete. Once all hosts use the
+OpenBao ACME endpoint for certificates, the step-ca host can be decommissioned following
+the same cleanup steps as the jump host.
+
+## Notes
+
+- Each host migration should be done individually, not in bulk, to limit blast radius
+- Keep the old VM running until the new one is verified — do not destroy prematurely
+- The old VMs use IPs that the new VMs need, so the old VM must be shut down before
+  the new one is provisioned (or use a temporary IP and swap after verification)
+- Stateful migrations should be done during low-usage windows
+- After all migrations are complete, the only hosts not in OpenTofu will be ca (deferred)