191 lines
7.6 KiB
Markdown
191 lines
7.6 KiB
Markdown
# Host Migration to OpenTofu
|
|
|
|
## Overview
|
|
|
|
Migrate all existing hosts (provisioned manually before the OpenTofu pipeline) into the new
|
|
OpenTofu-managed provisioning workflow. Hosts are categorized by their state requirements:
|
|
stateless hosts are simply recreated, stateful hosts require backup and restore, and some
|
|
hosts are decommissioned or deferred.
|
|
|
|
## Current State
|
|
|
|
Hosts already managed by OpenTofu: `vault01`, `testvm01`, `vaulttest01`
|
|
|
|
Hosts to migrate:
|
|
|
|
| Host | Category | Notes |
|
|
|------|----------|-------|
|
|
| ns1 | Stateless | Primary DNS, recreate |
|
|
| ns2 | Stateless | Secondary DNS, recreate |
|
|
| nix-cache01 | Stateless | Binary cache, recreate |
|
|
| http-proxy | Stateless | Reverse proxy, recreate |
|
|
| nats1 | Stateless | Messaging, recreate |
|
|
| auth01 | Stateless | Authentication, recreate |
|
|
| ha1 | Stateful | Home Assistant + Zigbee2MQTT + Mosquitto |
|
|
| monitoring01 | Stateful | Prometheus, Grafana, Loki |
|
|
| jelly01 | Stateful | Jellyfin metadata, watch history, config |
|
|
| pgdb1 | Stateful | PostgreSQL databases |
|
|
| jump | Decommission | No longer needed |
|
|
| ca | Deferred | Pending Phase 4c PKI migration to OpenBao |
|
|
|
|
## Phase 1: Backup Preparation
|
|
|
|
Before migrating any stateful host, ensure restic backups are in place and verified.
|
|
|
|
### 1a. Expand monitoring01 Grafana Backup
|
|
|
|
The existing backup only covers `/var/lib/grafana/plugins` and a sqlite dump of `grafana.db`.
|
|
Expand to back up all of `/var/lib/grafana/` to capture config directory and any other state.
|
|
|
|
### 1b. Add Jellyfin Backup to jelly01
|
|
|
|
No backup currently exists. Add a restic backup job for `/var/lib/jellyfin/` which contains:
|
|
- `config/` — server settings, library configuration
|
|
- `data/` — user watch history, playback state, library metadata
|
|
|
|
Media files are on the NAS (`nas.home.2rjus.net:/mnt/hdd-pool/media`) and do not need backup.
|
|
The cache directory (`/var/cache/jellyfin/`) does not need backup — it regenerates.
|
|
|
|
### 1c. Add PostgreSQL Backup to pgdb1
|
|
|
|
No backup currently exists. Add a restic backup job with a `pg_dumpall` pre-hook to capture
|
|
all databases and roles. The dump should be piped through restic's stdin backup (similar to
|
|
the Grafana DB dump pattern on monitoring01).
|
|
|
|
### 1d. Verify Existing ha1 Backup
|
|
|
|
ha1 already backs up `/var/lib/hass`, `/var/lib/zigbee2mqtt`, `/var/lib/mosquitto`. Verify
|
|
these backups are current and restorable before proceeding with migration.
|
|
|
|
### 1e. Verify All Backups
|
|
|
|
After adding/expanding backup jobs:
|
|
1. Trigger a manual backup run on each host
|
|
2. Verify backup integrity with `restic check`
|
|
3. Test a restore to a temporary location to confirm data is recoverable
|
|
|
|
## Phase 2: Declare pgdb1 Databases in Nix
|
|
|
|
Before migrating pgdb1, audit the manually-created databases and users on the running
|
|
instance, then declare them in the Nix configuration using `ensureDatabases` and
|
|
`ensureUsers`. This makes the PostgreSQL setup reproducible on the new host.
|
|
|
|
Steps:
|
|
1. SSH to pgdb1, run `\l` and `\du` in psql to list databases and roles
|
|
2. Add `ensureDatabases` and `ensureUsers` to `services/postgres/postgres.nix`
|
|
3. Document any non-default PostgreSQL settings or extensions per database
|
|
|
|
After reprovisioning, the databases will be created by NixOS, and data restored from the
|
|
`pg_dumpall` backup.
|
|
|
|
## Phase 3: Stateless Host Migration
|
|
|
|
These hosts have no meaningful state and can be recreated fresh. For each host:
|
|
|
|
1. Add the host definition to `terraform/vms.tf` (using `create-host` or manually)
|
|
2. Commit and push to master
|
|
3. Run `tofu apply` to provision the new VM
|
|
4. Wait for bootstrap to complete (VM pulls config from master and reboots)
|
|
5. Verify the host is functional
|
|
6. Decommission the old VM in Proxmox
|
|
|
|
### Migration Order
|
|
|
|
Migrate stateless hosts in an order that minimizes disruption:
|
|
|
|
1. **nix-cache01** — low risk, no downstream dependencies during migration
|
|
2. **auth01** — low risk
|
|
3. **nats1** — low risk, verify no persistent JetStream streams first
|
|
4. **http-proxy** — brief disruption to proxied services, migrate during low-traffic window
|
|
5. **ns1, ns2** — migrate one at a time, verify DNS resolution between each
|
|
|
|
For ns1/ns2: migrate ns2 first (secondary), verify AXFR works, then migrate ns1. All hosts
|
|
use both ns1 and ns2 as resolvers, so one being down briefly is tolerable.
|
|
|
|
## Phase 4: Stateful Host Migration
|
|
|
|
For each stateful host, the procedure is:
|
|
|
|
1. Trigger a final restic backup
|
|
2. Stop services on the old host (to prevent state drift during migration)
|
|
3. Provision the new VM via `tofu apply`
|
|
4. Wait for bootstrap to complete
|
|
5. Stop the relevant services on the new host
|
|
6. Restore data from restic backup
|
|
7. Start services and verify functionality
|
|
8. Decommission the old VM
|
|
|
|
### 4a. pgdb1
|
|
|
|
1. Run final `pg_dumpall` backup via restic
|
|
2. Stop PostgreSQL on the old host
|
|
3. Provision new pgdb1 via OpenTofu
|
|
4. After bootstrap, NixOS creates the declared databases/users
|
|
5. Restore data with `pg_restore` or `psql < dumpall.sql`
|
|
6. Verify database connectivity from gunter (`10.69.30.105`)
|
|
7. Decommission old VM
|
|
|
|
### 4b. monitoring01
|
|
|
|
1. Run final Grafana backup
|
|
2. Provision new monitoring01 via OpenTofu
|
|
3. After bootstrap, restore `/var/lib/grafana/` from restic
|
|
4. Restart Grafana, verify dashboards and datasources are intact
|
|
5. Prometheus and Loki start fresh with empty data (acceptable)
|
|
6. Verify all scrape targets are being collected
|
|
7. Decommission old VM
|
|
|
|
### 4c. jelly01
|
|
|
|
1. Run final Jellyfin backup
|
|
2. Provision new jelly01 via OpenTofu
|
|
3. After bootstrap, restore `/var/lib/jellyfin/` from restic
|
|
4. Verify NFS mount to NAS is working
|
|
5. Start Jellyfin, verify watch history and library metadata are present
|
|
6. Decommission old VM
|
|
|
|
### 4d. ha1
|
|
|
|
1. Verify latest restic backup is current
|
|
2. Stop Home Assistant, Zigbee2MQTT, and Mosquitto on old host
|
|
3. Provision new ha1 via OpenTofu
|
|
4. After bootstrap, restore `/var/lib/hass`, `/var/lib/zigbee2mqtt`, `/var/lib/mosquitto`
|
|
5. Start services, verify Home Assistant is functional
|
|
6. Verify Zigbee devices are still paired and communicating
|
|
7. Decommission old VM
|
|
|
|
**Note:** ha1 is the highest-risk migration due to Zigbee device pairings. The Zigbee
|
|
coordinator state in `/var/lib/zigbee2mqtt` should preserve pairings, but verify on a
|
|
non-critical time window.
|
|
|
|
**USB Passthrough:** The ha1 VM has a USB device passed through from the Proxmox hypervisor
|
|
(the Zigbee coordinator). The new VM must be configured with the same USB passthrough in
|
|
OpenTofu/Proxmox. Verify the USB device ID on the hypervisor and add the appropriate
|
|
`usb` block to the VM definition in `terraform/vms.tf`. The USB device must be passed
|
|
through before starting Zigbee2MQTT on the new host.
|
|
|
|
## Phase 5: Decommission jump Host
|
|
|
|
1. Verify nothing depends on the jump host (no SSH proxy configs pointing to it, etc.)
|
|
2. Remove host configuration from `hosts/jump/`
|
|
3. Remove from `flake.nix`
|
|
4. Remove any secrets in `secrets/jump/`
|
|
5. Remove from `.sops.yaml`
|
|
6. Destroy the VM in Proxmox
|
|
7. Commit cleanup
|
|
|
|
## Phase 6: Decommission ca Host (Deferred)
|
|
|
|
Deferred until Phase 4c (PKI migration to OpenBao) is complete. Once all hosts use the
|
|
OpenBao ACME endpoint for certificates, the step-ca host can be decommissioned following
|
|
the same cleanup steps as the jump host.
|
|
|
|
## Notes
|
|
|
|
- Each host migration should be done individually, not in bulk, to limit blast radius
|
|
- Keep the old VM running until the new one is verified — do not destroy prematurely
|
|
- The old VMs use IPs that the new VMs need, so the old VM must be shut down before
|
|
the new one is provisioned (or use a temporary IP and swap after verification)
|
|
- Stateful migrations should be done during low-usage windows
|
|
- After all migrations are complete, the only hosts not in OpenTofu will be ca (deferred)
|