Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
217 lines
9.0 KiB
Markdown
217 lines
9.0 KiB
Markdown
# Host Migration to OpenTofu
|
|
|
|
## Overview
|
|
|
|
Migrate all existing hosts (provisioned manually before the OpenTofu pipeline) into the new
|
|
OpenTofu-managed provisioning workflow. Hosts are categorized by their state requirements:
|
|
stateless hosts are simply recreated, stateful hosts require backup and restore, and some
|
|
hosts are decommissioned or deferred.
|
|
|
|
## Current State
|
|
|
|
Hosts already managed by OpenTofu: `vault01`, `testvm01`, `testvm02`, `testvm03`, `ns2`, `ns1`
|
|
|
|
Hosts to migrate:
|
|
|
|
| Host | Category | Notes |
|
|
|------|----------|-------|
|
|
| ~~ns1~~ | ~~Stateless~~ | ✓ Complete |
|
|
| nix-cache01 | Stateless | Binary cache, recreate |
|
|
| http-proxy | Stateless | Reverse proxy, recreate |
|
|
| nats1 | Stateless | Messaging, recreate |
|
|
| ha1 | Stateful | Home Assistant + Zigbee2MQTT + Mosquitto |
|
|
| monitoring01 | Stateful | Prometheus, Grafana, Loki |
|
|
| jelly01 | Stateful | Jellyfin metadata, watch history, config |
|
|
| pgdb1 | Decommission | Only used by Open WebUI on gunter, migrating to local postgres |
|
|
| ~~jump~~ | ~~Decommission~~ | ✓ Complete |
|
|
| ~~auth01~~ | ~~Decommission~~ | ✓ Complete |
|
|
| ~~ca~~ | ~~Deferred~~ | ✓ Complete |
|
|
|
|
## Phase 1: Backup Preparation
|
|
|
|
Before migrating any stateful host, ensure restic backups are in place and verified.
|
|
|
|
### 1a. Expand monitoring01 Grafana Backup
|
|
|
|
The existing backup only covers `/var/lib/grafana/plugins` and a sqlite dump of `grafana.db`.
|
|
Expand to back up all of `/var/lib/grafana/` to capture config directory and any other state.
|
|
|
|
### 1b. Add Jellyfin Backup to jelly01
|
|
|
|
No backup currently exists. Add a restic backup job for `/var/lib/jellyfin/` which contains:
|
|
- `config/` — server settings, library configuration
|
|
- `data/` — user watch history, playback state, library metadata
|
|
|
|
Media files are on the NAS (`nas.home.2rjus.net:/mnt/hdd-pool/media`) and do not need backup.
|
|
The cache directory (`/var/cache/jellyfin/`) does not need backup — it regenerates.
|
|
|
|
### 1c. Verify Existing ha1 Backup
|
|
|
|
ha1 already backs up `/var/lib/hass`, `/var/lib/zigbee2mqtt`, `/var/lib/mosquitto`. Verify
|
|
these backups are current and restorable before proceeding with migration.
|
|
|
|
### 1d. Verify All Backups
|
|
|
|
After adding/expanding backup jobs:
|
|
1. Trigger a manual backup run on each host
|
|
2. Verify backup integrity with `restic check`
|
|
3. Test a restore to a temporary location to confirm data is recoverable
|
|
|
|
## Phase 2: Stateless Host Migration
|
|
|
|
These hosts have no meaningful state and can be recreated fresh. For each host:
|
|
|
|
1. Add the host definition to `terraform/vms.tf` (using `create-host` or manually)
|
|
2. Commit and push to master
|
|
3. Run `tofu apply` to provision the new VM
|
|
4. Wait for bootstrap to complete (VM pulls config from master and reboots)
|
|
5. Verify the host is functional
|
|
6. Decommission the old VM in Proxmox
|
|
|
|
### Migration Order
|
|
|
|
Migrate stateless hosts in an order that minimizes disruption:
|
|
|
|
1. **nix-cache01** — low risk, no downstream dependencies during migration
|
|
2. **nats1** — low risk, verify no persistent JetStream streams first
|
|
3. **http-proxy** — brief disruption to proxied services, migrate during low-traffic window
|
|
4. ~~**ns1** — ns2 already migrated, verify AXFR works after ns1 migration~~ ✓ Complete
|
|
|
|
~~For ns1/ns2: migrate ns2 first (secondary), verify AXFR works, then migrate ns1.~~ Both ns1
|
|
and ns2 migration complete. Zone transfer (AXFR) verified working between ns1 (primary) and
|
|
ns2 (secondary).
|
|
|
|
## Phase 3: Stateful Host Migration
|
|
|
|
For each stateful host, the procedure is:
|
|
|
|
1. Trigger a final restic backup
|
|
2. Stop services on the old host (to prevent state drift during migration)
|
|
3. Provision the new VM via `tofu apply`
|
|
4. Wait for bootstrap to complete
|
|
5. Stop the relevant services on the new host
|
|
6. Restore data from restic backup
|
|
7. Start services and verify functionality
|
|
8. Decommission the old VM
|
|
|
|
### 3a. monitoring01
|
|
|
|
1. Run final Grafana backup
|
|
2. Provision new monitoring01 via OpenTofu
|
|
3. After bootstrap, restore `/var/lib/grafana/` from restic
|
|
4. Restart Grafana, verify dashboards and datasources are intact
|
|
5. Prometheus and Loki start fresh with empty data (acceptable)
|
|
6. Verify all scrape targets are being collected
|
|
7. Decommission old VM
|
|
|
|
### 3b. jelly01
|
|
|
|
1. Run final Jellyfin backup
|
|
2. Provision new jelly01 via OpenTofu
|
|
3. After bootstrap, restore `/var/lib/jellyfin/` from restic
|
|
4. Verify NFS mount to NAS is working
|
|
5. Start Jellyfin, verify watch history and library metadata are present
|
|
6. Decommission old VM
|
|
|
|
### 3c. ha1
|
|
|
|
1. Verify latest restic backup is current
|
|
2. Stop Home Assistant, Zigbee2MQTT, and Mosquitto on old host
|
|
3. Provision new ha1 via OpenTofu
|
|
4. After bootstrap, restore `/var/lib/hass`, `/var/lib/zigbee2mqtt`, `/var/lib/mosquitto`
|
|
5. Start services, verify Home Assistant is functional
|
|
6. Verify Zigbee devices are still paired and communicating
|
|
7. Decommission old VM
|
|
|
|
**Note:** ha1 currently has 2 GB RAM, which is consistently tight. Average memory usage has
|
|
climbed from ~57% (30-day avg) to ~70% currently, with a 30-day low of only 187 MB free.
|
|
Consider increasing to 4 GB when reprovisioning to allow headroom for additional integrations.
|
|
|
|
**Note:** ha1 is the highest-risk migration due to Zigbee device pairings. The Zigbee
|
|
coordinator state in `/var/lib/zigbee2mqtt` should preserve pairings, but verify on a
|
|
non-critical time window.
|
|
|
|
**USB Passthrough:** The ha1 VM has a USB device passed through from the Proxmox hypervisor
|
|
(the Zigbee coordinator). The new VM must be configured with the same USB passthrough in
|
|
OpenTofu/Proxmox. Verify the USB device ID on the hypervisor and add the appropriate
|
|
`usb` block to the VM definition in `terraform/vms.tf`. The USB device must be passed
|
|
through before starting Zigbee2MQTT on the new host.
|
|
|
|
## Phase 4: Decommission Hosts
|
|
|
|
### jump ✓ COMPLETE
|
|
|
|
~~1. Verify nothing depends on the jump host (no SSH proxy configs pointing to it, etc.)~~
|
|
~~2. Remove host configuration from `hosts/jump/`~~
|
|
~~3. Remove from `flake.nix`~~
|
|
~~4. Remove any secrets in `secrets/jump/`~~
|
|
~~5. Remove from `.sops.yaml`~~
|
|
~~6. Destroy the VM in Proxmox~~
|
|
~~7. Commit cleanup~~
|
|
|
|
Host was already removed from flake.nix and VM destroyed. Configuration cleaned up in ba9f47f.
|
|
|
|
### auth01 ✓ COMPLETE
|
|
|
|
~~1. Remove host configuration from `hosts/auth01/`~~
|
|
~~2. Remove from `flake.nix`~~
|
|
~~3. Remove any secrets in `secrets/auth01/`~~
|
|
~~4. Remove from `.sops.yaml`~~
|
|
~~5. Remove `services/authelia/` and `services/lldap/` (only used by auth01)~~
|
|
~~6. Destroy the VM in Proxmox~~
|
|
~~7. Commit cleanup~~
|
|
|
|
Host configuration, services, and VM already removed.
|
|
|
|
### pgdb1 (in progress)
|
|
|
|
Only consumer was Open WebUI on gunter, which has been migrated to use local PostgreSQL.
|
|
|
|
1. ~~Verify Open WebUI on gunter is using local PostgreSQL (not pgdb1)~~ ✓
|
|
2. ~~Remove host configuration from `hosts/pgdb1/`~~ ✓
|
|
3. ~~Remove `services/postgres/` (only used by pgdb1)~~ ✓
|
|
4. ~~Remove from `flake.nix`~~ ✓
|
|
5. ~~Remove Vault AppRole from `terraform/vault/approle.tf`~~ ✓
|
|
6. Destroy the VM in Proxmox
|
|
7. ~~Commit cleanup~~ ✓
|
|
|
|
See `docs/plans/pgdb1-decommission.md` for detailed plan.
|
|
|
|
## Phase 5: Decommission ca Host ✓ COMPLETE
|
|
|
|
~~Deferred until Phase 4c (PKI migration to OpenBao) is complete. Once all hosts use the
|
|
OpenBao ACME endpoint for certificates, the step-ca host can be decommissioned following
|
|
the same cleanup steps as the jump host.~~
|
|
|
|
PKI migration to OpenBao complete. Host configuration, `services/ca/`, and VM removed.
|
|
|
|
## Phase 6: Remove sops-nix ✓ COMPLETE
|
|
|
|
~~Once `ca` is decommissioned (Phase 6), `sops-nix` is no longer used by any host. Remove
|
|
all remnants:~~
|
|
~~- `sops-nix` input from `flake.nix` and `flake.lock`~~
|
|
~~- `sops-nix.nixosModules.sops` from all host module lists in `flake.nix`~~
|
|
~~- `inherit sops-nix` from all specialArgs in `flake.nix`~~
|
|
~~- `system/sops.nix` and its import in `system/default.nix`~~
|
|
~~- `.sops.yaml`~~
|
|
~~- `secrets/` directory~~
|
|
~~- All `sops.secrets.*` declarations in `services/ca/`, `services/authelia/`, `services/lldap/`~~
|
|
~~- Template scripts that generate age keys for sops (`hosts/template/scripts.nix`,
|
|
`hosts/template2/scripts.nix`)~~
|
|
|
|
All sops-nix remnants removed. See `docs/plans/completed/sops-to-openbao-migration.md` for context.
|
|
|
|
## Notes
|
|
|
|
- Each host migration should be done individually, not in bulk, to limit blast radius
|
|
- Keep the old VM running until the new one is verified — do not destroy prematurely
|
|
- The old VMs use IPs that the new VMs need, so the old VM must be shut down before
|
|
the new one is provisioned (or use a temporary IP and swap after verification)
|
|
- Stateful migrations should be done during low-usage windows
|
|
- After all migrations are complete, all decommissioned hosts (jump, auth01, ca) have been removed
|
|
- Since many hosts are being recreated, this is a good opportunity to establish consistent
|
|
hostname naming conventions before provisioning the new VMs. Current naming is inconsistent
|
|
(e.g. `ns1` vs `nix-cache01`, `ha1` vs `auth01`, `pgdb1` vs `http-proxy`). Decide on a
|
|
convention before starting migrations — e.g. whether to always use numeric suffixes, a
|
|
consistent format like `service-NN`, role-based vs function-based names, etc.
|