nixos-servers/docs/plans/host-migration-to-opentofu.md

# Host Migration to OpenTofu

## Overview

Migrate all existing hosts (provisioned manually before the OpenTofu pipeline) into the new
OpenTofu-managed provisioning workflow. Hosts are categorized by their state requirements:
stateless hosts are simply recreated, stateful hosts require backup and restore, and some
hosts are decommissioned or deferred.

## Current State

Hosts already managed by OpenTofu: `vault01`, `testvm01`, `testvm02`, `testvm03`, `ns2`, `ns1`

Hosts to migrate:

| Host | Category | Notes |
|------|----------|-------|
| ~~ns1~~ | ~~Stateless~~ | ✓ Complete |
| nix-cache01 | Stateless | Binary cache, recreate |
| http-proxy | Stateless | Reverse proxy, recreate |
| nats1 | Stateless | Messaging, recreate |
| ha1 | Stateful | Home Assistant + Zigbee2MQTT + Mosquitto |
| monitoring01 | Stateful | Prometheus, Grafana, Loki |
| jelly01 | Stateful | Jellyfin metadata, watch history, config |
| pgdb1 | Decommission | Only used by Open WebUI on gunter, migrating to local postgres |
| ~~jump~~ | ~~Decommission~~ | ✓ Complete |
| ~~auth01~~ | ~~Decommission~~ | ✓ Complete |
| ~~ca~~ | ~~Deferred~~ | ✓ Complete |

## Phase 1: Backup Preparation

Before migrating any stateful host, ensure restic backups are in place and verified.

### 1a. Expand monitoring01 Grafana Backup

The existing backup only covers `/var/lib/grafana/plugins` and a sqlite dump of `grafana.db`.
Expand to back up all of `/var/lib/grafana/` to capture config directory and any other state.

### 1b. Add Jellyfin Backup to jelly01

No backup currently exists. Add a restic backup job for `/var/lib/jellyfin/` which contains:
- `config/` — server settings, library configuration
- `data/` — user watch history, playback state, library metadata

Media files are on the NAS (`nas.home.2rjus.net:/mnt/hdd-pool/media`) and do not need backup.
The cache directory (`/var/cache/jellyfin/`) does not need backup — it regenerates.

### 1c. Verify Existing ha1 Backup

ha1 already backs up `/var/lib/hass`, `/var/lib/zigbee2mqtt`, `/var/lib/mosquitto`. Verify
these backups are current and restorable before proceeding with migration.

### 1d. Verify All Backups

After adding/expanding backup jobs:
1. Trigger a manual backup run on each host
2. Verify backup integrity with `restic check`
3. Test a restore to a temporary location to confirm data is recoverable

## Phase 2: Stateless Host Migration

These hosts have no meaningful state and can be recreated fresh. For each host:

1. Add the host definition to `terraform/vms.tf` (using `create-host` or manually)
2. Commit and push to master
3. Run `tofu apply` to provision the new VM
4. Wait for bootstrap to complete (VM pulls config from master and reboots)
5. Verify the host is functional
6. Decommission the old VM in Proxmox

### Migration Order

Migrate stateless hosts in an order that minimizes disruption:

1. **nix-cache01** — low risk, no downstream dependencies during migration
2. **nats1** — low risk, verify no persistent JetStream streams first
3. **http-proxy** — brief disruption to proxied services, migrate during low-traffic window
4. ~~**ns1** — ns2 already migrated, verify AXFR works after ns1 migration~~ ✓ Complete

~~For ns1/ns2: migrate ns2 first (secondary), verify AXFR works, then migrate ns1.~~ Both ns1
and ns2 migration complete. Zone transfer (AXFR) verified working between ns1 (primary) and
ns2 (secondary).

## Phase 3: Stateful Host Migration

For each stateful host, the procedure is:

1. Trigger a final restic backup
2. Stop services on the old host (to prevent state drift during migration)
3. Provision the new VM via `tofu apply`
4. Wait for bootstrap to complete
5. Stop the relevant services on the new host
6. Restore data from restic backup
7. Start services and verify functionality
8. Decommission the old VM

### 3a. monitoring01

1. Run final Grafana backup
2. Provision new monitoring01 via OpenTofu
3. After bootstrap, restore `/var/lib/grafana/` from restic
4. Restart Grafana, verify dashboards and datasources are intact
5. Prometheus and Loki start fresh with empty data (acceptable)
6. Verify all scrape targets are being collected
7. Decommission old VM

### 3b. jelly01

1. Run final Jellyfin backup
2. Provision new jelly01 via OpenTofu
3. After bootstrap, restore `/var/lib/jellyfin/` from restic
4. Verify NFS mount to NAS is working
5. Start Jellyfin, verify watch history and library metadata are present
6. Decommission old VM

### 3c. ha1

1. Verify latest restic backup is current
2. Stop Home Assistant, Zigbee2MQTT, and Mosquitto on old host
3. Provision new ha1 via OpenTofu
4. After bootstrap, restore `/var/lib/hass`, `/var/lib/zigbee2mqtt`, `/var/lib/mosquitto`
5. Start services, verify Home Assistant is functional
6. Verify Zigbee devices are still paired and communicating
7. Decommission old VM

**Note:** ha1 currently has 2 GB RAM, which is consistently tight. Average memory usage has
climbed from ~57% (30-day avg) to ~70% currently, with a 30-day low of only 187 MB free.
Consider increasing to 4 GB when reprovisioning to allow headroom for additional integrations.

**Note:** ha1 is the highest-risk migration due to Zigbee device pairings. The Zigbee
coordinator state in `/var/lib/zigbee2mqtt` should preserve pairings, but verify on a
non-critical time window.

**USB Passthrough:** The ha1 VM has a USB device passed through from the Proxmox hypervisor
(the Zigbee coordinator). The new VM must be configured with the same USB passthrough in
OpenTofu/Proxmox. Verify the USB device ID on the hypervisor and add the appropriate
`usb` block to the VM definition in `terraform/vms.tf`. The USB device must be passed
through before starting Zigbee2MQTT on the new host.

## Phase 4: Decommission Hosts

### jump ✓ COMPLETE

~~1. Verify nothing depends on the jump host (no SSH proxy configs pointing to it, etc.)~~
~~2. Remove host configuration from `hosts/jump/`~~
~~3. Remove from `flake.nix`~~
~~4. Remove any secrets in `secrets/jump/`~~
~~5. Remove from `.sops.yaml`~~
~~6. Destroy the VM in Proxmox~~
~~7. Commit cleanup~~

Host was already removed from flake.nix and VM destroyed. Configuration cleaned up in ba9f47f.

### auth01 ✓ COMPLETE

~~1. Remove host configuration from `hosts/auth01/`~~
~~2. Remove from `flake.nix`~~
~~3. Remove any secrets in `secrets/auth01/`~~
~~4. Remove from `.sops.yaml`~~
~~5. Remove `services/authelia/` and `services/lldap/` (only used by auth01)~~
~~6. Destroy the VM in Proxmox~~
~~7. Commit cleanup~~

Host configuration, services, and VM already removed.

### pgdb1 (in progress)

Only consumer was Open WebUI on gunter, which has been migrated to use local PostgreSQL.

1. ~~Verify Open WebUI on gunter is using local PostgreSQL (not pgdb1)~~ ✓
2. ~~Remove host configuration from `hosts/pgdb1/`~~ ✓
3. ~~Remove `services/postgres/` (only used by pgdb1)~~ ✓
4. ~~Remove from `flake.nix`~~ ✓
5. ~~Remove Vault AppRole from `terraform/vault/approle.tf`~~ ✓
6. Destroy the VM in Proxmox
7. ~~Commit cleanup~~ ✓

See `docs/plans/pgdb1-decommission.md` for detailed plan.

## Phase 5: Decommission ca Host ✓ COMPLETE

~~Deferred until Phase 4c (PKI migration to OpenBao) is complete. Once all hosts use the
OpenBao ACME endpoint for certificates, the step-ca host can be decommissioned following
the same cleanup steps as the jump host.~~

PKI migration to OpenBao complete. Host configuration, `services/ca/`, and VM removed.

## Phase 6: Remove sops-nix ✓ COMPLETE

~~Once `ca` is decommissioned (Phase 6), `sops-nix` is no longer used by any host. Remove
all remnants:~~
~~- `sops-nix` input from `flake.nix` and `flake.lock`~~
~~- `sops-nix.nixosModules.sops` from all host module lists in `flake.nix`~~
~~- `inherit sops-nix` from all specialArgs in `flake.nix`~~
~~- `system/sops.nix` and its import in `system/default.nix`~~
~~- `.sops.yaml`~~
~~- `secrets/` directory~~
~~- All `sops.secrets.*` declarations in `services/ca/`, `services/authelia/`, `services/lldap/`~~
~~- Template scripts that generate age keys for sops (`hosts/template/scripts.nix`,
  `hosts/template2/scripts.nix`)~~

All sops-nix remnants removed. See `docs/plans/completed/sops-to-openbao-migration.md` for context.

## Notes

- Each host migration should be done individually, not in bulk, to limit blast radius
- Keep the old VM running until the new one is verified — do not destroy prematurely
- The old VMs use IPs that the new VMs need, so the old VM must be shut down before
  the new one is provisioned (or use a temporary IP and swap after verification)
- Stateful migrations should be done during low-usage windows
- After all migrations are complete, all decommissioned hosts (jump, auth01, ca) have been removed
- Since many hosts are being recreated, this is a good opportunity to establish consistent
  hostname naming conventions before provisioning the new VMs. Current naming is inconsistent
  (e.g. `ns1` vs `nix-cache01`, `ha1` vs `auth01`, `pgdb1` vs `http-proxy`). Decide on a
  convention before starting migrations — e.g. whether to always use numeric suffixes, a
  consistent format like `service-NN`, role-based vs function-based names, etc.