Files

Run nix flake check / flake-check (push) Failing after 18m17s

Details

docs: update for sops-to-openbao migration completion

Update CLAUDE.md and README.md to reflect that secrets are now managed
by OpenBao, with sops only remaining for ca. Update migration plans
with sops cleanup checklist and auth01 decommission.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-05 20:06:21 +01:00

9.2 KiB

Raw Blame History

Host Migration to OpenTofu

Overview

Migrate all existing hosts (provisioned manually before the OpenTofu pipeline) into the new OpenTofu-managed provisioning workflow. Hosts are categorized by their state requirements: stateless hosts are simply recreated, stateful hosts require backup and restore, and some hosts are decommissioned or deferred.

Current State

Hosts already managed by OpenTofu: vault01, testvm01, vaulttest01

Hosts to migrate:

Host	Category	Notes
ns1	Stateless	Primary DNS, recreate
ns2	Stateless	Secondary DNS, recreate
nix-cache01	Stateless	Binary cache, recreate
http-proxy	Stateless	Reverse proxy, recreate
nats1	Stateless	Messaging, recreate
auth01	Decommission	No longer in use
ha1	Stateful	Home Assistant + Zigbee2MQTT + Mosquitto
monitoring01	Stateful	Prometheus, Grafana, Loki
jelly01	Stateful	Jellyfin metadata, watch history, config
pgdb1	Stateful	PostgreSQL databases
jump	Decommission	No longer needed
ca	Deferred	Pending Phase 4c PKI migration to OpenBao

Phase 1: Backup Preparation

Before migrating any stateful host, ensure restic backups are in place and verified.

1a. Expand monitoring01 Grafana Backup

The existing backup only covers /var/lib/grafana/plugins and a sqlite dump of grafana.db. Expand to back up all of /var/lib/grafana/ to capture config directory and any other state.

1b. Add Jellyfin Backup to jelly01

No backup currently exists. Add a restic backup job for /var/lib/jellyfin/ which contains:

config/ — server settings, library configuration
data/ — user watch history, playback state, library metadata

Media files are on the NAS (nas.home.2rjus.net:/mnt/hdd-pool/media) and do not need backup. The cache directory (/var/cache/jellyfin/) does not need backup — it regenerates.

1c. Add PostgreSQL Backup to pgdb1

No backup currently exists. Add a restic backup job with a pg_dumpall pre-hook to capture all databases and roles. The dump should be piped through restic's stdin backup (similar to the Grafana DB dump pattern on monitoring01).

1d. Verify Existing ha1 Backup

ha1 already backs up /var/lib/hass, /var/lib/zigbee2mqtt, /var/lib/mosquitto. Verify these backups are current and restorable before proceeding with migration.

1e. Verify All Backups

After adding/expanding backup jobs:

Trigger a manual backup run on each host
Verify backup integrity with restic check
Test a restore to a temporary location to confirm data is recoverable

Phase 2: Declare pgdb1 Databases in Nix

Before migrating pgdb1, audit the manually-created databases and users on the running instance, then declare them in the Nix configuration using ensureDatabases and ensureUsers. This makes the PostgreSQL setup reproducible on the new host.

Steps:

SSH to pgdb1, run \l and \du in psql to list databases and roles
Add ensureDatabases and ensureUsers to services/postgres/postgres.nix
Document any non-default PostgreSQL settings or extensions per database

After reprovisioning, the databases will be created by NixOS, and data restored from the pg_dumpall backup.

Phase 3: Stateless Host Migration

These hosts have no meaningful state and can be recreated fresh. For each host:

Add the host definition to terraform/vms.tf (using create-host or manually)
Commit and push to master
Run tofu apply to provision the new VM
Wait for bootstrap to complete (VM pulls config from master and reboots)
Verify the host is functional
Decommission the old VM in Proxmox

Migration Order

Migrate stateless hosts in an order that minimizes disruption:

nix-cache01 — low risk, no downstream dependencies during migration
nats1 — low risk, verify no persistent JetStream streams first
http-proxy — brief disruption to proxied services, migrate during low-traffic window
ns1, ns2 — migrate one at a time, verify DNS resolution between each

For ns1/ns2: migrate ns2 first (secondary), verify AXFR works, then migrate ns1. All hosts use both ns1 and ns2 as resolvers, so one being down briefly is tolerable.

Phase 4: Stateful Host Migration

For each stateful host, the procedure is:

Trigger a final restic backup
Stop services on the old host (to prevent state drift during migration)
Provision the new VM via tofu apply
Wait for bootstrap to complete
Stop the relevant services on the new host
Restore data from restic backup
Start services and verify functionality
Decommission the old VM

4a. pgdb1

Run final pg_dumpall backup via restic
Stop PostgreSQL on the old host
Provision new pgdb1 via OpenTofu
After bootstrap, NixOS creates the declared databases/users
Restore data with pg_restore or psql < dumpall.sql
Verify database connectivity from gunter (10.69.30.105)
Decommission old VM

4b. monitoring01

Run final Grafana backup
Provision new monitoring01 via OpenTofu
After bootstrap, restore /var/lib/grafana/ from restic
Restart Grafana, verify dashboards and datasources are intact
Prometheus and Loki start fresh with empty data (acceptable)
Verify all scrape targets are being collected
Decommission old VM

4c. jelly01

Run final Jellyfin backup
Provision new jelly01 via OpenTofu
After bootstrap, restore /var/lib/jellyfin/ from restic
Verify NFS mount to NAS is working
Start Jellyfin, verify watch history and library metadata are present
Decommission old VM

4d. ha1

Verify latest restic backup is current
Stop Home Assistant, Zigbee2MQTT, and Mosquitto on old host
Provision new ha1 via OpenTofu
After bootstrap, restore /var/lib/hass, /var/lib/zigbee2mqtt, /var/lib/mosquitto
Start services, verify Home Assistant is functional
Verify Zigbee devices are still paired and communicating
Decommission old VM

Note: ha1 currently has 2 GB RAM, which is consistently tight. Average memory usage has climbed from ~57% (30-day avg) to ~70% currently, with a 30-day low of only 187 MB free. Consider increasing to 4 GB when reprovisioning to allow headroom for additional integrations.

Note: ha1 is the highest-risk migration due to Zigbee device pairings. The Zigbee coordinator state in /var/lib/zigbee2mqtt should preserve pairings, but verify on a non-critical time window.

USB Passthrough: The ha1 VM has a USB device passed through from the Proxmox hypervisor (the Zigbee coordinator). The new VM must be configured with the same USB passthrough in OpenTofu/Proxmox. Verify the USB device ID on the hypervisor and add the appropriate usb block to the VM definition in terraform/vms.tf. The USB device must be passed through before starting Zigbee2MQTT on the new host.

Phase 5: Decommission jump and auth01 Hosts

jump

Verify nothing depends on the jump host (no SSH proxy configs pointing to it, etc.)
Remove host configuration from hosts/jump/
Remove from flake.nix
Remove any secrets in secrets/jump/
Remove from .sops.yaml
Destroy the VM in Proxmox
Commit cleanup

auth01

Remove host configuration from hosts/auth01/
Remove from flake.nix
Remove any secrets in secrets/auth01/
Remove from .sops.yaml
Remove services/authelia/ and services/lldap/ (only used by auth01)
Destroy the VM in Proxmox
Commit cleanup

Phase 6: Decommission ca Host (Deferred)

Deferred until Phase 4c (PKI migration to OpenBao) is complete. Once all hosts use the OpenBao ACME endpoint for certificates, the step-ca host can be decommissioned following the same cleanup steps as the jump host.

Phase 7: Remove sops-nix

Once ca is decommissioned (Phase 6), sops-nix is no longer used by any host. Remove all remnants:

sops-nix input from flake.nix and flake.lock
sops-nix.nixosModules.sops from all host module lists in flake.nix
inherit sops-nix from all specialArgs in flake.nix
system/sops.nix and its import in system/default.nix
.sops.yaml
secrets/ directory
All sops.secrets.* declarations in services/ca/, services/authelia/, services/lldap/
Template scripts that generate age keys for sops (hosts/template/scripts.nix, hosts/template2/scripts.nix)

See docs/plans/completed/sops-to-openbao-migration.md for full context.

Notes

Each host migration should be done individually, not in bulk, to limit blast radius
Keep the old VM running until the new one is verified — do not destroy prematurely
The old VMs use IPs that the new VMs need, so the old VM must be shut down before the new one is provisioned (or use a temporary IP and swap after verification)
Stateful migrations should be done during low-usage windows
After all migrations are complete, the only hosts not in OpenTofu will be ca (deferred)
Since many hosts are being recreated, this is a good opportunity to establish consistent hostname naming conventions before provisioning the new VMs. Current naming is inconsistent (e.g. ns1 vs nix-cache01, ha1 vs auth01, pgdb1 vs http-proxy). Decide on a convention before starting migrations — e.g. whether to always use numeric suffixes, a consistent format like service-NN, role-based vs function-based names, etc.

9.2 KiB Raw Blame History