From bb53b922fad1d2f27f94409e94d604adf48239e0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Sun, 22 Feb 2026 00:47:09 +0100 Subject: [PATCH] plans: add NixOS hypervisor plan (Incus on PN51s) Co-Authored-By: Claude Opus 4.6 --- docs/plans/nixos-hypervisor.md | 232 +++++++++++++++++++++++++++++++++ 1 file changed, 232 insertions(+) create mode 100644 docs/plans/nixos-hypervisor.md diff --git a/docs/plans/nixos-hypervisor.md b/docs/plans/nixos-hypervisor.md new file mode 100644 index 0000000..a941138 --- /dev/null +++ b/docs/plans/nixos-hypervisor.md @@ -0,0 +1,232 @@ +# NixOS Hypervisor + +## Overview + +Experiment with running a NixOS-based hypervisor as an alternative/complement to the current Proxmox setup. Goal is better homelab integration — declarative config, monitoring, auto-updates — while retaining the ability to run VMs with a Terraform-like workflow. + +## Motivation + +- Proxmox works but doesn't integrate with the NixOS-managed homelab (no monitoring, no auto-updates, no vault, no declarative config) +- The PN51 units (once stable) are good candidates for experimentation — test-tier, plenty of RAM (32-64GB), 8C/16T +- Long-term: could reduce reliance on Proxmox or provide a secondary hypervisor pool +- **VM migration**: Currently all VMs (including both nameservers) run on a single Proxmox host. Being able to migrate VMs between hypervisors would allow rebooting a host for kernel updates without downtime for critical services like DNS. + +## Hardware Candidates + +| | pn01 | pn02 | +|---|---|---| +| **CPU** | Ryzen 7 5700U (8C/16T) | Ryzen 7 5700U (8C/16T) | +| **RAM** | 64GB (2x32GB) | 32GB (1x32GB, second slot available) | +| **Storage** | 1TB NVMe | 1TB SATA SSD (NVMe planned) | +| **Status** | Stability testing | Stability testing | + +## Options + +### Option 1: Incus + +Fork of LXD (after Canonical made LXD proprietary). Supports both containers (LXC) and VMs (QEMU/KVM). + +**NixOS integration:** +- `virtualisation.incus.enable` module in nixpkgs +- Manages storage pools, networks, and instances +- REST API for automation +- CLI tool (`incus`) for management + +**Terraform integration:** +- `lxd` provider works with Incus (API-compatible) +- Dedicated `incus` Terraform provider also exists +- Can define VMs/containers in OpenTofu, similar to current Proxmox workflow + +**Migration:** +- Built-in live and offline migration via `incus move --target ` +- Clustering makes hosts aware of each other — migration is a first-class operation +- Shared storage (NFS, Ceph) or Incus can transfer storage during migration +- Stateful stop-and-move also supported for offline migration + +**Pros:** +- Supports both containers and VMs +- REST API + CLI for automation +- Built-in clustering and migration — closest to Proxmox experience +- Good NixOS module support +- Image-based workflow (can build NixOS images and import) +- Active development and community + +**Cons:** +- Another abstraction layer on top of QEMU/KVM +- Less mature Terraform provider than libvirt +- Container networking can be complex +- NixOS guests in Incus VMs need some setup + +### Option 2: libvirt/QEMU + +Standard Linux virtualization stack. Thin wrapper around QEMU/KVM. + +**NixOS integration:** +- `virtualisation.libvirtd.enable` module in nixpkgs +- Mature and well-tested +- virsh CLI for management + +**Terraform integration:** +- `dmacvicar/libvirt` provider — mature, well-maintained +- Supports cloud-init, volume management, network config +- Very similar workflow to current Proxmox+OpenTofu setup +- Can reuse cloud-init patterns from existing `terraform/` config + +**Migration:** +- Supports live and offline migration via `virsh migrate` +- Requires shared storage (NFS, Ceph, or similar) for live migration +- Requires matching CPU models between hosts (or CPU model masking) +- Works but is manual — no cluster awareness, must specify target URI +- No built-in orchestration for multi-host scenarios + +**Pros:** +- Closest to current Proxmox+Terraform workflow +- Most mature Terraform provider +- Minimal abstraction — direct QEMU/KVM management +- Well-understood, massive community +- Cloud-init works identically to Proxmox workflow +- Can reuse existing template-building patterns + +**Cons:** +- VMs only (no containers without adding LXC separately) +- No built-in REST API (would need to expose libvirt socket) +- No web UI without adding cockpit or virt-manager +- Migration works but requires manual setup — no clustering, no orchestration +- Less feature-rich than Incus for multi-host scenarios + +### Option 3: microvm.nix + +NixOS-native microVM framework. VMs defined as NixOS modules in the host's flake. + +**NixOS integration:** +- VMs are NixOS configurations in the same flake +- Supports multiple backends: cloud-hypervisor, QEMU, firecracker, kvmtool +- Lightweight — shares host's nix store with guests via virtiofs +- Declarative network, storage, and resource allocation + +**Terraform integration:** +- None — everything is defined in Nix +- Fundamentally different workflow from current Proxmox+Terraform approach + +**Pros:** +- Most NixOS-native approach +- VMs defined right alongside host configs in this repo +- Very lightweight — fast boot, minimal overhead +- Shares nix store with host (no duplicate packages) +- No cloud-init needed — guest config is part of the flake + +**Migration:** +- No migration support — VMs are tied to the host's NixOS config +- Moving a VM means rebuilding it on another host + +**Cons:** +- Very niche, smaller community +- Different mental model from current workflow +- Only NixOS guests (no Ubuntu, FreeBSD, etc.) +- No Terraform integration +- No migration support +- Less isolation than full QEMU VMs +- Would need to learn a new deployment pattern + +## Comparison + +| Criteria | Incus | libvirt | microvm.nix | +|----------|-------|---------|-------------| +| **Workflow similarity** | Medium | High | Low | +| **Terraform support** | Yes (lxd/incus provider) | Yes (mature provider) | No | +| **NixOS module** | Yes | Yes | Yes | +| **Containers + VMs** | Both | VMs only | VMs only | +| **Non-NixOS guests** | Yes | Yes | No | +| **Live migration** | Built-in (first-class) | Yes (manual setup) | No | +| **Offline migration** | Built-in | Yes (manual setup) | No (rebuild) | +| **Clustering** | Built-in | Manual | No | +| **Learning curve** | Medium | Low | Medium | +| **Community/maturity** | Growing | Very mature | Niche | +| **Overhead** | Low | Minimal | Minimal | + +## Recommendation + +Start with **Incus**. Migration and clustering are key requirements: +- Built-in clustering makes two PN51s a proper hypervisor pool +- Live and offline migration are first-class operations, similar to Proxmox +- Can move VMs between hosts for maintenance (kernel updates, hardware work) without downtime +- Supports both containers and VMs — flexibility for future use +- Terraform provider exists (less mature than libvirt's, but functional) +- REST API enables automation beyond what Terraform covers + +libvirt could achieve similar results but requires significantly more manual setup for migration and has no clustering awareness. For a two-node setup where migration is a priority, Incus provides much more out of the box. + +**microvm.nix** is off the table given the migration requirement. + +## Implementation Plan + +### Phase 1: Single-Node Setup (on one PN51) + +1. Enable `virtualisation.incus` on pn01 (or whichever is stable) +2. Initialize Incus (`incus admin init`) — configure storage pool (local NVMe) and network bridge +3. Configure bridge networking for VM traffic on VLAN 12 +4. Build a NixOS VM image and import it into Incus +5. Create a test VM manually with `incus launch` to validate the setup + +### Phase 2: Two-Node Cluster (PN51s only) + +1. Enable Incus on the second PN51 +2. Form a cluster between both nodes +3. Configure shared storage (NFS from NAS, or Ceph if warranted) +4. Test offline migration: `incus move --target ` +5. Test live migration with shared storage +6. CPU compatibility is not an issue here — both nodes have identical Ryzen 7 5700U CPUs + +### Phase 3: Terraform Integration + +1. Add Incus Terraform provider to `terraform/` +2. Define a test VM in OpenTofu (cloud-init, static IP, vault provisioning) +3. Verify the full pipeline: tofu apply -> VM boots -> cloud-init -> vault credentials -> NixOS rebuild +4. Compare workflow with existing Proxmox pipeline + +### Phase 4: Evaluate and Expand + +- Is the workflow comparable to Proxmox? +- Migration reliability — does live migration work cleanly? +- Performance overhead acceptable on Ryzen 5700U? +- Worth migrating some test-tier VMs from Proxmox? +- Could ns1/ns2 run on separate Incus nodes instead of the single Proxmox host? + +### Phase 5: Proxmox Replacement (optional) + +If Incus works well on the PN51s, consider replacing Proxmox entirely for a three-node cluster. + +**CPU compatibility for mixed cluster:** + +| Node | CPU | Architecture | x86-64-v3 | +|------|-----|-------------|-----------| +| Proxmox host | AMD Ryzen 9 3900X (12C/24T) | Zen 2 | Yes | +| pn01 | AMD Ryzen 7 5700U (8C/16T) | Zen 3 | Yes | +| pn02 | AMD Ryzen 7 5700U (8C/16T) | Zen 3 | Yes | + +All three CPUs are AMD and support `x86-64-v3`. The 3900X (Zen 2) is the oldest, so it defines the feature ceiling — but `x86-64-v3` is well within its capabilities. VMs configured with `x86-64-v3` can migrate freely between all three nodes. + +Being all-AMD also avoids the trickier Intel/AMD cross-vendor migration edge cases (different CPUID layouts, virtualization extensions). + +The 3900X (12C/24T) would be the most powerful node, making it the natural home for heavier workloads, with the PN51s (8C/16T each) handling lighter VMs or serving as migration targets during maintenance. + +Steps: +1. Install NixOS + Incus on the Proxmox host (or a replacement machine) +2. Join it to the existing Incus cluster with `x86-64-v3` CPU baseline +3. Migrate VMs from Proxmox to the Incus cluster +4. Decommission Proxmox + +## Prerequisites + +- [ ] PN51 units pass stability testing (see `pn51-stability.md`) +- [ ] Decide which unit to use first (pn01 preferred — 64GB RAM, NVMe, currently more stable) + +## Open Questions + +- How to handle VM storage? Local NVMe, NFS from NAS, or Ceph between the two nodes? +- Network topology: bridge on VLAN 12, or trunk multiple VLANs to the PN51? +- Should VMs be on the same VLAN as the hypervisor host, or separate? +- Incus clustering with only two nodes — any quorum issues? Three nodes (with Proxmox replacement) would solve this +- How to handle NixOS guest images? Build with nixos-generators, or use Incus image builder? +- ~~What CPU does the current Proxmox host have?~~ AMD Ryzen 9 3900X (Zen 2) — `x86-64-v3` confirmed, all-AMD cluster +- If replacing Proxmox: migrate VMs first, or fresh start and rebuild?