233 lines
9.9 KiB
Markdown
233 lines
9.9 KiB
Markdown
# NixOS Hypervisor
|
|
|
|
## Overview
|
|
|
|
Experiment with running a NixOS-based hypervisor as an alternative/complement to the current Proxmox setup. Goal is better homelab integration — declarative config, monitoring, auto-updates — while retaining the ability to run VMs with a Terraform-like workflow.
|
|
|
|
## Motivation
|
|
|
|
- Proxmox works but doesn't integrate with the NixOS-managed homelab (no monitoring, no auto-updates, no vault, no declarative config)
|
|
- The PN51 units (once stable) are good candidates for experimentation — test-tier, plenty of RAM (32-64GB), 8C/16T
|
|
- Long-term: could reduce reliance on Proxmox or provide a secondary hypervisor pool
|
|
- **VM migration**: Currently all VMs (including both nameservers) run on a single Proxmox host. Being able to migrate VMs between hypervisors would allow rebooting a host for kernel updates without downtime for critical services like DNS.
|
|
|
|
## Hardware Candidates
|
|
|
|
| | pn01 | pn02 |
|
|
|---|---|---|
|
|
| **CPU** | Ryzen 7 5700U (8C/16T) | Ryzen 7 5700U (8C/16T) |
|
|
| **RAM** | 64GB (2x32GB) | 32GB (1x32GB, second slot available) |
|
|
| **Storage** | 1TB NVMe | 1TB SATA SSD (NVMe planned) |
|
|
| **Status** | Stability testing | Stability testing |
|
|
|
|
## Options
|
|
|
|
### Option 1: Incus
|
|
|
|
Fork of LXD (after Canonical made LXD proprietary). Supports both containers (LXC) and VMs (QEMU/KVM).
|
|
|
|
**NixOS integration:**
|
|
- `virtualisation.incus.enable` module in nixpkgs
|
|
- Manages storage pools, networks, and instances
|
|
- REST API for automation
|
|
- CLI tool (`incus`) for management
|
|
|
|
**Terraform integration:**
|
|
- `lxd` provider works with Incus (API-compatible)
|
|
- Dedicated `incus` Terraform provider also exists
|
|
- Can define VMs/containers in OpenTofu, similar to current Proxmox workflow
|
|
|
|
**Migration:**
|
|
- Built-in live and offline migration via `incus move <instance> --target <host>`
|
|
- Clustering makes hosts aware of each other — migration is a first-class operation
|
|
- Shared storage (NFS, Ceph) or Incus can transfer storage during migration
|
|
- Stateful stop-and-move also supported for offline migration
|
|
|
|
**Pros:**
|
|
- Supports both containers and VMs
|
|
- REST API + CLI for automation
|
|
- Built-in clustering and migration — closest to Proxmox experience
|
|
- Good NixOS module support
|
|
- Image-based workflow (can build NixOS images and import)
|
|
- Active development and community
|
|
|
|
**Cons:**
|
|
- Another abstraction layer on top of QEMU/KVM
|
|
- Less mature Terraform provider than libvirt
|
|
- Container networking can be complex
|
|
- NixOS guests in Incus VMs need some setup
|
|
|
|
### Option 2: libvirt/QEMU
|
|
|
|
Standard Linux virtualization stack. Thin wrapper around QEMU/KVM.
|
|
|
|
**NixOS integration:**
|
|
- `virtualisation.libvirtd.enable` module in nixpkgs
|
|
- Mature and well-tested
|
|
- virsh CLI for management
|
|
|
|
**Terraform integration:**
|
|
- `dmacvicar/libvirt` provider — mature, well-maintained
|
|
- Supports cloud-init, volume management, network config
|
|
- Very similar workflow to current Proxmox+OpenTofu setup
|
|
- Can reuse cloud-init patterns from existing `terraform/` config
|
|
|
|
**Migration:**
|
|
- Supports live and offline migration via `virsh migrate`
|
|
- Requires shared storage (NFS, Ceph, or similar) for live migration
|
|
- Requires matching CPU models between hosts (or CPU model masking)
|
|
- Works but is manual — no cluster awareness, must specify target URI
|
|
- No built-in orchestration for multi-host scenarios
|
|
|
|
**Pros:**
|
|
- Closest to current Proxmox+Terraform workflow
|
|
- Most mature Terraform provider
|
|
- Minimal abstraction — direct QEMU/KVM management
|
|
- Well-understood, massive community
|
|
- Cloud-init works identically to Proxmox workflow
|
|
- Can reuse existing template-building patterns
|
|
|
|
**Cons:**
|
|
- VMs only (no containers without adding LXC separately)
|
|
- No built-in REST API (would need to expose libvirt socket)
|
|
- No web UI without adding cockpit or virt-manager
|
|
- Migration works but requires manual setup — no clustering, no orchestration
|
|
- Less feature-rich than Incus for multi-host scenarios
|
|
|
|
### Option 3: microvm.nix
|
|
|
|
NixOS-native microVM framework. VMs defined as NixOS modules in the host's flake.
|
|
|
|
**NixOS integration:**
|
|
- VMs are NixOS configurations in the same flake
|
|
- Supports multiple backends: cloud-hypervisor, QEMU, firecracker, kvmtool
|
|
- Lightweight — shares host's nix store with guests via virtiofs
|
|
- Declarative network, storage, and resource allocation
|
|
|
|
**Terraform integration:**
|
|
- None — everything is defined in Nix
|
|
- Fundamentally different workflow from current Proxmox+Terraform approach
|
|
|
|
**Pros:**
|
|
- Most NixOS-native approach
|
|
- VMs defined right alongside host configs in this repo
|
|
- Very lightweight — fast boot, minimal overhead
|
|
- Shares nix store with host (no duplicate packages)
|
|
- No cloud-init needed — guest config is part of the flake
|
|
|
|
**Migration:**
|
|
- No migration support — VMs are tied to the host's NixOS config
|
|
- Moving a VM means rebuilding it on another host
|
|
|
|
**Cons:**
|
|
- Very niche, smaller community
|
|
- Different mental model from current workflow
|
|
- Only NixOS guests (no Ubuntu, FreeBSD, etc.)
|
|
- No Terraform integration
|
|
- No migration support
|
|
- Less isolation than full QEMU VMs
|
|
- Would need to learn a new deployment pattern
|
|
|
|
## Comparison
|
|
|
|
| Criteria | Incus | libvirt | microvm.nix |
|
|
|----------|-------|---------|-------------|
|
|
| **Workflow similarity** | Medium | High | Low |
|
|
| **Terraform support** | Yes (lxd/incus provider) | Yes (mature provider) | No |
|
|
| **NixOS module** | Yes | Yes | Yes |
|
|
| **Containers + VMs** | Both | VMs only | VMs only |
|
|
| **Non-NixOS guests** | Yes | Yes | No |
|
|
| **Live migration** | Built-in (first-class) | Yes (manual setup) | No |
|
|
| **Offline migration** | Built-in | Yes (manual setup) | No (rebuild) |
|
|
| **Clustering** | Built-in | Manual | No |
|
|
| **Learning curve** | Medium | Low | Medium |
|
|
| **Community/maturity** | Growing | Very mature | Niche |
|
|
| **Overhead** | Low | Minimal | Minimal |
|
|
|
|
## Recommendation
|
|
|
|
Start with **Incus**. Migration and clustering are key requirements:
|
|
- Built-in clustering makes two PN51s a proper hypervisor pool
|
|
- Live and offline migration are first-class operations, similar to Proxmox
|
|
- Can move VMs between hosts for maintenance (kernel updates, hardware work) without downtime
|
|
- Supports both containers and VMs — flexibility for future use
|
|
- Terraform provider exists (less mature than libvirt's, but functional)
|
|
- REST API enables automation beyond what Terraform covers
|
|
|
|
libvirt could achieve similar results but requires significantly more manual setup for migration and has no clustering awareness. For a two-node setup where migration is a priority, Incus provides much more out of the box.
|
|
|
|
**microvm.nix** is off the table given the migration requirement.
|
|
|
|
## Implementation Plan
|
|
|
|
### Phase 1: Single-Node Setup (on one PN51)
|
|
|
|
1. Enable `virtualisation.incus` on pn01 (or whichever is stable)
|
|
2. Initialize Incus (`incus admin init`) — configure storage pool (local NVMe) and network bridge
|
|
3. Configure bridge networking for VM traffic on VLAN 12
|
|
4. Build a NixOS VM image and import it into Incus
|
|
5. Create a test VM manually with `incus launch` to validate the setup
|
|
|
|
### Phase 2: Two-Node Cluster (PN51s only)
|
|
|
|
1. Enable Incus on the second PN51
|
|
2. Form a cluster between both nodes
|
|
3. Configure shared storage (NFS from NAS, or Ceph if warranted)
|
|
4. Test offline migration: `incus move <vm> --target <other-node>`
|
|
5. Test live migration with shared storage
|
|
6. CPU compatibility is not an issue here — both nodes have identical Ryzen 7 5700U CPUs
|
|
|
|
### Phase 3: Terraform Integration
|
|
|
|
1. Add Incus Terraform provider to `terraform/`
|
|
2. Define a test VM in OpenTofu (cloud-init, static IP, vault provisioning)
|
|
3. Verify the full pipeline: tofu apply -> VM boots -> cloud-init -> vault credentials -> NixOS rebuild
|
|
4. Compare workflow with existing Proxmox pipeline
|
|
|
|
### Phase 4: Evaluate and Expand
|
|
|
|
- Is the workflow comparable to Proxmox?
|
|
- Migration reliability — does live migration work cleanly?
|
|
- Performance overhead acceptable on Ryzen 5700U?
|
|
- Worth migrating some test-tier VMs from Proxmox?
|
|
- Could ns1/ns2 run on separate Incus nodes instead of the single Proxmox host?
|
|
|
|
### Phase 5: Proxmox Replacement (optional)
|
|
|
|
If Incus works well on the PN51s, consider replacing Proxmox entirely for a three-node cluster.
|
|
|
|
**CPU compatibility for mixed cluster:**
|
|
|
|
| Node | CPU | Architecture | x86-64-v3 |
|
|
|------|-----|-------------|-----------|
|
|
| Proxmox host | AMD Ryzen 9 3900X (12C/24T) | Zen 2 | Yes |
|
|
| pn01 | AMD Ryzen 7 5700U (8C/16T) | Zen 3 | Yes |
|
|
| pn02 | AMD Ryzen 7 5700U (8C/16T) | Zen 3 | Yes |
|
|
|
|
All three CPUs are AMD and support `x86-64-v3`. The 3900X (Zen 2) is the oldest, so it defines the feature ceiling — but `x86-64-v3` is well within its capabilities. VMs configured with `x86-64-v3` can migrate freely between all three nodes.
|
|
|
|
Being all-AMD also avoids the trickier Intel/AMD cross-vendor migration edge cases (different CPUID layouts, virtualization extensions).
|
|
|
|
The 3900X (12C/24T) would be the most powerful node, making it the natural home for heavier workloads, with the PN51s (8C/16T each) handling lighter VMs or serving as migration targets during maintenance.
|
|
|
|
Steps:
|
|
1. Install NixOS + Incus on the Proxmox host (or a replacement machine)
|
|
2. Join it to the existing Incus cluster with `x86-64-v3` CPU baseline
|
|
3. Migrate VMs from Proxmox to the Incus cluster
|
|
4. Decommission Proxmox
|
|
|
|
## Prerequisites
|
|
|
|
- [ ] PN51 units pass stability testing (see `pn51-stability.md`)
|
|
- [ ] Decide which unit to use first (pn01 preferred — 64GB RAM, NVMe, currently more stable)
|
|
|
|
## Open Questions
|
|
|
|
- How to handle VM storage? Local NVMe, NFS from NAS, or Ceph between the two nodes?
|
|
- Network topology: bridge on VLAN 12, or trunk multiple VLANs to the PN51?
|
|
- Should VMs be on the same VLAN as the hypervisor host, or separate?
|
|
- Incus clustering with only two nodes — any quorum issues? Three nodes (with Proxmox replacement) would solve this
|
|
- How to handle NixOS guest images? Build with nixos-generators, or use Incus image builder?
|
|
- ~~What CPU does the current Proxmox host have?~~ AMD Ryzen 9 3900X (Zen 2) — `x86-64-v3` confirmed, all-AMD cluster
|
|
- If replacing Proxmox: migrate VMs first, or fresh start and rebuild?
|