Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
3.3 KiB
Memory Issues Follow-up
Tracking the zram change to verify it resolves OOM issues during nixos-upgrade on low-memory hosts.
Background
On 2026-02-08, ns2 (2GB RAM) experienced an OOM kill during nixos-upgrade. The Nix evaluation process consumed ~1.6GB before being killed by the kernel. ns1 (manually increased to 4GB) succeeded with the same upgrade.
Root cause: 2GB RAM is insufficient for Nix flake evaluation without swap.
Fix Applied
Commit: 1674b6a - system: enable zram swap for all hosts
Merged: 2026-02-08 ~12:15 UTC
Change: Added zramSwap.enable = true to system/zram.nix, providing ~2GB compressed swap on all hosts.
Timeline
| Time (UTC) | Event |
|---|---|
| 05:00:46 | ns2 nixos-upgrade OOM killed |
| 05:01:47 | nixos_upgrade_failed alert fired |
| 12:15 | zram commit merged to master |
| 12:19 | ns2 rebooted with zram enabled |
| 12:20 | ns1 rebooted (memory reduced to 2GB via tofu) |
Hosts Affected
All 2GB VMs that run nixos-upgrade:
- ns1, ns2 (DNS)
- vault01
- testvm01, testvm02, testvm03
- kanidm01
Metrics to Monitor
Check these in Grafana or via PromQL to verify the fix:
Swap availability (should be ~2GB after upgrade)
node_memory_SwapTotal_bytes / 1024 / 1024
Swap usage during upgrades
(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / 1024 / 1024
Zswap compressed bytes (active compression)
node_memory_Zswap_bytes / 1024 / 1024
Upgrade failures (should be 0)
node_systemd_unit_state{name="nixos-upgrade.service", state="failed"}
Memory available during upgrades
node_memory_MemAvailable_bytes / 1024 / 1024
Verification Steps
After a few days (allow auto-upgrades to run on all hosts):
-
Check all hosts have swap enabled:
node_memory_SwapTotal_bytes > 0 -
Check for any upgrade failures since the fix:
count_over_time(ALERTS{alertname="nixos_upgrade_failed"}[7d]) -
Review if any hosts used swap during upgrades (check historical graphs)
Success Criteria
- No
nixos_upgrade_failedalerts due to OOM after 2026-02-08 - All hosts show ~2GB swap available
- Upgrades complete successfully on 2GB VMs
Fallback Options
If zram is insufficient:
- Increase VM memory - Update
terraform/vms.tfto 4GB for affected hosts - Enable memory ballooning - Configure VMs with dynamic memory allocation (see below)
- Use remote builds - Configure
nix.buildMachinesto offload evaluation - Reduce flake size - Split configurations to reduce evaluation memory
Memory Ballooning
Proxmox supports memory ballooning, which allows VMs to dynamically grow/shrink memory allocation based on demand. The balloon driver inside the guest communicates with the hypervisor to release or reclaim memory pages.
Configuration in terraform/vms.tf:
memory = 4096 # maximum memory
balloon = 2048 # minimum memory (shrinks to this when idle)
Pros:
- VMs get memory on-demand without reboots
- Better host memory utilization
- Solves upgrade OOM without permanently allocating 4GB
Cons:
- Requires QEMU guest agent running in guest
- Guest can experience memory pressure if host is overcommitted
Ballooning and zram are complementary - ballooning provides headroom from the host, zram provides overflow within the guest.