Files
nixos-servers/docs/plans/memory-issues-follow-up.md
Torjus Håkestad 3abe5e83a7
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m5s
docs: add memory ballooning as fallback option
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 13:29:42 +01:00

3.3 KiB

Memory Issues Follow-up

Tracking the zram change to verify it resolves OOM issues during nixos-upgrade on low-memory hosts.

Background

On 2026-02-08, ns2 (2GB RAM) experienced an OOM kill during nixos-upgrade. The Nix evaluation process consumed ~1.6GB before being killed by the kernel. ns1 (manually increased to 4GB) succeeded with the same upgrade.

Root cause: 2GB RAM is insufficient for Nix flake evaluation without swap.

Fix Applied

Commit: 1674b6a - system: enable zram swap for all hosts

Merged: 2026-02-08 ~12:15 UTC

Change: Added zramSwap.enable = true to system/zram.nix, providing ~2GB compressed swap on all hosts.

Timeline

Time (UTC) Event
05:00:46 ns2 nixos-upgrade OOM killed
05:01:47 nixos_upgrade_failed alert fired
12:15 zram commit merged to master
12:19 ns2 rebooted with zram enabled
12:20 ns1 rebooted (memory reduced to 2GB via tofu)

Hosts Affected

All 2GB VMs that run nixos-upgrade:

  • ns1, ns2 (DNS)
  • vault01
  • testvm01, testvm02, testvm03
  • kanidm01

Metrics to Monitor

Check these in Grafana or via PromQL to verify the fix:

Swap availability (should be ~2GB after upgrade)

node_memory_SwapTotal_bytes / 1024 / 1024

Swap usage during upgrades

(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / 1024 / 1024

Zswap compressed bytes (active compression)

node_memory_Zswap_bytes / 1024 / 1024

Upgrade failures (should be 0)

node_systemd_unit_state{name="nixos-upgrade.service", state="failed"}

Memory available during upgrades

node_memory_MemAvailable_bytes / 1024 / 1024

Verification Steps

After a few days (allow auto-upgrades to run on all hosts):

  1. Check all hosts have swap enabled:

    node_memory_SwapTotal_bytes > 0
    
  2. Check for any upgrade failures since the fix:

    count_over_time(ALERTS{alertname="nixos_upgrade_failed"}[7d])
    
  3. Review if any hosts used swap during upgrades (check historical graphs)

Success Criteria

  • No nixos_upgrade_failed alerts due to OOM after 2026-02-08
  • All hosts show ~2GB swap available
  • Upgrades complete successfully on 2GB VMs

Fallback Options

If zram is insufficient:

  1. Increase VM memory - Update terraform/vms.tf to 4GB for affected hosts
  2. Enable memory ballooning - Configure VMs with dynamic memory allocation (see below)
  3. Use remote builds - Configure nix.buildMachines to offload evaluation
  4. Reduce flake size - Split configurations to reduce evaluation memory

Memory Ballooning

Proxmox supports memory ballooning, which allows VMs to dynamically grow/shrink memory allocation based on demand. The balloon driver inside the guest communicates with the hypervisor to release or reclaim memory pages.

Configuration in terraform/vms.tf:

memory  = 4096  # maximum memory
balloon = 2048  # minimum memory (shrinks to this when idle)

Pros:

  • VMs get memory on-demand without reboots
  • Better host memory utilization
  • Solves upgrade OOM without permanently allocating 4GB

Cons:

  • Requires QEMU guest agent running in guest
  • Guest can experience memory pressure if host is overcommitted

Ballooning and zram are complementary - ballooning provides headroom from the host, zram provides overflow within the guest.