# Memory Issues Follow-up Tracking the zram change to verify it resolves OOM issues during nixos-upgrade on low-memory hosts. ## Background On 2026-02-08, ns2 (2GB RAM) experienced an OOM kill during nixos-upgrade. The Nix evaluation process consumed ~1.6GB before being killed by the kernel. ns1 (manually increased to 4GB) succeeded with the same upgrade. Root cause: 2GB RAM is insufficient for Nix flake evaluation without swap. ## Fix Applied **Commit:** `1674b6a` - system: enable zram swap for all hosts **Merged:** 2026-02-08 ~12:15 UTC **Change:** Added `zramSwap.enable = true` to `system/zram.nix`, providing ~2GB compressed swap on all hosts. ## Timeline | Time (UTC) | Event | |------------|-------| | 05:00:46 | ns2 nixos-upgrade OOM killed | | 05:01:47 | `nixos_upgrade_failed` alert fired | | 12:15 | zram commit merged to master | | 12:19 | ns2 rebooted with zram enabled | | 12:20 | ns1 rebooted (memory reduced to 2GB via tofu) | ## Hosts Affected All 2GB VMs that run nixos-upgrade: - ns1, ns2 (DNS) - vault01 - testvm01, testvm02, testvm03 - kanidm01 ## Metrics to Monitor Check these in Grafana or via PromQL to verify the fix: ### Swap availability (should be ~2GB after upgrade) ```promql node_memory_SwapTotal_bytes / 1024 / 1024 ``` ### Swap usage during upgrades ```promql (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / 1024 / 1024 ``` ### Zswap compressed bytes (active compression) ```promql node_memory_Zswap_bytes / 1024 / 1024 ``` ### Upgrade failures (should be 0) ```promql node_systemd_unit_state{name="nixos-upgrade.service", state="failed"} ``` ### Memory available during upgrades ```promql node_memory_MemAvailable_bytes / 1024 / 1024 ``` ## Verification Steps After a few days (allow auto-upgrades to run on all hosts): 1. Check all hosts have swap enabled: ```promql node_memory_SwapTotal_bytes > 0 ``` 2. Check for any upgrade failures since the fix: ```promql count_over_time(ALERTS{alertname="nixos_upgrade_failed"}[7d]) ``` 3. Review if any hosts used swap during upgrades (check historical graphs) ## Success Criteria - No `nixos_upgrade_failed` alerts due to OOM after 2026-02-08 - All hosts show ~2GB swap available - Upgrades complete successfully on 2GB VMs ## Fallback Options If zram is insufficient: 1. **Increase VM memory** - Update `terraform/vms.tf` to 4GB for affected hosts 2. **Enable memory ballooning** - Configure VMs with dynamic memory allocation (see below) 3. **Use remote builds** - Configure `nix.buildMachines` to offload evaluation 4. **Reduce flake size** - Split configurations to reduce evaluation memory ### Memory Ballooning Proxmox supports memory ballooning, which allows VMs to dynamically grow/shrink memory allocation based on demand. The balloon driver inside the guest communicates with the hypervisor to release or reclaim memory pages. Configuration in `terraform/vms.tf`: ```hcl memory = 4096 # maximum memory balloon = 2048 # minimum memory (shrinks to this when idle) ``` Pros: - VMs get memory on-demand without reboots - Better host memory utilization - Solves upgrade OOM without permanently allocating 4GB Cons: - Requires QEMU guest agent running in guest - Guest can experience memory pressure if host is overcommitted Ballooning and zram are complementary - ballooning provides headroom from the host, zram provides overflow within the guest.