From 67c27555f3d14fc3aab6527daf630c075a16bf20 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Sun, 8 Feb 2026 13:26:31 +0100 Subject: [PATCH] docs: add memory issues follow-up plan Track zram change effectiveness for OOM prevention during upgrades. Co-Authored-By: Claude Opus 4.5 --- docs/plans/memory-issues-follow-up.md | 94 +++++++++++++++++++++++++++ 1 file changed, 94 insertions(+) create mode 100644 docs/plans/memory-issues-follow-up.md diff --git a/docs/plans/memory-issues-follow-up.md b/docs/plans/memory-issues-follow-up.md new file mode 100644 index 0000000..4173063 --- /dev/null +++ b/docs/plans/memory-issues-follow-up.md @@ -0,0 +1,94 @@ +# Memory Issues Follow-up + +Tracking the zram change to verify it resolves OOM issues during nixos-upgrade on low-memory hosts. + +## Background + +On 2026-02-08, ns2 (2GB RAM) experienced an OOM kill during nixos-upgrade. The Nix evaluation process consumed ~1.6GB before being killed by the kernel. ns1 (manually increased to 4GB) succeeded with the same upgrade. + +Root cause: 2GB RAM is insufficient for Nix flake evaluation without swap. + +## Fix Applied + +**Commit:** `1674b6a` - system: enable zram swap for all hosts + +**Merged:** 2026-02-08 ~12:15 UTC + +**Change:** Added `zramSwap.enable = true` to `system/zram.nix`, providing ~2GB compressed swap on all hosts. + +## Timeline + +| Time (UTC) | Event | +|------------|-------| +| 05:00:46 | ns2 nixos-upgrade OOM killed | +| 05:01:47 | `nixos_upgrade_failed` alert fired | +| 12:15 | zram commit merged to master | +| 12:19 | ns2 rebooted with zram enabled | +| 12:20 | ns1 rebooted (memory reduced to 2GB via tofu) | + +## Hosts Affected + +All 2GB VMs that run nixos-upgrade: +- ns1, ns2 (DNS) +- vault01 +- testvm01, testvm02, testvm03 +- kanidm01 + +## Metrics to Monitor + +Check these in Grafana or via PromQL to verify the fix: + +### Swap availability (should be ~2GB after upgrade) +```promql +node_memory_SwapTotal_bytes / 1024 / 1024 +``` + +### Swap usage during upgrades +```promql +(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / 1024 / 1024 +``` + +### Zswap compressed bytes (active compression) +```promql +node_memory_Zswap_bytes / 1024 / 1024 +``` + +### Upgrade failures (should be 0) +```promql +node_systemd_unit_state{name="nixos-upgrade.service", state="failed"} +``` + +### Memory available during upgrades +```promql +node_memory_MemAvailable_bytes / 1024 / 1024 +``` + +## Verification Steps + +After a few days (allow auto-upgrades to run on all hosts): + +1. Check all hosts have swap enabled: + ```promql + node_memory_SwapTotal_bytes > 0 + ``` + +2. Check for any upgrade failures since the fix: + ```promql + count_over_time(ALERTS{alertname="nixos_upgrade_failed"}[7d]) + ``` + +3. Review if any hosts used swap during upgrades (check historical graphs) + +## Success Criteria + +- No `nixos_upgrade_failed` alerts due to OOM after 2026-02-08 +- All hosts show ~2GB swap available +- Upgrades complete successfully on 2GB VMs + +## Fallback Options + +If zram is insufficient: + +1. **Increase VM memory** - Update `terraform/vms.tf` to 4GB for affected hosts +2. **Use remote builds** - Configure `nix.buildMachines` to offload evaluation +3. **Reduce flake size** - Split configurations to reduce evaluation memory