docs: add memory issues follow-up plan
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m2s
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m2s
Track zram change effectiveness for OOM prevention during upgrades. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
94
docs/plans/memory-issues-follow-up.md
Normal file
94
docs/plans/memory-issues-follow-up.md
Normal file
@@ -0,0 +1,94 @@
|
|||||||
|
# Memory Issues Follow-up
|
||||||
|
|
||||||
|
Tracking the zram change to verify it resolves OOM issues during nixos-upgrade on low-memory hosts.
|
||||||
|
|
||||||
|
## Background
|
||||||
|
|
||||||
|
On 2026-02-08, ns2 (2GB RAM) experienced an OOM kill during nixos-upgrade. The Nix evaluation process consumed ~1.6GB before being killed by the kernel. ns1 (manually increased to 4GB) succeeded with the same upgrade.
|
||||||
|
|
||||||
|
Root cause: 2GB RAM is insufficient for Nix flake evaluation without swap.
|
||||||
|
|
||||||
|
## Fix Applied
|
||||||
|
|
||||||
|
**Commit:** `1674b6a` - system: enable zram swap for all hosts
|
||||||
|
|
||||||
|
**Merged:** 2026-02-08 ~12:15 UTC
|
||||||
|
|
||||||
|
**Change:** Added `zramSwap.enable = true` to `system/zram.nix`, providing ~2GB compressed swap on all hosts.
|
||||||
|
|
||||||
|
## Timeline
|
||||||
|
|
||||||
|
| Time (UTC) | Event |
|
||||||
|
|------------|-------|
|
||||||
|
| 05:00:46 | ns2 nixos-upgrade OOM killed |
|
||||||
|
| 05:01:47 | `nixos_upgrade_failed` alert fired |
|
||||||
|
| 12:15 | zram commit merged to master |
|
||||||
|
| 12:19 | ns2 rebooted with zram enabled |
|
||||||
|
| 12:20 | ns1 rebooted (memory reduced to 2GB via tofu) |
|
||||||
|
|
||||||
|
## Hosts Affected
|
||||||
|
|
||||||
|
All 2GB VMs that run nixos-upgrade:
|
||||||
|
- ns1, ns2 (DNS)
|
||||||
|
- vault01
|
||||||
|
- testvm01, testvm02, testvm03
|
||||||
|
- kanidm01
|
||||||
|
|
||||||
|
## Metrics to Monitor
|
||||||
|
|
||||||
|
Check these in Grafana or via PromQL to verify the fix:
|
||||||
|
|
||||||
|
### Swap availability (should be ~2GB after upgrade)
|
||||||
|
```promql
|
||||||
|
node_memory_SwapTotal_bytes / 1024 / 1024
|
||||||
|
```
|
||||||
|
|
||||||
|
### Swap usage during upgrades
|
||||||
|
```promql
|
||||||
|
(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / 1024 / 1024
|
||||||
|
```
|
||||||
|
|
||||||
|
### Zswap compressed bytes (active compression)
|
||||||
|
```promql
|
||||||
|
node_memory_Zswap_bytes / 1024 / 1024
|
||||||
|
```
|
||||||
|
|
||||||
|
### Upgrade failures (should be 0)
|
||||||
|
```promql
|
||||||
|
node_systemd_unit_state{name="nixos-upgrade.service", state="failed"}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Memory available during upgrades
|
||||||
|
```promql
|
||||||
|
node_memory_MemAvailable_bytes / 1024 / 1024
|
||||||
|
```
|
||||||
|
|
||||||
|
## Verification Steps
|
||||||
|
|
||||||
|
After a few days (allow auto-upgrades to run on all hosts):
|
||||||
|
|
||||||
|
1. Check all hosts have swap enabled:
|
||||||
|
```promql
|
||||||
|
node_memory_SwapTotal_bytes > 0
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Check for any upgrade failures since the fix:
|
||||||
|
```promql
|
||||||
|
count_over_time(ALERTS{alertname="nixos_upgrade_failed"}[7d])
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Review if any hosts used swap during upgrades (check historical graphs)
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
- No `nixos_upgrade_failed` alerts due to OOM after 2026-02-08
|
||||||
|
- All hosts show ~2GB swap available
|
||||||
|
- Upgrades complete successfully on 2GB VMs
|
||||||
|
|
||||||
|
## Fallback Options
|
||||||
|
|
||||||
|
If zram is insufficient:
|
||||||
|
|
||||||
|
1. **Increase VM memory** - Update `terraform/vms.tf` to 4GB for affected hosts
|
||||||
|
2. **Use remote builds** - Configure `nix.buildMachines` to offload evaluation
|
||||||
|
3. **Reduce flake size** - Split configurations to reduce evaluation memory
|
||||||
Reference in New Issue
Block a user