pn02: disable sched_ext and document memtest results

Memtest86 ran 38 passes (109 hours) with zero errors, ruling out RAM. Disable sched_ext scheduler to test whether kernel scheduler crashes stop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 12:16:55 +01:00
parent 5c111c8d78
commit 20875fb03f
2 changed files with 14 additions and 6 deletions
--- a/docs/plans/pn51-stability.md
+++ b/docs/plans/pn51-stability.md
@@ -169,6 +169,14 @@ These appear on both units and can be ignored:
 - Journal corruption confirmed on next boot
 - No pstore data captured

+### 2026-03-12: pn02 Memtest86 — 38 Passes, Zero Errors
+
+- Ran memtest86 for ~109 hours (4.5 days), completing 38 full passes
+- **Zero errors found** — RAM appears healthy
+- Makes hardware-induced memory corruption less likely as the sole cause of crashes
+- Memtest cannot rule out CPU cache errors, PCIe/IOMMU issues, or kernel bugs triggered by platform quirks
+- **Next step**: Boot back into NixOS with sched_ext disabled to test the kernel scheduler hypothesis
+
 ### 2026-03-07: pn01 Status

 - pn01 has had **zero crashes** since initial setup on Feb 21
@@ -193,18 +201,18 @@ These appear on both units and can be ignored:

 **pn02 is unreliable.** After exhausting mitigations (fTPM disabled, BIOS updated, WiFi/BT disabled, ErP disabled, amdgpu blacklisted, processor.max_cstate=1, NMI watchdog, rasdaemon), the unit still crashes every few days. 26 reboots in 30 days (7 unclean crashes + daily auto-upgrade reboots).

-The pstore crash dumps from March reveal a new dimension: at least some crashes are **kernel scheduler bugs in sched_ext**, not just silent hardware-level freezes. The `set_next_task_scx` and `pick_next_task_fair` crash sites, combined with the dbus-daemon segfault before one crash, suggest possible memory corruption that manifests in the scheduler. It's unclear whether this is:
+The pstore crash dumps from March reveal a new dimension: at least some crashes are **kernel scheduler bugs in sched_ext**, not just silent hardware-level freezes. The `set_next_task_scx` and `pick_next_task_fair` crash sites, combined with the dbus-daemon segfault before one crash, suggest possible memory corruption that manifests in the scheduler. Memtest86 ran 38 passes (109 hours) with zero errors, making option 2 less likely. Remaining possibilities:
 1. A sched_ext kernel bug exposed by the PN51's hardware quirks (unstable TSC, C-state behavior)
-2. Hardware-induced memory corruption that happens to hit scheduler data structures
+2. ~~Hardware-induced memory corruption that happens to hit scheduler data structures~~ — unlikely after clean memtest
 3. A pure software bug in the 6.12.74 kernel's sched_ext implementation

 **pn01 is stable** — zero crashes in 30 days of continuous operation. Both units have identical kernel and NixOS configuration (minus pn02's diagnostic mitigations), so the difference points toward a hardware defect specific to the pn02 board.

 ## Next Steps

- **pn02 memtest**: Run memtest86 for 24h+ (available in systemd-boot menu). The crash signatures (userspace segfaults before kernel panics, corrupted pointers in scheduler structures) are consistent with intermittent RAM errors that a quick pass wouldn't catch. If memtest finds errors, swap the DIMM.
- **pn02**: Consider scrapping or repurposing for non-critical workloads that tolerate random reboots (auto-recovery via hardware watchdog is now working)
- **pn02 investigation**: Could try disabling sched_ext (`boot.kernelParams = [ "sched_ext.enabled=0" ]` or equivalent) to test whether the crashes stop — would help distinguish kernel bug from hardware defect
+- **~~pn02 memtest~~**: ~~Run memtest86 for 24h+~~ — Done (2026-03-12): 38 passes over 109 hours, zero errors. RAM is not the issue.
+- **pn02 sched_ext test**: Disable sched_ext (`boot.kernelParams = [ "sched_ext.enabled=0" ]` or equivalent) and run for 1-2 weeks to test whether the crashes stop — would help distinguish kernel bug from hardware defect
+- **pn02**: If sched_ext disable doesn't help, consider scrapping or repurposing for non-critical workloads that tolerate random reboots (auto-recovery via hardware watchdog is working)
 - **pn01**: Continue monitoring. If it remains stable long-term, it is viable for light workloads
 - If pn01 eventually crashes, apply the same mitigations (amdgpu blacklist, max_cstate=1) to see if they help
 - For the Incus hypervisor plan: likely need different hardware. Evaluating GMKtec G3 (Intel) as an alternative. Note: mixed Intel/AMD cluster complicates live migration