From 20875fb03fbb4da0c5f7c45f3b542a53a64fb106 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= <torjus@usit.uio.no>
Date: Thu, 12 Mar 2026 12:16:55 +0100
Subject: [PATCH] pn02: disable sched_ext and document memtest results

Memtest86 ran 38 passes (109 hours) with zero errors, ruling out RAM.
Disable sched_ext scheduler to test whether kernel scheduler crashes stop.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 docs/plans/pn51-stability.md | 18 +++++++++++++-----
 hosts/pn02/configuration.nix |  2 +-
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/docs/plans/pn51-stability.md b/docs/plans/pn51-stability.md
index 1c70841..263d990 100644
--- a/docs/plans/pn51-stability.md
+++ b/docs/plans/pn51-stability.md
@@ -169,6 +169,14 @@ These appear on both units and can be ignored:
 - Journal corruption confirmed on next boot
 - No pstore data captured
 
+### 2026-03-12: pn02 Memtest86 — 38 Passes, Zero Errors
+
+- Ran memtest86 for ~109 hours (4.5 days), completing 38 full passes
+- **Zero errors found** — RAM appears healthy
+- Makes hardware-induced memory corruption less likely as the sole cause of crashes
+- Memtest cannot rule out CPU cache errors, PCIe/IOMMU issues, or kernel bugs triggered by platform quirks
+- **Next step**: Boot back into NixOS with sched_ext disabled to test the kernel scheduler hypothesis
+
 ### 2026-03-07: pn01 Status
 
 - pn01 has had **zero crashes** since initial setup on Feb 21
@@ -193,18 +201,18 @@ These appear on both units and can be ignored:
 
 **pn02 is unreliable.** After exhausting mitigations (fTPM disabled, BIOS updated, WiFi/BT disabled, ErP disabled, amdgpu blacklisted, processor.max_cstate=1, NMI watchdog, rasdaemon), the unit still crashes every few days. 26 reboots in 30 days (7 unclean crashes + daily auto-upgrade reboots).
 
-The pstore crash dumps from March reveal a new dimension: at least some crashes are **kernel scheduler bugs in sched_ext**, not just silent hardware-level freezes. The `set_next_task_scx` and `pick_next_task_fair` crash sites, combined with the dbus-daemon segfault before one crash, suggest possible memory corruption that manifests in the scheduler. It's unclear whether this is:
+The pstore crash dumps from March reveal a new dimension: at least some crashes are **kernel scheduler bugs in sched_ext**, not just silent hardware-level freezes. The `set_next_task_scx` and `pick_next_task_fair` crash sites, combined with the dbus-daemon segfault before one crash, suggest possible memory corruption that manifests in the scheduler. Memtest86 ran 38 passes (109 hours) with zero errors, making option 2 less likely. Remaining possibilities:
 1. A sched_ext kernel bug exposed by the PN51's hardware quirks (unstable TSC, C-state behavior)
-2. Hardware-induced memory corruption that happens to hit scheduler data structures
+2. ~~Hardware-induced memory corruption that happens to hit scheduler data structures~~ — unlikely after clean memtest
 3. A pure software bug in the 6.12.74 kernel's sched_ext implementation
 
 **pn01 is stable** — zero crashes in 30 days of continuous operation. Both units have identical kernel and NixOS configuration (minus pn02's diagnostic mitigations), so the difference points toward a hardware defect specific to the pn02 board.
 
 ## Next Steps
 
-- **pn02 memtest**: Run memtest86 for 24h+ (available in systemd-boot menu). The crash signatures (userspace segfaults before kernel panics, corrupted pointers in scheduler structures) are consistent with intermittent RAM errors that a quick pass wouldn't catch. If memtest finds errors, swap the DIMM.
-- **pn02**: Consider scrapping or repurposing for non-critical workloads that tolerate random reboots (auto-recovery via hardware watchdog is now working)
-- **pn02 investigation**: Could try disabling sched_ext (`boot.kernelParams = [ "sched_ext.enabled=0" ]` or equivalent) to test whether the crashes stop — would help distinguish kernel bug from hardware defect
+- **~~pn02 memtest~~**: ~~Run memtest86 for 24h+~~ — Done (2026-03-12): 38 passes over 109 hours, zero errors. RAM is not the issue.
+- **pn02 sched_ext test**: Disable sched_ext (`boot.kernelParams = [ "sched_ext.enabled=0" ]` or equivalent) and run for 1-2 weeks to test whether the crashes stop — would help distinguish kernel bug from hardware defect
+- **pn02**: If sched_ext disable doesn't help, consider scrapping or repurposing for non-critical workloads that tolerate random reboots (auto-recovery via hardware watchdog is working)
 - **pn01**: Continue monitoring. If it remains stable long-term, it is viable for light workloads
 - If pn01 eventually crashes, apply the same mitigations (amdgpu blacklist, max_cstate=1) to see if they help
 - For the Incus hypervisor plan: likely need different hardware. Evaluating GMKtec G3 (Intel) as an alternative. Note: mixed Intel/AMD cluster complicates live migration
diff --git a/hosts/pn02/configuration.nix b/hosts/pn02/configuration.nix
index 5a52001..5d82961 100644
--- a/hosts/pn02/configuration.nix
+++ b/hosts/pn02/configuration.nix
@@ -15,7 +15,7 @@
   boot.loader.systemd-boot.memtest86.enable = true;
   boot.loader.efi.canTouchEfiVariables = true;
   boot.blacklistedKernelModules = [ "amdgpu" ];
-  boot.kernelParams = [ "panic=10" "nmi_watchdog=1" "processor.max_cstate=1" ];
+  boot.kernelParams = [ "panic=10" "nmi_watchdog=1" "processor.max_cstate=1" "sched_ext.enabled=0" ];
   boot.kernel.sysctl."kernel.softlockup_panic" = 1;
   boot.kernel.sysctl."kernel.hardlockup_panic" = 1;