From 20875fb03fbb4da0c5f7c45f3b542a53a64fb106 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Thu, 12 Mar 2026 12:16:55 +0100 Subject: [PATCH] pn02: disable sched_ext and document memtest results Memtest86 ran 38 passes (109 hours) with zero errors, ruling out RAM. Disable sched_ext scheduler to test whether kernel scheduler crashes stop. Co-Authored-By: Claude Opus 4.6 --- docs/plans/pn51-stability.md | 18 +++++++++++++----- hosts/pn02/configuration.nix | 2 +- 2 files changed, 14 insertions(+), 6 deletions(-) diff --git a/docs/plans/pn51-stability.md b/docs/plans/pn51-stability.md index 1c70841..263d990 100644 --- a/docs/plans/pn51-stability.md +++ b/docs/plans/pn51-stability.md @@ -169,6 +169,14 @@ These appear on both units and can be ignored: - Journal corruption confirmed on next boot - No pstore data captured +### 2026-03-12: pn02 Memtest86 — 38 Passes, Zero Errors + +- Ran memtest86 for ~109 hours (4.5 days), completing 38 full passes +- **Zero errors found** — RAM appears healthy +- Makes hardware-induced memory corruption less likely as the sole cause of crashes +- Memtest cannot rule out CPU cache errors, PCIe/IOMMU issues, or kernel bugs triggered by platform quirks +- **Next step**: Boot back into NixOS with sched_ext disabled to test the kernel scheduler hypothesis + ### 2026-03-07: pn01 Status - pn01 has had **zero crashes** since initial setup on Feb 21 @@ -193,18 +201,18 @@ These appear on both units and can be ignored: **pn02 is unreliable.** After exhausting mitigations (fTPM disabled, BIOS updated, WiFi/BT disabled, ErP disabled, amdgpu blacklisted, processor.max_cstate=1, NMI watchdog, rasdaemon), the unit still crashes every few days. 26 reboots in 30 days (7 unclean crashes + daily auto-upgrade reboots). -The pstore crash dumps from March reveal a new dimension: at least some crashes are **kernel scheduler bugs in sched_ext**, not just silent hardware-level freezes. The `set_next_task_scx` and `pick_next_task_fair` crash sites, combined with the dbus-daemon segfault before one crash, suggest possible memory corruption that manifests in the scheduler. It's unclear whether this is: +The pstore crash dumps from March reveal a new dimension: at least some crashes are **kernel scheduler bugs in sched_ext**, not just silent hardware-level freezes. The `set_next_task_scx` and `pick_next_task_fair` crash sites, combined with the dbus-daemon segfault before one crash, suggest possible memory corruption that manifests in the scheduler. Memtest86 ran 38 passes (109 hours) with zero errors, making option 2 less likely. Remaining possibilities: 1. A sched_ext kernel bug exposed by the PN51's hardware quirks (unstable TSC, C-state behavior) -2. Hardware-induced memory corruption that happens to hit scheduler data structures +2. ~~Hardware-induced memory corruption that happens to hit scheduler data structures~~ — unlikely after clean memtest 3. A pure software bug in the 6.12.74 kernel's sched_ext implementation **pn01 is stable** — zero crashes in 30 days of continuous operation. Both units have identical kernel and NixOS configuration (minus pn02's diagnostic mitigations), so the difference points toward a hardware defect specific to the pn02 board. ## Next Steps -- **pn02 memtest**: Run memtest86 for 24h+ (available in systemd-boot menu). The crash signatures (userspace segfaults before kernel panics, corrupted pointers in scheduler structures) are consistent with intermittent RAM errors that a quick pass wouldn't catch. If memtest finds errors, swap the DIMM. -- **pn02**: Consider scrapping or repurposing for non-critical workloads that tolerate random reboots (auto-recovery via hardware watchdog is now working) -- **pn02 investigation**: Could try disabling sched_ext (`boot.kernelParams = [ "sched_ext.enabled=0" ]` or equivalent) to test whether the crashes stop — would help distinguish kernel bug from hardware defect +- **~~pn02 memtest~~**: ~~Run memtest86 for 24h+~~ — Done (2026-03-12): 38 passes over 109 hours, zero errors. RAM is not the issue. +- **pn02 sched_ext test**: Disable sched_ext (`boot.kernelParams = [ "sched_ext.enabled=0" ]` or equivalent) and run for 1-2 weeks to test whether the crashes stop — would help distinguish kernel bug from hardware defect +- **pn02**: If sched_ext disable doesn't help, consider scrapping or repurposing for non-critical workloads that tolerate random reboots (auto-recovery via hardware watchdog is working) - **pn01**: Continue monitoring. If it remains stable long-term, it is viable for light workloads - If pn01 eventually crashes, apply the same mitigations (amdgpu blacklist, max_cstate=1) to see if they help - For the Incus hypervisor plan: likely need different hardware. Evaluating GMKtec G3 (Intel) as an alternative. Note: mixed Intel/AMD cluster complicates live migration diff --git a/hosts/pn02/configuration.nix b/hosts/pn02/configuration.nix index 5a52001..5d82961 100644 --- a/hosts/pn02/configuration.nix +++ b/hosts/pn02/configuration.nix @@ -15,7 +15,7 @@ boot.loader.systemd-boot.memtest86.enable = true; boot.loader.efi.canTouchEfiVariables = true; boot.blacklistedKernelModules = [ "amdgpu" ]; - boot.kernelParams = [ "panic=10" "nmi_watchdog=1" "processor.max_cstate=1" ]; + boot.kernelParams = [ "panic=10" "nmi_watchdog=1" "processor.max_cstate=1" "sched_ext.enabled=0" ]; boot.kernel.sysctl."kernel.softlockup_panic" = 1; boot.kernel.sysctl."kernel.hardlockup_panic" = 1;