pn02: disable sched_ext and document memtest results

Memtest86 ran 38 passes (109 hours) with zero errors, ruling out RAM.
Disable sched_ext scheduler to test whether kernel scheduler crashes stop.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-12 12:16:55 +01:00
parent 5c111c8d78
commit 20875fb03f
2 changed files with 14 additions and 6 deletions

View File

@@ -169,6 +169,14 @@ These appear on both units and can be ignored:
- Journal corruption confirmed on next boot - Journal corruption confirmed on next boot
- No pstore data captured - No pstore data captured
### 2026-03-12: pn02 Memtest86 — 38 Passes, Zero Errors
- Ran memtest86 for ~109 hours (4.5 days), completing 38 full passes
- **Zero errors found** — RAM appears healthy
- Makes hardware-induced memory corruption less likely as the sole cause of crashes
- Memtest cannot rule out CPU cache errors, PCIe/IOMMU issues, or kernel bugs triggered by platform quirks
- **Next step**: Boot back into NixOS with sched_ext disabled to test the kernel scheduler hypothesis
### 2026-03-07: pn01 Status ### 2026-03-07: pn01 Status
- pn01 has had **zero crashes** since initial setup on Feb 21 - pn01 has had **zero crashes** since initial setup on Feb 21
@@ -193,18 +201,18 @@ These appear on both units and can be ignored:
**pn02 is unreliable.** After exhausting mitigations (fTPM disabled, BIOS updated, WiFi/BT disabled, ErP disabled, amdgpu blacklisted, processor.max_cstate=1, NMI watchdog, rasdaemon), the unit still crashes every few days. 26 reboots in 30 days (7 unclean crashes + daily auto-upgrade reboots). **pn02 is unreliable.** After exhausting mitigations (fTPM disabled, BIOS updated, WiFi/BT disabled, ErP disabled, amdgpu blacklisted, processor.max_cstate=1, NMI watchdog, rasdaemon), the unit still crashes every few days. 26 reboots in 30 days (7 unclean crashes + daily auto-upgrade reboots).
The pstore crash dumps from March reveal a new dimension: at least some crashes are **kernel scheduler bugs in sched_ext**, not just silent hardware-level freezes. The `set_next_task_scx` and `pick_next_task_fair` crash sites, combined with the dbus-daemon segfault before one crash, suggest possible memory corruption that manifests in the scheduler. It's unclear whether this is: The pstore crash dumps from March reveal a new dimension: at least some crashes are **kernel scheduler bugs in sched_ext**, not just silent hardware-level freezes. The `set_next_task_scx` and `pick_next_task_fair` crash sites, combined with the dbus-daemon segfault before one crash, suggest possible memory corruption that manifests in the scheduler. Memtest86 ran 38 passes (109 hours) with zero errors, making option 2 less likely. Remaining possibilities:
1. A sched_ext kernel bug exposed by the PN51's hardware quirks (unstable TSC, C-state behavior) 1. A sched_ext kernel bug exposed by the PN51's hardware quirks (unstable TSC, C-state behavior)
2. Hardware-induced memory corruption that happens to hit scheduler data structures 2. ~~Hardware-induced memory corruption that happens to hit scheduler data structures~~ — unlikely after clean memtest
3. A pure software bug in the 6.12.74 kernel's sched_ext implementation 3. A pure software bug in the 6.12.74 kernel's sched_ext implementation
**pn01 is stable** — zero crashes in 30 days of continuous operation. Both units have identical kernel and NixOS configuration (minus pn02's diagnostic mitigations), so the difference points toward a hardware defect specific to the pn02 board. **pn01 is stable** — zero crashes in 30 days of continuous operation. Both units have identical kernel and NixOS configuration (minus pn02's diagnostic mitigations), so the difference points toward a hardware defect specific to the pn02 board.
## Next Steps ## Next Steps
- **pn02 memtest**: Run memtest86 for 24h+ (available in systemd-boot menu). The crash signatures (userspace segfaults before kernel panics, corrupted pointers in scheduler structures) are consistent with intermittent RAM errors that a quick pass wouldn't catch. If memtest finds errors, swap the DIMM. - **~~pn02 memtest~~**: ~~Run memtest86 for 24h+~~ — Done (2026-03-12): 38 passes over 109 hours, zero errors. RAM is not the issue.
- **pn02**: Consider scrapping or repurposing for non-critical workloads that tolerate random reboots (auto-recovery via hardware watchdog is now working) - **pn02 sched_ext test**: Disable sched_ext (`boot.kernelParams = [ "sched_ext.enabled=0" ]` or equivalent) and run for 1-2 weeks to test whether the crashes stop — would help distinguish kernel bug from hardware defect
- **pn02 investigation**: Could try disabling sched_ext (`boot.kernelParams = [ "sched_ext.enabled=0" ]` or equivalent) to test whether the crashes stop — would help distinguish kernel bug from hardware defect - **pn02**: If sched_ext disable doesn't help, consider scrapping or repurposing for non-critical workloads that tolerate random reboots (auto-recovery via hardware watchdog is working)
- **pn01**: Continue monitoring. If it remains stable long-term, it is viable for light workloads - **pn01**: Continue monitoring. If it remains stable long-term, it is viable for light workloads
- If pn01 eventually crashes, apply the same mitigations (amdgpu blacklist, max_cstate=1) to see if they help - If pn01 eventually crashes, apply the same mitigations (amdgpu blacklist, max_cstate=1) to see if they help
- For the Incus hypervisor plan: likely need different hardware. Evaluating GMKtec G3 (Intel) as an alternative. Note: mixed Intel/AMD cluster complicates live migration - For the Incus hypervisor plan: likely need different hardware. Evaluating GMKtec G3 (Intel) as an alternative. Note: mixed Intel/AMD cluster complicates live migration

View File

@@ -15,7 +15,7 @@
boot.loader.systemd-boot.memtest86.enable = true; boot.loader.systemd-boot.memtest86.enable = true;
boot.loader.efi.canTouchEfiVariables = true; boot.loader.efi.canTouchEfiVariables = true;
boot.blacklistedKernelModules = [ "amdgpu" ]; boot.blacklistedKernelModules = [ "amdgpu" ];
boot.kernelParams = [ "panic=10" "nmi_watchdog=1" "processor.max_cstate=1" ]; boot.kernelParams = [ "panic=10" "nmi_watchdog=1" "processor.max_cstate=1" "sched_ext.enabled=0" ];
boot.kernel.sysctl."kernel.softlockup_panic" = 1; boot.kernel.sysctl."kernel.softlockup_panic" = 1;
boot.kernel.sysctl."kernel.hardlockup_panic" = 1; boot.kernel.sysctl."kernel.hardlockup_panic" = 1;