pn02: disable sched_ext and document memtest results
Memtest86 ran 38 passes (109 hours) with zero errors, ruling out RAM. Disable sched_ext scheduler to test whether kernel scheduler crashes stop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -169,6 +169,14 @@ These appear on both units and can be ignored:
|
||||
- Journal corruption confirmed on next boot
|
||||
- No pstore data captured
|
||||
|
||||
### 2026-03-12: pn02 Memtest86 — 38 Passes, Zero Errors
|
||||
|
||||
- Ran memtest86 for ~109 hours (4.5 days), completing 38 full passes
|
||||
- **Zero errors found** — RAM appears healthy
|
||||
- Makes hardware-induced memory corruption less likely as the sole cause of crashes
|
||||
- Memtest cannot rule out CPU cache errors, PCIe/IOMMU issues, or kernel bugs triggered by platform quirks
|
||||
- **Next step**: Boot back into NixOS with sched_ext disabled to test the kernel scheduler hypothesis
|
||||
|
||||
### 2026-03-07: pn01 Status
|
||||
|
||||
- pn01 has had **zero crashes** since initial setup on Feb 21
|
||||
@@ -193,18 +201,18 @@ These appear on both units and can be ignored:
|
||||
|
||||
**pn02 is unreliable.** After exhausting mitigations (fTPM disabled, BIOS updated, WiFi/BT disabled, ErP disabled, amdgpu blacklisted, processor.max_cstate=1, NMI watchdog, rasdaemon), the unit still crashes every few days. 26 reboots in 30 days (7 unclean crashes + daily auto-upgrade reboots).
|
||||
|
||||
The pstore crash dumps from March reveal a new dimension: at least some crashes are **kernel scheduler bugs in sched_ext**, not just silent hardware-level freezes. The `set_next_task_scx` and `pick_next_task_fair` crash sites, combined with the dbus-daemon segfault before one crash, suggest possible memory corruption that manifests in the scheduler. It's unclear whether this is:
|
||||
The pstore crash dumps from March reveal a new dimension: at least some crashes are **kernel scheduler bugs in sched_ext**, not just silent hardware-level freezes. The `set_next_task_scx` and `pick_next_task_fair` crash sites, combined with the dbus-daemon segfault before one crash, suggest possible memory corruption that manifests in the scheduler. Memtest86 ran 38 passes (109 hours) with zero errors, making option 2 less likely. Remaining possibilities:
|
||||
1. A sched_ext kernel bug exposed by the PN51's hardware quirks (unstable TSC, C-state behavior)
|
||||
2. Hardware-induced memory corruption that happens to hit scheduler data structures
|
||||
2. ~~Hardware-induced memory corruption that happens to hit scheduler data structures~~ — unlikely after clean memtest
|
||||
3. A pure software bug in the 6.12.74 kernel's sched_ext implementation
|
||||
|
||||
**pn01 is stable** — zero crashes in 30 days of continuous operation. Both units have identical kernel and NixOS configuration (minus pn02's diagnostic mitigations), so the difference points toward a hardware defect specific to the pn02 board.
|
||||
|
||||
## Next Steps
|
||||
|
||||
- **pn02 memtest**: Run memtest86 for 24h+ (available in systemd-boot menu). The crash signatures (userspace segfaults before kernel panics, corrupted pointers in scheduler structures) are consistent with intermittent RAM errors that a quick pass wouldn't catch. If memtest finds errors, swap the DIMM.
|
||||
- **pn02**: Consider scrapping or repurposing for non-critical workloads that tolerate random reboots (auto-recovery via hardware watchdog is now working)
|
||||
- **pn02 investigation**: Could try disabling sched_ext (`boot.kernelParams = [ "sched_ext.enabled=0" ]` or equivalent) to test whether the crashes stop — would help distinguish kernel bug from hardware defect
|
||||
- **~~pn02 memtest~~**: ~~Run memtest86 for 24h+~~ — Done (2026-03-12): 38 passes over 109 hours, zero errors. RAM is not the issue.
|
||||
- **pn02 sched_ext test**: Disable sched_ext (`boot.kernelParams = [ "sched_ext.enabled=0" ]` or equivalent) and run for 1-2 weeks to test whether the crashes stop — would help distinguish kernel bug from hardware defect
|
||||
- **pn02**: If sched_ext disable doesn't help, consider scrapping or repurposing for non-critical workloads that tolerate random reboots (auto-recovery via hardware watchdog is working)
|
||||
- **pn01**: Continue monitoring. If it remains stable long-term, it is viable for light workloads
|
||||
- If pn01 eventually crashes, apply the same mitigations (amdgpu blacklist, max_cstate=1) to see if they help
|
||||
- For the Incus hypervisor plan: likely need different hardware. Evaluating GMKtec G3 (Intel) as an alternative. Note: mixed Intel/AMD cluster complicates live migration
|
||||
|
||||
Reference in New Issue
Block a user