pn01, pn02: enable memtest86 and update stability docs
Enable memtest86 in systemd-boot menu on both PN51 units to allow extended memory testing. Update stability document with March crash data from pstore/Loki — crashes now traced to sched_ext scheduler kernel oops, suggesting possible memory corruption. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -134,16 +134,78 @@ These appear on both units and can be ignored:
|
||||
- **Positive**: The system auto-rebooted this time (likely hardware watchdog), unlike previous freezes that required manual power cycle
|
||||
- `processor.max_cstate=1` may have extended uptime (2d21h vs previous 1h and 5.5h) but did not prevent the freeze
|
||||
|
||||
### 2026-02-27 to 2026-03-03: Relative Stability
|
||||
|
||||
- pn02 ran without crashes for approximately one week after the third freeze
|
||||
- pn01 continued to be completely stable throughout this period
|
||||
- Auto-upgrade reboots continued daily (~4am) on both units — these are planned and healthy
|
||||
|
||||
### 2026-03-04: pn02 Fourth Crash — sched_ext Kernel Oops (pstore captured)
|
||||
|
||||
- pn02 crashed after ~5.8 days uptime (504566s)
|
||||
- **First crash captured by pstore** — kernel oops and panic stack traces preserved across reboot
|
||||
- Journal corruption confirmed: `system.journal corrupted or uncleanly shut down`
|
||||
- **Crash location**: `RIP: 0010:set_next_task_scx+0x6e/0x210` — crash in the **sched_ext (SCX) scheduler** subsystem
|
||||
- **Call trace**: `sysvec_apic_timer_interrupt` → `cpuidle_enter_state` — crashed during CPU idle, triggered by APIC timer interrupt
|
||||
- **CR2**: `ffffffffffffff89` — dereferencing an obviously invalid kernel pointer
|
||||
- **Kernel**: 6.12.74 (NixOS 25.11)
|
||||
- **Significance**: This is the first crash with actual diagnostic output. Previous crashes were silent sub-NMI freezes. The sched_ext scheduler path is a new finding — earlier crashes were assumed to be hardware-level.
|
||||
|
||||
### 2026-03-06: pn02 Fifth Crash
|
||||
|
||||
- pn02 crashed again — journal corruption on next boot
|
||||
- No pstore data captured for this crash
|
||||
|
||||
### 2026-03-07: pn02 Sixth and Seventh Crashes — Two in One Day
|
||||
|
||||
**First crash (~11:06 UTC):**
|
||||
- ~26.6 hours uptime (95994s)
|
||||
- **pstore captured both Oops and Panic**
|
||||
- **Crash location**: Scheduler code path — `pick_next_task_fair` → `__pick_next_task`
|
||||
- **CR2**: `000000c000726000` — invalid pointer dereference
|
||||
- **Notable**: `dbus-daemon` segfaulted ~50 minutes before the kernel crash (`segfault at 0` in `libdbus-1.so.3.32.4` on CPU 0) — may indicate memory corruption preceding the kernel crash
|
||||
|
||||
**Second crash (~21:15 UTC):**
|
||||
- Journal corruption confirmed on next boot
|
||||
- No pstore data captured
|
||||
|
||||
### 2026-03-07: pn01 Status
|
||||
|
||||
- pn01 has had **zero crashes** since initial setup on Feb 21
|
||||
- Zero journal corruptions, zero pstore dumps in 30 days
|
||||
- Same BOOT_ID maintained between daily auto-upgrade reboots — consistently clean shutdown/reboot cycles
|
||||
- All 8 reboots in 30 days are planned auto-upgrade reboots
|
||||
- **pn01 is fully stable**
|
||||
|
||||
## Crash Summary
|
||||
|
||||
| Date | Uptime Before Crash | Crash Type | Diagnostic Data |
|
||||
|------|---------------------|------------|-----------------|
|
||||
| Feb 21 | ~1h | Silent freeze | None — sub-NMI |
|
||||
| Feb 22 | ~5.5h | Silent freeze | None — sub-NMI |
|
||||
| Feb 27 | ~2d 21h | Silent freeze | None — sub-NMI, rasdaemon empty |
|
||||
| Mar 4 | ~5.8d | **Kernel oops** | pstore: `set_next_task_scx` (sched_ext) |
|
||||
| Mar 6 | Unknown | Crash | Journal corruption only |
|
||||
| Mar 7 | ~26.6h | **Kernel oops + panic** | pstore: `pick_next_task_fair` (scheduler) + dbus segfault |
|
||||
| Mar 7 | Unknown | Crash | Journal corruption only |
|
||||
|
||||
## Conclusion
|
||||
|
||||
**pn02 is unreliable.** After exhausting mitigations (fTPM disabled, BIOS updated, WiFi/BT disabled, ErP disabled, amdgpu blacklisted, processor.max_cstate=1, NMI watchdog, rasdaemon), the unit still hard-freezes every few days. The freeze occurs below NMI level with zero diagnostic output, pointing to a board-level hardware defect. The unit is not suitable for hypervisor use or any workload requiring reliability.
|
||||
**pn02 is unreliable.** After exhausting mitigations (fTPM disabled, BIOS updated, WiFi/BT disabled, ErP disabled, amdgpu blacklisted, processor.max_cstate=1, NMI watchdog, rasdaemon), the unit still crashes every few days. 26 reboots in 30 days (7 unclean crashes + daily auto-upgrade reboots).
|
||||
|
||||
**pn01 remains stable** so far but has historically crashed as well, just less frequently. Continuing to monitor.
|
||||
The pstore crash dumps from March reveal a new dimension: at least some crashes are **kernel scheduler bugs in sched_ext**, not just silent hardware-level freezes. The `set_next_task_scx` and `pick_next_task_fair` crash sites, combined with the dbus-daemon segfault before one crash, suggest possible memory corruption that manifests in the scheduler. It's unclear whether this is:
|
||||
1. A sched_ext kernel bug exposed by the PN51's hardware quirks (unstable TSC, C-state behavior)
|
||||
2. Hardware-induced memory corruption that happens to hit scheduler data structures
|
||||
3. A pure software bug in the 6.12.74 kernel's sched_ext implementation
|
||||
|
||||
**pn01 is stable** — zero crashes in 30 days of continuous operation. Both units have identical kernel and NixOS configuration (minus pn02's diagnostic mitigations), so the difference points toward a hardware defect specific to the pn02 board.
|
||||
|
||||
## Next Steps
|
||||
|
||||
- **pn02 memtest**: Run memtest86 for 24h+ (available in systemd-boot menu). The crash signatures (userspace segfaults before kernel panics, corrupted pointers in scheduler structures) are consistent with intermittent RAM errors that a quick pass wouldn't catch. If memtest finds errors, swap the DIMM.
|
||||
- **pn02**: Consider scrapping or repurposing for non-critical workloads that tolerate random reboots (auto-recovery via hardware watchdog is now working)
|
||||
- **pn01**: Continue monitoring. If it remains stable long-term, may still be viable for light workloads
|
||||
- **pn02 investigation**: Could try disabling sched_ext (`boot.kernelParams = [ "sched_ext.enabled=0" ]` or equivalent) to test whether the crashes stop — would help distinguish kernel bug from hardware defect
|
||||
- **pn01**: Continue monitoring. If it remains stable long-term, it is viable for light workloads
|
||||
- If pn01 eventually crashes, apply the same mitigations (amdgpu blacklist, max_cstate=1) to see if they help
|
||||
- For the Incus hypervisor plan: likely need different hardware. Evaluating GMKtec G3 (Intel) as an alternative. Note: mixed Intel/AMD cluster complicates live migration
|
||||
|
||||
@@ -160,11 +222,10 @@ hardware.rasdaemon.enable = true;
|
||||
hardware.rasdaemon.record = true;
|
||||
```
|
||||
|
||||
**On next freeze, one of two things happens:**
|
||||
1. **NMI watchdog catches it** -> kernel panic with stack trace in logs -> auto-reboot after 10s -> we get diagnostic info
|
||||
2. **Hard lockup below NMI level** -> SP5100 TCO hardware watchdog (10min timeout) reboots it -> confirms board-level defect
|
||||
**Crash recovery is working**: pstore now captures kernel oops/panic data, and the system auto-reboots via `panic=10` or SP5100 TCO hardware watchdog.
|
||||
|
||||
**After reboot, check:**
|
||||
- `ras-mc-ctl --summary` — overview of hardware errors
|
||||
- `ras-mc-ctl --errors` — detailed error list
|
||||
- `journalctl -b -1 -p err` — kernel logs from crashed boot (if panic was logged)
|
||||
- pstore data is automatically archived by `systemd-pstore.service` and forwarded to Loki via promtail
|
||||
|
||||
@@ -12,6 +12,7 @@
|
||||
];
|
||||
|
||||
boot.loader.systemd-boot.enable = true;
|
||||
boot.loader.systemd-boot.memtest86.enable = true;
|
||||
boot.loader.efi.canTouchEfiVariables = true;
|
||||
|
||||
networking.hostName = "pn01";
|
||||
|
||||
@@ -12,6 +12,7 @@
|
||||
];
|
||||
|
||||
boot.loader.systemd-boot.enable = true;
|
||||
boot.loader.systemd-boot.memtest86.enable = true;
|
||||
boot.loader.efi.canTouchEfiVariables = true;
|
||||
boot.blacklistedKernelModules = [ "amdgpu" ];
|
||||
boot.kernelParams = [ "panic=10" "nmi_watchdog=1" "processor.max_cstate=1" ];
|
||||
|
||||
Reference in New Issue
Block a user