From c8cadd09c54bc42639895f0f2ccecd69f4a2712d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Sun, 22 Feb 2026 18:52:34 +0100 Subject: [PATCH] pn51: document diagnostic config (rasdaemon, NMI watchdog, panic) Co-Authored-By: Claude Opus 4.6 --- docs/plans/pn51-stability.md | 20 +++++++++++++++++--- 1 file changed, 17 insertions(+), 3 deletions(-) diff --git a/docs/plans/pn51-stability.md b/docs/plans/pn51-stability.md index 8a55c05..5e59685 100644 --- a/docs/plans/pn51-stability.md +++ b/docs/plans/pn51-stability.md @@ -86,6 +86,11 @@ Two ASUS PN51-E1 mini PCs (Ryzen 7 5700U) purchased years ago but shelved due to - Survived the 1h stress test earlier but froze at idle later — not thermal - pn01 remains stable throughout - **Action**: Blacklisted `amdgpu` kernel module on pn02 (`boot.blacklistedKernelModules = [ "amdgpu" ]`) to eliminate GPU/PSP firmware interactions as a cause. No console output but managed via SSH. +- **Action**: Added diagnostic/recovery config to pn02: + - `panic=10` + `nmi_watchdog=1` kernel params — auto-reboot after 10s on panic + - `softlockup_panic` + `hardlockup_panic` sysctls — convert lockups to panics with stack traces + - `hardware.rasdaemon` with recording — logs hardware errors (MCE, PCIe AER, memory) to sqlite database, survives reboots + - Check recorded errors: `ras-mc-ctl --summary`, `ras-mc-ctl --errors` ## Benign Kernel Errors (Both Units) @@ -111,14 +116,23 @@ These appear on both units and can be ignored: - Once stable: add second RAM stick back to pn02, reinstall with NVMe - Evaluate for Incus hypervisor use (see `nixos-hypervisor.md`) -## Fallback: Auto-Recovery from Freezes +## Diagnostics and Auto-Recovery (pn02) -If pn02 remains unstable but we want to keep it running for non-critical workloads, configure the kernel to auto-reboot on lockups instead of hanging forever: +Currently deployed on pn02: ```nix boot.kernelParams = [ "panic=10" "nmi_watchdog=1" ]; boot.kernel.sysctl."kernel.softlockup_panic" = 1; boot.kernel.sysctl."kernel.hardlockup_panic" = 1; +hardware.rasdaemon.enable = true; +hardware.rasdaemon.record = true; ``` -This would: detect hard/soft lockup -> kernel panic -> auto-reboot after 10 seconds. The SP5100 TCO hardware watchdog (already enabled, 10min timeout) serves as a secondary safety net. Doesn't fix the root cause but avoids requiring physical intervention. +**On next freeze, one of two things happens:** +1. **NMI watchdog catches it** -> kernel panic with stack trace in logs -> auto-reboot after 10s -> we get diagnostic info +2. **Hard lockup below NMI level** -> SP5100 TCO hardware watchdog (10min timeout) reboots it -> confirms board-level defect + +**After reboot, check:** +- `ras-mc-ctl --summary` — overview of hardware errors +- `ras-mc-ctl --errors` — detailed error list +- `journalctl -b -1 -p err` — kernel logs from crashed boot (if panic was logged)