pn51: document diagnostic config (rasdaemon, NMI watchdog, panic)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 18:52:34 +01:00
parent 72acaa872b
commit c8cadd09c5
1 changed files with 17 additions and 3 deletions
--- a/docs/plans/pn51-stability.md
+++ b/docs/plans/pn51-stability.md
@@ -86,6 +86,11 @@ Two ASUS PN51-E1 mini PCs (Ryzen 7 5700U) purchased years ago but shelved due to
 - Survived the 1h stress test earlier but froze at idle later — not thermal
 - pn01 remains stable throughout
 - **Action**: Blacklisted `amdgpu` kernel module on pn02 (`boot.blacklistedKernelModules = [ "amdgpu" ]`) to eliminate GPU/PSP firmware interactions as a cause. No console output but managed via SSH.
+- **Action**: Added diagnostic/recovery config to pn02:
+  - `panic=10` + `nmi_watchdog=1` kernel params — auto-reboot after 10s on panic
+  - `softlockup_panic` + `hardlockup_panic` sysctls — convert lockups to panics with stack traces
+  - `hardware.rasdaemon` with recording — logs hardware errors (MCE, PCIe AER, memory) to sqlite database, survives reboots
+  - Check recorded errors: `ras-mc-ctl --summary`, `ras-mc-ctl --errors`

 ## Benign Kernel Errors (Both Units)

@@ -111,14 +116,23 @@ These appear on both units and can be ignored:
 - Once stable: add second RAM stick back to pn02, reinstall with NVMe
 - Evaluate for Incus hypervisor use (see `nixos-hypervisor.md`)

-## Fallback: Auto-Recovery from Freezes
+## Diagnostics and Auto-Recovery (pn02)

-If pn02 remains unstable but we want to keep it running for non-critical workloads, configure the kernel to auto-reboot on lockups instead of hanging forever:
+Currently deployed on pn02:

 ```nix
 boot.kernelParams = [ "panic=10" "nmi_watchdog=1" ];
 boot.kernel.sysctl."kernel.softlockup_panic" = 1;
 boot.kernel.sysctl."kernel.hardlockup_panic" = 1;
+hardware.rasdaemon.enable = true;
+hardware.rasdaemon.record = true;
 ```

-This would: detect hard/soft lockup -> kernel panic -> auto-reboot after 10 seconds. The SP5100 TCO hardware watchdog (already enabled, 10min timeout) serves as a secondary safety net. Doesn't fix the root cause but avoids requiring physical intervention.
+**On next freeze, one of two things happens:**
+1. **NMI watchdog catches it** -> kernel panic with stack trace in logs -> auto-reboot after 10s -> we get diagnostic info
+2. **Hard lockup below NMI level** -> SP5100 TCO hardware watchdog (10min timeout) reboots it -> confirms board-level defect
+
+**After reboot, check:**
+- `ras-mc-ctl --summary` — overview of hardware errors
+- `ras-mc-ctl --errors` — detailed error list
+- `journalctl -b -1 -p err` — kernel logs from crashed boot (if panic was logged)