From c8cadd09c54bc42639895f0f2ccecd69f4a2712d Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= <torjus@usit.uio.no>
Date: Sun, 22 Feb 2026 18:52:34 +0100
Subject: [PATCH] pn51: document diagnostic config (rasdaemon, NMI watchdog,
 panic)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 docs/plans/pn51-stability.md | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/docs/plans/pn51-stability.md b/docs/plans/pn51-stability.md
index 8a55c05..5e59685 100644
--- a/docs/plans/pn51-stability.md
+++ b/docs/plans/pn51-stability.md
@@ -86,6 +86,11 @@ Two ASUS PN51-E1 mini PCs (Ryzen 7 5700U) purchased years ago but shelved due to
 - Survived the 1h stress test earlier but froze at idle later — not thermal
 - pn01 remains stable throughout
 - **Action**: Blacklisted `amdgpu` kernel module on pn02 (`boot.blacklistedKernelModules = [ "amdgpu" ]`) to eliminate GPU/PSP firmware interactions as a cause. No console output but managed via SSH.
+- **Action**: Added diagnostic/recovery config to pn02:
+  - `panic=10` + `nmi_watchdog=1` kernel params — auto-reboot after 10s on panic
+  - `softlockup_panic` + `hardlockup_panic` sysctls — convert lockups to panics with stack traces
+  - `hardware.rasdaemon` with recording — logs hardware errors (MCE, PCIe AER, memory) to sqlite database, survives reboots
+  - Check recorded errors: `ras-mc-ctl --summary`, `ras-mc-ctl --errors`
 
 ## Benign Kernel Errors (Both Units)
 
@@ -111,14 +116,23 @@ These appear on both units and can be ignored:
 - Once stable: add second RAM stick back to pn02, reinstall with NVMe
 - Evaluate for Incus hypervisor use (see `nixos-hypervisor.md`)
 
-## Fallback: Auto-Recovery from Freezes
+## Diagnostics and Auto-Recovery (pn02)
 
-If pn02 remains unstable but we want to keep it running for non-critical workloads, configure the kernel to auto-reboot on lockups instead of hanging forever:
+Currently deployed on pn02:
 
 ```nix
 boot.kernelParams = [ "panic=10" "nmi_watchdog=1" ];
 boot.kernel.sysctl."kernel.softlockup_panic" = 1;
 boot.kernel.sysctl."kernel.hardlockup_panic" = 1;
+hardware.rasdaemon.enable = true;
+hardware.rasdaemon.record = true;
 ```
 
-This would: detect hard/soft lockup -> kernel panic -> auto-reboot after 10 seconds. The SP5100 TCO hardware watchdog (already enabled, 10min timeout) serves as a secondary safety net. Doesn't fix the root cause but avoids requiring physical intervention.
+**On next freeze, one of two things happens:**
+1. **NMI watchdog catches it** -> kernel panic with stack trace in logs -> auto-reboot after 10s -> we get diagnostic info
+2. **Hard lockup below NMI level** -> SP5100 TCO hardware watchdog (10min timeout) reboots it -> confirms board-level defect
+
+**After reboot, check:**
+- `ras-mc-ctl --summary` — overview of hardware errors
+- `ras-mc-ctl --errors` — detailed error list
+- `journalctl -b -1 -p err` — kernel logs from crashed boot (if panic was logged)