From a7c1ce932d81e1e23ad1f2e3acf2bd5ec0c2d4c8 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= <torjus@usit.uio.no>
Date: Sun, 22 Feb 2026 18:38:17 +0100
Subject: [PATCH] pn51: add remaining debug steps and auto-recovery fallback

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 docs/plans/pn51-stability.md | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/docs/plans/pn51-stability.md b/docs/plans/pn51-stability.md
index 132a4fb..8a55c05 100644
--- a/docs/plans/pn51-stability.md
+++ b/docs/plans/pn51-stability.md
@@ -101,7 +101,24 @@ These appear on both units and can be ignored:
 ## Next Steps
 
 - Monitor pn02 with amdgpu blacklisted — if stable, try the less impactful `amdgpu.runpm=0 amdgpu.dpm=0` kernel params instead
-- If pn02 still freezes without amdgpu, likely a hardware defect on this unit
+- If pn02 still freezes without amdgpu:
+  - Swap RAM sticks between units to check if instability follows the RAM
+  - Run memtest86 for 24h+ on each stick individually
+  - Try a different kernel (LTS 6.6 or latest unstable)
+  - Try disabling SMT (`nosmt` kernel param)
+- If nothing helps, likely a hardware defect on pn02's board
 - pn01 continues to be stable — keep monitoring
 - Once stable: add second RAM stick back to pn02, reinstall with NVMe
 - Evaluate for Incus hypervisor use (see `nixos-hypervisor.md`)
+
+## Fallback: Auto-Recovery from Freezes
+
+If pn02 remains unstable but we want to keep it running for non-critical workloads, configure the kernel to auto-reboot on lockups instead of hanging forever:
+
+```nix
+boot.kernelParams = [ "panic=10" "nmi_watchdog=1" ];
+boot.kernel.sysctl."kernel.softlockup_panic" = 1;
+boot.kernel.sysctl."kernel.hardlockup_panic" = 1;
+```
+
+This would: detect hard/soft lockup -> kernel panic -> auto-reboot after 10 seconds. The SP5100 TCO hardware watchdog (already enabled, 10min timeout) serves as a secondary safety net. Doesn't fix the root cause but avoids requiring physical intervention.