pn51: add remaining debug steps and auto-recovery fallback

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 18:38:17 +01:00
parent 2b42145d94
commit a7c1ce932d
1 changed files with 18 additions and 1 deletions
--- a/docs/plans/pn51-stability.md
+++ b/docs/plans/pn51-stability.md
@@ -101,7 +101,24 @@ These appear on both units and can be ignored:
 ## Next Steps

 - Monitor pn02 with amdgpu blacklisted — if stable, try the less impactful `amdgpu.runpm=0 amdgpu.dpm=0` kernel params instead
- If pn02 still freezes without amdgpu, likely a hardware defect on this unit
+- If pn02 still freezes without amdgpu:
+  - Swap RAM sticks between units to check if instability follows the RAM
+  - Run memtest86 for 24h+ on each stick individually
+  - Try a different kernel (LTS 6.6 or latest unstable)
+  - Try disabling SMT (`nosmt` kernel param)
+- If nothing helps, likely a hardware defect on pn02's board
 - pn01 continues to be stable — keep monitoring
 - Once stable: add second RAM stick back to pn02, reinstall with NVMe
 - Evaluate for Incus hypervisor use (see `nixos-hypervisor.md`)
+
+## Fallback: Auto-Recovery from Freezes
+
+If pn02 remains unstable but we want to keep it running for non-critical workloads, configure the kernel to auto-reboot on lockups instead of hanging forever:
+
+```nix
+boot.kernelParams = [ "panic=10" "nmi_watchdog=1" ];
+boot.kernel.sysctl."kernel.softlockup_panic" = 1;
+boot.kernel.sysctl."kernel.hardlockup_panic" = 1;
+```
+
+This would: detect hard/soft lockup -> kernel panic -> auto-reboot after 10 seconds. The SP5100 TCO hardware watchdog (already enabled, 10min timeout) serves as a secondary safety net. Doesn't fix the root cause but avoids requiring physical intervention.