From a7c1ce932d81e1e23ad1f2e3acf2bd5ec0c2d4c8 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Sun, 22 Feb 2026 18:38:17 +0100 Subject: [PATCH] pn51: add remaining debug steps and auto-recovery fallback Co-Authored-By: Claude Opus 4.6 --- docs/plans/pn51-stability.md | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/docs/plans/pn51-stability.md b/docs/plans/pn51-stability.md index 132a4fb..8a55c05 100644 --- a/docs/plans/pn51-stability.md +++ b/docs/plans/pn51-stability.md @@ -101,7 +101,24 @@ These appear on both units and can be ignored: ## Next Steps - Monitor pn02 with amdgpu blacklisted — if stable, try the less impactful `amdgpu.runpm=0 amdgpu.dpm=0` kernel params instead -- If pn02 still freezes without amdgpu, likely a hardware defect on this unit +- If pn02 still freezes without amdgpu: + - Swap RAM sticks between units to check if instability follows the RAM + - Run memtest86 for 24h+ on each stick individually + - Try a different kernel (LTS 6.6 or latest unstable) + - Try disabling SMT (`nosmt` kernel param) +- If nothing helps, likely a hardware defect on pn02's board - pn01 continues to be stable — keep monitoring - Once stable: add second RAM stick back to pn02, reinstall with NVMe - Evaluate for Incus hypervisor use (see `nixos-hypervisor.md`) + +## Fallback: Auto-Recovery from Freezes + +If pn02 remains unstable but we want to keep it running for non-critical workloads, configure the kernel to auto-reboot on lockups instead of hanging forever: + +```nix +boot.kernelParams = [ "panic=10" "nmi_watchdog=1" ]; +boot.kernel.sysctl."kernel.softlockup_panic" = 1; +boot.kernel.sysctl."kernel.hardlockup_panic" = 1; +``` + +This would: detect hard/soft lockup -> kernel panic -> auto-reboot after 10 seconds. The SP5100 TCO hardware watchdog (already enabled, 10min timeout) serves as a secondary safety net. Doesn't fix the root cause but avoids requiring physical intervention.