pn51: add remaining debug steps and auto-recovery fallback
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m4s

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-22 18:38:17 +01:00
parent 2b42145d94
commit a7c1ce932d

View File

@@ -101,7 +101,24 @@ These appear on both units and can be ignored:
## Next Steps
- Monitor pn02 with amdgpu blacklisted — if stable, try the less impactful `amdgpu.runpm=0 amdgpu.dpm=0` kernel params instead
- If pn02 still freezes without amdgpu, likely a hardware defect on this unit
- If pn02 still freezes without amdgpu:
- Swap RAM sticks between units to check if instability follows the RAM
- Run memtest86 for 24h+ on each stick individually
- Try a different kernel (LTS 6.6 or latest unstable)
- Try disabling SMT (`nosmt` kernel param)
- If nothing helps, likely a hardware defect on pn02's board
- pn01 continues to be stable — keep monitoring
- Once stable: add second RAM stick back to pn02, reinstall with NVMe
- Evaluate for Incus hypervisor use (see `nixos-hypervisor.md`)
## Fallback: Auto-Recovery from Freezes
If pn02 remains unstable but we want to keep it running for non-critical workloads, configure the kernel to auto-reboot on lockups instead of hanging forever:
```nix
boot.kernelParams = [ "panic=10" "nmi_watchdog=1" ];
boot.kernel.sysctl."kernel.softlockup_panic" = 1;
boot.kernel.sysctl."kernel.hardlockup_panic" = 1;
```
This would: detect hard/soft lockup -> kernel panic -> auto-reboot after 10 seconds. The SP5100 TCO hardware watchdog (already enabled, 10min timeout) serves as a secondary safety net. Doesn't fix the root cause but avoids requiring physical intervention.