pn51: add remaining debug steps and auto-recovery fallback
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m4s
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m4s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -101,7 +101,24 @@ These appear on both units and can be ignored:
|
||||
## Next Steps
|
||||
|
||||
- Monitor pn02 with amdgpu blacklisted — if stable, try the less impactful `amdgpu.runpm=0 amdgpu.dpm=0` kernel params instead
|
||||
- If pn02 still freezes without amdgpu, likely a hardware defect on this unit
|
||||
- If pn02 still freezes without amdgpu:
|
||||
- Swap RAM sticks between units to check if instability follows the RAM
|
||||
- Run memtest86 for 24h+ on each stick individually
|
||||
- Try a different kernel (LTS 6.6 or latest unstable)
|
||||
- Try disabling SMT (`nosmt` kernel param)
|
||||
- If nothing helps, likely a hardware defect on pn02's board
|
||||
- pn01 continues to be stable — keep monitoring
|
||||
- Once stable: add second RAM stick back to pn02, reinstall with NVMe
|
||||
- Evaluate for Incus hypervisor use (see `nixos-hypervisor.md`)
|
||||
|
||||
## Fallback: Auto-Recovery from Freezes
|
||||
|
||||
If pn02 remains unstable but we want to keep it running for non-critical workloads, configure the kernel to auto-reboot on lockups instead of hanging forever:
|
||||
|
||||
```nix
|
||||
boot.kernelParams = [ "panic=10" "nmi_watchdog=1" ];
|
||||
boot.kernel.sysctl."kernel.softlockup_panic" = 1;
|
||||
boot.kernel.sysctl."kernel.hardlockup_panic" = 1;
|
||||
```
|
||||
|
||||
This would: detect hard/soft lockup -> kernel panic -> auto-reboot after 10 seconds. The SP5100 TCO hardware watchdog (already enabled, 10min timeout) serves as a secondary safety net. Doesn't fix the root cause but avoids requiring physical intervention.
|
||||
|
||||
Reference in New Issue
Block a user