docs: update pn51 stability with third freeze and conclusion
pn02 crashed again after ~2d21h uptime despite all mitigations (amdgpu blacklist, max_cstate=1, NMI watchdog, rasdaemon). NMI watchdog didn't fire and rasdaemon recorded nothing, confirming hard lockup below NMI level. Unit is unreliable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -103,25 +103,57 @@ These appear on both units and can be ignored:
|
|||||||
- `snd_hda_intel: no codecs found` — no audio device connected, headless server
|
- `snd_hda_intel: no codecs found` — no audio device connected, headless server
|
||||||
- `ata2.00: supports DRM functions and may not be fully accessible` — Samsung SSD DRM quirk (pn02 only)
|
- `ata2.00: supports DRM functions and may not be fully accessible` — Samsung SSD DRM quirk (pn02 only)
|
||||||
|
|
||||||
|
### 2026-02-23: processor.max_cstate=1 and Proxmox Forums
|
||||||
|
|
||||||
|
- Found a thread on the Proxmox forums about PN51 units with similar freeze issues
|
||||||
|
- Many users reporting identical symptoms — random hard freezes, no log evidence
|
||||||
|
- No conclusive fix. Some have frequent freezes, others only a few times a month
|
||||||
|
- Some reported BIOS updates helped, but results inconsistent
|
||||||
|
- Added `processor.max_cstate=1` kernel parameter to pn02 — limits CPU to C1 halt state, preventing deep C-state sleep transitions that may trigger freezes on AMD mobile chips
|
||||||
|
- Also applied: amdgpu blacklist, panic=10, nmi_watchdog=1, softlockup/hardlockup panic, rasdaemon
|
||||||
|
|
||||||
|
### 2026-02-23: logind D-Bus Deadlock (pn02)
|
||||||
|
|
||||||
|
- node-exporter alert fired — but host was NOT frozen
|
||||||
|
- logind was running (PID 871) but deadlocked on D-Bus — not responding to `org.freedesktop.login1` requests
|
||||||
|
- Every node-exporter scrape blocked for 25s waiting for logind, causing scrape timeouts
|
||||||
|
- Likely related to amdgpu blacklist — no DRM device means no graphical seat, logind may have deadlocked during seat enumeration at boot
|
||||||
|
- Fix: `systemctl restart systemd-logind` + `systemctl restart prometheus-node-exporter`
|
||||||
|
- After restart, logind responded normally and reported seat0
|
||||||
|
|
||||||
|
### 2026-02-27: pn02 Third Freeze
|
||||||
|
|
||||||
|
- pn02 crashed again after ~2 days 21 hours uptime (longest run so far)
|
||||||
|
- Evidence of crash:
|
||||||
|
- Journal file corrupted: `system.journal corrupted or uncleanly shut down`
|
||||||
|
- Boot partition fsck: `Dirty bit is set. Fs was not properly unmounted`
|
||||||
|
- No orderly shutdown logs from previous boot
|
||||||
|
- No auto-upgrade triggered
|
||||||
|
- **NMI watchdog did NOT fire** — no kernel panic logged. This is a true hard lockup below NMI level
|
||||||
|
- **rasdaemon recorded nothing** — no MCE, AER, or memory errors in the sqlite database
|
||||||
|
- **Positive**: The system auto-rebooted this time (likely hardware watchdog), unlike previous freezes that required manual power cycle
|
||||||
|
- `processor.max_cstate=1` may have extended uptime (2d21h vs previous 1h and 5.5h) but did not prevent the freeze
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
**pn02 is unreliable.** After exhausting mitigations (fTPM disabled, BIOS updated, WiFi/BT disabled, ErP disabled, amdgpu blacklisted, processor.max_cstate=1, NMI watchdog, rasdaemon), the unit still hard-freezes every few days. The freeze occurs below NMI level with zero diagnostic output, pointing to a board-level hardware defect. The unit is not suitable for hypervisor use or any workload requiring reliability.
|
||||||
|
|
||||||
|
**pn01 remains stable** so far but has historically crashed as well, just less frequently. Continuing to monitor.
|
||||||
|
|
||||||
## Next Steps
|
## Next Steps
|
||||||
|
|
||||||
- Monitor pn02 with amdgpu blacklisted — if stable, try the less impactful `amdgpu.runpm=0 amdgpu.dpm=0` kernel params instead
|
- **pn02**: Consider scrapping or repurposing for non-critical workloads that tolerate random reboots (auto-recovery via hardware watchdog is now working)
|
||||||
- If pn02 still freezes without amdgpu:
|
- **pn01**: Continue monitoring. If it remains stable long-term, may still be viable for light workloads
|
||||||
- Swap RAM sticks between units to check if instability follows the RAM
|
- If pn01 eventually crashes, apply the same mitigations (amdgpu blacklist, max_cstate=1) to see if they help
|
||||||
- Run memtest86 for 24h+ on each stick individually
|
- For the Incus hypervisor plan: likely need different hardware. Evaluating GMKtec G3 (Intel) as an alternative. Note: mixed Intel/AMD cluster complicates live migration
|
||||||
- Try a different kernel (LTS 6.6 or latest unstable)
|
|
||||||
- Try disabling SMT (`nosmt` kernel param)
|
|
||||||
- If nothing helps, likely a hardware defect on pn02's board
|
|
||||||
- pn01 continues to be stable — keep monitoring
|
|
||||||
- Once stable: add second RAM stick back to pn02, reinstall with NVMe
|
|
||||||
- Evaluate for Incus hypervisor use (see `nixos-hypervisor.md`)
|
|
||||||
|
|
||||||
## Diagnostics and Auto-Recovery (pn02)
|
## Diagnostics and Auto-Recovery (pn02)
|
||||||
|
|
||||||
Currently deployed on pn02:
|
Currently deployed on pn02:
|
||||||
|
|
||||||
```nix
|
```nix
|
||||||
boot.kernelParams = [ "panic=10" "nmi_watchdog=1" ];
|
boot.blacklistedKernelModules = [ "amdgpu" ];
|
||||||
|
boot.kernelParams = [ "panic=10" "nmi_watchdog=1" "processor.max_cstate=1" ];
|
||||||
boot.kernel.sysctl."kernel.softlockup_panic" = 1;
|
boot.kernel.sysctl."kernel.softlockup_panic" = 1;
|
||||||
boot.kernel.sysctl."kernel.hardlockup_panic" = 1;
|
boot.kernel.sysctl."kernel.hardlockup_panic" = 1;
|
||||||
hardware.rasdaemon.enable = true;
|
hardware.rasdaemon.enable = true;
|
||||||
|
|||||||
Reference in New Issue
Block a user