pn02 crashed again after ~2d21h uptime despite all mitigations (amdgpu blacklist, max_cstate=1, NMI watchdog, rasdaemon). NMI watchdog didn't fire and rasdaemon recorded nothing, confirming hard lockup below NMI level. Unit is unreliable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9.9 KiB
ASUS PN51 Stability Testing
Overview
Two ASUS PN51-E1 mini PCs (Ryzen 7 5700U) purchased years ago but shelved due to stability issues. Revisiting them to potentially add to the homelab.
Hardware
| pn01 (10.69.12.60) | pn02 (10.69.12.61) | |
|---|---|---|
| CPU | AMD Ryzen 7 5700U (8C/16T) | AMD Ryzen 7 5700U (8C/16T) |
| RAM | 2x 32GB DDR4 SO-DIMM (64GB) | 1x 32GB DDR4 SO-DIMM (32GB) |
| Storage | 1TB NVMe | 1TB Samsung 870 EVO (SATA SSD) |
| BIOS | 0508 (2023-11-08) | Updated 2026-02-21 (latest from ASUS) |
Original Issues
- pn01: Would boot but freeze randomly after some time. No console errors, completely unresponsive. memtest86 passed.
- pn02: Had trouble booting — would start loading kernel from installer USB then instantly reboot. When it did boot, would also freeze randomly.
Debugging Steps
2026-02-21: Initial Setup
- Disabled fTPM (labeled "Security Device" in ASUS BIOS) on both units
- AMD Ryzen 5000 series had a known fTPM bug causing random hard freezes with no console output
- Both units booted the NixOS installer successfully after this change
- Installed NixOS on both, added to repo as
pn01andpn02on VLAN 12 - Configured monitoring (node-exporter, promtail, nixos-exporter)
2026-02-21: pn02 First Freeze
- pn02 froze approximately 1 hour after boot
- All three Prometheus targets went down simultaneously — hard freeze, not graceful shutdown
- Journal on next boot:
system.journal corrupted or uncleanly shut down - Kernel warnings from boot log before freeze:
- TSC clocksource unstable:
Marking clocksource 'tsc' as unstable because the skew is too large— TSC skewing ~3.8ms over 500ms relative to HPET watchdog - AMD PSP error:
psp gfx command LOAD_TA(0x1) failed and response status is (0x7)— Platform Security Processor failing to load trusted application
- TSC clocksource unstable:
- pn01 did not show these warnings on this particular boot, but has shown them historically (see below)
2026-02-21: pn02 BIOS Update
- Updated pn02 BIOS to latest version from ASUS website
- TSC still unstable after BIOS update — same ~3.8ms skew
- PSP LOAD_TA still failing after BIOS update
- Monitoring back up, letting it run to see if freeze recurs
2026-02-22: TSC/PSP Confirmed on Both Units
- Checked kernel logs after ~9 hours uptime — both units still running
- pn01 now shows TSC unstable and PSP LOAD_TA failure on this boot (same ~3.8ms TSC skew, same PSP error)
- pn01 had these same issues historically when tested years ago — the earlier clean boot was just lucky TSC calibration timing
- Conclusion: TSC instability and PSP LOAD_TA are platform-level quirks of the PN51-E1 / Ryzen 5700U, present on both units
- The kernel handles TSC instability gracefully (falls back to HPET), and PSP LOAD_TA is non-fatal
- Neither issue is likely the cause of the hard freezes — the fTPM bug remains the primary suspect
2026-02-22: Stress Test (1 hour)
- Ran
stress-ng --cpu 16 --vm 2 --vm-bytes 8G --timeout 1hon both units - CPU temps peaked at ~85°C, settled to ~80°C sustained (throttle limit is 105°C)
- Both survived the full hour with no freezes, no MCE errors, no kernel issues
- No concerning log entries during or after the test
2026-02-22: TSC Runtime Switch Test
- Attempted to switch clocksource back to TSC at runtime on pn01:
echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource - Kernel watchdog immediately reverted to HPET — TSC skew is ongoing, not just a boot-time issue
- Conclusion: TSC is genuinely unstable on the PN51-E1 platform. HPET is the correct clocksource.
- For virtualization (Incus), this means guest VMs will use HPET-backed timing. Performance impact is minimal for typical server workloads (DNS, monitoring, light services) but would matter for latency-sensitive applications.
2026-02-22: BIOS Tweaks (Both Units)
- Disabled ErP Ready on both (EU power efficiency mode — aggressively cuts power in idle)
- Disabled WiFi and Bluetooth in BIOS on both
- TSC still unstable after these changes — same ~3.8ms skew on both units
- ErP/power states are not the cause of the TSC issue
2026-02-22: pn02 Second Freeze
- pn02 froze again ~5.5 hours after boot (at idle, not under load)
- All Prometheus targets down simultaneously — same hard freeze pattern
- Last log entry was normal nix-daemon activity — zero warning/error logs before crash
- Survived the 1h stress test earlier but froze at idle later — not thermal
- pn01 remains stable throughout
- Action: Blacklisted
amdgpukernel module on pn02 (boot.blacklistedKernelModules = [ "amdgpu" ]) to eliminate GPU/PSP firmware interactions as a cause. No console output but managed via SSH. - Action: Added diagnostic/recovery config to pn02:
panic=10+nmi_watchdog=1kernel params — auto-reboot after 10s on panicsoftlockup_panic+hardlockup_panicsysctls — convert lockups to panics with stack traceshardware.rasdaemonwith recording — logs hardware errors (MCE, PCIe AER, memory) to sqlite database, survives reboots- Check recorded errors:
ras-mc-ctl --summary,ras-mc-ctl --errors
Benign Kernel Errors (Both Units)
These appear on both units and can be ignored:
clocksource: Marking clocksource 'tsc' as unstable— TSC skew vs HPET, kernel falls back gracefully. Platform-level quirk on PN51-E1, not always reproducible on every boot.psp gfx command LOAD_TA(0x1) failed— AMD PSP firmware error, non-fatal. Present on both units across all BIOS versions.pcie_mp2_amd: amd_sfh_hid_client_init failed err -95— AMD Sensor Fusion Hub, no sensors connectedBluetooth: hci0: Reading supported features failed— Bluetooth init quirkSerial bus multi instantiate pseudo device driver INT3515:00: error -ENXIO— unused serial bus devicesnd_hda_intel: no codecs found— no audio device connected, headless serverata2.00: supports DRM functions and may not be fully accessible— Samsung SSD DRM quirk (pn02 only)
2026-02-23: processor.max_cstate=1 and Proxmox Forums
- Found a thread on the Proxmox forums about PN51 units with similar freeze issues
- Many users reporting identical symptoms — random hard freezes, no log evidence
- No conclusive fix. Some have frequent freezes, others only a few times a month
- Some reported BIOS updates helped, but results inconsistent
- Added
processor.max_cstate=1kernel parameter to pn02 — limits CPU to C1 halt state, preventing deep C-state sleep transitions that may trigger freezes on AMD mobile chips - Also applied: amdgpu blacklist, panic=10, nmi_watchdog=1, softlockup/hardlockup panic, rasdaemon
2026-02-23: logind D-Bus Deadlock (pn02)
- node-exporter alert fired — but host was NOT frozen
- logind was running (PID 871) but deadlocked on D-Bus — not responding to
org.freedesktop.login1requests - Every node-exporter scrape blocked for 25s waiting for logind, causing scrape timeouts
- Likely related to amdgpu blacklist — no DRM device means no graphical seat, logind may have deadlocked during seat enumeration at boot
- Fix:
systemctl restart systemd-logind+systemctl restart prometheus-node-exporter - After restart, logind responded normally and reported seat0
2026-02-27: pn02 Third Freeze
- pn02 crashed again after ~2 days 21 hours uptime (longest run so far)
- Evidence of crash:
- Journal file corrupted:
system.journal corrupted or uncleanly shut down - Boot partition fsck:
Dirty bit is set. Fs was not properly unmounted - No orderly shutdown logs from previous boot
- No auto-upgrade triggered
- Journal file corrupted:
- NMI watchdog did NOT fire — no kernel panic logged. This is a true hard lockup below NMI level
- rasdaemon recorded nothing — no MCE, AER, or memory errors in the sqlite database
- Positive: The system auto-rebooted this time (likely hardware watchdog), unlike previous freezes that required manual power cycle
processor.max_cstate=1may have extended uptime (2d21h vs previous 1h and 5.5h) but did not prevent the freeze
Conclusion
pn02 is unreliable. After exhausting mitigations (fTPM disabled, BIOS updated, WiFi/BT disabled, ErP disabled, amdgpu blacklisted, processor.max_cstate=1, NMI watchdog, rasdaemon), the unit still hard-freezes every few days. The freeze occurs below NMI level with zero diagnostic output, pointing to a board-level hardware defect. The unit is not suitable for hypervisor use or any workload requiring reliability.
pn01 remains stable so far but has historically crashed as well, just less frequently. Continuing to monitor.
Next Steps
- pn02: Consider scrapping or repurposing for non-critical workloads that tolerate random reboots (auto-recovery via hardware watchdog is now working)
- pn01: Continue monitoring. If it remains stable long-term, may still be viable for light workloads
- If pn01 eventually crashes, apply the same mitigations (amdgpu blacklist, max_cstate=1) to see if they help
- For the Incus hypervisor plan: likely need different hardware. Evaluating GMKtec G3 (Intel) as an alternative. Note: mixed Intel/AMD cluster complicates live migration
Diagnostics and Auto-Recovery (pn02)
Currently deployed on pn02:
boot.blacklistedKernelModules = [ "amdgpu" ];
boot.kernelParams = [ "panic=10" "nmi_watchdog=1" "processor.max_cstate=1" ];
boot.kernel.sysctl."kernel.softlockup_panic" = 1;
boot.kernel.sysctl."kernel.hardlockup_panic" = 1;
hardware.rasdaemon.enable = true;
hardware.rasdaemon.record = true;
On next freeze, one of two things happens:
- NMI watchdog catches it -> kernel panic with stack trace in logs -> auto-reboot after 10s -> we get diagnostic info
- Hard lockup below NMI level -> SP5100 TCO hardware watchdog (10min timeout) reboots it -> confirms board-level defect
After reboot, check:
ras-mc-ctl --summary— overview of hardware errorsras-mc-ctl --errors— detailed error listjournalctl -b -1 -p err— kernel logs from crashed boot (if panic was logged)