Some checks failed
Run nix flake check / flake-check (push) Failing after 4m3s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7.4 KiB
7.4 KiB
ASUS PN51 Stability Testing
Overview
Two ASUS PN51-E1 mini PCs (Ryzen 7 5700U) purchased years ago but shelved due to stability issues. Revisiting them to potentially add to the homelab.
Hardware
| pn01 (10.69.12.60) | pn02 (10.69.12.61) | |
|---|---|---|
| CPU | AMD Ryzen 7 5700U (8C/16T) | AMD Ryzen 7 5700U (8C/16T) |
| RAM | 2x 32GB DDR4 SO-DIMM (64GB) | 1x 32GB DDR4 SO-DIMM (32GB) |
| Storage | 1TB NVMe | 1TB Samsung 870 EVO (SATA SSD) |
| BIOS | 0508 (2023-11-08) | Updated 2026-02-21 (latest from ASUS) |
Original Issues
- pn01: Would boot but freeze randomly after some time. No console errors, completely unresponsive. memtest86 passed.
- pn02: Had trouble booting — would start loading kernel from installer USB then instantly reboot. When it did boot, would also freeze randomly.
Debugging Steps
2026-02-21: Initial Setup
- Disabled fTPM (labeled "Security Device" in ASUS BIOS) on both units
- AMD Ryzen 5000 series had a known fTPM bug causing random hard freezes with no console output
- Both units booted the NixOS installer successfully after this change
- Installed NixOS on both, added to repo as
pn01andpn02on VLAN 12 - Configured monitoring (node-exporter, promtail, nixos-exporter)
2026-02-21: pn02 First Freeze
- pn02 froze approximately 1 hour after boot
- All three Prometheus targets went down simultaneously — hard freeze, not graceful shutdown
- Journal on next boot:
system.journal corrupted or uncleanly shut down - Kernel warnings from boot log before freeze:
- TSC clocksource unstable:
Marking clocksource 'tsc' as unstable because the skew is too large— TSC skewing ~3.8ms over 500ms relative to HPET watchdog - AMD PSP error:
psp gfx command LOAD_TA(0x1) failed and response status is (0x7)— Platform Security Processor failing to load trusted application
- TSC clocksource unstable:
- pn01 did not show these warnings on this particular boot, but has shown them historically (see below)
2026-02-21: pn02 BIOS Update
- Updated pn02 BIOS to latest version from ASUS website
- TSC still unstable after BIOS update — same ~3.8ms skew
- PSP LOAD_TA still failing after BIOS update
- Monitoring back up, letting it run to see if freeze recurs
2026-02-22: TSC/PSP Confirmed on Both Units
- Checked kernel logs after ~9 hours uptime — both units still running
- pn01 now shows TSC unstable and PSP LOAD_TA failure on this boot (same ~3.8ms TSC skew, same PSP error)
- pn01 had these same issues historically when tested years ago — the earlier clean boot was just lucky TSC calibration timing
- Conclusion: TSC instability and PSP LOAD_TA are platform-level quirks of the PN51-E1 / Ryzen 5700U, present on both units
- The kernel handles TSC instability gracefully (falls back to HPET), and PSP LOAD_TA is non-fatal
- Neither issue is likely the cause of the hard freezes — the fTPM bug remains the primary suspect
2026-02-22: Stress Test (1 hour)
- Ran
stress-ng --cpu 16 --vm 2 --vm-bytes 8G --timeout 1hon both units - CPU temps peaked at ~85°C, settled to ~80°C sustained (throttle limit is 105°C)
- Both survived the full hour with no freezes, no MCE errors, no kernel issues
- No concerning log entries during or after the test
2026-02-22: TSC Runtime Switch Test
- Attempted to switch clocksource back to TSC at runtime on pn01:
echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource - Kernel watchdog immediately reverted to HPET — TSC skew is ongoing, not just a boot-time issue
- Conclusion: TSC is genuinely unstable on the PN51-E1 platform. HPET is the correct clocksource.
- For virtualization (Incus), this means guest VMs will use HPET-backed timing. Performance impact is minimal for typical server workloads (DNS, monitoring, light services) but would matter for latency-sensitive applications.
2026-02-22: BIOS Tweaks (Both Units)
- Disabled ErP Ready on both (EU power efficiency mode — aggressively cuts power in idle)
- Disabled WiFi and Bluetooth in BIOS on both
- TSC still unstable after these changes — same ~3.8ms skew on both units
- ErP/power states are not the cause of the TSC issue
2026-02-22: pn02 Second Freeze
- pn02 froze again ~5.5 hours after boot (at idle, not under load)
- All Prometheus targets down simultaneously — same hard freeze pattern
- Last log entry was normal nix-daemon activity — zero warning/error logs before crash
- Survived the 1h stress test earlier but froze at idle later — not thermal
- pn01 remains stable throughout
- Action: Blacklisted
amdgpukernel module on pn02 (boot.blacklistedKernelModules = [ "amdgpu" ]) to eliminate GPU/PSP firmware interactions as a cause. No console output but managed via SSH. - Action: Added diagnostic/recovery config to pn02:
panic=10+nmi_watchdog=1kernel params — auto-reboot after 10s on panicsoftlockup_panic+hardlockup_panicsysctls — convert lockups to panics with stack traceshardware.rasdaemonwith recording — logs hardware errors (MCE, PCIe AER, memory) to sqlite database, survives reboots- Check recorded errors:
ras-mc-ctl --summary,ras-mc-ctl --errors
Benign Kernel Errors (Both Units)
These appear on both units and can be ignored:
clocksource: Marking clocksource 'tsc' as unstable— TSC skew vs HPET, kernel falls back gracefully. Platform-level quirk on PN51-E1, not always reproducible on every boot.psp gfx command LOAD_TA(0x1) failed— AMD PSP firmware error, non-fatal. Present on both units across all BIOS versions.pcie_mp2_amd: amd_sfh_hid_client_init failed err -95— AMD Sensor Fusion Hub, no sensors connectedBluetooth: hci0: Reading supported features failed— Bluetooth init quirkSerial bus multi instantiate pseudo device driver INT3515:00: error -ENXIO— unused serial bus devicesnd_hda_intel: no codecs found— no audio device connected, headless serverata2.00: supports DRM functions and may not be fully accessible— Samsung SSD DRM quirk (pn02 only)
Next Steps
- Monitor pn02 with amdgpu blacklisted — if stable, try the less impactful
amdgpu.runpm=0 amdgpu.dpm=0kernel params instead - If pn02 still freezes without amdgpu:
- Swap RAM sticks between units to check if instability follows the RAM
- Run memtest86 for 24h+ on each stick individually
- Try a different kernel (LTS 6.6 or latest unstable)
- Try disabling SMT (
nosmtkernel param)
- If nothing helps, likely a hardware defect on pn02's board
- pn01 continues to be stable — keep monitoring
- Once stable: add second RAM stick back to pn02, reinstall with NVMe
- Evaluate for Incus hypervisor use (see
nixos-hypervisor.md)
Diagnostics and Auto-Recovery (pn02)
Currently deployed on pn02:
boot.kernelParams = [ "panic=10" "nmi_watchdog=1" ];
boot.kernel.sysctl."kernel.softlockup_panic" = 1;
boot.kernel.sysctl."kernel.hardlockup_panic" = 1;
hardware.rasdaemon.enable = true;
hardware.rasdaemon.record = true;
On next freeze, one of two things happens:
- NMI watchdog catches it -> kernel panic with stack trace in logs -> auto-reboot after 10s -> we get diagnostic info
- Hard lockup below NMI level -> SP5100 TCO hardware watchdog (10min timeout) reboots it -> confirms board-level defect
After reboot, check:
ras-mc-ctl --summary— overview of hardware errorsras-mc-ctl --errors— detailed error listjournalctl -b -1 -p err— kernel logs from crashed boot (if panic was logged)