# ASUS PN51 Stability Testing ## Overview Two ASUS PN51-E1 mini PCs (Ryzen 7 5700U) purchased years ago but shelved due to stability issues. Revisiting them to potentially add to the homelab. ## Hardware | | pn01 (10.69.12.60) | pn02 (10.69.12.61) | |---|---|---| | **CPU** | AMD Ryzen 7 5700U (8C/16T) | AMD Ryzen 7 5700U (8C/16T) | | **RAM** | 2x 32GB DDR4 SO-DIMM (64GB) | 1x 32GB DDR4 SO-DIMM (32GB) | | **Storage** | 1TB NVMe | 1TB Samsung 870 EVO (SATA SSD) | | **BIOS** | 0508 (2023-11-08) | Updated 2026-02-21 (latest from ASUS) | ## Original Issues - **pn01**: Would boot but freeze randomly after some time. No console errors, completely unresponsive. memtest86 passed. - **pn02**: Had trouble booting — would start loading kernel from installer USB then instantly reboot. When it did boot, would also freeze randomly. ## Debugging Steps ### 2026-02-21: Initial Setup 1. **Disabled fTPM** (labeled "Security Device" in ASUS BIOS) on both units - AMD Ryzen 5000 series had a known fTPM bug causing random hard freezes with no console output - Both units booted the NixOS installer successfully after this change 2. Installed NixOS on both, added to repo as `pn01` and `pn02` on VLAN 12 3. Configured monitoring (node-exporter, promtail, nixos-exporter) ### 2026-02-21: pn02 First Freeze - pn02 froze approximately 1 hour after boot - All three Prometheus targets went down simultaneously — hard freeze, not graceful shutdown - Journal on next boot: `system.journal corrupted or uncleanly shut down` - Kernel warnings from boot log before freeze: - **TSC clocksource unstable**: `Marking clocksource 'tsc' as unstable because the skew is too large` — TSC skewing ~3.8ms over 500ms relative to HPET watchdog - **AMD PSP error**: `psp gfx command LOAD_TA(0x1) failed and response status is (0x7)` — Platform Security Processor failing to load trusted application - pn01 did not show these warnings on this particular boot, but has shown them historically (see below) ### 2026-02-21: pn02 BIOS Update - Updated pn02 BIOS to latest version from ASUS website - **TSC still unstable** after BIOS update — same ~3.8ms skew - **PSP LOAD_TA still failing** after BIOS update - Monitoring back up, letting it run to see if freeze recurs ### 2026-02-22: TSC/PSP Confirmed on Both Units - Checked kernel logs after ~9 hours uptime — both units still running - **pn01 now shows TSC unstable and PSP LOAD_TA failure** on this boot (same ~3.8ms TSC skew, same PSP error) - pn01 had these same issues historically when tested years ago — the earlier clean boot was just lucky TSC calibration timing - **Conclusion**: TSC instability and PSP LOAD_TA are platform-level quirks of the PN51-E1 / Ryzen 5700U, present on both units - The kernel handles TSC instability gracefully (falls back to HPET), and PSP LOAD_TA is non-fatal - Neither issue is likely the cause of the hard freezes — the fTPM bug remains the primary suspect ### 2026-02-22: Stress Test (1 hour) - Ran `stress-ng --cpu 16 --vm 2 --vm-bytes 8G --timeout 1h` on both units - CPU temps peaked at ~85°C, settled to ~80°C sustained (throttle limit is 105°C) - Both survived the full hour with no freezes, no MCE errors, no kernel issues - No concerning log entries during or after the test ### 2026-02-22: TSC Runtime Switch Test - Attempted to switch clocksource back to TSC at runtime on pn01: ``` echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource ``` - Kernel watchdog immediately reverted to HPET — TSC skew is ongoing, not just a boot-time issue - **Conclusion**: TSC is genuinely unstable on the PN51-E1 platform. HPET is the correct clocksource. - For virtualization (Incus), this means guest VMs will use HPET-backed timing. Performance impact is minimal for typical server workloads (DNS, monitoring, light services) but would matter for latency-sensitive applications. ### 2026-02-22: BIOS Tweaks (Both Units) - Disabled ErP Ready on both (EU power efficiency mode — aggressively cuts power in idle) - Disabled WiFi and Bluetooth in BIOS on both - **TSC still unstable** after these changes — same ~3.8ms skew on both units - ErP/power states are not the cause of the TSC issue ### 2026-02-22: pn02 Second Freeze - pn02 froze again ~5.5 hours after boot (at idle, not under load) - All Prometheus targets down simultaneously — same hard freeze pattern - Last log entry was normal nix-daemon activity — zero warning/error logs before crash - Survived the 1h stress test earlier but froze at idle later — not thermal - pn01 remains stable throughout - **Action**: Blacklisted `amdgpu` kernel module on pn02 (`boot.blacklistedKernelModules = [ "amdgpu" ]`) to eliminate GPU/PSP firmware interactions as a cause. No console output but managed via SSH. ## Benign Kernel Errors (Both Units) These appear on both units and can be ignored: - `clocksource: Marking clocksource 'tsc' as unstable` — TSC skew vs HPET, kernel falls back gracefully. Platform-level quirk on PN51-E1, not always reproducible on every boot. - `psp gfx command LOAD_TA(0x1) failed` — AMD PSP firmware error, non-fatal. Present on both units across all BIOS versions. - `pcie_mp2_amd: amd_sfh_hid_client_init failed err -95` — AMD Sensor Fusion Hub, no sensors connected - `Bluetooth: hci0: Reading supported features failed` — Bluetooth init quirk - `Serial bus multi instantiate pseudo device driver INT3515:00: error -ENXIO` — unused serial bus device - `snd_hda_intel: no codecs found` — no audio device connected, headless server - `ata2.00: supports DRM functions and may not be fully accessible` — Samsung SSD DRM quirk (pn02 only) ## Next Steps - Monitor pn02 with amdgpu blacklisted — if stable, try the less impactful `amdgpu.runpm=0 amdgpu.dpm=0` kernel params instead - If pn02 still freezes without amdgpu: - Swap RAM sticks between units to check if instability follows the RAM - Run memtest86 for 24h+ on each stick individually - Try a different kernel (LTS 6.6 or latest unstable) - Try disabling SMT (`nosmt` kernel param) - If nothing helps, likely a hardware defect on pn02's board - pn01 continues to be stable — keep monitoring - Once stable: add second RAM stick back to pn02, reinstall with NVMe - Evaluate for Incus hypervisor use (see `nixos-hypervisor.md`) ## Fallback: Auto-Recovery from Freezes If pn02 remains unstable but we want to keep it running for non-critical workloads, configure the kernel to auto-reboot on lockups instead of hanging forever: ```nix boot.kernelParams = [ "panic=10" "nmi_watchdog=1" ]; boot.kernel.sysctl."kernel.softlockup_panic" = 1; boot.kernel.sysctl."kernel.hardlockup_panic" = 1; ``` This would: detect hard/soft lockup -> kernel panic -> auto-reboot after 10 seconds. The SP5100 TCO hardware watchdog (already enabled, 10min timeout) serves as a secondary safety net. Doesn't fix the root cause but avoids requiring physical intervention.