# ASUS PN51 Stability Testing ## Overview Two ASUS PN51-E1 mini PCs (Ryzen 7 5700U) purchased years ago but shelved due to stability issues. Revisiting them to potentially add to the homelab. ## Hardware | | pn01 (10.69.12.60) | pn02 (10.69.12.61) | |---|---|---| | **CPU** | AMD Ryzen 7 5700U (8C/16T) | AMD Ryzen 7 5700U (8C/16T) | | **RAM** | 2x 32GB DDR4 SO-DIMM (64GB) | 1x 32GB DDR4 SO-DIMM (32GB) | | **Storage** | 1TB NVMe | 1TB Samsung 870 EVO (SATA SSD) | | **BIOS** | 0508 (2023-11-08) | Updated 2026-02-21 (latest from ASUS) | ## Original Issues - **pn01**: Would boot but freeze randomly after some time. No console errors, completely unresponsive. memtest86 passed. - **pn02**: Had trouble booting — would start loading kernel from installer USB then instantly reboot. When it did boot, would also freeze randomly. ## Debugging Steps ### 2026-02-21: Initial Setup 1. **Disabled fTPM** (labeled "Security Device" in ASUS BIOS) on both units - AMD Ryzen 5000 series had a known fTPM bug causing random hard freezes with no console output - Both units booted the NixOS installer successfully after this change 2. Installed NixOS on both, added to repo as `pn01` and `pn02` on VLAN 12 3. Configured monitoring (node-exporter, promtail, nixos-exporter) ### 2026-02-21: pn02 First Freeze - pn02 froze approximately 1 hour after boot - All three Prometheus targets went down simultaneously — hard freeze, not graceful shutdown - Journal on next boot: `system.journal corrupted or uncleanly shut down` - Kernel warnings from boot log before freeze: - **TSC clocksource unstable**: `Marking clocksource 'tsc' as unstable because the skew is too large` — TSC skewing ~3.8ms over 500ms relative to HPET watchdog - **AMD PSP error**: `psp gfx command LOAD_TA(0x1) failed and response status is (0x7)` — Platform Security Processor failing to load trusted application - pn01 did not show these warnings on this particular boot, but has shown them historically (see below) ### 2026-02-21: pn02 BIOS Update - Updated pn02 BIOS to latest version from ASUS website - **TSC still unstable** after BIOS update — same ~3.8ms skew - **PSP LOAD_TA still failing** after BIOS update - Monitoring back up, letting it run to see if freeze recurs ### 2026-02-22: TSC/PSP Confirmed on Both Units - Checked kernel logs after ~9 hours uptime — both units still running - **pn01 now shows TSC unstable and PSP LOAD_TA failure** on this boot (same ~3.8ms TSC skew, same PSP error) - pn01 had these same issues historically when tested years ago — the earlier clean boot was just lucky TSC calibration timing - **Conclusion**: TSC instability and PSP LOAD_TA are platform-level quirks of the PN51-E1 / Ryzen 5700U, present on both units - The kernel handles TSC instability gracefully (falls back to HPET), and PSP LOAD_TA is non-fatal - Neither issue is likely the cause of the hard freezes — the fTPM bug remains the primary suspect ### 2026-02-22: Stress Test (1 hour) - Ran `stress-ng --cpu 16 --vm 2 --vm-bytes 8G --timeout 1h` on both units - CPU temps peaked at ~85°C, settled to ~80°C sustained (throttle limit is 105°C) - Both survived the full hour with no freezes, no MCE errors, no kernel issues - No concerning log entries during or after the test ### 2026-02-22: TSC Runtime Switch Test - Attempted to switch clocksource back to TSC at runtime on pn01: ``` echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource ``` - Kernel watchdog immediately reverted to HPET — TSC skew is ongoing, not just a boot-time issue - **Conclusion**: TSC is genuinely unstable on the PN51-E1 platform. HPET is the correct clocksource. - For virtualization (Incus), this means guest VMs will use HPET-backed timing. Performance impact is minimal for typical server workloads (DNS, monitoring, light services) but would matter for latency-sensitive applications. ### 2026-02-22: BIOS Tweaks (Both Units) - Disabled ErP Ready on both (EU power efficiency mode — aggressively cuts power in idle) - Disabled WiFi and Bluetooth in BIOS on both - **TSC still unstable** after these changes — same ~3.8ms skew on both units - ErP/power states are not the cause of the TSC issue ### 2026-02-22: pn02 Second Freeze - pn02 froze again ~5.5 hours after boot (at idle, not under load) - All Prometheus targets down simultaneously — same hard freeze pattern - Last log entry was normal nix-daemon activity — zero warning/error logs before crash - Survived the 1h stress test earlier but froze at idle later — not thermal - pn01 remains stable throughout - **Action**: Blacklisted `amdgpu` kernel module on pn02 (`boot.blacklistedKernelModules = [ "amdgpu" ]`) to eliminate GPU/PSP firmware interactions as a cause. No console output but managed via SSH. - **Action**: Added diagnostic/recovery config to pn02: - `panic=10` + `nmi_watchdog=1` kernel params — auto-reboot after 10s on panic - `softlockup_panic` + `hardlockup_panic` sysctls — convert lockups to panics with stack traces - `hardware.rasdaemon` with recording — logs hardware errors (MCE, PCIe AER, memory) to sqlite database, survives reboots - Check recorded errors: `ras-mc-ctl --summary`, `ras-mc-ctl --errors` ## Benign Kernel Errors (Both Units) These appear on both units and can be ignored: - `clocksource: Marking clocksource 'tsc' as unstable` — TSC skew vs HPET, kernel falls back gracefully. Platform-level quirk on PN51-E1, not always reproducible on every boot. - `psp gfx command LOAD_TA(0x1) failed` — AMD PSP firmware error, non-fatal. Present on both units across all BIOS versions. - `pcie_mp2_amd: amd_sfh_hid_client_init failed err -95` — AMD Sensor Fusion Hub, no sensors connected - `Bluetooth: hci0: Reading supported features failed` — Bluetooth init quirk - `Serial bus multi instantiate pseudo device driver INT3515:00: error -ENXIO` — unused serial bus device - `snd_hda_intel: no codecs found` — no audio device connected, headless server - `ata2.00: supports DRM functions and may not be fully accessible` — Samsung SSD DRM quirk (pn02 only) ### 2026-02-23: processor.max_cstate=1 and Proxmox Forums - Found a thread on the Proxmox forums about PN51 units with similar freeze issues - Many users reporting identical symptoms — random hard freezes, no log evidence - No conclusive fix. Some have frequent freezes, others only a few times a month - Some reported BIOS updates helped, but results inconsistent - Added `processor.max_cstate=1` kernel parameter to pn02 — limits CPU to C1 halt state, preventing deep C-state sleep transitions that may trigger freezes on AMD mobile chips - Also applied: amdgpu blacklist, panic=10, nmi_watchdog=1, softlockup/hardlockup panic, rasdaemon ### 2026-02-23: logind D-Bus Deadlock (pn02) - node-exporter alert fired — but host was NOT frozen - logind was running (PID 871) but deadlocked on D-Bus — not responding to `org.freedesktop.login1` requests - Every node-exporter scrape blocked for 25s waiting for logind, causing scrape timeouts - Likely related to amdgpu blacklist — no DRM device means no graphical seat, logind may have deadlocked during seat enumeration at boot - Fix: `systemctl restart systemd-logind` + `systemctl restart prometheus-node-exporter` - After restart, logind responded normally and reported seat0 ### 2026-02-27: pn02 Third Freeze - pn02 crashed again after ~2 days 21 hours uptime (longest run so far) - Evidence of crash: - Journal file corrupted: `system.journal corrupted or uncleanly shut down` - Boot partition fsck: `Dirty bit is set. Fs was not properly unmounted` - No orderly shutdown logs from previous boot - No auto-upgrade triggered - **NMI watchdog did NOT fire** — no kernel panic logged. This is a true hard lockup below NMI level - **rasdaemon recorded nothing** — no MCE, AER, or memory errors in the sqlite database - **Positive**: The system auto-rebooted this time (likely hardware watchdog), unlike previous freezes that required manual power cycle - `processor.max_cstate=1` may have extended uptime (2d21h vs previous 1h and 5.5h) but did not prevent the freeze ### 2026-02-27 to 2026-03-03: Relative Stability - pn02 ran without crashes for approximately one week after the third freeze - pn01 continued to be completely stable throughout this period - Auto-upgrade reboots continued daily (~4am) on both units — these are planned and healthy ### 2026-03-04: pn02 Fourth Crash — sched_ext Kernel Oops (pstore captured) - pn02 crashed after ~5.8 days uptime (504566s) - **First crash captured by pstore** — kernel oops and panic stack traces preserved across reboot - Journal corruption confirmed: `system.journal corrupted or uncleanly shut down` - **Crash location**: `RIP: 0010:set_next_task_scx+0x6e/0x210` — crash in the **sched_ext (SCX) scheduler** subsystem - **Call trace**: `sysvec_apic_timer_interrupt` → `cpuidle_enter_state` — crashed during CPU idle, triggered by APIC timer interrupt - **CR2**: `ffffffffffffff89` — dereferencing an obviously invalid kernel pointer - **Kernel**: 6.12.74 (NixOS 25.11) - **Significance**: This is the first crash with actual diagnostic output. Previous crashes were silent sub-NMI freezes. The sched_ext scheduler path is a new finding — earlier crashes were assumed to be hardware-level. ### 2026-03-06: pn02 Fifth Crash - pn02 crashed again — journal corruption on next boot - No pstore data captured for this crash ### 2026-03-07: pn02 Sixth and Seventh Crashes — Two in One Day **First crash (~11:06 UTC):** - ~26.6 hours uptime (95994s) - **pstore captured both Oops and Panic** - **Crash location**: Scheduler code path — `pick_next_task_fair` → `__pick_next_task` - **CR2**: `000000c000726000` — invalid pointer dereference - **Notable**: `dbus-daemon` segfaulted ~50 minutes before the kernel crash (`segfault at 0` in `libdbus-1.so.3.32.4` on CPU 0) — may indicate memory corruption preceding the kernel crash **Second crash (~21:15 UTC):** - Journal corruption confirmed on next boot - No pstore data captured ### 2026-03-12: pn02 Memtest86 — 38 Passes, Zero Errors - Ran memtest86 for ~109 hours (4.5 days), completing 38 full passes - **Zero errors found** — RAM appears healthy - Makes hardware-induced memory corruption less likely as the sole cause of crashes - Memtest cannot rule out CPU cache errors, PCIe/IOMMU issues, or kernel bugs triggered by platform quirks - **Next step**: Boot back into NixOS with sched_ext disabled to test the kernel scheduler hypothesis ### 2026-03-07: pn01 Status - pn01 has had **zero crashes** since initial setup on Feb 21 - Zero journal corruptions, zero pstore dumps in 30 days - Same BOOT_ID maintained between daily auto-upgrade reboots — consistently clean shutdown/reboot cycles - All 8 reboots in 30 days are planned auto-upgrade reboots - **pn01 is fully stable** ## Crash Summary | Date | Uptime Before Crash | Crash Type | Diagnostic Data | |------|---------------------|------------|-----------------| | Feb 21 | ~1h | Silent freeze | None — sub-NMI | | Feb 22 | ~5.5h | Silent freeze | None — sub-NMI | | Feb 27 | ~2d 21h | Silent freeze | None — sub-NMI, rasdaemon empty | | Mar 4 | ~5.8d | **Kernel oops** | pstore: `set_next_task_scx` (sched_ext) | | Mar 6 | Unknown | Crash | Journal corruption only | | Mar 7 | ~26.6h | **Kernel oops + panic** | pstore: `pick_next_task_fair` (scheduler) + dbus segfault | | Mar 7 | Unknown | Crash | Journal corruption only | ## Conclusion **pn02 is unreliable.** After exhausting mitigations (fTPM disabled, BIOS updated, WiFi/BT disabled, ErP disabled, amdgpu blacklisted, processor.max_cstate=1, NMI watchdog, rasdaemon), the unit still crashes every few days. 26 reboots in 30 days (7 unclean crashes + daily auto-upgrade reboots). The pstore crash dumps from March reveal a new dimension: at least some crashes are **kernel scheduler bugs in sched_ext**, not just silent hardware-level freezes. The `set_next_task_scx` and `pick_next_task_fair` crash sites, combined with the dbus-daemon segfault before one crash, suggest possible memory corruption that manifests in the scheduler. Memtest86 ran 38 passes (109 hours) with zero errors, making option 2 less likely. Remaining possibilities: 1. A sched_ext kernel bug exposed by the PN51's hardware quirks (unstable TSC, C-state behavior) 2. ~~Hardware-induced memory corruption that happens to hit scheduler data structures~~ — unlikely after clean memtest 3. A pure software bug in the 6.12.74 kernel's sched_ext implementation **pn01 is stable** — zero crashes in 30 days of continuous operation. Both units have identical kernel and NixOS configuration (minus pn02's diagnostic mitigations), so the difference points toward a hardware defect specific to the pn02 board. ## Next Steps - **~~pn02 memtest~~**: ~~Run memtest86 for 24h+~~ — Done (2026-03-12): 38 passes over 109 hours, zero errors. RAM is not the issue. - **pn02 sched_ext test**: Disable sched_ext (`boot.kernelParams = [ "sched_ext.enabled=0" ]` or equivalent) and run for 1-2 weeks to test whether the crashes stop — would help distinguish kernel bug from hardware defect - **pn02**: If sched_ext disable doesn't help, consider scrapping or repurposing for non-critical workloads that tolerate random reboots (auto-recovery via hardware watchdog is working) - **pn01**: Continue monitoring. If it remains stable long-term, it is viable for light workloads - If pn01 eventually crashes, apply the same mitigations (amdgpu blacklist, max_cstate=1) to see if they help - For the Incus hypervisor plan: likely need different hardware. Evaluating GMKtec G3 (Intel) as an alternative. Note: mixed Intel/AMD cluster complicates live migration ## Diagnostics and Auto-Recovery (pn02) Currently deployed on pn02: ```nix boot.blacklistedKernelModules = [ "amdgpu" ]; boot.kernelParams = [ "panic=10" "nmi_watchdog=1" "processor.max_cstate=1" ]; boot.kernel.sysctl."kernel.softlockup_panic" = 1; boot.kernel.sysctl."kernel.hardlockup_panic" = 1; hardware.rasdaemon.enable = true; hardware.rasdaemon.record = true; ``` **Crash recovery is working**: pstore now captures kernel oops/panic data, and the system auto-reboots via `panic=10` or SP5100 TCO hardware watchdog. **After reboot, check:** - `ras-mc-ctl --summary` — overview of hardware errors - `ras-mc-ctl --errors` — detailed error list - `journalctl -b -1 -p err` — kernel logs from crashed boot (if panic was logged) - pstore data is automatically archived by `systemd-pstore.service` and forwarded to Loki via promtail