Files
nixos-servers/docs/plans/pn51-stability.md
Torjus Håkestad 20875fb03f pn02: disable sched_ext and document memtest results
Memtest86 ran 38 passes (109 hours) with zero errors, ruling out RAM.
Disable sched_ext scheduler to test whether kernel scheduler crashes stop.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 12:16:55 +01:00

14 KiB

ASUS PN51 Stability Testing

Overview

Two ASUS PN51-E1 mini PCs (Ryzen 7 5700U) purchased years ago but shelved due to stability issues. Revisiting them to potentially add to the homelab.

Hardware

pn01 (10.69.12.60) pn02 (10.69.12.61)
CPU AMD Ryzen 7 5700U (8C/16T) AMD Ryzen 7 5700U (8C/16T)
RAM 2x 32GB DDR4 SO-DIMM (64GB) 1x 32GB DDR4 SO-DIMM (32GB)
Storage 1TB NVMe 1TB Samsung 870 EVO (SATA SSD)
BIOS 0508 (2023-11-08) Updated 2026-02-21 (latest from ASUS)

Original Issues

  • pn01: Would boot but freeze randomly after some time. No console errors, completely unresponsive. memtest86 passed.
  • pn02: Had trouble booting — would start loading kernel from installer USB then instantly reboot. When it did boot, would also freeze randomly.

Debugging Steps

2026-02-21: Initial Setup

  1. Disabled fTPM (labeled "Security Device" in ASUS BIOS) on both units
    • AMD Ryzen 5000 series had a known fTPM bug causing random hard freezes with no console output
    • Both units booted the NixOS installer successfully after this change
  2. Installed NixOS on both, added to repo as pn01 and pn02 on VLAN 12
  3. Configured monitoring (node-exporter, promtail, nixos-exporter)

2026-02-21: pn02 First Freeze

  • pn02 froze approximately 1 hour after boot
  • All three Prometheus targets went down simultaneously — hard freeze, not graceful shutdown
  • Journal on next boot: system.journal corrupted or uncleanly shut down
  • Kernel warnings from boot log before freeze:
    • TSC clocksource unstable: Marking clocksource 'tsc' as unstable because the skew is too large — TSC skewing ~3.8ms over 500ms relative to HPET watchdog
    • AMD PSP error: psp gfx command LOAD_TA(0x1) failed and response status is (0x7) — Platform Security Processor failing to load trusted application
  • pn01 did not show these warnings on this particular boot, but has shown them historically (see below)

2026-02-21: pn02 BIOS Update

  • Updated pn02 BIOS to latest version from ASUS website
  • TSC still unstable after BIOS update — same ~3.8ms skew
  • PSP LOAD_TA still failing after BIOS update
  • Monitoring back up, letting it run to see if freeze recurs

2026-02-22: TSC/PSP Confirmed on Both Units

  • Checked kernel logs after ~9 hours uptime — both units still running
  • pn01 now shows TSC unstable and PSP LOAD_TA failure on this boot (same ~3.8ms TSC skew, same PSP error)
  • pn01 had these same issues historically when tested years ago — the earlier clean boot was just lucky TSC calibration timing
  • Conclusion: TSC instability and PSP LOAD_TA are platform-level quirks of the PN51-E1 / Ryzen 5700U, present on both units
  • The kernel handles TSC instability gracefully (falls back to HPET), and PSP LOAD_TA is non-fatal
  • Neither issue is likely the cause of the hard freezes — the fTPM bug remains the primary suspect

2026-02-22: Stress Test (1 hour)

  • Ran stress-ng --cpu 16 --vm 2 --vm-bytes 8G --timeout 1h on both units
  • CPU temps peaked at ~85°C, settled to ~80°C sustained (throttle limit is 105°C)
  • Both survived the full hour with no freezes, no MCE errors, no kernel issues
  • No concerning log entries during or after the test

2026-02-22: TSC Runtime Switch Test

  • Attempted to switch clocksource back to TSC at runtime on pn01:
    echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
    
  • Kernel watchdog immediately reverted to HPET — TSC skew is ongoing, not just a boot-time issue
  • Conclusion: TSC is genuinely unstable on the PN51-E1 platform. HPET is the correct clocksource.
  • For virtualization (Incus), this means guest VMs will use HPET-backed timing. Performance impact is minimal for typical server workloads (DNS, monitoring, light services) but would matter for latency-sensitive applications.

2026-02-22: BIOS Tweaks (Both Units)

  • Disabled ErP Ready on both (EU power efficiency mode — aggressively cuts power in idle)
  • Disabled WiFi and Bluetooth in BIOS on both
  • TSC still unstable after these changes — same ~3.8ms skew on both units
  • ErP/power states are not the cause of the TSC issue

2026-02-22: pn02 Second Freeze

  • pn02 froze again ~5.5 hours after boot (at idle, not under load)
  • All Prometheus targets down simultaneously — same hard freeze pattern
  • Last log entry was normal nix-daemon activity — zero warning/error logs before crash
  • Survived the 1h stress test earlier but froze at idle later — not thermal
  • pn01 remains stable throughout
  • Action: Blacklisted amdgpu kernel module on pn02 (boot.blacklistedKernelModules = [ "amdgpu" ]) to eliminate GPU/PSP firmware interactions as a cause. No console output but managed via SSH.
  • Action: Added diagnostic/recovery config to pn02:
    • panic=10 + nmi_watchdog=1 kernel params — auto-reboot after 10s on panic
    • softlockup_panic + hardlockup_panic sysctls — convert lockups to panics with stack traces
    • hardware.rasdaemon with recording — logs hardware errors (MCE, PCIe AER, memory) to sqlite database, survives reboots
    • Check recorded errors: ras-mc-ctl --summary, ras-mc-ctl --errors

Benign Kernel Errors (Both Units)

These appear on both units and can be ignored:

  • clocksource: Marking clocksource 'tsc' as unstable — TSC skew vs HPET, kernel falls back gracefully. Platform-level quirk on PN51-E1, not always reproducible on every boot.
  • psp gfx command LOAD_TA(0x1) failed — AMD PSP firmware error, non-fatal. Present on both units across all BIOS versions.
  • pcie_mp2_amd: amd_sfh_hid_client_init failed err -95 — AMD Sensor Fusion Hub, no sensors connected
  • Bluetooth: hci0: Reading supported features failed — Bluetooth init quirk
  • Serial bus multi instantiate pseudo device driver INT3515:00: error -ENXIO — unused serial bus device
  • snd_hda_intel: no codecs found — no audio device connected, headless server
  • ata2.00: supports DRM functions and may not be fully accessible — Samsung SSD DRM quirk (pn02 only)

2026-02-23: processor.max_cstate=1 and Proxmox Forums

  • Found a thread on the Proxmox forums about PN51 units with similar freeze issues
    • Many users reporting identical symptoms — random hard freezes, no log evidence
    • No conclusive fix. Some have frequent freezes, others only a few times a month
    • Some reported BIOS updates helped, but results inconsistent
  • Added processor.max_cstate=1 kernel parameter to pn02 — limits CPU to C1 halt state, preventing deep C-state sleep transitions that may trigger freezes on AMD mobile chips
  • Also applied: amdgpu blacklist, panic=10, nmi_watchdog=1, softlockup/hardlockup panic, rasdaemon

2026-02-23: logind D-Bus Deadlock (pn02)

  • node-exporter alert fired — but host was NOT frozen
  • logind was running (PID 871) but deadlocked on D-Bus — not responding to org.freedesktop.login1 requests
  • Every node-exporter scrape blocked for 25s waiting for logind, causing scrape timeouts
  • Likely related to amdgpu blacklist — no DRM device means no graphical seat, logind may have deadlocked during seat enumeration at boot
  • Fix: systemctl restart systemd-logind + systemctl restart prometheus-node-exporter
  • After restart, logind responded normally and reported seat0

2026-02-27: pn02 Third Freeze

  • pn02 crashed again after ~2 days 21 hours uptime (longest run so far)
  • Evidence of crash:
    • Journal file corrupted: system.journal corrupted or uncleanly shut down
    • Boot partition fsck: Dirty bit is set. Fs was not properly unmounted
    • No orderly shutdown logs from previous boot
    • No auto-upgrade triggered
  • NMI watchdog did NOT fire — no kernel panic logged. This is a true hard lockup below NMI level
  • rasdaemon recorded nothing — no MCE, AER, or memory errors in the sqlite database
  • Positive: The system auto-rebooted this time (likely hardware watchdog), unlike previous freezes that required manual power cycle
  • processor.max_cstate=1 may have extended uptime (2d21h vs previous 1h and 5.5h) but did not prevent the freeze

2026-02-27 to 2026-03-03: Relative Stability

  • pn02 ran without crashes for approximately one week after the third freeze
  • pn01 continued to be completely stable throughout this period
  • Auto-upgrade reboots continued daily (~4am) on both units — these are planned and healthy

2026-03-04: pn02 Fourth Crash — sched_ext Kernel Oops (pstore captured)

  • pn02 crashed after ~5.8 days uptime (504566s)
  • First crash captured by pstore — kernel oops and panic stack traces preserved across reboot
  • Journal corruption confirmed: system.journal corrupted or uncleanly shut down
  • Crash location: RIP: 0010:set_next_task_scx+0x6e/0x210 — crash in the sched_ext (SCX) scheduler subsystem
  • Call trace: sysvec_apic_timer_interruptcpuidle_enter_state — crashed during CPU idle, triggered by APIC timer interrupt
  • CR2: ffffffffffffff89 — dereferencing an obviously invalid kernel pointer
  • Kernel: 6.12.74 (NixOS 25.11)
  • Significance: This is the first crash with actual diagnostic output. Previous crashes were silent sub-NMI freezes. The sched_ext scheduler path is a new finding — earlier crashes were assumed to be hardware-level.

2026-03-06: pn02 Fifth Crash

  • pn02 crashed again — journal corruption on next boot
  • No pstore data captured for this crash

2026-03-07: pn02 Sixth and Seventh Crashes — Two in One Day

First crash (~11:06 UTC):

  • ~26.6 hours uptime (95994s)
  • pstore captured both Oops and Panic
  • Crash location: Scheduler code path — pick_next_task_fair__pick_next_task
  • CR2: 000000c000726000 — invalid pointer dereference
  • Notable: dbus-daemon segfaulted ~50 minutes before the kernel crash (segfault at 0 in libdbus-1.so.3.32.4 on CPU 0) — may indicate memory corruption preceding the kernel crash

Second crash (~21:15 UTC):

  • Journal corruption confirmed on next boot
  • No pstore data captured

2026-03-12: pn02 Memtest86 — 38 Passes, Zero Errors

  • Ran memtest86 for ~109 hours (4.5 days), completing 38 full passes
  • Zero errors found — RAM appears healthy
  • Makes hardware-induced memory corruption less likely as the sole cause of crashes
  • Memtest cannot rule out CPU cache errors, PCIe/IOMMU issues, or kernel bugs triggered by platform quirks
  • Next step: Boot back into NixOS with sched_ext disabled to test the kernel scheduler hypothesis

2026-03-07: pn01 Status

  • pn01 has had zero crashes since initial setup on Feb 21
  • Zero journal corruptions, zero pstore dumps in 30 days
  • Same BOOT_ID maintained between daily auto-upgrade reboots — consistently clean shutdown/reboot cycles
  • All 8 reboots in 30 days are planned auto-upgrade reboots
  • pn01 is fully stable

Crash Summary

Date Uptime Before Crash Crash Type Diagnostic Data
Feb 21 ~1h Silent freeze None — sub-NMI
Feb 22 ~5.5h Silent freeze None — sub-NMI
Feb 27 ~2d 21h Silent freeze None — sub-NMI, rasdaemon empty
Mar 4 ~5.8d Kernel oops pstore: set_next_task_scx (sched_ext)
Mar 6 Unknown Crash Journal corruption only
Mar 7 ~26.6h Kernel oops + panic pstore: pick_next_task_fair (scheduler) + dbus segfault
Mar 7 Unknown Crash Journal corruption only

Conclusion

pn02 is unreliable. After exhausting mitigations (fTPM disabled, BIOS updated, WiFi/BT disabled, ErP disabled, amdgpu blacklisted, processor.max_cstate=1, NMI watchdog, rasdaemon), the unit still crashes every few days. 26 reboots in 30 days (7 unclean crashes + daily auto-upgrade reboots).

The pstore crash dumps from March reveal a new dimension: at least some crashes are kernel scheduler bugs in sched_ext, not just silent hardware-level freezes. The set_next_task_scx and pick_next_task_fair crash sites, combined with the dbus-daemon segfault before one crash, suggest possible memory corruption that manifests in the scheduler. Memtest86 ran 38 passes (109 hours) with zero errors, making option 2 less likely. Remaining possibilities:

  1. A sched_ext kernel bug exposed by the PN51's hardware quirks (unstable TSC, C-state behavior)
  2. Hardware-induced memory corruption that happens to hit scheduler data structures — unlikely after clean memtest
  3. A pure software bug in the 6.12.74 kernel's sched_ext implementation

pn01 is stable — zero crashes in 30 days of continuous operation. Both units have identical kernel and NixOS configuration (minus pn02's diagnostic mitigations), so the difference points toward a hardware defect specific to the pn02 board.

Next Steps

  • pn02 memtest: Run memtest86 for 24h+ — Done (2026-03-12): 38 passes over 109 hours, zero errors. RAM is not the issue.
  • pn02 sched_ext test: Disable sched_ext (boot.kernelParams = [ "sched_ext.enabled=0" ] or equivalent) and run for 1-2 weeks to test whether the crashes stop — would help distinguish kernel bug from hardware defect
  • pn02: If sched_ext disable doesn't help, consider scrapping or repurposing for non-critical workloads that tolerate random reboots (auto-recovery via hardware watchdog is working)
  • pn01: Continue monitoring. If it remains stable long-term, it is viable for light workloads
  • If pn01 eventually crashes, apply the same mitigations (amdgpu blacklist, max_cstate=1) to see if they help
  • For the Incus hypervisor plan: likely need different hardware. Evaluating GMKtec G3 (Intel) as an alternative. Note: mixed Intel/AMD cluster complicates live migration

Diagnostics and Auto-Recovery (pn02)

Currently deployed on pn02:

boot.blacklistedKernelModules = [ "amdgpu" ];
boot.kernelParams = [ "panic=10" "nmi_watchdog=1" "processor.max_cstate=1" ];
boot.kernel.sysctl."kernel.softlockup_panic" = 1;
boot.kernel.sysctl."kernel.hardlockup_panic" = 1;
hardware.rasdaemon.enable = true;
hardware.rasdaemon.record = true;

Crash recovery is working: pstore now captures kernel oops/panic data, and the system auto-reboots via panic=10 or SP5100 TCO hardware watchdog.

After reboot, check:

  • ras-mc-ctl --summary — overview of hardware errors
  • ras-mc-ctl --errors — detailed error list
  • journalctl -b -1 -p err — kernel logs from crashed boot (if panic was logged)
  • pstore data is automatically archived by systemd-pstore.service and forwarded to Loki via promtail