Files
nixos-servers/docs/plans/pn51-stability.md
Torjus Håkestad 2b42145d94
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
pn51: document BIOS tweaks, second pn02 freeze, amdgpu blacklist
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 18:28:19 +01:00

5.9 KiB

ASUS PN51 Stability Testing

Overview

Two ASUS PN51-E1 mini PCs (Ryzen 7 5700U) purchased years ago but shelved due to stability issues. Revisiting them to potentially add to the homelab.

Hardware

pn01 (10.69.12.60) pn02 (10.69.12.61)
CPU AMD Ryzen 7 5700U (8C/16T) AMD Ryzen 7 5700U (8C/16T)
RAM 2x 32GB DDR4 SO-DIMM (64GB) 1x 32GB DDR4 SO-DIMM (32GB)
Storage 1TB NVMe 1TB Samsung 870 EVO (SATA SSD)
BIOS 0508 (2023-11-08) Updated 2026-02-21 (latest from ASUS)

Original Issues

  • pn01: Would boot but freeze randomly after some time. No console errors, completely unresponsive. memtest86 passed.
  • pn02: Had trouble booting — would start loading kernel from installer USB then instantly reboot. When it did boot, would also freeze randomly.

Debugging Steps

2026-02-21: Initial Setup

  1. Disabled fTPM (labeled "Security Device" in ASUS BIOS) on both units
    • AMD Ryzen 5000 series had a known fTPM bug causing random hard freezes with no console output
    • Both units booted the NixOS installer successfully after this change
  2. Installed NixOS on both, added to repo as pn01 and pn02 on VLAN 12
  3. Configured monitoring (node-exporter, promtail, nixos-exporter)

2026-02-21: pn02 First Freeze

  • pn02 froze approximately 1 hour after boot
  • All three Prometheus targets went down simultaneously — hard freeze, not graceful shutdown
  • Journal on next boot: system.journal corrupted or uncleanly shut down
  • Kernel warnings from boot log before freeze:
    • TSC clocksource unstable: Marking clocksource 'tsc' as unstable because the skew is too large — TSC skewing ~3.8ms over 500ms relative to HPET watchdog
    • AMD PSP error: psp gfx command LOAD_TA(0x1) failed and response status is (0x7) — Platform Security Processor failing to load trusted application
  • pn01 did not show these warnings on this particular boot, but has shown them historically (see below)

2026-02-21: pn02 BIOS Update

  • Updated pn02 BIOS to latest version from ASUS website
  • TSC still unstable after BIOS update — same ~3.8ms skew
  • PSP LOAD_TA still failing after BIOS update
  • Monitoring back up, letting it run to see if freeze recurs

2026-02-22: TSC/PSP Confirmed on Both Units

  • Checked kernel logs after ~9 hours uptime — both units still running
  • pn01 now shows TSC unstable and PSP LOAD_TA failure on this boot (same ~3.8ms TSC skew, same PSP error)
  • pn01 had these same issues historically when tested years ago — the earlier clean boot was just lucky TSC calibration timing
  • Conclusion: TSC instability and PSP LOAD_TA are platform-level quirks of the PN51-E1 / Ryzen 5700U, present on both units
  • The kernel handles TSC instability gracefully (falls back to HPET), and PSP LOAD_TA is non-fatal
  • Neither issue is likely the cause of the hard freezes — the fTPM bug remains the primary suspect

2026-02-22: Stress Test (1 hour)

  • Ran stress-ng --cpu 16 --vm 2 --vm-bytes 8G --timeout 1h on both units
  • CPU temps peaked at ~85°C, settled to ~80°C sustained (throttle limit is 105°C)
  • Both survived the full hour with no freezes, no MCE errors, no kernel issues
  • No concerning log entries during or after the test

2026-02-22: TSC Runtime Switch Test

  • Attempted to switch clocksource back to TSC at runtime on pn01:
    echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
    
  • Kernel watchdog immediately reverted to HPET — TSC skew is ongoing, not just a boot-time issue
  • Conclusion: TSC is genuinely unstable on the PN51-E1 platform. HPET is the correct clocksource.
  • For virtualization (Incus), this means guest VMs will use HPET-backed timing. Performance impact is minimal for typical server workloads (DNS, monitoring, light services) but would matter for latency-sensitive applications.

2026-02-22: BIOS Tweaks (Both Units)

  • Disabled ErP Ready on both (EU power efficiency mode — aggressively cuts power in idle)
  • Disabled WiFi and Bluetooth in BIOS on both
  • TSC still unstable after these changes — same ~3.8ms skew on both units
  • ErP/power states are not the cause of the TSC issue

2026-02-22: pn02 Second Freeze

  • pn02 froze again ~5.5 hours after boot (at idle, not under load)
  • All Prometheus targets down simultaneously — same hard freeze pattern
  • Last log entry was normal nix-daemon activity — zero warning/error logs before crash
  • Survived the 1h stress test earlier but froze at idle later — not thermal
  • pn01 remains stable throughout
  • Action: Blacklisted amdgpu kernel module on pn02 (boot.blacklistedKernelModules = [ "amdgpu" ]) to eliminate GPU/PSP firmware interactions as a cause. No console output but managed via SSH.

Benign Kernel Errors (Both Units)

These appear on both units and can be ignored:

  • clocksource: Marking clocksource 'tsc' as unstable — TSC skew vs HPET, kernel falls back gracefully. Platform-level quirk on PN51-E1, not always reproducible on every boot.
  • psp gfx command LOAD_TA(0x1) failed — AMD PSP firmware error, non-fatal. Present on both units across all BIOS versions.
  • pcie_mp2_amd: amd_sfh_hid_client_init failed err -95 — AMD Sensor Fusion Hub, no sensors connected
  • Bluetooth: hci0: Reading supported features failed — Bluetooth init quirk
  • Serial bus multi instantiate pseudo device driver INT3515:00: error -ENXIO — unused serial bus device
  • snd_hda_intel: no codecs found — no audio device connected, headless server
  • ata2.00: supports DRM functions and may not be fully accessible — Samsung SSD DRM quirk (pn02 only)

Next Steps

  • Monitor pn02 with amdgpu blacklisted — if stable, try the less impactful amdgpu.runpm=0 amdgpu.dpm=0 kernel params instead
  • If pn02 still freezes without amdgpu, likely a hardware defect on this unit
  • pn01 continues to be stable — keep monitoring
  • Once stable: add second RAM stick back to pn02, reinstall with NVMe
  • Evaluate for Incus hypervisor use (see nixos-hypervisor.md)