torjus/nixos-servers

Fork 0

Files

Torjus Håkestad cf55d07ce5

Run nix flake check / flake-check (push) Failing after 4m1s

Details

Periodic flake update / flake-update (push) Successful in 5m37s

Details

docs: update pn51 stability with third freeze and conclusion

pn02 crashed again after ~2d21h uptime despite all mitigations
(amdgpu blacklist, max_cstate=1, NMI watchdog, rasdaemon).
NMI watchdog didn't fire and rasdaemon recorded nothing,
confirming hard lockup below NMI level. Unit is unreliable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-27 18:25:52 +01:00

9.9 KiB

Raw Blame History

ASUS PN51 Stability Testing

Overview

Two ASUS PN51-E1 mini PCs (Ryzen 7 5700U) purchased years ago but shelved due to stability issues. Revisiting them to potentially add to the homelab.

Hardware

	pn01 (10.69.12.60)	pn02 (10.69.12.61)
CPU	AMD Ryzen 7 5700U (8C/16T)	AMD Ryzen 7 5700U (8C/16T)
RAM	2x 32GB DDR4 SO-DIMM (64GB)	1x 32GB DDR4 SO-DIMM (32GB)
Storage	1TB NVMe	1TB Samsung 870 EVO (SATA SSD)
BIOS	0508 (2023-11-08)	Updated 2026-02-21 (latest from ASUS)

Original Issues

pn01: Would boot but freeze randomly after some time. No console errors, completely unresponsive. memtest86 passed.
pn02: Had trouble booting — would start loading kernel from installer USB then instantly reboot. When it did boot, would also freeze randomly.

Debugging Steps

2026-02-21: Initial Setup

Disabled fTPM (labeled "Security Device" in ASUS BIOS) on both units
- AMD Ryzen 5000 series had a known fTPM bug causing random hard freezes with no console output
- Both units booted the NixOS installer successfully after this change
Installed NixOS on both, added to repo as pn01 and pn02 on VLAN 12
Configured monitoring (node-exporter, promtail, nixos-exporter)

2026-02-21: pn02 First Freeze

pn02 froze approximately 1 hour after boot
All three Prometheus targets went down simultaneously — hard freeze, not graceful shutdown
Journal on next boot: system.journal corrupted or uncleanly shut down
Kernel warnings from boot log before freeze:
- TSC clocksource unstable: Marking clocksource 'tsc' as unstable because the skew is too large — TSC skewing ~3.8ms over 500ms relative to HPET watchdog
- AMD PSP error: psp gfx command LOAD_TA(0x1) failed and response status is (0x7) — Platform Security Processor failing to load trusted application
pn01 did not show these warnings on this particular boot, but has shown them historically (see below)

2026-02-21: pn02 BIOS Update

Updated pn02 BIOS to latest version from ASUS website
TSC still unstable after BIOS update — same ~3.8ms skew
PSP LOAD_TA still failing after BIOS update
Monitoring back up, letting it run to see if freeze recurs

2026-02-22: TSC/PSP Confirmed on Both Units

Checked kernel logs after ~9 hours uptime — both units still running
pn01 now shows TSC unstable and PSP LOAD_TA failure on this boot (same ~3.8ms TSC skew, same PSP error)
pn01 had these same issues historically when tested years ago — the earlier clean boot was just lucky TSC calibration timing
Conclusion: TSC instability and PSP LOAD_TA are platform-level quirks of the PN51-E1 / Ryzen 5700U, present on both units
The kernel handles TSC instability gracefully (falls back to HPET), and PSP LOAD_TA is non-fatal
Neither issue is likely the cause of the hard freezes — the fTPM bug remains the primary suspect

2026-02-22: Stress Test (1 hour)

Ran stress-ng --cpu 16 --vm 2 --vm-bytes 8G --timeout 1h on both units
CPU temps peaked at ~85°C, settled to ~80°C sustained (throttle limit is 105°C)
Both survived the full hour with no freezes, no MCE errors, no kernel issues
No concerning log entries during or after the test

2026-02-22: TSC Runtime Switch Test

Attempted to switch clocksource back to TSC at runtime on pn01:

echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource

Kernel watchdog immediately reverted to HPET — TSC skew is ongoing, not just a boot-time issue
Conclusion: TSC is genuinely unstable on the PN51-E1 platform. HPET is the correct clocksource.
For virtualization (Incus), this means guest VMs will use HPET-backed timing. Performance impact is minimal for typical server workloads (DNS, monitoring, light services) but would matter for latency-sensitive applications.

2026-02-22: BIOS Tweaks (Both Units)

Disabled ErP Ready on both (EU power efficiency mode — aggressively cuts power in idle)
Disabled WiFi and Bluetooth in BIOS on both
TSC still unstable after these changes — same ~3.8ms skew on both units
ErP/power states are not the cause of the TSC issue

2026-02-22: pn02 Second Freeze

pn02 froze again ~5.5 hours after boot (at idle, not under load)
All Prometheus targets down simultaneously — same hard freeze pattern
Last log entry was normal nix-daemon activity — zero warning/error logs before crash
Survived the 1h stress test earlier but froze at idle later — not thermal
pn01 remains stable throughout
Action: Blacklisted amdgpu kernel module on pn02 (boot.blacklistedKernelModules = [ "amdgpu" ]) to eliminate GPU/PSP firmware interactions as a cause. No console output but managed via SSH.
Action: Added diagnostic/recovery config to pn02:
- panic=10 + nmi_watchdog=1 kernel params — auto-reboot after 10s on panic
- softlockup_panic + hardlockup_panic sysctls — convert lockups to panics with stack traces
- hardware.rasdaemon with recording — logs hardware errors (MCE, PCIe AER, memory) to sqlite database, survives reboots
- Check recorded errors: ras-mc-ctl --summary, ras-mc-ctl --errors

Benign Kernel Errors (Both Units)

These appear on both units and can be ignored:

clocksource: Marking clocksource 'tsc' as unstable — TSC skew vs HPET, kernel falls back gracefully. Platform-level quirk on PN51-E1, not always reproducible on every boot.
psp gfx command LOAD_TA(0x1) failed — AMD PSP firmware error, non-fatal. Present on both units across all BIOS versions.
pcie_mp2_amd: amd_sfh_hid_client_init failed err -95 — AMD Sensor Fusion Hub, no sensors connected
Bluetooth: hci0: Reading supported features failed — Bluetooth init quirk
Serial bus multi instantiate pseudo device driver INT3515:00: error -ENXIO — unused serial bus device
snd_hda_intel: no codecs found — no audio device connected, headless server
ata2.00: supports DRM functions and may not be fully accessible — Samsung SSD DRM quirk (pn02 only)

2026-02-23: processor.max_cstate=1 and Proxmox Forums

Found a thread on the Proxmox forums about PN51 units with similar freeze issues
- Many users reporting identical symptoms — random hard freezes, no log evidence
- No conclusive fix. Some have frequent freezes, others only a few times a month
- Some reported BIOS updates helped, but results inconsistent
Added processor.max_cstate=1 kernel parameter to pn02 — limits CPU to C1 halt state, preventing deep C-state sleep transitions that may trigger freezes on AMD mobile chips
Also applied: amdgpu blacklist, panic=10, nmi_watchdog=1, softlockup/hardlockup panic, rasdaemon

2026-02-23: logind D-Bus Deadlock (pn02)

node-exporter alert fired — but host was NOT frozen
logind was running (PID 871) but deadlocked on D-Bus — not responding to org.freedesktop.login1 requests
Every node-exporter scrape blocked for 25s waiting for logind, causing scrape timeouts
Likely related to amdgpu blacklist — no DRM device means no graphical seat, logind may have deadlocked during seat enumeration at boot
Fix: systemctl restart systemd-logind + systemctl restart prometheus-node-exporter
After restart, logind responded normally and reported seat0

2026-02-27: pn02 Third Freeze

pn02 crashed again after ~2 days 21 hours uptime (longest run so far)
Evidence of crash:
- Journal file corrupted: system.journal corrupted or uncleanly shut down
- Boot partition fsck: Dirty bit is set. Fs was not properly unmounted
- No orderly shutdown logs from previous boot
- No auto-upgrade triggered
NMI watchdog did NOT fire — no kernel panic logged. This is a true hard lockup below NMI level
rasdaemon recorded nothing — no MCE, AER, or memory errors in the sqlite database
Positive: The system auto-rebooted this time (likely hardware watchdog), unlike previous freezes that required manual power cycle
processor.max_cstate=1 may have extended uptime (2d21h vs previous 1h and 5.5h) but did not prevent the freeze

Conclusion

pn02 is unreliable. After exhausting mitigations (fTPM disabled, BIOS updated, WiFi/BT disabled, ErP disabled, amdgpu blacklisted, processor.max_cstate=1, NMI watchdog, rasdaemon), the unit still hard-freezes every few days. The freeze occurs below NMI level with zero diagnostic output, pointing to a board-level hardware defect. The unit is not suitable for hypervisor use or any workload requiring reliability.

pn01 remains stable so far but has historically crashed as well, just less frequently. Continuing to monitor.

Next Steps

pn02: Consider scrapping or repurposing for non-critical workloads that tolerate random reboots (auto-recovery via hardware watchdog is now working)
pn01: Continue monitoring. If it remains stable long-term, may still be viable for light workloads
If pn01 eventually crashes, apply the same mitigations (amdgpu blacklist, max_cstate=1) to see if they help
For the Incus hypervisor plan: likely need different hardware. Evaluating GMKtec G3 (Intel) as an alternative. Note: mixed Intel/AMD cluster complicates live migration

Diagnostics and Auto-Recovery (pn02)

Currently deployed on pn02:

boot.blacklistedKernelModules = [ "amdgpu" ];
boot.kernelParams = [ "panic=10" "nmi_watchdog=1" "processor.max_cstate=1" ];
boot.kernel.sysctl."kernel.softlockup_panic" = 1;
boot.kernel.sysctl."kernel.hardlockup_panic" = 1;
hardware.rasdaemon.enable = true;
hardware.rasdaemon.record = true;

On next freeze, one of two things happens:

NMI watchdog catches it -> kernel panic with stack trace in logs -> auto-reboot after 10s -> we get diagnostic info
Hard lockup below NMI level -> SP5100 TCO hardware watchdog (10min timeout) reboots it -> confirms board-level defect

After reboot, check:

ras-mc-ctl --summary — overview of hardware errors
ras-mc-ctl --errors — detailed error list
journalctl -b -1 -p err — kernel logs from crashed boot (if panic was logged)

9.9 KiB Raw Blame History