pn02 crashed again after ~2d21h uptime despite all mitigations (amdgpu blacklist, max_cstate=1, NMI watchdog, rasdaemon). NMI watchdog didn't fire and rasdaemon recorded nothing, confirming hard lockup below NMI level. Unit is unreliable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
171 lines
9.9 KiB
Markdown
171 lines
9.9 KiB
Markdown
# ASUS PN51 Stability Testing
|
|
|
|
## Overview
|
|
|
|
Two ASUS PN51-E1 mini PCs (Ryzen 7 5700U) purchased years ago but shelved due to stability issues. Revisiting them to potentially add to the homelab.
|
|
|
|
## Hardware
|
|
|
|
| | pn01 (10.69.12.60) | pn02 (10.69.12.61) |
|
|
|---|---|---|
|
|
| **CPU** | AMD Ryzen 7 5700U (8C/16T) | AMD Ryzen 7 5700U (8C/16T) |
|
|
| **RAM** | 2x 32GB DDR4 SO-DIMM (64GB) | 1x 32GB DDR4 SO-DIMM (32GB) |
|
|
| **Storage** | 1TB NVMe | 1TB Samsung 870 EVO (SATA SSD) |
|
|
| **BIOS** | 0508 (2023-11-08) | Updated 2026-02-21 (latest from ASUS) |
|
|
|
|
## Original Issues
|
|
|
|
- **pn01**: Would boot but freeze randomly after some time. No console errors, completely unresponsive. memtest86 passed.
|
|
- **pn02**: Had trouble booting — would start loading kernel from installer USB then instantly reboot. When it did boot, would also freeze randomly.
|
|
|
|
## Debugging Steps
|
|
|
|
### 2026-02-21: Initial Setup
|
|
|
|
1. **Disabled fTPM** (labeled "Security Device" in ASUS BIOS) on both units
|
|
- AMD Ryzen 5000 series had a known fTPM bug causing random hard freezes with no console output
|
|
- Both units booted the NixOS installer successfully after this change
|
|
2. Installed NixOS on both, added to repo as `pn01` and `pn02` on VLAN 12
|
|
3. Configured monitoring (node-exporter, promtail, nixos-exporter)
|
|
|
|
### 2026-02-21: pn02 First Freeze
|
|
|
|
- pn02 froze approximately 1 hour after boot
|
|
- All three Prometheus targets went down simultaneously — hard freeze, not graceful shutdown
|
|
- Journal on next boot: `system.journal corrupted or uncleanly shut down`
|
|
- Kernel warnings from boot log before freeze:
|
|
- **TSC clocksource unstable**: `Marking clocksource 'tsc' as unstable because the skew is too large` — TSC skewing ~3.8ms over 500ms relative to HPET watchdog
|
|
- **AMD PSP error**: `psp gfx command LOAD_TA(0x1) failed and response status is (0x7)` — Platform Security Processor failing to load trusted application
|
|
- pn01 did not show these warnings on this particular boot, but has shown them historically (see below)
|
|
|
|
### 2026-02-21: pn02 BIOS Update
|
|
|
|
- Updated pn02 BIOS to latest version from ASUS website
|
|
- **TSC still unstable** after BIOS update — same ~3.8ms skew
|
|
- **PSP LOAD_TA still failing** after BIOS update
|
|
- Monitoring back up, letting it run to see if freeze recurs
|
|
|
|
### 2026-02-22: TSC/PSP Confirmed on Both Units
|
|
|
|
- Checked kernel logs after ~9 hours uptime — both units still running
|
|
- **pn01 now shows TSC unstable and PSP LOAD_TA failure** on this boot (same ~3.8ms TSC skew, same PSP error)
|
|
- pn01 had these same issues historically when tested years ago — the earlier clean boot was just lucky TSC calibration timing
|
|
- **Conclusion**: TSC instability and PSP LOAD_TA are platform-level quirks of the PN51-E1 / Ryzen 5700U, present on both units
|
|
- The kernel handles TSC instability gracefully (falls back to HPET), and PSP LOAD_TA is non-fatal
|
|
- Neither issue is likely the cause of the hard freezes — the fTPM bug remains the primary suspect
|
|
|
|
### 2026-02-22: Stress Test (1 hour)
|
|
|
|
- Ran `stress-ng --cpu 16 --vm 2 --vm-bytes 8G --timeout 1h` on both units
|
|
- CPU temps peaked at ~85°C, settled to ~80°C sustained (throttle limit is 105°C)
|
|
- Both survived the full hour with no freezes, no MCE errors, no kernel issues
|
|
- No concerning log entries during or after the test
|
|
|
|
### 2026-02-22: TSC Runtime Switch Test
|
|
|
|
- Attempted to switch clocksource back to TSC at runtime on pn01:
|
|
```
|
|
echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
|
|
```
|
|
- Kernel watchdog immediately reverted to HPET — TSC skew is ongoing, not just a boot-time issue
|
|
- **Conclusion**: TSC is genuinely unstable on the PN51-E1 platform. HPET is the correct clocksource.
|
|
- For virtualization (Incus), this means guest VMs will use HPET-backed timing. Performance impact is minimal for typical server workloads (DNS, monitoring, light services) but would matter for latency-sensitive applications.
|
|
|
|
### 2026-02-22: BIOS Tweaks (Both Units)
|
|
|
|
- Disabled ErP Ready on both (EU power efficiency mode — aggressively cuts power in idle)
|
|
- Disabled WiFi and Bluetooth in BIOS on both
|
|
- **TSC still unstable** after these changes — same ~3.8ms skew on both units
|
|
- ErP/power states are not the cause of the TSC issue
|
|
|
|
### 2026-02-22: pn02 Second Freeze
|
|
|
|
- pn02 froze again ~5.5 hours after boot (at idle, not under load)
|
|
- All Prometheus targets down simultaneously — same hard freeze pattern
|
|
- Last log entry was normal nix-daemon activity — zero warning/error logs before crash
|
|
- Survived the 1h stress test earlier but froze at idle later — not thermal
|
|
- pn01 remains stable throughout
|
|
- **Action**: Blacklisted `amdgpu` kernel module on pn02 (`boot.blacklistedKernelModules = [ "amdgpu" ]`) to eliminate GPU/PSP firmware interactions as a cause. No console output but managed via SSH.
|
|
- **Action**: Added diagnostic/recovery config to pn02:
|
|
- `panic=10` + `nmi_watchdog=1` kernel params — auto-reboot after 10s on panic
|
|
- `softlockup_panic` + `hardlockup_panic` sysctls — convert lockups to panics with stack traces
|
|
- `hardware.rasdaemon` with recording — logs hardware errors (MCE, PCIe AER, memory) to sqlite database, survives reboots
|
|
- Check recorded errors: `ras-mc-ctl --summary`, `ras-mc-ctl --errors`
|
|
|
|
## Benign Kernel Errors (Both Units)
|
|
|
|
These appear on both units and can be ignored:
|
|
- `clocksource: Marking clocksource 'tsc' as unstable` — TSC skew vs HPET, kernel falls back gracefully. Platform-level quirk on PN51-E1, not always reproducible on every boot.
|
|
- `psp gfx command LOAD_TA(0x1) failed` — AMD PSP firmware error, non-fatal. Present on both units across all BIOS versions.
|
|
- `pcie_mp2_amd: amd_sfh_hid_client_init failed err -95` — AMD Sensor Fusion Hub, no sensors connected
|
|
- `Bluetooth: hci0: Reading supported features failed` — Bluetooth init quirk
|
|
- `Serial bus multi instantiate pseudo device driver INT3515:00: error -ENXIO` — unused serial bus device
|
|
- `snd_hda_intel: no codecs found` — no audio device connected, headless server
|
|
- `ata2.00: supports DRM functions and may not be fully accessible` — Samsung SSD DRM quirk (pn02 only)
|
|
|
|
### 2026-02-23: processor.max_cstate=1 and Proxmox Forums
|
|
|
|
- Found a thread on the Proxmox forums about PN51 units with similar freeze issues
|
|
- Many users reporting identical symptoms — random hard freezes, no log evidence
|
|
- No conclusive fix. Some have frequent freezes, others only a few times a month
|
|
- Some reported BIOS updates helped, but results inconsistent
|
|
- Added `processor.max_cstate=1` kernel parameter to pn02 — limits CPU to C1 halt state, preventing deep C-state sleep transitions that may trigger freezes on AMD mobile chips
|
|
- Also applied: amdgpu blacklist, panic=10, nmi_watchdog=1, softlockup/hardlockup panic, rasdaemon
|
|
|
|
### 2026-02-23: logind D-Bus Deadlock (pn02)
|
|
|
|
- node-exporter alert fired — but host was NOT frozen
|
|
- logind was running (PID 871) but deadlocked on D-Bus — not responding to `org.freedesktop.login1` requests
|
|
- Every node-exporter scrape blocked for 25s waiting for logind, causing scrape timeouts
|
|
- Likely related to amdgpu blacklist — no DRM device means no graphical seat, logind may have deadlocked during seat enumeration at boot
|
|
- Fix: `systemctl restart systemd-logind` + `systemctl restart prometheus-node-exporter`
|
|
- After restart, logind responded normally and reported seat0
|
|
|
|
### 2026-02-27: pn02 Third Freeze
|
|
|
|
- pn02 crashed again after ~2 days 21 hours uptime (longest run so far)
|
|
- Evidence of crash:
|
|
- Journal file corrupted: `system.journal corrupted or uncleanly shut down`
|
|
- Boot partition fsck: `Dirty bit is set. Fs was not properly unmounted`
|
|
- No orderly shutdown logs from previous boot
|
|
- No auto-upgrade triggered
|
|
- **NMI watchdog did NOT fire** — no kernel panic logged. This is a true hard lockup below NMI level
|
|
- **rasdaemon recorded nothing** — no MCE, AER, or memory errors in the sqlite database
|
|
- **Positive**: The system auto-rebooted this time (likely hardware watchdog), unlike previous freezes that required manual power cycle
|
|
- `processor.max_cstate=1` may have extended uptime (2d21h vs previous 1h and 5.5h) but did not prevent the freeze
|
|
|
|
## Conclusion
|
|
|
|
**pn02 is unreliable.** After exhausting mitigations (fTPM disabled, BIOS updated, WiFi/BT disabled, ErP disabled, amdgpu blacklisted, processor.max_cstate=1, NMI watchdog, rasdaemon), the unit still hard-freezes every few days. The freeze occurs below NMI level with zero diagnostic output, pointing to a board-level hardware defect. The unit is not suitable for hypervisor use or any workload requiring reliability.
|
|
|
|
**pn01 remains stable** so far but has historically crashed as well, just less frequently. Continuing to monitor.
|
|
|
|
## Next Steps
|
|
|
|
- **pn02**: Consider scrapping or repurposing for non-critical workloads that tolerate random reboots (auto-recovery via hardware watchdog is now working)
|
|
- **pn01**: Continue monitoring. If it remains stable long-term, may still be viable for light workloads
|
|
- If pn01 eventually crashes, apply the same mitigations (amdgpu blacklist, max_cstate=1) to see if they help
|
|
- For the Incus hypervisor plan: likely need different hardware. Evaluating GMKtec G3 (Intel) as an alternative. Note: mixed Intel/AMD cluster complicates live migration
|
|
|
|
## Diagnostics and Auto-Recovery (pn02)
|
|
|
|
Currently deployed on pn02:
|
|
|
|
```nix
|
|
boot.blacklistedKernelModules = [ "amdgpu" ];
|
|
boot.kernelParams = [ "panic=10" "nmi_watchdog=1" "processor.max_cstate=1" ];
|
|
boot.kernel.sysctl."kernel.softlockup_panic" = 1;
|
|
boot.kernel.sysctl."kernel.hardlockup_panic" = 1;
|
|
hardware.rasdaemon.enable = true;
|
|
hardware.rasdaemon.record = true;
|
|
```
|
|
|
|
**On next freeze, one of two things happens:**
|
|
1. **NMI watchdog catches it** -> kernel panic with stack trace in logs -> auto-reboot after 10s -> we get diagnostic info
|
|
2. **Hard lockup below NMI level** -> SP5100 TCO hardware watchdog (10min timeout) reboots it -> confirms board-level defect
|
|
|
|
**After reboot, check:**
|
|
- `ras-mc-ctl --summary` — overview of hardware errors
|
|
- `ras-mc-ctl --errors` — detailed error list
|
|
- `journalctl -b -1 -p err` — kernel logs from crashed boot (if panic was logged)
|