diff --git a/docs/plans/pn51-stability.md b/docs/plans/pn51-stability.md new file mode 100644 index 0000000..307681c --- /dev/null +++ b/docs/plans/pn51-stability.md @@ -0,0 +1,62 @@ +# ASUS PN51 Stability Testing + +## Overview + +Two ASUS PN51-E1 mini PCs (Ryzen 7 5700U) purchased years ago but shelved due to stability issues. Revisiting them to potentially add to the homelab. + +## Hardware + +| | pn01 (10.69.12.60) | pn02 (10.69.12.61) | +|---|---|---| +| **CPU** | AMD Ryzen 7 5700U (8C/16T) | AMD Ryzen 7 5700U (8C/16T) | +| **RAM** | 2x 32GB DDR4 SO-DIMM (64GB) | 1x 32GB DDR4 SO-DIMM (32GB) | +| **Storage** | 1TB NVMe | 1TB Samsung 870 EVO (SATA SSD) | +| **BIOS** | 0508 (2023-11-08) | Updated 2026-02-21 (latest from ASUS) | + +## Original Issues + +- **pn01**: Would boot but freeze randomly after some time. No console errors, completely unresponsive. memtest86 passed. +- **pn02**: Had trouble booting — would start loading kernel from installer USB then instantly reboot. When it did boot, would also freeze randomly. + +## Debugging Steps + +### 2026-02-21: Initial Setup + +1. **Disabled fTPM** (labeled "Security Device" in ASUS BIOS) on both units + - AMD Ryzen 5000 series had a known fTPM bug causing random hard freezes with no console output + - Both units booted the NixOS installer successfully after this change +2. Installed NixOS on both, added to repo as `pn01` and `pn02` on VLAN 12 +3. Configured monitoring (node-exporter, promtail, nixos-exporter) + +### 2026-02-21: pn02 First Freeze + +- pn02 froze approximately 1 hour after boot +- All three Prometheus targets went down simultaneously — hard freeze, not graceful shutdown +- Journal on next boot: `system.journal corrupted or uncleanly shut down` +- Kernel warnings from boot log before freeze: + - **TSC clocksource unstable**: `Marking clocksource 'tsc' as unstable because the skew is too large` — TSC skewing ~3.8ms over 500ms relative to HPET watchdog + - **AMD PSP error**: `psp gfx command LOAD_TA(0x1) failed and response status is (0x7)` — Platform Security Processor failing to load trusted application +- pn01 showed neither of these warnings and remained stable + +### 2026-02-21: pn02 BIOS Update + +- Updated pn02 BIOS to latest version from ASUS website +- **TSC still unstable** after BIOS update — same ~3.8ms skew +- **PSP LOAD_TA still failing** after BIOS update +- Monitoring back up, letting it run to see if freeze recurs + +## Benign Kernel Errors (Both Units) + +These appear on both units and can be ignored: +- `pcie_mp2_amd: amd_sfh_hid_client_init failed err -95` — AMD Sensor Fusion Hub, no sensors connected +- `Bluetooth: hci0: Reading supported features failed` — Bluetooth init quirk +- `Serial bus multi instantiate pseudo device driver INT3515:00: error -ENXIO` — unused serial bus device +- `ata2.00: supports DRM functions and may not be fully accessible` — Samsung SSD DRM quirk (pn02 only) + +## Next Steps + +- Monitor pn01 stability (fTPM disabled, no other changes needed) +- Monitor pn02 stability after BIOS update +- If pn02 continues to freeze, try adding `tsc=unstable` kernel parameter +- If pn02 still freezes, may be a hardware defect on that specific unit +- Once stable: add second RAM stick back to pn02, reinstall with NVMe