pn51: document stress test pass and TSC runtime test failure
Some checks failed
Run nix flake check / flake-check (push) Failing after 17m0s
Some checks failed
Run nix flake check / flake-check (push) Failing after 17m0s
Both units survived 1h stress test at 80-85C. TSC clocksource is genuinely unstable at runtime (not just boot), HPET is the correct fallback for this platform. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -54,6 +54,23 @@ Two ASUS PN51-E1 mini PCs (Ryzen 7 5700U) purchased years ago but shelved due to
|
|||||||
- The kernel handles TSC instability gracefully (falls back to HPET), and PSP LOAD_TA is non-fatal
|
- The kernel handles TSC instability gracefully (falls back to HPET), and PSP LOAD_TA is non-fatal
|
||||||
- Neither issue is likely the cause of the hard freezes — the fTPM bug remains the primary suspect
|
- Neither issue is likely the cause of the hard freezes — the fTPM bug remains the primary suspect
|
||||||
|
|
||||||
|
### 2026-02-22: Stress Test (1 hour)
|
||||||
|
|
||||||
|
- Ran `stress-ng --cpu 16 --vm 2 --vm-bytes 8G --timeout 1h` on both units
|
||||||
|
- CPU temps peaked at ~85°C, settled to ~80°C sustained (throttle limit is 105°C)
|
||||||
|
- Both survived the full hour with no freezes, no MCE errors, no kernel issues
|
||||||
|
- No concerning log entries during or after the test
|
||||||
|
|
||||||
|
### 2026-02-22: TSC Runtime Switch Test
|
||||||
|
|
||||||
|
- Attempted to switch clocksource back to TSC at runtime on pn01:
|
||||||
|
```
|
||||||
|
echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
|
||||||
|
```
|
||||||
|
- Kernel watchdog immediately reverted to HPET — TSC skew is ongoing, not just a boot-time issue
|
||||||
|
- **Conclusion**: TSC is genuinely unstable on the PN51-E1 platform. HPET is the correct clocksource.
|
||||||
|
- For virtualization (Incus), this means guest VMs will use HPET-backed timing. Performance impact is minimal for typical server workloads (DNS, monitoring, light services) but would matter for latency-sensitive applications.
|
||||||
|
|
||||||
## Benign Kernel Errors (Both Units)
|
## Benign Kernel Errors (Both Units)
|
||||||
|
|
||||||
These appear on both units and can be ignored:
|
These appear on both units and can be ignored:
|
||||||
@@ -62,19 +79,13 @@ These appear on both units and can be ignored:
|
|||||||
- `pcie_mp2_amd: amd_sfh_hid_client_init failed err -95` — AMD Sensor Fusion Hub, no sensors connected
|
- `pcie_mp2_amd: amd_sfh_hid_client_init failed err -95` — AMD Sensor Fusion Hub, no sensors connected
|
||||||
- `Bluetooth: hci0: Reading supported features failed` — Bluetooth init quirk
|
- `Bluetooth: hci0: Reading supported features failed` — Bluetooth init quirk
|
||||||
- `Serial bus multi instantiate pseudo device driver INT3515:00: error -ENXIO` — unused serial bus device
|
- `Serial bus multi instantiate pseudo device driver INT3515:00: error -ENXIO` — unused serial bus device
|
||||||
|
- `snd_hda_intel: no codecs found` — no audio device connected, headless server
|
||||||
- `ata2.00: supports DRM functions and may not be fully accessible` — Samsung SSD DRM quirk (pn02 only)
|
- `ata2.00: supports DRM functions and may not be fully accessible` — Samsung SSD DRM quirk (pn02 only)
|
||||||
|
|
||||||
## Next Steps
|
## Next Steps
|
||||||
|
|
||||||
- Monitor both units for stability (fTPM disabled on both, BIOS updated on pn02)
|
- Monitor both units for stability over the next few days
|
||||||
- **Test TSC stability after boot**: The TSC skew may only occur during early boot (power state transitions, frequency scaling) and stabilize later. Test by switching clocksource back to TSC at runtime:
|
|
||||||
```bash
|
|
||||||
# Check current clocksource
|
|
||||||
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
|
|
||||||
# Switch back to TSC (kernel watchdog will revert if skew persists)
|
|
||||||
echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
|
|
||||||
```
|
|
||||||
If TSC stays stable after boot, this is just an early-boot calibration issue. This matters for virtualization performance — HPET is 50-100x slower than TSC per timing call, and KVM guests rely on the host clocksource.
|
|
||||||
- If either freezes again, try disabling unused hardware in BIOS (GPU, WiFi, Bluetooth, audio)
|
- If either freezes again, try disabling unused hardware in BIOS (GPU, WiFi, Bluetooth, audio)
|
||||||
- If still freezing, may be a hardware defect
|
- If still freezing, may be a hardware defect
|
||||||
- Once stable: add second RAM stick back to pn02, reinstall with NVMe
|
- Once stable: add second RAM stick back to pn02, reinstall with NVMe
|
||||||
|
- Evaluate for Incus hypervisor use (see `nixos-hypervisor.md`)
|
||||||
|
|||||||
Reference in New Issue
Block a user