pn51: document diagnostic config (rasdaemon, NMI watchdog, panic)
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m3s
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m3s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -86,6 +86,11 @@ Two ASUS PN51-E1 mini PCs (Ryzen 7 5700U) purchased years ago but shelved due to
|
|||||||
- Survived the 1h stress test earlier but froze at idle later — not thermal
|
- Survived the 1h stress test earlier but froze at idle later — not thermal
|
||||||
- pn01 remains stable throughout
|
- pn01 remains stable throughout
|
||||||
- **Action**: Blacklisted `amdgpu` kernel module on pn02 (`boot.blacklistedKernelModules = [ "amdgpu" ]`) to eliminate GPU/PSP firmware interactions as a cause. No console output but managed via SSH.
|
- **Action**: Blacklisted `amdgpu` kernel module on pn02 (`boot.blacklistedKernelModules = [ "amdgpu" ]`) to eliminate GPU/PSP firmware interactions as a cause. No console output but managed via SSH.
|
||||||
|
- **Action**: Added diagnostic/recovery config to pn02:
|
||||||
|
- `panic=10` + `nmi_watchdog=1` kernel params — auto-reboot after 10s on panic
|
||||||
|
- `softlockup_panic` + `hardlockup_panic` sysctls — convert lockups to panics with stack traces
|
||||||
|
- `hardware.rasdaemon` with recording — logs hardware errors (MCE, PCIe AER, memory) to sqlite database, survives reboots
|
||||||
|
- Check recorded errors: `ras-mc-ctl --summary`, `ras-mc-ctl --errors`
|
||||||
|
|
||||||
## Benign Kernel Errors (Both Units)
|
## Benign Kernel Errors (Both Units)
|
||||||
|
|
||||||
@@ -111,14 +116,23 @@ These appear on both units and can be ignored:
|
|||||||
- Once stable: add second RAM stick back to pn02, reinstall with NVMe
|
- Once stable: add second RAM stick back to pn02, reinstall with NVMe
|
||||||
- Evaluate for Incus hypervisor use (see `nixos-hypervisor.md`)
|
- Evaluate for Incus hypervisor use (see `nixos-hypervisor.md`)
|
||||||
|
|
||||||
## Fallback: Auto-Recovery from Freezes
|
## Diagnostics and Auto-Recovery (pn02)
|
||||||
|
|
||||||
If pn02 remains unstable but we want to keep it running for non-critical workloads, configure the kernel to auto-reboot on lockups instead of hanging forever:
|
Currently deployed on pn02:
|
||||||
|
|
||||||
```nix
|
```nix
|
||||||
boot.kernelParams = [ "panic=10" "nmi_watchdog=1" ];
|
boot.kernelParams = [ "panic=10" "nmi_watchdog=1" ];
|
||||||
boot.kernel.sysctl."kernel.softlockup_panic" = 1;
|
boot.kernel.sysctl."kernel.softlockup_panic" = 1;
|
||||||
boot.kernel.sysctl."kernel.hardlockup_panic" = 1;
|
boot.kernel.sysctl."kernel.hardlockup_panic" = 1;
|
||||||
|
hardware.rasdaemon.enable = true;
|
||||||
|
hardware.rasdaemon.record = true;
|
||||||
```
|
```
|
||||||
|
|
||||||
This would: detect hard/soft lockup -> kernel panic -> auto-reboot after 10 seconds. The SP5100 TCO hardware watchdog (already enabled, 10min timeout) serves as a secondary safety net. Doesn't fix the root cause but avoids requiring physical intervention.
|
**On next freeze, one of two things happens:**
|
||||||
|
1. **NMI watchdog catches it** -> kernel panic with stack trace in logs -> auto-reboot after 10s -> we get diagnostic info
|
||||||
|
2. **Hard lockup below NMI level** -> SP5100 TCO hardware watchdog (10min timeout) reboots it -> confirms board-level defect
|
||||||
|
|
||||||
|
**After reboot, check:**
|
||||||
|
- `ras-mc-ctl --summary` — overview of hardware errors
|
||||||
|
- `ras-mc-ctl --errors` — detailed error list
|
||||||
|
- `journalctl -b -1 -p err` — kernel logs from crashed boot (if panic was logged)
|
||||||
|
|||||||
Reference in New Issue
Block a user