pn51: document diagnostic config (rasdaemon, NMI watchdog, panic)
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m3s
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m3s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -86,6 +86,11 @@ Two ASUS PN51-E1 mini PCs (Ryzen 7 5700U) purchased years ago but shelved due to
|
||||
- Survived the 1h stress test earlier but froze at idle later — not thermal
|
||||
- pn01 remains stable throughout
|
||||
- **Action**: Blacklisted `amdgpu` kernel module on pn02 (`boot.blacklistedKernelModules = [ "amdgpu" ]`) to eliminate GPU/PSP firmware interactions as a cause. No console output but managed via SSH.
|
||||
- **Action**: Added diagnostic/recovery config to pn02:
|
||||
- `panic=10` + `nmi_watchdog=1` kernel params — auto-reboot after 10s on panic
|
||||
- `softlockup_panic` + `hardlockup_panic` sysctls — convert lockups to panics with stack traces
|
||||
- `hardware.rasdaemon` with recording — logs hardware errors (MCE, PCIe AER, memory) to sqlite database, survives reboots
|
||||
- Check recorded errors: `ras-mc-ctl --summary`, `ras-mc-ctl --errors`
|
||||
|
||||
## Benign Kernel Errors (Both Units)
|
||||
|
||||
@@ -111,14 +116,23 @@ These appear on both units and can be ignored:
|
||||
- Once stable: add second RAM stick back to pn02, reinstall with NVMe
|
||||
- Evaluate for Incus hypervisor use (see `nixos-hypervisor.md`)
|
||||
|
||||
## Fallback: Auto-Recovery from Freezes
|
||||
## Diagnostics and Auto-Recovery (pn02)
|
||||
|
||||
If pn02 remains unstable but we want to keep it running for non-critical workloads, configure the kernel to auto-reboot on lockups instead of hanging forever:
|
||||
Currently deployed on pn02:
|
||||
|
||||
```nix
|
||||
boot.kernelParams = [ "panic=10" "nmi_watchdog=1" ];
|
||||
boot.kernel.sysctl."kernel.softlockup_panic" = 1;
|
||||
boot.kernel.sysctl."kernel.hardlockup_panic" = 1;
|
||||
hardware.rasdaemon.enable = true;
|
||||
hardware.rasdaemon.record = true;
|
||||
```
|
||||
|
||||
This would: detect hard/soft lockup -> kernel panic -> auto-reboot after 10 seconds. The SP5100 TCO hardware watchdog (already enabled, 10min timeout) serves as a secondary safety net. Doesn't fix the root cause but avoids requiring physical intervention.
|
||||
**On next freeze, one of two things happens:**
|
||||
1. **NMI watchdog catches it** -> kernel panic with stack trace in logs -> auto-reboot after 10s -> we get diagnostic info
|
||||
2. **Hard lockup below NMI level** -> SP5100 TCO hardware watchdog (10min timeout) reboots it -> confirms board-level defect
|
||||
|
||||
**After reboot, check:**
|
||||
- `ras-mc-ctl --summary` — overview of hardware errors
|
||||
- `ras-mc-ctl --errors` — detailed error list
|
||||
- `journalctl -b -1 -p err` — kernel logs from crashed boot (if panic was logged)
|
||||
|
||||
Reference in New Issue
Block a user