pn51: document diagnostic config (rasdaemon, NMI watchdog, panic)
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m3s

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-22 18:52:34 +01:00
parent 72acaa872b
commit c8cadd09c5

View File

@@ -86,6 +86,11 @@ Two ASUS PN51-E1 mini PCs (Ryzen 7 5700U) purchased years ago but shelved due to
- Survived the 1h stress test earlier but froze at idle later — not thermal - Survived the 1h stress test earlier but froze at idle later — not thermal
- pn01 remains stable throughout - pn01 remains stable throughout
- **Action**: Blacklisted `amdgpu` kernel module on pn02 (`boot.blacklistedKernelModules = [ "amdgpu" ]`) to eliminate GPU/PSP firmware interactions as a cause. No console output but managed via SSH. - **Action**: Blacklisted `amdgpu` kernel module on pn02 (`boot.blacklistedKernelModules = [ "amdgpu" ]`) to eliminate GPU/PSP firmware interactions as a cause. No console output but managed via SSH.
- **Action**: Added diagnostic/recovery config to pn02:
- `panic=10` + `nmi_watchdog=1` kernel params — auto-reboot after 10s on panic
- `softlockup_panic` + `hardlockup_panic` sysctls — convert lockups to panics with stack traces
- `hardware.rasdaemon` with recording — logs hardware errors (MCE, PCIe AER, memory) to sqlite database, survives reboots
- Check recorded errors: `ras-mc-ctl --summary`, `ras-mc-ctl --errors`
## Benign Kernel Errors (Both Units) ## Benign Kernel Errors (Both Units)
@@ -111,14 +116,23 @@ These appear on both units and can be ignored:
- Once stable: add second RAM stick back to pn02, reinstall with NVMe - Once stable: add second RAM stick back to pn02, reinstall with NVMe
- Evaluate for Incus hypervisor use (see `nixos-hypervisor.md`) - Evaluate for Incus hypervisor use (see `nixos-hypervisor.md`)
## Fallback: Auto-Recovery from Freezes ## Diagnostics and Auto-Recovery (pn02)
If pn02 remains unstable but we want to keep it running for non-critical workloads, configure the kernel to auto-reboot on lockups instead of hanging forever: Currently deployed on pn02:
```nix ```nix
boot.kernelParams = [ "panic=10" "nmi_watchdog=1" ]; boot.kernelParams = [ "panic=10" "nmi_watchdog=1" ];
boot.kernel.sysctl."kernel.softlockup_panic" = 1; boot.kernel.sysctl."kernel.softlockup_panic" = 1;
boot.kernel.sysctl."kernel.hardlockup_panic" = 1; boot.kernel.sysctl."kernel.hardlockup_panic" = 1;
hardware.rasdaemon.enable = true;
hardware.rasdaemon.record = true;
``` ```
This would: detect hard/soft lockup -> kernel panic -> auto-reboot after 10 seconds. The SP5100 TCO hardware watchdog (already enabled, 10min timeout) serves as a secondary safety net. Doesn't fix the root cause but avoids requiring physical intervention. **On next freeze, one of two things happens:**
1. **NMI watchdog catches it** -> kernel panic with stack trace in logs -> auto-reboot after 10s -> we get diagnostic info
2. **Hard lockup below NMI level** -> SP5100 TCO hardware watchdog (10min timeout) reboots it -> confirms board-level defect
**After reboot, check:**
- `ras-mc-ctl --summary` — overview of hardware errors
- `ras-mc-ctl --errors` — detailed error list
- `journalctl -b -1 -p err` — kernel logs from crashed boot (if panic was logged)