docs: add monitoring02 reboot alert investigation
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m41s
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m41s
Document findings from false positive host_reboot alert caused by NTP clock adjustment affecting node_boot_time_seconds metric. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
135
docs/plans/completed/monitoring02-reboot-alert-investigation.md
Normal file
135
docs/plans/completed/monitoring02-reboot-alert-investigation.md
Normal file
@@ -0,0 +1,135 @@
|
|||||||
|
# monitoring02 Reboot Alert Investigation
|
||||||
|
|
||||||
|
**Date:** 2026-02-10
|
||||||
|
**Status:** Completed - False positive identified
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
A `host_reboot` alert fired for monitoring02 at 16:27:36 UTC. Investigation determined this was a **false positive** caused by NTP clock adjustments, not an actual reboot.
|
||||||
|
|
||||||
|
## Alert Details
|
||||||
|
|
||||||
|
- **Alert:** `host_reboot`
|
||||||
|
- **Rule:** `changes(node_boot_time_seconds[10m]) > 0`
|
||||||
|
- **Host:** monitoring02
|
||||||
|
- **Time:** 2026-02-10T16:27:36Z
|
||||||
|
|
||||||
|
## Investigation Findings
|
||||||
|
|
||||||
|
### Evidence Against Actual Reboot
|
||||||
|
|
||||||
|
1. **Uptime:** System had been up for ~40 hours (143,751 seconds) at time of alert
|
||||||
|
2. **Consistent BOOT_ID:** All logs showed the same systemd BOOT_ID (`fd26e7f3d86f4cd688d1b1d7af62f2ad`) from Feb 9 through the alert time
|
||||||
|
3. **No log gaps:** Logs were continuous - no shutdown/restart cycle visible
|
||||||
|
4. **Prometheus metrics:** `node_boot_time_seconds` showed a 1-second fluctuation, then returned to normal
|
||||||
|
|
||||||
|
### Root Cause: NTP Clock Adjustment
|
||||||
|
|
||||||
|
The `node_boot_time_seconds` metric fluctuated by 1 second due to how Linux calculates boot time:
|
||||||
|
|
||||||
|
```
|
||||||
|
btime = current_wall_clock_time - monotonic_uptime
|
||||||
|
```
|
||||||
|
|
||||||
|
When NTP adjusts the wall clock, `btime` shifts by the same amount. The `node_timex_*` metrics confirmed this:
|
||||||
|
|
||||||
|
| Metric | Value |
|
||||||
|
|--------|-------|
|
||||||
|
| `node_timex_maxerror_seconds` (max in 3h) | 1.02 seconds |
|
||||||
|
| `node_timex_maxerror_seconds` (max in 24h) | 2.05 seconds |
|
||||||
|
| `node_timex_sync_status` | 1 (synced) |
|
||||||
|
| Current `node_timex_offset_seconds` | ~9ms (normal) |
|
||||||
|
|
||||||
|
The kernel's estimated maximum clock error spiked to over 1 second, causing the boot time calculation to drift momentarily.
|
||||||
|
|
||||||
|
Additionally, `systemd-resolved` logged "Clock change detected. Flushing caches." at 16:26:53Z, corroborating the NTP adjustment.
|
||||||
|
|
||||||
|
## Current Time Sync Configuration
|
||||||
|
|
||||||
|
### NixOS Guests
|
||||||
|
- **NTP client:** systemd-timesyncd (NixOS default)
|
||||||
|
- **No explicit configuration** in the codebase
|
||||||
|
- Uses default NixOS NTP server pool
|
||||||
|
|
||||||
|
### Proxmox VMs
|
||||||
|
- **Clocksource:** `kvm-clock` (optimal for KVM VMs)
|
||||||
|
- **QEMU guest agent:** Enabled
|
||||||
|
- **No additional QEMU timing args** configured
|
||||||
|
|
||||||
|
## Potential Improvements
|
||||||
|
|
||||||
|
### 1. Improve Alert Rule (Recommended)
|
||||||
|
|
||||||
|
Add tolerance to filter out small NTP adjustments:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# Current rule (triggers on any change)
|
||||||
|
expr: changes(node_boot_time_seconds[10m]) > 0
|
||||||
|
|
||||||
|
# Improved rule (requires >60 second shift)
|
||||||
|
expr: changes(node_boot_time_seconds[10m]) > 0 and abs(delta(node_boot_time_seconds[10m])) > 60
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Switch to Chrony (Optional)
|
||||||
|
|
||||||
|
Chrony handles time adjustments more gracefully than systemd-timesyncd:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
# In common/vm/qemu-guest.nix
|
||||||
|
{
|
||||||
|
services.qemuGuest.enable = true;
|
||||||
|
|
||||||
|
services.timesyncd.enable = false;
|
||||||
|
services.chrony = {
|
||||||
|
enable = true;
|
||||||
|
extraConfig = ''
|
||||||
|
makestep 1 3
|
||||||
|
rtcsync
|
||||||
|
'';
|
||||||
|
};
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Add QEMU Timing Args (Optional)
|
||||||
|
|
||||||
|
In `terraform/vms.tf`:
|
||||||
|
|
||||||
|
```hcl
|
||||||
|
args = "-global kvm-pit.lost_tick_policy=delay -rtc driftfix=slew"
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Local NTP Server (Optional)
|
||||||
|
|
||||||
|
Running a local NTP server (e.g., on ns1/ns2) would reduce latency and improve sync stability across all hosts.
|
||||||
|
|
||||||
|
## Monitoring NTP Health
|
||||||
|
|
||||||
|
The `node_timex_*` metrics from node_exporter provide visibility into NTP health:
|
||||||
|
|
||||||
|
```promql
|
||||||
|
# Clock offset from reference
|
||||||
|
node_timex_offset_seconds
|
||||||
|
|
||||||
|
# Sync status (1 = synced)
|
||||||
|
node_timex_sync_status
|
||||||
|
|
||||||
|
# Maximum estimated error - useful for alerting
|
||||||
|
node_timex_maxerror_seconds
|
||||||
|
```
|
||||||
|
|
||||||
|
A potential alert for NTP issues:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- alert: ntp_clock_drift
|
||||||
|
expr: node_timex_maxerror_seconds > 1
|
||||||
|
for: 5m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
summary: "High clock drift on {{ $labels.hostname }}"
|
||||||
|
description: "NTP max error is {{ $value }}s on {{ $labels.hostname }}"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
No action required for the alert itself - the system was healthy. Consider implementing the improved alert rule to prevent future false positives from NTP adjustments.
|
||||||
Reference in New Issue
Block a user