Document findings from false positive host_reboot alert caused by NTP clock adjustment affecting node_boot_time_seconds metric. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
3.8 KiB
monitoring02 Reboot Alert Investigation
Date: 2026-02-10 Status: Completed - False positive identified
Summary
A host_reboot alert fired for monitoring02 at 16:27:36 UTC. Investigation determined this was a false positive caused by NTP clock adjustments, not an actual reboot.
Alert Details
- Alert:
host_reboot - Rule:
changes(node_boot_time_seconds[10m]) > 0 - Host: monitoring02
- Time: 2026-02-10T16:27:36Z
Investigation Findings
Evidence Against Actual Reboot
- Uptime: System had been up for ~40 hours (143,751 seconds) at time of alert
- Consistent BOOT_ID: All logs showed the same systemd BOOT_ID (
fd26e7f3d86f4cd688d1b1d7af62f2ad) from Feb 9 through the alert time - No log gaps: Logs were continuous - no shutdown/restart cycle visible
- Prometheus metrics:
node_boot_time_secondsshowed a 1-second fluctuation, then returned to normal
Root Cause: NTP Clock Adjustment
The node_boot_time_seconds metric fluctuated by 1 second due to how Linux calculates boot time:
btime = current_wall_clock_time - monotonic_uptime
When NTP adjusts the wall clock, btime shifts by the same amount. The node_timex_* metrics confirmed this:
| Metric | Value |
|---|---|
node_timex_maxerror_seconds (max in 3h) |
1.02 seconds |
node_timex_maxerror_seconds (max in 24h) |
2.05 seconds |
node_timex_sync_status |
1 (synced) |
Current node_timex_offset_seconds |
~9ms (normal) |
The kernel's estimated maximum clock error spiked to over 1 second, causing the boot time calculation to drift momentarily.
Additionally, systemd-resolved logged "Clock change detected. Flushing caches." at 16:26:53Z, corroborating the NTP adjustment.
Current Time Sync Configuration
NixOS Guests
- NTP client: systemd-timesyncd (NixOS default)
- No explicit configuration in the codebase
- Uses default NixOS NTP server pool
Proxmox VMs
- Clocksource:
kvm-clock(optimal for KVM VMs) - QEMU guest agent: Enabled
- No additional QEMU timing args configured
Potential Improvements
1. Improve Alert Rule (Recommended)
Add tolerance to filter out small NTP adjustments:
# Current rule (triggers on any change)
expr: changes(node_boot_time_seconds[10m]) > 0
# Improved rule (requires >60 second shift)
expr: changes(node_boot_time_seconds[10m]) > 0 and abs(delta(node_boot_time_seconds[10m])) > 60
2. Switch to Chrony (Optional)
Chrony handles time adjustments more gracefully than systemd-timesyncd:
# In common/vm/qemu-guest.nix
{
services.qemuGuest.enable = true;
services.timesyncd.enable = false;
services.chrony = {
enable = true;
extraConfig = ''
makestep 1 3
rtcsync
'';
};
}
3. Add QEMU Timing Args (Optional)
In terraform/vms.tf:
args = "-global kvm-pit.lost_tick_policy=delay -rtc driftfix=slew"
4. Local NTP Server (Optional)
Running a local NTP server (e.g., on ns1/ns2) would reduce latency and improve sync stability across all hosts.
Monitoring NTP Health
The node_timex_* metrics from node_exporter provide visibility into NTP health:
# Clock offset from reference
node_timex_offset_seconds
# Sync status (1 = synced)
node_timex_sync_status
# Maximum estimated error - useful for alerting
node_timex_maxerror_seconds
A potential alert for NTP issues:
- alert: ntp_clock_drift
expr: node_timex_maxerror_seconds > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High clock drift on {{ $labels.hostname }}"
description: "NTP max error is {{ $value }}s on {{ $labels.hostname }}"
Conclusion
No action required for the alert itself - the system was healthy. Consider implementing the improved alert rule to prevent future false positives from NTP adjustments.