# monitoring02 Reboot Alert Investigation **Date:** 2026-02-10 **Status:** Completed - False positive identified ## Summary A `host_reboot` alert fired for monitoring02 at 16:27:36 UTC. Investigation determined this was a **false positive** caused by NTP clock adjustments, not an actual reboot. ## Alert Details - **Alert:** `host_reboot` - **Rule:** `changes(node_boot_time_seconds[10m]) > 0` - **Host:** monitoring02 - **Time:** 2026-02-10T16:27:36Z ## Investigation Findings ### Evidence Against Actual Reboot 1. **Uptime:** System had been up for ~40 hours (143,751 seconds) at time of alert 2. **Consistent BOOT_ID:** All logs showed the same systemd BOOT_ID (`fd26e7f3d86f4cd688d1b1d7af62f2ad`) from Feb 9 through the alert time 3. **No log gaps:** Logs were continuous - no shutdown/restart cycle visible 4. **Prometheus metrics:** `node_boot_time_seconds` showed a 1-second fluctuation, then returned to normal ### Root Cause: NTP Clock Adjustment The `node_boot_time_seconds` metric fluctuated by 1 second due to how Linux calculates boot time: ``` btime = current_wall_clock_time - monotonic_uptime ``` When NTP adjusts the wall clock, `btime` shifts by the same amount. The `node_timex_*` metrics confirmed this: | Metric | Value | |--------|-------| | `node_timex_maxerror_seconds` (max in 3h) | 1.02 seconds | | `node_timex_maxerror_seconds` (max in 24h) | 2.05 seconds | | `node_timex_sync_status` | 1 (synced) | | Current `node_timex_offset_seconds` | ~9ms (normal) | The kernel's estimated maximum clock error spiked to over 1 second, causing the boot time calculation to drift momentarily. Additionally, `systemd-resolved` logged "Clock change detected. Flushing caches." at 16:26:53Z, corroborating the NTP adjustment. ## Current Time Sync Configuration ### NixOS Guests - **NTP client:** systemd-timesyncd (NixOS default) - **No explicit configuration** in the codebase - Uses default NixOS NTP server pool ### Proxmox VMs - **Clocksource:** `kvm-clock` (optimal for KVM VMs) - **QEMU guest agent:** Enabled - **No additional QEMU timing args** configured ## Potential Improvements ### 1. Improve Alert Rule (Recommended) Add tolerance to filter out small NTP adjustments: ```yaml # Current rule (triggers on any change) expr: changes(node_boot_time_seconds[10m]) > 0 # Improved rule (requires >60 second shift) expr: changes(node_boot_time_seconds[10m]) > 0 and abs(delta(node_boot_time_seconds[10m])) > 60 ``` ### 2. Switch to Chrony (Optional) Chrony handles time adjustments more gracefully than systemd-timesyncd: ```nix # In common/vm/qemu-guest.nix { services.qemuGuest.enable = true; services.timesyncd.enable = false; services.chrony = { enable = true; extraConfig = '' makestep 1 3 rtcsync ''; }; } ``` ### 3. Add QEMU Timing Args (Optional) In `terraform/vms.tf`: ```hcl args = "-global kvm-pit.lost_tick_policy=delay -rtc driftfix=slew" ``` ### 4. Local NTP Server (Optional) Running a local NTP server (e.g., on ns1/ns2) would reduce latency and improve sync stability across all hosts. ## Monitoring NTP Health The `node_timex_*` metrics from node_exporter provide visibility into NTP health: ```promql # Clock offset from reference node_timex_offset_seconds # Sync status (1 = synced) node_timex_sync_status # Maximum estimated error - useful for alerting node_timex_maxerror_seconds ``` A potential alert for NTP issues: ```yaml - alert: ntp_clock_drift expr: node_timex_maxerror_seconds > 1 for: 5m labels: severity: warning annotations: summary: "High clock drift on {{ $labels.hostname }}" description: "NTP max error is {{ $value }}s on {{ $labels.hostname }}" ``` ## Conclusion No action required for the alert itself - the system was healthy. Consider implementing the improved alert rule to prevent future false positives from NTP adjustments.