Files
nixos-servers/docs/plans/completed/monitoring02-reboot-alert-investigation.md
Torjus Håkestad 98ea679ef2
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m41s
docs: add monitoring02 reboot alert investigation
Document findings from false positive host_reboot alert caused by
NTP clock adjustment affecting node_boot_time_seconds metric.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 17:59:53 +01:00

3.8 KiB

monitoring02 Reboot Alert Investigation

Date: 2026-02-10 Status: Completed - False positive identified

Summary

A host_reboot alert fired for monitoring02 at 16:27:36 UTC. Investigation determined this was a false positive caused by NTP clock adjustments, not an actual reboot.

Alert Details

  • Alert: host_reboot
  • Rule: changes(node_boot_time_seconds[10m]) > 0
  • Host: monitoring02
  • Time: 2026-02-10T16:27:36Z

Investigation Findings

Evidence Against Actual Reboot

  1. Uptime: System had been up for ~40 hours (143,751 seconds) at time of alert
  2. Consistent BOOT_ID: All logs showed the same systemd BOOT_ID (fd26e7f3d86f4cd688d1b1d7af62f2ad) from Feb 9 through the alert time
  3. No log gaps: Logs were continuous - no shutdown/restart cycle visible
  4. Prometheus metrics: node_boot_time_seconds showed a 1-second fluctuation, then returned to normal

Root Cause: NTP Clock Adjustment

The node_boot_time_seconds metric fluctuated by 1 second due to how Linux calculates boot time:

btime = current_wall_clock_time - monotonic_uptime

When NTP adjusts the wall clock, btime shifts by the same amount. The node_timex_* metrics confirmed this:

Metric Value
node_timex_maxerror_seconds (max in 3h) 1.02 seconds
node_timex_maxerror_seconds (max in 24h) 2.05 seconds
node_timex_sync_status 1 (synced)
Current node_timex_offset_seconds ~9ms (normal)

The kernel's estimated maximum clock error spiked to over 1 second, causing the boot time calculation to drift momentarily.

Additionally, systemd-resolved logged "Clock change detected. Flushing caches." at 16:26:53Z, corroborating the NTP adjustment.

Current Time Sync Configuration

NixOS Guests

  • NTP client: systemd-timesyncd (NixOS default)
  • No explicit configuration in the codebase
  • Uses default NixOS NTP server pool

Proxmox VMs

  • Clocksource: kvm-clock (optimal for KVM VMs)
  • QEMU guest agent: Enabled
  • No additional QEMU timing args configured

Potential Improvements

Add tolerance to filter out small NTP adjustments:

# Current rule (triggers on any change)
expr: changes(node_boot_time_seconds[10m]) > 0

# Improved rule (requires >60 second shift)
expr: changes(node_boot_time_seconds[10m]) > 0 and abs(delta(node_boot_time_seconds[10m])) > 60

2. Switch to Chrony (Optional)

Chrony handles time adjustments more gracefully than systemd-timesyncd:

# In common/vm/qemu-guest.nix
{
  services.qemuGuest.enable = true;

  services.timesyncd.enable = false;
  services.chrony = {
    enable = true;
    extraConfig = ''
      makestep 1 3
      rtcsync
    '';
  };
}

3. Add QEMU Timing Args (Optional)

In terraform/vms.tf:

args = "-global kvm-pit.lost_tick_policy=delay -rtc driftfix=slew"

4. Local NTP Server (Optional)

Running a local NTP server (e.g., on ns1/ns2) would reduce latency and improve sync stability across all hosts.

Monitoring NTP Health

The node_timex_* metrics from node_exporter provide visibility into NTP health:

# Clock offset from reference
node_timex_offset_seconds

# Sync status (1 = synced)
node_timex_sync_status

# Maximum estimated error - useful for alerting
node_timex_maxerror_seconds

A potential alert for NTP issues:

- alert: ntp_clock_drift
  expr: node_timex_maxerror_seconds > 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High clock drift on {{ $labels.hostname }}"
    description: "NTP max error is {{ $value }}s on {{ $labels.hostname }}"

Conclusion

No action required for the alert itself - the system was healthy. Consider implementing the improved alert rule to prevent future false positives from NTP adjustments.