docs: add monitoring02 reboot alert investigation

Document findings from false positive host_reboot alert caused by NTP clock adjustment affecting node_boot_time_seconds metric. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 17:59:53 +01:00
parent b709c0b703
commit 98ea679ef2
1 changed files with 135 additions and 0 deletions
--- a/docs/plans/completed/monitoring02-reboot-alert-investigation.md
+++ b/docs/plans/completed/monitoring02-reboot-alert-investigation.md
@@ -0,0 +1,135 @@
+# monitoring02 Reboot Alert Investigation
+
+**Date:** 2026-02-10
+**Status:** Completed - False positive identified
+
+## Summary
+
+A `host_reboot` alert fired for monitoring02 at 16:27:36 UTC. Investigation determined this was a **false positive** caused by NTP clock adjustments, not an actual reboot.
+
+## Alert Details
+
+- **Alert:** `host_reboot`
+- **Rule:** `changes(node_boot_time_seconds[10m]) > 0`
+- **Host:** monitoring02
+- **Time:** 2026-02-10T16:27:36Z
+
+## Investigation Findings
+
+### Evidence Against Actual Reboot
+
+1. **Uptime:** System had been up for ~40 hours (143,751 seconds) at time of alert
+2. **Consistent BOOT_ID:** All logs showed the same systemd BOOT_ID (`fd26e7f3d86f4cd688d1b1d7af62f2ad`) from Feb 9 through the alert time
+3. **No log gaps:** Logs were continuous - no shutdown/restart cycle visible
+4. **Prometheus metrics:** `node_boot_time_seconds` showed a 1-second fluctuation, then returned to normal
+
+### Root Cause: NTP Clock Adjustment
+
+The `node_boot_time_seconds` metric fluctuated by 1 second due to how Linux calculates boot time:
+
+```
+btime = current_wall_clock_time - monotonic_uptime
+```
+
+When NTP adjusts the wall clock, `btime` shifts by the same amount. The `node_timex_*` metrics confirmed this:
+
+| Metric | Value |
+|--------|-------|
+| `node_timex_maxerror_seconds` (max in 3h) | 1.02 seconds |
+| `node_timex_maxerror_seconds` (max in 24h) | 2.05 seconds |
+| `node_timex_sync_status` | 1 (synced) |
+| Current `node_timex_offset_seconds` | ~9ms (normal) |
+
+The kernel's estimated maximum clock error spiked to over 1 second, causing the boot time calculation to drift momentarily.
+
+Additionally, `systemd-resolved` logged "Clock change detected. Flushing caches." at 16:26:53Z, corroborating the NTP adjustment.
+
+## Current Time Sync Configuration
+
+### NixOS Guests
+- **NTP client:** systemd-timesyncd (NixOS default)
+- **No explicit configuration** in the codebase
+- Uses default NixOS NTP server pool
+
+### Proxmox VMs
+- **Clocksource:** `kvm-clock` (optimal for KVM VMs)
+- **QEMU guest agent:** Enabled
+- **No additional QEMU timing args** configured
+
+## Potential Improvements
+
+### 1. Improve Alert Rule (Recommended)
+
+Add tolerance to filter out small NTP adjustments:
+
+```yaml
+# Current rule (triggers on any change)
+expr: changes(node_boot_time_seconds[10m]) > 0
+
+# Improved rule (requires >60 second shift)
+expr: changes(node_boot_time_seconds[10m]) > 0 and abs(delta(node_boot_time_seconds[10m])) > 60
+```
+
+### 2. Switch to Chrony (Optional)
+
+Chrony handles time adjustments more gracefully than systemd-timesyncd:
+
+```nix
+# In common/vm/qemu-guest.nix
+{
+  services.qemuGuest.enable = true;
+
+  services.timesyncd.enable = false;
+  services.chrony = {
+    enable = true;
+    extraConfig = ''
+      makestep 1 3
+      rtcsync
+    '';
+  };
+}
+```
+
+### 3. Add QEMU Timing Args (Optional)
+
+In `terraform/vms.tf`:
+
+```hcl
+args = "-global kvm-pit.lost_tick_policy=delay -rtc driftfix=slew"
+```
+
+### 4. Local NTP Server (Optional)
+
+Running a local NTP server (e.g., on ns1/ns2) would reduce latency and improve sync stability across all hosts.
+
+## Monitoring NTP Health
+
+The `node_timex_*` metrics from node_exporter provide visibility into NTP health:
+
+```promql
+# Clock offset from reference
+node_timex_offset_seconds
+
+# Sync status (1 = synced)
+node_timex_sync_status
+
+# Maximum estimated error - useful for alerting
+node_timex_maxerror_seconds
+```
+
+A potential alert for NTP issues:
+
+```yaml
+- alert: ntp_clock_drift
+  expr: node_timex_maxerror_seconds > 1
+  for: 5m
+  labels:
+    severity: warning
+  annotations:
+    summary: "High clock drift on {{ $labels.hostname }}"
+    description: "NTP max error is {{ $value }}s on {{ $labels.hostname }}"
+```
+
+## Conclusion
+
+No action required for the alert itself - the system was healthy. Consider implementing the improved alert rule to prevent future false positives from NTP adjustments.