From 98ea679ef263b02e6b25fb6d08c6ee4422c60cc0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Tue, 10 Feb 2026 17:59:53 +0100 Subject: [PATCH] docs: add monitoring02 reboot alert investigation Document findings from false positive host_reboot alert caused by NTP clock adjustment affecting node_boot_time_seconds metric. Co-Authored-By: Claude Opus 4.5 --- ...monitoring02-reboot-alert-investigation.md | 135 ++++++++++++++++++ 1 file changed, 135 insertions(+) create mode 100644 docs/plans/completed/monitoring02-reboot-alert-investigation.md diff --git a/docs/plans/completed/monitoring02-reboot-alert-investigation.md b/docs/plans/completed/monitoring02-reboot-alert-investigation.md new file mode 100644 index 0000000..aa5a839 --- /dev/null +++ b/docs/plans/completed/monitoring02-reboot-alert-investigation.md @@ -0,0 +1,135 @@ +# monitoring02 Reboot Alert Investigation + +**Date:** 2026-02-10 +**Status:** Completed - False positive identified + +## Summary + +A `host_reboot` alert fired for monitoring02 at 16:27:36 UTC. Investigation determined this was a **false positive** caused by NTP clock adjustments, not an actual reboot. + +## Alert Details + +- **Alert:** `host_reboot` +- **Rule:** `changes(node_boot_time_seconds[10m]) > 0` +- **Host:** monitoring02 +- **Time:** 2026-02-10T16:27:36Z + +## Investigation Findings + +### Evidence Against Actual Reboot + +1. **Uptime:** System had been up for ~40 hours (143,751 seconds) at time of alert +2. **Consistent BOOT_ID:** All logs showed the same systemd BOOT_ID (`fd26e7f3d86f4cd688d1b1d7af62f2ad`) from Feb 9 through the alert time +3. **No log gaps:** Logs were continuous - no shutdown/restart cycle visible +4. **Prometheus metrics:** `node_boot_time_seconds` showed a 1-second fluctuation, then returned to normal + +### Root Cause: NTP Clock Adjustment + +The `node_boot_time_seconds` metric fluctuated by 1 second due to how Linux calculates boot time: + +``` +btime = current_wall_clock_time - monotonic_uptime +``` + +When NTP adjusts the wall clock, `btime` shifts by the same amount. The `node_timex_*` metrics confirmed this: + +| Metric | Value | +|--------|-------| +| `node_timex_maxerror_seconds` (max in 3h) | 1.02 seconds | +| `node_timex_maxerror_seconds` (max in 24h) | 2.05 seconds | +| `node_timex_sync_status` | 1 (synced) | +| Current `node_timex_offset_seconds` | ~9ms (normal) | + +The kernel's estimated maximum clock error spiked to over 1 second, causing the boot time calculation to drift momentarily. + +Additionally, `systemd-resolved` logged "Clock change detected. Flushing caches." at 16:26:53Z, corroborating the NTP adjustment. + +## Current Time Sync Configuration + +### NixOS Guests +- **NTP client:** systemd-timesyncd (NixOS default) +- **No explicit configuration** in the codebase +- Uses default NixOS NTP server pool + +### Proxmox VMs +- **Clocksource:** `kvm-clock` (optimal for KVM VMs) +- **QEMU guest agent:** Enabled +- **No additional QEMU timing args** configured + +## Potential Improvements + +### 1. Improve Alert Rule (Recommended) + +Add tolerance to filter out small NTP adjustments: + +```yaml +# Current rule (triggers on any change) +expr: changes(node_boot_time_seconds[10m]) > 0 + +# Improved rule (requires >60 second shift) +expr: changes(node_boot_time_seconds[10m]) > 0 and abs(delta(node_boot_time_seconds[10m])) > 60 +``` + +### 2. Switch to Chrony (Optional) + +Chrony handles time adjustments more gracefully than systemd-timesyncd: + +```nix +# In common/vm/qemu-guest.nix +{ + services.qemuGuest.enable = true; + + services.timesyncd.enable = false; + services.chrony = { + enable = true; + extraConfig = '' + makestep 1 3 + rtcsync + ''; + }; +} +``` + +### 3. Add QEMU Timing Args (Optional) + +In `terraform/vms.tf`: + +```hcl +args = "-global kvm-pit.lost_tick_policy=delay -rtc driftfix=slew" +``` + +### 4. Local NTP Server (Optional) + +Running a local NTP server (e.g., on ns1/ns2) would reduce latency and improve sync stability across all hosts. + +## Monitoring NTP Health + +The `node_timex_*` metrics from node_exporter provide visibility into NTP health: + +```promql +# Clock offset from reference +node_timex_offset_seconds + +# Sync status (1 = synced) +node_timex_sync_status + +# Maximum estimated error - useful for alerting +node_timex_maxerror_seconds +``` + +A potential alert for NTP issues: + +```yaml +- alert: ntp_clock_drift + expr: node_timex_maxerror_seconds > 1 + for: 5m + labels: + severity: warning + annotations: + summary: "High clock drift on {{ $labels.hostname }}" + description: "NTP max error is {{ $value }}s on {{ $labels.hostname }}" +``` + +## Conclusion + +No action required for the alert itself - the system was healthy. Consider implementing the improved alert rule to prevent future false positives from NTP adjustments.