# monitoring02 Reboot Alert Investigation

**Date:** 2026-02-10
**Status:** Completed - False positive identified

## Summary

A `host_reboot` alert fired for monitoring02 at 16:27:36 UTC. Investigation determined this was a **false positive** caused by NTP clock adjustments, not an actual reboot.

## Alert Details

- **Alert:** `host_reboot`
- **Rule:** `changes(node_boot_time_seconds[10m]) > 0`
- **Host:** monitoring02
- **Time:** 2026-02-10T16:27:36Z

## Investigation Findings

### Evidence Against Actual Reboot

1. **Uptime:** System had been up for ~40 hours (143,751 seconds) at time of alert
2. **Consistent BOOT_ID:** All logs showed the same systemd BOOT_ID (`fd26e7f3d86f4cd688d1b1d7af62f2ad`) from Feb 9 through the alert time
3. **No log gaps:** Logs were continuous - no shutdown/restart cycle visible
4. **Prometheus metrics:** `node_boot_time_seconds` showed a 1-second fluctuation, then returned to normal

### Root Cause: NTP Clock Adjustment

The `node_boot_time_seconds` metric fluctuated by 1 second due to how Linux calculates boot time:

```
btime = current_wall_clock_time - monotonic_uptime
```

When NTP adjusts the wall clock, `btime` shifts by the same amount. The `node_timex_*` metrics confirmed this:

| Metric | Value |
|--------|-------|
| `node_timex_maxerror_seconds` (max in 3h) | 1.02 seconds |
| `node_timex_maxerror_seconds` (max in 24h) | 2.05 seconds |
| `node_timex_sync_status` | 1 (synced) |
| Current `node_timex_offset_seconds` | ~9ms (normal) |

The kernel's estimated maximum clock error spiked to over 1 second, causing the boot time calculation to drift momentarily.

Additionally, `systemd-resolved` logged "Clock change detected. Flushing caches." at 16:26:53Z, corroborating the NTP adjustment.

## Current Time Sync Configuration

### NixOS Guests
- **NTP client:** systemd-timesyncd (NixOS default)
- **No explicit configuration** in the codebase
- Uses default NixOS NTP server pool

### Proxmox VMs
- **Clocksource:** `kvm-clock` (optimal for KVM VMs)
- **QEMU guest agent:** Enabled
- **No additional QEMU timing args** configured

## Potential Improvements

### 1. Improve Alert Rule (Recommended)

Add tolerance to filter out small NTP adjustments:

```yaml
# Current rule (triggers on any change)
expr: changes(node_boot_time_seconds[10m]) > 0

# Improved rule (requires >60 second shift)
expr: changes(node_boot_time_seconds[10m]) > 0 and abs(delta(node_boot_time_seconds[10m])) > 60
```

### 2. Switch to Chrony (Optional)

Chrony handles time adjustments more gracefully than systemd-timesyncd:

```nix
# In common/vm/qemu-guest.nix
{
  services.qemuGuest.enable = true;

  services.timesyncd.enable = false;
  services.chrony = {
    enable = true;
    extraConfig = ''
      makestep 1 3
      rtcsync
    '';
  };
}
```

### 3. Add QEMU Timing Args (Optional)

In `terraform/vms.tf`:

```hcl
args = "-global kvm-pit.lost_tick_policy=delay -rtc driftfix=slew"
```

### 4. Local NTP Server (Optional)

Running a local NTP server (e.g., on ns1/ns2) would reduce latency and improve sync stability across all hosts.

## Monitoring NTP Health

The `node_timex_*` metrics from node_exporter provide visibility into NTP health:

```promql
# Clock offset from reference
node_timex_offset_seconds

# Sync status (1 = synced)
node_timex_sync_status

# Maximum estimated error - useful for alerting
node_timex_maxerror_seconds
```

A potential alert for NTP issues:

```yaml
- alert: ntp_clock_drift
  expr: node_timex_maxerror_seconds > 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High clock drift on {{ $labels.hostname }}"
    description: "NTP max error is {{ $value }}s on {{ $labels.hostname }}"
```

## Conclusion

No action required for the alert itself - the system was healthy. Consider implementing the improved alert rule to prevent future false positives from NTP adjustments.