nixos-servers/docs/plans/local-ntp-chrony.md

# Local NTP with Chrony

## Overview/Goal

Set up pve1 as a local NTP server and switch all NixOS VMs from systemd-timesyncd to chrony, pointing at pve1 as the sole time source. This eliminates clock drift issues that cause false `host_reboot` alerts.

## Current State

- All NixOS hosts use `systemd-timesyncd` with default NixOS pool servers (`0.nixos.pool.ntp.org` etc.)
- No NTP/timesyncd configuration exists in the repo — all defaults
- pve1 (Proxmox, bare metal) already runs chrony but only as a client
- VMs drift noticeably — ns1 (~19ms) and jelly01 (~39ms) are worst offenders
- Clock step corrections from timesyncd trigger false `host_reboot` alerts via `changes(node_boot_time_seconds[10m]) > 0`
- pve1 itself stays at 0ms offset thanks to chrony

## Why systemd-timesyncd is Insufficient

- Minimal SNTP client, no proper clock discipline or frequency tracking
- Backs off polling interval when it thinks clock is stable, missing drift
- Corrects via step adjustments rather than gradual slewing, causing metric jumps
- Each VM resolves to different pool servers with varying accuracy

## Implementation Steps

### 1. Configure pve1 as NTP Server

Add to pve1's `/etc/chrony/chrony.conf`:

```
# Allow NTP clients from the infrastructure subnet
allow 10.69.13.0/24
```

Restart chrony on pve1.

### 2. Add Chrony to NixOS System Config

Create `system/chrony.nix` (applied to all hosts via system imports):

```nix
{
  # Disable systemd-timesyncd (chrony takes over)
  services.timesyncd.enable = false;

  # Enable chrony pointing at pve1
  services.chrony = {
    enable = true;
    servers = [ "pve1.home.2rjus.net" ];
    serverOption = "iburst";
  };
}
```

### 3. Optional: Add Chrony Exporter

For better visibility into NTP sync quality:

```nix
services.prometheus.exporters.chrony.enable = true;
```

Add chrony exporter scrape targets via `homelab.monitoring.scrapeTargets` and create a Grafana dashboard for NTP offset across all hosts.

### 4. Roll Out

- Deploy to a test-tier host first to verify
- Then deploy to all hosts via auto-upgrade

## Open Questions

- [ ] Does pve1's chrony config need `local stratum 10` as fallback if upstream is unreachable?
- [ ] Should we also enable `enableRTCTrimming` for the VMs?
- [ ] Worth adding a chrony exporter on pve1 as well (manual install like node-exporter)?

## Notes

- No fallback NTP servers needed on VMs — if pve1 is down, all VMs are down too
- The `host_reboot` alert rule (`changes(node_boot_time_seconds[10m]) > 0`) should stop false-firing once clock corrections are slewed instead of stepped
- pn01/pn02 are bare metal but still benefit from syncing to pve1 for consistency