docs: add plan for local NTP with chrony
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
79
docs/plans/local-ntp-chrony.md
Normal file
79
docs/plans/local-ntp-chrony.md
Normal file
@@ -0,0 +1,79 @@
|
|||||||
|
# Local NTP with Chrony
|
||||||
|
|
||||||
|
## Overview/Goal
|
||||||
|
|
||||||
|
Set up pve1 as a local NTP server and switch all NixOS VMs from systemd-timesyncd to chrony, pointing at pve1 as the sole time source. This eliminates clock drift issues that cause false `host_reboot` alerts.
|
||||||
|
|
||||||
|
## Current State
|
||||||
|
|
||||||
|
- All NixOS hosts use `systemd-timesyncd` with default NixOS pool servers (`0.nixos.pool.ntp.org` etc.)
|
||||||
|
- No NTP/timesyncd configuration exists in the repo — all defaults
|
||||||
|
- pve1 (Proxmox, bare metal) already runs chrony but only as a client
|
||||||
|
- VMs drift noticeably — ns1 (~19ms) and jelly01 (~39ms) are worst offenders
|
||||||
|
- Clock step corrections from timesyncd trigger false `host_reboot` alerts via `changes(node_boot_time_seconds[10m]) > 0`
|
||||||
|
- pve1 itself stays at 0ms offset thanks to chrony
|
||||||
|
|
||||||
|
## Why systemd-timesyncd is Insufficient
|
||||||
|
|
||||||
|
- Minimal SNTP client, no proper clock discipline or frequency tracking
|
||||||
|
- Backs off polling interval when it thinks clock is stable, missing drift
|
||||||
|
- Corrects via step adjustments rather than gradual slewing, causing metric jumps
|
||||||
|
- Each VM resolves to different pool servers with varying accuracy
|
||||||
|
|
||||||
|
## Implementation Steps
|
||||||
|
|
||||||
|
### 1. Configure pve1 as NTP Server
|
||||||
|
|
||||||
|
Add to pve1's `/etc/chrony/chrony.conf`:
|
||||||
|
|
||||||
|
```
|
||||||
|
# Allow NTP clients from the infrastructure subnet
|
||||||
|
allow 10.69.13.0/24
|
||||||
|
```
|
||||||
|
|
||||||
|
Restart chrony on pve1.
|
||||||
|
|
||||||
|
### 2. Add Chrony to NixOS System Config
|
||||||
|
|
||||||
|
Create `system/chrony.nix` (applied to all hosts via system imports):
|
||||||
|
|
||||||
|
```nix
|
||||||
|
{
|
||||||
|
# Disable systemd-timesyncd (chrony takes over)
|
||||||
|
services.timesyncd.enable = false;
|
||||||
|
|
||||||
|
# Enable chrony pointing at pve1
|
||||||
|
services.chrony = {
|
||||||
|
enable = true;
|
||||||
|
servers = [ "pve1.home.2rjus.net" ];
|
||||||
|
serverOption = "iburst";
|
||||||
|
};
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Optional: Add Chrony Exporter
|
||||||
|
|
||||||
|
For better visibility into NTP sync quality:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
services.prometheus.exporters.chrony.enable = true;
|
||||||
|
```
|
||||||
|
|
||||||
|
Add chrony exporter scrape targets via `homelab.monitoring.scrapeTargets` and create a Grafana dashboard for NTP offset across all hosts.
|
||||||
|
|
||||||
|
### 4. Roll Out
|
||||||
|
|
||||||
|
- Deploy to a test-tier host first to verify
|
||||||
|
- Then deploy to all hosts via auto-upgrade
|
||||||
|
|
||||||
|
## Open Questions
|
||||||
|
|
||||||
|
- [ ] Does pve1's chrony config need `local stratum 10` as fallback if upstream is unreachable?
|
||||||
|
- [ ] Should we also enable `enableRTCTrimming` for the VMs?
|
||||||
|
- [ ] Worth adding a chrony exporter on pve1 as well (manual install like node-exporter)?
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- No fallback NTP servers needed on VMs — if pve1 is down, all VMs are down too
|
||||||
|
- The `host_reboot` alert rule (`changes(node_boot_time_seconds[10m]) > 0`) should stop false-firing once clock corrections are slewed instead of stepped
|
||||||
|
- pn01/pn02 are bare metal but still benefit from syncing to pve1 for consistency
|
||||||
Reference in New Issue
Block a user