docs: add plan for local NTP with chrony
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
79
docs/plans/local-ntp-chrony.md
Normal file
79
docs/plans/local-ntp-chrony.md
Normal file
@@ -0,0 +1,79 @@
|
||||
# Local NTP with Chrony
|
||||
|
||||
## Overview/Goal
|
||||
|
||||
Set up pve1 as a local NTP server and switch all NixOS VMs from systemd-timesyncd to chrony, pointing at pve1 as the sole time source. This eliminates clock drift issues that cause false `host_reboot` alerts.
|
||||
|
||||
## Current State
|
||||
|
||||
- All NixOS hosts use `systemd-timesyncd` with default NixOS pool servers (`0.nixos.pool.ntp.org` etc.)
|
||||
- No NTP/timesyncd configuration exists in the repo — all defaults
|
||||
- pve1 (Proxmox, bare metal) already runs chrony but only as a client
|
||||
- VMs drift noticeably — ns1 (~19ms) and jelly01 (~39ms) are worst offenders
|
||||
- Clock step corrections from timesyncd trigger false `host_reboot` alerts via `changes(node_boot_time_seconds[10m]) > 0`
|
||||
- pve1 itself stays at 0ms offset thanks to chrony
|
||||
|
||||
## Why systemd-timesyncd is Insufficient
|
||||
|
||||
- Minimal SNTP client, no proper clock discipline or frequency tracking
|
||||
- Backs off polling interval when it thinks clock is stable, missing drift
|
||||
- Corrects via step adjustments rather than gradual slewing, causing metric jumps
|
||||
- Each VM resolves to different pool servers with varying accuracy
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
### 1. Configure pve1 as NTP Server
|
||||
|
||||
Add to pve1's `/etc/chrony/chrony.conf`:
|
||||
|
||||
```
|
||||
# Allow NTP clients from the infrastructure subnet
|
||||
allow 10.69.13.0/24
|
||||
```
|
||||
|
||||
Restart chrony on pve1.
|
||||
|
||||
### 2. Add Chrony to NixOS System Config
|
||||
|
||||
Create `system/chrony.nix` (applied to all hosts via system imports):
|
||||
|
||||
```nix
|
||||
{
|
||||
# Disable systemd-timesyncd (chrony takes over)
|
||||
services.timesyncd.enable = false;
|
||||
|
||||
# Enable chrony pointing at pve1
|
||||
services.chrony = {
|
||||
enable = true;
|
||||
servers = [ "pve1.home.2rjus.net" ];
|
||||
serverOption = "iburst";
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Optional: Add Chrony Exporter
|
||||
|
||||
For better visibility into NTP sync quality:
|
||||
|
||||
```nix
|
||||
services.prometheus.exporters.chrony.enable = true;
|
||||
```
|
||||
|
||||
Add chrony exporter scrape targets via `homelab.monitoring.scrapeTargets` and create a Grafana dashboard for NTP offset across all hosts.
|
||||
|
||||
### 4. Roll Out
|
||||
|
||||
- Deploy to a test-tier host first to verify
|
||||
- Then deploy to all hosts via auto-upgrade
|
||||
|
||||
## Open Questions
|
||||
|
||||
- [ ] Does pve1's chrony config need `local stratum 10` as fallback if upstream is unreachable?
|
||||
- [ ] Should we also enable `enableRTCTrimming` for the VMs?
|
||||
- [ ] Worth adding a chrony exporter on pve1 as well (manual install like node-exporter)?
|
||||
|
||||
## Notes
|
||||
|
||||
- No fallback NTP servers needed on VMs — if pve1 is down, all VMs are down too
|
||||
- The `host_reboot` alert rule (`changes(node_boot_time_seconds[10m]) > 0`) should stop false-firing once clock corrections are slewed instead of stepped
|
||||
- pn01/pn02 are bare metal but still benefit from syncing to pve1 for consistency
|
||||
Reference in New Issue
Block a user