2.6 KiB
2.6 KiB
Local NTP with Chrony
Overview/Goal
Set up pve1 as a local NTP server and switch all NixOS VMs from systemd-timesyncd to chrony, pointing at pve1 as the sole time source. This eliminates clock drift issues that cause false host_reboot alerts.
Current State
- All NixOS hosts use
systemd-timesyncdwith default NixOS pool servers (0.nixos.pool.ntp.orgetc.) - No NTP/timesyncd configuration exists in the repo — all defaults
- pve1 (Proxmox, bare metal) already runs chrony but only as a client
- VMs drift noticeably — ns1 (~19ms) and jelly01 (~39ms) are worst offenders
- Clock step corrections from timesyncd trigger false
host_rebootalerts viachanges(node_boot_time_seconds[10m]) > 0 - pve1 itself stays at 0ms offset thanks to chrony
Why systemd-timesyncd is Insufficient
- Minimal SNTP client, no proper clock discipline or frequency tracking
- Backs off polling interval when it thinks clock is stable, missing drift
- Corrects via step adjustments rather than gradual slewing, causing metric jumps
- Each VM resolves to different pool servers with varying accuracy
Implementation Steps
1. Configure pve1 as NTP Server
Add to pve1's /etc/chrony/chrony.conf:
# Allow NTP clients from the infrastructure subnet
allow 10.69.13.0/24
Restart chrony on pve1.
2. Add Chrony to NixOS System Config
Create system/chrony.nix (applied to all hosts via system imports):
{
# Disable systemd-timesyncd (chrony takes over)
services.timesyncd.enable = false;
# Enable chrony pointing at pve1
services.chrony = {
enable = true;
servers = [ "pve1.home.2rjus.net" ];
serverOption = "iburst";
};
}
3. Optional: Add Chrony Exporter
For better visibility into NTP sync quality:
services.prometheus.exporters.chrony.enable = true;
Add chrony exporter scrape targets via homelab.monitoring.scrapeTargets and create a Grafana dashboard for NTP offset across all hosts.
4. Roll Out
- Deploy to a test-tier host first to verify
- Then deploy to all hosts via auto-upgrade
Open Questions
- Does pve1's chrony config need
local stratum 10as fallback if upstream is unreachable? - Should we also enable
enableRTCTrimmingfor the VMs? - Worth adding a chrony exporter on pve1 as well (manual install like node-exporter)?
Notes
- No fallback NTP servers needed on VMs — if pve1 is down, all VMs are down too
- The
host_rebootalert rule (changes(node_boot_time_seconds[10m]) > 0) should stop false-firing once clock corrections are slewed instead of stepped - pn01/pn02 are bare metal but still benefit from syncing to pve1 for consistency