From 55da459108545174fecc922492c370ac8f557320 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Tue, 3 Mar 2026 19:33:28 +0100 Subject: [PATCH] docs: add plan for local NTP with chrony Co-Authored-By: Claude Opus 4.6 --- docs/plans/local-ntp-chrony.md | 79 ++++++++++++++++++++++++++++++++++ 1 file changed, 79 insertions(+) create mode 100644 docs/plans/local-ntp-chrony.md diff --git a/docs/plans/local-ntp-chrony.md b/docs/plans/local-ntp-chrony.md new file mode 100644 index 0000000..17171a1 --- /dev/null +++ b/docs/plans/local-ntp-chrony.md @@ -0,0 +1,79 @@ +# Local NTP with Chrony + +## Overview/Goal + +Set up pve1 as a local NTP server and switch all NixOS VMs from systemd-timesyncd to chrony, pointing at pve1 as the sole time source. This eliminates clock drift issues that cause false `host_reboot` alerts. + +## Current State + +- All NixOS hosts use `systemd-timesyncd` with default NixOS pool servers (`0.nixos.pool.ntp.org` etc.) +- No NTP/timesyncd configuration exists in the repo — all defaults +- pve1 (Proxmox, bare metal) already runs chrony but only as a client +- VMs drift noticeably — ns1 (~19ms) and jelly01 (~39ms) are worst offenders +- Clock step corrections from timesyncd trigger false `host_reboot` alerts via `changes(node_boot_time_seconds[10m]) > 0` +- pve1 itself stays at 0ms offset thanks to chrony + +## Why systemd-timesyncd is Insufficient + +- Minimal SNTP client, no proper clock discipline or frequency tracking +- Backs off polling interval when it thinks clock is stable, missing drift +- Corrects via step adjustments rather than gradual slewing, causing metric jumps +- Each VM resolves to different pool servers with varying accuracy + +## Implementation Steps + +### 1. Configure pve1 as NTP Server + +Add to pve1's `/etc/chrony/chrony.conf`: + +``` +# Allow NTP clients from the infrastructure subnet +allow 10.69.13.0/24 +``` + +Restart chrony on pve1. + +### 2. Add Chrony to NixOS System Config + +Create `system/chrony.nix` (applied to all hosts via system imports): + +```nix +{ + # Disable systemd-timesyncd (chrony takes over) + services.timesyncd.enable = false; + + # Enable chrony pointing at pve1 + services.chrony = { + enable = true; + servers = [ "pve1.home.2rjus.net" ]; + serverOption = "iburst"; + }; +} +``` + +### 3. Optional: Add Chrony Exporter + +For better visibility into NTP sync quality: + +```nix +services.prometheus.exporters.chrony.enable = true; +``` + +Add chrony exporter scrape targets via `homelab.monitoring.scrapeTargets` and create a Grafana dashboard for NTP offset across all hosts. + +### 4. Roll Out + +- Deploy to a test-tier host first to verify +- Then deploy to all hosts via auto-upgrade + +## Open Questions + +- [ ] Does pve1's chrony config need `local stratum 10` as fallback if upstream is unreachable? +- [ ] Should we also enable `enableRTCTrimming` for the VMs? +- [ ] Worth adding a chrony exporter on pve1 as well (manual install like node-exporter)? + +## Notes + +- No fallback NTP servers needed on VMs — if pve1 is down, all VMs are down too +- The `host_reboot` alert rule (`changes(node_boot_time_seconds[10m]) > 0`) should stop false-firing once clock corrections are slewed instead of stepped +- pn01/pn02 are bare metal but still benefit from syncing to pve1 for consistency