docs: add plan for local NTP with chrony

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 19:33:28 +01:00
parent 813c5c0f29
commit 55da459108
1 changed files with 79 additions and 0 deletions
--- a/docs/plans/local-ntp-chrony.md
+++ b/docs/plans/local-ntp-chrony.md
@@ -0,0 +1,79 @@
 # Local NTP with Chrony
 ## Overview/Goal
 Set up pve1 as a local NTP server and switch all NixOS VMs from systemd-timesyncd to chrony, pointing at pve1 as the sole time source. This eliminates clock drift issues that cause false `host_reboot` alerts.
 ## Current State
 - All NixOS hosts use `systemd-timesyncd` with default NixOS pool servers (`0.nixos.pool.ntp.org` etc.)
 - No NTP/timesyncd configuration exists in the repo — all defaults
 - pve1 (Proxmox, bare metal) already runs chrony but only as a client
 - VMs drift noticeably — ns1 (~19ms) and jelly01 (~39ms) are worst offenders
 - Clock step corrections from timesyncd trigger false `host_reboot` alerts via `changes(node_boot_time_seconds[10m]) > 0`
 - pve1 itself stays at 0ms offset thanks to chrony
 ## Why systemd-timesyncd is Insufficient
 - Minimal SNTP client, no proper clock discipline or frequency tracking
 - Backs off polling interval when it thinks clock is stable, missing drift
 - Corrects via step adjustments rather than gradual slewing, causing metric jumps
 - Each VM resolves to different pool servers with varying accuracy
 ## Implementation Steps
 ### 1. Configure pve1 as NTP Server
 Add to pve1's `/etc/chrony/chrony.conf`:
 ```
 # Allow NTP clients from the infrastructure subnet
 allow 10.69.13.0/24
 ```
 Restart chrony on pve1.
 ### 2. Add Chrony to NixOS System Config
 Create `system/chrony.nix` (applied to all hosts via system imports):
 ```nix
 {
  # Disable systemd-timesyncd (chrony takes over)
  services.timesyncd.enable = false;
  # Enable chrony pointing at pve1
  services.chrony = {
    enable = true;
    servers = [ "pve1.home.2rjus.net" ];
    serverOption = "iburst";
  };
 }
 ```
 ### 3. Optional: Add Chrony Exporter
 For better visibility into NTP sync quality:
 ```nix
 services.prometheus.exporters.chrony.enable = true;
 ```
 Add chrony exporter scrape targets via `homelab.monitoring.scrapeTargets` and create a Grafana dashboard for NTP offset across all hosts.
 ### 4. Roll Out
 - Deploy to a test-tier host first to verify
 - Then deploy to all hosts via auto-upgrade
 ## Open Questions
 - [ ] Does pve1's chrony config need `local stratum 10` as fallback if upstream is unreachable?
 - [ ] Should we also enable `enableRTCTrimming` for the VMs?
 - [ ] Worth adding a chrony exporter on pve1 as well (manual install like node-exporter)?
 ## Notes
 - No fallback NTP servers needed on VMs — if pve1 is down, all VMs are down too
 - The `host_reboot` alert rule (`changes(node_boot_time_seconds[10m]) > 0`) should stop false-firing once clock corrections are slewed instead of stepped
 - pn01/pn02 are bare metal but still benefit from syncing to pve1 for consistency