From 55da459108545174fecc922492c370ac8f557320 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= <torjus@usit.uio.no>
Date: Tue, 3 Mar 2026 19:33:28 +0100
Subject: [PATCH] docs: add plan for local NTP with chrony

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 docs/plans/local-ntp-chrony.md | 79 ++++++++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)
 create mode 100644 docs/plans/local-ntp-chrony.md

diff --git a/docs/plans/local-ntp-chrony.md b/docs/plans/local-ntp-chrony.md
new file mode 100644
index 0000000..17171a1
--- /dev/null
+++ b/docs/plans/local-ntp-chrony.md
@@ -0,0 +1,79 @@
+# Local NTP with Chrony
+
+## Overview/Goal
+
+Set up pve1 as a local NTP server and switch all NixOS VMs from systemd-timesyncd to chrony, pointing at pve1 as the sole time source. This eliminates clock drift issues that cause false `host_reboot` alerts.
+
+## Current State
+
+- All NixOS hosts use `systemd-timesyncd` with default NixOS pool servers (`0.nixos.pool.ntp.org` etc.)
+- No NTP/timesyncd configuration exists in the repo — all defaults
+- pve1 (Proxmox, bare metal) already runs chrony but only as a client
+- VMs drift noticeably — ns1 (~19ms) and jelly01 (~39ms) are worst offenders
+- Clock step corrections from timesyncd trigger false `host_reboot` alerts via `changes(node_boot_time_seconds[10m]) > 0`
+- pve1 itself stays at 0ms offset thanks to chrony
+
+## Why systemd-timesyncd is Insufficient
+
+- Minimal SNTP client, no proper clock discipline or frequency tracking
+- Backs off polling interval when it thinks clock is stable, missing drift
+- Corrects via step adjustments rather than gradual slewing, causing metric jumps
+- Each VM resolves to different pool servers with varying accuracy
+
+## Implementation Steps
+
+### 1. Configure pve1 as NTP Server
+
+Add to pve1's `/etc/chrony/chrony.conf`:
+
+```
+# Allow NTP clients from the infrastructure subnet
+allow 10.69.13.0/24
+```
+
+Restart chrony on pve1.
+
+### 2. Add Chrony to NixOS System Config
+
+Create `system/chrony.nix` (applied to all hosts via system imports):
+
+```nix
+{
+  # Disable systemd-timesyncd (chrony takes over)
+  services.timesyncd.enable = false;
+
+  # Enable chrony pointing at pve1
+  services.chrony = {
+    enable = true;
+    servers = [ "pve1.home.2rjus.net" ];
+    serverOption = "iburst";
+  };
+}
+```
+
+### 3. Optional: Add Chrony Exporter
+
+For better visibility into NTP sync quality:
+
+```nix
+services.prometheus.exporters.chrony.enable = true;
+```
+
+Add chrony exporter scrape targets via `homelab.monitoring.scrapeTargets` and create a Grafana dashboard for NTP offset across all hosts.
+
+### 4. Roll Out
+
+- Deploy to a test-tier host first to verify
+- Then deploy to all hosts via auto-upgrade
+
+## Open Questions
+
+- [ ] Does pve1's chrony config need `local stratum 10` as fallback if upstream is unreachable?
+- [ ] Should we also enable `enableRTCTrimming` for the VMs?
+- [ ] Worth adding a chrony exporter on pve1 as well (manual install like node-exporter)?
+
+## Notes
+
+- No fallback NTP servers needed on VMs — if pve1 is down, all VMs are down too
+- The `host_reboot` alert rule (`changes(node_boot_time_seconds[10m]) > 0`) should stop false-firing once clock corrections are slewed instead of stepped
+- pn01/pn02 are bare metal but still benefit from syncing to pve1 for consistency