torjus/nixos-servers

Fork 0

Files

Torjus Håkestad 55da459108

Run nix flake check / flake-check (push) Failing after 9m52s

Details

Periodic flake update / flake-update (push) Successful in 5m19s

Details

docs: add plan for local NTP with chrony

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-03 19:33:28 +01:00

2.6 KiB

Raw Blame History

Local NTP with Chrony

Overview/Goal

Set up pve1 as a local NTP server and switch all NixOS VMs from systemd-timesyncd to chrony, pointing at pve1 as the sole time source. This eliminates clock drift issues that cause false host_reboot alerts.

Current State

All NixOS hosts use systemd-timesyncd with default NixOS pool servers (0.nixos.pool.ntp.org etc.)
No NTP/timesyncd configuration exists in the repo — all defaults
pve1 (Proxmox, bare metal) already runs chrony but only as a client
VMs drift noticeably — ns1 (~19ms) and jelly01 (~39ms) are worst offenders
Clock step corrections from timesyncd trigger false host_reboot alerts via changes(node_boot_time_seconds[10m]) > 0
pve1 itself stays at 0ms offset thanks to chrony

Why systemd-timesyncd is Insufficient

Minimal SNTP client, no proper clock discipline or frequency tracking
Backs off polling interval when it thinks clock is stable, missing drift
Corrects via step adjustments rather than gradual slewing, causing metric jumps
Each VM resolves to different pool servers with varying accuracy

Implementation Steps

1. Configure pve1 as NTP Server

Add to pve1's /etc/chrony/chrony.conf:

# Allow NTP clients from the infrastructure subnet
allow 10.69.13.0/24

Restart chrony on pve1.

2. Add Chrony to NixOS System Config

Create system/chrony.nix (applied to all hosts via system imports):

{
  # Disable systemd-timesyncd (chrony takes over)
  services.timesyncd.enable = false;

  # Enable chrony pointing at pve1
  services.chrony = {
    enable = true;
    servers = [ "pve1.home.2rjus.net" ];
    serverOption = "iburst";
  };
}

3. Optional: Add Chrony Exporter

For better visibility into NTP sync quality:

services.prometheus.exporters.chrony.enable = true;

Add chrony exporter scrape targets via homelab.monitoring.scrapeTargets and create a Grafana dashboard for NTP offset across all hosts.

4. Roll Out

Deploy to a test-tier host first to verify
Then deploy to all hosts via auto-upgrade

Open Questions

Does pve1's chrony config need local stratum 10 as fallback if upstream is unreachable?
Should we also enable enableRTCTrimming for the VMs?
Worth adding a chrony exporter on pve1 as well (manual install like node-exporter)?

Notes

No fallback NTP servers needed on VMs — if pve1 is down, all VMs are down too
The host_reboot alert rule (changes(node_boot_time_seconds[10m]) > 0) should stop false-firing once clock corrections are slewed instead of stepped
pn01/pn02 are bare metal but still benefit from syncing to pve1 for consistency

2.6 KiB Raw Blame History