83 Commits

Author SHA1 Message Date
6a3e78a479 nrec-nixos01: enable Git LFS and hide explore page
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m47s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 15:10:59 +01:00
cfc0c6f6cb nrec-nixos01: add Forgejo with Caddy reverse proxy
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m6s
Run nix flake check / flake-check (pull_request) Failing after 4m31s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 14:49:48 +01:00
822380695e nrec-nixos01: import qemu-guest profile for virtio modules
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m6s
The initrd was missing virtio drivers, preventing the root
filesystem from being detected during boot.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 14:31:09 +01:00
0941bd52f5 nrec-nixos01: fix root filesystem device to use label
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m22s
The OpenStack image labels the root partition "nixos", so use
/dev/disk/by-label/nixos instead of /dev/vda1.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 14:22:24 +01:00
9ebdd94773 Merge pull request 'nrec-nixos01' (#44) from nrec-nixos01 into master
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Reviewed-on: #44
2026-03-08 13:12:24 +00:00
adc267bd95 nrec-nixos01: add host configuration with Caddy web server
Some checks failed
Run nix flake check / flake-check (push) Failing after 9m20s
Run nix flake check / flake-check (pull_request) Failing after 3m58s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 14:10:05 +01:00
7ffe2d71d6 openstack-template: add minimal NixOS image for OpenStack
Adds a new host configuration for building qcow2 images targeting
OpenStack (NREC). Uses a nixos user with SSH key and sudo instead
of root login, firewall enabled, and no internal services.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 13:56:55 +01:00
dd9ba42eb5 devshell: add openstack cli client
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m16s
2026-03-08 13:31:54 +01:00
3ee0433a6f flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/fabb8c9deee281e50b1065002c9828f2cf7b2239?narHash=sha256-YaHht/C35INEX3DeJQNWjNaTcPjYmBwwjFJ2jdtr%2B5U%3D' (2026-03-04)
  → 'github:nixos/nixpkgs/71caefce12ba78d84fe618cf61644dce01cf3a96?narHash=sha256-yf3iYLGbGVlIthlQIk5/4/EQDZNNEmuqKZkQssMljuw%3D' (2026-03-06)
• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/80bdc1e5ce51f56b19791b52b2901187931f5353?narHash=sha256-QKyJ0QGWBn6r0invrMAK8dmJoBYWoOWy7lN%2BUHzW1jc%3D' (2026-03-04)
  → 'github:nixos/nixpkgs/aca4d95fce4914b3892661bcb80b8087293536c6?narHash=sha256-E1bxHxNKfDoQUuvriG71%2Bf%2Bs/NT0qWkImXsYZNFFfCs%3D' (2026-03-06)
2026-03-08 00:02:42 +00:00
73d804105b pn01, pn02: enable memtest86 and update stability docs
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m15s
Periodic flake update / flake-update (push) Successful in 2m50s
Enable memtest86 in systemd-boot menu on both PN51 units to allow
extended memory testing. Update stability document with March crash
data from pstore/Loki — crashes now traced to sched_ext scheduler
kernel oops, suggesting possible memory corruption.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 23:02:28 +01:00
d2a4e4a0a1 grafana: add storage query performance panels to apiary dashboard
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m23s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 22:47:30 +01:00
28eba49d68 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/8c809a146a140c5c8806f13399592dbcb1bb5dc4?narHash=sha256-WGV2hy%2BVIeQsYXpsLjdr4GvHv5eECMISX1zKLTedhdg%3D' (2026-03-03)
  → 'github:nixos/nixpkgs/80bdc1e5ce51f56b19791b52b2901187931f5353?narHash=sha256-QKyJ0QGWBn6r0invrMAK8dmJoBYWoOWy7lN%2BUHzW1jc%3D' (2026-03-04)
2026-03-06 00:07:07 +00:00
4bf726a674 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/c581273b8d5bdf1c6ce7e0a54da9841e6a763913?narHash=sha256-ywy9troNEfpgh0Ee%2BzaV1UTgU8kYBVKtvPSxh6clYGU%3D' (2026-03-02)
  → 'github:nixos/nixpkgs/fabb8c9deee281e50b1065002c9828f2cf7b2239?narHash=sha256-YaHht/C35INEX3DeJQNWjNaTcPjYmBwwjFJ2jdtr%2B5U%3D' (2026-03-04)
2026-03-05 00:07:31 +00:00
774fd92524 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/1267bb4920d0fc06ea916734c11b0bf004bbe17e?narHash=sha256-7DaQVv4R97cii/Qdfy4tmDZMB2xxtyIvNGSwXBBhSmo%3D' (2026-02-25)
  → 'github:nixos/nixpkgs/c581273b8d5bdf1c6ce7e0a54da9841e6a763913?narHash=sha256-ywy9troNEfpgh0Ee%2BzaV1UTgU8kYBVKtvPSxh6clYGU%3D' (2026-03-02)
• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/cf59864ef8aa2e178cccedbe2c178185b0365705?narHash=sha256-izhTDFKsg6KeVBxJS9EblGeQ8y%2BO8eCa6RcW874vxEc%3D' (2026-03-02)
  → 'github:nixos/nixpkgs/8c809a146a140c5c8806f13399592dbcb1bb5dc4?narHash=sha256-WGV2hy%2BVIeQsYXpsLjdr4GvHv5eECMISX1zKLTedhdg%3D' (2026-03-03)
2026-03-04 00:06:56 +00:00
55da459108 docs: add plan for local NTP with chrony
Some checks failed
Run nix flake check / flake-check (push) Failing after 9m52s
Periodic flake update / flake-update (push) Successful in 5m19s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 19:33:28 +01:00
813c5c0f29 monitoring: separate node-exporter-only external targets
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m7s
Add nodeExporterOnly list to external-targets.nix for hosts that
have node-exporter but not systemd-exporter (e.g. pve1). This
prevents a down target in the systemd-exporter scrape job.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 19:17:39 +01:00
013ab8f621 monitoring: add pve1 node-exporter scrape target
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m6s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 19:10:54 +01:00
f75b773485 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/dd9b079222d43e1943b6ebd802f04fd959dc8e61?narHash=sha256-I45esRSssFtJ8p/gLHUZ1OUaaTaVLluNkABkk6arQwE%3D' (2026-02-27)
  → 'github:nixos/nixpkgs/cf59864ef8aa2e178cccedbe2c178185b0365705?narHash=sha256-izhTDFKsg6KeVBxJS9EblGeQ8y%2BO8eCa6RcW874vxEc%3D' (2026-03-02)
2026-03-03 00:07:07 +00:00
58c3844950 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/2fc6539b481e1d2569f25f8799236694180c0993?narHash=sha256-0MAd%2B0mun3K/Ns8JATeHT1sX28faLII5hVLq0L3BdZU%3D' (2026-02-23)
  → 'github:nixos/nixpkgs/dd9b079222d43e1943b6ebd802f04fd959dc8e61?narHash=sha256-I45esRSssFtJ8p/gLHUZ1OUaaTaVLluNkABkk6arQwE%3D' (2026-02-27)
2026-03-01 00:01:26 +00:00
80e5fa08fa flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/e764fc9a405871f1f6ca3d1394fb422e0a0c3951?narHash=sha256-sdaqdnsQCv3iifzxwB22tUwN/fSHoN7j2myFW5EIkGk%3D' (2026-02-24)
  → 'github:nixos/nixpkgs/1267bb4920d0fc06ea916734c11b0bf004bbe17e?narHash=sha256-7DaQVv4R97cii/Qdfy4tmDZMB2xxtyIvNGSwXBBhSmo%3D' (2026-02-25)
2026-02-28 00:07:22 +00:00
cf55d07ce5 docs: update pn51 stability with third freeze and conclusion
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m1s
Periodic flake update / flake-update (push) Successful in 5m37s
pn02 crashed again after ~2d21h uptime despite all mitigations
(amdgpu blacklist, max_cstate=1, NMI watchdog, rasdaemon).
NMI watchdog didn't fire and rasdaemon recorded nothing,
confirming hard lockup below NMI level. Unit is unreliable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 18:25:52 +01:00
4941e38dac flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/afbbf774e2087c3d734266c22f96fca2e78d3620?narHash=sha256-nhZJPnBavtu40/L2aqpljrfUNb2rxmWTmSjK2c9UKds%3D' (2026-02-21)
  → 'github:nixos/nixpkgs/e764fc9a405871f1f6ca3d1394fb422e0a0c3951?narHash=sha256-sdaqdnsQCv3iifzxwB22tUwN/fSHoN7j2myFW5EIkGk%3D' (2026-02-24)
• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/0182a361324364ae3f436a63005877674cf45efb?narHash=sha256-0NBlEBKkN3lufyvFegY4TYv5mCNHbi5OmBDrzihbBMQ%3D' (2026-02-17)
  → 'github:nixos/nixpkgs/2fc6539b481e1d2569f25f8799236694180c0993?narHash=sha256-0MAd%2B0mun3K/Ns8JATeHT1sX28faLII5hVLq0L3BdZU%3D' (2026-02-23)
2026-02-25 00:07:00 +00:00
03ffcc1ad0 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/c217913993d6c6f6805c3b1a3bda5e639adfde6d?narHash=sha256-D1PA3xQv/s4W3lnR9yJFSld8UOLr0a/cBWMQMXS%2B1Qg%3D' (2026-02-20)
  → 'github:nixos/nixpkgs/afbbf774e2087c3d734266c22f96fca2e78d3620?narHash=sha256-nhZJPnBavtu40/L2aqpljrfUNb2rxmWTmSjK2c9UKds%3D' (2026-02-21)
2026-02-24 00:01:35 +00:00
5e92eb3220 docs: add plan for NixOS OpenStack image
Some checks failed
Run nix flake check / flake-check (push) Failing after 8m1s
Periodic flake update / flake-update (push) Successful in 2m23s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 00:42:19 +01:00
2321e191a2 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/6d41bc27aaf7b6a3ba6b169db3bd5d6159cfaa47?narHash=sha256-bxAlQgre3pcQcaRUm/8A0v/X8d2nhfraWSFqVmMcBcU%3D' (2026-02-18)
  → 'github:nixos/nixpkgs/c217913993d6c6f6805c3b1a3bda5e639adfde6d?narHash=sha256-D1PA3xQv/s4W3lnR9yJFSld8UOLr0a/cBWMQMXS%2B1Qg%3D' (2026-02-20)
2026-02-23 00:01:30 +00:00
136116ab33 pn02: limit CPU to C1 power state for stability
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m36s
Periodic flake update / flake-update (push) Successful in 2m18s
Known PN51 platform issue with deep C-states causing freezes.
Limit to C1 to prevent deeper sleep states.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 18:58:41 +01:00
c8cadd09c5 pn51: document diagnostic config (rasdaemon, NMI watchdog, panic)
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m3s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 18:52:34 +01:00
72acaa872b pn02: add panic on lockup, NMI watchdog, and rasdaemon
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Enable kernel panic on soft/hard lockups with auto-reboot after
10s, and rasdaemon for hardware error logging. Should give us
diagnostic data on the next freeze.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 18:48:21 +01:00
a7c1ce932d pn51: add remaining debug steps and auto-recovery fallback
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m4s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 18:38:17 +01:00
2b42145d94 pn51: document BIOS tweaks, second pn02 freeze, amdgpu blacklist
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 18:28:19 +01:00
05e8556bda pn02: blacklist amdgpu kernel module for stability testing
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
pn02 continues to hard freeze with no log evidence. Blacklisting
the GPU driver to eliminate GPU/PSP firmware interactions as a
possible cause. Console output will be lost but the host is
managed over SSH.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 18:27:05 +01:00
75fdd7ae40 pn51: document stress test pass and TSC runtime test failure
Some checks failed
Run nix flake check / flake-check (push) Failing after 17m0s
Both units survived 1h stress test at 80-85C. TSC clocksource
is genuinely unstable at runtime (not just boot), HPET is the
correct fallback for this platform.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 11:52:34 +01:00
5346889b73 pn51: add TSC runtime switch test to next steps
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 11:50:30 +01:00
7e19f51dfa nix: move experimental-features to system/nix.nix
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
All hosts had identical nix-command/flakes settings in their
configuration.nix. Centralize in system/nix.nix so new hosts
(like pn01/pn02) get it automatically.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 10:27:53 +01:00
9f7aab86a0 pn51: update stability notes, TSC/PSP issues affect both units
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 09:25:28 +01:00
bb53b922fa plans: add NixOS hypervisor plan (Incus on PN51s)
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m40s
Periodic flake update / flake-update (push) Failing after 4s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 00:47:09 +01:00
75cd7c6c2d docs: add PN51 stability testing notes
Some checks failed
Run nix flake check / flake-check (push) Failing after 12m3s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 00:24:28 +01:00
72c3a938b0 hosts: enable vault on pn01 and pn02
Some checks failed
Run nix flake check / flake-check (push) Failing after 10m12s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-21 23:56:05 +01:00
2f89d564f7 vault: add approles for pn01/pn02, fix provision playbook
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Add pn01 and pn02 to hosts-generated.tf for Vault AppRole access.

Fix provision-approle.yml: the localhost play was skipped when using
-l filter, since localhost didn't match the target. Merged into a
single play using delegate_to: localhost for the bao commands.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-21 23:51:56 +01:00
4a83363ee5 hosts: add pn01 and pn02 (ASUS PN51 mini PCs)
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m33s
Add two ASUS PN51 hosts on VLAN 12 for stability testing.
pn01 at 10.69.12.60, pn02 at 10.69.12.61, both test-tier compute role.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-21 23:37:14 +01:00
b578520905 media-pc: add JellyCon, display server, and HDR decisions
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m45s
Periodic flake update / flake-update (push) Successful in 2m16s
Decided on Kodi + JellyCon with NFS direct path for media playback,
Sway/Hyprland for display server with workspace-based browser switching,
and noted HDR status for future reference.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-21 00:08:19 +01:00
8a5aa1c4f5 plans: add media PC replacement plan, update router hardware candidates
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m30s
New plan for replacing the media PC (i7-4770K/Ubuntu) with a NixOS mini PC
running Kodi. Router plan updated with specific AliExpress hardware options
and IDS/IPS considerations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 23:54:29 +01:00
0f8c4783a8 truenas-migration: drive trays ordered, resolve open question
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m18s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 19:29:12 +01:00
2ca2509083 monitoring: increase filesystem_filling_up prediction window to 24h
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m55s
Reduces false positives from transient Nix store growth by basing the
linear prediction on a 24h trend instead of 6h.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 09:36:27 +01:00
58702bd10b truenas-migration: note subnet issue for 10GbE traffic
Some checks failed
Run nix flake check / flake-check (push) Failing after 7m10s
NAS and Proxmox are on the same 10GbE switch but different subnets,
forcing traffic through the router. Need to fix during migration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 01:34:46 +01:00
c9f47acb01 truenas-migration: mdadm boot mirror, clean zfs export step
Use TrueNAS boot-pool SSDs as mdadm RAID1 for NixOS root to keep
the boot path ZFS-independent. Added zfs export step before shutdown.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 01:34:46 +01:00
09ce018fb2 truenas-migration: switch from BTRFS to keeping ZFS, update plan
BTRFS RAID5/6 write hole is still unresolved, and RAID1 wastes
capacity with mixed disk sizes. Keep existing ZFS pool and import
directly on NixOS instead. Updated migration strategy, disk purchase
decision (2x 24TB ordered), SMART health notes, and vdev rebalancing
guidance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 01:34:46 +01:00
3042803c4d flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/fa56d7d6de78f5a7f997b0ea2bc6efd5868ad9e8?narHash=sha256-X01Q3DgSpjeBpapoGA4rzKOn25qdKxbPnxHeMLNoHTU%3D' (2026-02-16)
  → 'github:nixos/nixpkgs/6d41bc27aaf7b6a3ba6b169db3bd5d6159cfaa47?narHash=sha256-bxAlQgre3pcQcaRUm/8A0v/X8d2nhfraWSFqVmMcBcU%3D' (2026-02-18)
2026-02-20 00:07:01 +00:00
1e7200b494 quick-plan: add mermaid diagram guideline
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m7s
Periodic flake update / flake-update (push) Successful in 5m26s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 16:35:53 +01:00
eec1e374b2 docs: simplify mermaid diagram labels
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m0s
Use <br/> for line breaks and shorter node labels so the diagram
renders cleanly in Gitea.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 16:29:52 +01:00
fcc410afad docs: replace ASCII diagram with mermaid in remote-access plan
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 16:28:57 +01:00
59f0c7ceda flake.lock: update homelab-deploy
Some checks failed
Run nix flake check / flake-check (push) Failing after 8m10s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 09:04:03 +01:00
d713f06c6e flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/a82ccc39b39b621151d6732718e3e250109076fa?narHash=sha256-gf2AmWVTs8lEq7z/3ZAsgnZDhWIckkb%2BZnAo5RzSxJg%3D' (2026-02-13)
  → 'github:nixos/nixpkgs/0182a361324364ae3f436a63005877674cf45efb?narHash=sha256-0NBlEBKkN3lufyvFegY4TYv5mCNHbi5OmBDrzihbBMQ%3D' (2026-02-17)
2026-02-19 00:01:44 +00:00
7374d1ff7f nix-cache02: increase builder timeout to 4 hours
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m4s
Periodic flake update / flake-update (push) Successful in 2m32s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 23:53:33 +01:00
e912c75b6c flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/3aadb7ca9eac2891d52a9dec199d9580a6e2bf44?narHash=sha256-O1XDr7EWbRp%2BkHrNNgLWgIrB0/US5wvw9K6RERWAj6I%3D' (2026-02-14)
  → 'github:nixos/nixpkgs/fa56d7d6de78f5a7f997b0ea2bc6efd5868ad9e8?narHash=sha256-X01Q3DgSpjeBpapoGA4rzKOn25qdKxbPnxHeMLNoHTU%3D' (2026-02-16)
2026-02-18 00:01:34 +00:00
b218b4f8bc docs: update migration plan for monitoring01 and pgdb1 completion
Some checks failed
Run nix flake check / flake-check (push) Failing after 16m37s
Periodic flake update / flake-update (push) Successful in 2m21s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 22:26:23 +01:00
65acf13e6f grafana: fix datasource UIDs for VictoriaMetrics migration
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Update all dashboard datasource references from "prometheus" to
"victoriametrics" to match the declared datasource UID. Enable
prune and deleteDatasources to clean up the old Prometheus
(monitoring01) datasource from Grafana's database.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 22:23:04 +01:00
95a96b2192 Merge pull request 'monitoring01: remove host and migrate services to monitoring02' (#43) from cleanup-monitoring01 into master
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m2s
Reviewed-on: #43
2026-02-17 21:08:00 +00:00
4f593126c0 monitoring01: remove host and migrate services to monitoring02
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m15s
Run nix flake check / flake-check (pull_request) Failing after 3m8s
Remove monitoring01 host configuration and unused service modules
(prometheus, grafana, loki, tempo, pyroscope). Migrate blackbox,
exportarr, and pve exporters to monitoring02 with scrape configs
moved to VictoriaMetrics. Update alert rules, terraform vault
policies/secrets, http-proxy entries, and documentation to reflect
the monitoring02 migration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 21:50:20 +01:00
1bba6f106a Merge pull request 'monitoring02: enable alerting and migrate CNAMEs from http-proxy' (#42) from monitoring02-enable-alerting into master
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m5s
Reviewed-on: #42
2026-02-17 20:24:16 +00:00
a6013d3950 monitoring02: enable alerting and migrate CNAMEs from http-proxy
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m25s
Run nix flake check / flake-check (pull_request) Failing after 3m52s
- Switch vmalert from blackhole mode to sending alerts to local
  Alertmanager
- Import alerttonotify service so alerts route to NATS notifications
- Move alertmanager and grafana CNAMEs from http-proxy to monitoring02
- Add monitoring CNAME to monitoring02
- Add Caddy reverse proxy entries for alertmanager and grafana
- Remove prometheus, alertmanager, and grafana Caddy entries from
  http-proxy (now served directly by monitoring02)
- Move monitoring02 Vault AppRole to hosts-generated.tf with
  extra_policies support and prometheus-metrics policy
- Update Promtail to use authenticated loki.home.2rjus.net endpoint
  only (remove unauthenticated monitoring01 client)
- Update pipe-to-loki and bootstrap to use loki.home.2rjus.net with
  basic auth from Vault secret
- Move migration plan to completed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 21:23:21 +01:00
7f69c0738a Merge pull request 'loki-monitoring02' (#41) from loki-monitoring02 into master
Some checks failed
Run nix flake check / flake-check (push) Failing after 8m20s
Reviewed-on: #41
2026-02-17 19:40:33 +00:00
35924c7b01 mcp: move config to .mcp.json.example, gitignore real config
Some checks failed
Run nix flake check / flake-check (push) Failing after 15m57s
Run nix flake check / flake-check (pull_request) Failing after 16m45s
The real .mcp.json now contains Loki credentials for basic auth,
so it should not be committed. The example file has placeholders.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 20:35:14 +01:00
87d8571d62 promtail: fix vault secret ownership for loki auth
Some checks failed
Run nix flake check / flake-check (push) Failing after 12m24s
The secret file needs to be owned by promtail since Promtail runs
as a dedicated user and can't read root-owned files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 20:17:02 +01:00
43c81f6688 terraform: fix loki-push policy for generated hosts
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Revert ns1/ns2 from approle.tf (they're in hosts-generated.tf) and add
loki-push policy to generated AppRoles instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 20:13:22 +01:00
58f901ad3e terraform: add ns1 and ns2 to AppRole policies
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
They were missing from the host_policies map, so they didn't get
shared policies like loki-push.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 20:10:37 +01:00
c13921d302 loki: add basic auth for log push and dual-ship promtail
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m36s
- Loki bound to localhost, Caddy reverse proxy with basic_auth
- Vault secret (shared/loki/push-auth) for password, bcrypt hash
  generated at boot for Caddy environment
- Promtail dual-ships to monitoring01 (direct) and loki.home.2rjus.net
  (with basic auth), conditional on vault.enable
- Terraform: new shared loki-push policy added to all AppRoles

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 20:00:08 +01:00
2903873d52 monitoring02: add loki CNAME and Caddy reverse proxy
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 19:48:06 +01:00
74e7c9faa4 monitoring02: add Loki service
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m19s
Add standalone Loki service module (services/loki/) with same config as
monitoring01 and import it on monitoring02. Update Grafana Loki datasource
to localhost. Defer Tempo and Pyroscope migration (not actively used).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 19:42:19 +01:00
471f536f1f Merge pull request 'victoriametrics-monitoring02' (#40) from victoriametrics-monitoring02 into master
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m3s
Periodic flake update / flake-update (push) Successful in 3m29s
Reviewed-on: #40
2026-02-16 23:56:04 +00:00
a013e80f1a terraform: grant monitoring02 access to apiary-token secret
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m59s
Run nix flake check / flake-check (pull_request) Failing after 4m20s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:55:08 +01:00
4cbaa33475 monitoring02: add Caddy reverse proxy for VictoriaMetrics and vmalert
Add metrics.home.2rjus.net and vmalert.home.2rjus.net CNAMEs with
Caddy TLS termination via internal ACME CA.

Refactors Grafana's Caddy config from configFile to globalConfig +
virtualHosts so both modules can contribute routes to the same
Caddy instance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:55:08 +01:00
e329f87b0b monitoring02: add VictoriaMetrics, vmalert, and Alertmanager
Set up the core metrics stack on monitoring02 as Phase 2 of the
monitoring migration. VictoriaMetrics replaces Prometheus with
identical scrape configs (22 jobs including auto-generated targets).

- VictoriaMetrics with 3-month retention and all scrape configs
- vmalert evaluating existing rules.yml (notifier disabled)
- Alertmanager with same routing config (no alerts during parallel op)
- Grafana datasources updated: local VictoriaMetrics as default
- Static user override for credential file access (OpenBao, Apiary)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:55:08 +01:00
c151f31011 grafana: fix apiary dashboard panels empty on short time ranges
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m54s
Set interval=60s on rate() panels to match the actual Prometheus scrape
interval, so Grafana calculates $__rate_interval correctly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 20:03:26 +01:00
f5362d6936 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/6c5e707c6b5339359a9a9e215c5e66d6d802fd7a?narHash=sha256-iKZMkr6Cm9JzWlRYW/VPoL0A9jVKtZYiU4zSrVeetIs%3D' (2026-02-11)
  → 'github:nixos/nixpkgs/3aadb7ca9eac2891d52a9dec199d9580a6e2bf44?narHash=sha256-O1XDr7EWbRp%2BkHrNNgLWgIrB0/US5wvw9K6RERWAj6I%3D' (2026-02-14)
• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/ec7c70d12ce2fc37cb92aff673dcdca89d187bae?narHash=sha256-9xejG0KoqsoKEGp2kVbXRlEYtFFcDTHjidiuX8hGO44%3D' (2026-02-11)
  → 'github:nixos/nixpkgs/a82ccc39b39b621151d6732718e3e250109076fa?narHash=sha256-gf2AmWVTs8lEq7z/3ZAsgnZDhWIckkb%2BZnAo5RzSxJg%3D' (2026-02-13)
2026-02-16 00:07:10 +00:00
3e7aabc73a grafana: fix apiary geomap and make it full-width
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m6s
Periodic flake update / flake-update (push) Successful in 5m25s
Add gazetteer reference for country code lookup resolution.
Remove unnecessary reduce transformation. Make geomap panel
full-width (24 cols) and taller (h=10) on its own row.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 21:36:24 +01:00
361e7f2a1b grafana: add apiary honeypot dashboard
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 21:31:06 +01:00
1942591d2e monitoring: add apiary metrics scraping with bearer token auth
Some checks failed
Run nix flake check / flake-check (push) Failing after 12m52s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 16:36:26 +01:00
4d614d8716 docs: add new service candidates and NixOS router plans
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m22s
Periodic flake update / flake-update (push) Failing after 1s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 13:21:34 +01:00
fd7caf7f00 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/d6c71932130818840fc8fe9509cf50be8c64634f?narHash=sha256-ub1gpAONMFsT/GU2hV6ZWJjur8rJ6kKxdm9IlCT0j84%3D' (2026-02-08)
  → 'github:nixos/nixpkgs/ec7c70d12ce2fc37cb92aff673dcdca89d187bae?narHash=sha256-9xejG0KoqsoKEGp2kVbXRlEYtFFcDTHjidiuX8hGO44%3D' (2026-02-11)
2026-02-14 00:01:24 +00:00
af8e385b6e docs: finalize remote access plan with WireGuard gateway design
Some checks failed
Run nix flake check / flake-check (push) Failing after 21m7s
Periodic flake update / flake-update (push) Successful in 2m16s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 00:31:52 +01:00
0db9fc6802 docs: update Loki improvements plan with implementation status
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m55s
Mark retention, limits, labels, and level mapping as done. Add
JSON logging audit results with per-service details. Update current
state and disk usage notes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 00:04:16 +01:00
5d68662035 loki: add 30-day retention policy and ingestion limits
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Enable compactor-based retention with 30-day period to prevent
unbounded disk growth. Add basic rate limits and stream guards
to protect against runaway log generators.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 23:55:27 +01:00
87 changed files with 3220 additions and 1442 deletions

View File

@@ -130,7 +130,7 @@ get_commit_info(<hash>) # Get full details of a specific change
```
**Example workflow for a service-related alert:**
1. Query `nixos_flake_info{hostname="monitoring01"}``current_rev: 8959829`
1. Query `nixos_flake_info{hostname="monitoring02"}``current_rev: 8959829`
2. `resolve_ref("master")``4633421`
3. `is_ancestor("8959829", "4633421")` → Yes, host is behind
4. `commits_between("8959829", "4633421")` → 7 commits missing

View File

@@ -30,7 +30,7 @@ Use the `lab-monitoring` MCP server tools:
### Label Reference
Available labels for log queries:
- `hostname` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`) - matches the Prometheus `hostname` label
- `hostname` - Hostname (e.g., `ns1`, `monitoring02`, `ha1`) - matches the Prometheus `hostname` label
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
- `filename` - For `varlog` job, the log file path
@@ -54,7 +54,7 @@ Journal logs are JSON-formatted. Key fields:
**All logs from a host:**
```logql
{hostname="monitoring01"}
{hostname="monitoring02"}
```
**Logs from a service across all hosts:**
@@ -74,7 +74,7 @@ Journal logs are JSON-formatted. Key fields:
**Regex matching:**
```logql
{systemd_unit="prometheus.service"} |~ "scrape.*failed"
{systemd_unit="victoriametrics.service"} |~ "scrape.*failed"
```
**Filter by level (journal scrape only):**
@@ -109,7 +109,7 @@ Default lookback is 1 hour. Use `start` parameter for older logs:
Useful systemd units for troubleshooting:
- `nixos-upgrade.service` - Daily auto-upgrade logs
- `nsd.service` - DNS server (ns1/ns2)
- `prometheus.service` - Metrics collection
- `victoriametrics.service` - Metrics collection
- `loki.service` - Log aggregation
- `caddy.service` - Reverse proxy
- `home-assistant.service` - Home automation
@@ -152,7 +152,7 @@ VMs provisioned from template2 send bootstrap progress directly to Loki via curl
Parse JSON and filter on fields:
```logql
{systemd_unit="prometheus.service"} | json | PRIORITY="3"
{systemd_unit="victoriametrics.service"} | json | PRIORITY="3"
```
---
@@ -242,12 +242,11 @@ All available Prometheus job names:
- `unbound` - DNS resolver metrics (ns1, ns2)
- `wireguard` - VPN tunnel metrics (http-proxy)
**Monitoring stack (localhost on monitoring01):**
- `prometheus` - Prometheus self-metrics
**Monitoring stack (localhost on monitoring02):**
- `victoriametrics` - VictoriaMetrics self-metrics
- `loki` - Loki self-metrics
- `grafana` - Grafana self-metrics
- `alertmanager` - Alertmanager metrics
- `pushgateway` - Push-based metrics gateway
**External/infrastructure:**
- `pve-exporter` - Proxmox hypervisor metrics
@@ -262,7 +261,7 @@ All scrape targets have these labels:
**Standard labels:**
- `instance` - Full target address (`<hostname>.home.2rjus.net:<port>`)
- `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`)
- `hostname` - Short hostname (e.g., `ns1`, `monitoring01`) - use this for host filtering
- `hostname` - Short hostname (e.g., `ns1`, `monitoring02`) - use this for host filtering
**Host metadata labels** (when configured in `homelab.host`):
- `role` - Host role (e.g., `dns`, `build-host`, `vault`)
@@ -275,7 +274,7 @@ Use the `hostname` label for easy host filtering across all jobs:
```promql
{hostname="ns1"} # All metrics from ns1
node_load1{hostname="monitoring01"} # Specific metric by hostname
node_load1{hostname="monitoring02"} # Specific metric by hostname
up{hostname="ha1"} # Check if ha1 is up
```
@@ -283,10 +282,10 @@ This is simpler than wildcarding the `instance` label:
```promql
# Old way (still works but verbose)
up{instance=~"monitoring01.*"}
up{instance=~"monitoring02.*"}
# New way (preferred)
up{hostname="monitoring01"}
up{hostname="monitoring02"}
```
### Filtering by Role/Tier

View File

@@ -73,6 +73,7 @@ Additional context, caveats, or references.
- **Reference existing patterns**: Mention how this fits with existing infrastructure
- **Tables for comparisons**: Use markdown tables when comparing options
- **Practical focus**: Emphasize what needs to happen, not theory
- **Mermaid diagrams**: Use mermaid code blocks for architecture diagrams, flow charts, or other graphs when relevant to the plan. Keep node labels short and use `<br/>` for line breaks
## Examples of Good Plans

3
.gitignore vendored
View File

@@ -2,6 +2,9 @@
result
result-*
# MCP config (contains secrets)
.mcp.json
# Terraform/OpenTofu
terraform/.terraform/
terraform/.terraform.lock.hcl

View File

@@ -20,7 +20,9 @@
"env": {
"PROMETHEUS_URL": "https://prometheus.home.2rjus.net",
"ALERTMANAGER_URL": "https://alertmanager.home.2rjus.net",
"LOKI_URL": "http://monitoring01.home.2rjus.net:3100"
"LOKI_URL": "https://loki.home.2rjus.net",
"LOKI_USERNAME": "promtail",
"LOKI_PASSWORD": "<password from: bao kv get -field=password secret/shared/loki/push-auth>"
}
},
"homelab-deploy": {
@@ -44,4 +46,3 @@
}
}
}

View File

@@ -247,7 +247,7 @@ nix develop -c homelab-deploy -- deploy \
deploy.prod.<hostname>
```
Subject format: `deploy.<tier>.<hostname>` (e.g., `deploy.prod.monitoring01`, `deploy.test.testvm01`)
Subject format: `deploy.<tier>.<hostname>` (e.g., `deploy.prod.monitoring02`, `deploy.test.testvm01`)
**Verifying Deployments:**
@@ -309,7 +309,7 @@ All hosts automatically get:
- OpenBao (Vault) secrets management via AppRole
- Internal ACME CA integration (OpenBao PKI at vault.home.2rjus.net)
- Daily auto-upgrades with auto-reboot
- Prometheus node-exporter + Promtail (logs to monitoring01)
- Prometheus node-exporter + Promtail (logs to monitoring02)
- Monitoring scrape target auto-registration via `homelab.monitoring` options
- Custom root CA trust
- DNS zone auto-registration via `homelab.dns` options
@@ -335,7 +335,7 @@ Use `nix flake show` or `nix develop -c ansible-inventory --graph` to list all h
- Infrastructure subnet: `10.69.13.x`
- DNS: ns1/ns2 provide authoritative DNS with primary-secondary setup
- Internal CA for ACME certificates (no Let's Encrypt)
- Centralized monitoring at monitoring01
- Centralized monitoring at monitoring02
- Static networking via systemd-networkd
### Secrets Management
@@ -480,23 +480,21 @@ See [docs/host-creation.md](docs/host-creation.md) for the complete host creatio
### Monitoring Stack
All hosts ship metrics and logs to `monitoring01`:
- **Metrics**: Prometheus scrapes node-exporter from all hosts
- **Logs**: Promtail ships logs to Loki on monitoring01
- **Access**: Grafana at monitoring01 for visualization
- **Tracing**: Tempo for distributed tracing
- **Profiling**: Pyroscope for continuous profiling
All hosts ship metrics and logs to `monitoring02`:
- **Metrics**: VictoriaMetrics scrapes node-exporter from all hosts
- **Logs**: Promtail ships logs to Loki on monitoring02
- **Access**: Grafana at monitoring02 for visualization
**Scrape Target Auto-Generation:**
Prometheus scrape targets are automatically generated from host configurations, following the same pattern as DNS zone generation:
VictoriaMetrics scrape targets are automatically generated from host configurations, following the same pattern as DNS zone generation:
- **Node-exporter**: All flake hosts with static IPs are automatically added as node-exporter targets
- **Service targets**: Defined via `homelab.monitoring.scrapeTargets` in service modules
- **External targets**: Non-flake hosts defined in `/services/monitoring/external-targets.nix`
- **Library**: `lib/monitoring.nix` provides `generateNodeExporterTargets` and `generateScrapeConfigs`
Service modules declare their scrape targets directly via `homelab.monitoring.scrapeTargets`. The Prometheus config on monitoring01 auto-generates scrape configs from all hosts. See "Homelab Module Options" section for available options.
Service modules declare their scrape targets directly via `homelab.monitoring.scrapeTargets`. The VictoriaMetrics config on monitoring02 auto-generates scrape configs from all hosts. See "Homelab Module Options" section for available options.
To add monitoring targets for non-NixOS hosts, edit `/services/monitoring/external-targets.nix`.

View File

@@ -10,7 +10,7 @@ NixOS Flake-based configuration repository for a homelab infrastructure. All hos
| `ca` | Internal Certificate Authority |
| `ha1` | Home Assistant + Zigbee2MQTT + Mosquitto |
| `http-proxy` | Reverse proxy |
| `monitoring01` | Prometheus, Grafana, Loki, Tempo, Pyroscope |
| `monitoring02` | VictoriaMetrics, Grafana, Loki, Alertmanager |
| `jelly01` | Jellyfin media server |
| `nix-cache02` | Nix binary cache + NATS-based build service |
| `nats1` | NATS messaging |
@@ -121,4 +121,4 @@ No manual intervention is required after `tofu apply`.
- Infrastructure subnet: `10.69.13.0/24`
- DNS: ns1/ns2 authoritative with primary-secondary AXFR
- Internal CA for TLS certificates (migrating from step-ca to OpenBao PKI)
- Centralized monitoring at monitoring01
- Centralized monitoring at monitoring02

View File

@@ -23,14 +23,12 @@
when: ansible_play_hosts | length != 1
run_once: true
- name: Fetch AppRole credentials from OpenBao
hosts: localhost
connection: local
- name: Provision AppRole credentials
hosts: all
gather_facts: false
vars:
target_host: "{{ groups['all'] | first }}"
target_hostname: "{{ hostvars[target_host]['short_hostname'] | default(target_host.split('.')[0]) }}"
target_hostname: "{{ inventory_hostname.split('.')[0] }}"
tasks:
- name: Display target host
@@ -45,6 +43,7 @@
BAO_SKIP_VERIFY: "1"
register: role_id_result
changed_when: false
delegate_to: localhost
- name: Generate secret-id for host
ansible.builtin.command:
@@ -54,21 +53,8 @@
BAO_SKIP_VERIFY: "1"
register: secret_id_result
changed_when: true
delegate_to: localhost
- name: Store credentials for next play
ansible.builtin.set_fact:
vault_role_id: "{{ role_id_result.stdout }}"
vault_secret_id: "{{ secret_id_result.stdout }}"
- name: Deploy AppRole credentials to host
hosts: all
gather_facts: false
vars:
vault_role_id: "{{ hostvars['localhost']['vault_role_id'] }}"
vault_secret_id: "{{ hostvars['localhost']['vault_secret_id'] }}"
tasks:
- name: Create AppRole directory
ansible.builtin.file:
path: /var/lib/vault/approle
@@ -79,7 +65,7 @@
- name: Write role-id
ansible.builtin.copy:
content: "{{ vault_role_id }}"
content: "{{ role_id_result.stdout }}"
dest: /var/lib/vault/approle/role-id
mode: "0600"
owner: root
@@ -87,7 +73,7 @@
- name: Write secret-id
ansible.builtin.copy:
content: "{{ vault_secret_id }}"
content: "{{ secret_id_result.stdout }}"
dest: /var/lib/vault/approle/secret-id
mode: "0600"
owner: root

View File

@@ -0,0 +1,156 @@
# Monitoring Stack Migration to VictoriaMetrics
## Overview
Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
and longer retention. Run in parallel with monitoring01 until validated, then switch over using
a `monitoring` CNAME for seamless transition.
## Current State
**monitoring02** (10.69.13.24) - **PRIMARY**:
- 4 CPU cores, 8GB RAM, 60GB disk
- VictoriaMetrics with 3-month retention
- vmalert with alerting enabled (routes to local Alertmanager)
- Alertmanager -> alerttonotify -> NATS notification pipeline
- Grafana with Kanidm OIDC (`grafana.home.2rjus.net`)
- Loki (log aggregation)
- CNAMEs: monitoring, alertmanager, grafana, grafana-test, metrics, vmalert, loki
**monitoring01** (10.69.13.13) - **SHUT DOWN**:
- No longer running, pending decommission
## Decision: VictoriaMetrics
Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
- Single binary replacement for Prometheus
- 5-10x better compression (30 days could become 180+ days in same space)
- Same PromQL query language (Grafana dashboards work unchanged)
- Same scrape config format (existing auto-generated configs work)
If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
## Architecture
```
┌─────────────────┐
│ monitoring02 │
│ VictoriaMetrics│
│ + Grafana │
monitoring │ + Loki │
CNAME ──────────│ + Alertmanager │
│ (vmalert) │
└─────────────────┘
│ scrapes
┌───────────────┼───────────────┐
│ │ │
┌────┴────┐ ┌─────┴────┐ ┌─────┴────┐
│ ns1 │ │ ha1 │ │ ... │
│ :9100 │ │ :9100 │ │ :9100 │
└─────────┘ └──────────┘ └──────────┘
```
## Implementation Plan
### Phase 1: Create monitoring02 Host [COMPLETE]
Host created and deployed at 10.69.13.24 (prod tier) with:
- 4 CPU cores, 8GB RAM, 60GB disk
- Vault integration enabled
- NATS-based remote deployment enabled
- Grafana with Kanidm OIDC deployed as test instance (`grafana-test.home.2rjus.net`)
### Phase 2: Set Up VictoriaMetrics Stack [COMPLETE]
New service module at `services/victoriametrics/` for VictoriaMetrics + vmalert + Alertmanager.
Imported by monitoring02 alongside the existing Grafana service.
1. **VictoriaMetrics** (port 8428):
- `services.victoriametrics.enable = true`
- `retentionPeriod = "3"` (3 months)
- All scrape configs migrated from Prometheus (22 jobs including auto-generated)
- Static user override (DynamicUser disabled) for credential file access
- OpenBao token fetch service + 30min refresh timer
- Apiary bearer token via vault.secrets
2. **vmalert** for alerting rules:
- Points to VictoriaMetrics datasource at localhost:8428
- Reuses existing `services/monitoring/rules.yml` directly via `settings.rule`
- Notifier sends to local Alertmanager at localhost:9093
3. **Alertmanager** (port 9093):
- Same configuration as monitoring01 (alerttonotify webhook routing)
- alerttonotify imported on monitoring02, routes alerts via NATS
4. **Grafana** (port 3000):
- VictoriaMetrics datasource (localhost:8428) as default
- Loki datasource pointing to localhost:3100
5. **Loki** (port 3100):
- Same configuration as monitoring01 in standalone `services/loki/` module
- Grafana datasource updated to localhost:3100
**Note:** pve-exporter and pushgateway scrape targets are not included on monitoring02.
pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics
native push support.
### Phase 3: Parallel Operation [COMPLETE]
Ran both monitoring01 and monitoring02 simultaneously to validate data collection and dashboards.
### Phase 4: Add monitoring CNAME [COMPLETE]
Added CNAMEs to monitoring02: monitoring, alertmanager, grafana, metrics, vmalert, loki.
### Phase 5: Update References [COMPLETE]
- Moved alertmanager, grafana, prometheus CNAMEs from http-proxy to monitoring02
- Removed corresponding Caddy reverse proxy entries from http-proxy
- monitoring02 Caddy serves alertmanager, grafana, metrics, vmalert directly
### Phase 6: Enable Alerting [COMPLETE]
- Switched vmalert from blackhole mode to local Alertmanager
- alerttonotify service running on monitoring02 (NATS nkey from Vault)
- prometheus-metrics Vault policy added for OpenBao scraping
- Full alerting pipeline verified: vmalert -> Alertmanager -> alerttonotify -> NATS
### Phase 7: Cutover and Decommission [IN PROGRESS]
- monitoring01 shut down (2026-02-17)
- Vault AppRole moved from approle.tf to hosts-generated.tf with extra_policies support
**Remaining cleanup (separate branch):**
- [ ] Update `system/monitoring/logs.nix` - Promtail still points to monitoring01
- [ ] Update `hosts/template2/bootstrap.nix` - Bootstrap Loki URL still points to monitoring01
- [ ] Remove monitoring01 from flake.nix and host configuration
- [ ] Destroy monitoring01 VM in Proxmox
- [ ] Remove monitoring01 from terraform state
- [ ] Remove or archive `services/monitoring/` (Prometheus config)
## Completed
- 2026-02-08: Phase 1 - monitoring02 host created
- 2026-02-17: Phase 2 - VictoriaMetrics, vmalert, Alertmanager, Loki, Grafana configured
- 2026-02-17: Phase 6 - Alerting enabled, CNAMEs migrated, monitoring01 shut down
## VictoriaMetrics Service Configuration
Implemented in `services/victoriametrics/default.nix`. Key design decisions:
- **Static user**: VictoriaMetrics NixOS module uses `DynamicUser`, overridden with a static
`victoriametrics` user so vault.secrets and credential files work correctly
- **Shared rules**: vmalert reuses `services/monitoring/rules.yml` via `settings.rule` path
reference (no YAML-to-Nix conversion needed)
- **Scrape config reuse**: Uses the same `lib/monitoring.nix` functions and
`services/monitoring/external-targets.nix` as Prometheus for auto-generated targets
## Notes
- VictoriaMetrics uses port 8428 vs Prometheus 9090
- PromQL compatibility is excellent
- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
- monitoring02 deployed via OpenTofu using `create-host` script
- Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state
- Tempo and Pyroscope deferred (not actively used; can be added later if needed)

View File

@@ -20,9 +20,9 @@ Hosts to migrate:
| http-proxy | Stateless | Reverse proxy, recreate |
| nats1 | Stateless | Messaging, recreate |
| ha1 | Stateful | Home Assistant + Zigbee2MQTT + Mosquitto |
| monitoring01 | Stateful | Prometheus, Grafana, Loki |
| ~~monitoring01~~ | ~~Decommission~~ | ✓ Complete — replaced by monitoring02 (VictoriaMetrics) |
| jelly01 | Stateful | Jellyfin metadata, watch history, config |
| pgdb1 | Decommission | Only used by Open WebUI on gunter, migrating to local postgres |
| ~~pgdb1~~ | ~~Decommission~~ | ✓ Complete |
| ~~jump~~ | ~~Decommission~~ | ✓ Complete |
| ~~auth01~~ | ~~Decommission~~ | ✓ Complete |
| ~~ca~~ | ~~Deferred~~ | ✓ Complete |
@@ -31,10 +31,12 @@ Hosts to migrate:
Before migrating any stateful host, ensure restic backups are in place and verified.
### 1a. Expand monitoring01 Grafana Backup
### ~~1a. Expand monitoring01 Grafana Backup~~ ✓ N/A
The existing backup only covers `/var/lib/grafana/plugins` and a sqlite dump of `grafana.db`.
Expand to back up all of `/var/lib/grafana/` to capture config directory and any other state.
~~The existing backup only covers `/var/lib/grafana/plugins` and a sqlite dump of `grafana.db`.
Expand to back up all of `/var/lib/grafana/` to capture config directory and any other state.~~
No longer needed — monitoring01 decommissioned, replaced by monitoring02 with declarative Grafana dashboards.
### 1b. Add Jellyfin Backup to jelly01
@@ -94,15 +96,17 @@ For each stateful host, the procedure is:
7. Start services and verify functionality
8. Decommission the old VM
### 3a. monitoring01
### 3a. monitoring01 ✓ COMPLETE
1. Run final Grafana backup
2. Provision new monitoring01 via OpenTofu
3. After bootstrap, restore `/var/lib/grafana/` from restic
4. Restart Grafana, verify dashboards and datasources are intact
5. Prometheus and Loki start fresh with empty data (acceptable)
6. Verify all scrape targets are being collected
7. Decommission old VM
~~1. Run final Grafana backup~~
~~2. Provision new monitoring01 via OpenTofu~~
~~3. After bootstrap, restore `/var/lib/grafana/` from restic~~
~~4. Restart Grafana, verify dashboards and datasources are intact~~
~~5. Prometheus and Loki start fresh with empty data (acceptable)~~
~~6. Verify all scrape targets are being collected~~
~~7. Decommission old VM~~
Replaced by monitoring02 with VictoriaMetrics, standalone Loki and Grafana modules. Host configuration, old service modules, and terraform resources removed.
### 3b. jelly01
@@ -163,19 +167,19 @@ Host was already removed from flake.nix and VM destroyed. Configuration cleaned
Host configuration, services, and VM already removed.
### pgdb1 (in progress)
### pgdb1 ✓ COMPLETE
Only consumer was Open WebUI on gunter, which has been migrated to use local PostgreSQL.
~~Only consumer was Open WebUI on gunter, which has been migrated to use local PostgreSQL.~~
1. ~~Verify Open WebUI on gunter is using local PostgreSQL (not pgdb1)~~
2. ~~Remove host configuration from `hosts/pgdb1/`~~
3. ~~Remove `services/postgres/` (only used by pgdb1)~~
4. ~~Remove from `flake.nix`~~
5. ~~Remove Vault AppRole from `terraform/vault/approle.tf`~~
6. Destroy the VM in Proxmox
7. ~~Commit cleanup~~
~~1. Verify Open WebUI on gunter is using local PostgreSQL (not pgdb1)~~
~~2. Remove host configuration from `hosts/pgdb1/`~~
~~3. Remove `services/postgres/` (only used by pgdb1)~~
~~4. Remove from `flake.nix`~~
~~5. Remove Vault AppRole from `terraform/vault/approle.tf`~~
~~6. Destroy the VM in Proxmox~~
~~7. Commit cleanup~~
See `docs/plans/pgdb1-decommission.md` for detailed plan.
Host configuration, services, terraform resources, and VM removed. See `docs/plans/pgdb1-decommission.md` for detailed plan.
## Phase 5: Decommission ca Host ✓ COMPLETE

View File

@@ -0,0 +1,79 @@
# Local NTP with Chrony
## Overview/Goal
Set up pve1 as a local NTP server and switch all NixOS VMs from systemd-timesyncd to chrony, pointing at pve1 as the sole time source. This eliminates clock drift issues that cause false `host_reboot` alerts.
## Current State
- All NixOS hosts use `systemd-timesyncd` with default NixOS pool servers (`0.nixos.pool.ntp.org` etc.)
- No NTP/timesyncd configuration exists in the repo — all defaults
- pve1 (Proxmox, bare metal) already runs chrony but only as a client
- VMs drift noticeably — ns1 (~19ms) and jelly01 (~39ms) are worst offenders
- Clock step corrections from timesyncd trigger false `host_reboot` alerts via `changes(node_boot_time_seconds[10m]) > 0`
- pve1 itself stays at 0ms offset thanks to chrony
## Why systemd-timesyncd is Insufficient
- Minimal SNTP client, no proper clock discipline or frequency tracking
- Backs off polling interval when it thinks clock is stable, missing drift
- Corrects via step adjustments rather than gradual slewing, causing metric jumps
- Each VM resolves to different pool servers with varying accuracy
## Implementation Steps
### 1. Configure pve1 as NTP Server
Add to pve1's `/etc/chrony/chrony.conf`:
```
# Allow NTP clients from the infrastructure subnet
allow 10.69.13.0/24
```
Restart chrony on pve1.
### 2. Add Chrony to NixOS System Config
Create `system/chrony.nix` (applied to all hosts via system imports):
```nix
{
# Disable systemd-timesyncd (chrony takes over)
services.timesyncd.enable = false;
# Enable chrony pointing at pve1
services.chrony = {
enable = true;
servers = [ "pve1.home.2rjus.net" ];
serverOption = "iburst";
};
}
```
### 3. Optional: Add Chrony Exporter
For better visibility into NTP sync quality:
```nix
services.prometheus.exporters.chrony.enable = true;
```
Add chrony exporter scrape targets via `homelab.monitoring.scrapeTargets` and create a Grafana dashboard for NTP offset across all hosts.
### 4. Roll Out
- Deploy to a test-tier host first to verify
- Then deploy to all hosts via auto-upgrade
## Open Questions
- [ ] Does pve1's chrony config need `local stratum 10` as fallback if upstream is unreachable?
- [ ] Should we also enable `enableRTCTrimming` for the VMs?
- [ ] Worth adding a chrony exporter on pve1 as well (manual install like node-exporter)?
## Notes
- No fallback NTP servers needed on VMs — if pve1 is down, all VMs are down too
- The `host_reboot` alert rule (`changes(node_boot_time_seconds[10m]) > 0`) should stop false-firing once clock corrections are slewed instead of stepped
- pn01/pn02 are bare metal but still benefit from syncing to pve1 for consistency

View File

@@ -8,16 +8,16 @@ The current Loki deployment on monitoring01 is functional but minimal. It lacks
**Loki** on monitoring01 (`services/monitoring/loki.nix`):
- Single-node deployment, no HA
- Filesystem storage at `/var/lib/loki/chunks`
- Filesystem storage at `/var/lib/loki/chunks` (~6.8 GB as of 2026-02-13)
- TSDB index (v13 schema, 24h period)
- No retention policy configured (logs grow indefinitely)
- No `limits_config` (no rate limiting, stream limits, or query guards)
- 30-day compactor-based retention with basic rate limits
- No caching layer
- Auth disabled (trusted network)
**Promtail** on all 16 hosts (`system/monitoring/logs.nix`):
- Ships systemd journal (JSON) + `/var/log/**/*.log`
- Labels: `host`, `job` (systemd-journal/varlog), `systemd_unit`
- Labels: `hostname`, `tier`, `role`, `level`, `job` (systemd-journal/varlog), `systemd_unit`
- `level` label mapped from journal PRIORITY (critical/error/warning/notice/info/debug)
- Hardcoded to `http://monitoring01.home.2rjus.net:3100`
**Additional log sources:**
@@ -30,16 +30,7 @@ The current Loki deployment on monitoring01 is functional but minimal. It lacks
### 1. Retention Policy
**Problem:** No retention configured. Logs accumulate until disk fills up.
**Options:**
| Approach | Config Location | How It Works |
|----------|----------------|--------------|
| **Compactor retention** | `compactor` + `limits_config` | Compactor runs periodic retention sweeps, deleting chunks older than threshold |
| **Table manager** | `table_manager` | Legacy approach, not recommended for TSDB |
**Recommendation:** Use compactor-based retention (the modern approach for TSDB/filesystem):
**Implemented.** Compactor-based retention with 30-day period. Note: Loki 3.6.3 requires `delete_request_store = "filesystem"` when retention is enabled (not documented in older guides).
```nix
compactor = {
@@ -48,24 +39,21 @@ compactor = {
retention_enabled = true;
retention_delete_delay = "2h";
retention_delete_worker_count = 150;
delete_request_store = "filesystem";
};
limits_config = {
retention_period = "30d"; # Default retention for all tenants
retention_period = "30d";
};
```
30 days aligns with the Prometheus retention and is reasonable for a homelab. Older logs are rarely useful, and anything important can be found in journal archives on the hosts themselves.
### 2. Storage Backend
**Decision:** Stay with filesystem storage for now. Garage S3 was considered but ruled out - the current single-node Garage (replication_factor=1) offers no real durability benefit over local disk. S3 storage can be revisited after the NAS migration, when a more robust S3-compatible solution will likely be available.
### 3. Limits Configuration
**Problem:** No rate limiting or stream cardinality protection. A misbehaving service could generate excessive logs and overwhelm Loki.
**Recommendation:** Add basic guardrails:
**Implemented.** Basic guardrails added alongside retention in `limits_config`:
```nix
limits_config = {
@@ -78,8 +66,6 @@ limits_config = {
};
```
These are generous limits that shouldn't affect normal operation but protect against runaway log generators.
### 4. Promtail Label Improvements
**Problem:** Label inconsistencies and missing useful metadata:
@@ -101,11 +87,7 @@ This enables queries like:
### 5. Journal Priority → Level Label
**Problem:** Loki 3.6.3 auto-detects a `detected_level` label by parsing log message text for keywords like "INFO", "ERROR", etc. This works for applications that embed level strings in messages (Go apps, Loki itself), but **fails for traditional Unix services** that use the journal `PRIORITY` field without level text in the message.
Example: NSD logs `"signal received, shutting down..."` with `PRIORITY="4"` (warning), but Loki sets `detected_level="unknown"` because the message has no level keyword. Querying `{detected_level="warn"}` misses these entirely.
**Recommendation:** Add a Promtail pipeline stage to the journal scrape config that maps the `PRIORITY` field to a `level` label:
**Implemented.** Promtail pipeline stages map journal `PRIORITY` to a `level` label:
| PRIORITY | level |
|----------|-------|
@@ -116,11 +98,9 @@ Example: NSD logs `"signal received, shutting down..."` with `PRIORITY="4"` (war
| 6 | info |
| 7 | debug |
This can be done with a `json` stage to extract PRIORITY, then a `template` + `labels` stage to map and attach it. The journal `PRIORITY` field is always present, so this gives reliable level filtering for all journal logs.
Uses a `json` stage to extract PRIORITY, `template` to map to level name, and `labels` to attach it. This gives reliable level filtering for all journal logs, unlike Loki's `detected_level` which only works for apps that embed level keywords in message text.
**Cardinality impact:** Moderate. Adds up to ~6 label values per host+unit combination. In practice most services log at 1-2 levels, so the stream count increase is manageable for 16 hosts. The filtering benefit (e.g., `{level="error"}` to find all errors across the fleet) outweighs the cost.
This enables queries like:
Example queries:
- `{level="error"}` - all errors across the fleet
- `{level=~"critical|error", tier="prod"}` - prod errors and criticals
- `{level="warning", role="dns"}` - warnings from DNS servers
@@ -129,19 +109,39 @@ This enables queries like:
**Problem:** Many services support structured JSON log output but may be using plain text by default. JSON logs are significantly easier to query in Loki - `| json` cleanly extracts all fields, whereas plain text requires fragile regex or pattern matching.
**Recommendation:** Audit all configured services and enable JSON logging where supported. Candidates to check include:
- Caddy (already JSON by default)
- Prometheus / Alertmanager / Loki / Tempo
- Grafana
- NSD / Unbound
- Home Assistant
- NATS
- Jellyfin
- OpenBao (Vault)
- Kanidm
- Garage
**Audit results (2026-02-13):**
For each service, check whether it supports a JSON log format option and whether enabling it would break anything (e.g., log volume increase from verbose JSON, or dashboards that parse text format).
**Already logging JSON:**
- Caddy (all instances) - JSON by default for access logs
- homelab-deploy (listener/builder) - Go app, logs structured JSON
**Supports JSON, not configured (high value):**
| Service | How to enable | Config file |
|---------|--------------|-------------|
| Prometheus | `--log.format=json` | `services/monitoring/prometheus.nix` |
| Alertmanager | `--log.format=json` | `services/monitoring/prometheus.nix` |
| Loki | `--log.format=json` | `services/monitoring/loki.nix` |
| Grafana | `log.console.format = "json"` | `services/monitoring/grafana.nix` |
| Tempo | `log_format: json` in config | `services/monitoring/tempo.nix` |
| OpenBao | `log_format = "json"` | `services/vault/default.nix` |
**Supports JSON, not configured (lower value - minimal log output):**
| Service | How to enable |
|---------|--------------|
| Pyroscope | `--log.format=json` (OCI container) |
| Blackbox Exporter | `--log.format=json` |
| Node Exporter | `--log.format=json` (all 16 hosts) |
| Systemd Exporter | `--log.format=json` (all 16 hosts) |
**No JSON support (syslog/text only):**
- NSD, Unbound, OpenSSH, Mosquitto
**Needs verification:**
- Kanidm, Jellyfin, Home Assistant, Harmonia, Zigbee2MQTT, NATS
**Recommendation:** Start with the monitoring stack (Prometheus, Alertmanager, Loki, Grafana, Tempo) since they're all Go apps with the same `--log.format=json` flag. Then OpenBao. The exporters are lower priority since they produce minimal log output.
### 7. Monitoring CNAME for Promtail Target
@@ -151,40 +151,46 @@ For each service, check whether it supports a JSON log format option and whether
## Priority Ranking
| # | Improvement | Effort | Impact | Recommendation |
|---|-------------|--------|--------|----------------|
| 1 | **Retention policy** | Low | High | Do first - prevents disk exhaustion |
| 2 | **Limits config** | Low | Medium | Do with retention - minimal additional effort |
| 3 | **Promtail label fix** | Trivial | Low | Quick fix, do with other label changes |
| 4 | **Journal priority → level** | Low-medium | Medium | Reliable level filtering across the fleet |
| 5 | **JSON logging audit** | Low-medium | Medium | Audit services, enable JSON where supported |
| # | Improvement | Effort | Impact | Status |
|---|-------------|--------|--------|--------|
| 1 | **Retention policy** | Low | High | Done (30d compactor retention) |
| 2 | **Limits config** | Low | Medium | Done (rate limits + stream guards) |
| 3 | **Promtail labels** | Trivial | Low | Done (hostname/tier/role/level) |
| 4 | **Journal priority → level** | Low-medium | Medium | Done (pipeline stages) |
| 5 | **JSON logging audit** | Low-medium | Medium | Audited, not yet enabled |
| 6 | **Monitoring CNAME** | Low | Medium | Part of monitoring02 migration |
## Implementation Steps
### Phase 1: Retention + Limits (quick win)
### Phase 1: Retention + Labels (done 2026-02-13)
1. Add `compactor` section to `services/monitoring/loki.nix`
2. Add `limits_config` with 30-day retention and basic rate limits
3. Update `system/monitoring/logs.nix`:
- ~~Fix `hostname``host` label in varlog scrape config~~ Done: standardized on `hostname` (matching Prometheus)
- ~~Add `tier` static label from `config.homelab.host.tier` to both scrape configs~~ Done
- ~~Add `role` static label from `config.homelab.host.role` (conditionally, only when set) to both scrape configs~~ Done
- ~~Add pipeline stages to journal scrape config: `json` to extract PRIORITY, `template` to map to level name, `labels` to attach as `level`~~ Done
4. Deploy to monitoring01, verify compactor runs and old data gets cleaned
5. Verify `level` label works: `{level="error"}` should return results, and match cases where `detected_level="unknown"`
1. ~~Add `compactor` section to `services/monitoring/loki.nix`~~ Done
2. ~~Add `limits_config` with 30-day retention and basic rate limits~~ Done
3. ~~Update `system/monitoring/logs.nix`~~ Done:
- Standardized on `hostname` label (matching Prometheus) for both scrape configs
- Added `tier` and `role` static labels from `homelab.host` options
- Added pipeline stages for journal PRIORITY → `level` label mapping
4. ~~Update `pipe-to-loki` and bootstrap scripts to use `hostname`~~ Done
5. ~~Deploy and verify labels~~ Done - all 15 hosts reporting with correct labels
### Phase 2 (future): S3 Storage Migration
### Phase 2: JSON Logging (not started)
Enable JSON logging on services that support it, starting with the monitoring stack:
1. Prometheus, Alertmanager, Loki, Grafana, Tempo (`--log.format=json`)
2. OpenBao (`log_format = "json"`)
3. Lower priority: exporters (node-exporter, systemd-exporter, blackbox)
### Phase 3 (future): S3 Storage Migration
Revisit after NAS migration when a proper S3-compatible storage solution is available. At that point, add a new schema period with `object_store = "s3"` - the old filesystem period will continue serving historical data until it ages out past retention.
## Open Questions
- [ ] What retention period makes sense? 30 days suggested, but could be 14d or 60d depending on disk/storage budget
- [ ] Do we want per-stream retention (e.g., keep bootstrap/pipe-to-loki longer)?
## Notes
- Loki schema changes require adding a new period entry (not modifying existing ones). The old period continues serving historical data.
- The compactor is already part of single-process Loki in recent versions - it just needs to be configured.
- Loki 3.6.3 requires `delete_request_store = "filesystem"` in the compactor config when retention is enabled.
- S3 storage deferred until post-NAS migration when a proper solution is available.
- As of 2026-02-13, Loki uses ~6.8 GB for ~30 days of logs from 16 hosts. Prometheus uses ~7.6 GB on the same disk (33 GB total, ~8 GB free).

View File

@@ -0,0 +1,244 @@
# Media PC Replacement
## Overview
Replace the aging Linux+Kodi media PC connected to the TV with a modern, compact solution. Primary use cases are Jellyfin/Kodi playback and watching Twitch/YouTube. The current machine (`media`, 10.69.31.50) is on VLAN 31.
## Current State
### Hardware
- **CPU**: Intel Core i7-4770K @ 3.50GHz (Haswell, 4C/8T, 2013)
- **GPU**: Nvidia GeForce GT 710 (Kepler, GK208B)
- **OS**: Ubuntu 22.04.5 LTS (Jammy)
- **Software**: Kodi
- **Network**: `media.home.2rjus.net` at `10.69.31.50` (VLAN 31)
### Control & Display
- **Input**: Wireless keyboard (works well, useful for browser)
- **TV**: 1080p (no 4K/HDR currently, but may upgrade TV later)
- **Audio**: Surround system connected via HDMI ARC from TV (PC → HDMI → TV → ARC → surround)
### Notes on Current Hardware
- The i7-4770K is massively overpowered for media playback — it's a full desktop CPU from 2013
- The GT 710 is a low-end passive GPU; supports NVDEC for H.264/H.265 hardware decode but limited to 4K@30Hz over HDMI 1.4
- Ubuntu 22.04 is approaching EOL (April 2027) and is not managed by this repo
- The whole system is likely in a full-size or mid-tower case — not ideal for a TV setup
### Integration
- **Media source**: Jellyfin on `jelly01` (10.69.13.14) serves media from NAS via NFS
- **DNS**: A record in `services/ns/external-hosts.nix`
- **Not managed**: Not a NixOS host in this repo, no monitoring/auto-updates
## Options
### Option 1: Dedicated Streaming Device (Apple TV / Nvidia Shield)
| Aspect | Apple TV 4K | Nvidia Shield Pro |
|--------|-------------|-------------------|
| **Price** | ~$130-180 | ~$200 |
| **Jellyfin** | Swiftfin app (good) | Jellyfin Android TV (good) |
| **Kodi** | Not available (tvOS) | Full Kodi support |
| **Twitch** | Native app | Native app |
| **YouTube** | Native app | Native app |
| **HDR/DV** | Dolby Vision + HDR10 | Dolby Vision + HDR10 |
| **4K** | Yes | Yes |
| **Form factor** | Tiny, silent | Small, silent |
| **Remote** | Excellent Siri remote | Decent, supports CEC |
| **Homelab integration** | None | Minimal (Plex/Kodi only) |
**Pros:**
- Zero maintenance - appliance experience
- Excellent app ecosystem (native Twitch, YouTube, streaming services)
- Silent, tiny form factor
- Great remote control / CEC support
- Hardware-accelerated codec support out of the box
**Cons:**
- No NixOS management, monitoring, or auto-updates
- Can't run arbitrary software
- Jellyfin clients are decent but not as mature as Kodi
- Vendor lock-in (Apple ecosystem / Google ecosystem)
- No SSH access for troubleshooting
### Option 2: NixOS Mini PC (Kodi Appliance)
A small form factor PC (Intel NUC, Beelink, MinisForum, etc.) running NixOS with Kodi as the desktop environment.
**NixOS has built-in support:**
- `services.xserver.desktopManager.kodi.enable` - boots directly into Kodi
- `kodi-gbm` package - Kodi with direct DRM/KMS rendering (no X11/Wayland needed)
- `kodiPackages.jellycon` - Jellyfin integration for Kodi
- `kodiPackages.sendtokodi` - plays streams via yt-dlp (Twitch, YouTube)
- `kodiPackages.inputstream-adaptive` - adaptive streaming support
**Example NixOS config sketch:**
```nix
{ pkgs, ... }:
{
services.xserver.desktopManager.kodi = {
enable = true;
package = pkgs.kodi.withPackages (p: [
p.jellycon
p.sendtokodi
p.inputstream-adaptive
]);
};
# Auto-login to Kodi session
services.displayManager.autoLogin = {
enable = true;
user = "kodi";
};
}
```
**Pros:**
- Full NixOS management (monitoring, auto-updates, vault, promtail)
- Kodi is a proven TV interface with excellent remote/CEC support
- JellyCon integrates Jellyfin library directly into Kodi
- Twitch/YouTube via sendtokodi + yt-dlp or Kodi browser addons
- Can run arbitrary services (e.g., Home Assistant dashboard)
- Declarative, reproducible config in this repo
**Cons:**
- More maintenance than an appliance
- NixOS + Kodi on bare metal needs GPU driver setup (Intel iGPU is usually fine)
- Kodi YouTube/Twitch addons are less polished than native apps
- Need to buy hardware (~$150-400 for a decent mini PC)
- Power consumption higher than a streaming device
### Option 3: NixOS Mini PC (Wayland Desktop)
A mini PC running NixOS with a lightweight Wayland compositor, launching Kodi for media and a browser for Twitch/YouTube.
**Pros:**
- Best of both worlds: Kodi for media, Firefox/Chromium for Twitch/YouTube
- Full NixOS management
- Can switch between Kodi and browser easily
- Native web experience for streaming sites
**Cons:**
- More complex setup (compositor + Kodi + browser)
- Harder to get a good "10-foot UI" experience
- Keyboard/mouse may be needed alongside remote
- Significantly more maintenance
## Comparison
| Criteria | Dedicated Device | NixOS Kodi | NixOS Desktop |
|----------|-----------------|------------|---------------|
| **Maintenance** | None | Low | Medium |
| **Media experience** | Excellent | Excellent | Good |
| **Twitch/YouTube** | Excellent (native apps) | Good (addons/yt-dlp) | Excellent (browser) |
| **Homelab integration** | None | Full | Full |
| **Form factor** | Tiny | Small | Small |
| **Cost** | $130-200 | $150-400 | $150-400 |
| **Silent operation** | Yes | Likely (fanless options) | Likely |
| **CEC remote** | Yes | Yes (Kodi) | Partial |
## Decision: NixOS Mini PC with Kodi (Option 2)
**Rationale:**
- Already comfortable with Kodi + wireless keyboard workflow
- Browser access for Twitch/YouTube is important — Kodi can launch a browser when needed
- Homelab integration comes for free (monitoring, auto-updates, vault)
- Natural fit alongside the other 16 NixOS hosts in this repo
- Dedicated devices lose the browser/keyboard workflow
### Display Server: Sway/Hyprland
Options evaluated:
| Approach | Pros | Cons |
|----------|------|------|
| Cage (kiosk) | Simplest, single-app | No browser without TTY switching |
| kodi-gbm (no compositor) | Best HDR support | No browser at all, ALSA-only audio |
| **Sway/Hyprland** | **Workspace switching, VA-API in browser** | **Slightly more config** |
| Full DE (GNOME/KDE) | Everything works | Overkill, heavy |
**Decision: Sway or Hyprland** (Hyprland preferred — same as desktop)
- Kodi fullscreen on workspace 1, Firefox on workspace 2
- Switch via keybinding on wireless keyboard
- Auto-start both on login via greetd
- Minimal config — no bar, no decorations, just workspaces
- VA-API hardware decode works in Firefox on Wayland (important for YouTube/Twitch)
- Can revisit kodi-gbm later if HDR becomes a priority (just a config change)
### Twitch/YouTube
Firefox on workspace 2, switched to via keyboard. Kodi addons (sendtokodi, YouTube plugin) available as secondary options but a real browser is the primary approach.
### Media Playback: Kodi + JellyCon + NFS Direct Path
Three options were evaluated for media playback:
| Approach | Transcoding | Library management | Watch state sync |
|----------|-------------|-------------------|-----------------|
| Jellyfin only (browser) | Yes — browsers lack codec support for DTS, PGS subs, etc. | Jellyfin | Jellyfin |
| Kodi + NFS only | No — Kodi plays everything natively | Kodi local DB | None |
| **Kodi + JellyCon + NFS** | **No — Kodi's native player, direct path via NFS** | **Jellyfin** | **Jellyfin** |
**Decision: Kodi + JellyCon with NFS direct path**
- JellyCon presents the Jellyfin library inside Kodi's UI (browse, search, metadata, artwork)
- Playback uses Kodi's native player — direct play, no transcoding, full codec support including surround passthrough
- JellyCon's "direct path" mode maps Jellyfin paths to local NFS mounts, so playback goes straight over NFS without streaming through Jellyfin's HTTP layer
- Watch state, resume position, etc. sync back to Jellyfin — accessible from other devices too
- NFS mount follows the same pattern as jelly01 (`nas.home.2rjus.net:/mnt/hdd-pool/media`)
### Audio Passthrough
Kodi on NixOS supports HDMI audio passthrough for surround formats (AC3, DTS, etc.). The ARC chain (PC → HDMI → TV → ARC → surround) works transparently — Kodi just needs to be configured for passthrough rather than decoding audio locally.
## Hardware
### Leading Candidate: GMKtec G3
- **CPU**: Intel N100 (Alder Lake-N, 4C/4T)
- **RAM**: 16GB
- **Storage**: 512GB NVMe
- **Price**: ~NOK 2800 (~$250 USD)
- **Source**: AliExpress
The N100 supports hardware decode for all relevant 4K codecs:
| Codec | Support | Used by |
|-------|---------|---------|
| H.264/AVC | Yes (Quick Sync) | Older media |
| H.265/HEVC 10-bit | Yes (Quick Sync) | Most 4K media, HDR |
| VP9 | Yes (Quick Sync) | YouTube 4K |
| AV1 | Yes (Quick Sync) | YouTube, Twitch, newer encodes |
16GB RAM is comfortable for Kodi + browser + NixOS system services (node-exporter, promtail, etc.) with plenty of headroom.
### Key Requirements
- HDMI 2.0+ for 4K future-proofing (current TV is 1080p)
- Hardware video decode via VA-API / Intel Quick Sync
- HDR support (for future TV upgrade)
- Fanless or near-silent operation
## Implementation Steps
1. **Choose and order hardware**
2. **Create host configuration** (`hosts/media1/`)
- Kodi desktop manager with Jellyfin + streaming addons
- Intel/AMD iGPU driver and VA-API hardware decode
- HDMI audio passthrough for surround
- NFS mount for media (same pattern as jelly01)
- Browser package (Firefox/Chromium) for Twitch/YouTube fallback
- Standard system modules (monitoring, promtail, vault, auto-upgrade)
3. **Install NixOS** on the mini PC
4. **Configure Kodi** (Jellyfin server, addons, audio passthrough)
5. **Update DNS** - point `media.home.2rjus.net` to new IP (or keep on VLAN 31)
6. **Retire old media PC**
## Open Questions
- [x] What are the current media PC specs? — i7-4770K, GT 710, Ubuntu 22.04. Overkill CPU, weak GPU, large form factor. Not worth reusing if goal is compact/silent.
- [x] VLAN? — Keep on VLAN 31 for now, same as current media PC. Can revisit later.
- [x] Is CEC needed? — No, not using it currently. Can add later if desired.
- [x] Is 4K HDR output needed? — TV is 1080p now, but want 4K/HDR capability for future TV upgrade
- [x] Audio setup? — Surround system via HDMI ARC from TV. Media PC outputs HDMI to TV, TV passes audio to surround via ARC. Kodi/any player just needs HDMI audio output with surround passthrough.
- [x] Are there streaming service apps needed? — No. Only Twitch/YouTube, which work fine in any browser.
- [x] Budget? — ~NOK 2800 for GMKtec G3 (N100, 16GB, 512GB NVMe)

View File

@@ -1,241 +0,0 @@
# Monitoring Stack Migration to VictoriaMetrics
## Overview
Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
and longer retention. Run in parallel with monitoring01 until validated, then switch over using
a `monitoring` CNAME for seamless transition.
## Current State
**monitoring01** (10.69.13.13):
- 4 CPU cores, 4GB RAM, 33GB disk
- Prometheus with 30-day retention (15s scrape interval)
- Alertmanager (routes to alerttonotify webhook)
- Grafana (dashboards, datasources)
- Loki (log aggregation from all hosts via Promtail)
- Tempo (distributed tracing)
- Pyroscope (continuous profiling)
**Hardcoded References to monitoring01:**
- `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100`
- `hosts/template2/bootstrap.nix` - Bootstrap logs to Loki (keep as-is until decommission)
- `services/http-proxy/proxy.nix` - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway
**Auto-generated:**
- Prometheus scrape targets (from `lib/monitoring.nix` + `homelab.monitoring.scrapeTargets`)
- Node-exporter targets (from all hosts with static IPs)
## Decision: VictoriaMetrics
Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
- Single binary replacement for Prometheus
- 5-10x better compression (30 days could become 180+ days in same space)
- Same PromQL query language (Grafana dashboards work unchanged)
- Same scrape config format (existing auto-generated configs work)
If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
## Architecture
```
┌─────────────────┐
│ monitoring02 │
│ VictoriaMetrics│
│ + Grafana │
monitoring │ + Loki │
CNAME ──────────│ + Tempo │
│ + Pyroscope │
│ + Alertmanager │
│ (vmalert) │
└─────────────────┘
│ scrapes
┌───────────────┼───────────────┐
│ │ │
┌────┴────┐ ┌─────┴────┐ ┌─────┴────┐
│ ns1 │ │ ha1 │ │ ... │
│ :9100 │ │ :9100 │ │ :9100 │
└─────────┘ └──────────┘ └──────────┘
```
## Implementation Plan
### Phase 1: Create monitoring02 Host
Use `create-host` script which handles flake.nix and terraform/vms.tf automatically.
1. **Run create-host**: `nix develop -c create-host monitoring02 10.69.13.24`
2. **Update VM resources** in `terraform/vms.tf`:
- 4 cores (same as monitoring01)
- 8GB RAM (double, for VictoriaMetrics headroom)
- 100GB disk (for 3+ months retention with compression)
3. **Update host configuration**: Import monitoring services
4. **Create Vault AppRole**: Add to `terraform/vault/approle.tf`
### Phase 2: Set Up VictoriaMetrics Stack
Create new service module at `services/monitoring/victoriametrics/` for testing alongside existing
Prometheus config. Once validated, this can replace the Prometheus module.
1. **VictoriaMetrics** (port 8428):
- `services.victoriametrics.enable = true`
- `services.victoriametrics.retentionPeriod = "3m"` (3 months, increase later based on disk usage)
- Migrate scrape configs via `prometheusConfig`
- Use native push support (replaces Pushgateway)
2. **vmalert** for alerting rules:
- `services.vmalert.enable = true`
- Point to VictoriaMetrics for metrics evaluation
- Keep rules in separate `rules.yml` file (same format as Prometheus)
- No receiver configured during parallel operation (prevents duplicate alerts)
3. **Alertmanager** (port 9093):
- Keep existing configuration (alerttonotify webhook routing)
- Only enable receiver after cutover from monitoring01
4. **Loki** (port 3100):
- Same configuration as current
5. **Grafana** (port 3000):
- Define dashboards declaratively via NixOS options (not imported from monitoring01)
- Reference existing dashboards on monitoring01 for content inspiration
- Configure VictoriaMetrics datasource (port 8428)
- Configure Loki datasource
6. **Tempo** (ports 3200, 3201):
- Same configuration
7. **Pyroscope** (port 4040):
- Same Docker-based deployment
### Phase 3: Parallel Operation
Run both monitoring01 and monitoring02 simultaneously:
1. **Dual scraping**: Both hosts scrape the same targets
- Validates VictoriaMetrics is collecting data correctly
2. **Dual log shipping**: Configure Promtail to send logs to both Loki instances
- Add second client in `system/monitoring/logs.nix` pointing to monitoring02
3. **Validate dashboards**: Access Grafana on monitoring02, verify dashboards work
4. **Validate alerts**: Verify vmalert evaluates rules correctly (no receiver = no notifications)
5. **Compare resource usage**: Monitor disk/memory consumption between hosts
### Phase 4: Add monitoring CNAME
Add CNAME to monitoring02 once validated:
```nix
# hosts/monitoring02/configuration.nix
homelab.dns.cnames = [ "monitoring" ];
```
This creates `monitoring.home.2rjus.net` pointing to monitoring02.
### Phase 5: Update References
Update hardcoded references to use the CNAME:
1. **system/monitoring/logs.nix**:
- Remove dual-shipping, point only to `http://monitoring.home.2rjus.net:3100`
2. **services/http-proxy/proxy.nix**: Update reverse proxy backends:
- prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
- alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
- grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
- pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040
Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission.
### Phase 6: Enable Alerting
Once ready to cut over:
1. Enable Alertmanager receiver on monitoring02
2. Verify test alerts route correctly
### Phase 7: Cutover and Decommission
1. **Stop monitoring01**: Prevent duplicate alerts during transition
2. **Update bootstrap.nix**: Point to `monitoring.home.2rjus.net`
3. **Verify all targets scraped**: Check VictoriaMetrics UI
4. **Verify logs flowing**: Check Loki on monitoring02
5. **Decommission monitoring01**:
- Remove from flake.nix
- Remove host configuration
- Destroy VM in Proxmox
- Remove from terraform state
## Current Progress
### monitoring02 Host Created (2026-02-08)
Host deployed at 10.69.13.24 (test tier) with:
- 4 CPU cores, 8GB RAM, 60GB disk
- Vault integration enabled
- NATS-based remote deployment enabled
### Grafana with Kanidm OIDC (2026-02-08)
Grafana deployed on monitoring02 as a test instance (`grafana-test.home.2rjus.net`):
- Kanidm OIDC authentication (PKCE enabled)
- Role mapping: `admins` → Admin, others → Viewer
- Declarative datasources pointing to monitoring01 (Prometheus, Loki)
- Local Caddy for TLS termination via internal ACME CA
This validates the Grafana + OIDC pattern before the full VictoriaMetrics migration. The existing
`services/monitoring/grafana.nix` on monitoring01 can be replaced with the new `services/grafana/`
module once monitoring02 becomes the primary monitoring host.
## Open Questions
- [ ] What disk size for monitoring02? Current 60GB may need expansion for 3+ months with VictoriaMetrics
- [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
- [ ] Consider replacing Promtail with Grafana Alloy (`services.alloy`, v1.12.2 in nixpkgs). Promtail is in maintenance mode and Grafana recommends Alloy as the successor. Alloy is a unified collector (logs, metrics, traces, profiles) but uses its own "River" config format instead of YAML, so less Nix-native ergonomics. Could bundle the migration with monitoring02 to consolidate disruption.
## VictoriaMetrics Service Configuration
Example NixOS configuration for monitoring02:
```nix
# VictoriaMetrics replaces Prometheus
services.victoriametrics = {
enable = true;
retentionPeriod = "3m"; # 3 months, increase based on disk usage
prometheusConfig = {
global.scrape_interval = "15s";
scrape_configs = [
# Auto-generated node-exporter targets
# Service-specific scrape targets
# External targets
];
};
};
# vmalert for alerting rules (no receiver during parallel operation)
services.vmalert = {
enable = true;
datasource.url = "http://localhost:8428";
# notifier.alertmanager.url = "http://localhost:9093"; # Enable after cutover
rule = [ ./rules.yml ];
};
```
## Rollback Plan
If issues arise after cutover:
1. Move `monitoring` CNAME back to monitoring01
2. Restart monitoring01 services
3. Revert Promtail config to point only to monitoring01
4. Revert http-proxy backends
## Notes
- VictoriaMetrics uses port 8428 vs Prometheus 9090
- PromQL compatibility is excellent
- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
- monitoring02 deployed via OpenTofu using `create-host` script
- Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state

145
docs/plans/new-services.md Normal file
View File

@@ -0,0 +1,145 @@
# New Service Candidates
Ideas for additional services to deploy in the homelab. These lean more enterprise/obscure
than the typical self-hosted fare.
## Litestream
Continuous SQLite replication to S3-compatible storage. Streams WAL changes in near-real-time,
providing point-in-time recovery without scheduled backup jobs.
**Why:** Several services use SQLite (Home Assistant, potentially others). Litestream would
give continuous backup to Garage S3 with minimal resource overhead and near-zero configuration.
Replaces cron-based backup scripts with a small daemon per database.
**Integration points:**
- Garage S3 as replication target (already deployed)
- Home Assistant SQLite database is the primary candidate
- Could also cover any future SQLite-backed services
**Complexity:** Low. Single Go binary, minimal config (source DB path + S3 endpoint).
**NixOS packaging:** Available in nixpkgs as `litestream`.
---
## ntopng
Deep network traffic analysis and flow monitoring. Provides real-time visibility into bandwidth
usage, protocol distribution, top talkers, and anomaly detection via a web UI.
**Why:** We have host-level metrics (node-exporter) and logs (Loki) but no network-level
visibility. ntopng would show traffic patterns across the infrastructure — NFS throughput to
the NAS, DNS query volume, inter-host traffic, and bandwidth anomalies. Useful for capacity
planning and debugging network issues.
**Integration points:**
- Could export metrics to Prometheus via its built-in exporter
- Web UI behind http-proxy with Kanidm OIDC (if supported) or Pomerium
- NetFlow/sFlow from managed switches (if available)
- Passive traffic capture on a mirror port or the monitoring host itself
**Complexity:** Medium. Needs network tap or mirror port for full visibility, or can run
in host-local mode. May need a dedicated interface or VLAN mirror.
**NixOS packaging:** Available in nixpkgs as `ntopng`.
---
## Renovate
Automated dependency update bot that understands Nix flakes natively. Creates branches/PRs
to bump flake inputs on a configurable schedule.
**Why:** Currently `nix flake update` is manual. Renovate can automatically propose updates
to individual flake inputs (nixpkgs, homelab-deploy, nixos-exporter, etc.), group related
updates, and respect schedules. More granular than updating everything at once — can bump
nixpkgs weekly but hold back other inputs, auto-merge patch-level changes, etc.
**Integration points:**
- Runs against git.t-juice.club repositories
- Understands `flake.lock` format natively
- Could target both `nixos-servers` and `nixos` repos
- Update branches would be validated by homelab-deploy builder
**Complexity:** Medium. Needs git forge integration (Gitea/Forgejo API). Self-hosted runner
mode available. Configuration via `renovate.json` in each repo.
**NixOS packaging:** Available in nixpkgs as `renovate`.
---
## Pomerium
Identity-aware reverse proxy implementing zero-trust access. Every request is authenticated
and authorized based on identity, device, and context — not just network location.
**Why:** Currently Caddy terminates TLS but doesn't enforce authentication on most services.
Pomerium would put Kanidm OIDC authentication in front of every internal service, with
per-route authorization policies (e.g., "only admins can access Prometheus," "require re-auth
for Vault UI"). Directly addresses the security hardening plan's goals.
**Integration points:**
- Kanidm as OIDC identity provider (already deployed)
- Could replace or sit in front of Caddy for internal services
- Per-route policies based on Kanidm groups (admins, users, ssh-users)
- Centralizes access logging and audit trail
**Complexity:** Medium-high. Needs careful integration with existing Caddy reverse proxy.
Decision needed on whether Pomerium replaces Caddy or works alongside it (Pomerium for
auth, Caddy for TLS termination and routing, or Pomerium handles everything).
**NixOS packaging:** Available in nixpkgs as `pomerium`.
---
## Apache Guacamole
Clientless remote desktop and SSH gateway. Provides browser-based access to hosts via
RDP, VNC, SSH, and Telnet with no client software required. Supports session recording
and playback.
**Why:** Provides an alternative remote access path that doesn't require VPN software or
SSH keys on the client device. Useful for accessing hosts from untrusted machines (phone,
borrowed laptop) or providing temporary access to others. Session recording gives an audit
trail. Could complement the WireGuard remote access plan rather than replace it.
**Integration points:**
- Kanidm for authentication (OIDC or LDAP)
- Behind http-proxy or Pomerium for TLS
- SSH access to all hosts in the fleet
- Session recordings could be stored on Garage S3
- Could serve as the "emergency access" path when VPN is unavailable
**Complexity:** Medium. Java-based (guacd + web app), typically needs PostgreSQL for
connection/user storage (already available). Docker is the common deployment method but
native packaging exists.
**NixOS packaging:** Available in nixpkgs as `guacamole-server` and `guacamole-client`.
---
## CrowdSec
Collaborative intrusion prevention system with crowd-sourced threat intelligence.
Parses logs to detect attack patterns, applies remediation (firewall bans, CAPTCHA),
and shares/receives threat signals from a global community network.
**Why:** Goes beyond fail2ban with behavioral detection, crowd-sourced IP reputation,
and a scenario-based engine. Fits the security hardening plan. The community blocklist
means we benefit from threat intelligence gathered across thousands of deployments.
Could parse SSH logs, HTTP access logs, and other service logs to detect and block
malicious activity.
**Integration points:**
- Could consume logs from Loki or directly from journald/log files
- Firewall bouncer for iptables/nftables remediation
- Caddy bouncer for HTTP-level blocking
- Prometheus metrics exporter for alert integration
- Scenarios available for SSH brute force, HTTP scanning, and more
- Feeds into existing alerting pipeline (Alertmanager -> alerttonotify)
**Complexity:** Medium. Agent (log parser + decision engine) on each host or centralized.
Bouncers (enforcement) on edge hosts. Free community tier includes threat intel access.
**NixOS packaging:** Available in nixpkgs as `crowdsec`.

View File

@@ -0,0 +1,232 @@
# NixOS Hypervisor
## Overview
Experiment with running a NixOS-based hypervisor as an alternative/complement to the current Proxmox setup. Goal is better homelab integration — declarative config, monitoring, auto-updates — while retaining the ability to run VMs with a Terraform-like workflow.
## Motivation
- Proxmox works but doesn't integrate with the NixOS-managed homelab (no monitoring, no auto-updates, no vault, no declarative config)
- The PN51 units (once stable) are good candidates for experimentation — test-tier, plenty of RAM (32-64GB), 8C/16T
- Long-term: could reduce reliance on Proxmox or provide a secondary hypervisor pool
- **VM migration**: Currently all VMs (including both nameservers) run on a single Proxmox host. Being able to migrate VMs between hypervisors would allow rebooting a host for kernel updates without downtime for critical services like DNS.
## Hardware Candidates
| | pn01 | pn02 |
|---|---|---|
| **CPU** | Ryzen 7 5700U (8C/16T) | Ryzen 7 5700U (8C/16T) |
| **RAM** | 64GB (2x32GB) | 32GB (1x32GB, second slot available) |
| **Storage** | 1TB NVMe | 1TB SATA SSD (NVMe planned) |
| **Status** | Stability testing | Stability testing |
## Options
### Option 1: Incus
Fork of LXD (after Canonical made LXD proprietary). Supports both containers (LXC) and VMs (QEMU/KVM).
**NixOS integration:**
- `virtualisation.incus.enable` module in nixpkgs
- Manages storage pools, networks, and instances
- REST API for automation
- CLI tool (`incus`) for management
**Terraform integration:**
- `lxd` provider works with Incus (API-compatible)
- Dedicated `incus` Terraform provider also exists
- Can define VMs/containers in OpenTofu, similar to current Proxmox workflow
**Migration:**
- Built-in live and offline migration via `incus move <instance> --target <host>`
- Clustering makes hosts aware of each other — migration is a first-class operation
- Shared storage (NFS, Ceph) or Incus can transfer storage during migration
- Stateful stop-and-move also supported for offline migration
**Pros:**
- Supports both containers and VMs
- REST API + CLI for automation
- Built-in clustering and migration — closest to Proxmox experience
- Good NixOS module support
- Image-based workflow (can build NixOS images and import)
- Active development and community
**Cons:**
- Another abstraction layer on top of QEMU/KVM
- Less mature Terraform provider than libvirt
- Container networking can be complex
- NixOS guests in Incus VMs need some setup
### Option 2: libvirt/QEMU
Standard Linux virtualization stack. Thin wrapper around QEMU/KVM.
**NixOS integration:**
- `virtualisation.libvirtd.enable` module in nixpkgs
- Mature and well-tested
- virsh CLI for management
**Terraform integration:**
- `dmacvicar/libvirt` provider — mature, well-maintained
- Supports cloud-init, volume management, network config
- Very similar workflow to current Proxmox+OpenTofu setup
- Can reuse cloud-init patterns from existing `terraform/` config
**Migration:**
- Supports live and offline migration via `virsh migrate`
- Requires shared storage (NFS, Ceph, or similar) for live migration
- Requires matching CPU models between hosts (or CPU model masking)
- Works but is manual — no cluster awareness, must specify target URI
- No built-in orchestration for multi-host scenarios
**Pros:**
- Closest to current Proxmox+Terraform workflow
- Most mature Terraform provider
- Minimal abstraction — direct QEMU/KVM management
- Well-understood, massive community
- Cloud-init works identically to Proxmox workflow
- Can reuse existing template-building patterns
**Cons:**
- VMs only (no containers without adding LXC separately)
- No built-in REST API (would need to expose libvirt socket)
- No web UI without adding cockpit or virt-manager
- Migration works but requires manual setup — no clustering, no orchestration
- Less feature-rich than Incus for multi-host scenarios
### Option 3: microvm.nix
NixOS-native microVM framework. VMs defined as NixOS modules in the host's flake.
**NixOS integration:**
- VMs are NixOS configurations in the same flake
- Supports multiple backends: cloud-hypervisor, QEMU, firecracker, kvmtool
- Lightweight — shares host's nix store with guests via virtiofs
- Declarative network, storage, and resource allocation
**Terraform integration:**
- None — everything is defined in Nix
- Fundamentally different workflow from current Proxmox+Terraform approach
**Pros:**
- Most NixOS-native approach
- VMs defined right alongside host configs in this repo
- Very lightweight — fast boot, minimal overhead
- Shares nix store with host (no duplicate packages)
- No cloud-init needed — guest config is part of the flake
**Migration:**
- No migration support — VMs are tied to the host's NixOS config
- Moving a VM means rebuilding it on another host
**Cons:**
- Very niche, smaller community
- Different mental model from current workflow
- Only NixOS guests (no Ubuntu, FreeBSD, etc.)
- No Terraform integration
- No migration support
- Less isolation than full QEMU VMs
- Would need to learn a new deployment pattern
## Comparison
| Criteria | Incus | libvirt | microvm.nix |
|----------|-------|---------|-------------|
| **Workflow similarity** | Medium | High | Low |
| **Terraform support** | Yes (lxd/incus provider) | Yes (mature provider) | No |
| **NixOS module** | Yes | Yes | Yes |
| **Containers + VMs** | Both | VMs only | VMs only |
| **Non-NixOS guests** | Yes | Yes | No |
| **Live migration** | Built-in (first-class) | Yes (manual setup) | No |
| **Offline migration** | Built-in | Yes (manual setup) | No (rebuild) |
| **Clustering** | Built-in | Manual | No |
| **Learning curve** | Medium | Low | Medium |
| **Community/maturity** | Growing | Very mature | Niche |
| **Overhead** | Low | Minimal | Minimal |
## Recommendation
Start with **Incus**. Migration and clustering are key requirements:
- Built-in clustering makes two PN51s a proper hypervisor pool
- Live and offline migration are first-class operations, similar to Proxmox
- Can move VMs between hosts for maintenance (kernel updates, hardware work) without downtime
- Supports both containers and VMs — flexibility for future use
- Terraform provider exists (less mature than libvirt's, but functional)
- REST API enables automation beyond what Terraform covers
libvirt could achieve similar results but requires significantly more manual setup for migration and has no clustering awareness. For a two-node setup where migration is a priority, Incus provides much more out of the box.
**microvm.nix** is off the table given the migration requirement.
## Implementation Plan
### Phase 1: Single-Node Setup (on one PN51)
1. Enable `virtualisation.incus` on pn01 (or whichever is stable)
2. Initialize Incus (`incus admin init`) — configure storage pool (local NVMe) and network bridge
3. Configure bridge networking for VM traffic on VLAN 12
4. Build a NixOS VM image and import it into Incus
5. Create a test VM manually with `incus launch` to validate the setup
### Phase 2: Two-Node Cluster (PN51s only)
1. Enable Incus on the second PN51
2. Form a cluster between both nodes
3. Configure shared storage (NFS from NAS, or Ceph if warranted)
4. Test offline migration: `incus move <vm> --target <other-node>`
5. Test live migration with shared storage
6. CPU compatibility is not an issue here — both nodes have identical Ryzen 7 5700U CPUs
### Phase 3: Terraform Integration
1. Add Incus Terraform provider to `terraform/`
2. Define a test VM in OpenTofu (cloud-init, static IP, vault provisioning)
3. Verify the full pipeline: tofu apply -> VM boots -> cloud-init -> vault credentials -> NixOS rebuild
4. Compare workflow with existing Proxmox pipeline
### Phase 4: Evaluate and Expand
- Is the workflow comparable to Proxmox?
- Migration reliability — does live migration work cleanly?
- Performance overhead acceptable on Ryzen 5700U?
- Worth migrating some test-tier VMs from Proxmox?
- Could ns1/ns2 run on separate Incus nodes instead of the single Proxmox host?
### Phase 5: Proxmox Replacement (optional)
If Incus works well on the PN51s, consider replacing Proxmox entirely for a three-node cluster.
**CPU compatibility for mixed cluster:**
| Node | CPU | Architecture | x86-64-v3 |
|------|-----|-------------|-----------|
| Proxmox host | AMD Ryzen 9 3900X (12C/24T) | Zen 2 | Yes |
| pn01 | AMD Ryzen 7 5700U (8C/16T) | Zen 3 | Yes |
| pn02 | AMD Ryzen 7 5700U (8C/16T) | Zen 3 | Yes |
All three CPUs are AMD and support `x86-64-v3`. The 3900X (Zen 2) is the oldest, so it defines the feature ceiling — but `x86-64-v3` is well within its capabilities. VMs configured with `x86-64-v3` can migrate freely between all three nodes.
Being all-AMD also avoids the trickier Intel/AMD cross-vendor migration edge cases (different CPUID layouts, virtualization extensions).
The 3900X (12C/24T) would be the most powerful node, making it the natural home for heavier workloads, with the PN51s (8C/16T each) handling lighter VMs or serving as migration targets during maintenance.
Steps:
1. Install NixOS + Incus on the Proxmox host (or a replacement machine)
2. Join it to the existing Incus cluster with `x86-64-v3` CPU baseline
3. Migrate VMs from Proxmox to the Incus cluster
4. Decommission Proxmox
## Prerequisites
- [ ] PN51 units pass stability testing (see `pn51-stability.md`)
- [ ] Decide which unit to use first (pn01 preferred — 64GB RAM, NVMe, currently more stable)
## Open Questions
- How to handle VM storage? Local NVMe, NFS from NAS, or Ceph between the two nodes?
- Network topology: bridge on VLAN 12, or trunk multiple VLANs to the PN51?
- Should VMs be on the same VLAN as the hypervisor host, or separate?
- Incus clustering with only two nodes — any quorum issues? Three nodes (with Proxmox replacement) would solve this
- How to handle NixOS guest images? Build with nixos-generators, or use Incus image builder?
- ~~What CPU does the current Proxmox host have?~~ AMD Ryzen 9 3900X (Zen 2) — `x86-64-v3` confirmed, all-AMD cluster
- If replacing Proxmox: migrate VMs first, or fresh start and rebuild?

182
docs/plans/nixos-router.md Normal file
View File

@@ -0,0 +1,182 @@
# NixOS Router — Replace EdgeRouter
Replace the aging Ubiquiti EdgeRouter (gw, 10.69.10.1) with a NixOS-based router.
The EdgeRouter is suspected to be a throughput bottleneck. A NixOS router integrates
naturally with the existing fleet: same config management, same monitoring pipeline,
same deployment workflow.
## Goals
- Eliminate the EdgeRouter throughput bottleneck
- Full integration with existing monitoring (node-exporter, promtail, Prometheus, Loki)
- Declarative firewall and routing config managed in the flake
- Inter-VLAN routing for all existing subnets
- DHCP server for client subnets
- NetFlow/traffic accounting for future ntopng integration
- Foundation for WireGuard remote access (see remote-access.md)
## Current Network Topology
**Subnets (known VLANs):**
| VLAN/Subnet | Purpose | Notable hosts |
|----------------|------------------|----------------------------------------|
| 10.69.10.0/24 | Gateway | gw (10.69.10.1) |
| 10.69.12.0/24 | Core services | nas, pve1, arr jails, restic |
| 10.69.13.0/24 | Infrastructure | All NixOS servers (static IPs) |
| 10.69.22.0/24 | WLAN | unifi-ctrl |
| 10.69.30.0/24 | Workstations | gunter |
| 10.69.31.0/24 | Media | media |
| 10.69.99.0/24 | Management | sw1 (MikroTik CRS326-24G-2S+) |
**DNS:** ns1 (10.69.13.5) and ns2 (10.69.13.6) handle all resolution. Upstream is
Cloudflare/Google over DoT via Unbound.
**Switch:** MikroTik CRS326-24G-2S+ — L2 switching with VLAN trunking. Capable of
L3 routing via RouterOS but not ideal for sustained routing throughput.
## Hardware
Needs a small x86 box with:
- At least 2 NICs (WAN + LAN trunk). Dual 2.5GbE preferred.
- Enough CPU for nftables NAT at line rate (any modern x86 is fine)
- 4-8 GB RAM (plenty for routing + DHCP + NetFlow accounting)
- Low power consumption, fanless preferred for always-on use
**Leading candidate:** [Topton Solid Mini PC](https://www.aliexpress.com/item/1005008981218625.html)
with Intel i3-N300 (8 E-cores), 2x10GbE SFP+ + 3x2.5GbE (~NOK 3000 barebones). The N300
gives headroom for ntopng DPI and potential Suricata IDS without being overkill.
### Hardware Alternatives
Domestic availability for firewall mini PCs is limited — likely ordering from AliExpress.
Key things to verify:
- NIC chipset: Intel i225-V/i226-V preferred over Realtek for Linux driver support
- RAM/storage: some listings are barebones, check what's included
- Import duties: factor in ~25% on top of listing price
| Option | NICs | Notes | Price |
|--------|------|-------|-------|
| [Topton Solid Firewall Router](https://www.aliexpress.com/item/1005008059819023.html) | 2x10GbE SFP+, 4x2.5GbE | No RAM/SSD, only Intel N150 available currently | ~NOK 2500 |
| [Topton Solid Mini PC](https://www.aliexpress.com/item/1005008981218625.html) | 2x10GbE SFP+, 3x2.5GbE | No RAM/SSD, only Intel i3-N300 available currently | ~NOK 3000 |
| [MINISFORUM MS-01](https://www.aliexpress.com/item/1005007308262492.html) | 2x10GbE SFP+, 2x2.5GbE | No RAM/SSD, i5-12600H | ~NOK 4500 |
The LAN port would carry a VLAN trunk to the MikroTik switch, with sub-interfaces
for each VLAN. WAN port connects to the ISP uplink.
## NixOS Configuration
### Stability Policy
The router is treated differently from the rest of the fleet:
- **No auto-upgrade** — `system.autoUpgrade.enable = false`
- **No homelab-deploy listener** — `homelab.deploy.enable = false`
- **Manual updates only** — update every few months, test-build first
- **Use `nixos-rebuild boot`** — changes take effect on next deliberate reboot
- **Tier: prod, priority: high** — alerts treated with highest priority
### Core Services
**Routing & NAT:**
- `systemd-networkd` for all interface config (consistent with rest of fleet)
- VLAN sub-interfaces on the LAN trunk (one per subnet)
- `networking.nftables` for stateful firewall and NAT
- IP forwarding enabled (`net.ipv4.ip_forward = 1`)
- Masquerade outbound traffic on WAN interface
**DHCP:**
- Kea or dnsmasq for DHCP on client subnets (WLAN, workstations, media)
- Infrastructure subnet (10.69.13.0/24) stays static — no DHCP needed
- Static leases for known devices
**Firewall (nftables):**
- Default deny between VLANs
- Explicit allow rules for known cross-VLAN traffic:
- All subnets → ns1/ns2 (DNS)
- All subnets → monitoring01 (metrics/logs)
- Infrastructure → all (management access)
- Workstations → media, core services
- NAT masquerade on WAN
- Rate limiting on WAN-facing services
**Traffic Accounting:**
- nftables flow accounting or softflowd for NetFlow export
- Export to future ntopng instance (see new-services.md)
**IDS/IPS (future consideration):**
- Suricata for inline intrusion detection/prevention on the WAN interface
- Signature-based threat detection, protocol anomaly detection
- CPU-intensive — feasible at typical home internet speeds (500Mbps-1Gbps) on the N300
- Not a day-one requirement, but the hardware should support it
### Monitoring Integration
Since this is a NixOS host in the flake, it gets the standard monitoring stack for free:
- node-exporter for system metrics (CPU, memory, NIC throughput per interface)
- promtail shipping logs to Loki
- Prometheus scrape target auto-registration
- Alertmanager alerts for host-down, high CPU, etc.
Additional router-specific monitoring:
- Per-VLAN interface traffic metrics via node-exporter (automatic for all interfaces)
- NAT connection tracking table size
- WAN uplink status and throughput
- DHCP lease metrics (if Kea, it has a Prometheus exporter)
This is a significant advantage over the EdgeRouter — full observability through
the existing Grafana dashboards and Loki log search, debuggable via the monitoring
MCP tools.
### WireGuard Integration
The remote access plan (remote-access.md) currently proposes a separate `extgw01`
gateway host. With a NixOS router, there's a decision to make:
**Option A:** WireGuard terminates on the router itself. Simplest topology — the
router is already the gateway, so VPN traffic doesn't need extra hops or firewall
rules. But adds complexity to the router, which should stay simple.
**Option B:** Keep extgw01 as a separate host (original plan). Router just routes
traffic to it. Better separation of concerns, router stays minimal.
Recommendation: Start with option B (keep it separate). The router should do routing
and nothing else. WireGuard can move to the router later if extgw01 feels redundant.
## Migration Plan
### Phase 1: Build and lab test
- Acquire hardware
- Create host config in the flake (routing, NAT, DHCP, firewall)
- Test-build on workstation: `nix build .#nixosConfigurations.router01.config.system.build.toplevel`
- Lab test with a temporary setup if possible (two NICs, isolated VLAN)
### Phase 2: Prepare cutover
- Pre-configure the MikroTik switch trunk port for the new router
- Document current EdgeRouter config (port forwarding, NAT rules, DHCP leases)
- Replicate all rules in the NixOS config
- Verify DNS, DHCP, and inter-VLAN routing work in test
### Phase 3: Cutover
- Schedule a maintenance window (brief downtime expected)
- Swap WAN cable from EdgeRouter to new router
- Swap LAN trunk from EdgeRouter to new router
- Verify connectivity from each VLAN
- Verify internet access, DNS resolution, inter-VLAN routing
- Monitor via Prometheus/Loki (immediately available since it's a fleet host)
### Phase 4: Decommission EdgeRouter
- Keep EdgeRouter available as fallback for a few weeks
- Remove `gw` entry from external-hosts.nix, replace with flake-managed host
- Update any references to 10.69.10.1 if the router IP changes
## Open Questions
- **Router IP:** Keep 10.69.10.1 or move to a different address? Each VLAN
sub-interface needs an IP (the gateway address for that subnet).
- **ISP uplink:** What type of WAN connection? PPPoE, DHCP, static IP?
- **Port forwarding:** What ports are currently forwarded on the EdgeRouter?
These need to be replicated in nftables.
- **DHCP scope:** Which subnets currently get DHCP from the EdgeRouter vs
other sources (UniFi controller for WLAN?)?
- **UPnP/NAT-PMP:** Needed for any devices? (gaming consoles, etc.)
- **Hardware preference:** Fanless mini PC budget and preferred vendor?

View File

@@ -0,0 +1,104 @@
# NixOS OpenStack Image
## Overview
Build and upload a NixOS base image to the OpenStack cluster at work, enabling NixOS-based VPS instances to replace the current Debian+Podman setup. This image will serve as the foundation for multiple external services:
- **Forgejo** (replacing Gitea on docker2)
- **WireGuard gateway** (replacing docker2's tunnel role, feeding into the remote-access plan)
- Any future externally-hosted services
## Current State
- VPS hosting runs on an OpenStack cluster with a personal quota
- Current VPS (`docker2.t-juice.club`) runs Debian with Podman containers
- Homelab already has a working Proxmox image pipeline: `template2` builds via `nixos-rebuild build-image --image-variant proxmox`, deployed via Ansible
- nixpkgs has a built-in `openstack` image variant in the same `image.modules` system used for Proxmox
## Decisions
- **No cloud-init dependency** - SSH key baked into the image, no need for metadata service
- **No bootstrap script** - VPS deployments are infrequent; manual `nixos-rebuild` after first boot is fine
- **No Vault access** - secrets handled manually until WireGuard access is set up (see remote-access plan)
- **Separate from homelab services** - no logging/metrics integration initially; revisit after remote-access WireGuard is in place
- **Repo placement TBD** - keep in this flake for now for convenience, but external hosts may move to a separate flake later since they can't use most shared `system/` modules (no Vault, no internal DNS, no Promtail)
- **OpenStack CLI in devshell** - add `openstackclient` package; credentials (`clouds.yaml`) stay outside the repo
- **Parallel deployment** - new Forgejo instance runs alongside docker2 initially, then CNAME moves over
## Approach
Follow the same pattern as the Proxmox template (`hosts/template2`), but targeting OpenStack's qcow2 format.
### What nixpkgs provides
The `image.modules.openstack` module produces a qcow2 image with:
- `openstack-config.nix`: EC2 metadata fetcher, SSH enabled, GRUB bootloader, serial console, auto-growing root partition
- `qemu-guest.nix` profile (virtio drivers)
- ext4 root filesystem with `autoResize`
### What we need to customize
The stock OpenStack image pulls SSH keys and hostname from EC2-style metadata. Since we're baking the SSH key into the image, we need a simpler configuration:
- SSH authorized keys baked into the image
- Base packages (age, vim, wget, git)
- Nix substituters (`cache.nixos.org` only - internal cache not reachable)
- systemd-networkd with DHCP
- GRUB bootloader
- Firewall enabled (public-facing host)
### Differences from template2
| Aspect | template2 (Proxmox) | openstack-template (OpenStack) |
|--------|---------------------|-------------------------------|
| Image format | VMA (`.vma.zst`) | qcow2 (`.qcow2`) |
| Image variant | `proxmox` | `openstack` |
| Cloud-init | ConfigDrive + NoCloud | Not used (SSH key baked in) |
| Nix cache | Internal + nixos.org | `cache.nixos.org` only |
| Vault | AppRole via wrapped token | None |
| Bootstrap | Automatic nixos-rebuild on first boot | Manual |
| Network | Internal DHCP | OpenStack DHCP |
| DNS | Internal ns1/ns2 | Public DNS |
| Firewall | Disabled (trusted network) | Enabled |
| System modules | Full `../../system` import | Minimal (sshd, packages only) |
## Implementation Steps
### Phase 1: Build the image
1. Create `hosts/openstack-template/` with minimal configuration
- `default.nix` - imports (only sshd and packages from `system/`, not the full set)
- `configuration.nix` - base config: SSH key, DHCP, GRUB, base packages, firewall on
- `hardware-configuration.nix` - qemu-guest profile with virtio drivers
- Exclude from DNS and monitoring (`homelab.dns.enable = false`, `homelab.monitoring.enable = false`)
- May need to override parts of `image.modules.openstack` to disable the EC2 metadata fetcher if it causes boot delays
2. Build with `nixos-rebuild build-image --image-variant openstack --flake .#openstack-template`
3. Verify the qcow2 image is produced in `result/`
### Phase 2: Upload and test
1. Add `openstackclient` to the devshell
2. Upload image: `openstack image create --disk-format qcow2 --file result/<image>.qcow2 nixos-template`
3. Boot a test instance from the image
4. Verify: SSH access works, DHCP networking, Nix builds work
5. Test manual `nixos-rebuild switch --flake` against the instance
### Phase 3: Automation (optional, later)
Consider an Ansible playbook similar to `build-and-deploy-template.yml` for image builds + uploads. Low priority since this will be done rarely.
## Open Questions
- [ ] Should external VPS hosts eventually move to a separate flake? (Depends on how different they end up being from homelab hosts)
- [ ] Will the stock `openstack-config.nix` metadata fetcher cause boot delays/errors if the metadata service isn't reachable? May need to disable it.
- [ ] **Flavor selection** - investigate what flavors are available in the quota. The standard small flavors likely have insufficient root disk for a NixOS host (Nix store grows fast). Options:
- Use a larger flavor with adequate root disk
- Create a custom flavor (if permissions allow)
- Cinder block storage is an option in theory, but was very slow last time it was tested - avoid if possible
- [ ] Consolidation opportunity - currently running multiple smaller VMs on OpenStack. Could a single larger NixOS VM replace several of them?
## Notes
- `nixos-rebuild build-image --image-variant openstack` uses the same `image.modules` system as Proxmox
- nixpkgs also has an `openstack-zfs` variant if ZFS root is ever wanted
- The stock OpenStack module imports `ec2-data.nix` and `amazon-init.nix` - these may need to be disabled or overridden if they cause issues without a metadata service

View File

@@ -0,0 +1,231 @@
# ASUS PN51 Stability Testing
## Overview
Two ASUS PN51-E1 mini PCs (Ryzen 7 5700U) purchased years ago but shelved due to stability issues. Revisiting them to potentially add to the homelab.
## Hardware
| | pn01 (10.69.12.60) | pn02 (10.69.12.61) |
|---|---|---|
| **CPU** | AMD Ryzen 7 5700U (8C/16T) | AMD Ryzen 7 5700U (8C/16T) |
| **RAM** | 2x 32GB DDR4 SO-DIMM (64GB) | 1x 32GB DDR4 SO-DIMM (32GB) |
| **Storage** | 1TB NVMe | 1TB Samsung 870 EVO (SATA SSD) |
| **BIOS** | 0508 (2023-11-08) | Updated 2026-02-21 (latest from ASUS) |
## Original Issues
- **pn01**: Would boot but freeze randomly after some time. No console errors, completely unresponsive. memtest86 passed.
- **pn02**: Had trouble booting — would start loading kernel from installer USB then instantly reboot. When it did boot, would also freeze randomly.
## Debugging Steps
### 2026-02-21: Initial Setup
1. **Disabled fTPM** (labeled "Security Device" in ASUS BIOS) on both units
- AMD Ryzen 5000 series had a known fTPM bug causing random hard freezes with no console output
- Both units booted the NixOS installer successfully after this change
2. Installed NixOS on both, added to repo as `pn01` and `pn02` on VLAN 12
3. Configured monitoring (node-exporter, promtail, nixos-exporter)
### 2026-02-21: pn02 First Freeze
- pn02 froze approximately 1 hour after boot
- All three Prometheus targets went down simultaneously — hard freeze, not graceful shutdown
- Journal on next boot: `system.journal corrupted or uncleanly shut down`
- Kernel warnings from boot log before freeze:
- **TSC clocksource unstable**: `Marking clocksource 'tsc' as unstable because the skew is too large` — TSC skewing ~3.8ms over 500ms relative to HPET watchdog
- **AMD PSP error**: `psp gfx command LOAD_TA(0x1) failed and response status is (0x7)` — Platform Security Processor failing to load trusted application
- pn01 did not show these warnings on this particular boot, but has shown them historically (see below)
### 2026-02-21: pn02 BIOS Update
- Updated pn02 BIOS to latest version from ASUS website
- **TSC still unstable** after BIOS update — same ~3.8ms skew
- **PSP LOAD_TA still failing** after BIOS update
- Monitoring back up, letting it run to see if freeze recurs
### 2026-02-22: TSC/PSP Confirmed on Both Units
- Checked kernel logs after ~9 hours uptime — both units still running
- **pn01 now shows TSC unstable and PSP LOAD_TA failure** on this boot (same ~3.8ms TSC skew, same PSP error)
- pn01 had these same issues historically when tested years ago — the earlier clean boot was just lucky TSC calibration timing
- **Conclusion**: TSC instability and PSP LOAD_TA are platform-level quirks of the PN51-E1 / Ryzen 5700U, present on both units
- The kernel handles TSC instability gracefully (falls back to HPET), and PSP LOAD_TA is non-fatal
- Neither issue is likely the cause of the hard freezes — the fTPM bug remains the primary suspect
### 2026-02-22: Stress Test (1 hour)
- Ran `stress-ng --cpu 16 --vm 2 --vm-bytes 8G --timeout 1h` on both units
- CPU temps peaked at ~85°C, settled to ~80°C sustained (throttle limit is 105°C)
- Both survived the full hour with no freezes, no MCE errors, no kernel issues
- No concerning log entries during or after the test
### 2026-02-22: TSC Runtime Switch Test
- Attempted to switch clocksource back to TSC at runtime on pn01:
```
echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
```
- Kernel watchdog immediately reverted to HPET — TSC skew is ongoing, not just a boot-time issue
- **Conclusion**: TSC is genuinely unstable on the PN51-E1 platform. HPET is the correct clocksource.
- For virtualization (Incus), this means guest VMs will use HPET-backed timing. Performance impact is minimal for typical server workloads (DNS, monitoring, light services) but would matter for latency-sensitive applications.
### 2026-02-22: BIOS Tweaks (Both Units)
- Disabled ErP Ready on both (EU power efficiency mode — aggressively cuts power in idle)
- Disabled WiFi and Bluetooth in BIOS on both
- **TSC still unstable** after these changes — same ~3.8ms skew on both units
- ErP/power states are not the cause of the TSC issue
### 2026-02-22: pn02 Second Freeze
- pn02 froze again ~5.5 hours after boot (at idle, not under load)
- All Prometheus targets down simultaneously — same hard freeze pattern
- Last log entry was normal nix-daemon activity — zero warning/error logs before crash
- Survived the 1h stress test earlier but froze at idle later — not thermal
- pn01 remains stable throughout
- **Action**: Blacklisted `amdgpu` kernel module on pn02 (`boot.blacklistedKernelModules = [ "amdgpu" ]`) to eliminate GPU/PSP firmware interactions as a cause. No console output but managed via SSH.
- **Action**: Added diagnostic/recovery config to pn02:
- `panic=10` + `nmi_watchdog=1` kernel params — auto-reboot after 10s on panic
- `softlockup_panic` + `hardlockup_panic` sysctls — convert lockups to panics with stack traces
- `hardware.rasdaemon` with recording — logs hardware errors (MCE, PCIe AER, memory) to sqlite database, survives reboots
- Check recorded errors: `ras-mc-ctl --summary`, `ras-mc-ctl --errors`
## Benign Kernel Errors (Both Units)
These appear on both units and can be ignored:
- `clocksource: Marking clocksource 'tsc' as unstable` — TSC skew vs HPET, kernel falls back gracefully. Platform-level quirk on PN51-E1, not always reproducible on every boot.
- `psp gfx command LOAD_TA(0x1) failed` — AMD PSP firmware error, non-fatal. Present on both units across all BIOS versions.
- `pcie_mp2_amd: amd_sfh_hid_client_init failed err -95` — AMD Sensor Fusion Hub, no sensors connected
- `Bluetooth: hci0: Reading supported features failed` — Bluetooth init quirk
- `Serial bus multi instantiate pseudo device driver INT3515:00: error -ENXIO` — unused serial bus device
- `snd_hda_intel: no codecs found` — no audio device connected, headless server
- `ata2.00: supports DRM functions and may not be fully accessible` — Samsung SSD DRM quirk (pn02 only)
### 2026-02-23: processor.max_cstate=1 and Proxmox Forums
- Found a thread on the Proxmox forums about PN51 units with similar freeze issues
- Many users reporting identical symptoms — random hard freezes, no log evidence
- No conclusive fix. Some have frequent freezes, others only a few times a month
- Some reported BIOS updates helped, but results inconsistent
- Added `processor.max_cstate=1` kernel parameter to pn02 — limits CPU to C1 halt state, preventing deep C-state sleep transitions that may trigger freezes on AMD mobile chips
- Also applied: amdgpu blacklist, panic=10, nmi_watchdog=1, softlockup/hardlockup panic, rasdaemon
### 2026-02-23: logind D-Bus Deadlock (pn02)
- node-exporter alert fired — but host was NOT frozen
- logind was running (PID 871) but deadlocked on D-Bus — not responding to `org.freedesktop.login1` requests
- Every node-exporter scrape blocked for 25s waiting for logind, causing scrape timeouts
- Likely related to amdgpu blacklist — no DRM device means no graphical seat, logind may have deadlocked during seat enumeration at boot
- Fix: `systemctl restart systemd-logind` + `systemctl restart prometheus-node-exporter`
- After restart, logind responded normally and reported seat0
### 2026-02-27: pn02 Third Freeze
- pn02 crashed again after ~2 days 21 hours uptime (longest run so far)
- Evidence of crash:
- Journal file corrupted: `system.journal corrupted or uncleanly shut down`
- Boot partition fsck: `Dirty bit is set. Fs was not properly unmounted`
- No orderly shutdown logs from previous boot
- No auto-upgrade triggered
- **NMI watchdog did NOT fire** — no kernel panic logged. This is a true hard lockup below NMI level
- **rasdaemon recorded nothing** — no MCE, AER, or memory errors in the sqlite database
- **Positive**: The system auto-rebooted this time (likely hardware watchdog), unlike previous freezes that required manual power cycle
- `processor.max_cstate=1` may have extended uptime (2d21h vs previous 1h and 5.5h) but did not prevent the freeze
### 2026-02-27 to 2026-03-03: Relative Stability
- pn02 ran without crashes for approximately one week after the third freeze
- pn01 continued to be completely stable throughout this period
- Auto-upgrade reboots continued daily (~4am) on both units — these are planned and healthy
### 2026-03-04: pn02 Fourth Crash — sched_ext Kernel Oops (pstore captured)
- pn02 crashed after ~5.8 days uptime (504566s)
- **First crash captured by pstore** — kernel oops and panic stack traces preserved across reboot
- Journal corruption confirmed: `system.journal corrupted or uncleanly shut down`
- **Crash location**: `RIP: 0010:set_next_task_scx+0x6e/0x210` — crash in the **sched_ext (SCX) scheduler** subsystem
- **Call trace**: `sysvec_apic_timer_interrupt` → `cpuidle_enter_state` — crashed during CPU idle, triggered by APIC timer interrupt
- **CR2**: `ffffffffffffff89` — dereferencing an obviously invalid kernel pointer
- **Kernel**: 6.12.74 (NixOS 25.11)
- **Significance**: This is the first crash with actual diagnostic output. Previous crashes were silent sub-NMI freezes. The sched_ext scheduler path is a new finding — earlier crashes were assumed to be hardware-level.
### 2026-03-06: pn02 Fifth Crash
- pn02 crashed again — journal corruption on next boot
- No pstore data captured for this crash
### 2026-03-07: pn02 Sixth and Seventh Crashes — Two in One Day
**First crash (~11:06 UTC):**
- ~26.6 hours uptime (95994s)
- **pstore captured both Oops and Panic**
- **Crash location**: Scheduler code path — `pick_next_task_fair` → `__pick_next_task`
- **CR2**: `000000c000726000` — invalid pointer dereference
- **Notable**: `dbus-daemon` segfaulted ~50 minutes before the kernel crash (`segfault at 0` in `libdbus-1.so.3.32.4` on CPU 0) — may indicate memory corruption preceding the kernel crash
**Second crash (~21:15 UTC):**
- Journal corruption confirmed on next boot
- No pstore data captured
### 2026-03-07: pn01 Status
- pn01 has had **zero crashes** since initial setup on Feb 21
- Zero journal corruptions, zero pstore dumps in 30 days
- Same BOOT_ID maintained between daily auto-upgrade reboots — consistently clean shutdown/reboot cycles
- All 8 reboots in 30 days are planned auto-upgrade reboots
- **pn01 is fully stable**
## Crash Summary
| Date | Uptime Before Crash | Crash Type | Diagnostic Data |
|------|---------------------|------------|-----------------|
| Feb 21 | ~1h | Silent freeze | None — sub-NMI |
| Feb 22 | ~5.5h | Silent freeze | None — sub-NMI |
| Feb 27 | ~2d 21h | Silent freeze | None — sub-NMI, rasdaemon empty |
| Mar 4 | ~5.8d | **Kernel oops** | pstore: `set_next_task_scx` (sched_ext) |
| Mar 6 | Unknown | Crash | Journal corruption only |
| Mar 7 | ~26.6h | **Kernel oops + panic** | pstore: `pick_next_task_fair` (scheduler) + dbus segfault |
| Mar 7 | Unknown | Crash | Journal corruption only |
## Conclusion
**pn02 is unreliable.** After exhausting mitigations (fTPM disabled, BIOS updated, WiFi/BT disabled, ErP disabled, amdgpu blacklisted, processor.max_cstate=1, NMI watchdog, rasdaemon), the unit still crashes every few days. 26 reboots in 30 days (7 unclean crashes + daily auto-upgrade reboots).
The pstore crash dumps from March reveal a new dimension: at least some crashes are **kernel scheduler bugs in sched_ext**, not just silent hardware-level freezes. The `set_next_task_scx` and `pick_next_task_fair` crash sites, combined with the dbus-daemon segfault before one crash, suggest possible memory corruption that manifests in the scheduler. It's unclear whether this is:
1. A sched_ext kernel bug exposed by the PN51's hardware quirks (unstable TSC, C-state behavior)
2. Hardware-induced memory corruption that happens to hit scheduler data structures
3. A pure software bug in the 6.12.74 kernel's sched_ext implementation
**pn01 is stable** — zero crashes in 30 days of continuous operation. Both units have identical kernel and NixOS configuration (minus pn02's diagnostic mitigations), so the difference points toward a hardware defect specific to the pn02 board.
## Next Steps
- **pn02 memtest**: Run memtest86 for 24h+ (available in systemd-boot menu). The crash signatures (userspace segfaults before kernel panics, corrupted pointers in scheduler structures) are consistent with intermittent RAM errors that a quick pass wouldn't catch. If memtest finds errors, swap the DIMM.
- **pn02**: Consider scrapping or repurposing for non-critical workloads that tolerate random reboots (auto-recovery via hardware watchdog is now working)
- **pn02 investigation**: Could try disabling sched_ext (`boot.kernelParams = [ "sched_ext.enabled=0" ]` or equivalent) to test whether the crashes stop — would help distinguish kernel bug from hardware defect
- **pn01**: Continue monitoring. If it remains stable long-term, it is viable for light workloads
- If pn01 eventually crashes, apply the same mitigations (amdgpu blacklist, max_cstate=1) to see if they help
- For the Incus hypervisor plan: likely need different hardware. Evaluating GMKtec G3 (Intel) as an alternative. Note: mixed Intel/AMD cluster complicates live migration
## Diagnostics and Auto-Recovery (pn02)
Currently deployed on pn02:
```nix
boot.blacklistedKernelModules = [ "amdgpu" ];
boot.kernelParams = [ "panic=10" "nmi_watchdog=1" "processor.max_cstate=1" ];
boot.kernel.sysctl."kernel.softlockup_panic" = 1;
boot.kernel.sysctl."kernel.hardlockup_panic" = 1;
hardware.rasdaemon.enable = true;
hardware.rasdaemon.record = true;
```
**Crash recovery is working**: pstore now captures kernel oops/panic data, and the system auto-reboots via `panic=10` or SP5100 TCO hardware watchdog.
**After reboot, check:**
- `ras-mc-ctl --summary` — overview of hardware errors
- `ras-mc-ctl --errors` — detailed error list
- `journalctl -b -1 -p err` — kernel logs from crashed boot (if panic was logged)
- pstore data is automatically archived by `systemd-pstore.service` and forwarded to Loki via promtail

View File

@@ -4,119 +4,118 @@
## Goal
Enable remote access to some or all homelab services from outside the internal network, without exposing anything directly to the internet.
Enable personal remote access to selected homelab services from outside the internal network, without exposing anything directly to the internet.
## Current State
- All services are only accessible from the internal 10.69.13.x network
- Exception: jelly01 has a WireGuard link to an external VPS
- No services are directly exposed to the public internet
- http-proxy has a WireGuard tunnel (`wg0`, `10.69.222.0/24`) to a VPS (`docker2.t-juice.club`) on an OpenStack cluster
- VPS runs Traefik which proxies selected services (including Jellyfin) back through the tunnel to http-proxy's Caddy
- No other services are directly exposed to the public internet
## Constraints
## Decision: WireGuard Gateway
- Nothing should be directly accessible from the outside
- Must use VPN or overlay network (no port forwarding of services)
- Self-hosted solutions preferred over managed services
After evaluating WireGuard gateway vs Headscale (self-hosted Tailscale), the **WireGuard gateway** approach was chosen:
## Options
- Only 2 client devices (laptop + phone), so Headscale's device management UX isn't needed
- Split DNS works fine on Linux laptop via systemd-resolved; all-or-nothing DNS on phone is acceptable for occasional use
- Simpler infrastructure - no control server to maintain
- Builds on existing WireGuard experience and setup
### 1. WireGuard Gateway (Internal Router)
## Architecture
A dedicated NixOS host on the internal network with a WireGuard tunnel out to the VPS. The VPS becomes the public entry point, and the gateway routes traffic to internal services. Firewall rules on the gateway control which services are reachable.
```mermaid
graph TD
clients["Laptop / Phone"]
vps["VPS<br/>(WireGuard endpoint)"]
extgw["extgw01<br/>(gateway + bastion)"]
grafana["Grafana<br/>monitoring01:3000"]
jellyfin["Jellyfin<br/>jelly01:8096"]
arr["arr stack<br/>*-jail hosts"]
**Pros:**
- Simple, well-understood technology
- Already running WireGuard for jelly01
- Full control over routing and firewall rules
- Excellent NixOS module support
- No extra dependencies
clients -->|WireGuard| vps
vps -->|WireGuard tunnel| extgw
extgw -->|allowed traffic| grafana
extgw -->|allowed traffic| jellyfin
extgw -->|allowed traffic| arr
```
**Cons:**
- Hub-and-spoke topology (all traffic goes through VPS)
- Manual peer management
- Adding a new client device means editing configs on both VPS and gateway
### Existing path (unchanged)
### 2. WireGuard Mesh (No Relay)
The current public access path stays as-is:
Each client device connects directly to a WireGuard endpoint. Could be on the VPS which forwards to the homelab, or if there is a routable IP at home, directly to an internal host.
```
Internet → VPS (Traefik) → WireGuard → http-proxy (Caddy) → internal services
```
**Pros:**
- Simple and fast
- No extra software
This handles public Jellyfin access and any other publicly-exposed services.
**Cons:**
- Manual key and endpoint management for every peer
- Doesn't scale well
- If behind CGNAT, still needs the VPS as intermediary
### New path (personal VPN)
### 3. Headscale (Self-Hosted Tailscale)
A separate WireGuard tunnel for personal remote access with restricted firewall rules:
Run a Headscale control server (on the VPS or internally) and install the Tailscale client on homelab hosts and personal devices. Gets the Tailscale mesh networking UX without depending on Tailscale's infrastructure.
```
Laptop/Phone → VPS (WireGuard peers) → tunnel → extgw01 (firewall) → allowed services
```
**Pros:**
- Mesh topology - devices communicate directly via NAT traversal (DERP relay as fallback)
- Easy to add/remove devices
- ACL support for granular access control
- MagicDNS for service discovery
- Good NixOS support for both headscale server and tailscale client
- Subnet routing lets you expose the entire 10.69.13.x network or specific hosts without installing tailscale on every host
### Access tiers
**Cons:**
- More moving parts than plain WireGuard
- Headscale is a third-party reimplementation, can lag behind Tailscale features
- Need to run and maintain the control server
1. **VPN (default)**: Laptop/phone connect to VPS WireGuard endpoint, traffic routed through extgw01 firewall. Only whitelisted services are reachable.
2. **SSH + 2FA (escalated)**: SSH into extgw01 for full network access when needed.
### 4. Tailscale (Managed)
## New Host: extgw01
Same as Headscale but using Tailscale's hosted control plane.
A NixOS host on the internal network acting as both WireGuard gateway and SSH bastion.
**Pros:**
- Zero infrastructure to manage on the control plane side
- Polished UX, well-maintained clients
- Free tier covers personal use
### Responsibilities
**Cons:**
- Dependency on Tailscale's service
- Less aligned with self-hosting preference
- Coordination metadata goes through their servers (data plane is still peer-to-peer)
- **WireGuard tunnel** to the VPS for client traffic
- **Firewall** with allowlist controlling which internal services are reachable through the VPN
- **SSH bastion** with 2FA for full network access when needed
- **DNS**: Clients get split DNS config (laptop via systemd-resolved routing domain, phone uses internal DNS for all queries)
### 5. Netbird (Self-Hosted)
### Firewall allowlist (initial)
Open-source alternative to Tailscale with a self-hostable management server. WireGuard-based, supports ACLs and NAT traversal.
| Service | Destination | Port |
|------------|------------------------------|-------|
| Grafana | monitoring01.home.2rjus.net | 3000 |
| Jellyfin | jelly01.home.2rjus.net | 8096 |
| Sonarr | sonarr-jail.home.2rjus.net | 8989 |
| Radarr | radarr-jail.home.2rjus.net | 7878 |
| NZBget | nzbget-jail.home.2rjus.net | 6789 |
**Pros:**
- Fully self-hostable
- Web UI for management
- ACL and peer grouping support
### SSH 2FA options (to be decided)
**Cons:**
- Heavier to self-host (needs multiple components: management server, signal server, TURN relay)
- Less mature NixOS module support compared to Tailscale/Headscale
- **Kanidm**: Already deployed on kanidm01, supports RADIUS/OAuth2 for PAM integration
- **SSH certificates via OpenBao**: Fits existing Vault infrastructure, short-lived certs
- **TOTP via PAM**: Simplest fallback, Google Authenticator / similar
### 6. Nebula (by Defined Networking)
## VPS Configuration
Certificate-based mesh VPN. Each node gets a certificate from a CA you control. No central coordination server needed at runtime.
The VPS needs a new WireGuard interface (separate from the existing http-proxy tunnel):
**Pros:**
- No always-on control plane
- Certificate-based identity
- Lightweight
- WireGuard endpoint listening on a public UDP port
- 2 peers: laptop, phone
- Routes client traffic through tunnel to extgw01
- Minimal config - just routing, no firewall policy (that lives on extgw01)
**Cons:**
- Less convenient for ad-hoc device addition (need to issue certs)
- NAT traversal less mature than Tailscale's
- Smaller community/ecosystem
## Implementation Steps
## Key Decision Points
- **Static public IP vs CGNAT?** Determines whether clients can connect directly to home network or need VPS relay.
- **Number of client devices?** If just phone and laptop, plain WireGuard via VPS is fine. More devices favors Headscale.
- **Per-service vs per-network access?** Gateway with firewall rules gives per-service control. Headscale ACLs can also do this. Plain WireGuard gives network-level access with gateway firewall for finer control.
- **Subnet routing vs per-host agents?** With Headscale/Tailscale, can either install client on every host, or use a single subnet router that advertises the 10.69.13.x range. The latter is closer to the gateway approach and avoids touching every host.
## Leading Candidates
Based on existing WireGuard experience, self-hosting preference, and NixOS stack:
1. **Headscale with a subnet router** - Best balance of convenience and self-hosting
2. **WireGuard gateway via VPS** - Simplest, most transparent, builds on existing setup
1. **Create extgw01 host configuration** in this repo
- VM provisioned via OpenTofu (same as other hosts)
- WireGuard interface for VPS tunnel
- nftables/iptables firewall with service allowlist
- IP forwarding enabled
2. **Configure VPS WireGuard** for client peers
- New WireGuard interface with laptop + phone peers
- Routing for 10.69.13.0/24 through extgw01 tunnel
3. **Set up client configs**
- Laptop: WireGuard config + systemd-resolved split DNS for `home.2rjus.net`
- Phone: WireGuard app config with DNS pointing at internal nameservers
4. **Set up SSH 2FA** on extgw01
- Evaluate Kanidm integration vs OpenBao SSH certs vs TOTP
5. **Test and verify**
- VPN access to allowed services only
- Firewall blocks everything else
- SSH + 2FA grants full access
- Existing public access path unaffected

View File

@@ -39,23 +39,17 @@ Expand storage capacity for the main hdd-pool. Since we need to add disks anyway
- nzbget: NixOS service or OCI container
- NFS exports: `services.nfs.server`
### Filesystem: BTRFS RAID1
### Filesystem: Keep ZFS
**Decision**: Migrate from ZFS to BTRFS with RAID1
**Decision**: Keep existing ZFS pool, import on NixOS
**Rationale**:
- **In-kernel**: No out-of-tree module issues like ZFS
- **Flexible expansion**: Add individual disks, not required to buy pairs
- **Mixed disk sizes**: Better handling than ZFS multi-vdev approach
- **RAID level conversion**: Can convert between RAID levels in place
- Built-in checksumming, snapshots, compression (zstd)
- NixOS has good BTRFS support
**BTRFS RAID1 notes**:
- "RAID1" means 2 copies of all data
- Distributes across all available devices
- With 6+ disks, provides redundancy + capacity scaling
- RAID5/6 avoided (known issues), RAID1/10 are stable
- **No data migration needed**: Existing ZFS pool can be imported directly on NixOS
- **Proven reliability**: Pool has been running reliably on TrueNAS
- **NixOS ZFS support**: Well-supported, declarative configuration via `boot.zfs` and `services.zfs`
- **BTRFS RAID5/6 unreliable**: Research showed BTRFS RAID5/6 write hole is still unresolved
- **BTRFS RAID1 wasteful**: With mixed disk sizes, RAID1 wastes significant capacity vs ZFS mirrors
- Checksumming, snapshots, compression (lz4/zstd) all available
### Hardware: Keep Existing + Add Disks
@@ -69,83 +63,94 @@ Expand storage capacity for the main hdd-pool. Since we need to add disks anyway
**Storage architecture**:
**Bulk storage** (BTRFS RAID1 on HDDs):
- Current: 6x HDDs (2x16TB + 2x8TB + 2x8TB)
- Add: 2x new HDDs (size TBD)
**hdd-pool** (ZFS mirrors):
- Current: 3 mirror vdevs (2x16TB + 2x8TB + 2x8TB) = 32TB usable
- Add: mirror-3 with 2x 24TB = +24TB usable
- Total after expansion: ~56TB usable
- Use: Media, downloads, backups, non-critical data
- Risk tolerance: High (data mostly replaceable)
**Critical data** (small volume):
- Use 2x 240GB SSDs in mirror (BTRFS or ZFS)
- Or use 2TB NVMe for critical data
- Risk tolerance: Low (data important but small)
### Disk Purchase Decision
**Options under consideration**:
**Option A: 2x 16TB drives**
- Matches largest current drives
- Enables potential future RAID5 if desired (6x 16TB array)
- More conservative capacity increase
**Option B: 2x 20-24TB drives**
- Larger capacity headroom
- Better $/TB ratio typically
- Future-proofs better
**Initial purchase**: 2 drives (chassis has space for 2 more without modifications)
**Decision**: 2x 24TB drives (ordered, arriving 2026-02-21)
## Migration Strategy
### High-Level Plan
1. **Preparation**:
- Purchase 2x new HDDs (16TB or 20-24TB)
- Create NixOS configuration for new storage host
- Set up bare metal NixOS installation
1. **Expand ZFS pool** (on TrueNAS):
- Install 2x 24TB drives (may need new drive trays - order from abroad if needed)
- If chassis space is limited, temporarily replace the two oldest 8TB drives (da0/ada4)
- Add as mirror-3 vdev to hdd-pool
- Verify pool health and resilver completes
- Check SMART data on old 8TB drives (all healthy as of 2026-02-20, no reallocated sectors)
- Burn-in: at minimum short + long SMART test before adding to pool
2. **Initial BTRFS pool**:
- Install 2 new disks
- Create BTRFS filesystem in RAID1
- Mount and test NFS exports
2. **Prepare NixOS configuration**:
- Create host configuration (`hosts/nas1/` or similar)
- Configure ZFS pool import (`boot.zfs.extraPools`)
- Set up services: radarr, sonarr, nzbget, restic-rest, NFS
- Configure monitoring (node-exporter, promtail, smartctl-exporter)
3. **Data migration**:
- Copy data from TrueNAS ZFS pool to new BTRFS pool over 10GbE
- Verify data integrity
3. **Install NixOS**:
- `zfs export hdd-pool` on TrueNAS before shutdown (clean export)
- Wipe TrueNAS boot-pool SSDs, set up as mdadm RAID1 for NixOS root
- Install NixOS on mdadm mirror (keeps boot path ZFS-independent)
- Import hdd-pool via `boot.zfs.extraPools`
- Verify all datasets mount correctly
4. **Expand pool**:
- As old ZFS pool is emptied, wipe drives and add to BTRFS pool
- Pool grows incrementally: 2 → 4 → 6 → 8 disks
- BTRFS rebalances data across new devices
4. **Service migration**:
- Configure NixOS services to use ZFS dataset paths
- Update NFS exports
- Test from consuming hosts
5. **Service migration**:
- Set up radarr/sonarr/nzbget/restic as NixOS services
- Update NFS client mounts on consuming hosts
6. **Cutover**:
- Point consumers to new NAS host
5. **Cutover**:
- Update DNS/client mounts if IP changes
- Verify monitoring integration
- Decommission TrueNAS
- Repurpose hardware or keep as spare
### Post-Expansion: Vdev Rebalancing
ZFS has no built-in rebalance command. After adding the new 24TB vdev, ZFS will
write new data preferentially to it (most free space), leaving old vdevs packed
at ~97%. This is suboptimal but not urgent once overall pool usage drops to ~50%.
To gradually rebalance, rewrite files in place so ZFS redistributes blocks across
all vdevs proportional to free space:
```bash
# Rewrite files individually (spreads blocks across all vdevs)
find /pool/dataset -type f -exec sh -c '
for f; do cp "$f" "$f.rebal" && mv "$f.rebal" "$f"; done
' _ {} +
```
Avoid `zfs send/recv` for large datasets (e.g. 20TB) as this would concentrate
data on the emptiest vdev rather than spreading it evenly.
**Recommendation**: Do this after NixOS migration is stable. Not urgent - the pool
will function fine with uneven distribution, just slightly suboptimal for performance.
### Migration Advantages
- **Low risk**: New pool created independently, old data remains intact during migration
- **Incremental**: Can add old disks one at a time as space allows
- **Flexible**: BTRFS handles mixed disk sizes gracefully
- **Reversible**: Keep TrueNAS running until fully validated
- **No data migration**: ZFS pool imported directly, no copying terabytes of data
- **Low risk**: Pool expansion done on stable TrueNAS before OS swap
- **Reversible**: Can boot back to TrueNAS if NixOS has issues (ZFS pool is OS-independent)
- **Quick cutover**: Once NixOS config is ready, the OS swap is fast
## Next Steps
1. Decide on disk size (16TB vs 20-24TB)
2. Purchase disks
3. Design NixOS host configuration (`hosts/nas1/`)
4. Plan detailed migration timeline
5. Document NFS export mapping (current new)
1. ~~Decide on disk size~~ - 2x 24TB ordered
2. Install drives and add mirror vdev to ZFS pool
3. Check SMART data on 8TB drives - decide whether to keep or retire
4. Design NixOS host configuration (`hosts/nas1/`)
5. Document NFS export mapping (current -> new)
6. Plan NixOS installation and cutover
## Open Questions
- [ ] Final decision on disk size?
- [ ] Hostname for new NAS host? (nas1? storage1?)
- [ ] IP address allocation (keep 10.69.12.50 or new IP?)
- [ ] Timeline/maintenance window for migration?
- [ ] IP address/subnet: NAS and Proxmox are both on 10GbE to the same switch but different subnets, forcing traffic through the router (bottleneck). Move to same subnet during migration.
- [x] Boot drive: Reuse TrueNAS boot-pool SSDs as mdadm RAID1 for NixOS root (no ZFS on boot path)
- [ ] Retire old 8TB drives? (SMART looks healthy, keep unless chassis space is needed)
- [x] Drive trays: ordered domestically (expected 2026-02-25 to 2026-03-03)
- [ ] Timeline/maintenance window for NixOS swap?

20
flake.lock generated
View File

@@ -28,11 +28,11 @@
]
},
"locked": {
"lastModified": 1771004123,
"narHash": "sha256-Jw36EzL4IGIc2TmeZGphAAUrJXoWqfvCbybF8bTHgMA=",
"lastModified": 1771488195,
"narHash": "sha256-2kMxqdDyPluRQRoES22Y0oSjp7pc5fj2nRterfmSIyc=",
"ref": "master",
"rev": "e5e8be86ecdcae8a5962ba3bddddfe91b574792b",
"revCount": 36,
"rev": "2d26de50559d8acb82ea803764e138325d95572c",
"revCount": 37,
"type": "git",
"url": "https://git.t-juice.club/torjus/homelab-deploy"
},
@@ -64,11 +64,11 @@
},
"nixpkgs": {
"locked": {
"lastModified": 1770770419,
"narHash": "sha256-iKZMkr6Cm9JzWlRYW/VPoL0A9jVKtZYiU4zSrVeetIs=",
"lastModified": 1772822230,
"narHash": "sha256-yf3iYLGbGVlIthlQIk5/4/EQDZNNEmuqKZkQssMljuw=",
"owner": "nixos",
"repo": "nixpkgs",
"rev": "6c5e707c6b5339359a9a9e215c5e66d6d802fd7a",
"rev": "71caefce12ba78d84fe618cf61644dce01cf3a96",
"type": "github"
},
"original": {
@@ -80,11 +80,11 @@
},
"nixpkgs-unstable": {
"locked": {
"lastModified": 1770562336,
"narHash": "sha256-ub1gpAONMFsT/GU2hV6ZWJjur8rJ6kKxdm9IlCT0j84=",
"lastModified": 1772773019,
"narHash": "sha256-E1bxHxNKfDoQUuvriG71+f+s/NT0qWkImXsYZNFFfCs=",
"owner": "nixos",
"repo": "nixpkgs",
"rev": "d6c71932130818840fc8fe9509cf50be8c64634f",
"rev": "aca4d95fce4914b3892661bcb80b8087293536c6",
"type": "github"
},
"original": {

View File

@@ -92,15 +92,6 @@
./hosts/http-proxy
];
};
monitoring01 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self;
};
modules = commonModules ++ [
./hosts/monitoring01
];
};
jelly01 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
@@ -209,6 +200,42 @@
./hosts/garage01
];
};
pn01 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self;
};
modules = commonModules ++ [
./hosts/pn01
];
};
pn02 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self;
};
modules = commonModules ++ [
./hosts/pn02
];
};
nrec-nixos01 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self;
};
modules = commonModules ++ [
./hosts/nrec-nixos01
];
};
openstack-template = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self;
};
modules = commonModules ++ [
./hosts/openstack-template
];
};
};
packages = forAllSystems (
{ pkgs }:
@@ -227,6 +254,7 @@
pkgs.openbao
pkgs.kanidm_1_8
pkgs.nkeys
pkgs.openstackclient
(pkgs.callPackage ./scripts/create-host { })
homelab-deploy.packages.${pkgs.system}.default
];

View File

@@ -54,10 +54,7 @@
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim

View File

@@ -46,10 +46,7 @@
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim

View File

@@ -18,12 +18,7 @@
"sonarr"
"ha"
"z2m"
"grafana"
"prometheus"
"alertmanager"
"jelly"
"pyroscope"
"pushgw"
];
nixpkgs.config.allowUnfree = true;
@@ -57,10 +52,7 @@
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
vault.enable = true;
homelab.deploy.enable = true;

View File

@@ -44,10 +44,7 @@
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim

View File

@@ -55,10 +55,7 @@
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim

View File

@@ -1,114 +0,0 @@
{
pkgs,
...
}:
{
imports = [
./hardware-configuration.nix
../../system
../../common/vm
];
homelab.host.role = "monitoring";
nixpkgs.config.allowUnfree = true;
# Use the systemd-boot EFI boot loader.
boot.loader.grub = {
enable = true;
device = "/dev/sda";
configurationLimit = 3;
};
networking.hostName = "monitoring01";
networking.domain = "home.2rjus.net";
networking.useNetworkd = true;
networking.useDHCP = false;
services.resolved.enable = true;
networking.nameservers = [
"10.69.13.5"
"10.69.13.6"
];
systemd.network.enable = true;
systemd.network.networks."ens18" = {
matchConfig.Name = "ens18";
address = [
"10.69.13.13/24"
];
routes = [
{ Gateway = "10.69.13.1"; }
];
linkConfig.RequiredForOnline = "routable";
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim
wget
git
sqlite
];
services.qemuGuest.enable = true;
# Vault secrets management
vault.enable = true;
homelab.deploy.enable = true;
vault.secrets.backup-helper = {
secretPath = "shared/backup/password";
extractKey = "password";
outputDir = "/run/secrets/backup_helper_secret";
services = [ "restic-backups-grafana" "restic-backups-grafana-db" ];
};
services.restic.backups.grafana = {
repository = "rest:http://10.69.12.52:8000/backup-nix";
passwordFile = "/run/secrets/backup_helper_secret";
paths = [ "/var/lib/grafana/plugins" ];
timerConfig = {
OnCalendar = "daily";
Persistent = true;
RandomizedDelaySec = "2h";
};
pruneOpts = [
"--keep-daily 7"
"--keep-weekly 4"
"--keep-monthly 6"
"--keep-within 1d"
];
extraOptions = [ "--retry-lock=5m" ];
};
services.restic.backups.grafana-db = {
repository = "rest:http://10.69.12.52:8000/backup-nix";
passwordFile = "/run/secrets/backup_helper_secret";
command = [ "${pkgs.sqlite}/bin/sqlite3" "/var/lib/grafana/data/grafana.db" ".dump" ];
timerConfig = {
OnCalendar = "daily";
Persistent = true;
RandomizedDelaySec = "2h";
};
pruneOpts = [
"--keep-daily 7"
"--keep-weekly 4"
"--keep-monthly 6"
"--keep-within 1d"
];
extraOptions = [ "--retry-lock=5m" ];
};
# Open ports in the firewall.
# networking.firewall.allowedTCPPorts = [ ... ];
# networking.firewall.allowedUDPPorts = [ ... ];
# Or disable the firewall altogether.
networking.firewall.enable = false;
system.stateVersion = "23.11"; # Did you read the comment?
}

View File

@@ -1,42 +0,0 @@
{
config,
lib,
pkgs,
modulesPath,
...
}:
{
imports = [
(modulesPath + "/profiles/qemu-guest.nix")
];
boot.initrd.availableKernelModules = [
"ata_piix"
"uhci_hcd"
"virtio_pci"
"virtio_scsi"
"sd_mod"
"sr_mod"
];
boot.initrd.kernelModules = [ "dm-snapshot" ];
boot.kernelModules = [
"ptp_kvm"
];
boot.extraModulePackages = [ ];
fileSystems."/" = {
device = "/dev/disk/by-label/root";
fsType = "xfs";
};
swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
# Enables DHCP on each ethernet and wireless interface. In case of scripted networking
# (the default) this is the recommended approach. When using systemd-networkd it's
# still possible to use this option, but it's recommended to use it in conjunction
# with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
networking.useDHCP = lib.mkDefault true;
# networking.interfaces.ens18.useDHCP = lib.mkDefault true;
nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
}

View File

@@ -18,8 +18,7 @@
role = "monitoring";
};
# DNS CNAME for Grafana test instance
homelab.dns.cnames = [ "grafana-test" ];
homelab.dns.cnames = [ "monitoring" "alertmanager" "grafana" "grafana-test" "metrics" "vmalert" "loki" ];
# Enable Vault integration
vault.enable = true;
@@ -54,10 +53,7 @@
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim

View File

@@ -2,5 +2,11 @@
imports = [
./configuration.nix
../../services/grafana
../../services/victoriametrics
../../services/loki
../../services/monitoring/alerttonotify.nix
../../services/monitoring/blackbox.nix
../../services/monitoring/exportarr.nix
../../services/monitoring/pve.nix
];
}

View File

@@ -44,10 +44,7 @@
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim

View File

@@ -25,7 +25,7 @@
};
};
timeout = 7200;
timeout = 14400;
metrics.enable = true;
};

View File

@@ -53,10 +53,7 @@
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim

View File

@@ -0,0 +1,78 @@
{
lib,
pkgs,
...
}:
{
services.openssh = {
enable = true;
settings = {
PermitRootLogin = lib.mkForce "no";
PasswordAuthentication = false;
};
};
users.users.nixos = {
isNormalUser = true;
extraGroups = [ "wheel" ];
shell = pkgs.zsh;
openssh.authorizedKeys.keys = [
"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAwfb2jpKrBnCw28aevnH8HbE5YbcMXpdaVv2KmueDu6 torjus@gunter"
];
};
security.sudo.wheelNeedsPassword = false;
programs.zsh.enable = true;
homelab.dns.enable = false;
homelab.monitoring.enable = false;
homelab.host.labels.ansible = "false";
fileSystems."/" = {
device = "/dev/disk/by-label/nixos";
fsType = "ext4";
autoResize = true;
};
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda";
networking.hostName = "nrec-nixos01";
networking.useNetworkd = true;
networking.useDHCP = false;
services.resolved.enable = true;
systemd.network.enable = true;
systemd.network.networks."ens3" = {
matchConfig.Name = "ens3";
networkConfig.DHCP = "ipv4";
linkConfig.RequiredForOnline = "routable";
};
time.timeZone = "Europe/Oslo";
networking.firewall.enable = true;
networking.firewall.allowedTCPPorts = [
22
80
443
];
nix.settings.substituters = [
"https://cache.nixos.org"
];
nix.settings.trusted-public-keys = [
"cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY="
];
services.caddy = {
enable = true;
virtualHosts."nrec-nixos01.t-juice.club" = {
extraConfig = ''
reverse_proxy 127.0.0.1:3000
'';
};
};
zramSwap.enable = true;
system.stateVersion = "25.11";
}

View File

@@ -0,0 +1,9 @@
{ modulesPath, ... }:
{
imports = [
./configuration.nix
../../system/packages.nix
../../services/forgejo
(modulesPath + "/profiles/qemu-guest.nix")
];
}

View File

@@ -58,10 +58,7 @@
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim

View File

@@ -58,10 +58,7 @@
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim

View File

@@ -0,0 +1,72 @@
{
lib,
pkgs,
...
}:
{
services.openssh = {
enable = true;
settings = {
PermitRootLogin = lib.mkForce "no";
PasswordAuthentication = false;
};
};
users.users.nixos = {
isNormalUser = true;
extraGroups = [ "wheel" ];
shell = pkgs.zsh;
openssh.authorizedKeys.keys = [
"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAwfb2jpKrBnCw28aevnH8HbE5YbcMXpdaVv2KmueDu6 torjus@gunter"
];
};
security.sudo.wheelNeedsPassword = false;
programs.zsh.enable = true;
homelab.dns.enable = false;
homelab.monitoring.enable = false;
homelab.host.labels.ansible = "false";
# Minimal fileSystems for evaluation; openstack-config.nix overrides this at image build time
fileSystems."/" = {
device = lib.mkDefault "/dev/vda1";
fsType = lib.mkDefault "ext4";
};
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda";
networking.hostName = "nixos-openstack-template";
networking.useNetworkd = true;
networking.useDHCP = false;
services.resolved.enable = true;
systemd.network.enable = true;
systemd.network.networks."ens3" = {
matchConfig.Name = "ens3";
networkConfig.DHCP = "ipv4";
linkConfig.RequiredForOnline = "routable";
};
time.timeZone = "Europe/Oslo";
networking.firewall.enable = true;
networking.firewall.allowedTCPPorts = [ 22 ];
nix.settings.substituters = [
"https://cache.nixos.org"
];
nix.settings.trusted-public-keys = [
"cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY="
];
environment.systemPackages = with pkgs; [
age
vim
wget
git
];
zramSwap.enable = true;
system.stateVersion = "25.11";
}

View File

@@ -2,6 +2,6 @@
{
imports = [
./configuration.nix
../../services/monitoring
../../system/packages.nix
];
}

View File

@@ -0,0 +1,54 @@
{
config,
lib,
pkgs,
...
}:
{
imports = [
./hardware-configuration.nix
../../system
];
boot.loader.systemd-boot.enable = true;
boot.loader.systemd-boot.memtest86.enable = true;
boot.loader.efi.canTouchEfiVariables = true;
networking.hostName = "pn01";
networking.domain = "home.2rjus.net";
networking.useNetworkd = true;
networking.useDHCP = false;
networking.firewall.enable = false;
services.resolved.enable = true;
networking.nameservers = [
"10.69.13.5"
"10.69.13.6"
];
systemd.network.enable = true;
systemd.network.networks."enp2s0" = {
matchConfig.Name = "enp2s0";
address = [
"10.69.12.60/24"
];
routes = [
{ Gateway = "10.69.12.1"; }
];
linkConfig.RequiredForOnline = "routable";
};
time.timeZone = "Europe/Oslo";
homelab.host = {
tier = "test";
priority = "low";
role = "compute";
};
vault.enable = true;
nixpkgs.config.allowUnfree = true;
system.stateVersion = "25.11";
}

5
hosts/pn01/default.nix Normal file
View File

@@ -0,0 +1,5 @@
{ ... }: {
imports = [
./configuration.nix
];
}

View File

@@ -0,0 +1,33 @@
# Do not modify this file! It was generated by nixos-generate-config
# and may be overwritten by future invocations. Please make changes
# to /etc/nixos/configuration.nix instead.
{ config, lib, pkgs, modulesPath, ... }:
{
imports =
[ (modulesPath + "/installer/scan/not-detected.nix")
];
boot.initrd.availableKernelModules = [ "xhci_pci" "nvme" "ahci" "usb_storage" "usbhid" "sd_mod" "rtsx_usb_sdmmc" ];
boot.initrd.kernelModules = [ ];
boot.kernelModules = [ "kvm-amd" ];
boot.extraModulePackages = [ ];
fileSystems."/" =
{ device = "/dev/disk/by-uuid/9444cf54-80e0-4315-adca-8ddd5037217c";
fsType = "ext4";
};
fileSystems."/boot" =
{ device = "/dev/disk/by-uuid/D897-146F";
fsType = "vfat";
options = [ "fmask=0022" "dmask=0022" ];
};
swapDevices =
[ { device = "/dev/disk/by-uuid/6c1e775f-342e-463a-a7f9-d7ce6593a482"; }
];
nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
hardware.cpu.amd.updateMicrocode = lib.mkDefault config.hardware.enableRedistributableFirmware;
}

View File

@@ -0,0 +1,61 @@
{
config,
lib,
pkgs,
...
}:
{
imports = [
./hardware-configuration.nix
../../system
];
boot.loader.systemd-boot.enable = true;
boot.loader.systemd-boot.memtest86.enable = true;
boot.loader.efi.canTouchEfiVariables = true;
boot.blacklistedKernelModules = [ "amdgpu" ];
boot.kernelParams = [ "panic=10" "nmi_watchdog=1" "processor.max_cstate=1" ];
boot.kernel.sysctl."kernel.softlockup_panic" = 1;
boot.kernel.sysctl."kernel.hardlockup_panic" = 1;
hardware.rasdaemon.enable = true;
hardware.rasdaemon.record = true;
networking.hostName = "pn02";
networking.domain = "home.2rjus.net";
networking.useNetworkd = true;
networking.useDHCP = false;
networking.firewall.enable = false;
services.resolved.enable = true;
networking.nameservers = [
"10.69.13.5"
"10.69.13.6"
];
systemd.network.enable = true;
systemd.network.networks."enp2s0" = {
matchConfig.Name = "enp2s0";
address = [
"10.69.12.61/24"
];
routes = [
{ Gateway = "10.69.12.1"; }
];
linkConfig.RequiredForOnline = "routable";
};
time.timeZone = "Europe/Oslo";
homelab.host = {
tier = "test";
priority = "low";
role = "compute";
};
vault.enable = true;
nixpkgs.config.allowUnfree = true;
system.stateVersion = "25.11";
}

5
hosts/pn02/default.nix Normal file
View File

@@ -0,0 +1,5 @@
{ ... }: {
imports = [
./configuration.nix
];
}

View File

@@ -0,0 +1,33 @@
# Do not modify this file! It was generated by nixos-generate-config
# and may be overwritten by future invocations. Please make changes
# to /etc/nixos/configuration.nix instead.
{ config, lib, pkgs, modulesPath, ... }:
{
imports =
[ (modulesPath + "/installer/scan/not-detected.nix")
];
boot.initrd.availableKernelModules = [ "xhci_pci" "ahci" "usb_storage" "usbhid" "sd_mod" "rtsx_usb_sdmmc" ];
boot.initrd.kernelModules = [ ];
boot.kernelModules = [ "kvm-amd" ];
boot.extraModulePackages = [ ];
fileSystems."/" =
{ device = "/dev/disk/by-uuid/1d28b629-51ae-4f0e-b440-9388c2e48413";
fsType = "ext4";
};
fileSystems."/boot" =
{ device = "/dev/disk/by-uuid/A5A7-C7B2";
fsType = "vfat";
options = [ "fmask=0022" "dmask=0022" ];
};
swapDevices =
[ { device = "/dev/disk/by-uuid/f2570894-0922-4746-84c7-2b2fe7601ea1"; }
];
nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
hardware.cpu.amd.updateMicrocode = lib.mkDefault config.hardware.enableRedistributableFirmware;
}

View File

@@ -6,7 +6,8 @@ let
text = ''
set -euo pipefail
LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
LOKI_URL="https://loki.home.2rjus.net/loki/api/v1/push"
LOKI_AUTH_FILE="/run/secrets/promtail-loki-auth"
# Send a log entry to Loki with bootstrap status
# Usage: log_to_loki <stage> <message>
@@ -36,8 +37,14 @@ let
}]
}')
local auth_args=()
if [[ -f "$LOKI_AUTH_FILE" ]]; then
auth_args=(-u "promtail:$(cat "$LOKI_AUTH_FILE")")
fi
curl -s --connect-timeout 2 --max-time 5 \
-X POST \
"''${auth_args[@]}" \
-H "Content-Type: application/json" \
-d "$payload" \
"$LOKI_URL" >/dev/null 2>&1 || true

View File

@@ -54,10 +54,7 @@
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
nix.settings.substituters = [
"https://nix-cache.home.2rjus.net"

View File

@@ -55,10 +55,7 @@
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim

View File

@@ -55,10 +55,7 @@
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim

View File

@@ -55,10 +55,7 @@
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim

View File

@@ -45,10 +45,7 @@
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim

View File

@@ -94,7 +94,15 @@ let
})
(externalTargets.nodeExporter or [ ]);
allEntries = flakeEntries ++ externalEntries;
# Node-exporter-only external targets (no systemd-exporter)
externalOnlyEntries = map
(target: {
inherit target;
labels = { hostname = extractHostnameFromTarget target; };
})
(externalTargets.nodeExporterOnly or [ ]);
allEntries = flakeEntries ++ externalEntries ++ externalOnlyEntries;
# Group entries by their label set for efficient static_configs
# Convert labels attrset to a string key for grouping
@@ -203,7 +211,18 @@ let
in
flakeScrapeConfigs ++ externalScrapeConfigs;
# Generate systemd-exporter targets (excludes nodeExporterOnly hosts)
generateSystemdExporterTargets = self: externalTargets:
let
nodeTargets = generateNodeExporterTargets self (externalTargets // { nodeExporterOnly = [ ]; });
in
map
(cfg: cfg // {
targets = map (t: builtins.replaceStrings [ ":9100" ] [ ":9558" ] t) cfg.targets;
})
nodeTargets;
in
{
inherit extractHostMonitoring generateNodeExporterTargets generateScrapeConfigs;
inherit extractHostMonitoring generateNodeExporterTargets generateScrapeConfigs generateSystemdExporterTargets;
}

View File

@@ -56,10 +56,7 @@
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim

View File

@@ -20,10 +20,10 @@ vault-fetch <secret-path> <output-directory> [cache-directory]
```bash
# Fetch Grafana admin secrets
vault-fetch hosts/monitoring01/grafana-admin /run/secrets/grafana /var/lib/vault/cache/grafana
vault-fetch hosts/ha1/mqtt-password /run/secrets/grafana /var/lib/vault/cache/grafana
# Use default cache location
vault-fetch hosts/monitoring01/grafana-admin /run/secrets/grafana
vault-fetch hosts/ha1/mqtt-password /run/secrets/grafana
```
## How It Works
@@ -53,13 +53,13 @@ If Vault is unreachable or authentication fails:
This tool is designed to be called from systemd service `ExecStartPre` hooks via the `vault.secrets` NixOS module:
```nix
vault.secrets.grafana-admin = {
secretPath = "hosts/monitoring01/grafana-admin";
vault.secrets.mqtt-password = {
secretPath = "hosts/ha1/mqtt-password";
};
# Service automatically gets secrets fetched before start
systemd.services.grafana.serviceConfig = {
EnvironmentFile = "/run/secrets/grafana-admin/password";
systemd.services.mosquitto.serviceConfig = {
EnvironmentFile = "/run/secrets/mqtt-password/password";
};
```

View File

@@ -5,7 +5,7 @@ set -euo pipefail
#
# Usage: vault-fetch <secret-path> <output-directory> [cache-directory]
#
# Example: vault-fetch hosts/monitoring01/grafana-admin /run/secrets/grafana /var/lib/vault/cache/grafana
# Example: vault-fetch hosts/ha1/mqtt-password /run/secrets/grafana /var/lib/vault/cache/grafana
#
# This script:
# 1. Authenticates to Vault using AppRole credentials from /var/lib/vault/approle/
@@ -17,7 +17,7 @@ set -euo pipefail
# Parse arguments
if [ $# -lt 2 ]; then
echo "Usage: vault-fetch <secret-path> <output-directory> [cache-directory]" >&2
echo "Example: vault-fetch hosts/monitoring01/grafana /run/secrets/grafana /var/lib/vault/cache/grafana" >&2
echo "Example: vault-fetch hosts/ha1/mqtt-password /run/secrets/grafana /var/lib/vault/cache/grafana" >&2
exit 1
fi

View File

@@ -0,0 +1,19 @@
{ ... }:
{
services.forgejo = {
enable = true;
database.type = "sqlite3";
settings = {
server = {
DOMAIN = "nrec-nixos01.t-juice.club";
ROOT_URL = "https://nrec-nixos01.t-juice.club/";
HTTP_ADDR = "127.0.0.1";
HTTP_PORT = 3000;
};
server.LFS_START_SERVER = true;
service.DISABLE_REGISTRATION = true;
"service.explore".REQUIRE_SIGNIN_VIEW = true;
session.COOKIE_SECURE = true;
};
};
}

View File

@@ -0,0 +1,492 @@
{
"uid": "apiary-homelab",
"title": "Apiary - Honeypot",
"tags": ["apiary", "honeypot", "prometheus", "homelab"],
"timezone": "browser",
"schemaVersion": 39,
"version": 1,
"refresh": "1m",
"time": {
"from": "now-24h",
"to": "now"
},
"templating": {
"list": []
},
"panels": [
{
"id": 1,
"title": "SSH Connections",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "sum(oubliette_ssh_connections_total{job=\"apiary\"})",
"legendFormat": "Total",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "blue", "value": null}
]
}
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none",
"textMode": "auto"
},
"description": "Total SSH connections across all outcomes"
},
{
"id": 2,
"title": "Active Sessions",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 6, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "oubliette_sessions_active{job=\"apiary\"}",
"legendFormat": "Active",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 5},
{"color": "red", "value": 20}
]
},
"noValue": "0"
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none",
"textMode": "auto"
},
"description": "Currently active honeypot sessions"
},
{
"id": 3,
"title": "Unique IPs",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 12, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "oubliette_storage_unique_ips{job=\"apiary\"}",
"legendFormat": "IPs",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "purple", "value": null}
]
}
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none",
"textMode": "auto"
},
"description": "Total unique source IPs observed"
},
{
"id": 4,
"title": "Total Login Attempts",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 18, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "oubliette_storage_login_attempts_total{job=\"apiary\"}",
"legendFormat": "Attempts",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "orange", "value": null}
]
}
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none",
"textMode": "auto"
},
"description": "Total login attempts stored"
},
{
"id": 5,
"title": "SSH Connections Over Time",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"interval": "60s",
"targets": [
{
"expr": "rate(oubliette_ssh_connections_total{job=\"apiary\"}[$__rate_interval])",
"legendFormat": "{{outcome}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "cps",
"custom": {
"drawStyle": "line",
"lineInterpolation": "smooth",
"fillOpacity": 20,
"pointSize": 5,
"showPoints": "auto",
"stacking": {"mode": "none"}
}
}
},
"options": {
"legend": {"displayMode": "list", "placement": "bottom"},
"tooltip": {"mode": "multi", "sort": "desc"}
},
"description": "SSH connection rate by outcome"
},
{
"id": 6,
"title": "Auth Attempts Over Time",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"interval": "60s",
"targets": [
{
"expr": "rate(oubliette_auth_attempts_total{job=\"apiary\"}[$__rate_interval])",
"legendFormat": "{{reason}} - {{result}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "cps",
"custom": {
"drawStyle": "line",
"lineInterpolation": "smooth",
"fillOpacity": 20,
"pointSize": 5,
"showPoints": "auto",
"stacking": {"mode": "none"}
}
}
},
"options": {
"legend": {"displayMode": "list", "placement": "bottom"},
"tooltip": {"mode": "multi", "sort": "desc"}
},
"description": "Authentication attempt rate by reason and result"
},
{
"id": 7,
"title": "Sessions by Shell",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 22},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"interval": "60s",
"targets": [
{
"expr": "rate(oubliette_sessions_total{job=\"apiary\"}[$__rate_interval])",
"legendFormat": "{{shell}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "cps",
"custom": {
"drawStyle": "line",
"lineInterpolation": "smooth",
"fillOpacity": 20,
"pointSize": 5,
"showPoints": "auto",
"stacking": {"mode": "normal"}
}
}
},
"options": {
"legend": {"displayMode": "list", "placement": "bottom"},
"tooltip": {"mode": "multi", "sort": "desc"}
},
"description": "Session creation rate by shell type"
},
{
"id": 8,
"title": "Attempts by Country",
"type": "geomap",
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 12},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "oubliette_auth_attempts_by_country_total{job=\"apiary\"}",
"legendFormat": "{{country}}",
"refId": "A",
"instant": true,
"format": "table"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 10},
{"color": "orange", "value": 50},
{"color": "red", "value": 200}
]
}
}
},
"options": {
"view": {
"id": "zero",
"lat": 30,
"lon": 10,
"zoom": 2
},
"basemap": {
"type": "default"
},
"layers": [
{
"type": "markers",
"name": "Auth Attempts",
"config": {
"showLegend": true,
"style": {
"size": {
"field": "Value",
"min": 3,
"max": 20
},
"color": {
"field": "Value"
},
"symbol": {
"mode": "fixed",
"fixed": "img/icons/marker/circle.svg"
}
}
},
"location": {
"mode": "lookup",
"lookup": "country",
"gazetteer": "public/gazetteer/countries.json"
}
}
]
},
"description": "Authentication attempts by country (geo lookup from country code)"
},
{
"id": 9,
"title": "Session Duration Distribution",
"type": "heatmap",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 30},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"interval": "60s",
"targets": [
{
"expr": "rate(oubliette_session_duration_seconds_bucket{job=\"apiary\"}[$__rate_interval])",
"legendFormat": "{{le}}",
"refId": "A",
"format": "heatmap"
}
],
"fieldConfig": {
"defaults": {
"custom": {
"scaleDistribution": {
"type": "log",
"log": 2
}
}
}
},
"options": {
"calculate": false,
"yAxis": {
"unit": "s"
},
"color": {
"scheme": "Oranges",
"mode": "scheme"
},
"cellGap": 1,
"tooltip": {
"show": true
}
},
"description": "Distribution of session durations"
},
{
"id": 10,
"title": "Commands Executed by Shell",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 22},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"interval": "60s",
"targets": [
{
"expr": "rate(oubliette_commands_executed_total{job=\"apiary\"}[$__rate_interval])",
"legendFormat": "{{shell}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "cps",
"custom": {
"drawStyle": "line",
"lineInterpolation": "smooth",
"fillOpacity": 20,
"pointSize": 5,
"showPoints": "auto",
"stacking": {"mode": "normal"}
}
}
},
"options": {
"legend": {"displayMode": "list", "placement": "bottom"},
"tooltip": {"mode": "multi", "sort": "desc"}
},
"description": "Rate of commands executed in honeypot shells"
},
{
"id": 11,
"title": "Storage Query Duration by Method",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 38},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"interval": "60s",
"targets": [
{
"expr": "rate(oubliette_storage_query_duration_seconds_sum{job=\"apiary\"}[$__rate_interval]) / rate(oubliette_storage_query_duration_seconds_count{job=\"apiary\"}[$__rate_interval])",
"legendFormat": "{{method}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"custom": {
"drawStyle": "line",
"lineInterpolation": "smooth",
"fillOpacity": 10,
"pointSize": 5,
"showPoints": "auto",
"stacking": {"mode": "none"}
}
}
},
"options": {
"legend": {"displayMode": "list", "placement": "bottom"},
"tooltip": {"mode": "multi", "sort": "desc"}
},
"description": "Average query duration per storage method over time"
},
{
"id": 12,
"title": "Storage Query Rate by Method",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 38},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"interval": "60s",
"targets": [
{
"expr": "rate(oubliette_storage_query_duration_seconds_count{job=\"apiary\"}[$__rate_interval])",
"legendFormat": "{{method}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"drawStyle": "line",
"lineInterpolation": "smooth",
"fillOpacity": 10,
"pointSize": 5,
"showPoints": "auto",
"stacking": {"mode": "none"}
}
}
},
"options": {
"legend": {"displayMode": "list", "placement": "bottom"},
"tooltip": {"mode": "multi", "sort": "desc"}
},
"description": "Query execution rate per storage method"
},
{
"id": 13,
"title": "Storage Query Errors",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 46},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "sum(oubliette_storage_query_errors_total{job=\"apiary\"})",
"legendFormat": "Errors",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 1},
{"color": "red", "value": 10}
]
},
"noValue": "0"
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none",
"textMode": "auto"
},
"description": "Total storage query errors"
}
]
}

View File

@@ -16,7 +16,7 @@
"title": "Endpoints Monitored",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "count(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"})",
@@ -48,7 +48,7 @@
"title": "Probe Failures",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "count(probe_success{job=\"blackbox_tls\"} == 0) or vector(0)",
@@ -82,7 +82,7 @@
"title": "Expiring Soon (< 7d)",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "count((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) < 86400 * 7) or vector(0)",
@@ -116,7 +116,7 @@
"title": "Expiring Critical (< 24h)",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "count((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) < 86400) or vector(0)",
@@ -150,7 +150,7 @@
"title": "Minimum Days Remaining",
"type": "gauge",
"gridPos": {"h": 4, "w": 8, "x": 16, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "min((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400)",
@@ -187,7 +187,7 @@
"title": "Certificate Expiry by Endpoint",
"type": "table",
"gridPos": {"h": 12, "w": 12, "x": 0, "y": 4},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400",
@@ -253,7 +253,7 @@
"title": "Probe Status",
"type": "table",
"gridPos": {"h": 12, "w": 12, "x": 12, "y": 4},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "probe_success{job=\"blackbox_tls\"}",
@@ -340,7 +340,7 @@
"title": "Certificate Expiry Over Time",
"type": "timeseries",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 16},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400",
@@ -378,7 +378,7 @@
"title": "Probe Success Rate",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 24},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "avg(probe_success{job=\"blackbox_tls\"}) * 100",
@@ -418,7 +418,7 @@
"title": "Probe Duration",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 24},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "probe_duration_seconds{job=\"blackbox_tls\"}",

View File

@@ -15,7 +15,7 @@
{
"name": "tier",
"type": "query",
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"query": "label_values(nixos_flake_info, tier)",
"refresh": 2,
"includeAll": true,
@@ -30,7 +30,7 @@
"title": "Hosts Behind Remote",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "count(nixos_flake_revision_behind{tier=~\"$tier\"} == 1)",
@@ -65,7 +65,7 @@
"title": "Hosts Needing Reboot",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "count(nixos_config_mismatch{tier=~\"$tier\"} == 1)",
@@ -100,7 +100,7 @@
"title": "Total Hosts",
"type": "stat",
"gridPos": {"h": 4, "w": 3, "x": 8, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "count(nixos_flake_info{tier=~\"$tier\"})",
@@ -128,7 +128,7 @@
"title": "Nixpkgs Age",
"type": "stat",
"gridPos": {"h": 4, "w": 3, "x": 11, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "max(nixos_flake_input_age_seconds{input=\"nixpkgs\", tier=~\"$tier\"})",
@@ -163,7 +163,7 @@
"title": "Hosts Up-to-date",
"type": "stat",
"gridPos": {"h": 4, "w": 3, "x": 14, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "count(nixos_flake_revision_behind{tier=~\"$tier\"} == 0)",
@@ -192,7 +192,7 @@
"title": "Deployments (24h)",
"type": "stat",
"gridPos": {"h": 4, "w": 3, "x": 17, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_deployments_total{status=\"completed\"}[24h]))",
@@ -222,7 +222,7 @@
"title": "Avg Deploy Time",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_deployment_duration_seconds_sum{success=\"true\"}[24h])) / sum(increase(homelab_deploy_deployment_duration_seconds_count{success=\"true\"}[24h]))",
@@ -256,7 +256,7 @@
"title": "Fleet Status",
"type": "table",
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 4},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "nixos_flake_info{tier=~\"$tier\"}",
@@ -430,7 +430,7 @@
"title": "Generation Age by Host",
"type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 14},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "sort_desc(nixos_generation_age_seconds{tier=~\"$tier\"})",
@@ -467,7 +467,7 @@
"title": "Generations per Host",
"type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 14},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "sort_desc(nixos_generation_count{tier=~\"$tier\"})",
@@ -501,7 +501,7 @@
"title": "Deployment Activity (Generation Age Over Time)",
"type": "timeseries",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 22},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "nixos_generation_age_seconds{tier=~\"$tier\"}",
@@ -534,7 +534,7 @@
"title": "Flake Input Ages",
"type": "table",
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 30},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "max by (input) (nixos_flake_input_age_seconds)",
@@ -577,7 +577,7 @@
"title": "Hosts by Revision",
"type": "piechart",
"gridPos": {"h": 6, "w": 6, "x": 12, "y": 30},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "count by (current_rev) (nixos_flake_info{tier=~\"$tier\"})",
@@ -601,7 +601,7 @@
"title": "Hosts by Tier",
"type": "piechart",
"gridPos": {"h": 6, "w": 6, "x": 18, "y": 30},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "count by (tier) (nixos_flake_info)",
@@ -641,7 +641,7 @@
"title": "Builds (24h)",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 37},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_build_host_total{status=\"success\"}[24h]))",
@@ -671,7 +671,7 @@
"title": "Failed Builds (24h)",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 37},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_build_host_total{status=\"failure\"}[24h])) or vector(0)",
@@ -705,7 +705,7 @@
"title": "Last Build",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 37},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "time() - max(homelab_deploy_build_last_timestamp)",
@@ -739,7 +739,7 @@
"title": "Avg Build Time",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 37},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_build_duration_seconds_sum[24h])) / sum(increase(homelab_deploy_build_duration_seconds_count[24h]))",
@@ -773,7 +773,7 @@
"title": "Total Hosts Built",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 37},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "count(homelab_deploy_build_duration_seconds_count)",
@@ -802,7 +802,7 @@
"title": "Build Jobs (24h)",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 37},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_builds_total[24h]))",
@@ -832,7 +832,7 @@
"title": "Build Time by Host",
"type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 41},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "sort_desc(homelab_deploy_build_duration_seconds_sum / homelab_deploy_build_duration_seconds_count)",
@@ -869,7 +869,7 @@
"title": "Build Count by Host",
"type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 41},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "sort_desc(sum by (host) (homelab_deploy_build_host_total))",
@@ -903,7 +903,7 @@
"title": "Build Activity",
"type": "timeseries",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 49},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_build_host_total{status=\"success\"}[1h]))",

View File

@@ -11,7 +11,7 @@
{
"name": "instance",
"type": "query",
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"query": "label_values(node_uname_info, instance)",
"refresh": 2,
"includeAll": false,
@@ -26,7 +26,7 @@
"title": "CPU Usage",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\", instance=~\"$instance\"}[5m])) * 100)",
@@ -55,7 +55,7 @@
"title": "Memory Usage",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$instance\"} / node_memory_MemTotal_bytes{instance=~\"$instance\"})) * 100",
@@ -84,7 +84,7 @@
"title": "Disk Usage",
"type": "gauge",
"gridPos": {"h": 8, "w": 8, "x": 0, "y": 8},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "100 - ((node_filesystem_avail_bytes{instance=~\"$instance\",mountpoint=\"/\",fstype!=\"rootfs\"} / node_filesystem_size_bytes{instance=~\"$instance\",mountpoint=\"/\",fstype!=\"rootfs\"}) * 100)",
@@ -113,7 +113,7 @@
"title": "System Load",
"type": "timeseries",
"gridPos": {"h": 8, "w": 8, "x": 8, "y": 8},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "node_load1{instance=~\"$instance\"}",
@@ -142,7 +142,7 @@
"title": "Uptime",
"type": "stat",
"gridPos": {"h": 8, "w": 8, "x": 16, "y": 8},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "time() - node_boot_time_seconds{instance=~\"$instance\"}",
@@ -161,7 +161,7 @@
"title": "Network Traffic",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "rate(node_network_receive_bytes_total{instance=~\"$instance\",device!~\"lo|veth.*|br.*|docker.*\"}[5m])",
@@ -185,7 +185,7 @@
"title": "Disk I/O",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 16},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "rate(node_disk_read_bytes_total{instance=~\"$instance\",device!~\"dm-.*\"}[5m])",

View File

@@ -15,7 +15,7 @@
{
"name": "vm",
"type": "query",
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"query": "label_values(pve_guest_info{template=\"0\"}, name)",
"refresh": 2,
"includeAll": true,
@@ -30,7 +30,7 @@
"title": "VMs Running",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "count(pve_up{id=~\"qemu/.*\"} * on(id) pve_guest_info{template=\"0\"} == 1)",
@@ -56,7 +56,7 @@
"title": "VMs Stopped",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "count(pve_up{id=~\"qemu/.*\"} * on(id) pve_guest_info{template=\"0\"} == 0)",
@@ -87,7 +87,7 @@
"title": "Node CPU",
"type": "gauge",
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "pve_cpu_usage_ratio{id=~\"node/.*\"} * 100",
@@ -120,7 +120,7 @@
"title": "Node Memory",
"type": "gauge",
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "pve_memory_usage_bytes{id=~\"node/.*\"} / pve_memory_size_bytes{id=~\"node/.*\"} * 100",
@@ -153,7 +153,7 @@
"title": "Node Uptime",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "pve_uptime_seconds{id=~\"node/.*\"}",
@@ -180,7 +180,7 @@
"title": "Templates",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "count(pve_guest_info{template=\"1\"})",
@@ -206,7 +206,7 @@
"title": "VM Status",
"type": "table",
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 4},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "pve_guest_info{template=\"0\", name=~\"$vm\"}",
@@ -362,7 +362,7 @@
"title": "VM CPU Usage",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 14},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "pve_cpu_usage_ratio{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"} * 100",
@@ -391,7 +391,7 @@
"title": "VM Memory Usage",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 14},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "pve_memory_usage_bytes{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
@@ -420,7 +420,7 @@
"title": "VM Network Traffic",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 22},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "rate(pve_network_receive_bytes{id=~\"qemu/.*\"}[5m]) * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
@@ -453,7 +453,7 @@
"title": "VM Disk I/O",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 22},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "rate(pve_disk_read_bytes{id=~\"qemu/.*\"}[5m]) * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
@@ -486,7 +486,7 @@
"title": "Storage Usage",
"type": "bargauge",
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 30},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "pve_disk_usage_bytes{id=~\"storage/.*\"} / pve_disk_size_bytes{id=~\"storage/.*\"} * 100",
@@ -531,7 +531,7 @@
"title": "Storage Capacity",
"type": "table",
"gridPos": {"h": 6, "w": 12, "x": 12, "y": 30},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "pve_disk_size_bytes{id=~\"storage/.*\"}",

View File

@@ -15,7 +15,7 @@
{
"name": "hostname",
"type": "query",
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"query": "label_values(systemd_unit_state, hostname)",
"refresh": 2,
"includeAll": true,
@@ -30,7 +30,7 @@
"title": "Failed Units",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "count(systemd_unit_state{state=\"failed\", hostname=~\"$hostname\"} == 1) or vector(0)",
@@ -60,7 +60,7 @@
"title": "Active Units",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "count(systemd_unit_state{state=\"active\", hostname=~\"$hostname\"} == 1)",
@@ -86,7 +86,7 @@
"title": "Hosts Monitored",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "count(count by (hostname) (systemd_unit_state{hostname=~\"$hostname\"}))",
@@ -112,7 +112,7 @@
"title": "Total Service Restarts",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "sum(systemd_service_restart_total{hostname=~\"$hostname\"})",
@@ -143,7 +143,7 @@
"title": "Inactive Units",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "count(systemd_unit_state{state=\"inactive\", hostname=~\"$hostname\"} == 1)",
@@ -169,7 +169,7 @@
"title": "Timers",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "count(systemd_timer_last_trigger_seconds{hostname=~\"$hostname\"})",
@@ -195,7 +195,7 @@
"title": "Failed Units",
"type": "table",
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 4},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "systemd_unit_state{state=\"failed\", hostname=~\"$hostname\"} == 1",
@@ -251,7 +251,7 @@
"title": "Service Restarts (Top 15)",
"type": "table",
"gridPos": {"h": 6, "w": 12, "x": 12, "y": 4},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "topk(15, systemd_service_restart_total{hostname=~\"$hostname\"} > 0)",
@@ -309,7 +309,7 @@
"title": "Active Units per Host",
"type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 10},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "sort_desc(count by (hostname) (systemd_unit_state{state=\"active\", hostname=~\"$hostname\"} == 1))",
@@ -339,7 +339,7 @@
"title": "NixOS Upgrade Timers",
"type": "table",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 10},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "systemd_timer_last_trigger_seconds{name=\"nixos-upgrade.timer\", hostname=~\"$hostname\"}",
@@ -429,7 +429,7 @@
"title": "Backup Timers",
"type": "table",
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 18},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "systemd_timer_last_trigger_seconds{name=~\"restic.*\", hostname=~\"$hostname\"}",
@@ -524,7 +524,7 @@
"title": "Service Restarts Over Time",
"type": "timeseries",
"gridPos": {"h": 6, "w": 12, "x": 12, "y": 18},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "sum by (hostname) (increase(systemd_service_restart_total{hostname=~\"$hostname\"}[1h]))",

View File

@@ -19,7 +19,7 @@
"title": "Current Temperatures",
"type": "stat",
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}",
@@ -71,7 +71,7 @@
"title": "Average Home Temperature",
"type": "gauge",
"gridPos": {"h": 6, "w": 6, "x": 12, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "avg(hass_sensor_temperature_celsius{entity!~\".*device_temperature|.*server.*\"})",
@@ -108,7 +108,7 @@
"title": "Current Humidity",
"type": "stat",
"gridPos": {"h": 6, "w": 6, "x": 18, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "hass_sensor_humidity_percent{entity!~\".*server.*\"}",
@@ -154,7 +154,7 @@
"title": "Temperature History (30 Days)",
"type": "timeseries",
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 6},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}",
@@ -207,7 +207,7 @@
"title": "Temperature Trend (1h rate of change)",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "deriv(hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}[1h]) * 3600",
@@ -268,7 +268,7 @@
"title": "24h Min / Max / Avg",
"type": "table",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 16},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "min_over_time(hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}[24h])",
@@ -346,7 +346,7 @@
"title": "Humidity History (30 Days)",
"type": "timeseries",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 24},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [
{
"expr": "hass_sensor_humidity_percent",

View File

@@ -34,21 +34,25 @@
};
};
# Declarative datasources pointing to monitoring01
# Declarative datasources
provision.datasources.settings = {
apiVersion = 1;
prune = true;
deleteDatasources = [
{ name = "Prometheus (monitoring01)"; orgId = 1; }
];
datasources = [
{
name = "Prometheus";
name = "VictoriaMetrics";
type = "prometheus";
url = "http://monitoring01.home.2rjus.net:9090";
url = "http://localhost:8428";
isDefault = true;
uid = "prometheus";
uid = "victoriametrics";
}
{
name = "Loki";
type = "loki";
url = "http://monitoring01.home.2rjus.net:3100";
url = "http://localhost:3100";
uid = "loki";
}
];
@@ -81,22 +85,28 @@
services.caddy = {
enable = true;
package = pkgs.unstable.caddy;
configFile = pkgs.writeText "Caddyfile" ''
{
globalConfig = ''
acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
metrics
}
grafana-test.home.2rjus.net {
'';
virtualHosts."grafana.home.2rjus.net".extraConfig = ''
log {
output file /var/log/caddy/grafana.log {
mode 644
}
}
reverse_proxy http://127.0.0.1:3000
'';
virtualHosts."grafana-test.home.2rjus.net".extraConfig = ''
log {
output file /var/log/caddy/grafana.log {
mode 644
}
}
reverse_proxy http://127.0.0.1:3000
'';
# Metrics endpoint on plain HTTP for Prometheus scraping
extraConfig = ''
http://${config.networking.hostName}.home.2rjus.net/metrics {
metrics
}

View File

@@ -54,30 +54,7 @@
}
reverse_proxy http://ha1.home.2rjus.net:8080
}
prometheus.home.2rjus.net {
log {
output file /var/log/caddy/prometheus.log {
mode 644
}
}
reverse_proxy http://monitoring01.home.2rjus.net:9090
}
alertmanager.home.2rjus.net {
log {
output file /var/log/caddy/alertmanager.log {
mode 644
}
}
reverse_proxy http://monitoring01.home.2rjus.net:9093
}
grafana.home.2rjus.net {
log {
output file /var/log/caddy/grafana.log {
mode 644
}
}
reverse_proxy http://monitoring01.home.2rjus.net:3000
}
jelly.home.2rjus.net {
log {
output file /var/log/caddy/jelly.log {
@@ -86,22 +63,6 @@
}
reverse_proxy http://jelly01.home.2rjus.net:8096
}
pyroscope.home.2rjus.net {
log {
output file /var/log/caddy/pyroscope.log {
mode 644
}
}
reverse_proxy http://monitoring01.home.2rjus.net:4040
}
pushgw.home.2rjus.net {
log {
output file /var/log/caddy/pushgw.log {
mode 644
}
}
reverse_proxy http://monitoring01.home.2rjus.net:9091
}
http://http-proxy.home.2rjus.net/metrics {
log {
output file /var/log/caddy/caddy-metrics.log {

104
services/loki/default.nix Normal file
View File

@@ -0,0 +1,104 @@
{ config, lib, pkgs, ... }:
let
# Script to generate bcrypt hash from Vault password for Caddy basic_auth
generateCaddyAuth = pkgs.writeShellApplication {
name = "generate-caddy-loki-auth";
runtimeInputs = [ config.services.caddy.package ];
text = ''
PASSWORD=$(cat /run/secrets/loki-push-auth)
HASH=$(caddy hash-password --plaintext "$PASSWORD")
echo "LOKI_PUSH_HASH=$HASH" > /run/secrets/caddy-loki-auth.env
chmod 0400 /run/secrets/caddy-loki-auth.env
'';
};
in
{
# Fetch Loki push password from Vault
vault.secrets.loki-push-auth = {
secretPath = "shared/loki/push-auth";
extractKey = "password";
services = [ "caddy" ];
};
# Generate bcrypt hash for Caddy before it starts
systemd.services.caddy-loki-auth = {
description = "Generate Caddy basic auth hash for Loki";
after = [ "vault-secret-loki-push-auth.service" ];
requires = [ "vault-secret-loki-push-auth.service" ];
before = [ "caddy.service" ];
requiredBy = [ "caddy.service" ];
serviceConfig = {
Type = "oneshot";
RemainAfterExit = true;
ExecStart = lib.getExe generateCaddyAuth;
};
};
# Load the bcrypt hash as environment variable for Caddy
services.caddy.environmentFile = "/run/secrets/caddy-loki-auth.env";
# Caddy reverse proxy for Loki with basic auth
services.caddy.virtualHosts."loki.home.2rjus.net".extraConfig = ''
basic_auth {
promtail {env.LOKI_PUSH_HASH}
}
reverse_proxy http://127.0.0.1:3100
'';
services.loki = {
enable = true;
configuration = {
auth_enabled = false;
server = {
http_listen_address = "127.0.0.1";
http_listen_port = 3100;
};
common = {
ring = {
instance_addr = "127.0.0.1";
kvstore = {
store = "inmemory";
};
};
replication_factor = 1;
path_prefix = "/var/lib/loki";
};
schema_config = {
configs = [
{
from = "2024-01-01";
store = "tsdb";
object_store = "filesystem";
schema = "v13";
index = {
prefix = "loki_index_";
period = "24h";
};
}
];
};
storage_config = {
filesystem = {
directory = "/var/lib/loki/chunks";
};
};
compactor = {
working_directory = "/var/lib/loki/compactor";
compaction_interval = "10m";
retention_enabled = true;
retention_delete_delay = "2h";
retention_delete_worker_count = 150;
delete_request_store = "filesystem";
};
limits_config = {
retention_period = "30d";
ingestion_rate_mb = 10;
ingestion_burst_size_mb = 20;
max_streams_per_user = 10000;
max_query_series = 500;
max_query_parallelism = 8;
};
};
};
}

View File

@@ -1,33 +1,4 @@
{ pkgs, ... }:
let
# TLS endpoints to monitor for certificate expiration
# These are all services using ACME certificates from OpenBao PKI
tlsTargets = [
# Direct ACME certs (security.acme.certs)
"https://vault.home.2rjus.net:8200"
"https://auth.home.2rjus.net"
"https://testvm01.home.2rjus.net"
# Caddy auto-TLS on http-proxy
"https://nzbget.home.2rjus.net"
"https://radarr.home.2rjus.net"
"https://sonarr.home.2rjus.net"
"https://ha.home.2rjus.net"
"https://z2m.home.2rjus.net"
"https://prometheus.home.2rjus.net"
"https://alertmanager.home.2rjus.net"
"https://grafana.home.2rjus.net"
"https://jelly.home.2rjus.net"
"https://pyroscope.home.2rjus.net"
"https://pushgw.home.2rjus.net"
# Caddy auto-TLS on nix-cache02
"https://nix-cache.home.2rjus.net"
# Caddy auto-TLS on grafana01
"https://grafana-test.home.2rjus.net"
];
in
{
services.prometheus.exporters.blackbox = {
enable = true;
@@ -57,36 +28,4 @@ in
- 503
'';
};
# Add blackbox scrape config to Prometheus
# Alert rules are in rules.yml (certificate_rules group)
services.prometheus.scrapeConfigs = [
{
job_name = "blackbox_tls";
metrics_path = "/probe";
params = {
module = [ "https_cert" ];
};
static_configs = [{
targets = tlsTargets;
}];
relabel_configs = [
# Pass the target URL to blackbox as a parameter
{
source_labels = [ "__address__" ];
target_label = "__param_target";
}
# Use the target URL as the instance label
{
source_labels = [ "__param_target" ];
target_label = "instance";
}
# Point the actual scrape at the local blackbox exporter
{
target_label = "__address__";
replacement = "127.0.0.1:9115";
}
];
}
];
}

View File

@@ -1,14 +0,0 @@
{ ... }:
{
imports = [
./loki.nix
./grafana.nix
./prometheus.nix
./blackbox.nix
./exportarr.nix
./pve.nix
./alerttonotify.nix
./pyroscope.nix
./tempo.nix
];
}

View File

@@ -14,14 +14,4 @@
apiKeyFile = config.vault.secrets.sonarr-api-key.outputDir;
port = 9709;
};
# Scrape config
services.prometheus.scrapeConfigs = [
{
job_name = "sonarr";
static_configs = [{
targets = [ "localhost:9709" ];
}];
}
];
}

View File

@@ -4,6 +4,10 @@
nodeExporter = [
"gunter.home.2rjus.net:9100"
];
# Hosts with node-exporter but no systemd-exporter
nodeExporterOnly = [
"pve1.home.2rjus.net:9100"
];
scrapeConfigs = [
{ job_name = "smartctl"; targets = [ "gunter.home.2rjus.net:9633" ]; }
{ job_name = "ghettoptt"; targets = [ "gunter.home.2rjus.net:8989" ]; }

View File

@@ -1,11 +0,0 @@
{ pkgs, ... }:
{
services.grafana = {
enable = true;
settings = {
server = {
http_addr = "";
};
};
};
}

View File

@@ -1,42 +0,0 @@
{ ... }:
{
services.loki = {
enable = true;
configuration = {
auth_enabled = false;
server = {
http_listen_port = 3100;
};
common = {
ring = {
instance_addr = "127.0.0.1";
kvstore = {
store = "inmemory";
};
};
replication_factor = 1;
path_prefix = "/var/lib/loki";
};
schema_config = {
configs = [
{
from = "2024-01-01";
store = "tsdb";
object_store = "filesystem";
schema = "v13";
index = {
prefix = "loki_index_";
period = "24h";
};
}
];
};
storage_config = {
filesystem = {
directory = "/var/lib/loki/chunks";
};
};
};
};
}

View File

@@ -1,245 +0,0 @@
{ self, lib, pkgs, ... }:
let
monLib = import ../../lib/monitoring.nix { inherit lib; };
externalTargets = import ./external-targets.nix;
nodeExporterTargets = monLib.generateNodeExporterTargets self externalTargets;
autoScrapeConfigs = monLib.generateScrapeConfigs self externalTargets;
# Script to fetch AppRole token for Prometheus to use when scraping OpenBao metrics
fetchOpenbaoToken = pkgs.writeShellApplication {
name = "fetch-openbao-token";
runtimeInputs = [ pkgs.curl pkgs.jq ];
text = ''
VAULT_ADDR="https://vault01.home.2rjus.net:8200"
APPROLE_DIR="/var/lib/vault/approle"
OUTPUT_FILE="/run/secrets/prometheus/openbao-token"
# Read AppRole credentials
if [ ! -f "$APPROLE_DIR/role-id" ] || [ ! -f "$APPROLE_DIR/secret-id" ]; then
echo "AppRole credentials not found at $APPROLE_DIR" >&2
exit 1
fi
ROLE_ID=$(cat "$APPROLE_DIR/role-id")
SECRET_ID=$(cat "$APPROLE_DIR/secret-id")
# Authenticate to Vault
AUTH_RESPONSE=$(curl -sf -k -X POST \
-d "{\"role_id\":\"$ROLE_ID\",\"secret_id\":\"$SECRET_ID\"}" \
"$VAULT_ADDR/v1/auth/approle/login")
# Extract token
VAULT_TOKEN=$(echo "$AUTH_RESPONSE" | jq -r '.auth.client_token')
if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
echo "Failed to extract Vault token from response" >&2
exit 1
fi
# Write token to file
mkdir -p "$(dirname "$OUTPUT_FILE")"
echo -n "$VAULT_TOKEN" > "$OUTPUT_FILE"
chown prometheus:prometheus "$OUTPUT_FILE"
chmod 0400 "$OUTPUT_FILE"
echo "Successfully fetched OpenBao token"
'';
};
in
{
# Systemd service to fetch AppRole token for Prometheus OpenBao scraping
# The token is used to authenticate when scraping /v1/sys/metrics
systemd.services.prometheus-openbao-token = {
description = "Fetch OpenBao token for Prometheus metrics scraping";
after = [ "network-online.target" ];
wants = [ "network-online.target" ];
before = [ "prometheus.service" ];
requiredBy = [ "prometheus.service" ];
serviceConfig = {
Type = "oneshot";
ExecStart = lib.getExe fetchOpenbaoToken;
};
};
# Timer to periodically refresh the token (AppRole tokens have 1-hour TTL)
systemd.timers.prometheus-openbao-token = {
description = "Refresh OpenBao token for Prometheus";
wantedBy = [ "timers.target" ];
timerConfig = {
OnBootSec = "5min";
OnUnitActiveSec = "30min";
RandomizedDelaySec = "5min";
};
};
services.prometheus = {
enable = true;
# syntax-only check because we use external credential files (e.g., openbao-token)
checkConfig = "syntax-only";
alertmanager = {
enable = true;
configuration = {
global = {
};
route = {
receiver = "webhook_natstonotify";
group_wait = "30s";
group_interval = "5m";
repeat_interval = "1h";
group_by = [ "alertname" ];
};
receivers = [
{
name = "webhook_natstonotify";
webhook_configs = [
{
url = "http://localhost:5001/alert";
}
];
}
];
};
};
alertmanagers = [
{
static_configs = [
{
targets = [ "localhost:9093" ];
}
];
}
];
retentionTime = "30d";
globalConfig = {
scrape_interval = "15s";
};
rules = [
(builtins.readFile ./rules.yml)
];
scrapeConfigs = [
# Auto-generated node-exporter targets from flake hosts + external
# Each static_config entry may have labels from homelab.host metadata
{
job_name = "node-exporter";
static_configs = nodeExporterTargets;
}
# Systemd exporter on all hosts (same targets, different port)
# Preserves the same label grouping as node-exporter
{
job_name = "systemd-exporter";
static_configs = map
(cfg: cfg // {
targets = map (t: builtins.replaceStrings [ ":9100" ] [ ":9558" ] t) cfg.targets;
})
nodeExporterTargets;
}
# Local monitoring services (not auto-generated)
{
job_name = "prometheus";
static_configs = [
{
targets = [ "localhost:9090" ];
}
];
}
{
job_name = "loki";
static_configs = [
{
targets = [ "localhost:3100" ];
}
];
}
{
job_name = "grafana";
static_configs = [
{
targets = [ "localhost:3000" ];
}
];
}
{
job_name = "alertmanager";
static_configs = [
{
targets = [ "localhost:9093" ];
}
];
}
{
job_name = "pushgateway";
honor_labels = true;
static_configs = [
{
targets = [ "localhost:9091" ];
}
];
}
# Caddy metrics from nix-cache02 (serves nix-cache.home.2rjus.net)
{
job_name = "nix-cache_caddy";
scheme = "https";
static_configs = [
{
targets = [ "nix-cache.home.2rjus.net" ];
}
];
}
# pve-exporter with complex relabel config
{
job_name = "pve-exporter";
static_configs = [
{
targets = [ "10.69.12.75" ];
}
];
metrics_path = "/pve";
params = {
module = [ "default" ];
cluster = [ "1" ];
node = [ "1" ];
};
relabel_configs = [
{
source_labels = [ "__address__" ];
target_label = "__param_target";
}
{
source_labels = [ "__param_target" ];
target_label = "instance";
}
{
target_label = "__address__";
replacement = "127.0.0.1:9221";
}
];
}
# OpenBao metrics with bearer token auth
{
job_name = "openbao";
scheme = "https";
metrics_path = "/v1/sys/metrics";
params = {
format = [ "prometheus" ];
};
static_configs = [{
targets = [ "vault01.home.2rjus.net:8200" ];
}];
authorization = {
type = "Bearer";
credentials_file = "/run/secrets/prometheus/openbao-token";
};
}
] ++ autoScrapeConfigs;
pushgateway = {
enable = true;
web = {
external-url = "https://pushgw.home.2rjus.net";
};
};
};
}

View File

@@ -1,7 +1,7 @@
{ config, ... }:
{
vault.secrets.pve-exporter = {
secretPath = "hosts/monitoring01/pve-exporter";
secretPath = "hosts/monitoring02/pve-exporter";
extractKey = "config";
outputDir = "/run/secrets/pve_exporter";
mode = "0444";

View File

@@ -1,8 +0,0 @@
{ ... }:
{
virtualisation.oci-containers.containers.pyroscope = {
pull = "missing";
image = "grafana/pyroscope:latest";
ports = [ "4040:4040" ];
};
}

View File

@@ -67,13 +67,13 @@ groups:
summary: "Promtail service not running on {{ $labels.instance }}"
description: "The promtail service has not been active on {{ $labels.instance }} for 5 minutes."
- alert: filesystem_filling_up
expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 24*3600) < 0
expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[24h], 24*3600) < 0
for: 1h
labels:
severity: warning
annotations:
summary: "Filesystem predicted to fill within 24h on {{ $labels.instance }}"
description: "Based on the last 6h trend, the root filesystem on {{ $labels.instance }} is predicted to run out of space within 24 hours."
description: "Based on the last 24h trend, the root filesystem on {{ $labels.instance }} is predicted to run out of space within 24 hours."
- alert: systemd_not_running
expr: node_systemd_system_running == 0
for: 10m
@@ -259,32 +259,32 @@ groups:
description: "Wireguard handshake timeout on {{ $labels.instance }} for peer {{ $labels.public_key }}."
- name: monitoring_rules
rules:
- alert: prometheus_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="prometheus.service", state="active"} == 0
- alert: victoriametrics_not_running
expr: node_systemd_unit_state{instance="monitoring02.home.2rjus.net:9100", name="victoriametrics.service", state="active"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus service not running on {{ $labels.instance }}"
description: "Prometheus service not running on {{ $labels.instance }}"
summary: "VictoriaMetrics service not running on {{ $labels.instance }}"
description: "VictoriaMetrics service not running on {{ $labels.instance }}"
- alert: vmalert_not_running
expr: node_systemd_unit_state{instance="monitoring02.home.2rjus.net:9100", name="vmalert.service", state="active"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "vmalert service not running on {{ $labels.instance }}"
description: "vmalert service not running on {{ $labels.instance }}"
- alert: alertmanager_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="alertmanager.service", state="active"} == 0
expr: node_systemd_unit_state{instance="monitoring02.home.2rjus.net:9100", name="alertmanager.service", state="active"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Alertmanager service not running on {{ $labels.instance }}"
description: "Alertmanager service not running on {{ $labels.instance }}"
- alert: pushgateway_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="pushgateway.service", state="active"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pushgateway service not running on {{ $labels.instance }}"
description: "Pushgateway service not running on {{ $labels.instance }}"
- alert: loki_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="loki.service", state="active"} == 0
expr: node_systemd_unit_state{instance="monitoring02.home.2rjus.net:9100", name="loki.service", state="active"} == 0
for: 5m
labels:
severity: critical
@@ -292,29 +292,13 @@ groups:
summary: "Loki service not running on {{ $labels.instance }}"
description: "Loki service not running on {{ $labels.instance }}"
- alert: grafana_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="grafana.service", state="active"} == 0
expr: node_systemd_unit_state{instance="monitoring02.home.2rjus.net:9100", name="grafana.service", state="active"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Grafana service not running on {{ $labels.instance }}"
description: "Grafana service not running on {{ $labels.instance }}"
- alert: tempo_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="tempo.service", state="active"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Tempo service not running on {{ $labels.instance }}"
description: "Tempo service not running on {{ $labels.instance }}"
- alert: pyroscope_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="podman-pyroscope.service", state="active"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pyroscope service not running on {{ $labels.instance }}"
description: "Pyroscope service not running on {{ $labels.instance }}"
- name: proxmox_rules
rules:
- alert: pve_node_down

View File

@@ -1,37 +0,0 @@
{ ... }:
{
services.tempo = {
enable = true;
settings = {
server = {
http_listen_port = 3200;
grpc_listen_port = 3201;
};
distributor = {
receivers = {
otlp = {
protocols = {
http = {
endpoint = ":4318";
cors = {
allowed_origins = [ "*.home.2rjus.net" ];
};
};
};
};
};
};
storage = {
trace = {
backend = "local";
local = {
path = "/var/lib/tempo";
};
wal = {
path = "/var/lib/tempo/wal";
};
};
};
};
};
}

View File

@@ -0,0 +1,267 @@
{ self, config, lib, pkgs, ... }:
let
monLib = import ../../lib/monitoring.nix { inherit lib; };
externalTargets = import ../monitoring/external-targets.nix;
nodeExporterTargets = monLib.generateNodeExporterTargets self externalTargets;
systemdExporterTargets = monLib.generateSystemdExporterTargets self externalTargets;
autoScrapeConfigs = monLib.generateScrapeConfigs self externalTargets;
# TLS endpoints to monitor for certificate expiration via blackbox exporter
tlsTargets = [
"https://vault.home.2rjus.net:8200"
"https://auth.home.2rjus.net"
"https://testvm01.home.2rjus.net"
"https://nzbget.home.2rjus.net"
"https://radarr.home.2rjus.net"
"https://sonarr.home.2rjus.net"
"https://ha.home.2rjus.net"
"https://z2m.home.2rjus.net"
"https://metrics.home.2rjus.net"
"https://alertmanager.home.2rjus.net"
"https://grafana.home.2rjus.net"
"https://jelly.home.2rjus.net"
"https://nix-cache.home.2rjus.net"
"https://grafana-test.home.2rjus.net"
];
# Script to fetch AppRole token for VictoriaMetrics to use when scraping OpenBao metrics
fetchOpenbaoToken = pkgs.writeShellApplication {
name = "fetch-openbao-token-vm";
runtimeInputs = [ pkgs.curl pkgs.jq ];
text = ''
VAULT_ADDR="https://vault01.home.2rjus.net:8200"
APPROLE_DIR="/var/lib/vault/approle"
OUTPUT_FILE="/run/secrets/victoriametrics/openbao-token"
# Read AppRole credentials
if [ ! -f "$APPROLE_DIR/role-id" ] || [ ! -f "$APPROLE_DIR/secret-id" ]; then
echo "AppRole credentials not found at $APPROLE_DIR" >&2
exit 1
fi
ROLE_ID=$(cat "$APPROLE_DIR/role-id")
SECRET_ID=$(cat "$APPROLE_DIR/secret-id")
# Authenticate to Vault
AUTH_RESPONSE=$(curl -sf -k -X POST \
-d "{\"role_id\":\"$ROLE_ID\",\"secret_id\":\"$SECRET_ID\"}" \
"$VAULT_ADDR/v1/auth/approle/login")
# Extract token
VAULT_TOKEN=$(echo "$AUTH_RESPONSE" | jq -r '.auth.client_token')
if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
echo "Failed to extract Vault token from response" >&2
exit 1
fi
# Write token to file
mkdir -p "$(dirname "$OUTPUT_FILE")"
echo -n "$VAULT_TOKEN" > "$OUTPUT_FILE"
chown victoriametrics:victoriametrics "$OUTPUT_FILE"
chmod 0400 "$OUTPUT_FILE"
echo "Successfully fetched OpenBao token"
'';
};
scrapeConfigs = [
# Auto-generated node-exporter targets from flake hosts + external
{
job_name = "node-exporter";
static_configs = nodeExporterTargets;
}
# Systemd exporter on hosts that have it (excludes nodeExporterOnly hosts)
{
job_name = "systemd-exporter";
static_configs = systemdExporterTargets;
}
# Local monitoring services
{
job_name = "victoriametrics";
static_configs = [{ targets = [ "localhost:8428" ]; }];
}
{
job_name = "loki";
static_configs = [{ targets = [ "localhost:3100" ]; }];
}
{
job_name = "grafana";
static_configs = [{ targets = [ "localhost:3000" ]; }];
}
{
job_name = "alertmanager";
static_configs = [{ targets = [ "localhost:9093" ]; }];
}
# Caddy metrics from nix-cache02
{
job_name = "nix-cache_caddy";
scheme = "https";
static_configs = [{ targets = [ "nix-cache.home.2rjus.net" ]; }];
}
# OpenBao metrics with bearer token auth
{
job_name = "openbao";
scheme = "https";
metrics_path = "/v1/sys/metrics";
params = { format = [ "prometheus" ]; };
static_configs = [{ targets = [ "vault01.home.2rjus.net:8200" ]; }];
authorization = {
type = "Bearer";
credentials_file = "/run/secrets/victoriametrics/openbao-token";
};
}
# Apiary external service
{
job_name = "apiary";
scheme = "https";
scrape_interval = "60s";
static_configs = [{ targets = [ "apiary.t-juice.club" ]; }];
authorization = {
type = "Bearer";
credentials_file = "/run/secrets/victoriametrics-apiary-token";
};
}
# Blackbox TLS certificate monitoring
{
job_name = "blackbox_tls";
metrics_path = "/probe";
params = {
module = [ "https_cert" ];
};
static_configs = [{ targets = tlsTargets; }];
relabel_configs = [
{
source_labels = [ "__address__" ];
target_label = "__param_target";
}
{
source_labels = [ "__param_target" ];
target_label = "instance";
}
{
target_label = "__address__";
replacement = "127.0.0.1:9115";
}
];
}
# Sonarr exporter
{
job_name = "sonarr";
static_configs = [{ targets = [ "localhost:9709" ]; }];
}
# Proxmox VE exporter
{
job_name = "pve";
static_configs = [{ targets = [ "localhost:9221" ]; }];
}
] ++ autoScrapeConfigs;
in
{
# Static user for VictoriaMetrics (overrides DynamicUser) so vault.secrets
# and credential files can be owned by this user
users.users.victoriametrics = {
isSystemUser = true;
group = "victoriametrics";
};
users.groups.victoriametrics = { };
# Override DynamicUser since we need a static user for credential file access
systemd.services.victoriametrics.serviceConfig = {
DynamicUser = lib.mkForce false;
User = "victoriametrics";
Group = "victoriametrics";
};
# Systemd service to fetch AppRole token for OpenBao scraping
systemd.services.victoriametrics-openbao-token = {
description = "Fetch OpenBao token for VictoriaMetrics metrics scraping";
after = [ "network-online.target" ];
wants = [ "network-online.target" ];
before = [ "victoriametrics.service" ];
requiredBy = [ "victoriametrics.service" ];
serviceConfig = {
Type = "oneshot";
ExecStart = lib.getExe fetchOpenbaoToken;
};
};
# Timer to periodically refresh the token (AppRole tokens have 1-hour TTL)
systemd.timers.victoriametrics-openbao-token = {
description = "Refresh OpenBao token for VictoriaMetrics";
wantedBy = [ "timers.target" ];
timerConfig = {
OnBootSec = "5min";
OnUnitActiveSec = "30min";
RandomizedDelaySec = "5min";
};
};
# Fetch apiary bearer token from Vault
vault.secrets.victoriametrics-apiary-token = {
secretPath = "hosts/monitoring02/apiary-token";
extractKey = "password";
owner = "victoriametrics";
group = "victoriametrics";
services = [ "victoriametrics" ];
};
services.victoriametrics = {
enable = true;
retentionPeriod = "3"; # 3 months
# Disable config check since we reference external credential files
checkConfig = false;
prometheusConfig = {
global.scrape_interval = "15s";
scrape_configs = scrapeConfigs;
};
};
# vmalert for alerting rules
services.vmalert.instances.default = {
enable = true;
settings = {
"datasource.url" = "http://localhost:8428";
"notifier.url" = [ "http://localhost:9093" ];
"rule" = [ ../monitoring/rules.yml ];
};
};
# Caddy reverse proxy for VictoriaMetrics and vmalert
services.caddy.virtualHosts."metrics.home.2rjus.net".extraConfig = ''
reverse_proxy http://127.0.0.1:8428
'';
services.caddy.virtualHosts."vmalert.home.2rjus.net".extraConfig = ''
reverse_proxy http://127.0.0.1:8880
'';
# Alertmanager
services.caddy.virtualHosts."alertmanager.home.2rjus.net".extraConfig = ''
reverse_proxy http://127.0.0.1:9093
'';
services.prometheus.alertmanager = {
enable = true;
configuration = {
global = { };
route = {
receiver = "webhook_natstonotify";
group_wait = "30s";
group_interval = "5m";
repeat_interval = "1h";
group_by = [ "alertname" ];
};
receivers = [
{
name = "webhook_natstonotify";
webhook_configs = [
{
url = "http://localhost:5001/alert";
}
];
}
];
};
};
}

View File

@@ -16,6 +16,16 @@ in
SystemKeepFree=1G
'';
};
# Fetch Loki push password from Vault (only on hosts with Vault enabled)
vault.secrets.promtail-loki-auth = lib.mkIf config.vault.enable {
secretPath = "shared/loki/push-auth";
extractKey = "password";
owner = "promtail";
group = "promtail";
services = [ "promtail" ];
};
# Configure promtail
services.promtail = {
enable = true;
@@ -29,7 +39,11 @@ in
clients = [
{
url = "http://monitoring01.home.2rjus.net:3100/loki/api/v1/push";
url = "https://loki.home.2rjus.net/loki/api/v1/push";
basic_auth = {
username = "promtail";
password_file = "/run/secrets/promtail-loki-auth";
};
}
];

View File

@@ -31,6 +31,10 @@ in
};
settings = {
experimental-features = [
"nix-command"
"flakes"
];
trusted-substituters = [
"https://nix-cache.home.2rjus.net"
"https://cache.nixos.org"

View File

@@ -16,7 +16,8 @@ let
text = ''
set -euo pipefail
LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
LOKI_URL="https://loki.home.2rjus.net/loki/api/v1/push"
LOKI_AUTH_FILE="/run/secrets/promtail-loki-auth"
HOSTNAME=$(hostname)
SESSION_ID=""
RECORD_MODE=false
@@ -69,7 +70,13 @@ let
}]
}')
local auth_args=()
if [[ -f "$LOKI_AUTH_FILE" ]]; then
auth_args=(-u "promtail:$(cat "$LOKI_AUTH_FILE")")
fi
if curl -s -X POST "$LOKI_URL" \
"''${auth_args[@]}" \
-H "Content-Type: application/json" \
-d "$payload" > /dev/null; then
return 0

View File

@@ -57,7 +57,7 @@ let
type = types.str;
description = ''
Path to the secret in Vault (without /v1/secret/data/ prefix).
Example: "hosts/monitoring01/grafana-admin"
Example: "hosts/ha1/mqtt-password"
'';
};
@@ -152,13 +152,11 @@ in
'';
example = literalExpression ''
{
grafana-admin = {
secretPath = "hosts/monitoring01/grafana-admin";
owner = "grafana";
group = "grafana";
restartTrigger = true;
restartInterval = "daily";
services = [ "grafana" ];
mqtt-password = {
secretPath = "hosts/ha1/mqtt-password";
owner = "mosquitto";
group = "mosquitto";
services = [ "mosquitto" ];
};
}
'';

View File

@@ -26,26 +26,27 @@ path "secret/data/shared/nixos-exporter/*" {
EOT
}
# Shared policy for Loki push authentication (all hosts push logs)
resource "vault_policy" "loki_push" {
name = "loki-push"
policy = <<EOT
path "secret/data/shared/loki/*" {
capabilities = ["read", "list"]
}
EOT
}
# Define host access policies
locals {
host_policies = {
# Example: monitoring01 host
# "monitoring01" = {
# paths = [
# "secret/data/hosts/monitoring01/*",
# "secret/data/services/prometheus/*",
# "secret/data/services/grafana/*",
# "secret/data/shared/smtp/*"
# ]
# extra_policies = ["some-other-policy"] # Optional: additional policies
# }
# Example: ha1 host
# Example:
# "ha1" = {
# paths = [
# "secret/data/hosts/ha1/*",
# "secret/data/shared/mqtt/*"
# ]
# extra_policies = ["some-other-policy"] # Optional: additional policies
# }
"ha1" = {
@@ -55,16 +56,6 @@ locals {
]
}
"monitoring01" = {
paths = [
"secret/data/hosts/monitoring01/*",
"secret/data/shared/backup/*",
"secret/data/shared/nats/*",
"secret/data/services/exportarr/*",
]
extra_policies = ["prometheus-metrics"]
}
# Wave 1: hosts with no service secrets (only need vault.enable for future use)
"nats1" = {
paths = [
@@ -78,7 +69,7 @@ locals {
]
}
# Wave 3: DNS servers
# Wave 3: DNS servers (managed in hosts-generated.tf)
# Wave 4: http-proxy
"http-proxy" = {
@@ -104,14 +95,6 @@ locals {
]
}
# monitoring02: Grafana test instance
"monitoring02" = {
paths = [
"secret/data/hosts/monitoring02/*",
"secret/data/services/grafana/*",
]
}
}
}
@@ -137,7 +120,7 @@ resource "vault_approle_auth_backend_role" "hosts" {
backend = vault_auth_backend.approle.path
role_name = each.key
token_policies = concat(
["${each.key}-policy", "homelab-deploy", "nixos-exporter"],
["${each.key}-policy", "homelab-deploy", "nixos-exporter", "loki-push"],
lookup(each.value, "extra_policies", [])
)

View File

@@ -44,6 +44,25 @@ locals {
"secret/data/hosts/garage01/*",
]
}
"monitoring02" = {
paths = [
"secret/data/hosts/monitoring02/*",
"secret/data/services/grafana/*",
"secret/data/services/exportarr/*",
"secret/data/shared/nats/nkey",
]
extra_policies = ["prometheus-metrics"]
}
"pn01" = {
paths = [
"secret/data/hosts/pn01/*",
]
}
"pn02" = {
paths = [
"secret/data/hosts/pn02/*",
]
}
}
@@ -74,7 +93,10 @@ resource "vault_approle_auth_backend_role" "generated_hosts" {
backend = vault_auth_backend.approle.path
role_name = each.key
token_policies = ["host-${each.key}", "homelab-deploy", "nixos-exporter"]
token_policies = concat(
["host-${each.key}", "homelab-deploy", "nixos-exporter", "loki-push"],
lookup(each.value, "extra_policies", [])
)
secret_id_ttl = 0 # Never expire (wrapped tokens provide time limit)
token_ttl = 3600
token_max_ttl = 3600

View File

@@ -10,10 +10,6 @@ resource "vault_mount" "kv" {
locals {
secrets = {
# Example host-specific secrets
# "hosts/monitoring01/grafana-admin" = {
# auto_generate = true
# password_length = 32
# }
# "hosts/ha1/mqtt-password" = {
# auto_generate = true
# password_length = 24
@@ -35,11 +31,6 @@ locals {
# }
# }
"hosts/monitoring01/grafana-admin" = {
auto_generate = true
password_length = 32
}
"hosts/ha1/mqtt-password" = {
auto_generate = true
password_length = 24
@@ -57,8 +48,8 @@ locals {
data = { nkey = var.nats_nkey }
}
# PVE exporter config for monitoring01
"hosts/monitoring01/pve-exporter" = {
# PVE exporter config for monitoring02
"hosts/monitoring02/pve-exporter" = {
auto_generate = false
data = { config = var.pve_exporter_config }
}
@@ -147,6 +138,18 @@ locals {
auto_generate = false
data = { api_key = var.sonarr_api_key }
}
# Bearer token for scraping apiary metrics
"hosts/monitoring02/apiary-token" = {
auto_generate = true
password_length = 64
}
# Loki push authentication (used by Promtail on all hosts)
"shared/loki/push-auth" = {
auto_generate = true
password_length = 32
}
}
}