Commit Graph

1005 Commits

Author SHA1 Message Date
73d804105b pn01, pn02: enable memtest86 and update stability docs
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m15s
Periodic flake update / flake-update (push) Successful in 2m50s
Enable memtest86 in systemd-boot menu on both PN51 units to allow
extended memory testing. Update stability document with March crash
data from pstore/Loki — crashes now traced to sched_ext scheduler
kernel oops, suggesting possible memory corruption.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 23:02:28 +01:00
d2a4e4a0a1 grafana: add storage query performance panels to apiary dashboard
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m23s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 22:47:30 +01:00
28eba49d68 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/8c809a146a140c5c8806f13399592dbcb1bb5dc4?narHash=sha256-WGV2hy%2BVIeQsYXpsLjdr4GvHv5eECMISX1zKLTedhdg%3D' (2026-03-03)
  → 'github:nixos/nixpkgs/80bdc1e5ce51f56b19791b52b2901187931f5353?narHash=sha256-QKyJ0QGWBn6r0invrMAK8dmJoBYWoOWy7lN%2BUHzW1jc%3D' (2026-03-04)
2026-03-06 00:07:07 +00:00
4bf726a674 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/c581273b8d5bdf1c6ce7e0a54da9841e6a763913?narHash=sha256-ywy9troNEfpgh0Ee%2BzaV1UTgU8kYBVKtvPSxh6clYGU%3D' (2026-03-02)
  → 'github:nixos/nixpkgs/fabb8c9deee281e50b1065002c9828f2cf7b2239?narHash=sha256-YaHht/C35INEX3DeJQNWjNaTcPjYmBwwjFJ2jdtr%2B5U%3D' (2026-03-04)
2026-03-05 00:07:31 +00:00
774fd92524 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/1267bb4920d0fc06ea916734c11b0bf004bbe17e?narHash=sha256-7DaQVv4R97cii/Qdfy4tmDZMB2xxtyIvNGSwXBBhSmo%3D' (2026-02-25)
  → 'github:nixos/nixpkgs/c581273b8d5bdf1c6ce7e0a54da9841e6a763913?narHash=sha256-ywy9troNEfpgh0Ee%2BzaV1UTgU8kYBVKtvPSxh6clYGU%3D' (2026-03-02)
• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/cf59864ef8aa2e178cccedbe2c178185b0365705?narHash=sha256-izhTDFKsg6KeVBxJS9EblGeQ8y%2BO8eCa6RcW874vxEc%3D' (2026-03-02)
  → 'github:nixos/nixpkgs/8c809a146a140c5c8806f13399592dbcb1bb5dc4?narHash=sha256-WGV2hy%2BVIeQsYXpsLjdr4GvHv5eECMISX1zKLTedhdg%3D' (2026-03-03)
2026-03-04 00:06:56 +00:00
55da459108 docs: add plan for local NTP with chrony
Some checks failed
Run nix flake check / flake-check (push) Failing after 9m52s
Periodic flake update / flake-update (push) Successful in 5m19s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 19:33:28 +01:00
813c5c0f29 monitoring: separate node-exporter-only external targets
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m7s
Add nodeExporterOnly list to external-targets.nix for hosts that
have node-exporter but not systemd-exporter (e.g. pve1). This
prevents a down target in the systemd-exporter scrape job.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 19:17:39 +01:00
013ab8f621 monitoring: add pve1 node-exporter scrape target
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m6s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 19:10:54 +01:00
f75b773485 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/dd9b079222d43e1943b6ebd802f04fd959dc8e61?narHash=sha256-I45esRSssFtJ8p/gLHUZ1OUaaTaVLluNkABkk6arQwE%3D' (2026-02-27)
  → 'github:nixos/nixpkgs/cf59864ef8aa2e178cccedbe2c178185b0365705?narHash=sha256-izhTDFKsg6KeVBxJS9EblGeQ8y%2BO8eCa6RcW874vxEc%3D' (2026-03-02)
2026-03-03 00:07:07 +00:00
58c3844950 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/2fc6539b481e1d2569f25f8799236694180c0993?narHash=sha256-0MAd%2B0mun3K/Ns8JATeHT1sX28faLII5hVLq0L3BdZU%3D' (2026-02-23)
  → 'github:nixos/nixpkgs/dd9b079222d43e1943b6ebd802f04fd959dc8e61?narHash=sha256-I45esRSssFtJ8p/gLHUZ1OUaaTaVLluNkABkk6arQwE%3D' (2026-02-27)
2026-03-01 00:01:26 +00:00
80e5fa08fa flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/e764fc9a405871f1f6ca3d1394fb422e0a0c3951?narHash=sha256-sdaqdnsQCv3iifzxwB22tUwN/fSHoN7j2myFW5EIkGk%3D' (2026-02-24)
  → 'github:nixos/nixpkgs/1267bb4920d0fc06ea916734c11b0bf004bbe17e?narHash=sha256-7DaQVv4R97cii/Qdfy4tmDZMB2xxtyIvNGSwXBBhSmo%3D' (2026-02-25)
2026-02-28 00:07:22 +00:00
cf55d07ce5 docs: update pn51 stability with third freeze and conclusion
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m1s
Periodic flake update / flake-update (push) Successful in 5m37s
pn02 crashed again after ~2d21h uptime despite all mitigations
(amdgpu blacklist, max_cstate=1, NMI watchdog, rasdaemon).
NMI watchdog didn't fire and rasdaemon recorded nothing,
confirming hard lockup below NMI level. Unit is unreliable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 18:25:52 +01:00
4941e38dac flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/afbbf774e2087c3d734266c22f96fca2e78d3620?narHash=sha256-nhZJPnBavtu40/L2aqpljrfUNb2rxmWTmSjK2c9UKds%3D' (2026-02-21)
  → 'github:nixos/nixpkgs/e764fc9a405871f1f6ca3d1394fb422e0a0c3951?narHash=sha256-sdaqdnsQCv3iifzxwB22tUwN/fSHoN7j2myFW5EIkGk%3D' (2026-02-24)
• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/0182a361324364ae3f436a63005877674cf45efb?narHash=sha256-0NBlEBKkN3lufyvFegY4TYv5mCNHbi5OmBDrzihbBMQ%3D' (2026-02-17)
  → 'github:nixos/nixpkgs/2fc6539b481e1d2569f25f8799236694180c0993?narHash=sha256-0MAd%2B0mun3K/Ns8JATeHT1sX28faLII5hVLq0L3BdZU%3D' (2026-02-23)
2026-02-25 00:07:00 +00:00
03ffcc1ad0 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/c217913993d6c6f6805c3b1a3bda5e639adfde6d?narHash=sha256-D1PA3xQv/s4W3lnR9yJFSld8UOLr0a/cBWMQMXS%2B1Qg%3D' (2026-02-20)
  → 'github:nixos/nixpkgs/afbbf774e2087c3d734266c22f96fca2e78d3620?narHash=sha256-nhZJPnBavtu40/L2aqpljrfUNb2rxmWTmSjK2c9UKds%3D' (2026-02-21)
2026-02-24 00:01:35 +00:00
5e92eb3220 docs: add plan for NixOS OpenStack image
Some checks failed
Run nix flake check / flake-check (push) Failing after 8m1s
Periodic flake update / flake-update (push) Successful in 2m23s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 00:42:19 +01:00
2321e191a2 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/6d41bc27aaf7b6a3ba6b169db3bd5d6159cfaa47?narHash=sha256-bxAlQgre3pcQcaRUm/8A0v/X8d2nhfraWSFqVmMcBcU%3D' (2026-02-18)
  → 'github:nixos/nixpkgs/c217913993d6c6f6805c3b1a3bda5e639adfde6d?narHash=sha256-D1PA3xQv/s4W3lnR9yJFSld8UOLr0a/cBWMQMXS%2B1Qg%3D' (2026-02-20)
2026-02-23 00:01:30 +00:00
136116ab33 pn02: limit CPU to C1 power state for stability
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m36s
Periodic flake update / flake-update (push) Successful in 2m18s
Known PN51 platform issue with deep C-states causing freezes.
Limit to C1 to prevent deeper sleep states.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 18:58:41 +01:00
c8cadd09c5 pn51: document diagnostic config (rasdaemon, NMI watchdog, panic)
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m3s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 18:52:34 +01:00
72acaa872b pn02: add panic on lockup, NMI watchdog, and rasdaemon
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Enable kernel panic on soft/hard lockups with auto-reboot after
10s, and rasdaemon for hardware error logging. Should give us
diagnostic data on the next freeze.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 18:48:21 +01:00
a7c1ce932d pn51: add remaining debug steps and auto-recovery fallback
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m4s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 18:38:17 +01:00
2b42145d94 pn51: document BIOS tweaks, second pn02 freeze, amdgpu blacklist
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 18:28:19 +01:00
05e8556bda pn02: blacklist amdgpu kernel module for stability testing
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
pn02 continues to hard freeze with no log evidence. Blacklisting
the GPU driver to eliminate GPU/PSP firmware interactions as a
possible cause. Console output will be lost but the host is
managed over SSH.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 18:27:05 +01:00
75fdd7ae40 pn51: document stress test pass and TSC runtime test failure
Some checks failed
Run nix flake check / flake-check (push) Failing after 17m0s
Both units survived 1h stress test at 80-85C. TSC clocksource
is genuinely unstable at runtime (not just boot), HPET is the
correct fallback for this platform.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 11:52:34 +01:00
5346889b73 pn51: add TSC runtime switch test to next steps
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 11:50:30 +01:00
7e19f51dfa nix: move experimental-features to system/nix.nix
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
All hosts had identical nix-command/flakes settings in their
configuration.nix. Centralize in system/nix.nix so new hosts
(like pn01/pn02) get it automatically.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 10:27:53 +01:00
9f7aab86a0 pn51: update stability notes, TSC/PSP issues affect both units
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 09:25:28 +01:00
bb53b922fa plans: add NixOS hypervisor plan (Incus on PN51s)
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m40s
Periodic flake update / flake-update (push) Failing after 4s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 00:47:09 +01:00
75cd7c6c2d docs: add PN51 stability testing notes
Some checks failed
Run nix flake check / flake-check (push) Failing after 12m3s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 00:24:28 +01:00
72c3a938b0 hosts: enable vault on pn01 and pn02
Some checks failed
Run nix flake check / flake-check (push) Failing after 10m12s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-21 23:56:05 +01:00
2f89d564f7 vault: add approles for pn01/pn02, fix provision playbook
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Add pn01 and pn02 to hosts-generated.tf for Vault AppRole access.

Fix provision-approle.yml: the localhost play was skipped when using
-l filter, since localhost didn't match the target. Merged into a
single play using delegate_to: localhost for the bao commands.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-21 23:51:56 +01:00
4a83363ee5 hosts: add pn01 and pn02 (ASUS PN51 mini PCs)
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m33s
Add two ASUS PN51 hosts on VLAN 12 for stability testing.
pn01 at 10.69.12.60, pn02 at 10.69.12.61, both test-tier compute role.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-21 23:37:14 +01:00
b578520905 media-pc: add JellyCon, display server, and HDR decisions
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m45s
Periodic flake update / flake-update (push) Successful in 2m16s
Decided on Kodi + JellyCon with NFS direct path for media playback,
Sway/Hyprland for display server with workspace-based browser switching,
and noted HDR status for future reference.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-21 00:08:19 +01:00
8a5aa1c4f5 plans: add media PC replacement plan, update router hardware candidates
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m30s
New plan for replacing the media PC (i7-4770K/Ubuntu) with a NixOS mini PC
running Kodi. Router plan updated with specific AliExpress hardware options
and IDS/IPS considerations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 23:54:29 +01:00
0f8c4783a8 truenas-migration: drive trays ordered, resolve open question
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m18s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 19:29:12 +01:00
2ca2509083 monitoring: increase filesystem_filling_up prediction window to 24h
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m55s
Reduces false positives from transient Nix store growth by basing the
linear prediction on a 24h trend instead of 6h.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 09:36:27 +01:00
58702bd10b truenas-migration: note subnet issue for 10GbE traffic
Some checks failed
Run nix flake check / flake-check (push) Failing after 7m10s
NAS and Proxmox are on the same 10GbE switch but different subnets,
forcing traffic through the router. Need to fix during migration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 01:34:46 +01:00
c9f47acb01 truenas-migration: mdadm boot mirror, clean zfs export step
Use TrueNAS boot-pool SSDs as mdadm RAID1 for NixOS root to keep
the boot path ZFS-independent. Added zfs export step before shutdown.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 01:34:46 +01:00
09ce018fb2 truenas-migration: switch from BTRFS to keeping ZFS, update plan
BTRFS RAID5/6 write hole is still unresolved, and RAID1 wastes
capacity with mixed disk sizes. Keep existing ZFS pool and import
directly on NixOS instead. Updated migration strategy, disk purchase
decision (2x 24TB ordered), SMART health notes, and vdev rebalancing
guidance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 01:34:46 +01:00
3042803c4d flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/fa56d7d6de78f5a7f997b0ea2bc6efd5868ad9e8?narHash=sha256-X01Q3DgSpjeBpapoGA4rzKOn25qdKxbPnxHeMLNoHTU%3D' (2026-02-16)
  → 'github:nixos/nixpkgs/6d41bc27aaf7b6a3ba6b169db3bd5d6159cfaa47?narHash=sha256-bxAlQgre3pcQcaRUm/8A0v/X8d2nhfraWSFqVmMcBcU%3D' (2026-02-18)
2026-02-20 00:07:01 +00:00
1e7200b494 quick-plan: add mermaid diagram guideline
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m7s
Periodic flake update / flake-update (push) Successful in 5m26s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 16:35:53 +01:00
eec1e374b2 docs: simplify mermaid diagram labels
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m0s
Use <br/> for line breaks and shorter node labels so the diagram
renders cleanly in Gitea.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 16:29:52 +01:00
fcc410afad docs: replace ASCII diagram with mermaid in remote-access plan
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 16:28:57 +01:00
59f0c7ceda flake.lock: update homelab-deploy
Some checks failed
Run nix flake check / flake-check (push) Failing after 8m10s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 09:04:03 +01:00
d713f06c6e flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/a82ccc39b39b621151d6732718e3e250109076fa?narHash=sha256-gf2AmWVTs8lEq7z/3ZAsgnZDhWIckkb%2BZnAo5RzSxJg%3D' (2026-02-13)
  → 'github:nixos/nixpkgs/0182a361324364ae3f436a63005877674cf45efb?narHash=sha256-0NBlEBKkN3lufyvFegY4TYv5mCNHbi5OmBDrzihbBMQ%3D' (2026-02-17)
2026-02-19 00:01:44 +00:00
7374d1ff7f nix-cache02: increase builder timeout to 4 hours
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m4s
Periodic flake update / flake-update (push) Successful in 2m32s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 23:53:33 +01:00
e912c75b6c flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/3aadb7ca9eac2891d52a9dec199d9580a6e2bf44?narHash=sha256-O1XDr7EWbRp%2BkHrNNgLWgIrB0/US5wvw9K6RERWAj6I%3D' (2026-02-14)
  → 'github:nixos/nixpkgs/fa56d7d6de78f5a7f997b0ea2bc6efd5868ad9e8?narHash=sha256-X01Q3DgSpjeBpapoGA4rzKOn25qdKxbPnxHeMLNoHTU%3D' (2026-02-16)
2026-02-18 00:01:34 +00:00
b218b4f8bc docs: update migration plan for monitoring01 and pgdb1 completion
Some checks failed
Run nix flake check / flake-check (push) Failing after 16m37s
Periodic flake update / flake-update (push) Successful in 2m21s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 22:26:23 +01:00
65acf13e6f grafana: fix datasource UIDs for VictoriaMetrics migration
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Update all dashboard datasource references from "prometheus" to
"victoriametrics" to match the declared datasource UID. Enable
prune and deleteDatasources to clean up the old Prometheus
(monitoring01) datasource from Grafana's database.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 22:23:04 +01:00
95a96b2192 Merge pull request 'monitoring01: remove host and migrate services to monitoring02' (#43) from cleanup-monitoring01 into master
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m2s
Reviewed-on: #43
2026-02-17 21:08:00 +00:00
4f593126c0 monitoring01: remove host and migrate services to monitoring02
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m15s
Run nix flake check / flake-check (pull_request) Failing after 3m8s
Remove monitoring01 host configuration and unused service modules
(prometheus, grafana, loki, tempo, pyroscope). Migrate blackbox,
exportarr, and pve exporters to monitoring02 with scrape configs
moved to VictoriaMetrics. Update alert rules, terraform vault
policies/secrets, http-proxy entries, and documentation to reflect
the monitoring02 migration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 21:50:20 +01:00