nixos-servers

Author	SHA1	Message	Date
Torjus Håkestad	20875fb03f	pn02: disable sched_ext and document memtest results Memtest86 ran 38 passes (109 hours) with zero errors, ruling out RAM. Disable sched_ext scheduler to test whether kernel scheduler crashes stop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 12:16:55 +01:00
Torjus Håkestad	07e86acbaa	docs: add plan for bare metal actions runner on nix-cache02 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 01:01:14 +01:00
Torjus Håkestad	73d804105b	pn01, pn02: enable memtest86 and update stability docs Some checks failed Run nix flake check / flake-check (push) Failing after 6m15s Details Periodic flake update / flake-update (push) Successful in 2m50s Details Enable memtest86 in systemd-boot menu on both PN51 units to allow extended memory testing. Update stability document with March crash data from pstore/Loki — crashes now traced to sched_ext scheduler kernel oops, suggesting possible memory corruption. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 23:02:28 +01:00
Torjus Håkestad	55da459108	docs: add plan for local NTP with chrony Some checks failed Run nix flake check / flake-check (push) Failing after 9m52s Details Periodic flake update / flake-update (push) Successful in 5m19s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 19:33:28 +01:00
Torjus Håkestad	cf55d07ce5	docs: update pn51 stability with third freeze and conclusion Some checks failed Run nix flake check / flake-check (push) Failing after 4m1s Details Periodic flake update / flake-update (push) Successful in 5m37s Details pn02 crashed again after ~2d21h uptime despite all mitigations (amdgpu blacklist, max_cstate=1, NMI watchdog, rasdaemon). NMI watchdog didn't fire and rasdaemon recorded nothing, confirming hard lockup below NMI level. Unit is unreliable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 18:25:52 +01:00
Torjus Håkestad	5e92eb3220	docs: add plan for NixOS OpenStack image Some checks failed Run nix flake check / flake-check (push) Failing after 8m1s Details Periodic flake update / flake-update (push) Successful in 2m23s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 00:42:19 +01:00
Torjus Håkestad	c8cadd09c5	pn51: document diagnostic config (rasdaemon, NMI watchdog, panic) Some checks failed Run nix flake check / flake-check (push) Failing after 4m3s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 18:52:34 +01:00
Torjus Håkestad	a7c1ce932d	pn51: add remaining debug steps and auto-recovery fallback Some checks failed Run nix flake check / flake-check (push) Failing after 5m4s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 18:38:17 +01:00
Torjus Håkestad	2b42145d94	pn51: document BIOS tweaks, second pn02 freeze, amdgpu blacklist Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 18:28:19 +01:00
Torjus Håkestad	75fdd7ae40	pn51: document stress test pass and TSC runtime test failure Some checks failed Run nix flake check / flake-check (push) Failing after 17m0s Details Both units survived 1h stress test at 80-85C. TSC clocksource is genuinely unstable at runtime (not just boot), HPET is the correct fallback for this platform. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 11:52:34 +01:00
Torjus Håkestad	5346889b73	pn51: add TSC runtime switch test to next steps Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 11:50:30 +01:00
Torjus Håkestad	9f7aab86a0	pn51: update stability notes, TSC/PSP issues affect both units Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 09:25:28 +01:00
Torjus Håkestad	bb53b922fa	plans: add NixOS hypervisor plan (Incus on PN51s) Some checks failed Run nix flake check / flake-check (push) Failing after 5m40s Details Periodic flake update / flake-update (push) Failing after 4s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 00:47:09 +01:00
Torjus Håkestad	75cd7c6c2d	docs: add PN51 stability testing notes Some checks failed Run nix flake check / flake-check (push) Failing after 12m3s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 00:24:28 +01:00
Torjus Håkestad	b578520905	media-pc: add JellyCon, display server, and HDR decisions Some checks failed Run nix flake check / flake-check (push) Failing after 4m45s Details Periodic flake update / flake-update (push) Successful in 2m16s Details Decided on Kodi + JellyCon with NFS direct path for media playback, Sway/Hyprland for display server with workspace-based browser switching, and noted HDR status for future reference. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-21 00:08:19 +01:00
Torjus Håkestad	8a5aa1c4f5	plans: add media PC replacement plan, update router hardware candidates Some checks failed Run nix flake check / flake-check (push) Failing after 4m30s Details New plan for replacing the media PC (i7-4770K/Ubuntu) with a NixOS mini PC running Kodi. Router plan updated with specific AliExpress hardware options and IDS/IPS considerations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-20 23:54:29 +01:00
Torjus Håkestad	0f8c4783a8	truenas-migration: drive trays ordered, resolve open question Some checks failed Run nix flake check / flake-check (push) Failing after 3m18s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-20 19:29:12 +01:00
Torjus Håkestad	58702bd10b	truenas-migration: note subnet issue for 10GbE traffic Some checks failed Run nix flake check / flake-check (push) Failing after 7m10s Details NAS and Proxmox are on the same 10GbE switch but different subnets, forcing traffic through the router. Need to fix during migration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-20 01:34:46 +01:00
Torjus Håkestad	c9f47acb01	truenas-migration: mdadm boot mirror, clean zfs export step Use TrueNAS boot-pool SSDs as mdadm RAID1 for NixOS root to keep the boot path ZFS-independent. Added zfs export step before shutdown. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-20 01:34:46 +01:00
Torjus Håkestad	09ce018fb2	truenas-migration: switch from BTRFS to keeping ZFS, update plan BTRFS RAID5/6 write hole is still unresolved, and RAID1 wastes capacity with mixed disk sizes. Keep existing ZFS pool and import directly on NixOS instead. Updated migration strategy, disk purchase decision (2x 24TB ordered), SMART health notes, and vdev rebalancing guidance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-20 01:34:46 +01:00
Torjus Håkestad	eec1e374b2	docs: simplify mermaid diagram labels Some checks failed Run nix flake check / flake-check (push) Failing after 4m0s Details Use <br/> for line breaks and shorter node labels so the diagram renders cleanly in Gitea. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 16:29:52 +01:00
Torjus Håkestad	fcc410afad	docs: replace ASCII diagram with mermaid in remote-access plan Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 16:28:57 +01:00
Torjus Håkestad	b218b4f8bc	docs: update migration plan for monitoring01 and pgdb1 completion Some checks failed Run nix flake check / flake-check (push) Failing after 16m37s Details Periodic flake update / flake-update (push) Successful in 2m21s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 22:26:23 +01:00
Torjus Håkestad	a6013d3950	monitoring02: enable alerting and migrate CNAMEs from http-proxy Some checks failed Run nix flake check / flake-check (push) Failing after 6m25s Details Run nix flake check / flake-check (pull_request) Failing after 3m52s Details - Switch vmalert from blackhole mode to sending alerts to local Alertmanager - Import alerttonotify service so alerts route to NATS notifications - Move alertmanager and grafana CNAMEs from http-proxy to monitoring02 - Add monitoring CNAME to monitoring02 - Add Caddy reverse proxy entries for alertmanager and grafana - Remove prometheus, alertmanager, and grafana Caddy entries from http-proxy (now served directly by monitoring02) - Move monitoring02 Vault AppRole to hosts-generated.tf with extra_policies support and prometheus-metrics policy - Update Promtail to use authenticated loki.home.2rjus.net endpoint only (remove unauthenticated monitoring01 client) - Update pipe-to-loki and bootstrap to use loki.home.2rjus.net with basic auth from Vault secret - Move migration plan to completed Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 21:23:21 +01:00
Torjus Håkestad	74e7c9faa4	monitoring02: add Loki service Some checks failed Run nix flake check / flake-check (push) Failing after 3m19s Details Add standalone Loki service module (services/loki/) with same config as monitoring01 and import it on monitoring02. Update Grafana Loki datasource to localhost. Defer Tempo and Pyroscope migration (not actively used). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 19:42:19 +01:00
Torjus Håkestad	e329f87b0b	monitoring02: add VictoriaMetrics, vmalert, and Alertmanager Set up the core metrics stack on monitoring02 as Phase 2 of the monitoring migration. VictoriaMetrics replaces Prometheus with identical scrape configs (22 jobs including auto-generated targets). - VictoriaMetrics with 3-month retention and all scrape configs - vmalert evaluating existing rules.yml (notifier disabled) - Alertmanager with same routing config (no alerts during parallel op) - Grafana datasources updated: local VictoriaMetrics as default - Static user override for credential file access (OpenBao, Apiary) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 00:55:08 +01:00
Torjus Håkestad	4d614d8716	docs: add new service candidates and NixOS router plans Some checks failed Run nix flake check / flake-check (push) Failing after 3m22s Details Periodic flake update / flake-update (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-14 13:21:34 +01:00
Torjus Håkestad	af8e385b6e	docs: finalize remote access plan with WireGuard gateway design Some checks failed Run nix flake check / flake-check (push) Failing after 21m7s Details Periodic flake update / flake-update (push) Successful in 2m16s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-14 00:31:52 +01:00
Torjus Håkestad	0db9fc6802	docs: update Loki improvements plan with implementation status Some checks failed Run nix flake check / flake-check (push) Failing after 13m55s Details Mark retention, limits, labels, and level mapping as done. Add JSON logging audit results with per-service details. Update current state and disk usage notes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-14 00:04:16 +01:00
Torjus Håkestad	d485948df0	docs: update Loki queries from host to hostname label Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Update all LogQL examples, agent instructions, and scripts to use the hostname label instead of host, matching the Prometheus label naming convention. Also update pipe-to-loki and bootstrap scripts to push hostname instead of host. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 23:43:47 +01:00
Torjus Håkestad	2f0dad1acc	docs: add JSON logging audit to Loki improvements plan Some checks failed Run nix flake check / flake-check (push) Failing after 15m38s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 22:44:05 +01:00
Torjus Håkestad	1544415ef3	docs: add Loki improvements plan Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Covers retention policy, limits config, Promtail label improvements (tier/role/level), and journal PRIORITY extraction. Also adds Alloy consideration to VictoriaMetrics migration plan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 22:39:16 +01:00
Torjus Håkestad	5babd7f507	docs: move garage S3 storage plan to completed Some checks failed Run nix flake check / flake-check (push) Failing after 15m36s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 21:54:23 +01:00
Torjus Håkestad	5d3d93b280	docs: move completed plans to completed folder Some checks failed Run nix flake check / flake-check (push) Failing after 13m22s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 21:08:17 +01:00
Torjus Håkestad	08d9e1ec3f	docs: add garage S3 storage plan Some checks failed Run nix flake check / flake-check (push) Failing after 3m26s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 18:06:53 +01:00
Torjus Håkestad	ed1821b073	nix-cache02: add scheduled builds timer Some checks failed Run nix flake check / flake-check (push) Failing after 5m7s Details Periodic flake update / flake-update (push) Successful in 2m18s Details Add a systemd timer that triggers builds for all hosts every 2 hours via NATS, keeping the binary cache warm. - Add scheduler.nix with timer (every 2h) and oneshot service - Add scheduler NATS user to DEPLOY account - Add Vault secret and variable for scheduler NKey - Increase nix-cache02 memory from 16GB to 20GB Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-12 00:50:09 +01:00
Torjus Håkestad	ddcbc30665	docs: mark nix-cache01 decommission complete Some checks failed Run nix flake check / flake-check (push) Failing after 16m38s Details Phase 4 fully complete. nix-cache01 has been: - Removed from repo (host config, build scripts, flake entry) - Vault resources cleaned up - VM deleted from Proxmox nix-cache02 is now the sole binary cache host. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:43:12 +01:00
Torjus Håkestad	ade0538717	docs: mark nix-cache DNS cutover complete Some checks are pending Run nix flake check / flake-check (push) Has started running Details nix-cache.home.2rjus.net now served by nix-cache02. nix-cache01 ready for decommission. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:34:04 +01:00
Torjus Håkestad	afff3f28ca	docs: update nix-cache-reprovision plan with Harmonia progress Some checks failed Run nix flake check / flake-check (push) Failing after 52s Details - Phase 4 now in progress - Harmonia configured on nix-cache02 with new signing key - Trusted public key deployed to all hosts - Cache tested successfully from testvm01 - Actions runner removed from scope Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:17:51 +01:00
Torjus Håkestad	751edfc11d	nix-cache02: add Harmonia binary cache service Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details - Parameterize harmonia.nix to use hostname-based Vault paths - Add nix-cache services to nix-cache02 - Add Vault secret and variable for nix-cache02 signing key - Add nix-cache02 public key to trusted-public-keys on all hosts - Update plan doc to remove actions runner references Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:08:48 +01:00
Torjus Håkestad	5bfb51a497	docs: add observability phase to nix-cache plan Some checks failed Run nix flake check / flake-check (push) Successful in 2m35s Details Run nix flake check / flake-check (pull_request) Failing after 16m1s Details - Add Phase 6 for alerting and Grafana dashboards - Document available Prometheus metrics - Include example alerting rules for build failures Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 22:46:38 +01:00
Torjus Håkestad	f83145d97a	docs: update nix-cache-reprovision plan with progress Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details - Mark Phase 1 (new build host) and Phase 2 (NATS build triggering) complete - Document nix-cache02 configuration and tested build times - Add remaining work for Harmonia, Actions runner, and DNS cutover - Enable --enable-builds flag in MCP config Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 22:43:48 +01:00
Torjus Håkestad	98ea679ef2	docs: add monitoring02 reboot alert investigation Some checks failed Run nix flake check / flake-check (push) Failing after 13m41s Details Document findings from false positive host_reboot alert caused by NTP clock adjustment affecting node_boot_time_seconds metric. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 17:59:53 +01:00
Torjus Håkestad	75e4fb61a5	monitoring: add blackbox exporter for TLS certificate monitoring All checks were successful Run nix flake check / flake-check (push) Successful in 2m6s Details Add blackbox exporter to monitoring01 to probe TLS endpoints and alert on expiring certificates. Monitors all ACME-managed certificates from OpenBao PKI including Caddy auto-TLS services. Alerts: - tls_certificate_expiring_soon (< 7 days, warning) - tls_certificate_expiring_critical (< 24h, critical) - tls_probe_failed (connectivity issues) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:21:42 +01:00
Torjus Håkestad	7ff3d2a09b	docs: move openbao-kanidm-oidc plan to completed All checks were successful Run nix flake check / flake-check (push) Successful in 2m7s Details	2026-02-09 19:44:06 +01:00
Torjus Håkestad	02270a0e4a	docs: update plans with Grafana OIDC progress Some checks failed Run nix flake check / flake-check (pull_request) Successful in 2m7s Details Run nix flake check / flake-check (push) Failing after 16m31s Details - auth-system-replacement.md: Mark OAuth2 client (Grafana) as completed, document key findings (PKCE, attribute paths, user requirements) - monitoring-migration-victoriametrics.md: Note Grafana deployment on monitoring02 with Kanidm OIDC as test instance Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:28:10 +01:00
Torjus Håkestad	8786113f8f	docs: add OpenBao + Kanidm OIDC integration plan Some checks failed Run nix flake check / flake-check (push) Failing after 3m10s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:45:44 +01:00
Torjus Håkestad	9ed09c9a9c	docs: add user-management documentation All checks were successful Run nix flake check / flake-check (pull_request) Successful in 3m33s Details Run nix flake check / flake-check (push) Successful in 2m0s Details - CLI workflows for creating users and groups - Troubleshooting guide (nscd, cache invalidation) - Home directory behavior (UUID-based with symlinks) - Update auth-system-replacement plan with progress Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:14:21 +01:00
Torjus Håkestad	b845a8bb8b	system: add kanidm PAM/NSS client module Add homelab.kanidm.enable option for central authentication via Kanidm. The module configures: - PAM/NSS integration with kanidm-unixd - Client connection to auth.home.2rjus.net - Login authorization for ssh-users group Enable on testvm01-03 for testing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:12:19 +01:00
Torjus Håkestad	3abe5e83a7	docs: add memory ballooning as fallback option All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 13:29:42 +01:00

1 2 3

107 Commits