Commit Graph

103 Commits

Author SHA1 Message Date
cf55d07ce5 docs: update pn51 stability with third freeze and conclusion
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m1s
Periodic flake update / flake-update (push) Successful in 5m37s
pn02 crashed again after ~2d21h uptime despite all mitigations
(amdgpu blacklist, max_cstate=1, NMI watchdog, rasdaemon).
NMI watchdog didn't fire and rasdaemon recorded nothing,
confirming hard lockup below NMI level. Unit is unreliable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 18:25:52 +01:00
5e92eb3220 docs: add plan for NixOS OpenStack image
Some checks failed
Run nix flake check / flake-check (push) Failing after 8m1s
Periodic flake update / flake-update (push) Successful in 2m23s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 00:42:19 +01:00
c8cadd09c5 pn51: document diagnostic config (rasdaemon, NMI watchdog, panic)
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m3s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 18:52:34 +01:00
a7c1ce932d pn51: add remaining debug steps and auto-recovery fallback
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m4s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 18:38:17 +01:00
2b42145d94 pn51: document BIOS tweaks, second pn02 freeze, amdgpu blacklist
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 18:28:19 +01:00
75fdd7ae40 pn51: document stress test pass and TSC runtime test failure
Some checks failed
Run nix flake check / flake-check (push) Failing after 17m0s
Both units survived 1h stress test at 80-85C. TSC clocksource
is genuinely unstable at runtime (not just boot), HPET is the
correct fallback for this platform.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 11:52:34 +01:00
5346889b73 pn51: add TSC runtime switch test to next steps
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 11:50:30 +01:00
9f7aab86a0 pn51: update stability notes, TSC/PSP issues affect both units
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 09:25:28 +01:00
bb53b922fa plans: add NixOS hypervisor plan (Incus on PN51s)
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m40s
Periodic flake update / flake-update (push) Failing after 4s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 00:47:09 +01:00
75cd7c6c2d docs: add PN51 stability testing notes
Some checks failed
Run nix flake check / flake-check (push) Failing after 12m3s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 00:24:28 +01:00
b578520905 media-pc: add JellyCon, display server, and HDR decisions
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m45s
Periodic flake update / flake-update (push) Successful in 2m16s
Decided on Kodi + JellyCon with NFS direct path for media playback,
Sway/Hyprland for display server with workspace-based browser switching,
and noted HDR status for future reference.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-21 00:08:19 +01:00
8a5aa1c4f5 plans: add media PC replacement plan, update router hardware candidates
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m30s
New plan for replacing the media PC (i7-4770K/Ubuntu) with a NixOS mini PC
running Kodi. Router plan updated with specific AliExpress hardware options
and IDS/IPS considerations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 23:54:29 +01:00
0f8c4783a8 truenas-migration: drive trays ordered, resolve open question
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m18s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 19:29:12 +01:00
58702bd10b truenas-migration: note subnet issue for 10GbE traffic
Some checks failed
Run nix flake check / flake-check (push) Failing after 7m10s
NAS and Proxmox are on the same 10GbE switch but different subnets,
forcing traffic through the router. Need to fix during migration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 01:34:46 +01:00
c9f47acb01 truenas-migration: mdadm boot mirror, clean zfs export step
Use TrueNAS boot-pool SSDs as mdadm RAID1 for NixOS root to keep
the boot path ZFS-independent. Added zfs export step before shutdown.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 01:34:46 +01:00
09ce018fb2 truenas-migration: switch from BTRFS to keeping ZFS, update plan
BTRFS RAID5/6 write hole is still unresolved, and RAID1 wastes
capacity with mixed disk sizes. Keep existing ZFS pool and import
directly on NixOS instead. Updated migration strategy, disk purchase
decision (2x 24TB ordered), SMART health notes, and vdev rebalancing
guidance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 01:34:46 +01:00
eec1e374b2 docs: simplify mermaid diagram labels
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m0s
Use <br/> for line breaks and shorter node labels so the diagram
renders cleanly in Gitea.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 16:29:52 +01:00
fcc410afad docs: replace ASCII diagram with mermaid in remote-access plan
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 16:28:57 +01:00
b218b4f8bc docs: update migration plan for monitoring01 and pgdb1 completion
Some checks failed
Run nix flake check / flake-check (push) Failing after 16m37s
Periodic flake update / flake-update (push) Successful in 2m21s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 22:26:23 +01:00
a6013d3950 monitoring02: enable alerting and migrate CNAMEs from http-proxy
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m25s
Run nix flake check / flake-check (pull_request) Failing after 3m52s
- Switch vmalert from blackhole mode to sending alerts to local
  Alertmanager
- Import alerttonotify service so alerts route to NATS notifications
- Move alertmanager and grafana CNAMEs from http-proxy to monitoring02
- Add monitoring CNAME to monitoring02
- Add Caddy reverse proxy entries for alertmanager and grafana
- Remove prometheus, alertmanager, and grafana Caddy entries from
  http-proxy (now served directly by monitoring02)
- Move monitoring02 Vault AppRole to hosts-generated.tf with
  extra_policies support and prometheus-metrics policy
- Update Promtail to use authenticated loki.home.2rjus.net endpoint
  only (remove unauthenticated monitoring01 client)
- Update pipe-to-loki and bootstrap to use loki.home.2rjus.net with
  basic auth from Vault secret
- Move migration plan to completed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 21:23:21 +01:00
74e7c9faa4 monitoring02: add Loki service
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m19s
Add standalone Loki service module (services/loki/) with same config as
monitoring01 and import it on monitoring02. Update Grafana Loki datasource
to localhost. Defer Tempo and Pyroscope migration (not actively used).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 19:42:19 +01:00
e329f87b0b monitoring02: add VictoriaMetrics, vmalert, and Alertmanager
Set up the core metrics stack on monitoring02 as Phase 2 of the
monitoring migration. VictoriaMetrics replaces Prometheus with
identical scrape configs (22 jobs including auto-generated targets).

- VictoriaMetrics with 3-month retention and all scrape configs
- vmalert evaluating existing rules.yml (notifier disabled)
- Alertmanager with same routing config (no alerts during parallel op)
- Grafana datasources updated: local VictoriaMetrics as default
- Static user override for credential file access (OpenBao, Apiary)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:55:08 +01:00
4d614d8716 docs: add new service candidates and NixOS router plans
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m22s
Periodic flake update / flake-update (push) Failing after 1s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 13:21:34 +01:00
af8e385b6e docs: finalize remote access plan with WireGuard gateway design
Some checks failed
Run nix flake check / flake-check (push) Failing after 21m7s
Periodic flake update / flake-update (push) Successful in 2m16s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 00:31:52 +01:00
0db9fc6802 docs: update Loki improvements plan with implementation status
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m55s
Mark retention, limits, labels, and level mapping as done. Add
JSON logging audit results with per-service details. Update current
state and disk usage notes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 00:04:16 +01:00
d485948df0 docs: update Loki queries from host to hostname label
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Update all LogQL examples, agent instructions, and scripts to use
the hostname label instead of host, matching the Prometheus label
naming convention. Also update pipe-to-loki and bootstrap scripts
to push hostname instead of host.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 23:43:47 +01:00
2f0dad1acc docs: add JSON logging audit to Loki improvements plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 15m38s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 22:44:05 +01:00
1544415ef3 docs: add Loki improvements plan
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Covers retention policy, limits config, Promtail label improvements
(tier/role/level), and journal PRIORITY extraction. Also adds Alloy
consideration to VictoriaMetrics migration plan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 22:39:16 +01:00
5babd7f507 docs: move garage S3 storage plan to completed
Some checks failed
Run nix flake check / flake-check (push) Failing after 15m36s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 21:54:23 +01:00
5d3d93b280 docs: move completed plans to completed folder
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m22s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 21:08:17 +01:00
08d9e1ec3f docs: add garage S3 storage plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m26s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 18:06:53 +01:00
ed1821b073 nix-cache02: add scheduled builds timer
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m7s
Periodic flake update / flake-update (push) Successful in 2m18s
Add a systemd timer that triggers builds for all hosts every 2 hours
via NATS, keeping the binary cache warm.

- Add scheduler.nix with timer (every 2h) and oneshot service
- Add scheduler NATS user to DEPLOY account
- Add Vault secret and variable for scheduler NKey
- Increase nix-cache02 memory from 16GB to 20GB

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-12 00:50:09 +01:00
ddcbc30665 docs: mark nix-cache01 decommission complete
Some checks failed
Run nix flake check / flake-check (push) Failing after 16m38s
Phase 4 fully complete. nix-cache01 has been:
- Removed from repo (host config, build scripts, flake entry)
- Vault resources cleaned up
- VM deleted from Proxmox

nix-cache02 is now the sole binary cache host.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:43:12 +01:00
ade0538717 docs: mark nix-cache DNS cutover complete
Some checks are pending
Run nix flake check / flake-check (push) Has started running
nix-cache.home.2rjus.net now served by nix-cache02.
nix-cache01 ready for decommission.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:34:04 +01:00
afff3f28ca docs: update nix-cache-reprovision plan with Harmonia progress
Some checks failed
Run nix flake check / flake-check (push) Failing after 52s
- Phase 4 now in progress
- Harmonia configured on nix-cache02 with new signing key
- Trusted public key deployed to all hosts
- Cache tested successfully from testvm01
- Actions runner removed from scope

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:17:51 +01:00
751edfc11d nix-cache02: add Harmonia binary cache service
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
- Parameterize harmonia.nix to use hostname-based Vault paths
- Add nix-cache services to nix-cache02
- Add Vault secret and variable for nix-cache02 signing key
- Add nix-cache02 public key to trusted-public-keys on all hosts
- Update plan doc to remove actions runner references

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:08:48 +01:00
5bfb51a497 docs: add observability phase to nix-cache plan
Some checks failed
Run nix flake check / flake-check (push) Successful in 2m35s
Run nix flake check / flake-check (pull_request) Failing after 16m1s
- Add Phase 6 for alerting and Grafana dashboards
- Document available Prometheus metrics
- Include example alerting rules for build failures

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 22:46:38 +01:00
f83145d97a docs: update nix-cache-reprovision plan with progress
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
- Mark Phase 1 (new build host) and Phase 2 (NATS build triggering) complete
- Document nix-cache02 configuration and tested build times
- Add remaining work for Harmonia, Actions runner, and DNS cutover
- Enable --enable-builds flag in MCP config

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 22:43:48 +01:00
98ea679ef2 docs: add monitoring02 reboot alert investigation
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m41s
Document findings from false positive host_reboot alert caused by
NTP clock adjustment affecting node_boot_time_seconds metric.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 17:59:53 +01:00
75e4fb61a5 monitoring: add blackbox exporter for TLS certificate monitoring
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m6s
Add blackbox exporter to monitoring01 to probe TLS endpoints and alert
on expiring certificates. Monitors all ACME-managed certificates from
OpenBao PKI including Caddy auto-TLS services.

Alerts:
- tls_certificate_expiring_soon (< 7 days, warning)
- tls_certificate_expiring_critical (< 24h, critical)
- tls_probe_failed (connectivity issues)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 22:21:42 +01:00
7ff3d2a09b docs: move openbao-kanidm-oidc plan to completed
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m7s
2026-02-09 19:44:06 +01:00
02270a0e4a docs: update plans with Grafana OIDC progress
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m7s
Run nix flake check / flake-check (push) Failing after 16m31s
- auth-system-replacement.md: Mark OAuth2 client (Grafana) as completed,
  document key findings (PKCE, attribute paths, user requirements)
- monitoring-migration-victoriametrics.md: Note Grafana deployment on
  monitoring02 with Kanidm OIDC as test instance

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 20:28:10 +01:00
8786113f8f docs: add OpenBao + Kanidm OIDC integration plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m10s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 15:45:44 +01:00
9ed09c9a9c docs: add user-management documentation
All checks were successful
Run nix flake check / flake-check (pull_request) Successful in 3m33s
Run nix flake check / flake-check (push) Successful in 2m0s
- CLI workflows for creating users and groups
- Troubleshooting guide (nscd, cache invalidation)
- Home directory behavior (UUID-based with symlinks)
- Update auth-system-replacement plan with progress

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 15:14:21 +01:00
b845a8bb8b system: add kanidm PAM/NSS client module
Add homelab.kanidm.enable option for central authentication via Kanidm.
The module configures:
- PAM/NSS integration with kanidm-unixd
- Client connection to auth.home.2rjus.net
- Login authorization for ssh-users group

Enable on testvm01-03 for testing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 15:12:19 +01:00
3abe5e83a7 docs: add memory ballooning as fallback option
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m5s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 13:29:42 +01:00
67c27555f3 docs: add memory issues follow-up plan
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m2s
Track zram change effectiveness for OOM prevention during upgrades.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 13:26:31 +01:00
311be282b6 docs: add security hardening plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 2s
Based on security review findings, covering SSH hardening, firewall
enablement, log transport TLS, security alerting, and secrets management.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 05:26:15 +01:00
8fbf1224fa docs: add host creation pipeline documentation
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Document the end-to-end host creation workflow including:
- Prerequisites and step-by-step process
- Tier specification (test vs prod)
- Bootstrap observability via Loki
- Verification steps
- Troubleshooting guide
- Related files reference

Update CLAUDE.md to reference the new document.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 02:05:21 +01:00
8959829f77 docs: add monitoring migration to VictoriaMetrics plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Plan for migrating from Prometheus to VictoriaMetrics on new monitoring02
host with parallel operation, declarative Grafana dashboards, and CNAME-based
cutover.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 01:11:07 +01:00