Memtest86 ran 38 passes (109 hours) with zero errors, ruling out RAM.
Disable sched_ext scheduler to test whether kernel scheduler crashes stop.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enable memtest86 in systemd-boot menu on both PN51 units to allow
extended memory testing. Update stability document with March crash
data from pstore/Loki — crashes now traced to sched_ext scheduler
kernel oops, suggesting possible memory corruption.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pn02 crashed again after ~2d21h uptime despite all mitigations
(amdgpu blacklist, max_cstate=1, NMI watchdog, rasdaemon).
NMI watchdog didn't fire and rasdaemon recorded nothing,
confirming hard lockup below NMI level. Unit is unreliable.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Both units survived 1h stress test at 80-85C. TSC clocksource
is genuinely unstable at runtime (not just boot), HPET is the
correct fallback for this platform.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Decided on Kodi + JellyCon with NFS direct path for media playback,
Sway/Hyprland for display server with workspace-based browser switching,
and noted HDR status for future reference.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New plan for replacing the media PC (i7-4770K/Ubuntu) with a NixOS mini PC
running Kodi. Router plan updated with specific AliExpress hardware options
and IDS/IPS considerations.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NAS and Proxmox are on the same 10GbE switch but different subnets,
forcing traffic through the router. Need to fix during migration.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use TrueNAS boot-pool SSDs as mdadm RAID1 for NixOS root to keep
the boot path ZFS-independent. Added zfs export step before shutdown.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BTRFS RAID5/6 write hole is still unresolved, and RAID1 wastes
capacity with mixed disk sizes. Keep existing ZFS pool and import
directly on NixOS instead. Updated migration strategy, disk purchase
decision (2x 24TB ordered), SMART health notes, and vdev rebalancing
guidance.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Switch vmalert from blackhole mode to sending alerts to local
Alertmanager
- Import alerttonotify service so alerts route to NATS notifications
- Move alertmanager and grafana CNAMEs from http-proxy to monitoring02
- Add monitoring CNAME to monitoring02
- Add Caddy reverse proxy entries for alertmanager and grafana
- Remove prometheus, alertmanager, and grafana Caddy entries from
http-proxy (now served directly by monitoring02)
- Move monitoring02 Vault AppRole to hosts-generated.tf with
extra_policies support and prometheus-metrics policy
- Update Promtail to use authenticated loki.home.2rjus.net endpoint
only (remove unauthenticated monitoring01 client)
- Update pipe-to-loki and bootstrap to use loki.home.2rjus.net with
basic auth from Vault secret
- Move migration plan to completed
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add standalone Loki service module (services/loki/) with same config as
monitoring01 and import it on monitoring02. Update Grafana Loki datasource
to localhost. Defer Tempo and Pyroscope migration (not actively used).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Set up the core metrics stack on monitoring02 as Phase 2 of the
monitoring migration. VictoriaMetrics replaces Prometheus with
identical scrape configs (22 jobs including auto-generated targets).
- VictoriaMetrics with 3-month retention and all scrape configs
- vmalert evaluating existing rules.yml (notifier disabled)
- Alertmanager with same routing config (no alerts during parallel op)
- Grafana datasources updated: local VictoriaMetrics as default
- Static user override for credential file access (OpenBao, Apiary)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mark retention, limits, labels, and level mapping as done. Add
JSON logging audit results with per-service details. Update current
state and disk usage notes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update all LogQL examples, agent instructions, and scripts to use
the hostname label instead of host, matching the Prometheus label
naming convention. Also update pipe-to-loki and bootstrap scripts
to push hostname instead of host.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a systemd timer that triggers builds for all hosts every 2 hours
via NATS, keeping the binary cache warm.
- Add scheduler.nix with timer (every 2h) and oneshot service
- Add scheduler NATS user to DEPLOY account
- Add Vault secret and variable for scheduler NKey
- Increase nix-cache02 memory from 16GB to 20GB
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 4 fully complete. nix-cache01 has been:
- Removed from repo (host config, build scripts, flake entry)
- Vault resources cleaned up
- VM deleted from Proxmox
nix-cache02 is now the sole binary cache host.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Phase 4 now in progress
- Harmonia configured on nix-cache02 with new signing key
- Trusted public key deployed to all hosts
- Cache tested successfully from testvm01
- Actions runner removed from scope
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Parameterize harmonia.nix to use hostname-based Vault paths
- Add nix-cache services to nix-cache02
- Add Vault secret and variable for nix-cache02 signing key
- Add nix-cache02 public key to trusted-public-keys on all hosts
- Update plan doc to remove actions runner references
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add Phase 6 for alerting and Grafana dashboards
- Document available Prometheus metrics
- Include example alerting rules for build failures
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Mark Phase 1 (new build host) and Phase 2 (NATS build triggering) complete
- Document nix-cache02 configuration and tested build times
- Add remaining work for Harmonia, Actions runner, and DNS cutover
- Enable --enable-builds flag in MCP config
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document findings from false positive host_reboot alert caused by
NTP clock adjustment affecting node_boot_time_seconds metric.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- auth-system-replacement.md: Mark OAuth2 client (Grafana) as completed,
document key findings (PKCE, attribute paths, user requirements)
- monitoring-migration-victoriametrics.md: Note Grafana deployment on
monitoring02 with Kanidm OIDC as test instance
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- CLI workflows for creating users and groups
- Troubleshooting guide (nscd, cache invalidation)
- Home directory behavior (UUID-based with symlinks)
- Update auth-system-replacement plan with progress
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add homelab.kanidm.enable option for central authentication via Kanidm.
The module configures:
- PAM/NSS integration with kanidm-unixd
- Client connection to auth.home.2rjus.net
- Login authorization for ssh-users group
Enable on testvm01-03 for testing.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>