Add nodeExporterOnly list to external-targets.nix for hosts that
have node-exporter but not systemd-exporter (e.g. pve1). This
prevents a down target in the systemd-exporter scrape job.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pn02 crashed again after ~2d21h uptime despite all mitigations
(amdgpu blacklist, max_cstate=1, NMI watchdog, rasdaemon).
NMI watchdog didn't fire and rasdaemon recorded nothing,
confirming hard lockup below NMI level. Unit is unreliable.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Known PN51 platform issue with deep C-states causing freezes.
Limit to C1 to prevent deeper sleep states.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enable kernel panic on soft/hard lockups with auto-reboot after
10s, and rasdaemon for hardware error logging. Should give us
diagnostic data on the next freeze.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pn02 continues to hard freeze with no log evidence. Blacklisting
the GPU driver to eliminate GPU/PSP firmware interactions as a
possible cause. Console output will be lost but the host is
managed over SSH.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Both units survived 1h stress test at 80-85C. TSC clocksource
is genuinely unstable at runtime (not just boot), HPET is the
correct fallback for this platform.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All hosts had identical nix-command/flakes settings in their
configuration.nix. Centralize in system/nix.nix so new hosts
(like pn01/pn02) get it automatically.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add pn01 and pn02 to hosts-generated.tf for Vault AppRole access.
Fix provision-approle.yml: the localhost play was skipped when using
-l filter, since localhost didn't match the target. Merged into a
single play using delegate_to: localhost for the bao commands.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add two ASUS PN51 hosts on VLAN 12 for stability testing.
pn01 at 10.69.12.60, pn02 at 10.69.12.61, both test-tier compute role.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Decided on Kodi + JellyCon with NFS direct path for media playback,
Sway/Hyprland for display server with workspace-based browser switching,
and noted HDR status for future reference.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New plan for replacing the media PC (i7-4770K/Ubuntu) with a NixOS mini PC
running Kodi. Router plan updated with specific AliExpress hardware options
and IDS/IPS considerations.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reduces false positives from transient Nix store growth by basing the
linear prediction on a 24h trend instead of 6h.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NAS and Proxmox are on the same 10GbE switch but different subnets,
forcing traffic through the router. Need to fix during migration.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use TrueNAS boot-pool SSDs as mdadm RAID1 for NixOS root to keep
the boot path ZFS-independent. Added zfs export step before shutdown.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BTRFS RAID5/6 write hole is still unresolved, and RAID1 wastes
capacity with mixed disk sizes. Keep existing ZFS pool and import
directly on NixOS instead. Updated migration strategy, disk purchase
decision (2x 24TB ordered), SMART health notes, and vdev rebalancing
guidance.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update all dashboard datasource references from "prometheus" to
"victoriametrics" to match the declared datasource UID. Enable
prune and deleteDatasources to clean up the old Prometheus
(monitoring01) datasource from Grafana's database.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove monitoring01 host configuration and unused service modules
(prometheus, grafana, loki, tempo, pyroscope). Migrate blackbox,
exportarr, and pve exporters to monitoring02 with scrape configs
moved to VictoriaMetrics. Update alert rules, terraform vault
policies/secrets, http-proxy entries, and documentation to reflect
the monitoring02 migration.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Switch vmalert from blackhole mode to sending alerts to local
Alertmanager
- Import alerttonotify service so alerts route to NATS notifications
- Move alertmanager and grafana CNAMEs from http-proxy to monitoring02
- Add monitoring CNAME to monitoring02
- Add Caddy reverse proxy entries for alertmanager and grafana
- Remove prometheus, alertmanager, and grafana Caddy entries from
http-proxy (now served directly by monitoring02)
- Move monitoring02 Vault AppRole to hosts-generated.tf with
extra_policies support and prometheus-metrics policy
- Update Promtail to use authenticated loki.home.2rjus.net endpoint
only (remove unauthenticated monitoring01 client)
- Update pipe-to-loki and bootstrap to use loki.home.2rjus.net with
basic auth from Vault secret
- Move migration plan to completed
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The real .mcp.json now contains Loki credentials for basic auth,
so it should not be committed. The example file has placeholders.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The secret file needs to be owned by promtail since Promtail runs
as a dedicated user and can't read root-owned files.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>