All hosts had identical nix-command/flakes settings in their
configuration.nix. Centralize in system/nix.nix so new hosts
(like pn01/pn02) get it automatically.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add pn01 and pn02 to hosts-generated.tf for Vault AppRole access.
Fix provision-approle.yml: the localhost play was skipped when using
-l filter, since localhost didn't match the target. Merged into a
single play using delegate_to: localhost for the bao commands.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add two ASUS PN51 hosts on VLAN 12 for stability testing.
pn01 at 10.69.12.60, pn02 at 10.69.12.61, both test-tier compute role.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Decided on Kodi + JellyCon with NFS direct path for media playback,
Sway/Hyprland for display server with workspace-based browser switching,
and noted HDR status for future reference.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New plan for replacing the media PC (i7-4770K/Ubuntu) with a NixOS mini PC
running Kodi. Router plan updated with specific AliExpress hardware options
and IDS/IPS considerations.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reduces false positives from transient Nix store growth by basing the
linear prediction on a 24h trend instead of 6h.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NAS and Proxmox are on the same 10GbE switch but different subnets,
forcing traffic through the router. Need to fix during migration.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use TrueNAS boot-pool SSDs as mdadm RAID1 for NixOS root to keep
the boot path ZFS-independent. Added zfs export step before shutdown.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BTRFS RAID5/6 write hole is still unresolved, and RAID1 wastes
capacity with mixed disk sizes. Keep existing ZFS pool and import
directly on NixOS instead. Updated migration strategy, disk purchase
decision (2x 24TB ordered), SMART health notes, and vdev rebalancing
guidance.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update all dashboard datasource references from "prometheus" to
"victoriametrics" to match the declared datasource UID. Enable
prune and deleteDatasources to clean up the old Prometheus
(monitoring01) datasource from Grafana's database.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove monitoring01 host configuration and unused service modules
(prometheus, grafana, loki, tempo, pyroscope). Migrate blackbox,
exportarr, and pve exporters to monitoring02 with scrape configs
moved to VictoriaMetrics. Update alert rules, terraform vault
policies/secrets, http-proxy entries, and documentation to reflect
the monitoring02 migration.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Switch vmalert from blackhole mode to sending alerts to local
Alertmanager
- Import alerttonotify service so alerts route to NATS notifications
- Move alertmanager and grafana CNAMEs from http-proxy to monitoring02
- Add monitoring CNAME to monitoring02
- Add Caddy reverse proxy entries for alertmanager and grafana
- Remove prometheus, alertmanager, and grafana Caddy entries from
http-proxy (now served directly by monitoring02)
- Move monitoring02 Vault AppRole to hosts-generated.tf with
extra_policies support and prometheus-metrics policy
- Update Promtail to use authenticated loki.home.2rjus.net endpoint
only (remove unauthenticated monitoring01 client)
- Update pipe-to-loki and bootstrap to use loki.home.2rjus.net with
basic auth from Vault secret
- Move migration plan to completed
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The real .mcp.json now contains Loki credentials for basic auth,
so it should not be committed. The example file has placeholders.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The secret file needs to be owned by promtail since Promtail runs
as a dedicated user and can't read root-owned files.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Revert ns1/ns2 from approle.tf (they're in hosts-generated.tf) and add
loki-push policy to generated AppRoles instead.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
They were missing from the host_policies map, so they didn't get
shared policies like loki-push.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Loki bound to localhost, Caddy reverse proxy with basic_auth
- Vault secret (shared/loki/push-auth) for password, bcrypt hash
generated at boot for Caddy environment
- Promtail dual-ships to monitoring01 (direct) and loki.home.2rjus.net
(with basic auth), conditional on vault.enable
- Terraform: new shared loki-push policy added to all AppRoles
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add standalone Loki service module (services/loki/) with same config as
monitoring01 and import it on monitoring02. Update Grafana Loki datasource
to localhost. Defer Tempo and Pyroscope migration (not actively used).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add metrics.home.2rjus.net and vmalert.home.2rjus.net CNAMEs with
Caddy TLS termination via internal ACME CA.
Refactors Grafana's Caddy config from configFile to globalConfig +
virtualHosts so both modules can contribute routes to the same
Caddy instance.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Set up the core metrics stack on monitoring02 as Phase 2 of the
monitoring migration. VictoriaMetrics replaces Prometheus with
identical scrape configs (22 jobs including auto-generated targets).
- VictoriaMetrics with 3-month retention and all scrape configs
- vmalert evaluating existing rules.yml (notifier disabled)
- Alertmanager with same routing config (no alerts during parallel op)
- Grafana datasources updated: local VictoriaMetrics as default
- Static user override for credential file access (OpenBao, Apiary)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Set interval=60s on rate() panels to match the actual Prometheus scrape
interval, so Grafana calculates $__rate_interval correctly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add gazetteer reference for country code lookup resolution.
Remove unnecessary reduce transformation. Make geomap panel
full-width (24 cols) and taller (h=10) on its own row.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mark retention, limits, labels, and level mapping as done. Add
JSON logging audit results with per-service details. Update current
state and disk usage notes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enable compactor-based retention with 30-day period to prevent
unbounded disk growth. Add basic rate limits and stream guards
to protect against runaway log generators.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>