Mark retention, limits, labels, and level mapping as done. Add
JSON logging audit results with per-service details. Update current
state and disk usage notes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enable compactor-based retention with 30-day period to prevent
unbounded disk growth. Add basic rate limits and stream guards
to protect against runaway log generators.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update all LogQL examples, agent instructions, and scripts to use
the hostname label instead of host, matching the Prometheus label
naming convention. Also update pipe-to-loki and bootstrap scripts
to push hostname instead of host.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Align Promtail labels with Prometheus by adding hostname, tier, and role
static labels to both journal and varlog scrape configs. Add pipeline
stages to map journal PRIORITY field to a level label for reliable
severity filtering across the fleet.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use handle directive instead of path in site address for the metrics
endpoint, as the latter is deprecated in Caddy 2.10.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Configure Garage object storage on garage01 with S3 API, Vault secrets
for RPC secret and admin token, and Caddy reverse proxy for HTTPS access
at s3.home.2rjus.net via internal ACME CA. Includes flake entry, VM
definition, and Vault policy for the host.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Improves builder logging: build failure output is now logged as
individual lines instead of a single JSON blob, making errors
readable in Loki/Grafana.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a systemd timer that triggers builds for all hosts every 2 hours
via NATS, keeping the binary cache warm.
- Add scheduler.nix with timer (every 2h) and oneshot service
- Add scheduler NATS user to DEPLOY account
- Add Vault secret and variable for scheduler NKey
- Increase nix-cache02 memory from 16GB to 20GB
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Prevents lock conflicts when multiple backup jobs targeting the same
repository run concurrently. Jobs will now retry acquiring the lock
every 10 seconds for up to 5 minutes before failing.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 4 fully complete. nix-cache01 has been:
- Removed from repo (host config, build scripts, flake entry)
- Vault resources cleaned up
- VM deleted from Proxmox
nix-cache02 is now the sole binary cache host.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Removed:
- hosts/nix-cache01/ directory
- services/nix-cache/build-flakes.{nix,sh} (replaced by NATS builder)
- Vault secret and AppRole for nix-cache01
- Old signing key variable from terraform
- Old trusted public key from system/nix.nix
Updated:
- flake.nix: removed nixosConfiguration
- README.md: nix-cache01 -> nix-cache02
- Monitoring rules: removed build-flakes alerts, updated harmonia to nix-cache02
- Simplified proxy.nix (no longer needs hostname conditional)
nix-cache02 is now the sole binary cache host.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move nix-cache CNAME from nix-cache01 to nix-cache02
- Remove actions1 CNAME (service removed)
- Update proxy.nix to serve canonical domain on nix-cache02
- Promote nix-cache02 to prod tier with build-host role
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Phase 4 now in progress
- Harmonia configured on nix-cache02 with new signing key
- Trusted public key deployed to all hosts
- Cache tested successfully from testvm01
- Actions runner removed from scope
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
nix-cache01 serves nix-cache.home.2rjus.net (canonical)
nix-cache02 serves nix-cache02.home.2rjus.net (for testing)
This allows testing nix-cache02 independently before DNS cutover.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Parameterize harmonia.nix to use hostname-based Vault paths
- Add nix-cache services to nix-cache02
- Add Vault secret and variable for nix-cache02 signing key
- Add nix-cache02 public key to trusted-public-keys on all hosts
- Update plan doc to remove actions runner references
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The actions runner on nix-cache01 was never actively used.
Removing it before migrating to nix-cache02.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add Phase 6 for alerting and Grafana dashboards
- Document available Prometheus metrics
- Include example alerting rules for build failures
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Mark Phase 1 (new build host) and Phase 2 (NATS build triggering) complete
- Document nix-cache02 configuration and tested build times
- Add remaining work for Harmonia, Actions runner, and DNS cutover
- Enable --enable-builds flag in MCP config
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Configure builder to build nixos-servers and nixos (gunter) repos
- Add builder NKey to Vault secrets
- Update NATS permissions for builder, test-deployer, and admin-deployer
- Grant nix-cache02 access to shared homelab-deploy secrets
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New build host to replace nix-cache01 with:
- 8 CPU cores, 16GB RAM, 200GB disk
- Static IP 10.69.13.25
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document findings from false positive host_reboot alert caused by
NTP clock adjustment affecting node_boot_time_seconds metric.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Radarr on TrueNAS jail is too old - exportarr fails on
/api/v3/wanted/cutoff endpoint (404). Keep sonarr which works.
Vault secret kept for when Radarr is updated.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add prometheus exportarr exporters for Radarr and Sonarr media
services. Runs on monitoring01, queries remote APIs.
- Radarr exporter on port 9708
- Sonarr exporter on port 9709
- API keys fetched from Vault
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add vault secrets for Radarr and Sonarr API keys to enable
exportarr metrics collection on monitoring01.
- services/exportarr/radarr - Radarr API key
- services/exportarr/sonarr - Sonarr API key
- Grant monitoring01 access to services/exportarr/*
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Empty valid_status_codes defaults to 2xx only, not "any".
Explicitly list common status codes (2xx, 3xx, 4xx, 5xx) so
services returning 400/401 like ha and nzbget pass the probe.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use joinByField transformation instead of merge to properly align
rows by instance. Also exclude duplicate Time/job columns from join.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Dashboard includes:
- Stat panels for endpoints monitored, probe failures, expiring certs
- Gauge showing minimum days until any cert expires
- Table of all endpoints sorted by expiry (color-coded)
- Probe status table with HTTP status and duration
- Time series graphs for expiry trends and probe success rate
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Only care about TLS handshake success for certificate monitoring.
Services like nzbget (401) and ha (400) return non-2xx but have
valid certificates.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The rules were already added to rules.yml but the blackbox.nix file
still had them, causing duplicate 'groups' key errors.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move certificate alert rules to rules.yml instead of adding them as a
separate rules string in blackbox.nix. The previous approach caused a
YAML parse error due to duplicate 'groups' keys.
Also add policy to CLAUDE.md: never force push to master.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>