- Switch vmalert from blackhole mode to sending alerts to local
Alertmanager
- Import alerttonotify service so alerts route to NATS notifications
- Move alertmanager and grafana CNAMEs from http-proxy to monitoring02
- Add monitoring CNAME to monitoring02
- Add Caddy reverse proxy entries for alertmanager and grafana
- Remove prometheus, alertmanager, and grafana Caddy entries from
http-proxy (now served directly by monitoring02)
- Add shared/nats/nkey to monitoring02 Vault AppRole policy
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The real .mcp.json now contains Loki credentials for basic auth,
so it should not be committed. The example file has placeholders.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The secret file needs to be owned by promtail since Promtail runs
as a dedicated user and can't read root-owned files.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Revert ns1/ns2 from approle.tf (they're in hosts-generated.tf) and add
loki-push policy to generated AppRoles instead.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
They were missing from the host_policies map, so they didn't get
shared policies like loki-push.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Loki bound to localhost, Caddy reverse proxy with basic_auth
- Vault secret (shared/loki/push-auth) for password, bcrypt hash
generated at boot for Caddy environment
- Promtail dual-ships to monitoring01 (direct) and loki.home.2rjus.net
(with basic auth), conditional on vault.enable
- Terraform: new shared loki-push policy added to all AppRoles
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add standalone Loki service module (services/loki/) with same config as
monitoring01 and import it on monitoring02. Update Grafana Loki datasource
to localhost. Defer Tempo and Pyroscope migration (not actively used).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add metrics.home.2rjus.net and vmalert.home.2rjus.net CNAMEs with
Caddy TLS termination via internal ACME CA.
Refactors Grafana's Caddy config from configFile to globalConfig +
virtualHosts so both modules can contribute routes to the same
Caddy instance.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Set up the core metrics stack on monitoring02 as Phase 2 of the
monitoring migration. VictoriaMetrics replaces Prometheus with
identical scrape configs (22 jobs including auto-generated targets).
- VictoriaMetrics with 3-month retention and all scrape configs
- vmalert evaluating existing rules.yml (notifier disabled)
- Alertmanager with same routing config (no alerts during parallel op)
- Grafana datasources updated: local VictoriaMetrics as default
- Static user override for credential file access (OpenBao, Apiary)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Set interval=60s on rate() panels to match the actual Prometheus scrape
interval, so Grafana calculates $__rate_interval correctly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add gazetteer reference for country code lookup resolution.
Remove unnecessary reduce transformation. Make geomap panel
full-width (24 cols) and taller (h=10) on its own row.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mark retention, limits, labels, and level mapping as done. Add
JSON logging audit results with per-service details. Update current
state and disk usage notes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enable compactor-based retention with 30-day period to prevent
unbounded disk growth. Add basic rate limits and stream guards
to protect against runaway log generators.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update all LogQL examples, agent instructions, and scripts to use
the hostname label instead of host, matching the Prometheus label
naming convention. Also update pipe-to-loki and bootstrap scripts
to push hostname instead of host.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Align Promtail labels with Prometheus by adding hostname, tier, and role
static labels to both journal and varlog scrape configs. Add pipeline
stages to map journal PRIORITY field to a level label for reliable
severity filtering across the fleet.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use handle directive instead of path in site address for the metrics
endpoint, as the latter is deprecated in Caddy 2.10.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Configure Garage object storage on garage01 with S3 API, Vault secrets
for RPC secret and admin token, and Caddy reverse proxy for HTTPS access
at s3.home.2rjus.net via internal ACME CA. Includes flake entry, VM
definition, and Vault policy for the host.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Improves builder logging: build failure output is now logged as
individual lines instead of a single JSON blob, making errors
readable in Loki/Grafana.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a systemd timer that triggers builds for all hosts every 2 hours
via NATS, keeping the binary cache warm.
- Add scheduler.nix with timer (every 2h) and oneshot service
- Add scheduler NATS user to DEPLOY account
- Add Vault secret and variable for scheduler NKey
- Increase nix-cache02 memory from 16GB to 20GB
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Prevents lock conflicts when multiple backup jobs targeting the same
repository run concurrently. Jobs will now retry acquiring the lock
every 10 seconds for up to 5 minutes before failing.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 4 fully complete. nix-cache01 has been:
- Removed from repo (host config, build scripts, flake entry)
- Vault resources cleaned up
- VM deleted from Proxmox
nix-cache02 is now the sole binary cache host.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Removed:
- hosts/nix-cache01/ directory
- services/nix-cache/build-flakes.{nix,sh} (replaced by NATS builder)
- Vault secret and AppRole for nix-cache01
- Old signing key variable from terraform
- Old trusted public key from system/nix.nix
Updated:
- flake.nix: removed nixosConfiguration
- README.md: nix-cache01 -> nix-cache02
- Monitoring rules: removed build-flakes alerts, updated harmonia to nix-cache02
- Simplified proxy.nix (no longer needs hostname conditional)
nix-cache02 is now the sole binary cache host.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move nix-cache CNAME from nix-cache01 to nix-cache02
- Remove actions1 CNAME (service removed)
- Update proxy.nix to serve canonical domain on nix-cache02
- Promote nix-cache02 to prod tier with build-host role
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Phase 4 now in progress
- Harmonia configured on nix-cache02 with new signing key
- Trusted public key deployed to all hosts
- Cache tested successfully from testvm01
- Actions runner removed from scope
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
nix-cache01 serves nix-cache.home.2rjus.net (canonical)
nix-cache02 serves nix-cache02.home.2rjus.net (for testing)
This allows testing nix-cache02 independently before DNS cutover.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Parameterize harmonia.nix to use hostname-based Vault paths
- Add nix-cache services to nix-cache02
- Add Vault secret and variable for nix-cache02 signing key
- Add nix-cache02 public key to trusted-public-keys on all hosts
- Update plan doc to remove actions runner references
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>