Commit Graph

996 Commits

Author SHA1 Message Date
74e7c9faa4 monitoring02: add Loki service
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m19s
Add standalone Loki service module (services/loki/) with same config as
monitoring01 and import it on monitoring02. Update Grafana Loki datasource
to localhost. Defer Tempo and Pyroscope migration (not actively used).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 19:42:19 +01:00
471f536f1f Merge pull request 'victoriametrics-monitoring02' (#40) from victoriametrics-monitoring02 into master
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m3s
Periodic flake update / flake-update (push) Successful in 3m29s
Reviewed-on: #40
2026-02-16 23:56:04 +00:00
a013e80f1a terraform: grant monitoring02 access to apiary-token secret
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m59s
Run nix flake check / flake-check (pull_request) Failing after 4m20s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:55:08 +01:00
4cbaa33475 monitoring02: add Caddy reverse proxy for VictoriaMetrics and vmalert
Add metrics.home.2rjus.net and vmalert.home.2rjus.net CNAMEs with
Caddy TLS termination via internal ACME CA.

Refactors Grafana's Caddy config from configFile to globalConfig +
virtualHosts so both modules can contribute routes to the same
Caddy instance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:55:08 +01:00
e329f87b0b monitoring02: add VictoriaMetrics, vmalert, and Alertmanager
Set up the core metrics stack on monitoring02 as Phase 2 of the
monitoring migration. VictoriaMetrics replaces Prometheus with
identical scrape configs (22 jobs including auto-generated targets).

- VictoriaMetrics with 3-month retention and all scrape configs
- vmalert evaluating existing rules.yml (notifier disabled)
- Alertmanager with same routing config (no alerts during parallel op)
- Grafana datasources updated: local VictoriaMetrics as default
- Static user override for credential file access (OpenBao, Apiary)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:55:08 +01:00
c151f31011 grafana: fix apiary dashboard panels empty on short time ranges
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m54s
Set interval=60s on rate() panels to match the actual Prometheus scrape
interval, so Grafana calculates $__rate_interval correctly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 20:03:26 +01:00
f5362d6936 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/6c5e707c6b5339359a9a9e215c5e66d6d802fd7a?narHash=sha256-iKZMkr6Cm9JzWlRYW/VPoL0A9jVKtZYiU4zSrVeetIs%3D' (2026-02-11)
  → 'github:nixos/nixpkgs/3aadb7ca9eac2891d52a9dec199d9580a6e2bf44?narHash=sha256-O1XDr7EWbRp%2BkHrNNgLWgIrB0/US5wvw9K6RERWAj6I%3D' (2026-02-14)
• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/ec7c70d12ce2fc37cb92aff673dcdca89d187bae?narHash=sha256-9xejG0KoqsoKEGp2kVbXRlEYtFFcDTHjidiuX8hGO44%3D' (2026-02-11)
  → 'github:nixos/nixpkgs/a82ccc39b39b621151d6732718e3e250109076fa?narHash=sha256-gf2AmWVTs8lEq7z/3ZAsgnZDhWIckkb%2BZnAo5RzSxJg%3D' (2026-02-13)
2026-02-16 00:07:10 +00:00
3e7aabc73a grafana: fix apiary geomap and make it full-width
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m6s
Periodic flake update / flake-update (push) Successful in 5m25s
Add gazetteer reference for country code lookup resolution.
Remove unnecessary reduce transformation. Make geomap panel
full-width (24 cols) and taller (h=10) on its own row.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 21:36:24 +01:00
361e7f2a1b grafana: add apiary honeypot dashboard
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 21:31:06 +01:00
1942591d2e monitoring: add apiary metrics scraping with bearer token auth
Some checks failed
Run nix flake check / flake-check (push) Failing after 12m52s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 16:36:26 +01:00
4d614d8716 docs: add new service candidates and NixOS router plans
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m22s
Periodic flake update / flake-update (push) Failing after 1s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 13:21:34 +01:00
fd7caf7f00 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/d6c71932130818840fc8fe9509cf50be8c64634f?narHash=sha256-ub1gpAONMFsT/GU2hV6ZWJjur8rJ6kKxdm9IlCT0j84%3D' (2026-02-08)
  → 'github:nixos/nixpkgs/ec7c70d12ce2fc37cb92aff673dcdca89d187bae?narHash=sha256-9xejG0KoqsoKEGp2kVbXRlEYtFFcDTHjidiuX8hGO44%3D' (2026-02-11)
2026-02-14 00:01:24 +00:00
af8e385b6e docs: finalize remote access plan with WireGuard gateway design
Some checks failed
Run nix flake check / flake-check (push) Failing after 21m7s
Periodic flake update / flake-update (push) Successful in 2m16s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 00:31:52 +01:00
0db9fc6802 docs: update Loki improvements plan with implementation status
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m55s
Mark retention, limits, labels, and level mapping as done. Add
JSON logging audit results with per-service details. Update current
state and disk usage notes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 00:04:16 +01:00
5d68662035 loki: add 30-day retention policy and ingestion limits
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Enable compactor-based retention with 30-day period to prevent
unbounded disk growth. Add basic rate limits and stream guards
to protect against runaway log generators.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 23:55:27 +01:00
d485948df0 docs: update Loki queries from host to hostname label
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Update all LogQL examples, agent instructions, and scripts to use
the hostname label instead of host, matching the Prometheus label
naming convention. Also update pipe-to-loki and bootstrap scripts
to push hostname instead of host.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 23:43:47 +01:00
7b804450a3 promtail: add hostname/tier/role labels and journal priority level mapping
Align Promtail labels with Prometheus by adding hostname, tier, and role
static labels to both journal and varlog scrape configs. Add pipeline
stages to map journal PRIORITY field to a level label for reliable
severity filtering across the fleet.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 23:40:14 +01:00
2f0dad1acc docs: add JSON logging audit to Loki improvements plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 15m38s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 22:44:05 +01:00
1544415ef3 docs: add Loki improvements plan
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Covers retention policy, limits config, Promtail label improvements
(tier/role/level), and journal PRIORITY extraction. Also adds Alloy
consideration to VictoriaMetrics migration plan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 22:39:16 +01:00
5babd7f507 docs: move garage S3 storage plan to completed
Some checks failed
Run nix flake check / flake-check (push) Failing after 15m36s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 21:54:23 +01:00
7e0c5fbf0f garage01: fix Caddy metrics deprecation warning
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Use handle directive instead of path in site address for the metrics
endpoint, as the latter is deprecated in Caddy 2.10.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 21:53:48 +01:00
ffaf95d109 terraform: add Vault secret for garage01 environment
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m13s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 21:27:43 +01:00
b2b6ab4799 garage01: add Garage S3 service with Caddy HTTPS proxy
Configure Garage object storage on garage01 with S3 API, Vault secrets
for RPC secret and admin token, and Caddy reverse proxy for HTTPS access
at s3.home.2rjus.net via internal ACME CA. Includes flake entry, VM
definition, and Vault policy for the host.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 21:24:25 +01:00
5d3d93b280 docs: move completed plans to completed folder
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m22s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 21:08:17 +01:00
ae823e439d monitoring: lower unbound cache hit ratio alert threshold to 20%
Some checks failed
Run nix flake check / flake-check (push) Failing after 9m2s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 18:55:03 +01:00
0d9f49a3b4 flake.lock: Update homelab-deploy
Some checks failed
Run nix flake check / flake-check (push) Failing after 12m25s
Improves builder logging: build failure output is now logged as
individual lines instead of a single JSON blob, making errors
readable in Loki/Grafana.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 18:36:18 +01:00
08d9e1ec3f docs: add garage S3 storage plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m26s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 18:06:53 +01:00
fa8d65b612 nix-cache02: increase builder timeout to 2 hours
Some checks failed
Run nix flake check / flake-check (push) Failing after 14m21s
Periodic flake update / flake-update (push) Successful in 5m17s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-12 14:44:55 +01:00
6726f111e3 flake.lock: Update homelab-deploy
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-12 14:42:23 +01:00
3a083285cb flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/2db38e08fdadcc0ce3232f7279bab59a15b94482?narHash=sha256-1jZvgZoAagZZB6NwGRv2T2ezPy%2BX6EFDsJm%2BYSlsvEs%3D' (2026-02-09)
  → 'github:nixos/nixpkgs/6c5e707c6b5339359a9a9e215c5e66d6d802fd7a?narHash=sha256-iKZMkr6Cm9JzWlRYW/VPoL0A9jVKtZYiU4zSrVeetIs%3D' (2026-02-11)
2026-02-12 00:01:27 +00:00
ed1821b073 nix-cache02: add scheduled builds timer
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m7s
Periodic flake update / flake-update (push) Successful in 2m18s
Add a systemd timer that triggers builds for all hosts every 2 hours
via NATS, keeping the binary cache warm.

- Add scheduler.nix with timer (every 2h) and oneshot service
- Add scheduler NATS user to DEPLOY account
- Add Vault secret and variable for scheduler NKey
- Increase nix-cache02 memory from 16GB to 20GB

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-12 00:50:09 +01:00
fa4a418007 restic: add --retry-lock=5m to all backup jobs
Some checks failed
Run nix flake check / flake-check (push) Failing after 23m42s
Prevents lock conflicts when multiple backup jobs targeting the same
repository run concurrently. Jobs will now retry acquiring the lock
every 10 seconds for up to 5 minutes before failing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-11 01:22:00 +01:00
963e5f6d3c flake.lock: Update
Flake lock file updates:

• Updated input 'homelab-deploy':
    'git+https://git.t-juice.club/torjus/homelab-deploy?ref=master&rev=a8aab16d0e7400aaa00500d08c12734da3b638e0' (2026-02-10)
  → 'git+https://git.t-juice.club/torjus/homelab-deploy?ref=master&rev=c13914bf5acdcda33de63ad5ed9d661e4dc3118c' (2026-02-10)
• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/23d72dabcb3b12469f57b37170fcbc1789bd7457?narHash=sha256-z5NJPSBwsLf/OfD8WTmh79tlSU8XgIbwmk6qB1/TFzY%3D' (2026-02-07)
  → 'github:nixos/nixpkgs/2db38e08fdadcc0ce3232f7279bab59a15b94482?narHash=sha256-1jZvgZoAagZZB6NwGRv2T2ezPy%2BX6EFDsJm%2BYSlsvEs%3D' (2026-02-09)
2026-02-11 00:01:28 +00:00
0bc10cb1fe grafana: add build service panels to nixos-fleet dashboard
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m48s
Periodic flake update / flake-update (push) Successful in 2m20s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-11 00:49:50 +01:00
b03e2e8ee4 monitoring: add alerts for homelab-deploy build failures
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-11 00:45:07 +01:00
ddcbc30665 docs: mark nix-cache01 decommission complete
Some checks failed
Run nix flake check / flake-check (push) Failing after 16m38s
Phase 4 fully complete. nix-cache01 has been:
- Removed from repo (host config, build scripts, flake entry)
- Vault resources cleaned up
- VM deleted from Proxmox

nix-cache02 is now the sole binary cache host.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:43:12 +01:00
75210805d5 nix-cache01: decommission and remove all references
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Removed:
- hosts/nix-cache01/ directory
- services/nix-cache/build-flakes.{nix,sh} (replaced by NATS builder)
- Vault secret and AppRole for nix-cache01
- Old signing key variable from terraform
- Old trusted public key from system/nix.nix

Updated:
- flake.nix: removed nixosConfiguration
- README.md: nix-cache01 -> nix-cache02
- Monitoring rules: removed build-flakes alerts, updated harmonia to nix-cache02
- Simplified proxy.nix (no longer needs hostname conditional)

nix-cache02 is now the sole binary cache host.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:40:51 +01:00
ade0538717 docs: mark nix-cache DNS cutover complete
Some checks are pending
Run nix flake check / flake-check (push) Has started running
nix-cache.home.2rjus.net now served by nix-cache02.
nix-cache01 ready for decommission.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:34:04 +01:00
83fce5f927 nix-cache: switch DNS to nix-cache02
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
- Move nix-cache CNAME from nix-cache01 to nix-cache02
- Remove actions1 CNAME (service removed)
- Update proxy.nix to serve canonical domain on nix-cache02
- Promote nix-cache02 to prod tier with build-host role

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:22:23 +01:00
afff3f28ca docs: update nix-cache-reprovision plan with Harmonia progress
Some checks failed
Run nix flake check / flake-check (push) Failing after 52s
- Phase 4 now in progress
- Harmonia configured on nix-cache02 with new signing key
- Trusted public key deployed to all hosts
- Cache tested successfully from testvm01
- Actions runner removed from scope

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:17:51 +01:00
49f7e3ae2e nix-cache: use hostname-based domain for Caddy proxy
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m18s
nix-cache01 serves nix-cache.home.2rjus.net (canonical)
nix-cache02 serves nix-cache02.home.2rjus.net (for testing)

This allows testing nix-cache02 independently before DNS cutover.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:14:14 +01:00
751edfc11d nix-cache02: add Harmonia binary cache service
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
- Parameterize harmonia.nix to use hostname-based Vault paths
- Add nix-cache services to nix-cache02
- Add Vault secret and variable for nix-cache02 signing key
- Add nix-cache02 public key to trusted-public-keys on all hosts
- Update plan doc to remove actions runner references

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:08:48 +01:00
98a7301985 nix-cache: remove unused Gitea Actions runner
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m23s
The actions runner on nix-cache01 was never actively used.
Removing it before migrating to nix-cache02.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 22:57:08 +01:00
34efa58cfe Merge pull request 'nix-cache02-builder' (#39) from nix-cache02-builder into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m27s
Reviewed-on: #39
2026-02-10 21:47:58 +00:00
5bfb51a497 docs: add observability phase to nix-cache plan
Some checks failed
Run nix flake check / flake-check (push) Successful in 2m35s
Run nix flake check / flake-check (pull_request) Failing after 16m1s
- Add Phase 6 for alerting and Grafana dashboards
- Document available Prometheus metrics
- Include example alerting rules for build failures

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 22:46:38 +01:00
f83145d97a docs: update nix-cache-reprovision plan with progress
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
- Mark Phase 1 (new build host) and Phase 2 (NATS build triggering) complete
- Document nix-cache02 configuration and tested build times
- Add remaining work for Harmonia, Actions runner, and DNS cutover
- Enable --enable-builds flag in MCP config

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 22:43:48 +01:00
47747329c4 nix-cache02: add homelab-deploy builder service
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m51s
- Configure builder to build nixos-servers and nixos (gunter) repos
- Add builder NKey to Vault secrets
- Update NATS permissions for builder, test-deployer, and admin-deployer
- Grant nix-cache02 access to shared homelab-deploy secrets

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 22:26:40 +01:00
2d9ca2a73f hosts: add nix-cache02 build host
Some checks failed
Run nix flake check / flake-check (push) Failing after 16m26s
New build host to replace nix-cache01 with:
- 8 CPU cores, 16GB RAM, 200GB disk
- Static IP 10.69.13.25

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 21:53:29 +01:00
98ea679ef2 docs: add monitoring02 reboot alert investigation
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m41s
Document findings from false positive host_reboot alert caused by
NTP clock adjustment affecting node_boot_time_seconds metric.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 17:59:53 +01:00
b709c0b703 monitoring: disable radarr exporter (version mismatch)
Some checks failed
Run nix flake check / flake-check (push) Failing after 15m20s
Periodic flake update / flake-update (push) Successful in 2m23s
Radarr on TrueNAS jail is too old - exportarr fails on
/api/v3/wanted/cutoff endpoint (404). Keep sonarr which works.

Vault secret kept for when Radarr is updated.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 22:59:45 +01:00