Commit Graph

960 Commits

Author SHA1 Message Date
e912c75b6c flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/3aadb7ca9eac2891d52a9dec199d9580a6e2bf44?narHash=sha256-O1XDr7EWbRp%2BkHrNNgLWgIrB0/US5wvw9K6RERWAj6I%3D' (2026-02-14)
  → 'github:nixos/nixpkgs/fa56d7d6de78f5a7f997b0ea2bc6efd5868ad9e8?narHash=sha256-X01Q3DgSpjeBpapoGA4rzKOn25qdKxbPnxHeMLNoHTU%3D' (2026-02-16)
2026-02-18 00:01:34 +00:00
b218b4f8bc docs: update migration plan for monitoring01 and pgdb1 completion
Some checks failed
Run nix flake check / flake-check (push) Failing after 16m37s
Periodic flake update / flake-update (push) Successful in 2m21s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 22:26:23 +01:00
65acf13e6f grafana: fix datasource UIDs for VictoriaMetrics migration
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Update all dashboard datasource references from "prometheus" to
"victoriametrics" to match the declared datasource UID. Enable
prune and deleteDatasources to clean up the old Prometheus
(monitoring01) datasource from Grafana's database.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 22:23:04 +01:00
95a96b2192 Merge pull request 'monitoring01: remove host and migrate services to monitoring02' (#43) from cleanup-monitoring01 into master
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m2s
Reviewed-on: #43
2026-02-17 21:08:00 +00:00
4f593126c0 monitoring01: remove host and migrate services to monitoring02
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m15s
Run nix flake check / flake-check (pull_request) Failing after 3m8s
Remove monitoring01 host configuration and unused service modules
(prometheus, grafana, loki, tempo, pyroscope). Migrate blackbox,
exportarr, and pve exporters to monitoring02 with scrape configs
moved to VictoriaMetrics. Update alert rules, terraform vault
policies/secrets, http-proxy entries, and documentation to reflect
the monitoring02 migration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 21:50:20 +01:00
1bba6f106a Merge pull request 'monitoring02: enable alerting and migrate CNAMEs from http-proxy' (#42) from monitoring02-enable-alerting into master
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m5s
Reviewed-on: #42
2026-02-17 20:24:16 +00:00
a6013d3950 monitoring02: enable alerting and migrate CNAMEs from http-proxy
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m25s
Run nix flake check / flake-check (pull_request) Failing after 3m52s
- Switch vmalert from blackhole mode to sending alerts to local
  Alertmanager
- Import alerttonotify service so alerts route to NATS notifications
- Move alertmanager and grafana CNAMEs from http-proxy to monitoring02
- Add monitoring CNAME to monitoring02
- Add Caddy reverse proxy entries for alertmanager and grafana
- Remove prometheus, alertmanager, and grafana Caddy entries from
  http-proxy (now served directly by monitoring02)
- Move monitoring02 Vault AppRole to hosts-generated.tf with
  extra_policies support and prometheus-metrics policy
- Update Promtail to use authenticated loki.home.2rjus.net endpoint
  only (remove unauthenticated monitoring01 client)
- Update pipe-to-loki and bootstrap to use loki.home.2rjus.net with
  basic auth from Vault secret
- Move migration plan to completed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 21:23:21 +01:00
7f69c0738a Merge pull request 'loki-monitoring02' (#41) from loki-monitoring02 into master
Some checks failed
Run nix flake check / flake-check (push) Failing after 8m20s
Reviewed-on: #41
2026-02-17 19:40:33 +00:00
35924c7b01 mcp: move config to .mcp.json.example, gitignore real config
Some checks failed
Run nix flake check / flake-check (push) Failing after 15m57s
Run nix flake check / flake-check (pull_request) Failing after 16m45s
The real .mcp.json now contains Loki credentials for basic auth,
so it should not be committed. The example file has placeholders.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 20:35:14 +01:00
87d8571d62 promtail: fix vault secret ownership for loki auth
Some checks failed
Run nix flake check / flake-check (push) Failing after 12m24s
The secret file needs to be owned by promtail since Promtail runs
as a dedicated user and can't read root-owned files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 20:17:02 +01:00
43c81f6688 terraform: fix loki-push policy for generated hosts
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Revert ns1/ns2 from approle.tf (they're in hosts-generated.tf) and add
loki-push policy to generated AppRoles instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 20:13:22 +01:00
58f901ad3e terraform: add ns1 and ns2 to AppRole policies
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
They were missing from the host_policies map, so they didn't get
shared policies like loki-push.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 20:10:37 +01:00
c13921d302 loki: add basic auth for log push and dual-ship promtail
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m36s
- Loki bound to localhost, Caddy reverse proxy with basic_auth
- Vault secret (shared/loki/push-auth) for password, bcrypt hash
  generated at boot for Caddy environment
- Promtail dual-ships to monitoring01 (direct) and loki.home.2rjus.net
  (with basic auth), conditional on vault.enable
- Terraform: new shared loki-push policy added to all AppRoles

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 20:00:08 +01:00
2903873d52 monitoring02: add loki CNAME and Caddy reverse proxy
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 19:48:06 +01:00
74e7c9faa4 monitoring02: add Loki service
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m19s
Add standalone Loki service module (services/loki/) with same config as
monitoring01 and import it on monitoring02. Update Grafana Loki datasource
to localhost. Defer Tempo and Pyroscope migration (not actively used).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 19:42:19 +01:00
471f536f1f Merge pull request 'victoriametrics-monitoring02' (#40) from victoriametrics-monitoring02 into master
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m3s
Periodic flake update / flake-update (push) Successful in 3m29s
Reviewed-on: #40
2026-02-16 23:56:04 +00:00
a013e80f1a terraform: grant monitoring02 access to apiary-token secret
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m59s
Run nix flake check / flake-check (pull_request) Failing after 4m20s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:55:08 +01:00
4cbaa33475 monitoring02: add Caddy reverse proxy for VictoriaMetrics and vmalert
Add metrics.home.2rjus.net and vmalert.home.2rjus.net CNAMEs with
Caddy TLS termination via internal ACME CA.

Refactors Grafana's Caddy config from configFile to globalConfig +
virtualHosts so both modules can contribute routes to the same
Caddy instance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:55:08 +01:00
e329f87b0b monitoring02: add VictoriaMetrics, vmalert, and Alertmanager
Set up the core metrics stack on monitoring02 as Phase 2 of the
monitoring migration. VictoriaMetrics replaces Prometheus with
identical scrape configs (22 jobs including auto-generated targets).

- VictoriaMetrics with 3-month retention and all scrape configs
- vmalert evaluating existing rules.yml (notifier disabled)
- Alertmanager with same routing config (no alerts during parallel op)
- Grafana datasources updated: local VictoriaMetrics as default
- Static user override for credential file access (OpenBao, Apiary)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:55:08 +01:00
c151f31011 grafana: fix apiary dashboard panels empty on short time ranges
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m54s
Set interval=60s on rate() panels to match the actual Prometheus scrape
interval, so Grafana calculates $__rate_interval correctly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 20:03:26 +01:00
f5362d6936 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/6c5e707c6b5339359a9a9e215c5e66d6d802fd7a?narHash=sha256-iKZMkr6Cm9JzWlRYW/VPoL0A9jVKtZYiU4zSrVeetIs%3D' (2026-02-11)
  → 'github:nixos/nixpkgs/3aadb7ca9eac2891d52a9dec199d9580a6e2bf44?narHash=sha256-O1XDr7EWbRp%2BkHrNNgLWgIrB0/US5wvw9K6RERWAj6I%3D' (2026-02-14)
• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/ec7c70d12ce2fc37cb92aff673dcdca89d187bae?narHash=sha256-9xejG0KoqsoKEGp2kVbXRlEYtFFcDTHjidiuX8hGO44%3D' (2026-02-11)
  → 'github:nixos/nixpkgs/a82ccc39b39b621151d6732718e3e250109076fa?narHash=sha256-gf2AmWVTs8lEq7z/3ZAsgnZDhWIckkb%2BZnAo5RzSxJg%3D' (2026-02-13)
2026-02-16 00:07:10 +00:00
3e7aabc73a grafana: fix apiary geomap and make it full-width
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m6s
Periodic flake update / flake-update (push) Successful in 5m25s
Add gazetteer reference for country code lookup resolution.
Remove unnecessary reduce transformation. Make geomap panel
full-width (24 cols) and taller (h=10) on its own row.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 21:36:24 +01:00
361e7f2a1b grafana: add apiary honeypot dashboard
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 21:31:06 +01:00
1942591d2e monitoring: add apiary metrics scraping with bearer token auth
Some checks failed
Run nix flake check / flake-check (push) Failing after 12m52s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 16:36:26 +01:00
4d614d8716 docs: add new service candidates and NixOS router plans
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m22s
Periodic flake update / flake-update (push) Failing after 1s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 13:21:34 +01:00
fd7caf7f00 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/d6c71932130818840fc8fe9509cf50be8c64634f?narHash=sha256-ub1gpAONMFsT/GU2hV6ZWJjur8rJ6kKxdm9IlCT0j84%3D' (2026-02-08)
  → 'github:nixos/nixpkgs/ec7c70d12ce2fc37cb92aff673dcdca89d187bae?narHash=sha256-9xejG0KoqsoKEGp2kVbXRlEYtFFcDTHjidiuX8hGO44%3D' (2026-02-11)
2026-02-14 00:01:24 +00:00
af8e385b6e docs: finalize remote access plan with WireGuard gateway design
Some checks failed
Run nix flake check / flake-check (push) Failing after 21m7s
Periodic flake update / flake-update (push) Successful in 2m16s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 00:31:52 +01:00
0db9fc6802 docs: update Loki improvements plan with implementation status
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m55s
Mark retention, limits, labels, and level mapping as done. Add
JSON logging audit results with per-service details. Update current
state and disk usage notes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 00:04:16 +01:00
5d68662035 loki: add 30-day retention policy and ingestion limits
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Enable compactor-based retention with 30-day period to prevent
unbounded disk growth. Add basic rate limits and stream guards
to protect against runaway log generators.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 23:55:27 +01:00
d485948df0 docs: update Loki queries from host to hostname label
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Update all LogQL examples, agent instructions, and scripts to use
the hostname label instead of host, matching the Prometheus label
naming convention. Also update pipe-to-loki and bootstrap scripts
to push hostname instead of host.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 23:43:47 +01:00
7b804450a3 promtail: add hostname/tier/role labels and journal priority level mapping
Align Promtail labels with Prometheus by adding hostname, tier, and role
static labels to both journal and varlog scrape configs. Add pipeline
stages to map journal PRIORITY field to a level label for reliable
severity filtering across the fleet.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 23:40:14 +01:00
2f0dad1acc docs: add JSON logging audit to Loki improvements plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 15m38s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 22:44:05 +01:00
1544415ef3 docs: add Loki improvements plan
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Covers retention policy, limits config, Promtail label improvements
(tier/role/level), and journal PRIORITY extraction. Also adds Alloy
consideration to VictoriaMetrics migration plan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 22:39:16 +01:00
5babd7f507 docs: move garage S3 storage plan to completed
Some checks failed
Run nix flake check / flake-check (push) Failing after 15m36s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 21:54:23 +01:00
7e0c5fbf0f garage01: fix Caddy metrics deprecation warning
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Use handle directive instead of path in site address for the metrics
endpoint, as the latter is deprecated in Caddy 2.10.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 21:53:48 +01:00
ffaf95d109 terraform: add Vault secret for garage01 environment
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m13s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 21:27:43 +01:00
b2b6ab4799 garage01: add Garage S3 service with Caddy HTTPS proxy
Configure Garage object storage on garage01 with S3 API, Vault secrets
for RPC secret and admin token, and Caddy reverse proxy for HTTPS access
at s3.home.2rjus.net via internal ACME CA. Includes flake entry, VM
definition, and Vault policy for the host.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 21:24:25 +01:00
5d3d93b280 docs: move completed plans to completed folder
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m22s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 21:08:17 +01:00
ae823e439d monitoring: lower unbound cache hit ratio alert threshold to 20%
Some checks failed
Run nix flake check / flake-check (push) Failing after 9m2s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 18:55:03 +01:00
0d9f49a3b4 flake.lock: Update homelab-deploy
Some checks failed
Run nix flake check / flake-check (push) Failing after 12m25s
Improves builder logging: build failure output is now logged as
individual lines instead of a single JSON blob, making errors
readable in Loki/Grafana.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 18:36:18 +01:00
08d9e1ec3f docs: add garage S3 storage plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m26s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 18:06:53 +01:00
fa8d65b612 nix-cache02: increase builder timeout to 2 hours
Some checks failed
Run nix flake check / flake-check (push) Failing after 14m21s
Periodic flake update / flake-update (push) Successful in 5m17s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-12 14:44:55 +01:00
6726f111e3 flake.lock: Update homelab-deploy
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-12 14:42:23 +01:00
3a083285cb flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/2db38e08fdadcc0ce3232f7279bab59a15b94482?narHash=sha256-1jZvgZoAagZZB6NwGRv2T2ezPy%2BX6EFDsJm%2BYSlsvEs%3D' (2026-02-09)
  → 'github:nixos/nixpkgs/6c5e707c6b5339359a9a9e215c5e66d6d802fd7a?narHash=sha256-iKZMkr6Cm9JzWlRYW/VPoL0A9jVKtZYiU4zSrVeetIs%3D' (2026-02-11)
2026-02-12 00:01:27 +00:00
ed1821b073 nix-cache02: add scheduled builds timer
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m7s
Periodic flake update / flake-update (push) Successful in 2m18s
Add a systemd timer that triggers builds for all hosts every 2 hours
via NATS, keeping the binary cache warm.

- Add scheduler.nix with timer (every 2h) and oneshot service
- Add scheduler NATS user to DEPLOY account
- Add Vault secret and variable for scheduler NKey
- Increase nix-cache02 memory from 16GB to 20GB

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-12 00:50:09 +01:00
fa4a418007 restic: add --retry-lock=5m to all backup jobs
Some checks failed
Run nix flake check / flake-check (push) Failing after 23m42s
Prevents lock conflicts when multiple backup jobs targeting the same
repository run concurrently. Jobs will now retry acquiring the lock
every 10 seconds for up to 5 minutes before failing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-11 01:22:00 +01:00
963e5f6d3c flake.lock: Update
Flake lock file updates:

• Updated input 'homelab-deploy':
    'git+https://git.t-juice.club/torjus/homelab-deploy?ref=master&rev=a8aab16d0e7400aaa00500d08c12734da3b638e0' (2026-02-10)
  → 'git+https://git.t-juice.club/torjus/homelab-deploy?ref=master&rev=c13914bf5acdcda33de63ad5ed9d661e4dc3118c' (2026-02-10)
• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/23d72dabcb3b12469f57b37170fcbc1789bd7457?narHash=sha256-z5NJPSBwsLf/OfD8WTmh79tlSU8XgIbwmk6qB1/TFzY%3D' (2026-02-07)
  → 'github:nixos/nixpkgs/2db38e08fdadcc0ce3232f7279bab59a15b94482?narHash=sha256-1jZvgZoAagZZB6NwGRv2T2ezPy%2BX6EFDsJm%2BYSlsvEs%3D' (2026-02-09)
2026-02-11 00:01:28 +00:00
0bc10cb1fe grafana: add build service panels to nixos-fleet dashboard
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m48s
Periodic flake update / flake-update (push) Successful in 2m20s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-11 00:49:50 +01:00
b03e2e8ee4 monitoring: add alerts for homelab-deploy build failures
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-11 00:45:07 +01:00
ddcbc30665 docs: mark nix-cache01 decommission complete
Some checks failed
Run nix flake check / flake-check (push) Failing after 16m38s
Phase 4 fully complete. nix-cache01 has been:
- Removed from repo (host config, build scripts, flake entry)
- Vault resources cleaned up
- VM deleted from Proxmox

nix-cache02 is now the sole binary cache host.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:43:12 +01:00