nixos-servers

Author	SHA1	Message	Date
Torjus Håkestad	4087083926	monitoring02: enable alerting and migrate CNAMEs from http-proxy Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details - Switch vmalert from blackhole mode to sending alerts to local Alertmanager - Import alerttonotify service so alerts route to NATS notifications - Move alertmanager and grafana CNAMEs from http-proxy to monitoring02 - Add monitoring CNAME to monitoring02 - Add Caddy reverse proxy entries for alertmanager and grafana - Remove prometheus, alertmanager, and grafana Caddy entries from http-proxy (now served directly by monitoring02) - Add shared/nats/nkey to monitoring02 Vault AppRole policy Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 21:07:15 +01:00
Torjus Håkestad	7f69c0738a	Merge pull request 'loki-monitoring02' (#41 ) from loki-monitoring02 into master Some checks failed Run nix flake check / flake-check (push) Failing after 8m20s Details Reviewed-on: #41	2026-02-17 19:40:33 +00:00
Torjus Håkestad	35924c7b01	mcp: move config to .mcp.json.example, gitignore real config Some checks failed Run nix flake check / flake-check (push) Failing after 15m57s Details Run nix flake check / flake-check (pull_request) Failing after 16m45s Details The real .mcp.json now contains Loki credentials for basic auth, so it should not be committed. The example file has placeholders. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 20:35:14 +01:00
Torjus Håkestad	87d8571d62	promtail: fix vault secret ownership for loki auth Some checks failed Run nix flake check / flake-check (push) Failing after 12m24s Details The secret file needs to be owned by promtail since Promtail runs as a dedicated user and can't read root-owned files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 20:17:02 +01:00
Torjus Håkestad	43c81f6688	terraform: fix loki-push policy for generated hosts Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Revert ns1/ns2 from approle.tf (they're in hosts-generated.tf) and add loki-push policy to generated AppRoles instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 20:13:22 +01:00
Torjus Håkestad	58f901ad3e	terraform: add ns1 and ns2 to AppRole policies Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details They were missing from the host_policies map, so they didn't get shared policies like loki-push. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 20:10:37 +01:00
Torjus Håkestad	c13921d302	loki: add basic auth for log push and dual-ship promtail Some checks failed Run nix flake check / flake-check (push) Failing after 4m36s Details - Loki bound to localhost, Caddy reverse proxy with basic_auth - Vault secret (shared/loki/push-auth) for password, bcrypt hash generated at boot for Caddy environment - Promtail dual-ships to monitoring01 (direct) and loki.home.2rjus.net (with basic auth), conditional on vault.enable - Terraform: new shared loki-push policy added to all AppRoles Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 20:00:08 +01:00
Torjus Håkestad	2903873d52	monitoring02: add loki CNAME and Caddy reverse proxy Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 19:48:06 +01:00
Torjus Håkestad	74e7c9faa4	monitoring02: add Loki service Some checks failed Run nix flake check / flake-check (push) Failing after 3m19s Details Add standalone Loki service module (services/loki/) with same config as monitoring01 and import it on monitoring02. Update Grafana Loki datasource to localhost. Defer Tempo and Pyroscope migration (not actively used). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 19:42:19 +01:00
Torjus Håkestad	471f536f1f	Merge pull request 'victoriametrics-monitoring02' (#40 ) from victoriametrics-monitoring02 into master Some checks failed Run nix flake check / flake-check (push) Failing after 4m3s Details Periodic flake update / flake-update (push) Successful in 3m29s Details Reviewed-on: #40	2026-02-16 23:56:04 +00:00
Torjus Håkestad	a013e80f1a	terraform: grant monitoring02 access to apiary-token secret Some checks failed Run nix flake check / flake-check (push) Failing after 3m59s Details Run nix flake check / flake-check (pull_request) Failing after 4m20s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 00:55:08 +01:00
Torjus Håkestad	4cbaa33475	monitoring02: add Caddy reverse proxy for VictoriaMetrics and vmalert Add metrics.home.2rjus.net and vmalert.home.2rjus.net CNAMEs with Caddy TLS termination via internal ACME CA. Refactors Grafana's Caddy config from configFile to globalConfig + virtualHosts so both modules can contribute routes to the same Caddy instance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 00:55:08 +01:00
Torjus Håkestad	e329f87b0b	monitoring02: add VictoriaMetrics, vmalert, and Alertmanager Set up the core metrics stack on monitoring02 as Phase 2 of the monitoring migration. VictoriaMetrics replaces Prometheus with identical scrape configs (22 jobs including auto-generated targets). - VictoriaMetrics with 3-month retention and all scrape configs - vmalert evaluating existing rules.yml (notifier disabled) - Alertmanager with same routing config (no alerts during parallel op) - Grafana datasources updated: local VictoriaMetrics as default - Static user override for credential file access (OpenBao, Apiary) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 00:55:08 +01:00
Torjus Håkestad	c151f31011	grafana: fix apiary dashboard panels empty on short time ranges Some checks failed Run nix flake check / flake-check (push) Failing after 3m54s Details Set interval=60s on rate() panels to match the actual Prometheus scrape interval, so Grafana calculates $__rate_interval correctly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-16 20:03:26 +01:00
torjus-bot	f5362d6936	flake.lock: Update Flake lock file updates: • Updated input 'nixpkgs': 'github:nixos/nixpkgs/6c5e707c6b5339359a9a9e215c5e66d6d802fd7a?narHash=sha256-iKZMkr6Cm9JzWlRYW/VPoL0A9jVKtZYiU4zSrVeetIs%3D' (2026-02-11) → 'github:nixos/nixpkgs/3aadb7ca9eac2891d52a9dec199d9580a6e2bf44?narHash=sha256-O1XDr7EWbRp%2BkHrNNgLWgIrB0/US5wvw9K6RERWAj6I%3D' (2026-02-14) • Updated input 'nixpkgs-unstable': 'github:nixos/nixpkgs/ec7c70d12ce2fc37cb92aff673dcdca89d187bae?narHash=sha256-9xejG0KoqsoKEGp2kVbXRlEYtFFcDTHjidiuX8hGO44%3D' (2026-02-11) → 'github:nixos/nixpkgs/a82ccc39b39b621151d6732718e3e250109076fa?narHash=sha256-gf2AmWVTs8lEq7z/3ZAsgnZDhWIckkb%2BZnAo5RzSxJg%3D' (2026-02-13)	2026-02-16 00:07:10 +00:00
Torjus Håkestad	3e7aabc73a	grafana: fix apiary geomap and make it full-width Some checks failed Run nix flake check / flake-check (push) Failing after 5m6s Details Periodic flake update / flake-update (push) Successful in 5m25s Details Add gazetteer reference for country code lookup resolution. Remove unnecessary reduce transformation. Make geomap panel full-width (24 cols) and taller (h=10) on its own row. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 21:36:24 +01:00
Torjus Håkestad	361e7f2a1b	grafana: add apiary honeypot dashboard Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 21:31:06 +01:00
Torjus Håkestad	1942591d2e	monitoring: add apiary metrics scraping with bearer token auth Some checks failed Run nix flake check / flake-check (push) Failing after 12m52s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 16:36:26 +01:00
Torjus Håkestad	4d614d8716	docs: add new service candidates and NixOS router plans Some checks failed Run nix flake check / flake-check (push) Failing after 3m22s Details Periodic flake update / flake-update (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-14 13:21:34 +01:00
torjus-bot	fd7caf7f00	flake.lock: Update Flake lock file updates: • Updated input 'nixpkgs-unstable': 'github:nixos/nixpkgs/d6c71932130818840fc8fe9509cf50be8c64634f?narHash=sha256-ub1gpAONMFsT/GU2hV6ZWJjur8rJ6kKxdm9IlCT0j84%3D' (2026-02-08) → 'github:nixos/nixpkgs/ec7c70d12ce2fc37cb92aff673dcdca89d187bae?narHash=sha256-9xejG0KoqsoKEGp2kVbXRlEYtFFcDTHjidiuX8hGO44%3D' (2026-02-11)	2026-02-14 00:01:24 +00:00
Torjus Håkestad	af8e385b6e	docs: finalize remote access plan with WireGuard gateway design Some checks failed Run nix flake check / flake-check (push) Failing after 21m7s Details Periodic flake update / flake-update (push) Successful in 2m16s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-14 00:31:52 +01:00
Torjus Håkestad	0db9fc6802	docs: update Loki improvements plan with implementation status Some checks failed Run nix flake check / flake-check (push) Failing after 13m55s Details Mark retention, limits, labels, and level mapping as done. Add JSON logging audit results with per-service details. Update current state and disk usage notes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-14 00:04:16 +01:00
Torjus Håkestad	5d68662035	loki: add 30-day retention policy and ingestion limits Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Enable compactor-based retention with 30-day period to prevent unbounded disk growth. Add basic rate limits and stream guards to protect against runaway log generators. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 23:55:27 +01:00
Torjus Håkestad	d485948df0	docs: update Loki queries from host to hostname label Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Update all LogQL examples, agent instructions, and scripts to use the hostname label instead of host, matching the Prometheus label naming convention. Also update pipe-to-loki and bootstrap scripts to push hostname instead of host. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 23:43:47 +01:00
Torjus Håkestad	7b804450a3	promtail: add hostname/tier/role labels and journal priority level mapping Align Promtail labels with Prometheus by adding hostname, tier, and role static labels to both journal and varlog scrape configs. Add pipeline stages to map journal PRIORITY field to a level label for reliable severity filtering across the fleet. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 23:40:14 +01:00
Torjus Håkestad	2f0dad1acc	docs: add JSON logging audit to Loki improvements plan Some checks failed Run nix flake check / flake-check (push) Failing after 15m38s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 22:44:05 +01:00
Torjus Håkestad	1544415ef3	docs: add Loki improvements plan Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Covers retention policy, limits config, Promtail label improvements (tier/role/level), and journal PRIORITY extraction. Also adds Alloy consideration to VictoriaMetrics migration plan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 22:39:16 +01:00
Torjus Håkestad	5babd7f507	docs: move garage S3 storage plan to completed Some checks failed Run nix flake check / flake-check (push) Failing after 15m36s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 21:54:23 +01:00
Torjus Håkestad	7e0c5fbf0f	garage01: fix Caddy metrics deprecation warning Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Use handle directive instead of path in site address for the metrics endpoint, as the latter is deprecated in Caddy 2.10. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 21:53:48 +01:00
Torjus Håkestad	ffaf95d109	terraform: add Vault secret for garage01 environment Some checks failed Run nix flake check / flake-check (push) Failing after 3m13s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 21:27:43 +01:00
Torjus Håkestad	b2b6ab4799	garage01: add Garage S3 service with Caddy HTTPS proxy Configure Garage object storage on garage01 with S3 API, Vault secrets for RPC secret and admin token, and Caddy reverse proxy for HTTPS access at s3.home.2rjus.net via internal ACME CA. Includes flake entry, VM definition, and Vault policy for the host. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 21:24:25 +01:00
Torjus Håkestad	5d3d93b280	docs: move completed plans to completed folder Some checks failed Run nix flake check / flake-check (push) Failing after 13m22s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 21:08:17 +01:00
Torjus Håkestad	ae823e439d	monitoring: lower unbound cache hit ratio alert threshold to 20% Some checks failed Run nix flake check / flake-check (push) Failing after 9m2s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 18:55:03 +01:00
Torjus Håkestad	0d9f49a3b4	flake.lock: Update homelab-deploy Some checks failed Run nix flake check / flake-check (push) Failing after 12m25s Details Improves builder logging: build failure output is now logged as individual lines instead of a single JSON blob, making errors readable in Loki/Grafana. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 18:36:18 +01:00
Torjus Håkestad	08d9e1ec3f	docs: add garage S3 storage plan Some checks failed Run nix flake check / flake-check (push) Failing after 3m26s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 18:06:53 +01:00
Torjus Håkestad	fa8d65b612	nix-cache02: increase builder timeout to 2 hours Some checks failed Run nix flake check / flake-check (push) Failing after 14m21s Details Periodic flake update / flake-update (push) Successful in 5m17s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 14:44:55 +01:00
Torjus Håkestad	6726f111e3	flake.lock: Update homelab-deploy Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 14:42:23 +01:00
torjus-bot	3a083285cb	flake.lock: Update Flake lock file updates: • Updated input 'nixpkgs': 'github:nixos/nixpkgs/2db38e08fdadcc0ce3232f7279bab59a15b94482?narHash=sha256-1jZvgZoAagZZB6NwGRv2T2ezPy%2BX6EFDsJm%2BYSlsvEs%3D' (2026-02-09) → 'github:nixos/nixpkgs/6c5e707c6b5339359a9a9e215c5e66d6d802fd7a?narHash=sha256-iKZMkr6Cm9JzWlRYW/VPoL0A9jVKtZYiU4zSrVeetIs%3D' (2026-02-11)	2026-02-12 00:01:27 +00:00
Torjus Håkestad	ed1821b073	nix-cache02: add scheduled builds timer Some checks failed Run nix flake check / flake-check (push) Failing after 5m7s Details Periodic flake update / flake-update (push) Successful in 2m18s Details Add a systemd timer that triggers builds for all hosts every 2 hours via NATS, keeping the binary cache warm. - Add scheduler.nix with timer (every 2h) and oneshot service - Add scheduler NATS user to DEPLOY account - Add Vault secret and variable for scheduler NKey - Increase nix-cache02 memory from 16GB to 20GB Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-12 00:50:09 +01:00
Torjus Håkestad	fa4a418007	restic: add --retry-lock=5m to all backup jobs Some checks failed Run nix flake check / flake-check (push) Failing after 23m42s Details Prevents lock conflicts when multiple backup jobs targeting the same repository run concurrently. Jobs will now retry acquiring the lock every 10 seconds for up to 5 minutes before failing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-11 01:22:00 +01:00
torjus-bot	963e5f6d3c	flake.lock: Update Flake lock file updates: • Updated input 'homelab-deploy': 'git+https://git.t-juice.club/torjus/homelab-deploy?ref=master&rev=a8aab16d0e7400aaa00500d08c12734da3b638e0' (2026-02-10) → 'git+https://git.t-juice.club/torjus/homelab-deploy?ref=master&rev=c13914bf5acdcda33de63ad5ed9d661e4dc3118c' (2026-02-10) • Updated input 'nixpkgs': 'github:nixos/nixpkgs/23d72dabcb3b12469f57b37170fcbc1789bd7457?narHash=sha256-z5NJPSBwsLf/OfD8WTmh79tlSU8XgIbwmk6qB1/TFzY%3D' (2026-02-07) → 'github:nixos/nixpkgs/2db38e08fdadcc0ce3232f7279bab59a15b94482?narHash=sha256-1jZvgZoAagZZB6NwGRv2T2ezPy%2BX6EFDsJm%2BYSlsvEs%3D' (2026-02-09)	2026-02-11 00:01:28 +00:00
Torjus Håkestad	0bc10cb1fe	grafana: add build service panels to nixos-fleet dashboard Some checks failed Run nix flake check / flake-check (push) Failing after 4m48s Details Periodic flake update / flake-update (push) Successful in 2m20s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-11 00:49:50 +01:00
Torjus Håkestad	b03e2e8ee4	monitoring: add alerts for homelab-deploy build failures Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-11 00:45:07 +01:00
Torjus Håkestad	ddcbc30665	docs: mark nix-cache01 decommission complete Some checks failed Run nix flake check / flake-check (push) Failing after 16m38s Details Phase 4 fully complete. nix-cache01 has been: - Removed from repo (host config, build scripts, flake entry) - Vault resources cleaned up - VM deleted from Proxmox nix-cache02 is now the sole binary cache host. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:43:12 +01:00
Torjus Håkestad	75210805d5	nix-cache01: decommission and remove all references Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Removed: - hosts/nix-cache01/ directory - services/nix-cache/build-flakes.{nix,sh} (replaced by NATS builder) - Vault secret and AppRole for nix-cache01 - Old signing key variable from terraform - Old trusted public key from system/nix.nix Updated: - flake.nix: removed nixosConfiguration - README.md: nix-cache01 -> nix-cache02 - Monitoring rules: removed build-flakes alerts, updated harmonia to nix-cache02 - Simplified proxy.nix (no longer needs hostname conditional) nix-cache02 is now the sole binary cache host. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:40:51 +01:00
Torjus Håkestad	ade0538717	docs: mark nix-cache DNS cutover complete Some checks are pending Run nix flake check / flake-check (push) Has started running Details nix-cache.home.2rjus.net now served by nix-cache02. nix-cache01 ready for decommission. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:34:04 +01:00
Torjus Håkestad	83fce5f927	nix-cache: switch DNS to nix-cache02 Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details - Move nix-cache CNAME from nix-cache01 to nix-cache02 - Remove actions1 CNAME (service removed) - Update proxy.nix to serve canonical domain on nix-cache02 - Promote nix-cache02 to prod tier with build-host role Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:22:23 +01:00
Torjus Håkestad	afff3f28ca	docs: update nix-cache-reprovision plan with Harmonia progress Some checks failed Run nix flake check / flake-check (push) Failing after 52s Details - Phase 4 now in progress - Harmonia configured on nix-cache02 with new signing key - Trusted public key deployed to all hosts - Cache tested successfully from testvm01 - Actions runner removed from scope Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:17:51 +01:00
Torjus Håkestad	49f7e3ae2e	nix-cache: use hostname-based domain for Caddy proxy All checks were successful Run nix flake check / flake-check (push) Successful in 2m18s Details nix-cache01 serves nix-cache.home.2rjus.net (canonical) nix-cache02 serves nix-cache02.home.2rjus.net (for testing) This allows testing nix-cache02 independently before DNS cutover. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:14:14 +01:00
Torjus Håkestad	751edfc11d	nix-cache02: add Harmonia binary cache service Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details - Parameterize harmonia.nix to use hostname-based Vault paths - Add nix-cache services to nix-cache02 - Add Vault secret and variable for nix-cache02 signing key - Add nix-cache02 public key to trusted-public-keys on all hosts - Update plan doc to remove actions runner references Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:08:48 +01:00

1 2 3 4 5 ...

954 Commits