nixos-servers

Author	SHA1	Message	Date
Torjus Håkestad	1942591d2e	monitoring: add apiary metrics scraping with bearer token auth Some checks failed Run nix flake check / flake-check (push) Failing after 12m52s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 16:36:26 +01:00
Torjus Håkestad	4d614d8716	docs: add new service candidates and NixOS router plans Some checks failed Run nix flake check / flake-check (push) Failing after 3m22s Details Periodic flake update / flake-update (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-14 13:21:34 +01:00
torjus-bot	fd7caf7f00	flake.lock: Update Flake lock file updates: • Updated input 'nixpkgs-unstable': 'github:nixos/nixpkgs/d6c71932130818840fc8fe9509cf50be8c64634f?narHash=sha256-ub1gpAONMFsT/GU2hV6ZWJjur8rJ6kKxdm9IlCT0j84%3D' (2026-02-08) → 'github:nixos/nixpkgs/ec7c70d12ce2fc37cb92aff673dcdca89d187bae?narHash=sha256-9xejG0KoqsoKEGp2kVbXRlEYtFFcDTHjidiuX8hGO44%3D' (2026-02-11)	2026-02-14 00:01:24 +00:00
Torjus Håkestad	af8e385b6e	docs: finalize remote access plan with WireGuard gateway design Some checks failed Run nix flake check / flake-check (push) Failing after 21m7s Details Periodic flake update / flake-update (push) Successful in 2m16s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-14 00:31:52 +01:00
Torjus Håkestad	0db9fc6802	docs: update Loki improvements plan with implementation status Some checks failed Run nix flake check / flake-check (push) Failing after 13m55s Details Mark retention, limits, labels, and level mapping as done. Add JSON logging audit results with per-service details. Update current state and disk usage notes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-14 00:04:16 +01:00
Torjus Håkestad	5d68662035	loki: add 30-day retention policy and ingestion limits Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Enable compactor-based retention with 30-day period to prevent unbounded disk growth. Add basic rate limits and stream guards to protect against runaway log generators. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 23:55:27 +01:00
Torjus Håkestad	d485948df0	docs: update Loki queries from host to hostname label Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Update all LogQL examples, agent instructions, and scripts to use the hostname label instead of host, matching the Prometheus label naming convention. Also update pipe-to-loki and bootstrap scripts to push hostname instead of host. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 23:43:47 +01:00
Torjus Håkestad	7b804450a3	promtail: add hostname/tier/role labels and journal priority level mapping Align Promtail labels with Prometheus by adding hostname, tier, and role static labels to both journal and varlog scrape configs. Add pipeline stages to map journal PRIORITY field to a level label for reliable severity filtering across the fleet. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 23:40:14 +01:00
Torjus Håkestad	2f0dad1acc	docs: add JSON logging audit to Loki improvements plan Some checks failed Run nix flake check / flake-check (push) Failing after 15m38s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 22:44:05 +01:00
Torjus Håkestad	1544415ef3	docs: add Loki improvements plan Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Covers retention policy, limits config, Promtail label improvements (tier/role/level), and journal PRIORITY extraction. Also adds Alloy consideration to VictoriaMetrics migration plan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 22:39:16 +01:00
Torjus Håkestad	5babd7f507	docs: move garage S3 storage plan to completed Some checks failed Run nix flake check / flake-check (push) Failing after 15m36s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 21:54:23 +01:00
Torjus Håkestad	7e0c5fbf0f	garage01: fix Caddy metrics deprecation warning Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Use handle directive instead of path in site address for the metrics endpoint, as the latter is deprecated in Caddy 2.10. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 21:53:48 +01:00
Torjus Håkestad	ffaf95d109	terraform: add Vault secret for garage01 environment Some checks failed Run nix flake check / flake-check (push) Failing after 3m13s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 21:27:43 +01:00
Torjus Håkestad	b2b6ab4799	garage01: add Garage S3 service with Caddy HTTPS proxy Configure Garage object storage on garage01 with S3 API, Vault secrets for RPC secret and admin token, and Caddy reverse proxy for HTTPS access at s3.home.2rjus.net via internal ACME CA. Includes flake entry, VM definition, and Vault policy for the host. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 21:24:25 +01:00
Torjus Håkestad	5d3d93b280	docs: move completed plans to completed folder Some checks failed Run nix flake check / flake-check (push) Failing after 13m22s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 21:08:17 +01:00
Torjus Håkestad	ae823e439d	monitoring: lower unbound cache hit ratio alert threshold to 20% Some checks failed Run nix flake check / flake-check (push) Failing after 9m2s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 18:55:03 +01:00
Torjus Håkestad	0d9f49a3b4	flake.lock: Update homelab-deploy Some checks failed Run nix flake check / flake-check (push) Failing after 12m25s Details Improves builder logging: build failure output is now logged as individual lines instead of a single JSON blob, making errors readable in Loki/Grafana. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 18:36:18 +01:00
Torjus Håkestad	08d9e1ec3f	docs: add garage S3 storage plan Some checks failed Run nix flake check / flake-check (push) Failing after 3m26s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 18:06:53 +01:00
Torjus Håkestad	fa8d65b612	nix-cache02: increase builder timeout to 2 hours Some checks failed Run nix flake check / flake-check (push) Failing after 14m21s Details Periodic flake update / flake-update (push) Successful in 5m17s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 14:44:55 +01:00
Torjus Håkestad	6726f111e3	flake.lock: Update homelab-deploy Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 14:42:23 +01:00
torjus-bot	3a083285cb	flake.lock: Update Flake lock file updates: • Updated input 'nixpkgs': 'github:nixos/nixpkgs/2db38e08fdadcc0ce3232f7279bab59a15b94482?narHash=sha256-1jZvgZoAagZZB6NwGRv2T2ezPy%2BX6EFDsJm%2BYSlsvEs%3D' (2026-02-09) → 'github:nixos/nixpkgs/6c5e707c6b5339359a9a9e215c5e66d6d802fd7a?narHash=sha256-iKZMkr6Cm9JzWlRYW/VPoL0A9jVKtZYiU4zSrVeetIs%3D' (2026-02-11)	2026-02-12 00:01:27 +00:00
Torjus Håkestad	ed1821b073	nix-cache02: add scheduled builds timer Some checks failed Run nix flake check / flake-check (push) Failing after 5m7s Details Periodic flake update / flake-update (push) Successful in 2m18s Details Add a systemd timer that triggers builds for all hosts every 2 hours via NATS, keeping the binary cache warm. - Add scheduler.nix with timer (every 2h) and oneshot service - Add scheduler NATS user to DEPLOY account - Add Vault secret and variable for scheduler NKey - Increase nix-cache02 memory from 16GB to 20GB Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-12 00:50:09 +01:00
Torjus Håkestad	fa4a418007	restic: add --retry-lock=5m to all backup jobs Some checks failed Run nix flake check / flake-check (push) Failing after 23m42s Details Prevents lock conflicts when multiple backup jobs targeting the same repository run concurrently. Jobs will now retry acquiring the lock every 10 seconds for up to 5 minutes before failing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-11 01:22:00 +01:00
torjus-bot	963e5f6d3c	flake.lock: Update Flake lock file updates: • Updated input 'homelab-deploy': 'git+https://git.t-juice.club/torjus/homelab-deploy?ref=master&rev=a8aab16d0e7400aaa00500d08c12734da3b638e0' (2026-02-10) → 'git+https://git.t-juice.club/torjus/homelab-deploy?ref=master&rev=c13914bf5acdcda33de63ad5ed9d661e4dc3118c' (2026-02-10) • Updated input 'nixpkgs': 'github:nixos/nixpkgs/23d72dabcb3b12469f57b37170fcbc1789bd7457?narHash=sha256-z5NJPSBwsLf/OfD8WTmh79tlSU8XgIbwmk6qB1/TFzY%3D' (2026-02-07) → 'github:nixos/nixpkgs/2db38e08fdadcc0ce3232f7279bab59a15b94482?narHash=sha256-1jZvgZoAagZZB6NwGRv2T2ezPy%2BX6EFDsJm%2BYSlsvEs%3D' (2026-02-09)	2026-02-11 00:01:28 +00:00
Torjus Håkestad	0bc10cb1fe	grafana: add build service panels to nixos-fleet dashboard Some checks failed Run nix flake check / flake-check (push) Failing after 4m48s Details Periodic flake update / flake-update (push) Successful in 2m20s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-11 00:49:50 +01:00
Torjus Håkestad	b03e2e8ee4	monitoring: add alerts for homelab-deploy build failures Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-11 00:45:07 +01:00
Torjus Håkestad	ddcbc30665	docs: mark nix-cache01 decommission complete Some checks failed Run nix flake check / flake-check (push) Failing after 16m38s Details Phase 4 fully complete. nix-cache01 has been: - Removed from repo (host config, build scripts, flake entry) - Vault resources cleaned up - VM deleted from Proxmox nix-cache02 is now the sole binary cache host. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:43:12 +01:00
Torjus Håkestad	75210805d5	nix-cache01: decommission and remove all references Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Removed: - hosts/nix-cache01/ directory - services/nix-cache/build-flakes.{nix,sh} (replaced by NATS builder) - Vault secret and AppRole for nix-cache01 - Old signing key variable from terraform - Old trusted public key from system/nix.nix Updated: - flake.nix: removed nixosConfiguration - README.md: nix-cache01 -> nix-cache02 - Monitoring rules: removed build-flakes alerts, updated harmonia to nix-cache02 - Simplified proxy.nix (no longer needs hostname conditional) nix-cache02 is now the sole binary cache host. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:40:51 +01:00
Torjus Håkestad	ade0538717	docs: mark nix-cache DNS cutover complete Some checks are pending Run nix flake check / flake-check (push) Has started running Details nix-cache.home.2rjus.net now served by nix-cache02. nix-cache01 ready for decommission. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:34:04 +01:00
Torjus Håkestad	83fce5f927	nix-cache: switch DNS to nix-cache02 Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details - Move nix-cache CNAME from nix-cache01 to nix-cache02 - Remove actions1 CNAME (service removed) - Update proxy.nix to serve canonical domain on nix-cache02 - Promote nix-cache02 to prod tier with build-host role Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:22:23 +01:00
Torjus Håkestad	afff3f28ca	docs: update nix-cache-reprovision plan with Harmonia progress Some checks failed Run nix flake check / flake-check (push) Failing after 52s Details - Phase 4 now in progress - Harmonia configured on nix-cache02 with new signing key - Trusted public key deployed to all hosts - Cache tested successfully from testvm01 - Actions runner removed from scope Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:17:51 +01:00
Torjus Håkestad	49f7e3ae2e	nix-cache: use hostname-based domain for Caddy proxy All checks were successful Run nix flake check / flake-check (push) Successful in 2m18s Details nix-cache01 serves nix-cache.home.2rjus.net (canonical) nix-cache02 serves nix-cache02.home.2rjus.net (for testing) This allows testing nix-cache02 independently before DNS cutover. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:14:14 +01:00
Torjus Håkestad	751edfc11d	nix-cache02: add Harmonia binary cache service Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details - Parameterize harmonia.nix to use hostname-based Vault paths - Add nix-cache services to nix-cache02 - Add Vault secret and variable for nix-cache02 signing key - Add nix-cache02 public key to trusted-public-keys on all hosts - Update plan doc to remove actions runner references Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:08:48 +01:00
Torjus Håkestad	98a7301985	nix-cache: remove unused Gitea Actions runner All checks were successful Run nix flake check / flake-check (push) Successful in 2m23s Details The actions runner on nix-cache01 was never actively used. Removing it before migrating to nix-cache02. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 22:57:08 +01:00
Torjus Håkestad	34efa58cfe	Merge pull request 'nix-cache02-builder' (#39 ) from nix-cache02-builder into master All checks were successful Run nix flake check / flake-check (push) Successful in 2m27s Details Reviewed-on: #39	2026-02-10 21:47:58 +00:00
Torjus Håkestad	5bfb51a497	docs: add observability phase to nix-cache plan Some checks failed Run nix flake check / flake-check (push) Successful in 2m35s Details Run nix flake check / flake-check (pull_request) Failing after 16m1s Details - Add Phase 6 for alerting and Grafana dashboards - Document available Prometheus metrics - Include example alerting rules for build failures Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 22:46:38 +01:00
Torjus Håkestad	f83145d97a	docs: update nix-cache-reprovision plan with progress Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details - Mark Phase 1 (new build host) and Phase 2 (NATS build triggering) complete - Document nix-cache02 configuration and tested build times - Add remaining work for Harmonia, Actions runner, and DNS cutover - Enable --enable-builds flag in MCP config Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 22:43:48 +01:00
Torjus Håkestad	47747329c4	nix-cache02: add homelab-deploy builder service Some checks failed Run nix flake check / flake-check (push) Failing after 4m51s Details - Configure builder to build nixos-servers and nixos (gunter) repos - Add builder NKey to Vault secrets - Update NATS permissions for builder, test-deployer, and admin-deployer - Grant nix-cache02 access to shared homelab-deploy secrets Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 22:26:40 +01:00
Torjus Håkestad	2d9ca2a73f	hosts: add nix-cache02 build host Some checks failed Run nix flake check / flake-check (push) Failing after 16m26s Details New build host to replace nix-cache01 with: - 8 CPU cores, 16GB RAM, 200GB disk - Static IP 10.69.13.25 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 21:53:29 +01:00
Torjus Håkestad	98ea679ef2	docs: add monitoring02 reboot alert investigation Some checks failed Run nix flake check / flake-check (push) Failing after 13m41s Details Document findings from false positive host_reboot alert caused by NTP clock adjustment affecting node_boot_time_seconds metric. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 17:59:53 +01:00
Torjus Håkestad	b709c0b703	monitoring: disable radarr exporter (version mismatch) Some checks failed Run nix flake check / flake-check (push) Failing after 15m20s Details Periodic flake update / flake-update (push) Successful in 2m23s Details Radarr on TrueNAS jail is too old - exportarr fails on /api/v3/wanted/cutoff endpoint (404). Keep sonarr which works. Vault secret kept for when Radarr is updated. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:59:45 +01:00
Torjus Håkestad	33c5d5b3f0	monitoring: add exportarr for radarr/sonarr metrics All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Add prometheus exportarr exporters for Radarr and Sonarr media services. Runs on monitoring01, queries remote APIs. - Radarr exporter on port 9708 - Sonarr exporter on port 9709 - API keys fetched from Vault Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:56:03 +01:00
Torjus Håkestad	0a28c5f495	terraform: add radarr/sonarr API keys for exportarr Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Add vault secrets for Radarr and Sonarr API keys to enable exportarr metrics collection on monitoring01. - services/exportarr/radarr - Radarr API key - services/exportarr/sonarr - Sonarr API key - Grant monitoring01 access to services/exportarr/* Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:52:34 +01:00
Torjus Håkestad	9bd48e0808	monitoring: explicitly list valid HTTP status codes All checks were successful Run nix flake check / flake-check (push) Successful in 2m6s Details Empty valid_status_codes defaults to 2xx only, not "any". Explicitly list common status codes (2xx, 3xx, 4xx, 5xx) so services returning 400/401 like ha and nzbget pass the probe. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:41:47 +01:00
Torjus Håkestad	1460eea700	grafana: fix probe status table join All checks were successful Run nix flake check / flake-check (push) Successful in 2m9s Details Use joinByField transformation instead of merge to properly align rows by instance. Also exclude duplicate Time/job columns from join. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:38:02 +01:00
Torjus Håkestad	98c4f54f94	grafana: add TLS certificates dashboard Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Dashboard includes: - Stat panels for endpoints monitored, probe failures, expiring certs - Gauge showing minimum days until any cert expires - Table of all endpoints sorted by expiry (color-coded) - Probe status table with HTTP status and duration - Time series graphs for expiry trends and probe success rate Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:35:44 +01:00
Torjus Håkestad	d1b0a5dc20	monitoring: accept any HTTP status in TLS probe Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Only care about TLS handshake success for certificate monitoring. Services like nzbget (401) and ha (400) return non-2xx but have valid certificates. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:33:45 +01:00
Torjus Håkestad	4d32707130	monitoring: remove duplicate rules from blackbox.nix All checks were successful Run nix flake check / flake-check (push) Successful in 2m7s Details The rules were already added to rules.yml but the blackbox.nix file still had them, causing duplicate 'groups' key errors. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:28:42 +01:00
Torjus Håkestad	8e1753c2c8	monitoring: fix blackbox rules and add force-push policy Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Move certificate alert rules to rules.yml instead of adding them as a separate rules string in blackbox.nix. The previous approach caused a YAML parse error due to duplicate 'groups' keys. Also add policy to CLAUDE.md: never force push to master. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:26:05 +01:00
Torjus Håkestad	75e4fb61a5	monitoring: add blackbox exporter for TLS certificate monitoring All checks were successful Run nix flake check / flake-check (push) Successful in 2m6s Details Add blackbox exporter to monitoring01 to probe TLS endpoints and alert on expiring certificates. Monitors all ACME-managed certificates from OpenBao PKI including Caddy auto-TLS services. Alerts: - tls_certificate_expiring_soon (< 7 days, warning) - tls_certificate_expiring_critical (< 24h, critical) - tls_probe_failed (connectivity issues) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:21:42 +01:00

1 2 3 4 5 ...

987 Commits