nixos-servers

Author	SHA1	Message	Date
Torjus Håkestad	f7b1a18579	dns: remove old media PC entry The old Ubuntu media PC (10.69.31.50) is retired, replaced by media1 which auto-registers via its NixOS static IP config. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 19:55:02 +01:00
Torjus Håkestad	5c111c8d78	unbound: tune timeouts for faster recovery after network outages Lower infra-host-ttl (900s → 120s) and tcp-reuse-timeout (60s → 15s) so unbound recovers faster from upstream TLS forwarder failures instead of staying stuck after ISP outages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 01:53:11 +01:00
Torjus Håkestad	d1516ddd66	forgejo: upgrade from LTS to stable (11.0.10 → 14.0.2) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 23:03:51 +01:00
Torjus Håkestad	117e54a849	actions-runner: add Forgejo runner to nix-cache02 with Vault token Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 00:41:24 +01:00
Torjus Håkestad	ff5f166855	actions-runner: trust podman interfaces in firewall Allow containers to reach the runner's cache service by trusting podman network interfaces. Uses "podman+" wildcard to match any podman-prefixed interface regardless of name. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 00:05:27 +01:00
Torjus Håkestad	456a0703a9	actions-runner: use custom golang runner image Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 21:29:30 +01:00
Torjus Håkestad	ad408c2981	actions-runner: add golang runner image Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 21:02:13 +01:00
Torjus Håkestad	cb7a25fef5	actions-runner: use custom nix runner image Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 20:30:37 +01:00
Torjus Håkestad	d2373b5e37	actions-runner: fix cache dir for DynamicUser Move cache directory under the managed state directory since the service runs with DynamicUser and cannot create /var/cache paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 23:27:00 +01:00
Torjus Håkestad	93aa91f307	nrec-nixos02: add Forgejo Actions runner with Podman Adds a container-based Forgejo Actions runner on nrec-nixos02 connecting to code.t-juice.club, using Podman for sandboxed job execution with nix, node-bookworm, and alpine labels. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 23:17:27 +01:00
Torjus Håkestad	00f46af628	nrec-nixos01: use code.t-juice.club for Forgejo Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 18:50:54 +01:00
Torjus Håkestad	01906e81f9	nrec-nixos01: use lfs.enable instead of raw setting Some checks failed Run nix flake check / flake-check (push) Failing after 10m28s Details The NixOS module's lfs.enable option properly handles LFS JWT secret generation via forgejo-secrets.service, fixing the permission denied error on app.ini. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 15:15:35 +01:00
Torjus Håkestad	09ec4f9e8c	nrec-nixos01: enable Git LFS and hide explore page Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 15:12:26 +01:00
Torjus Håkestad	cfc0c6f6cb	nrec-nixos01: add Forgejo with Caddy reverse proxy Some checks failed Run nix flake check / flake-check (push) Failing after 5m6s Details Run nix flake check / flake-check (pull_request) Failing after 4m31s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 14:49:48 +01:00
Torjus Håkestad	d2a4e4a0a1	grafana: add storage query performance panels to apiary dashboard Some checks failed Run nix flake check / flake-check (push) Failing after 3m23s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 22:47:30 +01:00
Torjus Håkestad	813c5c0f29	monitoring: separate node-exporter-only external targets Some checks failed Run nix flake check / flake-check (push) Failing after 3m7s Details Add nodeExporterOnly list to external-targets.nix for hosts that have node-exporter but not systemd-exporter (e.g. pve1). This prevents a down target in the systemd-exporter scrape job. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 19:17:39 +01:00
Torjus Håkestad	013ab8f621	monitoring: add pve1 node-exporter scrape target Some checks failed Run nix flake check / flake-check (push) Failing after 4m6s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 19:10:54 +01:00
Torjus Håkestad	2ca2509083	monitoring: increase filesystem_filling_up prediction window to 24h Some checks failed Run nix flake check / flake-check (push) Failing after 3m55s Details Reduces false positives from transient Nix store growth by basing the linear prediction on a 24h trend instead of 6h. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-20 09:36:27 +01:00
Torjus Håkestad	65acf13e6f	grafana: fix datasource UIDs for VictoriaMetrics migration Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Update all dashboard datasource references from "prometheus" to "victoriametrics" to match the declared datasource UID. Enable prune and deleteDatasources to clean up the old Prometheus (monitoring01) datasource from Grafana's database. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 22:23:04 +01:00
Torjus Håkestad	4f593126c0	monitoring01: remove host and migrate services to monitoring02 Some checks failed Run nix flake check / flake-check (push) Failing after 3m15s Details Run nix flake check / flake-check (pull_request) Failing after 3m8s Details Remove monitoring01 host configuration and unused service modules (prometheus, grafana, loki, tempo, pyroscope). Migrate blackbox, exportarr, and pve exporters to monitoring02 with scrape configs moved to VictoriaMetrics. Update alert rules, terraform vault policies/secrets, http-proxy entries, and documentation to reflect the monitoring02 migration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 21:50:20 +01:00
Torjus Håkestad	a6013d3950	monitoring02: enable alerting and migrate CNAMEs from http-proxy Some checks failed Run nix flake check / flake-check (push) Failing after 6m25s Details Run nix flake check / flake-check (pull_request) Failing after 3m52s Details - Switch vmalert from blackhole mode to sending alerts to local Alertmanager - Import alerttonotify service so alerts route to NATS notifications - Move alertmanager and grafana CNAMEs from http-proxy to monitoring02 - Add monitoring CNAME to monitoring02 - Add Caddy reverse proxy entries for alertmanager and grafana - Remove prometheus, alertmanager, and grafana Caddy entries from http-proxy (now served directly by monitoring02) - Move monitoring02 Vault AppRole to hosts-generated.tf with extra_policies support and prometheus-metrics policy - Update Promtail to use authenticated loki.home.2rjus.net endpoint only (remove unauthenticated monitoring01 client) - Update pipe-to-loki and bootstrap to use loki.home.2rjus.net with basic auth from Vault secret - Move migration plan to completed Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 21:23:21 +01:00
Torjus Håkestad	c13921d302	loki: add basic auth for log push and dual-ship promtail Some checks failed Run nix flake check / flake-check (push) Failing after 4m36s Details - Loki bound to localhost, Caddy reverse proxy with basic_auth - Vault secret (shared/loki/push-auth) for password, bcrypt hash generated at boot for Caddy environment - Promtail dual-ships to monitoring01 (direct) and loki.home.2rjus.net (with basic auth), conditional on vault.enable - Terraform: new shared loki-push policy added to all AppRoles Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 20:00:08 +01:00
Torjus Håkestad	2903873d52	monitoring02: add loki CNAME and Caddy reverse proxy Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 19:48:06 +01:00
Torjus Håkestad	74e7c9faa4	monitoring02: add Loki service Some checks failed Run nix flake check / flake-check (push) Failing after 3m19s Details Add standalone Loki service module (services/loki/) with same config as monitoring01 and import it on monitoring02. Update Grafana Loki datasource to localhost. Defer Tempo and Pyroscope migration (not actively used). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 19:42:19 +01:00
Torjus Håkestad	4cbaa33475	monitoring02: add Caddy reverse proxy for VictoriaMetrics and vmalert Add metrics.home.2rjus.net and vmalert.home.2rjus.net CNAMEs with Caddy TLS termination via internal ACME CA. Refactors Grafana's Caddy config from configFile to globalConfig + virtualHosts so both modules can contribute routes to the same Caddy instance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 00:55:08 +01:00
Torjus Håkestad	e329f87b0b	monitoring02: add VictoriaMetrics, vmalert, and Alertmanager Set up the core metrics stack on monitoring02 as Phase 2 of the monitoring migration. VictoriaMetrics replaces Prometheus with identical scrape configs (22 jobs including auto-generated targets). - VictoriaMetrics with 3-month retention and all scrape configs - vmalert evaluating existing rules.yml (notifier disabled) - Alertmanager with same routing config (no alerts during parallel op) - Grafana datasources updated: local VictoriaMetrics as default - Static user override for credential file access (OpenBao, Apiary) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 00:55:08 +01:00
Torjus Håkestad	c151f31011	grafana: fix apiary dashboard panels empty on short time ranges Some checks failed Run nix flake check / flake-check (push) Failing after 3m54s Details Set interval=60s on rate() panels to match the actual Prometheus scrape interval, so Grafana calculates $__rate_interval correctly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-16 20:03:26 +01:00
Torjus Håkestad	3e7aabc73a	grafana: fix apiary geomap and make it full-width Some checks failed Run nix flake check / flake-check (push) Failing after 5m6s Details Periodic flake update / flake-update (push) Successful in 5m25s Details Add gazetteer reference for country code lookup resolution. Remove unnecessary reduce transformation. Make geomap panel full-width (24 cols) and taller (h=10) on its own row. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 21:36:24 +01:00
Torjus Håkestad	361e7f2a1b	grafana: add apiary honeypot dashboard Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 21:31:06 +01:00
Torjus Håkestad	1942591d2e	monitoring: add apiary metrics scraping with bearer token auth Some checks failed Run nix flake check / flake-check (push) Failing after 12m52s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 16:36:26 +01:00
Torjus Håkestad	5d68662035	loki: add 30-day retention policy and ingestion limits Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Enable compactor-based retention with 30-day period to prevent unbounded disk growth. Add basic rate limits and stream guards to protect against runaway log generators. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 23:55:27 +01:00
Torjus Håkestad	7e0c5fbf0f	garage01: fix Caddy metrics deprecation warning Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Use handle directive instead of path in site address for the metrics endpoint, as the latter is deprecated in Caddy 2.10. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 21:53:48 +01:00
Torjus Håkestad	b2b6ab4799	garage01: add Garage S3 service with Caddy HTTPS proxy Configure Garage object storage on garage01 with S3 API, Vault secrets for RPC secret and admin token, and Caddy reverse proxy for HTTPS access at s3.home.2rjus.net via internal ACME CA. Includes flake entry, VM definition, and Vault policy for the host. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 21:24:25 +01:00
Torjus Håkestad	ae823e439d	monitoring: lower unbound cache hit ratio alert threshold to 20% Some checks failed Run nix flake check / flake-check (push) Failing after 9m2s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 18:55:03 +01:00
Torjus Håkestad	ed1821b073	nix-cache02: add scheduled builds timer Some checks failed Run nix flake check / flake-check (push) Failing after 5m7s Details Periodic flake update / flake-update (push) Successful in 2m18s Details Add a systemd timer that triggers builds for all hosts every 2 hours via NATS, keeping the binary cache warm. - Add scheduler.nix with timer (every 2h) and oneshot service - Add scheduler NATS user to DEPLOY account - Add Vault secret and variable for scheduler NKey - Increase nix-cache02 memory from 16GB to 20GB Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-12 00:50:09 +01:00
Torjus Håkestad	0bc10cb1fe	grafana: add build service panels to nixos-fleet dashboard Some checks failed Run nix flake check / flake-check (push) Failing after 4m48s Details Periodic flake update / flake-update (push) Successful in 2m20s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-11 00:49:50 +01:00
Torjus Håkestad	b03e2e8ee4	monitoring: add alerts for homelab-deploy build failures Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-11 00:45:07 +01:00
Torjus Håkestad	75210805d5	nix-cache01: decommission and remove all references Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Removed: - hosts/nix-cache01/ directory - services/nix-cache/build-flakes.{nix,sh} (replaced by NATS builder) - Vault secret and AppRole for nix-cache01 - Old signing key variable from terraform - Old trusted public key from system/nix.nix Updated: - flake.nix: removed nixosConfiguration - README.md: nix-cache01 -> nix-cache02 - Monitoring rules: removed build-flakes alerts, updated harmonia to nix-cache02 - Simplified proxy.nix (no longer needs hostname conditional) nix-cache02 is now the sole binary cache host. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:40:51 +01:00
Torjus Håkestad	83fce5f927	nix-cache: switch DNS to nix-cache02 Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details - Move nix-cache CNAME from nix-cache01 to nix-cache02 - Remove actions1 CNAME (service removed) - Update proxy.nix to serve canonical domain on nix-cache02 - Promote nix-cache02 to prod tier with build-host role Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:22:23 +01:00
Torjus Håkestad	49f7e3ae2e	nix-cache: use hostname-based domain for Caddy proxy All checks were successful Run nix flake check / flake-check (push) Successful in 2m18s Details nix-cache01 serves nix-cache.home.2rjus.net (canonical) nix-cache02 serves nix-cache02.home.2rjus.net (for testing) This allows testing nix-cache02 independently before DNS cutover. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:14:14 +01:00
Torjus Håkestad	751edfc11d	nix-cache02: add Harmonia binary cache service Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details - Parameterize harmonia.nix to use hostname-based Vault paths - Add nix-cache services to nix-cache02 - Add Vault secret and variable for nix-cache02 signing key - Add nix-cache02 public key to trusted-public-keys on all hosts - Update plan doc to remove actions runner references Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:08:48 +01:00
Torjus Håkestad	98a7301985	nix-cache: remove unused Gitea Actions runner All checks were successful Run nix flake check / flake-check (push) Successful in 2m23s Details The actions runner on nix-cache01 was never actively used. Removing it before migrating to nix-cache02. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 22:57:08 +01:00
Torjus Håkestad	47747329c4	nix-cache02: add homelab-deploy builder service Some checks failed Run nix flake check / flake-check (push) Failing after 4m51s Details - Configure builder to build nixos-servers and nixos (gunter) repos - Add builder NKey to Vault secrets - Update NATS permissions for builder, test-deployer, and admin-deployer - Grant nix-cache02 access to shared homelab-deploy secrets Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 22:26:40 +01:00
Torjus Håkestad	b709c0b703	monitoring: disable radarr exporter (version mismatch) Some checks failed Run nix flake check / flake-check (push) Failing after 15m20s Details Periodic flake update / flake-update (push) Successful in 2m23s Details Radarr on TrueNAS jail is too old - exportarr fails on /api/v3/wanted/cutoff endpoint (404). Keep sonarr which works. Vault secret kept for when Radarr is updated. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:59:45 +01:00
Torjus Håkestad	33c5d5b3f0	monitoring: add exportarr for radarr/sonarr metrics All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Add prometheus exportarr exporters for Radarr and Sonarr media services. Runs on monitoring01, queries remote APIs. - Radarr exporter on port 9708 - Sonarr exporter on port 9709 - API keys fetched from Vault Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:56:03 +01:00
Torjus Håkestad	9bd48e0808	monitoring: explicitly list valid HTTP status codes All checks were successful Run nix flake check / flake-check (push) Successful in 2m6s Details Empty valid_status_codes defaults to 2xx only, not "any". Explicitly list common status codes (2xx, 3xx, 4xx, 5xx) so services returning 400/401 like ha and nzbget pass the probe. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:41:47 +01:00
Torjus Håkestad	1460eea700	grafana: fix probe status table join All checks were successful Run nix flake check / flake-check (push) Successful in 2m9s Details Use joinByField transformation instead of merge to properly align rows by instance. Also exclude duplicate Time/job columns from join. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:38:02 +01:00
Torjus Håkestad	98c4f54f94	grafana: add TLS certificates dashboard Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Dashboard includes: - Stat panels for endpoints monitored, probe failures, expiring certs - Gauge showing minimum days until any cert expires - Table of all endpoints sorted by expiry (color-coded) - Probe status table with HTTP status and duration - Time series graphs for expiry trends and probe success rate Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:35:44 +01:00
Torjus Håkestad	d1b0a5dc20	monitoring: accept any HTTP status in TLS probe Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Only care about TLS handshake success for certificate monitoring. Services like nzbget (401) and ha (400) return non-2xx but have valid certificates. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:33:45 +01:00
Torjus Håkestad	4d32707130	monitoring: remove duplicate rules from blackbox.nix All checks were successful Run nix flake check / flake-check (push) Successful in 2m7s Details The rules were already added to rules.yml but the blackbox.nix file still had them, causing duplicate 'groups' key errors. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:28:42 +01:00

1 2 3 4 5 ...

284 Commits