nixos-servers

Author	SHA1	Message	Date
Torjus Håkestad	4f593126c0	monitoring01: remove host and migrate services to monitoring02 Some checks failed Run nix flake check / flake-check (push) Failing after 3m15s Details Run nix flake check / flake-check (pull_request) Failing after 3m8s Details Remove monitoring01 host configuration and unused service modules (prometheus, grafana, loki, tempo, pyroscope). Migrate blackbox, exportarr, and pve exporters to monitoring02 with scrape configs moved to VictoriaMetrics. Update alert rules, terraform vault policies/secrets, http-proxy entries, and documentation to reflect the monitoring02 migration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 21:50:20 +01:00
Torjus Håkestad	1942591d2e	monitoring: add apiary metrics scraping with bearer token auth Some checks failed Run nix flake check / flake-check (push) Failing after 12m52s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 16:36:26 +01:00
Torjus Håkestad	5d68662035	loki: add 30-day retention policy and ingestion limits Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Enable compactor-based retention with 30-day period to prevent unbounded disk growth. Add basic rate limits and stream guards to protect against runaway log generators. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 23:55:27 +01:00
Torjus Håkestad	ae823e439d	monitoring: lower unbound cache hit ratio alert threshold to 20% Some checks failed Run nix flake check / flake-check (push) Failing after 9m2s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 18:55:03 +01:00
Torjus Håkestad	b03e2e8ee4	monitoring: add alerts for homelab-deploy build failures Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-11 00:45:07 +01:00
Torjus Håkestad	75210805d5	nix-cache01: decommission and remove all references Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Removed: - hosts/nix-cache01/ directory - services/nix-cache/build-flakes.{nix,sh} (replaced by NATS builder) - Vault secret and AppRole for nix-cache01 - Old signing key variable from terraform - Old trusted public key from system/nix.nix Updated: - flake.nix: removed nixosConfiguration - README.md: nix-cache01 -> nix-cache02 - Monitoring rules: removed build-flakes alerts, updated harmonia to nix-cache02 - Simplified proxy.nix (no longer needs hostname conditional) nix-cache02 is now the sole binary cache host. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:40:51 +01:00
Torjus Håkestad	b709c0b703	monitoring: disable radarr exporter (version mismatch) Some checks failed Run nix flake check / flake-check (push) Failing after 15m20s Details Periodic flake update / flake-update (push) Successful in 2m23s Details Radarr on TrueNAS jail is too old - exportarr fails on /api/v3/wanted/cutoff endpoint (404). Keep sonarr which works. Vault secret kept for when Radarr is updated. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:59:45 +01:00
Torjus Håkestad	33c5d5b3f0	monitoring: add exportarr for radarr/sonarr metrics All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Add prometheus exportarr exporters for Radarr and Sonarr media services. Runs on monitoring01, queries remote APIs. - Radarr exporter on port 9708 - Sonarr exporter on port 9709 - API keys fetched from Vault Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:56:03 +01:00
Torjus Håkestad	9bd48e0808	monitoring: explicitly list valid HTTP status codes All checks were successful Run nix flake check / flake-check (push) Successful in 2m6s Details Empty valid_status_codes defaults to 2xx only, not "any". Explicitly list common status codes (2xx, 3xx, 4xx, 5xx) so services returning 400/401 like ha and nzbget pass the probe. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:41:47 +01:00
Torjus Håkestad	d1b0a5dc20	monitoring: accept any HTTP status in TLS probe Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Only care about TLS handshake success for certificate monitoring. Services like nzbget (401) and ha (400) return non-2xx but have valid certificates. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:33:45 +01:00
Torjus Håkestad	4d32707130	monitoring: remove duplicate rules from blackbox.nix All checks were successful Run nix flake check / flake-check (push) Successful in 2m7s Details The rules were already added to rules.yml but the blackbox.nix file still had them, causing duplicate 'groups' key errors. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:28:42 +01:00
Torjus Håkestad	8e1753c2c8	monitoring: fix blackbox rules and add force-push policy Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Move certificate alert rules to rules.yml instead of adding them as a separate rules string in blackbox.nix. The previous approach caused a YAML parse error due to duplicate 'groups' keys. Also add policy to CLAUDE.md: never force push to master. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:26:05 +01:00
Torjus Håkestad	75e4fb61a5	monitoring: add blackbox exporter for TLS certificate monitoring All checks were successful Run nix flake check / flake-check (push) Successful in 2m6s Details Add blackbox exporter to monitoring01 to probe TLS endpoints and alert on expiring certificates. Monitors all ACME-managed certificates from OpenBao PKI including Caddy auto-TLS services. Alerts: - tls_certificate_expiring_soon (< 7 days, warning) - tls_certificate_expiring_critical (< 24h, critical) - tls_probe_failed (connectivity issues) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:21:42 +01:00
Torjus Håkestad	ffad2dd205	monitoring: increase zigbee_sensor_stale threshold to 4 hours The 2-hour threshold was too aggressive for temperature sensors in stable environments. Historical data shows gaps up to 2.75 hours when temperature hasn't changed (Home Assistant only updates last_updated when values change). Increasing to 4 hours avoids false positives while still catching genuine failures. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 16:10:54 +01:00
Torjus Håkestad	8ec2a083bd	pgdb1: decommission postgresql host Remove pgdb1 host configuration and postgres service module. The only consumer (Open WebUI on gunter) has migrated to local PostgreSQL. Removed: - hosts/pgdb1/ - host configuration - services/postgres/ - service module (only used by pgdb1) - postgres_rules from monitoring rules - rebuild-all.sh (obsolete script) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:54:50 +01:00
Torjus Håkestad	21db7e9573	acme: migrate from step-ca to OpenBao PKI Switch all ACME certificate issuance from step-ca (ca.home.2rjus.net) to OpenBao PKI (vault.home.2rjus.net:8200/v1/pki_int/acme/directory). - Update default ACME server in system/acme.nix - Update Caddy acme_ca in http-proxy and nix-cache services - Remove labmon service from monitoring01 (step-ca monitoring) - Remove labmon scrape target and certificate_rules alerts - Remove alloy.nix (only used for labmon profiling) - Add docs/plans/cert-monitoring.md for future cert monitoring needs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 18:20:10 +01:00
Torjus Håkestad	7d291f85bf	monitoring: propagate host labels to Prometheus scrape targets Extract homelab.host metadata (tier, priority, role, labels) from host configurations and propagate them to Prometheus scrape targets. This enables semantic alert filtering using labels instead of hardcoded instance names. Changes: - lib/monitoring.nix: Extract host metadata, group targets by labels - prometheus.nix: Use structured static_configs with labels - rules.yml: Replace instance filters with role-based filters Example labels in Prometheus: - ns1/ns2: role=dns, dns_role=primary/secondary - nix-cache01: role=build-host - testvm*: tier=test Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:04:50 +01:00
Torjus Håkestad	881e70df27	monitoring: relax systemd_not_running alert threshold All checks were successful Run nix flake check / flake-check (push) Successful in 2m4s Details Increase duration from 5m to 10m and demote severity from critical to warning. Brief degraded states during nixos-rebuild are normal and were causing false positive alerts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 01:22:29 +01:00
Torjus Håkestad	025570dea1	monitoring: fix openbao token refresh timer not triggering RemainAfterExit=true kept the service in "active" state, which prevented OnUnitActiveSec from scheduling new triggers since there was no new "activation" event. Removing it allows the service to properly go inactive, enabling the timer to reschedule correctly. Also fix ExecStart to use lib.getExe for proper path resolution with writeShellApplication. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-06 19:41:45 +01:00
Torjus Håkestad	15c00393f1	monitoring: increase zigbee_sensor_stale threshold to 2 hours Some checks failed Run nix flake check / flake-check (push) Failing after 6m59s Details Sensors report every ~45-50 minutes on average, so 1 hour was too tight. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-06 19:26:56 +01:00
Torjus Håkestad	bbb22e588e	system: replace writeShellScript with writeShellApplication Some checks failed Run nix flake check / flake-check (pull_request) Successful in 2m3s Details Run nix flake check / flake-check (push) Failing after 5m57s Details Convert remaining writeShellScript usages to writeShellApplication for shellcheck validation and strict bash options. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-06 00:17:24 +01:00
Torjus Håkestad	e9857afc11	monitoring: use AppRole token for OpenBao metrics scraping All checks were successful Run nix flake check / flake-check (push) Successful in 2m12s Details Run nix flake check / flake-check (pull_request) Successful in 2m19s Details Instead of creating a long-lived Vault token in Terraform (which gets invalidated when Terraform recreates it), monitoring01 now uses its existing AppRole credentials to fetch a fresh token for Prometheus. Changes: - Add prometheus-metrics policy to monitoring01's AppRole - Remove vault_token.prometheus_metrics resource from Terraform - Remove openbao-token KV secret from Terraform - Add systemd service to fetch AppRole token on boot - Add systemd timer to refresh token every 30 minutes This ensures Prometheus always has a valid token without depending on Terraform state or manual intervention. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 23:51:11 +01:00
Torjus Håkestad	59e1962d75	auth01: decommission host and remove authelia/lldap services Some checks failed Run nix flake check / flake-check (pull_request) Successful in 2m5s Details Run nix flake check / flake-check (push) Failing after 18m1s Details Remove auth01 host configuration and associated services in preparation for new auth stack with different provisioning system. Removed: - hosts/auth01/ - host configuration - services/authelia/ - authelia service module - services/lldap/ - lldap service module - secrets/auth01/ - sops secrets - Reverse proxy entries for auth and lldap - Monitoring alert rules for authelia and lldap - SOPS configuration for auth01 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 23:35:45 +01:00
Torjus Håkestad	c515a6b4e1	home-assistant: fix zigbee sensor battery reporting Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details WSDCGQ12LM sensors report battery: 0 due to firmware quirk. Override battery calculation using voltage via homeassistant value_template. Also adds zigbee_sensor_stale alert for detecting dead sensors regardless of battery reporting accuracy (1 hour threshold). Device configuration moved from external devices.yaml to inline NixOS config for declarative management. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 22:41:07 +01:00
Torjus Håkestad	b322b1156b	monitoring: fix openbao token output path Some checks failed Run nix flake check / flake-check (pull_request) Successful in 2m17s Details Run nix flake check / flake-check (push) Failing after 8m57s Details The outputDir with extractKey should be the full file path, not just the parent directory. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 21:56:26 +01:00
Torjus Håkestad	3cccfc0487	monitoring: implement monitoring gaps coverage Some checks failed Run nix flake check / flake-check (push) Failing after 7m36s Details Add exporters and scrape targets for services lacking monitoring: - PostgreSQL: postgres-exporter on pgdb1 - Authelia: native telemetry metrics on auth01 - Unbound: unbound-exporter with remote-control on ns1/ns2 - NATS: HTTP monitoring endpoint on nats1 - OpenBao: telemetry config and Prometheus scrape with token auth - Systemd: systemd-exporter on all hosts for per-service metrics Add alert rules for postgres, auth (authelia + lldap), jellyfin, vault (openbao), plus extend existing nats and unbound rules. Add Terraform config for Prometheus metrics policy and token. The token is created via vault_token resource and stored in KV, so no manual token creation is needed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 21:44:13 +01:00
Torjus Håkestad	0700033c0a	secrets: migrate all hosts from sops to OpenBao vault Replace sops-nix secrets with OpenBao vault secrets across all hosts. Hardcode root password hash, add extractKey option to vault-secrets module, update Terraform with secrets/policies for all hosts, and create AppRole provisioning playbook. Hosts migrated: ha1, monitoring01, ns1, ns2, http-proxy, nix-cache01 Wave 1 hosts (nats1, jelly01, pgdb1) get AppRole policies only. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 18:43:09 +01:00
Torjus Håkestad	28b8d7c115	monitoring: increase high_cpu_load duration for nix-cache01 to 2h nix-cache01 regularly hits high CPU during nix builds, causing flappy alerts. Keep the 15m threshold for all other hosts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 02:28:48 +01:00
Torjus Håkestad	3a9a47f1ad	monitoring: exclude step-ca serving cert from general expiry alert Some checks failed Run nix flake check / flake-check (push) Failing after 6m23s Details Run nix flake check / flake-check (pull_request) Failing after 4m46s Details The step-ca serving certificate is auto-renewed with a 24h lifetime, so it always triggers the general < 86400s threshold. Exclude it and add a dedicated step_ca_serving_cert_expiring alert at < 1h instead. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 01:12:42 +01:00
Torjus Håkestad	fa6380e767	monitoring: fix nix-cache_caddy scrape target TLS error All checks were successful Run nix flake check / flake-check (push) Successful in 2m43s Details Move nix-cache_caddy back to a manual config in prometheus.nix using the service CNAME (nix-cache.home.2rjus.net) instead of the hostname. The auto-generated target used nix-cache01.home.2rjus.net which doesn't match the TLS certificate SAN. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 01:04:50 +01:00
Torjus Håkestad	dd1b64de27	monitoring: auto-generate Prometheus scrape targets from host configs Some checks failed Run nix flake check / flake-check (pull_request) Successful in 2m49s Details Run nix flake check / flake-check (push) Has been cancelled Details Add homelab.monitoring NixOS options (enable, scrapeTargets) following the same pattern as homelab.dns. Prometheus scrape configs are now auto-generated from flake host configurations and external targets, replacing hardcoded target lists. Also cleans up alert rules: snake_case naming, fix zigbee2mqtt typo, remove duplicate pushgateway alert, add for clauses to monitoring_rules, remove hardcoded WireGuard public key, and add new alerts for certificates, proxmox, caddy, smartctl temperature, filesystem prediction, systemd state, file descriptors, and host reboots. Fixes grafana scrape target port from 3100 to 3000. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 00:49:07 +01:00
Torjus Håkestad	adf70999b9	Fix scrape config Some checks failed Run nix flake check / flake-check (push) Failing after 6m7s Details Periodic flake update / flake-update (push) Successful in 3m13s Details	2025-06-01 02:41:54 +02:00
Torjus Håkestad	acb9e59775	Scrape nix-cache caddy Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-06-01 02:40:41 +02:00
Torjus Håkestad	14aa3a9340	Remove non-working timer rule Some checks failed Run nix flake check / flake-check (push) Failing after 14m3s Details Periodic flake update / flake-update (push) Successful in 3m9s Details	2025-05-29 10:15:40 +02:00
Torjus Håkestad	797f915939	Add monitoring rules for monitoring services Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-05-29 10:09:27 +02:00
Torjus Håkestad	3785b8047a	Fix alert name for build-flakes alert Some checks failed Run nix flake check / flake-check (push) Failing after 10m34s Details Periodic flake update / flake-update (push) Successful in 3m3s Details	2025-05-28 21:28:04 +02:00
Torjus Håkestad	fb1a36a846	Rework build-flakes alert rules Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-05-28 21:26:04 +02:00
Torjus Håkestad	77d1782f36	Set honor_labels for pushgw scrape Some checks failed Run nix flake check / flake-check (push) Failing after 8m37s Details	2025-05-28 20:34:17 +02:00
Torjus Håkestad	5b06a95222	Add prometheus pushgateway Some checks failed Run nix flake check / flake-check (push) Failing after 12m59s Details	2025-05-28 17:10:50 +02:00
Torjus Håkestad	5ce8f46394	Configure tempo otlp reciever endpoint Some checks failed Run nix flake check / flake-check (push) Failing after 11m42s Details Periodic flake update / flake-update (push) Successful in 4m6s Details	2025-05-24 22:10:01 +02:00
Torjus Håkestad	feff1d06eb	Configure tempo otlp reciever Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-05-24 22:08:36 +02:00
Torjus Håkestad	b75df7578f	Configure tempo wal storage Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-05-24 22:03:56 +02:00
Torjus Håkestad	4d88644417	Configure tempo storage Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-05-24 21:55:08 +02:00
Torjus Håkestad	d4137f79aa	Change tempo settings Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-05-24 21:32:19 +02:00
Torjus Håkestad	486320b0ec	Add tempo to monitoring Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-05-24 21:29:05 +02:00
Torjus Håkestad	6fc4d42d16	Fix alloy config Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-05-24 12:42:40 +02:00
Torjus Håkestad	ebcdefd0ca	Add alloy Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-05-24 12:40:39 +02:00
Torjus Håkestad	2dae23560d	Fix pyroscope ports attribute Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-05-24 12:01:30 +02:00
Torjus Håkestad	1988b36f03	Add pyroscope container to monitoring Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-05-24 12:00:02 +02:00
Torjus Håkestad	2a46da3761	Add labmon to scrape config Some checks failed Run nix flake check / flake-check (push) Failing after 14m32s Details	2025-05-24 03:37:52 +02:00

1 2

92 Commits