nixos-servers

Author	SHA1	Message	Date
Torjus Håkestad	59e1962d75	auth01: decommission host and remove authelia/lldap services Some checks failed Run nix flake check / flake-check (pull_request) Successful in 2m5s Details Run nix flake check / flake-check (push) Failing after 18m1s Details Remove auth01 host configuration and associated services in preparation for new auth stack with different provisioning system. Removed: - hosts/auth01/ - host configuration - services/authelia/ - authelia service module - services/lldap/ - lldap service module - secrets/auth01/ - sops secrets - Reverse proxy entries for auth and lldap - Monitoring alert rules for authelia and lldap - SOPS configuration for auth01 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 23:35:45 +01:00
Torjus Håkestad	c515a6b4e1	home-assistant: fix zigbee sensor battery reporting Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details WSDCGQ12LM sensors report battery: 0 due to firmware quirk. Override battery calculation using voltage via homeassistant value_template. Also adds zigbee_sensor_stale alert for detecting dead sensors regardless of battery reporting accuracy (1 hour threshold). Device configuration moved from external devices.yaml to inline NixOS config for declarative management. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 22:41:07 +01:00
Torjus Håkestad	b322b1156b	monitoring: fix openbao token output path Some checks failed Run nix flake check / flake-check (pull_request) Successful in 2m17s Details Run nix flake check / flake-check (push) Failing after 8m57s Details The outputDir with extractKey should be the full file path, not just the parent directory. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 21:56:26 +01:00
Torjus Håkestad	3cccfc0487	monitoring: implement monitoring gaps coverage Some checks failed Run nix flake check / flake-check (push) Failing after 7m36s Details Add exporters and scrape targets for services lacking monitoring: - PostgreSQL: postgres-exporter on pgdb1 - Authelia: native telemetry metrics on auth01 - Unbound: unbound-exporter with remote-control on ns1/ns2 - NATS: HTTP monitoring endpoint on nats1 - OpenBao: telemetry config and Prometheus scrape with token auth - Systemd: systemd-exporter on all hosts for per-service metrics Add alert rules for postgres, auth (authelia + lldap), jellyfin, vault (openbao), plus extend existing nats and unbound rules. Add Terraform config for Prometheus metrics policy and token. The token is created via vault_token resource and stored in KV, so no manual token creation is needed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 21:44:13 +01:00
Torjus Håkestad	0700033c0a	secrets: migrate all hosts from sops to OpenBao vault Replace sops-nix secrets with OpenBao vault secrets across all hosts. Hardcode root password hash, add extractKey option to vault-secrets module, update Terraform with secrets/policies for all hosts, and create AppRole provisioning playbook. Hosts migrated: ha1, monitoring01, ns1, ns2, http-proxy, nix-cache01 Wave 1 hosts (nats1, jelly01, pgdb1) get AppRole policies only. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 18:43:09 +01:00
Torjus Håkestad	28b8d7c115	monitoring: increase high_cpu_load duration for nix-cache01 to 2h nix-cache01 regularly hits high CPU during nix builds, causing flappy alerts. Keep the 15m threshold for all other hosts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 02:28:48 +01:00
Torjus Håkestad	3a9a47f1ad	monitoring: exclude step-ca serving cert from general expiry alert Some checks failed Run nix flake check / flake-check (push) Failing after 6m23s Details Run nix flake check / flake-check (pull_request) Failing after 4m46s Details The step-ca serving certificate is auto-renewed with a 24h lifetime, so it always triggers the general < 86400s threshold. Exclude it and add a dedicated step_ca_serving_cert_expiring alert at < 1h instead. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 01:12:42 +01:00
Torjus Håkestad	fa6380e767	monitoring: fix nix-cache_caddy scrape target TLS error All checks were successful Run nix flake check / flake-check (push) Successful in 2m43s Details Move nix-cache_caddy back to a manual config in prometheus.nix using the service CNAME (nix-cache.home.2rjus.net) instead of the hostname. The auto-generated target used nix-cache01.home.2rjus.net which doesn't match the TLS certificate SAN. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 01:04:50 +01:00
Torjus Håkestad	dd1b64de27	monitoring: auto-generate Prometheus scrape targets from host configs Some checks failed Run nix flake check / flake-check (pull_request) Successful in 2m49s Details Run nix flake check / flake-check (push) Has been cancelled Details Add homelab.monitoring NixOS options (enable, scrapeTargets) following the same pattern as homelab.dns. Prometheus scrape configs are now auto-generated from flake host configurations and external targets, replacing hardcoded target lists. Also cleans up alert rules: snake_case naming, fix zigbee2mqtt typo, remove duplicate pushgateway alert, add for clauses to monitoring_rules, remove hardcoded WireGuard public key, and add new alerts for certificates, proxmox, caddy, smartctl temperature, filesystem prediction, systemd state, file descriptors, and host reboots. Fixes grafana scrape target port from 3100 to 3000. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 00:49:07 +01:00
Torjus Håkestad	adf70999b9	Fix scrape config Some checks failed Run nix flake check / flake-check (push) Failing after 6m7s Details Periodic flake update / flake-update (push) Successful in 3m13s Details	2025-06-01 02:41:54 +02:00
Torjus Håkestad	acb9e59775	Scrape nix-cache caddy Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-06-01 02:40:41 +02:00
Torjus Håkestad	14aa3a9340	Remove non-working timer rule Some checks failed Run nix flake check / flake-check (push) Failing after 14m3s Details Periodic flake update / flake-update (push) Successful in 3m9s Details	2025-05-29 10:15:40 +02:00
Torjus Håkestad	797f915939	Add monitoring rules for monitoring services Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-05-29 10:09:27 +02:00
Torjus Håkestad	3785b8047a	Fix alert name for build-flakes alert Some checks failed Run nix flake check / flake-check (push) Failing after 10m34s Details Periodic flake update / flake-update (push) Successful in 3m3s Details	2025-05-28 21:28:04 +02:00
Torjus Håkestad	fb1a36a846	Rework build-flakes alert rules Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-05-28 21:26:04 +02:00
Torjus Håkestad	77d1782f36	Set honor_labels for pushgw scrape Some checks failed Run nix flake check / flake-check (push) Failing after 8m37s Details	2025-05-28 20:34:17 +02:00
Torjus Håkestad	5b06a95222	Add prometheus pushgateway Some checks failed Run nix flake check / flake-check (push) Failing after 12m59s Details	2025-05-28 17:10:50 +02:00
Torjus Håkestad	5ce8f46394	Configure tempo otlp reciever endpoint Some checks failed Run nix flake check / flake-check (push) Failing after 11m42s Details Periodic flake update / flake-update (push) Successful in 4m6s Details	2025-05-24 22:10:01 +02:00
Torjus Håkestad	feff1d06eb	Configure tempo otlp reciever Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-05-24 22:08:36 +02:00
Torjus Håkestad	b75df7578f	Configure tempo wal storage Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-05-24 22:03:56 +02:00
Torjus Håkestad	4d88644417	Configure tempo storage Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-05-24 21:55:08 +02:00
Torjus Håkestad	d4137f79aa	Change tempo settings Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-05-24 21:32:19 +02:00
Torjus Håkestad	486320b0ec	Add tempo to monitoring Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-05-24 21:29:05 +02:00
Torjus Håkestad	6fc4d42d16	Fix alloy config Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-05-24 12:42:40 +02:00
Torjus Håkestad	ebcdefd0ca	Add alloy Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-05-24 12:40:39 +02:00
Torjus Håkestad	2dae23560d	Fix pyroscope ports attribute Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-05-24 12:01:30 +02:00
Torjus Håkestad	1988b36f03	Add pyroscope container to monitoring Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-05-24 12:00:02 +02:00
Torjus Håkestad	2a46da3761	Add labmon to scrape config Some checks failed Run nix flake check / flake-check (push) Failing after 14m32s Details	2025-05-24 03:37:52 +02:00
Torjus Håkestad	4e870cda44	Scrape step-ca metrics Some checks failed Run nix flake check / flake-check (push) Failing after 3m52s Details Periodic flake update / flake-update (push) Successful in 2m42s Details	2025-05-23 09:28:52 +02:00
Torjus Håkestad	6e6d5098c5	Collect ghettoptt stats Some checks failed Run nix flake check / flake-check (push) Failing after 11m48s Details	2025-05-22 14:55:32 +02:00
Torjus Håkestad	aa2cbcda60	Add home assistant to prometheus Some checks failed Run nix flake check / flake-check (push) Failing after 15m18s Details	2025-05-19 11:21:46 +02:00
Torjus Håkestad	78efb084ec	Alertonotify hardening part 3 Some checks failed Run nix flake check / flake-check (push) Failing after 10m10s Details Periodic flake update / flake-update (push) Successful in 4m12s Details	2025-05-18 15:24:58 +02:00
Torjus Håkestad	16042b08c0	Alertonotify hardening part 2 Some checks failed Run nix flake check / flake-check (push) Failing after 3m58s Details	2025-05-18 15:20:00 +02:00
Torjus Håkestad	8e0b97c9e0	Alertonotify hardening part 1 Some checks failed Run nix flake check / flake-check (push) Failing after 4m30s Details	2025-05-18 15:08:26 +02:00
Torjus Håkestad	fe2e87658a	Move prometheus roles to external file Some checks failed Run nix flake check / flake-check (push) Failing after 3m7s Details	2025-05-18 14:54:09 +02:00
Torjus Håkestad	c07d96bbab	Add alert for wireguard handshake Some checks failed Run nix flake check / flake-check (push) Failing after 3m17s Details Periodic flake update / flake-update (push) Successful in 2m15s Details	2025-05-18 01:12:04 +02:00
Torjus Håkestad	bd58d07001	Monitor wireguard Some checks failed Run nix flake check / flake-check (push) Failing after 3m32s Details	2025-05-18 00:59:55 +02:00
Torjus Håkestad	3797526000	Add some alerting rules for smartctl Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-05-18 00:51:02 +02:00
Torjus Håkestad	afa3cc3a57	Collect smartctl metrics from gunter Some checks failed Run nix flake check / flake-check (push) Failing after 4m53s Details	2025-05-18 00:43:15 +02:00
Torjus Håkestad	08a0ddaf30	Increase prometheus retention to 30d Some checks failed Run nix flake check / flake-check (push) Failing after 5m58s Details Periodic flake update / flake-update (push) Successful in 4m7s Details	2025-05-12 23:22:31 +02:00
Torjus Håkestad	518e3a3ded	Fix flapping build-flakes alarm Some checks failed Run nix flake check / flake-check (push) Failing after 6m57s Details Periodic flake update / flake-update (push) Successful in 3m59s Details	2025-04-07 10:41:35 +02:00
Torjus Håkestad	0dbdee65c5	Add harmonia alerting rule Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-02-24 18:29:41 +01:00
Torjus Håkestad	b468e9d533	Improve alerttonotify service Some checks failed Run nix flake check / flake-check (push) Failing after 2m56s Details Periodic flake update / flake-update (push) Successful in 1m26s Details	2025-02-23 20:51:39 +01:00
Torjus Håkestad	874e30fb28	Tune cpu alarm Some checks failed Run nix flake check / flake-check (push) Failing after 4m18s Details	2025-02-23 20:46:25 +01:00
Torjus Håkestad	db9bf38ab6	Fix alerttonotify service Some checks failed Run nix flake check / flake-check (push) Failing after 26m40s Details	2025-02-23 18:16:13 +01:00
Torjus Håkestad	15e5ccb0ec	Change alertmanager repeat time Some checks failed Run nix flake check / flake-check (push) Failing after 3m41s Details	2025-02-23 18:10:14 +01:00
Torjus Håkestad	b8d058d23e	Add alerting rules Some checks failed Run nix flake check / flake-check (push) Failing after 8m51s Details	2025-02-12 20:34:22 +01:00
Torjus Håkestad	a5448c5fc1	Remove whitespace Some checks failed Run nix flake check / flake-check (push) Failing after 24m42s Details Periodic flake update / flake-update (push) Successful in 1m24s Details	2025-02-12 00:26:14 +01:00
Torjus Håkestad	f1ca20a387	Add some alerting rules Some checks failed Run nix flake check / flake-check (push) Failing after 14m34s Details	2025-02-11 23:24:35 +01:00
Torjus Håkestad	f0bc29ac5e	Add nats host to monitoring Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-02-11 23:12:55 +01:00

1 2

70 Commits