nixos-servers

Author	SHA1	Message	Date
Torjus Håkestad	2ca2509083	monitoring: increase filesystem_filling_up prediction window to 24h Some checks failed Run nix flake check / flake-check (push) Failing after 3m55s Details Reduces false positives from transient Nix store growth by basing the linear prediction on a 24h trend instead of 6h. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-20 09:36:27 +01:00
Torjus Håkestad	4f593126c0	monitoring01: remove host and migrate services to monitoring02 Some checks failed Run nix flake check / flake-check (push) Failing after 3m15s Details Run nix flake check / flake-check (pull_request) Failing after 3m8s Details Remove monitoring01 host configuration and unused service modules (prometheus, grafana, loki, tempo, pyroscope). Migrate blackbox, exportarr, and pve exporters to monitoring02 with scrape configs moved to VictoriaMetrics. Update alert rules, terraform vault policies/secrets, http-proxy entries, and documentation to reflect the monitoring02 migration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 21:50:20 +01:00
Torjus Håkestad	ae823e439d	monitoring: lower unbound cache hit ratio alert threshold to 20% Some checks failed Run nix flake check / flake-check (push) Failing after 9m2s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 18:55:03 +01:00
Torjus Håkestad	b03e2e8ee4	monitoring: add alerts for homelab-deploy build failures Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-11 00:45:07 +01:00
Torjus Håkestad	75210805d5	nix-cache01: decommission and remove all references Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Removed: - hosts/nix-cache01/ directory - services/nix-cache/build-flakes.{nix,sh} (replaced by NATS builder) - Vault secret and AppRole for nix-cache01 - Old signing key variable from terraform - Old trusted public key from system/nix.nix Updated: - flake.nix: removed nixosConfiguration - README.md: nix-cache01 -> nix-cache02 - Monitoring rules: removed build-flakes alerts, updated harmonia to nix-cache02 - Simplified proxy.nix (no longer needs hostname conditional) nix-cache02 is now the sole binary cache host. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:40:51 +01:00
Torjus Håkestad	8e1753c2c8	monitoring: fix blackbox rules and add force-push policy Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Move certificate alert rules to rules.yml instead of adding them as a separate rules string in blackbox.nix. The previous approach caused a YAML parse error due to duplicate 'groups' keys. Also add policy to CLAUDE.md: never force push to master. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:26:05 +01:00
Torjus Håkestad	ffad2dd205	monitoring: increase zigbee_sensor_stale threshold to 4 hours The 2-hour threshold was too aggressive for temperature sensors in stable environments. Historical data shows gaps up to 2.75 hours when temperature hasn't changed (Home Assistant only updates last_updated when values change). Increasing to 4 hours avoids false positives while still catching genuine failures. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 16:10:54 +01:00
Torjus Håkestad	8ec2a083bd	pgdb1: decommission postgresql host Remove pgdb1 host configuration and postgres service module. The only consumer (Open WebUI on gunter) has migrated to local PostgreSQL. Removed: - hosts/pgdb1/ - host configuration - services/postgres/ - service module (only used by pgdb1) - postgres_rules from monitoring rules - rebuild-all.sh (obsolete script) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:54:50 +01:00
Torjus Håkestad	21db7e9573	acme: migrate from step-ca to OpenBao PKI Switch all ACME certificate issuance from step-ca (ca.home.2rjus.net) to OpenBao PKI (vault.home.2rjus.net:8200/v1/pki_int/acme/directory). - Update default ACME server in system/acme.nix - Update Caddy acme_ca in http-proxy and nix-cache services - Remove labmon service from monitoring01 (step-ca monitoring) - Remove labmon scrape target and certificate_rules alerts - Remove alloy.nix (only used for labmon profiling) - Add docs/plans/cert-monitoring.md for future cert monitoring needs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 18:20:10 +01:00
Torjus Håkestad	7d291f85bf	monitoring: propagate host labels to Prometheus scrape targets Extract homelab.host metadata (tier, priority, role, labels) from host configurations and propagate them to Prometheus scrape targets. This enables semantic alert filtering using labels instead of hardcoded instance names. Changes: - lib/monitoring.nix: Extract host metadata, group targets by labels - prometheus.nix: Use structured static_configs with labels - rules.yml: Replace instance filters with role-based filters Example labels in Prometheus: - ns1/ns2: role=dns, dns_role=primary/secondary - nix-cache01: role=build-host - testvm*: tier=test Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:04:50 +01:00
Torjus Håkestad	881e70df27	monitoring: relax systemd_not_running alert threshold All checks were successful Run nix flake check / flake-check (push) Successful in 2m4s Details Increase duration from 5m to 10m and demote severity from critical to warning. Brief degraded states during nixos-rebuild are normal and were causing false positive alerts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 01:22:29 +01:00
Torjus Håkestad	15c00393f1	monitoring: increase zigbee_sensor_stale threshold to 2 hours Some checks failed Run nix flake check / flake-check (push) Failing after 6m59s Details Sensors report every ~45-50 minutes on average, so 1 hour was too tight. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-06 19:26:56 +01:00
Torjus Håkestad	59e1962d75	auth01: decommission host and remove authelia/lldap services Some checks failed Run nix flake check / flake-check (pull_request) Successful in 2m5s Details Run nix flake check / flake-check (push) Failing after 18m1s Details Remove auth01 host configuration and associated services in preparation for new auth stack with different provisioning system. Removed: - hosts/auth01/ - host configuration - services/authelia/ - authelia service module - services/lldap/ - lldap service module - secrets/auth01/ - sops secrets - Reverse proxy entries for auth and lldap - Monitoring alert rules for authelia and lldap - SOPS configuration for auth01 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 23:35:45 +01:00
Torjus Håkestad	c515a6b4e1	home-assistant: fix zigbee sensor battery reporting Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details WSDCGQ12LM sensors report battery: 0 due to firmware quirk. Override battery calculation using voltage via homeassistant value_template. Also adds zigbee_sensor_stale alert for detecting dead sensors regardless of battery reporting accuracy (1 hour threshold). Device configuration moved from external devices.yaml to inline NixOS config for declarative management. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 22:41:07 +01:00
Torjus Håkestad	3cccfc0487	monitoring: implement monitoring gaps coverage Some checks failed Run nix flake check / flake-check (push) Failing after 7m36s Details Add exporters and scrape targets for services lacking monitoring: - PostgreSQL: postgres-exporter on pgdb1 - Authelia: native telemetry metrics on auth01 - Unbound: unbound-exporter with remote-control on ns1/ns2 - NATS: HTTP monitoring endpoint on nats1 - OpenBao: telemetry config and Prometheus scrape with token auth - Systemd: systemd-exporter on all hosts for per-service metrics Add alert rules for postgres, auth (authelia + lldap), jellyfin, vault (openbao), plus extend existing nats and unbound rules. Add Terraform config for Prometheus metrics policy and token. The token is created via vault_token resource and stored in KV, so no manual token creation is needed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 21:44:13 +01:00
Torjus Håkestad	28b8d7c115	monitoring: increase high_cpu_load duration for nix-cache01 to 2h nix-cache01 regularly hits high CPU during nix builds, causing flappy alerts. Keep the 15m threshold for all other hosts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 02:28:48 +01:00
Torjus Håkestad	3a9a47f1ad	monitoring: exclude step-ca serving cert from general expiry alert Some checks failed Run nix flake check / flake-check (push) Failing after 6m23s Details Run nix flake check / flake-check (pull_request) Failing after 4m46s Details The step-ca serving certificate is auto-renewed with a 24h lifetime, so it always triggers the general < 86400s threshold. Exclude it and add a dedicated step_ca_serving_cert_expiring alert at < 1h instead. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 01:12:42 +01:00
Torjus Håkestad	dd1b64de27	monitoring: auto-generate Prometheus scrape targets from host configs Some checks failed Run nix flake check / flake-check (pull_request) Successful in 2m49s Details Run nix flake check / flake-check (push) Has been cancelled Details Add homelab.monitoring NixOS options (enable, scrapeTargets) following the same pattern as homelab.dns. Prometheus scrape configs are now auto-generated from flake host configurations and external targets, replacing hardcoded target lists. Also cleans up alert rules: snake_case naming, fix zigbee2mqtt typo, remove duplicate pushgateway alert, add for clauses to monitoring_rules, remove hardcoded WireGuard public key, and add new alerts for certificates, proxmox, caddy, smartctl temperature, filesystem prediction, systemd state, file descriptors, and host reboots. Fixes grafana scrape target port from 3100 to 3000. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 00:49:07 +01:00
Torjus Håkestad	14aa3a9340	Remove non-working timer rule Some checks failed Run nix flake check / flake-check (push) Failing after 14m3s Details Periodic flake update / flake-update (push) Successful in 3m9s Details	2025-05-29 10:15:40 +02:00
Torjus Håkestad	797f915939	Add monitoring rules for monitoring services Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-05-29 10:09:27 +02:00
Torjus Håkestad	3785b8047a	Fix alert name for build-flakes alert Some checks failed Run nix flake check / flake-check (push) Failing after 10m34s Details Periodic flake update / flake-update (push) Successful in 3m3s Details	2025-05-28 21:28:04 +02:00
Torjus Håkestad	fb1a36a846	Rework build-flakes alert rules Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-05-28 21:26:04 +02:00
Torjus Håkestad	fe2e87658a	Move prometheus roles to external file Some checks failed Run nix flake check / flake-check (push) Failing after 3m7s Details	2025-05-18 14:54:09 +02:00

23 Commits