nixos-servers

Author	SHA1	Message	Date
Torjus Håkestad	b709c0b703	monitoring: disable radarr exporter (version mismatch) Some checks failed Run nix flake check / flake-check (push) Failing after 15m20s Details Periodic flake update / flake-update (push) Successful in 2m23s Details Radarr on TrueNAS jail is too old - exportarr fails on /api/v3/wanted/cutoff endpoint (404). Keep sonarr which works. Vault secret kept for when Radarr is updated. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:59:45 +01:00
Torjus Håkestad	33c5d5b3f0	monitoring: add exportarr for radarr/sonarr metrics All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Add prometheus exportarr exporters for Radarr and Sonarr media services. Runs on monitoring01, queries remote APIs. - Radarr exporter on port 9708 - Sonarr exporter on port 9709 - API keys fetched from Vault Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:56:03 +01:00
Torjus Håkestad	9bd48e0808	monitoring: explicitly list valid HTTP status codes All checks were successful Run nix flake check / flake-check (push) Successful in 2m6s Details Empty valid_status_codes defaults to 2xx only, not "any". Explicitly list common status codes (2xx, 3xx, 4xx, 5xx) so services returning 400/401 like ha and nzbget pass the probe. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:41:47 +01:00
Torjus Håkestad	1460eea700	grafana: fix probe status table join All checks were successful Run nix flake check / flake-check (push) Successful in 2m9s Details Use joinByField transformation instead of merge to properly align rows by instance. Also exclude duplicate Time/job columns from join. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:38:02 +01:00
Torjus Håkestad	98c4f54f94	grafana: add TLS certificates dashboard Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Dashboard includes: - Stat panels for endpoints monitored, probe failures, expiring certs - Gauge showing minimum days until any cert expires - Table of all endpoints sorted by expiry (color-coded) - Probe status table with HTTP status and duration - Time series graphs for expiry trends and probe success rate Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:35:44 +01:00
Torjus Håkestad	d1b0a5dc20	monitoring: accept any HTTP status in TLS probe Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Only care about TLS handshake success for certificate monitoring. Services like nzbget (401) and ha (400) return non-2xx but have valid certificates. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:33:45 +01:00
Torjus Håkestad	4d32707130	monitoring: remove duplicate rules from blackbox.nix All checks were successful Run nix flake check / flake-check (push) Successful in 2m7s Details The rules were already added to rules.yml but the blackbox.nix file still had them, causing duplicate 'groups' key errors. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:28:42 +01:00
Torjus Håkestad	8e1753c2c8	monitoring: fix blackbox rules and add force-push policy Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Move certificate alert rules to rules.yml instead of adding them as a separate rules string in blackbox.nix. The previous approach caused a YAML parse error due to duplicate 'groups' keys. Also add policy to CLAUDE.md: never force push to master. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:26:05 +01:00
Torjus Håkestad	75e4fb61a5	monitoring: add blackbox exporter for TLS certificate monitoring All checks were successful Run nix flake check / flake-check (push) Successful in 2m6s Details Add blackbox exporter to monitoring01 to probe TLS endpoints and alert on expiring certificates. Monitors all ACME-managed certificates from OpenBao PKI including Caddy auto-TLS services. Alerts: - tls_certificate_expiring_soon (< 7 days, warning) - tls_certificate_expiring_critical (< 24h, critical) - tls_probe_failed (connectivity issues) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:21:42 +01:00
Torjus Håkestad	e85f15b73d	vault: add OpenBao OIDC integration with Kanidm All checks were successful Run nix flake check / flake-check (push) Successful in 2m9s Details Enable Kanidm users to authenticate to OpenBao via OIDC for Web UI access. Members of the admins group get full read/write access to secrets. Changes: - Add OIDC auth backend in Terraform (oidc.tf) - Add oidc-admin and oidc-default policies - Add openbao OAuth2 client to Kanidm - Enable legacy crypto (RS256) for OpenBao compatibility - Allow imperative group membership management in Kanidm Limitations: - CLI login not supported (Kanidm requires HTTPS for confidential client redirects) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 19:42:26 +01:00
Torjus Håkestad	2f5a2a4bf1	grafana: use instant queries for fleet dashboard stat panels All checks were successful Run nix flake check / flake-check (push) Successful in 2m6s Details Prevents stat panels from being affected by dashboard time range selection. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 19:00:33 +01:00
Torjus Håkestad	9ed11b712f	home-assistant: fix Jinja2 battery template syntax All checks were successful Run nix flake check / flake-check (push) Successful in 2m13s Details The template used \| min(100) \| max(0) which is invalid Jinja2 syntax. These filters expect iterables (lists), not scalar arguments. This caused TypeError warnings on every MQTT message and left battery sensors unavailable. Fixed by using proper list-based min/max: [[[value, 100] \| min, 0] \| max Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 16:12:59 +01:00
Torjus Håkestad	ffad2dd205	monitoring: increase zigbee_sensor_stale threshold to 4 hours The 2-hour threshold was too aggressive for temperature sensors in stable environments. Historical data shows gaps up to 2.75 hours when temperature hasn't changed (Home Assistant only updates last_updated when values change). Increasing to 4 hours avoids false positives while still catching genuine failures. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 16:10:54 +01:00
Torjus Håkestad	ed7d2aa727	grafana: add deployment metrics to nixos-fleet dashboard Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 15:58:28 +01:00
Torjus Håkestad	60c04a2052	nixos-exporter: enable NATS cache sharing Some checks failed Run nix flake check / flake-check (pull_request) Successful in 2m17s Details Run nix flake check / flake-check (push) Failing after 5m16s Details When one host fetches the latest flake revision, it publishes to NATS and all other hosts receive the update immediately. This reduces redundant nix flake metadata calls across the fleet. - Add nkeys to devshell for key generation - Add nixos-exporter user to NATS HOMELAB account - Add Vault secret for NKey storage - Configure all hosts to use NATS for revision sharing - Update nixos-exporter input to version with NATS support Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 23:57:28 +01:00
Torjus Håkestad	f66dfc753c	grafana: add NixOS operations dashboard All checks were successful Run nix flake check / flake-check (push) Successful in 3m24s Details Run nix flake check / flake-check (pull_request) Successful in 4m5s Details Loki-based dashboard for tracking NixOS operations including: - Upgrade activity and success/failure stats - Build activity during upgrades - Bootstrap logs for new VM deployments - ACME certificate renewal activity Log panels use LogQL json parsing with \| keep host to show clean messages with host labels. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 22:03:28 +01:00
Torjus Håkestad	89d0a6f358	grafana: add systemd services dashboard Some checks failed Run nix flake check / flake-check (push) Failing after 8m30s Details Run nix flake check / flake-check (pull_request) Failing after 16m49s Details Dashboard for monitoring systemd across the fleet: - Summary stats: failed/active/inactive units, restarts, timers - Failed units table (shows any units in failed state) - Service restarts table (top 15 services by restart count) - Active units per host bar chart - NixOS upgrade timer table with last trigger time - Backup timers table (restic jobs) - Service restarts over time chart - Hostname filter to focus on specific hosts Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 21:06:59 +01:00
Torjus Håkestad	03ebee4d82	grafana: fix proxmox table __name__ column All checks were successful Run nix flake check / flake-check (push) Successful in 2m9s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 21:04:41 +01:00
Torjus Håkestad	05630eb4d4	grafana: add Proxmox dashboard Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Dashboard for monitoring Proxmox VMs: - Summary stats: VMs running/stopped, node CPU/memory, uptime - VM status table with name, status, CPU%, memory%, uptime - VM CPU usage over time - VM memory usage over time - Network traffic (RX/TX) per VM - Disk I/O (read/write) per VM - Storage usage gauges and capacity table - VM filter to focus on specific VMs Filters out template VMs, shows only actual guests. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 21:02:28 +01:00
Torjus Håkestad	d333aa0164	grafana: fix fleet table __name__ columns All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Exclude the __name__ columns that were leaking through the table transformations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:52:39 +01:00
Torjus Håkestad	a5d5827dcc	grafana: add NixOS fleet dashboard Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Dashboard for monitoring NixOS deployments across the homelab: - Hosts behind remote / needing reboot stat panels - Fleet status table with revision, behind status, reboot needed, age - Generation age bar chart (shows stale configs) - Generations per host bar chart - Deployment activity time series (see when hosts were updated) - Flake input ages table - Pie charts for hosts by revision and tier - Tier filter variable Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:50:08 +01:00
Torjus Håkestad	1c13ec12a4	grafana: add temperature dashboard All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Dashboard includes: - Current temperatures per room (stat panel) - Average home temperature (gauge) - Current humidity (stat panel) - 30-day temperature history with mean/min/max in legend - Temperature trend (rate of change per hour) - 24h min/max/avg table per room - 30-day humidity history Filters out device_temperature (internal sensor) metrics. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:45:52 +01:00
Torjus Håkestad	4bf0eeeadb	grafana: add dashboards and fix permissions All checks were successful Run nix flake check / flake-check (push) Successful in 2m3s Details - Change default OIDC role from Viewer to Editor for Explore access - Add declarative dashboard provisioning - Add node-exporter dashboard (CPU, memory, disk, load, network, I/O) - Add Loki logs dashboard with host/job filters Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:39:21 +01:00
Torjus Håkestad	030e8518c5	grafana: add Grafana on monitoring02 with Kanidm OIDC Some checks failed Run nix flake check / flake-check (push) Failing after 4m3s Details Deploy Grafana test instance on monitoring02 with: - Kanidm OIDC authentication (admins -> Admin role, others -> Viewer) - PKCE enabled for secure OAuth2 flow (required by Kanidm) - Declarative datasources for Prometheus and Loki on monitoring01 - Local Caddy for TLS termination via internal ACME CA - DNS CNAME grafana-test.home.2rjus.net Terraform changes add OAuth2 client secret and AppRole policies for kanidm01 and monitoring02. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:23:26 +01:00
Torjus Håkestad	b31c64f1b9	kanidm: remove declarative user provisioning Keep base groups (admins, users, ssh-users) provisioned declaratively but manage regular users via the kanidm CLI. This allows setting POSIX attributes and passwords in a single workflow. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:14:03 +01:00
Torjus Håkestad	463342133e	kanidm: remove non-functional metrics scrape target All checks were successful Run nix flake check / flake-check (push) Successful in 1m56s Details Kanidm does not expose a Prometheus /metrics endpoint. The scrape target was causing 404 errors after the TLS certificate issue was fixed. Also add SSH command restriction to CLAUDE.md. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:34:12 +01:00
Torjus Håkestad	de36b9d016	kanidm: add hostname SAN to ACME certificate Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Include both auth.home.2rjus.net (CNAME) and kanidm01.home.2rjus.net (A record) as SANs in the TLS certificate. This fixes Prometheus scraping which connects via the hostname, not the CNAME. Fixes: x509: certificate is valid for auth.home.2rjus.net, not kanidm01.home.2rjus.net Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:29:54 +01:00
Torjus Håkestad	538c2ad097	kanidm: fix secret file permissions for provisioning Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Set owner/group to kanidm so the post-start provisioning script can read the idm_admin password. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 00:24:41 +01:00
Torjus Håkestad	d99c82c74c	kanidm: fix service ordering for vault secret Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Ensure vault-secret-kanidm-idm-admin runs before kanidm.service by adding services dependency. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 00:21:11 +01:00
Torjus Håkestad	ca0e3fd629	kanidm01: add kanidm authentication server Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - New test-tier VM at 10.69.13.23 with role=auth - Kanidm 1.8 server with HTTPS (443) and LDAPS (636) - ACME certificate from internal CA (auth.home.2rjus.net) - Provisioned groups: admins, users, ssh-users - Provisioned user: torjus - Daily backups at 22:00 (7 versions) - Prometheus monitoring scrape target Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 00:13:59 +01:00
Torjus Håkestad	8ec2a083bd	pgdb1: decommission postgresql host Remove pgdb1 host configuration and postgres service module. The only consumer (Open WebUI on gunter) has migrated to local PostgreSQL. Removed: - hosts/pgdb1/ - host configuration - services/postgres/ - service module (only used by pgdb1) - postgres_rules from monitoring rules - rebuild-all.sh (obsolete script) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:54:50 +01:00
Torjus Håkestad	bf199bd7c6	ns/resolver: add redundant stub-zone addresses Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Configure Unbound to query both ns1 and ns2 for the home.2rjus.net zone, in addition to local NSD. This provides redundancy during bootstrap or if local NSD is temporarily unavailable. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 20:10:17 +01:00
Torjus Håkestad	bdc6057689	hosts: decommission ca host and remove labmon Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Remove the step-ca host and labmon flake input now that ACME has been migrated to OpenBao PKI. Removed: - hosts/ca/ - step-ca host configuration - services/ca/ - step-ca service module - labmon flake input and module (no longer used) Updated: - flake.nix - removed ca host and labmon references - flake.lock - removed labmon input - rebuild-all.sh - removed ca from host list - CLAUDE.md - updated documentation Note: secrets/ca/ should be manually removed by the user. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 18:41:49 +01:00
Torjus Håkestad	21db7e9573	acme: migrate from step-ca to OpenBao PKI Switch all ACME certificate issuance from step-ca (ca.home.2rjus.net) to OpenBao PKI (vault.home.2rjus.net:8200/v1/pki_int/acme/directory). - Update default ACME server in system/acme.nix - Update Caddy acme_ca in http-proxy and nix-cache services - Remove labmon service from monitoring01 (step-ca monitoring) - Remove labmon scrape target and certificate_rules alerts - Remove alloy.nix (only used for labmon profiling) - Add docs/plans/cert-monitoring.md for future cert monitoring needs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 18:20:10 +01:00
Torjus Håkestad	7d291f85bf	monitoring: propagate host labels to Prometheus scrape targets Extract homelab.host metadata (tier, priority, role, labels) from host configurations and propagate them to Prometheus scrape targets. This enables semantic alert filtering using labels instead of hardcoded instance names. Changes: - lib/monitoring.nix: Extract host metadata, group targets by labels - prometheus.nix: Use structured static_configs with labels - rules.yml: Replace instance filters with role-based filters Example labels in Prometheus: - ns1/ns2: role=dns, dns_role=primary/secondary - nix-cache01: role=build-host - testvm*: tier=test Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:04:50 +01:00
Torjus Håkestad	ad8570f8db	homelab-deploy: add NATS-based deployment system Some checks failed Run nix flake check / flake-check (push) Failing after 3m45s Details Add homelab-deploy flake input and NixOS module for message-based deployments across the fleet. Configure DEPLOY account in NATS with tiered access control (listener, test-deployer, admin-deployer). Enable listener on vaulttest01 as initial test host. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 05:22:06 +01:00
Torjus Håkestad	881e70df27	monitoring: relax systemd_not_running alert threshold All checks were successful Run nix flake check / flake-check (push) Successful in 2m4s Details Increase duration from 5m to 10m and demote severity from critical to warning. Brief degraded states during nixos-rebuild are normal and were causing false positive alerts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 01:22:29 +01:00
Torjus Håkestad	025570dea1	monitoring: fix openbao token refresh timer not triggering RemainAfterExit=true kept the service in "active" state, which prevented OnUnitActiveSec from scheduling new triggers since there was no new "activation" event. Removing it allows the service to properly go inactive, enabling the timer to reschedule correctly. Also fix ExecStart to use lib.getExe for proper path resolution with writeShellApplication. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-06 19:41:45 +01:00
Torjus Håkestad	15c00393f1	monitoring: increase zigbee_sensor_stale threshold to 2 hours Some checks failed Run nix flake check / flake-check (push) Failing after 6m59s Details Sensors report every ~45-50 minutes on average, so 1 hour was too tight. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-06 19:26:56 +01:00
Torjus Håkestad	506e93a5e2	home-assistant: fix zigbee battery value_template override key Some checks failed Run nix flake check / flake-check (push) Failing after 5m39s Details Run nix flake check / flake-check (pull_request) Failing after 12m37s Details The homeassistant override key should match the entity type in the MQTT discovery topic path. For battery sensors, the topic is homeassistant/sensor/<device>/battery/config, so the key should be "battery" not "sensor_battery". Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-06 00:48:30 +01:00
Torjus Håkestad	bbb22e588e	system: replace writeShellScript with writeShellApplication Some checks failed Run nix flake check / flake-check (pull_request) Successful in 2m3s Details Run nix flake check / flake-check (push) Failing after 5m57s Details Convert remaining writeShellScript usages to writeShellApplication for shellcheck validation and strict bash options. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-06 00:17:24 +01:00
Torjus Håkestad	e9857afc11	monitoring: use AppRole token for OpenBao metrics scraping All checks were successful Run nix flake check / flake-check (push) Successful in 2m12s Details Run nix flake check / flake-check (pull_request) Successful in 2m19s Details Instead of creating a long-lived Vault token in Terraform (which gets invalidated when Terraform recreates it), monitoring01 now uses its existing AppRole credentials to fetch a fresh token for Prometheus. Changes: - Add prometheus-metrics policy to monitoring01's AppRole - Remove vault_token.prometheus_metrics resource from Terraform - Remove openbao-token KV secret from Terraform - Add systemd service to fetch AppRole token on boot - Add systemd timer to refresh token every 30 minutes This ensures Prometheus always has a valid token without depending on Terraform state or manual intervention. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 23:51:11 +01:00
Torjus Håkestad	59e1962d75	auth01: decommission host and remove authelia/lldap services Some checks failed Run nix flake check / flake-check (pull_request) Successful in 2m5s Details Run nix flake check / flake-check (push) Failing after 18m1s Details Remove auth01 host configuration and associated services in preparation for new auth stack with different provisioning system. Removed: - hosts/auth01/ - host configuration - services/authelia/ - authelia service module - services/lldap/ - lldap service module - secrets/auth01/ - sops secrets - Reverse proxy entries for auth and lldap - Monitoring alert rules for authelia and lldap - SOPS configuration for auth01 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 23:35:45 +01:00
Torjus Håkestad	c515a6b4e1	home-assistant: fix zigbee sensor battery reporting Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details WSDCGQ12LM sensors report battery: 0 due to firmware quirk. Override battery calculation using voltage via homeassistant value_template. Also adds zigbee_sensor_stale alert for detecting dead sensors regardless of battery reporting accuracy (1 hour threshold). Device configuration moved from external devices.yaml to inline NixOS config for declarative management. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 22:41:07 +01:00
Torjus Håkestad	4d8b94ce83	monitoring: add collector flags to nats exporter Some checks failed Run nix flake check / flake-check (push) Failing after 8m53s Details The exporter requires explicit collector flags to specify what metrics to collect. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 22:23:30 +01:00
Torjus Håkestad	8b0a4ea33a	monitoring: use nats exporter instead of direct scrape Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details NATS HTTP monitoring endpoint serves JSON, not Prometheus format. Use the prometheus-nats-exporter which queries the NATS endpoint and exposes proper Prometheus metrics. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 22:22:04 +01:00
Torjus Håkestad	b322b1156b	monitoring: fix openbao token output path Some checks failed Run nix flake check / flake-check (pull_request) Successful in 2m17s Details Run nix flake check / flake-check (push) Failing after 8m57s Details The outputDir with extractKey should be the full file path, not just the parent directory. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 21:56:26 +01:00
Torjus Håkestad	3cccfc0487	monitoring: implement monitoring gaps coverage Some checks failed Run nix flake check / flake-check (push) Failing after 7m36s Details Add exporters and scrape targets for services lacking monitoring: - PostgreSQL: postgres-exporter on pgdb1 - Authelia: native telemetry metrics on auth01 - Unbound: unbound-exporter with remote-control on ns1/ns2 - NATS: HTTP monitoring endpoint on nats1 - OpenBao: telemetry config and Prometheus scrape with token auth - Systemd: systemd-exporter on all hosts for per-service metrics Add alert rules for postgres, auth (authelia + lldap), jellyfin, vault (openbao), plus extend existing nats and unbound rules. Add Terraform config for Prometheus metrics policy and token. The token is created via vault_token resource and stored in KV, so no manual token creation is needed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 21:44:13 +01:00
Torjus Håkestad	0700033c0a	secrets: migrate all hosts from sops to OpenBao vault Replace sops-nix secrets with OpenBao vault secrets across all hosts. Hardcode root password hash, add extractKey option to vault-secrets module, update Terraform with secrets/policies for all hosts, and create AppRole provisioning playbook. Hosts migrated: ha1, monitoring01, ns1, ns2, http-proxy, nix-cache01 Wave 1 hosts (nats1, jelly01, pgdb1) get AppRole policies only. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 18:43:09 +01:00
Torjus Håkestad	28b8d7c115	monitoring: increase high_cpu_load duration for nix-cache01 to 2h nix-cache01 regularly hits high CPU during nix builds, causing flappy alerts. Keep the 15m threshold for all other hosts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 02:28:48 +01:00

1 2 3 4 5

241 Commits