nixos-servers

Author	SHA1	Message	Date
Torjus Håkestad	03ebee4d82	grafana: fix proxmox table __name__ column All checks were successful Run nix flake check / flake-check (push) Successful in 2m9s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 21:04:41 +01:00
Torjus Håkestad	05630eb4d4	grafana: add Proxmox dashboard Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Dashboard for monitoring Proxmox VMs: - Summary stats: VMs running/stopped, node CPU/memory, uptime - VM status table with name, status, CPU%, memory%, uptime - VM CPU usage over time - VM memory usage over time - Network traffic (RX/TX) per VM - Disk I/O (read/write) per VM - Storage usage gauges and capacity table - VM filter to focus on specific VMs Filters out template VMs, shows only actual guests. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 21:02:28 +01:00
Torjus Håkestad	d333aa0164	grafana: fix fleet table __name__ columns All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Exclude the __name__ columns that were leaking through the table transformations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:52:39 +01:00
Torjus Håkestad	a5d5827dcc	grafana: add NixOS fleet dashboard Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Dashboard for monitoring NixOS deployments across the homelab: - Hosts behind remote / needing reboot stat panels - Fleet status table with revision, behind status, reboot needed, age - Generation age bar chart (shows stale configs) - Generations per host bar chart - Deployment activity time series (see when hosts were updated) - Flake input ages table - Pie charts for hosts by revision and tier - Tier filter variable Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:50:08 +01:00
Torjus Håkestad	1c13ec12a4	grafana: add temperature dashboard All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Dashboard includes: - Current temperatures per room (stat panel) - Average home temperature (gauge) - Current humidity (stat panel) - 30-day temperature history with mean/min/max in legend - Temperature trend (rate of change per hour) - 24h min/max/avg table per room - 30-day humidity history Filters out device_temperature (internal sensor) metrics. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:45:52 +01:00
Torjus Håkestad	4bf0eeeadb	grafana: add dashboards and fix permissions All checks were successful Run nix flake check / flake-check (push) Successful in 2m3s Details - Change default OIDC role from Viewer to Editor for Explore access - Add declarative dashboard provisioning - Add node-exporter dashboard (CPU, memory, disk, load, network, I/O) - Add Loki logs dashboard with host/job filters Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:39:21 +01:00
Torjus Håkestad	030e8518c5	grafana: add Grafana on monitoring02 with Kanidm OIDC Some checks failed Run nix flake check / flake-check (push) Failing after 4m3s Details Deploy Grafana test instance on monitoring02 with: - Kanidm OIDC authentication (admins -> Admin role, others -> Viewer) - PKCE enabled for secure OAuth2 flow (required by Kanidm) - Declarative datasources for Prometheus and Loki on monitoring01 - Local Caddy for TLS termination via internal ACME CA - DNS CNAME grafana-test.home.2rjus.net Terraform changes add OAuth2 client secret and AppRole policies for kanidm01 and monitoring02. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:23:26 +01:00
Torjus Håkestad	b31c64f1b9	kanidm: remove declarative user provisioning Keep base groups (admins, users, ssh-users) provisioned declaratively but manage regular users via the kanidm CLI. This allows setting POSIX attributes and passwords in a single workflow. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:14:03 +01:00
Torjus Håkestad	463342133e	kanidm: remove non-functional metrics scrape target All checks were successful Run nix flake check / flake-check (push) Successful in 1m56s Details Kanidm does not expose a Prometheus /metrics endpoint. The scrape target was causing 404 errors after the TLS certificate issue was fixed. Also add SSH command restriction to CLAUDE.md. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:34:12 +01:00
Torjus Håkestad	de36b9d016	kanidm: add hostname SAN to ACME certificate Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Include both auth.home.2rjus.net (CNAME) and kanidm01.home.2rjus.net (A record) as SANs in the TLS certificate. This fixes Prometheus scraping which connects via the hostname, not the CNAME. Fixes: x509: certificate is valid for auth.home.2rjus.net, not kanidm01.home.2rjus.net Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:29:54 +01:00
Torjus Håkestad	538c2ad097	kanidm: fix secret file permissions for provisioning Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Set owner/group to kanidm so the post-start provisioning script can read the idm_admin password. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 00:24:41 +01:00
Torjus Håkestad	d99c82c74c	kanidm: fix service ordering for vault secret Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Ensure vault-secret-kanidm-idm-admin runs before kanidm.service by adding services dependency. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 00:21:11 +01:00
Torjus Håkestad	ca0e3fd629	kanidm01: add kanidm authentication server Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - New test-tier VM at 10.69.13.23 with role=auth - Kanidm 1.8 server with HTTPS (443) and LDAPS (636) - ACME certificate from internal CA (auth.home.2rjus.net) - Provisioned groups: admins, users, ssh-users - Provisioned user: torjus - Daily backups at 22:00 (7 versions) - Prometheus monitoring scrape target Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 00:13:59 +01:00
Torjus Håkestad	8ec2a083bd	pgdb1: decommission postgresql host Remove pgdb1 host configuration and postgres service module. The only consumer (Open WebUI on gunter) has migrated to local PostgreSQL. Removed: - hosts/pgdb1/ - host configuration - services/postgres/ - service module (only used by pgdb1) - postgres_rules from monitoring rules - rebuild-all.sh (obsolete script) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:54:50 +01:00
Torjus Håkestad	bf199bd7c6	ns/resolver: add redundant stub-zone addresses Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Configure Unbound to query both ns1 and ns2 for the home.2rjus.net zone, in addition to local NSD. This provides redundancy during bootstrap or if local NSD is temporarily unavailable. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 20:10:17 +01:00
Torjus Håkestad	bdc6057689	hosts: decommission ca host and remove labmon Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Remove the step-ca host and labmon flake input now that ACME has been migrated to OpenBao PKI. Removed: - hosts/ca/ - step-ca host configuration - services/ca/ - step-ca service module - labmon flake input and module (no longer used) Updated: - flake.nix - removed ca host and labmon references - flake.lock - removed labmon input - rebuild-all.sh - removed ca from host list - CLAUDE.md - updated documentation Note: secrets/ca/ should be manually removed by the user. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 18:41:49 +01:00
Torjus Håkestad	21db7e9573	acme: migrate from step-ca to OpenBao PKI Switch all ACME certificate issuance from step-ca (ca.home.2rjus.net) to OpenBao PKI (vault.home.2rjus.net:8200/v1/pki_int/acme/directory). - Update default ACME server in system/acme.nix - Update Caddy acme_ca in http-proxy and nix-cache services - Remove labmon service from monitoring01 (step-ca monitoring) - Remove labmon scrape target and certificate_rules alerts - Remove alloy.nix (only used for labmon profiling) - Add docs/plans/cert-monitoring.md for future cert monitoring needs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 18:20:10 +01:00
Torjus Håkestad	7d291f85bf	monitoring: propagate host labels to Prometheus scrape targets Extract homelab.host metadata (tier, priority, role, labels) from host configurations and propagate them to Prometheus scrape targets. This enables semantic alert filtering using labels instead of hardcoded instance names. Changes: - lib/monitoring.nix: Extract host metadata, group targets by labels - prometheus.nix: Use structured static_configs with labels - rules.yml: Replace instance filters with role-based filters Example labels in Prometheus: - ns1/ns2: role=dns, dns_role=primary/secondary - nix-cache01: role=build-host - testvm*: tier=test Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:04:50 +01:00
Torjus Håkestad	ad8570f8db	homelab-deploy: add NATS-based deployment system Some checks failed Run nix flake check / flake-check (push) Failing after 3m45s Details Add homelab-deploy flake input and NixOS module for message-based deployments across the fleet. Configure DEPLOY account in NATS with tiered access control (listener, test-deployer, admin-deployer). Enable listener on vaulttest01 as initial test host. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 05:22:06 +01:00
Torjus Håkestad	881e70df27	monitoring: relax systemd_not_running alert threshold All checks were successful Run nix flake check / flake-check (push) Successful in 2m4s Details Increase duration from 5m to 10m and demote severity from critical to warning. Brief degraded states during nixos-rebuild are normal and were causing false positive alerts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 01:22:29 +01:00
Torjus Håkestad	025570dea1	monitoring: fix openbao token refresh timer not triggering RemainAfterExit=true kept the service in "active" state, which prevented OnUnitActiveSec from scheduling new triggers since there was no new "activation" event. Removing it allows the service to properly go inactive, enabling the timer to reschedule correctly. Also fix ExecStart to use lib.getExe for proper path resolution with writeShellApplication. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-06 19:41:45 +01:00
Torjus Håkestad	15c00393f1	monitoring: increase zigbee_sensor_stale threshold to 2 hours Some checks failed Run nix flake check / flake-check (push) Failing after 6m59s Details Sensors report every ~45-50 minutes on average, so 1 hour was too tight. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-06 19:26:56 +01:00
Torjus Håkestad	506e93a5e2	home-assistant: fix zigbee battery value_template override key Some checks failed Run nix flake check / flake-check (push) Failing after 5m39s Details Run nix flake check / flake-check (pull_request) Failing after 12m37s Details The homeassistant override key should match the entity type in the MQTT discovery topic path. For battery sensors, the topic is homeassistant/sensor/<device>/battery/config, so the key should be "battery" not "sensor_battery". Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-06 00:48:30 +01:00
Torjus Håkestad	bbb22e588e	system: replace writeShellScript with writeShellApplication Some checks failed Run nix flake check / flake-check (pull_request) Successful in 2m3s Details Run nix flake check / flake-check (push) Failing after 5m57s Details Convert remaining writeShellScript usages to writeShellApplication for shellcheck validation and strict bash options. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-06 00:17:24 +01:00
Torjus Håkestad	e9857afc11	monitoring: use AppRole token for OpenBao metrics scraping All checks were successful Run nix flake check / flake-check (push) Successful in 2m12s Details Run nix flake check / flake-check (pull_request) Successful in 2m19s Details Instead of creating a long-lived Vault token in Terraform (which gets invalidated when Terraform recreates it), monitoring01 now uses its existing AppRole credentials to fetch a fresh token for Prometheus. Changes: - Add prometheus-metrics policy to monitoring01's AppRole - Remove vault_token.prometheus_metrics resource from Terraform - Remove openbao-token KV secret from Terraform - Add systemd service to fetch AppRole token on boot - Add systemd timer to refresh token every 30 minutes This ensures Prometheus always has a valid token without depending on Terraform state or manual intervention. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 23:51:11 +01:00
Torjus Håkestad	59e1962d75	auth01: decommission host and remove authelia/lldap services Some checks failed Run nix flake check / flake-check (pull_request) Successful in 2m5s Details Run nix flake check / flake-check (push) Failing after 18m1s Details Remove auth01 host configuration and associated services in preparation for new auth stack with different provisioning system. Removed: - hosts/auth01/ - host configuration - services/authelia/ - authelia service module - services/lldap/ - lldap service module - secrets/auth01/ - sops secrets - Reverse proxy entries for auth and lldap - Monitoring alert rules for authelia and lldap - SOPS configuration for auth01 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 23:35:45 +01:00
Torjus Håkestad	c515a6b4e1	home-assistant: fix zigbee sensor battery reporting Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details WSDCGQ12LM sensors report battery: 0 due to firmware quirk. Override battery calculation using voltage via homeassistant value_template. Also adds zigbee_sensor_stale alert for detecting dead sensors regardless of battery reporting accuracy (1 hour threshold). Device configuration moved from external devices.yaml to inline NixOS config for declarative management. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 22:41:07 +01:00
Torjus Håkestad	4d8b94ce83	monitoring: add collector flags to nats exporter Some checks failed Run nix flake check / flake-check (push) Failing after 8m53s Details The exporter requires explicit collector flags to specify what metrics to collect. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 22:23:30 +01:00
Torjus Håkestad	8b0a4ea33a	monitoring: use nats exporter instead of direct scrape Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details NATS HTTP monitoring endpoint serves JSON, not Prometheus format. Use the prometheus-nats-exporter which queries the NATS endpoint and exposes proper Prometheus metrics. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 22:22:04 +01:00
Torjus Håkestad	b322b1156b	monitoring: fix openbao token output path Some checks failed Run nix flake check / flake-check (pull_request) Successful in 2m17s Details Run nix flake check / flake-check (push) Failing after 8m57s Details The outputDir with extractKey should be the full file path, not just the parent directory. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 21:56:26 +01:00
Torjus Håkestad	3cccfc0487	monitoring: implement monitoring gaps coverage Some checks failed Run nix flake check / flake-check (push) Failing after 7m36s Details Add exporters and scrape targets for services lacking monitoring: - PostgreSQL: postgres-exporter on pgdb1 - Authelia: native telemetry metrics on auth01 - Unbound: unbound-exporter with remote-control on ns1/ns2 - NATS: HTTP monitoring endpoint on nats1 - OpenBao: telemetry config and Prometheus scrape with token auth - Systemd: systemd-exporter on all hosts for per-service metrics Add alert rules for postgres, auth (authelia + lldap), jellyfin, vault (openbao), plus extend existing nats and unbound rules. Add Terraform config for Prometheus metrics policy and token. The token is created via vault_token resource and stored in KV, so no manual token creation is needed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 21:44:13 +01:00
Torjus Håkestad	0700033c0a	secrets: migrate all hosts from sops to OpenBao vault Replace sops-nix secrets with OpenBao vault secrets across all hosts. Hardcode root password hash, add extractKey option to vault-secrets module, update Terraform with secrets/policies for all hosts, and create AppRole provisioning playbook. Hosts migrated: ha1, monitoring01, ns1, ns2, http-proxy, nix-cache01 Wave 1 hosts (nats1, jelly01, pgdb1) get AppRole policies only. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 18:43:09 +01:00
Torjus Håkestad	28b8d7c115	monitoring: increase high_cpu_load duration for nix-cache01 to 2h nix-cache01 regularly hits high CPU during nix builds, causing flappy alerts. Keep the 15m threshold for all other hosts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 02:28:48 +01:00
Torjus Håkestad	3a9a47f1ad	monitoring: exclude step-ca serving cert from general expiry alert Some checks failed Run nix flake check / flake-check (push) Failing after 6m23s Details Run nix flake check / flake-check (pull_request) Failing after 4m46s Details The step-ca serving certificate is auto-renewed with a 24h lifetime, so it always triggers the general < 86400s threshold. Exclude it and add a dedicated step_ca_serving_cert_expiring alert at < 1h instead. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 01:12:42 +01:00
Torjus Håkestad	fa6380e767	monitoring: fix nix-cache_caddy scrape target TLS error All checks were successful Run nix flake check / flake-check (push) Successful in 2m43s Details Move nix-cache_caddy back to a manual config in prometheus.nix using the service CNAME (nix-cache.home.2rjus.net) instead of the hostname. The auto-generated target used nix-cache01.home.2rjus.net which doesn't match the TLS certificate SAN. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 01:04:50 +01:00
Torjus Håkestad	dd1b64de27	monitoring: auto-generate Prometheus scrape targets from host configs Some checks failed Run nix flake check / flake-check (pull_request) Successful in 2m49s Details Run nix flake check / flake-check (push) Has been cancelled Details Add homelab.monitoring NixOS options (enable, scrapeTargets) following the same pattern as homelab.dns. Prometheus scrape configs are now auto-generated from flake host configurations and external targets, replacing hardcoded target lists. Also cleans up alert rules: snake_case naming, fix zigbee2mqtt typo, remove duplicate pushgateway alert, add for clauses to monitoring_rules, remove hardcoded WireGuard public key, and add new alerts for certificates, proxmox, caddy, smartctl temperature, filesystem prediction, systemd state, file descriptors, and host reboots. Fixes grafana scrape target port from 3100 to 3000. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 00:49:07 +01:00
Torjus Håkestad	83af00458b	dns: remove defunct external hosts Remove hosts that no longer respond to ping: - kube-blue1-10 (entire k8s cluster) - virt-mini1, mpnzb, inc2, testing - CNAMEs: rook, git (pointed to removed kube-blue nodes) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 21:50:56 +01:00
Torjus Håkestad	cee1b264cd	dns: auto-generate zone entries from host configurations Replace static zone file with dynamically generated records: - Add homelab.dns module with enable/cnames options - Extract IPs from systemd.network configs (filters VPN interfaces) - Use git commit timestamp as zone serial number - Move external hosts to separate external-hosts.nix Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 21:43:44 +01:00
Torjus Håkestad	7ae474fd3e	pki: add new vault root ca to pki	2026-02-03 06:53:59 +01:00
Torjus Håkestad	f0525b5c74	ns: add vaulttest01 to zone All checks were successful Run nix flake check / flake-check (push) Successful in 2m19s Details	2026-02-03 06:42:05 +01:00
Torjus Håkestad	42c391b355	ns: add vault cname to zone Some checks failed Run nix flake check / flake-check (push) Failing after 4m7s Details	2026-02-03 06:00:59 +01:00
Torjus Håkestad	c694b9889a	vault: add auto-unseal All checks were successful Run nix flake check / flake-check (push) Successful in 2m16s Details	2026-02-02 00:28:24 +01:00
Torjus Håkestad	ace848b29c	vault: replace vault with openbao	2026-02-01 22:16:52 +01:00
Torjus Håkestad	b012df9f34	ns: add vault01 host to zone Some checks failed Run nix flake check / flake-check (push) Failing after 15m40s Details Periodic flake update / flake-update (push) Successful in 1m7s Details	2026-02-01 20:54:22 +01:00
Torjus Håkestad	a2c798bc30	vault: add minimal vault config Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2026-02-01 20:27:02 +01:00
Torjus Håkestad	bb9de5b4ca	auth01: fix secret mode Some checks failed Run nix flake check / flake-check (push) Failing after 2m4s Details	2025-12-06 11:37:11 +01:00
Torjus Håkestad	8eefe38d5e	auth01: fix secret group Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-12-06 11:34:34 +01:00
Torjus Håkestad	78efc4f592	auth01: fix secret path Some checks failed Run nix flake check / flake-check (push) Failing after 1m54s Details	2025-12-06 11:07:53 +01:00
Torjus Håkestad	25b786915c	auth01: add lldap password to secrets Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details	2025-12-06 11:02:43 +01:00
Torjus Håkestad	3219b8da4b	nix-cache01: re-add homelab label Some checks failed Run nix flake check / flake-check (push) Failing after 4m15s Details Periodic flake update / flake-update (push) Successful in 2m32s Details	2025-08-27 23:00:47 +02:00

1 2 3 4 5 ...

274 Commits