nixos-servers

Author	SHA1	Message	Date
Torjus Håkestad	94feae82a0	ns1: recreate with OpenTofu workflow Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Old VM had incorrect hardware-configuration.nix with hardcoded UUIDs that didn't match actual disk layout, causing boot failure (emergency mode). Recreated using template2-based configuration for OpenTofu provisioning. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:18:08 +01:00
Torjus Håkestad	3f94f7ee95	docs: update pgdb1 decommission progress Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:55:55 +01:00
Torjus Håkestad	8ec2a083bd	pgdb1: decommission postgresql host Remove pgdb1 host configuration and postgres service module. The only consumer (Open WebUI on gunter) has migrated to local PostgreSQL. Removed: - hosts/pgdb1/ - host configuration - services/postgres/ - service module (only used by pgdb1) - postgres_rules from monitoring rules - rebuild-all.sh (obsolete script) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:54:50 +01:00
Torjus Håkestad	ec4ac1477e	docs: mark pgdb1 for decommissioning instead of migration Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Only consumer was Open WebUI on gunter, which will migrate to local PostgreSQL. Removed pgdb1 backup/migration phases and added to decommission list. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:49:53 +01:00
Torjus Håkestad	e937c68965	docs: mark auth01, ca, and sops-nix removal as complete Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - auth01 host and services (authelia, lldap) already removed - ca host and services already removed (PKI migrated to OpenBao) - sops-nix fully removed (secrets/, .sops.yaml gone) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:33:18 +01:00
Torjus Håkestad	98e808cd6c	docs: mark jump host decommissioning as complete Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:31:14 +01:00
Torjus Håkestad	1066e81ba8	docs: update opentofu migration plan with current state Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - ns2 migrated to OpenTofu - testvm02, testvm03 added to managed hosts - Remove vaulttest01 (no longer exists) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:26:49 +01:00
Torjus Håkestad	f0950b33de	docs: add plan for nix-cache01 reprovision Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 20:34:52 +01:00
Torjus Håkestad	38c104ea8c	docs: add plan for configuring template2 with nix cache Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Run nix flake check / flake-check (pull_request) Failing after 1s Details Bootstrap times can be improved by configuring the base template to use the local nix cache during initial builds. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 20:06:55 +01:00
Torjus Håkestad	f36457ee0d	cleanup: remove legacy secrets directory and move TODO.md to completed plans Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Run nix flake check / flake-check (pull_request) Failing after 1s Details - Remove secrets/ directory (sops-nix no longer in use, all hosts use Vault) - Move TODO.md to docs/plans/completed/automated-host-deployment-pipeline.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 18:49:31 +01:00
Torjus Håkestad	21db7e9573	acme: migrate from step-ca to OpenBao PKI Switch all ACME certificate issuance from step-ca (ca.home.2rjus.net) to OpenBao PKI (vault.home.2rjus.net:8200/v1/pki_int/acme/directory). - Update default ACME server in system/acme.nix - Update Caddy acme_ca in http-proxy and nix-cache services - Remove labmon service from monitoring01 (step-ca monitoring) - Remove labmon scrape target and certificate_rules alerts - Remove alloy.nix (only used for labmon profiling) - Add docs/plans/cert-monitoring.md for future cert monitoring needs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 18:20:10 +01:00
Torjus Håkestad	c518093578	docs: move prometheus-scrape-target-labels plan to completed Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:29:31 +01:00
Torjus Håkestad	50a85daa44	docs: update plan with hostname label documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:09:46 +01:00
Torjus Håkestad	7d291f85bf	monitoring: propagate host labels to Prometheus scrape targets Extract homelab.host metadata (tier, priority, role, labels) from host configurations and propagate them to Prometheus scrape targets. This enables semantic alert filtering using labels instead of hardcoded instance names. Changes: - lib/monitoring.nix: Extract host metadata, group targets by labels - prometheus.nix: Use structured static_configs with labels - rules.yml: Replace instance filters with role-based filters Example labels in Prometheus: - ns1/ns2: role=dns, dns_role=primary/secondary - nix-cache01: role=build-host - testvm*: tier=test Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:04:50 +01:00
Torjus Håkestad	2a842c655a	docs: update plan status and move completed nats-deploy plan Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - Move nats-deploy-service.md to completed/ folder - Update prometheus-scrape-target-labels.md with implementation status - Add status table showing which steps are complete/partial/not started - Update cross-references to point to new location Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 16:44:00 +01:00
Torjus Håkestad	b03a9b3b64	docs: add long-term metrics storage plan Compare VictoriaMetrics and Thanos as options for extending metrics retention beyond 30 days while managing disk usage. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 07:56:10 +01:00
Torjus Håkestad	12bf0683f5	modules: add homelab.host for host metadata Add a shared `homelab.host` module that provides host metadata for multiple consumers: - tier: deployment tier (test/prod) for future homelab-deploy service - priority: alerting priority (high/low) for Prometheus label filtering - role: primary role of the host (dns, database, monitoring, etc.) - labels: free-form labels for additional metadata Host configurations updated with appropriate values: - ns1, ns2: role=dns with dns_role labels - nix-cache01: priority=low, role=build-host - vault01: role=vault - jump: role=bastion - template, template2, testvm01, vaulttest01: tier=test, priority=low The module is now imported via commonModules in flake.nix, making it available to all hosts including minimal configurations like template2. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 02:49:58 +01:00
Torjus Håkestad	e8a43c6715	docs: add deploy_admin tool with opt-in flag to homelab-deploy plan All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details MCP exposes two tools: - deploy: test-tier only, always available - deploy_admin: all tiers, requires --enable-admin flag Three security layers: CLI flag, NATS authz, Claude Code permissions. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 02:29:13 +01:00
Torjus Håkestad	eef52bb8c5	docs: add group deployment support to homelab-deploy plan All checks were successful Run nix flake check / flake-check (push) Successful in 2m3s Details Support deploying to all hosts in a tier or all hosts with a role: - deploy.<tier>.all - broadcast to all hosts in tier - deploy.<tier>.role.<role> - broadcast to hosts with matching role MCP can deploy to all test hosts at once, admin can deploy to any group. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 02:22:17 +01:00
Torjus Håkestad	c6cdbc6799	docs: move nixos-exporter plan to completed Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 02:13:14 +01:00
Torjus Håkestad	4d724329a6	docs: add homelab-deploy plan, unify host metadata Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Add plan for NATS-based deployment service (homelab-deploy) that enables on-demand NixOS configuration updates via messaging. Features tiered permissions (test/prod) enforced at NATS layer. Update prometheus-scrape-target-labels plan to share the homelab.host module for host metadata (tier, priority, role, labels) - single source of truth for both deployment tiers and prometheus labels. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 02:10:54 +01:00
Torjus Håkestad	7e80d2e0bc	docs: add plans for nixos and homelab prometheus exporters Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-06 21:56:55 +01:00
Torjus Håkestad	787c14c7a6	docs: add dns_role label to scrape target labels plan All checks were successful Run nix flake check / flake-check (push) Successful in 2m3s Details Add proposed dns_role label to distinguish primary/secondary DNS resolvers. This addresses the unbound_low_cache_hit_ratio alert firing on ns2, which has a cold cache due to low traffic. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-06 01:23:34 +01:00
Torjus Håkestad	3dc4422ba0	docs: add NAS integration notes to auth plan All checks were successful Run nix flake check / flake-check (push) Successful in 2m4s Details Document TrueNAS CORE LDAP integration approach (NFS-only) and future NixOS NAS migration path with native Kanidm PAM/NSS. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 23:24:37 +01:00
Torjus Håkestad	f0963624bc	docs: add auth system replacement plan Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Evaluate options for replacing LLDAP+Authelia with a unified auth solution. Recommends Kanidm for its native NixOS PAM/NSS integration and built-in OIDC. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 23:18:38 +01:00
Torjus Håkestad	32968147b5	docs: move zigbee battery plan to completed All checks were successful Run nix flake check / flake-check (push) Successful in 2m17s Details Run nix flake check / flake-check (pull_request) Successful in 2m19s Details Updated plan with: - Full device inventory from ha1 - Backup verification details - Branch and commit references Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 22:49:49 +01:00
Torjus Håkestad	c515a6b4e1	home-assistant: fix zigbee sensor battery reporting Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details WSDCGQ12LM sensors report battery: 0 due to firmware quirk. Override battery calculation using voltage via homeassistant value_template. Also adds zigbee_sensor_stale alert for detecting dead sensors regardless of battery reporting accuracy (1 hour threshold). Device configuration moved from external devices.yaml to inline NixOS config for declarative management. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 22:41:07 +01:00
Torjus Håkestad	3cccfc0487	monitoring: implement monitoring gaps coverage Some checks failed Run nix flake check / flake-check (push) Failing after 7m36s Details Add exporters and scrape targets for services lacking monitoring: - PostgreSQL: postgres-exporter on pgdb1 - Authelia: native telemetry metrics on auth01 - Unbound: unbound-exporter with remote-control on ns1/ns2 - NATS: HTTP monitoring endpoint on nats1 - OpenBao: telemetry config and Prometheus scrape with token auth - Systemd: systemd-exporter on all hosts for per-service metrics Add alert rules for postgres, auth (authelia + lldap), jellyfin, vault (openbao), plus extend existing nats and unbound rules. Add Terraform config for Prometheus metrics policy and token. The token is created via vault_token resource and stored in KV, so no manual token creation is needed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 21:44:13 +01:00
Torjus Håkestad	7d92c55d37	docs: update for sops-to-openbao migration completion Some checks failed Run nix flake check / flake-check (push) Failing after 18m17s Details Update CLAUDE.md and README.md to reflect that secrets are now managed by OpenBao, with sops only remaining for ca. Update migration plans with sops cleanup checklist and auth01 decommission. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 20:06:21 +01:00
Torjus Håkestad	6d117d68ca	docs: move sops-to-openbao migration plan to completed All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 19:45:42 +01:00
Torjus Håkestad	0700033c0a	secrets: migrate all hosts from sops to OpenBao vault Replace sops-nix secrets with OpenBao vault secrets across all hosts. Hardcode root password hash, add extractKey option to vault-secrets module, update Terraform with secrets/policies for all hosts, and create AppRole provisioning playbook. Hosts migrated: ha1, monitoring01, ns1, ns2, http-proxy, nix-cache01 Wave 1 hosts (nats1, jelly01, pgdb1) get AppRole policies only. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 18:43:09 +01:00
Torjus Håkestad	4d33018285	docs: add ha1 memory recommendation to migration plan Some checks failed Run nix flake check / flake-check (push) Failing after 3m28s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 17:48:45 +01:00
Torjus Håkestad	678fd3d6de	docs: add systemd-exporter findings to monitoring gaps plan Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 10:19:33 +01:00
Torjus Håkestad	9d74aa5c04	docs: add zigbee sensor battery monitoring findings Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 09:21:54 +01:00
Torjus Håkestad	fe80ec3576	docs: add monitoring gaps audit plan Some checks failed Run nix flake check / flake-check (push) Failing after 20m32s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 03:19:20 +01:00
Torjus Håkestad	870fb3e532	docs: add plan for remote access to homelab services All checks were successful Run nix flake check / flake-check (push) Successful in 2m4s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 02:53:27 +01:00
Torjus Håkestad	e602e8d70b	docs: add plan for prometheus scrape target labels All checks were successful Run nix flake check / flake-check (push) Successful in 2m7s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 02:36:41 +01:00
Torjus Håkestad	09d9d71e2b	docs: note to establish hostname naming conventions before migration Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 02:04:58 +01:00
Torjus Håkestad	cc799f5929	docs: note USB passthrough requirement for ha1 migration Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 02:02:14 +01:00
Torjus Håkestad	0abdda8e8a	docs: add plan for migrating existing hosts to opentofu Some checks failed Run nix flake check / flake-check (push) Failing after 3m28s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 01:59:51 +01:00
Torjus Håkestad	0ef63ad874	hosts: remove decommissioned media1, ns3, ns4, nixos-test1 Some checks failed Run nix flake check / flake-check (push) Failing after 4m47s Details Run nix flake check / flake-check (pull_request) Successful in 3m20s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 01:36:57 +01:00
Torjus Håkestad	86a077e152	docs: add host cleanup plan for decommissioned hosts Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 01:04:50 +01:00
Torjus Håkestad	a2a55f3955	docs: add docs directory info and nixos options improvement plan Some checks failed Run nix flake check / flake-check (push) Failing after 4m12s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 22:27:11 +01:00
Torjus Håkestad	d7d4b0846c	docs: move dns-automation plan to completed All checks were successful Run nix flake check / flake-check (push) Successful in 2m17s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 22:13:38 +01:00
Torjus Håkestad	048536ba70	docs: move dns automation from TODO.md to nixos-improvements.md All checks were successful Run nix flake check / flake-check (push) Successful in 2m20s Details	2026-02-03 04:51:27 +01:00
Torjus Håkestad	01d4812280	vault: implement bootstrap integration Some checks failed Run nix flake check / flake-check (push) Successful in 2m31s Details Run nix flake check / flake-check (pull_request) Failing after 14m16s Details	2026-02-03 01:10:36 +01:00
Torjus Håkestad	7fc69c40a6	docs: add truenas-migration plan All checks were successful Run nix flake check / flake-check (push) Successful in 2m18s Details Periodic flake update / flake-update (push) Successful in 1m13s Details	2026-02-02 18:29:11 +01:00
Torjus Håkestad	34a2f2ab50	docs: add infrastructure documentation Some checks failed Run nix flake check / flake-check (push) Failing after 11m9s Details	2026-02-02 17:36:55 +01:00
Torjus Håkestad	c694b9889a	vault: add auto-unseal All checks were successful Run nix flake check / flake-check (push) Successful in 2m16s Details	2026-02-02 00:28:24 +01:00

49 Commits