nixos-servers

Author	SHA1	Message	Date
Torjus Håkestad	0d45e9f9d6	docs: switch to imperative user/group management Replace declarative NixOS provisioning examples with full CLI workflows. POSIX users and groups are now managed entirely via kanidm CLI, which allows setting all attributes (including UNIX passwords) in one step. Declarative provisioning may still be used for OIDC clients later. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:12:19 +01:00
Torjus Håkestad	cae1663526	docs: add home directory and enabled hosts info - Document UUID-based home directories with symlinks - List currently enabled hosts (testvm01-03) - Add cache-invalidate command to troubleshooting Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:12:19 +01:00
Torjus Håkestad	8bc4eee38e	docs: update kanidm troubleshooting with nscd restart Add troubleshooting tips discovered during testing: - kanidm-unix status command for checking connectivity - nscd restart required after config changes - Direct PAM auth test with kanidm-unix auth-test Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:12:19 +01:00
Torjus Håkestad	b845a8bb8b	system: add kanidm PAM/NSS client module Add homelab.kanidm.enable option for central authentication via Kanidm. The module configures: - PAM/NSS integration with kanidm-unixd - Client connection to auth.home.2rjus.net - Login authorization for ssh-users group Enable on testvm01-03 for testing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:12:19 +01:00
Torjus Håkestad	3abe5e83a7	docs: add memory ballooning as fallback option All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 13:29:42 +01:00
Torjus Håkestad	67c27555f3	docs: add memory issues follow-up plan All checks were successful Run nix flake check / flake-check (push) Successful in 2m2s Details Track zram change effectiveness for OOM prevention during upgrades. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 13:26:31 +01:00
Torjus Håkestad	311be282b6	docs: add security hardening plan Some checks failed Run nix flake check / flake-check (push) Failing after 2s Details Based on security review findings, covering SSH hardening, firewall enablement, log transport TLS, security alerting, and secrets management. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 05:26:15 +01:00
Torjus Håkestad	8fbf1224fa	docs: add host creation pipeline documentation Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Document the end-to-end host creation workflow including: - Prerequisites and step-by-step process - Tier specification (test vs prod) - Bootstrap observability via Loki - Verification steps - Troubleshooting guide - Related files reference Update CLAUDE.md to reference the new document. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 02:05:21 +01:00
Torjus Håkestad	8959829f77	docs: add monitoring migration to VictoriaMetrics plan Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Plan for migrating from Prometheus to VictoriaMetrics on new monitoring02 host with parallel operation, declarative Grafana dashboards, and CNAME-based cutover. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 01:11:07 +01:00
Torjus Håkestad	93dbb45802	docs: update auth-system-replacement plan with progress Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Periodic flake update / flake-update (push) Failing after 5s Details - Mark completed implementation steps - Document deployed kanidm01 configuration - Record UID/GID range decision (65,536-69,999) - Add verified working items (WebUI, LDAP, certs) - Update next steps and resolved questions Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 00:50:36 +01:00
Torjus Håkestad	732e9b8c22	docs: move bootstrap-cache plan to completed Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:41:05 +01:00
Torjus Håkestad	f9a3961457	docs: move ns1-recreation plan to completed Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:35:04 +01:00
Torjus Håkestad	003d4ccf03	docs: mark ns1 migration to OpenTofu as complete Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:34:44 +01:00
Torjus Håkestad	94feae82a0	ns1: recreate with OpenTofu workflow Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Old VM had incorrect hardware-configuration.nix with hardcoded UUIDs that didn't match actual disk layout, causing boot failure (emergency mode). Recreated using template2-based configuration for OpenTofu provisioning. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:18:08 +01:00
Torjus Håkestad	3f94f7ee95	docs: update pgdb1 decommission progress Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:55:55 +01:00
Torjus Håkestad	8ec2a083bd	pgdb1: decommission postgresql host Remove pgdb1 host configuration and postgres service module. The only consumer (Open WebUI on gunter) has migrated to local PostgreSQL. Removed: - hosts/pgdb1/ - host configuration - services/postgres/ - service module (only used by pgdb1) - postgres_rules from monitoring rules - rebuild-all.sh (obsolete script) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:54:50 +01:00
Torjus Håkestad	ec4ac1477e	docs: mark pgdb1 for decommissioning instead of migration Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Only consumer was Open WebUI on gunter, which will migrate to local PostgreSQL. Removed pgdb1 backup/migration phases and added to decommission list. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:49:53 +01:00
Torjus Håkestad	e937c68965	docs: mark auth01, ca, and sops-nix removal as complete Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - auth01 host and services (authelia, lldap) already removed - ca host and services already removed (PKI migrated to OpenBao) - sops-nix fully removed (secrets/, .sops.yaml gone) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:33:18 +01:00
Torjus Håkestad	98e808cd6c	docs: mark jump host decommissioning as complete Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:31:14 +01:00
Torjus Håkestad	1066e81ba8	docs: update opentofu migration plan with current state Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - ns2 migrated to OpenTofu - testvm02, testvm03 added to managed hosts - Remove vaulttest01 (no longer exists) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:26:49 +01:00
Torjus Håkestad	f0950b33de	docs: add plan for nix-cache01 reprovision Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 20:34:52 +01:00
Torjus Håkestad	38c104ea8c	docs: add plan for configuring template2 with nix cache Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Run nix flake check / flake-check (pull_request) Failing after 1s Details Bootstrap times can be improved by configuring the base template to use the local nix cache during initial builds. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 20:06:55 +01:00
Torjus Håkestad	f36457ee0d	cleanup: remove legacy secrets directory and move TODO.md to completed plans Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Run nix flake check / flake-check (pull_request) Failing after 1s Details - Remove secrets/ directory (sops-nix no longer in use, all hosts use Vault) - Move TODO.md to docs/plans/completed/automated-host-deployment-pipeline.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 18:49:31 +01:00
Torjus Håkestad	21db7e9573	acme: migrate from step-ca to OpenBao PKI Switch all ACME certificate issuance from step-ca (ca.home.2rjus.net) to OpenBao PKI (vault.home.2rjus.net:8200/v1/pki_int/acme/directory). - Update default ACME server in system/acme.nix - Update Caddy acme_ca in http-proxy and nix-cache services - Remove labmon service from monitoring01 (step-ca monitoring) - Remove labmon scrape target and certificate_rules alerts - Remove alloy.nix (only used for labmon profiling) - Add docs/plans/cert-monitoring.md for future cert monitoring needs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 18:20:10 +01:00
Torjus Håkestad	c518093578	docs: move prometheus-scrape-target-labels plan to completed Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:29:31 +01:00
Torjus Håkestad	50a85daa44	docs: update plan with hostname label documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:09:46 +01:00
Torjus Håkestad	7d291f85bf	monitoring: propagate host labels to Prometheus scrape targets Extract homelab.host metadata (tier, priority, role, labels) from host configurations and propagate them to Prometheus scrape targets. This enables semantic alert filtering using labels instead of hardcoded instance names. Changes: - lib/monitoring.nix: Extract host metadata, group targets by labels - prometheus.nix: Use structured static_configs with labels - rules.yml: Replace instance filters with role-based filters Example labels in Prometheus: - ns1/ns2: role=dns, dns_role=primary/secondary - nix-cache01: role=build-host - testvm*: tier=test Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:04:50 +01:00
Torjus Håkestad	2a842c655a	docs: update plan status and move completed nats-deploy plan Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - Move nats-deploy-service.md to completed/ folder - Update prometheus-scrape-target-labels.md with implementation status - Add status table showing which steps are complete/partial/not started - Update cross-references to point to new location Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 16:44:00 +01:00
Torjus Håkestad	b03a9b3b64	docs: add long-term metrics storage plan Compare VictoriaMetrics and Thanos as options for extending metrics retention beyond 30 days while managing disk usage. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 07:56:10 +01:00
Torjus Håkestad	12bf0683f5	modules: add homelab.host for host metadata Add a shared `homelab.host` module that provides host metadata for multiple consumers: - tier: deployment tier (test/prod) for future homelab-deploy service - priority: alerting priority (high/low) for Prometheus label filtering - role: primary role of the host (dns, database, monitoring, etc.) - labels: free-form labels for additional metadata Host configurations updated with appropriate values: - ns1, ns2: role=dns with dns_role labels - nix-cache01: priority=low, role=build-host - vault01: role=vault - jump: role=bastion - template, template2, testvm01, vaulttest01: tier=test, priority=low The module is now imported via commonModules in flake.nix, making it available to all hosts including minimal configurations like template2. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 02:49:58 +01:00
Torjus Håkestad	e8a43c6715	docs: add deploy_admin tool with opt-in flag to homelab-deploy plan All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details MCP exposes two tools: - deploy: test-tier only, always available - deploy_admin: all tiers, requires --enable-admin flag Three security layers: CLI flag, NATS authz, Claude Code permissions. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 02:29:13 +01:00
Torjus Håkestad	eef52bb8c5	docs: add group deployment support to homelab-deploy plan All checks were successful Run nix flake check / flake-check (push) Successful in 2m3s Details Support deploying to all hosts in a tier or all hosts with a role: - deploy.<tier>.all - broadcast to all hosts in tier - deploy.<tier>.role.<role> - broadcast to hosts with matching role MCP can deploy to all test hosts at once, admin can deploy to any group. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 02:22:17 +01:00
Torjus Håkestad	c6cdbc6799	docs: move nixos-exporter plan to completed Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 02:13:14 +01:00
Torjus Håkestad	4d724329a6	docs: add homelab-deploy plan, unify host metadata Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Add plan for NATS-based deployment service (homelab-deploy) that enables on-demand NixOS configuration updates via messaging. Features tiered permissions (test/prod) enforced at NATS layer. Update prometheus-scrape-target-labels plan to share the homelab.host module for host metadata (tier, priority, role, labels) - single source of truth for both deployment tiers and prometheus labels. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 02:10:54 +01:00
Torjus Håkestad	7e80d2e0bc	docs: add plans for nixos and homelab prometheus exporters Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-06 21:56:55 +01:00
Torjus Håkestad	787c14c7a6	docs: add dns_role label to scrape target labels plan All checks were successful Run nix flake check / flake-check (push) Successful in 2m3s Details Add proposed dns_role label to distinguish primary/secondary DNS resolvers. This addresses the unbound_low_cache_hit_ratio alert firing on ns2, which has a cold cache due to low traffic. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-06 01:23:34 +01:00
Torjus Håkestad	3dc4422ba0	docs: add NAS integration notes to auth plan All checks were successful Run nix flake check / flake-check (push) Successful in 2m4s Details Document TrueNAS CORE LDAP integration approach (NFS-only) and future NixOS NAS migration path with native Kanidm PAM/NSS. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 23:24:37 +01:00
Torjus Håkestad	f0963624bc	docs: add auth system replacement plan Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Evaluate options for replacing LLDAP+Authelia with a unified auth solution. Recommends Kanidm for its native NixOS PAM/NSS integration and built-in OIDC. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 23:18:38 +01:00
Torjus Håkestad	32968147b5	docs: move zigbee battery plan to completed All checks were successful Run nix flake check / flake-check (push) Successful in 2m17s Details Run nix flake check / flake-check (pull_request) Successful in 2m19s Details Updated plan with: - Full device inventory from ha1 - Backup verification details - Branch and commit references Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 22:49:49 +01:00
Torjus Håkestad	c515a6b4e1	home-assistant: fix zigbee sensor battery reporting Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details WSDCGQ12LM sensors report battery: 0 due to firmware quirk. Override battery calculation using voltage via homeassistant value_template. Also adds zigbee_sensor_stale alert for detecting dead sensors regardless of battery reporting accuracy (1 hour threshold). Device configuration moved from external devices.yaml to inline NixOS config for declarative management. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 22:41:07 +01:00
Torjus Håkestad	3cccfc0487	monitoring: implement monitoring gaps coverage Some checks failed Run nix flake check / flake-check (push) Failing after 7m36s Details Add exporters and scrape targets for services lacking monitoring: - PostgreSQL: postgres-exporter on pgdb1 - Authelia: native telemetry metrics on auth01 - Unbound: unbound-exporter with remote-control on ns1/ns2 - NATS: HTTP monitoring endpoint on nats1 - OpenBao: telemetry config and Prometheus scrape with token auth - Systemd: systemd-exporter on all hosts for per-service metrics Add alert rules for postgres, auth (authelia + lldap), jellyfin, vault (openbao), plus extend existing nats and unbound rules. Add Terraform config for Prometheus metrics policy and token. The token is created via vault_token resource and stored in KV, so no manual token creation is needed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 21:44:13 +01:00
Torjus Håkestad	7d92c55d37	docs: update for sops-to-openbao migration completion Some checks failed Run nix flake check / flake-check (push) Failing after 18m17s Details Update CLAUDE.md and README.md to reflect that secrets are now managed by OpenBao, with sops only remaining for ca. Update migration plans with sops cleanup checklist and auth01 decommission. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 20:06:21 +01:00
Torjus Håkestad	6d117d68ca	docs: move sops-to-openbao migration plan to completed All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 19:45:42 +01:00
Torjus Håkestad	0700033c0a	secrets: migrate all hosts from sops to OpenBao vault Replace sops-nix secrets with OpenBao vault secrets across all hosts. Hardcode root password hash, add extractKey option to vault-secrets module, update Terraform with secrets/policies for all hosts, and create AppRole provisioning playbook. Hosts migrated: ha1, monitoring01, ns1, ns2, http-proxy, nix-cache01 Wave 1 hosts (nats1, jelly01, pgdb1) get AppRole policies only. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 18:43:09 +01:00
Torjus Håkestad	4d33018285	docs: add ha1 memory recommendation to migration plan Some checks failed Run nix flake check / flake-check (push) Failing after 3m28s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 17:48:45 +01:00
Torjus Håkestad	678fd3d6de	docs: add systemd-exporter findings to monitoring gaps plan Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 10:19:33 +01:00
Torjus Håkestad	9d74aa5c04	docs: add zigbee sensor battery monitoring findings Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 09:21:54 +01:00
Torjus Håkestad	fe80ec3576	docs: add monitoring gaps audit plan Some checks failed Run nix flake check / flake-check (push) Failing after 20m32s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 03:19:20 +01:00
Torjus Håkestad	870fb3e532	docs: add plan for remote access to homelab services All checks were successful Run nix flake check / flake-check (push) Successful in 2m4s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 02:53:27 +01:00
Torjus Håkestad	e602e8d70b	docs: add plan for prometheus scrape target labels All checks were successful Run nix flake check / flake-check (push) Successful in 2m7s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 02:36:41 +01:00

1 2

62 Commits