nixos-servers

Author	SHA1	Message	Date
Torjus Håkestad	2f0dad1acc	docs: add JSON logging audit to Loki improvements plan Some checks failed Run nix flake check / flake-check (push) Failing after 15m38s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 22:44:05 +01:00
Torjus Håkestad	1544415ef3	docs: add Loki improvements plan Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Covers retention policy, limits config, Promtail label improvements (tier/role/level), and journal PRIORITY extraction. Also adds Alloy consideration to VictoriaMetrics migration plan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 22:39:16 +01:00
Torjus Håkestad	5babd7f507	docs: move garage S3 storage plan to completed Some checks failed Run nix flake check / flake-check (push) Failing after 15m36s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 21:54:23 +01:00
Torjus Håkestad	5d3d93b280	docs: move completed plans to completed folder Some checks failed Run nix flake check / flake-check (push) Failing after 13m22s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 21:08:17 +01:00
Torjus Håkestad	08d9e1ec3f	docs: add garage S3 storage plan Some checks failed Run nix flake check / flake-check (push) Failing after 3m26s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 18:06:53 +01:00
Torjus Håkestad	ed1821b073	nix-cache02: add scheduled builds timer Some checks failed Run nix flake check / flake-check (push) Failing after 5m7s Details Periodic flake update / flake-update (push) Successful in 2m18s Details Add a systemd timer that triggers builds for all hosts every 2 hours via NATS, keeping the binary cache warm. - Add scheduler.nix with timer (every 2h) and oneshot service - Add scheduler NATS user to DEPLOY account - Add Vault secret and variable for scheduler NKey - Increase nix-cache02 memory from 16GB to 20GB Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-12 00:50:09 +01:00
Torjus Håkestad	ddcbc30665	docs: mark nix-cache01 decommission complete Some checks failed Run nix flake check / flake-check (push) Failing after 16m38s Details Phase 4 fully complete. nix-cache01 has been: - Removed from repo (host config, build scripts, flake entry) - Vault resources cleaned up - VM deleted from Proxmox nix-cache02 is now the sole binary cache host. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:43:12 +01:00
Torjus Håkestad	ade0538717	docs: mark nix-cache DNS cutover complete Some checks are pending Run nix flake check / flake-check (push) Has started running Details nix-cache.home.2rjus.net now served by nix-cache02. nix-cache01 ready for decommission. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:34:04 +01:00
Torjus Håkestad	afff3f28ca	docs: update nix-cache-reprovision plan with Harmonia progress Some checks failed Run nix flake check / flake-check (push) Failing after 52s Details - Phase 4 now in progress - Harmonia configured on nix-cache02 with new signing key - Trusted public key deployed to all hosts - Cache tested successfully from testvm01 - Actions runner removed from scope Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:17:51 +01:00
Torjus Håkestad	751edfc11d	nix-cache02: add Harmonia binary cache service Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details - Parameterize harmonia.nix to use hostname-based Vault paths - Add nix-cache services to nix-cache02 - Add Vault secret and variable for nix-cache02 signing key - Add nix-cache02 public key to trusted-public-keys on all hosts - Update plan doc to remove actions runner references Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 23:08:48 +01:00
Torjus Håkestad	5bfb51a497	docs: add observability phase to nix-cache plan Some checks failed Run nix flake check / flake-check (push) Successful in 2m35s Details Run nix flake check / flake-check (pull_request) Failing after 16m1s Details - Add Phase 6 for alerting and Grafana dashboards - Document available Prometheus metrics - Include example alerting rules for build failures Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 22:46:38 +01:00
Torjus Håkestad	f83145d97a	docs: update nix-cache-reprovision plan with progress Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details - Mark Phase 1 (new build host) and Phase 2 (NATS build triggering) complete - Document nix-cache02 configuration and tested build times - Add remaining work for Harmonia, Actions runner, and DNS cutover - Enable --enable-builds flag in MCP config Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 22:43:48 +01:00
Torjus Håkestad	98ea679ef2	docs: add monitoring02 reboot alert investigation Some checks failed Run nix flake check / flake-check (push) Failing after 13m41s Details Document findings from false positive host_reboot alert caused by NTP clock adjustment affecting node_boot_time_seconds metric. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-10 17:59:53 +01:00
Torjus Håkestad	75e4fb61a5	monitoring: add blackbox exporter for TLS certificate monitoring All checks were successful Run nix flake check / flake-check (push) Successful in 2m6s Details Add blackbox exporter to monitoring01 to probe TLS endpoints and alert on expiring certificates. Monitors all ACME-managed certificates from OpenBao PKI including Caddy auto-TLS services. Alerts: - tls_certificate_expiring_soon (< 7 days, warning) - tls_certificate_expiring_critical (< 24h, critical) - tls_probe_failed (connectivity issues) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:21:42 +01:00
Torjus Håkestad	7ff3d2a09b	docs: move openbao-kanidm-oidc plan to completed All checks were successful Run nix flake check / flake-check (push) Successful in 2m7s Details	2026-02-09 19:44:06 +01:00
Torjus Håkestad	02270a0e4a	docs: update plans with Grafana OIDC progress Some checks failed Run nix flake check / flake-check (pull_request) Successful in 2m7s Details Run nix flake check / flake-check (push) Failing after 16m31s Details - auth-system-replacement.md: Mark OAuth2 client (Grafana) as completed, document key findings (PKCE, attribute paths, user requirements) - monitoring-migration-victoriametrics.md: Note Grafana deployment on monitoring02 with Kanidm OIDC as test instance Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:28:10 +01:00
Torjus Håkestad	8786113f8f	docs: add OpenBao + Kanidm OIDC integration plan Some checks failed Run nix flake check / flake-check (push) Failing after 3m10s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:45:44 +01:00
Torjus Håkestad	9ed09c9a9c	docs: add user-management documentation All checks were successful Run nix flake check / flake-check (pull_request) Successful in 3m33s Details Run nix flake check / flake-check (push) Successful in 2m0s Details - CLI workflows for creating users and groups - Troubleshooting guide (nscd, cache invalidation) - Home directory behavior (UUID-based with symlinks) - Update auth-system-replacement plan with progress Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:14:21 +01:00
Torjus Håkestad	b845a8bb8b	system: add kanidm PAM/NSS client module Add homelab.kanidm.enable option for central authentication via Kanidm. The module configures: - PAM/NSS integration with kanidm-unixd - Client connection to auth.home.2rjus.net - Login authorization for ssh-users group Enable on testvm01-03 for testing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:12:19 +01:00
Torjus Håkestad	3abe5e83a7	docs: add memory ballooning as fallback option All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 13:29:42 +01:00
Torjus Håkestad	67c27555f3	docs: add memory issues follow-up plan All checks were successful Run nix flake check / flake-check (push) Successful in 2m2s Details Track zram change effectiveness for OOM prevention during upgrades. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 13:26:31 +01:00
Torjus Håkestad	311be282b6	docs: add security hardening plan Some checks failed Run nix flake check / flake-check (push) Failing after 2s Details Based on security review findings, covering SSH hardening, firewall enablement, log transport TLS, security alerting, and secrets management. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 05:26:15 +01:00
Torjus Håkestad	8fbf1224fa	docs: add host creation pipeline documentation Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Document the end-to-end host creation workflow including: - Prerequisites and step-by-step process - Tier specification (test vs prod) - Bootstrap observability via Loki - Verification steps - Troubleshooting guide - Related files reference Update CLAUDE.md to reference the new document. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 02:05:21 +01:00
Torjus Håkestad	8959829f77	docs: add monitoring migration to VictoriaMetrics plan Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Plan for migrating from Prometheus to VictoriaMetrics on new monitoring02 host with parallel operation, declarative Grafana dashboards, and CNAME-based cutover. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 01:11:07 +01:00
Torjus Håkestad	93dbb45802	docs: update auth-system-replacement plan with progress Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Periodic flake update / flake-update (push) Failing after 5s Details - Mark completed implementation steps - Document deployed kanidm01 configuration - Record UID/GID range decision (65,536-69,999) - Add verified working items (WebUI, LDAP, certs) - Update next steps and resolved questions Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 00:50:36 +01:00
Torjus Håkestad	732e9b8c22	docs: move bootstrap-cache plan to completed Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:41:05 +01:00
Torjus Håkestad	f9a3961457	docs: move ns1-recreation plan to completed Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:35:04 +01:00
Torjus Håkestad	003d4ccf03	docs: mark ns1 migration to OpenTofu as complete Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:34:44 +01:00
Torjus Håkestad	94feae82a0	ns1: recreate with OpenTofu workflow Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Old VM had incorrect hardware-configuration.nix with hardcoded UUIDs that didn't match actual disk layout, causing boot failure (emergency mode). Recreated using template2-based configuration for OpenTofu provisioning. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:18:08 +01:00
Torjus Håkestad	3f94f7ee95	docs: update pgdb1 decommission progress Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:55:55 +01:00
Torjus Håkestad	8ec2a083bd	pgdb1: decommission postgresql host Remove pgdb1 host configuration and postgres service module. The only consumer (Open WebUI on gunter) has migrated to local PostgreSQL. Removed: - hosts/pgdb1/ - host configuration - services/postgres/ - service module (only used by pgdb1) - postgres_rules from monitoring rules - rebuild-all.sh (obsolete script) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:54:50 +01:00
Torjus Håkestad	ec4ac1477e	docs: mark pgdb1 for decommissioning instead of migration Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Only consumer was Open WebUI on gunter, which will migrate to local PostgreSQL. Removed pgdb1 backup/migration phases and added to decommission list. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:49:53 +01:00
Torjus Håkestad	e937c68965	docs: mark auth01, ca, and sops-nix removal as complete Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - auth01 host and services (authelia, lldap) already removed - ca host and services already removed (PKI migrated to OpenBao) - sops-nix fully removed (secrets/, .sops.yaml gone) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:33:18 +01:00
Torjus Håkestad	98e808cd6c	docs: mark jump host decommissioning as complete Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:31:14 +01:00
Torjus Håkestad	1066e81ba8	docs: update opentofu migration plan with current state Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - ns2 migrated to OpenTofu - testvm02, testvm03 added to managed hosts - Remove vaulttest01 (no longer exists) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:26:49 +01:00
Torjus Håkestad	f0950b33de	docs: add plan for nix-cache01 reprovision Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 20:34:52 +01:00
Torjus Håkestad	38c104ea8c	docs: add plan for configuring template2 with nix cache Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Run nix flake check / flake-check (pull_request) Failing after 1s Details Bootstrap times can be improved by configuring the base template to use the local nix cache during initial builds. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 20:06:55 +01:00
Torjus Håkestad	f36457ee0d	cleanup: remove legacy secrets directory and move TODO.md to completed plans Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Run nix flake check / flake-check (pull_request) Failing after 1s Details - Remove secrets/ directory (sops-nix no longer in use, all hosts use Vault) - Move TODO.md to docs/plans/completed/automated-host-deployment-pipeline.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 18:49:31 +01:00
Torjus Håkestad	21db7e9573	acme: migrate from step-ca to OpenBao PKI Switch all ACME certificate issuance from step-ca (ca.home.2rjus.net) to OpenBao PKI (vault.home.2rjus.net:8200/v1/pki_int/acme/directory). - Update default ACME server in system/acme.nix - Update Caddy acme_ca in http-proxy and nix-cache services - Remove labmon service from monitoring01 (step-ca monitoring) - Remove labmon scrape target and certificate_rules alerts - Remove alloy.nix (only used for labmon profiling) - Add docs/plans/cert-monitoring.md for future cert monitoring needs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 18:20:10 +01:00
Torjus Håkestad	c518093578	docs: move prometheus-scrape-target-labels plan to completed Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:29:31 +01:00
Torjus Håkestad	50a85daa44	docs: update plan with hostname label documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:09:46 +01:00
Torjus Håkestad	7d291f85bf	monitoring: propagate host labels to Prometheus scrape targets Extract homelab.host metadata (tier, priority, role, labels) from host configurations and propagate them to Prometheus scrape targets. This enables semantic alert filtering using labels instead of hardcoded instance names. Changes: - lib/monitoring.nix: Extract host metadata, group targets by labels - prometheus.nix: Use structured static_configs with labels - rules.yml: Replace instance filters with role-based filters Example labels in Prometheus: - ns1/ns2: role=dns, dns_role=primary/secondary - nix-cache01: role=build-host - testvm*: tier=test Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:04:50 +01:00
Torjus Håkestad	2a842c655a	docs: update plan status and move completed nats-deploy plan Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - Move nats-deploy-service.md to completed/ folder - Update prometheus-scrape-target-labels.md with implementation status - Add status table showing which steps are complete/partial/not started - Update cross-references to point to new location Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 16:44:00 +01:00
Torjus Håkestad	b03a9b3b64	docs: add long-term metrics storage plan Compare VictoriaMetrics and Thanos as options for extending metrics retention beyond 30 days while managing disk usage. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 07:56:10 +01:00
Torjus Håkestad	12bf0683f5	modules: add homelab.host for host metadata Add a shared `homelab.host` module that provides host metadata for multiple consumers: - tier: deployment tier (test/prod) for future homelab-deploy service - priority: alerting priority (high/low) for Prometheus label filtering - role: primary role of the host (dns, database, monitoring, etc.) - labels: free-form labels for additional metadata Host configurations updated with appropriate values: - ns1, ns2: role=dns with dns_role labels - nix-cache01: priority=low, role=build-host - vault01: role=vault - jump: role=bastion - template, template2, testvm01, vaulttest01: tier=test, priority=low The module is now imported via commonModules in flake.nix, making it available to all hosts including minimal configurations like template2. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 02:49:58 +01:00
Torjus Håkestad	e8a43c6715	docs: add deploy_admin tool with opt-in flag to homelab-deploy plan All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details MCP exposes two tools: - deploy: test-tier only, always available - deploy_admin: all tiers, requires --enable-admin flag Three security layers: CLI flag, NATS authz, Claude Code permissions. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 02:29:13 +01:00
Torjus Håkestad	eef52bb8c5	docs: add group deployment support to homelab-deploy plan All checks were successful Run nix flake check / flake-check (push) Successful in 2m3s Details Support deploying to all hosts in a tier or all hosts with a role: - deploy.<tier>.all - broadcast to all hosts in tier - deploy.<tier>.role.<role> - broadcast to hosts with matching role MCP can deploy to all test hosts at once, admin can deploy to any group. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 02:22:17 +01:00
Torjus Håkestad	c6cdbc6799	docs: move nixos-exporter plan to completed Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 02:13:14 +01:00
Torjus Håkestad	4d724329a6	docs: add homelab-deploy plan, unify host metadata Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Add plan for NATS-based deployment service (homelab-deploy) that enables on-demand NixOS configuration updates via messaging. Features tiered permissions (test/prod) enforced at NATS layer. Update prometheus-scrape-target-labels plan to share the homelab.host module for host metadata (tier, priority, role, labels) - single source of truth for both deployment tiers and prometheus labels. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 02:10:54 +01:00
Torjus Håkestad	7e80d2e0bc	docs: add plans for nixos and homelab prometheus exporters Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-06 21:56:55 +01:00

1 2

77 Commits