nixos-servers

Author	SHA1	Message	Date
Torjus Håkestad	11cbb64097	claude: make auditor delegation explicit in investigate-alarm Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - Changed section 4 from "if needed" to always spawn auditor - Added explicit "Do NOT query audit logs yourself" guidance - Listed specific scenarios requiring auditor (service stopped, etc.) - Added manual intervention as first common cause - Updated guidelines to emphasize mandatory delegation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 05:11:09 +01:00
Torjus Håkestad	e2dd21c994	claude: add auditor agent and git-explorer MCP Add new auditor agent for security-focused audit log analysis: - SSH session tracking, command execution, sudo usage - Suspicious activity detection patterns - Can be used standalone or as sub-agent by investigate-alarm Update investigate-alarm to delegate audit analysis to auditor and add git-explorer MCP for configuration drift detection. Add git-explorer to .mcp.json for repository inspection. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 04:48:55 +01:00
Torjus Håkestad	463342133e	kanidm: remove non-functional metrics scrape target All checks were successful Run nix flake check / flake-check (push) Successful in 1m56s Details Kanidm does not expose a Prometheus /metrics endpoint. The scrape target was causing 404 errors after the TLS certificate issue was fixed. Also add SSH command restriction to CLAUDE.md. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:34:12 +01:00
Torjus Håkestad	de36b9d016	kanidm: add hostname SAN to ACME certificate Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Include both auth.home.2rjus.net (CNAME) and kanidm01.home.2rjus.net (A record) as SANs in the TLS certificate. This fixes Prometheus scraping which connects via the hostname, not the CNAME. Fixes: x509: certificate is valid for auth.home.2rjus.net, not kanidm01.home.2rjus.net Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:29:54 +01:00
Torjus Håkestad	3f1d966919	claude: improve investigate-alarm log query guidelines Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Add best practices for querying Loki to avoid overwhelming responses: - Start with narrow filters and small limits - Filter audit logs to EXECVE only - Exclude verbose noise (PATH, PROCTITLE, SYSCALL, BPF) - Expand queries incrementally if needed Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:14:54 +01:00
Torjus Håkestad	7fcc043a4d	testvm: add SSH session command auditing Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Enable Linux audit to log execve syscalls from interactive SSH sessions. Uses auid filter to exclude system services and nix builds. Logs forwarded to journald for Loki ingestion. Query with: {host="testvmXX"} \|= "EXECVE" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:07:10 +01:00
Torjus Håkestad	70ec5f8109	claude: add investigate-alarm agent Sub-agent for investigating system alarms using Prometheus metrics and Loki logs. Provides root cause analysis with timeline of events. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 03:07:03 +01:00
Torjus Håkestad	c2ec34cab9	docs: consolidate monitoring docs into observability skill Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - Move detailed Prometheus/Loki reference from CLAUDE.md to the observability skill - Add complete list of Prometheus jobs organized by category - Add bootstrap log documentation with stages table - Add kanidm01 to host labels table - CLAUDE.md now references the skill instead of duplicating info Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 02:15:02 +01:00
Torjus Håkestad	8fbf1224fa	docs: add host creation pipeline documentation Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Document the end-to-end host creation workflow including: - Prerequisites and step-by-step process - Tier specification (test vs prod) - Bootstrap observability via Loki - Verification steps - Troubleshooting guide - Related files reference Update CLAUDE.md to reference the new document. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 02:05:21 +01:00
Torjus Håkestad	8959829f77	docs: add monitoring migration to VictoriaMetrics plan Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Plan for migrating from Prometheus to VictoriaMetrics on new monitoring02 host with parallel operation, declarative Grafana dashboards, and CNAME-based cutover. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 01:11:07 +01:00
Torjus Håkestad	93dbb45802	docs: update auth-system-replacement plan with progress Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Periodic flake update / flake-update (push) Failing after 5s Details - Mark completed implementation steps - Document deployed kanidm01 configuration - Record UID/GID range decision (65,536-69,999) - Add verified working items (WebUI, LDAP, certs) - Update next steps and resolved questions Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 00:50:36 +01:00
Torjus Håkestad	538c2ad097	kanidm: fix secret file permissions for provisioning Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Set owner/group to kanidm so the post-start provisioning script can read the idm_admin password. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 00:24:41 +01:00
Torjus Håkestad	d99c82c74c	kanidm: fix service ordering for vault secret Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Ensure vault-secret-kanidm-idm-admin runs before kanidm.service by adding services dependency. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 00:21:11 +01:00
Torjus Håkestad	ca0e3fd629	kanidm01: add kanidm authentication server Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - New test-tier VM at 10.69.13.23 with role=auth - Kanidm 1.8 server with HTTPS (443) and LDAPS (636) - ACME certificate from internal CA (auth.home.2rjus.net) - Provisioned groups: admins, users, ssh-users - Provisioned user: torjus - Daily backups at 22:00 (7 versions) - Prometheus monitoring scrape target Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 00:13:59 +01:00
Torjus Håkestad	732e9b8c22	docs: move bootstrap-cache plan to completed Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:41:05 +01:00
Torjus Håkestad	3a14ffd6b5	template2: add nix cache configuration Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details New VMs bootstrapped from template2 will now use the local nix cache during initial nixos-rebuild, speeding up bootstrap times. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:40:53 +01:00
Torjus Håkestad	f9a3961457	docs: move ns1-recreation plan to completed Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:35:04 +01:00
Torjus Håkestad	003d4ccf03	docs: mark ns1 migration to OpenTofu as complete Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:34:44 +01:00
Torjus Håkestad	735b8a9ee3	terraform: add dns and homelab-deploy secrets to ns1 policy Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details ns1 needs access to shared/dns/* for zone transfer key and shared/homelab-deploy/* for the NATS listener. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:33:36 +01:00
Torjus Håkestad	94feae82a0	ns1: recreate with OpenTofu workflow Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Old VM had incorrect hardware-configuration.nix with hardcoded UUIDs that didn't match actual disk layout, causing boot failure (emergency mode). Recreated using template2-based configuration for OpenTofu provisioning. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 23:18:08 +01:00
Torjus Håkestad	3f94f7ee95	docs: update pgdb1 decommission progress Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:55:55 +01:00
Torjus Håkestad	b7e398c9a7	terraform: remove pgdb1 vault approle Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:55:39 +01:00
Torjus Håkestad	8ec2a083bd	pgdb1: decommission postgresql host Remove pgdb1 host configuration and postgres service module. The only consumer (Open WebUI on gunter) has migrated to local PostgreSQL. Removed: - hosts/pgdb1/ - host configuration - services/postgres/ - service module (only used by pgdb1) - postgres_rules from monitoring rules - rebuild-all.sh (obsolete script) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:54:50 +01:00
Torjus Håkestad	ec4ac1477e	docs: mark pgdb1 for decommissioning instead of migration Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Only consumer was Open WebUI on gunter, which will migrate to local PostgreSQL. Removed pgdb1 backup/migration phases and added to decommission list. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:49:53 +01:00
Torjus Håkestad	e937c68965	docs: mark auth01, ca, and sops-nix removal as complete Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - auth01 host and services (authelia, lldap) already removed - ca host and services already removed (PKI migrated to OpenBao) - sops-nix fully removed (secrets/, .sops.yaml gone) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:33:18 +01:00
Torjus Håkestad	98e808cd6c	docs: mark jump host decommissioning as complete Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:31:14 +01:00
Torjus Håkestad	ba9f47f914	jump: remove unused host configuration Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Host was decommissioned and not in flake.nix. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:30:37 +01:00
Torjus Håkestad	1066e81ba8	docs: update opentofu migration plan with current state Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - ns2 migrated to OpenTofu - testvm02, testvm03 added to managed hosts - Remove vaulttest01 (no longer exists) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 22:26:49 +01:00
Torjus Håkestad	f0950b33de	docs: add plan for nix-cache01 reprovision Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 20:34:52 +01:00
Torjus Håkestad	bf199bd7c6	ns/resolver: add redundant stub-zone addresses Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Configure Unbound to query both ns1 and ns2 for the home.2rjus.net zone, in addition to local NSD. This provides redundancy during bootstrap or if local NSD is temporarily unavailable. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 20:10:17 +01:00
Torjus Håkestad	4e8ecb8a99	Merge pull request 'migrate-ns2-opentofu' (#33 ) from migrate-ns2-opentofu into master Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Reviewed-on: #33	2026-02-07 19:07:32 +00:00
Torjus Håkestad	38c104ea8c	docs: add plan for configuring template2 with nix cache Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Run nix flake check / flake-check (pull_request) Failing after 1s Details Bootstrap times can be improved by configuring the base template to use the local nix cache during initial builds. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 20:06:55 +01:00
Torjus Håkestad	536daee4c7	ns2: migrate to OpenTofu management Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - Remove hosts/template/ (legacy template1) and give each legacy host its own hardware-configuration.nix copy - Recreate ns2 using create-host with template2 base - Add secondary DNS services (NSD + Unbound resolver) - Configure Vault policy for shared DNS secrets - Fix create-host IP uniqueness validator to check CIDR notation (prevents false positives from DNS resolver entries) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 19:28:35 +01:00
Torjus Håkestad	4c1debf0a3	Merge pull request 'decommission-ca-host' (#32 ) from decommission-ca-host into master Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Reviewed-on: #32	2026-02-07 17:50:44 +00:00
Torjus Håkestad	f36457ee0d	cleanup: remove legacy secrets directory and move TODO.md to completed plans Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Run nix flake check / flake-check (pull_request) Failing after 1s Details - Remove secrets/ directory (sops-nix no longer in use, all hosts use Vault) - Move TODO.md to docs/plans/completed/automated-host-deployment-pipeline.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 18:49:31 +01:00
Torjus Håkestad	aedccbd9a0	flake: remove sops-nix (no longer used) Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details All secrets are now managed by OpenBao (Vault). Remove the legacy sops-nix infrastructure that is no longer in use. Removed: - sops-nix flake input - system/sops.nix module - .sops.yaml configuration file - Age key generation from template prepare-host scripts Updated: - flake.nix - removed sops-nix references from all hosts - flake.lock - removed sops-nix input - scripts/create-host/ - removed sops references - CLAUDE.md - removed SOPS documentation Note: secrets/ directory should be manually removed by the user. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 18:46:24 +01:00
Torjus Håkestad	bdc6057689	hosts: decommission ca host and remove labmon Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Remove the step-ca host and labmon flake input now that ACME has been migrated to OpenBao PKI. Removed: - hosts/ca/ - step-ca host configuration - services/ca/ - step-ca service module - labmon flake input and module (no longer used) Updated: - flake.nix - removed ca host and labmon references - flake.lock - removed labmon input - rebuild-all.sh - removed ca from host list - CLAUDE.md - updated documentation Note: secrets/ca/ should be manually removed by the user. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 18:41:49 +01:00
Torjus Håkestad	3a25e3f7bc	Merge pull request 'migrate-to-openbao-pki' (#31 ) from migrate-to-openbao-pki into master Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Reviewed-on: #31	2026-02-07 17:33:46 +00:00
Torjus Håkestad	46f03871f1	docs: update CLAUDE.md for PR creation and labmon removal Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Run nix flake check / flake-check (pull_request) Failing after 1s Details - Add note that gh pr create is not supported - Remove labmon from Prometheus job names list - Remove labmon from flake inputs list Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 18:32:27 +01:00
Torjus Håkestad	9d019f2b9a	testvm01: add nginx with ACME certificate for PKI testing Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Set up a simple nginx server with an ACME certificate from the new OpenBao PKI infrastructure. This allows testing the ACME migration before deploying to production hosts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 18:22:28 +01:00
Torjus Håkestad	21db7e9573	acme: migrate from step-ca to OpenBao PKI Switch all ACME certificate issuance from step-ca (ca.home.2rjus.net) to OpenBao PKI (vault.home.2rjus.net:8200/v1/pki_int/acme/directory). - Update default ACME server in system/acme.nix - Update Caddy acme_ca in http-proxy and nix-cache services - Remove labmon service from monitoring01 (step-ca monitoring) - Remove labmon scrape target and certificate_rules alerts - Remove alloy.nix (only used for labmon profiling) - Add docs/plans/cert-monitoring.md for future cert monitoring needs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 18:20:10 +01:00
Torjus Håkestad	979040aaf7	vault01: enable homelab-deploy listener Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Enable vault.enable and homelab.deploy.enable on vault01 so it can receive NATS-based remote deployments. Vault fetches secrets from itself using AppRole after auto-unseal. Add systemd ordering to ensure vault-secret services wait for openbao to be unsealed before attempting to fetch secrets. Also adds vault01 AppRole entry to Terraform. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:55:09 +01:00
Torjus Håkestad	8791c29402	hosts: enable homelab-deploy listener on pgdb1, nats1, jelly01 Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Enable vault.enable and homelab.deploy.enable for these hosts to allow NATS-based remote deployments and expose metrics on port 9972. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:43:06 +01:00
Torjus Håkestad	c7a067d7b3	flake: update homelab-deploy input Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:31:24 +01:00
Torjus Håkestad	c518093578	docs: move prometheus-scrape-target-labels plan to completed Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:29:31 +01:00
Torjus Håkestad	0b462f0a96	Merge pull request 'prometheus-scrape-target-labels' (#30 ) from prometheus-scrape-target-labels into master Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Reviewed-on: #30	2026-02-07 16:27:38 +00:00
Torjus Håkestad	116abf3bec	CLAUDE.md: document homelab-deploy CLI for prod hosts Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Run nix flake check / flake-check (pull_request) Failing after 1s Details Add instructions for deploying to prod hosts using the CLI directly, since the MCP server only handles test-tier deployments. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:23:10 +01:00
Torjus Håkestad	b794aa89db	skills: update observability with new target labels Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Document the new hostname and host metadata labels available on all Prometheus scrape targets: - hostname: short hostname for easy filtering - role: host role (dns, build-host, vault) - tier: deployment tier (test for test VMs) - dns_role: primary/secondary for DNS servers Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:12:17 +01:00
Torjus Håkestad	50a85daa44	docs: update plan with hostname label documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:09:46 +01:00
Torjus Håkestad	23e561cf49	monitoring: add hostname label to all scrape targets Add a `hostname` label to all Prometheus scrape targets, making it easy to query all metrics for a host without wildcarding the instance label. Example queries: - {hostname="ns1"} - all metrics from ns1 - node_cpu_seconds_total{hostname="monitoring01"} - specific metric For external targets (like gunter), the hostname is extracted from the target string. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:09:19 +01:00

1 2 3 4 5 ...

937 Commits