Commit Graph

801 Commits

Author SHA1 Message Date
bdc6057689 hosts: decommission ca host and remove labmon
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Remove the step-ca host and labmon flake input now that ACME has been
migrated to OpenBao PKI.

Removed:
- hosts/ca/ - step-ca host configuration
- services/ca/ - step-ca service module
- labmon flake input and module (no longer used)

Updated:
- flake.nix - removed ca host and labmon references
- flake.lock - removed labmon input
- rebuild-all.sh - removed ca from host list
- CLAUDE.md - updated documentation

Note: secrets/ca/ should be manually removed by the user.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 18:41:49 +01:00
3a25e3f7bc Merge pull request 'migrate-to-openbao-pki' (#31) from migrate-to-openbao-pki into master
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Reviewed-on: #31
2026-02-07 17:33:46 +00:00
46f03871f1 docs: update CLAUDE.md for PR creation and labmon removal
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Run nix flake check / flake-check (pull_request) Failing after 1s
- Add note that gh pr create is not supported
- Remove labmon from Prometheus job names list
- Remove labmon from flake inputs list

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 18:32:27 +01:00
9d019f2b9a testvm01: add nginx with ACME certificate for PKI testing
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Set up a simple nginx server with an ACME certificate from the new
OpenBao PKI infrastructure. This allows testing the ACME migration
before deploying to production hosts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 18:22:28 +01:00
21db7e9573 acme: migrate from step-ca to OpenBao PKI
Switch all ACME certificate issuance from step-ca (ca.home.2rjus.net)
to OpenBao PKI (vault.home.2rjus.net:8200/v1/pki_int/acme/directory).

- Update default ACME server in system/acme.nix
- Update Caddy acme_ca in http-proxy and nix-cache services
- Remove labmon service from monitoring01 (step-ca monitoring)
- Remove labmon scrape target and certificate_rules alerts
- Remove alloy.nix (only used for labmon profiling)
- Add docs/plans/cert-monitoring.md for future cert monitoring needs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 18:20:10 +01:00
979040aaf7 vault01: enable homelab-deploy listener
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Enable vault.enable and homelab.deploy.enable on vault01 so it can
receive NATS-based remote deployments. Vault fetches secrets from
itself using AppRole after auto-unseal.

Add systemd ordering to ensure vault-secret services wait for openbao
to be unsealed before attempting to fetch secrets.

Also adds vault01 AppRole entry to Terraform.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:55:09 +01:00
8791c29402 hosts: enable homelab-deploy listener on pgdb1, nats1, jelly01
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Enable vault.enable and homelab.deploy.enable for these hosts to
allow NATS-based remote deployments and expose metrics on port 9972.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:43:06 +01:00
c7a067d7b3 flake: update homelab-deploy input
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:31:24 +01:00
c518093578 docs: move prometheus-scrape-target-labels plan to completed
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:29:31 +01:00
0b462f0a96 Merge pull request 'prometheus-scrape-target-labels' (#30) from prometheus-scrape-target-labels into master
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Reviewed-on: #30
2026-02-07 16:27:38 +00:00
116abf3bec CLAUDE.md: document homelab-deploy CLI for prod hosts
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Run nix flake check / flake-check (pull_request) Failing after 1s
Add instructions for deploying to prod hosts using the CLI directly,
since the MCP server only handles test-tier deployments.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:23:10 +01:00
b794aa89db skills: update observability with new target labels
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Document the new hostname and host metadata labels available on all
Prometheus scrape targets:
- hostname: short hostname for easy filtering
- role: host role (dns, build-host, vault)
- tier: deployment tier (test for test VMs)
- dns_role: primary/secondary for DNS servers

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:12:17 +01:00
50a85daa44 docs: update plan with hostname label documentation
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:09:46 +01:00
23e561cf49 monitoring: add hostname label to all scrape targets
Add a `hostname` label to all Prometheus scrape targets, making it easy
to query all metrics for a host without wildcarding the instance label.

Example queries:
- {hostname="ns1"} - all metrics from ns1
- node_cpu_seconds_total{hostname="monitoring01"} - specific metric

For external targets (like gunter), the hostname is extracted from the
target string.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:09:19 +01:00
7d291f85bf monitoring: propagate host labels to Prometheus scrape targets
Extract homelab.host metadata (tier, priority, role, labels) from host
configurations and propagate them to Prometheus scrape targets. This
enables semantic alert filtering using labels instead of hardcoded
instance names.

Changes:
- lib/monitoring.nix: Extract host metadata, group targets by labels
- prometheus.nix: Use structured static_configs with labels
- rules.yml: Replace instance filters with role-based filters

Example labels in Prometheus:
- ns1/ns2: role=dns, dns_role=primary/secondary
- nix-cache01: role=build-host
- testvm*: tier=test

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:04:50 +01:00
2a842c655a docs: update plan status and move completed nats-deploy plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
- Move nats-deploy-service.md to completed/ folder
- Update prometheus-scrape-target-labels.md with implementation status
- Add status table showing which steps are complete/partial/not started
- Update cross-references to point to new location

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 16:44:00 +01:00
1f4a5571dc CLAUDE.md: update documentation from audit
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
- Fix OpenBao CLI name (bao, not vault)
- Add vault01, testvm01-03 to hosts list
- Document nixos-exporter and homelab-deploy flake inputs
- Add vault/ and actions-runner/ services
- Document homelab.host and homelab.deploy options
- Document automatic Vault credential provisioning via wrapped tokens
- Consolidate homelab module options into dedicated section

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 16:37:38 +01:00
13d6d0ea3a Merge pull request 'improve-bootstrap-visibility' (#29) from improve-bootstrap-visibility into master
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Reviewed-on: #29
2026-02-07 15:00:09 +00:00
eea000b337 CLAUDE.md: document bootstrap logs in Loki
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Run nix flake check / flake-check (pull_request) Failing after 4s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:57:51 +01:00
f19ba2f4b6 CLAUDE.md: use tofu -chdir instead of cd
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:41:59 +01:00
a90d9c33d5 CLAUDE.md: prefer nix develop -c for devshell commands
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:39:56 +01:00
09c9df1bbe terraform: regenerate wrapped token for testvm01
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:36:25 +01:00
ae3039af19 template2: send bootstrap status to Loki for remote monitoring
Adds log_to_loki function that pushes structured log entries to Loki
at key bootstrap stages (starting, network_ok, vault_*, building,
success, failed). Enables querying bootstrap state via LogQL without
console access.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:34:47 +01:00
11261c4636 template2: revert to journal+console output for bootstrap
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
TTY output was causing nixos-rebuild to fail. Keep the custom
greeting line to indicate bootstrap image, but use journal+console
for reliable logging.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:24:39 +01:00
4ca3c8890f terraform: add flake_branch and token for testvm01
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:14:57 +01:00
78e8d7a600 template2: add ncurses for clear command in bootstrap
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:10:25 +01:00
0cf72ec191 terraform: update template to nixos-25.11.20260203.e576e3c
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:02:16 +01:00
6a3a51407e playbooks: auto-update terraform template name after deploy
Add a third play to build-and-deploy-template.yml that updates
terraform/variables.tf with the new template name after deploying
to Proxmox. Only updates if the template name has changed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:59:13 +01:00
a1ae766eb8 template2: show bootstrap progress on tty1
- Display bootstrap banner and live progress on tty1 instead of login prompt
- Add custom getty greeting on other ttys indicating this is a bootstrap image
- Disable getty on tty1 during bootstrap so output is visible

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:49:58 +01:00
11999b37f3 flake: update homelab-deploy
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Fixes false "Some deployments failed" warning in MCP server when
deployments are still in progress.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:24:41 +01:00
29b2b7db52 Merge branch 'deploy-test-hosts'
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Add three permanent test hosts (testvm01, testvm02, testvm03) with:
- Static IPs: 10.69.13.20-22
- Vault AppRole integration with homelab-deploy policy
- Remote deployment via NATS (homelab.deploy.enable)
- Test tier configuration

Also updates create-host template to include vault.enable and
homelab.deploy.enable by default.
2026-02-07 14:09:40 +01:00
b046a1b862 terraform: remove flake_branch from test VMs
VMs are now bootstrapped and running. Remove temporary flake_branch
and vault_wrapped_token settings so they use master going forward.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:09:30 +01:00
38348c5980 vault: add homelab-deploy policy to generated hosts
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
The homelab-deploy listener requires access to shared/homelab-deploy/*
secrets. Update hosts-generated.tf and the generator script to include
this policy automatically.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:05:42 +01:00
370cf2b03a hosts: enable vault and deploy listener on test VMs
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
- Add vault.enable = true to testvm01, testvm02, testvm03
- Add homelab.deploy.enable = true for remote deployment via NATS
- Update create-host template to include these by default

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 13:55:33 +01:00
7bc465b414 hosts: add testvm01, testvm02, testvm03 test hosts
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Three permanent test hosts for validating deployment and bootstrapping
workflow. Each host configured with:
- Static IP (10.69.13.20-22/24)
- Vault AppRole integration
- Bootstrap from deploy-test-hosts branch

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 13:34:16 +01:00
8d7bc50108 hosts: remove testvm01
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Test host no longer needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 12:58:24 +01:00
03e70ac094 hosts: remove vaulttest01
Test host no longer needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 12:55:38 +01:00
3b32c9479f create-host: add approle removal and secrets detection
- Remove host entries from terraform/vault/approle.tf on --remove
- Detect and warn about secrets in terraform/vault/secrets.tf
- Include vault kv delete commands in removal instructions
- Update check_entries_exist to return approle status

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 12:54:42 +01:00
b0d35f9a99 create-host: fix flake.nix indentation patterns
The regex patterns expected 6 spaces of indentation but flake.nix uses
8 spaces for host entries. Also updated generated entry template to
match current flake.nix style (using commonModules ++).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 12:48:29 +01:00
26ca6817f0 homelab-deploy: enable prometheus metrics
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m57s
- Update homelab-deploy input to get metrics support
- Enable metrics endpoint on port 9972
- Add scrape target for prometheus auto-discovery

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 08:04:23 +01:00
b03a9b3b64 docs: add long-term metrics storage plan
Compare VictoriaMetrics and Thanos as options for extending
metrics retention beyond 30 days while managing disk usage.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:56:10 +01:00
f805b9f629 mcp: add homelab-deploy MCP server
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m20s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:27:12 +01:00
f3adf7e77f CLAUDE.md: add homelab-deploy MCP documentation
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:25:44 +01:00
f6eca9decc vaulttest01: add htop for deploy verification test
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m3s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:23:22 +01:00
6e93b8eae3 Merge pull request 'add-deploy-homelab' (#28) from add-deploy-homelab into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m9s
Reviewed-on: #28
2026-02-07 05:56:51 +00:00
c214f8543c homelab: add deploy.enable option with assertion
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m6s
Run nix flake check / flake-check (pull_request) Successful in 2m7s
- Add homelab.deploy.enable option (requires vault.enable)
- Create shared homelab-deploy Vault policy for all hosts
- Enable homelab.deploy on all vault-enabled hosts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:54:42 +01:00
7933127d77 system: enable homelab-deploy listener for all vault hosts
Add system/homelab-deploy.nix module that automatically enables the
listener on all hosts with vault.enable=true. Uses homelab.host.tier
and homelab.host.role for NATS subject subscriptions.

- Add homelab-deploy access to all host AppRole policies
- Remove manual listener config from vaulttest01 (now handled by system module)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:54:42 +01:00
13c3897e86 flake: update homelab-deploy, add to devShell
Update homelab-deploy to include bugfix. Add CLI to devShell for
easier testing and deployment operations.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:54:42 +01:00
0643f23281 vaulttest01: add vault secret dependency to listener
Some checks failed
Run nix flake check / flake-check (push) Failing after 15m32s
Ensure homelab-deploy-listener waits for the NKey secret to be
fetched from Vault before starting.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 05:29:29 +01:00
ad8570f8db homelab-deploy: add NATS-based deployment system
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m45s
Add homelab-deploy flake input and NixOS module for message-based
deployments across the fleet. Configure DEPLOY account in NATS with
tiered access control (listener, test-deployer, admin-deployer).
Enable listener on vaulttest01 as initial test host.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 05:22:06 +01:00