Commit Graph

779 Commits

Author SHA1 Message Date
ae3039af19 template2: send bootstrap status to Loki for remote monitoring
Adds log_to_loki function that pushes structured log entries to Loki
at key bootstrap stages (starting, network_ok, vault_*, building,
success, failed). Enables querying bootstrap state via LogQL without
console access.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:34:47 +01:00
11261c4636 template2: revert to journal+console output for bootstrap
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
TTY output was causing nixos-rebuild to fail. Keep the custom
greeting line to indicate bootstrap image, but use journal+console
for reliable logging.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:24:39 +01:00
4ca3c8890f terraform: add flake_branch and token for testvm01
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:14:57 +01:00
78e8d7a600 template2: add ncurses for clear command in bootstrap
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:10:25 +01:00
0cf72ec191 terraform: update template to nixos-25.11.20260203.e576e3c
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:02:16 +01:00
6a3a51407e playbooks: auto-update terraform template name after deploy
Add a third play to build-and-deploy-template.yml that updates
terraform/variables.tf with the new template name after deploying
to Proxmox. Only updates if the template name has changed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:59:13 +01:00
a1ae766eb8 template2: show bootstrap progress on tty1
- Display bootstrap banner and live progress on tty1 instead of login prompt
- Add custom getty greeting on other ttys indicating this is a bootstrap image
- Disable getty on tty1 during bootstrap so output is visible

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:49:58 +01:00
11999b37f3 flake: update homelab-deploy
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Fixes false "Some deployments failed" warning in MCP server when
deployments are still in progress.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:24:41 +01:00
29b2b7db52 Merge branch 'deploy-test-hosts'
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Add three permanent test hosts (testvm01, testvm02, testvm03) with:
- Static IPs: 10.69.13.20-22
- Vault AppRole integration with homelab-deploy policy
- Remote deployment via NATS (homelab.deploy.enable)
- Test tier configuration

Also updates create-host template to include vault.enable and
homelab.deploy.enable by default.
2026-02-07 14:09:40 +01:00
b046a1b862 terraform: remove flake_branch from test VMs
VMs are now bootstrapped and running. Remove temporary flake_branch
and vault_wrapped_token settings so they use master going forward.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:09:30 +01:00
38348c5980 vault: add homelab-deploy policy to generated hosts
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
The homelab-deploy listener requires access to shared/homelab-deploy/*
secrets. Update hosts-generated.tf and the generator script to include
this policy automatically.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:05:42 +01:00
370cf2b03a hosts: enable vault and deploy listener on test VMs
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
- Add vault.enable = true to testvm01, testvm02, testvm03
- Add homelab.deploy.enable = true for remote deployment via NATS
- Update create-host template to include these by default

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 13:55:33 +01:00
7bc465b414 hosts: add testvm01, testvm02, testvm03 test hosts
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Three permanent test hosts for validating deployment and bootstrapping
workflow. Each host configured with:
- Static IP (10.69.13.20-22/24)
- Vault AppRole integration
- Bootstrap from deploy-test-hosts branch

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 13:34:16 +01:00
8d7bc50108 hosts: remove testvm01
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Test host no longer needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 12:58:24 +01:00
03e70ac094 hosts: remove vaulttest01
Test host no longer needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 12:55:38 +01:00
3b32c9479f create-host: add approle removal and secrets detection
- Remove host entries from terraform/vault/approle.tf on --remove
- Detect and warn about secrets in terraform/vault/secrets.tf
- Include vault kv delete commands in removal instructions
- Update check_entries_exist to return approle status

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 12:54:42 +01:00
b0d35f9a99 create-host: fix flake.nix indentation patterns
The regex patterns expected 6 spaces of indentation but flake.nix uses
8 spaces for host entries. Also updated generated entry template to
match current flake.nix style (using commonModules ++).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 12:48:29 +01:00
26ca6817f0 homelab-deploy: enable prometheus metrics
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m57s
- Update homelab-deploy input to get metrics support
- Enable metrics endpoint on port 9972
- Add scrape target for prometheus auto-discovery

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 08:04:23 +01:00
b03a9b3b64 docs: add long-term metrics storage plan
Compare VictoriaMetrics and Thanos as options for extending
metrics retention beyond 30 days while managing disk usage.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:56:10 +01:00
f805b9f629 mcp: add homelab-deploy MCP server
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m20s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:27:12 +01:00
f3adf7e77f CLAUDE.md: add homelab-deploy MCP documentation
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:25:44 +01:00
f6eca9decc vaulttest01: add htop for deploy verification test
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m3s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:23:22 +01:00
6e93b8eae3 Merge pull request 'add-deploy-homelab' (#28) from add-deploy-homelab into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m9s
Reviewed-on: #28
2026-02-07 05:56:51 +00:00
c214f8543c homelab: add deploy.enable option with assertion
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m6s
Run nix flake check / flake-check (pull_request) Successful in 2m7s
- Add homelab.deploy.enable option (requires vault.enable)
- Create shared homelab-deploy Vault policy for all hosts
- Enable homelab.deploy on all vault-enabled hosts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:54:42 +01:00
7933127d77 system: enable homelab-deploy listener for all vault hosts
Add system/homelab-deploy.nix module that automatically enables the
listener on all hosts with vault.enable=true. Uses homelab.host.tier
and homelab.host.role for NATS subject subscriptions.

- Add homelab-deploy access to all host AppRole policies
- Remove manual listener config from vaulttest01 (now handled by system module)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:54:42 +01:00
13c3897e86 flake: update homelab-deploy, add to devShell
Update homelab-deploy to include bugfix. Add CLI to devShell for
easier testing and deployment operations.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:54:42 +01:00
0643f23281 vaulttest01: add vault secret dependency to listener
Some checks failed
Run nix flake check / flake-check (push) Failing after 15m32s
Ensure homelab-deploy-listener waits for the NKey secret to be
fetched from Vault before starting.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 05:29:29 +01:00
ad8570f8db homelab-deploy: add NATS-based deployment system
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m45s
Add homelab-deploy flake input and NixOS module for message-based
deployments across the fleet. Configure DEPLOY account in NATS with
tiered access control (listener, test-deployer, admin-deployer).
Enable listener on vaulttest01 as initial test host.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 05:22:06 +01:00
2f195d26d3 Merge pull request 'homelab-host-module' (#27) from homelab-host-module into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m8s
Reviewed-on: #27
2026-02-07 01:56:38 +00:00
a926d34287 nix-cache01: set priority to high
All checks were successful
Run nix flake check / flake-check (pull_request) Successful in 2m14s
Run nix flake check / flake-check (push) Successful in 2m17s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:54:32 +01:00
be2421746e gitignore: add result-* for parallel nix builds
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m4s
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:51:27 +01:00
12bf0683f5 modules: add homelab.host for host metadata
Add a shared `homelab.host` module that provides host metadata for
multiple consumers:
- tier: deployment tier (test/prod) for future homelab-deploy service
- priority: alerting priority (high/low) for Prometheus label filtering
- role: primary role of the host (dns, database, monitoring, etc.)
- labels: free-form labels for additional metadata

Host configurations updated with appropriate values:
- ns1, ns2: role=dns with dns_role labels
- nix-cache01: priority=low, role=build-host
- vault01: role=vault
- jump: role=bastion
- template, template2, testvm01, vaulttest01: tier=test, priority=low

The module is now imported via commonModules in flake.nix, making it
available to all hosts including minimal configurations like template2.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:49:58 +01:00
e8a43c6715 docs: add deploy_admin tool with opt-in flag to homelab-deploy plan
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m5s
MCP exposes two tools:
- deploy: test-tier only, always available
- deploy_admin: all tiers, requires --enable-admin flag

Three security layers: CLI flag, NATS authz, Claude Code permissions.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:29:13 +01:00
eef52bb8c5 docs: add group deployment support to homelab-deploy plan
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m3s
Support deploying to all hosts in a tier or all hosts with a role:
- deploy.<tier>.all - broadcast to all hosts in tier
- deploy.<tier>.role.<role> - broadcast to hosts with matching role

MCP can deploy to all test hosts at once, admin can deploy to any group.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:22:17 +01:00
c6cdbc6799 docs: move nixos-exporter plan to completed
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:13:14 +01:00
4d724329a6 docs: add homelab-deploy plan, unify host metadata
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Add plan for NATS-based deployment service (homelab-deploy) that enables
on-demand NixOS configuration updates via messaging. Features tiered
permissions (test/prod) enforced at NATS layer.

Update prometheus-scrape-target-labels plan to share the homelab.host
module for host metadata (tier, priority, role, labels) - single source
of truth for both deployment tiers and prometheus labels.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 02:10:54 +01:00
881e70df27 monitoring: relax systemd_not_running alert threshold
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m4s
Increase duration from 5m to 10m and demote severity from critical to
warning. Brief degraded states during nixos-rebuild are normal and were
causing false positive alerts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 01:22:29 +01:00
b9a269d280 chore: rename metrics skill to observability, add logs reference
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m4s
Merge Prometheus metrics and Loki logs into a unified troubleshooting
skill. Adds LogQL query patterns, label reference, and common service
units for log searching.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 01:17:41 +01:00
fcf1a66103 chore: add metrics troubleshooting skill
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Reference guide for exploring Prometheus metrics when troubleshooting
homelab issues, including the new nixos_flake_info metrics.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 01:11:40 +01:00
2034004280 flake: update nixos-exporter and set configurationRevision
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m33s
- Update nixos-exporter to 0.2.3
- Set system.configurationRevision for all hosts so the exporter
  can report the flake's git revision

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 01:06:47 +01:00
af43f88394 flake.lock: Update
Flake lock file updates:

• Updated input 'nixos-exporter':
    'git+https://git.t-juice.club/torjus/nixos-exporter?ref=refs/heads/master&rev=9c29505814954352b2af99b97910ee12a736b8dd' (2026-02-06)
  → 'git+https://git.t-juice.club/torjus/nixos-exporter?ref=refs/heads/master&rev=04eba77ac028033b6dfed604eb1b5664b46acc77' (2026-02-06)
2026-02-07 00:01:02 +00:00
a834497fe8 flake: update nixos-exporter input
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m27s
Periodic flake update / flake-update (push) Successful in 1m7s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 00:17:54 +01:00
d3de2a1511 Merge pull request 'monitoring: add nixos-exporter to all hosts' (#26) from nixos-exporter into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 3m6s
Reviewed-on: #26
2026-02-06 22:56:04 +00:00
97ff774d3f monitoring: add nixos-exporter to all hosts
All checks were successful
Run nix flake check / flake-check (push) Successful in 3m16s
Run nix flake check / flake-check (pull_request) Successful in 3m14s
Add nixos-exporter prometheus exporter to track NixOS generation metrics
and flake revision status across all hosts.

Changes:
- Add nixos-exporter flake input
- Add commonModules list in flake.nix for modules shared by all hosts
- Enable nixos-exporter in system/monitoring/metrics.nix
- Configure Prometheus to scrape nixos-exporter on all hosts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 23:55:29 +01:00
f2c30cc24f chore: give claude the quick-plan skill
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m57s
2026-02-06 21:58:30 +01:00
7e80d2e0bc docs: add plans for nixos and homelab prometheus exporters
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 21:56:55 +01:00
1f5b7b13e2 monitoring: enable restart-count and ip-accounting collectors
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m11s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 21:30:47 +01:00
c53e36c3f3 Revert "monitoring: enable additional systemd-exporter collectors"
This reverts commit 04a252b857.
2026-02-06 21:30:05 +01:00
04a252b857 monitoring: enable additional systemd-exporter collectors
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Enables restart-count, file-descriptor-size, and ip-accounting collectors.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 21:28:44 +01:00
5d26f52e0d Revert "monitoring: enable cpu, memory, io collectors for systemd-exporter"
This reverts commit 506a692548.
2026-02-06 21:26:20 +01:00