From 8fbf1224fa4083f5d3fce4f602b0597af546f0b2 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Sun, 8 Feb 2026 02:05:21 +0100 Subject: [PATCH] docs: add host creation pipeline documentation Document the end-to-end host creation workflow including: - Prerequisites and step-by-step process - Tier specification (test vs prod) - Bootstrap observability via Loki - Verification steps - Troubleshooting guide - Related files reference Update CLAUDE.md to reference the new document. Co-Authored-By: Claude Opus 4.5 --- CLAUDE.md | 19 +--- docs/host-creation.md | 217 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 222 insertions(+), 14 deletions(-) create mode 100644 docs/host-creation.md diff --git a/CLAUDE.md b/CLAUDE.md index 5a9bf69..d1e42df 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -496,20 +496,11 @@ This means: ### Adding a New Host -1. Create `/hosts//` directory -2. Copy structure from `template1` or similar host -3. Add host entry to `flake.nix` nixosConfigurations -4. Configure networking in `configuration.nix` (static IP via `systemd.network.networks`, DNS servers) -5. (Optional) Add `homelab.dns.cnames` if the host needs CNAME aliases -6. Add `vault.enable = true;` to the host configuration -7. Add AppRole policy in `terraform/vault/approle.tf` and any secrets in `secrets.tf` -8. Run `tofu apply` in `terraform/vault/` -9. User clones template host -10. User runs `prepare-host.sh` on new host -11. Provision AppRole credentials: `nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=` -12. Commit changes, and merge to master. -13. Deploy by running `nixos-rebuild boot --flake URL#` on the host. -14. Run auto-upgrade on DNS servers (ns1, ns2) to pick up the new host's DNS entry +See [docs/host-creation.md](docs/host-creation.md) for the complete host creation pipeline, including: +- Using the `create-host` script to generate host configurations +- Deploying VMs and secrets with OpenTofu +- Monitoring the bootstrap process via Loki +- Verification and troubleshooting steps **Note:** DNS A records and Prometheus node-exporter scrape targets are auto-generated from the host's `systemd.network.networks` static IP configuration. No manual zone file or Prometheus config editing is required. diff --git a/docs/host-creation.md b/docs/host-creation.md new file mode 100644 index 0000000..af3bf44 --- /dev/null +++ b/docs/host-creation.md @@ -0,0 +1,217 @@ +# Host Creation Pipeline + +This document describes the process for creating new hosts in the homelab infrastructure. + +## Overview + +We use the `create-host` script to create new hosts, which generates default configurations from a template. We then use OpenTofu to deploy both secrets and VMs. The VMs boot using a template image (built from `hosts/template2`), which starts a bootstrap process. This bootstrap process applies the host's NixOS configuration and then reboots into the new config. + +## Prerequisites + +All tools are available in the devshell: `create-host`, `bao` (OpenBao CLI), `tofu`. + +```bash +nix develop +``` + +## Steps + +Steps marked with **USER** must be performed by the user due to credential requirements. + +1. **USER**: Run `create-host --hostname --ip ` +2. Edit the auto-generated configurations in `hosts//` to import whatever modules are needed for its purpose +3. Add any secrets needed to `terraform/vault/` +4. Edit the VM specs in `terraform/vms.tf` if needed. To deploy from a branch other than master, add `flake_branch = ""` to the VM definition +5. Push configuration to master (or the branch specified by `flake_branch`) +6. **USER**: Apply terraform: + ```bash + nix develop -c tofu -chdir=terraform/vault apply + nix develop -c tofu -chdir=terraform apply + ``` +7. Once terraform completes, a VM boots in Proxmox using the template image +8. The VM runs the `nixos-bootstrap` service, which applies the host config and reboots +9. After reboot, the host should be operational +10. Trigger auto-upgrade on `ns1` and `ns2` to propagate DNS records for the new host +11. Trigger auto-upgrade on `monitoring01` to add the host to Prometheus scrape targets + +## Tier Specification + +New hosts should set `homelab.host.tier` in their configuration: + +```nix +homelab.host.tier = "test"; # or "prod" +``` + +- **test** - Test-tier hosts can receive remote deployments via the `homelab-deploy` MCP server and have different credential access. Use for staging/testing. +- **prod** - Production hosts. Deployments require direct access or the CLI with appropriate credentials. + +## Observability + +During the bootstrap process, status updates are sent to Loki. Query bootstrap logs with: + +``` +{job="bootstrap", host=""} +``` + +### Bootstrap Stages + +The bootstrap process reports these stages via the `stage` label: + +| Stage | Message | Meaning | +|-------|---------|---------| +| `starting` | Bootstrap starting for \ (branch: \) | Bootstrap service has started | +| `network_ok` | Network connectivity confirmed | Can reach git server | +| `vault_ok` | Vault credentials unwrapped and stored | AppRole credentials provisioned | +| `vault_skip` | No Vault token provided - skipping credential setup | No wrapped token was provided | +| `vault_warn` | Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) | +| `building` | Starting nixos-rebuild boot | NixOS build starting | +| `success` | Build successful - rebooting into new configuration | Build complete, rebooting | +| `failed` | nixos-rebuild failed - manual intervention required | Build failed | + +### Useful Queries + +``` +# All bootstrap activity for a host +{job="bootstrap", host="myhost"} + +# Track all failures +{job="bootstrap", stage="failed"} + +# Monitor builds in progress +{job="bootstrap", stage=~"building|success"} +``` + +Once the VM reboots with its full configuration, it will start publishing metrics to Prometheus and logs to Loki via Promtail. + +## Verification + +1. Check bootstrap completed successfully: + ``` + {job="bootstrap", host="", stage="success"} + ``` + +2. Verify the host is up and reporting metrics: + ```promql + up{instance=~".*"} + ``` + +3. Verify the correct flake revision is deployed: + ```promql + nixos_flake_info{instance=~".*"} + ``` + +4. Check logs are flowing: + ``` + {host=""} + ``` + +5. Confirm expected services are running and producing logs + +## Troubleshooting + +### Bootstrap Failed + +#### Common Issues + +* VM has trouble running initial nixos-rebuild. Usually caused if it needs to compile packages from scratch if they are not available in our local nix-cache. + +#### Troubleshooting + +1. Check bootstrap logs in Loki - if they never progress past `building`, the rebuild likely consumed all resources: + ``` + {job="bootstrap", host=""} + ``` + +2. **USER**: SSH into the host and check the bootstrap service: + ```bash + ssh root@ + journalctl -u nixos-bootstrap.service + ``` + +3. If the build failed due to resource constraints, increase VM specs in `terraform/vms.tf` and redeploy, or manually run the rebuild: + ```bash + nixos-rebuild boot --flake git+https://git.t-juice.club/torjus/nixos-servers.git# + ``` + +4. If the host config doesn't exist in the flake, ensure step 5 was completed (config pushed to the correct branch). + +### Vault Credentials Not Working + +Usually caused by running the `create-host` script without proper credentials, or the wrapped token has expired/already been used. + +#### Troubleshooting + +1. Check if credentials exist on the host: + ```bash + ssh root@ + ls -la /var/lib/vault/approle/ + ``` + +2. Check bootstrap logs for vault-related stages: + ``` + {job="bootstrap", host="", stage=~"vault.*"} + ``` + +3. **USER**: Regenerate and provision credentials manually: + ```bash + nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname= + ``` + +### Host Not Appearing in DNS + +Usually caused by not having deployed the commit with the new host to ns1/ns2. + +#### Troubleshooting + +1. Verify the host config has a static IP configured in `systemd.network.networks` + +2. Check that `homelab.dns.enable` is not set to `false` + +3. **USER**: Trigger auto-upgrade on DNS servers: + ```bash + ssh root@ns1 systemctl start nixos-upgrade.service + ssh root@ns2 systemctl start nixos-upgrade.service + ``` + +4. Verify DNS resolution after upgrade completes: + ```bash + dig @ns1.home.2rjus.net .home.2rjus.net + ``` + +### Host Not Being Scraped by Prometheus + +Usually caused by not having deployed the commit with the new host to the monitoring host. + +#### Troubleshooting + +1. Check that `homelab.monitoring.enable` is not set to `false` + +2. **USER**: Trigger auto-upgrade on monitoring01: + ```bash + ssh root@monitoring01 systemctl start nixos-upgrade.service + ``` + +3. Verify the target appears in Prometheus: + ```promql + up{instance=~".*"} + ``` + +4. If the target is down, check that node-exporter is running on the host: + ```bash + ssh root@ systemctl status prometheus-node-exporter.service + ``` + +## Related Files + +| Path | Description | +|------|-------------| +| `scripts/create-host/` | The `create-host` script that generates host configurations | +| `hosts/template2/` | Template VM configuration (base image for new VMs) | +| `hosts/template2/bootstrap.nix` | Bootstrap service that applies NixOS config on first boot | +| `terraform/vms.tf` | VM definitions (specs, IPs, branch overrides) | +| `terraform/cloud-init.tf` | Cloud-init configuration (passes hostname, branch, vault token) | +| `terraform/vault/approle.tf` | AppRole policies for each host | +| `terraform/vault/secrets.tf` | Secret definitions in Vault | +| `terraform/vault/hosts-generated.tf` | Auto-generated wrapped tokens for VM bootstrap | +| `playbooks/provision-approle.yml` | Ansible playbook for manual credential provisioning | +| `flake.nix` | Flake with all host configurations (add new hosts here) |