# Host Creation Pipeline This document describes the process for creating new hosts in the homelab infrastructure. ## Overview We use the `create-host` script to create new hosts, which generates default configurations from a template. We then use OpenTofu to deploy both secrets and VMs. The VMs boot using a template image (built from `hosts/template2`), which starts a bootstrap process. This bootstrap process applies the host's NixOS configuration and then reboots into the new config. ## Prerequisites All tools are available in the devshell: `create-host`, `bao` (OpenBao CLI), `tofu`. ```bash nix develop ``` ## Steps Steps marked with **USER** must be performed by the user due to credential requirements. 1. **USER**: Run `create-host --hostname --ip ` 2. Edit the auto-generated configurations in `hosts//` to import whatever modules are needed for its purpose 3. Add any secrets needed to `terraform/vault/` 4. Edit the VM specs in `terraform/vms.tf` if needed. To deploy from a branch other than master, add `flake_branch = ""` to the VM definition 5. Push configuration to master (or the branch specified by `flake_branch`) 6. **USER**: Apply terraform: ```bash nix develop -c tofu -chdir=terraform/vault apply nix develop -c tofu -chdir=terraform apply ``` 7. Once terraform completes, a VM boots in Proxmox using the template image 8. The VM runs the `nixos-bootstrap` service, which applies the host config and reboots 9. After reboot, the host should be operational 10. Trigger auto-upgrade on `ns1` and `ns2` to propagate DNS records for the new host 11. Trigger auto-upgrade on `monitoring01` to add the host to Prometheus scrape targets ## Tier Specification New hosts should set `homelab.host.tier` in their configuration: ```nix homelab.host.tier = "test"; # or "prod" ``` - **test** - Test-tier hosts can receive remote deployments via the `homelab-deploy` MCP server and have different credential access. Use for staging/testing. - **prod** - Production hosts. Deployments require direct access or the CLI with appropriate credentials. ## Observability During the bootstrap process, status updates are sent to Loki. Query bootstrap logs with: ``` {job="bootstrap", hostname=""} ``` ### Bootstrap Stages The bootstrap process reports these stages via the `stage` label: | Stage | Message | Meaning | |-------|---------|---------| | `starting` | Bootstrap starting for \ (branch: \) | Bootstrap service has started | | `network_ok` | Network connectivity confirmed | Can reach git server | | `vault_ok` | Vault credentials unwrapped and stored | AppRole credentials provisioned | | `vault_skip` | No Vault token provided - skipping credential setup | No wrapped token was provided | | `vault_warn` | Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) | | `building` | Starting nixos-rebuild boot | NixOS build starting | | `success` | Build successful - rebooting into new configuration | Build complete, rebooting | | `failed` | nixos-rebuild failed - manual intervention required | Build failed | ### Useful Queries ``` # All bootstrap activity for a host {job="bootstrap", hostname="myhost"} # Track all failures {job="bootstrap", stage="failed"} # Monitor builds in progress {job="bootstrap", stage=~"building|success"} ``` Once the VM reboots with its full configuration, it will start publishing metrics to Prometheus and logs to Loki via Promtail. ## Verification 1. Check bootstrap completed successfully: ``` {job="bootstrap", hostname="", stage="success"} ``` 2. Verify the host is up and reporting metrics: ```promql up{instance=~".*"} ``` 3. Verify the correct flake revision is deployed: ```promql nixos_flake_info{instance=~".*"} ``` 4. Check logs are flowing: ``` {hostname=""} ``` 5. Confirm expected services are running and producing logs ## Troubleshooting ### Bootstrap Failed #### Common Issues * VM has trouble running initial nixos-rebuild. Usually caused if it needs to compile packages from scratch if they are not available in our local nix-cache. #### Troubleshooting 1. Check bootstrap logs in Loki - if they never progress past `building`, the rebuild likely consumed all resources: ``` {job="bootstrap", hostname=""} ``` 2. **USER**: SSH into the host and check the bootstrap service: ```bash ssh root@ journalctl -u nixos-bootstrap.service ``` 3. If the build failed due to resource constraints, increase VM specs in `terraform/vms.tf` and redeploy, or manually run the rebuild: ```bash nixos-rebuild boot --flake git+https://git.t-juice.club/torjus/nixos-servers.git# ``` 4. If the host config doesn't exist in the flake, ensure step 5 was completed (config pushed to the correct branch). ### Vault Credentials Not Working Usually caused by running the `create-host` script without proper credentials, or the wrapped token has expired/already been used. #### Troubleshooting 1. Check if credentials exist on the host: ```bash ssh root@ ls -la /var/lib/vault/approle/ ``` 2. Check bootstrap logs for vault-related stages: ``` {job="bootstrap", hostname="", stage=~"vault.*"} ``` 3. **USER**: Regenerate and provision credentials manually: ```bash nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname= ``` ### Host Not Appearing in DNS Usually caused by not having deployed the commit with the new host to ns1/ns2. #### Troubleshooting 1. Verify the host config has a static IP configured in `systemd.network.networks` 2. Check that `homelab.dns.enable` is not set to `false` 3. **USER**: Trigger auto-upgrade on DNS servers: ```bash ssh root@ns1 systemctl start nixos-upgrade.service ssh root@ns2 systemctl start nixos-upgrade.service ``` 4. Verify DNS resolution after upgrade completes: ```bash dig @ns1.home.2rjus.net .home.2rjus.net ``` ### Host Not Being Scraped by Prometheus Usually caused by not having deployed the commit with the new host to the monitoring host. #### Troubleshooting 1. Check that `homelab.monitoring.enable` is not set to `false` 2. **USER**: Trigger auto-upgrade on monitoring01: ```bash ssh root@monitoring01 systemctl start nixos-upgrade.service ``` 3. Verify the target appears in Prometheus: ```promql up{instance=~".*"} ``` 4. If the target is down, check that node-exporter is running on the host: ```bash ssh root@ systemctl status prometheus-node-exporter.service ``` ## Related Files | Path | Description | |------|-------------| | `scripts/create-host/` | The `create-host` script that generates host configurations | | `hosts/template2/` | Template VM configuration (base image for new VMs) | | `hosts/template2/bootstrap.nix` | Bootstrap service that applies NixOS config on first boot | | `terraform/vms.tf` | VM definitions (specs, IPs, branch overrides) | | `terraform/cloud-init.tf` | Cloud-init configuration (passes hostname, branch, vault token) | | `terraform/vault/approle.tf` | AppRole policies for each host | | `terraform/vault/secrets.tf` | Secret definitions in Vault | | `terraform/vault/hosts-generated.tf` | Auto-generated wrapped tokens for VM bootstrap | | `playbooks/provision-approle.yml` | Ansible playbook for manual credential provisioning | | `flake.nix` | Flake with all host configurations (add new hosts here) |