Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Document the end-to-end host creation workflow including: - Prerequisites and step-by-step process - Tier specification (test vs prod) - Bootstrap observability via Loki - Verification steps - Troubleshooting guide - Related files reference Update CLAUDE.md to reference the new document. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
218 lines
7.4 KiB
Markdown
218 lines
7.4 KiB
Markdown
# Host Creation Pipeline
|
|
|
|
This document describes the process for creating new hosts in the homelab infrastructure.
|
|
|
|
## Overview
|
|
|
|
We use the `create-host` script to create new hosts, which generates default configurations from a template. We then use OpenTofu to deploy both secrets and VMs. The VMs boot using a template image (built from `hosts/template2`), which starts a bootstrap process. This bootstrap process applies the host's NixOS configuration and then reboots into the new config.
|
|
|
|
## Prerequisites
|
|
|
|
All tools are available in the devshell: `create-host`, `bao` (OpenBao CLI), `tofu`.
|
|
|
|
```bash
|
|
nix develop
|
|
```
|
|
|
|
## Steps
|
|
|
|
Steps marked with **USER** must be performed by the user due to credential requirements.
|
|
|
|
1. **USER**: Run `create-host --hostname <name> --ip <ip/prefix>`
|
|
2. Edit the auto-generated configurations in `hosts/<hostname>/` to import whatever modules are needed for its purpose
|
|
3. Add any secrets needed to `terraform/vault/`
|
|
4. Edit the VM specs in `terraform/vms.tf` if needed. To deploy from a branch other than master, add `flake_branch = "<branch>"` to the VM definition
|
|
5. Push configuration to master (or the branch specified by `flake_branch`)
|
|
6. **USER**: Apply terraform:
|
|
```bash
|
|
nix develop -c tofu -chdir=terraform/vault apply
|
|
nix develop -c tofu -chdir=terraform apply
|
|
```
|
|
7. Once terraform completes, a VM boots in Proxmox using the template image
|
|
8. The VM runs the `nixos-bootstrap` service, which applies the host config and reboots
|
|
9. After reboot, the host should be operational
|
|
10. Trigger auto-upgrade on `ns1` and `ns2` to propagate DNS records for the new host
|
|
11. Trigger auto-upgrade on `monitoring01` to add the host to Prometheus scrape targets
|
|
|
|
## Tier Specification
|
|
|
|
New hosts should set `homelab.host.tier` in their configuration:
|
|
|
|
```nix
|
|
homelab.host.tier = "test"; # or "prod"
|
|
```
|
|
|
|
- **test** - Test-tier hosts can receive remote deployments via the `homelab-deploy` MCP server and have different credential access. Use for staging/testing.
|
|
- **prod** - Production hosts. Deployments require direct access or the CLI with appropriate credentials.
|
|
|
|
## Observability
|
|
|
|
During the bootstrap process, status updates are sent to Loki. Query bootstrap logs with:
|
|
|
|
```
|
|
{job="bootstrap", host="<hostname>"}
|
|
```
|
|
|
|
### Bootstrap Stages
|
|
|
|
The bootstrap process reports these stages via the `stage` label:
|
|
|
|
| Stage | Message | Meaning |
|
|
|-------|---------|---------|
|
|
| `starting` | Bootstrap starting for \<host\> (branch: \<branch\>) | Bootstrap service has started |
|
|
| `network_ok` | Network connectivity confirmed | Can reach git server |
|
|
| `vault_ok` | Vault credentials unwrapped and stored | AppRole credentials provisioned |
|
|
| `vault_skip` | No Vault token provided - skipping credential setup | No wrapped token was provided |
|
|
| `vault_warn` | Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) |
|
|
| `building` | Starting nixos-rebuild boot | NixOS build starting |
|
|
| `success` | Build successful - rebooting into new configuration | Build complete, rebooting |
|
|
| `failed` | nixos-rebuild failed - manual intervention required | Build failed |
|
|
|
|
### Useful Queries
|
|
|
|
```
|
|
# All bootstrap activity for a host
|
|
{job="bootstrap", host="myhost"}
|
|
|
|
# Track all failures
|
|
{job="bootstrap", stage="failed"}
|
|
|
|
# Monitor builds in progress
|
|
{job="bootstrap", stage=~"building|success"}
|
|
```
|
|
|
|
Once the VM reboots with its full configuration, it will start publishing metrics to Prometheus and logs to Loki via Promtail.
|
|
|
|
## Verification
|
|
|
|
1. Check bootstrap completed successfully:
|
|
```
|
|
{job="bootstrap", host="<hostname>", stage="success"}
|
|
```
|
|
|
|
2. Verify the host is up and reporting metrics:
|
|
```promql
|
|
up{instance=~"<hostname>.*"}
|
|
```
|
|
|
|
3. Verify the correct flake revision is deployed:
|
|
```promql
|
|
nixos_flake_info{instance=~"<hostname>.*"}
|
|
```
|
|
|
|
4. Check logs are flowing:
|
|
```
|
|
{host="<hostname>"}
|
|
```
|
|
|
|
5. Confirm expected services are running and producing logs
|
|
|
|
## Troubleshooting
|
|
|
|
### Bootstrap Failed
|
|
|
|
#### Common Issues
|
|
|
|
* VM has trouble running initial nixos-rebuild. Usually caused if it needs to compile packages from scratch if they are not available in our local nix-cache.
|
|
|
|
#### Troubleshooting
|
|
|
|
1. Check bootstrap logs in Loki - if they never progress past `building`, the rebuild likely consumed all resources:
|
|
```
|
|
{job="bootstrap", host="<hostname>"}
|
|
```
|
|
|
|
2. **USER**: SSH into the host and check the bootstrap service:
|
|
```bash
|
|
ssh root@<hostname>
|
|
journalctl -u nixos-bootstrap.service
|
|
```
|
|
|
|
3. If the build failed due to resource constraints, increase VM specs in `terraform/vms.tf` and redeploy, or manually run the rebuild:
|
|
```bash
|
|
nixos-rebuild boot --flake git+https://git.t-juice.club/torjus/nixos-servers.git#<hostname>
|
|
```
|
|
|
|
4. If the host config doesn't exist in the flake, ensure step 5 was completed (config pushed to the correct branch).
|
|
|
|
### Vault Credentials Not Working
|
|
|
|
Usually caused by running the `create-host` script without proper credentials, or the wrapped token has expired/already been used.
|
|
|
|
#### Troubleshooting
|
|
|
|
1. Check if credentials exist on the host:
|
|
```bash
|
|
ssh root@<hostname>
|
|
ls -la /var/lib/vault/approle/
|
|
```
|
|
|
|
2. Check bootstrap logs for vault-related stages:
|
|
```
|
|
{job="bootstrap", host="<hostname>", stage=~"vault.*"}
|
|
```
|
|
|
|
3. **USER**: Regenerate and provision credentials manually:
|
|
```bash
|
|
nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<hostname>
|
|
```
|
|
|
|
### Host Not Appearing in DNS
|
|
|
|
Usually caused by not having deployed the commit with the new host to ns1/ns2.
|
|
|
|
#### Troubleshooting
|
|
|
|
1. Verify the host config has a static IP configured in `systemd.network.networks`
|
|
|
|
2. Check that `homelab.dns.enable` is not set to `false`
|
|
|
|
3. **USER**: Trigger auto-upgrade on DNS servers:
|
|
```bash
|
|
ssh root@ns1 systemctl start nixos-upgrade.service
|
|
ssh root@ns2 systemctl start nixos-upgrade.service
|
|
```
|
|
|
|
4. Verify DNS resolution after upgrade completes:
|
|
```bash
|
|
dig @ns1.home.2rjus.net <hostname>.home.2rjus.net
|
|
```
|
|
|
|
### Host Not Being Scraped by Prometheus
|
|
|
|
Usually caused by not having deployed the commit with the new host to the monitoring host.
|
|
|
|
#### Troubleshooting
|
|
|
|
1. Check that `homelab.monitoring.enable` is not set to `false`
|
|
|
|
2. **USER**: Trigger auto-upgrade on monitoring01:
|
|
```bash
|
|
ssh root@monitoring01 systemctl start nixos-upgrade.service
|
|
```
|
|
|
|
3. Verify the target appears in Prometheus:
|
|
```promql
|
|
up{instance=~"<hostname>.*"}
|
|
```
|
|
|
|
4. If the target is down, check that node-exporter is running on the host:
|
|
```bash
|
|
ssh root@<hostname> systemctl status prometheus-node-exporter.service
|
|
```
|
|
|
|
## Related Files
|
|
|
|
| Path | Description |
|
|
|------|-------------|
|
|
| `scripts/create-host/` | The `create-host` script that generates host configurations |
|
|
| `hosts/template2/` | Template VM configuration (base image for new VMs) |
|
|
| `hosts/template2/bootstrap.nix` | Bootstrap service that applies NixOS config on first boot |
|
|
| `terraform/vms.tf` | VM definitions (specs, IPs, branch overrides) |
|
|
| `terraform/cloud-init.tf` | Cloud-init configuration (passes hostname, branch, vault token) |
|
|
| `terraform/vault/approle.tf` | AppRole policies for each host |
|
|
| `terraform/vault/secrets.tf` | Secret definitions in Vault |
|
|
| `terraform/vault/hosts-generated.tf` | Auto-generated wrapped tokens for VM bootstrap |
|
|
| `playbooks/provision-approle.yml` | Ansible playbook for manual credential provisioning |
|
|
| `flake.nix` | Flake with all host configurations (add new hosts here) |
|