nixos-servers/docs/host-creation.md

# Host Creation Pipeline

This document describes the process for creating new hosts in the homelab infrastructure.

## Overview

We use the `create-host` script to create new hosts, which generates default configurations from a template. We then use OpenTofu to deploy both secrets and VMs. The VMs boot using a template image (built from `hosts/template2`), which starts a bootstrap process. This bootstrap process applies the host's NixOS configuration and then reboots into the new config.

## Prerequisites

All tools are available in the devshell: `create-host`, `bao` (OpenBao CLI), `tofu`.

```bash
nix develop
```

## Steps

Steps marked with **USER** must be performed by the user due to credential requirements.

1. **USER**: Run `create-host --hostname <name> --ip <ip/prefix>`
2. Edit the auto-generated configurations in `hosts/<hostname>/` to import whatever modules are needed for its purpose
3. Add any secrets needed to `terraform/vault/`
4. Edit the VM specs in `terraform/vms.tf` if needed. To deploy from a branch other than master, add `flake_branch = "<branch>"` to the VM definition
5. Push configuration to master (or the branch specified by `flake_branch`)
6. **USER**: Apply terraform:
   ```bash
   nix develop -c tofu -chdir=terraform/vault apply
   nix develop -c tofu -chdir=terraform apply
   ```
7. Once terraform completes, a VM boots in Proxmox using the template image
8. The VM runs the `nixos-bootstrap` service, which applies the host config and reboots
9. After reboot, the host should be operational
10. Trigger auto-upgrade on `ns1` and `ns2` to propagate DNS records for the new host
11. Trigger auto-upgrade on `monitoring01` to add the host to Prometheus scrape targets

## Tier Specification

New hosts should set `homelab.host.tier` in their configuration:

```nix
homelab.host.tier = "test";  # or "prod"
```

- **test** - Test-tier hosts can receive remote deployments via the `homelab-deploy` MCP server and have different credential access. Use for staging/testing.
- **prod** - Production hosts. Deployments require direct access or the CLI with appropriate credentials.

## Observability

During the bootstrap process, status updates are sent to Loki. Query bootstrap logs with:

```
{job="bootstrap", host="<hostname>"}
```

### Bootstrap Stages

The bootstrap process reports these stages via the `stage` label:

| Stage | Message | Meaning |
|-------|---------|---------|
| `starting` | Bootstrap starting for \<host\> (branch: \<branch\>) | Bootstrap service has started |
| `network_ok` | Network connectivity confirmed | Can reach git server |
| `vault_ok` | Vault credentials unwrapped and stored | AppRole credentials provisioned |
| `vault_skip` | No Vault token provided - skipping credential setup | No wrapped token was provided |
| `vault_warn` | Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) |
| `building` | Starting nixos-rebuild boot | NixOS build starting |
| `success` | Build successful - rebooting into new configuration | Build complete, rebooting |
| `failed` | nixos-rebuild failed - manual intervention required | Build failed |

### Useful Queries

```
# All bootstrap activity for a host
{job="bootstrap", host="myhost"}

# Track all failures
{job="bootstrap", stage="failed"}

# Monitor builds in progress
{job="bootstrap", stage=~"building|success"}
```

Once the VM reboots with its full configuration, it will start publishing metrics to Prometheus and logs to Loki via Promtail.

## Verification

1. Check bootstrap completed successfully:
   ```
   {job="bootstrap", host="<hostname>", stage="success"}
   ```

2. Verify the host is up and reporting metrics:
   ```promql
   up{instance=~"<hostname>.*"}
   ```

3. Verify the correct flake revision is deployed:
   ```promql
   nixos_flake_info{instance=~"<hostname>.*"}
   ```

4. Check logs are flowing:
   ```
   {host="<hostname>"}
   ```

5. Confirm expected services are running and producing logs

## Troubleshooting

### Bootstrap Failed

#### Common Issues

* VM has trouble running initial nixos-rebuild. Usually caused if it needs to compile packages from scratch if they are not available in our local nix-cache.

#### Troubleshooting

1. Check bootstrap logs in Loki - if they never progress past `building`, the rebuild likely consumed all resources:
   ```
   {job="bootstrap", host="<hostname>"}
   ```

2. **USER**: SSH into the host and check the bootstrap service:
   ```bash
   ssh root@<hostname>
   journalctl -u nixos-bootstrap.service
   ```

3. If the build failed due to resource constraints, increase VM specs in `terraform/vms.tf` and redeploy, or manually run the rebuild:
   ```bash
   nixos-rebuild boot --flake git+https://git.t-juice.club/torjus/nixos-servers.git#<hostname>
   ```

4. If the host config doesn't exist in the flake, ensure step 5 was completed (config pushed to the correct branch).

### Vault Credentials Not Working

Usually caused by running the `create-host` script without proper credentials, or the wrapped token has expired/already been used.

#### Troubleshooting

1. Check if credentials exist on the host:
   ```bash
   ssh root@<hostname>
   ls -la /var/lib/vault/approle/
   ```

2. Check bootstrap logs for vault-related stages:
   ```
   {job="bootstrap", host="<hostname>", stage=~"vault.*"}
   ```

3. **USER**: Regenerate and provision credentials manually:
   ```bash
   nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<hostname>
   ```

### Host Not Appearing in DNS

Usually caused by not having deployed the commit with the new host to ns1/ns2.

#### Troubleshooting

1. Verify the host config has a static IP configured in `systemd.network.networks`

2. Check that `homelab.dns.enable` is not set to `false`

3. **USER**: Trigger auto-upgrade on DNS servers:
   ```bash
   ssh root@ns1 systemctl start nixos-upgrade.service
   ssh root@ns2 systemctl start nixos-upgrade.service
   ```

4. Verify DNS resolution after upgrade completes:
   ```bash
   dig @ns1.home.2rjus.net <hostname>.home.2rjus.net
   ```

### Host Not Being Scraped by Prometheus

Usually caused by not having deployed the commit with the new host to the monitoring host.

#### Troubleshooting

1. Check that `homelab.monitoring.enable` is not set to `false`

2. **USER**: Trigger auto-upgrade on monitoring01:
   ```bash
   ssh root@monitoring01 systemctl start nixos-upgrade.service
   ```

3. Verify the target appears in Prometheus:
   ```promql
   up{instance=~"<hostname>.*"}
   ```

4. If the target is down, check that node-exporter is running on the host:
   ```bash
   ssh root@<hostname> systemctl status prometheus-node-exporter.service
   ```

## Related Files

| Path | Description |
|------|-------------|
| `scripts/create-host/` | The `create-host` script that generates host configurations |
| `hosts/template2/` | Template VM configuration (base image for new VMs) |
| `hosts/template2/bootstrap.nix` | Bootstrap service that applies NixOS config on first boot |
| `terraform/vms.tf` | VM definitions (specs, IPs, branch overrides) |
| `terraform/cloud-init.tf` | Cloud-init configuration (passes hostname, branch, vault token) |
| `terraform/vault/approle.tf` | AppRole policies for each host |
| `terraform/vault/secrets.tf` | Secret definitions in Vault |
| `terraform/vault/hosts-generated.tf` | Auto-generated wrapped tokens for VM bootstrap |
| `playbooks/provision-approle.yml` | Ansible playbook for manual credential provisioning |
| `flake.nix` | Flake with all host configurations (add new hosts here) |