Update all LogQL examples, agent instructions, and scripts to use the hostname label instead of host, matching the Prometheus label naming convention. Also update pipe-to-loki and bootstrap scripts to push hostname instead of host. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7.4 KiB
Host Creation Pipeline
This document describes the process for creating new hosts in the homelab infrastructure.
Overview
We use the create-host script to create new hosts, which generates default configurations from a template. We then use OpenTofu to deploy both secrets and VMs. The VMs boot using a template image (built from hosts/template2), which starts a bootstrap process. This bootstrap process applies the host's NixOS configuration and then reboots into the new config.
Prerequisites
All tools are available in the devshell: create-host, bao (OpenBao CLI), tofu.
nix develop
Steps
Steps marked with USER must be performed by the user due to credential requirements.
- USER: Run
create-host --hostname <name> --ip <ip/prefix> - Edit the auto-generated configurations in
hosts/<hostname>/to import whatever modules are needed for its purpose - Add any secrets needed to
terraform/vault/ - Edit the VM specs in
terraform/vms.tfif needed. To deploy from a branch other than master, addflake_branch = "<branch>"to the VM definition - Push configuration to master (or the branch specified by
flake_branch) - USER: Apply terraform:
nix develop -c tofu -chdir=terraform/vault apply nix develop -c tofu -chdir=terraform apply - Once terraform completes, a VM boots in Proxmox using the template image
- The VM runs the
nixos-bootstrapservice, which applies the host config and reboots - After reboot, the host should be operational
- Trigger auto-upgrade on
ns1andns2to propagate DNS records for the new host - Trigger auto-upgrade on
monitoring01to add the host to Prometheus scrape targets
Tier Specification
New hosts should set homelab.host.tier in their configuration:
homelab.host.tier = "test"; # or "prod"
- test - Test-tier hosts can receive remote deployments via the
homelab-deployMCP server and have different credential access. Use for staging/testing. - prod - Production hosts. Deployments require direct access or the CLI with appropriate credentials.
Observability
During the bootstrap process, status updates are sent to Loki. Query bootstrap logs with:
{job="bootstrap", hostname="<hostname>"}
Bootstrap Stages
The bootstrap process reports these stages via the stage label:
| Stage | Message | Meaning |
|---|---|---|
starting |
Bootstrap starting for <host> (branch: <branch>) | Bootstrap service has started |
network_ok |
Network connectivity confirmed | Can reach git server |
vault_ok |
Vault credentials unwrapped and stored | AppRole credentials provisioned |
vault_skip |
No Vault token provided - skipping credential setup | No wrapped token was provided |
vault_warn |
Failed to unwrap Vault token - continuing without secrets | Token unwrap failed (expired/used) |
building |
Starting nixos-rebuild boot | NixOS build starting |
success |
Build successful - rebooting into new configuration | Build complete, rebooting |
failed |
nixos-rebuild failed - manual intervention required | Build failed |
Useful Queries
# All bootstrap activity for a host
{job="bootstrap", hostname="myhost"}
# Track all failures
{job="bootstrap", stage="failed"}
# Monitor builds in progress
{job="bootstrap", stage=~"building|success"}
Once the VM reboots with its full configuration, it will start publishing metrics to Prometheus and logs to Loki via Promtail.
Verification
-
Check bootstrap completed successfully:
{job="bootstrap", hostname="<hostname>", stage="success"} -
Verify the host is up and reporting metrics:
up{instance=~"<hostname>.*"} -
Verify the correct flake revision is deployed:
nixos_flake_info{instance=~"<hostname>.*"} -
Check logs are flowing:
{hostname="<hostname>"} -
Confirm expected services are running and producing logs
Troubleshooting
Bootstrap Failed
Common Issues
- VM has trouble running initial nixos-rebuild. Usually caused if it needs to compile packages from scratch if they are not available in our local nix-cache.
Troubleshooting
-
Check bootstrap logs in Loki - if they never progress past
building, the rebuild likely consumed all resources:{job="bootstrap", hostname="<hostname>"} -
USER: SSH into the host and check the bootstrap service:
ssh root@<hostname> journalctl -u nixos-bootstrap.service -
If the build failed due to resource constraints, increase VM specs in
terraform/vms.tfand redeploy, or manually run the rebuild:nixos-rebuild boot --flake git+https://git.t-juice.club/torjus/nixos-servers.git#<hostname> -
If the host config doesn't exist in the flake, ensure step 5 was completed (config pushed to the correct branch).
Vault Credentials Not Working
Usually caused by running the create-host script without proper credentials, or the wrapped token has expired/already been used.
Troubleshooting
-
Check if credentials exist on the host:
ssh root@<hostname> ls -la /var/lib/vault/approle/ -
Check bootstrap logs for vault-related stages:
{job="bootstrap", hostname="<hostname>", stage=~"vault.*"} -
USER: Regenerate and provision credentials manually:
nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<hostname>
Host Not Appearing in DNS
Usually caused by not having deployed the commit with the new host to ns1/ns2.
Troubleshooting
-
Verify the host config has a static IP configured in
systemd.network.networks -
Check that
homelab.dns.enableis not set tofalse -
USER: Trigger auto-upgrade on DNS servers:
ssh root@ns1 systemctl start nixos-upgrade.service ssh root@ns2 systemctl start nixos-upgrade.service -
Verify DNS resolution after upgrade completes:
dig @ns1.home.2rjus.net <hostname>.home.2rjus.net
Host Not Being Scraped by Prometheus
Usually caused by not having deployed the commit with the new host to the monitoring host.
Troubleshooting
-
Check that
homelab.monitoring.enableis not set tofalse -
USER: Trigger auto-upgrade on monitoring01:
ssh root@monitoring01 systemctl start nixos-upgrade.service -
Verify the target appears in Prometheus:
up{instance=~"<hostname>.*"} -
If the target is down, check that node-exporter is running on the host:
ssh root@<hostname> systemctl status prometheus-node-exporter.service
Related Files
| Path | Description |
|---|---|
scripts/create-host/ |
The create-host script that generates host configurations |
hosts/template2/ |
Template VM configuration (base image for new VMs) |
hosts/template2/bootstrap.nix |
Bootstrap service that applies NixOS config on first boot |
terraform/vms.tf |
VM definitions (specs, IPs, branch overrides) |
terraform/cloud-init.tf |
Cloud-init configuration (passes hostname, branch, vault token) |
terraform/vault/approle.tf |
AppRole policies for each host |
terraform/vault/secrets.tf |
Secret definitions in Vault |
terraform/vault/hosts-generated.tf |
Auto-generated wrapped tokens for VM bootstrap |
playbooks/provision-approle.yml |
Ansible playbook for manual credential provisioning |
flake.nix |
Flake with all host configurations (add new hosts here) |