Files
nixos-servers/docs/host-creation.md
Torjus Håkestad d485948df0
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
docs: update Loki queries from host to hostname label
Update all LogQL examples, agent instructions, and scripts to use
the hostname label instead of host, matching the Prometheus label
naming convention. Also update pipe-to-loki and bootstrap scripts
to push hostname instead of host.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 23:43:47 +01:00

7.4 KiB

Host Creation Pipeline

This document describes the process for creating new hosts in the homelab infrastructure.

Overview

We use the create-host script to create new hosts, which generates default configurations from a template. We then use OpenTofu to deploy both secrets and VMs. The VMs boot using a template image (built from hosts/template2), which starts a bootstrap process. This bootstrap process applies the host's NixOS configuration and then reboots into the new config.

Prerequisites

All tools are available in the devshell: create-host, bao (OpenBao CLI), tofu.

nix develop

Steps

Steps marked with USER must be performed by the user due to credential requirements.

  1. USER: Run create-host --hostname <name> --ip <ip/prefix>
  2. Edit the auto-generated configurations in hosts/<hostname>/ to import whatever modules are needed for its purpose
  3. Add any secrets needed to terraform/vault/
  4. Edit the VM specs in terraform/vms.tf if needed. To deploy from a branch other than master, add flake_branch = "<branch>" to the VM definition
  5. Push configuration to master (or the branch specified by flake_branch)
  6. USER: Apply terraform:
    nix develop -c tofu -chdir=terraform/vault apply
    nix develop -c tofu -chdir=terraform apply
    
  7. Once terraform completes, a VM boots in Proxmox using the template image
  8. The VM runs the nixos-bootstrap service, which applies the host config and reboots
  9. After reboot, the host should be operational
  10. Trigger auto-upgrade on ns1 and ns2 to propagate DNS records for the new host
  11. Trigger auto-upgrade on monitoring01 to add the host to Prometheus scrape targets

Tier Specification

New hosts should set homelab.host.tier in their configuration:

homelab.host.tier = "test";  # or "prod"
  • test - Test-tier hosts can receive remote deployments via the homelab-deploy MCP server and have different credential access. Use for staging/testing.
  • prod - Production hosts. Deployments require direct access or the CLI with appropriate credentials.

Observability

During the bootstrap process, status updates are sent to Loki. Query bootstrap logs with:

{job="bootstrap", hostname="<hostname>"}

Bootstrap Stages

The bootstrap process reports these stages via the stage label:

Stage Message Meaning
starting Bootstrap starting for <host> (branch: <branch>) Bootstrap service has started
network_ok Network connectivity confirmed Can reach git server
vault_ok Vault credentials unwrapped and stored AppRole credentials provisioned
vault_skip No Vault token provided - skipping credential setup No wrapped token was provided
vault_warn Failed to unwrap Vault token - continuing without secrets Token unwrap failed (expired/used)
building Starting nixos-rebuild boot NixOS build starting
success Build successful - rebooting into new configuration Build complete, rebooting
failed nixos-rebuild failed - manual intervention required Build failed

Useful Queries

# All bootstrap activity for a host
{job="bootstrap", hostname="myhost"}

# Track all failures
{job="bootstrap", stage="failed"}

# Monitor builds in progress
{job="bootstrap", stage=~"building|success"}

Once the VM reboots with its full configuration, it will start publishing metrics to Prometheus and logs to Loki via Promtail.

Verification

  1. Check bootstrap completed successfully:

    {job="bootstrap", hostname="<hostname>", stage="success"}
    
  2. Verify the host is up and reporting metrics:

    up{instance=~"<hostname>.*"}
    
  3. Verify the correct flake revision is deployed:

    nixos_flake_info{instance=~"<hostname>.*"}
    
  4. Check logs are flowing:

    {hostname="<hostname>"}
    
  5. Confirm expected services are running and producing logs

Troubleshooting

Bootstrap Failed

Common Issues

  • VM has trouble running initial nixos-rebuild. Usually caused if it needs to compile packages from scratch if they are not available in our local nix-cache.

Troubleshooting

  1. Check bootstrap logs in Loki - if they never progress past building, the rebuild likely consumed all resources:

    {job="bootstrap", hostname="<hostname>"}
    
  2. USER: SSH into the host and check the bootstrap service:

    ssh root@<hostname>
    journalctl -u nixos-bootstrap.service
    
  3. If the build failed due to resource constraints, increase VM specs in terraform/vms.tf and redeploy, or manually run the rebuild:

    nixos-rebuild boot --flake git+https://git.t-juice.club/torjus/nixos-servers.git#<hostname>
    
  4. If the host config doesn't exist in the flake, ensure step 5 was completed (config pushed to the correct branch).

Vault Credentials Not Working

Usually caused by running the create-host script without proper credentials, or the wrapped token has expired/already been used.

Troubleshooting

  1. Check if credentials exist on the host:

    ssh root@<hostname>
    ls -la /var/lib/vault/approle/
    
  2. Check bootstrap logs for vault-related stages:

    {job="bootstrap", hostname="<hostname>", stage=~"vault.*"}
    
  3. USER: Regenerate and provision credentials manually:

    nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<hostname>
    

Host Not Appearing in DNS

Usually caused by not having deployed the commit with the new host to ns1/ns2.

Troubleshooting

  1. Verify the host config has a static IP configured in systemd.network.networks

  2. Check that homelab.dns.enable is not set to false

  3. USER: Trigger auto-upgrade on DNS servers:

    ssh root@ns1 systemctl start nixos-upgrade.service
    ssh root@ns2 systemctl start nixos-upgrade.service
    
  4. Verify DNS resolution after upgrade completes:

    dig @ns1.home.2rjus.net <hostname>.home.2rjus.net
    

Host Not Being Scraped by Prometheus

Usually caused by not having deployed the commit with the new host to the monitoring host.

Troubleshooting

  1. Check that homelab.monitoring.enable is not set to false

  2. USER: Trigger auto-upgrade on monitoring01:

    ssh root@monitoring01 systemctl start nixos-upgrade.service
    
  3. Verify the target appears in Prometheus:

    up{instance=~"<hostname>.*"}
    
  4. If the target is down, check that node-exporter is running on the host:

    ssh root@<hostname> systemctl status prometheus-node-exporter.service
    
Path Description
scripts/create-host/ The create-host script that generates host configurations
hosts/template2/ Template VM configuration (base image for new VMs)
hosts/template2/bootstrap.nix Bootstrap service that applies NixOS config on first boot
terraform/vms.tf VM definitions (specs, IPs, branch overrides)
terraform/cloud-init.tf Cloud-init configuration (passes hostname, branch, vault token)
terraform/vault/approle.tf AppRole policies for each host
terraform/vault/secrets.tf Secret definitions in Vault
terraform/vault/hosts-generated.tf Auto-generated wrapped tokens for VM bootstrap
playbooks/provision-approle.yml Ansible playbook for manual credential provisioning
flake.nix Flake with all host configurations (add new hosts here)