Files

Run nix flake check / flake-check (push) Has been cancelled

Details

docs: update Loki queries from host to hostname label

Update all LogQL examples, agent instructions, and scripts to use
the hostname label instead of host, matching the Prometheus label
naming convention. Also update pipe-to-loki and bootstrap scripts
to push hostname instead of host.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-13 23:43:47 +01:00

7.4 KiB

Raw Blame History

Host Creation Pipeline

This document describes the process for creating new hosts in the homelab infrastructure.

Overview

We use the create-host script to create new hosts, which generates default configurations from a template. We then use OpenTofu to deploy both secrets and VMs. The VMs boot using a template image (built from hosts/template2), which starts a bootstrap process. This bootstrap process applies the host's NixOS configuration and then reboots into the new config.

Prerequisites

All tools are available in the devshell: create-host, bao (OpenBao CLI), tofu.

nix develop

Steps

Steps marked with USER must be performed by the user due to credential requirements.

USER: Run create-host --hostname <name> --ip <ip/prefix>
Edit the auto-generated configurations in hosts/<hostname>/ to import whatever modules are needed for its purpose
Add any secrets needed to terraform/vault/
Edit the VM specs in terraform/vms.tf if needed. To deploy from a branch other than master, add flake_branch = "<branch>" to the VM definition
Push configuration to master (or the branch specified by flake_branch)

USER: Apply terraform:

nix develop -c tofu -chdir=terraform/vault apply
nix develop -c tofu -chdir=terraform apply

Once terraform completes, a VM boots in Proxmox using the template image
The VM runs the nixos-bootstrap service, which applies the host config and reboots
After reboot, the host should be operational
Trigger auto-upgrade on ns1 and ns2 to propagate DNS records for the new host
Trigger auto-upgrade on monitoring01 to add the host to Prometheus scrape targets

Tier Specification

New hosts should set homelab.host.tier in their configuration:

homelab.host.tier = "test";  # or "prod"

test - Test-tier hosts can receive remote deployments via the homelab-deploy MCP server and have different credential access. Use for staging/testing.
prod - Production hosts. Deployments require direct access or the CLI with appropriate credentials.

Observability

During the bootstrap process, status updates are sent to Loki. Query bootstrap logs with:

{job="bootstrap", hostname="<hostname>"}

Bootstrap Stages

The bootstrap process reports these stages via the stage label:

Stage	Message	Meaning
`starting`	Bootstrap starting for <host> (branch: <branch>)	Bootstrap service has started
`network_ok`	Network connectivity confirmed	Can reach git server
`vault_ok`	Vault credentials unwrapped and stored	AppRole credentials provisioned
`vault_skip`	No Vault token provided - skipping credential setup	No wrapped token was provided
`vault_warn`	Failed to unwrap Vault token - continuing without secrets	Token unwrap failed (expired/used)
`building`	Starting nixos-rebuild boot	NixOS build starting
`success`	Build successful - rebooting into new configuration	Build complete, rebooting
`failed`	nixos-rebuild failed - manual intervention required	Build failed

Useful Queries

# All bootstrap activity for a host
{job="bootstrap", hostname="myhost"}

# Track all failures
{job="bootstrap", stage="failed"}

# Monitor builds in progress
{job="bootstrap", stage=~"building|success"}

Once the VM reboots with its full configuration, it will start publishing metrics to Prometheus and logs to Loki via Promtail.

Verification

Check bootstrap completed successfully:

{job="bootstrap", hostname="<hostname>", stage="success"}

Verify the host is up and reporting metrics:
```
up{instance=~"<hostname>.*"}
```
Verify the correct flake revision is deployed:
```
nixos_flake_info{instance=~"<hostname>.*"}
```
Check logs are flowing:
```
{hostname="<hostname>"}
```
Confirm expected services are running and producing logs

Troubleshooting

Bootstrap Failed

Common Issues

VM has trouble running initial nixos-rebuild. Usually caused if it needs to compile packages from scratch if they are not available in our local nix-cache.

Troubleshooting

Check bootstrap logs in Loki - if they never progress past building, the rebuild likely consumed all resources:
```
{job="bootstrap", hostname="<hostname>"}
```

USER: SSH into the host and check the bootstrap service:

ssh root@<hostname>
journalctl -u nixos-bootstrap.service

If the build failed due to resource constraints, increase VM specs in terraform/vms.tf and redeploy, or manually run the rebuild:
```
nixos-rebuild boot --flake git+https://git.t-juice.club/torjus/nixos-servers.git#<hostname>
```
If the host config doesn't exist in the flake, ensure step 5 was completed (config pushed to the correct branch).

Vault Credentials Not Working

Usually caused by running the create-host script without proper credentials, or the wrapped token has expired/already been used.

Troubleshooting

Check if credentials exist on the host:

ssh root@<hostname>
ls -la /var/lib/vault/approle/

Check bootstrap logs for vault-related stages:

{job="bootstrap", hostname="<hostname>", stage=~"vault.*"}

USER: Regenerate and provision credentials manually:

nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<hostname>

Host Not Appearing in DNS

Usually caused by not having deployed the commit with the new host to ns1/ns2.

Troubleshooting

Verify the host config has a static IP configured in systemd.network.networks
Check that homelab.dns.enable is not set to false

USER: Trigger auto-upgrade on DNS servers:

ssh root@ns1 systemctl start nixos-upgrade.service
ssh root@ns2 systemctl start nixos-upgrade.service

Verify DNS resolution after upgrade completes:

dig @ns1.home.2rjus.net <hostname>.home.2rjus.net

Host Not Being Scraped by Prometheus

Usually caused by not having deployed the commit with the new host to the monitoring host.

Troubleshooting

Check that homelab.monitoring.enable is not set to false

USER: Trigger auto-upgrade on monitoring01:

ssh root@monitoring01 systemctl start nixos-upgrade.service

Verify the target appears in Prometheus:
```
up{instance=~"<hostname>.*"}
```

If the target is down, check that node-exporter is running on the host:

ssh root@<hostname> systemctl status prometheus-node-exporter.service

Path	Description
`scripts/create-host/`	The `create-host` script that generates host configurations
`hosts/template2/`	Template VM configuration (base image for new VMs)
`hosts/template2/bootstrap.nix`	Bootstrap service that applies NixOS config on first boot
`terraform/vms.tf`	VM definitions (specs, IPs, branch overrides)
`terraform/cloud-init.tf`	Cloud-init configuration (passes hostname, branch, vault token)
`terraform/vault/approle.tf`	AppRole policies for each host
`terraform/vault/secrets.tf`	Secret definitions in Vault
`terraform/vault/hosts-generated.tf`	Auto-generated wrapped tokens for VM bootstrap
`playbooks/provision-approle.yml`	Ansible playbook for manual credential provisioning
`flake.nix`	Flake with all host configurations (add new hosts here)

7.4 KiB Raw Blame History

Host Creation Pipeline

Overview

Prerequisites

Steps

Tier Specification

Observability

Bootstrap Stages

Useful Queries

Verification

Troubleshooting

Bootstrap Failed

Common Issues

Troubleshooting

Vault Credentials Not Working

Troubleshooting

Host Not Appearing in DNS

Troubleshooting

Host Not Being Scraped by Prometheus

Troubleshooting

Related Files

7.4 KiB

Raw Blame History