17 Commits

Author SHA1 Message Date
2a842c655a docs: update plan status and move completed nats-deploy plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
- Move nats-deploy-service.md to completed/ folder
- Update prometheus-scrape-target-labels.md with implementation status
- Add status table showing which steps are complete/partial/not started
- Update cross-references to point to new location

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 16:44:00 +01:00
1f4a5571dc CLAUDE.md: update documentation from audit
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
- Fix OpenBao CLI name (bao, not vault)
- Add vault01, testvm01-03 to hosts list
- Document nixos-exporter and homelab-deploy flake inputs
- Add vault/ and actions-runner/ services
- Document homelab.host and homelab.deploy options
- Document automatic Vault credential provisioning via wrapped tokens
- Consolidate homelab module options into dedicated section

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 16:37:38 +01:00
13d6d0ea3a Merge pull request 'improve-bootstrap-visibility' (#29) from improve-bootstrap-visibility into master
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Reviewed-on: #29
2026-02-07 15:00:09 +00:00
eea000b337 CLAUDE.md: document bootstrap logs in Loki
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Run nix flake check / flake-check (pull_request) Failing after 4s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:57:51 +01:00
f19ba2f4b6 CLAUDE.md: use tofu -chdir instead of cd
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:41:59 +01:00
a90d9c33d5 CLAUDE.md: prefer nix develop -c for devshell commands
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:39:56 +01:00
09c9df1bbe terraform: regenerate wrapped token for testvm01
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:36:25 +01:00
ae3039af19 template2: send bootstrap status to Loki for remote monitoring
Adds log_to_loki function that pushes structured log entries to Loki
at key bootstrap stages (starting, network_ok, vault_*, building,
success, failed). Enables querying bootstrap state via LogQL without
console access.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:34:47 +01:00
11261c4636 template2: revert to journal+console output for bootstrap
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
TTY output was causing nixos-rebuild to fail. Keep the custom
greeting line to indicate bootstrap image, but use journal+console
for reliable logging.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:24:39 +01:00
4ca3c8890f terraform: add flake_branch and token for testvm01
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:14:57 +01:00
78e8d7a600 template2: add ncurses for clear command in bootstrap
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:10:25 +01:00
0cf72ec191 terraform: update template to nixos-25.11.20260203.e576e3c
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:02:16 +01:00
6a3a51407e playbooks: auto-update terraform template name after deploy
Add a third play to build-and-deploy-template.yml that updates
terraform/variables.tf with the new template name after deploying
to Proxmox. Only updates if the template name has changed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:59:13 +01:00
a1ae766eb8 template2: show bootstrap progress on tty1
- Display bootstrap banner and live progress on tty1 instead of login prompt
- Add custom getty greeting on other ttys indicating this is a bootstrap image
- Disable getty on tty1 during bootstrap so output is visible

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:49:58 +01:00
11999b37f3 flake: update homelab-deploy
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Fixes false "Some deployments failed" warning in MCP server when
deployments are still in progress.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:24:41 +01:00
29b2b7db52 Merge branch 'deploy-test-hosts'
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Add three permanent test hosts (testvm01, testvm02, testvm03) with:
- Static IPs: 10.69.13.20-22
- Vault AppRole integration with homelab-deploy policy
- Remote deployment via NATS (homelab.deploy.enable)
- Test tier configuration

Also updates create-host template to include vault.enable and
homelab.deploy.enable by default.
2026-02-07 14:09:40 +01:00
b046a1b862 terraform: remove flake_branch from test VMs
VMs are now bootstrapped and running. Remove temporary flake_branch
and vault_wrapped_token settings so they use master going forward.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:09:30 +01:00
8 changed files with 255 additions and 48 deletions

103
CLAUDE.md
View File

@@ -61,10 +61,31 @@ Do not run `nix flake update`. Should only be done manually by user.
### Development Environment ### Development Environment
```bash ```bash
# Enter development shell (provides ansible, python3) # Enter development shell
nix develop nix develop
``` ```
The devshell provides: `ansible`, `tofu` (OpenTofu), `bao` (OpenBao CLI), `create-host`, and `homelab-deploy`.
**Important:** When suggesting commands that use devshell tools, always use `nix develop -c <command>` syntax rather than assuming the user is already in a devshell. For example:
```bash
# Good - works regardless of current shell
nix develop -c tofu plan
# Avoid - requires user to be in devshell
tofu plan
```
**OpenTofu:** Use the `-chdir` option instead of `cd` when running tofu commands in subdirectories:
```bash
# Good - uses -chdir option
nix develop -c tofu -chdir=terraform plan
nix develop -c tofu -chdir=terraform/vault apply
# Avoid - changing directories
cd terraform && tofu plan
```
### Secrets Management ### Secrets Management
Secrets are managed by OpenBao (Vault) using AppRole authentication. Most hosts use the Secrets are managed by OpenBao (Vault) using AppRole authentication. Most hosts use the
@@ -140,11 +161,27 @@ The **lab-monitoring** MCP server can query logs from Loki. All hosts ship syste
- `host` - Hostname (e.g., `ns1`, `ns2`, `monitoring01`, `ha1`). Use this label, not `hostname`. - `host` - Hostname (e.g., `ns1`, `ns2`, `monitoring01`, `ha1`). Use this label, not `hostname`.
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `prometheus.service`, `nixos-upgrade.service`) - `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `prometheus.service`, `nixos-upgrade.service`)
- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs like caddy access logs) - `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
- `filename` - For `varlog` job, the log file path (e.g., `/var/log/caddy/nix-cache.log`) - `filename` - For `varlog` job, the log file path (e.g., `/var/log/caddy/nix-cache.log`)
Journal log entries are JSON-formatted with the actual log message in the `MESSAGE` field. Other useful fields include `PRIORITY` and `SYSLOG_IDENTIFIER`. Journal log entries are JSON-formatted with the actual log message in the `MESSAGE` field. Other useful fields include `PRIORITY` and `SYSLOG_IDENTIFIER`.
**Bootstrap Logs:**
VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
- `host` - Target hostname
- `branch` - Git branch being deployed
- `stage` - Bootstrap stage: `starting`, `network_ok`, `vault_ok`/`vault_skip`/`vault_warn`, `building`, `success`, `failed`
Query bootstrap status:
```
{job="bootstrap"} # All bootstrap logs
{job="bootstrap", host="testvm01"} # Specific host
{job="bootstrap", stage="failed"} # All failures
{job="bootstrap", stage=~"building|success"} # Track build progress
```
**Example LogQL queries:** **Example LogQL queries:**
``` ```
# Logs from a specific service on a host # Logs from a specific service on a host
@@ -249,9 +286,10 @@ The `current_rev` label contains the git commit hash of the deployed flake confi
- `configuration.nix` - Host-specific settings (networking, hardware, users) - `configuration.nix` - Host-specific settings (networking, hardware, users)
- `/system/` - Shared system-level configurations applied to ALL hosts - `/system/` - Shared system-level configurations applied to ALL hosts
- Core modules: nix.nix, sshd.nix, sops.nix (legacy), vault-secrets.nix, acme.nix, autoupgrade.nix - Core modules: nix.nix, sshd.nix, sops.nix (legacy), vault-secrets.nix, acme.nix, autoupgrade.nix
- Additional modules: motd.nix (dynamic MOTD), packages.nix (base packages), root-user.nix (root config), homelab-deploy.nix (NATS listener)
- Monitoring: node-exporter and promtail on every host - Monitoring: node-exporter and promtail on every host
- `/modules/` - Custom NixOS modules - `/modules/` - Custom NixOS modules
- `homelab/` - Homelab-specific options (DNS automation, monitoring scrape targets) - `homelab/` - Homelab-specific options (see "Homelab Module Options" section below)
- `/lib/` - Nix library functions - `/lib/` - Nix library functions
- `dns-zone.nix` - DNS zone generation functions - `dns-zone.nix` - DNS zone generation functions
- `monitoring.nix` - Prometheus scrape target generation functions - `monitoring.nix` - Prometheus scrape target generation functions
@@ -259,6 +297,8 @@ The `current_rev` label contains the git commit hash of the deployed flake confi
- `home-assistant/` - Home automation stack - `home-assistant/` - Home automation stack
- `monitoring/` - Observability stack (Prometheus, Grafana, Loki, Tempo) - `monitoring/` - Observability stack (Prometheus, Grafana, Loki, Tempo)
- `ns/` - DNS services (authoritative, resolver, zone generation) - `ns/` - DNS services (authoritative, resolver, zone generation)
- `vault/` - OpenBao (Vault) secrets server
- `actions-runner/` - GitHub Actions runner
- `http-proxy/`, `ca/`, `postgres/`, `nats/`, `jellyfin/`, etc. - `http-proxy/`, `ca/`, `postgres/`, `nats/`, `jellyfin/`, etc.
- `/secrets/` - SOPS-encrypted secrets with age encryption (legacy, only used by ca) - `/secrets/` - SOPS-encrypted secrets with age encryption (legacy, only used by ca)
- `/common/` - Shared configurations (e.g., VM guest agent) - `/common/` - Shared configurations (e.g., VM guest agent)
@@ -292,25 +332,31 @@ All hosts automatically get:
### Active Hosts ### Active Hosts
Production servers managed by `rebuild-all.sh`: Production servers:
- `ns1`, `ns2` - Primary/secondary DNS servers (10.69.13.5/6) - `ns1`, `ns2` - Primary/secondary DNS servers (10.69.13.5/6)
- `ca` - Internal Certificate Authority - `ca` - Internal Certificate Authority
- `vault01` - OpenBao (Vault) secrets server
- `ha1` - Home Assistant + Zigbee2MQTT + Mosquitto - `ha1` - Home Assistant + Zigbee2MQTT + Mosquitto
- `http-proxy` - Reverse proxy - `http-proxy` - Reverse proxy
- `monitoring01` - Full observability stack (Prometheus, Grafana, Loki, Tempo, Pyroscope) - `monitoring01` - Full observability stack (Prometheus, Grafana, Loki, Tempo, Pyroscope)
- `jelly01` - Jellyfin media server - `jelly01` - Jellyfin media server
- `nix-cache01` - Binary cache server - `nix-cache01` - Binary cache server + GitHub Actions runner
- `pgdb1` - PostgreSQL database - `pgdb1` - PostgreSQL database
- `nats1` - NATS messaging server - `nats1` - NATS messaging server
Template/test hosts: Test/staging hosts:
- `template1` - Base template for cloning new hosts - `testvm01`, `testvm02`, `testvm03` - Test-tier VMs for branch testing and deployment validation
Template hosts:
- `template1`, `template2` - Base templates for cloning new hosts
### Flake Inputs ### Flake Inputs
- `nixpkgs` - NixOS 25.11 stable (primary) - `nixpkgs` - NixOS 25.11 stable (primary)
- `nixpkgs-unstable` - Unstable channel (available via overlay as `pkgs.unstable.<package>`) - `nixpkgs-unstable` - Unstable channel (available via overlay as `pkgs.unstable.<package>`)
- `sops-nix` - Secrets management (legacy, only used by ca) - `sops-nix` - Secrets management (legacy, only used by ca)
- `nixos-exporter` - NixOS module for exposing flake revision metrics (used to verify deployments)
- `homelab-deploy` - NATS-based remote deployment tool for test-tier hosts
- Custom packages from git.t-juice.club: - Custom packages from git.t-juice.club:
- `alerttonotify` - Alert routing - `alerttonotify` - Alert routing
- `labmon` - Lab monitoring - `labmon` - Lab monitoring
@@ -402,9 +448,21 @@ Example VM deployment includes:
- Custom CPU/memory/disk sizing - Custom CPU/memory/disk sizing
- VLAN tagging - VLAN tagging
- QEMU guest agent - QEMU guest agent
- Automatic Vault credential provisioning via `vault_wrapped_token`
OpenTofu outputs the VM's IP address after deployment for easy SSH access. OpenTofu outputs the VM's IP address after deployment for easy SSH access.
**Automatic Vault Credential Provisioning:**
VMs can receive Vault (OpenBao) credentials automatically during bootstrap:
1. OpenTofu generates a wrapped token via `terraform/vault/` and stores it in the VM configuration
2. Cloud-init passes `VAULT_WRAPPED_TOKEN` and `NIXOS_FLAKE_BRANCH` to the bootstrap script
3. The bootstrap script unwraps the token to obtain AppRole credentials
4. Credentials are written to `/var/lib/vault/approle/` before the NixOS rebuild
This eliminates the need for manual `provision-approle.yml` playbook runs on new VMs. Bootstrap progress is logged to Loki with `job="bootstrap"` labels.
#### Template Rebuilding and Terraform State #### Template Rebuilding and Terraform State
When the Proxmox template is rebuilt (via `build-and-deploy-template.yml`), the template name may change. This would normally cause Terraform to want to recreate all existing VMs, but that's unnecessary since VMs are independent once cloned. When the Proxmox template is rebuilt (via `build-and-deploy-template.yml`), the template name may change. This would normally cause Terraform to want to recreate all existing VMs, but that's unnecessary since VMs are independent once cloned.
@@ -484,11 +542,7 @@ Prometheus scrape targets are automatically generated from host configurations,
- **External targets**: Non-flake hosts defined in `/services/monitoring/external-targets.nix` - **External targets**: Non-flake hosts defined in `/services/monitoring/external-targets.nix`
- **Library**: `lib/monitoring.nix` provides `generateNodeExporterTargets` and `generateScrapeConfigs` - **Library**: `lib/monitoring.nix` provides `generateNodeExporterTargets` and `generateScrapeConfigs`
Host monitoring options (`homelab.monitoring.*`): Service modules declare their scrape targets directly via `homelab.monitoring.scrapeTargets` (e.g., `services/ca/default.nix` declares step-ca on port 9000). The Prometheus config on monitoring01 auto-generates scrape configs from all hosts. See "Homelab Module Options" section for available options.
- `enable` (default: `true`) - Include host in Prometheus node-exporter scrape targets
- `scrapeTargets` (default: `[]`) - Additional scrape targets exposed by this host (job_name, port, metrics_path, scheme, scrape_interval, honor_labels)
Service modules declare their scrape targets directly (e.g., `services/ca/default.nix` declares step-ca on port 9000). The Prometheus config on monitoring01 auto-generates scrape configs from all hosts.
To add monitoring targets for non-NixOS hosts, edit `/services/monitoring/external-targets.nix`. To add monitoring targets for non-NixOS hosts, edit `/services/monitoring/external-targets.nix`.
@@ -507,13 +561,30 @@ DNS zone entries are automatically generated from host configurations:
- **External hosts**: Non-flake hosts defined in `/services/ns/external-hosts.nix` - **External hosts**: Non-flake hosts defined in `/services/ns/external-hosts.nix`
- **Serial number**: Uses `self.sourceInfo.lastModified` (git commit timestamp) - **Serial number**: Uses `self.sourceInfo.lastModified` (git commit timestamp)
Host DNS options (`homelab.dns.*`):
- `enable` (default: `true`) - Include host in DNS zone generation
- `cnames` (default: `[]`) - List of CNAME aliases pointing to this host
Hosts are automatically excluded from DNS if: Hosts are automatically excluded from DNS if:
- `homelab.dns.enable = false` (e.g., template hosts) - `homelab.dns.enable = false` (e.g., template hosts)
- No static IP configured (e.g., DHCP-only hosts) - No static IP configured (e.g., DHCP-only hosts)
- Network interface is a VPN/tunnel (wg*, tun*, tap*) - Network interface is a VPN/tunnel (wg*, tun*, tap*)
To add DNS entries for non-NixOS hosts, edit `/services/ns/external-hosts.nix`. To add DNS entries for non-NixOS hosts, edit `/services/ns/external-hosts.nix`.
### Homelab Module Options
The `modules/homelab/` directory defines custom options used across hosts for automation and metadata.
**Host options (`homelab.host.*`):**
- `tier` - Deployment tier: `test` or `prod`. Test-tier hosts can receive remote deployments and have different credential access.
- `priority` - Alerting priority: `high` or `low`. Controls alerting thresholds for the host.
- `role` - Primary role designation (e.g., `dns`, `database`, `bastion`, `vault`)
- `labels` - Free-form key-value metadata for host categorization
**DNS options (`homelab.dns.*`):**
- `enable` (default: `true`) - Include host in DNS zone generation
- `cnames` (default: `[]`) - List of CNAME aliases pointing to this host
**Monitoring options (`homelab.monitoring.*`):**
- `enable` (default: `true`) - Include host in Prometheus node-exporter scrape targets
- `scrapeTargets` (default: `[]`) - Additional scrape targets exposed by this host
**Deploy options (`homelab.deploy.*`):**
- `enable` (default: `false`) - Enable NATS-based remote deployment listener. When enabled, the host listens for deployment commands via NATS and can be targeted by the `homelab-deploy` MCP server.

View File

@@ -1,10 +1,32 @@
# Prometheus Scrape Target Labels # Prometheus Scrape Target Labels
## Implementation Status
| Step | Status | Notes |
|------|--------|-------|
| 1. Create `homelab.host` module | ✅ Complete | `modules/homelab/host.nix` |
| 2. Update `lib/monitoring.nix` | ❌ Not started | Labels not extracted or propagated |
| 3. Update Prometheus config | ❌ Not started | Still uses flat target list |
| 4. Set metadata on hosts | ⚠️ Partial | Some hosts configured, see below |
| 5. Update alert rules | ❌ Not started | |
| 6. Labels for service targets | ❌ Not started | Optional |
**Hosts with metadata configured:**
- `ns1`, `ns2`: `role = "dns"`, `labels.dns_role = "primary"/"secondary"`
- `nix-cache01`: `role = "build-host"` (missing `priority = "low"` from plan)
- `vault01`: `role = "vault"`
- `jump`: `role = "bastion"`
- `template`, `template2`, `testvm*`: `tier` and `priority` set
**Key gap:** The `homelab.host` module exists and some hosts use it, but `lib/monitoring.nix` does not extract these values—they are not propagated to Prometheus scrape targets.
---
## Goal ## Goal
Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names. Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names.
**Related:** This plan shares the `homelab.host` module with `docs/plans/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment. **Related:** This plan shares the `homelab.host` module with `docs/plans/completed/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment.
## Motivation ## Motivation
@@ -54,12 +76,11 @@ or
## Implementation ## Implementation
This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/nats-deploy-service.md` which uses the same module for deployment tier assignment. This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/completed/nats-deploy-service.md` which uses the same module for deployment tier assignment.
### 1. Create `homelab.host` module ### 1. Create `homelab.host` module
**Status:** Step 1 (Create `homelab.host` module) is complete. The module is in **Complete.** The module is in `modules/homelab/host.nix`.
`modules/homelab/host.nix` with tier, priority, role, and labels options.
Create `modules/homelab/host.nix` with shared host metadata options: Create `modules/homelab/host.nix` with shared host metadata options:
@@ -98,6 +119,8 @@ Import this module in `modules/homelab/default.nix`.
### 2. Update `lib/monitoring.nix` ### 2. Update `lib/monitoring.nix`
**Not started.** The current implementation does not extract `homelab.host` values.
- `extractHostMonitoring` should also extract `homelab.host` values (priority, role, labels). - `extractHostMonitoring` should also extract `homelab.host` values (priority, role, labels).
- Build the combined label set from `homelab.host`: - Build the combined label set from `homelab.host`:
@@ -126,6 +149,8 @@ This requires grouping hosts by their label attrset and producing one `static_co
### 3. Update `services/monitoring/prometheus.nix` ### 3. Update `services/monitoring/prometheus.nix`
**Not started.** Still uses flat target list (`static_configs = [{ targets = nodeExporterTargets; }]`).
Change the node-exporter scrape config to use the new structured output: Change the node-exporter scrape config to use the new structured output:
```nix ```nix
@@ -138,29 +163,34 @@ static_configs = nodeExporterTargets;
### 4. Set metadata on hosts ### 4. Set metadata on hosts
⚠️ **Partial.** Some hosts configured (see status table above). Current `nix-cache01` only has `role`, missing the `priority = "low"` suggested below.
Example in `hosts/nix-cache01/configuration.nix`: Example in `hosts/nix-cache01/configuration.nix`:
```nix ```nix
homelab.host = { homelab.host = {
tier = "test"; # can be deployed by MCP (used by homelab-deploy)
priority = "low"; # relaxed alerting thresholds priority = "low"; # relaxed alerting thresholds
role = "build-host"; role = "build-host";
}; };
``` ```
**Note:** Current implementation only sets `role = "build-host"`. Consider adding `priority = "low"` when label propagation is implemented.
Example in `hosts/ns1/configuration.nix`: Example in `hosts/ns1/configuration.nix`:
```nix ```nix
homelab.host = { homelab.host = {
tier = "prod";
priority = "high";
role = "dns"; role = "dns";
labels.dns_role = "primary"; labels.dns_role = "primary";
}; };
``` ```
**Note:** `tier` and `priority` use defaults ("prod" and "high"), which is the intended behavior. The current ns1/ns2 configurations match this pattern.
### 5. Update alert rules ### 5. Update alert rules
**Not started.** Requires steps 2-3 to be completed first.
After implementing labels, review and update `services/monitoring/rules.yml`: After implementing labels, review and update `services/monitoring/rules.yml`:
- Replace instance-name exclusions with label-based filters (e.g. `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`). - Replace instance-name exclusions with label-based filters (e.g. `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`).
@@ -170,4 +200,6 @@ Specifically, the `high_cpu_load` rule currently has a nix-cache01 exclusion tha
### 6. Consider labels for `generateScrapeConfigs` (service targets) ### 6. Consider labels for `generateScrapeConfigs` (service targets)
**Not started.** Optional enhancement.
The same label propagation could be applied to service-level scrape targets. This is optional and can be deferred -- service targets are more specialized and less likely to need generic label-based filtering. The same label propagation could be applied to service-level scrape targets. This is optional and can be deferred -- service targets are more specialized and less likely to need generic label-based filtering.

8
flake.lock generated
View File

@@ -28,11 +28,11 @@
] ]
}, },
"locked": { "locked": {
"lastModified": 1770447502, "lastModified": 1770470613,
"narHash": "sha256-xH1PNyE3ydj4udhe1IpK8VQxBPZETGLuORZdSWYRmSU=", "narHash": "sha256-FOqOBdsQnLzA7+Y3hXuN9PcOHmpEDvNw64Ju8op9J2w=",
"ref": "master", "ref": "master",
"rev": "79db119d1ca6630023947ef0a65896cc3307c2ff", "rev": "36a74b8cf9a873202e28eecfb90f9af8650cca8b",
"revCount": 22, "revCount": 23,
"type": "git", "type": "git",
"url": "https://git.t-juice.club/torjus/homelab-deploy" "url": "https://git.t-juice.club/torjus/homelab-deploy"
}, },

View File

@@ -6,22 +6,72 @@ let
text = '' text = ''
set -euo pipefail set -euo pipefail
LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
# Send a log entry to Loki with bootstrap status
# Usage: log_to_loki <stage> <message>
# Fails silently if Loki is unreachable
log_to_loki() {
local stage="$1"
local message="$2"
local timestamp_ns
timestamp_ns="$(date +%s)000000000"
local payload
payload=$(jq -n \
--arg host "$HOSTNAME" \
--arg stage "$stage" \
--arg branch "''${BRANCH:-master}" \
--arg ts "$timestamp_ns" \
--arg msg "$message" \
'{
streams: [{
stream: {
job: "bootstrap",
host: $host,
stage: $stage,
branch: $branch
},
values: [[$ts, $msg]]
}]
}')
curl -s --connect-timeout 2 --max-time 5 \
-X POST \
-H "Content-Type: application/json" \
-d "$payload" \
"$LOKI_URL" >/dev/null 2>&1 || true
}
echo "================================================================================"
echo " NIXOS BOOTSTRAP IN PROGRESS"
echo "================================================================================"
echo ""
# Read hostname set by cloud-init (from Terraform VM name via user-data) # Read hostname set by cloud-init (from Terraform VM name via user-data)
# Cloud-init sets the system hostname from user-data.txt, so we read it from hostnamectl # Cloud-init sets the system hostname from user-data.txt, so we read it from hostnamectl
HOSTNAME=$(hostnamectl hostname) HOSTNAME=$(hostnamectl hostname)
echo "DEBUG: Hostname from hostnamectl: '$HOSTNAME'" # Read git branch from environment, default to master
BRANCH="''${NIXOS_FLAKE_BRANCH:-master}"
echo "Hostname: $HOSTNAME"
echo ""
echo "Starting NixOS bootstrap for host: $HOSTNAME" echo "Starting NixOS bootstrap for host: $HOSTNAME"
log_to_loki "starting" "Bootstrap starting for $HOSTNAME (branch: $BRANCH)"
echo "Waiting for network connectivity..." echo "Waiting for network connectivity..."
# Verify we can reach the git server via HTTPS (doesn't respond to ping) # Verify we can reach the git server via HTTPS (doesn't respond to ping)
if ! curl -s --connect-timeout 5 --max-time 10 https://git.t-juice.club >/dev/null 2>&1; then if ! curl -s --connect-timeout 5 --max-time 10 https://git.t-juice.club >/dev/null 2>&1; then
echo "ERROR: Cannot reach git.t-juice.club via HTTPS" echo "ERROR: Cannot reach git.t-juice.club via HTTPS"
echo "Check network configuration and DNS settings" echo "Check network configuration and DNS settings"
log_to_loki "failed" "Network check failed - cannot reach git.t-juice.club"
exit 1 exit 1
fi fi
echo "Network connectivity confirmed" echo "Network connectivity confirmed"
log_to_loki "network_ok" "Network connectivity confirmed"
# Unwrap Vault token and store AppRole credentials (if provided) # Unwrap Vault token and store AppRole credentials (if provided)
if [ -n "''${VAULT_WRAPPED_TOKEN:-}" ]; then if [ -n "''${VAULT_WRAPPED_TOKEN:-}" ]; then
@@ -50,6 +100,7 @@ let
chmod 600 /var/lib/vault/approle/secret-id chmod 600 /var/lib/vault/approle/secret-id
echo "Vault credentials unwrapped and stored successfully" echo "Vault credentials unwrapped and stored successfully"
log_to_loki "vault_ok" "Vault credentials unwrapped and stored"
else else
echo "WARNING: Failed to unwrap Vault token" echo "WARNING: Failed to unwrap Vault token"
if [ -n "$UNWRAP_RESPONSE" ]; then if [ -n "$UNWRAP_RESPONSE" ]; then
@@ -63,17 +114,17 @@ let
echo "To regenerate token, run: create-host --hostname $HOSTNAME --force" echo "To regenerate token, run: create-host --hostname $HOSTNAME --force"
echo "" echo ""
echo "Vault secrets will not be available, but continuing bootstrap..." echo "Vault secrets will not be available, but continuing bootstrap..."
log_to_loki "vault_warn" "Failed to unwrap Vault token - continuing without secrets"
fi fi
else else
echo "No Vault wrapped token provided (VAULT_WRAPPED_TOKEN not set)" echo "No Vault wrapped token provided (VAULT_WRAPPED_TOKEN not set)"
echo "Skipping Vault credential setup" echo "Skipping Vault credential setup"
log_to_loki "vault_skip" "No Vault token provided - skipping credential setup"
fi fi
echo "Fetching and building NixOS configuration from flake..." echo "Fetching and building NixOS configuration from flake..."
# Read git branch from environment, default to master
BRANCH="''${NIXOS_FLAKE_BRANCH:-master}"
echo "Using git branch: $BRANCH" echo "Using git branch: $BRANCH"
log_to_loki "building" "Starting nixos-rebuild boot"
# Build and activate the host-specific configuration # Build and activate the host-specific configuration
FLAKE_URL="git+https://git.t-juice.club/torjus/nixos-servers.git?ref=$BRANCH#''${HOSTNAME}" FLAKE_URL="git+https://git.t-juice.club/torjus/nixos-servers.git?ref=$BRANCH#''${HOSTNAME}"
@@ -81,18 +132,30 @@ let
if nixos-rebuild boot --flake "$FLAKE_URL"; then if nixos-rebuild boot --flake "$FLAKE_URL"; then
echo "Successfully built configuration for $HOSTNAME" echo "Successfully built configuration for $HOSTNAME"
echo "Rebooting into new configuration..." echo "Rebooting into new configuration..."
log_to_loki "success" "Build successful - rebooting into new configuration"
sleep 2 sleep 2
systemctl reboot systemctl reboot
else else
echo "ERROR: nixos-rebuild failed for $HOSTNAME" echo "ERROR: nixos-rebuild failed for $HOSTNAME"
echo "Check that flake has configuration for this hostname" echo "Check that flake has configuration for this hostname"
echo "Manual intervention required - system will not reboot" echo "Manual intervention required - system will not reboot"
log_to_loki "failed" "nixos-rebuild failed - manual intervention required"
exit 1 exit 1
fi fi
''; '';
}; };
in in
{ {
# Custom greeting line to indicate this is a bootstrap image
services.getty.greetingLine = lib.mkForce ''
================================================================================
BOOTSTRAP IMAGE - NixOS \V (\l)
================================================================================
Bootstrap service is running. Logs are displayed on tty1.
Check status: journalctl -fu nixos-bootstrap
'';
systemd.services."nixos-bootstrap" = { systemd.services."nixos-bootstrap" = {
description = "Bootstrap NixOS configuration from flake on first boot"; description = "Bootstrap NixOS configuration from flake on first boot";
@@ -107,12 +170,12 @@ in
serviceConfig = { serviceConfig = {
Type = "oneshot"; Type = "oneshot";
RemainAfterExit = true; RemainAfterExit = true;
ExecStart = "${bootstrap-script}/bin/nixos-bootstrap"; ExecStart = lib.getExe bootstrap-script;
# Read environment variables from cloud-init (set by cloud-init write_files) # Read environment variables from cloud-init (set by cloud-init write_files)
EnvironmentFile = "-/run/cloud-init-env"; EnvironmentFile = "-/run/cloud-init-env";
# Logging to journald # Log to journal and console
StandardOutput = "journal+console"; StandardOutput = "journal+console";
StandardError = "journal+console"; StandardError = "journal+console";
}; };

View File

@@ -99,3 +99,48 @@
- name: Display success message - name: Display success message
ansible.builtin.debug: ansible.builtin.debug:
msg: "Template VM {{ template_vmid }} created successfully on {{ storage }}" msg: "Template VM {{ template_vmid }} created successfully on {{ storage }}"
- name: Update Terraform template name
hosts: localhost
gather_facts: false
vars:
terraform_dir: "{{ playbook_dir }}/../terraform"
tasks:
- name: Get image filename from earlier play
ansible.builtin.set_fact:
image_filename: "{{ hostvars['localhost']['image_filename'] }}"
- name: Extract template name from image filename
ansible.builtin.set_fact:
new_template_name: "{{ image_filename | regex_replace('\\.vma\\.zst$', '') | regex_replace('^vzdump-qemu-', '') }}"
- name: Read current Terraform variables file
ansible.builtin.slurp:
src: "{{ terraform_dir }}/variables.tf"
register: variables_tf_content
- name: Extract current template name from variables.tf
ansible.builtin.set_fact:
current_template_name: "{{ (variables_tf_content.content | b64decode) | regex_search('variable \"default_template_name\"[^}]+default\\s*=\\s*\"([^\"]+)\"', '\\1') | first }}"
- name: Check if template name has changed
ansible.builtin.set_fact:
template_name_changed: "{{ current_template_name != new_template_name }}"
- name: Display template name status
ansible.builtin.debug:
msg: "Template name: {{ current_template_name }} -> {{ new_template_name }} ({{ 'changed' if template_name_changed else 'unchanged' }})"
- name: Update default_template_name in variables.tf
ansible.builtin.replace:
path: "{{ terraform_dir }}/variables.tf"
regexp: '(variable "default_template_name"[^}]+default\s*=\s*)"[^"]+"'
replace: '\1"{{ new_template_name }}"'
when: template_name_changed
- name: Display update result
ansible.builtin.debug:
msg: "Updated terraform/variables.tf with new template name: {{ new_template_name }}"
when: template_name_changed

View File

@@ -33,7 +33,7 @@ variable "default_target_node" {
variable "default_template_name" { variable "default_template_name" {
description = "Default template VM name to clone from" description = "Default template VM name to clone from"
type = string type = string
default = "nixos-25.11.20260131.41e216c" default = "nixos-25.11.20260203.e576e3c"
} }
variable "default_ssh_public_key" { variable "default_ssh_public_key" {

View File

@@ -43,24 +43,20 @@ locals {
cpu_cores = 2 cpu_cores = 2
memory = 2048 memory = 2048
disk_size = "20G" disk_size = "20G"
flake_branch = "deploy-test-hosts" flake_branch = "improve-bootstrap-visibility"
vault_wrapped_token = "s.YRGRpAZVVtSYEa3wOYOqFmjt" vault_wrapped_token = "s.l5q88wzXfEcr5SMDHmO6o96b"
} }
"testvm02" = { "testvm02" = {
ip = "10.69.13.21/24" ip = "10.69.13.21/24"
cpu_cores = 2 cpu_cores = 2
memory = 2048 memory = 2048
disk_size = "20G" disk_size = "20G"
flake_branch = "deploy-test-hosts"
vault_wrapped_token = "s.tvs8yhJOkLjBs548STs6DBw7"
} }
"testvm03" = { "testvm03" = {
ip = "10.69.13.22/24" ip = "10.69.13.22/24"
cpu_cores = 2 cpu_cores = 2
memory = 2048 memory = 2048
disk_size = "20G" disk_size = "20G"
flake_branch = "deploy-test-hosts"
vault_wrapped_token = "s.sQ80FZGeG3z6jgrsuh74IopC"
} }
} }