30 Commits

Author SHA1 Message Date
0b462f0a96 Merge pull request 'prometheus-scrape-target-labels' (#30) from prometheus-scrape-target-labels into master
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Reviewed-on: #30
2026-02-07 16:27:38 +00:00
116abf3bec CLAUDE.md: document homelab-deploy CLI for prod hosts
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Run nix flake check / flake-check (pull_request) Failing after 1s
Add instructions for deploying to prod hosts using the CLI directly,
since the MCP server only handles test-tier deployments.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:23:10 +01:00
b794aa89db skills: update observability with new target labels
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Document the new hostname and host metadata labels available on all
Prometheus scrape targets:
- hostname: short hostname for easy filtering
- role: host role (dns, build-host, vault)
- tier: deployment tier (test for test VMs)
- dns_role: primary/secondary for DNS servers

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:12:17 +01:00
50a85daa44 docs: update plan with hostname label documentation
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:09:46 +01:00
23e561cf49 monitoring: add hostname label to all scrape targets
Add a `hostname` label to all Prometheus scrape targets, making it easy
to query all metrics for a host without wildcarding the instance label.

Example queries:
- {hostname="ns1"} - all metrics from ns1
- node_cpu_seconds_total{hostname="monitoring01"} - specific metric

For external targets (like gunter), the hostname is extracted from the
target string.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:09:19 +01:00
7d291f85bf monitoring: propagate host labels to Prometheus scrape targets
Extract homelab.host metadata (tier, priority, role, labels) from host
configurations and propagate them to Prometheus scrape targets. This
enables semantic alert filtering using labels instead of hardcoded
instance names.

Changes:
- lib/monitoring.nix: Extract host metadata, group targets by labels
- prometheus.nix: Use structured static_configs with labels
- rules.yml: Replace instance filters with role-based filters

Example labels in Prometheus:
- ns1/ns2: role=dns, dns_role=primary/secondary
- nix-cache01: role=build-host
- testvm*: tier=test

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:04:50 +01:00
2a842c655a docs: update plan status and move completed nats-deploy plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
- Move nats-deploy-service.md to completed/ folder
- Update prometheus-scrape-target-labels.md with implementation status
- Add status table showing which steps are complete/partial/not started
- Update cross-references to point to new location

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 16:44:00 +01:00
1f4a5571dc CLAUDE.md: update documentation from audit
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
- Fix OpenBao CLI name (bao, not vault)
- Add vault01, testvm01-03 to hosts list
- Document nixos-exporter and homelab-deploy flake inputs
- Add vault/ and actions-runner/ services
- Document homelab.host and homelab.deploy options
- Document automatic Vault credential provisioning via wrapped tokens
- Consolidate homelab module options into dedicated section

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 16:37:38 +01:00
13d6d0ea3a Merge pull request 'improve-bootstrap-visibility' (#29) from improve-bootstrap-visibility into master
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Reviewed-on: #29
2026-02-07 15:00:09 +00:00
eea000b337 CLAUDE.md: document bootstrap logs in Loki
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Run nix flake check / flake-check (pull_request) Failing after 4s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:57:51 +01:00
f19ba2f4b6 CLAUDE.md: use tofu -chdir instead of cd
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:41:59 +01:00
a90d9c33d5 CLAUDE.md: prefer nix develop -c for devshell commands
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:39:56 +01:00
09c9df1bbe terraform: regenerate wrapped token for testvm01
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:36:25 +01:00
ae3039af19 template2: send bootstrap status to Loki for remote monitoring
Adds log_to_loki function that pushes structured log entries to Loki
at key bootstrap stages (starting, network_ok, vault_*, building,
success, failed). Enables querying bootstrap state via LogQL without
console access.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:34:47 +01:00
11261c4636 template2: revert to journal+console output for bootstrap
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
TTY output was causing nixos-rebuild to fail. Keep the custom
greeting line to indicate bootstrap image, but use journal+console
for reliable logging.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:24:39 +01:00
4ca3c8890f terraform: add flake_branch and token for testvm01
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:14:57 +01:00
78e8d7a600 template2: add ncurses for clear command in bootstrap
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:10:25 +01:00
0cf72ec191 terraform: update template to nixos-25.11.20260203.e576e3c
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:02:16 +01:00
6a3a51407e playbooks: auto-update terraform template name after deploy
Add a third play to build-and-deploy-template.yml that updates
terraform/variables.tf with the new template name after deploying
to Proxmox. Only updates if the template name has changed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:59:13 +01:00
a1ae766eb8 template2: show bootstrap progress on tty1
- Display bootstrap banner and live progress on tty1 instead of login prompt
- Add custom getty greeting on other ttys indicating this is a bootstrap image
- Disable getty on tty1 during bootstrap so output is visible

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:49:58 +01:00
11999b37f3 flake: update homelab-deploy
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Fixes false "Some deployments failed" warning in MCP server when
deployments are still in progress.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:24:41 +01:00
29b2b7db52 Merge branch 'deploy-test-hosts'
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Add three permanent test hosts (testvm01, testvm02, testvm03) with:
- Static IPs: 10.69.13.20-22
- Vault AppRole integration with homelab-deploy policy
- Remote deployment via NATS (homelab.deploy.enable)
- Test tier configuration

Also updates create-host template to include vault.enable and
homelab.deploy.enable by default.
2026-02-07 14:09:40 +01:00
b046a1b862 terraform: remove flake_branch from test VMs
VMs are now bootstrapped and running. Remove temporary flake_branch
and vault_wrapped_token settings so they use master going forward.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:09:30 +01:00
38348c5980 vault: add homelab-deploy policy to generated hosts
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
The homelab-deploy listener requires access to shared/homelab-deploy/*
secrets. Update hosts-generated.tf and the generator script to include
this policy automatically.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:05:42 +01:00
370cf2b03a hosts: enable vault and deploy listener on test VMs
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
- Add vault.enable = true to testvm01, testvm02, testvm03
- Add homelab.deploy.enable = true for remote deployment via NATS
- Update create-host template to include these by default

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 13:55:33 +01:00
7bc465b414 hosts: add testvm01, testvm02, testvm03 test hosts
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Three permanent test hosts for validating deployment and bootstrapping
workflow. Each host configured with:
- Static IP (10.69.13.20-22/24)
- Vault AppRole integration
- Bootstrap from deploy-test-hosts branch

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 13:34:16 +01:00
8d7bc50108 hosts: remove testvm01
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Test host no longer needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 12:58:24 +01:00
03e70ac094 hosts: remove vaulttest01
Test host no longer needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 12:55:38 +01:00
3b32c9479f create-host: add approle removal and secrets detection
- Remove host entries from terraform/vault/approle.tf on --remove
- Detect and warn about secrets in terraform/vault/secrets.tf
- Include vault kv delete commands in removal instructions
- Update check_entries_exist to return approle status

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 12:54:42 +01:00
b0d35f9a99 create-host: fix flake.nix indentation patterns
The regex patterns expected 6 spaces of indentation but flake.nix uses
8 spaces for host entries. Also updated generated entry template to
match current flake.nix style (using commonModules ++).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 12:48:29 +01:00
26 changed files with 757 additions and 285 deletions

View File

@@ -185,21 +185,60 @@ Common job names:
- `home-assistant` - Home automation - `home-assistant` - Home automation
- `step-ca` - Internal CA - `step-ca` - Internal CA
### Instance Label Format ### Target Labels
The `instance` label uses FQDN format: All scrape targets have these labels:
``` **Standard labels:**
<hostname>.home.2rjus.net:<port> - `instance` - Full target address (`<hostname>.home.2rjus.net:<port>`)
``` - `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`)
- `hostname` - Short hostname (e.g., `ns1`, `monitoring01`) - use this for host filtering
Example queries filtering by host: **Host metadata labels** (when configured in `homelab.host`):
- `role` - Host role (e.g., `dns`, `build-host`, `vault`)
- `tier` - Deployment tier (`test` for test VMs, absent for prod)
- `dns_role` - DNS-specific role (`primary` or `secondary` for ns1/ns2)
### Filtering by Host
Use the `hostname` label for easy host filtering across all jobs:
```promql ```promql
up{instance=~"monitoring01.*"} {hostname="ns1"} # All metrics from ns1
node_load1{instance=~"ns1.*"} node_load1{hostname="monitoring01"} # Specific metric by hostname
up{hostname="ha1"} # Check if ha1 is up
``` ```
This is simpler than wildcarding the `instance` label:
```promql
# Old way (still works but verbose)
up{instance=~"monitoring01.*"}
# New way (preferred)
up{hostname="monitoring01"}
```
### Filtering by Role/Tier
Filter hosts by their role or tier:
```promql
up{role="dns"} # All DNS servers (ns1, ns2)
node_cpu_seconds_total{role="build-host"} # Build hosts only (nix-cache01)
up{tier="test"} # All test-tier VMs
up{dns_role="primary"} # Primary DNS only (ns1)
```
Current host labels:
| Host | Labels |
|------|--------|
| ns1 | `role=dns`, `dns_role=primary` |
| ns2 | `role=dns`, `dns_role=secondary` |
| nix-cache01 | `role=build-host` |
| vault01 | `role=vault` |
| testvm01/02/03 | `tier=test` |
--- ---
## Troubleshooting Workflows ## Troubleshooting Workflows
@@ -212,11 +251,12 @@ node_load1{instance=~"ns1.*"}
### Investigate Service Issues ### Investigate Service Issues
1. Check `up{job="<service>"}` for scrape failures 1. Check `up{job="<service>"}` or `up{hostname="<host>"}` for scrape failures
2. Use `list_targets` to see target health details 2. Use `list_targets` to see target health details
3. Query service logs: `{host="<host>", systemd_unit="<service>.service"}` 3. Query service logs: `{host="<host>", systemd_unit="<service>.service"}`
4. Search for errors: `{host="<host>"} |= "error"` 4. Search for errors: `{host="<host>"} |= "error"`
5. Check `list_alerts` for related alerts 5. Check `list_alerts` for related alerts
6. Use role filters for group issues: `up{role="dns"}` to check all DNS servers
### After Deploying Changes ### After Deploying Changes
@@ -246,5 +286,6 @@ With `start: "24h"` to see last 24 hours of upgrades across all hosts.
- Default scrape interval is 15s for most metrics targets - Default scrape interval is 15s for most metrics targets
- Default log lookback is 1h - use `start` parameter for older logs - Default log lookback is 1h - use `start` parameter for older logs
- Use `rate()` for counter metrics, direct queries for gauges - Use `rate()` for counter metrics, direct queries for gauges
- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters - Use the `hostname` label to filter metrics by host (simpler than regex on `instance`)
- Host metadata labels (`role`, `tier`, `dns_role`) are propagated to all scrape targets
- Log `MESSAGE` field contains the actual log content in JSON format - Log `MESSAGE` field contains the actual log content in JSON format

118
CLAUDE.md
View File

@@ -61,10 +61,31 @@ Do not run `nix flake update`. Should only be done manually by user.
### Development Environment ### Development Environment
```bash ```bash
# Enter development shell (provides ansible, python3) # Enter development shell
nix develop nix develop
``` ```
The devshell provides: `ansible`, `tofu` (OpenTofu), `bao` (OpenBao CLI), `create-host`, and `homelab-deploy`.
**Important:** When suggesting commands that use devshell tools, always use `nix develop -c <command>` syntax rather than assuming the user is already in a devshell. For example:
```bash
# Good - works regardless of current shell
nix develop -c tofu plan
# Avoid - requires user to be in devshell
tofu plan
```
**OpenTofu:** Use the `-chdir` option instead of `cd` when running tofu commands in subdirectories:
```bash
# Good - uses -chdir option
nix develop -c tofu -chdir=terraform plan
nix develop -c tofu -chdir=terraform/vault apply
# Avoid - changing directories
cd terraform && tofu plan
```
### Secrets Management ### Secrets Management
Secrets are managed by OpenBao (Vault) using AppRole authentication. Most hosts use the Secrets are managed by OpenBao (Vault) using AppRole authentication. Most hosts use the
@@ -140,11 +161,27 @@ The **lab-monitoring** MCP server can query logs from Loki. All hosts ship syste
- `host` - Hostname (e.g., `ns1`, `ns2`, `monitoring01`, `ha1`). Use this label, not `hostname`. - `host` - Hostname (e.g., `ns1`, `ns2`, `monitoring01`, `ha1`). Use this label, not `hostname`.
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `prometheus.service`, `nixos-upgrade.service`) - `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `prometheus.service`, `nixos-upgrade.service`)
- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs like caddy access logs) - `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
- `filename` - For `varlog` job, the log file path (e.g., `/var/log/caddy/nix-cache.log`) - `filename` - For `varlog` job, the log file path (e.g., `/var/log/caddy/nix-cache.log`)
Journal log entries are JSON-formatted with the actual log message in the `MESSAGE` field. Other useful fields include `PRIORITY` and `SYSLOG_IDENTIFIER`. Journal log entries are JSON-formatted with the actual log message in the `MESSAGE` field. Other useful fields include `PRIORITY` and `SYSLOG_IDENTIFIER`.
**Bootstrap Logs:**
VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
- `host` - Target hostname
- `branch` - Git branch being deployed
- `stage` - Bootstrap stage: `starting`, `network_ok`, `vault_ok`/`vault_skip`/`vault_warn`, `building`, `success`, `failed`
Query bootstrap status:
```
{job="bootstrap"} # All bootstrap logs
{job="bootstrap", host="testvm01"} # Specific host
{job="bootstrap", stage="failed"} # All failures
{job="bootstrap", stage=~"building|success"} # Track build progress
```
**Example LogQL queries:** **Example LogQL queries:**
``` ```
# Logs from a specific service on a host # Logs from a specific service on a host
@@ -229,6 +266,21 @@ deploy(role="vault", action="switch")
**Note:** Only test-tier hosts with `homelab.deploy.enable = true` and the listener service running will respond to deployments. **Note:** Only test-tier hosts with `homelab.deploy.enable = true` and the listener service running will respond to deployments.
**Deploying to Prod Hosts:**
The MCP server only deploys to test-tier hosts. For prod hosts, use the CLI directly:
```bash
nix develop -c homelab-deploy -- deploy \
--nats-url nats://nats1.home.2rjus.net:4222 \
--nkey-file ~/.config/homelab-deploy/admin-deployer.nkey \
--branch <branch-name> \
--action switch \
deploy.prod.<hostname>
```
Subject format: `deploy.<tier>.<hostname>` (e.g., `deploy.prod.monitoring01`, `deploy.test.testvm01`)
**Verifying Deployments:** **Verifying Deployments:**
After deploying, use the `nixos_flake_info` metric from nixos-exporter to verify the host is running the expected revision: After deploying, use the `nixos_flake_info` metric from nixos-exporter to verify the host is running the expected revision:
@@ -249,9 +301,10 @@ The `current_rev` label contains the git commit hash of the deployed flake confi
- `configuration.nix` - Host-specific settings (networking, hardware, users) - `configuration.nix` - Host-specific settings (networking, hardware, users)
- `/system/` - Shared system-level configurations applied to ALL hosts - `/system/` - Shared system-level configurations applied to ALL hosts
- Core modules: nix.nix, sshd.nix, sops.nix (legacy), vault-secrets.nix, acme.nix, autoupgrade.nix - Core modules: nix.nix, sshd.nix, sops.nix (legacy), vault-secrets.nix, acme.nix, autoupgrade.nix
- Additional modules: motd.nix (dynamic MOTD), packages.nix (base packages), root-user.nix (root config), homelab-deploy.nix (NATS listener)
- Monitoring: node-exporter and promtail on every host - Monitoring: node-exporter and promtail on every host
- `/modules/` - Custom NixOS modules - `/modules/` - Custom NixOS modules
- `homelab/` - Homelab-specific options (DNS automation, monitoring scrape targets) - `homelab/` - Homelab-specific options (see "Homelab Module Options" section below)
- `/lib/` - Nix library functions - `/lib/` - Nix library functions
- `dns-zone.nix` - DNS zone generation functions - `dns-zone.nix` - DNS zone generation functions
- `monitoring.nix` - Prometheus scrape target generation functions - `monitoring.nix` - Prometheus scrape target generation functions
@@ -259,6 +312,8 @@ The `current_rev` label contains the git commit hash of the deployed flake confi
- `home-assistant/` - Home automation stack - `home-assistant/` - Home automation stack
- `monitoring/` - Observability stack (Prometheus, Grafana, Loki, Tempo) - `monitoring/` - Observability stack (Prometheus, Grafana, Loki, Tempo)
- `ns/` - DNS services (authoritative, resolver, zone generation) - `ns/` - DNS services (authoritative, resolver, zone generation)
- `vault/` - OpenBao (Vault) secrets server
- `actions-runner/` - GitHub Actions runner
- `http-proxy/`, `ca/`, `postgres/`, `nats/`, `jellyfin/`, etc. - `http-proxy/`, `ca/`, `postgres/`, `nats/`, `jellyfin/`, etc.
- `/secrets/` - SOPS-encrypted secrets with age encryption (legacy, only used by ca) - `/secrets/` - SOPS-encrypted secrets with age encryption (legacy, only used by ca)
- `/common/` - Shared configurations (e.g., VM guest agent) - `/common/` - Shared configurations (e.g., VM guest agent)
@@ -292,25 +347,31 @@ All hosts automatically get:
### Active Hosts ### Active Hosts
Production servers managed by `rebuild-all.sh`: Production servers:
- `ns1`, `ns2` - Primary/secondary DNS servers (10.69.13.5/6) - `ns1`, `ns2` - Primary/secondary DNS servers (10.69.13.5/6)
- `ca` - Internal Certificate Authority - `ca` - Internal Certificate Authority
- `vault01` - OpenBao (Vault) secrets server
- `ha1` - Home Assistant + Zigbee2MQTT + Mosquitto - `ha1` - Home Assistant + Zigbee2MQTT + Mosquitto
- `http-proxy` - Reverse proxy - `http-proxy` - Reverse proxy
- `monitoring01` - Full observability stack (Prometheus, Grafana, Loki, Tempo, Pyroscope) - `monitoring01` - Full observability stack (Prometheus, Grafana, Loki, Tempo, Pyroscope)
- `jelly01` - Jellyfin media server - `jelly01` - Jellyfin media server
- `nix-cache01` - Binary cache server - `nix-cache01` - Binary cache server + GitHub Actions runner
- `pgdb1` - PostgreSQL database - `pgdb1` - PostgreSQL database
- `nats1` - NATS messaging server - `nats1` - NATS messaging server
Template/test hosts: Test/staging hosts:
- `template1` - Base template for cloning new hosts - `testvm01`, `testvm02`, `testvm03` - Test-tier VMs for branch testing and deployment validation
Template hosts:
- `template1`, `template2` - Base templates for cloning new hosts
### Flake Inputs ### Flake Inputs
- `nixpkgs` - NixOS 25.11 stable (primary) - `nixpkgs` - NixOS 25.11 stable (primary)
- `nixpkgs-unstable` - Unstable channel (available via overlay as `pkgs.unstable.<package>`) - `nixpkgs-unstable` - Unstable channel (available via overlay as `pkgs.unstable.<package>`)
- `sops-nix` - Secrets management (legacy, only used by ca) - `sops-nix` - Secrets management (legacy, only used by ca)
- `nixos-exporter` - NixOS module for exposing flake revision metrics (used to verify deployments)
- `homelab-deploy` - NATS-based remote deployment tool for test-tier hosts
- Custom packages from git.t-juice.club: - Custom packages from git.t-juice.club:
- `alerttonotify` - Alert routing - `alerttonotify` - Alert routing
- `labmon` - Lab monitoring - `labmon` - Lab monitoring
@@ -402,9 +463,21 @@ Example VM deployment includes:
- Custom CPU/memory/disk sizing - Custom CPU/memory/disk sizing
- VLAN tagging - VLAN tagging
- QEMU guest agent - QEMU guest agent
- Automatic Vault credential provisioning via `vault_wrapped_token`
OpenTofu outputs the VM's IP address after deployment for easy SSH access. OpenTofu outputs the VM's IP address after deployment for easy SSH access.
**Automatic Vault Credential Provisioning:**
VMs can receive Vault (OpenBao) credentials automatically during bootstrap:
1. OpenTofu generates a wrapped token via `terraform/vault/` and stores it in the VM configuration
2. Cloud-init passes `VAULT_WRAPPED_TOKEN` and `NIXOS_FLAKE_BRANCH` to the bootstrap script
3. The bootstrap script unwraps the token to obtain AppRole credentials
4. Credentials are written to `/var/lib/vault/approle/` before the NixOS rebuild
This eliminates the need for manual `provision-approle.yml` playbook runs on new VMs. Bootstrap progress is logged to Loki with `job="bootstrap"` labels.
#### Template Rebuilding and Terraform State #### Template Rebuilding and Terraform State
When the Proxmox template is rebuilt (via `build-and-deploy-template.yml`), the template name may change. This would normally cause Terraform to want to recreate all existing VMs, but that's unnecessary since VMs are independent once cloned. When the Proxmox template is rebuilt (via `build-and-deploy-template.yml`), the template name may change. This would normally cause Terraform to want to recreate all existing VMs, but that's unnecessary since VMs are independent once cloned.
@@ -484,11 +557,7 @@ Prometheus scrape targets are automatically generated from host configurations,
- **External targets**: Non-flake hosts defined in `/services/monitoring/external-targets.nix` - **External targets**: Non-flake hosts defined in `/services/monitoring/external-targets.nix`
- **Library**: `lib/monitoring.nix` provides `generateNodeExporterTargets` and `generateScrapeConfigs` - **Library**: `lib/monitoring.nix` provides `generateNodeExporterTargets` and `generateScrapeConfigs`
Host monitoring options (`homelab.monitoring.*`): Service modules declare their scrape targets directly via `homelab.monitoring.scrapeTargets` (e.g., `services/ca/default.nix` declares step-ca on port 9000). The Prometheus config on monitoring01 auto-generates scrape configs from all hosts. See "Homelab Module Options" section for available options.
- `enable` (default: `true`) - Include host in Prometheus node-exporter scrape targets
- `scrapeTargets` (default: `[]`) - Additional scrape targets exposed by this host (job_name, port, metrics_path, scheme, scrape_interval, honor_labels)
Service modules declare their scrape targets directly (e.g., `services/ca/default.nix` declares step-ca on port 9000). The Prometheus config on monitoring01 auto-generates scrape configs from all hosts.
To add monitoring targets for non-NixOS hosts, edit `/services/monitoring/external-targets.nix`. To add monitoring targets for non-NixOS hosts, edit `/services/monitoring/external-targets.nix`.
@@ -507,13 +576,30 @@ DNS zone entries are automatically generated from host configurations:
- **External hosts**: Non-flake hosts defined in `/services/ns/external-hosts.nix` - **External hosts**: Non-flake hosts defined in `/services/ns/external-hosts.nix`
- **Serial number**: Uses `self.sourceInfo.lastModified` (git commit timestamp) - **Serial number**: Uses `self.sourceInfo.lastModified` (git commit timestamp)
Host DNS options (`homelab.dns.*`):
- `enable` (default: `true`) - Include host in DNS zone generation
- `cnames` (default: `[]`) - List of CNAME aliases pointing to this host
Hosts are automatically excluded from DNS if: Hosts are automatically excluded from DNS if:
- `homelab.dns.enable = false` (e.g., template hosts) - `homelab.dns.enable = false` (e.g., template hosts)
- No static IP configured (e.g., DHCP-only hosts) - No static IP configured (e.g., DHCP-only hosts)
- Network interface is a VPN/tunnel (wg*, tun*, tap*) - Network interface is a VPN/tunnel (wg*, tun*, tap*)
To add DNS entries for non-NixOS hosts, edit `/services/ns/external-hosts.nix`. To add DNS entries for non-NixOS hosts, edit `/services/ns/external-hosts.nix`.
### Homelab Module Options
The `modules/homelab/` directory defines custom options used across hosts for automation and metadata.
**Host options (`homelab.host.*`):**
- `tier` - Deployment tier: `test` or `prod`. Test-tier hosts can receive remote deployments and have different credential access.
- `priority` - Alerting priority: `high` or `low`. Controls alerting thresholds for the host.
- `role` - Primary role designation (e.g., `dns`, `database`, `bastion`, `vault`)
- `labels` - Free-form key-value metadata for host categorization
**DNS options (`homelab.dns.*`):**
- `enable` (default: `true`) - Include host in DNS zone generation
- `cnames` (default: `[]`) - List of CNAME aliases pointing to this host
**Monitoring options (`homelab.monitoring.*`):**
- `enable` (default: `true`) - Include host in Prometheus node-exporter scrape targets
- `scrapeTargets` (default: `[]`) - Additional scrape targets exposed by this host
**Deploy options (`homelab.deploy.*`):**
- `enable` (default: `false`) - Enable NATS-based remote deployment listener. When enabled, the host listens for deployment commands via NATS and can be targeted by the `homelab-deploy` MCP server.

View File

@@ -1,10 +1,38 @@
# Prometheus Scrape Target Labels # Prometheus Scrape Target Labels
## Implementation Status
| Step | Status | Notes |
|------|--------|-------|
| 1. Create `homelab.host` module | ✅ Complete | `modules/homelab/host.nix` |
| 2. Update `lib/monitoring.nix` | ✅ Complete | Labels extracted and propagated |
| 3. Update Prometheus config | ✅ Complete | Uses structured static_configs |
| 4. Set metadata on hosts | ✅ Complete | All relevant hosts configured |
| 5. Update alert rules | ✅ Complete | Role-based filtering implemented |
| 6. Labels for service targets | ✅ Complete | Host labels propagated to all services |
| 7. Add hostname label | ✅ Complete | All targets have `hostname` label for easy filtering |
**Hosts with metadata configured:**
- `ns1`, `ns2`: `role = "dns"`, `labels.dns_role = "primary"/"secondary"`
- `nix-cache01`: `role = "build-host"`
- `vault01`: `role = "vault"`
- `testvm01/02/03`: `tier = "test"`
**Implementation complete.** Branch: `prometheus-scrape-target-labels`
**Query examples:**
- `{hostname="ns1"}` - all metrics from ns1 (any job/port)
- `node_cpu_seconds_total{hostname="monitoring01"}` - specific metric by hostname
- `up{role="dns"}` - all DNS servers
- `up{tier="test"}` - all test-tier hosts
---
## Goal ## Goal
Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names. Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names.
**Related:** This plan shares the `homelab.host` module with `docs/plans/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment. **Related:** This plan shares the `homelab.host` module with `docs/plans/completed/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment.
## Motivation ## Motivation
@@ -54,12 +82,11 @@ or
## Implementation ## Implementation
This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/nats-deploy-service.md` which uses the same module for deployment tier assignment. This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/completed/nats-deploy-service.md` which uses the same module for deployment tier assignment.
### 1. Create `homelab.host` module ### 1. Create `homelab.host` module
**Status:** Step 1 (Create `homelab.host` module) is complete. The module is in **Complete.** The module is in `modules/homelab/host.nix`.
`modules/homelab/host.nix` with tier, priority, role, and labels options.
Create `modules/homelab/host.nix` with shared host metadata options: Create `modules/homelab/host.nix` with shared host metadata options:
@@ -98,6 +125,8 @@ Import this module in `modules/homelab/default.nix`.
### 2. Update `lib/monitoring.nix` ### 2. Update `lib/monitoring.nix`
**Complete.** Labels are now extracted and propagated.
- `extractHostMonitoring` should also extract `homelab.host` values (priority, role, labels). - `extractHostMonitoring` should also extract `homelab.host` values (priority, role, labels).
- Build the combined label set from `homelab.host`: - Build the combined label set from `homelab.host`:
@@ -126,6 +155,8 @@ This requires grouping hosts by their label attrset and producing one `static_co
### 3. Update `services/monitoring/prometheus.nix` ### 3. Update `services/monitoring/prometheus.nix`
**Complete.** Now uses structured static_configs output.
Change the node-exporter scrape config to use the new structured output: Change the node-exporter scrape config to use the new structured output:
```nix ```nix
@@ -138,36 +169,37 @@ static_configs = nodeExporterTargets;
### 4. Set metadata on hosts ### 4. Set metadata on hosts
**Complete.** All relevant hosts have metadata configured. Note: The implementation filters by `role` rather than `priority`, which matches the existing nix-cache01 configuration.
Example in `hosts/nix-cache01/configuration.nix`: Example in `hosts/nix-cache01/configuration.nix`:
```nix ```nix
homelab.host = { homelab.host = {
tier = "test"; # can be deployed by MCP (used by homelab-deploy)
priority = "low"; # relaxed alerting thresholds priority = "low"; # relaxed alerting thresholds
role = "build-host"; role = "build-host";
}; };
``` ```
**Note:** Current implementation only sets `role = "build-host"`. Consider adding `priority = "low"` when label propagation is implemented.
Example in `hosts/ns1/configuration.nix`: Example in `hosts/ns1/configuration.nix`:
```nix ```nix
homelab.host = { homelab.host = {
tier = "prod";
priority = "high";
role = "dns"; role = "dns";
labels.dns_role = "primary"; labels.dns_role = "primary";
}; };
``` ```
**Note:** `tier` and `priority` use defaults ("prod" and "high"), which is the intended behavior. The current ns1/ns2 configurations match this pattern.
### 5. Update alert rules ### 5. Update alert rules
After implementing labels, review and update `services/monitoring/rules.yml`: **Complete.** Updated `services/monitoring/rules.yml`:
- Replace instance-name exclusions with label-based filters (e.g. `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`). - `high_cpu_load`: Replaced `instance!="nix-cache01..."` with `role!="build-host"` for standard hosts (15m duration) and `role="build-host"` for build hosts (2h duration).
- Consider whether any other rules should differentiate by priority or role. - `unbound_low_cache_hit_ratio`: Added `dns_role="primary"` filter to only alert on the primary DNS resolver (secondary has a cold cache).
Specifically, the `high_cpu_load` rule currently has a nix-cache01 exclusion that should be replaced with a `priority`-based filter. ### 6. Labels for `generateScrapeConfigs` (service targets)
### 6. Consider labels for `generateScrapeConfigs` (service targets) **Complete.** Host labels are now propagated to all auto-generated service scrape targets (unbound, homelab-deploy, nixos-exporter, etc.). This enables semantic filtering on any service metric, such as using `dns_role="primary"` with the unbound job.
The same label propagation could be applied to service-level scrape targets. This is optional and can be deferred -- service targets are more specialized and less likely to need generic label-based filtering.

8
flake.lock generated
View File

@@ -28,11 +28,11 @@
] ]
}, },
"locked": { "locked": {
"lastModified": 1770447502, "lastModified": 1770470613,
"narHash": "sha256-xH1PNyE3ydj4udhe1IpK8VQxBPZETGLuORZdSWYRmSU=", "narHash": "sha256-FOqOBdsQnLzA7+Y3hXuN9PcOHmpEDvNw64Ju8op9J2w=",
"ref": "master", "ref": "master",
"rev": "79db119d1ca6630023947ef0a65896cc3307c2ff", "rev": "36a74b8cf9a873202e28eecfb90f9af8650cca8b",
"revCount": 22, "revCount": 23,
"type": "git", "type": "git",
"url": "https://git.t-juice.club/torjus/homelab-deploy" "url": "https://git.t-juice.club/torjus/homelab-deploy"
}, },

View File

@@ -186,15 +186,6 @@
./hosts/nats1 ./hosts/nats1
]; ];
}; };
testvm01 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self sops-nix;
};
modules = commonModules ++ [
./hosts/testvm01
];
};
vault01 = nixpkgs.lib.nixosSystem { vault01 = nixpkgs.lib.nixosSystem {
inherit system; inherit system;
specialArgs = { specialArgs = {
@@ -204,13 +195,31 @@
./hosts/vault01 ./hosts/vault01
]; ];
}; };
vaulttest01 = nixpkgs.lib.nixosSystem { testvm01 = nixpkgs.lib.nixosSystem {
inherit system; inherit system;
specialArgs = { specialArgs = {
inherit inputs self sops-nix; inherit inputs self sops-nix;
}; };
modules = commonModules ++ [ modules = commonModules ++ [
./hosts/vaulttest01 ./hosts/testvm01
];
};
testvm02 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self sops-nix;
};
modules = commonModules ++ [
./hosts/testvm02
];
};
testvm03 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self sops-nix;
};
modules = commonModules ++ [
./hosts/testvm03
]; ];
}; };
}; };

View File

@@ -6,22 +6,72 @@ let
text = '' text = ''
set -euo pipefail set -euo pipefail
LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
# Send a log entry to Loki with bootstrap status
# Usage: log_to_loki <stage> <message>
# Fails silently if Loki is unreachable
log_to_loki() {
local stage="$1"
local message="$2"
local timestamp_ns
timestamp_ns="$(date +%s)000000000"
local payload
payload=$(jq -n \
--arg host "$HOSTNAME" \
--arg stage "$stage" \
--arg branch "''${BRANCH:-master}" \
--arg ts "$timestamp_ns" \
--arg msg "$message" \
'{
streams: [{
stream: {
job: "bootstrap",
host: $host,
stage: $stage,
branch: $branch
},
values: [[$ts, $msg]]
}]
}')
curl -s --connect-timeout 2 --max-time 5 \
-X POST \
-H "Content-Type: application/json" \
-d "$payload" \
"$LOKI_URL" >/dev/null 2>&1 || true
}
echo "================================================================================"
echo " NIXOS BOOTSTRAP IN PROGRESS"
echo "================================================================================"
echo ""
# Read hostname set by cloud-init (from Terraform VM name via user-data) # Read hostname set by cloud-init (from Terraform VM name via user-data)
# Cloud-init sets the system hostname from user-data.txt, so we read it from hostnamectl # Cloud-init sets the system hostname from user-data.txt, so we read it from hostnamectl
HOSTNAME=$(hostnamectl hostname) HOSTNAME=$(hostnamectl hostname)
echo "DEBUG: Hostname from hostnamectl: '$HOSTNAME'" # Read git branch from environment, default to master
BRANCH="''${NIXOS_FLAKE_BRANCH:-master}"
echo "Hostname: $HOSTNAME"
echo ""
echo "Starting NixOS bootstrap for host: $HOSTNAME" echo "Starting NixOS bootstrap for host: $HOSTNAME"
log_to_loki "starting" "Bootstrap starting for $HOSTNAME (branch: $BRANCH)"
echo "Waiting for network connectivity..." echo "Waiting for network connectivity..."
# Verify we can reach the git server via HTTPS (doesn't respond to ping) # Verify we can reach the git server via HTTPS (doesn't respond to ping)
if ! curl -s --connect-timeout 5 --max-time 10 https://git.t-juice.club >/dev/null 2>&1; then if ! curl -s --connect-timeout 5 --max-time 10 https://git.t-juice.club >/dev/null 2>&1; then
echo "ERROR: Cannot reach git.t-juice.club via HTTPS" echo "ERROR: Cannot reach git.t-juice.club via HTTPS"
echo "Check network configuration and DNS settings" echo "Check network configuration and DNS settings"
log_to_loki "failed" "Network check failed - cannot reach git.t-juice.club"
exit 1 exit 1
fi fi
echo "Network connectivity confirmed" echo "Network connectivity confirmed"
log_to_loki "network_ok" "Network connectivity confirmed"
# Unwrap Vault token and store AppRole credentials (if provided) # Unwrap Vault token and store AppRole credentials (if provided)
if [ -n "''${VAULT_WRAPPED_TOKEN:-}" ]; then if [ -n "''${VAULT_WRAPPED_TOKEN:-}" ]; then
@@ -50,6 +100,7 @@ let
chmod 600 /var/lib/vault/approle/secret-id chmod 600 /var/lib/vault/approle/secret-id
echo "Vault credentials unwrapped and stored successfully" echo "Vault credentials unwrapped and stored successfully"
log_to_loki "vault_ok" "Vault credentials unwrapped and stored"
else else
echo "WARNING: Failed to unwrap Vault token" echo "WARNING: Failed to unwrap Vault token"
if [ -n "$UNWRAP_RESPONSE" ]; then if [ -n "$UNWRAP_RESPONSE" ]; then
@@ -63,17 +114,17 @@ let
echo "To regenerate token, run: create-host --hostname $HOSTNAME --force" echo "To regenerate token, run: create-host --hostname $HOSTNAME --force"
echo "" echo ""
echo "Vault secrets will not be available, but continuing bootstrap..." echo "Vault secrets will not be available, but continuing bootstrap..."
log_to_loki "vault_warn" "Failed to unwrap Vault token - continuing without secrets"
fi fi
else else
echo "No Vault wrapped token provided (VAULT_WRAPPED_TOKEN not set)" echo "No Vault wrapped token provided (VAULT_WRAPPED_TOKEN not set)"
echo "Skipping Vault credential setup" echo "Skipping Vault credential setup"
log_to_loki "vault_skip" "No Vault token provided - skipping credential setup"
fi fi
echo "Fetching and building NixOS configuration from flake..." echo "Fetching and building NixOS configuration from flake..."
# Read git branch from environment, default to master
BRANCH="''${NIXOS_FLAKE_BRANCH:-master}"
echo "Using git branch: $BRANCH" echo "Using git branch: $BRANCH"
log_to_loki "building" "Starting nixos-rebuild boot"
# Build and activate the host-specific configuration # Build and activate the host-specific configuration
FLAKE_URL="git+https://git.t-juice.club/torjus/nixos-servers.git?ref=$BRANCH#''${HOSTNAME}" FLAKE_URL="git+https://git.t-juice.club/torjus/nixos-servers.git?ref=$BRANCH#''${HOSTNAME}"
@@ -81,18 +132,30 @@ let
if nixos-rebuild boot --flake "$FLAKE_URL"; then if nixos-rebuild boot --flake "$FLAKE_URL"; then
echo "Successfully built configuration for $HOSTNAME" echo "Successfully built configuration for $HOSTNAME"
echo "Rebooting into new configuration..." echo "Rebooting into new configuration..."
log_to_loki "success" "Build successful - rebooting into new configuration"
sleep 2 sleep 2
systemctl reboot systemctl reboot
else else
echo "ERROR: nixos-rebuild failed for $HOSTNAME" echo "ERROR: nixos-rebuild failed for $HOSTNAME"
echo "Check that flake has configuration for this hostname" echo "Check that flake has configuration for this hostname"
echo "Manual intervention required - system will not reboot" echo "Manual intervention required - system will not reboot"
log_to_loki "failed" "nixos-rebuild failed - manual intervention required"
exit 1 exit 1
fi fi
''; '';
}; };
in in
{ {
# Custom greeting line to indicate this is a bootstrap image
services.getty.greetingLine = lib.mkForce ''
================================================================================
BOOTSTRAP IMAGE - NixOS \V (\l)
================================================================================
Bootstrap service is running. Logs are displayed on tty1.
Check status: journalctl -fu nixos-bootstrap
'';
systemd.services."nixos-bootstrap" = { systemd.services."nixos-bootstrap" = {
description = "Bootstrap NixOS configuration from flake on first boot"; description = "Bootstrap NixOS configuration from flake on first boot";
@@ -107,12 +170,12 @@ in
serviceConfig = { serviceConfig = {
Type = "oneshot"; Type = "oneshot";
RemainAfterExit = true; RemainAfterExit = true;
ExecStart = "${bootstrap-script}/bin/nixos-bootstrap"; ExecStart = lib.getExe bootstrap-script;
# Read environment variables from cloud-init (set by cloud-init write_files) # Read environment variables from cloud-init (set by cloud-init write_files)
EnvironmentFile = "-/run/cloud-init-env"; EnvironmentFile = "-/run/cloud-init-env";
# Logging to journald # Log to journal and console
StandardOutput = "journal+console"; StandardOutput = "journal+console";
StandardError = "journal+console"; StandardError = "journal+console";
}; };

View File

@@ -13,14 +13,17 @@
../../common/vm ../../common/vm
]; ];
# Test VM - exclude from DNS zone generation # Host metadata (adjust as needed)
homelab.dns.enable = false;
homelab.host = { homelab.host = {
tier = "test"; tier = "test"; # Start in test tier, move to prod after validation
priority = "low";
}; };
# Enable Vault integration
vault.enable = true;
# Enable remote deployment via NATS
homelab.deploy.enable = true;
nixpkgs.config.allowUnfree = true; nixpkgs.config.allowUnfree = true;
boot.loader.grub.enable = true; boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda"; boot.loader.grub.device = "/dev/vda";
@@ -29,7 +32,7 @@
networking.domain = "home.2rjus.net"; networking.domain = "home.2rjus.net";
networking.useNetworkd = true; networking.useNetworkd = true;
networking.useDHCP = false; networking.useDHCP = false;
services.resolved.enable = false; services.resolved.enable = true;
networking.nameservers = [ networking.nameservers = [
"10.69.13.5" "10.69.13.5"
"10.69.13.6" "10.69.13.6"
@@ -39,7 +42,7 @@
systemd.network.networks."ens18" = { systemd.network.networks."ens18" = {
matchConfig.Name = "ens18"; matchConfig.Name = "ens18";
address = [ address = [
"10.69.13.101/24" "10.69.13.20/24"
]; ];
routes = [ routes = [
{ Gateway = "10.69.13.1"; } { Gateway = "10.69.13.1"; }

View File

@@ -0,0 +1,72 @@
{
config,
lib,
pkgs,
...
}:
{
imports = [
../template2/hardware-configuration.nix
../../system
../../common/vm
];
# Host metadata (adjust as needed)
homelab.host = {
tier = "test"; # Start in test tier, move to prod after validation
};
# Enable Vault integration
vault.enable = true;
# Enable remote deployment via NATS
homelab.deploy.enable = true;
nixpkgs.config.allowUnfree = true;
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda";
networking.hostName = "testvm02";
networking.domain = "home.2rjus.net";
networking.useNetworkd = true;
networking.useDHCP = false;
services.resolved.enable = true;
networking.nameservers = [
"10.69.13.5"
"10.69.13.6"
];
systemd.network.enable = true;
systemd.network.networks."ens18" = {
matchConfig.Name = "ens18";
address = [
"10.69.13.21/24"
];
routes = [
{ Gateway = "10.69.13.1"; }
];
linkConfig.RequiredForOnline = "routable";
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim
wget
git
];
# Open ports in the firewall.
# networking.firewall.allowedTCPPorts = [ ... ];
# networking.firewall.allowedUDPPorts = [ ... ];
# Or disable the firewall altogether.
networking.firewall.enable = false;
system.stateVersion = "25.11"; # Did you read the comment?
}

View File

@@ -0,0 +1,72 @@
{
config,
lib,
pkgs,
...
}:
{
imports = [
../template2/hardware-configuration.nix
../../system
../../common/vm
];
# Host metadata (adjust as needed)
homelab.host = {
tier = "test"; # Start in test tier, move to prod after validation
};
# Enable Vault integration
vault.enable = true;
# Enable remote deployment via NATS
homelab.deploy.enable = true;
nixpkgs.config.allowUnfree = true;
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda";
networking.hostName = "testvm03";
networking.domain = "home.2rjus.net";
networking.useNetworkd = true;
networking.useDHCP = false;
services.resolved.enable = true;
networking.nameservers = [
"10.69.13.5"
"10.69.13.6"
];
systemd.network.enable = true;
systemd.network.networks."ens18" = {
matchConfig.Name = "ens18";
address = [
"10.69.13.22/24"
];
routes = [
{ Gateway = "10.69.13.1"; }
];
linkConfig.RequiredForOnline = "routable";
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim
wget
git
];
# Open ports in the firewall.
# networking.firewall.allowedTCPPorts = [ ... ];
# networking.firewall.allowedUDPPorts = [ ... ];
# Or disable the firewall altogether.
networking.firewall.enable = false;
system.stateVersion = "25.11"; # Did you read the comment?
}

View File

@@ -0,0 +1,5 @@
{ ... }: {
imports = [
./configuration.nix
];
}

View File

@@ -1,135 +0,0 @@
{
config,
lib,
pkgs,
...
}:
let
vault-test-script = pkgs.writeShellApplication {
name = "vault-test";
text = ''
echo "=== Vault Secret Test ==="
echo "Secret path: hosts/vaulttest01/test-service"
if [ -f /run/secrets/test-service/password ]; then
echo " Password file exists"
echo "Password length: $(wc -c < /run/secrets/test-service/password)"
else
echo " Password file missing!"
exit 1
fi
if [ -d /var/lib/vault/cache/test-service ]; then
echo " Cache directory exists"
else
echo " Cache directory missing!"
exit 1
fi
echo "Test successful!"
'';
};
in
{
imports = [
../template2/hardware-configuration.nix
../../system
../../common/vm
];
homelab.host = {
tier = "test";
priority = "low";
role = "vault";
};
nixpkgs.config.allowUnfree = true;
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda";
networking.hostName = "vaulttest01";
networking.domain = "home.2rjus.net";
networking.useNetworkd = true;
networking.useDHCP = false;
services.resolved.enable = true;
networking.nameservers = [
"10.69.13.5"
"10.69.13.6"
];
systemd.network.enable = true;
systemd.network.networks."ens18" = {
matchConfig.Name = "ens18";
address = [
"10.69.13.150/24"
];
routes = [
{ Gateway = "10.69.13.1"; }
];
linkConfig.RequiredForOnline = "routable";
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim
wget
git
htop # test deploy verification
];
# Open ports in the firewall.
# networking.firewall.allowedTCPPorts = [ ... ];
# networking.firewall.allowedUDPPorts = [ ... ];
# Or disable the firewall altogether.
networking.firewall.enable = false;
# Testing config
# Enable Vault secrets management
vault.enable = true;
homelab.deploy.enable = true;
# Define a test secret
vault.secrets.test-service = {
secretPath = "hosts/vaulttest01/test-service";
restartTrigger = true;
restartInterval = "daily";
services = [ "vault-test" ];
};
# Create a test service that uses the secret
systemd.services.vault-test = {
description = "Test Vault secret fetching";
wantedBy = [ "multi-user.target" ];
after = [ "vault-secret-test-service.service" ];
serviceConfig = {
Type = "oneshot";
RemainAfterExit = true;
ExecStart = lib.getExe vault-test-script;
StandardOutput = "journal+console";
};
};
# Test ACME certificate issuance from OpenBao PKI
# Override the global ACME server (from system/acme.nix) to use OpenBao instead of step-ca
security.acme.defaults.server = lib.mkForce "https://vault01.home.2rjus.net:8200/v1/pki_int/acme/directory";
# Request a certificate for this host
# Using HTTP-01 challenge with standalone listener on port 80
security.acme.certs."vaulttest01.home.2rjus.net" = {
listenHTTP = ":80";
enableDebugLogs = true;
};
system.stateVersion = "25.11"; # Did you read the comment?
}

View File

@@ -21,6 +21,7 @@ let
cfg = hostConfig.config; cfg = hostConfig.config;
monConfig = (cfg.homelab or { }).monitoring or { enable = true; scrapeTargets = [ ]; }; monConfig = (cfg.homelab or { }).monitoring or { enable = true; scrapeTargets = [ ]; };
dnsConfig = (cfg.homelab or { }).dns or { enable = true; }; dnsConfig = (cfg.homelab or { }).dns or { enable = true; };
hostConfig' = (cfg.homelab or { }).host or { };
hostname = cfg.networking.hostName; hostname = cfg.networking.hostName;
networks = cfg.systemd.network.networks or { }; networks = cfg.systemd.network.networks or { };
@@ -49,20 +50,73 @@ let
inherit hostname; inherit hostname;
ip = extractIP firstAddress; ip = extractIP firstAddress;
scrapeTargets = monConfig.scrapeTargets or [ ]; scrapeTargets = monConfig.scrapeTargets or [ ];
# Host metadata for label propagation
tier = hostConfig'.tier or "prod";
priority = hostConfig'.priority or "high";
role = hostConfig'.role or null;
labels = hostConfig'.labels or { };
}; };
# Build effective labels for a host
# Always includes hostname; only includes tier/priority/role if non-default
buildEffectiveLabels = host:
{ hostname = host.hostname; }
// (lib.optionalAttrs (host.tier != "prod") { tier = host.tier; })
// (lib.optionalAttrs (host.priority != "high") { priority = host.priority; })
// (lib.optionalAttrs (host.role != null) { role = host.role; })
// host.labels;
# Generate node-exporter targets from all flake hosts # Generate node-exporter targets from all flake hosts
# Returns a list of static_configs entries with labels
generateNodeExporterTargets = self: externalTargets: generateNodeExporterTargets = self: externalTargets:
let let
nixosConfigs = self.nixosConfigurations or { }; nixosConfigs = self.nixosConfigurations or { };
hostList = lib.filter (x: x != null) ( hostList = lib.filter (x: x != null) (
lib.mapAttrsToList extractHostMonitoring nixosConfigs lib.mapAttrsToList extractHostMonitoring nixosConfigs
); );
flakeTargets = map (host: "${host.hostname}.home.2rjus.net:9100") hostList;
# Extract hostname from a target string like "gunter.home.2rjus.net:9100"
extractHostnameFromTarget = target:
builtins.head (lib.splitString "." target);
# Build target entries with labels for each host
flakeEntries = map
(host: {
target = "${host.hostname}.home.2rjus.net:9100";
labels = buildEffectiveLabels host;
})
hostList;
# External targets get hostname extracted from the target string
externalEntries = map
(target: {
inherit target;
labels = { hostname = extractHostnameFromTarget target; };
})
(externalTargets.nodeExporter or [ ]);
allEntries = flakeEntries ++ externalEntries;
# Group entries by their label set for efficient static_configs
# Convert labels attrset to a string key for grouping
labelKey = entry: builtins.toJSON entry.labels;
grouped = lib.groupBy labelKey allEntries;
# Convert groups to static_configs format
# Every flake host now has at least a hostname label
staticConfigs = lib.mapAttrsToList
(key: entries:
let
labels = (builtins.head entries).labels;
in
{ targets = map (e: e.target) entries; labels = labels; }
)
grouped;
in in
flakeTargets ++ (externalTargets.nodeExporter or [ ]); staticConfigs;
# Generate scrape configs from all flake hosts and external targets # Generate scrape configs from all flake hosts and external targets
# Host labels are propagated to service targets for semantic alert filtering
generateScrapeConfigs = self: externalTargets: generateScrapeConfigs = self: externalTargets:
let let
nixosConfigs = self.nixosConfigurations or { }; nixosConfigs = self.nixosConfigurations or { };
@@ -70,13 +124,14 @@ let
lib.mapAttrsToList extractHostMonitoring nixosConfigs lib.mapAttrsToList extractHostMonitoring nixosConfigs
); );
# Collect all scrapeTargets from all hosts, grouped by job_name # Collect all scrapeTargets from all hosts, including host labels
allTargets = lib.flatten (map allTargets = lib.flatten (map
(host: (host:
map map
(target: { (target: {
inherit (target) job_name port metrics_path scheme scrape_interval honor_labels; inherit (target) job_name port metrics_path scheme scrape_interval honor_labels;
hostname = host.hostname; hostname = host.hostname;
hostLabels = buildEffectiveLabels host;
}) })
host.scrapeTargets host.scrapeTargets
) )
@@ -87,22 +142,32 @@ let
grouped = lib.groupBy (t: t.job_name) allTargets; grouped = lib.groupBy (t: t.job_name) allTargets;
# Generate a scrape config for each job # Generate a scrape config for each job
# Within each job, group targets by their host labels for efficient static_configs
flakeScrapeConfigs = lib.mapAttrsToList flakeScrapeConfigs = lib.mapAttrsToList
(jobName: targets: (jobName: targets:
let let
first = builtins.head targets; first = builtins.head targets;
targetAddrs = map
(t: # Group targets within this job by their host labels
labelKey = t: builtins.toJSON t.hostLabels;
groupedByLabels = lib.groupBy labelKey targets;
# Every flake host now has at least a hostname label
staticConfigs = lib.mapAttrsToList
(key: labelTargets:
let let
portStr = toString t.port; labels = (builtins.head labelTargets).hostLabels;
targetAddrs = map
(t: "${t.hostname}.home.2rjus.net:${toString t.port}")
labelTargets;
in in
"${t.hostname}.home.2rjus.net:${portStr}") { targets = targetAddrs; labels = labels; }
targets; )
groupedByLabels;
config = { config = {
job_name = jobName; job_name = jobName;
static_configs = [{ static_configs = staticConfigs;
targets = targetAddrs;
}];
} }
// (lib.optionalAttrs (first.metrics_path != "/metrics") { // (lib.optionalAttrs (first.metrics_path != "/metrics") {
metrics_path = first.metrics_path; metrics_path = first.metrics_path;

View File

@@ -99,3 +99,48 @@
- name: Display success message - name: Display success message
ansible.builtin.debug: ansible.builtin.debug:
msg: "Template VM {{ template_vmid }} created successfully on {{ storage }}" msg: "Template VM {{ template_vmid }} created successfully on {{ storage }}"
- name: Update Terraform template name
hosts: localhost
gather_facts: false
vars:
terraform_dir: "{{ playbook_dir }}/../terraform"
tasks:
- name: Get image filename from earlier play
ansible.builtin.set_fact:
image_filename: "{{ hostvars['localhost']['image_filename'] }}"
- name: Extract template name from image filename
ansible.builtin.set_fact:
new_template_name: "{{ image_filename | regex_replace('\\.vma\\.zst$', '') | regex_replace('^vzdump-qemu-', '') }}"
- name: Read current Terraform variables file
ansible.builtin.slurp:
src: "{{ terraform_dir }}/variables.tf"
register: variables_tf_content
- name: Extract current template name from variables.tf
ansible.builtin.set_fact:
current_template_name: "{{ (variables_tf_content.content | b64decode) | regex_search('variable \"default_template_name\"[^}]+default\\s*=\\s*\"([^\"]+)\"', '\\1') | first }}"
- name: Check if template name has changed
ansible.builtin.set_fact:
template_name_changed: "{{ current_template_name != new_template_name }}"
- name: Display template name status
ansible.builtin.debug:
msg: "Template name: {{ current_template_name }} -> {{ new_template_name }} ({{ 'changed' if template_name_changed else 'unchanged' }})"
- name: Update default_template_name in variables.tf
ansible.builtin.replace:
path: "{{ terraform_dir }}/variables.tf"
regexp: '(variable "default_template_name"[^}]+default\s*=\s*)"[^"]+"'
replace: '\1"{{ new_template_name }}"'
when: template_name_changed
- name: Display update result
ansible.builtin.debug:
msg: "Updated terraform/variables.tf with new template name: {{ new_template_name }}"
when: template_name_changed

View File

@@ -18,6 +18,8 @@ from manipulators import (
remove_from_flake_nix, remove_from_flake_nix,
remove_from_terraform_vms, remove_from_terraform_vms,
remove_from_vault_terraform, remove_from_vault_terraform,
remove_from_approle_tf,
find_host_secrets,
check_entries_exist, check_entries_exist,
) )
from models import HostConfig from models import HostConfig
@@ -255,7 +257,10 @@ def handle_remove(
sys.exit(1) sys.exit(1)
# Check what entries exist # Check what entries exist
flake_exists, terraform_exists, vault_exists = check_entries_exist(hostname, repo_root) flake_exists, terraform_exists, vault_exists, approle_exists = check_entries_exist(hostname, repo_root)
# Check for secrets in secrets.tf
host_secrets = find_host_secrets(hostname, repo_root)
# Collect all files in the host directory recursively # Collect all files in the host directory recursively
files_in_host_dir = sorted([f for f in host_dir.rglob("*") if f.is_file()]) files_in_host_dir = sorted([f for f in host_dir.rglob("*") if f.is_file()])
@@ -294,6 +299,21 @@ def handle_remove(
else: else:
console.print(f" • terraform/vault/hosts-generated.tf [dim](not found)[/dim]") console.print(f" • terraform/vault/hosts-generated.tf [dim](not found)[/dim]")
if approle_exists:
console.print(f' • terraform/vault/approle.tf (host_policies["{hostname}"])')
else:
console.print(f" • terraform/vault/approle.tf [dim](not found)[/dim]")
# Warn about secrets in secrets.tf
if host_secrets:
console.print(f"\n[yellow]⚠️ Warning: Found {len(host_secrets)} secret(s) in terraform/vault/secrets.tf:[/yellow]")
for secret_path in host_secrets:
console.print(f'"{secret_path}"')
console.print(f"\n [yellow]These will NOT be removed automatically.[/yellow]")
console.print(f" After removal, manually edit secrets.tf and run:")
for secret_path in host_secrets:
console.print(f" [white]vault kv delete secret/{secret_path}[/white]")
# Warn about secrets directory # Warn about secrets directory
if secrets_exist: if secrets_exist:
console.print(f"\n[yellow]⚠️ Warning: secrets/{hostname}/ directory exists and will NOT be deleted[/yellow]") console.print(f"\n[yellow]⚠️ Warning: secrets/{hostname}/ directory exists and will NOT be deleted[/yellow]")
@@ -323,6 +343,13 @@ def handle_remove(
else: else:
console.print("[yellow]⚠[/yellow] Could not remove from terraform/vault/hosts-generated.tf") console.print("[yellow]⚠[/yellow] Could not remove from terraform/vault/hosts-generated.tf")
# Remove from terraform/vault/approle.tf
if approle_exists:
if remove_from_approle_tf(hostname, repo_root):
console.print("[green]✓[/green] Removed from terraform/vault/approle.tf")
else:
console.print("[yellow]⚠[/yellow] Could not remove from terraform/vault/approle.tf")
# Remove from terraform/vms.tf # Remove from terraform/vms.tf
if terraform_exists: if terraform_exists:
if remove_from_terraform_vms(hostname, repo_root): if remove_from_terraform_vms(hostname, repo_root):
@@ -345,19 +372,34 @@ def handle_remove(
console.print(f"\n[bold green]✓ Host {hostname} removed successfully![/bold green]\n") console.print(f"\n[bold green]✓ Host {hostname} removed successfully![/bold green]\n")
# Display next steps # Display next steps
display_removal_next_steps(hostname, vault_exists) display_removal_next_steps(hostname, vault_exists, approle_exists, host_secrets)
def display_removal_next_steps(hostname: str, had_vault: bool) -> None: def display_removal_next_steps(hostname: str, had_vault: bool, had_approle: bool, host_secrets: list) -> None:
"""Display next steps after successful removal.""" """Display next steps after successful removal."""
vault_file = " terraform/vault/hosts-generated.tf" if had_vault else "" vault_files = ""
vault_apply = ""
if had_vault: if had_vault:
vault_files += " terraform/vault/hosts-generated.tf"
if had_approle:
vault_files += " terraform/vault/approle.tf"
vault_apply = ""
if had_vault or had_approle:
vault_apply = f""" vault_apply = f"""
3. Apply Vault changes: 3. Apply Vault changes:
[white]cd terraform/vault && tofu apply[/white] [white]cd terraform/vault && tofu apply[/white]
""" """
secrets_cleanup = ""
if host_secrets:
secrets_cleanup = f"""
5. Clean up secrets (manual):
Edit terraform/vault/secrets.tf to remove entries for {hostname}
Then delete from Vault:"""
for secret_path in host_secrets:
secrets_cleanup += f"\n [white]vault kv delete secret/{secret_path}[/white]"
secrets_cleanup += "\n"
next_steps = f"""[bold cyan]Next Steps:[/bold cyan] next_steps = f"""[bold cyan]Next Steps:[/bold cyan]
1. Review changes: 1. Review changes:
@@ -367,9 +409,9 @@ def display_removal_next_steps(hostname: str, had_vault: bool) -> None:
[white]cd terraform && tofu destroy -target='proxmox_vm_qemu.vm["{hostname}"]'[/white] [white]cd terraform && tofu destroy -target='proxmox_vm_qemu.vm["{hostname}"]'[/white]
{vault_apply} {vault_apply}
4. Commit changes: 4. Commit changes:
[white]git add -u hosts/{hostname} flake.nix terraform/vms.tf{vault_file} [white]git add -u hosts/{hostname} flake.nix terraform/vms.tf{vault_files}
git commit -m "hosts: remove {hostname}"[/white] git commit -m "hosts: remove {hostname}"[/white]
""" {secrets_cleanup}"""
console.print(Panel(next_steps, border_style="cyan")) console.print(Panel(next_steps, border_style="cyan"))

View File

@@ -144,7 +144,7 @@ resource "vault_approle_auth_backend_role" "generated_hosts" {
backend = vault_auth_backend.approle.path backend = vault_auth_backend.approle.path
role_name = each.key role_name = each.key
token_policies = ["host-\${each.key}"] token_policies = ["host-\${each.key}", "homelab-deploy"]
secret_id_ttl = 0 # Never expire (wrapped tokens provide time limit) secret_id_ttl = 0 # Never expire (wrapped tokens provide time limit)
token_ttl = 3600 token_ttl = 3600
token_max_ttl = 3600 token_max_ttl = 3600

View File

@@ -22,12 +22,12 @@ def remove_from_flake_nix(hostname: str, repo_root: Path) -> bool:
content = flake_path.read_text() content = flake_path.read_text()
# Check if hostname exists # Check if hostname exists
hostname_pattern = rf"^ {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem" hostname_pattern = rf"^ {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem"
if not re.search(hostname_pattern, content, re.MULTILINE): if not re.search(hostname_pattern, content, re.MULTILINE):
return False return False
# Match the entire block from "hostname = " to "};" # Match the entire block from "hostname = " to "};"
replace_pattern = rf"^ {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem \{{.*?^ \}};\n" replace_pattern = rf"^ {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem \{{.*?^ \}};\n"
new_content, count = re.subn(replace_pattern, "", content, flags=re.MULTILINE | re.DOTALL) new_content, count = re.subn(replace_pattern, "", content, flags=re.MULTILINE | re.DOTALL)
if count == 0: if count == 0:
@@ -101,7 +101,68 @@ def remove_from_vault_terraform(hostname: str, repo_root: Path) -> bool:
return True return True
def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, bool]: def remove_from_approle_tf(hostname: str, repo_root: Path) -> bool:
"""
Remove host entry from terraform/vault/approle.tf locals.host_policies.
Args:
hostname: Hostname to remove
repo_root: Path to repository root
Returns:
True if found and removed, False if not found
"""
approle_path = repo_root / "terraform" / "vault" / "approle.tf"
if not approle_path.exists():
return False
content = approle_path.read_text()
# Check if hostname exists in host_policies
hostname_pattern = rf'^\s+"{re.escape(hostname)}" = \{{'
if not re.search(hostname_pattern, content, re.MULTILINE):
return False
# Match the entire block from "hostname" = { to closing }
# The block contains paths = [ ... ] and possibly extra_policies = [...]
replace_pattern = rf'\n?\s+"{re.escape(hostname)}" = \{{[^}}]*\}}\n?'
new_content, count = re.subn(replace_pattern, "\n", content, flags=re.DOTALL)
if count == 0:
return False
approle_path.write_text(new_content)
return True
def find_host_secrets(hostname: str, repo_root: Path) -> list:
"""
Find secrets in terraform/vault/secrets.tf that belong to a host.
Args:
hostname: Hostname to search for
repo_root: Path to repository root
Returns:
List of secret paths found (e.g., ["hosts/hostname/test-service"])
"""
secrets_path = repo_root / "terraform" / "vault" / "secrets.tf"
if not secrets_path.exists():
return []
content = secrets_path.read_text()
# Find all secret paths matching hosts/{hostname}/
pattern = rf'"(hosts/{re.escape(hostname)}/[^"]+)"'
matches = re.findall(pattern, content)
# Return unique paths, preserving order
return list(dict.fromkeys(matches))
def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, bool, bool]:
""" """
Check which entries exist for a hostname. Check which entries exist for a hostname.
@@ -110,12 +171,12 @@ def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, boo
repo_root: Path to repository root repo_root: Path to repository root
Returns: Returns:
Tuple of (flake_exists, terraform_vms_exists, vault_exists) Tuple of (flake_exists, terraform_vms_exists, vault_generated_exists, approle_exists)
""" """
# Check flake.nix # Check flake.nix
flake_path = repo_root / "flake.nix" flake_path = repo_root / "flake.nix"
flake_content = flake_path.read_text() flake_content = flake_path.read_text()
flake_pattern = rf"^ {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem" flake_pattern = rf"^ {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem"
flake_exists = bool(re.search(flake_pattern, flake_content, re.MULTILINE)) flake_exists = bool(re.search(flake_pattern, flake_content, re.MULTILINE))
# Check terraform/vms.tf # Check terraform/vms.tf
@@ -131,7 +192,15 @@ def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, boo
vault_content = vault_tf_path.read_text() vault_content = vault_tf_path.read_text()
vault_exists = f'"{hostname}"' in vault_content vault_exists = f'"{hostname}"' in vault_content
return (flake_exists, terraform_exists, vault_exists) # Check terraform/vault/approle.tf
approle_path = repo_root / "terraform" / "vault" / "approle.tf"
approle_exists = False
if approle_path.exists():
approle_content = approle_path.read_text()
approle_pattern = rf'^\s+"{re.escape(hostname)}" = \{{'
approle_exists = bool(re.search(approle_pattern, approle_content, re.MULTILINE))
return (flake_exists, terraform_exists, vault_exists, approle_exists)
def update_flake_nix(config: HostConfig, repo_root: Path, force: bool = False) -> None: def update_flake_nix(config: HostConfig, repo_root: Path, force: bool = False) -> None:
@@ -147,32 +216,25 @@ def update_flake_nix(config: HostConfig, repo_root: Path, force: bool = False) -
content = flake_path.read_text() content = flake_path.read_text()
# Create new entry # Create new entry
new_entry = f""" {config.hostname} = nixpkgs.lib.nixosSystem {{ new_entry = f""" {config.hostname} = nixpkgs.lib.nixosSystem {{
inherit system; inherit system;
specialArgs = {{ specialArgs = {{
inherit inputs self sops-nix; inherit inputs self sops-nix;
}};
modules = commonModules ++ [
./hosts/{config.hostname}
];
}}; }};
modules = [
(
{{ config, pkgs, ... }}:
{{
nixpkgs.overlays = commonOverlays;
}}
)
./hosts/{config.hostname}
sops-nix.nixosModules.sops
];
}};
""" """
# Check if hostname already exists # Check if hostname already exists
hostname_pattern = rf"^ {re.escape(config.hostname)} = nixpkgs\.lib\.nixosSystem" hostname_pattern = rf"^ {re.escape(config.hostname)} = nixpkgs\.lib\.nixosSystem"
existing_match = re.search(hostname_pattern, content, re.MULTILINE) existing_match = re.search(hostname_pattern, content, re.MULTILINE)
if existing_match and force: if existing_match and force:
# Replace existing entry # Replace existing entry
# Match the entire block from "hostname = " to "};" # Match the entire block from "hostname = " to "};"
replace_pattern = rf"^ {re.escape(config.hostname)} = nixpkgs\.lib\.nixosSystem \{{.*?^ \}};\n" replace_pattern = rf"^ {re.escape(config.hostname)} = nixpkgs\.lib\.nixosSystem \{{.*?^ \}};\n"
new_content, count = re.subn(replace_pattern, new_entry, content, flags=re.MULTILINE | re.DOTALL) new_content, count = re.subn(replace_pattern, new_entry, content, flags=re.MULTILINE | re.DOTALL)
if count == 0: if count == 0:

View File

@@ -18,6 +18,12 @@
tier = "test"; # Start in test tier, move to prod after validation tier = "test"; # Start in test tier, move to prod after validation
}; };
# Enable Vault integration
vault.enable = true;
# Enable remote deployment via NATS
homelab.deploy.enable = true;
nixpkgs.config.allowUnfree = true; nixpkgs.config.allowUnfree = true;
boot.loader.grub.enable = true; boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda"; boot.loader.grub.device = "/dev/vda";

View File

@@ -121,22 +121,20 @@ in
scrapeConfigs = [ scrapeConfigs = [
# Auto-generated node-exporter targets from flake hosts + external # Auto-generated node-exporter targets from flake hosts + external
# Each static_config entry may have labels from homelab.host metadata
{ {
job_name = "node-exporter"; job_name = "node-exporter";
static_configs = [ static_configs = nodeExporterTargets;
{
targets = nodeExporterTargets;
}
];
} }
# Systemd exporter on all hosts (same targets, different port) # Systemd exporter on all hosts (same targets, different port)
# Preserves the same label grouping as node-exporter
{ {
job_name = "systemd-exporter"; job_name = "systemd-exporter";
static_configs = [ static_configs = map
{ (cfg: cfg // {
targets = map (t: builtins.replaceStrings [":9100"] [":9558"] t) nodeExporterTargets; targets = map (t: builtins.replaceStrings [ ":9100" ] [ ":9558" ] t) cfg.targets;
} })
]; nodeExporterTargets;
} }
# Local monitoring services (not auto-generated) # Local monitoring services (not auto-generated)
{ {

View File

@@ -17,8 +17,9 @@ groups:
annotations: annotations:
summary: "Disk space low on {{ $labels.instance }}" summary: "Disk space low on {{ $labels.instance }}"
description: "Disk space is low on {{ $labels.instance }}. Please check." description: "Disk space is low on {{ $labels.instance }}. Please check."
# Build hosts (e.g., nix-cache01) are expected to have high CPU during builds
- alert: high_cpu_load - alert: high_cpu_load
expr: max(node_load5{instance!="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance!="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7) expr: max(node_load5{role!="build-host"}) by (instance) > (count by (instance)(node_cpu_seconds_total{role!="build-host", mode="idle"}) * 0.7)
for: 15m for: 15m
labels: labels:
severity: warning severity: warning
@@ -26,7 +27,7 @@ groups:
summary: "High CPU load on {{ $labels.instance }}" summary: "High CPU load on {{ $labels.instance }}"
description: "CPU load is high on {{ $labels.instance }}. Please check." description: "CPU load is high on {{ $labels.instance }}. Please check."
- alert: high_cpu_load - alert: high_cpu_load
expr: max(node_load5{instance="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7) expr: max(node_load5{role="build-host"}) by (instance) > (count by (instance)(node_cpu_seconds_total{role="build-host", mode="idle"}) * 0.7)
for: 2h for: 2h
labels: labels:
severity: warning severity: warning
@@ -115,8 +116,9 @@ groups:
annotations: annotations:
summary: "NSD not running on {{ $labels.instance }}" summary: "NSD not running on {{ $labels.instance }}"
description: "NSD has been down on {{ $labels.instance }} more than 5 minutes." description: "NSD has been down on {{ $labels.instance }} more than 5 minutes."
# Only alert on primary DNS (secondary has cold cache after failover)
- alert: unbound_low_cache_hit_ratio - alert: unbound_low_cache_hit_ratio
expr: (rate(unbound_cache_hits_total[5m]) / (rate(unbound_cache_hits_total[5m]) + rate(unbound_cache_misses_total[5m]))) < 0.5 expr: (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) / (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) + rate(unbound_cache_misses_total{dns_role="primary"}[5m]))) < 0.5
for: 15m for: 15m
labels: labels:
severity: warning severity: warning

View File

@@ -33,7 +33,7 @@ variable "default_target_node" {
variable "default_template_name" { variable "default_template_name" {
description = "Default template VM name to clone from" description = "Default template VM name to clone from"
type = string type = string
default = "nixos-25.11.20260131.41e216c" default = "nixos-25.11.20260203.e576e3c"
} }
variable "default_ssh_public_key" { variable "default_ssh_public_key" {

View File

@@ -101,11 +101,6 @@ locals {
] ]
} }
"vaulttest01" = {
paths = [
"secret/data/hosts/vaulttest01/*",
]
}
} }
} }

View File

@@ -5,12 +5,22 @@
# Each host gets access to its own secrets under hosts/<hostname>/* # Each host gets access to its own secrets under hosts/<hostname>/*
locals { locals {
generated_host_policies = { generated_host_policies = {
"vaulttest01" = { "testvm01" = {
paths = [ paths = [
"secret/data/hosts/vaulttest01/*", "secret/data/hosts/testvm01/*",
] ]
} }
"testvm02" = {
paths = [
"secret/data/hosts/testvm02/*",
]
}
"testvm03" = {
paths = [
"secret/data/hosts/testvm03/*",
]
}
} }
# Placeholder secrets - user should add actual secrets manually or via tofu # Placeholder secrets - user should add actual secrets manually or via tofu
@@ -40,7 +50,7 @@ resource "vault_approle_auth_backend_role" "generated_hosts" {
backend = vault_auth_backend.approle.path backend = vault_auth_backend.approle.path
role_name = each.key role_name = each.key
token_policies = ["host-${each.key}"] token_policies = ["host-${each.key}", "homelab-deploy"]
secret_id_ttl = 0 # Never expire (wrapped tokens provide time limit) secret_id_ttl = 0 # Never expire (wrapped tokens provide time limit)
token_ttl = 3600 token_ttl = 3600
token_max_ttl = 3600 token_max_ttl = 3600

View File

@@ -45,12 +45,6 @@ locals {
password_length = 24 password_length = 24
} }
# TODO: Remove after testing
"hosts/vaulttest01/test-service" = {
auto_generate = true
password_length = 32
}
# Shared backup password (auto-generated, add alongside existing restic key) # Shared backup password (auto-generated, add alongside existing restic key)
"shared/backup/password" = { "shared/backup/password" = {
auto_generate = true auto_generate = true

View File

@@ -31,13 +31,6 @@ locals {
# Example Minimal VM using all defaults (uncomment to deploy): # Example Minimal VM using all defaults (uncomment to deploy):
# "minimal-vm" = {} # "minimal-vm" = {}
# "bootstrap-verify-test" = {} # "bootstrap-verify-test" = {}
"testvm01" = {
ip = "10.69.13.101/24"
cpu_cores = 2
memory = 2048
disk_size = "20G"
flake_branch = "pipeline-testing-improvements"
}
"vault01" = { "vault01" = {
ip = "10.69.13.19/24" ip = "10.69.13.19/24"
cpu_cores = 2 cpu_cores = 2
@@ -45,13 +38,25 @@ locals {
disk_size = "20G" disk_size = "20G"
flake_branch = "vault-setup" # Bootstrap from this branch instead of master flake_branch = "vault-setup" # Bootstrap from this branch instead of master
} }
"vaulttest01" = { "testvm01" = {
ip = "10.69.13.150/24" ip = "10.69.13.20/24"
cpu_cores = 2 cpu_cores = 2
memory = 2048 memory = 2048
disk_size = "20G" disk_size = "20G"
flake_branch = "pki-migration" flake_branch = "improve-bootstrap-visibility"
vault_wrapped_token = "s.UCpQCOp7cOKDdtGGBvfRWwAt" vault_wrapped_token = "s.l5q88wzXfEcr5SMDHmO6o96b"
}
"testvm02" = {
ip = "10.69.13.21/24"
cpu_cores = 2
memory = 2048
disk_size = "20G"
}
"testvm03" = {
ip = "10.69.13.22/24"
cpu_cores = 2
memory = 2048
disk_size = "20G"
} }
} }