38 Commits

Author SHA1 Message Date
3a25e3f7bc Merge pull request 'migrate-to-openbao-pki' (#31) from migrate-to-openbao-pki into master
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Reviewed-on: #31
2026-02-07 17:33:46 +00:00
46f03871f1 docs: update CLAUDE.md for PR creation and labmon removal
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Run nix flake check / flake-check (pull_request) Failing after 1s
- Add note that gh pr create is not supported
- Remove labmon from Prometheus job names list
- Remove labmon from flake inputs list

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 18:32:27 +01:00
9d019f2b9a testvm01: add nginx with ACME certificate for PKI testing
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Set up a simple nginx server with an ACME certificate from the new
OpenBao PKI infrastructure. This allows testing the ACME migration
before deploying to production hosts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 18:22:28 +01:00
21db7e9573 acme: migrate from step-ca to OpenBao PKI
Switch all ACME certificate issuance from step-ca (ca.home.2rjus.net)
to OpenBao PKI (vault.home.2rjus.net:8200/v1/pki_int/acme/directory).

- Update default ACME server in system/acme.nix
- Update Caddy acme_ca in http-proxy and nix-cache services
- Remove labmon service from monitoring01 (step-ca monitoring)
- Remove labmon scrape target and certificate_rules alerts
- Remove alloy.nix (only used for labmon profiling)
- Add docs/plans/cert-monitoring.md for future cert monitoring needs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 18:20:10 +01:00
979040aaf7 vault01: enable homelab-deploy listener
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Enable vault.enable and homelab.deploy.enable on vault01 so it can
receive NATS-based remote deployments. Vault fetches secrets from
itself using AppRole after auto-unseal.

Add systemd ordering to ensure vault-secret services wait for openbao
to be unsealed before attempting to fetch secrets.

Also adds vault01 AppRole entry to Terraform.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:55:09 +01:00
8791c29402 hosts: enable homelab-deploy listener on pgdb1, nats1, jelly01
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Enable vault.enable and homelab.deploy.enable for these hosts to
allow NATS-based remote deployments and expose metrics on port 9972.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:43:06 +01:00
c7a067d7b3 flake: update homelab-deploy input
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:31:24 +01:00
c518093578 docs: move prometheus-scrape-target-labels plan to completed
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:29:31 +01:00
0b462f0a96 Merge pull request 'prometheus-scrape-target-labels' (#30) from prometheus-scrape-target-labels into master
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Reviewed-on: #30
2026-02-07 16:27:38 +00:00
116abf3bec CLAUDE.md: document homelab-deploy CLI for prod hosts
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Run nix flake check / flake-check (pull_request) Failing after 1s
Add instructions for deploying to prod hosts using the CLI directly,
since the MCP server only handles test-tier deployments.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:23:10 +01:00
b794aa89db skills: update observability with new target labels
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Document the new hostname and host metadata labels available on all
Prometheus scrape targets:
- hostname: short hostname for easy filtering
- role: host role (dns, build-host, vault)
- tier: deployment tier (test for test VMs)
- dns_role: primary/secondary for DNS servers

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:12:17 +01:00
50a85daa44 docs: update plan with hostname label documentation
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:09:46 +01:00
23e561cf49 monitoring: add hostname label to all scrape targets
Add a `hostname` label to all Prometheus scrape targets, making it easy
to query all metrics for a host without wildcarding the instance label.

Example queries:
- {hostname="ns1"} - all metrics from ns1
- node_cpu_seconds_total{hostname="monitoring01"} - specific metric

For external targets (like gunter), the hostname is extracted from the
target string.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:09:19 +01:00
7d291f85bf monitoring: propagate host labels to Prometheus scrape targets
Extract homelab.host metadata (tier, priority, role, labels) from host
configurations and propagate them to Prometheus scrape targets. This
enables semantic alert filtering using labels instead of hardcoded
instance names.

Changes:
- lib/monitoring.nix: Extract host metadata, group targets by labels
- prometheus.nix: Use structured static_configs with labels
- rules.yml: Replace instance filters with role-based filters

Example labels in Prometheus:
- ns1/ns2: role=dns, dns_role=primary/secondary
- nix-cache01: role=build-host
- testvm*: tier=test

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 17:04:50 +01:00
2a842c655a docs: update plan status and move completed nats-deploy plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
- Move nats-deploy-service.md to completed/ folder
- Update prometheus-scrape-target-labels.md with implementation status
- Add status table showing which steps are complete/partial/not started
- Update cross-references to point to new location

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 16:44:00 +01:00
1f4a5571dc CLAUDE.md: update documentation from audit
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
- Fix OpenBao CLI name (bao, not vault)
- Add vault01, testvm01-03 to hosts list
- Document nixos-exporter and homelab-deploy flake inputs
- Add vault/ and actions-runner/ services
- Document homelab.host and homelab.deploy options
- Document automatic Vault credential provisioning via wrapped tokens
- Consolidate homelab module options into dedicated section

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 16:37:38 +01:00
13d6d0ea3a Merge pull request 'improve-bootstrap-visibility' (#29) from improve-bootstrap-visibility into master
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Reviewed-on: #29
2026-02-07 15:00:09 +00:00
eea000b337 CLAUDE.md: document bootstrap logs in Loki
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Run nix flake check / flake-check (pull_request) Failing after 4s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:57:51 +01:00
f19ba2f4b6 CLAUDE.md: use tofu -chdir instead of cd
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:41:59 +01:00
a90d9c33d5 CLAUDE.md: prefer nix develop -c for devshell commands
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:39:56 +01:00
09c9df1bbe terraform: regenerate wrapped token for testvm01
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:36:25 +01:00
ae3039af19 template2: send bootstrap status to Loki for remote monitoring
Adds log_to_loki function that pushes structured log entries to Loki
at key bootstrap stages (starting, network_ok, vault_*, building,
success, failed). Enables querying bootstrap state via LogQL without
console access.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:34:47 +01:00
11261c4636 template2: revert to journal+console output for bootstrap
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
TTY output was causing nixos-rebuild to fail. Keep the custom
greeting line to indicate bootstrap image, but use journal+console
for reliable logging.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:24:39 +01:00
4ca3c8890f terraform: add flake_branch and token for testvm01
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:14:57 +01:00
78e8d7a600 template2: add ncurses for clear command in bootstrap
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:10:25 +01:00
0cf72ec191 terraform: update template to nixos-25.11.20260203.e576e3c
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 15:02:16 +01:00
6a3a51407e playbooks: auto-update terraform template name after deploy
Add a third play to build-and-deploy-template.yml that updates
terraform/variables.tf with the new template name after deploying
to Proxmox. Only updates if the template name has changed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:59:13 +01:00
a1ae766eb8 template2: show bootstrap progress on tty1
- Display bootstrap banner and live progress on tty1 instead of login prompt
- Add custom getty greeting on other ttys indicating this is a bootstrap image
- Disable getty on tty1 during bootstrap so output is visible

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:49:58 +01:00
11999b37f3 flake: update homelab-deploy
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Fixes false "Some deployments failed" warning in MCP server when
deployments are still in progress.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:24:41 +01:00
29b2b7db52 Merge branch 'deploy-test-hosts'
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Add three permanent test hosts (testvm01, testvm02, testvm03) with:
- Static IPs: 10.69.13.20-22
- Vault AppRole integration with homelab-deploy policy
- Remote deployment via NATS (homelab.deploy.enable)
- Test tier configuration

Also updates create-host template to include vault.enable and
homelab.deploy.enable by default.
2026-02-07 14:09:40 +01:00
b046a1b862 terraform: remove flake_branch from test VMs
VMs are now bootstrapped and running. Remove temporary flake_branch
and vault_wrapped_token settings so they use master going forward.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:09:30 +01:00
38348c5980 vault: add homelab-deploy policy to generated hosts
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
The homelab-deploy listener requires access to shared/homelab-deploy/*
secrets. Update hosts-generated.tf and the generator script to include
this policy automatically.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 14:05:42 +01:00
370cf2b03a hosts: enable vault and deploy listener on test VMs
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
- Add vault.enable = true to testvm01, testvm02, testvm03
- Add homelab.deploy.enable = true for remote deployment via NATS
- Update create-host template to include these by default

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 13:55:33 +01:00
7bc465b414 hosts: add testvm01, testvm02, testvm03 test hosts
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Three permanent test hosts for validating deployment and bootstrapping
workflow. Each host configured with:
- Static IP (10.69.13.20-22/24)
- Vault AppRole integration
- Bootstrap from deploy-test-hosts branch

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 13:34:16 +01:00
8d7bc50108 hosts: remove testvm01
Some checks failed
Run nix flake check / flake-check (push) Failing after 1s
Test host no longer needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 12:58:24 +01:00
03e70ac094 hosts: remove vaulttest01
Test host no longer needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 12:55:38 +01:00
3b32c9479f create-host: add approle removal and secrets detection
- Remove host entries from terraform/vault/approle.tf on --remove
- Detect and warn about secrets in terraform/vault/secrets.tf
- Include vault kv delete commands in removal instructions
- Update check_entries_exist to return approle status

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 12:54:42 +01:00
b0d35f9a99 create-host: fix flake.nix indentation patterns
The regex patterns expected 6 spaces of indentation but flake.nix uses
8 spaces for host entries. Also updated generated entry template to
match current flake.nix style (using commonModules ++).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 12:48:29 +01:00
37 changed files with 891 additions and 426 deletions

View File

@@ -185,21 +185,60 @@ Common job names:
- `home-assistant` - Home automation
- `step-ca` - Internal CA
### Instance Label Format
### Target Labels
The `instance` label uses FQDN format:
All scrape targets have these labels:
```
<hostname>.home.2rjus.net:<port>
```
**Standard labels:**
- `instance` - Full target address (`<hostname>.home.2rjus.net:<port>`)
- `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`)
- `hostname` - Short hostname (e.g., `ns1`, `monitoring01`) - use this for host filtering
Example queries filtering by host:
**Host metadata labels** (when configured in `homelab.host`):
- `role` - Host role (e.g., `dns`, `build-host`, `vault`)
- `tier` - Deployment tier (`test` for test VMs, absent for prod)
- `dns_role` - DNS-specific role (`primary` or `secondary` for ns1/ns2)
### Filtering by Host
Use the `hostname` label for easy host filtering across all jobs:
```promql
up{instance=~"monitoring01.*"}
node_load1{instance=~"ns1.*"}
{hostname="ns1"} # All metrics from ns1
node_load1{hostname="monitoring01"} # Specific metric by hostname
up{hostname="ha1"} # Check if ha1 is up
```
This is simpler than wildcarding the `instance` label:
```promql
# Old way (still works but verbose)
up{instance=~"monitoring01.*"}
# New way (preferred)
up{hostname="monitoring01"}
```
### Filtering by Role/Tier
Filter hosts by their role or tier:
```promql
up{role="dns"} # All DNS servers (ns1, ns2)
node_cpu_seconds_total{role="build-host"} # Build hosts only (nix-cache01)
up{tier="test"} # All test-tier VMs
up{dns_role="primary"} # Primary DNS only (ns1)
```
Current host labels:
| Host | Labels |
|------|--------|
| ns1 | `role=dns`, `dns_role=primary` |
| ns2 | `role=dns`, `dns_role=secondary` |
| nix-cache01 | `role=build-host` |
| vault01 | `role=vault` |
| testvm01/02/03 | `tier=test` |
---
## Troubleshooting Workflows
@@ -212,11 +251,12 @@ node_load1{instance=~"ns1.*"}
### Investigate Service Issues
1. Check `up{job="<service>"}` for scrape failures
1. Check `up{job="<service>"}` or `up{hostname="<host>"}` for scrape failures
2. Use `list_targets` to see target health details
3. Query service logs: `{host="<host>", systemd_unit="<service>.service"}`
4. Search for errors: `{host="<host>"} |= "error"`
5. Check `list_alerts` for related alerts
6. Use role filters for group issues: `up{role="dns"}` to check all DNS servers
### After Deploying Changes
@@ -246,5 +286,6 @@ With `start: "24h"` to see last 24 hours of upgrades across all hosts.
- Default scrape interval is 15s for most metrics targets
- Default log lookback is 1h - use `start` parameter for older logs
- Use `rate()` for counter metrics, direct queries for gauges
- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters
- Use the `hostname` label to filter metrics by host (simpler than regex on `instance`)
- Host metadata labels (`role`, `tier`, `dns_role`) are propagated to all scrape targets
- Log `MESSAGE` field contains the actual log content in JSON format

123
CLAUDE.md
View File

@@ -61,10 +61,31 @@ Do not run `nix flake update`. Should only be done manually by user.
### Development Environment
```bash
# Enter development shell (provides ansible, python3)
# Enter development shell
nix develop
```
The devshell provides: `ansible`, `tofu` (OpenTofu), `bao` (OpenBao CLI), `create-host`, and `homelab-deploy`.
**Important:** When suggesting commands that use devshell tools, always use `nix develop -c <command>` syntax rather than assuming the user is already in a devshell. For example:
```bash
# Good - works regardless of current shell
nix develop -c tofu plan
# Avoid - requires user to be in devshell
tofu plan
```
**OpenTofu:** Use the `-chdir` option instead of `cd` when running tofu commands in subdirectories:
```bash
# Good - uses -chdir option
nix develop -c tofu -chdir=terraform plan
nix develop -c tofu -chdir=terraform/vault apply
# Avoid - changing directories
cd terraform && tofu plan
```
### Secrets Management
Secrets are managed by OpenBao (Vault) using AppRole authentication. Most hosts use the
@@ -80,6 +101,8 @@ Legacy sops-nix is still present but only actively used by the `ca` host. Do not
**Important:** Never amend commits to `master` unless the user explicitly asks for it. Amending rewrites history and causes issues for deployed configurations.
**Important:** Do not use `gh pr create` to create pull requests. The git server does not support GitHub CLI for PR creation. Instead, push the branch and let the user create the PR manually via the web interface.
When starting a new plan or task, the first step should typically be to create and checkout a new branch with an appropriate name (e.g., `git checkout -b dns-automation` or `git checkout -b fix-nginx-config`).
### Plan Management
@@ -140,11 +163,27 @@ The **lab-monitoring** MCP server can query logs from Loki. All hosts ship syste
- `host` - Hostname (e.g., `ns1`, `ns2`, `monitoring01`, `ha1`). Use this label, not `hostname`.
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `prometheus.service`, `nixos-upgrade.service`)
- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs like caddy access logs)
- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
- `filename` - For `varlog` job, the log file path (e.g., `/var/log/caddy/nix-cache.log`)
Journal log entries are JSON-formatted with the actual log message in the `MESSAGE` field. Other useful fields include `PRIORITY` and `SYSLOG_IDENTIFIER`.
**Bootstrap Logs:**
VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
- `host` - Target hostname
- `branch` - Git branch being deployed
- `stage` - Bootstrap stage: `starting`, `network_ok`, `vault_ok`/`vault_skip`/`vault_warn`, `building`, `success`, `failed`
Query bootstrap status:
```
{job="bootstrap"} # All bootstrap logs
{job="bootstrap", host="testvm01"} # Specific host
{job="bootstrap", stage="failed"} # All failures
{job="bootstrap", stage=~"building|success"} # Track build progress
```
**Example LogQL queries:**
```
# Logs from a specific service on a host
@@ -177,7 +216,7 @@ The **lab-monitoring** MCP server can query Prometheus metrics via PromQL. The `
- `wireguard` - VPN metrics (http-proxy)
- `pushgateway` - Push-based metrics (e.g., backup results)
- `restic_rest` - Backup server metrics
- `labmon` / `ghettoptt` / `alertmanager` - Other service metrics
- `ghettoptt` / `alertmanager` - Other service metrics
**Example PromQL queries:**
```
@@ -229,6 +268,21 @@ deploy(role="vault", action="switch")
**Note:** Only test-tier hosts with `homelab.deploy.enable = true` and the listener service running will respond to deployments.
**Deploying to Prod Hosts:**
The MCP server only deploys to test-tier hosts. For prod hosts, use the CLI directly:
```bash
nix develop -c homelab-deploy -- deploy \
--nats-url nats://nats1.home.2rjus.net:4222 \
--nkey-file ~/.config/homelab-deploy/admin-deployer.nkey \
--branch <branch-name> \
--action switch \
deploy.prod.<hostname>
```
Subject format: `deploy.<tier>.<hostname>` (e.g., `deploy.prod.monitoring01`, `deploy.test.testvm01`)
**Verifying Deployments:**
After deploying, use the `nixos_flake_info` metric from nixos-exporter to verify the host is running the expected revision:
@@ -249,9 +303,10 @@ The `current_rev` label contains the git commit hash of the deployed flake confi
- `configuration.nix` - Host-specific settings (networking, hardware, users)
- `/system/` - Shared system-level configurations applied to ALL hosts
- Core modules: nix.nix, sshd.nix, sops.nix (legacy), vault-secrets.nix, acme.nix, autoupgrade.nix
- Additional modules: motd.nix (dynamic MOTD), packages.nix (base packages), root-user.nix (root config), homelab-deploy.nix (NATS listener)
- Monitoring: node-exporter and promtail on every host
- `/modules/` - Custom NixOS modules
- `homelab/` - Homelab-specific options (DNS automation, monitoring scrape targets)
- `homelab/` - Homelab-specific options (see "Homelab Module Options" section below)
- `/lib/` - Nix library functions
- `dns-zone.nix` - DNS zone generation functions
- `monitoring.nix` - Prometheus scrape target generation functions
@@ -259,6 +314,8 @@ The `current_rev` label contains the git commit hash of the deployed flake confi
- `home-assistant/` - Home automation stack
- `monitoring/` - Observability stack (Prometheus, Grafana, Loki, Tempo)
- `ns/` - DNS services (authoritative, resolver, zone generation)
- `vault/` - OpenBao (Vault) secrets server
- `actions-runner/` - GitHub Actions runner
- `http-proxy/`, `ca/`, `postgres/`, `nats/`, `jellyfin/`, etc.
- `/secrets/` - SOPS-encrypted secrets with age encryption (legacy, only used by ca)
- `/common/` - Shared configurations (e.g., VM guest agent)
@@ -292,28 +349,33 @@ All hosts automatically get:
### Active Hosts
Production servers managed by `rebuild-all.sh`:
Production servers:
- `ns1`, `ns2` - Primary/secondary DNS servers (10.69.13.5/6)
- `ca` - Internal Certificate Authority
- `vault01` - OpenBao (Vault) secrets server
- `ha1` - Home Assistant + Zigbee2MQTT + Mosquitto
- `http-proxy` - Reverse proxy
- `monitoring01` - Full observability stack (Prometheus, Grafana, Loki, Tempo, Pyroscope)
- `jelly01` - Jellyfin media server
- `nix-cache01` - Binary cache server
- `nix-cache01` - Binary cache server + GitHub Actions runner
- `pgdb1` - PostgreSQL database
- `nats1` - NATS messaging server
Template/test hosts:
- `template1` - Base template for cloning new hosts
Test/staging hosts:
- `testvm01`, `testvm02`, `testvm03` - Test-tier VMs for branch testing and deployment validation
Template hosts:
- `template1`, `template2` - Base templates for cloning new hosts
### Flake Inputs
- `nixpkgs` - NixOS 25.11 stable (primary)
- `nixpkgs-unstable` - Unstable channel (available via overlay as `pkgs.unstable.<package>`)
- `sops-nix` - Secrets management (legacy, only used by ca)
- `nixos-exporter` - NixOS module for exposing flake revision metrics (used to verify deployments)
- `homelab-deploy` - NATS-based remote deployment tool for test-tier hosts
- Custom packages from git.t-juice.club:
- `alerttonotify` - Alert routing
- `labmon` - Lab monitoring
### Network Architecture
@@ -402,9 +464,21 @@ Example VM deployment includes:
- Custom CPU/memory/disk sizing
- VLAN tagging
- QEMU guest agent
- Automatic Vault credential provisioning via `vault_wrapped_token`
OpenTofu outputs the VM's IP address after deployment for easy SSH access.
**Automatic Vault Credential Provisioning:**
VMs can receive Vault (OpenBao) credentials automatically during bootstrap:
1. OpenTofu generates a wrapped token via `terraform/vault/` and stores it in the VM configuration
2. Cloud-init passes `VAULT_WRAPPED_TOKEN` and `NIXOS_FLAKE_BRANCH` to the bootstrap script
3. The bootstrap script unwraps the token to obtain AppRole credentials
4. Credentials are written to `/var/lib/vault/approle/` before the NixOS rebuild
This eliminates the need for manual `provision-approle.yml` playbook runs on new VMs. Bootstrap progress is logged to Loki with `job="bootstrap"` labels.
#### Template Rebuilding and Terraform State
When the Proxmox template is rebuilt (via `build-and-deploy-template.yml`), the template name may change. This would normally cause Terraform to want to recreate all existing VMs, but that's unnecessary since VMs are independent once cloned.
@@ -484,11 +558,7 @@ Prometheus scrape targets are automatically generated from host configurations,
- **External targets**: Non-flake hosts defined in `/services/monitoring/external-targets.nix`
- **Library**: `lib/monitoring.nix` provides `generateNodeExporterTargets` and `generateScrapeConfigs`
Host monitoring options (`homelab.monitoring.*`):
- `enable` (default: `true`) - Include host in Prometheus node-exporter scrape targets
- `scrapeTargets` (default: `[]`) - Additional scrape targets exposed by this host (job_name, port, metrics_path, scheme, scrape_interval, honor_labels)
Service modules declare their scrape targets directly (e.g., `services/ca/default.nix` declares step-ca on port 9000). The Prometheus config on monitoring01 auto-generates scrape configs from all hosts.
Service modules declare their scrape targets directly via `homelab.monitoring.scrapeTargets` (e.g., `services/ca/default.nix` declares step-ca on port 9000). The Prometheus config on monitoring01 auto-generates scrape configs from all hosts. See "Homelab Module Options" section for available options.
To add monitoring targets for non-NixOS hosts, edit `/services/monitoring/external-targets.nix`.
@@ -507,13 +577,30 @@ DNS zone entries are automatically generated from host configurations:
- **External hosts**: Non-flake hosts defined in `/services/ns/external-hosts.nix`
- **Serial number**: Uses `self.sourceInfo.lastModified` (git commit timestamp)
Host DNS options (`homelab.dns.*`):
- `enable` (default: `true`) - Include host in DNS zone generation
- `cnames` (default: `[]`) - List of CNAME aliases pointing to this host
Hosts are automatically excluded from DNS if:
- `homelab.dns.enable = false` (e.g., template hosts)
- No static IP configured (e.g., DHCP-only hosts)
- Network interface is a VPN/tunnel (wg*, tun*, tap*)
To add DNS entries for non-NixOS hosts, edit `/services/ns/external-hosts.nix`.
### Homelab Module Options
The `modules/homelab/` directory defines custom options used across hosts for automation and metadata.
**Host options (`homelab.host.*`):**
- `tier` - Deployment tier: `test` or `prod`. Test-tier hosts can receive remote deployments and have different credential access.
- `priority` - Alerting priority: `high` or `low`. Controls alerting thresholds for the host.
- `role` - Primary role designation (e.g., `dns`, `database`, `bastion`, `vault`)
- `labels` - Free-form key-value metadata for host categorization
**DNS options (`homelab.dns.*`):**
- `enable` (default: `true`) - Include host in DNS zone generation
- `cnames` (default: `[]`) - List of CNAME aliases pointing to this host
**Monitoring options (`homelab.monitoring.*`):**
- `enable` (default: `true`) - Include host in Prometheus node-exporter scrape targets
- `scrapeTargets` (default: `[]`) - Additional scrape targets exposed by this host
**Deploy options (`homelab.deploy.*`):**
- `enable` (default: `false`) - Enable NATS-based remote deployment listener. When enabled, the host listens for deployment commands via NATS and can be targeted by the `homelab-deploy` MCP server.

View File

@@ -0,0 +1,72 @@
# Certificate Monitoring Plan
## Summary
This document describes the removal of labmon certificate monitoring and outlines future needs for certificate monitoring in the homelab.
## What Was Removed
### labmon Service
The `labmon` service was a custom Go application that provided:
1. **StepMonitor**: Monitoring for step-ca (Smallstep CA) certificate provisioning and health
2. **TLSConnectionMonitor**: Periodic TLS connection checks to verify certificate validity and expiration
The service exposed Prometheus metrics at `:9969` including:
- `labmon_tlsconmon_certificate_seconds_left` - Time until certificate expiration
- `labmon_tlsconmon_certificate_check_error` - Whether the TLS check failed
- `labmon_stepmon_certificate_seconds_left` - Step-CA internal certificate expiration
### Affected Files
- `hosts/monitoring01/configuration.nix` - Removed labmon configuration block
- `services/monitoring/prometheus.nix` - Removed labmon scrape target
- `services/monitoring/rules.yml` - Removed `certificate_rules` alert group
- `services/monitoring/alloy.nix` - Deleted (was only used for labmon profiling)
- `services/monitoring/default.nix` - Removed alloy.nix import
### Removed Alerts
- `certificate_expiring_soon` - Warned when any monitored TLS cert had < 24h validity
- `step_ca_serving_cert_expiring` - Critical alert for step-ca's own serving certificate
- `certificate_check_error` - Warned when TLS connection check failed
- `step_ca_certificate_expiring` - Critical alert for step-ca issued certificates
## Why It Was Removed
1. **step-ca decommissioned**: The primary monitoring target (step-ca) is no longer in use
2. **Outdated codebase**: labmon was a custom tool that required maintenance
3. **Limited value**: With ACME auto-renewal, certificates should renew automatically
## Current State
ACME certificates are now issued by OpenBao PKI at `vault.home.2rjus.net:8200`. The ACME protocol handles automatic renewal, and certificates are typically renewed well before expiration.
## Future Needs
While ACME handles renewal automatically, we should consider monitoring for:
1. **ACME renewal failures**: Alert when a certificate fails to renew
- Could monitor ACME client logs (via Loki queries)
- Could check certificate file modification times
2. **Certificate expiration as backup**: Even with auto-renewal, a last-resort alert for certificates approaching expiration would catch renewal failures
3. **Certificate transparency**: Monitor for unexpected certificate issuance
### Potential Solutions
1. **Prometheus blackbox_exporter**: Can probe TLS endpoints and export certificate expiration metrics
- `probe_ssl_earliest_cert_expiry` metric
- Already a standard tool, well-maintained
2. **Custom Loki alerting**: Query ACME service logs for renewal failures
- Works with existing infrastructure
- No additional services needed
3. **Node-exporter textfile collector**: Script that checks local certificate files and writes expiration metrics
## Status
**Not yet implemented.** This document serves as a placeholder for future work on certificate monitoring.

View File

@@ -1,10 +1,38 @@
# Prometheus Scrape Target Labels
## Implementation Status
| Step | Status | Notes |
|------|--------|-------|
| 1. Create `homelab.host` module | ✅ Complete | `modules/homelab/host.nix` |
| 2. Update `lib/monitoring.nix` | ✅ Complete | Labels extracted and propagated |
| 3. Update Prometheus config | ✅ Complete | Uses structured static_configs |
| 4. Set metadata on hosts | ✅ Complete | All relevant hosts configured |
| 5. Update alert rules | ✅ Complete | Role-based filtering implemented |
| 6. Labels for service targets | ✅ Complete | Host labels propagated to all services |
| 7. Add hostname label | ✅ Complete | All targets have `hostname` label for easy filtering |
**Hosts with metadata configured:**
- `ns1`, `ns2`: `role = "dns"`, `labels.dns_role = "primary"/"secondary"`
- `nix-cache01`: `role = "build-host"`
- `vault01`: `role = "vault"`
- `testvm01/02/03`: `tier = "test"`
**Implementation complete.** Branch: `prometheus-scrape-target-labels`
**Query examples:**
- `{hostname="ns1"}` - all metrics from ns1 (any job/port)
- `node_cpu_seconds_total{hostname="monitoring01"}` - specific metric by hostname
- `up{role="dns"}` - all DNS servers
- `up{tier="test"}` - all test-tier hosts
---
## Goal
Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names.
**Related:** This plan shares the `homelab.host` module with `docs/plans/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment.
**Related:** This plan shares the `homelab.host` module with `docs/plans/completed/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment.
## Motivation
@@ -54,12 +82,11 @@ or
## Implementation
This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/nats-deploy-service.md` which uses the same module for deployment tier assignment.
This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/completed/nats-deploy-service.md` which uses the same module for deployment tier assignment.
### 1. Create `homelab.host` module
**Status:** Step 1 (Create `homelab.host` module) is complete. The module is in
`modules/homelab/host.nix` with tier, priority, role, and labels options.
**Complete.** The module is in `modules/homelab/host.nix`.
Create `modules/homelab/host.nix` with shared host metadata options:
@@ -98,6 +125,8 @@ Import this module in `modules/homelab/default.nix`.
### 2. Update `lib/monitoring.nix`
**Complete.** Labels are now extracted and propagated.
- `extractHostMonitoring` should also extract `homelab.host` values (priority, role, labels).
- Build the combined label set from `homelab.host`:
@@ -126,6 +155,8 @@ This requires grouping hosts by their label attrset and producing one `static_co
### 3. Update `services/monitoring/prometheus.nix`
**Complete.** Now uses structured static_configs output.
Change the node-exporter scrape config to use the new structured output:
```nix
@@ -138,36 +169,37 @@ static_configs = nodeExporterTargets;
### 4. Set metadata on hosts
**Complete.** All relevant hosts have metadata configured. Note: The implementation filters by `role` rather than `priority`, which matches the existing nix-cache01 configuration.
Example in `hosts/nix-cache01/configuration.nix`:
```nix
homelab.host = {
tier = "test"; # can be deployed by MCP (used by homelab-deploy)
priority = "low"; # relaxed alerting thresholds
role = "build-host";
};
```
**Note:** Current implementation only sets `role = "build-host"`. Consider adding `priority = "low"` when label propagation is implemented.
Example in `hosts/ns1/configuration.nix`:
```nix
homelab.host = {
tier = "prod";
priority = "high";
role = "dns";
labels.dns_role = "primary";
};
```
**Note:** `tier` and `priority` use defaults ("prod" and "high"), which is the intended behavior. The current ns1/ns2 configurations match this pattern.
### 5. Update alert rules
After implementing labels, review and update `services/monitoring/rules.yml`:
**Complete.** Updated `services/monitoring/rules.yml`:
- Replace instance-name exclusions with label-based filters (e.g. `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`).
- Consider whether any other rules should differentiate by priority or role.
- `high_cpu_load`: Replaced `instance!="nix-cache01..."` with `role!="build-host"` for standard hosts (15m duration) and `role="build-host"` for build hosts (2h duration).
- `unbound_low_cache_hit_ratio`: Added `dns_role="primary"` filter to only alert on the primary DNS resolver (secondary has a cold cache).
Specifically, the `high_cpu_load` rule currently has a nix-cache01 exclusion that should be replaced with a `priority`-based filter.
### 6. Labels for `generateScrapeConfigs` (service targets)
### 6. Consider labels for `generateScrapeConfigs` (service targets)
The same label propagation could be applied to service-level scrape targets. This is optional and can be deferred -- service targets are more specialized and less likely to need generic label-based filtering.
**Complete.** Host labels are now propagated to all auto-generated service scrape targets (unbound, homelab-deploy, nixos-exporter, etc.). This enables semantic filtering on any service metric, such as using `dns_role="primary"` with the unbound job.

8
flake.lock generated
View File

@@ -28,11 +28,11 @@
]
},
"locked": {
"lastModified": 1770447502,
"narHash": "sha256-xH1PNyE3ydj4udhe1IpK8VQxBPZETGLuORZdSWYRmSU=",
"lastModified": 1770481834,
"narHash": "sha256-Xx9BYnI0C/qgPbwr9nj6NoAdQTbYLunrdbNSaUww9oY=",
"ref": "master",
"rev": "79db119d1ca6630023947ef0a65896cc3307c2ff",
"revCount": 22,
"rev": "fd0d63b103dfaf21d1c27363266590e723021c67",
"revCount": 24,
"type": "git",
"url": "https://git.t-juice.club/torjus/homelab-deploy"
},

View File

@@ -186,15 +186,6 @@
./hosts/nats1
];
};
testvm01 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self sops-nix;
};
modules = commonModules ++ [
./hosts/testvm01
];
};
vault01 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
@@ -204,13 +195,31 @@
./hosts/vault01
];
};
vaulttest01 = nixpkgs.lib.nixosSystem {
testvm01 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self sops-nix;
};
modules = commonModules ++ [
./hosts/vaulttest01
./hosts/testvm01
];
};
testvm02 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self sops-nix;
};
modules = commonModules ++ [
./hosts/testvm02
];
};
testvm03 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self sops-nix;
};
modules = commonModules ++ [
./hosts/testvm03
];
};
};

View File

@@ -61,6 +61,9 @@
# Or disable the firewall altogether.
networking.firewall.enable = false;
vault.enable = true;
homelab.deploy.enable = true;
zramSwap = {
enable = true;
};

View File

@@ -100,61 +100,6 @@
];
};
labmon = {
enable = true;
settings = {
ListenAddr = ":9969";
Profiling = true;
StepMonitors = [
{
Enabled = true;
BaseURL = "https://ca.home.2rjus.net";
RootID = "3381bda8015a86b9a3cd1851439d1091890a79005e0f1f7c4301fe4bccc29d80";
}
];
TLSConnectionMonitors = [
{
Enabled = true;
Address = "ca.home.2rjus.net:443";
Verify = true;
Duration = "12h";
}
{
Enabled = true;
Address = "jelly.home.2rjus.net:443";
Verify = true;
Duration = "12h";
}
{
Enabled = true;
Address = "grafana.home.2rjus.net:443";
Verify = true;
Duration = "12h";
}
{
Enabled = true;
Address = "prometheus.home.2rjus.net:443";
Verify = true;
Duration = "12h";
}
{
Enabled = true;
Address = "alertmanager.home.2rjus.net:443";
Verify = true;
Duration = "12h";
}
{
Enabled = true;
Address = "pyroscope.home.2rjus.net:443";
Verify = true;
Duration = "12h";
}
];
};
};
# Open ports in the firewall.
# networking.firewall.allowedTCPPorts = [ ... ];
# networking.firewall.allowedUDPPorts = [ ... ];

View File

@@ -59,5 +59,8 @@
# Or disable the firewall altogether.
networking.firewall.enable = false;
vault.enable = true;
homelab.deploy.enable = true;
system.stateVersion = "23.11"; # Did you read the comment?
}

View File

@@ -59,5 +59,8 @@
# Or disable the firewall altogether.
networking.firewall.enable = false;
vault.enable = true;
homelab.deploy.enable = true;
system.stateVersion = "23.11"; # Did you read the comment?
}

View File

@@ -6,22 +6,72 @@ let
text = ''
set -euo pipefail
LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
# Send a log entry to Loki with bootstrap status
# Usage: log_to_loki <stage> <message>
# Fails silently if Loki is unreachable
log_to_loki() {
local stage="$1"
local message="$2"
local timestamp_ns
timestamp_ns="$(date +%s)000000000"
local payload
payload=$(jq -n \
--arg host "$HOSTNAME" \
--arg stage "$stage" \
--arg branch "''${BRANCH:-master}" \
--arg ts "$timestamp_ns" \
--arg msg "$message" \
'{
streams: [{
stream: {
job: "bootstrap",
host: $host,
stage: $stage,
branch: $branch
},
values: [[$ts, $msg]]
}]
}')
curl -s --connect-timeout 2 --max-time 5 \
-X POST \
-H "Content-Type: application/json" \
-d "$payload" \
"$LOKI_URL" >/dev/null 2>&1 || true
}
echo "================================================================================"
echo " NIXOS BOOTSTRAP IN PROGRESS"
echo "================================================================================"
echo ""
# Read hostname set by cloud-init (from Terraform VM name via user-data)
# Cloud-init sets the system hostname from user-data.txt, so we read it from hostnamectl
HOSTNAME=$(hostnamectl hostname)
echo "DEBUG: Hostname from hostnamectl: '$HOSTNAME'"
# Read git branch from environment, default to master
BRANCH="''${NIXOS_FLAKE_BRANCH:-master}"
echo "Hostname: $HOSTNAME"
echo ""
echo "Starting NixOS bootstrap for host: $HOSTNAME"
log_to_loki "starting" "Bootstrap starting for $HOSTNAME (branch: $BRANCH)"
echo "Waiting for network connectivity..."
# Verify we can reach the git server via HTTPS (doesn't respond to ping)
if ! curl -s --connect-timeout 5 --max-time 10 https://git.t-juice.club >/dev/null 2>&1; then
echo "ERROR: Cannot reach git.t-juice.club via HTTPS"
echo "Check network configuration and DNS settings"
log_to_loki "failed" "Network check failed - cannot reach git.t-juice.club"
exit 1
fi
echo "Network connectivity confirmed"
log_to_loki "network_ok" "Network connectivity confirmed"
# Unwrap Vault token and store AppRole credentials (if provided)
if [ -n "''${VAULT_WRAPPED_TOKEN:-}" ]; then
@@ -50,6 +100,7 @@ let
chmod 600 /var/lib/vault/approle/secret-id
echo "Vault credentials unwrapped and stored successfully"
log_to_loki "vault_ok" "Vault credentials unwrapped and stored"
else
echo "WARNING: Failed to unwrap Vault token"
if [ -n "$UNWRAP_RESPONSE" ]; then
@@ -63,17 +114,17 @@ let
echo "To regenerate token, run: create-host --hostname $HOSTNAME --force"
echo ""
echo "Vault secrets will not be available, but continuing bootstrap..."
log_to_loki "vault_warn" "Failed to unwrap Vault token - continuing without secrets"
fi
else
echo "No Vault wrapped token provided (VAULT_WRAPPED_TOKEN not set)"
echo "Skipping Vault credential setup"
log_to_loki "vault_skip" "No Vault token provided - skipping credential setup"
fi
echo "Fetching and building NixOS configuration from flake..."
# Read git branch from environment, default to master
BRANCH="''${NIXOS_FLAKE_BRANCH:-master}"
echo "Using git branch: $BRANCH"
log_to_loki "building" "Starting nixos-rebuild boot"
# Build and activate the host-specific configuration
FLAKE_URL="git+https://git.t-juice.club/torjus/nixos-servers.git?ref=$BRANCH#''${HOSTNAME}"
@@ -81,18 +132,30 @@ let
if nixos-rebuild boot --flake "$FLAKE_URL"; then
echo "Successfully built configuration for $HOSTNAME"
echo "Rebooting into new configuration..."
log_to_loki "success" "Build successful - rebooting into new configuration"
sleep 2
systemctl reboot
else
echo "ERROR: nixos-rebuild failed for $HOSTNAME"
echo "Check that flake has configuration for this hostname"
echo "Manual intervention required - system will not reboot"
log_to_loki "failed" "nixos-rebuild failed - manual intervention required"
exit 1
fi
'';
};
in
{
# Custom greeting line to indicate this is a bootstrap image
services.getty.greetingLine = lib.mkForce ''
================================================================================
BOOTSTRAP IMAGE - NixOS \V (\l)
================================================================================
Bootstrap service is running. Logs are displayed on tty1.
Check status: journalctl -fu nixos-bootstrap
'';
systemd.services."nixos-bootstrap" = {
description = "Bootstrap NixOS configuration from flake on first boot";
@@ -107,12 +170,12 @@ in
serviceConfig = {
Type = "oneshot";
RemainAfterExit = true;
ExecStart = "${bootstrap-script}/bin/nixos-bootstrap";
ExecStart = lib.getExe bootstrap-script;
# Read environment variables from cloud-init (set by cloud-init write_files)
EnvironmentFile = "-/run/cloud-init-env";
# Logging to journald
# Log to journal and console
StandardOutput = "journal+console";
StandardError = "journal+console";
};

View File

@@ -13,14 +13,17 @@
../../common/vm
];
# Test VM - exclude from DNS zone generation
homelab.dns.enable = false;
# Host metadata (adjust as needed)
homelab.host = {
tier = "test";
priority = "low";
tier = "test"; # Start in test tier, move to prod after validation
};
# Enable Vault integration
vault.enable = true;
# Enable remote deployment via NATS
homelab.deploy.enable = true;
nixpkgs.config.allowUnfree = true;
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda";
@@ -29,7 +32,7 @@
networking.domain = "home.2rjus.net";
networking.useNetworkd = true;
networking.useDHCP = false;
services.resolved.enable = false;
services.resolved.enable = true;
networking.nameservers = [
"10.69.13.5"
"10.69.13.6"
@@ -39,7 +42,7 @@
systemd.network.networks."ens18" = {
matchConfig.Name = "ens18";
address = [
"10.69.13.101/24"
"10.69.13.20/24"
];
routes = [
{ Gateway = "10.69.13.1"; }
@@ -59,6 +62,39 @@
git
];
# Test nginx with ACME certificate from OpenBao PKI
services.nginx = {
enable = true;
virtualHosts."testvm01.home.2rjus.net" = {
forceSSL = true;
enableACME = true;
locations."/" = {
root = pkgs.writeTextDir "index.html" ''
<!DOCTYPE html>
<html>
<head>
<title>testvm01 - ACME Test</title>
<style>
body { font-family: monospace; max-width: 600px; margin: 50px auto; padding: 20px; }
.joke { background: #f0f0f0; padding: 20px; border-radius: 8px; margin: 20px 0; }
.punchline { margin-top: 15px; font-weight: bold; }
</style>
</head>
<body>
<h1>OpenBao PKI ACME Test</h1>
<p>If you're seeing this over HTTPS, the migration worked!</p>
<div class="joke">
<p>Why do programmers prefer dark mode?</p>
<p class="punchline">Because light attracts bugs.</p>
</div>
<p><small>Certificate issued by: vault.home.2rjus.net</small></p>
</body>
</html>
'';
};
};
};
# Open ports in the firewall.
# networking.firewall.allowedTCPPorts = [ ... ];
# networking.firewall.allowedUDPPorts = [ ... ];

View File

@@ -0,0 +1,72 @@
{
config,
lib,
pkgs,
...
}:
{
imports = [
../template2/hardware-configuration.nix
../../system
../../common/vm
];
# Host metadata (adjust as needed)
homelab.host = {
tier = "test"; # Start in test tier, move to prod after validation
};
# Enable Vault integration
vault.enable = true;
# Enable remote deployment via NATS
homelab.deploy.enable = true;
nixpkgs.config.allowUnfree = true;
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda";
networking.hostName = "testvm02";
networking.domain = "home.2rjus.net";
networking.useNetworkd = true;
networking.useDHCP = false;
services.resolved.enable = true;
networking.nameservers = [
"10.69.13.5"
"10.69.13.6"
];
systemd.network.enable = true;
systemd.network.networks."ens18" = {
matchConfig.Name = "ens18";
address = [
"10.69.13.21/24"
];
routes = [
{ Gateway = "10.69.13.1"; }
];
linkConfig.RequiredForOnline = "routable";
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim
wget
git
];
# Open ports in the firewall.
# networking.firewall.allowedTCPPorts = [ ... ];
# networking.firewall.allowedUDPPorts = [ ... ];
# Or disable the firewall altogether.
networking.firewall.enable = false;
system.stateVersion = "25.11"; # Did you read the comment?
}

View File

@@ -0,0 +1,72 @@
{
config,
lib,
pkgs,
...
}:
{
imports = [
../template2/hardware-configuration.nix
../../system
../../common/vm
];
# Host metadata (adjust as needed)
homelab.host = {
tier = "test"; # Start in test tier, move to prod after validation
};
# Enable Vault integration
vault.enable = true;
# Enable remote deployment via NATS
homelab.deploy.enable = true;
nixpkgs.config.allowUnfree = true;
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda";
networking.hostName = "testvm03";
networking.domain = "home.2rjus.net";
networking.useNetworkd = true;
networking.useDHCP = false;
services.resolved.enable = true;
networking.nameservers = [
"10.69.13.5"
"10.69.13.6"
];
systemd.network.enable = true;
systemd.network.networks."ens18" = {
matchConfig.Name = "ens18";
address = [
"10.69.13.22/24"
];
routes = [
{ Gateway = "10.69.13.1"; }
];
linkConfig.RequiredForOnline = "routable";
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim
wget
git
];
# Open ports in the firewall.
# networking.firewall.allowedTCPPorts = [ ... ];
# networking.firewall.allowedUDPPorts = [ ... ];
# Or disable the firewall altogether.
networking.firewall.enable = false;
system.stateVersion = "25.11"; # Did you read the comment?
}

View File

@@ -0,0 +1,5 @@
{ ... }: {
imports = [
./configuration.nix
];
}

View File

@@ -62,6 +62,16 @@
# Or disable the firewall altogether.
networking.firewall.enable = false;
# Vault fetches secrets from itself (after unseal)
vault.enable = true;
homelab.deploy.enable = true;
# Ensure vault-secret services wait for openbao to be unsealed
systemd.services.vault-secret-homelab-deploy-nkey = {
after = [ "openbao.service" ];
wants = [ "openbao.service" ];
};
system.stateVersion = "25.11"; # Did you read the comment?
}

View File

@@ -1,135 +0,0 @@
{
config,
lib,
pkgs,
...
}:
let
vault-test-script = pkgs.writeShellApplication {
name = "vault-test";
text = ''
echo "=== Vault Secret Test ==="
echo "Secret path: hosts/vaulttest01/test-service"
if [ -f /run/secrets/test-service/password ]; then
echo " Password file exists"
echo "Password length: $(wc -c < /run/secrets/test-service/password)"
else
echo " Password file missing!"
exit 1
fi
if [ -d /var/lib/vault/cache/test-service ]; then
echo " Cache directory exists"
else
echo " Cache directory missing!"
exit 1
fi
echo "Test successful!"
'';
};
in
{
imports = [
../template2/hardware-configuration.nix
../../system
../../common/vm
];
homelab.host = {
tier = "test";
priority = "low";
role = "vault";
};
nixpkgs.config.allowUnfree = true;
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda";
networking.hostName = "vaulttest01";
networking.domain = "home.2rjus.net";
networking.useNetworkd = true;
networking.useDHCP = false;
services.resolved.enable = true;
networking.nameservers = [
"10.69.13.5"
"10.69.13.6"
];
systemd.network.enable = true;
systemd.network.networks."ens18" = {
matchConfig.Name = "ens18";
address = [
"10.69.13.150/24"
];
routes = [
{ Gateway = "10.69.13.1"; }
];
linkConfig.RequiredForOnline = "routable";
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim
wget
git
htop # test deploy verification
];
# Open ports in the firewall.
# networking.firewall.allowedTCPPorts = [ ... ];
# networking.firewall.allowedUDPPorts = [ ... ];
# Or disable the firewall altogether.
networking.firewall.enable = false;
# Testing config
# Enable Vault secrets management
vault.enable = true;
homelab.deploy.enable = true;
# Define a test secret
vault.secrets.test-service = {
secretPath = "hosts/vaulttest01/test-service";
restartTrigger = true;
restartInterval = "daily";
services = [ "vault-test" ];
};
# Create a test service that uses the secret
systemd.services.vault-test = {
description = "Test Vault secret fetching";
wantedBy = [ "multi-user.target" ];
after = [ "vault-secret-test-service.service" ];
serviceConfig = {
Type = "oneshot";
RemainAfterExit = true;
ExecStart = lib.getExe vault-test-script;
StandardOutput = "journal+console";
};
};
# Test ACME certificate issuance from OpenBao PKI
# Override the global ACME server (from system/acme.nix) to use OpenBao instead of step-ca
security.acme.defaults.server = lib.mkForce "https://vault01.home.2rjus.net:8200/v1/pki_int/acme/directory";
# Request a certificate for this host
# Using HTTP-01 challenge with standalone listener on port 80
security.acme.certs."vaulttest01.home.2rjus.net" = {
listenHTTP = ":80";
enableDebugLogs = true;
};
system.stateVersion = "25.11"; # Did you read the comment?
}

View File

@@ -21,6 +21,7 @@ let
cfg = hostConfig.config;
monConfig = (cfg.homelab or { }).monitoring or { enable = true; scrapeTargets = [ ]; };
dnsConfig = (cfg.homelab or { }).dns or { enable = true; };
hostConfig' = (cfg.homelab or { }).host or { };
hostname = cfg.networking.hostName;
networks = cfg.systemd.network.networks or { };
@@ -49,20 +50,73 @@ let
inherit hostname;
ip = extractIP firstAddress;
scrapeTargets = monConfig.scrapeTargets or [ ];
# Host metadata for label propagation
tier = hostConfig'.tier or "prod";
priority = hostConfig'.priority or "high";
role = hostConfig'.role or null;
labels = hostConfig'.labels or { };
};
# Build effective labels for a host
# Always includes hostname; only includes tier/priority/role if non-default
buildEffectiveLabels = host:
{ hostname = host.hostname; }
// (lib.optionalAttrs (host.tier != "prod") { tier = host.tier; })
// (lib.optionalAttrs (host.priority != "high") { priority = host.priority; })
// (lib.optionalAttrs (host.role != null) { role = host.role; })
// host.labels;
# Generate node-exporter targets from all flake hosts
# Returns a list of static_configs entries with labels
generateNodeExporterTargets = self: externalTargets:
let
nixosConfigs = self.nixosConfigurations or { };
hostList = lib.filter (x: x != null) (
lib.mapAttrsToList extractHostMonitoring nixosConfigs
);
flakeTargets = map (host: "${host.hostname}.home.2rjus.net:9100") hostList;
# Extract hostname from a target string like "gunter.home.2rjus.net:9100"
extractHostnameFromTarget = target:
builtins.head (lib.splitString "." target);
# Build target entries with labels for each host
flakeEntries = map
(host: {
target = "${host.hostname}.home.2rjus.net:9100";
labels = buildEffectiveLabels host;
})
hostList;
# External targets get hostname extracted from the target string
externalEntries = map
(target: {
inherit target;
labels = { hostname = extractHostnameFromTarget target; };
})
(externalTargets.nodeExporter or [ ]);
allEntries = flakeEntries ++ externalEntries;
# Group entries by their label set for efficient static_configs
# Convert labels attrset to a string key for grouping
labelKey = entry: builtins.toJSON entry.labels;
grouped = lib.groupBy labelKey allEntries;
# Convert groups to static_configs format
# Every flake host now has at least a hostname label
staticConfigs = lib.mapAttrsToList
(key: entries:
let
labels = (builtins.head entries).labels;
in
flakeTargets ++ (externalTargets.nodeExporter or [ ]);
{ targets = map (e: e.target) entries; labels = labels; }
)
grouped;
in
staticConfigs;
# Generate scrape configs from all flake hosts and external targets
# Host labels are propagated to service targets for semantic alert filtering
generateScrapeConfigs = self: externalTargets:
let
nixosConfigs = self.nixosConfigurations or { };
@@ -70,13 +124,14 @@ let
lib.mapAttrsToList extractHostMonitoring nixosConfigs
);
# Collect all scrapeTargets from all hosts, grouped by job_name
# Collect all scrapeTargets from all hosts, including host labels
allTargets = lib.flatten (map
(host:
map
(target: {
inherit (target) job_name port metrics_path scheme scrape_interval honor_labels;
hostname = host.hostname;
hostLabels = buildEffectiveLabels host;
})
host.scrapeTargets
)
@@ -87,22 +142,32 @@ let
grouped = lib.groupBy (t: t.job_name) allTargets;
# Generate a scrape config for each job
# Within each job, group targets by their host labels for efficient static_configs
flakeScrapeConfigs = lib.mapAttrsToList
(jobName: targets:
let
first = builtins.head targets;
targetAddrs = map
(t:
# Group targets within this job by their host labels
labelKey = t: builtins.toJSON t.hostLabels;
groupedByLabels = lib.groupBy labelKey targets;
# Every flake host now has at least a hostname label
staticConfigs = lib.mapAttrsToList
(key: labelTargets:
let
portStr = toString t.port;
labels = (builtins.head labelTargets).hostLabels;
targetAddrs = map
(t: "${t.hostname}.home.2rjus.net:${toString t.port}")
labelTargets;
in
"${t.hostname}.home.2rjus.net:${portStr}")
targets;
{ targets = targetAddrs; labels = labels; }
)
groupedByLabels;
config = {
job_name = jobName;
static_configs = [{
targets = targetAddrs;
}];
static_configs = staticConfigs;
}
// (lib.optionalAttrs (first.metrics_path != "/metrics") {
metrics_path = first.metrics_path;

View File

@@ -99,3 +99,48 @@
- name: Display success message
ansible.builtin.debug:
msg: "Template VM {{ template_vmid }} created successfully on {{ storage }}"
- name: Update Terraform template name
hosts: localhost
gather_facts: false
vars:
terraform_dir: "{{ playbook_dir }}/../terraform"
tasks:
- name: Get image filename from earlier play
ansible.builtin.set_fact:
image_filename: "{{ hostvars['localhost']['image_filename'] }}"
- name: Extract template name from image filename
ansible.builtin.set_fact:
new_template_name: "{{ image_filename | regex_replace('\\.vma\\.zst$', '') | regex_replace('^vzdump-qemu-', '') }}"
- name: Read current Terraform variables file
ansible.builtin.slurp:
src: "{{ terraform_dir }}/variables.tf"
register: variables_tf_content
- name: Extract current template name from variables.tf
ansible.builtin.set_fact:
current_template_name: "{{ (variables_tf_content.content | b64decode) | regex_search('variable \"default_template_name\"[^}]+default\\s*=\\s*\"([^\"]+)\"', '\\1') | first }}"
- name: Check if template name has changed
ansible.builtin.set_fact:
template_name_changed: "{{ current_template_name != new_template_name }}"
- name: Display template name status
ansible.builtin.debug:
msg: "Template name: {{ current_template_name }} -> {{ new_template_name }} ({{ 'changed' if template_name_changed else 'unchanged' }})"
- name: Update default_template_name in variables.tf
ansible.builtin.replace:
path: "{{ terraform_dir }}/variables.tf"
regexp: '(variable "default_template_name"[^}]+default\s*=\s*)"[^"]+"'
replace: '\1"{{ new_template_name }}"'
when: template_name_changed
- name: Display update result
ansible.builtin.debug:
msg: "Updated terraform/variables.tf with new template name: {{ new_template_name }}"
when: template_name_changed

View File

@@ -18,6 +18,8 @@ from manipulators import (
remove_from_flake_nix,
remove_from_terraform_vms,
remove_from_vault_terraform,
remove_from_approle_tf,
find_host_secrets,
check_entries_exist,
)
from models import HostConfig
@@ -255,7 +257,10 @@ def handle_remove(
sys.exit(1)
# Check what entries exist
flake_exists, terraform_exists, vault_exists = check_entries_exist(hostname, repo_root)
flake_exists, terraform_exists, vault_exists, approle_exists = check_entries_exist(hostname, repo_root)
# Check for secrets in secrets.tf
host_secrets = find_host_secrets(hostname, repo_root)
# Collect all files in the host directory recursively
files_in_host_dir = sorted([f for f in host_dir.rglob("*") if f.is_file()])
@@ -294,6 +299,21 @@ def handle_remove(
else:
console.print(f" • terraform/vault/hosts-generated.tf [dim](not found)[/dim]")
if approle_exists:
console.print(f' • terraform/vault/approle.tf (host_policies["{hostname}"])')
else:
console.print(f" • terraform/vault/approle.tf [dim](not found)[/dim]")
# Warn about secrets in secrets.tf
if host_secrets:
console.print(f"\n[yellow]⚠️ Warning: Found {len(host_secrets)} secret(s) in terraform/vault/secrets.tf:[/yellow]")
for secret_path in host_secrets:
console.print(f'"{secret_path}"')
console.print(f"\n [yellow]These will NOT be removed automatically.[/yellow]")
console.print(f" After removal, manually edit secrets.tf and run:")
for secret_path in host_secrets:
console.print(f" [white]vault kv delete secret/{secret_path}[/white]")
# Warn about secrets directory
if secrets_exist:
console.print(f"\n[yellow]⚠️ Warning: secrets/{hostname}/ directory exists and will NOT be deleted[/yellow]")
@@ -323,6 +343,13 @@ def handle_remove(
else:
console.print("[yellow]⚠[/yellow] Could not remove from terraform/vault/hosts-generated.tf")
# Remove from terraform/vault/approle.tf
if approle_exists:
if remove_from_approle_tf(hostname, repo_root):
console.print("[green]✓[/green] Removed from terraform/vault/approle.tf")
else:
console.print("[yellow]⚠[/yellow] Could not remove from terraform/vault/approle.tf")
# Remove from terraform/vms.tf
if terraform_exists:
if remove_from_terraform_vms(hostname, repo_root):
@@ -345,19 +372,34 @@ def handle_remove(
console.print(f"\n[bold green]✓ Host {hostname} removed successfully![/bold green]\n")
# Display next steps
display_removal_next_steps(hostname, vault_exists)
display_removal_next_steps(hostname, vault_exists, approle_exists, host_secrets)
def display_removal_next_steps(hostname: str, had_vault: bool) -> None:
def display_removal_next_steps(hostname: str, had_vault: bool, had_approle: bool, host_secrets: list) -> None:
"""Display next steps after successful removal."""
vault_file = " terraform/vault/hosts-generated.tf" if had_vault else ""
vault_apply = ""
vault_files = ""
if had_vault:
vault_files += " terraform/vault/hosts-generated.tf"
if had_approle:
vault_files += " terraform/vault/approle.tf"
vault_apply = ""
if had_vault or had_approle:
vault_apply = f"""
3. Apply Vault changes:
[white]cd terraform/vault && tofu apply[/white]
"""
secrets_cleanup = ""
if host_secrets:
secrets_cleanup = f"""
5. Clean up secrets (manual):
Edit terraform/vault/secrets.tf to remove entries for {hostname}
Then delete from Vault:"""
for secret_path in host_secrets:
secrets_cleanup += f"\n [white]vault kv delete secret/{secret_path}[/white]"
secrets_cleanup += "\n"
next_steps = f"""[bold cyan]Next Steps:[/bold cyan]
1. Review changes:
@@ -367,9 +409,9 @@ def display_removal_next_steps(hostname: str, had_vault: bool) -> None:
[white]cd terraform && tofu destroy -target='proxmox_vm_qemu.vm["{hostname}"]'[/white]
{vault_apply}
4. Commit changes:
[white]git add -u hosts/{hostname} flake.nix terraform/vms.tf{vault_file}
[white]git add -u hosts/{hostname} flake.nix terraform/vms.tf{vault_files}
git commit -m "hosts: remove {hostname}"[/white]
"""
{secrets_cleanup}"""
console.print(Panel(next_steps, border_style="cyan"))

View File

@@ -144,7 +144,7 @@ resource "vault_approle_auth_backend_role" "generated_hosts" {
backend = vault_auth_backend.approle.path
role_name = each.key
token_policies = ["host-\${each.key}"]
token_policies = ["host-\${each.key}", "homelab-deploy"]
secret_id_ttl = 0 # Never expire (wrapped tokens provide time limit)
token_ttl = 3600
token_max_ttl = 3600

View File

@@ -101,7 +101,68 @@ def remove_from_vault_terraform(hostname: str, repo_root: Path) -> bool:
return True
def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, bool]:
def remove_from_approle_tf(hostname: str, repo_root: Path) -> bool:
"""
Remove host entry from terraform/vault/approle.tf locals.host_policies.
Args:
hostname: Hostname to remove
repo_root: Path to repository root
Returns:
True if found and removed, False if not found
"""
approle_path = repo_root / "terraform" / "vault" / "approle.tf"
if not approle_path.exists():
return False
content = approle_path.read_text()
# Check if hostname exists in host_policies
hostname_pattern = rf'^\s+"{re.escape(hostname)}" = \{{'
if not re.search(hostname_pattern, content, re.MULTILINE):
return False
# Match the entire block from "hostname" = { to closing }
# The block contains paths = [ ... ] and possibly extra_policies = [...]
replace_pattern = rf'\n?\s+"{re.escape(hostname)}" = \{{[^}}]*\}}\n?'
new_content, count = re.subn(replace_pattern, "\n", content, flags=re.DOTALL)
if count == 0:
return False
approle_path.write_text(new_content)
return True
def find_host_secrets(hostname: str, repo_root: Path) -> list:
"""
Find secrets in terraform/vault/secrets.tf that belong to a host.
Args:
hostname: Hostname to search for
repo_root: Path to repository root
Returns:
List of secret paths found (e.g., ["hosts/hostname/test-service"])
"""
secrets_path = repo_root / "terraform" / "vault" / "secrets.tf"
if not secrets_path.exists():
return []
content = secrets_path.read_text()
# Find all secret paths matching hosts/{hostname}/
pattern = rf'"(hosts/{re.escape(hostname)}/[^"]+)"'
matches = re.findall(pattern, content)
# Return unique paths, preserving order
return list(dict.fromkeys(matches))
def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, bool, bool]:
"""
Check which entries exist for a hostname.
@@ -110,7 +171,7 @@ def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, boo
repo_root: Path to repository root
Returns:
Tuple of (flake_exists, terraform_vms_exists, vault_exists)
Tuple of (flake_exists, terraform_vms_exists, vault_generated_exists, approle_exists)
"""
# Check flake.nix
flake_path = repo_root / "flake.nix"
@@ -131,7 +192,15 @@ def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, boo
vault_content = vault_tf_path.read_text()
vault_exists = f'"{hostname}"' in vault_content
return (flake_exists, terraform_exists, vault_exists)
# Check terraform/vault/approle.tf
approle_path = repo_root / "terraform" / "vault" / "approle.tf"
approle_exists = False
if approle_path.exists():
approle_content = approle_path.read_text()
approle_pattern = rf'^\s+"{re.escape(hostname)}" = \{{'
approle_exists = bool(re.search(approle_pattern, approle_content, re.MULTILINE))
return (flake_exists, terraform_exists, vault_exists, approle_exists)
def update_flake_nix(config: HostConfig, repo_root: Path, force: bool = False) -> None:
@@ -152,15 +221,8 @@ def update_flake_nix(config: HostConfig, repo_root: Path, force: bool = False) -
specialArgs = {{
inherit inputs self sops-nix;
}};
modules = [
(
{{ config, pkgs, ... }}:
{{
nixpkgs.overlays = commonOverlays;
}}
)
modules = commonModules ++ [
./hosts/{config.hostname}
sops-nix.nixosModules.sops
];
}};
"""

View File

@@ -18,6 +18,12 @@
tier = "test"; # Start in test tier, move to prod after validation
};
# Enable Vault integration
vault.enable = true;
# Enable remote deployment via NATS
homelab.deploy.enable = true;
nixpkgs.config.allowUnfree = true;
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda";

View File

@@ -5,7 +5,7 @@
package = pkgs.unstable.caddy;
configFile = pkgs.writeText "Caddyfile" ''
{
acme_ca https://ca.home.2rjus.net/acme/acme/directory
acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
metrics {
per_host

View File

@@ -1,41 +0,0 @@
{ ... }:
{
services.alloy = {
enable = true;
};
environment.etc."alloy/config.alloy" = {
enable = true;
mode = "0644";
text = ''
pyroscope.write "local_pyroscope" {
endpoint {
url = "http://localhost:4040"
}
}
pyroscope.scrape "labmon" {
targets = [{"__address__" = "localhost:9969", "service_name" = "labmon"}]
forward_to = [pyroscope.write.local_pyroscope.receiver]
profiling_config {
profile.process_cpu {
enabled = true
}
profile.memory {
enabled = true
}
profile.mutex {
enabled = true
}
profile.block {
enabled = true
}
profile.goroutine {
enabled = true
}
}
}
'';
};
}

View File

@@ -7,7 +7,6 @@
./pve.nix
./alerttonotify.nix
./pyroscope.nix
./alloy.nix
./tempo.nix
];
}

View File

@@ -121,22 +121,20 @@ in
scrapeConfigs = [
# Auto-generated node-exporter targets from flake hosts + external
# Each static_config entry may have labels from homelab.host metadata
{
job_name = "node-exporter";
static_configs = [
{
targets = nodeExporterTargets;
}
];
static_configs = nodeExporterTargets;
}
# Systemd exporter on all hosts (same targets, different port)
# Preserves the same label grouping as node-exporter
{
job_name = "systemd-exporter";
static_configs = [
{
targets = map (t: builtins.replaceStrings [":9100"] [":9558"] t) nodeExporterTargets;
}
];
static_configs = map
(cfg: cfg // {
targets = map (t: builtins.replaceStrings [ ":9100" ] [ ":9558" ] t) cfg.targets;
})
nodeExporterTargets;
}
# Local monitoring services (not auto-generated)
{
@@ -180,14 +178,6 @@ in
}
];
}
{
job_name = "labmon";
static_configs = [
{
targets = [ "monitoring01.home.2rjus.net:9969" ];
}
];
}
# TODO: nix-cache_caddy can't be auto-generated because the cert is issued
# for nix-cache.home.2rjus.net (service CNAME), not nix-cache01 (hostname).
# Consider adding a target override to homelab.monitoring.scrapeTargets.

View File

@@ -17,8 +17,9 @@ groups:
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Disk space is low on {{ $labels.instance }}. Please check."
# Build hosts (e.g., nix-cache01) are expected to have high CPU during builds
- alert: high_cpu_load
expr: max(node_load5{instance!="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance!="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7)
expr: max(node_load5{role!="build-host"}) by (instance) > (count by (instance)(node_cpu_seconds_total{role!="build-host", mode="idle"}) * 0.7)
for: 15m
labels:
severity: warning
@@ -26,7 +27,7 @@ groups:
summary: "High CPU load on {{ $labels.instance }}"
description: "CPU load is high on {{ $labels.instance }}. Please check."
- alert: high_cpu_load
expr: max(node_load5{instance="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7)
expr: max(node_load5{role="build-host"}) by (instance) > (count by (instance)(node_cpu_seconds_total{role="build-host", mode="idle"}) * 0.7)
for: 2h
labels:
severity: warning
@@ -115,8 +116,9 @@ groups:
annotations:
summary: "NSD not running on {{ $labels.instance }}"
description: "NSD has been down on {{ $labels.instance }} more than 5 minutes."
# Only alert on primary DNS (secondary has cold cache after failover)
- alert: unbound_low_cache_hit_ratio
expr: (rate(unbound_cache_hits_total[5m]) / (rate(unbound_cache_hits_total[5m]) + rate(unbound_cache_misses_total[5m]))) < 0.5
expr: (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) / (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) + rate(unbound_cache_misses_total{dns_role="primary"}[5m]))) < 0.5
for: 15m
labels:
severity: warning
@@ -336,40 +338,6 @@ groups:
annotations:
summary: "Pyroscope service not running on {{ $labels.instance }}"
description: "Pyroscope service not running on {{ $labels.instance }}"
- name: certificate_rules
rules:
- alert: certificate_expiring_soon
expr: labmon_tlsconmon_certificate_seconds_left{address!="ca.home.2rjus.net:443"} < 86400
for: 5m
labels:
severity: warning
annotations:
summary: "TLS certificate expiring soon for {{ $labels.instance }}"
description: "TLS certificate for {{ $labels.address }} is expiring within 24 hours."
- alert: step_ca_serving_cert_expiring
expr: labmon_tlsconmon_certificate_seconds_left{address="ca.home.2rjus.net:443"} < 3600
for: 5m
labels:
severity: critical
annotations:
summary: "Step-CA serving certificate expiring"
description: "The step-ca serving certificate (24h auto-renewed) has less than 1 hour of validity left. Renewal may have failed."
- alert: certificate_check_error
expr: labmon_tlsconmon_certificate_check_error == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Error checking certificate for {{ $labels.address }}"
description: "Certificate check is failing for {{ $labels.address }} on {{ $labels.instance }}."
- alert: step_ca_certificate_expiring
expr: labmon_stepmon_certificate_seconds_left < 3600
for: 5m
labels:
severity: critical
annotations:
summary: "Step-CA certificate expiring for {{ $labels.instance }}"
description: "Step-CA certificate is expiring within 1 hour on {{ $labels.instance }}."
- name: proxmox_rules
rules:
- alert: pve_node_down

View File

@@ -5,7 +5,7 @@
package = pkgs.unstable.caddy;
configFile = pkgs.writeText "Caddyfile" ''
{
acme_ca https://ca.home.2rjus.net/acme/acme/directory
acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
metrics
}

View File

@@ -3,7 +3,7 @@
security.acme = {
acceptTerms = true;
defaults = {
server = "https://ca.home.2rjus.net/acme/acme/directory";
server = "https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory";
email = "root@home.2rjus.net";
dnsPropagationCheck = false;
};

View File

@@ -33,7 +33,7 @@ variable "default_target_node" {
variable "default_template_name" {
description = "Default template VM name to clone from"
type = string
default = "nixos-25.11.20260131.41e216c"
default = "nixos-25.11.20260203.e576e3c"
}
variable "default_ssh_public_key" {

View File

@@ -101,11 +101,13 @@ locals {
]
}
"vaulttest01" = {
# vault01: Vault server itself (fetches secrets from itself)
"vault01" = {
paths = [
"secret/data/hosts/vaulttest01/*",
"secret/data/hosts/vault01/*",
]
}
}
}

View File

@@ -5,9 +5,19 @@
# Each host gets access to its own secrets under hosts/<hostname>/*
locals {
generated_host_policies = {
"vaulttest01" = {
"testvm01" = {
paths = [
"secret/data/hosts/vaulttest01/*",
"secret/data/hosts/testvm01/*",
]
}
"testvm02" = {
paths = [
"secret/data/hosts/testvm02/*",
]
}
"testvm03" = {
paths = [
"secret/data/hosts/testvm03/*",
]
}
@@ -40,7 +50,7 @@ resource "vault_approle_auth_backend_role" "generated_hosts" {
backend = vault_auth_backend.approle.path
role_name = each.key
token_policies = ["host-${each.key}"]
token_policies = ["host-${each.key}", "homelab-deploy"]
secret_id_ttl = 0 # Never expire (wrapped tokens provide time limit)
token_ttl = 3600
token_max_ttl = 3600

View File

@@ -45,12 +45,6 @@ locals {
password_length = 24
}
# TODO: Remove after testing
"hosts/vaulttest01/test-service" = {
auto_generate = true
password_length = 32
}
# Shared backup password (auto-generated, add alongside existing restic key)
"shared/backup/password" = {
auto_generate = true

View File

@@ -31,13 +31,6 @@ locals {
# Example Minimal VM using all defaults (uncomment to deploy):
# "minimal-vm" = {}
# "bootstrap-verify-test" = {}
"testvm01" = {
ip = "10.69.13.101/24"
cpu_cores = 2
memory = 2048
disk_size = "20G"
flake_branch = "pipeline-testing-improvements"
}
"vault01" = {
ip = "10.69.13.19/24"
cpu_cores = 2
@@ -45,13 +38,25 @@ locals {
disk_size = "20G"
flake_branch = "vault-setup" # Bootstrap from this branch instead of master
}
"vaulttest01" = {
ip = "10.69.13.150/24"
"testvm01" = {
ip = "10.69.13.20/24"
cpu_cores = 2
memory = 2048
disk_size = "20G"
flake_branch = "improve-bootstrap-visibility"
vault_wrapped_token = "s.l5q88wzXfEcr5SMDHmO6o96b"
}
"testvm02" = {
ip = "10.69.13.21/24"
cpu_cores = 2
memory = 2048
disk_size = "20G"
}
"testvm03" = {
ip = "10.69.13.22/24"
cpu_cores = 2
memory = 2048
disk_size = "20G"
flake_branch = "pki-migration"
vault_wrapped_token = "s.UCpQCOp7cOKDdtGGBvfRWwAt"
}
}