testvm01: add nginx with ACME certificate for PKI testing

Set up a simple nginx server with an ACME certificate from the new OpenBao PKI infrastructure. This allows testing the ACME migration before deploying to production hosts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
acme: migrate from step-ca to OpenBao PKI
2026-02-07 18:22:28 +01:00 · 2026-02-07 18:20:10 +01:00 · 2026-02-07 17:55:09 +01:00 · 2026-02-07 17:43:06 +01:00 · 2026-02-07 17:31:24 +01:00 · 2026-02-07 17:29:31 +01:00
25 changed files with 551 additions and 231 deletions
--- a/.claude/skills/observability/SKILL.md
+++ b/.claude/skills/observability/SKILL.md
@@ -185,21 +185,60 @@ Common job names:
 - `home-assistant` - Home automation
 - `step-ca` - Internal CA

-### Instance Label Format
+### Target Labels

-The `instance` label uses FQDN format:
+All scrape targets have these labels:

-```
-<hostname>.home.2rjus.net:<port>
-```
+**Standard labels:**
+- `instance` - Full target address (`<hostname>.home.2rjus.net:<port>`)
+- `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`)
+- `hostname` - Short hostname (e.g., `ns1`, `monitoring01`) - use this for host filtering

-Example queries filtering by host:
+**Host metadata labels** (when configured in `homelab.host`):
+- `role` - Host role (e.g., `dns`, `build-host`, `vault`)
+- `tier` - Deployment tier (`test` for test VMs, absent for prod)
+- `dns_role` - DNS-specific role (`primary` or `secondary` for ns1/ns2)
+
+### Filtering by Host
+
+Use the `hostname` label for easy host filtering across all jobs:

 ```promql
-up{instance=~"monitoring01.*"}
-node_load1{instance=~"ns1.*"}
+{hostname="ns1"}                    # All metrics from ns1
+node_load1{hostname="monitoring01"} # Specific metric by hostname
+up{hostname="ha1"}                  # Check if ha1 is up
 ```

+This is simpler than wildcarding the `instance` label:
+
+```promql
+# Old way (still works but verbose)
+up{instance=~"monitoring01.*"}
+
+# New way (preferred)
+up{hostname="monitoring01"}
+```
+
+### Filtering by Role/Tier
+
+Filter hosts by their role or tier:
+
+```promql
+up{role="dns"}                      # All DNS servers (ns1, ns2)
+node_cpu_seconds_total{role="build-host"}  # Build hosts only (nix-cache01)
+up{tier="test"}                     # All test-tier VMs
+up{dns_role="primary"}              # Primary DNS only (ns1)
+```
+
+Current host labels:
+| Host | Labels |
+|------|--------|
+| ns1 | `role=dns`, `dns_role=primary` |
+| ns2 | `role=dns`, `dns_role=secondary` |
+| nix-cache01 | `role=build-host` |
+| vault01 | `role=vault` |
+| testvm01/02/03 | `tier=test` |
+
 ---

 ## Troubleshooting Workflows
@@ -212,11 +251,12 @@ node_load1{instance=~"ns1.*"}

 ### Investigate Service Issues

-1. Check `up{job="<service>"}` for scrape failures
+1. Check `up{job="<service>"}` or `up{hostname="<host>"}` for scrape failures
 2. Use `list_targets` to see target health details
 3. Query service logs: `{host="<host>", systemd_unit="<service>.service"}`
 4. Search for errors: `{host="<host>"} |= "error"`
 5. Check `list_alerts` for related alerts
+6. Use role filters for group issues: `up{role="dns"}` to check all DNS servers

 ### After Deploying Changes

@@ -246,5 +286,6 @@ With `start: "24h"` to see last 24 hours of upgrades across all hosts.
 - Default scrape interval is 15s for most metrics targets
 - Default log lookback is 1h - use `start` parameter for older logs
 - Use `rate()` for counter metrics, direct queries for gauges
- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters
+- Use the `hostname` label to filter metrics by host (simpler than regex on `instance`)
+- Host metadata labels (`role`, `tier`, `dns_role`) are propagated to all scrape targets
 - Log `MESSAGE` field contains the actual log content in JSON format
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -61,10 +61,31 @@ Do not run `nix flake update`. Should only be done manually by user.
 ### Development Environment

 ```bash
-# Enter development shell (provides ansible, python3)
+# Enter development shell
 nix develop
 ```

+The devshell provides: `ansible`, `tofu` (OpenTofu), `bao` (OpenBao CLI), `create-host`, and `homelab-deploy`.
+
+**Important:** When suggesting commands that use devshell tools, always use `nix develop -c <command>` syntax rather than assuming the user is already in a devshell. For example:
+```bash
+# Good - works regardless of current shell
+nix develop -c tofu plan
+
+# Avoid - requires user to be in devshell
+tofu plan
+```
+
+**OpenTofu:** Use the `-chdir` option instead of `cd` when running tofu commands in subdirectories:
+```bash
+# Good - uses -chdir option
+nix develop -c tofu -chdir=terraform plan
+nix develop -c tofu -chdir=terraform/vault apply
+
+# Avoid - changing directories
+cd terraform && tofu plan
+```
+
 ### Secrets Management

 Secrets are managed by OpenBao (Vault) using AppRole authentication. Most hosts use the
@@ -140,11 +161,27 @@ The **lab-monitoring** MCP server can query logs from Loki. All hosts ship syste

 - `host` - Hostname (e.g., `ns1`, `ns2`, `monitoring01`, `ha1`). Use this label, not `hostname`.
 - `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `prometheus.service`, `nixos-upgrade.service`)
- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs like caddy access logs)
+- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
 - `filename` - For `varlog` job, the log file path (e.g., `/var/log/caddy/nix-cache.log`)

 Journal log entries are JSON-formatted with the actual log message in the `MESSAGE` field. Other useful fields include `PRIORITY` and `SYSLOG_IDENTIFIER`.

+**Bootstrap Logs:**
+
+VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
+
+- `host` - Target hostname
+- `branch` - Git branch being deployed
+- `stage` - Bootstrap stage: `starting`, `network_ok`, `vault_ok`/`vault_skip`/`vault_warn`, `building`, `success`, `failed`
+
+Query bootstrap status:
+```
+{job="bootstrap"}                              # All bootstrap logs
+{job="bootstrap", host="testvm01"}             # Specific host
+{job="bootstrap", stage="failed"}              # All failures
+{job="bootstrap", stage=~"building|success"}   # Track build progress
+```
+
 **Example LogQL queries:**
 ```
 # Logs from a specific service on a host
@@ -229,6 +266,21 @@ deploy(role="vault", action="switch")

 **Note:** Only test-tier hosts with `homelab.deploy.enable = true` and the listener service running will respond to deployments.

+**Deploying to Prod Hosts:**
+
+The MCP server only deploys to test-tier hosts. For prod hosts, use the CLI directly:
+
+```bash
+nix develop -c homelab-deploy -- deploy \
+  --nats-url nats://nats1.home.2rjus.net:4222 \
+  --nkey-file ~/.config/homelab-deploy/admin-deployer.nkey \
+  --branch <branch-name> \
+  --action switch \
+  deploy.prod.<hostname>
+```
+
+Subject format: `deploy.<tier>.<hostname>` (e.g., `deploy.prod.monitoring01`, `deploy.test.testvm01`)
+
 **Verifying Deployments:**

 After deploying, use the `nixos_flake_info` metric from nixos-exporter to verify the host is running the expected revision:
@@ -249,9 +301,10 @@ The `current_rev` label contains the git commit hash of the deployed flake confi
  - `configuration.nix` - Host-specific settings (networking, hardware, users)
 - `/system/` - Shared system-level configurations applied to ALL hosts
  - Core modules: nix.nix, sshd.nix, sops.nix (legacy), vault-secrets.nix, acme.nix, autoupgrade.nix
+  - Additional modules: motd.nix (dynamic MOTD), packages.nix (base packages), root-user.nix (root config), homelab-deploy.nix (NATS listener)
  - Monitoring: node-exporter and promtail on every host
 - `/modules/` - Custom NixOS modules
-  - `homelab/` - Homelab-specific options (DNS automation, monitoring scrape targets)
+  - `homelab/` - Homelab-specific options (see "Homelab Module Options" section below)
 - `/lib/` - Nix library functions
  - `dns-zone.nix` - DNS zone generation functions
  - `monitoring.nix` - Prometheus scrape target generation functions
@@ -259,6 +312,8 @@ The `current_rev` label contains the git commit hash of the deployed flake confi
  - `home-assistant/` - Home automation stack
  - `monitoring/` - Observability stack (Prometheus, Grafana, Loki, Tempo)
  - `ns/` - DNS services (authoritative, resolver, zone generation)
+  - `vault/` - OpenBao (Vault) secrets server
+  - `actions-runner/` - GitHub Actions runner
  - `http-proxy/`, `ca/`, `postgres/`, `nats/`, `jellyfin/`, etc.
 - `/secrets/` - SOPS-encrypted secrets with age encryption (legacy, only used by ca)
 - `/common/` - Shared configurations (e.g., VM guest agent)
@@ -292,25 +347,31 @@ All hosts automatically get:

 ### Active Hosts

-Production servers managed by `rebuild-all.sh`:
+Production servers:
 - `ns1`, `ns2` - Primary/secondary DNS servers (10.69.13.5/6)
 - `ca` - Internal Certificate Authority
+- `vault01` - OpenBao (Vault) secrets server
 - `ha1` - Home Assistant + Zigbee2MQTT + Mosquitto
 - `http-proxy` - Reverse proxy
 - `monitoring01` - Full observability stack (Prometheus, Grafana, Loki, Tempo, Pyroscope)
 - `jelly01` - Jellyfin media server
- `nix-cache01` - Binary cache server
+- `nix-cache01` - Binary cache server + GitHub Actions runner
 - `pgdb1` - PostgreSQL database
 - `nats1` - NATS messaging server

-Template/test hosts:
- `template1` - Base template for cloning new hosts
+Test/staging hosts:
+- `testvm01`, `testvm02`, `testvm03` - Test-tier VMs for branch testing and deployment validation
+
+Template hosts:
+- `template1`, `template2` - Base templates for cloning new hosts

 ### Flake Inputs

 - `nixpkgs` - NixOS 25.11 stable (primary)
 - `nixpkgs-unstable` - Unstable channel (available via overlay as `pkgs.unstable.<package>`)
 - `sops-nix` - Secrets management (legacy, only used by ca)
+- `nixos-exporter` - NixOS module for exposing flake revision metrics (used to verify deployments)
+- `homelab-deploy` - NATS-based remote deployment tool for test-tier hosts
 - Custom packages from git.t-juice.club:
  - `alerttonotify` - Alert routing
  - `labmon` - Lab monitoring
@@ -402,9 +463,21 @@ Example VM deployment includes:
 - Custom CPU/memory/disk sizing
 - VLAN tagging
 - QEMU guest agent
+- Automatic Vault credential provisioning via `vault_wrapped_token`

 OpenTofu outputs the VM's IP address after deployment for easy SSH access.

+**Automatic Vault Credential Provisioning:**
+
+VMs can receive Vault (OpenBao) credentials automatically during bootstrap:
+
+1. OpenTofu generates a wrapped token via `terraform/vault/` and stores it in the VM configuration
+2. Cloud-init passes `VAULT_WRAPPED_TOKEN` and `NIXOS_FLAKE_BRANCH` to the bootstrap script
+3. The bootstrap script unwraps the token to obtain AppRole credentials
+4. Credentials are written to `/var/lib/vault/approle/` before the NixOS rebuild
+
+This eliminates the need for manual `provision-approle.yml` playbook runs on new VMs. Bootstrap progress is logged to Loki with `job="bootstrap"` labels.
+
 #### Template Rebuilding and Terraform State

 When the Proxmox template is rebuilt (via `build-and-deploy-template.yml`), the template name may change. This would normally cause Terraform to want to recreate all existing VMs, but that's unnecessary since VMs are independent once cloned.
@@ -484,11 +557,7 @@ Prometheus scrape targets are automatically generated from host configurations,
 - **External targets**: Non-flake hosts defined in `/services/monitoring/external-targets.nix`
 - **Library**: `lib/monitoring.nix` provides `generateNodeExporterTargets` and `generateScrapeConfigs`

-Host monitoring options (`homelab.monitoring.*`):
- `enable` (default: `true`) - Include host in Prometheus node-exporter scrape targets
- `scrapeTargets` (default: `[]`) - Additional scrape targets exposed by this host (job_name, port, metrics_path, scheme, scrape_interval, honor_labels)
-
-Service modules declare their scrape targets directly (e.g., `services/ca/default.nix` declares step-ca on port 9000). The Prometheus config on monitoring01 auto-generates scrape configs from all hosts.
+Service modules declare their scrape targets directly via `homelab.monitoring.scrapeTargets` (e.g., `services/ca/default.nix` declares step-ca on port 9000). The Prometheus config on monitoring01 auto-generates scrape configs from all hosts. See "Homelab Module Options" section for available options.

 To add monitoring targets for non-NixOS hosts, edit `/services/monitoring/external-targets.nix`.

@@ -507,13 +576,30 @@ DNS zone entries are automatically generated from host configurations:
 - **External hosts**: Non-flake hosts defined in `/services/ns/external-hosts.nix`
 - **Serial number**: Uses `self.sourceInfo.lastModified` (git commit timestamp)

-Host DNS options (`homelab.dns.*`):
- `enable` (default: `true`) - Include host in DNS zone generation
- `cnames` (default: `[]`) - List of CNAME aliases pointing to this host
-
 Hosts are automatically excluded from DNS if:
 - `homelab.dns.enable = false` (e.g., template hosts)
 - No static IP configured (e.g., DHCP-only hosts)
 - Network interface is a VPN/tunnel (wg*, tun*, tap*)

 To add DNS entries for non-NixOS hosts, edit `/services/ns/external-hosts.nix`.
+
+### Homelab Module Options
+
+The `modules/homelab/` directory defines custom options used across hosts for automation and metadata.
+
+**Host options (`homelab.host.*`):**
+- `tier` - Deployment tier: `test` or `prod`. Test-tier hosts can receive remote deployments and have different credential access.
+- `priority` - Alerting priority: `high` or `low`. Controls alerting thresholds for the host.
+- `role` - Primary role designation (e.g., `dns`, `database`, `bastion`, `vault`)
+- `labels` - Free-form key-value metadata for host categorization
+
+**DNS options (`homelab.dns.*`):**
+- `enable` (default: `true`) - Include host in DNS zone generation
+- `cnames` (default: `[]`) - List of CNAME aliases pointing to this host
+
+**Monitoring options (`homelab.monitoring.*`):**
+- `enable` (default: `true`) - Include host in Prometheus node-exporter scrape targets
+- `scrapeTargets` (default: `[]`) - Additional scrape targets exposed by this host
+
+**Deploy options (`homelab.deploy.*`):**
+- `enable` (default: `false`) - Enable NATS-based remote deployment listener. When enabled, the host listens for deployment commands via NATS and can be targeted by the `homelab-deploy` MCP server.
--- a/docs/plans/cert-monitoring.md
+++ b/docs/plans/cert-monitoring.md
@@ -0,0 +1,72 @@
+# Certificate Monitoring Plan
+
+## Summary
+
+This document describes the removal of labmon certificate monitoring and outlines future needs for certificate monitoring in the homelab.
+
+## What Was Removed
+
+### labmon Service
+
+The `labmon` service was a custom Go application that provided:
+
+1. **StepMonitor**: Monitoring for step-ca (Smallstep CA) certificate provisioning and health
+2. **TLSConnectionMonitor**: Periodic TLS connection checks to verify certificate validity and expiration
+
+The service exposed Prometheus metrics at `:9969` including:
+- `labmon_tlsconmon_certificate_seconds_left` - Time until certificate expiration
+- `labmon_tlsconmon_certificate_check_error` - Whether the TLS check failed
+- `labmon_stepmon_certificate_seconds_left` - Step-CA internal certificate expiration
+
+### Affected Files
+
+- `hosts/monitoring01/configuration.nix` - Removed labmon configuration block
+- `services/monitoring/prometheus.nix` - Removed labmon scrape target
+- `services/monitoring/rules.yml` - Removed `certificate_rules` alert group
+- `services/monitoring/alloy.nix` - Deleted (was only used for labmon profiling)
+- `services/monitoring/default.nix` - Removed alloy.nix import
+
+### Removed Alerts
+
+- `certificate_expiring_soon` - Warned when any monitored TLS cert had < 24h validity
+- `step_ca_serving_cert_expiring` - Critical alert for step-ca's own serving certificate
+- `certificate_check_error` - Warned when TLS connection check failed
+- `step_ca_certificate_expiring` - Critical alert for step-ca issued certificates
+
+## Why It Was Removed
+
+1. **step-ca decommissioned**: The primary monitoring target (step-ca) is no longer in use
+2. **Outdated codebase**: labmon was a custom tool that required maintenance
+3. **Limited value**: With ACME auto-renewal, certificates should renew automatically
+
+## Current State
+
+ACME certificates are now issued by OpenBao PKI at `vault.home.2rjus.net:8200`. The ACME protocol handles automatic renewal, and certificates are typically renewed well before expiration.
+
+## Future Needs
+
+While ACME handles renewal automatically, we should consider monitoring for:
+
+1. **ACME renewal failures**: Alert when a certificate fails to renew
+   - Could monitor ACME client logs (via Loki queries)
+   - Could check certificate file modification times
+
+2. **Certificate expiration as backup**: Even with auto-renewal, a last-resort alert for certificates approaching expiration would catch renewal failures
+
+3. **Certificate transparency**: Monitor for unexpected certificate issuance
+
+### Potential Solutions
+
+1. **Prometheus blackbox_exporter**: Can probe TLS endpoints and export certificate expiration metrics
+   - `probe_ssl_earliest_cert_expiry` metric
+   - Already a standard tool, well-maintained
+
+2. **Custom Loki alerting**: Query ACME service logs for renewal failures
+   - Works with existing infrastructure
+   - No additional services needed
+
+3. **Node-exporter textfile collector**: Script that checks local certificate files and writes expiration metrics
+
+## Status
+
+**Not yet implemented.** This document serves as a placeholder for future work on certificate monitoring.
--- a/docs/plans/completed/nats-deploy-service.md
+++ b/docs/plans/completed/nats-deploy-service.md
--- a/docs/plans/completed/prometheus-scrape-target-labels.md
+++ b/docs/plans/completed/prometheus-scrape-target-labels.md
@@ -1,10 +1,38 @@
 # Prometheus Scrape Target Labels

+## Implementation Status
+
+| Step | Status | Notes |
+|------|--------|-------|
+| 1. Create `homelab.host` module | ✅ Complete | `modules/homelab/host.nix` |
+| 2. Update `lib/monitoring.nix` | ✅ Complete | Labels extracted and propagated |
+| 3. Update Prometheus config | ✅ Complete | Uses structured static_configs |
+| 4. Set metadata on hosts | ✅ Complete | All relevant hosts configured |
+| 5. Update alert rules | ✅ Complete | Role-based filtering implemented |
+| 6. Labels for service targets | ✅ Complete | Host labels propagated to all services |
+| 7. Add hostname label | ✅ Complete | All targets have `hostname` label for easy filtering |
+
+**Hosts with metadata configured:**
+- `ns1`, `ns2`: `role = "dns"`, `labels.dns_role = "primary"/"secondary"`
+- `nix-cache01`: `role = "build-host"`
+- `vault01`: `role = "vault"`
+- `testvm01/02/03`: `tier = "test"`
+
+**Implementation complete.** Branch: `prometheus-scrape-target-labels`
+
+**Query examples:**
+- `{hostname="ns1"}` - all metrics from ns1 (any job/port)
+- `node_cpu_seconds_total{hostname="monitoring01"}` - specific metric by hostname
+- `up{role="dns"}` - all DNS servers
+- `up{tier="test"}` - all test-tier hosts
+
+---
+
 ## Goal

 Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names.

-**Related:** This plan shares the `homelab.host` module with `docs/plans/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment.
+**Related:** This plan shares the `homelab.host` module with `docs/plans/completed/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment.

 ## Motivation

@@ -54,12 +82,11 @@ or

 ## Implementation

-This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/nats-deploy-service.md` which uses the same module for deployment tier assignment.
+This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/completed/nats-deploy-service.md` which uses the same module for deployment tier assignment.

 ### 1. Create `homelab.host` module

-**Status:** Step 1 (Create `homelab.host` module) is complete. The module is in
-`modules/homelab/host.nix` with tier, priority, role, and labels options.
+✅ **Complete.** The module is in `modules/homelab/host.nix`.

 Create `modules/homelab/host.nix` with shared host metadata options:

@@ -98,6 +125,8 @@ Import this module in `modules/homelab/default.nix`.

 ### 2. Update `lib/monitoring.nix`

+✅ **Complete.** Labels are now extracted and propagated.
+
 - `extractHostMonitoring` should also extract `homelab.host` values (priority, role, labels).
 - Build the combined label set from `homelab.host`:

@@ -126,6 +155,8 @@ This requires grouping hosts by their label attrset and producing one `static_co

 ### 3. Update `services/monitoring/prometheus.nix`

+✅ **Complete.** Now uses structured static_configs output.
+
 Change the node-exporter scrape config to use the new structured output:

 ```nix
@@ -138,36 +169,37 @@ static_configs = nodeExporterTargets;

 ### 4. Set metadata on hosts

+✅ **Complete.** All relevant hosts have metadata configured. Note: The implementation filters by `role` rather than `priority`, which matches the existing nix-cache01 configuration.
+
 Example in `hosts/nix-cache01/configuration.nix`:

 ```nix
 homelab.host = {
-  tier = "test";       # can be deployed by MCP (used by homelab-deploy)
  priority = "low";    # relaxed alerting thresholds
  role = "build-host";
 };
 ```

+**Note:** Current implementation only sets `role = "build-host"`. Consider adding `priority = "low"` when label propagation is implemented.
+
 Example in `hosts/ns1/configuration.nix`:

 ```nix
 homelab.host = {
-  tier = "prod";
-  priority = "high";
  role = "dns";
  labels.dns_role = "primary";
 };
 ```

+**Note:** `tier` and `priority` use defaults ("prod" and "high"), which is the intended behavior. The current ns1/ns2 configurations match this pattern.
+
 ### 5. Update alert rules

-After implementing labels, review and update `services/monitoring/rules.yml`:
+✅ **Complete.** Updated `services/monitoring/rules.yml`:

- Replace instance-name exclusions with label-based filters (e.g. `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`).
- Consider whether any other rules should differentiate by priority or role.
+- `high_cpu_load`: Replaced `instance!="nix-cache01..."` with `role!="build-host"` for standard hosts (15m duration) and `role="build-host"` for build hosts (2h duration).
+- `unbound_low_cache_hit_ratio`: Added `dns_role="primary"` filter to only alert on the primary DNS resolver (secondary has a cold cache).

-Specifically, the `high_cpu_load` rule currently has a nix-cache01 exclusion that should be replaced with a `priority`-based filter.
+### 6. Labels for `generateScrapeConfigs` (service targets)

-### 6. Consider labels for `generateScrapeConfigs` (service targets)
-
-The same label propagation could be applied to service-level scrape targets. This is optional and can be deferred -- service targets are more specialized and less likely to need generic label-based filtering.
+✅ **Complete.** Host labels are now propagated to all auto-generated service scrape targets (unbound, homelab-deploy, nixos-exporter, etc.). This enables semantic filtering on any service metric, such as using `dns_role="primary"` with the unbound job.
--- a/flake.lock
+++ b/flake.lock
@@ -28,11 +28,11 @@
        ]
      },
      "locked": {
-        "lastModified": 1770447502,
-        "narHash": "sha256-xH1PNyE3ydj4udhe1IpK8VQxBPZETGLuORZdSWYRmSU=",
+        "lastModified": 1770481834,
+        "narHash": "sha256-Xx9BYnI0C/qgPbwr9nj6NoAdQTbYLunrdbNSaUww9oY=",
        "ref": "master",
-        "rev": "79db119d1ca6630023947ef0a65896cc3307c2ff",
-        "revCount": 22,
+        "rev": "fd0d63b103dfaf21d1c27363266590e723021c67",
+        "revCount": 24,
        "type": "git",
        "url": "https://git.t-juice.club/torjus/homelab-deploy"
      },
--- a/hosts/jelly01/configuration.nix
+++ b/hosts/jelly01/configuration.nix
@@ -61,6 +61,9 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

+  vault.enable = true;
+  homelab.deploy.enable = true;
+
  zramSwap = {
    enable = true;
  };
--- a/hosts/monitoring01/configuration.nix
+++ b/hosts/monitoring01/configuration.nix
@@ -100,61 +100,6 @@
    ];
  };

-  labmon = {
-    enable = true;
-
-    settings = {
-      ListenAddr = ":9969";
-      Profiling = true;
-      StepMonitors = [
-        {
-          Enabled = true;
-          BaseURL = "https://ca.home.2rjus.net";
-          RootID = "3381bda8015a86b9a3cd1851439d1091890a79005e0f1f7c4301fe4bccc29d80";
-        }
-      ];
-
-      TLSConnectionMonitors = [
-        {
-          Enabled = true;
-          Address = "ca.home.2rjus.net:443";
-          Verify = true;
-          Duration = "12h";
-        }
-        {
-          Enabled = true;
-          Address = "jelly.home.2rjus.net:443";
-          Verify = true;
-          Duration = "12h";
-        }
-        {
-          Enabled = true;
-          Address = "grafana.home.2rjus.net:443";
-          Verify = true;
-          Duration = "12h";
-        }
-        {
-          Enabled = true;
-          Address = "prometheus.home.2rjus.net:443";
-          Verify = true;
-          Duration = "12h";
-        }
-        {
-          Enabled = true;
-          Address = "alertmanager.home.2rjus.net:443";
-          Verify = true;
-          Duration = "12h";
-        }
-        {
-          Enabled = true;
-          Address = "pyroscope.home.2rjus.net:443";
-          Verify = true;
-          Duration = "12h";
-        }
-      ];
-    };
-  };
-
  # Open ports in the firewall.
  # networking.firewall.allowedTCPPorts = [ ... ];
  # networking.firewall.allowedUDPPorts = [ ... ];
--- a/hosts/nats1/configuration.nix
+++ b/hosts/nats1/configuration.nix
@@ -59,5 +59,8 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

+  vault.enable = true;
+  homelab.deploy.enable = true;
+
  system.stateVersion = "23.11"; # Did you read the comment?
 }
--- a/hosts/pgdb1/configuration.nix
+++ b/hosts/pgdb1/configuration.nix
@@ -59,5 +59,8 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

+  vault.enable = true;
+  homelab.deploy.enable = true;
+
  system.stateVersion = "23.11"; # Did you read the comment?
 }
--- a/hosts/template2/bootstrap.nix
+++ b/hosts/template2/bootstrap.nix
@@ -6,22 +6,72 @@ let
    text = ''
      set -euo pipefail

+      LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
+
+      # Send a log entry to Loki with bootstrap status
+      # Usage: log_to_loki <stage> <message>
+      # Fails silently if Loki is unreachable
+      log_to_loki() {
+        local stage="$1"
+        local message="$2"
+        local timestamp_ns
+        timestamp_ns="$(date +%s)000000000"
+
+        local payload
+        payload=$(jq -n \
+          --arg host "$HOSTNAME" \
+          --arg stage "$stage" \
+          --arg branch "''${BRANCH:-master}" \
+          --arg ts "$timestamp_ns" \
+          --arg msg "$message" \
+          '{
+            streams: [{
+              stream: {
+                job: "bootstrap",
+                host: $host,
+                stage: $stage,
+                branch: $branch
+              },
+              values: [[$ts, $msg]]
+            }]
+          }')
+
+        curl -s --connect-timeout 2 --max-time 5 \
+          -X POST \
+          -H "Content-Type: application/json" \
+          -d "$payload" \
+          "$LOKI_URL" >/dev/null 2>&1 || true
+      }
+
+      echo "================================================================================"
+      echo "                     NIXOS BOOTSTRAP IN PROGRESS"
+      echo "================================================================================"
+      echo ""
+
      # Read hostname set by cloud-init (from Terraform VM name via user-data)
      # Cloud-init sets the system hostname from user-data.txt, so we read it from hostnamectl
      HOSTNAME=$(hostnamectl hostname)
-      echo "DEBUG: Hostname from hostnamectl: '$HOSTNAME'"
+      # Read git branch from environment, default to master
+      BRANCH="''${NIXOS_FLAKE_BRANCH:-master}"

+      echo "Hostname: $HOSTNAME"
+      echo ""
      echo "Starting NixOS bootstrap for host: $HOSTNAME"
+
+      log_to_loki "starting" "Bootstrap starting for $HOSTNAME (branch: $BRANCH)"
+
      echo "Waiting for network connectivity..."

      # Verify we can reach the git server via HTTPS (doesn't respond to ping)
      if ! curl -s --connect-timeout 5 --max-time 10 https://git.t-juice.club >/dev/null 2>&1; then
        echo "ERROR: Cannot reach git.t-juice.club via HTTPS"
        echo "Check network configuration and DNS settings"
+        log_to_loki "failed" "Network check failed - cannot reach git.t-juice.club"
        exit 1
      fi

      echo "Network connectivity confirmed"
+      log_to_loki "network_ok" "Network connectivity confirmed"

      # Unwrap Vault token and store AppRole credentials (if provided)
      if [ -n "''${VAULT_WRAPPED_TOKEN:-}" ]; then
@@ -50,6 +100,7 @@ let
          chmod 600 /var/lib/vault/approle/secret-id

          echo "Vault credentials unwrapped and stored successfully"
+          log_to_loki "vault_ok" "Vault credentials unwrapped and stored"
        else
          echo "WARNING: Failed to unwrap Vault token"
          if [ -n "$UNWRAP_RESPONSE" ]; then
@@ -63,17 +114,17 @@ let
          echo "To regenerate token, run: create-host --hostname $HOSTNAME --force"
          echo ""
          echo "Vault secrets will not be available, but continuing bootstrap..."
+          log_to_loki "vault_warn" "Failed to unwrap Vault token - continuing without secrets"
        fi
      else
        echo "No Vault wrapped token provided (VAULT_WRAPPED_TOKEN not set)"
        echo "Skipping Vault credential setup"
+        log_to_loki "vault_skip" "No Vault token provided - skipping credential setup"
      fi

      echo "Fetching and building NixOS configuration from flake..."
-
-      # Read git branch from environment, default to master
-      BRANCH="''${NIXOS_FLAKE_BRANCH:-master}"
      echo "Using git branch: $BRANCH"
+      log_to_loki "building" "Starting nixos-rebuild boot"

      # Build and activate the host-specific configuration
      FLAKE_URL="git+https://git.t-juice.club/torjus/nixos-servers.git?ref=$BRANCH#''${HOSTNAME}"
@@ -81,18 +132,30 @@ let
      if nixos-rebuild boot --flake "$FLAKE_URL"; then
        echo "Successfully built configuration for $HOSTNAME"
        echo "Rebooting into new configuration..."
+        log_to_loki "success" "Build successful - rebooting into new configuration"
        sleep 2
        systemctl reboot
      else
        echo "ERROR: nixos-rebuild failed for $HOSTNAME"
        echo "Check that flake has configuration for this hostname"
        echo "Manual intervention required - system will not reboot"
+        log_to_loki "failed" "nixos-rebuild failed - manual intervention required"
        exit 1
      fi
    '';
  };
 in
 {
+  # Custom greeting line to indicate this is a bootstrap image
+  services.getty.greetingLine = lib.mkForce ''
+    ================================================================================
+                          BOOTSTRAP IMAGE - NixOS \V (\l)
+    ================================================================================
+
+    Bootstrap service is running. Logs are displayed on tty1.
+    Check status: journalctl -fu nixos-bootstrap
+  '';
+
  systemd.services."nixos-bootstrap" = {
    description = "Bootstrap NixOS configuration from flake on first boot";

@@ -107,12 +170,12 @@ in
    serviceConfig = {
      Type = "oneshot";
      RemainAfterExit = true;
-      ExecStart = "${bootstrap-script}/bin/nixos-bootstrap";
+      ExecStart = lib.getExe bootstrap-script;

      # Read environment variables from cloud-init (set by cloud-init write_files)
      EnvironmentFile = "-/run/cloud-init-env";

-      # Logging to journald
+      # Log to journal and console
      StandardOutput = "journal+console";
      StandardError = "journal+console";
    };
--- a/hosts/testvm01/configuration.nix
+++ b/hosts/testvm01/configuration.nix
@@ -62,6 +62,39 @@
    git
  ];

+  # Test nginx with ACME certificate from OpenBao PKI
+  services.nginx = {
+    enable = true;
+    virtualHosts."testvm01.home.2rjus.net" = {
+      forceSSL = true;
+      enableACME = true;
+      locations."/" = {
+        root = pkgs.writeTextDir "index.html" ''
+          <!DOCTYPE html>
+          <html>
+          <head>
+            <title>testvm01 - ACME Test</title>
+            <style>
+              body { font-family: monospace; max-width: 600px; margin: 50px auto; padding: 20px; }
+              .joke { background: #f0f0f0; padding: 20px; border-radius: 8px; margin: 20px 0; }
+              .punchline { margin-top: 15px; font-weight: bold; }
+            </style>
+          </head>
+          <body>
+            <h1>OpenBao PKI ACME Test</h1>
+            <p>If you're seeing this over HTTPS, the migration worked!</p>
+            <div class="joke">
+              <p>Why do programmers prefer dark mode?</p>
+              <p class="punchline">Because light attracts bugs.</p>
+            </div>
+            <p><small>Certificate issued by: vault.home.2rjus.net</small></p>
+          </body>
+          </html>
+        '';
+      };
+    };
+  };
+
  # Open ports in the firewall.
  # networking.firewall.allowedTCPPorts = [ ... ];
  # networking.firewall.allowedUDPPorts = [ ... ];
--- a/hosts/vault01/configuration.nix
+++ b/hosts/vault01/configuration.nix
@@ -62,6 +62,16 @@
  # Or disable the firewall altogether.
  networking.firewall.enable = false;

+  # Vault fetches secrets from itself (after unseal)
+  vault.enable = true;
+  homelab.deploy.enable = true;
+
+  # Ensure vault-secret services wait for openbao to be unsealed
+  systemd.services.vault-secret-homelab-deploy-nkey = {
+    after = [ "openbao.service" ];
+    wants = [ "openbao.service" ];
+  };
+
  system.stateVersion = "25.11"; # Did you read the comment?
 }

--- a/lib/monitoring.nix
+++ b/lib/monitoring.nix
@@ -21,6 +21,7 @@ let
      cfg = hostConfig.config;
      monConfig = (cfg.homelab or { }).monitoring or { enable = true; scrapeTargets = [ ]; };
      dnsConfig = (cfg.homelab or { }).dns or { enable = true; };
+      hostConfig' = (cfg.homelab or { }).host or { };
      hostname = cfg.networking.hostName;
      networks = cfg.systemd.network.networks or { };

@@ -49,20 +50,73 @@ let
        inherit hostname;
        ip = extractIP firstAddress;
        scrapeTargets = monConfig.scrapeTargets or [ ];
+        # Host metadata for label propagation
+        tier = hostConfig'.tier or "prod";
+        priority = hostConfig'.priority or "high";
+        role = hostConfig'.role or null;
+        labels = hostConfig'.labels or { };
      };

+  # Build effective labels for a host
+  # Always includes hostname; only includes tier/priority/role if non-default
+  buildEffectiveLabels = host:
+    { hostname = host.hostname; }
+    // (lib.optionalAttrs (host.tier != "prod") { tier = host.tier; })
+    // (lib.optionalAttrs (host.priority != "high") { priority = host.priority; })
+    // (lib.optionalAttrs (host.role != null) { role = host.role; })
+    // host.labels;
+
  # Generate node-exporter targets from all flake hosts
+  # Returns a list of static_configs entries with labels
  generateNodeExporterTargets = self: externalTargets:
    let
      nixosConfigs = self.nixosConfigurations or { };
      hostList = lib.filter (x: x != null) (
        lib.mapAttrsToList extractHostMonitoring nixosConfigs
      );
-      flakeTargets = map (host: "${host.hostname}.home.2rjus.net:9100") hostList;
+
+      # Extract hostname from a target string like "gunter.home.2rjus.net:9100"
+      extractHostnameFromTarget = target:
+        builtins.head (lib.splitString "." target);
+
+      # Build target entries with labels for each host
+      flakeEntries = map
+        (host: {
+          target = "${host.hostname}.home.2rjus.net:9100";
+          labels = buildEffectiveLabels host;
+        })
+        hostList;
+
+      # External targets get hostname extracted from the target string
+      externalEntries = map
+        (target: {
+          inherit target;
+          labels = { hostname = extractHostnameFromTarget target; };
+        })
+        (externalTargets.nodeExporter or [ ]);
+
+      allEntries = flakeEntries ++ externalEntries;
+
+      # Group entries by their label set for efficient static_configs
+      # Convert labels attrset to a string key for grouping
+      labelKey = entry: builtins.toJSON entry.labels;
+      grouped = lib.groupBy labelKey allEntries;
+
+      # Convert groups to static_configs format
+      # Every flake host now has at least a hostname label
+      staticConfigs = lib.mapAttrsToList
+        (key: entries:
+          let
+            labels = (builtins.head entries).labels;
          in
-    flakeTargets ++ (externalTargets.nodeExporter or [ ]);
+          { targets = map (e: e.target) entries; labels = labels; }
+        )
+        grouped;
+    in
+    staticConfigs;

  # Generate scrape configs from all flake hosts and external targets
+  # Host labels are propagated to service targets for semantic alert filtering
  generateScrapeConfigs = self: externalTargets:
    let
      nixosConfigs = self.nixosConfigurations or { };
@@ -70,13 +124,14 @@ let
        lib.mapAttrsToList extractHostMonitoring nixosConfigs
      );

-      # Collect all scrapeTargets from all hosts, grouped by job_name
+      # Collect all scrapeTargets from all hosts, including host labels
      allTargets = lib.flatten (map
        (host:
          map
            (target: {
              inherit (target) job_name port metrics_path scheme scrape_interval honor_labels;
              hostname = host.hostname;
+              hostLabels = buildEffectiveLabels host;
            })
            host.scrapeTargets
        )
@@ -87,22 +142,32 @@ let
      grouped = lib.groupBy (t: t.job_name) allTargets;

      # Generate a scrape config for each job
+      # Within each job, group targets by their host labels for efficient static_configs
      flakeScrapeConfigs = lib.mapAttrsToList
        (jobName: targets:
          let
            first = builtins.head targets;
-            targetAddrs = map
-              (t:
+
+            # Group targets within this job by their host labels
+            labelKey = t: builtins.toJSON t.hostLabels;
+            groupedByLabels = lib.groupBy labelKey targets;
+
+            # Every flake host now has at least a hostname label
+            staticConfigs = lib.mapAttrsToList
+              (key: labelTargets:
                let
-                  portStr = toString t.port;
+                  labels = (builtins.head labelTargets).hostLabels;
+                  targetAddrs = map
+                    (t: "${t.hostname}.home.2rjus.net:${toString t.port}")
+                    labelTargets;
                in
-                "${t.hostname}.home.2rjus.net:${portStr}")
-              targets;
+                { targets = targetAddrs; labels = labels; }
+              )
+              groupedByLabels;
+
            config = {
              job_name = jobName;
-              static_configs = [{
-                targets = targetAddrs;
-              }];
+              static_configs = staticConfigs;
            }
            // (lib.optionalAttrs (first.metrics_path != "/metrics") {
              metrics_path = first.metrics_path;
--- a/playbooks/build-and-deploy-template.yml
+++ b/playbooks/build-and-deploy-template.yml
@@ -99,3 +99,48 @@
    - name: Display success message
      ansible.builtin.debug:
        msg: "Template VM {{ template_vmid }} created successfully on {{ storage }}"
+
+- name: Update Terraform template name
+  hosts: localhost
+  gather_facts: false
+
+  vars:
+    terraform_dir: "{{ playbook_dir }}/../terraform"
+
+  tasks:
+    - name: Get image filename from earlier play
+      ansible.builtin.set_fact:
+        image_filename: "{{ hostvars['localhost']['image_filename'] }}"
+
+    - name: Extract template name from image filename
+      ansible.builtin.set_fact:
+        new_template_name: "{{ image_filename | regex_replace('\\.vma\\.zst$', '') | regex_replace('^vzdump-qemu-', '') }}"
+
+    - name: Read current Terraform variables file
+      ansible.builtin.slurp:
+        src: "{{ terraform_dir }}/variables.tf"
+      register: variables_tf_content
+
+    - name: Extract current template name from variables.tf
+      ansible.builtin.set_fact:
+        current_template_name: "{{ (variables_tf_content.content | b64decode) | regex_search('variable \"default_template_name\"[^}]+default\\s*=\\s*\"([^\"]+)\"', '\\1') | first }}"
+
+    - name: Check if template name has changed
+      ansible.builtin.set_fact:
+        template_name_changed: "{{ current_template_name != new_template_name }}"
+
+    - name: Display template name status
+      ansible.builtin.debug:
+        msg: "Template name: {{ current_template_name }} -> {{ new_template_name }} ({{ 'changed' if template_name_changed else 'unchanged' }})"
+
+    - name: Update default_template_name in variables.tf
+      ansible.builtin.replace:
+        path: "{{ terraform_dir }}/variables.tf"
+        regexp: '(variable "default_template_name"[^}]+default\s*=\s*)"[^"]+"'
+        replace: '\1"{{ new_template_name }}"'
+      when: template_name_changed
+
+    - name: Display update result
+      ansible.builtin.debug:
+        msg: "Updated terraform/variables.tf with new template name: {{ new_template_name }}"
+      when: template_name_changed
--- a/services/http-proxy/proxy.nix
+++ b/services/http-proxy/proxy.nix
@@ -5,7 +5,7 @@
    package = pkgs.unstable.caddy;
    configFile = pkgs.writeText "Caddyfile" ''
      {
-        acme_ca https://ca.home.2rjus.net/acme/acme/directory
+        acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory

        metrics {
          per_host
--- a/services/monitoring/alloy.nix
+++ b/services/monitoring/alloy.nix
@@ -1,41 +0,0 @@
-{ ... }:
-{
-  services.alloy = {
-    enable = true;
-  };
-
-  environment.etc."alloy/config.alloy" = {
-    enable = true;
-    mode = "0644";
-    text = ''
-      pyroscope.write "local_pyroscope" {
-        endpoint {
-          url = "http://localhost:4040"
-        }
-      }
-
-      pyroscope.scrape "labmon" {
-        targets    = [{"__address__" = "localhost:9969", "service_name" = "labmon"}]
-        forward_to = [pyroscope.write.local_pyroscope.receiver]
-
-        profiling_config {
-          profile.process_cpu {
-            enabled = true
-          }
-          profile.memory {
-            enabled = true
-          }
-          profile.mutex {
-            enabled = true
-          }
-          profile.block {
-            enabled = true
-          }
-          profile.goroutine {
-            enabled = true
-          }
-        }
-      }
-    '';
-  };
-}
--- a/services/monitoring/default.nix
+++ b/services/monitoring/default.nix
@@ -7,7 +7,6 @@
    ./pve.nix
    ./alerttonotify.nix
    ./pyroscope.nix
-    ./alloy.nix
    ./tempo.nix
  ];
 }
--- a/services/monitoring/prometheus.nix
+++ b/services/monitoring/prometheus.nix
@@ -121,22 +121,20 @@ in

    scrapeConfigs = [
      # Auto-generated node-exporter targets from flake hosts + external
+      # Each static_config entry may have labels from homelab.host metadata
      {
        job_name = "node-exporter";
-        static_configs = [
-          {
-            targets = nodeExporterTargets;
-          }
-        ];
+        static_configs = nodeExporterTargets;
      }
      # Systemd exporter on all hosts (same targets, different port)
+      # Preserves the same label grouping as node-exporter
      {
        job_name = "systemd-exporter";
-        static_configs = [
-          {
-            targets = map (t: builtins.replaceStrings [":9100"] [":9558"] t) nodeExporterTargets;
-          }
-        ];
+        static_configs = map
+          (cfg: cfg // {
+            targets = map (t: builtins.replaceStrings [ ":9100" ] [ ":9558" ] t) cfg.targets;
+          })
+          nodeExporterTargets;
      }
      # Local monitoring services (not auto-generated)
      {
@@ -180,14 +178,6 @@ in
          }
        ];
      }
-      {
-        job_name = "labmon";
-        static_configs = [
-          {
-            targets = [ "monitoring01.home.2rjus.net:9969" ];
-          }
-        ];
-      }
      # TODO: nix-cache_caddy can't be auto-generated because the cert is issued
      # for nix-cache.home.2rjus.net (service CNAME), not nix-cache01 (hostname).
      # Consider adding a target override to homelab.monitoring.scrapeTargets.
--- a/services/monitoring/rules.yml
+++ b/services/monitoring/rules.yml
@@ -17,8 +17,9 @@ groups:
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: "Disk space is low on {{ $labels.instance }}. Please check."
+      # Build hosts (e.g., nix-cache01) are expected to have high CPU during builds
      - alert: high_cpu_load
-        expr: max(node_load5{instance!="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance!="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7)
+        expr: max(node_load5{role!="build-host"}) by (instance) > (count by (instance)(node_cpu_seconds_total{role!="build-host", mode="idle"}) * 0.7)
        for: 15m
        labels:
          severity: warning
@@ -26,7 +27,7 @@ groups:
          summary: "High CPU load on {{ $labels.instance }}"
          description: "CPU load is high on {{ $labels.instance }}. Please check."
      - alert: high_cpu_load
-        expr: max(node_load5{instance="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7)
+        expr: max(node_load5{role="build-host"}) by (instance) > (count by (instance)(node_cpu_seconds_total{role="build-host", mode="idle"}) * 0.7)
        for: 2h
        labels:
          severity: warning
@@ -115,8 +116,9 @@ groups:
        annotations:
          summary: "NSD not running on {{ $labels.instance }}"
          description: "NSD has been down on {{ $labels.instance }} more than 5 minutes."
+      # Only alert on primary DNS (secondary has cold cache after failover)
      - alert: unbound_low_cache_hit_ratio
-        expr: (rate(unbound_cache_hits_total[5m]) / (rate(unbound_cache_hits_total[5m]) + rate(unbound_cache_misses_total[5m]))) < 0.5
+        expr: (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) / (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) + rate(unbound_cache_misses_total{dns_role="primary"}[5m]))) < 0.5
        for: 15m
        labels:
          severity: warning
@@ -336,40 +338,6 @@ groups:
        annotations:
          summary: "Pyroscope service not running on {{ $labels.instance }}"
          description: "Pyroscope service not running on {{ $labels.instance }}"
-  - name: certificate_rules
-    rules:
-      - alert: certificate_expiring_soon
-        expr: labmon_tlsconmon_certificate_seconds_left{address!="ca.home.2rjus.net:443"} < 86400
-        for: 5m
-        labels:
-          severity: warning
-        annotations:
-          summary: "TLS certificate expiring soon for {{ $labels.instance }}"
-          description: "TLS certificate for {{ $labels.address }} is expiring within 24 hours."
-      - alert: step_ca_serving_cert_expiring
-        expr: labmon_tlsconmon_certificate_seconds_left{address="ca.home.2rjus.net:443"} < 3600
-        for: 5m
-        labels:
-          severity: critical
-        annotations:
-          summary: "Step-CA serving certificate expiring"
-          description: "The step-ca serving certificate (24h auto-renewed) has less than 1 hour of validity left. Renewal may have failed."
-      - alert: certificate_check_error
-        expr: labmon_tlsconmon_certificate_check_error == 1
-        for: 5m
-        labels:
-          severity: warning
-        annotations:
-          summary: "Error checking certificate for {{ $labels.address }}"
-          description: "Certificate check is failing for {{ $labels.address }} on {{ $labels.instance }}."
-      - alert: step_ca_certificate_expiring
-        expr: labmon_stepmon_certificate_seconds_left < 3600
-        for: 5m
-        labels:
-          severity: critical
-        annotations:
-          summary: "Step-CA certificate expiring for {{ $labels.instance }}"
-          description: "Step-CA certificate is expiring within 1 hour on {{ $labels.instance }}."
  - name: proxmox_rules
    rules:
      - alert: pve_node_down
--- a/services/nix-cache/proxy.nix
+++ b/services/nix-cache/proxy.nix
@@ -5,7 +5,7 @@
    package = pkgs.unstable.caddy;
    configFile = pkgs.writeText "Caddyfile" ''
      {
-        acme_ca https://ca.home.2rjus.net/acme/acme/directory
+        acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
        metrics
      }

--- a/system/acme.nix
+++ b/system/acme.nix
@@ -3,7 +3,7 @@
  security.acme = {
    acceptTerms = true;
    defaults = {
-      server = "https://ca.home.2rjus.net/acme/acme/directory";
+      server = "https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory";
      email = "root@home.2rjus.net";
      dnsPropagationCheck = false;
    };
--- a/terraform/variables.tf
+++ b/terraform/variables.tf
@@ -33,7 +33,7 @@ variable "default_target_node" {
 variable "default_template_name" {
  description = "Default template VM name to clone from"
  type        = string
-  default     = "nixos-25.11.20260131.41e216c"
+  default     = "nixos-25.11.20260203.e576e3c"
 }

 variable "default_ssh_public_key" {
--- a/terraform/vault/approle.tf
+++ b/terraform/vault/approle.tf
@@ -101,6 +101,13 @@ locals {
      ]
    }

+    # vault01: Vault server itself (fetches secrets from itself)
+    "vault01" = {
+      paths = [
+        "secret/data/hosts/vault01/*",
+      ]
+    }
+
  }
 }

--- a/terraform/vms.tf
+++ b/terraform/vms.tf
@@ -43,24 +43,20 @@ locals {
      cpu_cores           = 2
      memory              = 2048
      disk_size           = "20G"
-      flake_branch        = "deploy-test-hosts"
-      vault_wrapped_token = "s.YRGRpAZVVtSYEa3wOYOqFmjt"
+      flake_branch        = "improve-bootstrap-visibility"
+      vault_wrapped_token = "s.l5q88wzXfEcr5SMDHmO6o96b"
    }
    "testvm02" = {
      ip        = "10.69.13.21/24"
      cpu_cores = 2
      memory    = 2048
      disk_size = "20G"
-      flake_branch        = "deploy-test-hosts"
-      vault_wrapped_token = "s.tvs8yhJOkLjBs548STs6DBw7"
    }
    "testvm03" = {
      ip        = "10.69.13.22/24"
      cpu_cores = 2
      memory    = 2048
      disk_size = "20G"
-      flake_branch        = "deploy-test-hosts"
-      vault_wrapped_token = "s.sQ80FZGeG3z6jgrsuh74IopC"
    }
  }
Author	SHA1	Message	Date
Torjus Håkestad	9d019f2b9a	testvm01: add nginx with ACME certificate for PKI testing Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Set up a simple nginx server with an ACME certificate from the new OpenBao PKI infrastructure. This allows testing the ACME migration before deploying to production hosts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 18:22:28 +01:00
Torjus Håkestad	21db7e9573	acme: migrate from step-ca to OpenBao PKI Switch all ACME certificate issuance from step-ca (ca.home.2rjus.net) to OpenBao PKI (vault.home.2rjus.net:8200/v1/pki_int/acme/directory). - Update default ACME server in system/acme.nix - Update Caddy acme_ca in http-proxy and nix-cache services - Remove labmon service from monitoring01 (step-ca monitoring) - Remove labmon scrape target and certificate_rules alerts - Remove alloy.nix (only used for labmon profiling) - Add docs/plans/cert-monitoring.md for future cert monitoring needs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 18:20:10 +01:00
Torjus Håkestad	979040aaf7	vault01: enable homelab-deploy listener Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Enable vault.enable and homelab.deploy.enable on vault01 so it can receive NATS-based remote deployments. Vault fetches secrets from itself using AppRole after auto-unseal. Add systemd ordering to ensure vault-secret services wait for openbao to be unsealed before attempting to fetch secrets. Also adds vault01 AppRole entry to Terraform. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:55:09 +01:00
Torjus Håkestad	8791c29402	hosts: enable homelab-deploy listener on pgdb1, nats1, jelly01 Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Enable vault.enable and homelab.deploy.enable for these hosts to allow NATS-based remote deployments and expose metrics on port 9972. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:43:06 +01:00
Torjus Håkestad	c7a067d7b3	flake: update homelab-deploy input Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:31:24 +01:00
Torjus Håkestad	c518093578	docs: move prometheus-scrape-target-labels plan to completed Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:29:31 +01:00
Torjus Håkestad	0b462f0a96	Merge pull request 'prometheus-scrape-target-labels' (#30 ) from prometheus-scrape-target-labels into master Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Reviewed-on: #30	2026-02-07 16:27:38 +00:00
Torjus Håkestad	116abf3bec	CLAUDE.md: document homelab-deploy CLI for prod hosts Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Run nix flake check / flake-check (pull_request) Failing after 1s Details Add instructions for deploying to prod hosts using the CLI directly, since the MCP server only handles test-tier deployments. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:23:10 +01:00
Torjus Håkestad	b794aa89db	skills: update observability with new target labels Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Document the new hostname and host metadata labels available on all Prometheus scrape targets: - hostname: short hostname for easy filtering - role: host role (dns, build-host, vault) - tier: deployment tier (test for test VMs) - dns_role: primary/secondary for DNS servers Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:12:17 +01:00
Torjus Håkestad	50a85daa44	docs: update plan with hostname label documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:09:46 +01:00
Torjus Håkestad	23e561cf49	monitoring: add hostname label to all scrape targets Add a `hostname` label to all Prometheus scrape targets, making it easy to query all metrics for a host without wildcarding the instance label. Example queries: - {hostname="ns1"} - all metrics from ns1 - node_cpu_seconds_total{hostname="monitoring01"} - specific metric For external targets (like gunter), the hostname is extracted from the target string. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:09:19 +01:00
Torjus Håkestad	7d291f85bf	monitoring: propagate host labels to Prometheus scrape targets Extract homelab.host metadata (tier, priority, role, labels) from host configurations and propagate them to Prometheus scrape targets. This enables semantic alert filtering using labels instead of hardcoded instance names. Changes: - lib/monitoring.nix: Extract host metadata, group targets by labels - prometheus.nix: Use structured static_configs with labels - rules.yml: Replace instance filters with role-based filters Example labels in Prometheus: - ns1/ns2: role=dns, dns_role=primary/secondary - nix-cache01: role=build-host - testvm*: tier=test Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 17:04:50 +01:00
Torjus Håkestad	2a842c655a	docs: update plan status and move completed nats-deploy plan Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - Move nats-deploy-service.md to completed/ folder - Update prometheus-scrape-target-labels.md with implementation status - Add status table showing which steps are complete/partial/not started - Update cross-references to point to new location Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 16:44:00 +01:00
Torjus Håkestad	1f4a5571dc	CLAUDE.md: update documentation from audit Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details - Fix OpenBao CLI name (bao, not vault) - Add vault01, testvm01-03 to hosts list - Document nixos-exporter and homelab-deploy flake inputs - Add vault/ and actions-runner/ services - Document homelab.host and homelab.deploy options - Document automatic Vault credential provisioning via wrapped tokens - Consolidate homelab module options into dedicated section Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 16:37:38 +01:00
Torjus Håkestad	13d6d0ea3a	Merge pull request 'improve-bootstrap-visibility' (#29 ) from improve-bootstrap-visibility into master Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Reviewed-on: #29	2026-02-07 15:00:09 +00:00
Torjus Håkestad	eea000b337	CLAUDE.md: document bootstrap logs in Loki Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Run nix flake check / flake-check (pull_request) Failing after 4s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 15:57:51 +01:00
Torjus Håkestad	f19ba2f4b6	CLAUDE.md: use tofu -chdir instead of cd Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 15:41:59 +01:00
Torjus Håkestad	a90d9c33d5	CLAUDE.md: prefer nix develop -c for devshell commands Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 15:39:56 +01:00
Torjus Håkestad	09c9df1bbe	terraform: regenerate wrapped token for testvm01 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 15:36:25 +01:00
Torjus Håkestad	ae3039af19	template2: send bootstrap status to Loki for remote monitoring Adds log_to_loki function that pushes structured log entries to Loki at key bootstrap stages (starting, network_ok, vault_*, building, success, failed). Enables querying bootstrap state via LogQL without console access. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 15:34:47 +01:00
Torjus Håkestad	11261c4636	template2: revert to journal+console output for bootstrap Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details TTY output was causing nixos-rebuild to fail. Keep the custom greeting line to indicate bootstrap image, but use journal+console for reliable logging. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 15:24:39 +01:00
Torjus Håkestad	4ca3c8890f	terraform: add flake_branch and token for testvm01 Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 15:14:57 +01:00
Torjus Håkestad	78e8d7a600	template2: add ncurses for clear command in bootstrap Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 15:10:25 +01:00
Torjus Håkestad	0cf72ec191	terraform: update template to nixos-25.11.20260203.e576e3c Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 15:02:16 +01:00
Torjus Håkestad	6a3a51407e	playbooks: auto-update terraform template name after deploy Add a third play to build-and-deploy-template.yml that updates terraform/variables.tf with the new template name after deploying to Proxmox. Only updates if the template name has changed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 14:59:13 +01:00
Torjus Håkestad	a1ae766eb8	template2: show bootstrap progress on tty1 - Display bootstrap banner and live progress on tty1 instead of login prompt - Add custom getty greeting on other ttys indicating this is a bootstrap image - Disable getty on tty1 during bootstrap so output is visible Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 14:49:58 +01:00
Torjus Håkestad	11999b37f3	flake: update homelab-deploy Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Fixes false "Some deployments failed" warning in MCP server when deployments are still in progress. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 14:24:41 +01:00
Torjus Håkestad	29b2b7db52	Merge branch 'deploy-test-hosts' Some checks failed Run nix flake check / flake-check (push) Failing after 1s Details Add three permanent test hosts (testvm01, testvm02, testvm03) with: - Static IPs: 10.69.13.20-22 - Vault AppRole integration with homelab-deploy policy - Remote deployment via NATS (homelab.deploy.enable) - Test tier configuration Also updates create-host template to include vault.enable and homelab.deploy.enable by default.	2026-02-07 14:09:40 +01:00
Torjus Håkestad	b046a1b862	terraform: remove flake_branch from test VMs VMs are now bootstrapped and running. Remove temporary flake_branch and vault_wrapped_token settings so they use master going forward. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 14:09:30 +01:00