Compare commits
17 Commits
deploy-tes
...
2a842c655a
| Author | SHA1 | Date | |
|---|---|---|---|
|
2a842c655a
|
|||
|
1f4a5571dc
|
|||
| 13d6d0ea3a | |||
|
eea000b337
|
|||
|
f19ba2f4b6
|
|||
|
a90d9c33d5
|
|||
|
09c9df1bbe
|
|||
|
ae3039af19
|
|||
|
11261c4636
|
|||
|
4ca3c8890f
|
|||
|
78e8d7a600
|
|||
|
0cf72ec191
|
|||
|
6a3a51407e
|
|||
|
a1ae766eb8
|
|||
|
11999b37f3
|
|||
|
29b2b7db52
|
|||
|
b046a1b862
|
103
CLAUDE.md
103
CLAUDE.md
@@ -61,10 +61,31 @@ Do not run `nix flake update`. Should only be done manually by user.
|
|||||||
### Development Environment
|
### Development Environment
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Enter development shell (provides ansible, python3)
|
# Enter development shell
|
||||||
nix develop
|
nix develop
|
||||||
```
|
```
|
||||||
|
|
||||||
|
The devshell provides: `ansible`, `tofu` (OpenTofu), `bao` (OpenBao CLI), `create-host`, and `homelab-deploy`.
|
||||||
|
|
||||||
|
**Important:** When suggesting commands that use devshell tools, always use `nix develop -c <command>` syntax rather than assuming the user is already in a devshell. For example:
|
||||||
|
```bash
|
||||||
|
# Good - works regardless of current shell
|
||||||
|
nix develop -c tofu plan
|
||||||
|
|
||||||
|
# Avoid - requires user to be in devshell
|
||||||
|
tofu plan
|
||||||
|
```
|
||||||
|
|
||||||
|
**OpenTofu:** Use the `-chdir` option instead of `cd` when running tofu commands in subdirectories:
|
||||||
|
```bash
|
||||||
|
# Good - uses -chdir option
|
||||||
|
nix develop -c tofu -chdir=terraform plan
|
||||||
|
nix develop -c tofu -chdir=terraform/vault apply
|
||||||
|
|
||||||
|
# Avoid - changing directories
|
||||||
|
cd terraform && tofu plan
|
||||||
|
```
|
||||||
|
|
||||||
### Secrets Management
|
### Secrets Management
|
||||||
|
|
||||||
Secrets are managed by OpenBao (Vault) using AppRole authentication. Most hosts use the
|
Secrets are managed by OpenBao (Vault) using AppRole authentication. Most hosts use the
|
||||||
@@ -140,11 +161,27 @@ The **lab-monitoring** MCP server can query logs from Loki. All hosts ship syste
|
|||||||
|
|
||||||
- `host` - Hostname (e.g., `ns1`, `ns2`, `monitoring01`, `ha1`). Use this label, not `hostname`.
|
- `host` - Hostname (e.g., `ns1`, `ns2`, `monitoring01`, `ha1`). Use this label, not `hostname`.
|
||||||
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `prometheus.service`, `nixos-upgrade.service`)
|
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `prometheus.service`, `nixos-upgrade.service`)
|
||||||
- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs like caddy access logs)
|
- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
|
||||||
- `filename` - For `varlog` job, the log file path (e.g., `/var/log/caddy/nix-cache.log`)
|
- `filename` - For `varlog` job, the log file path (e.g., `/var/log/caddy/nix-cache.log`)
|
||||||
|
|
||||||
Journal log entries are JSON-formatted with the actual log message in the `MESSAGE` field. Other useful fields include `PRIORITY` and `SYSLOG_IDENTIFIER`.
|
Journal log entries are JSON-formatted with the actual log message in the `MESSAGE` field. Other useful fields include `PRIORITY` and `SYSLOG_IDENTIFIER`.
|
||||||
|
|
||||||
|
**Bootstrap Logs:**
|
||||||
|
|
||||||
|
VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
|
||||||
|
|
||||||
|
- `host` - Target hostname
|
||||||
|
- `branch` - Git branch being deployed
|
||||||
|
- `stage` - Bootstrap stage: `starting`, `network_ok`, `vault_ok`/`vault_skip`/`vault_warn`, `building`, `success`, `failed`
|
||||||
|
|
||||||
|
Query bootstrap status:
|
||||||
|
```
|
||||||
|
{job="bootstrap"} # All bootstrap logs
|
||||||
|
{job="bootstrap", host="testvm01"} # Specific host
|
||||||
|
{job="bootstrap", stage="failed"} # All failures
|
||||||
|
{job="bootstrap", stage=~"building|success"} # Track build progress
|
||||||
|
```
|
||||||
|
|
||||||
**Example LogQL queries:**
|
**Example LogQL queries:**
|
||||||
```
|
```
|
||||||
# Logs from a specific service on a host
|
# Logs from a specific service on a host
|
||||||
@@ -249,9 +286,10 @@ The `current_rev` label contains the git commit hash of the deployed flake confi
|
|||||||
- `configuration.nix` - Host-specific settings (networking, hardware, users)
|
- `configuration.nix` - Host-specific settings (networking, hardware, users)
|
||||||
- `/system/` - Shared system-level configurations applied to ALL hosts
|
- `/system/` - Shared system-level configurations applied to ALL hosts
|
||||||
- Core modules: nix.nix, sshd.nix, sops.nix (legacy), vault-secrets.nix, acme.nix, autoupgrade.nix
|
- Core modules: nix.nix, sshd.nix, sops.nix (legacy), vault-secrets.nix, acme.nix, autoupgrade.nix
|
||||||
|
- Additional modules: motd.nix (dynamic MOTD), packages.nix (base packages), root-user.nix (root config), homelab-deploy.nix (NATS listener)
|
||||||
- Monitoring: node-exporter and promtail on every host
|
- Monitoring: node-exporter and promtail on every host
|
||||||
- `/modules/` - Custom NixOS modules
|
- `/modules/` - Custom NixOS modules
|
||||||
- `homelab/` - Homelab-specific options (DNS automation, monitoring scrape targets)
|
- `homelab/` - Homelab-specific options (see "Homelab Module Options" section below)
|
||||||
- `/lib/` - Nix library functions
|
- `/lib/` - Nix library functions
|
||||||
- `dns-zone.nix` - DNS zone generation functions
|
- `dns-zone.nix` - DNS zone generation functions
|
||||||
- `monitoring.nix` - Prometheus scrape target generation functions
|
- `monitoring.nix` - Prometheus scrape target generation functions
|
||||||
@@ -259,6 +297,8 @@ The `current_rev` label contains the git commit hash of the deployed flake confi
|
|||||||
- `home-assistant/` - Home automation stack
|
- `home-assistant/` - Home automation stack
|
||||||
- `monitoring/` - Observability stack (Prometheus, Grafana, Loki, Tempo)
|
- `monitoring/` - Observability stack (Prometheus, Grafana, Loki, Tempo)
|
||||||
- `ns/` - DNS services (authoritative, resolver, zone generation)
|
- `ns/` - DNS services (authoritative, resolver, zone generation)
|
||||||
|
- `vault/` - OpenBao (Vault) secrets server
|
||||||
|
- `actions-runner/` - GitHub Actions runner
|
||||||
- `http-proxy/`, `ca/`, `postgres/`, `nats/`, `jellyfin/`, etc.
|
- `http-proxy/`, `ca/`, `postgres/`, `nats/`, `jellyfin/`, etc.
|
||||||
- `/secrets/` - SOPS-encrypted secrets with age encryption (legacy, only used by ca)
|
- `/secrets/` - SOPS-encrypted secrets with age encryption (legacy, only used by ca)
|
||||||
- `/common/` - Shared configurations (e.g., VM guest agent)
|
- `/common/` - Shared configurations (e.g., VM guest agent)
|
||||||
@@ -292,25 +332,31 @@ All hosts automatically get:
|
|||||||
|
|
||||||
### Active Hosts
|
### Active Hosts
|
||||||
|
|
||||||
Production servers managed by `rebuild-all.sh`:
|
Production servers:
|
||||||
- `ns1`, `ns2` - Primary/secondary DNS servers (10.69.13.5/6)
|
- `ns1`, `ns2` - Primary/secondary DNS servers (10.69.13.5/6)
|
||||||
- `ca` - Internal Certificate Authority
|
- `ca` - Internal Certificate Authority
|
||||||
|
- `vault01` - OpenBao (Vault) secrets server
|
||||||
- `ha1` - Home Assistant + Zigbee2MQTT + Mosquitto
|
- `ha1` - Home Assistant + Zigbee2MQTT + Mosquitto
|
||||||
- `http-proxy` - Reverse proxy
|
- `http-proxy` - Reverse proxy
|
||||||
- `monitoring01` - Full observability stack (Prometheus, Grafana, Loki, Tempo, Pyroscope)
|
- `monitoring01` - Full observability stack (Prometheus, Grafana, Loki, Tempo, Pyroscope)
|
||||||
- `jelly01` - Jellyfin media server
|
- `jelly01` - Jellyfin media server
|
||||||
- `nix-cache01` - Binary cache server
|
- `nix-cache01` - Binary cache server + GitHub Actions runner
|
||||||
- `pgdb1` - PostgreSQL database
|
- `pgdb1` - PostgreSQL database
|
||||||
- `nats1` - NATS messaging server
|
- `nats1` - NATS messaging server
|
||||||
|
|
||||||
Template/test hosts:
|
Test/staging hosts:
|
||||||
- `template1` - Base template for cloning new hosts
|
- `testvm01`, `testvm02`, `testvm03` - Test-tier VMs for branch testing and deployment validation
|
||||||
|
|
||||||
|
Template hosts:
|
||||||
|
- `template1`, `template2` - Base templates for cloning new hosts
|
||||||
|
|
||||||
### Flake Inputs
|
### Flake Inputs
|
||||||
|
|
||||||
- `nixpkgs` - NixOS 25.11 stable (primary)
|
- `nixpkgs` - NixOS 25.11 stable (primary)
|
||||||
- `nixpkgs-unstable` - Unstable channel (available via overlay as `pkgs.unstable.<package>`)
|
- `nixpkgs-unstable` - Unstable channel (available via overlay as `pkgs.unstable.<package>`)
|
||||||
- `sops-nix` - Secrets management (legacy, only used by ca)
|
- `sops-nix` - Secrets management (legacy, only used by ca)
|
||||||
|
- `nixos-exporter` - NixOS module for exposing flake revision metrics (used to verify deployments)
|
||||||
|
- `homelab-deploy` - NATS-based remote deployment tool for test-tier hosts
|
||||||
- Custom packages from git.t-juice.club:
|
- Custom packages from git.t-juice.club:
|
||||||
- `alerttonotify` - Alert routing
|
- `alerttonotify` - Alert routing
|
||||||
- `labmon` - Lab monitoring
|
- `labmon` - Lab monitoring
|
||||||
@@ -402,9 +448,21 @@ Example VM deployment includes:
|
|||||||
- Custom CPU/memory/disk sizing
|
- Custom CPU/memory/disk sizing
|
||||||
- VLAN tagging
|
- VLAN tagging
|
||||||
- QEMU guest agent
|
- QEMU guest agent
|
||||||
|
- Automatic Vault credential provisioning via `vault_wrapped_token`
|
||||||
|
|
||||||
OpenTofu outputs the VM's IP address after deployment for easy SSH access.
|
OpenTofu outputs the VM's IP address after deployment for easy SSH access.
|
||||||
|
|
||||||
|
**Automatic Vault Credential Provisioning:**
|
||||||
|
|
||||||
|
VMs can receive Vault (OpenBao) credentials automatically during bootstrap:
|
||||||
|
|
||||||
|
1. OpenTofu generates a wrapped token via `terraform/vault/` and stores it in the VM configuration
|
||||||
|
2. Cloud-init passes `VAULT_WRAPPED_TOKEN` and `NIXOS_FLAKE_BRANCH` to the bootstrap script
|
||||||
|
3. The bootstrap script unwraps the token to obtain AppRole credentials
|
||||||
|
4. Credentials are written to `/var/lib/vault/approle/` before the NixOS rebuild
|
||||||
|
|
||||||
|
This eliminates the need for manual `provision-approle.yml` playbook runs on new VMs. Bootstrap progress is logged to Loki with `job="bootstrap"` labels.
|
||||||
|
|
||||||
#### Template Rebuilding and Terraform State
|
#### Template Rebuilding and Terraform State
|
||||||
|
|
||||||
When the Proxmox template is rebuilt (via `build-and-deploy-template.yml`), the template name may change. This would normally cause Terraform to want to recreate all existing VMs, but that's unnecessary since VMs are independent once cloned.
|
When the Proxmox template is rebuilt (via `build-and-deploy-template.yml`), the template name may change. This would normally cause Terraform to want to recreate all existing VMs, but that's unnecessary since VMs are independent once cloned.
|
||||||
@@ -484,11 +542,7 @@ Prometheus scrape targets are automatically generated from host configurations,
|
|||||||
- **External targets**: Non-flake hosts defined in `/services/monitoring/external-targets.nix`
|
- **External targets**: Non-flake hosts defined in `/services/monitoring/external-targets.nix`
|
||||||
- **Library**: `lib/monitoring.nix` provides `generateNodeExporterTargets` and `generateScrapeConfigs`
|
- **Library**: `lib/monitoring.nix` provides `generateNodeExporterTargets` and `generateScrapeConfigs`
|
||||||
|
|
||||||
Host monitoring options (`homelab.monitoring.*`):
|
Service modules declare their scrape targets directly via `homelab.monitoring.scrapeTargets` (e.g., `services/ca/default.nix` declares step-ca on port 9000). The Prometheus config on monitoring01 auto-generates scrape configs from all hosts. See "Homelab Module Options" section for available options.
|
||||||
- `enable` (default: `true`) - Include host in Prometheus node-exporter scrape targets
|
|
||||||
- `scrapeTargets` (default: `[]`) - Additional scrape targets exposed by this host (job_name, port, metrics_path, scheme, scrape_interval, honor_labels)
|
|
||||||
|
|
||||||
Service modules declare their scrape targets directly (e.g., `services/ca/default.nix` declares step-ca on port 9000). The Prometheus config on monitoring01 auto-generates scrape configs from all hosts.
|
|
||||||
|
|
||||||
To add monitoring targets for non-NixOS hosts, edit `/services/monitoring/external-targets.nix`.
|
To add monitoring targets for non-NixOS hosts, edit `/services/monitoring/external-targets.nix`.
|
||||||
|
|
||||||
@@ -507,13 +561,30 @@ DNS zone entries are automatically generated from host configurations:
|
|||||||
- **External hosts**: Non-flake hosts defined in `/services/ns/external-hosts.nix`
|
- **External hosts**: Non-flake hosts defined in `/services/ns/external-hosts.nix`
|
||||||
- **Serial number**: Uses `self.sourceInfo.lastModified` (git commit timestamp)
|
- **Serial number**: Uses `self.sourceInfo.lastModified` (git commit timestamp)
|
||||||
|
|
||||||
Host DNS options (`homelab.dns.*`):
|
|
||||||
- `enable` (default: `true`) - Include host in DNS zone generation
|
|
||||||
- `cnames` (default: `[]`) - List of CNAME aliases pointing to this host
|
|
||||||
|
|
||||||
Hosts are automatically excluded from DNS if:
|
Hosts are automatically excluded from DNS if:
|
||||||
- `homelab.dns.enable = false` (e.g., template hosts)
|
- `homelab.dns.enable = false` (e.g., template hosts)
|
||||||
- No static IP configured (e.g., DHCP-only hosts)
|
- No static IP configured (e.g., DHCP-only hosts)
|
||||||
- Network interface is a VPN/tunnel (wg*, tun*, tap*)
|
- Network interface is a VPN/tunnel (wg*, tun*, tap*)
|
||||||
|
|
||||||
To add DNS entries for non-NixOS hosts, edit `/services/ns/external-hosts.nix`.
|
To add DNS entries for non-NixOS hosts, edit `/services/ns/external-hosts.nix`.
|
||||||
|
|
||||||
|
### Homelab Module Options
|
||||||
|
|
||||||
|
The `modules/homelab/` directory defines custom options used across hosts for automation and metadata.
|
||||||
|
|
||||||
|
**Host options (`homelab.host.*`):**
|
||||||
|
- `tier` - Deployment tier: `test` or `prod`. Test-tier hosts can receive remote deployments and have different credential access.
|
||||||
|
- `priority` - Alerting priority: `high` or `low`. Controls alerting thresholds for the host.
|
||||||
|
- `role` - Primary role designation (e.g., `dns`, `database`, `bastion`, `vault`)
|
||||||
|
- `labels` - Free-form key-value metadata for host categorization
|
||||||
|
|
||||||
|
**DNS options (`homelab.dns.*`):**
|
||||||
|
- `enable` (default: `true`) - Include host in DNS zone generation
|
||||||
|
- `cnames` (default: `[]`) - List of CNAME aliases pointing to this host
|
||||||
|
|
||||||
|
**Monitoring options (`homelab.monitoring.*`):**
|
||||||
|
- `enable` (default: `true`) - Include host in Prometheus node-exporter scrape targets
|
||||||
|
- `scrapeTargets` (default: `[]`) - Additional scrape targets exposed by this host
|
||||||
|
|
||||||
|
**Deploy options (`homelab.deploy.*`):**
|
||||||
|
- `enable` (default: `false`) - Enable NATS-based remote deployment listener. When enabled, the host listens for deployment commands via NATS and can be targeted by the `homelab-deploy` MCP server.
|
||||||
|
|||||||
@@ -1,10 +1,32 @@
|
|||||||
# Prometheus Scrape Target Labels
|
# Prometheus Scrape Target Labels
|
||||||
|
|
||||||
|
## Implementation Status
|
||||||
|
|
||||||
|
| Step | Status | Notes |
|
||||||
|
|------|--------|-------|
|
||||||
|
| 1. Create `homelab.host` module | ✅ Complete | `modules/homelab/host.nix` |
|
||||||
|
| 2. Update `lib/monitoring.nix` | ❌ Not started | Labels not extracted or propagated |
|
||||||
|
| 3. Update Prometheus config | ❌ Not started | Still uses flat target list |
|
||||||
|
| 4. Set metadata on hosts | ⚠️ Partial | Some hosts configured, see below |
|
||||||
|
| 5. Update alert rules | ❌ Not started | |
|
||||||
|
| 6. Labels for service targets | ❌ Not started | Optional |
|
||||||
|
|
||||||
|
**Hosts with metadata configured:**
|
||||||
|
- `ns1`, `ns2`: `role = "dns"`, `labels.dns_role = "primary"/"secondary"`
|
||||||
|
- `nix-cache01`: `role = "build-host"` (missing `priority = "low"` from plan)
|
||||||
|
- `vault01`: `role = "vault"`
|
||||||
|
- `jump`: `role = "bastion"`
|
||||||
|
- `template`, `template2`, `testvm*`: `tier` and `priority` set
|
||||||
|
|
||||||
|
**Key gap:** The `homelab.host` module exists and some hosts use it, but `lib/monitoring.nix` does not extract these values—they are not propagated to Prometheus scrape targets.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Goal
|
## Goal
|
||||||
|
|
||||||
Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names.
|
Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names.
|
||||||
|
|
||||||
**Related:** This plan shares the `homelab.host` module with `docs/plans/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment.
|
**Related:** This plan shares the `homelab.host` module with `docs/plans/completed/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment.
|
||||||
|
|
||||||
## Motivation
|
## Motivation
|
||||||
|
|
||||||
@@ -54,12 +76,11 @@ or
|
|||||||
|
|
||||||
## Implementation
|
## Implementation
|
||||||
|
|
||||||
This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/nats-deploy-service.md` which uses the same module for deployment tier assignment.
|
This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/completed/nats-deploy-service.md` which uses the same module for deployment tier assignment.
|
||||||
|
|
||||||
### 1. Create `homelab.host` module
|
### 1. Create `homelab.host` module
|
||||||
|
|
||||||
**Status:** Step 1 (Create `homelab.host` module) is complete. The module is in
|
✅ **Complete.** The module is in `modules/homelab/host.nix`.
|
||||||
`modules/homelab/host.nix` with tier, priority, role, and labels options.
|
|
||||||
|
|
||||||
Create `modules/homelab/host.nix` with shared host metadata options:
|
Create `modules/homelab/host.nix` with shared host metadata options:
|
||||||
|
|
||||||
@@ -98,6 +119,8 @@ Import this module in `modules/homelab/default.nix`.
|
|||||||
|
|
||||||
### 2. Update `lib/monitoring.nix`
|
### 2. Update `lib/monitoring.nix`
|
||||||
|
|
||||||
|
❌ **Not started.** The current implementation does not extract `homelab.host` values.
|
||||||
|
|
||||||
- `extractHostMonitoring` should also extract `homelab.host` values (priority, role, labels).
|
- `extractHostMonitoring` should also extract `homelab.host` values (priority, role, labels).
|
||||||
- Build the combined label set from `homelab.host`:
|
- Build the combined label set from `homelab.host`:
|
||||||
|
|
||||||
@@ -126,6 +149,8 @@ This requires grouping hosts by their label attrset and producing one `static_co
|
|||||||
|
|
||||||
### 3. Update `services/monitoring/prometheus.nix`
|
### 3. Update `services/monitoring/prometheus.nix`
|
||||||
|
|
||||||
|
❌ **Not started.** Still uses flat target list (`static_configs = [{ targets = nodeExporterTargets; }]`).
|
||||||
|
|
||||||
Change the node-exporter scrape config to use the new structured output:
|
Change the node-exporter scrape config to use the new structured output:
|
||||||
|
|
||||||
```nix
|
```nix
|
||||||
@@ -138,29 +163,34 @@ static_configs = nodeExporterTargets;
|
|||||||
|
|
||||||
### 4. Set metadata on hosts
|
### 4. Set metadata on hosts
|
||||||
|
|
||||||
|
⚠️ **Partial.** Some hosts configured (see status table above). Current `nix-cache01` only has `role`, missing the `priority = "low"` suggested below.
|
||||||
|
|
||||||
Example in `hosts/nix-cache01/configuration.nix`:
|
Example in `hosts/nix-cache01/configuration.nix`:
|
||||||
|
|
||||||
```nix
|
```nix
|
||||||
homelab.host = {
|
homelab.host = {
|
||||||
tier = "test"; # can be deployed by MCP (used by homelab-deploy)
|
|
||||||
priority = "low"; # relaxed alerting thresholds
|
priority = "low"; # relaxed alerting thresholds
|
||||||
role = "build-host";
|
role = "build-host";
|
||||||
};
|
};
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Note:** Current implementation only sets `role = "build-host"`. Consider adding `priority = "low"` when label propagation is implemented.
|
||||||
|
|
||||||
Example in `hosts/ns1/configuration.nix`:
|
Example in `hosts/ns1/configuration.nix`:
|
||||||
|
|
||||||
```nix
|
```nix
|
||||||
homelab.host = {
|
homelab.host = {
|
||||||
tier = "prod";
|
|
||||||
priority = "high";
|
|
||||||
role = "dns";
|
role = "dns";
|
||||||
labels.dns_role = "primary";
|
labels.dns_role = "primary";
|
||||||
};
|
};
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Note:** `tier` and `priority` use defaults ("prod" and "high"), which is the intended behavior. The current ns1/ns2 configurations match this pattern.
|
||||||
|
|
||||||
### 5. Update alert rules
|
### 5. Update alert rules
|
||||||
|
|
||||||
|
❌ **Not started.** Requires steps 2-3 to be completed first.
|
||||||
|
|
||||||
After implementing labels, review and update `services/monitoring/rules.yml`:
|
After implementing labels, review and update `services/monitoring/rules.yml`:
|
||||||
|
|
||||||
- Replace instance-name exclusions with label-based filters (e.g. `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`).
|
- Replace instance-name exclusions with label-based filters (e.g. `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`).
|
||||||
@@ -170,4 +200,6 @@ Specifically, the `high_cpu_load` rule currently has a nix-cache01 exclusion tha
|
|||||||
|
|
||||||
### 6. Consider labels for `generateScrapeConfigs` (service targets)
|
### 6. Consider labels for `generateScrapeConfigs` (service targets)
|
||||||
|
|
||||||
|
❌ **Not started.** Optional enhancement.
|
||||||
|
|
||||||
The same label propagation could be applied to service-level scrape targets. This is optional and can be deferred -- service targets are more specialized and less likely to need generic label-based filtering.
|
The same label propagation could be applied to service-level scrape targets. This is optional and can be deferred -- service targets are more specialized and less likely to need generic label-based filtering.
|
||||||
|
|||||||
8
flake.lock
generated
8
flake.lock
generated
@@ -28,11 +28,11 @@
|
|||||||
]
|
]
|
||||||
},
|
},
|
||||||
"locked": {
|
"locked": {
|
||||||
"lastModified": 1770447502,
|
"lastModified": 1770470613,
|
||||||
"narHash": "sha256-xH1PNyE3ydj4udhe1IpK8VQxBPZETGLuORZdSWYRmSU=",
|
"narHash": "sha256-FOqOBdsQnLzA7+Y3hXuN9PcOHmpEDvNw64Ju8op9J2w=",
|
||||||
"ref": "master",
|
"ref": "master",
|
||||||
"rev": "79db119d1ca6630023947ef0a65896cc3307c2ff",
|
"rev": "36a74b8cf9a873202e28eecfb90f9af8650cca8b",
|
||||||
"revCount": 22,
|
"revCount": 23,
|
||||||
"type": "git",
|
"type": "git",
|
||||||
"url": "https://git.t-juice.club/torjus/homelab-deploy"
|
"url": "https://git.t-juice.club/torjus/homelab-deploy"
|
||||||
},
|
},
|
||||||
|
|||||||
@@ -6,22 +6,72 @@ let
|
|||||||
text = ''
|
text = ''
|
||||||
set -euo pipefail
|
set -euo pipefail
|
||||||
|
|
||||||
|
LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
|
||||||
|
|
||||||
|
# Send a log entry to Loki with bootstrap status
|
||||||
|
# Usage: log_to_loki <stage> <message>
|
||||||
|
# Fails silently if Loki is unreachable
|
||||||
|
log_to_loki() {
|
||||||
|
local stage="$1"
|
||||||
|
local message="$2"
|
||||||
|
local timestamp_ns
|
||||||
|
timestamp_ns="$(date +%s)000000000"
|
||||||
|
|
||||||
|
local payload
|
||||||
|
payload=$(jq -n \
|
||||||
|
--arg host "$HOSTNAME" \
|
||||||
|
--arg stage "$stage" \
|
||||||
|
--arg branch "''${BRANCH:-master}" \
|
||||||
|
--arg ts "$timestamp_ns" \
|
||||||
|
--arg msg "$message" \
|
||||||
|
'{
|
||||||
|
streams: [{
|
||||||
|
stream: {
|
||||||
|
job: "bootstrap",
|
||||||
|
host: $host,
|
||||||
|
stage: $stage,
|
||||||
|
branch: $branch
|
||||||
|
},
|
||||||
|
values: [[$ts, $msg]]
|
||||||
|
}]
|
||||||
|
}')
|
||||||
|
|
||||||
|
curl -s --connect-timeout 2 --max-time 5 \
|
||||||
|
-X POST \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d "$payload" \
|
||||||
|
"$LOKI_URL" >/dev/null 2>&1 || true
|
||||||
|
}
|
||||||
|
|
||||||
|
echo "================================================================================"
|
||||||
|
echo " NIXOS BOOTSTRAP IN PROGRESS"
|
||||||
|
echo "================================================================================"
|
||||||
|
echo ""
|
||||||
|
|
||||||
# Read hostname set by cloud-init (from Terraform VM name via user-data)
|
# Read hostname set by cloud-init (from Terraform VM name via user-data)
|
||||||
# Cloud-init sets the system hostname from user-data.txt, so we read it from hostnamectl
|
# Cloud-init sets the system hostname from user-data.txt, so we read it from hostnamectl
|
||||||
HOSTNAME=$(hostnamectl hostname)
|
HOSTNAME=$(hostnamectl hostname)
|
||||||
echo "DEBUG: Hostname from hostnamectl: '$HOSTNAME'"
|
# Read git branch from environment, default to master
|
||||||
|
BRANCH="''${NIXOS_FLAKE_BRANCH:-master}"
|
||||||
|
|
||||||
|
echo "Hostname: $HOSTNAME"
|
||||||
|
echo ""
|
||||||
echo "Starting NixOS bootstrap for host: $HOSTNAME"
|
echo "Starting NixOS bootstrap for host: $HOSTNAME"
|
||||||
|
|
||||||
|
log_to_loki "starting" "Bootstrap starting for $HOSTNAME (branch: $BRANCH)"
|
||||||
|
|
||||||
echo "Waiting for network connectivity..."
|
echo "Waiting for network connectivity..."
|
||||||
|
|
||||||
# Verify we can reach the git server via HTTPS (doesn't respond to ping)
|
# Verify we can reach the git server via HTTPS (doesn't respond to ping)
|
||||||
if ! curl -s --connect-timeout 5 --max-time 10 https://git.t-juice.club >/dev/null 2>&1; then
|
if ! curl -s --connect-timeout 5 --max-time 10 https://git.t-juice.club >/dev/null 2>&1; then
|
||||||
echo "ERROR: Cannot reach git.t-juice.club via HTTPS"
|
echo "ERROR: Cannot reach git.t-juice.club via HTTPS"
|
||||||
echo "Check network configuration and DNS settings"
|
echo "Check network configuration and DNS settings"
|
||||||
|
log_to_loki "failed" "Network check failed - cannot reach git.t-juice.club"
|
||||||
exit 1
|
exit 1
|
||||||
fi
|
fi
|
||||||
|
|
||||||
echo "Network connectivity confirmed"
|
echo "Network connectivity confirmed"
|
||||||
|
log_to_loki "network_ok" "Network connectivity confirmed"
|
||||||
|
|
||||||
# Unwrap Vault token and store AppRole credentials (if provided)
|
# Unwrap Vault token and store AppRole credentials (if provided)
|
||||||
if [ -n "''${VAULT_WRAPPED_TOKEN:-}" ]; then
|
if [ -n "''${VAULT_WRAPPED_TOKEN:-}" ]; then
|
||||||
@@ -50,6 +100,7 @@ let
|
|||||||
chmod 600 /var/lib/vault/approle/secret-id
|
chmod 600 /var/lib/vault/approle/secret-id
|
||||||
|
|
||||||
echo "Vault credentials unwrapped and stored successfully"
|
echo "Vault credentials unwrapped and stored successfully"
|
||||||
|
log_to_loki "vault_ok" "Vault credentials unwrapped and stored"
|
||||||
else
|
else
|
||||||
echo "WARNING: Failed to unwrap Vault token"
|
echo "WARNING: Failed to unwrap Vault token"
|
||||||
if [ -n "$UNWRAP_RESPONSE" ]; then
|
if [ -n "$UNWRAP_RESPONSE" ]; then
|
||||||
@@ -63,17 +114,17 @@ let
|
|||||||
echo "To regenerate token, run: create-host --hostname $HOSTNAME --force"
|
echo "To regenerate token, run: create-host --hostname $HOSTNAME --force"
|
||||||
echo ""
|
echo ""
|
||||||
echo "Vault secrets will not be available, but continuing bootstrap..."
|
echo "Vault secrets will not be available, but continuing bootstrap..."
|
||||||
|
log_to_loki "vault_warn" "Failed to unwrap Vault token - continuing without secrets"
|
||||||
fi
|
fi
|
||||||
else
|
else
|
||||||
echo "No Vault wrapped token provided (VAULT_WRAPPED_TOKEN not set)"
|
echo "No Vault wrapped token provided (VAULT_WRAPPED_TOKEN not set)"
|
||||||
echo "Skipping Vault credential setup"
|
echo "Skipping Vault credential setup"
|
||||||
|
log_to_loki "vault_skip" "No Vault token provided - skipping credential setup"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
echo "Fetching and building NixOS configuration from flake..."
|
echo "Fetching and building NixOS configuration from flake..."
|
||||||
|
|
||||||
# Read git branch from environment, default to master
|
|
||||||
BRANCH="''${NIXOS_FLAKE_BRANCH:-master}"
|
|
||||||
echo "Using git branch: $BRANCH"
|
echo "Using git branch: $BRANCH"
|
||||||
|
log_to_loki "building" "Starting nixos-rebuild boot"
|
||||||
|
|
||||||
# Build and activate the host-specific configuration
|
# Build and activate the host-specific configuration
|
||||||
FLAKE_URL="git+https://git.t-juice.club/torjus/nixos-servers.git?ref=$BRANCH#''${HOSTNAME}"
|
FLAKE_URL="git+https://git.t-juice.club/torjus/nixos-servers.git?ref=$BRANCH#''${HOSTNAME}"
|
||||||
@@ -81,18 +132,30 @@ let
|
|||||||
if nixos-rebuild boot --flake "$FLAKE_URL"; then
|
if nixos-rebuild boot --flake "$FLAKE_URL"; then
|
||||||
echo "Successfully built configuration for $HOSTNAME"
|
echo "Successfully built configuration for $HOSTNAME"
|
||||||
echo "Rebooting into new configuration..."
|
echo "Rebooting into new configuration..."
|
||||||
|
log_to_loki "success" "Build successful - rebooting into new configuration"
|
||||||
sleep 2
|
sleep 2
|
||||||
systemctl reboot
|
systemctl reboot
|
||||||
else
|
else
|
||||||
echo "ERROR: nixos-rebuild failed for $HOSTNAME"
|
echo "ERROR: nixos-rebuild failed for $HOSTNAME"
|
||||||
echo "Check that flake has configuration for this hostname"
|
echo "Check that flake has configuration for this hostname"
|
||||||
echo "Manual intervention required - system will not reboot"
|
echo "Manual intervention required - system will not reboot"
|
||||||
|
log_to_loki "failed" "nixos-rebuild failed - manual intervention required"
|
||||||
exit 1
|
exit 1
|
||||||
fi
|
fi
|
||||||
'';
|
'';
|
||||||
};
|
};
|
||||||
in
|
in
|
||||||
{
|
{
|
||||||
|
# Custom greeting line to indicate this is a bootstrap image
|
||||||
|
services.getty.greetingLine = lib.mkForce ''
|
||||||
|
================================================================================
|
||||||
|
BOOTSTRAP IMAGE - NixOS \V (\l)
|
||||||
|
================================================================================
|
||||||
|
|
||||||
|
Bootstrap service is running. Logs are displayed on tty1.
|
||||||
|
Check status: journalctl -fu nixos-bootstrap
|
||||||
|
'';
|
||||||
|
|
||||||
systemd.services."nixos-bootstrap" = {
|
systemd.services."nixos-bootstrap" = {
|
||||||
description = "Bootstrap NixOS configuration from flake on first boot";
|
description = "Bootstrap NixOS configuration from flake on first boot";
|
||||||
|
|
||||||
@@ -107,12 +170,12 @@ in
|
|||||||
serviceConfig = {
|
serviceConfig = {
|
||||||
Type = "oneshot";
|
Type = "oneshot";
|
||||||
RemainAfterExit = true;
|
RemainAfterExit = true;
|
||||||
ExecStart = "${bootstrap-script}/bin/nixos-bootstrap";
|
ExecStart = lib.getExe bootstrap-script;
|
||||||
|
|
||||||
# Read environment variables from cloud-init (set by cloud-init write_files)
|
# Read environment variables from cloud-init (set by cloud-init write_files)
|
||||||
EnvironmentFile = "-/run/cloud-init-env";
|
EnvironmentFile = "-/run/cloud-init-env";
|
||||||
|
|
||||||
# Logging to journald
|
# Log to journal and console
|
||||||
StandardOutput = "journal+console";
|
StandardOutput = "journal+console";
|
||||||
StandardError = "journal+console";
|
StandardError = "journal+console";
|
||||||
};
|
};
|
||||||
|
|||||||
@@ -99,3 +99,48 @@
|
|||||||
- name: Display success message
|
- name: Display success message
|
||||||
ansible.builtin.debug:
|
ansible.builtin.debug:
|
||||||
msg: "Template VM {{ template_vmid }} created successfully on {{ storage }}"
|
msg: "Template VM {{ template_vmid }} created successfully on {{ storage }}"
|
||||||
|
|
||||||
|
- name: Update Terraform template name
|
||||||
|
hosts: localhost
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
terraform_dir: "{{ playbook_dir }}/../terraform"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Get image filename from earlier play
|
||||||
|
ansible.builtin.set_fact:
|
||||||
|
image_filename: "{{ hostvars['localhost']['image_filename'] }}"
|
||||||
|
|
||||||
|
- name: Extract template name from image filename
|
||||||
|
ansible.builtin.set_fact:
|
||||||
|
new_template_name: "{{ image_filename | regex_replace('\\.vma\\.zst$', '') | regex_replace('^vzdump-qemu-', '') }}"
|
||||||
|
|
||||||
|
- name: Read current Terraform variables file
|
||||||
|
ansible.builtin.slurp:
|
||||||
|
src: "{{ terraform_dir }}/variables.tf"
|
||||||
|
register: variables_tf_content
|
||||||
|
|
||||||
|
- name: Extract current template name from variables.tf
|
||||||
|
ansible.builtin.set_fact:
|
||||||
|
current_template_name: "{{ (variables_tf_content.content | b64decode) | regex_search('variable \"default_template_name\"[^}]+default\\s*=\\s*\"([^\"]+)\"', '\\1') | first }}"
|
||||||
|
|
||||||
|
- name: Check if template name has changed
|
||||||
|
ansible.builtin.set_fact:
|
||||||
|
template_name_changed: "{{ current_template_name != new_template_name }}"
|
||||||
|
|
||||||
|
- name: Display template name status
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg: "Template name: {{ current_template_name }} -> {{ new_template_name }} ({{ 'changed' if template_name_changed else 'unchanged' }})"
|
||||||
|
|
||||||
|
- name: Update default_template_name in variables.tf
|
||||||
|
ansible.builtin.replace:
|
||||||
|
path: "{{ terraform_dir }}/variables.tf"
|
||||||
|
regexp: '(variable "default_template_name"[^}]+default\s*=\s*)"[^"]+"'
|
||||||
|
replace: '\1"{{ new_template_name }}"'
|
||||||
|
when: template_name_changed
|
||||||
|
|
||||||
|
- name: Display update result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg: "Updated terraform/variables.tf with new template name: {{ new_template_name }}"
|
||||||
|
when: template_name_changed
|
||||||
|
|||||||
@@ -33,7 +33,7 @@ variable "default_target_node" {
|
|||||||
variable "default_template_name" {
|
variable "default_template_name" {
|
||||||
description = "Default template VM name to clone from"
|
description = "Default template VM name to clone from"
|
||||||
type = string
|
type = string
|
||||||
default = "nixos-25.11.20260131.41e216c"
|
default = "nixos-25.11.20260203.e576e3c"
|
||||||
}
|
}
|
||||||
|
|
||||||
variable "default_ssh_public_key" {
|
variable "default_ssh_public_key" {
|
||||||
|
|||||||
@@ -43,24 +43,20 @@ locals {
|
|||||||
cpu_cores = 2
|
cpu_cores = 2
|
||||||
memory = 2048
|
memory = 2048
|
||||||
disk_size = "20G"
|
disk_size = "20G"
|
||||||
flake_branch = "deploy-test-hosts"
|
flake_branch = "improve-bootstrap-visibility"
|
||||||
vault_wrapped_token = "s.YRGRpAZVVtSYEa3wOYOqFmjt"
|
vault_wrapped_token = "s.l5q88wzXfEcr5SMDHmO6o96b"
|
||||||
}
|
}
|
||||||
"testvm02" = {
|
"testvm02" = {
|
||||||
ip = "10.69.13.21/24"
|
ip = "10.69.13.21/24"
|
||||||
cpu_cores = 2
|
cpu_cores = 2
|
||||||
memory = 2048
|
memory = 2048
|
||||||
disk_size = "20G"
|
disk_size = "20G"
|
||||||
flake_branch = "deploy-test-hosts"
|
|
||||||
vault_wrapped_token = "s.tvs8yhJOkLjBs548STs6DBw7"
|
|
||||||
}
|
}
|
||||||
"testvm03" = {
|
"testvm03" = {
|
||||||
ip = "10.69.13.22/24"
|
ip = "10.69.13.22/24"
|
||||||
cpu_cores = 2
|
cpu_cores = 2
|
||||||
memory = 2048
|
memory = 2048
|
||||||
disk_size = "20G"
|
disk_size = "20G"
|
||||||
flake_branch = "deploy-test-hosts"
|
|
||||||
vault_wrapped_token = "s.sQ80FZGeG3z6jgrsuh74IopC"
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user