Compare commits
36 Commits
homelab-de
...
9d019f2b9a
| Author | SHA1 | Date | |
|---|---|---|---|
|
9d019f2b9a
|
|||
|
21db7e9573
|
|||
|
979040aaf7
|
|||
|
8791c29402
|
|||
|
c7a067d7b3
|
|||
|
c518093578
|
|||
| 0b462f0a96 | |||
|
116abf3bec
|
|||
|
b794aa89db
|
|||
|
50a85daa44
|
|||
|
23e561cf49
|
|||
|
7d291f85bf
|
|||
|
2a842c655a
|
|||
|
1f4a5571dc
|
|||
| 13d6d0ea3a | |||
|
eea000b337
|
|||
|
f19ba2f4b6
|
|||
|
a90d9c33d5
|
|||
|
09c9df1bbe
|
|||
|
ae3039af19
|
|||
|
11261c4636
|
|||
|
4ca3c8890f
|
|||
|
78e8d7a600
|
|||
|
0cf72ec191
|
|||
|
6a3a51407e
|
|||
|
a1ae766eb8
|
|||
|
11999b37f3
|
|||
|
29b2b7db52
|
|||
|
b046a1b862
|
|||
|
38348c5980
|
|||
|
370cf2b03a
|
|||
|
7bc465b414
|
|||
|
8d7bc50108
|
|||
|
03e70ac094
|
|||
|
3b32c9479f
|
|||
|
b0d35f9a99
|
@@ -185,21 +185,60 @@ Common job names:
|
||||
- `home-assistant` - Home automation
|
||||
- `step-ca` - Internal CA
|
||||
|
||||
### Instance Label Format
|
||||
### Target Labels
|
||||
|
||||
The `instance` label uses FQDN format:
|
||||
All scrape targets have these labels:
|
||||
|
||||
```
|
||||
<hostname>.home.2rjus.net:<port>
|
||||
```
|
||||
**Standard labels:**
|
||||
- `instance` - Full target address (`<hostname>.home.2rjus.net:<port>`)
|
||||
- `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`)
|
||||
- `hostname` - Short hostname (e.g., `ns1`, `monitoring01`) - use this for host filtering
|
||||
|
||||
Example queries filtering by host:
|
||||
**Host metadata labels** (when configured in `homelab.host`):
|
||||
- `role` - Host role (e.g., `dns`, `build-host`, `vault`)
|
||||
- `tier` - Deployment tier (`test` for test VMs, absent for prod)
|
||||
- `dns_role` - DNS-specific role (`primary` or `secondary` for ns1/ns2)
|
||||
|
||||
### Filtering by Host
|
||||
|
||||
Use the `hostname` label for easy host filtering across all jobs:
|
||||
|
||||
```promql
|
||||
up{instance=~"monitoring01.*"}
|
||||
node_load1{instance=~"ns1.*"}
|
||||
{hostname="ns1"} # All metrics from ns1
|
||||
node_load1{hostname="monitoring01"} # Specific metric by hostname
|
||||
up{hostname="ha1"} # Check if ha1 is up
|
||||
```
|
||||
|
||||
This is simpler than wildcarding the `instance` label:
|
||||
|
||||
```promql
|
||||
# Old way (still works but verbose)
|
||||
up{instance=~"monitoring01.*"}
|
||||
|
||||
# New way (preferred)
|
||||
up{hostname="monitoring01"}
|
||||
```
|
||||
|
||||
### Filtering by Role/Tier
|
||||
|
||||
Filter hosts by their role or tier:
|
||||
|
||||
```promql
|
||||
up{role="dns"} # All DNS servers (ns1, ns2)
|
||||
node_cpu_seconds_total{role="build-host"} # Build hosts only (nix-cache01)
|
||||
up{tier="test"} # All test-tier VMs
|
||||
up{dns_role="primary"} # Primary DNS only (ns1)
|
||||
```
|
||||
|
||||
Current host labels:
|
||||
| Host | Labels |
|
||||
|------|--------|
|
||||
| ns1 | `role=dns`, `dns_role=primary` |
|
||||
| ns2 | `role=dns`, `dns_role=secondary` |
|
||||
| nix-cache01 | `role=build-host` |
|
||||
| vault01 | `role=vault` |
|
||||
| testvm01/02/03 | `tier=test` |
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting Workflows
|
||||
@@ -212,11 +251,12 @@ node_load1{instance=~"ns1.*"}
|
||||
|
||||
### Investigate Service Issues
|
||||
|
||||
1. Check `up{job="<service>"}` for scrape failures
|
||||
1. Check `up{job="<service>"}` or `up{hostname="<host>"}` for scrape failures
|
||||
2. Use `list_targets` to see target health details
|
||||
3. Query service logs: `{host="<host>", systemd_unit="<service>.service"}`
|
||||
4. Search for errors: `{host="<host>"} |= "error"`
|
||||
5. Check `list_alerts` for related alerts
|
||||
6. Use role filters for group issues: `up{role="dns"}` to check all DNS servers
|
||||
|
||||
### After Deploying Changes
|
||||
|
||||
@@ -246,5 +286,6 @@ With `start: "24h"` to see last 24 hours of upgrades across all hosts.
|
||||
- Default scrape interval is 15s for most metrics targets
|
||||
- Default log lookback is 1h - use `start` parameter for older logs
|
||||
- Use `rate()` for counter metrics, direct queries for gauges
|
||||
- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters
|
||||
- Use the `hostname` label to filter metrics by host (simpler than regex on `instance`)
|
||||
- Host metadata labels (`role`, `tier`, `dns_role`) are propagated to all scrape targets
|
||||
- Log `MESSAGE` field contains the actual log content in JSON format
|
||||
|
||||
118
CLAUDE.md
118
CLAUDE.md
@@ -61,10 +61,31 @@ Do not run `nix flake update`. Should only be done manually by user.
|
||||
### Development Environment
|
||||
|
||||
```bash
|
||||
# Enter development shell (provides ansible, python3)
|
||||
# Enter development shell
|
||||
nix develop
|
||||
```
|
||||
|
||||
The devshell provides: `ansible`, `tofu` (OpenTofu), `bao` (OpenBao CLI), `create-host`, and `homelab-deploy`.
|
||||
|
||||
**Important:** When suggesting commands that use devshell tools, always use `nix develop -c <command>` syntax rather than assuming the user is already in a devshell. For example:
|
||||
```bash
|
||||
# Good - works regardless of current shell
|
||||
nix develop -c tofu plan
|
||||
|
||||
# Avoid - requires user to be in devshell
|
||||
tofu plan
|
||||
```
|
||||
|
||||
**OpenTofu:** Use the `-chdir` option instead of `cd` when running tofu commands in subdirectories:
|
||||
```bash
|
||||
# Good - uses -chdir option
|
||||
nix develop -c tofu -chdir=terraform plan
|
||||
nix develop -c tofu -chdir=terraform/vault apply
|
||||
|
||||
# Avoid - changing directories
|
||||
cd terraform && tofu plan
|
||||
```
|
||||
|
||||
### Secrets Management
|
||||
|
||||
Secrets are managed by OpenBao (Vault) using AppRole authentication. Most hosts use the
|
||||
@@ -140,11 +161,27 @@ The **lab-monitoring** MCP server can query logs from Loki. All hosts ship syste
|
||||
|
||||
- `host` - Hostname (e.g., `ns1`, `ns2`, `monitoring01`, `ha1`). Use this label, not `hostname`.
|
||||
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `prometheus.service`, `nixos-upgrade.service`)
|
||||
- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs like caddy access logs)
|
||||
- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
|
||||
- `filename` - For `varlog` job, the log file path (e.g., `/var/log/caddy/nix-cache.log`)
|
||||
|
||||
Journal log entries are JSON-formatted with the actual log message in the `MESSAGE` field. Other useful fields include `PRIORITY` and `SYSLOG_IDENTIFIER`.
|
||||
|
||||
**Bootstrap Logs:**
|
||||
|
||||
VMs provisioned from template2 send bootstrap progress directly to Loki via curl (before promtail is available). These logs use `job="bootstrap"` with additional labels:
|
||||
|
||||
- `host` - Target hostname
|
||||
- `branch` - Git branch being deployed
|
||||
- `stage` - Bootstrap stage: `starting`, `network_ok`, `vault_ok`/`vault_skip`/`vault_warn`, `building`, `success`, `failed`
|
||||
|
||||
Query bootstrap status:
|
||||
```
|
||||
{job="bootstrap"} # All bootstrap logs
|
||||
{job="bootstrap", host="testvm01"} # Specific host
|
||||
{job="bootstrap", stage="failed"} # All failures
|
||||
{job="bootstrap", stage=~"building|success"} # Track build progress
|
||||
```
|
||||
|
||||
**Example LogQL queries:**
|
||||
```
|
||||
# Logs from a specific service on a host
|
||||
@@ -229,6 +266,21 @@ deploy(role="vault", action="switch")
|
||||
|
||||
**Note:** Only test-tier hosts with `homelab.deploy.enable = true` and the listener service running will respond to deployments.
|
||||
|
||||
**Deploying to Prod Hosts:**
|
||||
|
||||
The MCP server only deploys to test-tier hosts. For prod hosts, use the CLI directly:
|
||||
|
||||
```bash
|
||||
nix develop -c homelab-deploy -- deploy \
|
||||
--nats-url nats://nats1.home.2rjus.net:4222 \
|
||||
--nkey-file ~/.config/homelab-deploy/admin-deployer.nkey \
|
||||
--branch <branch-name> \
|
||||
--action switch \
|
||||
deploy.prod.<hostname>
|
||||
```
|
||||
|
||||
Subject format: `deploy.<tier>.<hostname>` (e.g., `deploy.prod.monitoring01`, `deploy.test.testvm01`)
|
||||
|
||||
**Verifying Deployments:**
|
||||
|
||||
After deploying, use the `nixos_flake_info` metric from nixos-exporter to verify the host is running the expected revision:
|
||||
@@ -249,9 +301,10 @@ The `current_rev` label contains the git commit hash of the deployed flake confi
|
||||
- `configuration.nix` - Host-specific settings (networking, hardware, users)
|
||||
- `/system/` - Shared system-level configurations applied to ALL hosts
|
||||
- Core modules: nix.nix, sshd.nix, sops.nix (legacy), vault-secrets.nix, acme.nix, autoupgrade.nix
|
||||
- Additional modules: motd.nix (dynamic MOTD), packages.nix (base packages), root-user.nix (root config), homelab-deploy.nix (NATS listener)
|
||||
- Monitoring: node-exporter and promtail on every host
|
||||
- `/modules/` - Custom NixOS modules
|
||||
- `homelab/` - Homelab-specific options (DNS automation, monitoring scrape targets)
|
||||
- `homelab/` - Homelab-specific options (see "Homelab Module Options" section below)
|
||||
- `/lib/` - Nix library functions
|
||||
- `dns-zone.nix` - DNS zone generation functions
|
||||
- `monitoring.nix` - Prometheus scrape target generation functions
|
||||
@@ -259,6 +312,8 @@ The `current_rev` label contains the git commit hash of the deployed flake confi
|
||||
- `home-assistant/` - Home automation stack
|
||||
- `monitoring/` - Observability stack (Prometheus, Grafana, Loki, Tempo)
|
||||
- `ns/` - DNS services (authoritative, resolver, zone generation)
|
||||
- `vault/` - OpenBao (Vault) secrets server
|
||||
- `actions-runner/` - GitHub Actions runner
|
||||
- `http-proxy/`, `ca/`, `postgres/`, `nats/`, `jellyfin/`, etc.
|
||||
- `/secrets/` - SOPS-encrypted secrets with age encryption (legacy, only used by ca)
|
||||
- `/common/` - Shared configurations (e.g., VM guest agent)
|
||||
@@ -292,25 +347,31 @@ All hosts automatically get:
|
||||
|
||||
### Active Hosts
|
||||
|
||||
Production servers managed by `rebuild-all.sh`:
|
||||
Production servers:
|
||||
- `ns1`, `ns2` - Primary/secondary DNS servers (10.69.13.5/6)
|
||||
- `ca` - Internal Certificate Authority
|
||||
- `vault01` - OpenBao (Vault) secrets server
|
||||
- `ha1` - Home Assistant + Zigbee2MQTT + Mosquitto
|
||||
- `http-proxy` - Reverse proxy
|
||||
- `monitoring01` - Full observability stack (Prometheus, Grafana, Loki, Tempo, Pyroscope)
|
||||
- `jelly01` - Jellyfin media server
|
||||
- `nix-cache01` - Binary cache server
|
||||
- `nix-cache01` - Binary cache server + GitHub Actions runner
|
||||
- `pgdb1` - PostgreSQL database
|
||||
- `nats1` - NATS messaging server
|
||||
|
||||
Template/test hosts:
|
||||
- `template1` - Base template for cloning new hosts
|
||||
Test/staging hosts:
|
||||
- `testvm01`, `testvm02`, `testvm03` - Test-tier VMs for branch testing and deployment validation
|
||||
|
||||
Template hosts:
|
||||
- `template1`, `template2` - Base templates for cloning new hosts
|
||||
|
||||
### Flake Inputs
|
||||
|
||||
- `nixpkgs` - NixOS 25.11 stable (primary)
|
||||
- `nixpkgs-unstable` - Unstable channel (available via overlay as `pkgs.unstable.<package>`)
|
||||
- `sops-nix` - Secrets management (legacy, only used by ca)
|
||||
- `nixos-exporter` - NixOS module for exposing flake revision metrics (used to verify deployments)
|
||||
- `homelab-deploy` - NATS-based remote deployment tool for test-tier hosts
|
||||
- Custom packages from git.t-juice.club:
|
||||
- `alerttonotify` - Alert routing
|
||||
- `labmon` - Lab monitoring
|
||||
@@ -402,9 +463,21 @@ Example VM deployment includes:
|
||||
- Custom CPU/memory/disk sizing
|
||||
- VLAN tagging
|
||||
- QEMU guest agent
|
||||
- Automatic Vault credential provisioning via `vault_wrapped_token`
|
||||
|
||||
OpenTofu outputs the VM's IP address after deployment for easy SSH access.
|
||||
|
||||
**Automatic Vault Credential Provisioning:**
|
||||
|
||||
VMs can receive Vault (OpenBao) credentials automatically during bootstrap:
|
||||
|
||||
1. OpenTofu generates a wrapped token via `terraform/vault/` and stores it in the VM configuration
|
||||
2. Cloud-init passes `VAULT_WRAPPED_TOKEN` and `NIXOS_FLAKE_BRANCH` to the bootstrap script
|
||||
3. The bootstrap script unwraps the token to obtain AppRole credentials
|
||||
4. Credentials are written to `/var/lib/vault/approle/` before the NixOS rebuild
|
||||
|
||||
This eliminates the need for manual `provision-approle.yml` playbook runs on new VMs. Bootstrap progress is logged to Loki with `job="bootstrap"` labels.
|
||||
|
||||
#### Template Rebuilding and Terraform State
|
||||
|
||||
When the Proxmox template is rebuilt (via `build-and-deploy-template.yml`), the template name may change. This would normally cause Terraform to want to recreate all existing VMs, but that's unnecessary since VMs are independent once cloned.
|
||||
@@ -484,11 +557,7 @@ Prometheus scrape targets are automatically generated from host configurations,
|
||||
- **External targets**: Non-flake hosts defined in `/services/monitoring/external-targets.nix`
|
||||
- **Library**: `lib/monitoring.nix` provides `generateNodeExporterTargets` and `generateScrapeConfigs`
|
||||
|
||||
Host monitoring options (`homelab.monitoring.*`):
|
||||
- `enable` (default: `true`) - Include host in Prometheus node-exporter scrape targets
|
||||
- `scrapeTargets` (default: `[]`) - Additional scrape targets exposed by this host (job_name, port, metrics_path, scheme, scrape_interval, honor_labels)
|
||||
|
||||
Service modules declare their scrape targets directly (e.g., `services/ca/default.nix` declares step-ca on port 9000). The Prometheus config on monitoring01 auto-generates scrape configs from all hosts.
|
||||
Service modules declare their scrape targets directly via `homelab.monitoring.scrapeTargets` (e.g., `services/ca/default.nix` declares step-ca on port 9000). The Prometheus config on monitoring01 auto-generates scrape configs from all hosts. See "Homelab Module Options" section for available options.
|
||||
|
||||
To add monitoring targets for non-NixOS hosts, edit `/services/monitoring/external-targets.nix`.
|
||||
|
||||
@@ -507,13 +576,30 @@ DNS zone entries are automatically generated from host configurations:
|
||||
- **External hosts**: Non-flake hosts defined in `/services/ns/external-hosts.nix`
|
||||
- **Serial number**: Uses `self.sourceInfo.lastModified` (git commit timestamp)
|
||||
|
||||
Host DNS options (`homelab.dns.*`):
|
||||
- `enable` (default: `true`) - Include host in DNS zone generation
|
||||
- `cnames` (default: `[]`) - List of CNAME aliases pointing to this host
|
||||
|
||||
Hosts are automatically excluded from DNS if:
|
||||
- `homelab.dns.enable = false` (e.g., template hosts)
|
||||
- No static IP configured (e.g., DHCP-only hosts)
|
||||
- Network interface is a VPN/tunnel (wg*, tun*, tap*)
|
||||
|
||||
To add DNS entries for non-NixOS hosts, edit `/services/ns/external-hosts.nix`.
|
||||
|
||||
### Homelab Module Options
|
||||
|
||||
The `modules/homelab/` directory defines custom options used across hosts for automation and metadata.
|
||||
|
||||
**Host options (`homelab.host.*`):**
|
||||
- `tier` - Deployment tier: `test` or `prod`. Test-tier hosts can receive remote deployments and have different credential access.
|
||||
- `priority` - Alerting priority: `high` or `low`. Controls alerting thresholds for the host.
|
||||
- `role` - Primary role designation (e.g., `dns`, `database`, `bastion`, `vault`)
|
||||
- `labels` - Free-form key-value metadata for host categorization
|
||||
|
||||
**DNS options (`homelab.dns.*`):**
|
||||
- `enable` (default: `true`) - Include host in DNS zone generation
|
||||
- `cnames` (default: `[]`) - List of CNAME aliases pointing to this host
|
||||
|
||||
**Monitoring options (`homelab.monitoring.*`):**
|
||||
- `enable` (default: `true`) - Include host in Prometheus node-exporter scrape targets
|
||||
- `scrapeTargets` (default: `[]`) - Additional scrape targets exposed by this host
|
||||
|
||||
**Deploy options (`homelab.deploy.*`):**
|
||||
- `enable` (default: `false`) - Enable NATS-based remote deployment listener. When enabled, the host listens for deployment commands via NATS and can be targeted by the `homelab-deploy` MCP server.
|
||||
|
||||
72
docs/plans/cert-monitoring.md
Normal file
72
docs/plans/cert-monitoring.md
Normal file
@@ -0,0 +1,72 @@
|
||||
# Certificate Monitoring Plan
|
||||
|
||||
## Summary
|
||||
|
||||
This document describes the removal of labmon certificate monitoring and outlines future needs for certificate monitoring in the homelab.
|
||||
|
||||
## What Was Removed
|
||||
|
||||
### labmon Service
|
||||
|
||||
The `labmon` service was a custom Go application that provided:
|
||||
|
||||
1. **StepMonitor**: Monitoring for step-ca (Smallstep CA) certificate provisioning and health
|
||||
2. **TLSConnectionMonitor**: Periodic TLS connection checks to verify certificate validity and expiration
|
||||
|
||||
The service exposed Prometheus metrics at `:9969` including:
|
||||
- `labmon_tlsconmon_certificate_seconds_left` - Time until certificate expiration
|
||||
- `labmon_tlsconmon_certificate_check_error` - Whether the TLS check failed
|
||||
- `labmon_stepmon_certificate_seconds_left` - Step-CA internal certificate expiration
|
||||
|
||||
### Affected Files
|
||||
|
||||
- `hosts/monitoring01/configuration.nix` - Removed labmon configuration block
|
||||
- `services/monitoring/prometheus.nix` - Removed labmon scrape target
|
||||
- `services/monitoring/rules.yml` - Removed `certificate_rules` alert group
|
||||
- `services/monitoring/alloy.nix` - Deleted (was only used for labmon profiling)
|
||||
- `services/monitoring/default.nix` - Removed alloy.nix import
|
||||
|
||||
### Removed Alerts
|
||||
|
||||
- `certificate_expiring_soon` - Warned when any monitored TLS cert had < 24h validity
|
||||
- `step_ca_serving_cert_expiring` - Critical alert for step-ca's own serving certificate
|
||||
- `certificate_check_error` - Warned when TLS connection check failed
|
||||
- `step_ca_certificate_expiring` - Critical alert for step-ca issued certificates
|
||||
|
||||
## Why It Was Removed
|
||||
|
||||
1. **step-ca decommissioned**: The primary monitoring target (step-ca) is no longer in use
|
||||
2. **Outdated codebase**: labmon was a custom tool that required maintenance
|
||||
3. **Limited value**: With ACME auto-renewal, certificates should renew automatically
|
||||
|
||||
## Current State
|
||||
|
||||
ACME certificates are now issued by OpenBao PKI at `vault.home.2rjus.net:8200`. The ACME protocol handles automatic renewal, and certificates are typically renewed well before expiration.
|
||||
|
||||
## Future Needs
|
||||
|
||||
While ACME handles renewal automatically, we should consider monitoring for:
|
||||
|
||||
1. **ACME renewal failures**: Alert when a certificate fails to renew
|
||||
- Could monitor ACME client logs (via Loki queries)
|
||||
- Could check certificate file modification times
|
||||
|
||||
2. **Certificate expiration as backup**: Even with auto-renewal, a last-resort alert for certificates approaching expiration would catch renewal failures
|
||||
|
||||
3. **Certificate transparency**: Monitor for unexpected certificate issuance
|
||||
|
||||
### Potential Solutions
|
||||
|
||||
1. **Prometheus blackbox_exporter**: Can probe TLS endpoints and export certificate expiration metrics
|
||||
- `probe_ssl_earliest_cert_expiry` metric
|
||||
- Already a standard tool, well-maintained
|
||||
|
||||
2. **Custom Loki alerting**: Query ACME service logs for renewal failures
|
||||
- Works with existing infrastructure
|
||||
- No additional services needed
|
||||
|
||||
3. **Node-exporter textfile collector**: Script that checks local certificate files and writes expiration metrics
|
||||
|
||||
## Status
|
||||
|
||||
**Not yet implemented.** This document serves as a placeholder for future work on certificate monitoring.
|
||||
@@ -1,10 +1,38 @@
|
||||
# Prometheus Scrape Target Labels
|
||||
|
||||
## Implementation Status
|
||||
|
||||
| Step | Status | Notes |
|
||||
|------|--------|-------|
|
||||
| 1. Create `homelab.host` module | ✅ Complete | `modules/homelab/host.nix` |
|
||||
| 2. Update `lib/monitoring.nix` | ✅ Complete | Labels extracted and propagated |
|
||||
| 3. Update Prometheus config | ✅ Complete | Uses structured static_configs |
|
||||
| 4. Set metadata on hosts | ✅ Complete | All relevant hosts configured |
|
||||
| 5. Update alert rules | ✅ Complete | Role-based filtering implemented |
|
||||
| 6. Labels for service targets | ✅ Complete | Host labels propagated to all services |
|
||||
| 7. Add hostname label | ✅ Complete | All targets have `hostname` label for easy filtering |
|
||||
|
||||
**Hosts with metadata configured:**
|
||||
- `ns1`, `ns2`: `role = "dns"`, `labels.dns_role = "primary"/"secondary"`
|
||||
- `nix-cache01`: `role = "build-host"`
|
||||
- `vault01`: `role = "vault"`
|
||||
- `testvm01/02/03`: `tier = "test"`
|
||||
|
||||
**Implementation complete.** Branch: `prometheus-scrape-target-labels`
|
||||
|
||||
**Query examples:**
|
||||
- `{hostname="ns1"}` - all metrics from ns1 (any job/port)
|
||||
- `node_cpu_seconds_total{hostname="monitoring01"}` - specific metric by hostname
|
||||
- `up{role="dns"}` - all DNS servers
|
||||
- `up{tier="test"}` - all test-tier hosts
|
||||
|
||||
---
|
||||
|
||||
## Goal
|
||||
|
||||
Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names.
|
||||
|
||||
**Related:** This plan shares the `homelab.host` module with `docs/plans/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment.
|
||||
**Related:** This plan shares the `homelab.host` module with `docs/plans/completed/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment.
|
||||
|
||||
## Motivation
|
||||
|
||||
@@ -54,12 +82,11 @@ or
|
||||
|
||||
## Implementation
|
||||
|
||||
This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/nats-deploy-service.md` which uses the same module for deployment tier assignment.
|
||||
This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/completed/nats-deploy-service.md` which uses the same module for deployment tier assignment.
|
||||
|
||||
### 1. Create `homelab.host` module
|
||||
|
||||
**Status:** Step 1 (Create `homelab.host` module) is complete. The module is in
|
||||
`modules/homelab/host.nix` with tier, priority, role, and labels options.
|
||||
✅ **Complete.** The module is in `modules/homelab/host.nix`.
|
||||
|
||||
Create `modules/homelab/host.nix` with shared host metadata options:
|
||||
|
||||
@@ -98,6 +125,8 @@ Import this module in `modules/homelab/default.nix`.
|
||||
|
||||
### 2. Update `lib/monitoring.nix`
|
||||
|
||||
✅ **Complete.** Labels are now extracted and propagated.
|
||||
|
||||
- `extractHostMonitoring` should also extract `homelab.host` values (priority, role, labels).
|
||||
- Build the combined label set from `homelab.host`:
|
||||
|
||||
@@ -126,6 +155,8 @@ This requires grouping hosts by their label attrset and producing one `static_co
|
||||
|
||||
### 3. Update `services/monitoring/prometheus.nix`
|
||||
|
||||
✅ **Complete.** Now uses structured static_configs output.
|
||||
|
||||
Change the node-exporter scrape config to use the new structured output:
|
||||
|
||||
```nix
|
||||
@@ -138,36 +169,37 @@ static_configs = nodeExporterTargets;
|
||||
|
||||
### 4. Set metadata on hosts
|
||||
|
||||
✅ **Complete.** All relevant hosts have metadata configured. Note: The implementation filters by `role` rather than `priority`, which matches the existing nix-cache01 configuration.
|
||||
|
||||
Example in `hosts/nix-cache01/configuration.nix`:
|
||||
|
||||
```nix
|
||||
homelab.host = {
|
||||
tier = "test"; # can be deployed by MCP (used by homelab-deploy)
|
||||
priority = "low"; # relaxed alerting thresholds
|
||||
role = "build-host";
|
||||
};
|
||||
```
|
||||
|
||||
**Note:** Current implementation only sets `role = "build-host"`. Consider adding `priority = "low"` when label propagation is implemented.
|
||||
|
||||
Example in `hosts/ns1/configuration.nix`:
|
||||
|
||||
```nix
|
||||
homelab.host = {
|
||||
tier = "prod";
|
||||
priority = "high";
|
||||
role = "dns";
|
||||
labels.dns_role = "primary";
|
||||
};
|
||||
```
|
||||
|
||||
**Note:** `tier` and `priority` use defaults ("prod" and "high"), which is the intended behavior. The current ns1/ns2 configurations match this pattern.
|
||||
|
||||
### 5. Update alert rules
|
||||
|
||||
After implementing labels, review and update `services/monitoring/rules.yml`:
|
||||
✅ **Complete.** Updated `services/monitoring/rules.yml`:
|
||||
|
||||
- Replace instance-name exclusions with label-based filters (e.g. `{priority!="low"}` instead of `{instance!="nix-cache01.home.2rjus.net:9100"}`).
|
||||
- Consider whether any other rules should differentiate by priority or role.
|
||||
- `high_cpu_load`: Replaced `instance!="nix-cache01..."` with `role!="build-host"` for standard hosts (15m duration) and `role="build-host"` for build hosts (2h duration).
|
||||
- `unbound_low_cache_hit_ratio`: Added `dns_role="primary"` filter to only alert on the primary DNS resolver (secondary has a cold cache).
|
||||
|
||||
Specifically, the `high_cpu_load` rule currently has a nix-cache01 exclusion that should be replaced with a `priority`-based filter.
|
||||
### 6. Labels for `generateScrapeConfigs` (service targets)
|
||||
|
||||
### 6. Consider labels for `generateScrapeConfigs` (service targets)
|
||||
|
||||
The same label propagation could be applied to service-level scrape targets. This is optional and can be deferred -- service targets are more specialized and less likely to need generic label-based filtering.
|
||||
✅ **Complete.** Host labels are now propagated to all auto-generated service scrape targets (unbound, homelab-deploy, nixos-exporter, etc.). This enables semantic filtering on any service metric, such as using `dns_role="primary"` with the unbound job.
|
||||
8
flake.lock
generated
8
flake.lock
generated
@@ -28,11 +28,11 @@
|
||||
]
|
||||
},
|
||||
"locked": {
|
||||
"lastModified": 1770447502,
|
||||
"narHash": "sha256-xH1PNyE3ydj4udhe1IpK8VQxBPZETGLuORZdSWYRmSU=",
|
||||
"lastModified": 1770481834,
|
||||
"narHash": "sha256-Xx9BYnI0C/qgPbwr9nj6NoAdQTbYLunrdbNSaUww9oY=",
|
||||
"ref": "master",
|
||||
"rev": "79db119d1ca6630023947ef0a65896cc3307c2ff",
|
||||
"revCount": 22,
|
||||
"rev": "fd0d63b103dfaf21d1c27363266590e723021c67",
|
||||
"revCount": 24,
|
||||
"type": "git",
|
||||
"url": "https://git.t-juice.club/torjus/homelab-deploy"
|
||||
},
|
||||
|
||||
31
flake.nix
31
flake.nix
@@ -186,15 +186,6 @@
|
||||
./hosts/nats1
|
||||
];
|
||||
};
|
||||
testvm01 = nixpkgs.lib.nixosSystem {
|
||||
inherit system;
|
||||
specialArgs = {
|
||||
inherit inputs self sops-nix;
|
||||
};
|
||||
modules = commonModules ++ [
|
||||
./hosts/testvm01
|
||||
];
|
||||
};
|
||||
vault01 = nixpkgs.lib.nixosSystem {
|
||||
inherit system;
|
||||
specialArgs = {
|
||||
@@ -204,13 +195,31 @@
|
||||
./hosts/vault01
|
||||
];
|
||||
};
|
||||
vaulttest01 = nixpkgs.lib.nixosSystem {
|
||||
testvm01 = nixpkgs.lib.nixosSystem {
|
||||
inherit system;
|
||||
specialArgs = {
|
||||
inherit inputs self sops-nix;
|
||||
};
|
||||
modules = commonModules ++ [
|
||||
./hosts/vaulttest01
|
||||
./hosts/testvm01
|
||||
];
|
||||
};
|
||||
testvm02 = nixpkgs.lib.nixosSystem {
|
||||
inherit system;
|
||||
specialArgs = {
|
||||
inherit inputs self sops-nix;
|
||||
};
|
||||
modules = commonModules ++ [
|
||||
./hosts/testvm02
|
||||
];
|
||||
};
|
||||
testvm03 = nixpkgs.lib.nixosSystem {
|
||||
inherit system;
|
||||
specialArgs = {
|
||||
inherit inputs self sops-nix;
|
||||
};
|
||||
modules = commonModules ++ [
|
||||
./hosts/testvm03
|
||||
];
|
||||
};
|
||||
};
|
||||
|
||||
@@ -61,6 +61,9 @@
|
||||
# Or disable the firewall altogether.
|
||||
networking.firewall.enable = false;
|
||||
|
||||
vault.enable = true;
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
zramSwap = {
|
||||
enable = true;
|
||||
};
|
||||
|
||||
@@ -100,61 +100,6 @@
|
||||
];
|
||||
};
|
||||
|
||||
labmon = {
|
||||
enable = true;
|
||||
|
||||
settings = {
|
||||
ListenAddr = ":9969";
|
||||
Profiling = true;
|
||||
StepMonitors = [
|
||||
{
|
||||
Enabled = true;
|
||||
BaseURL = "https://ca.home.2rjus.net";
|
||||
RootID = "3381bda8015a86b9a3cd1851439d1091890a79005e0f1f7c4301fe4bccc29d80";
|
||||
}
|
||||
];
|
||||
|
||||
TLSConnectionMonitors = [
|
||||
{
|
||||
Enabled = true;
|
||||
Address = "ca.home.2rjus.net:443";
|
||||
Verify = true;
|
||||
Duration = "12h";
|
||||
}
|
||||
{
|
||||
Enabled = true;
|
||||
Address = "jelly.home.2rjus.net:443";
|
||||
Verify = true;
|
||||
Duration = "12h";
|
||||
}
|
||||
{
|
||||
Enabled = true;
|
||||
Address = "grafana.home.2rjus.net:443";
|
||||
Verify = true;
|
||||
Duration = "12h";
|
||||
}
|
||||
{
|
||||
Enabled = true;
|
||||
Address = "prometheus.home.2rjus.net:443";
|
||||
Verify = true;
|
||||
Duration = "12h";
|
||||
}
|
||||
{
|
||||
Enabled = true;
|
||||
Address = "alertmanager.home.2rjus.net:443";
|
||||
Verify = true;
|
||||
Duration = "12h";
|
||||
}
|
||||
{
|
||||
Enabled = true;
|
||||
Address = "pyroscope.home.2rjus.net:443";
|
||||
Verify = true;
|
||||
Duration = "12h";
|
||||
}
|
||||
];
|
||||
};
|
||||
};
|
||||
|
||||
# Open ports in the firewall.
|
||||
# networking.firewall.allowedTCPPorts = [ ... ];
|
||||
# networking.firewall.allowedUDPPorts = [ ... ];
|
||||
|
||||
@@ -59,5 +59,8 @@
|
||||
# Or disable the firewall altogether.
|
||||
networking.firewall.enable = false;
|
||||
|
||||
vault.enable = true;
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
system.stateVersion = "23.11"; # Did you read the comment?
|
||||
}
|
||||
|
||||
@@ -59,5 +59,8 @@
|
||||
# Or disable the firewall altogether.
|
||||
networking.firewall.enable = false;
|
||||
|
||||
vault.enable = true;
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
system.stateVersion = "23.11"; # Did you read the comment?
|
||||
}
|
||||
|
||||
@@ -6,22 +6,72 @@ let
|
||||
text = ''
|
||||
set -euo pipefail
|
||||
|
||||
LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
|
||||
|
||||
# Send a log entry to Loki with bootstrap status
|
||||
# Usage: log_to_loki <stage> <message>
|
||||
# Fails silently if Loki is unreachable
|
||||
log_to_loki() {
|
||||
local stage="$1"
|
||||
local message="$2"
|
||||
local timestamp_ns
|
||||
timestamp_ns="$(date +%s)000000000"
|
||||
|
||||
local payload
|
||||
payload=$(jq -n \
|
||||
--arg host "$HOSTNAME" \
|
||||
--arg stage "$stage" \
|
||||
--arg branch "''${BRANCH:-master}" \
|
||||
--arg ts "$timestamp_ns" \
|
||||
--arg msg "$message" \
|
||||
'{
|
||||
streams: [{
|
||||
stream: {
|
||||
job: "bootstrap",
|
||||
host: $host,
|
||||
stage: $stage,
|
||||
branch: $branch
|
||||
},
|
||||
values: [[$ts, $msg]]
|
||||
}]
|
||||
}')
|
||||
|
||||
curl -s --connect-timeout 2 --max-time 5 \
|
||||
-X POST \
|
||||
-H "Content-Type: application/json" \
|
||||
-d "$payload" \
|
||||
"$LOKI_URL" >/dev/null 2>&1 || true
|
||||
}
|
||||
|
||||
echo "================================================================================"
|
||||
echo " NIXOS BOOTSTRAP IN PROGRESS"
|
||||
echo "================================================================================"
|
||||
echo ""
|
||||
|
||||
# Read hostname set by cloud-init (from Terraform VM name via user-data)
|
||||
# Cloud-init sets the system hostname from user-data.txt, so we read it from hostnamectl
|
||||
HOSTNAME=$(hostnamectl hostname)
|
||||
echo "DEBUG: Hostname from hostnamectl: '$HOSTNAME'"
|
||||
# Read git branch from environment, default to master
|
||||
BRANCH="''${NIXOS_FLAKE_BRANCH:-master}"
|
||||
|
||||
echo "Hostname: $HOSTNAME"
|
||||
echo ""
|
||||
echo "Starting NixOS bootstrap for host: $HOSTNAME"
|
||||
|
||||
log_to_loki "starting" "Bootstrap starting for $HOSTNAME (branch: $BRANCH)"
|
||||
|
||||
echo "Waiting for network connectivity..."
|
||||
|
||||
# Verify we can reach the git server via HTTPS (doesn't respond to ping)
|
||||
if ! curl -s --connect-timeout 5 --max-time 10 https://git.t-juice.club >/dev/null 2>&1; then
|
||||
echo "ERROR: Cannot reach git.t-juice.club via HTTPS"
|
||||
echo "Check network configuration and DNS settings"
|
||||
log_to_loki "failed" "Network check failed - cannot reach git.t-juice.club"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "Network connectivity confirmed"
|
||||
log_to_loki "network_ok" "Network connectivity confirmed"
|
||||
|
||||
# Unwrap Vault token and store AppRole credentials (if provided)
|
||||
if [ -n "''${VAULT_WRAPPED_TOKEN:-}" ]; then
|
||||
@@ -50,6 +100,7 @@ let
|
||||
chmod 600 /var/lib/vault/approle/secret-id
|
||||
|
||||
echo "Vault credentials unwrapped and stored successfully"
|
||||
log_to_loki "vault_ok" "Vault credentials unwrapped and stored"
|
||||
else
|
||||
echo "WARNING: Failed to unwrap Vault token"
|
||||
if [ -n "$UNWRAP_RESPONSE" ]; then
|
||||
@@ -63,17 +114,17 @@ let
|
||||
echo "To regenerate token, run: create-host --hostname $HOSTNAME --force"
|
||||
echo ""
|
||||
echo "Vault secrets will not be available, but continuing bootstrap..."
|
||||
log_to_loki "vault_warn" "Failed to unwrap Vault token - continuing without secrets"
|
||||
fi
|
||||
else
|
||||
echo "No Vault wrapped token provided (VAULT_WRAPPED_TOKEN not set)"
|
||||
echo "Skipping Vault credential setup"
|
||||
log_to_loki "vault_skip" "No Vault token provided - skipping credential setup"
|
||||
fi
|
||||
|
||||
echo "Fetching and building NixOS configuration from flake..."
|
||||
|
||||
# Read git branch from environment, default to master
|
||||
BRANCH="''${NIXOS_FLAKE_BRANCH:-master}"
|
||||
echo "Using git branch: $BRANCH"
|
||||
log_to_loki "building" "Starting nixos-rebuild boot"
|
||||
|
||||
# Build and activate the host-specific configuration
|
||||
FLAKE_URL="git+https://git.t-juice.club/torjus/nixos-servers.git?ref=$BRANCH#''${HOSTNAME}"
|
||||
@@ -81,18 +132,30 @@ let
|
||||
if nixos-rebuild boot --flake "$FLAKE_URL"; then
|
||||
echo "Successfully built configuration for $HOSTNAME"
|
||||
echo "Rebooting into new configuration..."
|
||||
log_to_loki "success" "Build successful - rebooting into new configuration"
|
||||
sleep 2
|
||||
systemctl reboot
|
||||
else
|
||||
echo "ERROR: nixos-rebuild failed for $HOSTNAME"
|
||||
echo "Check that flake has configuration for this hostname"
|
||||
echo "Manual intervention required - system will not reboot"
|
||||
log_to_loki "failed" "nixos-rebuild failed - manual intervention required"
|
||||
exit 1
|
||||
fi
|
||||
'';
|
||||
};
|
||||
in
|
||||
{
|
||||
# Custom greeting line to indicate this is a bootstrap image
|
||||
services.getty.greetingLine = lib.mkForce ''
|
||||
================================================================================
|
||||
BOOTSTRAP IMAGE - NixOS \V (\l)
|
||||
================================================================================
|
||||
|
||||
Bootstrap service is running. Logs are displayed on tty1.
|
||||
Check status: journalctl -fu nixos-bootstrap
|
||||
'';
|
||||
|
||||
systemd.services."nixos-bootstrap" = {
|
||||
description = "Bootstrap NixOS configuration from flake on first boot";
|
||||
|
||||
@@ -107,12 +170,12 @@ in
|
||||
serviceConfig = {
|
||||
Type = "oneshot";
|
||||
RemainAfterExit = true;
|
||||
ExecStart = "${bootstrap-script}/bin/nixos-bootstrap";
|
||||
ExecStart = lib.getExe bootstrap-script;
|
||||
|
||||
# Read environment variables from cloud-init (set by cloud-init write_files)
|
||||
EnvironmentFile = "-/run/cloud-init-env";
|
||||
|
||||
# Logging to journald
|
||||
# Log to journal and console
|
||||
StandardOutput = "journal+console";
|
||||
StandardError = "journal+console";
|
||||
};
|
||||
|
||||
@@ -13,14 +13,17 @@
|
||||
../../common/vm
|
||||
];
|
||||
|
||||
# Test VM - exclude from DNS zone generation
|
||||
homelab.dns.enable = false;
|
||||
|
||||
# Host metadata (adjust as needed)
|
||||
homelab.host = {
|
||||
tier = "test";
|
||||
priority = "low";
|
||||
tier = "test"; # Start in test tier, move to prod after validation
|
||||
};
|
||||
|
||||
# Enable Vault integration
|
||||
vault.enable = true;
|
||||
|
||||
# Enable remote deployment via NATS
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
nixpkgs.config.allowUnfree = true;
|
||||
boot.loader.grub.enable = true;
|
||||
boot.loader.grub.device = "/dev/vda";
|
||||
@@ -29,7 +32,7 @@
|
||||
networking.domain = "home.2rjus.net";
|
||||
networking.useNetworkd = true;
|
||||
networking.useDHCP = false;
|
||||
services.resolved.enable = false;
|
||||
services.resolved.enable = true;
|
||||
networking.nameservers = [
|
||||
"10.69.13.5"
|
||||
"10.69.13.6"
|
||||
@@ -39,7 +42,7 @@
|
||||
systemd.network.networks."ens18" = {
|
||||
matchConfig.Name = "ens18";
|
||||
address = [
|
||||
"10.69.13.101/24"
|
||||
"10.69.13.20/24"
|
||||
];
|
||||
routes = [
|
||||
{ Gateway = "10.69.13.1"; }
|
||||
@@ -59,6 +62,39 @@
|
||||
git
|
||||
];
|
||||
|
||||
# Test nginx with ACME certificate from OpenBao PKI
|
||||
services.nginx = {
|
||||
enable = true;
|
||||
virtualHosts."testvm01.home.2rjus.net" = {
|
||||
forceSSL = true;
|
||||
enableACME = true;
|
||||
locations."/" = {
|
||||
root = pkgs.writeTextDir "index.html" ''
|
||||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<title>testvm01 - ACME Test</title>
|
||||
<style>
|
||||
body { font-family: monospace; max-width: 600px; margin: 50px auto; padding: 20px; }
|
||||
.joke { background: #f0f0f0; padding: 20px; border-radius: 8px; margin: 20px 0; }
|
||||
.punchline { margin-top: 15px; font-weight: bold; }
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<h1>OpenBao PKI ACME Test</h1>
|
||||
<p>If you're seeing this over HTTPS, the migration worked!</p>
|
||||
<div class="joke">
|
||||
<p>Why do programmers prefer dark mode?</p>
|
||||
<p class="punchline">Because light attracts bugs.</p>
|
||||
</div>
|
||||
<p><small>Certificate issued by: vault.home.2rjus.net</small></p>
|
||||
</body>
|
||||
</html>
|
||||
'';
|
||||
};
|
||||
};
|
||||
};
|
||||
|
||||
# Open ports in the firewall.
|
||||
# networking.firewall.allowedTCPPorts = [ ... ];
|
||||
# networking.firewall.allowedUDPPorts = [ ... ];
|
||||
|
||||
72
hosts/testvm02/configuration.nix
Normal file
72
hosts/testvm02/configuration.nix
Normal file
@@ -0,0 +1,72 @@
|
||||
{
|
||||
config,
|
||||
lib,
|
||||
pkgs,
|
||||
...
|
||||
}:
|
||||
|
||||
{
|
||||
imports = [
|
||||
../template2/hardware-configuration.nix
|
||||
|
||||
../../system
|
||||
../../common/vm
|
||||
];
|
||||
|
||||
# Host metadata (adjust as needed)
|
||||
homelab.host = {
|
||||
tier = "test"; # Start in test tier, move to prod after validation
|
||||
};
|
||||
|
||||
# Enable Vault integration
|
||||
vault.enable = true;
|
||||
|
||||
# Enable remote deployment via NATS
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
nixpkgs.config.allowUnfree = true;
|
||||
boot.loader.grub.enable = true;
|
||||
boot.loader.grub.device = "/dev/vda";
|
||||
|
||||
networking.hostName = "testvm02";
|
||||
networking.domain = "home.2rjus.net";
|
||||
networking.useNetworkd = true;
|
||||
networking.useDHCP = false;
|
||||
services.resolved.enable = true;
|
||||
networking.nameservers = [
|
||||
"10.69.13.5"
|
||||
"10.69.13.6"
|
||||
];
|
||||
|
||||
systemd.network.enable = true;
|
||||
systemd.network.networks."ens18" = {
|
||||
matchConfig.Name = "ens18";
|
||||
address = [
|
||||
"10.69.13.21/24"
|
||||
];
|
||||
routes = [
|
||||
{ Gateway = "10.69.13.1"; }
|
||||
];
|
||||
linkConfig.RequiredForOnline = "routable";
|
||||
};
|
||||
time.timeZone = "Europe/Oslo";
|
||||
|
||||
nix.settings.experimental-features = [
|
||||
"nix-command"
|
||||
"flakes"
|
||||
];
|
||||
nix.settings.tarball-ttl = 0;
|
||||
environment.systemPackages = with pkgs; [
|
||||
vim
|
||||
wget
|
||||
git
|
||||
];
|
||||
|
||||
# Open ports in the firewall.
|
||||
# networking.firewall.allowedTCPPorts = [ ... ];
|
||||
# networking.firewall.allowedUDPPorts = [ ... ];
|
||||
# Or disable the firewall altogether.
|
||||
networking.firewall.enable = false;
|
||||
|
||||
system.stateVersion = "25.11"; # Did you read the comment?
|
||||
}
|
||||
72
hosts/testvm03/configuration.nix
Normal file
72
hosts/testvm03/configuration.nix
Normal file
@@ -0,0 +1,72 @@
|
||||
{
|
||||
config,
|
||||
lib,
|
||||
pkgs,
|
||||
...
|
||||
}:
|
||||
|
||||
{
|
||||
imports = [
|
||||
../template2/hardware-configuration.nix
|
||||
|
||||
../../system
|
||||
../../common/vm
|
||||
];
|
||||
|
||||
# Host metadata (adjust as needed)
|
||||
homelab.host = {
|
||||
tier = "test"; # Start in test tier, move to prod after validation
|
||||
};
|
||||
|
||||
# Enable Vault integration
|
||||
vault.enable = true;
|
||||
|
||||
# Enable remote deployment via NATS
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
nixpkgs.config.allowUnfree = true;
|
||||
boot.loader.grub.enable = true;
|
||||
boot.loader.grub.device = "/dev/vda";
|
||||
|
||||
networking.hostName = "testvm03";
|
||||
networking.domain = "home.2rjus.net";
|
||||
networking.useNetworkd = true;
|
||||
networking.useDHCP = false;
|
||||
services.resolved.enable = true;
|
||||
networking.nameservers = [
|
||||
"10.69.13.5"
|
||||
"10.69.13.6"
|
||||
];
|
||||
|
||||
systemd.network.enable = true;
|
||||
systemd.network.networks."ens18" = {
|
||||
matchConfig.Name = "ens18";
|
||||
address = [
|
||||
"10.69.13.22/24"
|
||||
];
|
||||
routes = [
|
||||
{ Gateway = "10.69.13.1"; }
|
||||
];
|
||||
linkConfig.RequiredForOnline = "routable";
|
||||
};
|
||||
time.timeZone = "Europe/Oslo";
|
||||
|
||||
nix.settings.experimental-features = [
|
||||
"nix-command"
|
||||
"flakes"
|
||||
];
|
||||
nix.settings.tarball-ttl = 0;
|
||||
environment.systemPackages = with pkgs; [
|
||||
vim
|
||||
wget
|
||||
git
|
||||
];
|
||||
|
||||
# Open ports in the firewall.
|
||||
# networking.firewall.allowedTCPPorts = [ ... ];
|
||||
# networking.firewall.allowedUDPPorts = [ ... ];
|
||||
# Or disable the firewall altogether.
|
||||
networking.firewall.enable = false;
|
||||
|
||||
system.stateVersion = "25.11"; # Did you read the comment?
|
||||
}
|
||||
5
hosts/testvm03/default.nix
Normal file
5
hosts/testvm03/default.nix
Normal file
@@ -0,0 +1,5 @@
|
||||
{ ... }: {
|
||||
imports = [
|
||||
./configuration.nix
|
||||
];
|
||||
}
|
||||
@@ -62,6 +62,16 @@
|
||||
# Or disable the firewall altogether.
|
||||
networking.firewall.enable = false;
|
||||
|
||||
# Vault fetches secrets from itself (after unseal)
|
||||
vault.enable = true;
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
# Ensure vault-secret services wait for openbao to be unsealed
|
||||
systemd.services.vault-secret-homelab-deploy-nkey = {
|
||||
after = [ "openbao.service" ];
|
||||
wants = [ "openbao.service" ];
|
||||
};
|
||||
|
||||
system.stateVersion = "25.11"; # Did you read the comment?
|
||||
}
|
||||
|
||||
|
||||
@@ -1,135 +0,0 @@
|
||||
{
|
||||
config,
|
||||
lib,
|
||||
pkgs,
|
||||
...
|
||||
}:
|
||||
|
||||
let
|
||||
vault-test-script = pkgs.writeShellApplication {
|
||||
name = "vault-test";
|
||||
text = ''
|
||||
echo "=== Vault Secret Test ==="
|
||||
echo "Secret path: hosts/vaulttest01/test-service"
|
||||
|
||||
if [ -f /run/secrets/test-service/password ]; then
|
||||
echo "✓ Password file exists"
|
||||
echo "Password length: $(wc -c < /run/secrets/test-service/password)"
|
||||
else
|
||||
echo "✗ Password file missing!"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [ -d /var/lib/vault/cache/test-service ]; then
|
||||
echo "✓ Cache directory exists"
|
||||
else
|
||||
echo "✗ Cache directory missing!"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "Test successful!"
|
||||
'';
|
||||
};
|
||||
in
|
||||
{
|
||||
imports = [
|
||||
../template2/hardware-configuration.nix
|
||||
|
||||
../../system
|
||||
../../common/vm
|
||||
];
|
||||
|
||||
homelab.host = {
|
||||
tier = "test";
|
||||
priority = "low";
|
||||
role = "vault";
|
||||
};
|
||||
|
||||
nixpkgs.config.allowUnfree = true;
|
||||
boot.loader.grub.enable = true;
|
||||
boot.loader.grub.device = "/dev/vda";
|
||||
|
||||
networking.hostName = "vaulttest01";
|
||||
networking.domain = "home.2rjus.net";
|
||||
networking.useNetworkd = true;
|
||||
networking.useDHCP = false;
|
||||
services.resolved.enable = true;
|
||||
networking.nameservers = [
|
||||
"10.69.13.5"
|
||||
"10.69.13.6"
|
||||
];
|
||||
|
||||
systemd.network.enable = true;
|
||||
systemd.network.networks."ens18" = {
|
||||
matchConfig.Name = "ens18";
|
||||
address = [
|
||||
"10.69.13.150/24"
|
||||
];
|
||||
routes = [
|
||||
{ Gateway = "10.69.13.1"; }
|
||||
];
|
||||
linkConfig.RequiredForOnline = "routable";
|
||||
};
|
||||
time.timeZone = "Europe/Oslo";
|
||||
|
||||
nix.settings.experimental-features = [
|
||||
"nix-command"
|
||||
"flakes"
|
||||
];
|
||||
nix.settings.tarball-ttl = 0;
|
||||
environment.systemPackages = with pkgs; [
|
||||
vim
|
||||
wget
|
||||
git
|
||||
htop # test deploy verification
|
||||
];
|
||||
|
||||
# Open ports in the firewall.
|
||||
# networking.firewall.allowedTCPPorts = [ ... ];
|
||||
# networking.firewall.allowedUDPPorts = [ ... ];
|
||||
# Or disable the firewall altogether.
|
||||
networking.firewall.enable = false;
|
||||
|
||||
# Testing config
|
||||
# Enable Vault secrets management
|
||||
vault.enable = true;
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
# Define a test secret
|
||||
vault.secrets.test-service = {
|
||||
secretPath = "hosts/vaulttest01/test-service";
|
||||
restartTrigger = true;
|
||||
restartInterval = "daily";
|
||||
services = [ "vault-test" ];
|
||||
};
|
||||
|
||||
# Create a test service that uses the secret
|
||||
systemd.services.vault-test = {
|
||||
description = "Test Vault secret fetching";
|
||||
wantedBy = [ "multi-user.target" ];
|
||||
after = [ "vault-secret-test-service.service" ];
|
||||
|
||||
serviceConfig = {
|
||||
Type = "oneshot";
|
||||
RemainAfterExit = true;
|
||||
|
||||
ExecStart = lib.getExe vault-test-script;
|
||||
|
||||
StandardOutput = "journal+console";
|
||||
};
|
||||
};
|
||||
|
||||
# Test ACME certificate issuance from OpenBao PKI
|
||||
# Override the global ACME server (from system/acme.nix) to use OpenBao instead of step-ca
|
||||
security.acme.defaults.server = lib.mkForce "https://vault01.home.2rjus.net:8200/v1/pki_int/acme/directory";
|
||||
|
||||
# Request a certificate for this host
|
||||
# Using HTTP-01 challenge with standalone listener on port 80
|
||||
security.acme.certs."vaulttest01.home.2rjus.net" = {
|
||||
listenHTTP = ":80";
|
||||
enableDebugLogs = true;
|
||||
};
|
||||
|
||||
system.stateVersion = "25.11"; # Did you read the comment?
|
||||
}
|
||||
|
||||
@@ -21,6 +21,7 @@ let
|
||||
cfg = hostConfig.config;
|
||||
monConfig = (cfg.homelab or { }).monitoring or { enable = true; scrapeTargets = [ ]; };
|
||||
dnsConfig = (cfg.homelab or { }).dns or { enable = true; };
|
||||
hostConfig' = (cfg.homelab or { }).host or { };
|
||||
hostname = cfg.networking.hostName;
|
||||
networks = cfg.systemd.network.networks or { };
|
||||
|
||||
@@ -49,20 +50,73 @@ let
|
||||
inherit hostname;
|
||||
ip = extractIP firstAddress;
|
||||
scrapeTargets = monConfig.scrapeTargets or [ ];
|
||||
# Host metadata for label propagation
|
||||
tier = hostConfig'.tier or "prod";
|
||||
priority = hostConfig'.priority or "high";
|
||||
role = hostConfig'.role or null;
|
||||
labels = hostConfig'.labels or { };
|
||||
};
|
||||
|
||||
# Build effective labels for a host
|
||||
# Always includes hostname; only includes tier/priority/role if non-default
|
||||
buildEffectiveLabels = host:
|
||||
{ hostname = host.hostname; }
|
||||
// (lib.optionalAttrs (host.tier != "prod") { tier = host.tier; })
|
||||
// (lib.optionalAttrs (host.priority != "high") { priority = host.priority; })
|
||||
// (lib.optionalAttrs (host.role != null) { role = host.role; })
|
||||
// host.labels;
|
||||
|
||||
# Generate node-exporter targets from all flake hosts
|
||||
# Returns a list of static_configs entries with labels
|
||||
generateNodeExporterTargets = self: externalTargets:
|
||||
let
|
||||
nixosConfigs = self.nixosConfigurations or { };
|
||||
hostList = lib.filter (x: x != null) (
|
||||
lib.mapAttrsToList extractHostMonitoring nixosConfigs
|
||||
);
|
||||
flakeTargets = map (host: "${host.hostname}.home.2rjus.net:9100") hostList;
|
||||
|
||||
# Extract hostname from a target string like "gunter.home.2rjus.net:9100"
|
||||
extractHostnameFromTarget = target:
|
||||
builtins.head (lib.splitString "." target);
|
||||
|
||||
# Build target entries with labels for each host
|
||||
flakeEntries = map
|
||||
(host: {
|
||||
target = "${host.hostname}.home.2rjus.net:9100";
|
||||
labels = buildEffectiveLabels host;
|
||||
})
|
||||
hostList;
|
||||
|
||||
# External targets get hostname extracted from the target string
|
||||
externalEntries = map
|
||||
(target: {
|
||||
inherit target;
|
||||
labels = { hostname = extractHostnameFromTarget target; };
|
||||
})
|
||||
(externalTargets.nodeExporter or [ ]);
|
||||
|
||||
allEntries = flakeEntries ++ externalEntries;
|
||||
|
||||
# Group entries by their label set for efficient static_configs
|
||||
# Convert labels attrset to a string key for grouping
|
||||
labelKey = entry: builtins.toJSON entry.labels;
|
||||
grouped = lib.groupBy labelKey allEntries;
|
||||
|
||||
# Convert groups to static_configs format
|
||||
# Every flake host now has at least a hostname label
|
||||
staticConfigs = lib.mapAttrsToList
|
||||
(key: entries:
|
||||
let
|
||||
labels = (builtins.head entries).labels;
|
||||
in
|
||||
{ targets = map (e: e.target) entries; labels = labels; }
|
||||
)
|
||||
grouped;
|
||||
in
|
||||
flakeTargets ++ (externalTargets.nodeExporter or [ ]);
|
||||
staticConfigs;
|
||||
|
||||
# Generate scrape configs from all flake hosts and external targets
|
||||
# Host labels are propagated to service targets for semantic alert filtering
|
||||
generateScrapeConfigs = self: externalTargets:
|
||||
let
|
||||
nixosConfigs = self.nixosConfigurations or { };
|
||||
@@ -70,13 +124,14 @@ let
|
||||
lib.mapAttrsToList extractHostMonitoring nixosConfigs
|
||||
);
|
||||
|
||||
# Collect all scrapeTargets from all hosts, grouped by job_name
|
||||
# Collect all scrapeTargets from all hosts, including host labels
|
||||
allTargets = lib.flatten (map
|
||||
(host:
|
||||
map
|
||||
(target: {
|
||||
inherit (target) job_name port metrics_path scheme scrape_interval honor_labels;
|
||||
hostname = host.hostname;
|
||||
hostLabels = buildEffectiveLabels host;
|
||||
})
|
||||
host.scrapeTargets
|
||||
)
|
||||
@@ -87,22 +142,32 @@ let
|
||||
grouped = lib.groupBy (t: t.job_name) allTargets;
|
||||
|
||||
# Generate a scrape config for each job
|
||||
# Within each job, group targets by their host labels for efficient static_configs
|
||||
flakeScrapeConfigs = lib.mapAttrsToList
|
||||
(jobName: targets:
|
||||
let
|
||||
first = builtins.head targets;
|
||||
targetAddrs = map
|
||||
(t:
|
||||
|
||||
# Group targets within this job by their host labels
|
||||
labelKey = t: builtins.toJSON t.hostLabels;
|
||||
groupedByLabels = lib.groupBy labelKey targets;
|
||||
|
||||
# Every flake host now has at least a hostname label
|
||||
staticConfigs = lib.mapAttrsToList
|
||||
(key: labelTargets:
|
||||
let
|
||||
portStr = toString t.port;
|
||||
labels = (builtins.head labelTargets).hostLabels;
|
||||
targetAddrs = map
|
||||
(t: "${t.hostname}.home.2rjus.net:${toString t.port}")
|
||||
labelTargets;
|
||||
in
|
||||
"${t.hostname}.home.2rjus.net:${portStr}")
|
||||
targets;
|
||||
{ targets = targetAddrs; labels = labels; }
|
||||
)
|
||||
groupedByLabels;
|
||||
|
||||
config = {
|
||||
job_name = jobName;
|
||||
static_configs = [{
|
||||
targets = targetAddrs;
|
||||
}];
|
||||
static_configs = staticConfigs;
|
||||
}
|
||||
// (lib.optionalAttrs (first.metrics_path != "/metrics") {
|
||||
metrics_path = first.metrics_path;
|
||||
|
||||
@@ -99,3 +99,48 @@
|
||||
- name: Display success message
|
||||
ansible.builtin.debug:
|
||||
msg: "Template VM {{ template_vmid }} created successfully on {{ storage }}"
|
||||
|
||||
- name: Update Terraform template name
|
||||
hosts: localhost
|
||||
gather_facts: false
|
||||
|
||||
vars:
|
||||
terraform_dir: "{{ playbook_dir }}/../terraform"
|
||||
|
||||
tasks:
|
||||
- name: Get image filename from earlier play
|
||||
ansible.builtin.set_fact:
|
||||
image_filename: "{{ hostvars['localhost']['image_filename'] }}"
|
||||
|
||||
- name: Extract template name from image filename
|
||||
ansible.builtin.set_fact:
|
||||
new_template_name: "{{ image_filename | regex_replace('\\.vma\\.zst$', '') | regex_replace('^vzdump-qemu-', '') }}"
|
||||
|
||||
- name: Read current Terraform variables file
|
||||
ansible.builtin.slurp:
|
||||
src: "{{ terraform_dir }}/variables.tf"
|
||||
register: variables_tf_content
|
||||
|
||||
- name: Extract current template name from variables.tf
|
||||
ansible.builtin.set_fact:
|
||||
current_template_name: "{{ (variables_tf_content.content | b64decode) | regex_search('variable \"default_template_name\"[^}]+default\\s*=\\s*\"([^\"]+)\"', '\\1') | first }}"
|
||||
|
||||
- name: Check if template name has changed
|
||||
ansible.builtin.set_fact:
|
||||
template_name_changed: "{{ current_template_name != new_template_name }}"
|
||||
|
||||
- name: Display template name status
|
||||
ansible.builtin.debug:
|
||||
msg: "Template name: {{ current_template_name }} -> {{ new_template_name }} ({{ 'changed' if template_name_changed else 'unchanged' }})"
|
||||
|
||||
- name: Update default_template_name in variables.tf
|
||||
ansible.builtin.replace:
|
||||
path: "{{ terraform_dir }}/variables.tf"
|
||||
regexp: '(variable "default_template_name"[^}]+default\s*=\s*)"[^"]+"'
|
||||
replace: '\1"{{ new_template_name }}"'
|
||||
when: template_name_changed
|
||||
|
||||
- name: Display update result
|
||||
ansible.builtin.debug:
|
||||
msg: "Updated terraform/variables.tf with new template name: {{ new_template_name }}"
|
||||
when: template_name_changed
|
||||
|
||||
@@ -18,6 +18,8 @@ from manipulators import (
|
||||
remove_from_flake_nix,
|
||||
remove_from_terraform_vms,
|
||||
remove_from_vault_terraform,
|
||||
remove_from_approle_tf,
|
||||
find_host_secrets,
|
||||
check_entries_exist,
|
||||
)
|
||||
from models import HostConfig
|
||||
@@ -255,7 +257,10 @@ def handle_remove(
|
||||
sys.exit(1)
|
||||
|
||||
# Check what entries exist
|
||||
flake_exists, terraform_exists, vault_exists = check_entries_exist(hostname, repo_root)
|
||||
flake_exists, terraform_exists, vault_exists, approle_exists = check_entries_exist(hostname, repo_root)
|
||||
|
||||
# Check for secrets in secrets.tf
|
||||
host_secrets = find_host_secrets(hostname, repo_root)
|
||||
|
||||
# Collect all files in the host directory recursively
|
||||
files_in_host_dir = sorted([f for f in host_dir.rglob("*") if f.is_file()])
|
||||
@@ -294,6 +299,21 @@ def handle_remove(
|
||||
else:
|
||||
console.print(f" • terraform/vault/hosts-generated.tf [dim](not found)[/dim]")
|
||||
|
||||
if approle_exists:
|
||||
console.print(f' • terraform/vault/approle.tf (host_policies["{hostname}"])')
|
||||
else:
|
||||
console.print(f" • terraform/vault/approle.tf [dim](not found)[/dim]")
|
||||
|
||||
# Warn about secrets in secrets.tf
|
||||
if host_secrets:
|
||||
console.print(f"\n[yellow]⚠️ Warning: Found {len(host_secrets)} secret(s) in terraform/vault/secrets.tf:[/yellow]")
|
||||
for secret_path in host_secrets:
|
||||
console.print(f' • "{secret_path}"')
|
||||
console.print(f"\n [yellow]These will NOT be removed automatically.[/yellow]")
|
||||
console.print(f" After removal, manually edit secrets.tf and run:")
|
||||
for secret_path in host_secrets:
|
||||
console.print(f" [white]vault kv delete secret/{secret_path}[/white]")
|
||||
|
||||
# Warn about secrets directory
|
||||
if secrets_exist:
|
||||
console.print(f"\n[yellow]⚠️ Warning: secrets/{hostname}/ directory exists and will NOT be deleted[/yellow]")
|
||||
@@ -323,6 +343,13 @@ def handle_remove(
|
||||
else:
|
||||
console.print("[yellow]⚠[/yellow] Could not remove from terraform/vault/hosts-generated.tf")
|
||||
|
||||
# Remove from terraform/vault/approle.tf
|
||||
if approle_exists:
|
||||
if remove_from_approle_tf(hostname, repo_root):
|
||||
console.print("[green]✓[/green] Removed from terraform/vault/approle.tf")
|
||||
else:
|
||||
console.print("[yellow]⚠[/yellow] Could not remove from terraform/vault/approle.tf")
|
||||
|
||||
# Remove from terraform/vms.tf
|
||||
if terraform_exists:
|
||||
if remove_from_terraform_vms(hostname, repo_root):
|
||||
@@ -345,19 +372,34 @@ def handle_remove(
|
||||
console.print(f"\n[bold green]✓ Host {hostname} removed successfully![/bold green]\n")
|
||||
|
||||
# Display next steps
|
||||
display_removal_next_steps(hostname, vault_exists)
|
||||
display_removal_next_steps(hostname, vault_exists, approle_exists, host_secrets)
|
||||
|
||||
|
||||
def display_removal_next_steps(hostname: str, had_vault: bool) -> None:
|
||||
def display_removal_next_steps(hostname: str, had_vault: bool, had_approle: bool, host_secrets: list) -> None:
|
||||
"""Display next steps after successful removal."""
|
||||
vault_file = " terraform/vault/hosts-generated.tf" if had_vault else ""
|
||||
vault_apply = ""
|
||||
vault_files = ""
|
||||
if had_vault:
|
||||
vault_files += " terraform/vault/hosts-generated.tf"
|
||||
if had_approle:
|
||||
vault_files += " terraform/vault/approle.tf"
|
||||
|
||||
vault_apply = ""
|
||||
if had_vault or had_approle:
|
||||
vault_apply = f"""
|
||||
3. Apply Vault changes:
|
||||
[white]cd terraform/vault && tofu apply[/white]
|
||||
"""
|
||||
|
||||
secrets_cleanup = ""
|
||||
if host_secrets:
|
||||
secrets_cleanup = f"""
|
||||
5. Clean up secrets (manual):
|
||||
Edit terraform/vault/secrets.tf to remove entries for {hostname}
|
||||
Then delete from Vault:"""
|
||||
for secret_path in host_secrets:
|
||||
secrets_cleanup += f"\n [white]vault kv delete secret/{secret_path}[/white]"
|
||||
secrets_cleanup += "\n"
|
||||
|
||||
next_steps = f"""[bold cyan]Next Steps:[/bold cyan]
|
||||
|
||||
1. Review changes:
|
||||
@@ -367,9 +409,9 @@ def display_removal_next_steps(hostname: str, had_vault: bool) -> None:
|
||||
[white]cd terraform && tofu destroy -target='proxmox_vm_qemu.vm["{hostname}"]'[/white]
|
||||
{vault_apply}
|
||||
4. Commit changes:
|
||||
[white]git add -u hosts/{hostname} flake.nix terraform/vms.tf{vault_file}
|
||||
[white]git add -u hosts/{hostname} flake.nix terraform/vms.tf{vault_files}
|
||||
git commit -m "hosts: remove {hostname}"[/white]
|
||||
"""
|
||||
{secrets_cleanup}"""
|
||||
console.print(Panel(next_steps, border_style="cyan"))
|
||||
|
||||
|
||||
|
||||
@@ -144,7 +144,7 @@ resource "vault_approle_auth_backend_role" "generated_hosts" {
|
||||
|
||||
backend = vault_auth_backend.approle.path
|
||||
role_name = each.key
|
||||
token_policies = ["host-\${each.key}"]
|
||||
token_policies = ["host-\${each.key}", "homelab-deploy"]
|
||||
secret_id_ttl = 0 # Never expire (wrapped tokens provide time limit)
|
||||
token_ttl = 3600
|
||||
token_max_ttl = 3600
|
||||
|
||||
@@ -22,12 +22,12 @@ def remove_from_flake_nix(hostname: str, repo_root: Path) -> bool:
|
||||
content = flake_path.read_text()
|
||||
|
||||
# Check if hostname exists
|
||||
hostname_pattern = rf"^ {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem"
|
||||
hostname_pattern = rf"^ {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem"
|
||||
if not re.search(hostname_pattern, content, re.MULTILINE):
|
||||
return False
|
||||
|
||||
# Match the entire block from "hostname = " to "};"
|
||||
replace_pattern = rf"^ {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem \{{.*?^ \}};\n"
|
||||
replace_pattern = rf"^ {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem \{{.*?^ \}};\n"
|
||||
new_content, count = re.subn(replace_pattern, "", content, flags=re.MULTILINE | re.DOTALL)
|
||||
|
||||
if count == 0:
|
||||
@@ -101,7 +101,68 @@ def remove_from_vault_terraform(hostname: str, repo_root: Path) -> bool:
|
||||
return True
|
||||
|
||||
|
||||
def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, bool]:
|
||||
def remove_from_approle_tf(hostname: str, repo_root: Path) -> bool:
|
||||
"""
|
||||
Remove host entry from terraform/vault/approle.tf locals.host_policies.
|
||||
|
||||
Args:
|
||||
hostname: Hostname to remove
|
||||
repo_root: Path to repository root
|
||||
|
||||
Returns:
|
||||
True if found and removed, False if not found
|
||||
"""
|
||||
approle_path = repo_root / "terraform" / "vault" / "approle.tf"
|
||||
|
||||
if not approle_path.exists():
|
||||
return False
|
||||
|
||||
content = approle_path.read_text()
|
||||
|
||||
# Check if hostname exists in host_policies
|
||||
hostname_pattern = rf'^\s+"{re.escape(hostname)}" = \{{'
|
||||
if not re.search(hostname_pattern, content, re.MULTILINE):
|
||||
return False
|
||||
|
||||
# Match the entire block from "hostname" = { to closing }
|
||||
# The block contains paths = [ ... ] and possibly extra_policies = [...]
|
||||
replace_pattern = rf'\n?\s+"{re.escape(hostname)}" = \{{[^}}]*\}}\n?'
|
||||
new_content, count = re.subn(replace_pattern, "\n", content, flags=re.DOTALL)
|
||||
|
||||
if count == 0:
|
||||
return False
|
||||
|
||||
approle_path.write_text(new_content)
|
||||
return True
|
||||
|
||||
|
||||
def find_host_secrets(hostname: str, repo_root: Path) -> list:
|
||||
"""
|
||||
Find secrets in terraform/vault/secrets.tf that belong to a host.
|
||||
|
||||
Args:
|
||||
hostname: Hostname to search for
|
||||
repo_root: Path to repository root
|
||||
|
||||
Returns:
|
||||
List of secret paths found (e.g., ["hosts/hostname/test-service"])
|
||||
"""
|
||||
secrets_path = repo_root / "terraform" / "vault" / "secrets.tf"
|
||||
|
||||
if not secrets_path.exists():
|
||||
return []
|
||||
|
||||
content = secrets_path.read_text()
|
||||
|
||||
# Find all secret paths matching hosts/{hostname}/
|
||||
pattern = rf'"(hosts/{re.escape(hostname)}/[^"]+)"'
|
||||
matches = re.findall(pattern, content)
|
||||
|
||||
# Return unique paths, preserving order
|
||||
return list(dict.fromkeys(matches))
|
||||
|
||||
|
||||
def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, bool, bool]:
|
||||
"""
|
||||
Check which entries exist for a hostname.
|
||||
|
||||
@@ -110,12 +171,12 @@ def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, boo
|
||||
repo_root: Path to repository root
|
||||
|
||||
Returns:
|
||||
Tuple of (flake_exists, terraform_vms_exists, vault_exists)
|
||||
Tuple of (flake_exists, terraform_vms_exists, vault_generated_exists, approle_exists)
|
||||
"""
|
||||
# Check flake.nix
|
||||
flake_path = repo_root / "flake.nix"
|
||||
flake_content = flake_path.read_text()
|
||||
flake_pattern = rf"^ {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem"
|
||||
flake_pattern = rf"^ {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem"
|
||||
flake_exists = bool(re.search(flake_pattern, flake_content, re.MULTILINE))
|
||||
|
||||
# Check terraform/vms.tf
|
||||
@@ -131,7 +192,15 @@ def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, boo
|
||||
vault_content = vault_tf_path.read_text()
|
||||
vault_exists = f'"{hostname}"' in vault_content
|
||||
|
||||
return (flake_exists, terraform_exists, vault_exists)
|
||||
# Check terraform/vault/approle.tf
|
||||
approle_path = repo_root / "terraform" / "vault" / "approle.tf"
|
||||
approle_exists = False
|
||||
if approle_path.exists():
|
||||
approle_content = approle_path.read_text()
|
||||
approle_pattern = rf'^\s+"{re.escape(hostname)}" = \{{'
|
||||
approle_exists = bool(re.search(approle_pattern, approle_content, re.MULTILINE))
|
||||
|
||||
return (flake_exists, terraform_exists, vault_exists, approle_exists)
|
||||
|
||||
|
||||
def update_flake_nix(config: HostConfig, repo_root: Path, force: bool = False) -> None:
|
||||
@@ -147,32 +216,25 @@ def update_flake_nix(config: HostConfig, repo_root: Path, force: bool = False) -
|
||||
content = flake_path.read_text()
|
||||
|
||||
# Create new entry
|
||||
new_entry = f""" {config.hostname} = nixpkgs.lib.nixosSystem {{
|
||||
inherit system;
|
||||
specialArgs = {{
|
||||
inherit inputs self sops-nix;
|
||||
new_entry = f""" {config.hostname} = nixpkgs.lib.nixosSystem {{
|
||||
inherit system;
|
||||
specialArgs = {{
|
||||
inherit inputs self sops-nix;
|
||||
}};
|
||||
modules = commonModules ++ [
|
||||
./hosts/{config.hostname}
|
||||
];
|
||||
}};
|
||||
modules = [
|
||||
(
|
||||
{{ config, pkgs, ... }}:
|
||||
{{
|
||||
nixpkgs.overlays = commonOverlays;
|
||||
}}
|
||||
)
|
||||
./hosts/{config.hostname}
|
||||
sops-nix.nixosModules.sops
|
||||
];
|
||||
}};
|
||||
"""
|
||||
|
||||
# Check if hostname already exists
|
||||
hostname_pattern = rf"^ {re.escape(config.hostname)} = nixpkgs\.lib\.nixosSystem"
|
||||
hostname_pattern = rf"^ {re.escape(config.hostname)} = nixpkgs\.lib\.nixosSystem"
|
||||
existing_match = re.search(hostname_pattern, content, re.MULTILINE)
|
||||
|
||||
if existing_match and force:
|
||||
# Replace existing entry
|
||||
# Match the entire block from "hostname = " to "};"
|
||||
replace_pattern = rf"^ {re.escape(config.hostname)} = nixpkgs\.lib\.nixosSystem \{{.*?^ \}};\n"
|
||||
replace_pattern = rf"^ {re.escape(config.hostname)} = nixpkgs\.lib\.nixosSystem \{{.*?^ \}};\n"
|
||||
new_content, count = re.subn(replace_pattern, new_entry, content, flags=re.MULTILINE | re.DOTALL)
|
||||
|
||||
if count == 0:
|
||||
|
||||
@@ -18,6 +18,12 @@
|
||||
tier = "test"; # Start in test tier, move to prod after validation
|
||||
};
|
||||
|
||||
# Enable Vault integration
|
||||
vault.enable = true;
|
||||
|
||||
# Enable remote deployment via NATS
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
nixpkgs.config.allowUnfree = true;
|
||||
boot.loader.grub.enable = true;
|
||||
boot.loader.grub.device = "/dev/vda";
|
||||
|
||||
@@ -5,7 +5,7 @@
|
||||
package = pkgs.unstable.caddy;
|
||||
configFile = pkgs.writeText "Caddyfile" ''
|
||||
{
|
||||
acme_ca https://ca.home.2rjus.net/acme/acme/directory
|
||||
acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
|
||||
|
||||
metrics {
|
||||
per_host
|
||||
|
||||
@@ -1,41 +0,0 @@
|
||||
{ ... }:
|
||||
{
|
||||
services.alloy = {
|
||||
enable = true;
|
||||
};
|
||||
|
||||
environment.etc."alloy/config.alloy" = {
|
||||
enable = true;
|
||||
mode = "0644";
|
||||
text = ''
|
||||
pyroscope.write "local_pyroscope" {
|
||||
endpoint {
|
||||
url = "http://localhost:4040"
|
||||
}
|
||||
}
|
||||
|
||||
pyroscope.scrape "labmon" {
|
||||
targets = [{"__address__" = "localhost:9969", "service_name" = "labmon"}]
|
||||
forward_to = [pyroscope.write.local_pyroscope.receiver]
|
||||
|
||||
profiling_config {
|
||||
profile.process_cpu {
|
||||
enabled = true
|
||||
}
|
||||
profile.memory {
|
||||
enabled = true
|
||||
}
|
||||
profile.mutex {
|
||||
enabled = true
|
||||
}
|
||||
profile.block {
|
||||
enabled = true
|
||||
}
|
||||
profile.goroutine {
|
||||
enabled = true
|
||||
}
|
||||
}
|
||||
}
|
||||
'';
|
||||
};
|
||||
}
|
||||
@@ -7,7 +7,6 @@
|
||||
./pve.nix
|
||||
./alerttonotify.nix
|
||||
./pyroscope.nix
|
||||
./alloy.nix
|
||||
./tempo.nix
|
||||
];
|
||||
}
|
||||
|
||||
@@ -121,22 +121,20 @@ in
|
||||
|
||||
scrapeConfigs = [
|
||||
# Auto-generated node-exporter targets from flake hosts + external
|
||||
# Each static_config entry may have labels from homelab.host metadata
|
||||
{
|
||||
job_name = "node-exporter";
|
||||
static_configs = [
|
||||
{
|
||||
targets = nodeExporterTargets;
|
||||
}
|
||||
];
|
||||
static_configs = nodeExporterTargets;
|
||||
}
|
||||
# Systemd exporter on all hosts (same targets, different port)
|
||||
# Preserves the same label grouping as node-exporter
|
||||
{
|
||||
job_name = "systemd-exporter";
|
||||
static_configs = [
|
||||
{
|
||||
targets = map (t: builtins.replaceStrings [":9100"] [":9558"] t) nodeExporterTargets;
|
||||
}
|
||||
];
|
||||
static_configs = map
|
||||
(cfg: cfg // {
|
||||
targets = map (t: builtins.replaceStrings [ ":9100" ] [ ":9558" ] t) cfg.targets;
|
||||
})
|
||||
nodeExporterTargets;
|
||||
}
|
||||
# Local monitoring services (not auto-generated)
|
||||
{
|
||||
@@ -180,14 +178,6 @@ in
|
||||
}
|
||||
];
|
||||
}
|
||||
{
|
||||
job_name = "labmon";
|
||||
static_configs = [
|
||||
{
|
||||
targets = [ "monitoring01.home.2rjus.net:9969" ];
|
||||
}
|
||||
];
|
||||
}
|
||||
# TODO: nix-cache_caddy can't be auto-generated because the cert is issued
|
||||
# for nix-cache.home.2rjus.net (service CNAME), not nix-cache01 (hostname).
|
||||
# Consider adding a target override to homelab.monitoring.scrapeTargets.
|
||||
|
||||
@@ -17,8 +17,9 @@ groups:
|
||||
annotations:
|
||||
summary: "Disk space low on {{ $labels.instance }}"
|
||||
description: "Disk space is low on {{ $labels.instance }}. Please check."
|
||||
# Build hosts (e.g., nix-cache01) are expected to have high CPU during builds
|
||||
- alert: high_cpu_load
|
||||
expr: max(node_load5{instance!="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance!="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7)
|
||||
expr: max(node_load5{role!="build-host"}) by (instance) > (count by (instance)(node_cpu_seconds_total{role!="build-host", mode="idle"}) * 0.7)
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
@@ -26,7 +27,7 @@ groups:
|
||||
summary: "High CPU load on {{ $labels.instance }}"
|
||||
description: "CPU load is high on {{ $labels.instance }}. Please check."
|
||||
- alert: high_cpu_load
|
||||
expr: max(node_load5{instance="nix-cache01.home.2rjus.net:9100"}) by (instance) > (count by (instance)(node_cpu_seconds_total{instance="nix-cache01.home.2rjus.net:9100", mode="idle"}) * 0.7)
|
||||
expr: max(node_load5{role="build-host"}) by (instance) > (count by (instance)(node_cpu_seconds_total{role="build-host", mode="idle"}) * 0.7)
|
||||
for: 2h
|
||||
labels:
|
||||
severity: warning
|
||||
@@ -115,8 +116,9 @@ groups:
|
||||
annotations:
|
||||
summary: "NSD not running on {{ $labels.instance }}"
|
||||
description: "NSD has been down on {{ $labels.instance }} more than 5 minutes."
|
||||
# Only alert on primary DNS (secondary has cold cache after failover)
|
||||
- alert: unbound_low_cache_hit_ratio
|
||||
expr: (rate(unbound_cache_hits_total[5m]) / (rate(unbound_cache_hits_total[5m]) + rate(unbound_cache_misses_total[5m]))) < 0.5
|
||||
expr: (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) / (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) + rate(unbound_cache_misses_total{dns_role="primary"}[5m]))) < 0.5
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
@@ -336,40 +338,6 @@ groups:
|
||||
annotations:
|
||||
summary: "Pyroscope service not running on {{ $labels.instance }}"
|
||||
description: "Pyroscope service not running on {{ $labels.instance }}"
|
||||
- name: certificate_rules
|
||||
rules:
|
||||
- alert: certificate_expiring_soon
|
||||
expr: labmon_tlsconmon_certificate_seconds_left{address!="ca.home.2rjus.net:443"} < 86400
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "TLS certificate expiring soon for {{ $labels.instance }}"
|
||||
description: "TLS certificate for {{ $labels.address }} is expiring within 24 hours."
|
||||
- alert: step_ca_serving_cert_expiring
|
||||
expr: labmon_tlsconmon_certificate_seconds_left{address="ca.home.2rjus.net:443"} < 3600
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Step-CA serving certificate expiring"
|
||||
description: "The step-ca serving certificate (24h auto-renewed) has less than 1 hour of validity left. Renewal may have failed."
|
||||
- alert: certificate_check_error
|
||||
expr: labmon_tlsconmon_certificate_check_error == 1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Error checking certificate for {{ $labels.address }}"
|
||||
description: "Certificate check is failing for {{ $labels.address }} on {{ $labels.instance }}."
|
||||
- alert: step_ca_certificate_expiring
|
||||
expr: labmon_stepmon_certificate_seconds_left < 3600
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Step-CA certificate expiring for {{ $labels.instance }}"
|
||||
description: "Step-CA certificate is expiring within 1 hour on {{ $labels.instance }}."
|
||||
- name: proxmox_rules
|
||||
rules:
|
||||
- alert: pve_node_down
|
||||
|
||||
@@ -5,7 +5,7 @@
|
||||
package = pkgs.unstable.caddy;
|
||||
configFile = pkgs.writeText "Caddyfile" ''
|
||||
{
|
||||
acme_ca https://ca.home.2rjus.net/acme/acme/directory
|
||||
acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
|
||||
metrics
|
||||
}
|
||||
|
||||
|
||||
@@ -3,7 +3,7 @@
|
||||
security.acme = {
|
||||
acceptTerms = true;
|
||||
defaults = {
|
||||
server = "https://ca.home.2rjus.net/acme/acme/directory";
|
||||
server = "https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory";
|
||||
email = "root@home.2rjus.net";
|
||||
dnsPropagationCheck = false;
|
||||
};
|
||||
|
||||
@@ -33,7 +33,7 @@ variable "default_target_node" {
|
||||
variable "default_template_name" {
|
||||
description = "Default template VM name to clone from"
|
||||
type = string
|
||||
default = "nixos-25.11.20260131.41e216c"
|
||||
default = "nixos-25.11.20260203.e576e3c"
|
||||
}
|
||||
|
||||
variable "default_ssh_public_key" {
|
||||
|
||||
@@ -101,11 +101,13 @@ locals {
|
||||
]
|
||||
}
|
||||
|
||||
"vaulttest01" = {
|
||||
# vault01: Vault server itself (fetches secrets from itself)
|
||||
"vault01" = {
|
||||
paths = [
|
||||
"secret/data/hosts/vaulttest01/*",
|
||||
"secret/data/hosts/vault01/*",
|
||||
]
|
||||
}
|
||||
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -5,12 +5,22 @@
|
||||
# Each host gets access to its own secrets under hosts/<hostname>/*
|
||||
locals {
|
||||
generated_host_policies = {
|
||||
"vaulttest01" = {
|
||||
"testvm01" = {
|
||||
paths = [
|
||||
"secret/data/hosts/vaulttest01/*",
|
||||
"secret/data/hosts/testvm01/*",
|
||||
]
|
||||
}
|
||||
|
||||
"testvm02" = {
|
||||
paths = [
|
||||
"secret/data/hosts/testvm02/*",
|
||||
]
|
||||
}
|
||||
"testvm03" = {
|
||||
paths = [
|
||||
"secret/data/hosts/testvm03/*",
|
||||
]
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
# Placeholder secrets - user should add actual secrets manually or via tofu
|
||||
@@ -40,7 +50,7 @@ resource "vault_approle_auth_backend_role" "generated_hosts" {
|
||||
|
||||
backend = vault_auth_backend.approle.path
|
||||
role_name = each.key
|
||||
token_policies = ["host-${each.key}"]
|
||||
token_policies = ["host-${each.key}", "homelab-deploy"]
|
||||
secret_id_ttl = 0 # Never expire (wrapped tokens provide time limit)
|
||||
token_ttl = 3600
|
||||
token_max_ttl = 3600
|
||||
|
||||
@@ -45,12 +45,6 @@ locals {
|
||||
password_length = 24
|
||||
}
|
||||
|
||||
# TODO: Remove after testing
|
||||
"hosts/vaulttest01/test-service" = {
|
||||
auto_generate = true
|
||||
password_length = 32
|
||||
}
|
||||
|
||||
# Shared backup password (auto-generated, add alongside existing restic key)
|
||||
"shared/backup/password" = {
|
||||
auto_generate = true
|
||||
|
||||
@@ -31,13 +31,6 @@ locals {
|
||||
# Example Minimal VM using all defaults (uncomment to deploy):
|
||||
# "minimal-vm" = {}
|
||||
# "bootstrap-verify-test" = {}
|
||||
"testvm01" = {
|
||||
ip = "10.69.13.101/24"
|
||||
cpu_cores = 2
|
||||
memory = 2048
|
||||
disk_size = "20G"
|
||||
flake_branch = "pipeline-testing-improvements"
|
||||
}
|
||||
"vault01" = {
|
||||
ip = "10.69.13.19/24"
|
||||
cpu_cores = 2
|
||||
@@ -45,13 +38,25 @@ locals {
|
||||
disk_size = "20G"
|
||||
flake_branch = "vault-setup" # Bootstrap from this branch instead of master
|
||||
}
|
||||
"vaulttest01" = {
|
||||
ip = "10.69.13.150/24"
|
||||
"testvm01" = {
|
||||
ip = "10.69.13.20/24"
|
||||
cpu_cores = 2
|
||||
memory = 2048
|
||||
disk_size = "20G"
|
||||
flake_branch = "pki-migration"
|
||||
vault_wrapped_token = "s.UCpQCOp7cOKDdtGGBvfRWwAt"
|
||||
flake_branch = "improve-bootstrap-visibility"
|
||||
vault_wrapped_token = "s.l5q88wzXfEcr5SMDHmO6o96b"
|
||||
}
|
||||
"testvm02" = {
|
||||
ip = "10.69.13.21/24"
|
||||
cpu_cores = 2
|
||||
memory = 2048
|
||||
disk_size = "20G"
|
||||
}
|
||||
"testvm03" = {
|
||||
ip = "10.69.13.22/24"
|
||||
cpu_cores = 2
|
||||
memory = 2048
|
||||
disk_size = "20G"
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user