Compare commits
32 Commits
d3de2a1511
...
deploy-tes
| Author | SHA1 | Date | |
|---|---|---|---|
|
38348c5980
|
|||
|
370cf2b03a
|
|||
|
7bc465b414
|
|||
|
8d7bc50108
|
|||
|
03e70ac094
|
|||
|
3b32c9479f
|
|||
|
b0d35f9a99
|
|||
|
26ca6817f0
|
|||
|
b03a9b3b64
|
|||
|
f805b9f629
|
|||
|
f3adf7e77f
|
|||
|
f6eca9decc
|
|||
| 6e93b8eae3 | |||
|
c214f8543c
|
|||
|
7933127d77
|
|||
|
13c3897e86
|
|||
|
0643f23281
|
|||
|
ad8570f8db
|
|||
| 2f195d26d3 | |||
|
a926d34287
|
|||
|
be2421746e
|
|||
|
12bf0683f5
|
|||
|
e8a43c6715
|
|||
|
eef52bb8c5
|
|||
|
c6cdbc6799
|
|||
|
4d724329a6
|
|||
|
881e70df27
|
|||
|
b9a269d280
|
|||
|
fcf1a66103
|
|||
|
2034004280
|
|||
| af43f88394 | |||
|
a834497fe8
|
250
.claude/skills/observability/SKILL.md
Normal file
250
.claude/skills/observability/SKILL.md
Normal file
@@ -0,0 +1,250 @@
|
||||
---
|
||||
name: observability
|
||||
description: Reference guide for exploring Prometheus metrics and Loki logs when troubleshooting homelab issues. Use when investigating system state, deployments, service health, or searching logs.
|
||||
---
|
||||
|
||||
# Observability Troubleshooting Guide
|
||||
|
||||
Quick reference for exploring Prometheus metrics and Loki logs to troubleshoot homelab issues.
|
||||
|
||||
## Available Tools
|
||||
|
||||
Use the `lab-monitoring` MCP server tools:
|
||||
|
||||
**Metrics:**
|
||||
- `search_metrics` - Find metrics by name substring
|
||||
- `get_metric_metadata` - Get type/help for a specific metric
|
||||
- `query` - Execute PromQL queries
|
||||
- `list_targets` - Check scrape target health
|
||||
- `list_alerts` / `get_alert` - View active alerts
|
||||
|
||||
**Logs:**
|
||||
- `query_logs` - Execute LogQL queries against Loki
|
||||
- `list_labels` - List available log labels
|
||||
- `list_label_values` - List values for a specific label
|
||||
|
||||
---
|
||||
|
||||
## Logs Reference
|
||||
|
||||
### Label Reference
|
||||
|
||||
Available labels for log queries:
|
||||
- `host` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`)
|
||||
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
|
||||
- `job` - Either `systemd-journal` (most logs) or `varlog` (file-based logs)
|
||||
- `filename` - For `varlog` job, the log file path
|
||||
- `hostname` - Alternative to `host` for some streams
|
||||
|
||||
### Log Format
|
||||
|
||||
Journal logs are JSON-formatted. Key fields:
|
||||
- `MESSAGE` - The actual log message
|
||||
- `PRIORITY` - Syslog priority (6=info, 4=warning, 3=error)
|
||||
- `SYSLOG_IDENTIFIER` - Program name
|
||||
|
||||
### Basic LogQL Queries
|
||||
|
||||
**Logs from a specific service on a host:**
|
||||
```logql
|
||||
{host="ns1", systemd_unit="nsd.service"}
|
||||
```
|
||||
|
||||
**All logs from a host:**
|
||||
```logql
|
||||
{host="monitoring01"}
|
||||
```
|
||||
|
||||
**Logs from a service across all hosts:**
|
||||
```logql
|
||||
{systemd_unit="nixos-upgrade.service"}
|
||||
```
|
||||
|
||||
**Substring matching (case-sensitive):**
|
||||
```logql
|
||||
{host="ha1"} |= "error"
|
||||
```
|
||||
|
||||
**Exclude pattern:**
|
||||
```logql
|
||||
{host="ns1"} != "routine"
|
||||
```
|
||||
|
||||
**Regex matching:**
|
||||
```logql
|
||||
{systemd_unit="prometheus.service"} |~ "scrape.*failed"
|
||||
```
|
||||
|
||||
**File-based logs (caddy access logs, etc):**
|
||||
```logql
|
||||
{job="varlog", hostname="nix-cache01"}
|
||||
{job="varlog", filename="/var/log/caddy/nix-cache.log"}
|
||||
```
|
||||
|
||||
### Time Ranges
|
||||
|
||||
Default lookback is 1 hour. Use `start` parameter for older logs:
|
||||
- `start: "1h"` - Last hour (default)
|
||||
- `start: "24h"` - Last 24 hours
|
||||
- `start: "168h"` - Last 7 days
|
||||
|
||||
### Common Services
|
||||
|
||||
Useful systemd units for troubleshooting:
|
||||
- `nixos-upgrade.service` - Daily auto-upgrade logs
|
||||
- `nsd.service` - DNS server (ns1/ns2)
|
||||
- `prometheus.service` - Metrics collection
|
||||
- `loki.service` - Log aggregation
|
||||
- `caddy.service` - Reverse proxy
|
||||
- `home-assistant.service` - Home automation
|
||||
- `step-ca.service` - Internal CA
|
||||
- `openbao.service` - Secrets management
|
||||
- `sshd.service` - SSH daemon
|
||||
- `nix-gc.service` - Nix garbage collection
|
||||
|
||||
### Extracting JSON Fields
|
||||
|
||||
Parse JSON and filter on fields:
|
||||
```logql
|
||||
{systemd_unit="prometheus.service"} | json | PRIORITY="3"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Metrics Reference
|
||||
|
||||
### Deployment & Version Status
|
||||
|
||||
Check which NixOS revision hosts are running:
|
||||
|
||||
```promql
|
||||
nixos_flake_info
|
||||
```
|
||||
|
||||
Labels:
|
||||
- `current_rev` - Git commit of the running NixOS configuration
|
||||
- `remote_rev` - Latest commit on the remote repository
|
||||
- `nixpkgs_rev` - Nixpkgs revision used to build the system
|
||||
- `nixos_version` - Full NixOS version string (e.g., `25.11.20260203.e576e3c`)
|
||||
|
||||
Check if hosts are behind on updates:
|
||||
|
||||
```promql
|
||||
nixos_flake_revision_behind == 1
|
||||
```
|
||||
|
||||
View flake input versions:
|
||||
|
||||
```promql
|
||||
nixos_flake_input_info
|
||||
```
|
||||
|
||||
Labels: `input` (name), `rev` (revision), `type` (git/github)
|
||||
|
||||
Check flake input age:
|
||||
|
||||
```promql
|
||||
nixos_flake_input_age_seconds / 86400
|
||||
```
|
||||
|
||||
Returns age in days for each flake input.
|
||||
|
||||
### System Health
|
||||
|
||||
Basic host availability:
|
||||
|
||||
```promql
|
||||
up{job="node-exporter"}
|
||||
```
|
||||
|
||||
CPU usage by host:
|
||||
|
||||
```promql
|
||||
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
||||
```
|
||||
|
||||
Memory usage:
|
||||
|
||||
```promql
|
||||
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
|
||||
```
|
||||
|
||||
Disk space (root filesystem):
|
||||
|
||||
```promql
|
||||
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
|
||||
```
|
||||
|
||||
### Service-Specific Metrics
|
||||
|
||||
Common job names:
|
||||
- `node-exporter` - System metrics (all hosts)
|
||||
- `nixos-exporter` - NixOS version/generation metrics
|
||||
- `caddy` - Reverse proxy metrics
|
||||
- `prometheus` / `loki` / `grafana` - Monitoring stack
|
||||
- `home-assistant` - Home automation
|
||||
- `step-ca` - Internal CA
|
||||
|
||||
### Instance Label Format
|
||||
|
||||
The `instance` label uses FQDN format:
|
||||
|
||||
```
|
||||
<hostname>.home.2rjus.net:<port>
|
||||
```
|
||||
|
||||
Example queries filtering by host:
|
||||
|
||||
```promql
|
||||
up{instance=~"monitoring01.*"}
|
||||
node_load1{instance=~"ns1.*"}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting Workflows
|
||||
|
||||
### Check Deployment Status Across Fleet
|
||||
|
||||
1. Query `nixos_flake_info` to see all hosts' current revisions
|
||||
2. Check `nixos_flake_revision_behind` for hosts needing updates
|
||||
3. Look at upgrade logs: `{systemd_unit="nixos-upgrade.service"}` with `start: "24h"`
|
||||
|
||||
### Investigate Service Issues
|
||||
|
||||
1. Check `up{job="<service>"}` for scrape failures
|
||||
2. Use `list_targets` to see target health details
|
||||
3. Query service logs: `{host="<host>", systemd_unit="<service>.service"}`
|
||||
4. Search for errors: `{host="<host>"} |= "error"`
|
||||
5. Check `list_alerts` for related alerts
|
||||
|
||||
### After Deploying Changes
|
||||
|
||||
1. Verify `current_rev` updated in `nixos_flake_info`
|
||||
2. Confirm `nixos_flake_revision_behind == 0`
|
||||
3. Check service logs for startup issues
|
||||
4. Check service metrics are being scraped
|
||||
|
||||
### Debug SSH/Access Issues
|
||||
|
||||
```logql
|
||||
{host="<host>", systemd_unit="sshd.service"}
|
||||
```
|
||||
|
||||
### Check Recent Upgrades
|
||||
|
||||
```logql
|
||||
{systemd_unit="nixos-upgrade.service"}
|
||||
```
|
||||
|
||||
With `start: "24h"` to see last 24 hours of upgrades across all hosts.
|
||||
|
||||
---
|
||||
|
||||
## Notes
|
||||
|
||||
- Default scrape interval is 15s for most metrics targets
|
||||
- Default log lookback is 1h - use `start` parameter for older logs
|
||||
- Use `rate()` for counter metrics, direct queries for gauges
|
||||
- The `instance` label includes the port, use regex matching (`=~`) for hostname-only filters
|
||||
- Log `MESSAGE` field contains the actual log content in JSON format
|
||||
1
.gitignore
vendored
1
.gitignore
vendored
@@ -1,5 +1,6 @@
|
||||
.direnv/
|
||||
result
|
||||
result-*
|
||||
|
||||
# Terraform/OpenTofu
|
||||
terraform/.terraform/
|
||||
|
||||
11
.mcp.json
11
.mcp.json
@@ -22,6 +22,17 @@
|
||||
"ALERTMANAGER_URL": "https://alertmanager.home.2rjus.net",
|
||||
"LOKI_URL": "http://monitoring01.home.2rjus.net:3100"
|
||||
}
|
||||
},
|
||||
"homelab-deploy": {
|
||||
"command": "nix",
|
||||
"args": [
|
||||
"run",
|
||||
"git+https://git.t-juice.club/torjus/homelab-deploy",
|
||||
"--",
|
||||
"mcp",
|
||||
"--nats-url", "nats://nats1.home.2rjus.net:4222",
|
||||
"--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
45
CLAUDE.md
45
CLAUDE.md
@@ -194,6 +194,51 @@ node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
|
||||
node_filesystem_avail_bytes{mountpoint="/"}
|
||||
```
|
||||
|
||||
### Deploying to Test Hosts
|
||||
|
||||
The **homelab-deploy** MCP server enables remote deployments to test-tier hosts via NATS messaging.
|
||||
|
||||
**Available Tools:**
|
||||
|
||||
- `deploy` - Deploy NixOS configuration to test-tier hosts
|
||||
- `list_hosts` - List available deployment targets
|
||||
|
||||
**Deploy Parameters:**
|
||||
|
||||
- `hostname` - Target a specific host (e.g., `vaulttest01`)
|
||||
- `role` - Deploy to all hosts with a specific role (e.g., `vault`)
|
||||
- `all` - Deploy to all test-tier hosts
|
||||
- `action` - nixos-rebuild action: `switch` (default), `boot`, `test`, `dry-activate`
|
||||
- `branch` - Git branch or commit to deploy (default: `master`)
|
||||
|
||||
**Examples:**
|
||||
|
||||
```
|
||||
# List available hosts
|
||||
list_hosts()
|
||||
|
||||
# Deploy to a specific host
|
||||
deploy(hostname="vaulttest01", action="switch")
|
||||
|
||||
# Dry-run deployment
|
||||
deploy(hostname="vaulttest01", action="dry-activate")
|
||||
|
||||
# Deploy to all hosts with a role
|
||||
deploy(role="vault", action="switch")
|
||||
```
|
||||
|
||||
**Note:** Only test-tier hosts with `homelab.deploy.enable = true` and the listener service running will respond to deployments.
|
||||
|
||||
**Verifying Deployments:**
|
||||
|
||||
After deploying, use the `nixos_flake_info` metric from nixos-exporter to verify the host is running the expected revision:
|
||||
|
||||
```promql
|
||||
nixos_flake_info{instance=~"vaulttest01.*"}
|
||||
```
|
||||
|
||||
The `current_rev` label contains the git commit hash of the deployed flake configuration.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Directory Structure
|
||||
|
||||
122
docs/plans/long-term-metrics-storage.md
Normal file
122
docs/plans/long-term-metrics-storage.md
Normal file
@@ -0,0 +1,122 @@
|
||||
# Long-Term Metrics Storage Options
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Current Prometheus configuration retains metrics for 30 days (`retentionTime = "30d"`). Extending retention further raises disk usage concerns on the homelab hypervisor with limited local storage.
|
||||
|
||||
Prometheus does not support downsampling - it stores all data at full resolution until the retention period expires, then deletes it entirely.
|
||||
|
||||
## Current Configuration
|
||||
|
||||
Location: `services/monitoring/prometheus.nix`
|
||||
|
||||
- **Retention**: 30 days
|
||||
- **Scrape interval**: 15s
|
||||
- **Features**: Alertmanager, Pushgateway, auto-generated scrape configs from flake hosts
|
||||
- **Storage**: Local disk on monitoring01
|
||||
|
||||
## Options Evaluated
|
||||
|
||||
### Option 1: VictoriaMetrics
|
||||
|
||||
VictoriaMetrics is a Prometheus-compatible TSDB with significantly better compression (5-10x smaller storage footprint).
|
||||
|
||||
**NixOS Options Available:**
|
||||
- `services.victoriametrics.enable`
|
||||
- `services.victoriametrics.prometheusConfig` - accepts Prometheus scrape config format
|
||||
- `services.victoriametrics.retentionPeriod` - e.g., "6m" for 6 months
|
||||
- `services.vmagent` - dedicated scraping agent
|
||||
- `services.vmalert` - alerting rules evaluation
|
||||
|
||||
**Pros:**
|
||||
- Simple migration - single service replacement
|
||||
- Same PromQL query language - Grafana dashboards work unchanged
|
||||
- Same scrape config format - existing auto-generated configs work as-is
|
||||
- 5-10x better compression means 30 days of Prometheus data could become 180+ days
|
||||
- Lightweight, single binary
|
||||
|
||||
**Cons:**
|
||||
- No automatic downsampling (relies on compression alone)
|
||||
- Alerting requires switching to vmalert instead of Prometheus alertmanager integration
|
||||
- Would need to migrate existing data or start fresh
|
||||
|
||||
**Migration Steps:**
|
||||
1. Replace `services.prometheus` with `services.victoriametrics`
|
||||
2. Move scrape configs to `prometheusConfig`
|
||||
3. Set up `services.vmalert` for alerting rules
|
||||
4. Update Grafana datasource to VictoriaMetrics port (8428)
|
||||
5. Keep Alertmanager for notification routing
|
||||
|
||||
### Option 2: Thanos
|
||||
|
||||
Thanos extends Prometheus with long-term storage and automatic downsampling by uploading data to object storage.
|
||||
|
||||
**NixOS Options Available:**
|
||||
- `services.thanos.sidecar` - uploads Prometheus blocks to object storage
|
||||
- `services.thanos.compact` - compacts and downsamples data
|
||||
- `services.thanos.query` - unified query gateway
|
||||
- `services.thanos.query-frontend` - query caching and parallelization
|
||||
- `services.thanos.downsample` - dedicated downsampling service
|
||||
|
||||
**Downsampling Behavior:**
|
||||
- Raw resolution kept for configurable period (default: indefinite)
|
||||
- 5-minute resolution created after 40 hours
|
||||
- 1-hour resolution created after 10 days
|
||||
|
||||
**Retention Configuration (in compactor):**
|
||||
```nix
|
||||
services.thanos.compact = {
|
||||
retention.resolution-raw = "30d"; # Keep raw for 30 days
|
||||
retention.resolution-5m = "180d"; # Keep 5m samples for 6 months
|
||||
retention.resolution-1h = "2y"; # Keep 1h samples for 2 years
|
||||
};
|
||||
```
|
||||
|
||||
**Pros:**
|
||||
- True downsampling - older data uses progressively less storage
|
||||
- Keep metrics for years with minimal storage impact
|
||||
- Prometheus continues running unchanged
|
||||
- Existing Alertmanager integration preserved
|
||||
|
||||
**Cons:**
|
||||
- Requires object storage (MinIO, S3, or local filesystem)
|
||||
- Multiple services to manage (sidecar, compactor, query)
|
||||
- More complex architecture
|
||||
- Additional infrastructure (MinIO) may be needed
|
||||
|
||||
**Required Components:**
|
||||
1. Thanos Sidecar (runs alongside Prometheus)
|
||||
2. Object storage (MinIO or local filesystem)
|
||||
3. Thanos Compactor (handles downsampling)
|
||||
4. Thanos Query (provides unified query endpoint)
|
||||
|
||||
**Migration Steps:**
|
||||
1. Deploy object storage (MinIO or configure filesystem backend)
|
||||
2. Add Thanos sidecar pointing to Prometheus data directory
|
||||
3. Add Thanos compactor with retention policies
|
||||
4. Add Thanos query gateway
|
||||
5. Update Grafana datasource to Thanos Query port (10902)
|
||||
|
||||
## Comparison
|
||||
|
||||
| Aspect | VictoriaMetrics | Thanos |
|
||||
|--------|-----------------|--------|
|
||||
| Complexity | Low (1 service) | Higher (3-4 services) |
|
||||
| Downsampling | No | Yes (automatic) |
|
||||
| Storage savings | 5-10x compression | Compression + downsampling |
|
||||
| Object storage required | No | Yes |
|
||||
| Migration effort | Minimal | Moderate |
|
||||
| Grafana changes | Change port only | Change port only |
|
||||
| Alerting changes | Need vmalert | Keep existing |
|
||||
|
||||
## Recommendation
|
||||
|
||||
**Start with VictoriaMetrics** for simplicity. The compression alone may provide 6+ months of retention in the same disk space currently used for 30 days.
|
||||
|
||||
If multi-year retention with true downsampling becomes necessary, Thanos can be evaluated later. However, it requires deploying object storage infrastructure (MinIO) which adds operational complexity.
|
||||
|
||||
## References
|
||||
|
||||
- VictoriaMetrics docs: https://docs.victoriametrics.com/
|
||||
- Thanos docs: https://thanos.io/tip/thanos/getting-started.md/
|
||||
- NixOS options searched from nixpkgs revision e576e3c9 (NixOS 25.11)
|
||||
371
docs/plans/nats-deploy-service.md
Normal file
371
docs/plans/nats-deploy-service.md
Normal file
@@ -0,0 +1,371 @@
|
||||
# NATS-Based Deployment Service
|
||||
|
||||
## Overview
|
||||
|
||||
Create a message-based deployment system that allows triggering NixOS configuration updates on-demand, rather than waiting for the daily auto-upgrade timer. This enables faster iteration when testing changes and immediate fleet-wide deployments.
|
||||
|
||||
## Goals
|
||||
|
||||
1. **On-demand deployment** - Trigger config updates immediately via NATS message
|
||||
2. **Targeted deployment** - Deploy to specific hosts or all hosts
|
||||
3. **Branch/revision support** - Test feature branches before merging to master
|
||||
4. **MCP integration** - Allow Claude Code to trigger deployments during development
|
||||
|
||||
## Current State
|
||||
|
||||
- **Auto-upgrade**: All hosts run `nixos-upgrade.service` daily, pulling from master
|
||||
- **Manual testing**: `nixos-rebuild-test <action> <branch>` helper exists on all hosts
|
||||
- **NATS**: Running on nats1 with JetStream enabled, using NKey authentication
|
||||
- **Accounts**: ADMIN (system) and HOMELAB (user workloads with JetStream)
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────┐ ┌─────────────┐
|
||||
│ MCP Tool │ deploy.test.> │ Admin CLI │ deploy.test.> + deploy.prod.>
|
||||
│ (claude) │────────────┐ ┌─────│ (torjus) │
|
||||
└─────────────┘ │ │ └─────────────┘
|
||||
▼ ▼
|
||||
┌──────────────┐
|
||||
│ nats1 │
|
||||
│ (authz) │
|
||||
└──────┬───────┘
|
||||
│
|
||||
┌─────────────────┼─────────────────┐
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌──────────┐ ┌──────────┐ ┌──────────┐
|
||||
│ template1│ │ ns1 │ │ ha1 │
|
||||
│ tier=test│ │ tier=prod│ │ tier=prod│
|
||||
└──────────┘ └──────────┘ └──────────┘
|
||||
```
|
||||
|
||||
## Repository Structure
|
||||
|
||||
The project lives in a **separate repository** (e.g., `homelab-deploy`) containing:
|
||||
|
||||
```
|
||||
homelab-deploy/
|
||||
├── flake.nix # Nix flake with Go package + NixOS module
|
||||
├── go.mod
|
||||
├── go.sum
|
||||
├── cmd/
|
||||
│ └── homelab-deploy/
|
||||
│ └── main.go # CLI entrypoint with subcommands
|
||||
├── internal/
|
||||
│ ├── listener/ # Listener mode logic
|
||||
│ ├── mcp/ # MCP server mode logic
|
||||
│ └── deploy/ # Shared deployment logic
|
||||
└── nixos/
|
||||
└── module.nix # NixOS module for listener service
|
||||
```
|
||||
|
||||
This repo imports the flake as an input and uses the NixOS module.
|
||||
|
||||
## Single Binary with Subcommands
|
||||
|
||||
The `homelab-deploy` binary supports multiple modes:
|
||||
|
||||
```bash
|
||||
# Run as listener on a host (systemd service)
|
||||
homelab-deploy listener --hostname ns1 --nats-url nats://nats1:4222
|
||||
|
||||
# Run as MCP server (for Claude Code)
|
||||
homelab-deploy mcp --nats-url nats://nats1:4222
|
||||
|
||||
# CLI commands for manual use
|
||||
homelab-deploy deploy ns1 --branch feature-x --action switch # single host
|
||||
homelab-deploy deploy --tier test --all --action boot # all test hosts
|
||||
homelab-deploy deploy --tier prod --all --action boot # all prod hosts (admin only)
|
||||
homelab-deploy deploy --tier prod --role dns --action switch # all prod dns hosts
|
||||
homelab-deploy status
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
### Listener Mode
|
||||
|
||||
A systemd service on each host that:
|
||||
- Subscribes to multiple subjects for targeted and group deployments
|
||||
- Validates incoming messages (revision, action)
|
||||
- Executes `nixos-rebuild` with specified parameters
|
||||
- Reports status back via NATS
|
||||
|
||||
**Subject structure:**
|
||||
```
|
||||
deploy.<tier>.<hostname> # specific host (e.g., deploy.prod.ns1)
|
||||
deploy.<tier>.all # all hosts in tier (e.g., deploy.test.all)
|
||||
deploy.<tier>.role.<role> # all hosts with role in tier (e.g., deploy.prod.role.dns)
|
||||
```
|
||||
|
||||
**Listener subscriptions** (based on `homelab.host` config):
|
||||
- `deploy.<tier>.<hostname>` - direct messages to this host
|
||||
- `deploy.<tier>.all` - broadcast to all hosts in tier
|
||||
- `deploy.<tier>.role.<role>` - broadcast to hosts with matching role (if role is set)
|
||||
|
||||
Example: ns1 with `tier=prod, role=dns` subscribes to:
|
||||
- `deploy.prod.ns1`
|
||||
- `deploy.prod.all`
|
||||
- `deploy.prod.role.dns`
|
||||
|
||||
**NixOS module configuration:**
|
||||
```nix
|
||||
services.homelab-deploy.listener = {
|
||||
enable = true;
|
||||
timeout = 600; # seconds, default 10 minutes
|
||||
};
|
||||
```
|
||||
|
||||
The listener reads tier and role from `config.homelab.host` (see Host Metadata below).
|
||||
|
||||
**Request message format:**
|
||||
```json
|
||||
{
|
||||
"action": "switch" | "boot" | "test" | "dry-activate",
|
||||
"revision": "master" | "feature-branch" | "abc123...",
|
||||
"reply_to": "deploy.responses.<request-id>"
|
||||
}
|
||||
```
|
||||
|
||||
**Response message format:**
|
||||
```json
|
||||
{
|
||||
"status": "accepted" | "rejected" | "started" | "completed" | "failed",
|
||||
"error": "invalid_revision" | "already_running" | "build_failed" | null,
|
||||
"message": "human-readable details"
|
||||
}
|
||||
```
|
||||
|
||||
**Request/Reply flow:**
|
||||
1. MCP/CLI sends deploy request with unique `reply_to` subject
|
||||
2. Listener validates request (e.g., `git ls-remote` to check revision exists)
|
||||
3. Listener sends immediate response:
|
||||
- `{"status": "rejected", "error": "invalid_revision", "message": "branch 'foo' not found"}`, or
|
||||
- `{"status": "started", "message": "starting nixos-rebuild switch"}`
|
||||
4. If started, listener runs nixos-rebuild
|
||||
5. Listener sends final response:
|
||||
- `{"status": "completed", "message": "successfully switched to generation 42"}`, or
|
||||
- `{"status": "failed", "error": "build_failed", "message": "nixos-rebuild exited with code 1"}`
|
||||
|
||||
This provides immediate feedback on validation errors (bad revision, already running) without waiting for the build to fail.
|
||||
|
||||
### MCP Mode
|
||||
|
||||
Runs as an MCP server providing tools for Claude Code.
|
||||
|
||||
**Tools:**
|
||||
| Tool | Description | Tier Access |
|
||||
|------|-------------|-------------|
|
||||
| `deploy` | Deploy to test hosts (individual, all, or by role) | test only |
|
||||
| `deploy_admin` | Deploy to any host (requires `--enable-admin` flag) | test + prod |
|
||||
| `deploy_status` | Check deployment status/history | n/a |
|
||||
| `list_hosts` | List available deployment targets | n/a |
|
||||
|
||||
**CLI flags:**
|
||||
```bash
|
||||
# Default: only test-tier deployments available
|
||||
homelab-deploy mcp --nats-url nats://nats1:4222
|
||||
|
||||
# Enable admin tool (requires admin NKey to be configured)
|
||||
homelab-deploy mcp --nats-url nats://nats1:4222 --enable-admin --admin-nkey-file /path/to/admin.nkey
|
||||
```
|
||||
|
||||
**Security layers:**
|
||||
1. **MCP flag**: `deploy_admin` tool only exposed when `--enable-admin` is passed
|
||||
2. **NATS authz**: Even if tool is exposed, NATS rejects publishes without valid admin NKey
|
||||
3. **Claude Code permissions**: Can set `mcp__homelab-deploy__deploy_admin` to `ask` mode for confirmation popup
|
||||
|
||||
By default, the MCP only loads test-tier credentials and exposes the `deploy` tool. Claude can:
|
||||
- Deploy to individual test hosts
|
||||
- Deploy to all test hosts at once (`deploy.test.all`)
|
||||
- Deploy to test hosts by role (`deploy.test.role.<role>`)
|
||||
|
||||
### Tiered Permissions
|
||||
|
||||
Authorization is enforced at the NATS layer using subject-based permissions. Different deployer credentials have different publish rights:
|
||||
|
||||
**NATS user configuration (on nats1):**
|
||||
```nix
|
||||
accounts = {
|
||||
HOMELAB = {
|
||||
users = [
|
||||
# MCP/Claude - test tier only
|
||||
{
|
||||
nkey = "UABC..."; # mcp-deployer
|
||||
permissions = {
|
||||
publish = [ "deploy.test.>" ];
|
||||
subscribe = [ "deploy.responses.>" ];
|
||||
};
|
||||
}
|
||||
# Admin - full access to all tiers
|
||||
{
|
||||
nkey = "UXYZ..."; # admin-deployer
|
||||
permissions = {
|
||||
publish = [ "deploy.test.>" "deploy.prod.>" ];
|
||||
subscribe = [ "deploy.responses.>" ];
|
||||
};
|
||||
}
|
||||
# Host listeners - subscribe to their tier, publish responses
|
||||
{
|
||||
nkey = "UDEF..."; # host-listener (one per host)
|
||||
permissions = {
|
||||
subscribe = [ "deploy.*.>" ];
|
||||
publish = [ "deploy.responses.>" ];
|
||||
};
|
||||
}
|
||||
];
|
||||
};
|
||||
};
|
||||
```
|
||||
|
||||
**Host tier assignments** (via `homelab.host.tier`):
|
||||
| Tier | Hosts |
|
||||
|------|-------|
|
||||
| test | template1, nix-cache01, future test hosts |
|
||||
| prod | ns1, ns2, ha1, monitoring01, http-proxy, etc. |
|
||||
|
||||
**Example deployment scenarios:**
|
||||
|
||||
| Command | Subject | MCP | Admin |
|
||||
|---------|---------|-----|-------|
|
||||
| Deploy to ns1 | `deploy.prod.ns1` | ❌ | ✅ |
|
||||
| Deploy to template1 | `deploy.test.template1` | ✅ | ✅ |
|
||||
| Deploy to all test hosts | `deploy.test.all` | ✅ | ✅ |
|
||||
| Deploy to all prod hosts | `deploy.prod.all` | ❌ | ✅ |
|
||||
| Deploy to all DNS servers | `deploy.prod.role.dns` | ❌ | ✅ |
|
||||
|
||||
All NKeys stored in Vault - MCP gets limited credentials, admin CLI gets full-access credentials.
|
||||
|
||||
### Host Metadata
|
||||
|
||||
Rather than defining `tier` in the listener config, use a central `homelab.host` module that provides host metadata for multiple consumers. This aligns with the approach proposed in `docs/plans/prometheus-scrape-target-labels.md`.
|
||||
|
||||
**Status:** The `homelab.host` module is implemented in `modules/homelab/host.nix`.
|
||||
Hosts can be filtered by tier using `config.homelab.host.tier`.
|
||||
|
||||
**Module definition (in `modules/homelab/host.nix`):**
|
||||
```nix
|
||||
homelab.host = {
|
||||
tier = lib.mkOption {
|
||||
type = lib.types.enum [ "test" "prod" ];
|
||||
default = "prod";
|
||||
description = "Deployment tier - controls which credentials can deploy to this host";
|
||||
};
|
||||
|
||||
priority = lib.mkOption {
|
||||
type = lib.types.enum [ "high" "low" ];
|
||||
default = "high";
|
||||
description = "Alerting priority - low priority hosts have relaxed thresholds";
|
||||
};
|
||||
|
||||
role = lib.mkOption {
|
||||
type = lib.types.nullOr lib.types.str;
|
||||
default = null;
|
||||
description = "Primary role of this host (dns, database, monitoring, etc.)";
|
||||
};
|
||||
|
||||
labels = lib.mkOption {
|
||||
type = lib.types.attrsOf lib.types.str;
|
||||
default = { };
|
||||
description = "Additional free-form labels";
|
||||
};
|
||||
};
|
||||
```
|
||||
|
||||
**Consumers:**
|
||||
- `homelab-deploy` listener reads `config.homelab.host.tier` for subject subscription
|
||||
- Prometheus scrape config reads `priority`, `role`, `labels` for target labels
|
||||
- Future services can consume the same metadata
|
||||
|
||||
**Example host config:**
|
||||
```nix
|
||||
# hosts/nix-cache01/configuration.nix
|
||||
homelab.host = {
|
||||
tier = "test"; # can be deployed by MCP
|
||||
priority = "low"; # relaxed alerting thresholds
|
||||
role = "build-host";
|
||||
};
|
||||
|
||||
# hosts/ns1/configuration.nix
|
||||
homelab.host = {
|
||||
tier = "prod"; # requires admin credentials
|
||||
priority = "high";
|
||||
role = "dns";
|
||||
labels.dns_role = "primary";
|
||||
};
|
||||
```
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
### Phase 1: Core Binary + Listener
|
||||
|
||||
1. **Create homelab-deploy repository**
|
||||
- Initialize Go module
|
||||
- Set up flake.nix with Go package build
|
||||
|
||||
2. **Implement listener mode**
|
||||
- NATS subscription logic
|
||||
- nixos-rebuild execution
|
||||
- Status reporting via NATS reply
|
||||
|
||||
3. **Create NixOS module**
|
||||
- Systemd service definition
|
||||
- Configuration options (hostname, NATS URL, NKey path)
|
||||
- Vault secret integration for NKeys
|
||||
|
||||
4. **Create `homelab.host` module** (in nixos-servers)
|
||||
- Define `tier`, `priority`, `role`, `labels` options
|
||||
- This module is shared with Prometheus label work (see `docs/plans/prometheus-scrape-target-labels.md`)
|
||||
|
||||
5. **Integrate with nixos-servers**
|
||||
- Add flake input for homelab-deploy
|
||||
- Import listener module in `system/`
|
||||
- Set `homelab.host.tier` per host (test vs prod)
|
||||
|
||||
6. **Configure NATS tiered permissions**
|
||||
- Add deployer users to nats1 config (mcp-deployer, admin-deployer)
|
||||
- Set up subject ACLs per user (test-only vs full access)
|
||||
- Add deployer NKeys to Vault
|
||||
- Create Terraform resources for NKey secrets
|
||||
|
||||
### Phase 2: MCP + CLI
|
||||
|
||||
7. **Implement MCP mode**
|
||||
- MCP server with deploy/status tools
|
||||
- Request/reply pattern for deployment feedback
|
||||
|
||||
8. **Implement CLI commands**
|
||||
- `deploy` command for manual deployments
|
||||
- `status` command to check deployment state
|
||||
|
||||
9. **Configure Claude Code**
|
||||
- Add MCP server to configuration
|
||||
- Document usage
|
||||
|
||||
### Phase 3: Enhancements
|
||||
|
||||
10. Add deployment locking (prevent concurrent deploys)
|
||||
11. Prometheus metrics for deployment status
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- **Privilege escalation**: Listener runs as root to execute nixos-rebuild
|
||||
- **Input validation**: Strictly validate revision format (branch name or commit hash)
|
||||
- **Rate limiting**: Prevent rapid-fire deployments
|
||||
- **Audit logging**: Log all deployment requests with source identity
|
||||
- **Network isolation**: NATS only accessible from internal network
|
||||
|
||||
## Decisions
|
||||
|
||||
All open questions have been resolved. See Notes section for decision rationale.
|
||||
|
||||
## Notes
|
||||
|
||||
- The existing `nixos-rebuild-test` helper provides a good reference for the rebuild logic
|
||||
- Uses NATS request/reply pattern for immediate validation feedback and completion status
|
||||
- Consider using NATS headers for metadata (request ID, timestamp)
|
||||
- **Timeout decision**: Metrics show no-change upgrades complete in 5-55 seconds. A 10-minute default provides ample headroom for actual updates with package downloads. Per-host override available for hosts with known longer build times.
|
||||
- **Rollback**: Not needed as a separate feature - deploy an older commit hash to effectively rollback.
|
||||
- **Offline hosts**: No message persistence - if host is offline, deploy fails. Daily auto-upgrade is the safety net. Avoids complexity of JetStream deduplication (host coming online and applying 10 queued updates instead of just the latest).
|
||||
- **Deploy history**: Use existing Loki - listener logs deployments to journald, queryable via Loki. No need for separate JetStream persistence.
|
||||
- **Naming**: `homelab-deploy` - ties it to the infrastructure rather than implementation details.
|
||||
@@ -4,6 +4,8 @@
|
||||
|
||||
Add support for custom per-host labels on Prometheus scrape targets, enabling alert rules to reference host metadata (priority, role) instead of hardcoding instance names.
|
||||
|
||||
**Related:** This plan shares the `homelab.host` module with `docs/plans/nats-deploy-service.md`, which uses the same metadata for deployment tier assignment.
|
||||
|
||||
## Motivation
|
||||
|
||||
Some hosts have workloads that make generic alert thresholds inappropriate. For example, `nix-cache01` regularly hits high CPU during builds, requiring a longer `for` duration on `high_cpu_load`. Currently this is handled by excluding specific instance names in PromQL expressions, which is brittle and doesn't scale.
|
||||
@@ -52,22 +54,62 @@ or
|
||||
|
||||
## Implementation
|
||||
|
||||
### 1. Add `labels` option to `homelab.monitoring`
|
||||
This implementation uses a shared `homelab.host` module that provides host metadata for multiple consumers (Prometheus labels, deployment tiers, etc.). See also `docs/plans/nats-deploy-service.md` which uses the same module for deployment tier assignment.
|
||||
|
||||
In `modules/homelab/monitoring.nix`, add:
|
||||
### 1. Create `homelab.host` module
|
||||
|
||||
**Status:** Step 1 (Create `homelab.host` module) is complete. The module is in
|
||||
`modules/homelab/host.nix` with tier, priority, role, and labels options.
|
||||
|
||||
Create `modules/homelab/host.nix` with shared host metadata options:
|
||||
|
||||
```nix
|
||||
labels = lib.mkOption {
|
||||
type = lib.types.attrsOf lib.types.str;
|
||||
default = { };
|
||||
description = "Custom labels to attach to this host's scrape targets";
|
||||
};
|
||||
{ lib, ... }:
|
||||
{
|
||||
options.homelab.host = {
|
||||
tier = lib.mkOption {
|
||||
type = lib.types.enum [ "test" "prod" ];
|
||||
default = "prod";
|
||||
description = "Deployment tier - controls which credentials can deploy to this host";
|
||||
};
|
||||
|
||||
priority = lib.mkOption {
|
||||
type = lib.types.enum [ "high" "low" ];
|
||||
default = "high";
|
||||
description = "Alerting priority - low priority hosts have relaxed thresholds";
|
||||
};
|
||||
|
||||
role = lib.mkOption {
|
||||
type = lib.types.nullOr lib.types.str;
|
||||
default = null;
|
||||
description = "Primary role of this host (dns, database, monitoring, etc.)";
|
||||
};
|
||||
|
||||
labels = lib.mkOption {
|
||||
type = lib.types.attrsOf lib.types.str;
|
||||
default = { };
|
||||
description = "Additional free-form labels (e.g., dns_role = 'primary')";
|
||||
};
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
Import this module in `modules/homelab/default.nix`.
|
||||
|
||||
### 2. Update `lib/monitoring.nix`
|
||||
|
||||
- `extractHostMonitoring` should carry `labels` through in its return value.
|
||||
- `generateNodeExporterTargets` currently returns a flat list of target strings. It needs to return structured `static_configs` entries instead, grouping targets by their label sets:
|
||||
- `extractHostMonitoring` should also extract `homelab.host` values (priority, role, labels).
|
||||
- Build the combined label set from `homelab.host`:
|
||||
|
||||
```nix
|
||||
# Combine structured options + free-form labels
|
||||
effectiveLabels =
|
||||
(lib.optionalAttrs (host.priority != "high") { priority = host.priority; })
|
||||
// (lib.optionalAttrs (host.role != null) { role = host.role; })
|
||||
// host.labels;
|
||||
```
|
||||
|
||||
- `generateNodeExporterTargets` returns structured `static_configs` entries, grouping targets by their label sets:
|
||||
|
||||
```nix
|
||||
# Before (flat list):
|
||||
@@ -80,7 +122,7 @@ labels = lib.mkOption {
|
||||
]
|
||||
```
|
||||
|
||||
This requires grouping hosts by their label attrset and producing one `static_configs` entry per unique label combination. Hosts with no custom labels get grouped together with no extra labels (preserving current behavior).
|
||||
This requires grouping hosts by their label attrset and producing one `static_configs` entry per unique label combination. Hosts with default values (priority=high, no role, no labels) get grouped together with no extra labels (preserving current behavior).
|
||||
|
||||
### 3. Update `services/monitoring/prometheus.nix`
|
||||
|
||||
@@ -94,17 +136,29 @@ static_configs = [{ targets = nodeExporterTargets; }];
|
||||
static_configs = nodeExporterTargets;
|
||||
```
|
||||
|
||||
### 4. Set labels on hosts
|
||||
### 4. Set metadata on hosts
|
||||
|
||||
Example in `hosts/nix-cache01/configuration.nix` or the relevant service module:
|
||||
Example in `hosts/nix-cache01/configuration.nix`:
|
||||
|
||||
```nix
|
||||
homelab.monitoring.labels = {
|
||||
priority = "low";
|
||||
homelab.host = {
|
||||
tier = "test"; # can be deployed by MCP (used by homelab-deploy)
|
||||
priority = "low"; # relaxed alerting thresholds
|
||||
role = "build-host";
|
||||
};
|
||||
```
|
||||
|
||||
Example in `hosts/ns1/configuration.nix`:
|
||||
|
||||
```nix
|
||||
homelab.host = {
|
||||
tier = "prod";
|
||||
priority = "high";
|
||||
role = "dns";
|
||||
labels.dns_role = "primary";
|
||||
};
|
||||
```
|
||||
|
||||
### 5. Update alert rules
|
||||
|
||||
After implementing labels, review and update `services/monitoring/rules.yml`:
|
||||
|
||||
30
flake.lock
generated
30
flake.lock
generated
@@ -21,6 +21,27 @@
|
||||
"url": "https://git.t-juice.club/torjus/alerttonotify"
|
||||
}
|
||||
},
|
||||
"homelab-deploy": {
|
||||
"inputs": {
|
||||
"nixpkgs": [
|
||||
"nixpkgs-unstable"
|
||||
]
|
||||
},
|
||||
"locked": {
|
||||
"lastModified": 1770447502,
|
||||
"narHash": "sha256-xH1PNyE3ydj4udhe1IpK8VQxBPZETGLuORZdSWYRmSU=",
|
||||
"ref": "master",
|
||||
"rev": "79db119d1ca6630023947ef0a65896cc3307c2ff",
|
||||
"revCount": 22,
|
||||
"type": "git",
|
||||
"url": "https://git.t-juice.club/torjus/homelab-deploy"
|
||||
},
|
||||
"original": {
|
||||
"ref": "master",
|
||||
"type": "git",
|
||||
"url": "https://git.t-juice.club/torjus/homelab-deploy"
|
||||
}
|
||||
},
|
||||
"labmon": {
|
||||
"inputs": {
|
||||
"nixpkgs": [
|
||||
@@ -49,11 +70,11 @@
|
||||
]
|
||||
},
|
||||
"locked": {
|
||||
"lastModified": 1770417341,
|
||||
"narHash": "sha256-Js2EJZlpol6eSWVGyYy6SAhIISKHfURnD6S+NBaLfHU=",
|
||||
"lastModified": 1770422522,
|
||||
"narHash": "sha256-WmIFnquu4u58v8S2bOVWmknRwHn4x88CRfBFTzJ1inQ=",
|
||||
"ref": "refs/heads/master",
|
||||
"rev": "e381038537bbee314fb0326ac7d17ec4f651fe1e",
|
||||
"revCount": 7,
|
||||
"rev": "cf0ce858997af4d8dcc2ce10393ff393e17fc911",
|
||||
"revCount": 11,
|
||||
"type": "git",
|
||||
"url": "https://git.t-juice.club/torjus/nixos-exporter"
|
||||
},
|
||||
@@ -97,6 +118,7 @@
|
||||
"root": {
|
||||
"inputs": {
|
||||
"alerttonotify": "alerttonotify",
|
||||
"homelab-deploy": "homelab-deploy",
|
||||
"labmon": "labmon",
|
||||
"nixos-exporter": "nixos-exporter",
|
||||
"nixpkgs": "nixpkgs",
|
||||
|
||||
48
flake.nix
48
flake.nix
@@ -21,6 +21,10 @@
|
||||
url = "git+https://git.t-juice.club/torjus/nixos-exporter";
|
||||
inputs.nixpkgs.follows = "nixpkgs-unstable";
|
||||
};
|
||||
homelab-deploy = {
|
||||
url = "git+https://git.t-juice.club/torjus/homelab-deploy?ref=master";
|
||||
inputs.nixpkgs.follows = "nixpkgs-unstable";
|
||||
};
|
||||
};
|
||||
|
||||
outputs =
|
||||
@@ -32,6 +36,7 @@
|
||||
alerttonotify,
|
||||
labmon,
|
||||
nixos-exporter,
|
||||
homelab-deploy,
|
||||
...
|
||||
}@inputs:
|
||||
let
|
||||
@@ -53,10 +58,13 @@
|
||||
{ config, pkgs, ... }:
|
||||
{
|
||||
nixpkgs.overlays = commonOverlays;
|
||||
system.configurationRevision = self.rev or self.dirtyRev or "dirty";
|
||||
}
|
||||
)
|
||||
sops-nix.nixosModules.sops
|
||||
nixos-exporter.nixosModules.default
|
||||
homelab-deploy.nixosModules.default
|
||||
./modules/homelab
|
||||
];
|
||||
allSystems = [
|
||||
"x86_64-linux"
|
||||
@@ -178,15 +186,6 @@
|
||||
./hosts/nats1
|
||||
];
|
||||
};
|
||||
testvm01 = nixpkgs.lib.nixosSystem {
|
||||
inherit system;
|
||||
specialArgs = {
|
||||
inherit inputs self sops-nix;
|
||||
};
|
||||
modules = commonModules ++ [
|
||||
./hosts/testvm01
|
||||
];
|
||||
};
|
||||
vault01 = nixpkgs.lib.nixosSystem {
|
||||
inherit system;
|
||||
specialArgs = {
|
||||
@@ -196,13 +195,31 @@
|
||||
./hosts/vault01
|
||||
];
|
||||
};
|
||||
vaulttest01 = nixpkgs.lib.nixosSystem {
|
||||
testvm01 = nixpkgs.lib.nixosSystem {
|
||||
inherit system;
|
||||
specialArgs = {
|
||||
inherit inputs self sops-nix;
|
||||
};
|
||||
modules = commonModules ++ [
|
||||
./hosts/vaulttest01
|
||||
./hosts/testvm01
|
||||
];
|
||||
};
|
||||
testvm02 = nixpkgs.lib.nixosSystem {
|
||||
inherit system;
|
||||
specialArgs = {
|
||||
inherit inputs self sops-nix;
|
||||
};
|
||||
modules = commonModules ++ [
|
||||
./hosts/testvm02
|
||||
];
|
||||
};
|
||||
testvm03 = nixpkgs.lib.nixosSystem {
|
||||
inherit system;
|
||||
specialArgs = {
|
||||
inherit inputs self sops-nix;
|
||||
};
|
||||
modules = commonModules ++ [
|
||||
./hosts/testvm03
|
||||
];
|
||||
};
|
||||
};
|
||||
@@ -217,11 +234,12 @@
|
||||
{ pkgs }:
|
||||
{
|
||||
default = pkgs.mkShell {
|
||||
packages = with pkgs; [
|
||||
ansible
|
||||
opentofu
|
||||
openbao
|
||||
packages = [
|
||||
pkgs.ansible
|
||||
pkgs.opentofu
|
||||
pkgs.openbao
|
||||
(pkgs.callPackage ./scripts/create-host { })
|
||||
homelab-deploy.packages.${pkgs.system}.default
|
||||
];
|
||||
};
|
||||
}
|
||||
|
||||
@@ -57,6 +57,7 @@
|
||||
|
||||
# Vault secrets management
|
||||
vault.enable = true;
|
||||
homelab.deploy.enable = true;
|
||||
vault.secrets.backup-helper = {
|
||||
secretPath = "shared/backup/password";
|
||||
extractKey = "password";
|
||||
|
||||
@@ -61,6 +61,7 @@
|
||||
"flakes"
|
||||
];
|
||||
vault.enable = true;
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
nix.settings.tarball-ttl = 0;
|
||||
environment.systemPackages = with pkgs; [
|
||||
|
||||
@@ -8,6 +8,9 @@
|
||||
];
|
||||
|
||||
nixpkgs.config.allowUnfree = true;
|
||||
|
||||
homelab.host.role = "bastion";
|
||||
|
||||
# Use the systemd-boot EFI boot loader.
|
||||
boot.loader.grub.enable = true;
|
||||
boot.loader.grub.device = "/dev/sda";
|
||||
|
||||
@@ -58,6 +58,7 @@
|
||||
|
||||
# Vault secrets management
|
||||
vault.enable = true;
|
||||
homelab.deploy.enable = true;
|
||||
vault.secrets.backup-helper = {
|
||||
secretPath = "shared/backup/password";
|
||||
extractKey = "password";
|
||||
|
||||
@@ -13,6 +13,8 @@
|
||||
|
||||
homelab.dns.cnames = [ "nix-cache" "actions1" ];
|
||||
|
||||
homelab.host.role = "build-host";
|
||||
|
||||
fileSystems."/nix" = {
|
||||
device = "/dev/disk/by-label/nixcache";
|
||||
fsType = "xfs";
|
||||
@@ -53,6 +55,7 @@
|
||||
"flakes"
|
||||
];
|
||||
vault.enable = true;
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
nix.settings.tarball-ttl = 0;
|
||||
environment.systemPackages = with pkgs; [
|
||||
|
||||
@@ -48,6 +48,12 @@
|
||||
"flakes"
|
||||
];
|
||||
vault.enable = true;
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
homelab.host = {
|
||||
role = "dns";
|
||||
labels.dns_role = "primary";
|
||||
};
|
||||
|
||||
nix.settings.tarball-ttl = 0;
|
||||
environment.systemPackages = with pkgs; [
|
||||
|
||||
@@ -48,6 +48,12 @@
|
||||
"flakes"
|
||||
];
|
||||
vault.enable = true;
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
homelab.host = {
|
||||
role = "dns";
|
||||
labels.dns_role = "secondary";
|
||||
};
|
||||
|
||||
environment.systemPackages = with pkgs; [
|
||||
vim
|
||||
|
||||
@@ -11,6 +11,11 @@
|
||||
# Template host - exclude from DNS zone generation
|
||||
homelab.dns.enable = false;
|
||||
|
||||
homelab.host = {
|
||||
tier = "test";
|
||||
priority = "low";
|
||||
};
|
||||
|
||||
|
||||
boot.loader.grub.enable = true;
|
||||
boot.loader.grub.device = "/dev/sda";
|
||||
|
||||
@@ -32,6 +32,11 @@
|
||||
datasource_list = [ "ConfigDrive" "NoCloud" ];
|
||||
};
|
||||
|
||||
homelab.host = {
|
||||
tier = "test";
|
||||
priority = "low";
|
||||
};
|
||||
|
||||
boot.loader.grub.enable = true;
|
||||
boot.loader.grub.device = "/dev/vda";
|
||||
networking.hostName = "nixos-template2";
|
||||
|
||||
@@ -13,8 +13,16 @@
|
||||
../../common/vm
|
||||
];
|
||||
|
||||
# Test VM - exclude from DNS zone generation
|
||||
homelab.dns.enable = false;
|
||||
# Host metadata (adjust as needed)
|
||||
homelab.host = {
|
||||
tier = "test"; # Start in test tier, move to prod after validation
|
||||
};
|
||||
|
||||
# Enable Vault integration
|
||||
vault.enable = true;
|
||||
|
||||
# Enable remote deployment via NATS
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
nixpkgs.config.allowUnfree = true;
|
||||
boot.loader.grub.enable = true;
|
||||
@@ -24,7 +32,7 @@
|
||||
networking.domain = "home.2rjus.net";
|
||||
networking.useNetworkd = true;
|
||||
networking.useDHCP = false;
|
||||
services.resolved.enable = false;
|
||||
services.resolved.enable = true;
|
||||
networking.nameservers = [
|
||||
"10.69.13.5"
|
||||
"10.69.13.6"
|
||||
@@ -34,7 +42,7 @@
|
||||
systemd.network.networks."ens18" = {
|
||||
matchConfig.Name = "ens18";
|
||||
address = [
|
||||
"10.69.13.101/24"
|
||||
"10.69.13.20/24"
|
||||
];
|
||||
routes = [
|
||||
{ Gateway = "10.69.13.1"; }
|
||||
|
||||
72
hosts/testvm02/configuration.nix
Normal file
72
hosts/testvm02/configuration.nix
Normal file
@@ -0,0 +1,72 @@
|
||||
{
|
||||
config,
|
||||
lib,
|
||||
pkgs,
|
||||
...
|
||||
}:
|
||||
|
||||
{
|
||||
imports = [
|
||||
../template2/hardware-configuration.nix
|
||||
|
||||
../../system
|
||||
../../common/vm
|
||||
];
|
||||
|
||||
# Host metadata (adjust as needed)
|
||||
homelab.host = {
|
||||
tier = "test"; # Start in test tier, move to prod after validation
|
||||
};
|
||||
|
||||
# Enable Vault integration
|
||||
vault.enable = true;
|
||||
|
||||
# Enable remote deployment via NATS
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
nixpkgs.config.allowUnfree = true;
|
||||
boot.loader.grub.enable = true;
|
||||
boot.loader.grub.device = "/dev/vda";
|
||||
|
||||
networking.hostName = "testvm02";
|
||||
networking.domain = "home.2rjus.net";
|
||||
networking.useNetworkd = true;
|
||||
networking.useDHCP = false;
|
||||
services.resolved.enable = true;
|
||||
networking.nameservers = [
|
||||
"10.69.13.5"
|
||||
"10.69.13.6"
|
||||
];
|
||||
|
||||
systemd.network.enable = true;
|
||||
systemd.network.networks."ens18" = {
|
||||
matchConfig.Name = "ens18";
|
||||
address = [
|
||||
"10.69.13.21/24"
|
||||
];
|
||||
routes = [
|
||||
{ Gateway = "10.69.13.1"; }
|
||||
];
|
||||
linkConfig.RequiredForOnline = "routable";
|
||||
};
|
||||
time.timeZone = "Europe/Oslo";
|
||||
|
||||
nix.settings.experimental-features = [
|
||||
"nix-command"
|
||||
"flakes"
|
||||
];
|
||||
nix.settings.tarball-ttl = 0;
|
||||
environment.systemPackages = with pkgs; [
|
||||
vim
|
||||
wget
|
||||
git
|
||||
];
|
||||
|
||||
# Open ports in the firewall.
|
||||
# networking.firewall.allowedTCPPorts = [ ... ];
|
||||
# networking.firewall.allowedUDPPorts = [ ... ];
|
||||
# Or disable the firewall altogether.
|
||||
networking.firewall.enable = false;
|
||||
|
||||
system.stateVersion = "25.11"; # Did you read the comment?
|
||||
}
|
||||
72
hosts/testvm03/configuration.nix
Normal file
72
hosts/testvm03/configuration.nix
Normal file
@@ -0,0 +1,72 @@
|
||||
{
|
||||
config,
|
||||
lib,
|
||||
pkgs,
|
||||
...
|
||||
}:
|
||||
|
||||
{
|
||||
imports = [
|
||||
../template2/hardware-configuration.nix
|
||||
|
||||
../../system
|
||||
../../common/vm
|
||||
];
|
||||
|
||||
# Host metadata (adjust as needed)
|
||||
homelab.host = {
|
||||
tier = "test"; # Start in test tier, move to prod after validation
|
||||
};
|
||||
|
||||
# Enable Vault integration
|
||||
vault.enable = true;
|
||||
|
||||
# Enable remote deployment via NATS
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
nixpkgs.config.allowUnfree = true;
|
||||
boot.loader.grub.enable = true;
|
||||
boot.loader.grub.device = "/dev/vda";
|
||||
|
||||
networking.hostName = "testvm03";
|
||||
networking.domain = "home.2rjus.net";
|
||||
networking.useNetworkd = true;
|
||||
networking.useDHCP = false;
|
||||
services.resolved.enable = true;
|
||||
networking.nameservers = [
|
||||
"10.69.13.5"
|
||||
"10.69.13.6"
|
||||
];
|
||||
|
||||
systemd.network.enable = true;
|
||||
systemd.network.networks."ens18" = {
|
||||
matchConfig.Name = "ens18";
|
||||
address = [
|
||||
"10.69.13.22/24"
|
||||
];
|
||||
routes = [
|
||||
{ Gateway = "10.69.13.1"; }
|
||||
];
|
||||
linkConfig.RequiredForOnline = "routable";
|
||||
};
|
||||
time.timeZone = "Europe/Oslo";
|
||||
|
||||
nix.settings.experimental-features = [
|
||||
"nix-command"
|
||||
"flakes"
|
||||
];
|
||||
nix.settings.tarball-ttl = 0;
|
||||
environment.systemPackages = with pkgs; [
|
||||
vim
|
||||
wget
|
||||
git
|
||||
];
|
||||
|
||||
# Open ports in the firewall.
|
||||
# networking.firewall.allowedTCPPorts = [ ... ];
|
||||
# networking.firewall.allowedUDPPorts = [ ... ];
|
||||
# Or disable the firewall altogether.
|
||||
networking.firewall.enable = false;
|
||||
|
||||
system.stateVersion = "25.11"; # Did you read the comment?
|
||||
}
|
||||
5
hosts/testvm03/default.nix
Normal file
5
hosts/testvm03/default.nix
Normal file
@@ -0,0 +1,5 @@
|
||||
{ ... }: {
|
||||
imports = [
|
||||
./configuration.nix
|
||||
];
|
||||
}
|
||||
@@ -16,6 +16,8 @@
|
||||
|
||||
homelab.dns.cnames = [ "vault" ];
|
||||
|
||||
homelab.host.role = "vault";
|
||||
|
||||
nixpkgs.config.allowUnfree = true;
|
||||
boot.loader.grub.enable = true;
|
||||
boot.loader.grub.device = "/dev/vda";
|
||||
|
||||
@@ -1,127 +0,0 @@
|
||||
{
|
||||
config,
|
||||
lib,
|
||||
pkgs,
|
||||
...
|
||||
}:
|
||||
|
||||
let
|
||||
vault-test-script = pkgs.writeShellApplication {
|
||||
name = "vault-test";
|
||||
text = ''
|
||||
echo "=== Vault Secret Test ==="
|
||||
echo "Secret path: hosts/vaulttest01/test-service"
|
||||
|
||||
if [ -f /run/secrets/test-service/password ]; then
|
||||
echo "✓ Password file exists"
|
||||
echo "Password length: $(wc -c < /run/secrets/test-service/password)"
|
||||
else
|
||||
echo "✗ Password file missing!"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [ -d /var/lib/vault/cache/test-service ]; then
|
||||
echo "✓ Cache directory exists"
|
||||
else
|
||||
echo "✗ Cache directory missing!"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "Test successful!"
|
||||
'';
|
||||
};
|
||||
in
|
||||
{
|
||||
imports = [
|
||||
../template2/hardware-configuration.nix
|
||||
|
||||
../../system
|
||||
../../common/vm
|
||||
];
|
||||
|
||||
nixpkgs.config.allowUnfree = true;
|
||||
boot.loader.grub.enable = true;
|
||||
boot.loader.grub.device = "/dev/vda";
|
||||
|
||||
networking.hostName = "vaulttest01";
|
||||
networking.domain = "home.2rjus.net";
|
||||
networking.useNetworkd = true;
|
||||
networking.useDHCP = false;
|
||||
services.resolved.enable = true;
|
||||
networking.nameservers = [
|
||||
"10.69.13.5"
|
||||
"10.69.13.6"
|
||||
];
|
||||
|
||||
systemd.network.enable = true;
|
||||
systemd.network.networks."ens18" = {
|
||||
matchConfig.Name = "ens18";
|
||||
address = [
|
||||
"10.69.13.150/24"
|
||||
];
|
||||
routes = [
|
||||
{ Gateway = "10.69.13.1"; }
|
||||
];
|
||||
linkConfig.RequiredForOnline = "routable";
|
||||
};
|
||||
time.timeZone = "Europe/Oslo";
|
||||
|
||||
nix.settings.experimental-features = [
|
||||
"nix-command"
|
||||
"flakes"
|
||||
];
|
||||
nix.settings.tarball-ttl = 0;
|
||||
environment.systemPackages = with pkgs; [
|
||||
vim
|
||||
wget
|
||||
git
|
||||
];
|
||||
|
||||
# Open ports in the firewall.
|
||||
# networking.firewall.allowedTCPPorts = [ ... ];
|
||||
# networking.firewall.allowedUDPPorts = [ ... ];
|
||||
# Or disable the firewall altogether.
|
||||
networking.firewall.enable = false;
|
||||
|
||||
# Testing config
|
||||
# Enable Vault secrets management
|
||||
vault.enable = true;
|
||||
|
||||
# Define a test secret
|
||||
vault.secrets.test-service = {
|
||||
secretPath = "hosts/vaulttest01/test-service";
|
||||
restartTrigger = true;
|
||||
restartInterval = "daily";
|
||||
services = [ "vault-test" ];
|
||||
};
|
||||
|
||||
# Create a test service that uses the secret
|
||||
systemd.services.vault-test = {
|
||||
description = "Test Vault secret fetching";
|
||||
wantedBy = [ "multi-user.target" ];
|
||||
after = [ "vault-secret-test-service.service" ];
|
||||
|
||||
serviceConfig = {
|
||||
Type = "oneshot";
|
||||
RemainAfterExit = true;
|
||||
|
||||
ExecStart = lib.getExe vault-test-script;
|
||||
|
||||
StandardOutput = "journal+console";
|
||||
};
|
||||
};
|
||||
|
||||
# Test ACME certificate issuance from OpenBao PKI
|
||||
# Override the global ACME server (from system/acme.nix) to use OpenBao instead of step-ca
|
||||
security.acme.defaults.server = lib.mkForce "https://vault01.home.2rjus.net:8200/v1/pki_int/acme/directory";
|
||||
|
||||
# Request a certificate for this host
|
||||
# Using HTTP-01 challenge with standalone listener on port 80
|
||||
security.acme.certs."vaulttest01.home.2rjus.net" = {
|
||||
listenHTTP = ":80";
|
||||
enableDebugLogs = true;
|
||||
};
|
||||
|
||||
system.stateVersion = "25.11"; # Did you read the comment?
|
||||
}
|
||||
|
||||
@@ -1,7 +1,9 @@
|
||||
{ ... }:
|
||||
{
|
||||
imports = [
|
||||
./deploy.nix
|
||||
./dns.nix
|
||||
./host.nix
|
||||
./monitoring.nix
|
||||
];
|
||||
}
|
||||
|
||||
16
modules/homelab/deploy.nix
Normal file
16
modules/homelab/deploy.nix
Normal file
@@ -0,0 +1,16 @@
|
||||
{ config, lib, ... }:
|
||||
|
||||
{
|
||||
options.homelab.deploy = {
|
||||
enable = lib.mkEnableOption "homelab-deploy listener for NATS-based deployments";
|
||||
};
|
||||
|
||||
config = {
|
||||
assertions = [
|
||||
{
|
||||
assertion = config.homelab.deploy.enable -> config.vault.enable;
|
||||
message = "homelab.deploy.enable requires vault.enable to be true (needed for NKey secret)";
|
||||
}
|
||||
];
|
||||
};
|
||||
}
|
||||
28
modules/homelab/host.nix
Normal file
28
modules/homelab/host.nix
Normal file
@@ -0,0 +1,28 @@
|
||||
{ lib, ... }:
|
||||
{
|
||||
options.homelab.host = {
|
||||
tier = lib.mkOption {
|
||||
type = lib.types.enum [ "test" "prod" ];
|
||||
default = "prod";
|
||||
description = "Deployment tier - controls which credentials can deploy to this host";
|
||||
};
|
||||
|
||||
priority = lib.mkOption {
|
||||
type = lib.types.enum [ "high" "low" ];
|
||||
default = "high";
|
||||
description = "Alerting priority - low priority hosts have relaxed thresholds";
|
||||
};
|
||||
|
||||
role = lib.mkOption {
|
||||
type = lib.types.nullOr lib.types.str;
|
||||
default = null;
|
||||
description = "Primary role of this host (dns, database, monitoring, etc.)";
|
||||
};
|
||||
|
||||
labels = lib.mkOption {
|
||||
type = lib.types.attrsOf lib.types.str;
|
||||
default = { };
|
||||
description = "Additional free-form labels (e.g., dns_role = 'primary')";
|
||||
};
|
||||
};
|
||||
}
|
||||
@@ -18,6 +18,8 @@ from manipulators import (
|
||||
remove_from_flake_nix,
|
||||
remove_from_terraform_vms,
|
||||
remove_from_vault_terraform,
|
||||
remove_from_approle_tf,
|
||||
find_host_secrets,
|
||||
check_entries_exist,
|
||||
)
|
||||
from models import HostConfig
|
||||
@@ -255,7 +257,10 @@ def handle_remove(
|
||||
sys.exit(1)
|
||||
|
||||
# Check what entries exist
|
||||
flake_exists, terraform_exists, vault_exists = check_entries_exist(hostname, repo_root)
|
||||
flake_exists, terraform_exists, vault_exists, approle_exists = check_entries_exist(hostname, repo_root)
|
||||
|
||||
# Check for secrets in secrets.tf
|
||||
host_secrets = find_host_secrets(hostname, repo_root)
|
||||
|
||||
# Collect all files in the host directory recursively
|
||||
files_in_host_dir = sorted([f for f in host_dir.rglob("*") if f.is_file()])
|
||||
@@ -294,6 +299,21 @@ def handle_remove(
|
||||
else:
|
||||
console.print(f" • terraform/vault/hosts-generated.tf [dim](not found)[/dim]")
|
||||
|
||||
if approle_exists:
|
||||
console.print(f' • terraform/vault/approle.tf (host_policies["{hostname}"])')
|
||||
else:
|
||||
console.print(f" • terraform/vault/approle.tf [dim](not found)[/dim]")
|
||||
|
||||
# Warn about secrets in secrets.tf
|
||||
if host_secrets:
|
||||
console.print(f"\n[yellow]⚠️ Warning: Found {len(host_secrets)} secret(s) in terraform/vault/secrets.tf:[/yellow]")
|
||||
for secret_path in host_secrets:
|
||||
console.print(f' • "{secret_path}"')
|
||||
console.print(f"\n [yellow]These will NOT be removed automatically.[/yellow]")
|
||||
console.print(f" After removal, manually edit secrets.tf and run:")
|
||||
for secret_path in host_secrets:
|
||||
console.print(f" [white]vault kv delete secret/{secret_path}[/white]")
|
||||
|
||||
# Warn about secrets directory
|
||||
if secrets_exist:
|
||||
console.print(f"\n[yellow]⚠️ Warning: secrets/{hostname}/ directory exists and will NOT be deleted[/yellow]")
|
||||
@@ -323,6 +343,13 @@ def handle_remove(
|
||||
else:
|
||||
console.print("[yellow]⚠[/yellow] Could not remove from terraform/vault/hosts-generated.tf")
|
||||
|
||||
# Remove from terraform/vault/approle.tf
|
||||
if approle_exists:
|
||||
if remove_from_approle_tf(hostname, repo_root):
|
||||
console.print("[green]✓[/green] Removed from terraform/vault/approle.tf")
|
||||
else:
|
||||
console.print("[yellow]⚠[/yellow] Could not remove from terraform/vault/approle.tf")
|
||||
|
||||
# Remove from terraform/vms.tf
|
||||
if terraform_exists:
|
||||
if remove_from_terraform_vms(hostname, repo_root):
|
||||
@@ -345,19 +372,34 @@ def handle_remove(
|
||||
console.print(f"\n[bold green]✓ Host {hostname} removed successfully![/bold green]\n")
|
||||
|
||||
# Display next steps
|
||||
display_removal_next_steps(hostname, vault_exists)
|
||||
display_removal_next_steps(hostname, vault_exists, approle_exists, host_secrets)
|
||||
|
||||
|
||||
def display_removal_next_steps(hostname: str, had_vault: bool) -> None:
|
||||
def display_removal_next_steps(hostname: str, had_vault: bool, had_approle: bool, host_secrets: list) -> None:
|
||||
"""Display next steps after successful removal."""
|
||||
vault_file = " terraform/vault/hosts-generated.tf" if had_vault else ""
|
||||
vault_apply = ""
|
||||
vault_files = ""
|
||||
if had_vault:
|
||||
vault_files += " terraform/vault/hosts-generated.tf"
|
||||
if had_approle:
|
||||
vault_files += " terraform/vault/approle.tf"
|
||||
|
||||
vault_apply = ""
|
||||
if had_vault or had_approle:
|
||||
vault_apply = f"""
|
||||
3. Apply Vault changes:
|
||||
[white]cd terraform/vault && tofu apply[/white]
|
||||
"""
|
||||
|
||||
secrets_cleanup = ""
|
||||
if host_secrets:
|
||||
secrets_cleanup = f"""
|
||||
5. Clean up secrets (manual):
|
||||
Edit terraform/vault/secrets.tf to remove entries for {hostname}
|
||||
Then delete from Vault:"""
|
||||
for secret_path in host_secrets:
|
||||
secrets_cleanup += f"\n [white]vault kv delete secret/{secret_path}[/white]"
|
||||
secrets_cleanup += "\n"
|
||||
|
||||
next_steps = f"""[bold cyan]Next Steps:[/bold cyan]
|
||||
|
||||
1. Review changes:
|
||||
@@ -367,9 +409,9 @@ def display_removal_next_steps(hostname: str, had_vault: bool) -> None:
|
||||
[white]cd terraform && tofu destroy -target='proxmox_vm_qemu.vm["{hostname}"]'[/white]
|
||||
{vault_apply}
|
||||
4. Commit changes:
|
||||
[white]git add -u hosts/{hostname} flake.nix terraform/vms.tf{vault_file}
|
||||
[white]git add -u hosts/{hostname} flake.nix terraform/vms.tf{vault_files}
|
||||
git commit -m "hosts: remove {hostname}"[/white]
|
||||
"""
|
||||
{secrets_cleanup}"""
|
||||
console.print(Panel(next_steps, border_style="cyan"))
|
||||
|
||||
|
||||
|
||||
@@ -144,7 +144,7 @@ resource "vault_approle_auth_backend_role" "generated_hosts" {
|
||||
|
||||
backend = vault_auth_backend.approle.path
|
||||
role_name = each.key
|
||||
token_policies = ["host-\${each.key}"]
|
||||
token_policies = ["host-\${each.key}", "homelab-deploy"]
|
||||
secret_id_ttl = 0 # Never expire (wrapped tokens provide time limit)
|
||||
token_ttl = 3600
|
||||
token_max_ttl = 3600
|
||||
|
||||
@@ -22,12 +22,12 @@ def remove_from_flake_nix(hostname: str, repo_root: Path) -> bool:
|
||||
content = flake_path.read_text()
|
||||
|
||||
# Check if hostname exists
|
||||
hostname_pattern = rf"^ {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem"
|
||||
hostname_pattern = rf"^ {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem"
|
||||
if not re.search(hostname_pattern, content, re.MULTILINE):
|
||||
return False
|
||||
|
||||
# Match the entire block from "hostname = " to "};"
|
||||
replace_pattern = rf"^ {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem \{{.*?^ \}};\n"
|
||||
replace_pattern = rf"^ {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem \{{.*?^ \}};\n"
|
||||
new_content, count = re.subn(replace_pattern, "", content, flags=re.MULTILINE | re.DOTALL)
|
||||
|
||||
if count == 0:
|
||||
@@ -101,7 +101,68 @@ def remove_from_vault_terraform(hostname: str, repo_root: Path) -> bool:
|
||||
return True
|
||||
|
||||
|
||||
def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, bool]:
|
||||
def remove_from_approle_tf(hostname: str, repo_root: Path) -> bool:
|
||||
"""
|
||||
Remove host entry from terraform/vault/approle.tf locals.host_policies.
|
||||
|
||||
Args:
|
||||
hostname: Hostname to remove
|
||||
repo_root: Path to repository root
|
||||
|
||||
Returns:
|
||||
True if found and removed, False if not found
|
||||
"""
|
||||
approle_path = repo_root / "terraform" / "vault" / "approle.tf"
|
||||
|
||||
if not approle_path.exists():
|
||||
return False
|
||||
|
||||
content = approle_path.read_text()
|
||||
|
||||
# Check if hostname exists in host_policies
|
||||
hostname_pattern = rf'^\s+"{re.escape(hostname)}" = \{{'
|
||||
if not re.search(hostname_pattern, content, re.MULTILINE):
|
||||
return False
|
||||
|
||||
# Match the entire block from "hostname" = { to closing }
|
||||
# The block contains paths = [ ... ] and possibly extra_policies = [...]
|
||||
replace_pattern = rf'\n?\s+"{re.escape(hostname)}" = \{{[^}}]*\}}\n?'
|
||||
new_content, count = re.subn(replace_pattern, "\n", content, flags=re.DOTALL)
|
||||
|
||||
if count == 0:
|
||||
return False
|
||||
|
||||
approle_path.write_text(new_content)
|
||||
return True
|
||||
|
||||
|
||||
def find_host_secrets(hostname: str, repo_root: Path) -> list:
|
||||
"""
|
||||
Find secrets in terraform/vault/secrets.tf that belong to a host.
|
||||
|
||||
Args:
|
||||
hostname: Hostname to search for
|
||||
repo_root: Path to repository root
|
||||
|
||||
Returns:
|
||||
List of secret paths found (e.g., ["hosts/hostname/test-service"])
|
||||
"""
|
||||
secrets_path = repo_root / "terraform" / "vault" / "secrets.tf"
|
||||
|
||||
if not secrets_path.exists():
|
||||
return []
|
||||
|
||||
content = secrets_path.read_text()
|
||||
|
||||
# Find all secret paths matching hosts/{hostname}/
|
||||
pattern = rf'"(hosts/{re.escape(hostname)}/[^"]+)"'
|
||||
matches = re.findall(pattern, content)
|
||||
|
||||
# Return unique paths, preserving order
|
||||
return list(dict.fromkeys(matches))
|
||||
|
||||
|
||||
def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, bool, bool]:
|
||||
"""
|
||||
Check which entries exist for a hostname.
|
||||
|
||||
@@ -110,12 +171,12 @@ def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, boo
|
||||
repo_root: Path to repository root
|
||||
|
||||
Returns:
|
||||
Tuple of (flake_exists, terraform_vms_exists, vault_exists)
|
||||
Tuple of (flake_exists, terraform_vms_exists, vault_generated_exists, approle_exists)
|
||||
"""
|
||||
# Check flake.nix
|
||||
flake_path = repo_root / "flake.nix"
|
||||
flake_content = flake_path.read_text()
|
||||
flake_pattern = rf"^ {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem"
|
||||
flake_pattern = rf"^ {re.escape(hostname)} = nixpkgs\.lib\.nixosSystem"
|
||||
flake_exists = bool(re.search(flake_pattern, flake_content, re.MULTILINE))
|
||||
|
||||
# Check terraform/vms.tf
|
||||
@@ -131,7 +192,15 @@ def check_entries_exist(hostname: str, repo_root: Path) -> Tuple[bool, bool, boo
|
||||
vault_content = vault_tf_path.read_text()
|
||||
vault_exists = f'"{hostname}"' in vault_content
|
||||
|
||||
return (flake_exists, terraform_exists, vault_exists)
|
||||
# Check terraform/vault/approle.tf
|
||||
approle_path = repo_root / "terraform" / "vault" / "approle.tf"
|
||||
approle_exists = False
|
||||
if approle_path.exists():
|
||||
approle_content = approle_path.read_text()
|
||||
approle_pattern = rf'^\s+"{re.escape(hostname)}" = \{{'
|
||||
approle_exists = bool(re.search(approle_pattern, approle_content, re.MULTILINE))
|
||||
|
||||
return (flake_exists, terraform_exists, vault_exists, approle_exists)
|
||||
|
||||
|
||||
def update_flake_nix(config: HostConfig, repo_root: Path, force: bool = False) -> None:
|
||||
@@ -147,32 +216,25 @@ def update_flake_nix(config: HostConfig, repo_root: Path, force: bool = False) -
|
||||
content = flake_path.read_text()
|
||||
|
||||
# Create new entry
|
||||
new_entry = f""" {config.hostname} = nixpkgs.lib.nixosSystem {{
|
||||
inherit system;
|
||||
specialArgs = {{
|
||||
inherit inputs self sops-nix;
|
||||
new_entry = f""" {config.hostname} = nixpkgs.lib.nixosSystem {{
|
||||
inherit system;
|
||||
specialArgs = {{
|
||||
inherit inputs self sops-nix;
|
||||
}};
|
||||
modules = commonModules ++ [
|
||||
./hosts/{config.hostname}
|
||||
];
|
||||
}};
|
||||
modules = [
|
||||
(
|
||||
{{ config, pkgs, ... }}:
|
||||
{{
|
||||
nixpkgs.overlays = commonOverlays;
|
||||
}}
|
||||
)
|
||||
./hosts/{config.hostname}
|
||||
sops-nix.nixosModules.sops
|
||||
];
|
||||
}};
|
||||
"""
|
||||
|
||||
# Check if hostname already exists
|
||||
hostname_pattern = rf"^ {re.escape(config.hostname)} = nixpkgs\.lib\.nixosSystem"
|
||||
hostname_pattern = rf"^ {re.escape(config.hostname)} = nixpkgs\.lib\.nixosSystem"
|
||||
existing_match = re.search(hostname_pattern, content, re.MULTILINE)
|
||||
|
||||
if existing_match and force:
|
||||
# Replace existing entry
|
||||
# Match the entire block from "hostname = " to "};"
|
||||
replace_pattern = rf"^ {re.escape(config.hostname)} = nixpkgs\.lib\.nixosSystem \{{.*?^ \}};\n"
|
||||
replace_pattern = rf"^ {re.escape(config.hostname)} = nixpkgs\.lib\.nixosSystem \{{.*?^ \}};\n"
|
||||
new_content, count = re.subn(replace_pattern, new_entry, content, flags=re.MULTILINE | re.DOTALL)
|
||||
|
||||
if count == 0:
|
||||
|
||||
@@ -13,6 +13,17 @@
|
||||
../../common/vm
|
||||
];
|
||||
|
||||
# Host metadata (adjust as needed)
|
||||
homelab.host = {
|
||||
tier = "test"; # Start in test tier, move to prod after validation
|
||||
};
|
||||
|
||||
# Enable Vault integration
|
||||
vault.enable = true;
|
||||
|
||||
# Enable remote deployment via NATS
|
||||
homelab.deploy.enable = true;
|
||||
|
||||
nixpkgs.config.allowUnfree = true;
|
||||
boot.loader.grub.enable = true;
|
||||
boot.loader.grub.device = "/dev/vda";
|
||||
|
||||
@@ -75,12 +75,12 @@ groups:
|
||||
description: "Based on the last 6h trend, the root filesystem on {{ $labels.instance }} is predicted to run out of space within 24 hours."
|
||||
- alert: systemd_not_running
|
||||
expr: node_systemd_system_running == 0
|
||||
for: 5m
|
||||
for: 10m
|
||||
labels:
|
||||
severity: critical
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Systemd not in running state on {{ $labels.instance }}"
|
||||
description: "Systemd is not in running state on {{ $labels.instance }}. The system may be in a degraded state."
|
||||
description: "Systemd is not in running state on {{ $labels.instance }}. The system may be in a degraded state. Note: brief degraded states during nixos-rebuild are normal."
|
||||
- alert: high_file_descriptors
|
||||
expr: node_filefd_allocated / node_filefd_maximum > 0.8
|
||||
for: 5m
|
||||
|
||||
@@ -1,16 +1,18 @@
|
||||
{ ... }:
|
||||
{
|
||||
homelab.monitoring.scrapeTargets = [{
|
||||
job_name = "nats";
|
||||
port = 7777;
|
||||
}];
|
||||
homelab.monitoring.scrapeTargets = [
|
||||
{
|
||||
job_name = "nats";
|
||||
port = 7777;
|
||||
}
|
||||
];
|
||||
|
||||
services.prometheus.exporters.nats = {
|
||||
enable = true;
|
||||
url = "http://localhost:8222";
|
||||
extraFlags = [
|
||||
"-varz" # General server info
|
||||
"-connz" # Connection info
|
||||
"-varz" # General server info
|
||||
"-connz" # Connection info
|
||||
"-jsz=all" # JetStream info
|
||||
];
|
||||
};
|
||||
@@ -38,6 +40,48 @@
|
||||
}
|
||||
];
|
||||
};
|
||||
|
||||
DEPLOY = {
|
||||
users = [
|
||||
# Shared listener (all hosts use this)
|
||||
{
|
||||
nkey = "UCCZJSUGLCSLBBKHBPL4QA66TUMQUGIXGLIFTWDEH43MGWM3LDD232X4";
|
||||
permissions = {
|
||||
subscribe = [
|
||||
"deploy.test.>"
|
||||
"deploy.prod.>"
|
||||
"deploy.discover"
|
||||
];
|
||||
publish = [
|
||||
"deploy.responses.>"
|
||||
"deploy.discover"
|
||||
];
|
||||
};
|
||||
}
|
||||
# Test deployer (MCP without admin)
|
||||
{
|
||||
nkey = "UBR66CX2ZNY5XNVQF5VBG4WFAF54LSGUYCUNNCEYRILDQ4NXDAD2THZU";
|
||||
permissions = {
|
||||
publish = [
|
||||
"deploy.test.>"
|
||||
"deploy.discover"
|
||||
];
|
||||
subscribe = [
|
||||
"deploy.responses.>"
|
||||
"deploy.discover"
|
||||
];
|
||||
};
|
||||
}
|
||||
# Admin deployer (full access)
|
||||
{
|
||||
nkey = "UD2BFB7DLM67P5UUVCKBUJMCHADIZLGGVUNSRLZE2ZC66FW2XT44P73Y";
|
||||
permissions = {
|
||||
publish = [ "deploy.>" ];
|
||||
subscribe = [ "deploy.>" ];
|
||||
};
|
||||
}
|
||||
];
|
||||
};
|
||||
};
|
||||
system_account = "ADMIN";
|
||||
jetstream = {
|
||||
|
||||
@@ -3,6 +3,7 @@
|
||||
imports = [
|
||||
./acme.nix
|
||||
./autoupgrade.nix
|
||||
./homelab-deploy.nix
|
||||
./monitoring
|
||||
./motd.nix
|
||||
./packages.nix
|
||||
@@ -12,7 +13,5 @@
|
||||
./sops.nix
|
||||
./sshd.nix
|
||||
./vault-secrets.nix
|
||||
|
||||
../modules/homelab
|
||||
];
|
||||
}
|
||||
|
||||
37
system/homelab-deploy.nix
Normal file
37
system/homelab-deploy.nix
Normal file
@@ -0,0 +1,37 @@
|
||||
{ config, lib, ... }:
|
||||
|
||||
let
|
||||
hostCfg = config.homelab.host;
|
||||
in
|
||||
{
|
||||
config = lib.mkIf config.homelab.deploy.enable {
|
||||
# Fetch listener NKey from Vault
|
||||
vault.secrets.homelab-deploy-nkey = {
|
||||
secretPath = "shared/homelab-deploy/listener-nkey";
|
||||
extractKey = "nkey";
|
||||
};
|
||||
|
||||
# Enable homelab-deploy listener
|
||||
services.homelab-deploy.listener = {
|
||||
enable = true;
|
||||
tier = hostCfg.tier;
|
||||
role = hostCfg.role;
|
||||
natsUrl = "nats://nats1.home.2rjus.net:4222";
|
||||
nkeyFile = "/run/secrets/homelab-deploy-nkey";
|
||||
flakeUrl = "git+https://git.t-juice.club/torjus/nixos-servers.git";
|
||||
metrics.enable = true;
|
||||
};
|
||||
|
||||
# Expose metrics for Prometheus scraping
|
||||
homelab.monitoring.scrapeTargets = [{
|
||||
job_name = "homelab-deploy";
|
||||
port = 9972;
|
||||
}];
|
||||
|
||||
# Ensure listener starts after vault secret is available
|
||||
systemd.services.homelab-deploy-listener = {
|
||||
after = [ "vault-secret-homelab-deploy-nkey.service" ];
|
||||
requires = [ "vault-secret-homelab-deploy-nkey.service" ];
|
||||
};
|
||||
};
|
||||
}
|
||||
@@ -4,6 +4,17 @@ resource "vault_auth_backend" "approle" {
|
||||
path = "approle"
|
||||
}
|
||||
|
||||
# Shared policy for homelab-deploy (all hosts need this for NATS-based deployments)
|
||||
resource "vault_policy" "homelab_deploy" {
|
||||
name = "homelab-deploy"
|
||||
|
||||
policy = <<EOT
|
||||
path "secret/data/shared/homelab-deploy/*" {
|
||||
capabilities = ["read", "list"]
|
||||
}
|
||||
EOT
|
||||
}
|
||||
|
||||
# Define host access policies
|
||||
locals {
|
||||
host_policies = {
|
||||
@@ -89,6 +100,7 @@ locals {
|
||||
"secret/data/hosts/nix-cache01/*",
|
||||
]
|
||||
}
|
||||
|
||||
}
|
||||
}
|
||||
|
||||
@@ -114,7 +126,7 @@ resource "vault_approle_auth_backend_role" "hosts" {
|
||||
backend = vault_auth_backend.approle.path
|
||||
role_name = each.key
|
||||
token_policies = concat(
|
||||
["${each.key}-policy"],
|
||||
["${each.key}-policy", "homelab-deploy"],
|
||||
lookup(each.value, "extra_policies", [])
|
||||
)
|
||||
|
||||
|
||||
@@ -5,12 +5,22 @@
|
||||
# Each host gets access to its own secrets under hosts/<hostname>/*
|
||||
locals {
|
||||
generated_host_policies = {
|
||||
"vaulttest01" = {
|
||||
"testvm01" = {
|
||||
paths = [
|
||||
"secret/data/hosts/vaulttest01/*",
|
||||
"secret/data/hosts/testvm01/*",
|
||||
]
|
||||
}
|
||||
|
||||
"testvm02" = {
|
||||
paths = [
|
||||
"secret/data/hosts/testvm02/*",
|
||||
]
|
||||
}
|
||||
"testvm03" = {
|
||||
paths = [
|
||||
"secret/data/hosts/testvm03/*",
|
||||
]
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
# Placeholder secrets - user should add actual secrets manually or via tofu
|
||||
@@ -40,7 +50,7 @@ resource "vault_approle_auth_backend_role" "generated_hosts" {
|
||||
|
||||
backend = vault_auth_backend.approle.path
|
||||
role_name = each.key
|
||||
token_policies = ["host-${each.key}"]
|
||||
token_policies = ["host-${each.key}", "homelab-deploy"]
|
||||
secret_id_ttl = 0 # Never expire (wrapped tokens provide time limit)
|
||||
token_ttl = 3600
|
||||
token_max_ttl = 3600
|
||||
|
||||
@@ -45,12 +45,6 @@ locals {
|
||||
password_length = 24
|
||||
}
|
||||
|
||||
# TODO: Remove after testing
|
||||
"hosts/vaulttest01/test-service" = {
|
||||
auto_generate = true
|
||||
password_length = 32
|
||||
}
|
||||
|
||||
# Shared backup password (auto-generated, add alongside existing restic key)
|
||||
"shared/backup/password" = {
|
||||
auto_generate = true
|
||||
@@ -92,6 +86,22 @@ locals {
|
||||
auto_generate = false
|
||||
data = { token = var.actions_token_1 }
|
||||
}
|
||||
|
||||
# Homelab-deploy NKeys
|
||||
"shared/homelab-deploy/listener-nkey" = {
|
||||
auto_generate = false
|
||||
data = { nkey = var.homelab_deploy_listener_nkey }
|
||||
}
|
||||
|
||||
"shared/homelab-deploy/test-deployer-nkey" = {
|
||||
auto_generate = false
|
||||
data = { nkey = var.homelab_deploy_test_deployer_nkey }
|
||||
}
|
||||
|
||||
"shared/homelab-deploy/admin-deployer-nkey" = {
|
||||
auto_generate = false
|
||||
data = { nkey = var.homelab_deploy_admin_deployer_nkey }
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -52,3 +52,24 @@ variable "actions_token_1" {
|
||||
sensitive = true
|
||||
}
|
||||
|
||||
variable "homelab_deploy_listener_nkey" {
|
||||
description = "NKey seed for homelab-deploy listeners"
|
||||
type = string
|
||||
default = "PLACEHOLDER"
|
||||
sensitive = true
|
||||
}
|
||||
|
||||
variable "homelab_deploy_test_deployer_nkey" {
|
||||
description = "NKey seed for test-tier deployer"
|
||||
type = string
|
||||
default = "PLACEHOLDER"
|
||||
sensitive = true
|
||||
}
|
||||
|
||||
variable "homelab_deploy_admin_deployer_nkey" {
|
||||
description = "NKey seed for admin deployer"
|
||||
type = string
|
||||
default = "PLACEHOLDER"
|
||||
sensitive = true
|
||||
}
|
||||
|
||||
|
||||
@@ -31,13 +31,6 @@ locals {
|
||||
# Example Minimal VM using all defaults (uncomment to deploy):
|
||||
# "minimal-vm" = {}
|
||||
# "bootstrap-verify-test" = {}
|
||||
"testvm01" = {
|
||||
ip = "10.69.13.101/24"
|
||||
cpu_cores = 2
|
||||
memory = 2048
|
||||
disk_size = "20G"
|
||||
flake_branch = "pipeline-testing-improvements"
|
||||
}
|
||||
"vault01" = {
|
||||
ip = "10.69.13.19/24"
|
||||
cpu_cores = 2
|
||||
@@ -45,13 +38,29 @@ locals {
|
||||
disk_size = "20G"
|
||||
flake_branch = "vault-setup" # Bootstrap from this branch instead of master
|
||||
}
|
||||
"vaulttest01" = {
|
||||
ip = "10.69.13.150/24"
|
||||
"testvm01" = {
|
||||
ip = "10.69.13.20/24"
|
||||
cpu_cores = 2
|
||||
memory = 2048
|
||||
disk_size = "20G"
|
||||
flake_branch = "pki-migration"
|
||||
vault_wrapped_token = "s.UCpQCOp7cOKDdtGGBvfRWwAt"
|
||||
flake_branch = "deploy-test-hosts"
|
||||
vault_wrapped_token = "s.YRGRpAZVVtSYEa3wOYOqFmjt"
|
||||
}
|
||||
"testvm02" = {
|
||||
ip = "10.69.13.21/24"
|
||||
cpu_cores = 2
|
||||
memory = 2048
|
||||
disk_size = "20G"
|
||||
flake_branch = "deploy-test-hosts"
|
||||
vault_wrapped_token = "s.tvs8yhJOkLjBs548STs6DBw7"
|
||||
}
|
||||
"testvm03" = {
|
||||
ip = "10.69.13.22/24"
|
||||
cpu_cores = 2
|
||||
memory = 2048
|
||||
disk_size = "20G"
|
||||
flake_branch = "deploy-test-hosts"
|
||||
vault_wrapped_token = "s.sQ80FZGeG3z6jgrsuh74IopC"
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user