11 Commits

Author SHA1 Message Date
26ca6817f0 homelab-deploy: enable prometheus metrics
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m57s
- Update homelab-deploy input to get metrics support
- Enable metrics endpoint on port 9972
- Add scrape target for prometheus auto-discovery

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 08:04:23 +01:00
b03a9b3b64 docs: add long-term metrics storage plan
Compare VictoriaMetrics and Thanos as options for extending
metrics retention beyond 30 days while managing disk usage.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:56:10 +01:00
f805b9f629 mcp: add homelab-deploy MCP server
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m20s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:27:12 +01:00
f3adf7e77f CLAUDE.md: add homelab-deploy MCP documentation
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:25:44 +01:00
f6eca9decc vaulttest01: add htop for deploy verification test
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m3s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:23:22 +01:00
6e93b8eae3 Merge pull request 'add-deploy-homelab' (#28) from add-deploy-homelab into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m9s
Reviewed-on: #28
2026-02-07 05:56:51 +00:00
c214f8543c homelab: add deploy.enable option with assertion
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m6s
Run nix flake check / flake-check (pull_request) Successful in 2m7s
- Add homelab.deploy.enable option (requires vault.enable)
- Create shared homelab-deploy Vault policy for all hosts
- Enable homelab.deploy on all vault-enabled hosts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:54:42 +01:00
7933127d77 system: enable homelab-deploy listener for all vault hosts
Add system/homelab-deploy.nix module that automatically enables the
listener on all hosts with vault.enable=true. Uses homelab.host.tier
and homelab.host.role for NATS subject subscriptions.

- Add homelab-deploy access to all host AppRole policies
- Remove manual listener config from vaulttest01 (now handled by system module)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:54:42 +01:00
13c3897e86 flake: update homelab-deploy, add to devShell
Update homelab-deploy to include bugfix. Add CLI to devShell for
easier testing and deployment operations.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 06:54:42 +01:00
0643f23281 vaulttest01: add vault secret dependency to listener
Some checks failed
Run nix flake check / flake-check (push) Failing after 15m32s
Ensure homelab-deploy-listener waits for the NKey secret to be
fetched from Vault before starting.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 05:29:29 +01:00
ad8570f8db homelab-deploy: add NATS-based deployment system
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m45s
Add homelab-deploy flake input and NixOS module for message-based
deployments across the fleet. Configure DEPLOY account in NATS with
tiered access control (listener, test-deployer, admin-deployer).
Enable listener on vaulttest01 as initial test host.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 05:22:06 +01:00
20 changed files with 379 additions and 11 deletions

View File

@@ -22,6 +22,17 @@
"ALERTMANAGER_URL": "https://alertmanager.home.2rjus.net", "ALERTMANAGER_URL": "https://alertmanager.home.2rjus.net",
"LOKI_URL": "http://monitoring01.home.2rjus.net:3100" "LOKI_URL": "http://monitoring01.home.2rjus.net:3100"
} }
},
"homelab-deploy": {
"command": "nix",
"args": [
"run",
"git+https://git.t-juice.club/torjus/homelab-deploy",
"--",
"mcp",
"--nats-url", "nats://nats1.home.2rjus.net:4222",
"--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey"
]
} }
} }
} }

View File

@@ -194,6 +194,51 @@ node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
node_filesystem_avail_bytes{mountpoint="/"} node_filesystem_avail_bytes{mountpoint="/"}
``` ```
### Deploying to Test Hosts
The **homelab-deploy** MCP server enables remote deployments to test-tier hosts via NATS messaging.
**Available Tools:**
- `deploy` - Deploy NixOS configuration to test-tier hosts
- `list_hosts` - List available deployment targets
**Deploy Parameters:**
- `hostname` - Target a specific host (e.g., `vaulttest01`)
- `role` - Deploy to all hosts with a specific role (e.g., `vault`)
- `all` - Deploy to all test-tier hosts
- `action` - nixos-rebuild action: `switch` (default), `boot`, `test`, `dry-activate`
- `branch` - Git branch or commit to deploy (default: `master`)
**Examples:**
```
# List available hosts
list_hosts()
# Deploy to a specific host
deploy(hostname="vaulttest01", action="switch")
# Dry-run deployment
deploy(hostname="vaulttest01", action="dry-activate")
# Deploy to all hosts with a role
deploy(role="vault", action="switch")
```
**Note:** Only test-tier hosts with `homelab.deploy.enable = true` and the listener service running will respond to deployments.
**Verifying Deployments:**
After deploying, use the `nixos_flake_info` metric from nixos-exporter to verify the host is running the expected revision:
```promql
nixos_flake_info{instance=~"vaulttest01.*"}
```
The `current_rev` label contains the git commit hash of the deployed flake configuration.
## Architecture ## Architecture
### Directory Structure ### Directory Structure

View File

@@ -0,0 +1,122 @@
# Long-Term Metrics Storage Options
## Problem Statement
Current Prometheus configuration retains metrics for 30 days (`retentionTime = "30d"`). Extending retention further raises disk usage concerns on the homelab hypervisor with limited local storage.
Prometheus does not support downsampling - it stores all data at full resolution until the retention period expires, then deletes it entirely.
## Current Configuration
Location: `services/monitoring/prometheus.nix`
- **Retention**: 30 days
- **Scrape interval**: 15s
- **Features**: Alertmanager, Pushgateway, auto-generated scrape configs from flake hosts
- **Storage**: Local disk on monitoring01
## Options Evaluated
### Option 1: VictoriaMetrics
VictoriaMetrics is a Prometheus-compatible TSDB with significantly better compression (5-10x smaller storage footprint).
**NixOS Options Available:**
- `services.victoriametrics.enable`
- `services.victoriametrics.prometheusConfig` - accepts Prometheus scrape config format
- `services.victoriametrics.retentionPeriod` - e.g., "6m" for 6 months
- `services.vmagent` - dedicated scraping agent
- `services.vmalert` - alerting rules evaluation
**Pros:**
- Simple migration - single service replacement
- Same PromQL query language - Grafana dashboards work unchanged
- Same scrape config format - existing auto-generated configs work as-is
- 5-10x better compression means 30 days of Prometheus data could become 180+ days
- Lightweight, single binary
**Cons:**
- No automatic downsampling (relies on compression alone)
- Alerting requires switching to vmalert instead of Prometheus alertmanager integration
- Would need to migrate existing data or start fresh
**Migration Steps:**
1. Replace `services.prometheus` with `services.victoriametrics`
2. Move scrape configs to `prometheusConfig`
3. Set up `services.vmalert` for alerting rules
4. Update Grafana datasource to VictoriaMetrics port (8428)
5. Keep Alertmanager for notification routing
### Option 2: Thanos
Thanos extends Prometheus with long-term storage and automatic downsampling by uploading data to object storage.
**NixOS Options Available:**
- `services.thanos.sidecar` - uploads Prometheus blocks to object storage
- `services.thanos.compact` - compacts and downsamples data
- `services.thanos.query` - unified query gateway
- `services.thanos.query-frontend` - query caching and parallelization
- `services.thanos.downsample` - dedicated downsampling service
**Downsampling Behavior:**
- Raw resolution kept for configurable period (default: indefinite)
- 5-minute resolution created after 40 hours
- 1-hour resolution created after 10 days
**Retention Configuration (in compactor):**
```nix
services.thanos.compact = {
retention.resolution-raw = "30d"; # Keep raw for 30 days
retention.resolution-5m = "180d"; # Keep 5m samples for 6 months
retention.resolution-1h = "2y"; # Keep 1h samples for 2 years
};
```
**Pros:**
- True downsampling - older data uses progressively less storage
- Keep metrics for years with minimal storage impact
- Prometheus continues running unchanged
- Existing Alertmanager integration preserved
**Cons:**
- Requires object storage (MinIO, S3, or local filesystem)
- Multiple services to manage (sidecar, compactor, query)
- More complex architecture
- Additional infrastructure (MinIO) may be needed
**Required Components:**
1. Thanos Sidecar (runs alongside Prometheus)
2. Object storage (MinIO or local filesystem)
3. Thanos Compactor (handles downsampling)
4. Thanos Query (provides unified query endpoint)
**Migration Steps:**
1. Deploy object storage (MinIO or configure filesystem backend)
2. Add Thanos sidecar pointing to Prometheus data directory
3. Add Thanos compactor with retention policies
4. Add Thanos query gateway
5. Update Grafana datasource to Thanos Query port (10902)
## Comparison
| Aspect | VictoriaMetrics | Thanos |
|--------|-----------------|--------|
| Complexity | Low (1 service) | Higher (3-4 services) |
| Downsampling | No | Yes (automatic) |
| Storage savings | 5-10x compression | Compression + downsampling |
| Object storage required | No | Yes |
| Migration effort | Minimal | Moderate |
| Grafana changes | Change port only | Change port only |
| Alerting changes | Need vmalert | Keep existing |
## Recommendation
**Start with VictoriaMetrics** for simplicity. The compression alone may provide 6+ months of retention in the same disk space currently used for 30 days.
If multi-year retention with true downsampling becomes necessary, Thanos can be evaluated later. However, it requires deploying object storage infrastructure (MinIO) which adds operational complexity.
## References
- VictoriaMetrics docs: https://docs.victoriametrics.com/
- Thanos docs: https://thanos.io/tip/thanos/getting-started.md/
- NixOS options searched from nixpkgs revision e576e3c9 (NixOS 25.11)

22
flake.lock generated
View File

@@ -21,6 +21,27 @@
"url": "https://git.t-juice.club/torjus/alerttonotify" "url": "https://git.t-juice.club/torjus/alerttonotify"
} }
}, },
"homelab-deploy": {
"inputs": {
"nixpkgs": [
"nixpkgs-unstable"
]
},
"locked": {
"lastModified": 1770447502,
"narHash": "sha256-xH1PNyE3ydj4udhe1IpK8VQxBPZETGLuORZdSWYRmSU=",
"ref": "master",
"rev": "79db119d1ca6630023947ef0a65896cc3307c2ff",
"revCount": 22,
"type": "git",
"url": "https://git.t-juice.club/torjus/homelab-deploy"
},
"original": {
"ref": "master",
"type": "git",
"url": "https://git.t-juice.club/torjus/homelab-deploy"
}
},
"labmon": { "labmon": {
"inputs": { "inputs": {
"nixpkgs": [ "nixpkgs": [
@@ -97,6 +118,7 @@
"root": { "root": {
"inputs": { "inputs": {
"alerttonotify": "alerttonotify", "alerttonotify": "alerttonotify",
"homelab-deploy": "homelab-deploy",
"labmon": "labmon", "labmon": "labmon",
"nixos-exporter": "nixos-exporter", "nixos-exporter": "nixos-exporter",
"nixpkgs": "nixpkgs", "nixpkgs": "nixpkgs",

View File

@@ -21,6 +21,10 @@
url = "git+https://git.t-juice.club/torjus/nixos-exporter"; url = "git+https://git.t-juice.club/torjus/nixos-exporter";
inputs.nixpkgs.follows = "nixpkgs-unstable"; inputs.nixpkgs.follows = "nixpkgs-unstable";
}; };
homelab-deploy = {
url = "git+https://git.t-juice.club/torjus/homelab-deploy?ref=master";
inputs.nixpkgs.follows = "nixpkgs-unstable";
};
}; };
outputs = outputs =
@@ -32,6 +36,7 @@
alerttonotify, alerttonotify,
labmon, labmon,
nixos-exporter, nixos-exporter,
homelab-deploy,
... ...
}@inputs: }@inputs:
let let
@@ -58,6 +63,7 @@
) )
sops-nix.nixosModules.sops sops-nix.nixosModules.sops
nixos-exporter.nixosModules.default nixos-exporter.nixosModules.default
homelab-deploy.nixosModules.default
./modules/homelab ./modules/homelab
]; ];
allSystems = [ allSystems = [
@@ -219,11 +225,12 @@
{ pkgs }: { pkgs }:
{ {
default = pkgs.mkShell { default = pkgs.mkShell {
packages = with pkgs; [ packages = [
ansible pkgs.ansible
opentofu pkgs.opentofu
openbao pkgs.openbao
(pkgs.callPackage ./scripts/create-host { }) (pkgs.callPackage ./scripts/create-host { })
homelab-deploy.packages.${pkgs.system}.default
]; ];
}; };
} }

View File

@@ -57,6 +57,7 @@
# Vault secrets management # Vault secrets management
vault.enable = true; vault.enable = true;
homelab.deploy.enable = true;
vault.secrets.backup-helper = { vault.secrets.backup-helper = {
secretPath = "shared/backup/password"; secretPath = "shared/backup/password";
extractKey = "password"; extractKey = "password";

View File

@@ -61,6 +61,7 @@
"flakes" "flakes"
]; ];
vault.enable = true; vault.enable = true;
homelab.deploy.enable = true;
nix.settings.tarball-ttl = 0; nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [ environment.systemPackages = with pkgs; [

View File

@@ -58,6 +58,7 @@
# Vault secrets management # Vault secrets management
vault.enable = true; vault.enable = true;
homelab.deploy.enable = true;
vault.secrets.backup-helper = { vault.secrets.backup-helper = {
secretPath = "shared/backup/password"; secretPath = "shared/backup/password";
extractKey = "password"; extractKey = "password";

View File

@@ -55,6 +55,7 @@
"flakes" "flakes"
]; ];
vault.enable = true; vault.enable = true;
homelab.deploy.enable = true;
nix.settings.tarball-ttl = 0; nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [ environment.systemPackages = with pkgs; [

View File

@@ -48,6 +48,7 @@
"flakes" "flakes"
]; ];
vault.enable = true; vault.enable = true;
homelab.deploy.enable = true;
homelab.host = { homelab.host = {
role = "dns"; role = "dns";

View File

@@ -48,6 +48,7 @@
"flakes" "flakes"
]; ];
vault.enable = true; vault.enable = true;
homelab.deploy.enable = true;
homelab.host = { homelab.host = {
role = "dns"; role = "dns";

View File

@@ -81,6 +81,7 @@ in
vim vim
wget wget
git git
htop # test deploy verification
]; ];
# Open ports in the firewall. # Open ports in the firewall.
@@ -92,6 +93,7 @@ in
# Testing config # Testing config
# Enable Vault secrets management # Enable Vault secrets management
vault.enable = true; vault.enable = true;
homelab.deploy.enable = true;
# Define a test secret # Define a test secret
vault.secrets.test-service = { vault.secrets.test-service = {

View File

@@ -1,6 +1,7 @@
{ ... }: { ... }:
{ {
imports = [ imports = [
./deploy.nix
./dns.nix ./dns.nix
./host.nix ./host.nix
./monitoring.nix ./monitoring.nix

View File

@@ -0,0 +1,16 @@
{ config, lib, ... }:
{
options.homelab.deploy = {
enable = lib.mkEnableOption "homelab-deploy listener for NATS-based deployments";
};
config = {
assertions = [
{
assertion = config.homelab.deploy.enable -> config.vault.enable;
message = "homelab.deploy.enable requires vault.enable to be true (needed for NKey secret)";
}
];
};
}

View File

@@ -1,16 +1,18 @@
{ ... }: { ... }:
{ {
homelab.monitoring.scrapeTargets = [{ homelab.monitoring.scrapeTargets = [
job_name = "nats"; {
port = 7777; job_name = "nats";
}]; port = 7777;
}
];
services.prometheus.exporters.nats = { services.prometheus.exporters.nats = {
enable = true; enable = true;
url = "http://localhost:8222"; url = "http://localhost:8222";
extraFlags = [ extraFlags = [
"-varz" # General server info "-varz" # General server info
"-connz" # Connection info "-connz" # Connection info
"-jsz=all" # JetStream info "-jsz=all" # JetStream info
]; ];
}; };
@@ -38,6 +40,48 @@
} }
]; ];
}; };
DEPLOY = {
users = [
# Shared listener (all hosts use this)
{
nkey = "UCCZJSUGLCSLBBKHBPL4QA66TUMQUGIXGLIFTWDEH43MGWM3LDD232X4";
permissions = {
subscribe = [
"deploy.test.>"
"deploy.prod.>"
"deploy.discover"
];
publish = [
"deploy.responses.>"
"deploy.discover"
];
};
}
# Test deployer (MCP without admin)
{
nkey = "UBR66CX2ZNY5XNVQF5VBG4WFAF54LSGUYCUNNCEYRILDQ4NXDAD2THZU";
permissions = {
publish = [
"deploy.test.>"
"deploy.discover"
];
subscribe = [
"deploy.responses.>"
"deploy.discover"
];
};
}
# Admin deployer (full access)
{
nkey = "UD2BFB7DLM67P5UUVCKBUJMCHADIZLGGVUNSRLZE2ZC66FW2XT44P73Y";
permissions = {
publish = [ "deploy.>" ];
subscribe = [ "deploy.>" ];
};
}
];
};
}; };
system_account = "ADMIN"; system_account = "ADMIN";
jetstream = { jetstream = {

View File

@@ -3,6 +3,7 @@
imports = [ imports = [
./acme.nix ./acme.nix
./autoupgrade.nix ./autoupgrade.nix
./homelab-deploy.nix
./monitoring ./monitoring
./motd.nix ./motd.nix
./packages.nix ./packages.nix

37
system/homelab-deploy.nix Normal file
View File

@@ -0,0 +1,37 @@
{ config, lib, ... }:
let
hostCfg = config.homelab.host;
in
{
config = lib.mkIf config.homelab.deploy.enable {
# Fetch listener NKey from Vault
vault.secrets.homelab-deploy-nkey = {
secretPath = "shared/homelab-deploy/listener-nkey";
extractKey = "nkey";
};
# Enable homelab-deploy listener
services.homelab-deploy.listener = {
enable = true;
tier = hostCfg.tier;
role = hostCfg.role;
natsUrl = "nats://nats1.home.2rjus.net:4222";
nkeyFile = "/run/secrets/homelab-deploy-nkey";
flakeUrl = "git+https://git.t-juice.club/torjus/nixos-servers.git";
metrics.enable = true;
};
# Expose metrics for Prometheus scraping
homelab.monitoring.scrapeTargets = [{
job_name = "homelab-deploy";
port = 9972;
}];
# Ensure listener starts after vault secret is available
systemd.services.homelab-deploy-listener = {
after = [ "vault-secret-homelab-deploy-nkey.service" ];
requires = [ "vault-secret-homelab-deploy-nkey.service" ];
};
};
}

View File

@@ -4,6 +4,17 @@ resource "vault_auth_backend" "approle" {
path = "approle" path = "approle"
} }
# Shared policy for homelab-deploy (all hosts need this for NATS-based deployments)
resource "vault_policy" "homelab_deploy" {
name = "homelab-deploy"
policy = <<EOT
path "secret/data/shared/homelab-deploy/*" {
capabilities = ["read", "list"]
}
EOT
}
# Define host access policies # Define host access policies
locals { locals {
host_policies = { host_policies = {
@@ -89,6 +100,12 @@ locals {
"secret/data/hosts/nix-cache01/*", "secret/data/hosts/nix-cache01/*",
] ]
} }
"vaulttest01" = {
paths = [
"secret/data/hosts/vaulttest01/*",
]
}
} }
} }
@@ -114,7 +131,7 @@ resource "vault_approle_auth_backend_role" "hosts" {
backend = vault_auth_backend.approle.path backend = vault_auth_backend.approle.path
role_name = each.key role_name = each.key
token_policies = concat( token_policies = concat(
["${each.key}-policy"], ["${each.key}-policy", "homelab-deploy"],
lookup(each.value, "extra_policies", []) lookup(each.value, "extra_policies", [])
) )

View File

@@ -92,6 +92,22 @@ locals {
auto_generate = false auto_generate = false
data = { token = var.actions_token_1 } data = { token = var.actions_token_1 }
} }
# Homelab-deploy NKeys
"shared/homelab-deploy/listener-nkey" = {
auto_generate = false
data = { nkey = var.homelab_deploy_listener_nkey }
}
"shared/homelab-deploy/test-deployer-nkey" = {
auto_generate = false
data = { nkey = var.homelab_deploy_test_deployer_nkey }
}
"shared/homelab-deploy/admin-deployer-nkey" = {
auto_generate = false
data = { nkey = var.homelab_deploy_admin_deployer_nkey }
}
} }
} }

View File

@@ -52,3 +52,24 @@ variable "actions_token_1" {
sensitive = true sensitive = true
} }
variable "homelab_deploy_listener_nkey" {
description = "NKey seed for homelab-deploy listeners"
type = string
default = "PLACEHOLDER"
sensitive = true
}
variable "homelab_deploy_test_deployer_nkey" {
description = "NKey seed for test-tier deployer"
type = string
default = "PLACEHOLDER"
sensitive = true
}
variable "homelab_deploy_admin_deployer_nkey" {
description = "NKey seed for admin deployer"
type = string
default = "PLACEHOLDER"
sensitive = true
}