6 Commits

Author SHA1 Message Date
26ca6817f0 homelab-deploy: enable prometheus metrics
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m57s
- Update homelab-deploy input to get metrics support
- Enable metrics endpoint on port 9972
- Add scrape target for prometheus auto-discovery

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 08:04:23 +01:00
b03a9b3b64 docs: add long-term metrics storage plan
Compare VictoriaMetrics and Thanos as options for extending
metrics retention beyond 30 days while managing disk usage.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:56:10 +01:00
f805b9f629 mcp: add homelab-deploy MCP server
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m20s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:27:12 +01:00
f3adf7e77f CLAUDE.md: add homelab-deploy MCP documentation
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:25:44 +01:00
f6eca9decc vaulttest01: add htop for deploy verification test
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m3s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:23:22 +01:00
6e93b8eae3 Merge pull request 'add-deploy-homelab' (#28) from add-deploy-homelab into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m9s
Reviewed-on: #28
2026-02-07 05:56:51 +00:00
6 changed files with 190 additions and 4 deletions

View File

@@ -22,6 +22,17 @@
"ALERTMANAGER_URL": "https://alertmanager.home.2rjus.net",
"LOKI_URL": "http://monitoring01.home.2rjus.net:3100"
}
},
"homelab-deploy": {
"command": "nix",
"args": [
"run",
"git+https://git.t-juice.club/torjus/homelab-deploy",
"--",
"mcp",
"--nats-url", "nats://nats1.home.2rjus.net:4222",
"--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey"
]
}
}
}

View File

@@ -194,6 +194,51 @@ node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
node_filesystem_avail_bytes{mountpoint="/"}
```
### Deploying to Test Hosts
The **homelab-deploy** MCP server enables remote deployments to test-tier hosts via NATS messaging.
**Available Tools:**
- `deploy` - Deploy NixOS configuration to test-tier hosts
- `list_hosts` - List available deployment targets
**Deploy Parameters:**
- `hostname` - Target a specific host (e.g., `vaulttest01`)
- `role` - Deploy to all hosts with a specific role (e.g., `vault`)
- `all` - Deploy to all test-tier hosts
- `action` - nixos-rebuild action: `switch` (default), `boot`, `test`, `dry-activate`
- `branch` - Git branch or commit to deploy (default: `master`)
**Examples:**
```
# List available hosts
list_hosts()
# Deploy to a specific host
deploy(hostname="vaulttest01", action="switch")
# Dry-run deployment
deploy(hostname="vaulttest01", action="dry-activate")
# Deploy to all hosts with a role
deploy(role="vault", action="switch")
```
**Note:** Only test-tier hosts with `homelab.deploy.enable = true` and the listener service running will respond to deployments.
**Verifying Deployments:**
After deploying, use the `nixos_flake_info` metric from nixos-exporter to verify the host is running the expected revision:
```promql
nixos_flake_info{instance=~"vaulttest01.*"}
```
The `current_rev` label contains the git commit hash of the deployed flake configuration.
## Architecture
### Directory Structure

View File

@@ -0,0 +1,122 @@
# Long-Term Metrics Storage Options
## Problem Statement
Current Prometheus configuration retains metrics for 30 days (`retentionTime = "30d"`). Extending retention further raises disk usage concerns on the homelab hypervisor with limited local storage.
Prometheus does not support downsampling - it stores all data at full resolution until the retention period expires, then deletes it entirely.
## Current Configuration
Location: `services/monitoring/prometheus.nix`
- **Retention**: 30 days
- **Scrape interval**: 15s
- **Features**: Alertmanager, Pushgateway, auto-generated scrape configs from flake hosts
- **Storage**: Local disk on monitoring01
## Options Evaluated
### Option 1: VictoriaMetrics
VictoriaMetrics is a Prometheus-compatible TSDB with significantly better compression (5-10x smaller storage footprint).
**NixOS Options Available:**
- `services.victoriametrics.enable`
- `services.victoriametrics.prometheusConfig` - accepts Prometheus scrape config format
- `services.victoriametrics.retentionPeriod` - e.g., "6m" for 6 months
- `services.vmagent` - dedicated scraping agent
- `services.vmalert` - alerting rules evaluation
**Pros:**
- Simple migration - single service replacement
- Same PromQL query language - Grafana dashboards work unchanged
- Same scrape config format - existing auto-generated configs work as-is
- 5-10x better compression means 30 days of Prometheus data could become 180+ days
- Lightweight, single binary
**Cons:**
- No automatic downsampling (relies on compression alone)
- Alerting requires switching to vmalert instead of Prometheus alertmanager integration
- Would need to migrate existing data or start fresh
**Migration Steps:**
1. Replace `services.prometheus` with `services.victoriametrics`
2. Move scrape configs to `prometheusConfig`
3. Set up `services.vmalert` for alerting rules
4. Update Grafana datasource to VictoriaMetrics port (8428)
5. Keep Alertmanager for notification routing
### Option 2: Thanos
Thanos extends Prometheus with long-term storage and automatic downsampling by uploading data to object storage.
**NixOS Options Available:**
- `services.thanos.sidecar` - uploads Prometheus blocks to object storage
- `services.thanos.compact` - compacts and downsamples data
- `services.thanos.query` - unified query gateway
- `services.thanos.query-frontend` - query caching and parallelization
- `services.thanos.downsample` - dedicated downsampling service
**Downsampling Behavior:**
- Raw resolution kept for configurable period (default: indefinite)
- 5-minute resolution created after 40 hours
- 1-hour resolution created after 10 days
**Retention Configuration (in compactor):**
```nix
services.thanos.compact = {
retention.resolution-raw = "30d"; # Keep raw for 30 days
retention.resolution-5m = "180d"; # Keep 5m samples for 6 months
retention.resolution-1h = "2y"; # Keep 1h samples for 2 years
};
```
**Pros:**
- True downsampling - older data uses progressively less storage
- Keep metrics for years with minimal storage impact
- Prometheus continues running unchanged
- Existing Alertmanager integration preserved
**Cons:**
- Requires object storage (MinIO, S3, or local filesystem)
- Multiple services to manage (sidecar, compactor, query)
- More complex architecture
- Additional infrastructure (MinIO) may be needed
**Required Components:**
1. Thanos Sidecar (runs alongside Prometheus)
2. Object storage (MinIO or local filesystem)
3. Thanos Compactor (handles downsampling)
4. Thanos Query (provides unified query endpoint)
**Migration Steps:**
1. Deploy object storage (MinIO or configure filesystem backend)
2. Add Thanos sidecar pointing to Prometheus data directory
3. Add Thanos compactor with retention policies
4. Add Thanos query gateway
5. Update Grafana datasource to Thanos Query port (10902)
## Comparison
| Aspect | VictoriaMetrics | Thanos |
|--------|-----------------|--------|
| Complexity | Low (1 service) | Higher (3-4 services) |
| Downsampling | No | Yes (automatic) |
| Storage savings | 5-10x compression | Compression + downsampling |
| Object storage required | No | Yes |
| Migration effort | Minimal | Moderate |
| Grafana changes | Change port only | Change port only |
| Alerting changes | Need vmalert | Keep existing |
## Recommendation
**Start with VictoriaMetrics** for simplicity. The compression alone may provide 6+ months of retention in the same disk space currently used for 30 days.
If multi-year retention with true downsampling becomes necessary, Thanos can be evaluated later. However, it requires deploying object storage infrastructure (MinIO) which adds operational complexity.
## References
- VictoriaMetrics docs: https://docs.victoriametrics.com/
- Thanos docs: https://thanos.io/tip/thanos/getting-started.md/
- NixOS options searched from nixpkgs revision e576e3c9 (NixOS 25.11)

8
flake.lock generated
View File

@@ -28,11 +28,11 @@
]
},
"locked": {
"lastModified": 1770443536,
"narHash": "sha256-UufZIVggiioMFDSjKx+ifgkDOk9alNSiRmkvc4/+HIA=",
"lastModified": 1770447502,
"narHash": "sha256-xH1PNyE3ydj4udhe1IpK8VQxBPZETGLuORZdSWYRmSU=",
"ref": "master",
"rev": "95b795dcfd86b7b36045bba67e536b3a1c61dd33",
"revCount": 20,
"rev": "79db119d1ca6630023947ef0a65896cc3307c2ff",
"revCount": 22,
"type": "git",
"url": "https://git.t-juice.club/torjus/homelab-deploy"
},

View File

@@ -81,6 +81,7 @@ in
vim
wget
git
htop # test deploy verification
];
# Open ports in the firewall.

View File

@@ -19,8 +19,15 @@ in
natsUrl = "nats://nats1.home.2rjus.net:4222";
nkeyFile = "/run/secrets/homelab-deploy-nkey";
flakeUrl = "git+https://git.t-juice.club/torjus/nixos-servers.git";
metrics.enable = true;
};
# Expose metrics for Prometheus scraping
homelab.monitoring.scrapeTargets = [{
job_name = "homelab-deploy";
port = 9972;
}];
# Ensure listener starts after vault secret is available
systemd.services.homelab-deploy-listener = {
after = [ "vault-secret-homelab-deploy-nkey.service" ];