Compare commits
9 Commits
homelab-de
...
2669b10f0e
| Author | SHA1 | Date | |
|---|---|---|---|
|
2669b10f0e
|
|||
|
db6d610e16
|
|||
|
e4eb8afe5c
|
|||
|
df9246a0f8
|
|||
|
ec3b87f7fa
|
|||
|
913fa11c64
|
|||
|
3e85e2527f
|
|||
|
543ca18b14
|
|||
|
c83218b3bc
|
11
.mcp.json
11
.mcp.json
@@ -22,17 +22,6 @@
|
|||||||
"ALERTMANAGER_URL": "https://alertmanager.home.2rjus.net",
|
"ALERTMANAGER_URL": "https://alertmanager.home.2rjus.net",
|
||||||
"LOKI_URL": "http://monitoring01.home.2rjus.net:3100"
|
"LOKI_URL": "http://monitoring01.home.2rjus.net:3100"
|
||||||
}
|
}
|
||||||
},
|
|
||||||
"homelab-deploy": {
|
|
||||||
"command": "nix",
|
|
||||||
"args": [
|
|
||||||
"run",
|
|
||||||
"git+https://git.t-juice.club/torjus/homelab-deploy",
|
|
||||||
"--",
|
|
||||||
"mcp",
|
|
||||||
"--nats-url", "nats://nats1.home.2rjus.net:4222",
|
|
||||||
"--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey"
|
|
||||||
]
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
45
CLAUDE.md
45
CLAUDE.md
@@ -194,51 +194,6 @@ node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
|
|||||||
node_filesystem_avail_bytes{mountpoint="/"}
|
node_filesystem_avail_bytes{mountpoint="/"}
|
||||||
```
|
```
|
||||||
|
|
||||||
### Deploying to Test Hosts
|
|
||||||
|
|
||||||
The **homelab-deploy** MCP server enables remote deployments to test-tier hosts via NATS messaging.
|
|
||||||
|
|
||||||
**Available Tools:**
|
|
||||||
|
|
||||||
- `deploy` - Deploy NixOS configuration to test-tier hosts
|
|
||||||
- `list_hosts` - List available deployment targets
|
|
||||||
|
|
||||||
**Deploy Parameters:**
|
|
||||||
|
|
||||||
- `hostname` - Target a specific host (e.g., `vaulttest01`)
|
|
||||||
- `role` - Deploy to all hosts with a specific role (e.g., `vault`)
|
|
||||||
- `all` - Deploy to all test-tier hosts
|
|
||||||
- `action` - nixos-rebuild action: `switch` (default), `boot`, `test`, `dry-activate`
|
|
||||||
- `branch` - Git branch or commit to deploy (default: `master`)
|
|
||||||
|
|
||||||
**Examples:**
|
|
||||||
|
|
||||||
```
|
|
||||||
# List available hosts
|
|
||||||
list_hosts()
|
|
||||||
|
|
||||||
# Deploy to a specific host
|
|
||||||
deploy(hostname="vaulttest01", action="switch")
|
|
||||||
|
|
||||||
# Dry-run deployment
|
|
||||||
deploy(hostname="vaulttest01", action="dry-activate")
|
|
||||||
|
|
||||||
# Deploy to all hosts with a role
|
|
||||||
deploy(role="vault", action="switch")
|
|
||||||
```
|
|
||||||
|
|
||||||
**Note:** Only test-tier hosts with `homelab.deploy.enable = true` and the listener service running will respond to deployments.
|
|
||||||
|
|
||||||
**Verifying Deployments:**
|
|
||||||
|
|
||||||
After deploying, use the `nixos_flake_info` metric from nixos-exporter to verify the host is running the expected revision:
|
|
||||||
|
|
||||||
```promql
|
|
||||||
nixos_flake_info{instance=~"vaulttest01.*"}
|
|
||||||
```
|
|
||||||
|
|
||||||
The `current_rev` label contains the git commit hash of the deployed flake configuration.
|
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
### Directory Structure
|
### Directory Structure
|
||||||
|
|||||||
@@ -1,122 +0,0 @@
|
|||||||
# Long-Term Metrics Storage Options
|
|
||||||
|
|
||||||
## Problem Statement
|
|
||||||
|
|
||||||
Current Prometheus configuration retains metrics for 30 days (`retentionTime = "30d"`). Extending retention further raises disk usage concerns on the homelab hypervisor with limited local storage.
|
|
||||||
|
|
||||||
Prometheus does not support downsampling - it stores all data at full resolution until the retention period expires, then deletes it entirely.
|
|
||||||
|
|
||||||
## Current Configuration
|
|
||||||
|
|
||||||
Location: `services/monitoring/prometheus.nix`
|
|
||||||
|
|
||||||
- **Retention**: 30 days
|
|
||||||
- **Scrape interval**: 15s
|
|
||||||
- **Features**: Alertmanager, Pushgateway, auto-generated scrape configs from flake hosts
|
|
||||||
- **Storage**: Local disk on monitoring01
|
|
||||||
|
|
||||||
## Options Evaluated
|
|
||||||
|
|
||||||
### Option 1: VictoriaMetrics
|
|
||||||
|
|
||||||
VictoriaMetrics is a Prometheus-compatible TSDB with significantly better compression (5-10x smaller storage footprint).
|
|
||||||
|
|
||||||
**NixOS Options Available:**
|
|
||||||
- `services.victoriametrics.enable`
|
|
||||||
- `services.victoriametrics.prometheusConfig` - accepts Prometheus scrape config format
|
|
||||||
- `services.victoriametrics.retentionPeriod` - e.g., "6m" for 6 months
|
|
||||||
- `services.vmagent` - dedicated scraping agent
|
|
||||||
- `services.vmalert` - alerting rules evaluation
|
|
||||||
|
|
||||||
**Pros:**
|
|
||||||
- Simple migration - single service replacement
|
|
||||||
- Same PromQL query language - Grafana dashboards work unchanged
|
|
||||||
- Same scrape config format - existing auto-generated configs work as-is
|
|
||||||
- 5-10x better compression means 30 days of Prometheus data could become 180+ days
|
|
||||||
- Lightweight, single binary
|
|
||||||
|
|
||||||
**Cons:**
|
|
||||||
- No automatic downsampling (relies on compression alone)
|
|
||||||
- Alerting requires switching to vmalert instead of Prometheus alertmanager integration
|
|
||||||
- Would need to migrate existing data or start fresh
|
|
||||||
|
|
||||||
**Migration Steps:**
|
|
||||||
1. Replace `services.prometheus` with `services.victoriametrics`
|
|
||||||
2. Move scrape configs to `prometheusConfig`
|
|
||||||
3. Set up `services.vmalert` for alerting rules
|
|
||||||
4. Update Grafana datasource to VictoriaMetrics port (8428)
|
|
||||||
5. Keep Alertmanager for notification routing
|
|
||||||
|
|
||||||
### Option 2: Thanos
|
|
||||||
|
|
||||||
Thanos extends Prometheus with long-term storage and automatic downsampling by uploading data to object storage.
|
|
||||||
|
|
||||||
**NixOS Options Available:**
|
|
||||||
- `services.thanos.sidecar` - uploads Prometheus blocks to object storage
|
|
||||||
- `services.thanos.compact` - compacts and downsamples data
|
|
||||||
- `services.thanos.query` - unified query gateway
|
|
||||||
- `services.thanos.query-frontend` - query caching and parallelization
|
|
||||||
- `services.thanos.downsample` - dedicated downsampling service
|
|
||||||
|
|
||||||
**Downsampling Behavior:**
|
|
||||||
- Raw resolution kept for configurable period (default: indefinite)
|
|
||||||
- 5-minute resolution created after 40 hours
|
|
||||||
- 1-hour resolution created after 10 days
|
|
||||||
|
|
||||||
**Retention Configuration (in compactor):**
|
|
||||||
```nix
|
|
||||||
services.thanos.compact = {
|
|
||||||
retention.resolution-raw = "30d"; # Keep raw for 30 days
|
|
||||||
retention.resolution-5m = "180d"; # Keep 5m samples for 6 months
|
|
||||||
retention.resolution-1h = "2y"; # Keep 1h samples for 2 years
|
|
||||||
};
|
|
||||||
```
|
|
||||||
|
|
||||||
**Pros:**
|
|
||||||
- True downsampling - older data uses progressively less storage
|
|
||||||
- Keep metrics for years with minimal storage impact
|
|
||||||
- Prometheus continues running unchanged
|
|
||||||
- Existing Alertmanager integration preserved
|
|
||||||
|
|
||||||
**Cons:**
|
|
||||||
- Requires object storage (MinIO, S3, or local filesystem)
|
|
||||||
- Multiple services to manage (sidecar, compactor, query)
|
|
||||||
- More complex architecture
|
|
||||||
- Additional infrastructure (MinIO) may be needed
|
|
||||||
|
|
||||||
**Required Components:**
|
|
||||||
1. Thanos Sidecar (runs alongside Prometheus)
|
|
||||||
2. Object storage (MinIO or local filesystem)
|
|
||||||
3. Thanos Compactor (handles downsampling)
|
|
||||||
4. Thanos Query (provides unified query endpoint)
|
|
||||||
|
|
||||||
**Migration Steps:**
|
|
||||||
1. Deploy object storage (MinIO or configure filesystem backend)
|
|
||||||
2. Add Thanos sidecar pointing to Prometheus data directory
|
|
||||||
3. Add Thanos compactor with retention policies
|
|
||||||
4. Add Thanos query gateway
|
|
||||||
5. Update Grafana datasource to Thanos Query port (10902)
|
|
||||||
|
|
||||||
## Comparison
|
|
||||||
|
|
||||||
| Aspect | VictoriaMetrics | Thanos |
|
|
||||||
|--------|-----------------|--------|
|
|
||||||
| Complexity | Low (1 service) | Higher (3-4 services) |
|
|
||||||
| Downsampling | No | Yes (automatic) |
|
|
||||||
| Storage savings | 5-10x compression | Compression + downsampling |
|
|
||||||
| Object storage required | No | Yes |
|
|
||||||
| Migration effort | Minimal | Moderate |
|
|
||||||
| Grafana changes | Change port only | Change port only |
|
|
||||||
| Alerting changes | Need vmalert | Keep existing |
|
|
||||||
|
|
||||||
## Recommendation
|
|
||||||
|
|
||||||
**Start with VictoriaMetrics** for simplicity. The compression alone may provide 6+ months of retention in the same disk space currently used for 30 days.
|
|
||||||
|
|
||||||
If multi-year retention with true downsampling becomes necessary, Thanos can be evaluated later. However, it requires deploying object storage infrastructure (MinIO) which adds operational complexity.
|
|
||||||
|
|
||||||
## References
|
|
||||||
|
|
||||||
- VictoriaMetrics docs: https://docs.victoriametrics.com/
|
|
||||||
- Thanos docs: https://thanos.io/tip/thanos/getting-started.md/
|
|
||||||
- NixOS options searched from nixpkgs revision e576e3c9 (NixOS 25.11)
|
|
||||||
8
flake.lock
generated
8
flake.lock
generated
@@ -28,11 +28,11 @@
|
|||||||
]
|
]
|
||||||
},
|
},
|
||||||
"locked": {
|
"locked": {
|
||||||
"lastModified": 1770447502,
|
"lastModified": 1770443536,
|
||||||
"narHash": "sha256-xH1PNyE3ydj4udhe1IpK8VQxBPZETGLuORZdSWYRmSU=",
|
"narHash": "sha256-UufZIVggiioMFDSjKx+ifgkDOk9alNSiRmkvc4/+HIA=",
|
||||||
"ref": "master",
|
"ref": "master",
|
||||||
"rev": "79db119d1ca6630023947ef0a65896cc3307c2ff",
|
"rev": "95b795dcfd86b7b36045bba67e536b3a1c61dd33",
|
||||||
"revCount": 22,
|
"revCount": 20,
|
||||||
"type": "git",
|
"type": "git",
|
||||||
"url": "https://git.t-juice.club/torjus/homelab-deploy"
|
"url": "https://git.t-juice.club/torjus/homelab-deploy"
|
||||||
},
|
},
|
||||||
|
|||||||
@@ -81,7 +81,6 @@ in
|
|||||||
vim
|
vim
|
||||||
wget
|
wget
|
||||||
git
|
git
|
||||||
htop # test deploy verification
|
|
||||||
];
|
];
|
||||||
|
|
||||||
# Open ports in the firewall.
|
# Open ports in the firewall.
|
||||||
|
|||||||
@@ -19,15 +19,8 @@ in
|
|||||||
natsUrl = "nats://nats1.home.2rjus.net:4222";
|
natsUrl = "nats://nats1.home.2rjus.net:4222";
|
||||||
nkeyFile = "/run/secrets/homelab-deploy-nkey";
|
nkeyFile = "/run/secrets/homelab-deploy-nkey";
|
||||||
flakeUrl = "git+https://git.t-juice.club/torjus/nixos-servers.git";
|
flakeUrl = "git+https://git.t-juice.club/torjus/nixos-servers.git";
|
||||||
metrics.enable = true;
|
|
||||||
};
|
};
|
||||||
|
|
||||||
# Expose metrics for Prometheus scraping
|
|
||||||
homelab.monitoring.scrapeTargets = [{
|
|
||||||
job_name = "homelab-deploy";
|
|
||||||
port = 9972;
|
|
||||||
}];
|
|
||||||
|
|
||||||
# Ensure listener starts after vault secret is available
|
# Ensure listener starts after vault secret is available
|
||||||
systemd.services.homelab-deploy-listener = {
|
systemd.services.homelab-deploy-listener = {
|
||||||
after = [ "vault-secret-homelab-deploy-nkey.service" ];
|
after = [ "vault-secret-homelab-deploy-nkey.service" ];
|
||||||
|
|||||||
Reference in New Issue
Block a user