docs: add long-term metrics storage plan
Compare VictoriaMetrics and Thanos as options for extending metrics retention beyond 30 days while managing disk usage. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
122
docs/plans/long-term-metrics-storage.md
Normal file
122
docs/plans/long-term-metrics-storage.md
Normal file
@@ -0,0 +1,122 @@
|
|||||||
|
# Long-Term Metrics Storage Options
|
||||||
|
|
||||||
|
## Problem Statement
|
||||||
|
|
||||||
|
Current Prometheus configuration retains metrics for 30 days (`retentionTime = "30d"`). Extending retention further raises disk usage concerns on the homelab hypervisor with limited local storage.
|
||||||
|
|
||||||
|
Prometheus does not support downsampling - it stores all data at full resolution until the retention period expires, then deletes it entirely.
|
||||||
|
|
||||||
|
## Current Configuration
|
||||||
|
|
||||||
|
Location: `services/monitoring/prometheus.nix`
|
||||||
|
|
||||||
|
- **Retention**: 30 days
|
||||||
|
- **Scrape interval**: 15s
|
||||||
|
- **Features**: Alertmanager, Pushgateway, auto-generated scrape configs from flake hosts
|
||||||
|
- **Storage**: Local disk on monitoring01
|
||||||
|
|
||||||
|
## Options Evaluated
|
||||||
|
|
||||||
|
### Option 1: VictoriaMetrics
|
||||||
|
|
||||||
|
VictoriaMetrics is a Prometheus-compatible TSDB with significantly better compression (5-10x smaller storage footprint).
|
||||||
|
|
||||||
|
**NixOS Options Available:**
|
||||||
|
- `services.victoriametrics.enable`
|
||||||
|
- `services.victoriametrics.prometheusConfig` - accepts Prometheus scrape config format
|
||||||
|
- `services.victoriametrics.retentionPeriod` - e.g., "6m" for 6 months
|
||||||
|
- `services.vmagent` - dedicated scraping agent
|
||||||
|
- `services.vmalert` - alerting rules evaluation
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- Simple migration - single service replacement
|
||||||
|
- Same PromQL query language - Grafana dashboards work unchanged
|
||||||
|
- Same scrape config format - existing auto-generated configs work as-is
|
||||||
|
- 5-10x better compression means 30 days of Prometheus data could become 180+ days
|
||||||
|
- Lightweight, single binary
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- No automatic downsampling (relies on compression alone)
|
||||||
|
- Alerting requires switching to vmalert instead of Prometheus alertmanager integration
|
||||||
|
- Would need to migrate existing data or start fresh
|
||||||
|
|
||||||
|
**Migration Steps:**
|
||||||
|
1. Replace `services.prometheus` with `services.victoriametrics`
|
||||||
|
2. Move scrape configs to `prometheusConfig`
|
||||||
|
3. Set up `services.vmalert` for alerting rules
|
||||||
|
4. Update Grafana datasource to VictoriaMetrics port (8428)
|
||||||
|
5. Keep Alertmanager for notification routing
|
||||||
|
|
||||||
|
### Option 2: Thanos
|
||||||
|
|
||||||
|
Thanos extends Prometheus with long-term storage and automatic downsampling by uploading data to object storage.
|
||||||
|
|
||||||
|
**NixOS Options Available:**
|
||||||
|
- `services.thanos.sidecar` - uploads Prometheus blocks to object storage
|
||||||
|
- `services.thanos.compact` - compacts and downsamples data
|
||||||
|
- `services.thanos.query` - unified query gateway
|
||||||
|
- `services.thanos.query-frontend` - query caching and parallelization
|
||||||
|
- `services.thanos.downsample` - dedicated downsampling service
|
||||||
|
|
||||||
|
**Downsampling Behavior:**
|
||||||
|
- Raw resolution kept for configurable period (default: indefinite)
|
||||||
|
- 5-minute resolution created after 40 hours
|
||||||
|
- 1-hour resolution created after 10 days
|
||||||
|
|
||||||
|
**Retention Configuration (in compactor):**
|
||||||
|
```nix
|
||||||
|
services.thanos.compact = {
|
||||||
|
retention.resolution-raw = "30d"; # Keep raw for 30 days
|
||||||
|
retention.resolution-5m = "180d"; # Keep 5m samples for 6 months
|
||||||
|
retention.resolution-1h = "2y"; # Keep 1h samples for 2 years
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- True downsampling - older data uses progressively less storage
|
||||||
|
- Keep metrics for years with minimal storage impact
|
||||||
|
- Prometheus continues running unchanged
|
||||||
|
- Existing Alertmanager integration preserved
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- Requires object storage (MinIO, S3, or local filesystem)
|
||||||
|
- Multiple services to manage (sidecar, compactor, query)
|
||||||
|
- More complex architecture
|
||||||
|
- Additional infrastructure (MinIO) may be needed
|
||||||
|
|
||||||
|
**Required Components:**
|
||||||
|
1. Thanos Sidecar (runs alongside Prometheus)
|
||||||
|
2. Object storage (MinIO or local filesystem)
|
||||||
|
3. Thanos Compactor (handles downsampling)
|
||||||
|
4. Thanos Query (provides unified query endpoint)
|
||||||
|
|
||||||
|
**Migration Steps:**
|
||||||
|
1. Deploy object storage (MinIO or configure filesystem backend)
|
||||||
|
2. Add Thanos sidecar pointing to Prometheus data directory
|
||||||
|
3. Add Thanos compactor with retention policies
|
||||||
|
4. Add Thanos query gateway
|
||||||
|
5. Update Grafana datasource to Thanos Query port (10902)
|
||||||
|
|
||||||
|
## Comparison
|
||||||
|
|
||||||
|
| Aspect | VictoriaMetrics | Thanos |
|
||||||
|
|--------|-----------------|--------|
|
||||||
|
| Complexity | Low (1 service) | Higher (3-4 services) |
|
||||||
|
| Downsampling | No | Yes (automatic) |
|
||||||
|
| Storage savings | 5-10x compression | Compression + downsampling |
|
||||||
|
| Object storage required | No | Yes |
|
||||||
|
| Migration effort | Minimal | Moderate |
|
||||||
|
| Grafana changes | Change port only | Change port only |
|
||||||
|
| Alerting changes | Need vmalert | Keep existing |
|
||||||
|
|
||||||
|
## Recommendation
|
||||||
|
|
||||||
|
**Start with VictoriaMetrics** for simplicity. The compression alone may provide 6+ months of retention in the same disk space currently used for 30 days.
|
||||||
|
|
||||||
|
If multi-year retention with true downsampling becomes necessary, Thanos can be evaluated later. However, it requires deploying object storage infrastructure (MinIO) which adds operational complexity.
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- VictoriaMetrics docs: https://docs.victoriametrics.com/
|
||||||
|
- Thanos docs: https://thanos.io/tip/thanos/getting-started.md/
|
||||||
|
- NixOS options searched from nixpkgs revision e576e3c9 (NixOS 25.11)
|
||||||
Reference in New Issue
Block a user