Compare VictoriaMetrics and Thanos as options for extending metrics retention beyond 30 days while managing disk usage. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
4.7 KiB
Long-Term Metrics Storage Options
Problem Statement
Current Prometheus configuration retains metrics for 30 days (retentionTime = "30d"). Extending retention further raises disk usage concerns on the homelab hypervisor with limited local storage.
Prometheus does not support downsampling - it stores all data at full resolution until the retention period expires, then deletes it entirely.
Current Configuration
Location: services/monitoring/prometheus.nix
- Retention: 30 days
- Scrape interval: 15s
- Features: Alertmanager, Pushgateway, auto-generated scrape configs from flake hosts
- Storage: Local disk on monitoring01
Options Evaluated
Option 1: VictoriaMetrics
VictoriaMetrics is a Prometheus-compatible TSDB with significantly better compression (5-10x smaller storage footprint).
NixOS Options Available:
services.victoriametrics.enableservices.victoriametrics.prometheusConfig- accepts Prometheus scrape config formatservices.victoriametrics.retentionPeriod- e.g., "6m" for 6 monthsservices.vmagent- dedicated scraping agentservices.vmalert- alerting rules evaluation
Pros:
- Simple migration - single service replacement
- Same PromQL query language - Grafana dashboards work unchanged
- Same scrape config format - existing auto-generated configs work as-is
- 5-10x better compression means 30 days of Prometheus data could become 180+ days
- Lightweight, single binary
Cons:
- No automatic downsampling (relies on compression alone)
- Alerting requires switching to vmalert instead of Prometheus alertmanager integration
- Would need to migrate existing data or start fresh
Migration Steps:
- Replace
services.prometheuswithservices.victoriametrics - Move scrape configs to
prometheusConfig - Set up
services.vmalertfor alerting rules - Update Grafana datasource to VictoriaMetrics port (8428)
- Keep Alertmanager for notification routing
Option 2: Thanos
Thanos extends Prometheus with long-term storage and automatic downsampling by uploading data to object storage.
NixOS Options Available:
services.thanos.sidecar- uploads Prometheus blocks to object storageservices.thanos.compact- compacts and downsamples dataservices.thanos.query- unified query gatewayservices.thanos.query-frontend- query caching and parallelizationservices.thanos.downsample- dedicated downsampling service
Downsampling Behavior:
- Raw resolution kept for configurable period (default: indefinite)
- 5-minute resolution created after 40 hours
- 1-hour resolution created after 10 days
Retention Configuration (in compactor):
services.thanos.compact = {
retention.resolution-raw = "30d"; # Keep raw for 30 days
retention.resolution-5m = "180d"; # Keep 5m samples for 6 months
retention.resolution-1h = "2y"; # Keep 1h samples for 2 years
};
Pros:
- True downsampling - older data uses progressively less storage
- Keep metrics for years with minimal storage impact
- Prometheus continues running unchanged
- Existing Alertmanager integration preserved
Cons:
- Requires object storage (MinIO, S3, or local filesystem)
- Multiple services to manage (sidecar, compactor, query)
- More complex architecture
- Additional infrastructure (MinIO) may be needed
Required Components:
- Thanos Sidecar (runs alongside Prometheus)
- Object storage (MinIO or local filesystem)
- Thanos Compactor (handles downsampling)
- Thanos Query (provides unified query endpoint)
Migration Steps:
- Deploy object storage (MinIO or configure filesystem backend)
- Add Thanos sidecar pointing to Prometheus data directory
- Add Thanos compactor with retention policies
- Add Thanos query gateway
- Update Grafana datasource to Thanos Query port (10902)
Comparison
| Aspect | VictoriaMetrics | Thanos |
|---|---|---|
| Complexity | Low (1 service) | Higher (3-4 services) |
| Downsampling | No | Yes (automatic) |
| Storage savings | 5-10x compression | Compression + downsampling |
| Object storage required | No | Yes |
| Migration effort | Minimal | Moderate |
| Grafana changes | Change port only | Change port only |
| Alerting changes | Need vmalert | Keep existing |
Recommendation
Start with VictoriaMetrics for simplicity. The compression alone may provide 6+ months of retention in the same disk space currently used for 30 days.
If multi-year retention with true downsampling becomes necessary, Thanos can be evaluated later. However, it requires deploying object storage infrastructure (MinIO) which adds operational complexity.
References
- VictoriaMetrics docs: https://docs.victoriametrics.com/
- Thanos docs: https://thanos.io/tip/thanos/getting-started.md/
- NixOS options searched from nixpkgs revision e576e3c9 (NixOS 25.11)