Files

docs: add long-term metrics storage plan

Compare VictoriaMetrics and Thanos as options for extending
metrics retention beyond 30 days while managing disk usage.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-07 07:56:10 +01:00

4.7 KiB

Raw Blame History

Long-Term Metrics Storage Options

Problem Statement

Current Prometheus configuration retains metrics for 30 days (retentionTime = "30d"). Extending retention further raises disk usage concerns on the homelab hypervisor with limited local storage.

Prometheus does not support downsampling - it stores all data at full resolution until the retention period expires, then deletes it entirely.

Current Configuration

Location: services/monitoring/prometheus.nix

Retention: 30 days
Scrape interval: 15s
Features: Alertmanager, Pushgateway, auto-generated scrape configs from flake hosts
Storage: Local disk on monitoring01

Options Evaluated

Option 1: VictoriaMetrics

VictoriaMetrics is a Prometheus-compatible TSDB with significantly better compression (5-10x smaller storage footprint).

NixOS Options Available:

services.victoriametrics.enable
services.victoriametrics.prometheusConfig - accepts Prometheus scrape config format
services.victoriametrics.retentionPeriod - e.g., "6m" for 6 months
services.vmagent - dedicated scraping agent
services.vmalert - alerting rules evaluation

Pros:

Simple migration - single service replacement
Same PromQL query language - Grafana dashboards work unchanged
Same scrape config format - existing auto-generated configs work as-is
5-10x better compression means 30 days of Prometheus data could become 180+ days
Lightweight, single binary

Cons:

No automatic downsampling (relies on compression alone)
Alerting requires switching to vmalert instead of Prometheus alertmanager integration
Would need to migrate existing data or start fresh

Migration Steps:

Replace services.prometheus with services.victoriametrics
Move scrape configs to prometheusConfig
Set up services.vmalert for alerting rules
Update Grafana datasource to VictoriaMetrics port (8428)
Keep Alertmanager for notification routing

Option 2: Thanos

Thanos extends Prometheus with long-term storage and automatic downsampling by uploading data to object storage.

NixOS Options Available:

services.thanos.sidecar - uploads Prometheus blocks to object storage
services.thanos.compact - compacts and downsamples data
services.thanos.query - unified query gateway
services.thanos.query-frontend - query caching and parallelization
services.thanos.downsample - dedicated downsampling service

Downsampling Behavior:

Raw resolution kept for configurable period (default: indefinite)
5-minute resolution created after 40 hours
1-hour resolution created after 10 days

Retention Configuration (in compactor):

services.thanos.compact = {
  retention.resolution-raw = "30d";   # Keep raw for 30 days
  retention.resolution-5m = "180d";   # Keep 5m samples for 6 months
  retention.resolution-1h = "2y";     # Keep 1h samples for 2 years
};

Pros:

True downsampling - older data uses progressively less storage
Keep metrics for years with minimal storage impact
Prometheus continues running unchanged
Existing Alertmanager integration preserved

Cons:

Requires object storage (MinIO, S3, or local filesystem)
Multiple services to manage (sidecar, compactor, query)
More complex architecture
Additional infrastructure (MinIO) may be needed

Required Components:

Thanos Sidecar (runs alongside Prometheus)
Object storage (MinIO or local filesystem)
Thanos Compactor (handles downsampling)
Thanos Query (provides unified query endpoint)

Migration Steps:

Deploy object storage (MinIO or configure filesystem backend)
Add Thanos sidecar pointing to Prometheus data directory
Add Thanos compactor with retention policies
Add Thanos query gateway
Update Grafana datasource to Thanos Query port (10902)

Comparison

Aspect	VictoriaMetrics	Thanos
Complexity	Low (1 service)	Higher (3-4 services)
Downsampling	No	Yes (automatic)
Storage savings	5-10x compression	Compression + downsampling
Object storage required	No	Yes
Migration effort	Minimal	Moderate
Grafana changes	Change port only	Change port only
Alerting changes	Need vmalert	Keep existing

Recommendation

Start with VictoriaMetrics for simplicity. The compression alone may provide 6+ months of retention in the same disk space currently used for 30 days.

If multi-year retention with true downsampling becomes necessary, Thanos can be evaluated later. However, it requires deploying object storage infrastructure (MinIO) which adds operational complexity.

References

VictoriaMetrics docs: https://docs.victoriametrics.com/
Thanos docs: https://thanos.io/tip/thanos/getting-started.md/
NixOS options searched from nixpkgs revision e576e3c9 (NixOS 25.11)

4.7 KiB Raw Blame History