From b03a9b3b6437f5f7afe9ebdbf1af51b072fdd14a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= Date: Sat, 7 Feb 2026 07:56:10 +0100 Subject: [PATCH] docs: add long-term metrics storage plan Compare VictoriaMetrics and Thanos as options for extending metrics retention beyond 30 days while managing disk usage. Co-Authored-By: Claude Opus 4.5 --- docs/plans/long-term-metrics-storage.md | 122 ++++++++++++++++++++++++ 1 file changed, 122 insertions(+) create mode 100644 docs/plans/long-term-metrics-storage.md diff --git a/docs/plans/long-term-metrics-storage.md b/docs/plans/long-term-metrics-storage.md new file mode 100644 index 0000000..160e75c --- /dev/null +++ b/docs/plans/long-term-metrics-storage.md @@ -0,0 +1,122 @@ +# Long-Term Metrics Storage Options + +## Problem Statement + +Current Prometheus configuration retains metrics for 30 days (`retentionTime = "30d"`). Extending retention further raises disk usage concerns on the homelab hypervisor with limited local storage. + +Prometheus does not support downsampling - it stores all data at full resolution until the retention period expires, then deletes it entirely. + +## Current Configuration + +Location: `services/monitoring/prometheus.nix` + +- **Retention**: 30 days +- **Scrape interval**: 15s +- **Features**: Alertmanager, Pushgateway, auto-generated scrape configs from flake hosts +- **Storage**: Local disk on monitoring01 + +## Options Evaluated + +### Option 1: VictoriaMetrics + +VictoriaMetrics is a Prometheus-compatible TSDB with significantly better compression (5-10x smaller storage footprint). + +**NixOS Options Available:** +- `services.victoriametrics.enable` +- `services.victoriametrics.prometheusConfig` - accepts Prometheus scrape config format +- `services.victoriametrics.retentionPeriod` - e.g., "6m" for 6 months +- `services.vmagent` - dedicated scraping agent +- `services.vmalert` - alerting rules evaluation + +**Pros:** +- Simple migration - single service replacement +- Same PromQL query language - Grafana dashboards work unchanged +- Same scrape config format - existing auto-generated configs work as-is +- 5-10x better compression means 30 days of Prometheus data could become 180+ days +- Lightweight, single binary + +**Cons:** +- No automatic downsampling (relies on compression alone) +- Alerting requires switching to vmalert instead of Prometheus alertmanager integration +- Would need to migrate existing data or start fresh + +**Migration Steps:** +1. Replace `services.prometheus` with `services.victoriametrics` +2. Move scrape configs to `prometheusConfig` +3. Set up `services.vmalert` for alerting rules +4. Update Grafana datasource to VictoriaMetrics port (8428) +5. Keep Alertmanager for notification routing + +### Option 2: Thanos + +Thanos extends Prometheus with long-term storage and automatic downsampling by uploading data to object storage. + +**NixOS Options Available:** +- `services.thanos.sidecar` - uploads Prometheus blocks to object storage +- `services.thanos.compact` - compacts and downsamples data +- `services.thanos.query` - unified query gateway +- `services.thanos.query-frontend` - query caching and parallelization +- `services.thanos.downsample` - dedicated downsampling service + +**Downsampling Behavior:** +- Raw resolution kept for configurable period (default: indefinite) +- 5-minute resolution created after 40 hours +- 1-hour resolution created after 10 days + +**Retention Configuration (in compactor):** +```nix +services.thanos.compact = { + retention.resolution-raw = "30d"; # Keep raw for 30 days + retention.resolution-5m = "180d"; # Keep 5m samples for 6 months + retention.resolution-1h = "2y"; # Keep 1h samples for 2 years +}; +``` + +**Pros:** +- True downsampling - older data uses progressively less storage +- Keep metrics for years with minimal storage impact +- Prometheus continues running unchanged +- Existing Alertmanager integration preserved + +**Cons:** +- Requires object storage (MinIO, S3, or local filesystem) +- Multiple services to manage (sidecar, compactor, query) +- More complex architecture +- Additional infrastructure (MinIO) may be needed + +**Required Components:** +1. Thanos Sidecar (runs alongside Prometheus) +2. Object storage (MinIO or local filesystem) +3. Thanos Compactor (handles downsampling) +4. Thanos Query (provides unified query endpoint) + +**Migration Steps:** +1. Deploy object storage (MinIO or configure filesystem backend) +2. Add Thanos sidecar pointing to Prometheus data directory +3. Add Thanos compactor with retention policies +4. Add Thanos query gateway +5. Update Grafana datasource to Thanos Query port (10902) + +## Comparison + +| Aspect | VictoriaMetrics | Thanos | +|--------|-----------------|--------| +| Complexity | Low (1 service) | Higher (3-4 services) | +| Downsampling | No | Yes (automatic) | +| Storage savings | 5-10x compression | Compression + downsampling | +| Object storage required | No | Yes | +| Migration effort | Minimal | Moderate | +| Grafana changes | Change port only | Change port only | +| Alerting changes | Need vmalert | Keep existing | + +## Recommendation + +**Start with VictoriaMetrics** for simplicity. The compression alone may provide 6+ months of retention in the same disk space currently used for 30 days. + +If multi-year retention with true downsampling becomes necessary, Thanos can be evaluated later. However, it requires deploying object storage infrastructure (MinIO) which adds operational complexity. + +## References + +- VictoriaMetrics docs: https://docs.victoriametrics.com/ +- Thanos docs: https://thanos.io/tip/thanos/getting-started.md/ +- NixOS options searched from nixpkgs revision e576e3c9 (NixOS 25.11)