docs: add long-term metrics storage plan

Compare VictoriaMetrics and Thanos as options for extending metrics retention beyond 30 days while managing disk usage. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 07:56:10 +01:00
parent f805b9f629
commit b03a9b3b64
1 changed files with 122 additions and 0 deletions
--- a/docs/plans/long-term-metrics-storage.md
+++ b/docs/plans/long-term-metrics-storage.md
@@ -0,0 +1,122 @@
 # Long-Term Metrics Storage Options
 ## Problem Statement
 Current Prometheus configuration retains metrics for 30 days (`retentionTime = "30d"`). Extending retention further raises disk usage concerns on the homelab hypervisor with limited local storage.
 Prometheus does not support downsampling - it stores all data at full resolution until the retention period expires, then deletes it entirely.
 ## Current Configuration
 Location: `services/monitoring/prometheus.nix`
 - **Retention**: 30 days
 - **Scrape interval**: 15s
 - **Features**: Alertmanager, Pushgateway, auto-generated scrape configs from flake hosts
 - **Storage**: Local disk on monitoring01
 ## Options Evaluated
 ### Option 1: VictoriaMetrics
 VictoriaMetrics is a Prometheus-compatible TSDB with significantly better compression (5-10x smaller storage footprint).
 **NixOS Options Available:**
 - `services.victoriametrics.enable`
 - `services.victoriametrics.prometheusConfig` - accepts Prometheus scrape config format
 - `services.victoriametrics.retentionPeriod` - e.g., "6m" for 6 months
 - `services.vmagent` - dedicated scraping agent
 - `services.vmalert` - alerting rules evaluation
 **Pros:**
 - Simple migration - single service replacement
 - Same PromQL query language - Grafana dashboards work unchanged
 - Same scrape config format - existing auto-generated configs work as-is
 - 5-10x better compression means 30 days of Prometheus data could become 180+ days
 - Lightweight, single binary
 **Cons:**
 - No automatic downsampling (relies on compression alone)
 - Alerting requires switching to vmalert instead of Prometheus alertmanager integration
 - Would need to migrate existing data or start fresh
 **Migration Steps:**
 1. Replace `services.prometheus` with `services.victoriametrics`
 2. Move scrape configs to `prometheusConfig`
 3. Set up `services.vmalert` for alerting rules
 4. Update Grafana datasource to VictoriaMetrics port (8428)
 5. Keep Alertmanager for notification routing
 ### Option 2: Thanos
 Thanos extends Prometheus with long-term storage and automatic downsampling by uploading data to object storage.
 **NixOS Options Available:**
 - `services.thanos.sidecar` - uploads Prometheus blocks to object storage
 - `services.thanos.compact` - compacts and downsamples data
 - `services.thanos.query` - unified query gateway
 - `services.thanos.query-frontend` - query caching and parallelization
 - `services.thanos.downsample` - dedicated downsampling service
 **Downsampling Behavior:**
 - Raw resolution kept for configurable period (default: indefinite)
 - 5-minute resolution created after 40 hours
 - 1-hour resolution created after 10 days
 **Retention Configuration (in compactor):**
 ```nix
 services.thanos.compact = {
  retention.resolution-raw = "30d";   # Keep raw for 30 days
  retention.resolution-5m = "180d";   # Keep 5m samples for 6 months
  retention.resolution-1h = "2y";     # Keep 1h samples for 2 years
 };
 ```
 **Pros:**
 - True downsampling - older data uses progressively less storage
 - Keep metrics for years with minimal storage impact
 - Prometheus continues running unchanged
 - Existing Alertmanager integration preserved
 **Cons:**
 - Requires object storage (MinIO, S3, or local filesystem)
 - Multiple services to manage (sidecar, compactor, query)
 - More complex architecture
 - Additional infrastructure (MinIO) may be needed
 **Required Components:**
 1. Thanos Sidecar (runs alongside Prometheus)
 2. Object storage (MinIO or local filesystem)
 3. Thanos Compactor (handles downsampling)
 4. Thanos Query (provides unified query endpoint)
 **Migration Steps:**
 1. Deploy object storage (MinIO or configure filesystem backend)
 2. Add Thanos sidecar pointing to Prometheus data directory
 3. Add Thanos compactor with retention policies
 4. Add Thanos query gateway
 5. Update Grafana datasource to Thanos Query port (10902)
 ## Comparison
 | Aspect | VictoriaMetrics | Thanos |
 |--------|-----------------|--------|
 | Complexity | Low (1 service) | Higher (3-4 services) |
 | Downsampling | No | Yes (automatic) |
 | Storage savings | 5-10x compression | Compression + downsampling |
 | Object storage required | No | Yes |
 | Migration effort | Minimal | Moderate |
 | Grafana changes | Change port only | Change port only |
 | Alerting changes | Need vmalert | Keep existing |
 ## Recommendation
 **Start with VictoriaMetrics** for simplicity. The compression alone may provide 6+ months of retention in the same disk space currently used for 30 days.
 If multi-year retention with true downsampling becomes necessary, Thanos can be evaluated later. However, it requires deploying object storage infrastructure (MinIO) which adds operational complexity.
 ## References
 - VictoriaMetrics docs: https://docs.victoriametrics.com/
 - Thanos docs: https://thanos.io/tip/thanos/getting-started.md/
 - NixOS options searched from nixpkgs revision e576e3c9 (NixOS 25.11)