From b03a9b3b6437f5f7afe9ebdbf1af51b072fdd14a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Torjus=20H=C3=A5kestad?= <torjus@usit.uio.no>
Date: Sat, 7 Feb 2026 07:56:10 +0100
Subject: [PATCH] docs: add long-term metrics storage plan

Compare VictoriaMetrics and Thanos as options for extending
metrics retention beyond 30 days while managing disk usage.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---
 docs/plans/long-term-metrics-storage.md | 122 ++++++++++++++++++++++++
 1 file changed, 122 insertions(+)
 create mode 100644 docs/plans/long-term-metrics-storage.md

diff --git a/docs/plans/long-term-metrics-storage.md b/docs/plans/long-term-metrics-storage.md
new file mode 100644
index 0000000..160e75c
--- /dev/null
+++ b/docs/plans/long-term-metrics-storage.md
@@ -0,0 +1,122 @@
+# Long-Term Metrics Storage Options
+
+## Problem Statement
+
+Current Prometheus configuration retains metrics for 30 days (`retentionTime = "30d"`). Extending retention further raises disk usage concerns on the homelab hypervisor with limited local storage.
+
+Prometheus does not support downsampling - it stores all data at full resolution until the retention period expires, then deletes it entirely.
+
+## Current Configuration
+
+Location: `services/monitoring/prometheus.nix`
+
+- **Retention**: 30 days
+- **Scrape interval**: 15s
+- **Features**: Alertmanager, Pushgateway, auto-generated scrape configs from flake hosts
+- **Storage**: Local disk on monitoring01
+
+## Options Evaluated
+
+### Option 1: VictoriaMetrics
+
+VictoriaMetrics is a Prometheus-compatible TSDB with significantly better compression (5-10x smaller storage footprint).
+
+**NixOS Options Available:**
+- `services.victoriametrics.enable`
+- `services.victoriametrics.prometheusConfig` - accepts Prometheus scrape config format
+- `services.victoriametrics.retentionPeriod` - e.g., "6m" for 6 months
+- `services.vmagent` - dedicated scraping agent
+- `services.vmalert` - alerting rules evaluation
+
+**Pros:**
+- Simple migration - single service replacement
+- Same PromQL query language - Grafana dashboards work unchanged
+- Same scrape config format - existing auto-generated configs work as-is
+- 5-10x better compression means 30 days of Prometheus data could become 180+ days
+- Lightweight, single binary
+
+**Cons:**
+- No automatic downsampling (relies on compression alone)
+- Alerting requires switching to vmalert instead of Prometheus alertmanager integration
+- Would need to migrate existing data or start fresh
+
+**Migration Steps:**
+1. Replace `services.prometheus` with `services.victoriametrics`
+2. Move scrape configs to `prometheusConfig`
+3. Set up `services.vmalert` for alerting rules
+4. Update Grafana datasource to VictoriaMetrics port (8428)
+5. Keep Alertmanager for notification routing
+
+### Option 2: Thanos
+
+Thanos extends Prometheus with long-term storage and automatic downsampling by uploading data to object storage.
+
+**NixOS Options Available:**
+- `services.thanos.sidecar` - uploads Prometheus blocks to object storage
+- `services.thanos.compact` - compacts and downsamples data
+- `services.thanos.query` - unified query gateway
+- `services.thanos.query-frontend` - query caching and parallelization
+- `services.thanos.downsample` - dedicated downsampling service
+
+**Downsampling Behavior:**
+- Raw resolution kept for configurable period (default: indefinite)
+- 5-minute resolution created after 40 hours
+- 1-hour resolution created after 10 days
+
+**Retention Configuration (in compactor):**
+```nix
+services.thanos.compact = {
+  retention.resolution-raw = "30d";   # Keep raw for 30 days
+  retention.resolution-5m = "180d";   # Keep 5m samples for 6 months
+  retention.resolution-1h = "2y";     # Keep 1h samples for 2 years
+};
+```
+
+**Pros:**
+- True downsampling - older data uses progressively less storage
+- Keep metrics for years with minimal storage impact
+- Prometheus continues running unchanged
+- Existing Alertmanager integration preserved
+
+**Cons:**
+- Requires object storage (MinIO, S3, or local filesystem)
+- Multiple services to manage (sidecar, compactor, query)
+- More complex architecture
+- Additional infrastructure (MinIO) may be needed
+
+**Required Components:**
+1. Thanos Sidecar (runs alongside Prometheus)
+2. Object storage (MinIO or local filesystem)
+3. Thanos Compactor (handles downsampling)
+4. Thanos Query (provides unified query endpoint)
+
+**Migration Steps:**
+1. Deploy object storage (MinIO or configure filesystem backend)
+2. Add Thanos sidecar pointing to Prometheus data directory
+3. Add Thanos compactor with retention policies
+4. Add Thanos query gateway
+5. Update Grafana datasource to Thanos Query port (10902)
+
+## Comparison
+
+| Aspect | VictoriaMetrics | Thanos |
+|--------|-----------------|--------|
+| Complexity | Low (1 service) | Higher (3-4 services) |
+| Downsampling | No | Yes (automatic) |
+| Storage savings | 5-10x compression | Compression + downsampling |
+| Object storage required | No | Yes |
+| Migration effort | Minimal | Moderate |
+| Grafana changes | Change port only | Change port only |
+| Alerting changes | Need vmalert | Keep existing |
+
+## Recommendation
+
+**Start with VictoriaMetrics** for simplicity. The compression alone may provide 6+ months of retention in the same disk space currently used for 30 days.
+
+If multi-year retention with true downsampling becomes necessary, Thanos can be evaluated later. However, it requires deploying object storage infrastructure (MinIO) which adds operational complexity.
+
+## References
+
+- VictoriaMetrics docs: https://docs.victoriametrics.com/
+- Thanos docs: https://thanos.io/tip/thanos/getting-started.md/
+- NixOS options searched from nixpkgs revision e576e3c9 (NixOS 25.11)