homelab-deploy: enable prometheus metrics

- Update homelab-deploy input to get metrics support - Enable metrics endpoint on port 9972 - Add scrape target for prometheus auto-discovery Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
docs: add long-term metrics storage plan
2026-02-07 08:04:23 +01:00 · 2026-02-07 07:56:10 +01:00 · 2026-02-07 07:27:12 +01:00 · 2026-02-07 07:25:44 +01:00 · 2026-02-07 07:23:22 +01:00
6 changed files with 190 additions and 4 deletions
--- a/.mcp.json
+++ b/.mcp.json
@@ -22,6 +22,17 @@
        "ALERTMANAGER_URL": "https://alertmanager.home.2rjus.net",
        "LOKI_URL": "http://monitoring01.home.2rjus.net:3100"
      }
+    },
+    "homelab-deploy": {
+      "command": "nix",
+      "args": [
+        "run",
+        "git+https://git.t-juice.club/torjus/homelab-deploy",
+        "--",
+        "mcp",
+        "--nats-url", "nats://nats1.home.2rjus.net:4222",
+        "--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey"
+      ]
    }
  }
 }
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -194,6 +194,51 @@ node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
 node_filesystem_avail_bytes{mountpoint="/"}
 ```

+### Deploying to Test Hosts
+
+The **homelab-deploy** MCP server enables remote deployments to test-tier hosts via NATS messaging.
+
+**Available Tools:**
+
+- `deploy` - Deploy NixOS configuration to test-tier hosts
+- `list_hosts` - List available deployment targets
+
+**Deploy Parameters:**
+
+- `hostname` - Target a specific host (e.g., `vaulttest01`)
+- `role` - Deploy to all hosts with a specific role (e.g., `vault`)
+- `all` - Deploy to all test-tier hosts
+- `action` - nixos-rebuild action: `switch` (default), `boot`, `test`, `dry-activate`
+- `branch` - Git branch or commit to deploy (default: `master`)
+
+**Examples:**
+
+```
+# List available hosts
+list_hosts()
+
+# Deploy to a specific host
+deploy(hostname="vaulttest01", action="switch")
+
+# Dry-run deployment
+deploy(hostname="vaulttest01", action="dry-activate")
+
+# Deploy to all hosts with a role
+deploy(role="vault", action="switch")
+```
+
+**Note:** Only test-tier hosts with `homelab.deploy.enable = true` and the listener service running will respond to deployments.
+
+**Verifying Deployments:**
+
+After deploying, use the `nixos_flake_info` metric from nixos-exporter to verify the host is running the expected revision:
+
+```promql
+nixos_flake_info{instance=~"vaulttest01.*"}
+```
+
+The `current_rev` label contains the git commit hash of the deployed flake configuration.
+
 ## Architecture

 ### Directory Structure
--- a/docs/plans/long-term-metrics-storage.md
+++ b/docs/plans/long-term-metrics-storage.md
@@ -0,0 +1,122 @@
+# Long-Term Metrics Storage Options
+
+## Problem Statement
+
+Current Prometheus configuration retains metrics for 30 days (`retentionTime = "30d"`). Extending retention further raises disk usage concerns on the homelab hypervisor with limited local storage.
+
+Prometheus does not support downsampling - it stores all data at full resolution until the retention period expires, then deletes it entirely.
+
+## Current Configuration
+
+Location: `services/monitoring/prometheus.nix`
+
+- **Retention**: 30 days
+- **Scrape interval**: 15s
+- **Features**: Alertmanager, Pushgateway, auto-generated scrape configs from flake hosts
+- **Storage**: Local disk on monitoring01
+
+## Options Evaluated
+
+### Option 1: VictoriaMetrics
+
+VictoriaMetrics is a Prometheus-compatible TSDB with significantly better compression (5-10x smaller storage footprint).
+
+**NixOS Options Available:**
+- `services.victoriametrics.enable`
+- `services.victoriametrics.prometheusConfig` - accepts Prometheus scrape config format
+- `services.victoriametrics.retentionPeriod` - e.g., "6m" for 6 months
+- `services.vmagent` - dedicated scraping agent
+- `services.vmalert` - alerting rules evaluation
+
+**Pros:**
+- Simple migration - single service replacement
+- Same PromQL query language - Grafana dashboards work unchanged
+- Same scrape config format - existing auto-generated configs work as-is
+- 5-10x better compression means 30 days of Prometheus data could become 180+ days
+- Lightweight, single binary
+
+**Cons:**
+- No automatic downsampling (relies on compression alone)
+- Alerting requires switching to vmalert instead of Prometheus alertmanager integration
+- Would need to migrate existing data or start fresh
+
+**Migration Steps:**
+1. Replace `services.prometheus` with `services.victoriametrics`
+2. Move scrape configs to `prometheusConfig`
+3. Set up `services.vmalert` for alerting rules
+4. Update Grafana datasource to VictoriaMetrics port (8428)
+5. Keep Alertmanager for notification routing
+
+### Option 2: Thanos
+
+Thanos extends Prometheus with long-term storage and automatic downsampling by uploading data to object storage.
+
+**NixOS Options Available:**
+- `services.thanos.sidecar` - uploads Prometheus blocks to object storage
+- `services.thanos.compact` - compacts and downsamples data
+- `services.thanos.query` - unified query gateway
+- `services.thanos.query-frontend` - query caching and parallelization
+- `services.thanos.downsample` - dedicated downsampling service
+
+**Downsampling Behavior:**
+- Raw resolution kept for configurable period (default: indefinite)
+- 5-minute resolution created after 40 hours
+- 1-hour resolution created after 10 days
+
+**Retention Configuration (in compactor):**
+```nix
+services.thanos.compact = {
+  retention.resolution-raw = "30d";   # Keep raw for 30 days
+  retention.resolution-5m = "180d";   # Keep 5m samples for 6 months
+  retention.resolution-1h = "2y";     # Keep 1h samples for 2 years
+};
+```
+
+**Pros:**
+- True downsampling - older data uses progressively less storage
+- Keep metrics for years with minimal storage impact
+- Prometheus continues running unchanged
+- Existing Alertmanager integration preserved
+
+**Cons:**
+- Requires object storage (MinIO, S3, or local filesystem)
+- Multiple services to manage (sidecar, compactor, query)
+- More complex architecture
+- Additional infrastructure (MinIO) may be needed
+
+**Required Components:**
+1. Thanos Sidecar (runs alongside Prometheus)
+2. Object storage (MinIO or local filesystem)
+3. Thanos Compactor (handles downsampling)
+4. Thanos Query (provides unified query endpoint)
+
+**Migration Steps:**
+1. Deploy object storage (MinIO or configure filesystem backend)
+2. Add Thanos sidecar pointing to Prometheus data directory
+3. Add Thanos compactor with retention policies
+4. Add Thanos query gateway
+5. Update Grafana datasource to Thanos Query port (10902)
+
+## Comparison
+
+| Aspect | VictoriaMetrics | Thanos |
+|--------|-----------------|--------|
+| Complexity | Low (1 service) | Higher (3-4 services) |
+| Downsampling | No | Yes (automatic) |
+| Storage savings | 5-10x compression | Compression + downsampling |
+| Object storage required | No | Yes |
+| Migration effort | Minimal | Moderate |
+| Grafana changes | Change port only | Change port only |
+| Alerting changes | Need vmalert | Keep existing |
+
+## Recommendation
+
+**Start with VictoriaMetrics** for simplicity. The compression alone may provide 6+ months of retention in the same disk space currently used for 30 days.
+
+If multi-year retention with true downsampling becomes necessary, Thanos can be evaluated later. However, it requires deploying object storage infrastructure (MinIO) which adds operational complexity.
+
+## References
+
+- VictoriaMetrics docs: https://docs.victoriametrics.com/
+- Thanos docs: https://thanos.io/tip/thanos/getting-started.md/
+- NixOS options searched from nixpkgs revision e576e3c9 (NixOS 25.11)
--- a/flake.lock
+++ b/flake.lock
@@ -28,11 +28,11 @@
        ]
      },
      "locked": {
-        "lastModified": 1770443536,
-        "narHash": "sha256-UufZIVggiioMFDSjKx+ifgkDOk9alNSiRmkvc4/+HIA=",
+        "lastModified": 1770447502,
+        "narHash": "sha256-xH1PNyE3ydj4udhe1IpK8VQxBPZETGLuORZdSWYRmSU=",
        "ref": "master",
-        "rev": "95b795dcfd86b7b36045bba67e536b3a1c61dd33",
-        "revCount": 20,
+        "rev": "79db119d1ca6630023947ef0a65896cc3307c2ff",
+        "revCount": 22,
        "type": "git",
        "url": "https://git.t-juice.club/torjus/homelab-deploy"
      },
--- a/hosts/vaulttest01/configuration.nix
+++ b/hosts/vaulttest01/configuration.nix
@@ -81,6 +81,7 @@ in
    vim
    wget
    git
+    htop # test deploy verification
  ];

  # Open ports in the firewall.
--- a/system/homelab-deploy.nix
+++ b/system/homelab-deploy.nix
@@ -19,8 +19,15 @@ in
      natsUrl = "nats://nats1.home.2rjus.net:4222";
      nkeyFile = "/run/secrets/homelab-deploy-nkey";
      flakeUrl = "git+https://git.t-juice.club/torjus/nixos-servers.git";
+      metrics.enable = true;
    };

+    # Expose metrics for Prometheus scraping
+    homelab.monitoring.scrapeTargets = [{
+      job_name = "homelab-deploy";
+      port = 9972;
+    }];
+
    # Ensure listener starts after vault secret is available
    systemd.services.homelab-deploy-listener = {
      after = [ "vault-secret-homelab-deploy-nkey.service" ];
Author	SHA1	Message	Date
Torjus Håkestad	26ca6817f0	homelab-deploy: enable prometheus metrics Some checks failed Run nix flake check / flake-check (push) Failing after 3m57s Details - Update homelab-deploy input to get metrics support - Enable metrics endpoint on port 9972 - Add scrape target for prometheus auto-discovery Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 08:04:23 +01:00
Torjus Håkestad	b03a9b3b64	docs: add long-term metrics storage plan Compare VictoriaMetrics and Thanos as options for extending metrics retention beyond 30 days while managing disk usage. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 07:56:10 +01:00
Torjus Håkestad	f805b9f629	mcp: add homelab-deploy MCP server Some checks failed Run nix flake check / flake-check (push) Failing after 4m20s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 07:27:12 +01:00
Torjus Håkestad	f3adf7e77f	CLAUDE.md: add homelab-deploy MCP documentation Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 07:25:44 +01:00
Torjus Håkestad	f6eca9decc	vaulttest01: add htop for deploy verification test All checks were successful Run nix flake check / flake-check (push) Successful in 2m3s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 07:23:22 +01:00