flake: update homelab-deploy

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
homelab: add deploy.enable option with assertion
2026-02-07 06:53:13 +01:00 · 2026-02-07 06:47:12 +01:00 · 2026-02-07 06:41:03 +01:00 · 2026-02-07 06:27:21 +01:00 · 2026-02-07 06:20:14 +01:00 · 2026-02-07 06:11:37 +01:00
6 changed files with 4 additions and 190 deletions
--- a/.mcp.json
+++ b/.mcp.json
@@ -22,17 +22,6 @@
        "ALERTMANAGER_URL": "https://alertmanager.home.2rjus.net",
        "LOKI_URL": "http://monitoring01.home.2rjus.net:3100"
      }
    },
    "homelab-deploy": {
      "command": "nix",
      "args": [
        "run",
        "git+https://git.t-juice.club/torjus/homelab-deploy",
        "--",
        "mcp",
        "--nats-url", "nats://nats1.home.2rjus.net:4222",
        "--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey"
      ]
    }
  }
 }
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -194,51 +194,6 @@ node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
 node_filesystem_avail_bytes{mountpoint="/"}
 ```
 ### Deploying to Test Hosts
 The **homelab-deploy** MCP server enables remote deployments to test-tier hosts via NATS messaging.
 **Available Tools:**
 - `deploy` - Deploy NixOS configuration to test-tier hosts
 - `list_hosts` - List available deployment targets
 **Deploy Parameters:**
 - `hostname` - Target a specific host (e.g., `vaulttest01`)
 - `role` - Deploy to all hosts with a specific role (e.g., `vault`)
 - `all` - Deploy to all test-tier hosts
 - `action` - nixos-rebuild action: `switch` (default), `boot`, `test`, `dry-activate`
 - `branch` - Git branch or commit to deploy (default: `master`)
 **Examples:**
 ```
 # List available hosts
 list_hosts()
 # Deploy to a specific host
 deploy(hostname="vaulttest01", action="switch")
 # Dry-run deployment
 deploy(hostname="vaulttest01", action="dry-activate")
 # Deploy to all hosts with a role
 deploy(role="vault", action="switch")
 ```
 **Note:** Only test-tier hosts with `homelab.deploy.enable = true` and the listener service running will respond to deployments.
 **Verifying Deployments:**
 After deploying, use the `nixos_flake_info` metric from nixos-exporter to verify the host is running the expected revision:
 ```promql
 nixos_flake_info{instance=~"vaulttest01.*"}
 ```
 The `current_rev` label contains the git commit hash of the deployed flake configuration.
 ## Architecture
 ### Directory Structure
--- a/docs/plans/long-term-metrics-storage.md
+++ b/docs/plans/long-term-metrics-storage.md
@@ -1,122 +0,0 @@
 # Long-Term Metrics Storage Options
 ## Problem Statement
 Current Prometheus configuration retains metrics for 30 days (`retentionTime = "30d"`). Extending retention further raises disk usage concerns on the homelab hypervisor with limited local storage.
 Prometheus does not support downsampling - it stores all data at full resolution until the retention period expires, then deletes it entirely.
 ## Current Configuration
 Location: `services/monitoring/prometheus.nix`
 - **Retention**: 30 days
 - **Scrape interval**: 15s
 - **Features**: Alertmanager, Pushgateway, auto-generated scrape configs from flake hosts
 - **Storage**: Local disk on monitoring01
 ## Options Evaluated
 ### Option 1: VictoriaMetrics
 VictoriaMetrics is a Prometheus-compatible TSDB with significantly better compression (5-10x smaller storage footprint).
 **NixOS Options Available:**
 - `services.victoriametrics.enable`
 - `services.victoriametrics.prometheusConfig` - accepts Prometheus scrape config format
 - `services.victoriametrics.retentionPeriod` - e.g., "6m" for 6 months
 - `services.vmagent` - dedicated scraping agent
 - `services.vmalert` - alerting rules evaluation
 **Pros:**
 - Simple migration - single service replacement
 - Same PromQL query language - Grafana dashboards work unchanged
 - Same scrape config format - existing auto-generated configs work as-is
 - 5-10x better compression means 30 days of Prometheus data could become 180+ days
 - Lightweight, single binary
 **Cons:**
 - No automatic downsampling (relies on compression alone)
 - Alerting requires switching to vmalert instead of Prometheus alertmanager integration
 - Would need to migrate existing data or start fresh
 **Migration Steps:**
 1. Replace `services.prometheus` with `services.victoriametrics`
 2. Move scrape configs to `prometheusConfig`
 3. Set up `services.vmalert` for alerting rules
 4. Update Grafana datasource to VictoriaMetrics port (8428)
 5. Keep Alertmanager for notification routing
 ### Option 2: Thanos
 Thanos extends Prometheus with long-term storage and automatic downsampling by uploading data to object storage.
 **NixOS Options Available:**
 - `services.thanos.sidecar` - uploads Prometheus blocks to object storage
 - `services.thanos.compact` - compacts and downsamples data
 - `services.thanos.query` - unified query gateway
 - `services.thanos.query-frontend` - query caching and parallelization
 - `services.thanos.downsample` - dedicated downsampling service
 **Downsampling Behavior:**
 - Raw resolution kept for configurable period (default: indefinite)
 - 5-minute resolution created after 40 hours
 - 1-hour resolution created after 10 days
 **Retention Configuration (in compactor):**
 ```nix
 services.thanos.compact = {
  retention.resolution-raw = "30d";   # Keep raw for 30 days
  retention.resolution-5m = "180d";   # Keep 5m samples for 6 months
  retention.resolution-1h = "2y";     # Keep 1h samples for 2 years
 };
 ```
 **Pros:**
 - True downsampling - older data uses progressively less storage
 - Keep metrics for years with minimal storage impact
 - Prometheus continues running unchanged
 - Existing Alertmanager integration preserved
 **Cons:**
 - Requires object storage (MinIO, S3, or local filesystem)
 - Multiple services to manage (sidecar, compactor, query)
 - More complex architecture
 - Additional infrastructure (MinIO) may be needed
 **Required Components:**
 1. Thanos Sidecar (runs alongside Prometheus)
 2. Object storage (MinIO or local filesystem)
 3. Thanos Compactor (handles downsampling)
 4. Thanos Query (provides unified query endpoint)
 **Migration Steps:**
 1. Deploy object storage (MinIO or configure filesystem backend)
 2. Add Thanos sidecar pointing to Prometheus data directory
 3. Add Thanos compactor with retention policies
 4. Add Thanos query gateway
 5. Update Grafana datasource to Thanos Query port (10902)
 ## Comparison
 | Aspect | VictoriaMetrics | Thanos |
 |--------|-----------------|--------|
 | Complexity | Low (1 service) | Higher (3-4 services) |
 | Downsampling | No | Yes (automatic) |
 | Storage savings | 5-10x compression | Compression + downsampling |
 | Object storage required | No | Yes |
 | Migration effort | Minimal | Moderate |
 | Grafana changes | Change port only | Change port only |
 | Alerting changes | Need vmalert | Keep existing |
 ## Recommendation
 **Start with VictoriaMetrics** for simplicity. The compression alone may provide 6+ months of retention in the same disk space currently used for 30 days.
 If multi-year retention with true downsampling becomes necessary, Thanos can be evaluated later. However, it requires deploying object storage infrastructure (MinIO) which adds operational complexity.
 ## References
 - VictoriaMetrics docs: https://docs.victoriametrics.com/
 - Thanos docs: https://thanos.io/tip/thanos/getting-started.md/
 - NixOS options searched from nixpkgs revision e576e3c9 (NixOS 25.11)
--- a/flake.lock
+++ b/flake.lock
@@ -28,11 +28,11 @@
        ]
      },
      "locked": {
-        "lastModified": 1770447502,
+        "lastModified": 1770443536,
-        "narHash": "sha256-xH1PNyE3ydj4udhe1IpK8VQxBPZETGLuORZdSWYRmSU=",
+        "narHash": "sha256-UufZIVggiioMFDSjKx+ifgkDOk9alNSiRmkvc4/+HIA=",
        "ref": "master",
-        "rev": "79db119d1ca6630023947ef0a65896cc3307c2ff",
+        "rev": "95b795dcfd86b7b36045bba67e536b3a1c61dd33",
-        "revCount": 22,
+        "revCount": 20,
        "type": "git",
        "url": "https://git.t-juice.club/torjus/homelab-deploy"
      },
--- a/hosts/vaulttest01/configuration.nix
+++ b/hosts/vaulttest01/configuration.nix
@@ -81,7 +81,6 @@ in
    vim
    wget
    git
    htop # test deploy verification
  ];
  # Open ports in the firewall.
--- a/system/homelab-deploy.nix
+++ b/system/homelab-deploy.nix
@@ -19,15 +19,8 @@ in
      natsUrl = "nats://nats1.home.2rjus.net:4222";
      nkeyFile = "/run/secrets/homelab-deploy-nkey";
      flakeUrl = "git+https://git.t-juice.club/torjus/nixos-servers.git";
      metrics.enable = true;
    };
    # Expose metrics for Prometheus scraping
    homelab.monitoring.scrapeTargets = [{
      job_name = "homelab-deploy";
      port = 9972;
    }];
    # Ensure listener starts after vault secret is available
    systemd.services.homelab-deploy-listener = {
      after = [ "vault-secret-homelab-deploy-nkey.service" ];
Author	SHA1	Message	Date
Torjus Håkestad	2669b10f0e	flake: update homelab-deploy Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 06:53:13 +01:00
Torjus Håkestad	db6d610e16	homelab: add deploy.enable option with assertion All checks were successful Run nix flake check / flake-check (push) Successful in 2m3s Details - Add homelab.deploy.enable option (requires vault.enable) - Create shared homelab-deploy Vault policy for all hosts - Enable homelab.deploy on all vault-enabled hosts Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 06:47:12 +01:00
Torjus Håkestad	e4eb8afe5c	system: enable homelab-deploy listener for all vault hosts All checks were successful Run nix flake check / flake-check (push) Successful in 2m4s Details Add system/homelab-deploy.nix module that automatically enables the listener on all hosts with vault.enable=true. Uses homelab.host.tier and homelab.host.role for NATS subject subscriptions. - Add homelab-deploy access to all host AppRole policies - Remove manual listener config from vaulttest01 (now handled by system module) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 06:41:03 +01:00
Torjus Håkestad	df9246a0f8	flake: update homelab-deploy Some checks failed Run nix flake check / flake-check (push) Failing after 12m46s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 06:27:21 +01:00
Torjus Håkestad	ec3b87f7fa	flake: update homelab-deploy All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 06:20:14 +01:00
Torjus Håkestad	913fa11c64	flake: update homelab-deploy Some checks failed Run nix flake check / flake-check (push) Failing after 3m39s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 06:11:37 +01:00
Torjus Håkestad	3e85e2527f	flake: update homelab-deploy All checks were successful Run nix flake check / flake-check (push) Successful in 2m7s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 05:58:26 +01:00
Torjus Håkestad	543ca18b14	flake: update homelab-deploy All checks were successful Run nix flake check / flake-check (push) Successful in 2m6s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 05:54:05 +01:00
Torjus Håkestad	c83218b3bc	flake: update homelab-deploy, add to devShell All checks were successful Run nix flake check / flake-check (push) Successful in 2m6s Details Update homelab-deploy to include bugfix. Add CLI to devShell for easier testing and deployment operations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 05:45:54 +01:00