2026-02-17 20:24:16 +00:00
13 changed files with 204 additions and 253 deletions
--- a/docs/plans/completed/monitoring-migration-victoriametrics.md
+++ b/docs/plans/completed/monitoring-migration-victoriametrics.md
@@ -0,0 +1,156 @@
 # Monitoring Stack Migration to VictoriaMetrics
 ## Overview
 Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
 and longer retention. Run in parallel with monitoring01 until validated, then switch over using
 a `monitoring` CNAME for seamless transition.
 ## Current State
 **monitoring02** (10.69.13.24) - **PRIMARY**:
 - 4 CPU cores, 8GB RAM, 60GB disk
 - VictoriaMetrics with 3-month retention
 - vmalert with alerting enabled (routes to local Alertmanager)
 - Alertmanager -> alerttonotify -> NATS notification pipeline
 - Grafana with Kanidm OIDC (`grafana.home.2rjus.net`)
 - Loki (log aggregation)
 - CNAMEs: monitoring, alertmanager, grafana, grafana-test, metrics, vmalert, loki
 **monitoring01** (10.69.13.13) - **SHUT DOWN**:
 - No longer running, pending decommission
 ## Decision: VictoriaMetrics
 Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
 - Single binary replacement for Prometheus
 - 5-10x better compression (30 days could become 180+ days in same space)
 - Same PromQL query language (Grafana dashboards work unchanged)
 - Same scrape config format (existing auto-generated configs work)
 If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
 ## Architecture
 ```
                     ┌─────────────────┐
                     │  monitoring02   │
                     │  VictoriaMetrics│
                     │  + Grafana      │
     monitoring      │  + Loki         │
     CNAME ──────────│  + Alertmanager │
                     │  (vmalert)      │
                     └─────────────────┘
                            ▲
                            │ scrapes
            ┌───────────────┼───────────────┐
            │               │               │
       ┌────┴────┐    ┌─────┴────┐    ┌─────┴────┐
       │  ns1    │    │  ha1     │    │  ...     │
       │ :9100   │    │ :9100    │    │ :9100    │
       └─────────┘    └──────────┘    └──────────┘
 ```
 ## Implementation Plan
 ### Phase 1: Create monitoring02 Host [COMPLETE]
 Host created and deployed at 10.69.13.24 (prod tier) with:
 - 4 CPU cores, 8GB RAM, 60GB disk
 - Vault integration enabled
 - NATS-based remote deployment enabled
 - Grafana with Kanidm OIDC deployed as test instance (`grafana-test.home.2rjus.net`)
 ### Phase 2: Set Up VictoriaMetrics Stack [COMPLETE]
 New service module at `services/victoriametrics/` for VictoriaMetrics + vmalert + Alertmanager.
 Imported by monitoring02 alongside the existing Grafana service.
 1. **VictoriaMetrics** (port 8428):
   - `services.victoriametrics.enable = true`
   - `retentionPeriod = "3"` (3 months)
   - All scrape configs migrated from Prometheus (22 jobs including auto-generated)
   - Static user override (DynamicUser disabled) for credential file access
   - OpenBao token fetch service + 30min refresh timer
   - Apiary bearer token via vault.secrets
 2. **vmalert** for alerting rules:
   - Points to VictoriaMetrics datasource at localhost:8428
   - Reuses existing `services/monitoring/rules.yml` directly via `settings.rule`
   - Notifier sends to local Alertmanager at localhost:9093
 3. **Alertmanager** (port 9093):
   - Same configuration as monitoring01 (alerttonotify webhook routing)
   - alerttonotify imported on monitoring02, routes alerts via NATS
 4. **Grafana** (port 3000):
   - VictoriaMetrics datasource (localhost:8428) as default
   - Loki datasource pointing to localhost:3100
 5. **Loki** (port 3100):
   - Same configuration as monitoring01 in standalone `services/loki/` module
   - Grafana datasource updated to localhost:3100
 **Note:** pve-exporter and pushgateway scrape targets are not included on monitoring02.
 pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics
 native push support.
 ### Phase 3: Parallel Operation [COMPLETE]
 Ran both monitoring01 and monitoring02 simultaneously to validate data collection and dashboards.
 ### Phase 4: Add monitoring CNAME [COMPLETE]
 Added CNAMEs to monitoring02: monitoring, alertmanager, grafana, metrics, vmalert, loki.
 ### Phase 5: Update References [COMPLETE]
 - Moved alertmanager, grafana, prometheus CNAMEs from http-proxy to monitoring02
 - Removed corresponding Caddy reverse proxy entries from http-proxy
 - monitoring02 Caddy serves alertmanager, grafana, metrics, vmalert directly
 ### Phase 6: Enable Alerting [COMPLETE]
 - Switched vmalert from blackhole mode to local Alertmanager
 - alerttonotify service running on monitoring02 (NATS nkey from Vault)
 - prometheus-metrics Vault policy added for OpenBao scraping
 - Full alerting pipeline verified: vmalert -> Alertmanager -> alerttonotify -> NATS
 ### Phase 7: Cutover and Decommission [IN PROGRESS]
 - monitoring01 shut down (2026-02-17)
 - Vault AppRole moved from approle.tf to hosts-generated.tf with extra_policies support
 **Remaining cleanup (separate branch):**
 - [ ] Update `system/monitoring/logs.nix` - Promtail still points to monitoring01
 - [ ] Update `hosts/template2/bootstrap.nix` - Bootstrap Loki URL still points to monitoring01
 - [ ] Remove monitoring01 from flake.nix and host configuration
 - [ ] Destroy monitoring01 VM in Proxmox
 - [ ] Remove monitoring01 from terraform state
 - [ ] Remove or archive `services/monitoring/` (Prometheus config)
 ## Completed
 - 2026-02-08: Phase 1 - monitoring02 host created
 - 2026-02-17: Phase 2 - VictoriaMetrics, vmalert, Alertmanager, Loki, Grafana configured
 - 2026-02-17: Phase 6 - Alerting enabled, CNAMEs migrated, monitoring01 shut down
 ## VictoriaMetrics Service Configuration
 Implemented in `services/victoriametrics/default.nix`. Key design decisions:
 - **Static user**: VictoriaMetrics NixOS module uses `DynamicUser`, overridden with a static
  `victoriametrics` user so vault.secrets and credential files work correctly
 - **Shared rules**: vmalert reuses `services/monitoring/rules.yml` via `settings.rule` path
  reference (no YAML-to-Nix conversion needed)
 - **Scrape config reuse**: Uses the same `lib/monitoring.nix` functions and
  `services/monitoring/external-targets.nix` as Prometheus for auto-generated targets
 ## Notes
 - VictoriaMetrics uses port 8428 vs Prometheus 9090
 - PromQL compatibility is excellent
 - VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
 - monitoring02 deployed via OpenTofu using `create-host` script
 - Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state
 - Tempo and Pyroscope deferred (not actively used; can be added later if needed)
--- a/docs/plans/monitoring-migration-victoriametrics.md
+++ b/docs/plans/monitoring-migration-victoriametrics.md
@@ -1,201 +0,0 @@
 # Monitoring Stack Migration to VictoriaMetrics
 ## Overview
 Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
 and longer retention. Run in parallel with monitoring01 until validated, then switch over using
 a `monitoring` CNAME for seamless transition.
 ## Current State
 **monitoring01** (10.69.13.13):
 - 4 CPU cores, 4GB RAM, 33GB disk
 - Prometheus with 30-day retention (15s scrape interval)
 - Alertmanager (routes to alerttonotify webhook)
 - Grafana (dashboards, datasources)
 - Loki (log aggregation from all hosts via Promtail)
 - Tempo (distributed tracing) - not actively used
 - Pyroscope (continuous profiling) - not actively used
 **Hardcoded References to monitoring01:**
 - `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100`
 - `hosts/template2/bootstrap.nix` - Bootstrap logs to Loki (keep as-is until decommission)
 - `services/http-proxy/proxy.nix` - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway
 **Auto-generated:**
 - Prometheus scrape targets (from `lib/monitoring.nix` + `homelab.monitoring.scrapeTargets`)
 - Node-exporter targets (from all hosts with static IPs)
 ## Decision: VictoriaMetrics
 Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
 - Single binary replacement for Prometheus
 - 5-10x better compression (30 days could become 180+ days in same space)
 - Same PromQL query language (Grafana dashboards work unchanged)
 - Same scrape config format (existing auto-generated configs work)
 If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
 ## Architecture
 ```
                     ┌─────────────────┐
                     │  monitoring02   │
                     │  VictoriaMetrics│
                     │  + Grafana      │
     monitoring      │  + Loki         │
     CNAME ──────────│  + Alertmanager │
                     │  (vmalert)      │
                     └─────────────────┘
                            ▲
                            │ scrapes
            ┌───────────────┼───────────────┐
            │               │               │
       ┌────┴────┐    ┌─────┴────┐    ┌─────┴────┐
       │  ns1    │    │  ha1     │    │  ...     │
       │ :9100   │    │ :9100    │    │ :9100    │
       └─────────┘    └──────────┘    └──────────┘
 ```
 ## Implementation Plan
 ### Phase 1: Create monitoring02 Host [COMPLETE]
 Host created and deployed at 10.69.13.24 (prod tier) with:
 - 4 CPU cores, 8GB RAM, 60GB disk
 - Vault integration enabled
 - NATS-based remote deployment enabled
 - Grafana with Kanidm OIDC deployed as test instance (`grafana-test.home.2rjus.net`)
 ### Phase 2: Set Up VictoriaMetrics Stack
 New service module at `services/victoriametrics/` for VictoriaMetrics + vmalert + Alertmanager.
 Imported by monitoring02 alongside the existing Grafana service.
 1. **VictoriaMetrics** (port 8428): [DONE]
   - `services.victoriametrics.enable = true`
   - `retentionPeriod = "3"` (3 months)
   - All scrape configs migrated from Prometheus (22 jobs including auto-generated)
   - Static user override (DynamicUser disabled) for credential file access
   - OpenBao token fetch service + 30min refresh timer
   - Apiary bearer token via vault.secrets
 2. **vmalert** for alerting rules: [DONE]
   - Points to VictoriaMetrics datasource at localhost:8428
   - Reuses existing `services/monitoring/rules.yml` directly via `settings.rule`
   - No notifier configured during parallel operation (prevents duplicate alerts)
 3. **Alertmanager** (port 9093): [DONE]
   - Same configuration as monitoring01 (alerttonotify webhook routing)
   - Will only receive alerts after cutover (vmalert notifier disabled)
 4. **Grafana** (port 3000): [DONE]
   - VictoriaMetrics datasource (localhost:8428) as default
   - monitoring01 Prometheus datasource kept for comparison during parallel operation
   - Loki datasource pointing to localhost (after Loki migrated to monitoring02)
 5. **Loki** (port 3100): [DONE]
   - Same configuration as monitoring01 in standalone `services/loki/` module
   - Grafana datasource updated to localhost:3100
 **Note:** pve-exporter and pushgateway scrape targets are not included on monitoring02.
 pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics
 native push support.
 ### Phase 3: Parallel Operation
 Run both monitoring01 and monitoring02 simultaneously:
 1. **Dual scraping**: Both hosts scrape the same targets
   - Validates VictoriaMetrics is collecting data correctly
 2. **Dual log shipping**: Configure Promtail to send logs to both Loki instances
   - Add second client in `system/monitoring/logs.nix` pointing to monitoring02
 3. **Validate dashboards**: Access Grafana on monitoring02, verify dashboards work
 4. **Validate alerts**: Verify vmalert evaluates rules correctly (no receiver = no notifications)
 5. **Compare resource usage**: Monitor disk/memory consumption between hosts
 ### Phase 4: Add monitoring CNAME
 Add CNAME to monitoring02 once validated:
 ```nix
 # hosts/monitoring02/configuration.nix
 homelab.dns.cnames = [ "monitoring" ];
 ```
 This creates `monitoring.home.2rjus.net` pointing to monitoring02.
 ### Phase 5: Update References
 Update hardcoded references to use the CNAME:
 1. **system/monitoring/logs.nix**:
   - Remove dual-shipping, point only to `http://monitoring.home.2rjus.net:3100`
 2. **services/http-proxy/proxy.nix**: Update reverse proxy backends:
   - prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
   - alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
   - grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
 Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission.
 ### Phase 6: Enable Alerting
 Once ready to cut over:
 1. Enable Alertmanager receiver on monitoring02
 2. Verify test alerts route correctly
 ### Phase 7: Cutover and Decommission
 1. **Stop monitoring01**: Prevent duplicate alerts during transition
 2. **Update bootstrap.nix**: Point to `monitoring.home.2rjus.net`
 3. **Verify all targets scraped**: Check VictoriaMetrics UI
 4. **Verify logs flowing**: Check Loki on monitoring02
 5. **Decommission monitoring01**:
   - Remove from flake.nix
   - Remove host configuration
   - Destroy VM in Proxmox
   - Remove from terraform state
 ## Current Progress
 - **Phase 1** complete (2026-02-08): monitoring02 host created, Grafana with Kanidm OIDC validated
 - **Phase 2** complete (2026-02-17): VictoriaMetrics, vmalert, Alertmanager, Loki, Grafana datasources configured
  - Tempo and Pyroscope deferred (not actively used; can be added later if needed)
 ## Open Questions
 - [ ] What disk size for monitoring02? Current 60GB may need expansion for 3+ months with VictoriaMetrics
 - [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
 - [ ] Consider replacing Promtail with Grafana Alloy (`services.alloy`, v1.12.2 in nixpkgs). Promtail is in maintenance mode and Grafana recommends Alloy as the successor. Alloy is a unified collector (logs, metrics, traces, profiles) but uses its own "River" config format instead of YAML, so less Nix-native ergonomics. Could bundle the migration with monitoring02 to consolidate disruption.
 ## VictoriaMetrics Service Configuration
 Implemented in `services/victoriametrics/default.nix`. Key design decisions:
 - **Static user**: VictoriaMetrics NixOS module uses `DynamicUser`, overridden with a static
  `victoriametrics` user so vault.secrets and credential files work correctly
 - **Shared rules**: vmalert reuses `services/monitoring/rules.yml` via `settings.rule` path
  reference (no YAML-to-Nix conversion needed)
 - **Scrape config reuse**: Uses the same `lib/monitoring.nix` functions and
  `services/monitoring/external-targets.nix` as Prometheus for auto-generated targets
 ## Rollback Plan
 If issues arise after cutover:
 1. Move `monitoring` CNAME back to monitoring01
 2. Restart monitoring01 services
 3. Revert Promtail config to point only to monitoring01
 4. Revert http-proxy backends
 ## Notes
 - VictoriaMetrics uses port 8428 vs Prometheus 9090
 - PromQL compatibility is excellent
 - VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
 - monitoring02 deployed via OpenTofu using `create-host` script
 - Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state
--- a/hosts/http-proxy/configuration.nix
+++ b/hosts/http-proxy/configuration.nix
@@ -18,9 +18,6 @@
    "sonarr"
    "ha"
    "z2m"
    "grafana"
    "prometheus"
    "alertmanager"
    "jelly"
    "pyroscope"
    "pushgw"
--- a/hosts/monitoring02/configuration.nix
+++ b/hosts/monitoring02/configuration.nix
@@ -18,7 +18,7 @@
    role = "monitoring";
  };
-  homelab.dns.cnames = [ "grafana-test" "metrics" "vmalert" "loki" ];
+  homelab.dns.cnames = [ "monitoring" "alertmanager" "grafana" "grafana-test" "metrics" "vmalert" "loki" ];
  # Enable Vault integration
  vault.enable = true;
--- a/hosts/monitoring02/default.nix
+++ b/hosts/monitoring02/default.nix
@@ -4,5 +4,6 @@
    ../../services/grafana
    ../../services/victoriametrics
    ../../services/loki
    ../../services/monitoring/alerttonotify.nix
  ];
 }
--- a/hosts/template2/bootstrap.nix
+++ b/hosts/template2/bootstrap.nix
@@ -6,7 +6,8 @@ let
    text = ''
      set -euo pipefail
-      LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
+      LOKI_URL="https://loki.home.2rjus.net/loki/api/v1/push"
      LOKI_AUTH_FILE="/run/secrets/promtail-loki-auth"
      # Send a log entry to Loki with bootstrap status
      # Usage: log_to_loki <stage> <message>
@@ -36,8 +37,14 @@ let
            }]
          }')
        local auth_args=()
        if [[ -f "$LOKI_AUTH_FILE" ]]; then
          auth_args=(-u "promtail:$(cat "$LOKI_AUTH_FILE")")
        fi
        curl -s --connect-timeout 2 --max-time 5 \
          -X POST \
          "''${auth_args[@]}" \
          -H "Content-Type: application/json" \
          -d "$payload" \
          "$LOKI_URL" >/dev/null 2>&1 || true
--- a/services/grafana/default.nix
+++ b/services/grafana/default.nix
@@ -91,6 +91,14 @@
      acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
      metrics
    '';
    virtualHosts."grafana.home.2rjus.net".extraConfig = ''
      log {
        output file /var/log/caddy/grafana.log {
          mode 644
        }
      }
      reverse_proxy http://127.0.0.1:3000
    '';
    virtualHosts."grafana-test.home.2rjus.net".extraConfig = ''
      log {
        output file /var/log/caddy/grafana.log {
--- a/services/http-proxy/proxy.nix
+++ b/services/http-proxy/proxy.nix
@@ -54,30 +54,7 @@
        }
        reverse_proxy http://ha1.home.2rjus.net:8080
      }
-      prometheus.home.2rjus.net {
+
        log {
          output file /var/log/caddy/prometheus.log {
            mode 644
          }
        }
        reverse_proxy http://monitoring01.home.2rjus.net:9090
      }
      alertmanager.home.2rjus.net {
        log {
          output file /var/log/caddy/alertmanager.log {
            mode 644
          }
        }
        reverse_proxy http://monitoring01.home.2rjus.net:9093
      }
      grafana.home.2rjus.net {
        log {
          output file /var/log/caddy/grafana.log {
            mode 644
          }
        }
        reverse_proxy http://monitoring01.home.2rjus.net:3000
      }
      jelly.home.2rjus.net {
        log {
          output file /var/log/caddy/jelly.log {
--- a/services/victoriametrics/default.nix
+++ b/services/victoriametrics/default.nix
@@ -170,15 +170,12 @@ in
    };
  };
-  # vmalert for alerting rules - no notifier during parallel operation
+  # vmalert for alerting rules
  services.vmalert.instances.default = {
    enable = true;
    settings = {
      "datasource.url" = "http://localhost:8428";
-      # Blackhole notifications during parallel operation to prevent duplicate alerts.
+      "notifier.url" = [ "http://localhost:9093" ];
      # Replace with notifier.url after cutover from monitoring01:
      # "notifier.url" = [ "http://localhost:9093" ];
      "notifier.blackhole" = true;
      "rule" = [ ../monitoring/rules.yml ];
    };
  };
@@ -191,8 +188,11 @@ in
    reverse_proxy http://127.0.0.1:8880
  '';
-  # Alertmanager - same config as monitoring01 but will only receive
+  # Alertmanager
-  # alerts after cutover (vmalert notifier is disabled above)
+  services.caddy.virtualHosts."alertmanager.home.2rjus.net".extraConfig = ''
    reverse_proxy http://127.0.0.1:9093
  '';
  services.prometheus.alertmanager = {
    enable = true;
    configuration = {
--- a/system/monitoring/logs.nix
+++ b/system/monitoring/logs.nix
@@ -38,10 +38,6 @@ in
      };
      clients = [
        {
          url = "http://monitoring01.home.2rjus.net:3100/loki/api/v1/push";
        }
      ] ++ lib.optionals config.vault.enable [
        {
          url = "https://loki.home.2rjus.net/loki/api/v1/push";
          basic_auth = {
--- a/system/pipe-to-loki.nix
+++ b/system/pipe-to-loki.nix
@@ -16,7 +16,8 @@ let
    text = ''
      set -euo pipefail
-      LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
+      LOKI_URL="https://loki.home.2rjus.net/loki/api/v1/push"
      LOKI_AUTH_FILE="/run/secrets/promtail-loki-auth"
      HOSTNAME=$(hostname)
      SESSION_ID=""
      RECORD_MODE=false
@@ -69,7 +70,13 @@ let
            }]
          }')
        local auth_args=()
        if [[ -f "$LOKI_AUTH_FILE" ]]; then
          auth_args=(-u "promtail:$(cat "$LOKI_AUTH_FILE")")
        fi
        if curl -s -X POST "$LOKI_URL" \
          "''${auth_args[@]}" \
          -H "Content-Type: application/json" \
          -d "$payload" > /dev/null; then
          return 0
--- a/terraform/vault/approle.tf
+++ b/terraform/vault/approle.tf
@@ -115,15 +115,6 @@ locals {
      ]
    }
    # monitoring02: Grafana + VictoriaMetrics
    "monitoring02" = {
      paths = [
        "secret/data/hosts/monitoring02/*",
        "secret/data/hosts/monitoring01/apiary-token",
        "secret/data/services/grafana/*",
      ]
    }
  }
 }
--- a/terraform/vault/hosts-generated.tf
+++ b/terraform/vault/hosts-generated.tf
@@ -44,6 +44,15 @@ locals {
        "secret/data/hosts/garage01/*",
      ]
    }
    "monitoring02" = {
      paths = [
        "secret/data/hosts/monitoring02/*",
        "secret/data/hosts/monitoring01/apiary-token",
        "secret/data/services/grafana/*",
        "secret/data/shared/nats/nkey",
      ]
      extra_policies = ["prometheus-metrics"]
    }
  }
@@ -74,7 +83,10 @@ resource "vault_approle_auth_backend_role" "generated_hosts" {
  backend            = vault_auth_backend.approle.path
  role_name          = each.key
-  token_policies     = ["host-${each.key}", "homelab-deploy", "nixos-exporter", "loki-push"]
+  token_policies     = concat(
    ["host-${each.key}", "homelab-deploy", "nixos-exporter", "loki-push"],
    lookup(each.value, "extra_policies", [])
  )
  secret_id_ttl      = 0 # Never expire (wrapped tokens provide time limit)
  token_ttl          = 3600
  token_max_ttl      = 3600