Compare commits
11 Commits
c151f31011
...
loki-monit
| Author | SHA1 | Date | |
|---|---|---|---|
|
35924c7b01
|
|||
|
87d8571d62
|
|||
|
43c81f6688
|
|||
|
58f901ad3e
|
|||
|
c13921d302
|
|||
|
2903873d52
|
|||
|
74e7c9faa4
|
|||
| 471f536f1f | |||
|
a013e80f1a
|
|||
|
4cbaa33475
|
|||
|
e329f87b0b
|
3
.gitignore
vendored
3
.gitignore
vendored
@@ -2,6 +2,9 @@
|
|||||||
result
|
result
|
||||||
result-*
|
result-*
|
||||||
|
|
||||||
|
# MCP config (contains secrets)
|
||||||
|
.mcp.json
|
||||||
|
|
||||||
# Terraform/OpenTofu
|
# Terraform/OpenTofu
|
||||||
terraform/.terraform/
|
terraform/.terraform/
|
||||||
terraform/.terraform.lock.hcl
|
terraform/.terraform.lock.hcl
|
||||||
|
|||||||
@@ -20,7 +20,9 @@
|
|||||||
"env": {
|
"env": {
|
||||||
"PROMETHEUS_URL": "https://prometheus.home.2rjus.net",
|
"PROMETHEUS_URL": "https://prometheus.home.2rjus.net",
|
||||||
"ALERTMANAGER_URL": "https://alertmanager.home.2rjus.net",
|
"ALERTMANAGER_URL": "https://alertmanager.home.2rjus.net",
|
||||||
"LOKI_URL": "http://monitoring01.home.2rjus.net:3100"
|
"LOKI_URL": "https://loki.home.2rjus.net",
|
||||||
|
"LOKI_USERNAME": "promtail",
|
||||||
|
"LOKI_PASSWORD": "<password from: bao kv get -field=password secret/shared/loki/push-auth>"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"homelab-deploy": {
|
"homelab-deploy": {
|
||||||
@@ -44,4 +46,3 @@
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -14,8 +14,8 @@ a `monitoring` CNAME for seamless transition.
|
|||||||
- Alertmanager (routes to alerttonotify webhook)
|
- Alertmanager (routes to alerttonotify webhook)
|
||||||
- Grafana (dashboards, datasources)
|
- Grafana (dashboards, datasources)
|
||||||
- Loki (log aggregation from all hosts via Promtail)
|
- Loki (log aggregation from all hosts via Promtail)
|
||||||
- Tempo (distributed tracing)
|
- Tempo (distributed tracing) - not actively used
|
||||||
- Pyroscope (continuous profiling)
|
- Pyroscope (continuous profiling) - not actively used
|
||||||
|
|
||||||
**Hardcoded References to monitoring01:**
|
**Hardcoded References to monitoring01:**
|
||||||
- `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100`
|
- `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100`
|
||||||
@@ -44,9 +44,7 @@ If multi-year retention with downsampling becomes necessary later, Thanos can be
|
|||||||
│ VictoriaMetrics│
|
│ VictoriaMetrics│
|
||||||
│ + Grafana │
|
│ + Grafana │
|
||||||
monitoring │ + Loki │
|
monitoring │ + Loki │
|
||||||
CNAME ──────────│ + Tempo │
|
CNAME ──────────│ + Alertmanager │
|
||||||
│ + Pyroscope │
|
|
||||||
│ + Alertmanager │
|
|
||||||
│ (vmalert) │
|
│ (vmalert) │
|
||||||
└─────────────────┘
|
└─────────────────┘
|
||||||
▲
|
▲
|
||||||
@@ -61,53 +59,48 @@ If multi-year retention with downsampling becomes necessary later, Thanos can be
|
|||||||
|
|
||||||
## Implementation Plan
|
## Implementation Plan
|
||||||
|
|
||||||
### Phase 1: Create monitoring02 Host
|
### Phase 1: Create monitoring02 Host [COMPLETE]
|
||||||
|
|
||||||
Use `create-host` script which handles flake.nix and terraform/vms.tf automatically.
|
Host created and deployed at 10.69.13.24 (prod tier) with:
|
||||||
|
- 4 CPU cores, 8GB RAM, 60GB disk
|
||||||
1. **Run create-host**: `nix develop -c create-host monitoring02 10.69.13.24`
|
- Vault integration enabled
|
||||||
2. **Update VM resources** in `terraform/vms.tf`:
|
- NATS-based remote deployment enabled
|
||||||
- 4 cores (same as monitoring01)
|
- Grafana with Kanidm OIDC deployed as test instance (`grafana-test.home.2rjus.net`)
|
||||||
- 8GB RAM (double, for VictoriaMetrics headroom)
|
|
||||||
- 100GB disk (for 3+ months retention with compression)
|
|
||||||
3. **Update host configuration**: Import monitoring services
|
|
||||||
4. **Create Vault AppRole**: Add to `terraform/vault/approle.tf`
|
|
||||||
|
|
||||||
### Phase 2: Set Up VictoriaMetrics Stack
|
### Phase 2: Set Up VictoriaMetrics Stack
|
||||||
|
|
||||||
Create new service module at `services/monitoring/victoriametrics/` for testing alongside existing
|
New service module at `services/victoriametrics/` for VictoriaMetrics + vmalert + Alertmanager.
|
||||||
Prometheus config. Once validated, this can replace the Prometheus module.
|
Imported by monitoring02 alongside the existing Grafana service.
|
||||||
|
|
||||||
1. **VictoriaMetrics** (port 8428):
|
1. **VictoriaMetrics** (port 8428): [DONE]
|
||||||
- `services.victoriametrics.enable = true`
|
- `services.victoriametrics.enable = true`
|
||||||
- `services.victoriametrics.retentionPeriod = "3m"` (3 months, increase later based on disk usage)
|
- `retentionPeriod = "3"` (3 months)
|
||||||
- Migrate scrape configs via `prometheusConfig`
|
- All scrape configs migrated from Prometheus (22 jobs including auto-generated)
|
||||||
- Use native push support (replaces Pushgateway)
|
- Static user override (DynamicUser disabled) for credential file access
|
||||||
|
- OpenBao token fetch service + 30min refresh timer
|
||||||
|
- Apiary bearer token via vault.secrets
|
||||||
|
|
||||||
2. **vmalert** for alerting rules:
|
2. **vmalert** for alerting rules: [DONE]
|
||||||
- `services.vmalert.enable = true`
|
- Points to VictoriaMetrics datasource at localhost:8428
|
||||||
- Point to VictoriaMetrics for metrics evaluation
|
- Reuses existing `services/monitoring/rules.yml` directly via `settings.rule`
|
||||||
- Keep rules in separate `rules.yml` file (same format as Prometheus)
|
- No notifier configured during parallel operation (prevents duplicate alerts)
|
||||||
- No receiver configured during parallel operation (prevents duplicate alerts)
|
|
||||||
|
|
||||||
3. **Alertmanager** (port 9093):
|
3. **Alertmanager** (port 9093): [DONE]
|
||||||
- Keep existing configuration (alerttonotify webhook routing)
|
- Same configuration as monitoring01 (alerttonotify webhook routing)
|
||||||
- Only enable receiver after cutover from monitoring01
|
- Will only receive alerts after cutover (vmalert notifier disabled)
|
||||||
|
|
||||||
4. **Loki** (port 3100):
|
4. **Grafana** (port 3000): [DONE]
|
||||||
- Same configuration as current
|
- VictoriaMetrics datasource (localhost:8428) as default
|
||||||
|
- monitoring01 Prometheus datasource kept for comparison during parallel operation
|
||||||
|
- Loki datasource pointing to localhost (after Loki migrated to monitoring02)
|
||||||
|
|
||||||
5. **Grafana** (port 3000):
|
5. **Loki** (port 3100): [DONE]
|
||||||
- Define dashboards declaratively via NixOS options (not imported from monitoring01)
|
- Same configuration as monitoring01 in standalone `services/loki/` module
|
||||||
- Reference existing dashboards on monitoring01 for content inspiration
|
- Grafana datasource updated to localhost:3100
|
||||||
- Configure VictoriaMetrics datasource (port 8428)
|
|
||||||
- Configure Loki datasource
|
|
||||||
|
|
||||||
6. **Tempo** (ports 3200, 3201):
|
**Note:** pve-exporter and pushgateway scrape targets are not included on monitoring02.
|
||||||
- Same configuration
|
pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics
|
||||||
|
native push support.
|
||||||
7. **Pyroscope** (port 4040):
|
|
||||||
- Same Docker-based deployment
|
|
||||||
|
|
||||||
### Phase 3: Parallel Operation
|
### Phase 3: Parallel Operation
|
||||||
|
|
||||||
@@ -147,7 +140,6 @@ Update hardcoded references to use the CNAME:
|
|||||||
- prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
|
- prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
|
||||||
- alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
|
- alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
|
||||||
- grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
|
- grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
|
||||||
- pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040
|
|
||||||
|
|
||||||
Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission.
|
Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission.
|
||||||
|
|
||||||
@@ -171,24 +163,9 @@ Once ready to cut over:
|
|||||||
|
|
||||||
## Current Progress
|
## Current Progress
|
||||||
|
|
||||||
### monitoring02 Host Created (2026-02-08)
|
- **Phase 1** complete (2026-02-08): monitoring02 host created, Grafana with Kanidm OIDC validated
|
||||||
|
- **Phase 2** complete (2026-02-17): VictoriaMetrics, vmalert, Alertmanager, Loki, Grafana datasources configured
|
||||||
Host deployed at 10.69.13.24 (test tier) with:
|
- Tempo and Pyroscope deferred (not actively used; can be added later if needed)
|
||||||
- 4 CPU cores, 8GB RAM, 60GB disk
|
|
||||||
- Vault integration enabled
|
|
||||||
- NATS-based remote deployment enabled
|
|
||||||
|
|
||||||
### Grafana with Kanidm OIDC (2026-02-08)
|
|
||||||
|
|
||||||
Grafana deployed on monitoring02 as a test instance (`grafana-test.home.2rjus.net`):
|
|
||||||
- Kanidm OIDC authentication (PKCE enabled)
|
|
||||||
- Role mapping: `admins` → Admin, others → Viewer
|
|
||||||
- Declarative datasources pointing to monitoring01 (Prometheus, Loki)
|
|
||||||
- Local Caddy for TLS termination via internal ACME CA
|
|
||||||
|
|
||||||
This validates the Grafana + OIDC pattern before the full VictoriaMetrics migration. The existing
|
|
||||||
`services/monitoring/grafana.nix` on monitoring01 can be replaced with the new `services/grafana/`
|
|
||||||
module once monitoring02 becomes the primary monitoring host.
|
|
||||||
|
|
||||||
## Open Questions
|
## Open Questions
|
||||||
|
|
||||||
@@ -198,31 +175,14 @@ module once monitoring02 becomes the primary monitoring host.
|
|||||||
|
|
||||||
## VictoriaMetrics Service Configuration
|
## VictoriaMetrics Service Configuration
|
||||||
|
|
||||||
Example NixOS configuration for monitoring02:
|
Implemented in `services/victoriametrics/default.nix`. Key design decisions:
|
||||||
|
|
||||||
```nix
|
- **Static user**: VictoriaMetrics NixOS module uses `DynamicUser`, overridden with a static
|
||||||
# VictoriaMetrics replaces Prometheus
|
`victoriametrics` user so vault.secrets and credential files work correctly
|
||||||
services.victoriametrics = {
|
- **Shared rules**: vmalert reuses `services/monitoring/rules.yml` via `settings.rule` path
|
||||||
enable = true;
|
reference (no YAML-to-Nix conversion needed)
|
||||||
retentionPeriod = "3m"; # 3 months, increase based on disk usage
|
- **Scrape config reuse**: Uses the same `lib/monitoring.nix` functions and
|
||||||
prometheusConfig = {
|
`services/monitoring/external-targets.nix` as Prometheus for auto-generated targets
|
||||||
global.scrape_interval = "15s";
|
|
||||||
scrape_configs = [
|
|
||||||
# Auto-generated node-exporter targets
|
|
||||||
# Service-specific scrape targets
|
|
||||||
# External targets
|
|
||||||
];
|
|
||||||
};
|
|
||||||
};
|
|
||||||
|
|
||||||
# vmalert for alerting rules (no receiver during parallel operation)
|
|
||||||
services.vmalert = {
|
|
||||||
enable = true;
|
|
||||||
datasource.url = "http://localhost:8428";
|
|
||||||
# notifier.alertmanager.url = "http://localhost:9093"; # Enable after cutover
|
|
||||||
rule = [ ./rules.yml ];
|
|
||||||
};
|
|
||||||
```
|
|
||||||
|
|
||||||
## Rollback Plan
|
## Rollback Plan
|
||||||
|
|
||||||
|
|||||||
@@ -18,8 +18,7 @@
|
|||||||
role = "monitoring";
|
role = "monitoring";
|
||||||
};
|
};
|
||||||
|
|
||||||
# DNS CNAME for Grafana test instance
|
homelab.dns.cnames = [ "grafana-test" "metrics" "vmalert" "loki" ];
|
||||||
homelab.dns.cnames = [ "grafana-test" ];
|
|
||||||
|
|
||||||
# Enable Vault integration
|
# Enable Vault integration
|
||||||
vault.enable = true;
|
vault.enable = true;
|
||||||
|
|||||||
@@ -2,5 +2,7 @@
|
|||||||
imports = [
|
imports = [
|
||||||
./configuration.nix
|
./configuration.nix
|
||||||
../../services/grafana
|
../../services/grafana
|
||||||
|
../../services/victoriametrics
|
||||||
|
../../services/loki
|
||||||
];
|
];
|
||||||
}
|
}
|
||||||
@@ -34,21 +34,27 @@
|
|||||||
};
|
};
|
||||||
};
|
};
|
||||||
|
|
||||||
# Declarative datasources pointing to monitoring01
|
# Declarative datasources
|
||||||
provision.datasources.settings = {
|
provision.datasources.settings = {
|
||||||
apiVersion = 1;
|
apiVersion = 1;
|
||||||
datasources = [
|
datasources = [
|
||||||
{
|
{
|
||||||
name = "Prometheus";
|
name = "VictoriaMetrics";
|
||||||
|
type = "prometheus";
|
||||||
|
url = "http://localhost:8428";
|
||||||
|
isDefault = true;
|
||||||
|
uid = "victoriametrics";
|
||||||
|
}
|
||||||
|
{
|
||||||
|
name = "Prometheus (monitoring01)";
|
||||||
type = "prometheus";
|
type = "prometheus";
|
||||||
url = "http://monitoring01.home.2rjus.net:9090";
|
url = "http://monitoring01.home.2rjus.net:9090";
|
||||||
isDefault = true;
|
|
||||||
uid = "prometheus";
|
uid = "prometheus";
|
||||||
}
|
}
|
||||||
{
|
{
|
||||||
name = "Loki";
|
name = "Loki";
|
||||||
type = "loki";
|
type = "loki";
|
||||||
url = "http://monitoring01.home.2rjus.net:3100";
|
url = "http://localhost:3100";
|
||||||
uid = "loki";
|
uid = "loki";
|
||||||
}
|
}
|
||||||
];
|
];
|
||||||
@@ -81,22 +87,20 @@
|
|||||||
services.caddy = {
|
services.caddy = {
|
||||||
enable = true;
|
enable = true;
|
||||||
package = pkgs.unstable.caddy;
|
package = pkgs.unstable.caddy;
|
||||||
configFile = pkgs.writeText "Caddyfile" ''
|
globalConfig = ''
|
||||||
{
|
|
||||||
acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
|
acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
|
||||||
metrics
|
metrics
|
||||||
}
|
'';
|
||||||
|
virtualHosts."grafana-test.home.2rjus.net".extraConfig = ''
|
||||||
grafana-test.home.2rjus.net {
|
|
||||||
log {
|
log {
|
||||||
output file /var/log/caddy/grafana.log {
|
output file /var/log/caddy/grafana.log {
|
||||||
mode 644
|
mode 644
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
reverse_proxy http://127.0.0.1:3000
|
reverse_proxy http://127.0.0.1:3000
|
||||||
}
|
'';
|
||||||
|
# Metrics endpoint on plain HTTP for Prometheus scraping
|
||||||
|
extraConfig = ''
|
||||||
http://${config.networking.hostName}.home.2rjus.net/metrics {
|
http://${config.networking.hostName}.home.2rjus.net/metrics {
|
||||||
metrics
|
metrics
|
||||||
}
|
}
|
||||||
|
|||||||
104
services/loki/default.nix
Normal file
104
services/loki/default.nix
Normal file
@@ -0,0 +1,104 @@
|
|||||||
|
{ config, lib, pkgs, ... }:
|
||||||
|
let
|
||||||
|
# Script to generate bcrypt hash from Vault password for Caddy basic_auth
|
||||||
|
generateCaddyAuth = pkgs.writeShellApplication {
|
||||||
|
name = "generate-caddy-loki-auth";
|
||||||
|
runtimeInputs = [ config.services.caddy.package ];
|
||||||
|
text = ''
|
||||||
|
PASSWORD=$(cat /run/secrets/loki-push-auth)
|
||||||
|
HASH=$(caddy hash-password --plaintext "$PASSWORD")
|
||||||
|
echo "LOKI_PUSH_HASH=$HASH" > /run/secrets/caddy-loki-auth.env
|
||||||
|
chmod 0400 /run/secrets/caddy-loki-auth.env
|
||||||
|
'';
|
||||||
|
};
|
||||||
|
in
|
||||||
|
{
|
||||||
|
# Fetch Loki push password from Vault
|
||||||
|
vault.secrets.loki-push-auth = {
|
||||||
|
secretPath = "shared/loki/push-auth";
|
||||||
|
extractKey = "password";
|
||||||
|
services = [ "caddy" ];
|
||||||
|
};
|
||||||
|
|
||||||
|
# Generate bcrypt hash for Caddy before it starts
|
||||||
|
systemd.services.caddy-loki-auth = {
|
||||||
|
description = "Generate Caddy basic auth hash for Loki";
|
||||||
|
after = [ "vault-secret-loki-push-auth.service" ];
|
||||||
|
requires = [ "vault-secret-loki-push-auth.service" ];
|
||||||
|
before = [ "caddy.service" ];
|
||||||
|
requiredBy = [ "caddy.service" ];
|
||||||
|
serviceConfig = {
|
||||||
|
Type = "oneshot";
|
||||||
|
RemainAfterExit = true;
|
||||||
|
ExecStart = lib.getExe generateCaddyAuth;
|
||||||
|
};
|
||||||
|
};
|
||||||
|
|
||||||
|
# Load the bcrypt hash as environment variable for Caddy
|
||||||
|
services.caddy.environmentFile = "/run/secrets/caddy-loki-auth.env";
|
||||||
|
|
||||||
|
# Caddy reverse proxy for Loki with basic auth
|
||||||
|
services.caddy.virtualHosts."loki.home.2rjus.net".extraConfig = ''
|
||||||
|
basic_auth {
|
||||||
|
promtail {env.LOKI_PUSH_HASH}
|
||||||
|
}
|
||||||
|
reverse_proxy http://127.0.0.1:3100
|
||||||
|
'';
|
||||||
|
|
||||||
|
services.loki = {
|
||||||
|
enable = true;
|
||||||
|
configuration = {
|
||||||
|
auth_enabled = false;
|
||||||
|
|
||||||
|
server = {
|
||||||
|
http_listen_address = "127.0.0.1";
|
||||||
|
http_listen_port = 3100;
|
||||||
|
};
|
||||||
|
common = {
|
||||||
|
ring = {
|
||||||
|
instance_addr = "127.0.0.1";
|
||||||
|
kvstore = {
|
||||||
|
store = "inmemory";
|
||||||
|
};
|
||||||
|
};
|
||||||
|
replication_factor = 1;
|
||||||
|
path_prefix = "/var/lib/loki";
|
||||||
|
};
|
||||||
|
schema_config = {
|
||||||
|
configs = [
|
||||||
|
{
|
||||||
|
from = "2024-01-01";
|
||||||
|
store = "tsdb";
|
||||||
|
object_store = "filesystem";
|
||||||
|
schema = "v13";
|
||||||
|
index = {
|
||||||
|
prefix = "loki_index_";
|
||||||
|
period = "24h";
|
||||||
|
};
|
||||||
|
}
|
||||||
|
];
|
||||||
|
};
|
||||||
|
storage_config = {
|
||||||
|
filesystem = {
|
||||||
|
directory = "/var/lib/loki/chunks";
|
||||||
|
};
|
||||||
|
};
|
||||||
|
compactor = {
|
||||||
|
working_directory = "/var/lib/loki/compactor";
|
||||||
|
compaction_interval = "10m";
|
||||||
|
retention_enabled = true;
|
||||||
|
retention_delete_delay = "2h";
|
||||||
|
retention_delete_worker_count = 150;
|
||||||
|
delete_request_store = "filesystem";
|
||||||
|
};
|
||||||
|
limits_config = {
|
||||||
|
retention_period = "30d";
|
||||||
|
ingestion_rate_mb = 10;
|
||||||
|
ingestion_burst_size_mb = 20;
|
||||||
|
max_streams_per_user = 10000;
|
||||||
|
max_query_series = 500;
|
||||||
|
max_query_parallelism = 8;
|
||||||
|
};
|
||||||
|
};
|
||||||
|
};
|
||||||
|
}
|
||||||
219
services/victoriametrics/default.nix
Normal file
219
services/victoriametrics/default.nix
Normal file
@@ -0,0 +1,219 @@
|
|||||||
|
{ self, config, lib, pkgs, ... }:
|
||||||
|
let
|
||||||
|
monLib = import ../../lib/monitoring.nix { inherit lib; };
|
||||||
|
externalTargets = import ../monitoring/external-targets.nix;
|
||||||
|
|
||||||
|
nodeExporterTargets = monLib.generateNodeExporterTargets self externalTargets;
|
||||||
|
autoScrapeConfigs = monLib.generateScrapeConfigs self externalTargets;
|
||||||
|
|
||||||
|
# Script to fetch AppRole token for VictoriaMetrics to use when scraping OpenBao metrics
|
||||||
|
fetchOpenbaoToken = pkgs.writeShellApplication {
|
||||||
|
name = "fetch-openbao-token-vm";
|
||||||
|
runtimeInputs = [ pkgs.curl pkgs.jq ];
|
||||||
|
text = ''
|
||||||
|
VAULT_ADDR="https://vault01.home.2rjus.net:8200"
|
||||||
|
APPROLE_DIR="/var/lib/vault/approle"
|
||||||
|
OUTPUT_FILE="/run/secrets/victoriametrics/openbao-token"
|
||||||
|
|
||||||
|
# Read AppRole credentials
|
||||||
|
if [ ! -f "$APPROLE_DIR/role-id" ] || [ ! -f "$APPROLE_DIR/secret-id" ]; then
|
||||||
|
echo "AppRole credentials not found at $APPROLE_DIR" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
ROLE_ID=$(cat "$APPROLE_DIR/role-id")
|
||||||
|
SECRET_ID=$(cat "$APPROLE_DIR/secret-id")
|
||||||
|
|
||||||
|
# Authenticate to Vault
|
||||||
|
AUTH_RESPONSE=$(curl -sf -k -X POST \
|
||||||
|
-d "{\"role_id\":\"$ROLE_ID\",\"secret_id\":\"$SECRET_ID\"}" \
|
||||||
|
"$VAULT_ADDR/v1/auth/approle/login")
|
||||||
|
|
||||||
|
# Extract token
|
||||||
|
VAULT_TOKEN=$(echo "$AUTH_RESPONSE" | jq -r '.auth.client_token')
|
||||||
|
if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
|
||||||
|
echo "Failed to extract Vault token from response" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Write token to file
|
||||||
|
mkdir -p "$(dirname "$OUTPUT_FILE")"
|
||||||
|
echo -n "$VAULT_TOKEN" > "$OUTPUT_FILE"
|
||||||
|
chown victoriametrics:victoriametrics "$OUTPUT_FILE"
|
||||||
|
chmod 0400 "$OUTPUT_FILE"
|
||||||
|
|
||||||
|
echo "Successfully fetched OpenBao token"
|
||||||
|
'';
|
||||||
|
};
|
||||||
|
|
||||||
|
scrapeConfigs = [
|
||||||
|
# Auto-generated node-exporter targets from flake hosts + external
|
||||||
|
{
|
||||||
|
job_name = "node-exporter";
|
||||||
|
static_configs = nodeExporterTargets;
|
||||||
|
}
|
||||||
|
# Systemd exporter on all hosts (same targets, different port)
|
||||||
|
{
|
||||||
|
job_name = "systemd-exporter";
|
||||||
|
static_configs = map
|
||||||
|
(cfg: cfg // {
|
||||||
|
targets = map (t: builtins.replaceStrings [ ":9100" ] [ ":9558" ] t) cfg.targets;
|
||||||
|
})
|
||||||
|
nodeExporterTargets;
|
||||||
|
}
|
||||||
|
# Local monitoring services
|
||||||
|
{
|
||||||
|
job_name = "victoriametrics";
|
||||||
|
static_configs = [{ targets = [ "localhost:8428" ]; }];
|
||||||
|
}
|
||||||
|
{
|
||||||
|
job_name = "loki";
|
||||||
|
static_configs = [{ targets = [ "localhost:3100" ]; }];
|
||||||
|
}
|
||||||
|
{
|
||||||
|
job_name = "grafana";
|
||||||
|
static_configs = [{ targets = [ "localhost:3000" ]; }];
|
||||||
|
}
|
||||||
|
{
|
||||||
|
job_name = "alertmanager";
|
||||||
|
static_configs = [{ targets = [ "localhost:9093" ]; }];
|
||||||
|
}
|
||||||
|
# Caddy metrics from nix-cache02
|
||||||
|
{
|
||||||
|
job_name = "nix-cache_caddy";
|
||||||
|
scheme = "https";
|
||||||
|
static_configs = [{ targets = [ "nix-cache.home.2rjus.net" ]; }];
|
||||||
|
}
|
||||||
|
# OpenBao metrics with bearer token auth
|
||||||
|
{
|
||||||
|
job_name = "openbao";
|
||||||
|
scheme = "https";
|
||||||
|
metrics_path = "/v1/sys/metrics";
|
||||||
|
params = { format = [ "prometheus" ]; };
|
||||||
|
static_configs = [{ targets = [ "vault01.home.2rjus.net:8200" ]; }];
|
||||||
|
authorization = {
|
||||||
|
type = "Bearer";
|
||||||
|
credentials_file = "/run/secrets/victoriametrics/openbao-token";
|
||||||
|
};
|
||||||
|
}
|
||||||
|
# Apiary external service
|
||||||
|
{
|
||||||
|
job_name = "apiary";
|
||||||
|
scheme = "https";
|
||||||
|
scrape_interval = "60s";
|
||||||
|
static_configs = [{ targets = [ "apiary.t-juice.club" ]; }];
|
||||||
|
authorization = {
|
||||||
|
type = "Bearer";
|
||||||
|
credentials_file = "/run/secrets/victoriametrics-apiary-token";
|
||||||
|
};
|
||||||
|
}
|
||||||
|
] ++ autoScrapeConfigs;
|
||||||
|
in
|
||||||
|
{
|
||||||
|
# Static user for VictoriaMetrics (overrides DynamicUser) so vault.secrets
|
||||||
|
# and credential files can be owned by this user
|
||||||
|
users.users.victoriametrics = {
|
||||||
|
isSystemUser = true;
|
||||||
|
group = "victoriametrics";
|
||||||
|
};
|
||||||
|
users.groups.victoriametrics = { };
|
||||||
|
|
||||||
|
# Override DynamicUser since we need a static user for credential file access
|
||||||
|
systemd.services.victoriametrics.serviceConfig = {
|
||||||
|
DynamicUser = lib.mkForce false;
|
||||||
|
User = "victoriametrics";
|
||||||
|
Group = "victoriametrics";
|
||||||
|
};
|
||||||
|
|
||||||
|
# Systemd service to fetch AppRole token for OpenBao scraping
|
||||||
|
systemd.services.victoriametrics-openbao-token = {
|
||||||
|
description = "Fetch OpenBao token for VictoriaMetrics metrics scraping";
|
||||||
|
after = [ "network-online.target" ];
|
||||||
|
wants = [ "network-online.target" ];
|
||||||
|
before = [ "victoriametrics.service" ];
|
||||||
|
requiredBy = [ "victoriametrics.service" ];
|
||||||
|
|
||||||
|
serviceConfig = {
|
||||||
|
Type = "oneshot";
|
||||||
|
ExecStart = lib.getExe fetchOpenbaoToken;
|
||||||
|
};
|
||||||
|
};
|
||||||
|
|
||||||
|
# Timer to periodically refresh the token (AppRole tokens have 1-hour TTL)
|
||||||
|
systemd.timers.victoriametrics-openbao-token = {
|
||||||
|
description = "Refresh OpenBao token for VictoriaMetrics";
|
||||||
|
wantedBy = [ "timers.target" ];
|
||||||
|
timerConfig = {
|
||||||
|
OnBootSec = "5min";
|
||||||
|
OnUnitActiveSec = "30min";
|
||||||
|
RandomizedDelaySec = "5min";
|
||||||
|
};
|
||||||
|
};
|
||||||
|
|
||||||
|
# Fetch apiary bearer token from Vault
|
||||||
|
vault.secrets.victoriametrics-apiary-token = {
|
||||||
|
secretPath = "hosts/monitoring01/apiary-token";
|
||||||
|
extractKey = "password";
|
||||||
|
owner = "victoriametrics";
|
||||||
|
group = "victoriametrics";
|
||||||
|
services = [ "victoriametrics" ];
|
||||||
|
};
|
||||||
|
|
||||||
|
services.victoriametrics = {
|
||||||
|
enable = true;
|
||||||
|
retentionPeriod = "3"; # 3 months
|
||||||
|
# Disable config check since we reference external credential files
|
||||||
|
checkConfig = false;
|
||||||
|
prometheusConfig = {
|
||||||
|
global.scrape_interval = "15s";
|
||||||
|
scrape_configs = scrapeConfigs;
|
||||||
|
};
|
||||||
|
};
|
||||||
|
|
||||||
|
# vmalert for alerting rules - no notifier during parallel operation
|
||||||
|
services.vmalert.instances.default = {
|
||||||
|
enable = true;
|
||||||
|
settings = {
|
||||||
|
"datasource.url" = "http://localhost:8428";
|
||||||
|
# Blackhole notifications during parallel operation to prevent duplicate alerts.
|
||||||
|
# Replace with notifier.url after cutover from monitoring01:
|
||||||
|
# "notifier.url" = [ "http://localhost:9093" ];
|
||||||
|
"notifier.blackhole" = true;
|
||||||
|
"rule" = [ ../monitoring/rules.yml ];
|
||||||
|
};
|
||||||
|
};
|
||||||
|
|
||||||
|
# Caddy reverse proxy for VictoriaMetrics and vmalert
|
||||||
|
services.caddy.virtualHosts."metrics.home.2rjus.net".extraConfig = ''
|
||||||
|
reverse_proxy http://127.0.0.1:8428
|
||||||
|
'';
|
||||||
|
services.caddy.virtualHosts."vmalert.home.2rjus.net".extraConfig = ''
|
||||||
|
reverse_proxy http://127.0.0.1:8880
|
||||||
|
'';
|
||||||
|
|
||||||
|
# Alertmanager - same config as monitoring01 but will only receive
|
||||||
|
# alerts after cutover (vmalert notifier is disabled above)
|
||||||
|
services.prometheus.alertmanager = {
|
||||||
|
enable = true;
|
||||||
|
configuration = {
|
||||||
|
global = { };
|
||||||
|
route = {
|
||||||
|
receiver = "webhook_natstonotify";
|
||||||
|
group_wait = "30s";
|
||||||
|
group_interval = "5m";
|
||||||
|
repeat_interval = "1h";
|
||||||
|
group_by = [ "alertname" ];
|
||||||
|
};
|
||||||
|
receivers = [
|
||||||
|
{
|
||||||
|
name = "webhook_natstonotify";
|
||||||
|
webhook_configs = [
|
||||||
|
{
|
||||||
|
url = "http://localhost:5001/alert";
|
||||||
|
}
|
||||||
|
];
|
||||||
|
}
|
||||||
|
];
|
||||||
|
};
|
||||||
|
};
|
||||||
|
}
|
||||||
@@ -16,6 +16,16 @@ in
|
|||||||
SystemKeepFree=1G
|
SystemKeepFree=1G
|
||||||
'';
|
'';
|
||||||
};
|
};
|
||||||
|
|
||||||
|
# Fetch Loki push password from Vault (only on hosts with Vault enabled)
|
||||||
|
vault.secrets.promtail-loki-auth = lib.mkIf config.vault.enable {
|
||||||
|
secretPath = "shared/loki/push-auth";
|
||||||
|
extractKey = "password";
|
||||||
|
owner = "promtail";
|
||||||
|
group = "promtail";
|
||||||
|
services = [ "promtail" ];
|
||||||
|
};
|
||||||
|
|
||||||
# Configure promtail
|
# Configure promtail
|
||||||
services.promtail = {
|
services.promtail = {
|
||||||
enable = true;
|
enable = true;
|
||||||
@@ -31,6 +41,14 @@ in
|
|||||||
{
|
{
|
||||||
url = "http://monitoring01.home.2rjus.net:3100/loki/api/v1/push";
|
url = "http://monitoring01.home.2rjus.net:3100/loki/api/v1/push";
|
||||||
}
|
}
|
||||||
|
] ++ lib.optionals config.vault.enable [
|
||||||
|
{
|
||||||
|
url = "https://loki.home.2rjus.net/loki/api/v1/push";
|
||||||
|
basic_auth = {
|
||||||
|
username = "promtail";
|
||||||
|
password_file = "/run/secrets/promtail-loki-auth";
|
||||||
|
};
|
||||||
|
}
|
||||||
];
|
];
|
||||||
|
|
||||||
scrape_configs = [
|
scrape_configs = [
|
||||||
|
|||||||
@@ -26,6 +26,17 @@ path "secret/data/shared/nixos-exporter/*" {
|
|||||||
EOT
|
EOT
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# Shared policy for Loki push authentication (all hosts push logs)
|
||||||
|
resource "vault_policy" "loki_push" {
|
||||||
|
name = "loki-push"
|
||||||
|
|
||||||
|
policy = <<EOT
|
||||||
|
path "secret/data/shared/loki/*" {
|
||||||
|
capabilities = ["read", "list"]
|
||||||
|
}
|
||||||
|
EOT
|
||||||
|
}
|
||||||
|
|
||||||
# Define host access policies
|
# Define host access policies
|
||||||
locals {
|
locals {
|
||||||
host_policies = {
|
host_policies = {
|
||||||
@@ -78,7 +89,7 @@ locals {
|
|||||||
]
|
]
|
||||||
}
|
}
|
||||||
|
|
||||||
# Wave 3: DNS servers
|
# Wave 3: DNS servers (managed in hosts-generated.tf)
|
||||||
|
|
||||||
# Wave 4: http-proxy
|
# Wave 4: http-proxy
|
||||||
"http-proxy" = {
|
"http-proxy" = {
|
||||||
@@ -104,10 +115,11 @@ locals {
|
|||||||
]
|
]
|
||||||
}
|
}
|
||||||
|
|
||||||
# monitoring02: Grafana test instance
|
# monitoring02: Grafana + VictoriaMetrics
|
||||||
"monitoring02" = {
|
"monitoring02" = {
|
||||||
paths = [
|
paths = [
|
||||||
"secret/data/hosts/monitoring02/*",
|
"secret/data/hosts/monitoring02/*",
|
||||||
|
"secret/data/hosts/monitoring01/apiary-token",
|
||||||
"secret/data/services/grafana/*",
|
"secret/data/services/grafana/*",
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
@@ -137,7 +149,7 @@ resource "vault_approle_auth_backend_role" "hosts" {
|
|||||||
backend = vault_auth_backend.approle.path
|
backend = vault_auth_backend.approle.path
|
||||||
role_name = each.key
|
role_name = each.key
|
||||||
token_policies = concat(
|
token_policies = concat(
|
||||||
["${each.key}-policy", "homelab-deploy", "nixos-exporter"],
|
["${each.key}-policy", "homelab-deploy", "nixos-exporter", "loki-push"],
|
||||||
lookup(each.value, "extra_policies", [])
|
lookup(each.value, "extra_policies", [])
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|||||||
@@ -74,7 +74,7 @@ resource "vault_approle_auth_backend_role" "generated_hosts" {
|
|||||||
|
|
||||||
backend = vault_auth_backend.approle.path
|
backend = vault_auth_backend.approle.path
|
||||||
role_name = each.key
|
role_name = each.key
|
||||||
token_policies = ["host-${each.key}", "homelab-deploy", "nixos-exporter"]
|
token_policies = ["host-${each.key}", "homelab-deploy", "nixos-exporter", "loki-push"]
|
||||||
secret_id_ttl = 0 # Never expire (wrapped tokens provide time limit)
|
secret_id_ttl = 0 # Never expire (wrapped tokens provide time limit)
|
||||||
token_ttl = 3600
|
token_ttl = 3600
|
||||||
token_max_ttl = 3600
|
token_max_ttl = 3600
|
||||||
|
|||||||
@@ -153,6 +153,12 @@ locals {
|
|||||||
auto_generate = true
|
auto_generate = true
|
||||||
password_length = 64
|
password_length = 64
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# Loki push authentication (used by Promtail on all hosts)
|
||||||
|
"shared/loki/push-auth" = {
|
||||||
|
auto_generate = true
|
||||||
|
password_length = 32
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user