11 Commits

Author SHA1 Message Date
35924c7b01 mcp: move config to .mcp.json.example, gitignore real config
Some checks failed
Run nix flake check / flake-check (push) Failing after 15m57s
Run nix flake check / flake-check (pull_request) Failing after 16m45s
The real .mcp.json now contains Loki credentials for basic auth,
so it should not be committed. The example file has placeholders.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 20:35:14 +01:00
87d8571d62 promtail: fix vault secret ownership for loki auth
Some checks failed
Run nix flake check / flake-check (push) Failing after 12m24s
The secret file needs to be owned by promtail since Promtail runs
as a dedicated user and can't read root-owned files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 20:17:02 +01:00
43c81f6688 terraform: fix loki-push policy for generated hosts
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Revert ns1/ns2 from approle.tf (they're in hosts-generated.tf) and add
loki-push policy to generated AppRoles instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 20:13:22 +01:00
58f901ad3e terraform: add ns1 and ns2 to AppRole policies
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
They were missing from the host_policies map, so they didn't get
shared policies like loki-push.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 20:10:37 +01:00
c13921d302 loki: add basic auth for log push and dual-ship promtail
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m36s
- Loki bound to localhost, Caddy reverse proxy with basic_auth
- Vault secret (shared/loki/push-auth) for password, bcrypt hash
  generated at boot for Caddy environment
- Promtail dual-ships to monitoring01 (direct) and loki.home.2rjus.net
  (with basic auth), conditional on vault.enable
- Terraform: new shared loki-push policy added to all AppRoles

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 20:00:08 +01:00
2903873d52 monitoring02: add loki CNAME and Caddy reverse proxy
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 19:48:06 +01:00
74e7c9faa4 monitoring02: add Loki service
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m19s
Add standalone Loki service module (services/loki/) with same config as
monitoring01 and import it on monitoring02. Update Grafana Loki datasource
to localhost. Defer Tempo and Pyroscope migration (not actively used).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 19:42:19 +01:00
471f536f1f Merge pull request 'victoriametrics-monitoring02' (#40) from victoriametrics-monitoring02 into master
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m3s
Periodic flake update / flake-update (push) Successful in 3m29s
Reviewed-on: #40
2026-02-16 23:56:04 +00:00
a013e80f1a terraform: grant monitoring02 access to apiary-token secret
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m59s
Run nix flake check / flake-check (pull_request) Failing after 4m20s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:55:08 +01:00
4cbaa33475 monitoring02: add Caddy reverse proxy for VictoriaMetrics and vmalert
Add metrics.home.2rjus.net and vmalert.home.2rjus.net CNAMEs with
Caddy TLS termination via internal ACME CA.

Refactors Grafana's Caddy config from configFile to globalConfig +
virtualHosts so both modules can contribute routes to the same
Caddy instance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:55:08 +01:00
e329f87b0b monitoring02: add VictoriaMetrics, vmalert, and Alertmanager
Set up the core metrics stack on monitoring02 as Phase 2 of the
monitoring migration. VictoriaMetrics replaces Prometheus with
identical scrape configs (22 jobs including auto-generated targets).

- VictoriaMetrics with 3-month retention and all scrape configs
- vmalert evaluating existing rules.yml (notifier disabled)
- Alertmanager with same routing config (no alerts during parallel op)
- Grafana datasources updated: local VictoriaMetrics as default
- Static user override for credential file access (OpenBao, Apiary)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:55:08 +01:00
12 changed files with 438 additions and 110 deletions

3
.gitignore vendored
View File

@@ -2,6 +2,9 @@
result result
result-* result-*
# MCP config (contains secrets)
.mcp.json
# Terraform/OpenTofu # Terraform/OpenTofu
terraform/.terraform/ terraform/.terraform/
terraform/.terraform.lock.hcl terraform/.terraform.lock.hcl

View File

@@ -20,7 +20,9 @@
"env": { "env": {
"PROMETHEUS_URL": "https://prometheus.home.2rjus.net", "PROMETHEUS_URL": "https://prometheus.home.2rjus.net",
"ALERTMANAGER_URL": "https://alertmanager.home.2rjus.net", "ALERTMANAGER_URL": "https://alertmanager.home.2rjus.net",
"LOKI_URL": "http://monitoring01.home.2rjus.net:3100" "LOKI_URL": "https://loki.home.2rjus.net",
"LOKI_USERNAME": "promtail",
"LOKI_PASSWORD": "<password from: bao kv get -field=password secret/shared/loki/push-auth>"
} }
}, },
"homelab-deploy": { "homelab-deploy": {
@@ -44,4 +46,3 @@
} }
} }
} }

View File

@@ -14,8 +14,8 @@ a `monitoring` CNAME for seamless transition.
- Alertmanager (routes to alerttonotify webhook) - Alertmanager (routes to alerttonotify webhook)
- Grafana (dashboards, datasources) - Grafana (dashboards, datasources)
- Loki (log aggregation from all hosts via Promtail) - Loki (log aggregation from all hosts via Promtail)
- Tempo (distributed tracing) - Tempo (distributed tracing) - not actively used
- Pyroscope (continuous profiling) - Pyroscope (continuous profiling) - not actively used
**Hardcoded References to monitoring01:** **Hardcoded References to monitoring01:**
- `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100` - `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100`
@@ -44,9 +44,7 @@ If multi-year retention with downsampling becomes necessary later, Thanos can be
│ VictoriaMetrics│ │ VictoriaMetrics│
│ + Grafana │ │ + Grafana │
monitoring │ + Loki │ monitoring │ + Loki │
CNAME ──────────│ + Tempo CNAME ──────────│ + Alertmanager
│ + Pyroscope │
│ + Alertmanager │
│ (vmalert) │ │ (vmalert) │
└─────────────────┘ └─────────────────┘
@@ -61,53 +59,48 @@ If multi-year retention with downsampling becomes necessary later, Thanos can be
## Implementation Plan ## Implementation Plan
### Phase 1: Create monitoring02 Host ### Phase 1: Create monitoring02 Host [COMPLETE]
Use `create-host` script which handles flake.nix and terraform/vms.tf automatically. Host created and deployed at 10.69.13.24 (prod tier) with:
- 4 CPU cores, 8GB RAM, 60GB disk
1. **Run create-host**: `nix develop -c create-host monitoring02 10.69.13.24` - Vault integration enabled
2. **Update VM resources** in `terraform/vms.tf`: - NATS-based remote deployment enabled
- 4 cores (same as monitoring01) - Grafana with Kanidm OIDC deployed as test instance (`grafana-test.home.2rjus.net`)
- 8GB RAM (double, for VictoriaMetrics headroom)
- 100GB disk (for 3+ months retention with compression)
3. **Update host configuration**: Import monitoring services
4. **Create Vault AppRole**: Add to `terraform/vault/approle.tf`
### Phase 2: Set Up VictoriaMetrics Stack ### Phase 2: Set Up VictoriaMetrics Stack
Create new service module at `services/monitoring/victoriametrics/` for testing alongside existing New service module at `services/victoriametrics/` for VictoriaMetrics + vmalert + Alertmanager.
Prometheus config. Once validated, this can replace the Prometheus module. Imported by monitoring02 alongside the existing Grafana service.
1. **VictoriaMetrics** (port 8428): 1. **VictoriaMetrics** (port 8428): [DONE]
- `services.victoriametrics.enable = true` - `services.victoriametrics.enable = true`
- `services.victoriametrics.retentionPeriod = "3m"` (3 months, increase later based on disk usage) - `retentionPeriod = "3"` (3 months)
- Migrate scrape configs via `prometheusConfig` - All scrape configs migrated from Prometheus (22 jobs including auto-generated)
- Use native push support (replaces Pushgateway) - Static user override (DynamicUser disabled) for credential file access
- OpenBao token fetch service + 30min refresh timer
- Apiary bearer token via vault.secrets
2. **vmalert** for alerting rules: 2. **vmalert** for alerting rules: [DONE]
- `services.vmalert.enable = true` - Points to VictoriaMetrics datasource at localhost:8428
- Point to VictoriaMetrics for metrics evaluation - Reuses existing `services/monitoring/rules.yml` directly via `settings.rule`
- Keep rules in separate `rules.yml` file (same format as Prometheus) - No notifier configured during parallel operation (prevents duplicate alerts)
- No receiver configured during parallel operation (prevents duplicate alerts)
3. **Alertmanager** (port 9093): 3. **Alertmanager** (port 9093): [DONE]
- Keep existing configuration (alerttonotify webhook routing) - Same configuration as monitoring01 (alerttonotify webhook routing)
- Only enable receiver after cutover from monitoring01 - Will only receive alerts after cutover (vmalert notifier disabled)
4. **Loki** (port 3100): 4. **Grafana** (port 3000): [DONE]
- Same configuration as current - VictoriaMetrics datasource (localhost:8428) as default
- monitoring01 Prometheus datasource kept for comparison during parallel operation
- Loki datasource pointing to localhost (after Loki migrated to monitoring02)
5. **Grafana** (port 3000): 5. **Loki** (port 3100): [DONE]
- Define dashboards declaratively via NixOS options (not imported from monitoring01) - Same configuration as monitoring01 in standalone `services/loki/` module
- Reference existing dashboards on monitoring01 for content inspiration - Grafana datasource updated to localhost:3100
- Configure VictoriaMetrics datasource (port 8428)
- Configure Loki datasource
6. **Tempo** (ports 3200, 3201): **Note:** pve-exporter and pushgateway scrape targets are not included on monitoring02.
- Same configuration pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics
native push support.
7. **Pyroscope** (port 4040):
- Same Docker-based deployment
### Phase 3: Parallel Operation ### Phase 3: Parallel Operation
@@ -147,7 +140,6 @@ Update hardcoded references to use the CNAME:
- prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428 - prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
- alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093 - alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
- grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000 - grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
- pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040
Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission. Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission.
@@ -171,24 +163,9 @@ Once ready to cut over:
## Current Progress ## Current Progress
### monitoring02 Host Created (2026-02-08) - **Phase 1** complete (2026-02-08): monitoring02 host created, Grafana with Kanidm OIDC validated
- **Phase 2** complete (2026-02-17): VictoriaMetrics, vmalert, Alertmanager, Loki, Grafana datasources configured
Host deployed at 10.69.13.24 (test tier) with: - Tempo and Pyroscope deferred (not actively used; can be added later if needed)
- 4 CPU cores, 8GB RAM, 60GB disk
- Vault integration enabled
- NATS-based remote deployment enabled
### Grafana with Kanidm OIDC (2026-02-08)
Grafana deployed on monitoring02 as a test instance (`grafana-test.home.2rjus.net`):
- Kanidm OIDC authentication (PKCE enabled)
- Role mapping: `admins` → Admin, others → Viewer
- Declarative datasources pointing to monitoring01 (Prometheus, Loki)
- Local Caddy for TLS termination via internal ACME CA
This validates the Grafana + OIDC pattern before the full VictoriaMetrics migration. The existing
`services/monitoring/grafana.nix` on monitoring01 can be replaced with the new `services/grafana/`
module once monitoring02 becomes the primary monitoring host.
## Open Questions ## Open Questions
@@ -198,31 +175,14 @@ module once monitoring02 becomes the primary monitoring host.
## VictoriaMetrics Service Configuration ## VictoriaMetrics Service Configuration
Example NixOS configuration for monitoring02: Implemented in `services/victoriametrics/default.nix`. Key design decisions:
```nix - **Static user**: VictoriaMetrics NixOS module uses `DynamicUser`, overridden with a static
# VictoriaMetrics replaces Prometheus `victoriametrics` user so vault.secrets and credential files work correctly
services.victoriametrics = { - **Shared rules**: vmalert reuses `services/monitoring/rules.yml` via `settings.rule` path
enable = true; reference (no YAML-to-Nix conversion needed)
retentionPeriod = "3m"; # 3 months, increase based on disk usage - **Scrape config reuse**: Uses the same `lib/monitoring.nix` functions and
prometheusConfig = { `services/monitoring/external-targets.nix` as Prometheus for auto-generated targets
global.scrape_interval = "15s";
scrape_configs = [
# Auto-generated node-exporter targets
# Service-specific scrape targets
# External targets
];
};
};
# vmalert for alerting rules (no receiver during parallel operation)
services.vmalert = {
enable = true;
datasource.url = "http://localhost:8428";
# notifier.alertmanager.url = "http://localhost:9093"; # Enable after cutover
rule = [ ./rules.yml ];
};
```
## Rollback Plan ## Rollback Plan

View File

@@ -18,8 +18,7 @@
role = "monitoring"; role = "monitoring";
}; };
# DNS CNAME for Grafana test instance homelab.dns.cnames = [ "grafana-test" "metrics" "vmalert" "loki" ];
homelab.dns.cnames = [ "grafana-test" ];
# Enable Vault integration # Enable Vault integration
vault.enable = true; vault.enable = true;

View File

@@ -2,5 +2,7 @@
imports = [ imports = [
./configuration.nix ./configuration.nix
../../services/grafana ../../services/grafana
../../services/victoriametrics
../../services/loki
]; ];
} }

View File

@@ -34,21 +34,27 @@
}; };
}; };
# Declarative datasources pointing to monitoring01 # Declarative datasources
provision.datasources.settings = { provision.datasources.settings = {
apiVersion = 1; apiVersion = 1;
datasources = [ datasources = [
{ {
name = "Prometheus"; name = "VictoriaMetrics";
type = "prometheus";
url = "http://localhost:8428";
isDefault = true;
uid = "victoriametrics";
}
{
name = "Prometheus (monitoring01)";
type = "prometheus"; type = "prometheus";
url = "http://monitoring01.home.2rjus.net:9090"; url = "http://monitoring01.home.2rjus.net:9090";
isDefault = true;
uid = "prometheus"; uid = "prometheus";
} }
{ {
name = "Loki"; name = "Loki";
type = "loki"; type = "loki";
url = "http://monitoring01.home.2rjus.net:3100"; url = "http://localhost:3100";
uid = "loki"; uid = "loki";
} }
]; ];
@@ -81,22 +87,20 @@
services.caddy = { services.caddy = {
enable = true; enable = true;
package = pkgs.unstable.caddy; package = pkgs.unstable.caddy;
configFile = pkgs.writeText "Caddyfile" '' globalConfig = ''
{
acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
metrics metrics
} '';
virtualHosts."grafana-test.home.2rjus.net".extraConfig = ''
grafana-test.home.2rjus.net {
log { log {
output file /var/log/caddy/grafana.log { output file /var/log/caddy/grafana.log {
mode 644 mode 644
} }
} }
reverse_proxy http://127.0.0.1:3000 reverse_proxy http://127.0.0.1:3000
} '';
# Metrics endpoint on plain HTTP for Prometheus scraping
extraConfig = ''
http://${config.networking.hostName}.home.2rjus.net/metrics { http://${config.networking.hostName}.home.2rjus.net/metrics {
metrics metrics
} }

104
services/loki/default.nix Normal file
View File

@@ -0,0 +1,104 @@
{ config, lib, pkgs, ... }:
let
# Script to generate bcrypt hash from Vault password for Caddy basic_auth
generateCaddyAuth = pkgs.writeShellApplication {
name = "generate-caddy-loki-auth";
runtimeInputs = [ config.services.caddy.package ];
text = ''
PASSWORD=$(cat /run/secrets/loki-push-auth)
HASH=$(caddy hash-password --plaintext "$PASSWORD")
echo "LOKI_PUSH_HASH=$HASH" > /run/secrets/caddy-loki-auth.env
chmod 0400 /run/secrets/caddy-loki-auth.env
'';
};
in
{
# Fetch Loki push password from Vault
vault.secrets.loki-push-auth = {
secretPath = "shared/loki/push-auth";
extractKey = "password";
services = [ "caddy" ];
};
# Generate bcrypt hash for Caddy before it starts
systemd.services.caddy-loki-auth = {
description = "Generate Caddy basic auth hash for Loki";
after = [ "vault-secret-loki-push-auth.service" ];
requires = [ "vault-secret-loki-push-auth.service" ];
before = [ "caddy.service" ];
requiredBy = [ "caddy.service" ];
serviceConfig = {
Type = "oneshot";
RemainAfterExit = true;
ExecStart = lib.getExe generateCaddyAuth;
};
};
# Load the bcrypt hash as environment variable for Caddy
services.caddy.environmentFile = "/run/secrets/caddy-loki-auth.env";
# Caddy reverse proxy for Loki with basic auth
services.caddy.virtualHosts."loki.home.2rjus.net".extraConfig = ''
basic_auth {
promtail {env.LOKI_PUSH_HASH}
}
reverse_proxy http://127.0.0.1:3100
'';
services.loki = {
enable = true;
configuration = {
auth_enabled = false;
server = {
http_listen_address = "127.0.0.1";
http_listen_port = 3100;
};
common = {
ring = {
instance_addr = "127.0.0.1";
kvstore = {
store = "inmemory";
};
};
replication_factor = 1;
path_prefix = "/var/lib/loki";
};
schema_config = {
configs = [
{
from = "2024-01-01";
store = "tsdb";
object_store = "filesystem";
schema = "v13";
index = {
prefix = "loki_index_";
period = "24h";
};
}
];
};
storage_config = {
filesystem = {
directory = "/var/lib/loki/chunks";
};
};
compactor = {
working_directory = "/var/lib/loki/compactor";
compaction_interval = "10m";
retention_enabled = true;
retention_delete_delay = "2h";
retention_delete_worker_count = 150;
delete_request_store = "filesystem";
};
limits_config = {
retention_period = "30d";
ingestion_rate_mb = 10;
ingestion_burst_size_mb = 20;
max_streams_per_user = 10000;
max_query_series = 500;
max_query_parallelism = 8;
};
};
};
}

View File

@@ -0,0 +1,219 @@
{ self, config, lib, pkgs, ... }:
let
monLib = import ../../lib/monitoring.nix { inherit lib; };
externalTargets = import ../monitoring/external-targets.nix;
nodeExporterTargets = monLib.generateNodeExporterTargets self externalTargets;
autoScrapeConfigs = monLib.generateScrapeConfigs self externalTargets;
# Script to fetch AppRole token for VictoriaMetrics to use when scraping OpenBao metrics
fetchOpenbaoToken = pkgs.writeShellApplication {
name = "fetch-openbao-token-vm";
runtimeInputs = [ pkgs.curl pkgs.jq ];
text = ''
VAULT_ADDR="https://vault01.home.2rjus.net:8200"
APPROLE_DIR="/var/lib/vault/approle"
OUTPUT_FILE="/run/secrets/victoriametrics/openbao-token"
# Read AppRole credentials
if [ ! -f "$APPROLE_DIR/role-id" ] || [ ! -f "$APPROLE_DIR/secret-id" ]; then
echo "AppRole credentials not found at $APPROLE_DIR" >&2
exit 1
fi
ROLE_ID=$(cat "$APPROLE_DIR/role-id")
SECRET_ID=$(cat "$APPROLE_DIR/secret-id")
# Authenticate to Vault
AUTH_RESPONSE=$(curl -sf -k -X POST \
-d "{\"role_id\":\"$ROLE_ID\",\"secret_id\":\"$SECRET_ID\"}" \
"$VAULT_ADDR/v1/auth/approle/login")
# Extract token
VAULT_TOKEN=$(echo "$AUTH_RESPONSE" | jq -r '.auth.client_token')
if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
echo "Failed to extract Vault token from response" >&2
exit 1
fi
# Write token to file
mkdir -p "$(dirname "$OUTPUT_FILE")"
echo -n "$VAULT_TOKEN" > "$OUTPUT_FILE"
chown victoriametrics:victoriametrics "$OUTPUT_FILE"
chmod 0400 "$OUTPUT_FILE"
echo "Successfully fetched OpenBao token"
'';
};
scrapeConfigs = [
# Auto-generated node-exporter targets from flake hosts + external
{
job_name = "node-exporter";
static_configs = nodeExporterTargets;
}
# Systemd exporter on all hosts (same targets, different port)
{
job_name = "systemd-exporter";
static_configs = map
(cfg: cfg // {
targets = map (t: builtins.replaceStrings [ ":9100" ] [ ":9558" ] t) cfg.targets;
})
nodeExporterTargets;
}
# Local monitoring services
{
job_name = "victoriametrics";
static_configs = [{ targets = [ "localhost:8428" ]; }];
}
{
job_name = "loki";
static_configs = [{ targets = [ "localhost:3100" ]; }];
}
{
job_name = "grafana";
static_configs = [{ targets = [ "localhost:3000" ]; }];
}
{
job_name = "alertmanager";
static_configs = [{ targets = [ "localhost:9093" ]; }];
}
# Caddy metrics from nix-cache02
{
job_name = "nix-cache_caddy";
scheme = "https";
static_configs = [{ targets = [ "nix-cache.home.2rjus.net" ]; }];
}
# OpenBao metrics with bearer token auth
{
job_name = "openbao";
scheme = "https";
metrics_path = "/v1/sys/metrics";
params = { format = [ "prometheus" ]; };
static_configs = [{ targets = [ "vault01.home.2rjus.net:8200" ]; }];
authorization = {
type = "Bearer";
credentials_file = "/run/secrets/victoriametrics/openbao-token";
};
}
# Apiary external service
{
job_name = "apiary";
scheme = "https";
scrape_interval = "60s";
static_configs = [{ targets = [ "apiary.t-juice.club" ]; }];
authorization = {
type = "Bearer";
credentials_file = "/run/secrets/victoriametrics-apiary-token";
};
}
] ++ autoScrapeConfigs;
in
{
# Static user for VictoriaMetrics (overrides DynamicUser) so vault.secrets
# and credential files can be owned by this user
users.users.victoriametrics = {
isSystemUser = true;
group = "victoriametrics";
};
users.groups.victoriametrics = { };
# Override DynamicUser since we need a static user for credential file access
systemd.services.victoriametrics.serviceConfig = {
DynamicUser = lib.mkForce false;
User = "victoriametrics";
Group = "victoriametrics";
};
# Systemd service to fetch AppRole token for OpenBao scraping
systemd.services.victoriametrics-openbao-token = {
description = "Fetch OpenBao token for VictoriaMetrics metrics scraping";
after = [ "network-online.target" ];
wants = [ "network-online.target" ];
before = [ "victoriametrics.service" ];
requiredBy = [ "victoriametrics.service" ];
serviceConfig = {
Type = "oneshot";
ExecStart = lib.getExe fetchOpenbaoToken;
};
};
# Timer to periodically refresh the token (AppRole tokens have 1-hour TTL)
systemd.timers.victoriametrics-openbao-token = {
description = "Refresh OpenBao token for VictoriaMetrics";
wantedBy = [ "timers.target" ];
timerConfig = {
OnBootSec = "5min";
OnUnitActiveSec = "30min";
RandomizedDelaySec = "5min";
};
};
# Fetch apiary bearer token from Vault
vault.secrets.victoriametrics-apiary-token = {
secretPath = "hosts/monitoring01/apiary-token";
extractKey = "password";
owner = "victoriametrics";
group = "victoriametrics";
services = [ "victoriametrics" ];
};
services.victoriametrics = {
enable = true;
retentionPeriod = "3"; # 3 months
# Disable config check since we reference external credential files
checkConfig = false;
prometheusConfig = {
global.scrape_interval = "15s";
scrape_configs = scrapeConfigs;
};
};
# vmalert for alerting rules - no notifier during parallel operation
services.vmalert.instances.default = {
enable = true;
settings = {
"datasource.url" = "http://localhost:8428";
# Blackhole notifications during parallel operation to prevent duplicate alerts.
# Replace with notifier.url after cutover from monitoring01:
# "notifier.url" = [ "http://localhost:9093" ];
"notifier.blackhole" = true;
"rule" = [ ../monitoring/rules.yml ];
};
};
# Caddy reverse proxy for VictoriaMetrics and vmalert
services.caddy.virtualHosts."metrics.home.2rjus.net".extraConfig = ''
reverse_proxy http://127.0.0.1:8428
'';
services.caddy.virtualHosts."vmalert.home.2rjus.net".extraConfig = ''
reverse_proxy http://127.0.0.1:8880
'';
# Alertmanager - same config as monitoring01 but will only receive
# alerts after cutover (vmalert notifier is disabled above)
services.prometheus.alertmanager = {
enable = true;
configuration = {
global = { };
route = {
receiver = "webhook_natstonotify";
group_wait = "30s";
group_interval = "5m";
repeat_interval = "1h";
group_by = [ "alertname" ];
};
receivers = [
{
name = "webhook_natstonotify";
webhook_configs = [
{
url = "http://localhost:5001/alert";
}
];
}
];
};
};
}

View File

@@ -16,6 +16,16 @@ in
SystemKeepFree=1G SystemKeepFree=1G
''; '';
}; };
# Fetch Loki push password from Vault (only on hosts with Vault enabled)
vault.secrets.promtail-loki-auth = lib.mkIf config.vault.enable {
secretPath = "shared/loki/push-auth";
extractKey = "password";
owner = "promtail";
group = "promtail";
services = [ "promtail" ];
};
# Configure promtail # Configure promtail
services.promtail = { services.promtail = {
enable = true; enable = true;
@@ -31,6 +41,14 @@ in
{ {
url = "http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"; url = "http://monitoring01.home.2rjus.net:3100/loki/api/v1/push";
} }
] ++ lib.optionals config.vault.enable [
{
url = "https://loki.home.2rjus.net/loki/api/v1/push";
basic_auth = {
username = "promtail";
password_file = "/run/secrets/promtail-loki-auth";
};
}
]; ];
scrape_configs = [ scrape_configs = [

View File

@@ -26,6 +26,17 @@ path "secret/data/shared/nixos-exporter/*" {
EOT EOT
} }
# Shared policy for Loki push authentication (all hosts push logs)
resource "vault_policy" "loki_push" {
name = "loki-push"
policy = <<EOT
path "secret/data/shared/loki/*" {
capabilities = ["read", "list"]
}
EOT
}
# Define host access policies # Define host access policies
locals { locals {
host_policies = { host_policies = {
@@ -78,7 +89,7 @@ locals {
] ]
} }
# Wave 3: DNS servers # Wave 3: DNS servers (managed in hosts-generated.tf)
# Wave 4: http-proxy # Wave 4: http-proxy
"http-proxy" = { "http-proxy" = {
@@ -104,10 +115,11 @@ locals {
] ]
} }
# monitoring02: Grafana test instance # monitoring02: Grafana + VictoriaMetrics
"monitoring02" = { "monitoring02" = {
paths = [ paths = [
"secret/data/hosts/monitoring02/*", "secret/data/hosts/monitoring02/*",
"secret/data/hosts/monitoring01/apiary-token",
"secret/data/services/grafana/*", "secret/data/services/grafana/*",
] ]
} }
@@ -137,7 +149,7 @@ resource "vault_approle_auth_backend_role" "hosts" {
backend = vault_auth_backend.approle.path backend = vault_auth_backend.approle.path
role_name = each.key role_name = each.key
token_policies = concat( token_policies = concat(
["${each.key}-policy", "homelab-deploy", "nixos-exporter"], ["${each.key}-policy", "homelab-deploy", "nixos-exporter", "loki-push"],
lookup(each.value, "extra_policies", []) lookup(each.value, "extra_policies", [])
) )

View File

@@ -74,7 +74,7 @@ resource "vault_approle_auth_backend_role" "generated_hosts" {
backend = vault_auth_backend.approle.path backend = vault_auth_backend.approle.path
role_name = each.key role_name = each.key
token_policies = ["host-${each.key}", "homelab-deploy", "nixos-exporter"] token_policies = ["host-${each.key}", "homelab-deploy", "nixos-exporter", "loki-push"]
secret_id_ttl = 0 # Never expire (wrapped tokens provide time limit) secret_id_ttl = 0 # Never expire (wrapped tokens provide time limit)
token_ttl = 3600 token_ttl = 3600
token_max_ttl = 3600 token_max_ttl = 3600

View File

@@ -153,6 +153,12 @@ locals {
auto_generate = true auto_generate = true
password_length = 64 password_length = 64
} }
# Loki push authentication (used by Promtail on all hosts)
"shared/loki/push-auth" = {
auto_generate = true
password_length = 32
}
} }
} }