Compare commits
6 Commits
jellyfin-m
...
ef850d91a4
| Author | SHA1 | Date | |
|---|---|---|---|
|
ef850d91a4
|
|||
|
a99fb5b959
|
|||
|
d385f02c89
|
|||
|
8dfd04b406
|
|||
|
63cf690598
|
|||
|
ef8eeaa2f5
|
@@ -130,7 +130,7 @@ get_commit_info(<hash>) # Get full details of a specific change
|
|||||||
```
|
```
|
||||||
|
|
||||||
**Example workflow for a service-related alert:**
|
**Example workflow for a service-related alert:**
|
||||||
1. Query `nixos_flake_info{hostname="monitoring02"}` → `current_rev: 8959829`
|
1. Query `nixos_flake_info{hostname="monitoring01"}` → `current_rev: 8959829`
|
||||||
2. `resolve_ref("master")` → `4633421`
|
2. `resolve_ref("master")` → `4633421`
|
||||||
3. `is_ancestor("8959829", "4633421")` → Yes, host is behind
|
3. `is_ancestor("8959829", "4633421")` → Yes, host is behind
|
||||||
4. `commits_between("8959829", "4633421")` → 7 commits missing
|
4. `commits_between("8959829", "4633421")` → 7 commits missing
|
||||||
|
|||||||
@@ -30,7 +30,7 @@ Use the `lab-monitoring` MCP server tools:
|
|||||||
### Label Reference
|
### Label Reference
|
||||||
|
|
||||||
Available labels for log queries:
|
Available labels for log queries:
|
||||||
- `hostname` - Hostname (e.g., `ns1`, `monitoring02`, `ha1`) - matches the Prometheus `hostname` label
|
- `hostname` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`) - matches the Prometheus `hostname` label
|
||||||
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
|
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
|
||||||
- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
|
- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
|
||||||
- `filename` - For `varlog` job, the log file path
|
- `filename` - For `varlog` job, the log file path
|
||||||
@@ -54,7 +54,7 @@ Journal logs are JSON-formatted. Key fields:
|
|||||||
|
|
||||||
**All logs from a host:**
|
**All logs from a host:**
|
||||||
```logql
|
```logql
|
||||||
{hostname="monitoring02"}
|
{hostname="monitoring01"}
|
||||||
```
|
```
|
||||||
|
|
||||||
**Logs from a service across all hosts:**
|
**Logs from a service across all hosts:**
|
||||||
@@ -74,7 +74,7 @@ Journal logs are JSON-formatted. Key fields:
|
|||||||
|
|
||||||
**Regex matching:**
|
**Regex matching:**
|
||||||
```logql
|
```logql
|
||||||
{systemd_unit="victoriametrics.service"} |~ "scrape.*failed"
|
{systemd_unit="prometheus.service"} |~ "scrape.*failed"
|
||||||
```
|
```
|
||||||
|
|
||||||
**Filter by level (journal scrape only):**
|
**Filter by level (journal scrape only):**
|
||||||
@@ -109,7 +109,7 @@ Default lookback is 1 hour. Use `start` parameter for older logs:
|
|||||||
Useful systemd units for troubleshooting:
|
Useful systemd units for troubleshooting:
|
||||||
- `nixos-upgrade.service` - Daily auto-upgrade logs
|
- `nixos-upgrade.service` - Daily auto-upgrade logs
|
||||||
- `nsd.service` - DNS server (ns1/ns2)
|
- `nsd.service` - DNS server (ns1/ns2)
|
||||||
- `victoriametrics.service` - Metrics collection
|
- `prometheus.service` - Metrics collection
|
||||||
- `loki.service` - Log aggregation
|
- `loki.service` - Log aggregation
|
||||||
- `caddy.service` - Reverse proxy
|
- `caddy.service` - Reverse proxy
|
||||||
- `home-assistant.service` - Home automation
|
- `home-assistant.service` - Home automation
|
||||||
@@ -152,7 +152,7 @@ VMs provisioned from template2 send bootstrap progress directly to Loki via curl
|
|||||||
|
|
||||||
Parse JSON and filter on fields:
|
Parse JSON and filter on fields:
|
||||||
```logql
|
```logql
|
||||||
{systemd_unit="victoriametrics.service"} | json | PRIORITY="3"
|
{systemd_unit="prometheus.service"} | json | PRIORITY="3"
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -242,11 +242,12 @@ All available Prometheus job names:
|
|||||||
- `unbound` - DNS resolver metrics (ns1, ns2)
|
- `unbound` - DNS resolver metrics (ns1, ns2)
|
||||||
- `wireguard` - VPN tunnel metrics (http-proxy)
|
- `wireguard` - VPN tunnel metrics (http-proxy)
|
||||||
|
|
||||||
**Monitoring stack (localhost on monitoring02):**
|
**Monitoring stack (localhost on monitoring01):**
|
||||||
- `victoriametrics` - VictoriaMetrics self-metrics
|
- `prometheus` - Prometheus self-metrics
|
||||||
- `loki` - Loki self-metrics
|
- `loki` - Loki self-metrics
|
||||||
- `grafana` - Grafana self-metrics
|
- `grafana` - Grafana self-metrics
|
||||||
- `alertmanager` - Alertmanager metrics
|
- `alertmanager` - Alertmanager metrics
|
||||||
|
- `pushgateway` - Push-based metrics gateway
|
||||||
|
|
||||||
**External/infrastructure:**
|
**External/infrastructure:**
|
||||||
- `pve-exporter` - Proxmox hypervisor metrics
|
- `pve-exporter` - Proxmox hypervisor metrics
|
||||||
@@ -261,7 +262,7 @@ All scrape targets have these labels:
|
|||||||
**Standard labels:**
|
**Standard labels:**
|
||||||
- `instance` - Full target address (`<hostname>.home.2rjus.net:<port>`)
|
- `instance` - Full target address (`<hostname>.home.2rjus.net:<port>`)
|
||||||
- `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`)
|
- `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`)
|
||||||
- `hostname` - Short hostname (e.g., `ns1`, `monitoring02`) - use this for host filtering
|
- `hostname` - Short hostname (e.g., `ns1`, `monitoring01`) - use this for host filtering
|
||||||
|
|
||||||
**Host metadata labels** (when configured in `homelab.host`):
|
**Host metadata labels** (when configured in `homelab.host`):
|
||||||
- `role` - Host role (e.g., `dns`, `build-host`, `vault`)
|
- `role` - Host role (e.g., `dns`, `build-host`, `vault`)
|
||||||
@@ -274,7 +275,7 @@ Use the `hostname` label for easy host filtering across all jobs:
|
|||||||
|
|
||||||
```promql
|
```promql
|
||||||
{hostname="ns1"} # All metrics from ns1
|
{hostname="ns1"} # All metrics from ns1
|
||||||
node_load1{hostname="monitoring02"} # Specific metric by hostname
|
node_load1{hostname="monitoring01"} # Specific metric by hostname
|
||||||
up{hostname="ha1"} # Check if ha1 is up
|
up{hostname="ha1"} # Check if ha1 is up
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -282,10 +283,10 @@ This is simpler than wildcarding the `instance` label:
|
|||||||
|
|
||||||
```promql
|
```promql
|
||||||
# Old way (still works but verbose)
|
# Old way (still works but verbose)
|
||||||
up{instance=~"monitoring02.*"}
|
up{instance=~"monitoring01.*"}
|
||||||
|
|
||||||
# New way (preferred)
|
# New way (preferred)
|
||||||
up{hostname="monitoring02"}
|
up{hostname="monitoring01"}
|
||||||
```
|
```
|
||||||
|
|
||||||
### Filtering by Role/Tier
|
### Filtering by Role/Tier
|
||||||
|
|||||||
@@ -73,7 +73,6 @@ Additional context, caveats, or references.
|
|||||||
- **Reference existing patterns**: Mention how this fits with existing infrastructure
|
- **Reference existing patterns**: Mention how this fits with existing infrastructure
|
||||||
- **Tables for comparisons**: Use markdown tables when comparing options
|
- **Tables for comparisons**: Use markdown tables when comparing options
|
||||||
- **Practical focus**: Emphasize what needs to happen, not theory
|
- **Practical focus**: Emphasize what needs to happen, not theory
|
||||||
- **Mermaid diagrams**: Use mermaid code blocks for architecture diagrams, flow charts, or other graphs when relevant to the plan. Keep node labels short and use `<br/>` for line breaks
|
|
||||||
|
|
||||||
## Examples of Good Plans
|
## Examples of Good Plans
|
||||||
|
|
||||||
|
|||||||
3
.gitignore
vendored
3
.gitignore
vendored
@@ -2,9 +2,6 @@
|
|||||||
result
|
result
|
||||||
result-*
|
result-*
|
||||||
|
|
||||||
# MCP config (contains secrets)
|
|
||||||
.mcp.json
|
|
||||||
|
|
||||||
# Terraform/OpenTofu
|
# Terraform/OpenTofu
|
||||||
terraform/.terraform/
|
terraform/.terraform/
|
||||||
terraform/.terraform.lock.hcl
|
terraform/.terraform.lock.hcl
|
||||||
|
|||||||
@@ -20,9 +20,7 @@
|
|||||||
"env": {
|
"env": {
|
||||||
"PROMETHEUS_URL": "https://prometheus.home.2rjus.net",
|
"PROMETHEUS_URL": "https://prometheus.home.2rjus.net",
|
||||||
"ALERTMANAGER_URL": "https://alertmanager.home.2rjus.net",
|
"ALERTMANAGER_URL": "https://alertmanager.home.2rjus.net",
|
||||||
"LOKI_URL": "https://loki.home.2rjus.net",
|
"LOKI_URL": "http://monitoring01.home.2rjus.net:3100"
|
||||||
"LOKI_USERNAME": "promtail",
|
|
||||||
"LOKI_PASSWORD": "<password from: bao kv get -field=password secret/shared/loki/push-auth>"
|
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"homelab-deploy": {
|
"homelab-deploy": {
|
||||||
@@ -46,3 +44,4 @@
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
20
CLAUDE.md
20
CLAUDE.md
@@ -247,7 +247,7 @@ nix develop -c homelab-deploy -- deploy \
|
|||||||
deploy.prod.<hostname>
|
deploy.prod.<hostname>
|
||||||
```
|
```
|
||||||
|
|
||||||
Subject format: `deploy.<tier>.<hostname>` (e.g., `deploy.prod.monitoring02`, `deploy.test.testvm01`)
|
Subject format: `deploy.<tier>.<hostname>` (e.g., `deploy.prod.monitoring01`, `deploy.test.testvm01`)
|
||||||
|
|
||||||
**Verifying Deployments:**
|
**Verifying Deployments:**
|
||||||
|
|
||||||
@@ -309,7 +309,7 @@ All hosts automatically get:
|
|||||||
- OpenBao (Vault) secrets management via AppRole
|
- OpenBao (Vault) secrets management via AppRole
|
||||||
- Internal ACME CA integration (OpenBao PKI at vault.home.2rjus.net)
|
- Internal ACME CA integration (OpenBao PKI at vault.home.2rjus.net)
|
||||||
- Daily auto-upgrades with auto-reboot
|
- Daily auto-upgrades with auto-reboot
|
||||||
- Prometheus node-exporter + Promtail (logs to monitoring02)
|
- Prometheus node-exporter + Promtail (logs to monitoring01)
|
||||||
- Monitoring scrape target auto-registration via `homelab.monitoring` options
|
- Monitoring scrape target auto-registration via `homelab.monitoring` options
|
||||||
- Custom root CA trust
|
- Custom root CA trust
|
||||||
- DNS zone auto-registration via `homelab.dns` options
|
- DNS zone auto-registration via `homelab.dns` options
|
||||||
@@ -335,7 +335,7 @@ Use `nix flake show` or `nix develop -c ansible-inventory --graph` to list all h
|
|||||||
- Infrastructure subnet: `10.69.13.x`
|
- Infrastructure subnet: `10.69.13.x`
|
||||||
- DNS: ns1/ns2 provide authoritative DNS with primary-secondary setup
|
- DNS: ns1/ns2 provide authoritative DNS with primary-secondary setup
|
||||||
- Internal CA for ACME certificates (no Let's Encrypt)
|
- Internal CA for ACME certificates (no Let's Encrypt)
|
||||||
- Centralized monitoring at monitoring02
|
- Centralized monitoring at monitoring01
|
||||||
- Static networking via systemd-networkd
|
- Static networking via systemd-networkd
|
||||||
|
|
||||||
### Secrets Management
|
### Secrets Management
|
||||||
@@ -480,21 +480,23 @@ See [docs/host-creation.md](docs/host-creation.md) for the complete host creatio
|
|||||||
|
|
||||||
### Monitoring Stack
|
### Monitoring Stack
|
||||||
|
|
||||||
All hosts ship metrics and logs to `monitoring02`:
|
All hosts ship metrics and logs to `monitoring01`:
|
||||||
- **Metrics**: VictoriaMetrics scrapes node-exporter from all hosts
|
- **Metrics**: Prometheus scrapes node-exporter from all hosts
|
||||||
- **Logs**: Promtail ships logs to Loki on monitoring02
|
- **Logs**: Promtail ships logs to Loki on monitoring01
|
||||||
- **Access**: Grafana at monitoring02 for visualization
|
- **Access**: Grafana at monitoring01 for visualization
|
||||||
|
- **Tracing**: Tempo for distributed tracing
|
||||||
|
- **Profiling**: Pyroscope for continuous profiling
|
||||||
|
|
||||||
**Scrape Target Auto-Generation:**
|
**Scrape Target Auto-Generation:**
|
||||||
|
|
||||||
VictoriaMetrics scrape targets are automatically generated from host configurations, following the same pattern as DNS zone generation:
|
Prometheus scrape targets are automatically generated from host configurations, following the same pattern as DNS zone generation:
|
||||||
|
|
||||||
- **Node-exporter**: All flake hosts with static IPs are automatically added as node-exporter targets
|
- **Node-exporter**: All flake hosts with static IPs are automatically added as node-exporter targets
|
||||||
- **Service targets**: Defined via `homelab.monitoring.scrapeTargets` in service modules
|
- **Service targets**: Defined via `homelab.monitoring.scrapeTargets` in service modules
|
||||||
- **External targets**: Non-flake hosts defined in `/services/monitoring/external-targets.nix`
|
- **External targets**: Non-flake hosts defined in `/services/monitoring/external-targets.nix`
|
||||||
- **Library**: `lib/monitoring.nix` provides `generateNodeExporterTargets` and `generateScrapeConfigs`
|
- **Library**: `lib/monitoring.nix` provides `generateNodeExporterTargets` and `generateScrapeConfigs`
|
||||||
|
|
||||||
Service modules declare their scrape targets directly via `homelab.monitoring.scrapeTargets`. The VictoriaMetrics config on monitoring02 auto-generates scrape configs from all hosts. See "Homelab Module Options" section for available options.
|
Service modules declare their scrape targets directly via `homelab.monitoring.scrapeTargets`. The Prometheus config on monitoring01 auto-generates scrape configs from all hosts. See "Homelab Module Options" section for available options.
|
||||||
|
|
||||||
To add monitoring targets for non-NixOS hosts, edit `/services/monitoring/external-targets.nix`.
|
To add monitoring targets for non-NixOS hosts, edit `/services/monitoring/external-targets.nix`.
|
||||||
|
|
||||||
|
|||||||
@@ -10,7 +10,7 @@ NixOS Flake-based configuration repository for a homelab infrastructure. All hos
|
|||||||
| `ca` | Internal Certificate Authority |
|
| `ca` | Internal Certificate Authority |
|
||||||
| `ha1` | Home Assistant + Zigbee2MQTT + Mosquitto |
|
| `ha1` | Home Assistant + Zigbee2MQTT + Mosquitto |
|
||||||
| `http-proxy` | Reverse proxy |
|
| `http-proxy` | Reverse proxy |
|
||||||
| `monitoring02` | VictoriaMetrics, Grafana, Loki, Alertmanager |
|
| `monitoring01` | Prometheus, Grafana, Loki, Tempo, Pyroscope |
|
||||||
| `jelly01` | Jellyfin media server |
|
| `jelly01` | Jellyfin media server |
|
||||||
| `nix-cache02` | Nix binary cache + NATS-based build service |
|
| `nix-cache02` | Nix binary cache + NATS-based build service |
|
||||||
| `nats1` | NATS messaging |
|
| `nats1` | NATS messaging |
|
||||||
@@ -121,4 +121,4 @@ No manual intervention is required after `tofu apply`.
|
|||||||
- Infrastructure subnet: `10.69.13.0/24`
|
- Infrastructure subnet: `10.69.13.0/24`
|
||||||
- DNS: ns1/ns2 authoritative with primary-secondary AXFR
|
- DNS: ns1/ns2 authoritative with primary-secondary AXFR
|
||||||
- Internal CA for TLS certificates (migrating from step-ca to OpenBao PKI)
|
- Internal CA for TLS certificates (migrating from step-ca to OpenBao PKI)
|
||||||
- Centralized monitoring at monitoring02
|
- Centralized monitoring at monitoring01
|
||||||
|
|||||||
@@ -1,156 +0,0 @@
|
|||||||
# Monitoring Stack Migration to VictoriaMetrics
|
|
||||||
|
|
||||||
## Overview
|
|
||||||
|
|
||||||
Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
|
|
||||||
and longer retention. Run in parallel with monitoring01 until validated, then switch over using
|
|
||||||
a `monitoring` CNAME for seamless transition.
|
|
||||||
|
|
||||||
## Current State
|
|
||||||
|
|
||||||
**monitoring02** (10.69.13.24) - **PRIMARY**:
|
|
||||||
- 4 CPU cores, 8GB RAM, 60GB disk
|
|
||||||
- VictoriaMetrics with 3-month retention
|
|
||||||
- vmalert with alerting enabled (routes to local Alertmanager)
|
|
||||||
- Alertmanager -> alerttonotify -> NATS notification pipeline
|
|
||||||
- Grafana with Kanidm OIDC (`grafana.home.2rjus.net`)
|
|
||||||
- Loki (log aggregation)
|
|
||||||
- CNAMEs: monitoring, alertmanager, grafana, grafana-test, metrics, vmalert, loki
|
|
||||||
|
|
||||||
**monitoring01** (10.69.13.13) - **SHUT DOWN**:
|
|
||||||
- No longer running, pending decommission
|
|
||||||
|
|
||||||
## Decision: VictoriaMetrics
|
|
||||||
|
|
||||||
Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
|
|
||||||
- Single binary replacement for Prometheus
|
|
||||||
- 5-10x better compression (30 days could become 180+ days in same space)
|
|
||||||
- Same PromQL query language (Grafana dashboards work unchanged)
|
|
||||||
- Same scrape config format (existing auto-generated configs work)
|
|
||||||
|
|
||||||
If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
|
|
||||||
|
|
||||||
## Architecture
|
|
||||||
|
|
||||||
```
|
|
||||||
┌─────────────────┐
|
|
||||||
│ monitoring02 │
|
|
||||||
│ VictoriaMetrics│
|
|
||||||
│ + Grafana │
|
|
||||||
monitoring │ + Loki │
|
|
||||||
CNAME ──────────│ + Alertmanager │
|
|
||||||
│ (vmalert) │
|
|
||||||
└─────────────────┘
|
|
||||||
▲
|
|
||||||
│ scrapes
|
|
||||||
┌───────────────┼───────────────┐
|
|
||||||
│ │ │
|
|
||||||
┌────┴────┐ ┌─────┴────┐ ┌─────┴────┐
|
|
||||||
│ ns1 │ │ ha1 │ │ ... │
|
|
||||||
│ :9100 │ │ :9100 │ │ :9100 │
|
|
||||||
└─────────┘ └──────────┘ └──────────┘
|
|
||||||
```
|
|
||||||
|
|
||||||
## Implementation Plan
|
|
||||||
|
|
||||||
### Phase 1: Create monitoring02 Host [COMPLETE]
|
|
||||||
|
|
||||||
Host created and deployed at 10.69.13.24 (prod tier) with:
|
|
||||||
- 4 CPU cores, 8GB RAM, 60GB disk
|
|
||||||
- Vault integration enabled
|
|
||||||
- NATS-based remote deployment enabled
|
|
||||||
- Grafana with Kanidm OIDC deployed as test instance (`grafana-test.home.2rjus.net`)
|
|
||||||
|
|
||||||
### Phase 2: Set Up VictoriaMetrics Stack [COMPLETE]
|
|
||||||
|
|
||||||
New service module at `services/victoriametrics/` for VictoriaMetrics + vmalert + Alertmanager.
|
|
||||||
Imported by monitoring02 alongside the existing Grafana service.
|
|
||||||
|
|
||||||
1. **VictoriaMetrics** (port 8428):
|
|
||||||
- `services.victoriametrics.enable = true`
|
|
||||||
- `retentionPeriod = "3"` (3 months)
|
|
||||||
- All scrape configs migrated from Prometheus (22 jobs including auto-generated)
|
|
||||||
- Static user override (DynamicUser disabled) for credential file access
|
|
||||||
- OpenBao token fetch service + 30min refresh timer
|
|
||||||
- Apiary bearer token via vault.secrets
|
|
||||||
|
|
||||||
2. **vmalert** for alerting rules:
|
|
||||||
- Points to VictoriaMetrics datasource at localhost:8428
|
|
||||||
- Reuses existing `services/monitoring/rules.yml` directly via `settings.rule`
|
|
||||||
- Notifier sends to local Alertmanager at localhost:9093
|
|
||||||
|
|
||||||
3. **Alertmanager** (port 9093):
|
|
||||||
- Same configuration as monitoring01 (alerttonotify webhook routing)
|
|
||||||
- alerttonotify imported on monitoring02, routes alerts via NATS
|
|
||||||
|
|
||||||
4. **Grafana** (port 3000):
|
|
||||||
- VictoriaMetrics datasource (localhost:8428) as default
|
|
||||||
- Loki datasource pointing to localhost:3100
|
|
||||||
|
|
||||||
5. **Loki** (port 3100):
|
|
||||||
- Same configuration as monitoring01 in standalone `services/loki/` module
|
|
||||||
- Grafana datasource updated to localhost:3100
|
|
||||||
|
|
||||||
**Note:** pve-exporter and pushgateway scrape targets are not included on monitoring02.
|
|
||||||
pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics
|
|
||||||
native push support.
|
|
||||||
|
|
||||||
### Phase 3: Parallel Operation [COMPLETE]
|
|
||||||
|
|
||||||
Ran both monitoring01 and monitoring02 simultaneously to validate data collection and dashboards.
|
|
||||||
|
|
||||||
### Phase 4: Add monitoring CNAME [COMPLETE]
|
|
||||||
|
|
||||||
Added CNAMEs to monitoring02: monitoring, alertmanager, grafana, metrics, vmalert, loki.
|
|
||||||
|
|
||||||
### Phase 5: Update References [COMPLETE]
|
|
||||||
|
|
||||||
- Moved alertmanager, grafana, prometheus CNAMEs from http-proxy to monitoring02
|
|
||||||
- Removed corresponding Caddy reverse proxy entries from http-proxy
|
|
||||||
- monitoring02 Caddy serves alertmanager, grafana, metrics, vmalert directly
|
|
||||||
|
|
||||||
### Phase 6: Enable Alerting [COMPLETE]
|
|
||||||
|
|
||||||
- Switched vmalert from blackhole mode to local Alertmanager
|
|
||||||
- alerttonotify service running on monitoring02 (NATS nkey from Vault)
|
|
||||||
- prometheus-metrics Vault policy added for OpenBao scraping
|
|
||||||
- Full alerting pipeline verified: vmalert -> Alertmanager -> alerttonotify -> NATS
|
|
||||||
|
|
||||||
### Phase 7: Cutover and Decommission [IN PROGRESS]
|
|
||||||
|
|
||||||
- monitoring01 shut down (2026-02-17)
|
|
||||||
- Vault AppRole moved from approle.tf to hosts-generated.tf with extra_policies support
|
|
||||||
|
|
||||||
**Remaining cleanup (separate branch):**
|
|
||||||
- [ ] Update `system/monitoring/logs.nix` - Promtail still points to monitoring01
|
|
||||||
- [ ] Update `hosts/template2/bootstrap.nix` - Bootstrap Loki URL still points to monitoring01
|
|
||||||
- [ ] Remove monitoring01 from flake.nix and host configuration
|
|
||||||
- [ ] Destroy monitoring01 VM in Proxmox
|
|
||||||
- [ ] Remove monitoring01 from terraform state
|
|
||||||
- [ ] Remove or archive `services/monitoring/` (Prometheus config)
|
|
||||||
|
|
||||||
## Completed
|
|
||||||
|
|
||||||
- 2026-02-08: Phase 1 - monitoring02 host created
|
|
||||||
- 2026-02-17: Phase 2 - VictoriaMetrics, vmalert, Alertmanager, Loki, Grafana configured
|
|
||||||
- 2026-02-17: Phase 6 - Alerting enabled, CNAMEs migrated, monitoring01 shut down
|
|
||||||
|
|
||||||
## VictoriaMetrics Service Configuration
|
|
||||||
|
|
||||||
Implemented in `services/victoriametrics/default.nix`. Key design decisions:
|
|
||||||
|
|
||||||
- **Static user**: VictoriaMetrics NixOS module uses `DynamicUser`, overridden with a static
|
|
||||||
`victoriametrics` user so vault.secrets and credential files work correctly
|
|
||||||
- **Shared rules**: vmalert reuses `services/monitoring/rules.yml` via `settings.rule` path
|
|
||||||
reference (no YAML-to-Nix conversion needed)
|
|
||||||
- **Scrape config reuse**: Uses the same `lib/monitoring.nix` functions and
|
|
||||||
`services/monitoring/external-targets.nix` as Prometheus for auto-generated targets
|
|
||||||
|
|
||||||
## Notes
|
|
||||||
|
|
||||||
- VictoriaMetrics uses port 8428 vs Prometheus 9090
|
|
||||||
- PromQL compatibility is excellent
|
|
||||||
- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
|
|
||||||
- monitoring02 deployed via OpenTofu using `create-host` script
|
|
||||||
- Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state
|
|
||||||
- Tempo and Pyroscope deferred (not actively used; can be added later if needed)
|
|
||||||
@@ -20,9 +20,9 @@ Hosts to migrate:
|
|||||||
| http-proxy | Stateless | Reverse proxy, recreate |
|
| http-proxy | Stateless | Reverse proxy, recreate |
|
||||||
| nats1 | Stateless | Messaging, recreate |
|
| nats1 | Stateless | Messaging, recreate |
|
||||||
| ha1 | Stateful | Home Assistant + Zigbee2MQTT + Mosquitto |
|
| ha1 | Stateful | Home Assistant + Zigbee2MQTT + Mosquitto |
|
||||||
| ~~monitoring01~~ | ~~Decommission~~ | ✓ Complete — replaced by monitoring02 (VictoriaMetrics) |
|
| monitoring01 | Stateful | Prometheus, Grafana, Loki |
|
||||||
| jelly01 | Stateful | Jellyfin metadata, watch history, config |
|
| jelly01 | Stateful | Jellyfin metadata, watch history, config |
|
||||||
| ~~pgdb1~~ | ~~Decommission~~ | ✓ Complete |
|
| pgdb1 | Decommission | Only used by Open WebUI on gunter, migrating to local postgres |
|
||||||
| ~~jump~~ | ~~Decommission~~ | ✓ Complete |
|
| ~~jump~~ | ~~Decommission~~ | ✓ Complete |
|
||||||
| ~~auth01~~ | ~~Decommission~~ | ✓ Complete |
|
| ~~auth01~~ | ~~Decommission~~ | ✓ Complete |
|
||||||
| ~~ca~~ | ~~Deferred~~ | ✓ Complete |
|
| ~~ca~~ | ~~Deferred~~ | ✓ Complete |
|
||||||
@@ -31,12 +31,10 @@ Hosts to migrate:
|
|||||||
|
|
||||||
Before migrating any stateful host, ensure restic backups are in place and verified.
|
Before migrating any stateful host, ensure restic backups are in place and verified.
|
||||||
|
|
||||||
### ~~1a. Expand monitoring01 Grafana Backup~~ ✓ N/A
|
### 1a. Expand monitoring01 Grafana Backup
|
||||||
|
|
||||||
~~The existing backup only covers `/var/lib/grafana/plugins` and a sqlite dump of `grafana.db`.
|
The existing backup only covers `/var/lib/grafana/plugins` and a sqlite dump of `grafana.db`.
|
||||||
Expand to back up all of `/var/lib/grafana/` to capture config directory and any other state.~~
|
Expand to back up all of `/var/lib/grafana/` to capture config directory and any other state.
|
||||||
|
|
||||||
No longer needed — monitoring01 decommissioned, replaced by monitoring02 with declarative Grafana dashboards.
|
|
||||||
|
|
||||||
### 1b. Add Jellyfin Backup to jelly01
|
### 1b. Add Jellyfin Backup to jelly01
|
||||||
|
|
||||||
@@ -96,17 +94,15 @@ For each stateful host, the procedure is:
|
|||||||
7. Start services and verify functionality
|
7. Start services and verify functionality
|
||||||
8. Decommission the old VM
|
8. Decommission the old VM
|
||||||
|
|
||||||
### 3a. monitoring01 ✓ COMPLETE
|
### 3a. monitoring01
|
||||||
|
|
||||||
~~1. Run final Grafana backup~~
|
1. Run final Grafana backup
|
||||||
~~2. Provision new monitoring01 via OpenTofu~~
|
2. Provision new monitoring01 via OpenTofu
|
||||||
~~3. After bootstrap, restore `/var/lib/grafana/` from restic~~
|
3. After bootstrap, restore `/var/lib/grafana/` from restic
|
||||||
~~4. Restart Grafana, verify dashboards and datasources are intact~~
|
4. Restart Grafana, verify dashboards and datasources are intact
|
||||||
~~5. Prometheus and Loki start fresh with empty data (acceptable)~~
|
5. Prometheus and Loki start fresh with empty data (acceptable)
|
||||||
~~6. Verify all scrape targets are being collected~~
|
6. Verify all scrape targets are being collected
|
||||||
~~7. Decommission old VM~~
|
7. Decommission old VM
|
||||||
|
|
||||||
Replaced by monitoring02 with VictoriaMetrics, standalone Loki and Grafana modules. Host configuration, old service modules, and terraform resources removed.
|
|
||||||
|
|
||||||
### 3b. jelly01
|
### 3b. jelly01
|
||||||
|
|
||||||
@@ -167,19 +163,19 @@ Host was already removed from flake.nix and VM destroyed. Configuration cleaned
|
|||||||
|
|
||||||
Host configuration, services, and VM already removed.
|
Host configuration, services, and VM already removed.
|
||||||
|
|
||||||
### pgdb1 ✓ COMPLETE
|
### pgdb1 (in progress)
|
||||||
|
|
||||||
~~Only consumer was Open WebUI on gunter, which has been migrated to use local PostgreSQL.~~
|
Only consumer was Open WebUI on gunter, which has been migrated to use local PostgreSQL.
|
||||||
|
|
||||||
~~1. Verify Open WebUI on gunter is using local PostgreSQL (not pgdb1)~~
|
1. ~~Verify Open WebUI on gunter is using local PostgreSQL (not pgdb1)~~ ✓
|
||||||
~~2. Remove host configuration from `hosts/pgdb1/`~~
|
2. ~~Remove host configuration from `hosts/pgdb1/`~~ ✓
|
||||||
~~3. Remove `services/postgres/` (only used by pgdb1)~~
|
3. ~~Remove `services/postgres/` (only used by pgdb1)~~ ✓
|
||||||
~~4. Remove from `flake.nix`~~
|
4. ~~Remove from `flake.nix`~~ ✓
|
||||||
~~5. Remove Vault AppRole from `terraform/vault/approle.tf`~~
|
5. ~~Remove Vault AppRole from `terraform/vault/approle.tf`~~ ✓
|
||||||
~~6. Destroy the VM in Proxmox~~
|
6. Destroy the VM in Proxmox
|
||||||
~~7. Commit cleanup~~
|
7. ~~Commit cleanup~~ ✓
|
||||||
|
|
||||||
Host configuration, services, terraform resources, and VM removed. See `docs/plans/pgdb1-decommission.md` for detailed plan.
|
See `docs/plans/pgdb1-decommission.md` for detailed plan.
|
||||||
|
|
||||||
## Phase 5: Decommission ca Host ✓ COMPLETE
|
## Phase 5: Decommission ca Host ✓ COMPLETE
|
||||||
|
|
||||||
|
|||||||
209
docs/plans/monitoring-migration-victoriametrics.md
Normal file
209
docs/plans/monitoring-migration-victoriametrics.md
Normal file
@@ -0,0 +1,209 @@
|
|||||||
|
# Monitoring Stack Migration to VictoriaMetrics
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
|
||||||
|
and longer retention. Run in parallel with monitoring01 until validated, then switch over using
|
||||||
|
a `monitoring` CNAME for seamless transition.
|
||||||
|
|
||||||
|
## Current State
|
||||||
|
|
||||||
|
**monitoring01** (10.69.13.13):
|
||||||
|
- 4 CPU cores, 4GB RAM, 33GB disk
|
||||||
|
- Prometheus with 30-day retention (15s scrape interval)
|
||||||
|
- Alertmanager (routes to alerttonotify webhook)
|
||||||
|
- Grafana (dashboards, datasources)
|
||||||
|
- Loki (log aggregation from all hosts via Promtail)
|
||||||
|
- Tempo (distributed tracing)
|
||||||
|
- Pyroscope (continuous profiling)
|
||||||
|
|
||||||
|
**Hardcoded References to monitoring01:**
|
||||||
|
- `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100`
|
||||||
|
- `hosts/template2/bootstrap.nix` - Bootstrap logs to Loki (keep as-is until decommission)
|
||||||
|
- `services/http-proxy/proxy.nix` - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway
|
||||||
|
|
||||||
|
**Auto-generated:**
|
||||||
|
- Prometheus scrape targets (from `lib/monitoring.nix` + `homelab.monitoring.scrapeTargets`)
|
||||||
|
- Node-exporter targets (from all hosts with static IPs)
|
||||||
|
|
||||||
|
## Decision: VictoriaMetrics
|
||||||
|
|
||||||
|
Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
|
||||||
|
- Single binary replacement for Prometheus
|
||||||
|
- 5-10x better compression (30 days could become 180+ days in same space)
|
||||||
|
- Same PromQL query language (Grafana dashboards work unchanged)
|
||||||
|
- Same scrape config format (existing auto-generated configs work)
|
||||||
|
|
||||||
|
If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────┐
|
||||||
|
│ monitoring02 │
|
||||||
|
│ VictoriaMetrics│
|
||||||
|
│ + Grafana │
|
||||||
|
monitoring │ + Loki │
|
||||||
|
CNAME ──────────│ + Tempo │
|
||||||
|
│ + Pyroscope │
|
||||||
|
│ + Alertmanager │
|
||||||
|
│ (vmalert) │
|
||||||
|
└─────────────────┘
|
||||||
|
▲
|
||||||
|
│ scrapes
|
||||||
|
┌───────────────┼───────────────┐
|
||||||
|
│ │ │
|
||||||
|
┌────┴────┐ ┌─────┴────┐ ┌─────┴────┐
|
||||||
|
│ ns1 │ │ ha1 │ │ ... │
|
||||||
|
│ :9100 │ │ :9100 │ │ :9100 │
|
||||||
|
└─────────┘ └──────────┘ └──────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## Implementation Plan
|
||||||
|
|
||||||
|
### Phase 1: Create monitoring02 Host [COMPLETE]
|
||||||
|
|
||||||
|
Host created and deployed at 10.69.13.24 (prod tier) with:
|
||||||
|
- 4 CPU cores, 8GB RAM, 60GB disk
|
||||||
|
- Vault integration enabled
|
||||||
|
- NATS-based remote deployment enabled
|
||||||
|
- Grafana with Kanidm OIDC deployed as test instance (`grafana-test.home.2rjus.net`)
|
||||||
|
|
||||||
|
### Phase 2: Set Up VictoriaMetrics Stack
|
||||||
|
|
||||||
|
New service module at `services/victoriametrics/` for VictoriaMetrics + vmalert + Alertmanager.
|
||||||
|
Imported by monitoring02 alongside the existing Grafana service.
|
||||||
|
|
||||||
|
1. **VictoriaMetrics** (port 8428): [DONE]
|
||||||
|
- `services.victoriametrics.enable = true`
|
||||||
|
- `retentionPeriod = "3"` (3 months)
|
||||||
|
- All scrape configs migrated from Prometheus (22 jobs including auto-generated)
|
||||||
|
- Static user override (DynamicUser disabled) for credential file access
|
||||||
|
- OpenBao token fetch service + 30min refresh timer
|
||||||
|
- Apiary bearer token via vault.secrets
|
||||||
|
|
||||||
|
2. **vmalert** for alerting rules: [DONE]
|
||||||
|
- Points to VictoriaMetrics datasource at localhost:8428
|
||||||
|
- Reuses existing `services/monitoring/rules.yml` directly via `settings.rule`
|
||||||
|
- No notifier configured during parallel operation (prevents duplicate alerts)
|
||||||
|
|
||||||
|
3. **Alertmanager** (port 9093): [DONE]
|
||||||
|
- Same configuration as monitoring01 (alerttonotify webhook routing)
|
||||||
|
- Will only receive alerts after cutover (vmalert notifier disabled)
|
||||||
|
|
||||||
|
4. **Grafana** (port 3000): [DONE]
|
||||||
|
- VictoriaMetrics datasource (localhost:8428) as default
|
||||||
|
- monitoring01 Prometheus datasource kept for comparison during parallel operation
|
||||||
|
- Loki datasource pointing to monitoring01 (until Loki migrated)
|
||||||
|
|
||||||
|
5. **Loki** (port 3100):
|
||||||
|
- TODO: Same configuration as current
|
||||||
|
|
||||||
|
6. **Tempo** (ports 3200, 3201):
|
||||||
|
- TODO: Same configuration
|
||||||
|
|
||||||
|
7. **Pyroscope** (port 4040):
|
||||||
|
- TODO: Same Docker-based deployment
|
||||||
|
|
||||||
|
**Note:** pve-exporter and pushgateway scrape targets are not included on monitoring02.
|
||||||
|
pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics
|
||||||
|
native push support.
|
||||||
|
|
||||||
|
### Phase 3: Parallel Operation
|
||||||
|
|
||||||
|
Run both monitoring01 and monitoring02 simultaneously:
|
||||||
|
|
||||||
|
1. **Dual scraping**: Both hosts scrape the same targets
|
||||||
|
- Validates VictoriaMetrics is collecting data correctly
|
||||||
|
|
||||||
|
2. **Dual log shipping**: Configure Promtail to send logs to both Loki instances
|
||||||
|
- Add second client in `system/monitoring/logs.nix` pointing to monitoring02
|
||||||
|
|
||||||
|
3. **Validate dashboards**: Access Grafana on monitoring02, verify dashboards work
|
||||||
|
|
||||||
|
4. **Validate alerts**: Verify vmalert evaluates rules correctly (no receiver = no notifications)
|
||||||
|
|
||||||
|
5. **Compare resource usage**: Monitor disk/memory consumption between hosts
|
||||||
|
|
||||||
|
### Phase 4: Add monitoring CNAME
|
||||||
|
|
||||||
|
Add CNAME to monitoring02 once validated:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
# hosts/monitoring02/configuration.nix
|
||||||
|
homelab.dns.cnames = [ "monitoring" ];
|
||||||
|
```
|
||||||
|
|
||||||
|
This creates `monitoring.home.2rjus.net` pointing to monitoring02.
|
||||||
|
|
||||||
|
### Phase 5: Update References
|
||||||
|
|
||||||
|
Update hardcoded references to use the CNAME:
|
||||||
|
|
||||||
|
1. **system/monitoring/logs.nix**:
|
||||||
|
- Remove dual-shipping, point only to `http://monitoring.home.2rjus.net:3100`
|
||||||
|
|
||||||
|
2. **services/http-proxy/proxy.nix**: Update reverse proxy backends:
|
||||||
|
- prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
|
||||||
|
- alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
|
||||||
|
- grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
|
||||||
|
- pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040
|
||||||
|
|
||||||
|
Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission.
|
||||||
|
|
||||||
|
### Phase 6: Enable Alerting
|
||||||
|
|
||||||
|
Once ready to cut over:
|
||||||
|
1. Enable Alertmanager receiver on monitoring02
|
||||||
|
2. Verify test alerts route correctly
|
||||||
|
|
||||||
|
### Phase 7: Cutover and Decommission
|
||||||
|
|
||||||
|
1. **Stop monitoring01**: Prevent duplicate alerts during transition
|
||||||
|
2. **Update bootstrap.nix**: Point to `monitoring.home.2rjus.net`
|
||||||
|
3. **Verify all targets scraped**: Check VictoriaMetrics UI
|
||||||
|
4. **Verify logs flowing**: Check Loki on monitoring02
|
||||||
|
5. **Decommission monitoring01**:
|
||||||
|
- Remove from flake.nix
|
||||||
|
- Remove host configuration
|
||||||
|
- Destroy VM in Proxmox
|
||||||
|
- Remove from terraform state
|
||||||
|
|
||||||
|
## Current Progress
|
||||||
|
|
||||||
|
- **Phase 1** complete (2026-02-08): monitoring02 host created, Grafana with Kanidm OIDC validated
|
||||||
|
- **Phase 2** in progress (2026-02-17): VictoriaMetrics, vmalert, Alertmanager, Grafana datasources configured
|
||||||
|
- Remaining: Loki, Tempo, Pyroscope migration
|
||||||
|
|
||||||
|
## Open Questions
|
||||||
|
|
||||||
|
- [ ] What disk size for monitoring02? Current 60GB may need expansion for 3+ months with VictoriaMetrics
|
||||||
|
- [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
|
||||||
|
- [ ] Consider replacing Promtail with Grafana Alloy (`services.alloy`, v1.12.2 in nixpkgs). Promtail is in maintenance mode and Grafana recommends Alloy as the successor. Alloy is a unified collector (logs, metrics, traces, profiles) but uses its own "River" config format instead of YAML, so less Nix-native ergonomics. Could bundle the migration with monitoring02 to consolidate disruption.
|
||||||
|
|
||||||
|
## VictoriaMetrics Service Configuration
|
||||||
|
|
||||||
|
Implemented in `services/victoriametrics/default.nix`. Key design decisions:
|
||||||
|
|
||||||
|
- **Static user**: VictoriaMetrics NixOS module uses `DynamicUser`, overridden with a static
|
||||||
|
`victoriametrics` user so vault.secrets and credential files work correctly
|
||||||
|
- **Shared rules**: vmalert reuses `services/monitoring/rules.yml` via `settings.rule` path
|
||||||
|
reference (no YAML-to-Nix conversion needed)
|
||||||
|
- **Scrape config reuse**: Uses the same `lib/monitoring.nix` functions and
|
||||||
|
`services/monitoring/external-targets.nix` as Prometheus for auto-generated targets
|
||||||
|
|
||||||
|
## Rollback Plan
|
||||||
|
|
||||||
|
If issues arise after cutover:
|
||||||
|
1. Move `monitoring` CNAME back to monitoring01
|
||||||
|
2. Restart monitoring01 services
|
||||||
|
3. Revert Promtail config to point only to monitoring01
|
||||||
|
4. Revert http-proxy backends
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- VictoriaMetrics uses port 8428 vs Prometheus 9090
|
||||||
|
- PromQL compatibility is excellent
|
||||||
|
- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
|
||||||
|
- monitoring02 deployed via OpenTofu using `create-host` script
|
||||||
|
- Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state
|
||||||
@@ -24,20 +24,29 @@ After evaluating WireGuard gateway vs Headscale (self-hosted Tailscale), the **W
|
|||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
```mermaid
|
```
|
||||||
graph TD
|
┌─────────────────────────────────┐
|
||||||
clients["Laptop / Phone"]
|
│ VPS (OpenStack) │
|
||||||
vps["VPS<br/>(WireGuard endpoint)"]
|
Laptop/Phone ──→ │ WireGuard endpoint │
|
||||||
extgw["extgw01<br/>(gateway + bastion)"]
|
(WireGuard) │ Client peers: laptop, phone │
|
||||||
grafana["Grafana<br/>monitoring01:3000"]
|
│ Routes 10.69.13.0/24 via tunnel│
|
||||||
jellyfin["Jellyfin<br/>jelly01:8096"]
|
└──────────┬──────────────────────┘
|
||||||
arr["arr stack<br/>*-jail hosts"]
|
│ WireGuard tunnel
|
||||||
|
▼
|
||||||
clients -->|WireGuard| vps
|
┌─────────────────────────────────┐
|
||||||
vps -->|WireGuard tunnel| extgw
|
│ extgw01 (gateway + bastion) │
|
||||||
extgw -->|allowed traffic| grafana
|
│ - WireGuard tunnel to VPS │
|
||||||
extgw -->|allowed traffic| jellyfin
|
│ - Firewall (allowlist only) │
|
||||||
extgw -->|allowed traffic| arr
|
│ - SSH + 2FA (full access) │
|
||||||
|
└──────────┬──────────────────────┘
|
||||||
|
│ allowed traffic only
|
||||||
|
▼
|
||||||
|
┌─────────────────────────────────┐
|
||||||
|
│ Internal network 10.69.13.0/24 │
|
||||||
|
│ - monitoring01:3000 (Grafana) │
|
||||||
|
│ - jelly01:8096 (Jellyfin) │
|
||||||
|
│ - *-jail hosts (arr stack) │
|
||||||
|
└─────────────────────────────────┘
|
||||||
```
|
```
|
||||||
|
|
||||||
### Existing path (unchanged)
|
### Existing path (unchanged)
|
||||||
|
|||||||
@@ -39,17 +39,23 @@ Expand storage capacity for the main hdd-pool. Since we need to add disks anyway
|
|||||||
- nzbget: NixOS service or OCI container
|
- nzbget: NixOS service or OCI container
|
||||||
- NFS exports: `services.nfs.server`
|
- NFS exports: `services.nfs.server`
|
||||||
|
|
||||||
### Filesystem: Keep ZFS
|
### Filesystem: BTRFS RAID1
|
||||||
|
|
||||||
**Decision**: Keep existing ZFS pool, import on NixOS
|
**Decision**: Migrate from ZFS to BTRFS with RAID1
|
||||||
|
|
||||||
**Rationale**:
|
**Rationale**:
|
||||||
- **No data migration needed**: Existing ZFS pool can be imported directly on NixOS
|
- **In-kernel**: No out-of-tree module issues like ZFS
|
||||||
- **Proven reliability**: Pool has been running reliably on TrueNAS
|
- **Flexible expansion**: Add individual disks, not required to buy pairs
|
||||||
- **NixOS ZFS support**: Well-supported, declarative configuration via `boot.zfs` and `services.zfs`
|
- **Mixed disk sizes**: Better handling than ZFS multi-vdev approach
|
||||||
- **BTRFS RAID5/6 unreliable**: Research showed BTRFS RAID5/6 write hole is still unresolved
|
- **RAID level conversion**: Can convert between RAID levels in place
|
||||||
- **BTRFS RAID1 wasteful**: With mixed disk sizes, RAID1 wastes significant capacity vs ZFS mirrors
|
- Built-in checksumming, snapshots, compression (zstd)
|
||||||
- Checksumming, snapshots, compression (lz4/zstd) all available
|
- NixOS has good BTRFS support
|
||||||
|
|
||||||
|
**BTRFS RAID1 notes**:
|
||||||
|
- "RAID1" means 2 copies of all data
|
||||||
|
- Distributes across all available devices
|
||||||
|
- With 6+ disks, provides redundancy + capacity scaling
|
||||||
|
- RAID5/6 avoided (known issues), RAID1/10 are stable
|
||||||
|
|
||||||
### Hardware: Keep Existing + Add Disks
|
### Hardware: Keep Existing + Add Disks
|
||||||
|
|
||||||
@@ -63,94 +69,83 @@ Expand storage capacity for the main hdd-pool. Since we need to add disks anyway
|
|||||||
|
|
||||||
**Storage architecture**:
|
**Storage architecture**:
|
||||||
|
|
||||||
**hdd-pool** (ZFS mirrors):
|
**Bulk storage** (BTRFS RAID1 on HDDs):
|
||||||
- Current: 3 mirror vdevs (2x16TB + 2x8TB + 2x8TB) = 32TB usable
|
- Current: 6x HDDs (2x16TB + 2x8TB + 2x8TB)
|
||||||
- Add: mirror-3 with 2x 24TB = +24TB usable
|
- Add: 2x new HDDs (size TBD)
|
||||||
- Total after expansion: ~56TB usable
|
|
||||||
- Use: Media, downloads, backups, non-critical data
|
- Use: Media, downloads, backups, non-critical data
|
||||||
|
- Risk tolerance: High (data mostly replaceable)
|
||||||
|
|
||||||
|
**Critical data** (small volume):
|
||||||
|
- Use 2x 240GB SSDs in mirror (BTRFS or ZFS)
|
||||||
|
- Or use 2TB NVMe for critical data
|
||||||
|
- Risk tolerance: Low (data important but small)
|
||||||
|
|
||||||
### Disk Purchase Decision
|
### Disk Purchase Decision
|
||||||
|
|
||||||
**Decision**: 2x 24TB drives (ordered, arriving 2026-02-21)
|
**Options under consideration**:
|
||||||
|
|
||||||
|
**Option A: 2x 16TB drives**
|
||||||
|
- Matches largest current drives
|
||||||
|
- Enables potential future RAID5 if desired (6x 16TB array)
|
||||||
|
- More conservative capacity increase
|
||||||
|
|
||||||
|
**Option B: 2x 20-24TB drives**
|
||||||
|
- Larger capacity headroom
|
||||||
|
- Better $/TB ratio typically
|
||||||
|
- Future-proofs better
|
||||||
|
|
||||||
|
**Initial purchase**: 2 drives (chassis has space for 2 more without modifications)
|
||||||
|
|
||||||
## Migration Strategy
|
## Migration Strategy
|
||||||
|
|
||||||
### High-Level Plan
|
### High-Level Plan
|
||||||
|
|
||||||
1. **Expand ZFS pool** (on TrueNAS):
|
1. **Preparation**:
|
||||||
- Install 2x 24TB drives (may need new drive trays - order from abroad if needed)
|
- Purchase 2x new HDDs (16TB or 20-24TB)
|
||||||
- If chassis space is limited, temporarily replace the two oldest 8TB drives (da0/ada4)
|
- Create NixOS configuration for new storage host
|
||||||
- Add as mirror-3 vdev to hdd-pool
|
- Set up bare metal NixOS installation
|
||||||
- Verify pool health and resilver completes
|
|
||||||
- Check SMART data on old 8TB drives (all healthy as of 2026-02-20, no reallocated sectors)
|
|
||||||
- Burn-in: at minimum short + long SMART test before adding to pool
|
|
||||||
|
|
||||||
2. **Prepare NixOS configuration**:
|
2. **Initial BTRFS pool**:
|
||||||
- Create host configuration (`hosts/nas1/` or similar)
|
- Install 2 new disks
|
||||||
- Configure ZFS pool import (`boot.zfs.extraPools`)
|
- Create BTRFS filesystem in RAID1
|
||||||
- Set up services: radarr, sonarr, nzbget, restic-rest, NFS
|
- Mount and test NFS exports
|
||||||
- Configure monitoring (node-exporter, promtail, smartctl-exporter)
|
|
||||||
|
|
||||||
3. **Install NixOS**:
|
3. **Data migration**:
|
||||||
- `zfs export hdd-pool` on TrueNAS before shutdown (clean export)
|
- Copy data from TrueNAS ZFS pool to new BTRFS pool over 10GbE
|
||||||
- Wipe TrueNAS boot-pool SSDs, set up as mdadm RAID1 for NixOS root
|
- Verify data integrity
|
||||||
- Install NixOS on mdadm mirror (keeps boot path ZFS-independent)
|
|
||||||
- Import hdd-pool via `boot.zfs.extraPools`
|
|
||||||
- Verify all datasets mount correctly
|
|
||||||
|
|
||||||
4. **Service migration**:
|
4. **Expand pool**:
|
||||||
- Configure NixOS services to use ZFS dataset paths
|
- As old ZFS pool is emptied, wipe drives and add to BTRFS pool
|
||||||
- Update NFS exports
|
- Pool grows incrementally: 2 → 4 → 6 → 8 disks
|
||||||
- Test from consuming hosts
|
- BTRFS rebalances data across new devices
|
||||||
|
|
||||||
5. **Cutover**:
|
5. **Service migration**:
|
||||||
- Update DNS/client mounts if IP changes
|
- Set up radarr/sonarr/nzbget/restic as NixOS services
|
||||||
- Verify monitoring integration
|
- Update NFS client mounts on consuming hosts
|
||||||
|
|
||||||
|
6. **Cutover**:
|
||||||
|
- Point consumers to new NAS host
|
||||||
- Decommission TrueNAS
|
- Decommission TrueNAS
|
||||||
|
- Repurpose hardware or keep as spare
|
||||||
### Post-Expansion: Vdev Rebalancing
|
|
||||||
|
|
||||||
ZFS has no built-in rebalance command. After adding the new 24TB vdev, ZFS will
|
|
||||||
write new data preferentially to it (most free space), leaving old vdevs packed
|
|
||||||
at ~97%. This is suboptimal but not urgent once overall pool usage drops to ~50%.
|
|
||||||
|
|
||||||
To gradually rebalance, rewrite files in place so ZFS redistributes blocks across
|
|
||||||
all vdevs proportional to free space:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Rewrite files individually (spreads blocks across all vdevs)
|
|
||||||
find /pool/dataset -type f -exec sh -c '
|
|
||||||
for f; do cp "$f" "$f.rebal" && mv "$f.rebal" "$f"; done
|
|
||||||
' _ {} +
|
|
||||||
```
|
|
||||||
|
|
||||||
Avoid `zfs send/recv` for large datasets (e.g. 20TB) as this would concentrate
|
|
||||||
data on the emptiest vdev rather than spreading it evenly.
|
|
||||||
|
|
||||||
**Recommendation**: Do this after NixOS migration is stable. Not urgent - the pool
|
|
||||||
will function fine with uneven distribution, just slightly suboptimal for performance.
|
|
||||||
|
|
||||||
### Migration Advantages
|
### Migration Advantages
|
||||||
|
|
||||||
- **No data migration**: ZFS pool imported directly, no copying terabytes of data
|
- **Low risk**: New pool created independently, old data remains intact during migration
|
||||||
- **Low risk**: Pool expansion done on stable TrueNAS before OS swap
|
- **Incremental**: Can add old disks one at a time as space allows
|
||||||
- **Reversible**: Can boot back to TrueNAS if NixOS has issues (ZFS pool is OS-independent)
|
- **Flexible**: BTRFS handles mixed disk sizes gracefully
|
||||||
- **Quick cutover**: Once NixOS config is ready, the OS swap is fast
|
- **Reversible**: Keep TrueNAS running until fully validated
|
||||||
|
|
||||||
## Next Steps
|
## Next Steps
|
||||||
|
|
||||||
1. ~~Decide on disk size~~ - 2x 24TB ordered
|
1. Decide on disk size (16TB vs 20-24TB)
|
||||||
2. Install drives and add mirror vdev to ZFS pool
|
2. Purchase disks
|
||||||
3. Check SMART data on 8TB drives - decide whether to keep or retire
|
3. Design NixOS host configuration (`hosts/nas1/`)
|
||||||
4. Design NixOS host configuration (`hosts/nas1/`)
|
4. Plan detailed migration timeline
|
||||||
5. Document NFS export mapping (current -> new)
|
5. Document NFS export mapping (current → new)
|
||||||
6. Plan NixOS installation and cutover
|
|
||||||
|
|
||||||
## Open Questions
|
## Open Questions
|
||||||
|
|
||||||
|
- [ ] Final decision on disk size?
|
||||||
- [ ] Hostname for new NAS host? (nas1? storage1?)
|
- [ ] Hostname for new NAS host? (nas1? storage1?)
|
||||||
- [ ] IP address/subnet: NAS and Proxmox are both on 10GbE to the same switch but different subnets, forcing traffic through the router (bottleneck). Move to same subnet during migration.
|
- [ ] IP address allocation (keep 10.69.12.50 or new IP?)
|
||||||
- [x] Boot drive: Reuse TrueNAS boot-pool SSDs as mdadm RAID1 for NixOS root (no ZFS on boot path)
|
- [ ] Timeline/maintenance window for migration?
|
||||||
- [ ] Retire old 8TB drives? (SMART looks healthy, keep unless chassis space is needed)
|
|
||||||
- [ ] Drive trays: do new 24TB drives fit, or order trays from abroad?
|
|
||||||
- [ ] Timeline/maintenance window for NixOS swap?
|
|
||||||
|
|||||||
20
flake.lock
generated
20
flake.lock
generated
@@ -28,11 +28,11 @@
|
|||||||
]
|
]
|
||||||
},
|
},
|
||||||
"locked": {
|
"locked": {
|
||||||
"lastModified": 1771488195,
|
"lastModified": 1771004123,
|
||||||
"narHash": "sha256-2kMxqdDyPluRQRoES22Y0oSjp7pc5fj2nRterfmSIyc=",
|
"narHash": "sha256-Jw36EzL4IGIc2TmeZGphAAUrJXoWqfvCbybF8bTHgMA=",
|
||||||
"ref": "master",
|
"ref": "master",
|
||||||
"rev": "2d26de50559d8acb82ea803764e138325d95572c",
|
"rev": "e5e8be86ecdcae8a5962ba3bddddfe91b574792b",
|
||||||
"revCount": 37,
|
"revCount": 36,
|
||||||
"type": "git",
|
"type": "git",
|
||||||
"url": "https://git.t-juice.club/torjus/homelab-deploy"
|
"url": "https://git.t-juice.club/torjus/homelab-deploy"
|
||||||
},
|
},
|
||||||
@@ -64,11 +64,11 @@
|
|||||||
},
|
},
|
||||||
"nixpkgs": {
|
"nixpkgs": {
|
||||||
"locked": {
|
"locked": {
|
||||||
"lastModified": 1771419570,
|
"lastModified": 1771043024,
|
||||||
"narHash": "sha256-bxAlQgre3pcQcaRUm/8A0v/X8d2nhfraWSFqVmMcBcU=",
|
"narHash": "sha256-O1XDr7EWbRp+kHrNNgLWgIrB0/US5wvw9K6RERWAj6I=",
|
||||||
"owner": "nixos",
|
"owner": "nixos",
|
||||||
"repo": "nixpkgs",
|
"repo": "nixpkgs",
|
||||||
"rev": "6d41bc27aaf7b6a3ba6b169db3bd5d6159cfaa47",
|
"rev": "3aadb7ca9eac2891d52a9dec199d9580a6e2bf44",
|
||||||
"type": "github"
|
"type": "github"
|
||||||
},
|
},
|
||||||
"original": {
|
"original": {
|
||||||
@@ -80,11 +80,11 @@
|
|||||||
},
|
},
|
||||||
"nixpkgs-unstable": {
|
"nixpkgs-unstable": {
|
||||||
"locked": {
|
"locked": {
|
||||||
"lastModified": 1771369470,
|
"lastModified": 1771008912,
|
||||||
"narHash": "sha256-0NBlEBKkN3lufyvFegY4TYv5mCNHbi5OmBDrzihbBMQ=",
|
"narHash": "sha256-gf2AmWVTs8lEq7z/3ZAsgnZDhWIckkb+ZnAo5RzSxJg=",
|
||||||
"owner": "nixos",
|
"owner": "nixos",
|
||||||
"repo": "nixpkgs",
|
"repo": "nixpkgs",
|
||||||
"rev": "0182a361324364ae3f436a63005877674cf45efb",
|
"rev": "a82ccc39b39b621151d6732718e3e250109076fa",
|
||||||
"type": "github"
|
"type": "github"
|
||||||
},
|
},
|
||||||
"original": {
|
"original": {
|
||||||
|
|||||||
@@ -92,6 +92,15 @@
|
|||||||
./hosts/http-proxy
|
./hosts/http-proxy
|
||||||
];
|
];
|
||||||
};
|
};
|
||||||
|
monitoring01 = nixpkgs.lib.nixosSystem {
|
||||||
|
inherit system;
|
||||||
|
specialArgs = {
|
||||||
|
inherit inputs self;
|
||||||
|
};
|
||||||
|
modules = commonModules ++ [
|
||||||
|
./hosts/monitoring01
|
||||||
|
];
|
||||||
|
};
|
||||||
jelly01 = nixpkgs.lib.nixosSystem {
|
jelly01 = nixpkgs.lib.nixosSystem {
|
||||||
inherit system;
|
inherit system;
|
||||||
specialArgs = {
|
specialArgs = {
|
||||||
|
|||||||
@@ -18,7 +18,12 @@
|
|||||||
"sonarr"
|
"sonarr"
|
||||||
"ha"
|
"ha"
|
||||||
"z2m"
|
"z2m"
|
||||||
|
"grafana"
|
||||||
|
"prometheus"
|
||||||
|
"alertmanager"
|
||||||
"jelly"
|
"jelly"
|
||||||
|
"pyroscope"
|
||||||
|
"pushgw"
|
||||||
];
|
];
|
||||||
|
|
||||||
nixpkgs.config.allowUnfree = true;
|
nixpkgs.config.allowUnfree = true;
|
||||||
|
|||||||
114
hosts/monitoring01/configuration.nix
Normal file
114
hosts/monitoring01/configuration.nix
Normal file
@@ -0,0 +1,114 @@
|
|||||||
|
{
|
||||||
|
pkgs,
|
||||||
|
...
|
||||||
|
}:
|
||||||
|
|
||||||
|
{
|
||||||
|
imports = [
|
||||||
|
./hardware-configuration.nix
|
||||||
|
|
||||||
|
../../system
|
||||||
|
../../common/vm
|
||||||
|
];
|
||||||
|
|
||||||
|
homelab.host.role = "monitoring";
|
||||||
|
|
||||||
|
nixpkgs.config.allowUnfree = true;
|
||||||
|
# Use the systemd-boot EFI boot loader.
|
||||||
|
boot.loader.grub = {
|
||||||
|
enable = true;
|
||||||
|
device = "/dev/sda";
|
||||||
|
configurationLimit = 3;
|
||||||
|
};
|
||||||
|
|
||||||
|
networking.hostName = "monitoring01";
|
||||||
|
networking.domain = "home.2rjus.net";
|
||||||
|
networking.useNetworkd = true;
|
||||||
|
networking.useDHCP = false;
|
||||||
|
services.resolved.enable = true;
|
||||||
|
networking.nameservers = [
|
||||||
|
"10.69.13.5"
|
||||||
|
"10.69.13.6"
|
||||||
|
];
|
||||||
|
|
||||||
|
systemd.network.enable = true;
|
||||||
|
systemd.network.networks."ens18" = {
|
||||||
|
matchConfig.Name = "ens18";
|
||||||
|
address = [
|
||||||
|
"10.69.13.13/24"
|
||||||
|
];
|
||||||
|
routes = [
|
||||||
|
{ Gateway = "10.69.13.1"; }
|
||||||
|
];
|
||||||
|
linkConfig.RequiredForOnline = "routable";
|
||||||
|
};
|
||||||
|
time.timeZone = "Europe/Oslo";
|
||||||
|
|
||||||
|
nix.settings.experimental-features = [
|
||||||
|
"nix-command"
|
||||||
|
"flakes"
|
||||||
|
];
|
||||||
|
nix.settings.tarball-ttl = 0;
|
||||||
|
environment.systemPackages = with pkgs; [
|
||||||
|
vim
|
||||||
|
wget
|
||||||
|
git
|
||||||
|
sqlite
|
||||||
|
];
|
||||||
|
|
||||||
|
services.qemuGuest.enable = true;
|
||||||
|
|
||||||
|
# Vault secrets management
|
||||||
|
vault.enable = true;
|
||||||
|
homelab.deploy.enable = true;
|
||||||
|
vault.secrets.backup-helper = {
|
||||||
|
secretPath = "shared/backup/password";
|
||||||
|
extractKey = "password";
|
||||||
|
outputDir = "/run/secrets/backup_helper_secret";
|
||||||
|
services = [ "restic-backups-grafana" "restic-backups-grafana-db" ];
|
||||||
|
};
|
||||||
|
|
||||||
|
services.restic.backups.grafana = {
|
||||||
|
repository = "rest:http://10.69.12.52:8000/backup-nix";
|
||||||
|
passwordFile = "/run/secrets/backup_helper_secret";
|
||||||
|
paths = [ "/var/lib/grafana/plugins" ];
|
||||||
|
timerConfig = {
|
||||||
|
OnCalendar = "daily";
|
||||||
|
Persistent = true;
|
||||||
|
RandomizedDelaySec = "2h";
|
||||||
|
};
|
||||||
|
pruneOpts = [
|
||||||
|
"--keep-daily 7"
|
||||||
|
"--keep-weekly 4"
|
||||||
|
"--keep-monthly 6"
|
||||||
|
"--keep-within 1d"
|
||||||
|
];
|
||||||
|
extraOptions = [ "--retry-lock=5m" ];
|
||||||
|
};
|
||||||
|
|
||||||
|
services.restic.backups.grafana-db = {
|
||||||
|
repository = "rest:http://10.69.12.52:8000/backup-nix";
|
||||||
|
passwordFile = "/run/secrets/backup_helper_secret";
|
||||||
|
command = [ "${pkgs.sqlite}/bin/sqlite3" "/var/lib/grafana/data/grafana.db" ".dump" ];
|
||||||
|
timerConfig = {
|
||||||
|
OnCalendar = "daily";
|
||||||
|
Persistent = true;
|
||||||
|
RandomizedDelaySec = "2h";
|
||||||
|
};
|
||||||
|
pruneOpts = [
|
||||||
|
"--keep-daily 7"
|
||||||
|
"--keep-weekly 4"
|
||||||
|
"--keep-monthly 6"
|
||||||
|
"--keep-within 1d"
|
||||||
|
];
|
||||||
|
extraOptions = [ "--retry-lock=5m" ];
|
||||||
|
};
|
||||||
|
|
||||||
|
# Open ports in the firewall.
|
||||||
|
# networking.firewall.allowedTCPPorts = [ ... ];
|
||||||
|
# networking.firewall.allowedUDPPorts = [ ... ];
|
||||||
|
# Or disable the firewall altogether.
|
||||||
|
networking.firewall.enable = false;
|
||||||
|
|
||||||
|
system.stateVersion = "23.11"; # Did you read the comment?
|
||||||
|
}
|
||||||
7
hosts/monitoring01/default.nix
Normal file
7
hosts/monitoring01/default.nix
Normal file
@@ -0,0 +1,7 @@
|
|||||||
|
{ ... }:
|
||||||
|
{
|
||||||
|
imports = [
|
||||||
|
./configuration.nix
|
||||||
|
../../services/monitoring
|
||||||
|
];
|
||||||
|
}
|
||||||
42
hosts/monitoring01/hardware-configuration.nix
Normal file
42
hosts/monitoring01/hardware-configuration.nix
Normal file
@@ -0,0 +1,42 @@
|
|||||||
|
{
|
||||||
|
config,
|
||||||
|
lib,
|
||||||
|
pkgs,
|
||||||
|
modulesPath,
|
||||||
|
...
|
||||||
|
}:
|
||||||
|
|
||||||
|
{
|
||||||
|
imports = [
|
||||||
|
(modulesPath + "/profiles/qemu-guest.nix")
|
||||||
|
];
|
||||||
|
boot.initrd.availableKernelModules = [
|
||||||
|
"ata_piix"
|
||||||
|
"uhci_hcd"
|
||||||
|
"virtio_pci"
|
||||||
|
"virtio_scsi"
|
||||||
|
"sd_mod"
|
||||||
|
"sr_mod"
|
||||||
|
];
|
||||||
|
boot.initrd.kernelModules = [ "dm-snapshot" ];
|
||||||
|
boot.kernelModules = [
|
||||||
|
"ptp_kvm"
|
||||||
|
];
|
||||||
|
boot.extraModulePackages = [ ];
|
||||||
|
|
||||||
|
fileSystems."/" = {
|
||||||
|
device = "/dev/disk/by-label/root";
|
||||||
|
fsType = "xfs";
|
||||||
|
};
|
||||||
|
|
||||||
|
swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
|
||||||
|
|
||||||
|
# Enables DHCP on each ethernet and wireless interface. In case of scripted networking
|
||||||
|
# (the default) this is the recommended approach. When using systemd-networkd it's
|
||||||
|
# still possible to use this option, but it's recommended to use it in conjunction
|
||||||
|
# with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
|
||||||
|
networking.useDHCP = lib.mkDefault true;
|
||||||
|
# networking.interfaces.ens18.useDHCP = lib.mkDefault true;
|
||||||
|
|
||||||
|
nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
|
||||||
|
}
|
||||||
@@ -18,7 +18,7 @@
|
|||||||
role = "monitoring";
|
role = "monitoring";
|
||||||
};
|
};
|
||||||
|
|
||||||
homelab.dns.cnames = [ "monitoring" "alertmanager" "grafana" "grafana-test" "metrics" "vmalert" "loki" ];
|
homelab.dns.cnames = [ "grafana-test" "metrics" "vmalert" ];
|
||||||
|
|
||||||
# Enable Vault integration
|
# Enable Vault integration
|
||||||
vault.enable = true;
|
vault.enable = true;
|
||||||
|
|||||||
@@ -3,10 +3,5 @@
|
|||||||
./configuration.nix
|
./configuration.nix
|
||||||
../../services/grafana
|
../../services/grafana
|
||||||
../../services/victoriametrics
|
../../services/victoriametrics
|
||||||
../../services/loki
|
|
||||||
../../services/monitoring/alerttonotify.nix
|
|
||||||
../../services/monitoring/blackbox.nix
|
|
||||||
../../services/monitoring/exportarr.nix
|
|
||||||
../../services/monitoring/pve.nix
|
|
||||||
];
|
];
|
||||||
}
|
}
|
||||||
@@ -25,7 +25,7 @@
|
|||||||
};
|
};
|
||||||
};
|
};
|
||||||
|
|
||||||
timeout = 14400;
|
timeout = 7200;
|
||||||
metrics.enable = true;
|
metrics.enable = true;
|
||||||
};
|
};
|
||||||
|
|
||||||
|
|||||||
@@ -6,8 +6,7 @@ let
|
|||||||
text = ''
|
text = ''
|
||||||
set -euo pipefail
|
set -euo pipefail
|
||||||
|
|
||||||
LOKI_URL="https://loki.home.2rjus.net/loki/api/v1/push"
|
LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
|
||||||
LOKI_AUTH_FILE="/run/secrets/promtail-loki-auth"
|
|
||||||
|
|
||||||
# Send a log entry to Loki with bootstrap status
|
# Send a log entry to Loki with bootstrap status
|
||||||
# Usage: log_to_loki <stage> <message>
|
# Usage: log_to_loki <stage> <message>
|
||||||
@@ -37,14 +36,8 @@ let
|
|||||||
}]
|
}]
|
||||||
}')
|
}')
|
||||||
|
|
||||||
local auth_args=()
|
|
||||||
if [[ -f "$LOKI_AUTH_FILE" ]]; then
|
|
||||||
auth_args=(-u "promtail:$(cat "$LOKI_AUTH_FILE")")
|
|
||||||
fi
|
|
||||||
|
|
||||||
curl -s --connect-timeout 2 --max-time 5 \
|
curl -s --connect-timeout 2 --max-time 5 \
|
||||||
-X POST \
|
-X POST \
|
||||||
"''${auth_args[@]}" \
|
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d "$payload" \
|
-d "$payload" \
|
||||||
"$LOKI_URL" >/dev/null 2>&1 || true
|
"$LOKI_URL" >/dev/null 2>&1 || true
|
||||||
|
|||||||
@@ -20,10 +20,10 @@ vault-fetch <secret-path> <output-directory> [cache-directory]
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Fetch Grafana admin secrets
|
# Fetch Grafana admin secrets
|
||||||
vault-fetch hosts/ha1/mqtt-password /run/secrets/grafana /var/lib/vault/cache/grafana
|
vault-fetch hosts/monitoring01/grafana-admin /run/secrets/grafana /var/lib/vault/cache/grafana
|
||||||
|
|
||||||
# Use default cache location
|
# Use default cache location
|
||||||
vault-fetch hosts/ha1/mqtt-password /run/secrets/grafana
|
vault-fetch hosts/monitoring01/grafana-admin /run/secrets/grafana
|
||||||
```
|
```
|
||||||
|
|
||||||
## How It Works
|
## How It Works
|
||||||
@@ -53,13 +53,13 @@ If Vault is unreachable or authentication fails:
|
|||||||
This tool is designed to be called from systemd service `ExecStartPre` hooks via the `vault.secrets` NixOS module:
|
This tool is designed to be called from systemd service `ExecStartPre` hooks via the `vault.secrets` NixOS module:
|
||||||
|
|
||||||
```nix
|
```nix
|
||||||
vault.secrets.mqtt-password = {
|
vault.secrets.grafana-admin = {
|
||||||
secretPath = "hosts/ha1/mqtt-password";
|
secretPath = "hosts/monitoring01/grafana-admin";
|
||||||
};
|
};
|
||||||
|
|
||||||
# Service automatically gets secrets fetched before start
|
# Service automatically gets secrets fetched before start
|
||||||
systemd.services.mosquitto.serviceConfig = {
|
systemd.services.grafana.serviceConfig = {
|
||||||
EnvironmentFile = "/run/secrets/mqtt-password/password";
|
EnvironmentFile = "/run/secrets/grafana-admin/password";
|
||||||
};
|
};
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
@@ -5,7 +5,7 @@ set -euo pipefail
|
|||||||
#
|
#
|
||||||
# Usage: vault-fetch <secret-path> <output-directory> [cache-directory]
|
# Usage: vault-fetch <secret-path> <output-directory> [cache-directory]
|
||||||
#
|
#
|
||||||
# Example: vault-fetch hosts/ha1/mqtt-password /run/secrets/grafana /var/lib/vault/cache/grafana
|
# Example: vault-fetch hosts/monitoring01/grafana-admin /run/secrets/grafana /var/lib/vault/cache/grafana
|
||||||
#
|
#
|
||||||
# This script:
|
# This script:
|
||||||
# 1. Authenticates to Vault using AppRole credentials from /var/lib/vault/approle/
|
# 1. Authenticates to Vault using AppRole credentials from /var/lib/vault/approle/
|
||||||
@@ -17,7 +17,7 @@ set -euo pipefail
|
|||||||
# Parse arguments
|
# Parse arguments
|
||||||
if [ $# -lt 2 ]; then
|
if [ $# -lt 2 ]; then
|
||||||
echo "Usage: vault-fetch <secret-path> <output-directory> [cache-directory]" >&2
|
echo "Usage: vault-fetch <secret-path> <output-directory> [cache-directory]" >&2
|
||||||
echo "Example: vault-fetch hosts/ha1/mqtt-password /run/secrets/grafana /var/lib/vault/cache/grafana" >&2
|
echo "Example: vault-fetch hosts/monitoring01/grafana /run/secrets/grafana /var/lib/vault/cache/grafana" >&2
|
||||||
exit 1
|
exit 1
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
|||||||
@@ -19,7 +19,7 @@
|
|||||||
"title": "SSH Connections",
|
"title": "SSH Connections",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
|
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "sum(oubliette_ssh_connections_total{job=\"apiary\"})",
|
"expr": "sum(oubliette_ssh_connections_total{job=\"apiary\"})",
|
||||||
@@ -51,7 +51,7 @@
|
|||||||
"title": "Active Sessions",
|
"title": "Active Sessions",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 6, "x": 6, "y": 0},
|
"gridPos": {"h": 4, "w": 6, "x": 6, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "oubliette_sessions_active{job=\"apiary\"}",
|
"expr": "oubliette_sessions_active{job=\"apiary\"}",
|
||||||
@@ -86,7 +86,7 @@
|
|||||||
"title": "Unique IPs",
|
"title": "Unique IPs",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 6, "x": 12, "y": 0},
|
"gridPos": {"h": 4, "w": 6, "x": 12, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "oubliette_storage_unique_ips{job=\"apiary\"}",
|
"expr": "oubliette_storage_unique_ips{job=\"apiary\"}",
|
||||||
@@ -118,7 +118,7 @@
|
|||||||
"title": "Total Login Attempts",
|
"title": "Total Login Attempts",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 6, "x": 18, "y": 0},
|
"gridPos": {"h": 4, "w": 6, "x": 18, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "oubliette_storage_login_attempts_total{job=\"apiary\"}",
|
"expr": "oubliette_storage_login_attempts_total{job=\"apiary\"}",
|
||||||
@@ -150,7 +150,7 @@
|
|||||||
"title": "SSH Connections Over Time",
|
"title": "SSH Connections Over Time",
|
||||||
"type": "timeseries",
|
"type": "timeseries",
|
||||||
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
|
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"interval": "60s",
|
"interval": "60s",
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
@@ -183,7 +183,7 @@
|
|||||||
"title": "Auth Attempts Over Time",
|
"title": "Auth Attempts Over Time",
|
||||||
"type": "timeseries",
|
"type": "timeseries",
|
||||||
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
|
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"interval": "60s",
|
"interval": "60s",
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
@@ -216,7 +216,7 @@
|
|||||||
"title": "Sessions by Shell",
|
"title": "Sessions by Shell",
|
||||||
"type": "timeseries",
|
"type": "timeseries",
|
||||||
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 22},
|
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 22},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"interval": "60s",
|
"interval": "60s",
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
@@ -249,7 +249,7 @@
|
|||||||
"title": "Attempts by Country",
|
"title": "Attempts by Country",
|
||||||
"type": "geomap",
|
"type": "geomap",
|
||||||
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 12},
|
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 12},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "oubliette_auth_attempts_by_country_total{job=\"apiary\"}",
|
"expr": "oubliette_auth_attempts_by_country_total{job=\"apiary\"}",
|
||||||
@@ -318,7 +318,7 @@
|
|||||||
"title": "Session Duration Distribution",
|
"title": "Session Duration Distribution",
|
||||||
"type": "heatmap",
|
"type": "heatmap",
|
||||||
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 30},
|
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 30},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"interval": "60s",
|
"interval": "60s",
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
@@ -359,7 +359,7 @@
|
|||||||
"title": "Commands Executed by Shell",
|
"title": "Commands Executed by Shell",
|
||||||
"type": "timeseries",
|
"type": "timeseries",
|
||||||
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 22},
|
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 22},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"interval": "60s",
|
"interval": "60s",
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
|
|||||||
@@ -16,7 +16,7 @@
|
|||||||
"title": "Endpoints Monitored",
|
"title": "Endpoints Monitored",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
|
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "count(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"})",
|
"expr": "count(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"})",
|
||||||
@@ -48,7 +48,7 @@
|
|||||||
"title": "Probe Failures",
|
"title": "Probe Failures",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
|
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "count(probe_success{job=\"blackbox_tls\"} == 0) or vector(0)",
|
"expr": "count(probe_success{job=\"blackbox_tls\"} == 0) or vector(0)",
|
||||||
@@ -82,7 +82,7 @@
|
|||||||
"title": "Expiring Soon (< 7d)",
|
"title": "Expiring Soon (< 7d)",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
|
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "count((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) < 86400 * 7) or vector(0)",
|
"expr": "count((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) < 86400 * 7) or vector(0)",
|
||||||
@@ -116,7 +116,7 @@
|
|||||||
"title": "Expiring Critical (< 24h)",
|
"title": "Expiring Critical (< 24h)",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
|
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "count((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) < 86400) or vector(0)",
|
"expr": "count((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) < 86400) or vector(0)",
|
||||||
@@ -150,7 +150,7 @@
|
|||||||
"title": "Minimum Days Remaining",
|
"title": "Minimum Days Remaining",
|
||||||
"type": "gauge",
|
"type": "gauge",
|
||||||
"gridPos": {"h": 4, "w": 8, "x": 16, "y": 0},
|
"gridPos": {"h": 4, "w": 8, "x": 16, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "min((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400)",
|
"expr": "min((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400)",
|
||||||
@@ -187,7 +187,7 @@
|
|||||||
"title": "Certificate Expiry by Endpoint",
|
"title": "Certificate Expiry by Endpoint",
|
||||||
"type": "table",
|
"type": "table",
|
||||||
"gridPos": {"h": 12, "w": 12, "x": 0, "y": 4},
|
"gridPos": {"h": 12, "w": 12, "x": 0, "y": 4},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400",
|
"expr": "(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400",
|
||||||
@@ -253,7 +253,7 @@
|
|||||||
"title": "Probe Status",
|
"title": "Probe Status",
|
||||||
"type": "table",
|
"type": "table",
|
||||||
"gridPos": {"h": 12, "w": 12, "x": 12, "y": 4},
|
"gridPos": {"h": 12, "w": 12, "x": 12, "y": 4},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "probe_success{job=\"blackbox_tls\"}",
|
"expr": "probe_success{job=\"blackbox_tls\"}",
|
||||||
@@ -340,7 +340,7 @@
|
|||||||
"title": "Certificate Expiry Over Time",
|
"title": "Certificate Expiry Over Time",
|
||||||
"type": "timeseries",
|
"type": "timeseries",
|
||||||
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 16},
|
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 16},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400",
|
"expr": "(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400",
|
||||||
@@ -378,7 +378,7 @@
|
|||||||
"title": "Probe Success Rate",
|
"title": "Probe Success Rate",
|
||||||
"type": "timeseries",
|
"type": "timeseries",
|
||||||
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 24},
|
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 24},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "avg(probe_success{job=\"blackbox_tls\"}) * 100",
|
"expr": "avg(probe_success{job=\"blackbox_tls\"}) * 100",
|
||||||
@@ -418,7 +418,7 @@
|
|||||||
"title": "Probe Duration",
|
"title": "Probe Duration",
|
||||||
"type": "timeseries",
|
"type": "timeseries",
|
||||||
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 24},
|
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 24},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "probe_duration_seconds{job=\"blackbox_tls\"}",
|
"expr": "probe_duration_seconds{job=\"blackbox_tls\"}",
|
||||||
|
|||||||
@@ -15,7 +15,7 @@
|
|||||||
{
|
{
|
||||||
"name": "tier",
|
"name": "tier",
|
||||||
"type": "query",
|
"type": "query",
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"query": "label_values(nixos_flake_info, tier)",
|
"query": "label_values(nixos_flake_info, tier)",
|
||||||
"refresh": 2,
|
"refresh": 2,
|
||||||
"includeAll": true,
|
"includeAll": true,
|
||||||
@@ -30,7 +30,7 @@
|
|||||||
"title": "Hosts Behind Remote",
|
"title": "Hosts Behind Remote",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
|
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "count(nixos_flake_revision_behind{tier=~\"$tier\"} == 1)",
|
"expr": "count(nixos_flake_revision_behind{tier=~\"$tier\"} == 1)",
|
||||||
@@ -65,7 +65,7 @@
|
|||||||
"title": "Hosts Needing Reboot",
|
"title": "Hosts Needing Reboot",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
|
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "count(nixos_config_mismatch{tier=~\"$tier\"} == 1)",
|
"expr": "count(nixos_config_mismatch{tier=~\"$tier\"} == 1)",
|
||||||
@@ -100,7 +100,7 @@
|
|||||||
"title": "Total Hosts",
|
"title": "Total Hosts",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 3, "x": 8, "y": 0},
|
"gridPos": {"h": 4, "w": 3, "x": 8, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "count(nixos_flake_info{tier=~\"$tier\"})",
|
"expr": "count(nixos_flake_info{tier=~\"$tier\"})",
|
||||||
@@ -128,7 +128,7 @@
|
|||||||
"title": "Nixpkgs Age",
|
"title": "Nixpkgs Age",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 3, "x": 11, "y": 0},
|
"gridPos": {"h": 4, "w": 3, "x": 11, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "max(nixos_flake_input_age_seconds{input=\"nixpkgs\", tier=~\"$tier\"})",
|
"expr": "max(nixos_flake_input_age_seconds{input=\"nixpkgs\", tier=~\"$tier\"})",
|
||||||
@@ -163,7 +163,7 @@
|
|||||||
"title": "Hosts Up-to-date",
|
"title": "Hosts Up-to-date",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 3, "x": 14, "y": 0},
|
"gridPos": {"h": 4, "w": 3, "x": 14, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "count(nixos_flake_revision_behind{tier=~\"$tier\"} == 0)",
|
"expr": "count(nixos_flake_revision_behind{tier=~\"$tier\"} == 0)",
|
||||||
@@ -192,7 +192,7 @@
|
|||||||
"title": "Deployments (24h)",
|
"title": "Deployments (24h)",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 3, "x": 17, "y": 0},
|
"gridPos": {"h": 4, "w": 3, "x": 17, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "sum(increase(homelab_deploy_deployments_total{status=\"completed\"}[24h]))",
|
"expr": "sum(increase(homelab_deploy_deployments_total{status=\"completed\"}[24h]))",
|
||||||
@@ -222,7 +222,7 @@
|
|||||||
"title": "Avg Deploy Time",
|
"title": "Avg Deploy Time",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
|
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "sum(increase(homelab_deploy_deployment_duration_seconds_sum{success=\"true\"}[24h])) / sum(increase(homelab_deploy_deployment_duration_seconds_count{success=\"true\"}[24h]))",
|
"expr": "sum(increase(homelab_deploy_deployment_duration_seconds_sum{success=\"true\"}[24h])) / sum(increase(homelab_deploy_deployment_duration_seconds_count{success=\"true\"}[24h]))",
|
||||||
@@ -256,7 +256,7 @@
|
|||||||
"title": "Fleet Status",
|
"title": "Fleet Status",
|
||||||
"type": "table",
|
"type": "table",
|
||||||
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 4},
|
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 4},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "nixos_flake_info{tier=~\"$tier\"}",
|
"expr": "nixos_flake_info{tier=~\"$tier\"}",
|
||||||
@@ -430,7 +430,7 @@
|
|||||||
"title": "Generation Age by Host",
|
"title": "Generation Age by Host",
|
||||||
"type": "bargauge",
|
"type": "bargauge",
|
||||||
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 14},
|
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 14},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "sort_desc(nixos_generation_age_seconds{tier=~\"$tier\"})",
|
"expr": "sort_desc(nixos_generation_age_seconds{tier=~\"$tier\"})",
|
||||||
@@ -467,7 +467,7 @@
|
|||||||
"title": "Generations per Host",
|
"title": "Generations per Host",
|
||||||
"type": "bargauge",
|
"type": "bargauge",
|
||||||
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 14},
|
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 14},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "sort_desc(nixos_generation_count{tier=~\"$tier\"})",
|
"expr": "sort_desc(nixos_generation_count{tier=~\"$tier\"})",
|
||||||
@@ -501,7 +501,7 @@
|
|||||||
"title": "Deployment Activity (Generation Age Over Time)",
|
"title": "Deployment Activity (Generation Age Over Time)",
|
||||||
"type": "timeseries",
|
"type": "timeseries",
|
||||||
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 22},
|
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 22},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "nixos_generation_age_seconds{tier=~\"$tier\"}",
|
"expr": "nixos_generation_age_seconds{tier=~\"$tier\"}",
|
||||||
@@ -534,7 +534,7 @@
|
|||||||
"title": "Flake Input Ages",
|
"title": "Flake Input Ages",
|
||||||
"type": "table",
|
"type": "table",
|
||||||
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 30},
|
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 30},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "max by (input) (nixos_flake_input_age_seconds)",
|
"expr": "max by (input) (nixos_flake_input_age_seconds)",
|
||||||
@@ -577,7 +577,7 @@
|
|||||||
"title": "Hosts by Revision",
|
"title": "Hosts by Revision",
|
||||||
"type": "piechart",
|
"type": "piechart",
|
||||||
"gridPos": {"h": 6, "w": 6, "x": 12, "y": 30},
|
"gridPos": {"h": 6, "w": 6, "x": 12, "y": 30},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "count by (current_rev) (nixos_flake_info{tier=~\"$tier\"})",
|
"expr": "count by (current_rev) (nixos_flake_info{tier=~\"$tier\"})",
|
||||||
@@ -601,7 +601,7 @@
|
|||||||
"title": "Hosts by Tier",
|
"title": "Hosts by Tier",
|
||||||
"type": "piechart",
|
"type": "piechart",
|
||||||
"gridPos": {"h": 6, "w": 6, "x": 18, "y": 30},
|
"gridPos": {"h": 6, "w": 6, "x": 18, "y": 30},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "count by (tier) (nixos_flake_info)",
|
"expr": "count by (tier) (nixos_flake_info)",
|
||||||
@@ -641,7 +641,7 @@
|
|||||||
"title": "Builds (24h)",
|
"title": "Builds (24h)",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 37},
|
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 37},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "sum(increase(homelab_deploy_build_host_total{status=\"success\"}[24h]))",
|
"expr": "sum(increase(homelab_deploy_build_host_total{status=\"success\"}[24h]))",
|
||||||
@@ -671,7 +671,7 @@
|
|||||||
"title": "Failed Builds (24h)",
|
"title": "Failed Builds (24h)",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 37},
|
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 37},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "sum(increase(homelab_deploy_build_host_total{status=\"failure\"}[24h])) or vector(0)",
|
"expr": "sum(increase(homelab_deploy_build_host_total{status=\"failure\"}[24h])) or vector(0)",
|
||||||
@@ -705,7 +705,7 @@
|
|||||||
"title": "Last Build",
|
"title": "Last Build",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 37},
|
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 37},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "time() - max(homelab_deploy_build_last_timestamp)",
|
"expr": "time() - max(homelab_deploy_build_last_timestamp)",
|
||||||
@@ -739,7 +739,7 @@
|
|||||||
"title": "Avg Build Time",
|
"title": "Avg Build Time",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 37},
|
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 37},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "sum(increase(homelab_deploy_build_duration_seconds_sum[24h])) / sum(increase(homelab_deploy_build_duration_seconds_count[24h]))",
|
"expr": "sum(increase(homelab_deploy_build_duration_seconds_sum[24h])) / sum(increase(homelab_deploy_build_duration_seconds_count[24h]))",
|
||||||
@@ -773,7 +773,7 @@
|
|||||||
"title": "Total Hosts Built",
|
"title": "Total Hosts Built",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 37},
|
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 37},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "count(homelab_deploy_build_duration_seconds_count)",
|
"expr": "count(homelab_deploy_build_duration_seconds_count)",
|
||||||
@@ -802,7 +802,7 @@
|
|||||||
"title": "Build Jobs (24h)",
|
"title": "Build Jobs (24h)",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 37},
|
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 37},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "sum(increase(homelab_deploy_builds_total[24h]))",
|
"expr": "sum(increase(homelab_deploy_builds_total[24h]))",
|
||||||
@@ -832,7 +832,7 @@
|
|||||||
"title": "Build Time by Host",
|
"title": "Build Time by Host",
|
||||||
"type": "bargauge",
|
"type": "bargauge",
|
||||||
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 41},
|
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 41},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "sort_desc(homelab_deploy_build_duration_seconds_sum / homelab_deploy_build_duration_seconds_count)",
|
"expr": "sort_desc(homelab_deploy_build_duration_seconds_sum / homelab_deploy_build_duration_seconds_count)",
|
||||||
@@ -869,7 +869,7 @@
|
|||||||
"title": "Build Count by Host",
|
"title": "Build Count by Host",
|
||||||
"type": "bargauge",
|
"type": "bargauge",
|
||||||
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 41},
|
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 41},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "sort_desc(sum by (host) (homelab_deploy_build_host_total))",
|
"expr": "sort_desc(sum by (host) (homelab_deploy_build_host_total))",
|
||||||
@@ -903,7 +903,7 @@
|
|||||||
"title": "Build Activity",
|
"title": "Build Activity",
|
||||||
"type": "timeseries",
|
"type": "timeseries",
|
||||||
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 49},
|
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 49},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "sum(increase(homelab_deploy_build_host_total{status=\"success\"}[1h]))",
|
"expr": "sum(increase(homelab_deploy_build_host_total{status=\"success\"}[1h]))",
|
||||||
|
|||||||
@@ -11,7 +11,7 @@
|
|||||||
{
|
{
|
||||||
"name": "instance",
|
"name": "instance",
|
||||||
"type": "query",
|
"type": "query",
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"query": "label_values(node_uname_info, instance)",
|
"query": "label_values(node_uname_info, instance)",
|
||||||
"refresh": 2,
|
"refresh": 2,
|
||||||
"includeAll": false,
|
"includeAll": false,
|
||||||
@@ -26,7 +26,7 @@
|
|||||||
"title": "CPU Usage",
|
"title": "CPU Usage",
|
||||||
"type": "timeseries",
|
"type": "timeseries",
|
||||||
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
|
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\", instance=~\"$instance\"}[5m])) * 100)",
|
"expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\", instance=~\"$instance\"}[5m])) * 100)",
|
||||||
@@ -55,7 +55,7 @@
|
|||||||
"title": "Memory Usage",
|
"title": "Memory Usage",
|
||||||
"type": "timeseries",
|
"type": "timeseries",
|
||||||
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
|
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$instance\"} / node_memory_MemTotal_bytes{instance=~\"$instance\"})) * 100",
|
"expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$instance\"} / node_memory_MemTotal_bytes{instance=~\"$instance\"})) * 100",
|
||||||
@@ -84,7 +84,7 @@
|
|||||||
"title": "Disk Usage",
|
"title": "Disk Usage",
|
||||||
"type": "gauge",
|
"type": "gauge",
|
||||||
"gridPos": {"h": 8, "w": 8, "x": 0, "y": 8},
|
"gridPos": {"h": 8, "w": 8, "x": 0, "y": 8},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "100 - ((node_filesystem_avail_bytes{instance=~\"$instance\",mountpoint=\"/\",fstype!=\"rootfs\"} / node_filesystem_size_bytes{instance=~\"$instance\",mountpoint=\"/\",fstype!=\"rootfs\"}) * 100)",
|
"expr": "100 - ((node_filesystem_avail_bytes{instance=~\"$instance\",mountpoint=\"/\",fstype!=\"rootfs\"} / node_filesystem_size_bytes{instance=~\"$instance\",mountpoint=\"/\",fstype!=\"rootfs\"}) * 100)",
|
||||||
@@ -113,7 +113,7 @@
|
|||||||
"title": "System Load",
|
"title": "System Load",
|
||||||
"type": "timeseries",
|
"type": "timeseries",
|
||||||
"gridPos": {"h": 8, "w": 8, "x": 8, "y": 8},
|
"gridPos": {"h": 8, "w": 8, "x": 8, "y": 8},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "node_load1{instance=~\"$instance\"}",
|
"expr": "node_load1{instance=~\"$instance\"}",
|
||||||
@@ -142,7 +142,7 @@
|
|||||||
"title": "Uptime",
|
"title": "Uptime",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 8, "w": 8, "x": 16, "y": 8},
|
"gridPos": {"h": 8, "w": 8, "x": 16, "y": 8},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "time() - node_boot_time_seconds{instance=~\"$instance\"}",
|
"expr": "time() - node_boot_time_seconds{instance=~\"$instance\"}",
|
||||||
@@ -161,7 +161,7 @@
|
|||||||
"title": "Network Traffic",
|
"title": "Network Traffic",
|
||||||
"type": "timeseries",
|
"type": "timeseries",
|
||||||
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
|
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "rate(node_network_receive_bytes_total{instance=~\"$instance\",device!~\"lo|veth.*|br.*|docker.*\"}[5m])",
|
"expr": "rate(node_network_receive_bytes_total{instance=~\"$instance\",device!~\"lo|veth.*|br.*|docker.*\"}[5m])",
|
||||||
@@ -185,7 +185,7 @@
|
|||||||
"title": "Disk I/O",
|
"title": "Disk I/O",
|
||||||
"type": "timeseries",
|
"type": "timeseries",
|
||||||
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 16},
|
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 16},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "rate(node_disk_read_bytes_total{instance=~\"$instance\",device!~\"dm-.*\"}[5m])",
|
"expr": "rate(node_disk_read_bytes_total{instance=~\"$instance\",device!~\"dm-.*\"}[5m])",
|
||||||
|
|||||||
@@ -15,7 +15,7 @@
|
|||||||
{
|
{
|
||||||
"name": "vm",
|
"name": "vm",
|
||||||
"type": "query",
|
"type": "query",
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"query": "label_values(pve_guest_info{template=\"0\"}, name)",
|
"query": "label_values(pve_guest_info{template=\"0\"}, name)",
|
||||||
"refresh": 2,
|
"refresh": 2,
|
||||||
"includeAll": true,
|
"includeAll": true,
|
||||||
@@ -30,7 +30,7 @@
|
|||||||
"title": "VMs Running",
|
"title": "VMs Running",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
|
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "count(pve_up{id=~\"qemu/.*\"} * on(id) pve_guest_info{template=\"0\"} == 1)",
|
"expr": "count(pve_up{id=~\"qemu/.*\"} * on(id) pve_guest_info{template=\"0\"} == 1)",
|
||||||
@@ -56,7 +56,7 @@
|
|||||||
"title": "VMs Stopped",
|
"title": "VMs Stopped",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
|
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "count(pve_up{id=~\"qemu/.*\"} * on(id) pve_guest_info{template=\"0\"} == 0)",
|
"expr": "count(pve_up{id=~\"qemu/.*\"} * on(id) pve_guest_info{template=\"0\"} == 0)",
|
||||||
@@ -87,7 +87,7 @@
|
|||||||
"title": "Node CPU",
|
"title": "Node CPU",
|
||||||
"type": "gauge",
|
"type": "gauge",
|
||||||
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
|
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "pve_cpu_usage_ratio{id=~\"node/.*\"} * 100",
|
"expr": "pve_cpu_usage_ratio{id=~\"node/.*\"} * 100",
|
||||||
@@ -120,7 +120,7 @@
|
|||||||
"title": "Node Memory",
|
"title": "Node Memory",
|
||||||
"type": "gauge",
|
"type": "gauge",
|
||||||
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
|
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "pve_memory_usage_bytes{id=~\"node/.*\"} / pve_memory_size_bytes{id=~\"node/.*\"} * 100",
|
"expr": "pve_memory_usage_bytes{id=~\"node/.*\"} / pve_memory_size_bytes{id=~\"node/.*\"} * 100",
|
||||||
@@ -153,7 +153,7 @@
|
|||||||
"title": "Node Uptime",
|
"title": "Node Uptime",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
|
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "pve_uptime_seconds{id=~\"node/.*\"}",
|
"expr": "pve_uptime_seconds{id=~\"node/.*\"}",
|
||||||
@@ -180,7 +180,7 @@
|
|||||||
"title": "Templates",
|
"title": "Templates",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
|
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "count(pve_guest_info{template=\"1\"})",
|
"expr": "count(pve_guest_info{template=\"1\"})",
|
||||||
@@ -206,7 +206,7 @@
|
|||||||
"title": "VM Status",
|
"title": "VM Status",
|
||||||
"type": "table",
|
"type": "table",
|
||||||
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 4},
|
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 4},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "pve_guest_info{template=\"0\", name=~\"$vm\"}",
|
"expr": "pve_guest_info{template=\"0\", name=~\"$vm\"}",
|
||||||
@@ -362,7 +362,7 @@
|
|||||||
"title": "VM CPU Usage",
|
"title": "VM CPU Usage",
|
||||||
"type": "timeseries",
|
"type": "timeseries",
|
||||||
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 14},
|
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 14},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "pve_cpu_usage_ratio{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"} * 100",
|
"expr": "pve_cpu_usage_ratio{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"} * 100",
|
||||||
@@ -391,7 +391,7 @@
|
|||||||
"title": "VM Memory Usage",
|
"title": "VM Memory Usage",
|
||||||
"type": "timeseries",
|
"type": "timeseries",
|
||||||
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 14},
|
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 14},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "pve_memory_usage_bytes{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
|
"expr": "pve_memory_usage_bytes{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
|
||||||
@@ -420,7 +420,7 @@
|
|||||||
"title": "VM Network Traffic",
|
"title": "VM Network Traffic",
|
||||||
"type": "timeseries",
|
"type": "timeseries",
|
||||||
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 22},
|
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 22},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "rate(pve_network_receive_bytes{id=~\"qemu/.*\"}[5m]) * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
|
"expr": "rate(pve_network_receive_bytes{id=~\"qemu/.*\"}[5m]) * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
|
||||||
@@ -453,7 +453,7 @@
|
|||||||
"title": "VM Disk I/O",
|
"title": "VM Disk I/O",
|
||||||
"type": "timeseries",
|
"type": "timeseries",
|
||||||
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 22},
|
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 22},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "rate(pve_disk_read_bytes{id=~\"qemu/.*\"}[5m]) * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
|
"expr": "rate(pve_disk_read_bytes{id=~\"qemu/.*\"}[5m]) * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
|
||||||
@@ -486,7 +486,7 @@
|
|||||||
"title": "Storage Usage",
|
"title": "Storage Usage",
|
||||||
"type": "bargauge",
|
"type": "bargauge",
|
||||||
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 30},
|
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 30},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "pve_disk_usage_bytes{id=~\"storage/.*\"} / pve_disk_size_bytes{id=~\"storage/.*\"} * 100",
|
"expr": "pve_disk_usage_bytes{id=~\"storage/.*\"} / pve_disk_size_bytes{id=~\"storage/.*\"} * 100",
|
||||||
@@ -531,7 +531,7 @@
|
|||||||
"title": "Storage Capacity",
|
"title": "Storage Capacity",
|
||||||
"type": "table",
|
"type": "table",
|
||||||
"gridPos": {"h": 6, "w": 12, "x": 12, "y": 30},
|
"gridPos": {"h": 6, "w": 12, "x": 12, "y": 30},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "pve_disk_size_bytes{id=~\"storage/.*\"}",
|
"expr": "pve_disk_size_bytes{id=~\"storage/.*\"}",
|
||||||
|
|||||||
@@ -15,7 +15,7 @@
|
|||||||
{
|
{
|
||||||
"name": "hostname",
|
"name": "hostname",
|
||||||
"type": "query",
|
"type": "query",
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"query": "label_values(systemd_unit_state, hostname)",
|
"query": "label_values(systemd_unit_state, hostname)",
|
||||||
"refresh": 2,
|
"refresh": 2,
|
||||||
"includeAll": true,
|
"includeAll": true,
|
||||||
@@ -30,7 +30,7 @@
|
|||||||
"title": "Failed Units",
|
"title": "Failed Units",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
|
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "count(systemd_unit_state{state=\"failed\", hostname=~\"$hostname\"} == 1) or vector(0)",
|
"expr": "count(systemd_unit_state{state=\"failed\", hostname=~\"$hostname\"} == 1) or vector(0)",
|
||||||
@@ -60,7 +60,7 @@
|
|||||||
"title": "Active Units",
|
"title": "Active Units",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
|
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "count(systemd_unit_state{state=\"active\", hostname=~\"$hostname\"} == 1)",
|
"expr": "count(systemd_unit_state{state=\"active\", hostname=~\"$hostname\"} == 1)",
|
||||||
@@ -86,7 +86,7 @@
|
|||||||
"title": "Hosts Monitored",
|
"title": "Hosts Monitored",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
|
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "count(count by (hostname) (systemd_unit_state{hostname=~\"$hostname\"}))",
|
"expr": "count(count by (hostname) (systemd_unit_state{hostname=~\"$hostname\"}))",
|
||||||
@@ -112,7 +112,7 @@
|
|||||||
"title": "Total Service Restarts",
|
"title": "Total Service Restarts",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
|
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "sum(systemd_service_restart_total{hostname=~\"$hostname\"})",
|
"expr": "sum(systemd_service_restart_total{hostname=~\"$hostname\"})",
|
||||||
@@ -143,7 +143,7 @@
|
|||||||
"title": "Inactive Units",
|
"title": "Inactive Units",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
|
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "count(systemd_unit_state{state=\"inactive\", hostname=~\"$hostname\"} == 1)",
|
"expr": "count(systemd_unit_state{state=\"inactive\", hostname=~\"$hostname\"} == 1)",
|
||||||
@@ -169,7 +169,7 @@
|
|||||||
"title": "Timers",
|
"title": "Timers",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
|
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "count(systemd_timer_last_trigger_seconds{hostname=~\"$hostname\"})",
|
"expr": "count(systemd_timer_last_trigger_seconds{hostname=~\"$hostname\"})",
|
||||||
@@ -195,7 +195,7 @@
|
|||||||
"title": "Failed Units",
|
"title": "Failed Units",
|
||||||
"type": "table",
|
"type": "table",
|
||||||
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 4},
|
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 4},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "systemd_unit_state{state=\"failed\", hostname=~\"$hostname\"} == 1",
|
"expr": "systemd_unit_state{state=\"failed\", hostname=~\"$hostname\"} == 1",
|
||||||
@@ -251,7 +251,7 @@
|
|||||||
"title": "Service Restarts (Top 15)",
|
"title": "Service Restarts (Top 15)",
|
||||||
"type": "table",
|
"type": "table",
|
||||||
"gridPos": {"h": 6, "w": 12, "x": 12, "y": 4},
|
"gridPos": {"h": 6, "w": 12, "x": 12, "y": 4},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "topk(15, systemd_service_restart_total{hostname=~\"$hostname\"} > 0)",
|
"expr": "topk(15, systemd_service_restart_total{hostname=~\"$hostname\"} > 0)",
|
||||||
@@ -309,7 +309,7 @@
|
|||||||
"title": "Active Units per Host",
|
"title": "Active Units per Host",
|
||||||
"type": "bargauge",
|
"type": "bargauge",
|
||||||
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 10},
|
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 10},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "sort_desc(count by (hostname) (systemd_unit_state{state=\"active\", hostname=~\"$hostname\"} == 1))",
|
"expr": "sort_desc(count by (hostname) (systemd_unit_state{state=\"active\", hostname=~\"$hostname\"} == 1))",
|
||||||
@@ -339,7 +339,7 @@
|
|||||||
"title": "NixOS Upgrade Timers",
|
"title": "NixOS Upgrade Timers",
|
||||||
"type": "table",
|
"type": "table",
|
||||||
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 10},
|
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 10},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "systemd_timer_last_trigger_seconds{name=\"nixos-upgrade.timer\", hostname=~\"$hostname\"}",
|
"expr": "systemd_timer_last_trigger_seconds{name=\"nixos-upgrade.timer\", hostname=~\"$hostname\"}",
|
||||||
@@ -429,7 +429,7 @@
|
|||||||
"title": "Backup Timers",
|
"title": "Backup Timers",
|
||||||
"type": "table",
|
"type": "table",
|
||||||
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 18},
|
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 18},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "systemd_timer_last_trigger_seconds{name=~\"restic.*\", hostname=~\"$hostname\"}",
|
"expr": "systemd_timer_last_trigger_seconds{name=~\"restic.*\", hostname=~\"$hostname\"}",
|
||||||
@@ -524,7 +524,7 @@
|
|||||||
"title": "Service Restarts Over Time",
|
"title": "Service Restarts Over Time",
|
||||||
"type": "timeseries",
|
"type": "timeseries",
|
||||||
"gridPos": {"h": 6, "w": 12, "x": 12, "y": 18},
|
"gridPos": {"h": 6, "w": 12, "x": 12, "y": 18},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "sum by (hostname) (increase(systemd_service_restart_total{hostname=~\"$hostname\"}[1h]))",
|
"expr": "sum by (hostname) (increase(systemd_service_restart_total{hostname=~\"$hostname\"}[1h]))",
|
||||||
|
|||||||
@@ -19,7 +19,7 @@
|
|||||||
"title": "Current Temperatures",
|
"title": "Current Temperatures",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 0},
|
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}",
|
"expr": "hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}",
|
||||||
@@ -71,7 +71,7 @@
|
|||||||
"title": "Average Home Temperature",
|
"title": "Average Home Temperature",
|
||||||
"type": "gauge",
|
"type": "gauge",
|
||||||
"gridPos": {"h": 6, "w": 6, "x": 12, "y": 0},
|
"gridPos": {"h": 6, "w": 6, "x": 12, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "avg(hass_sensor_temperature_celsius{entity!~\".*device_temperature|.*server.*\"})",
|
"expr": "avg(hass_sensor_temperature_celsius{entity!~\".*device_temperature|.*server.*\"})",
|
||||||
@@ -108,7 +108,7 @@
|
|||||||
"title": "Current Humidity",
|
"title": "Current Humidity",
|
||||||
"type": "stat",
|
"type": "stat",
|
||||||
"gridPos": {"h": 6, "w": 6, "x": 18, "y": 0},
|
"gridPos": {"h": 6, "w": 6, "x": 18, "y": 0},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "hass_sensor_humidity_percent{entity!~\".*server.*\"}",
|
"expr": "hass_sensor_humidity_percent{entity!~\".*server.*\"}",
|
||||||
@@ -154,7 +154,7 @@
|
|||||||
"title": "Temperature History (30 Days)",
|
"title": "Temperature History (30 Days)",
|
||||||
"type": "timeseries",
|
"type": "timeseries",
|
||||||
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 6},
|
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 6},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}",
|
"expr": "hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}",
|
||||||
@@ -207,7 +207,7 @@
|
|||||||
"title": "Temperature Trend (1h rate of change)",
|
"title": "Temperature Trend (1h rate of change)",
|
||||||
"type": "timeseries",
|
"type": "timeseries",
|
||||||
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
|
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "deriv(hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}[1h]) * 3600",
|
"expr": "deriv(hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}[1h]) * 3600",
|
||||||
@@ -268,7 +268,7 @@
|
|||||||
"title": "24h Min / Max / Avg",
|
"title": "24h Min / Max / Avg",
|
||||||
"type": "table",
|
"type": "table",
|
||||||
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 16},
|
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 16},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "min_over_time(hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}[24h])",
|
"expr": "min_over_time(hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}[24h])",
|
||||||
@@ -346,7 +346,7 @@
|
|||||||
"title": "Humidity History (30 Days)",
|
"title": "Humidity History (30 Days)",
|
||||||
"type": "timeseries",
|
"type": "timeseries",
|
||||||
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 24},
|
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 24},
|
||||||
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
"targets": [
|
"targets": [
|
||||||
{
|
{
|
||||||
"expr": "hass_sensor_humidity_percent",
|
"expr": "hass_sensor_humidity_percent",
|
||||||
|
|||||||
@@ -37,10 +37,6 @@
|
|||||||
# Declarative datasources
|
# Declarative datasources
|
||||||
provision.datasources.settings = {
|
provision.datasources.settings = {
|
||||||
apiVersion = 1;
|
apiVersion = 1;
|
||||||
prune = true;
|
|
||||||
deleteDatasources = [
|
|
||||||
{ name = "Prometheus (monitoring01)"; orgId = 1; }
|
|
||||||
];
|
|
||||||
datasources = [
|
datasources = [
|
||||||
{
|
{
|
||||||
name = "VictoriaMetrics";
|
name = "VictoriaMetrics";
|
||||||
@@ -49,10 +45,16 @@
|
|||||||
isDefault = true;
|
isDefault = true;
|
||||||
uid = "victoriametrics";
|
uid = "victoriametrics";
|
||||||
}
|
}
|
||||||
|
{
|
||||||
|
name = "Prometheus (monitoring01)";
|
||||||
|
type = "prometheus";
|
||||||
|
url = "http://monitoring01.home.2rjus.net:9090";
|
||||||
|
uid = "prometheus";
|
||||||
|
}
|
||||||
{
|
{
|
||||||
name = "Loki";
|
name = "Loki";
|
||||||
type = "loki";
|
type = "loki";
|
||||||
url = "http://localhost:3100";
|
url = "http://monitoring01.home.2rjus.net:3100";
|
||||||
uid = "loki";
|
uid = "loki";
|
||||||
}
|
}
|
||||||
];
|
];
|
||||||
@@ -89,14 +91,6 @@
|
|||||||
acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
|
acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
|
||||||
metrics
|
metrics
|
||||||
'';
|
'';
|
||||||
virtualHosts."grafana.home.2rjus.net".extraConfig = ''
|
|
||||||
log {
|
|
||||||
output file /var/log/caddy/grafana.log {
|
|
||||||
mode 644
|
|
||||||
}
|
|
||||||
}
|
|
||||||
reverse_proxy http://127.0.0.1:3000
|
|
||||||
'';
|
|
||||||
virtualHosts."grafana-test.home.2rjus.net".extraConfig = ''
|
virtualHosts."grafana-test.home.2rjus.net".extraConfig = ''
|
||||||
log {
|
log {
|
||||||
output file /var/log/caddy/grafana.log {
|
output file /var/log/caddy/grafana.log {
|
||||||
|
|||||||
@@ -54,49 +54,53 @@
|
|||||||
}
|
}
|
||||||
reverse_proxy http://ha1.home.2rjus.net:8080
|
reverse_proxy http://ha1.home.2rjus.net:8080
|
||||||
}
|
}
|
||||||
|
prometheus.home.2rjus.net {
|
||||||
|
log {
|
||||||
|
output file /var/log/caddy/prometheus.log {
|
||||||
|
mode 644
|
||||||
|
}
|
||||||
|
}
|
||||||
|
reverse_proxy http://monitoring01.home.2rjus.net:9090
|
||||||
|
}
|
||||||
|
alertmanager.home.2rjus.net {
|
||||||
|
log {
|
||||||
|
output file /var/log/caddy/alertmanager.log {
|
||||||
|
mode 644
|
||||||
|
}
|
||||||
|
}
|
||||||
|
reverse_proxy http://monitoring01.home.2rjus.net:9093
|
||||||
|
}
|
||||||
|
grafana.home.2rjus.net {
|
||||||
|
log {
|
||||||
|
output file /var/log/caddy/grafana.log {
|
||||||
|
mode 644
|
||||||
|
}
|
||||||
|
}
|
||||||
|
reverse_proxy http://monitoring01.home.2rjus.net:3000
|
||||||
|
}
|
||||||
jelly.home.2rjus.net {
|
jelly.home.2rjus.net {
|
||||||
log {
|
log {
|
||||||
output file /var/log/caddy/jelly.log {
|
output file /var/log/caddy/jelly.log {
|
||||||
mode 644
|
mode 644
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
header Content-Type text/html
|
reverse_proxy http://jelly01.home.2rjus.net:8096
|
||||||
respond <<HTML
|
|
||||||
<!DOCTYPE html>
|
|
||||||
<html>
|
|
||||||
<head>
|
|
||||||
<title>Jellyfin - Maintenance</title>
|
|
||||||
<style>
|
|
||||||
body {
|
|
||||||
background: #101020;
|
|
||||||
color: #ddd;
|
|
||||||
font-family: sans-serif;
|
|
||||||
display: flex;
|
|
||||||
justify-content: center;
|
|
||||||
align-items: center;
|
|
||||||
min-height: 100vh;
|
|
||||||
margin: 0;
|
|
||||||
text-align: center;
|
|
||||||
}
|
}
|
||||||
.container { max-width: 500px; }
|
pyroscope.home.2rjus.net {
|
||||||
.disk { font-size: 80px; animation: spin 3s linear infinite; display: inline-block; }
|
log {
|
||||||
@keyframes spin { from { transform: rotate(0deg); } to { transform: rotate(360deg); } }
|
output file /var/log/caddy/pyroscope.log {
|
||||||
h1 { color: #00a4dc; }
|
mode 644
|
||||||
p { font-size: 1.2em; line-height: 1.6; }
|
}
|
||||||
</style>
|
}
|
||||||
</head>
|
reverse_proxy http://monitoring01.home.2rjus.net:4040
|
||||||
<body>
|
}
|
||||||
<div class="container">
|
pushgw.home.2rjus.net {
|
||||||
<div class="disk">💿</div>
|
log {
|
||||||
<h1>Jellyfin is taking a nap</h1>
|
output file /var/log/caddy/pushgw.log {
|
||||||
<p>The NAS is getting shiny new hard drives.<br>
|
mode 644
|
||||||
Jellyfin will be back once the disks stop spinning up.</p>
|
}
|
||||||
<p style="color:#666;font-size:0.9em;">In the meantime, maybe go outside?</p>
|
}
|
||||||
</div>
|
reverse_proxy http://monitoring01.home.2rjus.net:9091
|
||||||
</body>
|
|
||||||
</html>
|
|
||||||
HTML 200
|
|
||||||
}
|
}
|
||||||
http://http-proxy.home.2rjus.net/metrics {
|
http://http-proxy.home.2rjus.net/metrics {
|
||||||
log {
|
log {
|
||||||
|
|||||||
@@ -1,104 +0,0 @@
|
|||||||
{ config, lib, pkgs, ... }:
|
|
||||||
let
|
|
||||||
# Script to generate bcrypt hash from Vault password for Caddy basic_auth
|
|
||||||
generateCaddyAuth = pkgs.writeShellApplication {
|
|
||||||
name = "generate-caddy-loki-auth";
|
|
||||||
runtimeInputs = [ config.services.caddy.package ];
|
|
||||||
text = ''
|
|
||||||
PASSWORD=$(cat /run/secrets/loki-push-auth)
|
|
||||||
HASH=$(caddy hash-password --plaintext "$PASSWORD")
|
|
||||||
echo "LOKI_PUSH_HASH=$HASH" > /run/secrets/caddy-loki-auth.env
|
|
||||||
chmod 0400 /run/secrets/caddy-loki-auth.env
|
|
||||||
'';
|
|
||||||
};
|
|
||||||
in
|
|
||||||
{
|
|
||||||
# Fetch Loki push password from Vault
|
|
||||||
vault.secrets.loki-push-auth = {
|
|
||||||
secretPath = "shared/loki/push-auth";
|
|
||||||
extractKey = "password";
|
|
||||||
services = [ "caddy" ];
|
|
||||||
};
|
|
||||||
|
|
||||||
# Generate bcrypt hash for Caddy before it starts
|
|
||||||
systemd.services.caddy-loki-auth = {
|
|
||||||
description = "Generate Caddy basic auth hash for Loki";
|
|
||||||
after = [ "vault-secret-loki-push-auth.service" ];
|
|
||||||
requires = [ "vault-secret-loki-push-auth.service" ];
|
|
||||||
before = [ "caddy.service" ];
|
|
||||||
requiredBy = [ "caddy.service" ];
|
|
||||||
serviceConfig = {
|
|
||||||
Type = "oneshot";
|
|
||||||
RemainAfterExit = true;
|
|
||||||
ExecStart = lib.getExe generateCaddyAuth;
|
|
||||||
};
|
|
||||||
};
|
|
||||||
|
|
||||||
# Load the bcrypt hash as environment variable for Caddy
|
|
||||||
services.caddy.environmentFile = "/run/secrets/caddy-loki-auth.env";
|
|
||||||
|
|
||||||
# Caddy reverse proxy for Loki with basic auth
|
|
||||||
services.caddy.virtualHosts."loki.home.2rjus.net".extraConfig = ''
|
|
||||||
basic_auth {
|
|
||||||
promtail {env.LOKI_PUSH_HASH}
|
|
||||||
}
|
|
||||||
reverse_proxy http://127.0.0.1:3100
|
|
||||||
'';
|
|
||||||
|
|
||||||
services.loki = {
|
|
||||||
enable = true;
|
|
||||||
configuration = {
|
|
||||||
auth_enabled = false;
|
|
||||||
|
|
||||||
server = {
|
|
||||||
http_listen_address = "127.0.0.1";
|
|
||||||
http_listen_port = 3100;
|
|
||||||
};
|
|
||||||
common = {
|
|
||||||
ring = {
|
|
||||||
instance_addr = "127.0.0.1";
|
|
||||||
kvstore = {
|
|
||||||
store = "inmemory";
|
|
||||||
};
|
|
||||||
};
|
|
||||||
replication_factor = 1;
|
|
||||||
path_prefix = "/var/lib/loki";
|
|
||||||
};
|
|
||||||
schema_config = {
|
|
||||||
configs = [
|
|
||||||
{
|
|
||||||
from = "2024-01-01";
|
|
||||||
store = "tsdb";
|
|
||||||
object_store = "filesystem";
|
|
||||||
schema = "v13";
|
|
||||||
index = {
|
|
||||||
prefix = "loki_index_";
|
|
||||||
period = "24h";
|
|
||||||
};
|
|
||||||
}
|
|
||||||
];
|
|
||||||
};
|
|
||||||
storage_config = {
|
|
||||||
filesystem = {
|
|
||||||
directory = "/var/lib/loki/chunks";
|
|
||||||
};
|
|
||||||
};
|
|
||||||
compactor = {
|
|
||||||
working_directory = "/var/lib/loki/compactor";
|
|
||||||
compaction_interval = "10m";
|
|
||||||
retention_enabled = true;
|
|
||||||
retention_delete_delay = "2h";
|
|
||||||
retention_delete_worker_count = 150;
|
|
||||||
delete_request_store = "filesystem";
|
|
||||||
};
|
|
||||||
limits_config = {
|
|
||||||
retention_period = "30d";
|
|
||||||
ingestion_rate_mb = 10;
|
|
||||||
ingestion_burst_size_mb = 20;
|
|
||||||
max_streams_per_user = 10000;
|
|
||||||
max_query_series = 500;
|
|
||||||
max_query_parallelism = 8;
|
|
||||||
};
|
|
||||||
};
|
|
||||||
};
|
|
||||||
}
|
|
||||||
@@ -1,4 +1,33 @@
|
|||||||
{ pkgs, ... }:
|
{ pkgs, ... }:
|
||||||
|
let
|
||||||
|
# TLS endpoints to monitor for certificate expiration
|
||||||
|
# These are all services using ACME certificates from OpenBao PKI
|
||||||
|
tlsTargets = [
|
||||||
|
# Direct ACME certs (security.acme.certs)
|
||||||
|
"https://vault.home.2rjus.net:8200"
|
||||||
|
"https://auth.home.2rjus.net"
|
||||||
|
"https://testvm01.home.2rjus.net"
|
||||||
|
|
||||||
|
# Caddy auto-TLS on http-proxy
|
||||||
|
"https://nzbget.home.2rjus.net"
|
||||||
|
"https://radarr.home.2rjus.net"
|
||||||
|
"https://sonarr.home.2rjus.net"
|
||||||
|
"https://ha.home.2rjus.net"
|
||||||
|
"https://z2m.home.2rjus.net"
|
||||||
|
"https://prometheus.home.2rjus.net"
|
||||||
|
"https://alertmanager.home.2rjus.net"
|
||||||
|
"https://grafana.home.2rjus.net"
|
||||||
|
"https://jelly.home.2rjus.net"
|
||||||
|
"https://pyroscope.home.2rjus.net"
|
||||||
|
"https://pushgw.home.2rjus.net"
|
||||||
|
|
||||||
|
# Caddy auto-TLS on nix-cache02
|
||||||
|
"https://nix-cache.home.2rjus.net"
|
||||||
|
|
||||||
|
# Caddy auto-TLS on grafana01
|
||||||
|
"https://grafana-test.home.2rjus.net"
|
||||||
|
];
|
||||||
|
in
|
||||||
{
|
{
|
||||||
services.prometheus.exporters.blackbox = {
|
services.prometheus.exporters.blackbox = {
|
||||||
enable = true;
|
enable = true;
|
||||||
@@ -28,4 +57,36 @@
|
|||||||
- 503
|
- 503
|
||||||
'';
|
'';
|
||||||
};
|
};
|
||||||
|
|
||||||
|
# Add blackbox scrape config to Prometheus
|
||||||
|
# Alert rules are in rules.yml (certificate_rules group)
|
||||||
|
services.prometheus.scrapeConfigs = [
|
||||||
|
{
|
||||||
|
job_name = "blackbox_tls";
|
||||||
|
metrics_path = "/probe";
|
||||||
|
params = {
|
||||||
|
module = [ "https_cert" ];
|
||||||
|
};
|
||||||
|
static_configs = [{
|
||||||
|
targets = tlsTargets;
|
||||||
|
}];
|
||||||
|
relabel_configs = [
|
||||||
|
# Pass the target URL to blackbox as a parameter
|
||||||
|
{
|
||||||
|
source_labels = [ "__address__" ];
|
||||||
|
target_label = "__param_target";
|
||||||
|
}
|
||||||
|
# Use the target URL as the instance label
|
||||||
|
{
|
||||||
|
source_labels = [ "__param_target" ];
|
||||||
|
target_label = "instance";
|
||||||
|
}
|
||||||
|
# Point the actual scrape at the local blackbox exporter
|
||||||
|
{
|
||||||
|
target_label = "__address__";
|
||||||
|
replacement = "127.0.0.1:9115";
|
||||||
|
}
|
||||||
|
];
|
||||||
|
}
|
||||||
|
];
|
||||||
}
|
}
|
||||||
|
|||||||
14
services/monitoring/default.nix
Normal file
14
services/monitoring/default.nix
Normal file
@@ -0,0 +1,14 @@
|
|||||||
|
{ ... }:
|
||||||
|
{
|
||||||
|
imports = [
|
||||||
|
./loki.nix
|
||||||
|
./grafana.nix
|
||||||
|
./prometheus.nix
|
||||||
|
./blackbox.nix
|
||||||
|
./exportarr.nix
|
||||||
|
./pve.nix
|
||||||
|
./alerttonotify.nix
|
||||||
|
./pyroscope.nix
|
||||||
|
./tempo.nix
|
||||||
|
];
|
||||||
|
}
|
||||||
@@ -14,4 +14,14 @@
|
|||||||
apiKeyFile = config.vault.secrets.sonarr-api-key.outputDir;
|
apiKeyFile = config.vault.secrets.sonarr-api-key.outputDir;
|
||||||
port = 9709;
|
port = 9709;
|
||||||
};
|
};
|
||||||
|
|
||||||
|
# Scrape config
|
||||||
|
services.prometheus.scrapeConfigs = [
|
||||||
|
{
|
||||||
|
job_name = "sonarr";
|
||||||
|
static_configs = [{
|
||||||
|
targets = [ "localhost:9709" ];
|
||||||
|
}];
|
||||||
|
}
|
||||||
|
];
|
||||||
}
|
}
|
||||||
|
|||||||
11
services/monitoring/grafana.nix
Normal file
11
services/monitoring/grafana.nix
Normal file
@@ -0,0 +1,11 @@
|
|||||||
|
{ pkgs, ... }:
|
||||||
|
{
|
||||||
|
services.grafana = {
|
||||||
|
enable = true;
|
||||||
|
settings = {
|
||||||
|
server = {
|
||||||
|
http_addr = "";
|
||||||
|
};
|
||||||
|
};
|
||||||
|
};
|
||||||
|
}
|
||||||
58
services/monitoring/loki.nix
Normal file
58
services/monitoring/loki.nix
Normal file
@@ -0,0 +1,58 @@
|
|||||||
|
{ ... }:
|
||||||
|
{
|
||||||
|
services.loki = {
|
||||||
|
enable = true;
|
||||||
|
configuration = {
|
||||||
|
auth_enabled = false;
|
||||||
|
|
||||||
|
server = {
|
||||||
|
http_listen_port = 3100;
|
||||||
|
};
|
||||||
|
common = {
|
||||||
|
ring = {
|
||||||
|
instance_addr = "127.0.0.1";
|
||||||
|
kvstore = {
|
||||||
|
store = "inmemory";
|
||||||
|
};
|
||||||
|
};
|
||||||
|
replication_factor = 1;
|
||||||
|
path_prefix = "/var/lib/loki";
|
||||||
|
};
|
||||||
|
schema_config = {
|
||||||
|
configs = [
|
||||||
|
{
|
||||||
|
from = "2024-01-01";
|
||||||
|
store = "tsdb";
|
||||||
|
object_store = "filesystem";
|
||||||
|
schema = "v13";
|
||||||
|
index = {
|
||||||
|
prefix = "loki_index_";
|
||||||
|
period = "24h";
|
||||||
|
};
|
||||||
|
}
|
||||||
|
];
|
||||||
|
};
|
||||||
|
storage_config = {
|
||||||
|
filesystem = {
|
||||||
|
directory = "/var/lib/loki/chunks";
|
||||||
|
};
|
||||||
|
};
|
||||||
|
compactor = {
|
||||||
|
working_directory = "/var/lib/loki/compactor";
|
||||||
|
compaction_interval = "10m";
|
||||||
|
retention_enabled = true;
|
||||||
|
retention_delete_delay = "2h";
|
||||||
|
retention_delete_worker_count = 150;
|
||||||
|
delete_request_store = "filesystem";
|
||||||
|
};
|
||||||
|
limits_config = {
|
||||||
|
retention_period = "30d";
|
||||||
|
ingestion_rate_mb = 10;
|
||||||
|
ingestion_burst_size_mb = 20;
|
||||||
|
max_streams_per_user = 10000;
|
||||||
|
max_query_series = 500;
|
||||||
|
max_query_parallelism = 8;
|
||||||
|
};
|
||||||
|
};
|
||||||
|
};
|
||||||
|
}
|
||||||
267
services/monitoring/prometheus.nix
Normal file
267
services/monitoring/prometheus.nix
Normal file
@@ -0,0 +1,267 @@
|
|||||||
|
{ self, lib, pkgs, ... }:
|
||||||
|
let
|
||||||
|
monLib = import ../../lib/monitoring.nix { inherit lib; };
|
||||||
|
externalTargets = import ./external-targets.nix;
|
||||||
|
|
||||||
|
nodeExporterTargets = monLib.generateNodeExporterTargets self externalTargets;
|
||||||
|
autoScrapeConfigs = monLib.generateScrapeConfigs self externalTargets;
|
||||||
|
|
||||||
|
# Script to fetch AppRole token for Prometheus to use when scraping OpenBao metrics
|
||||||
|
fetchOpenbaoToken = pkgs.writeShellApplication {
|
||||||
|
name = "fetch-openbao-token";
|
||||||
|
runtimeInputs = [ pkgs.curl pkgs.jq ];
|
||||||
|
text = ''
|
||||||
|
VAULT_ADDR="https://vault01.home.2rjus.net:8200"
|
||||||
|
APPROLE_DIR="/var/lib/vault/approle"
|
||||||
|
OUTPUT_FILE="/run/secrets/prometheus/openbao-token"
|
||||||
|
|
||||||
|
# Read AppRole credentials
|
||||||
|
if [ ! -f "$APPROLE_DIR/role-id" ] || [ ! -f "$APPROLE_DIR/secret-id" ]; then
|
||||||
|
echo "AppRole credentials not found at $APPROLE_DIR" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
ROLE_ID=$(cat "$APPROLE_DIR/role-id")
|
||||||
|
SECRET_ID=$(cat "$APPROLE_DIR/secret-id")
|
||||||
|
|
||||||
|
# Authenticate to Vault
|
||||||
|
AUTH_RESPONSE=$(curl -sf -k -X POST \
|
||||||
|
-d "{\"role_id\":\"$ROLE_ID\",\"secret_id\":\"$SECRET_ID\"}" \
|
||||||
|
"$VAULT_ADDR/v1/auth/approle/login")
|
||||||
|
|
||||||
|
# Extract token
|
||||||
|
VAULT_TOKEN=$(echo "$AUTH_RESPONSE" | jq -r '.auth.client_token')
|
||||||
|
if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
|
||||||
|
echo "Failed to extract Vault token from response" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Write token to file
|
||||||
|
mkdir -p "$(dirname "$OUTPUT_FILE")"
|
||||||
|
echo -n "$VAULT_TOKEN" > "$OUTPUT_FILE"
|
||||||
|
chown prometheus:prometheus "$OUTPUT_FILE"
|
||||||
|
chmod 0400 "$OUTPUT_FILE"
|
||||||
|
|
||||||
|
echo "Successfully fetched OpenBao token"
|
||||||
|
'';
|
||||||
|
};
|
||||||
|
in
|
||||||
|
{
|
||||||
|
# Systemd service to fetch AppRole token for Prometheus OpenBao scraping
|
||||||
|
# The token is used to authenticate when scraping /v1/sys/metrics
|
||||||
|
systemd.services.prometheus-openbao-token = {
|
||||||
|
description = "Fetch OpenBao token for Prometheus metrics scraping";
|
||||||
|
after = [ "network-online.target" ];
|
||||||
|
wants = [ "network-online.target" ];
|
||||||
|
before = [ "prometheus.service" ];
|
||||||
|
requiredBy = [ "prometheus.service" ];
|
||||||
|
|
||||||
|
serviceConfig = {
|
||||||
|
Type = "oneshot";
|
||||||
|
ExecStart = lib.getExe fetchOpenbaoToken;
|
||||||
|
};
|
||||||
|
};
|
||||||
|
|
||||||
|
# Timer to periodically refresh the token (AppRole tokens have 1-hour TTL)
|
||||||
|
systemd.timers.prometheus-openbao-token = {
|
||||||
|
description = "Refresh OpenBao token for Prometheus";
|
||||||
|
wantedBy = [ "timers.target" ];
|
||||||
|
timerConfig = {
|
||||||
|
OnBootSec = "5min";
|
||||||
|
OnUnitActiveSec = "30min";
|
||||||
|
RandomizedDelaySec = "5min";
|
||||||
|
};
|
||||||
|
};
|
||||||
|
|
||||||
|
# Fetch apiary bearer token from Vault
|
||||||
|
vault.secrets.prometheus-apiary-token = {
|
||||||
|
secretPath = "hosts/monitoring01/apiary-token";
|
||||||
|
extractKey = "password";
|
||||||
|
owner = "prometheus";
|
||||||
|
group = "prometheus";
|
||||||
|
services = [ "prometheus" ];
|
||||||
|
};
|
||||||
|
|
||||||
|
services.prometheus = {
|
||||||
|
enable = true;
|
||||||
|
# syntax-only check because we use external credential files (e.g., openbao-token)
|
||||||
|
checkConfig = "syntax-only";
|
||||||
|
alertmanager = {
|
||||||
|
enable = true;
|
||||||
|
configuration = {
|
||||||
|
global = {
|
||||||
|
};
|
||||||
|
route = {
|
||||||
|
receiver = "webhook_natstonotify";
|
||||||
|
group_wait = "30s";
|
||||||
|
group_interval = "5m";
|
||||||
|
repeat_interval = "1h";
|
||||||
|
group_by = [ "alertname" ];
|
||||||
|
};
|
||||||
|
receivers = [
|
||||||
|
{
|
||||||
|
name = "webhook_natstonotify";
|
||||||
|
webhook_configs = [
|
||||||
|
{
|
||||||
|
url = "http://localhost:5001/alert";
|
||||||
|
}
|
||||||
|
];
|
||||||
|
}
|
||||||
|
];
|
||||||
|
};
|
||||||
|
};
|
||||||
|
alertmanagers = [
|
||||||
|
{
|
||||||
|
static_configs = [
|
||||||
|
{
|
||||||
|
targets = [ "localhost:9093" ];
|
||||||
|
}
|
||||||
|
];
|
||||||
|
}
|
||||||
|
];
|
||||||
|
|
||||||
|
retentionTime = "30d";
|
||||||
|
globalConfig = {
|
||||||
|
scrape_interval = "15s";
|
||||||
|
};
|
||||||
|
rules = [
|
||||||
|
(builtins.readFile ./rules.yml)
|
||||||
|
];
|
||||||
|
|
||||||
|
scrapeConfigs = [
|
||||||
|
# Auto-generated node-exporter targets from flake hosts + external
|
||||||
|
# Each static_config entry may have labels from homelab.host metadata
|
||||||
|
{
|
||||||
|
job_name = "node-exporter";
|
||||||
|
static_configs = nodeExporterTargets;
|
||||||
|
}
|
||||||
|
# Systemd exporter on all hosts (same targets, different port)
|
||||||
|
# Preserves the same label grouping as node-exporter
|
||||||
|
{
|
||||||
|
job_name = "systemd-exporter";
|
||||||
|
static_configs = map
|
||||||
|
(cfg: cfg // {
|
||||||
|
targets = map (t: builtins.replaceStrings [ ":9100" ] [ ":9558" ] t) cfg.targets;
|
||||||
|
})
|
||||||
|
nodeExporterTargets;
|
||||||
|
}
|
||||||
|
# Local monitoring services (not auto-generated)
|
||||||
|
{
|
||||||
|
job_name = "prometheus";
|
||||||
|
static_configs = [
|
||||||
|
{
|
||||||
|
targets = [ "localhost:9090" ];
|
||||||
|
}
|
||||||
|
];
|
||||||
|
}
|
||||||
|
{
|
||||||
|
job_name = "loki";
|
||||||
|
static_configs = [
|
||||||
|
{
|
||||||
|
targets = [ "localhost:3100" ];
|
||||||
|
}
|
||||||
|
];
|
||||||
|
}
|
||||||
|
{
|
||||||
|
job_name = "grafana";
|
||||||
|
static_configs = [
|
||||||
|
{
|
||||||
|
targets = [ "localhost:3000" ];
|
||||||
|
}
|
||||||
|
];
|
||||||
|
}
|
||||||
|
{
|
||||||
|
job_name = "alertmanager";
|
||||||
|
static_configs = [
|
||||||
|
{
|
||||||
|
targets = [ "localhost:9093" ];
|
||||||
|
}
|
||||||
|
];
|
||||||
|
}
|
||||||
|
{
|
||||||
|
job_name = "pushgateway";
|
||||||
|
honor_labels = true;
|
||||||
|
static_configs = [
|
||||||
|
{
|
||||||
|
targets = [ "localhost:9091" ];
|
||||||
|
}
|
||||||
|
];
|
||||||
|
}
|
||||||
|
# Caddy metrics from nix-cache02 (serves nix-cache.home.2rjus.net)
|
||||||
|
{
|
||||||
|
job_name = "nix-cache_caddy";
|
||||||
|
scheme = "https";
|
||||||
|
static_configs = [
|
||||||
|
{
|
||||||
|
targets = [ "nix-cache.home.2rjus.net" ];
|
||||||
|
}
|
||||||
|
];
|
||||||
|
}
|
||||||
|
# pve-exporter with complex relabel config
|
||||||
|
{
|
||||||
|
job_name = "pve-exporter";
|
||||||
|
static_configs = [
|
||||||
|
{
|
||||||
|
targets = [ "10.69.12.75" ];
|
||||||
|
}
|
||||||
|
];
|
||||||
|
metrics_path = "/pve";
|
||||||
|
params = {
|
||||||
|
module = [ "default" ];
|
||||||
|
cluster = [ "1" ];
|
||||||
|
node = [ "1" ];
|
||||||
|
};
|
||||||
|
relabel_configs = [
|
||||||
|
{
|
||||||
|
source_labels = [ "__address__" ];
|
||||||
|
target_label = "__param_target";
|
||||||
|
}
|
||||||
|
{
|
||||||
|
source_labels = [ "__param_target" ];
|
||||||
|
target_label = "instance";
|
||||||
|
}
|
||||||
|
{
|
||||||
|
target_label = "__address__";
|
||||||
|
replacement = "127.0.0.1:9221";
|
||||||
|
}
|
||||||
|
];
|
||||||
|
}
|
||||||
|
# OpenBao metrics with bearer token auth
|
||||||
|
{
|
||||||
|
job_name = "openbao";
|
||||||
|
scheme = "https";
|
||||||
|
metrics_path = "/v1/sys/metrics";
|
||||||
|
params = {
|
||||||
|
format = [ "prometheus" ];
|
||||||
|
};
|
||||||
|
static_configs = [{
|
||||||
|
targets = [ "vault01.home.2rjus.net:8200" ];
|
||||||
|
}];
|
||||||
|
authorization = {
|
||||||
|
type = "Bearer";
|
||||||
|
credentials_file = "/run/secrets/prometheus/openbao-token";
|
||||||
|
};
|
||||||
|
}
|
||||||
|
# Apiary external service
|
||||||
|
{
|
||||||
|
job_name = "apiary";
|
||||||
|
scheme = "https";
|
||||||
|
scrape_interval = "60s";
|
||||||
|
static_configs = [{
|
||||||
|
targets = [ "apiary.t-juice.club" ];
|
||||||
|
}];
|
||||||
|
authorization = {
|
||||||
|
type = "Bearer";
|
||||||
|
credentials_file = "/run/secrets/prometheus-apiary-token";
|
||||||
|
};
|
||||||
|
}
|
||||||
|
] ++ autoScrapeConfigs;
|
||||||
|
|
||||||
|
pushgateway = {
|
||||||
|
enable = true;
|
||||||
|
web = {
|
||||||
|
external-url = "https://pushgw.home.2rjus.net";
|
||||||
|
};
|
||||||
|
};
|
||||||
|
};
|
||||||
|
}
|
||||||
@@ -1,7 +1,7 @@
|
|||||||
{ config, ... }:
|
{ config, ... }:
|
||||||
{
|
{
|
||||||
vault.secrets.pve-exporter = {
|
vault.secrets.pve-exporter = {
|
||||||
secretPath = "hosts/monitoring02/pve-exporter";
|
secretPath = "hosts/monitoring01/pve-exporter";
|
||||||
extractKey = "config";
|
extractKey = "config";
|
||||||
outputDir = "/run/secrets/pve_exporter";
|
outputDir = "/run/secrets/pve_exporter";
|
||||||
mode = "0444";
|
mode = "0444";
|
||||||
|
|||||||
8
services/monitoring/pyroscope.nix
Normal file
8
services/monitoring/pyroscope.nix
Normal file
@@ -0,0 +1,8 @@
|
|||||||
|
{ ... }:
|
||||||
|
{
|
||||||
|
virtualisation.oci-containers.containers.pyroscope = {
|
||||||
|
pull = "missing";
|
||||||
|
image = "grafana/pyroscope:latest";
|
||||||
|
ports = [ "4040:4040" ];
|
||||||
|
};
|
||||||
|
}
|
||||||
@@ -67,13 +67,13 @@ groups:
|
|||||||
summary: "Promtail service not running on {{ $labels.instance }}"
|
summary: "Promtail service not running on {{ $labels.instance }}"
|
||||||
description: "The promtail service has not been active on {{ $labels.instance }} for 5 minutes."
|
description: "The promtail service has not been active on {{ $labels.instance }} for 5 minutes."
|
||||||
- alert: filesystem_filling_up
|
- alert: filesystem_filling_up
|
||||||
expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[24h], 24*3600) < 0
|
expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 24*3600) < 0
|
||||||
for: 1h
|
for: 1h
|
||||||
labels:
|
labels:
|
||||||
severity: warning
|
severity: warning
|
||||||
annotations:
|
annotations:
|
||||||
summary: "Filesystem predicted to fill within 24h on {{ $labels.instance }}"
|
summary: "Filesystem predicted to fill within 24h on {{ $labels.instance }}"
|
||||||
description: "Based on the last 24h trend, the root filesystem on {{ $labels.instance }} is predicted to run out of space within 24 hours."
|
description: "Based on the last 6h trend, the root filesystem on {{ $labels.instance }} is predicted to run out of space within 24 hours."
|
||||||
- alert: systemd_not_running
|
- alert: systemd_not_running
|
||||||
expr: node_systemd_system_running == 0
|
expr: node_systemd_system_running == 0
|
||||||
for: 10m
|
for: 10m
|
||||||
@@ -259,32 +259,32 @@ groups:
|
|||||||
description: "Wireguard handshake timeout on {{ $labels.instance }} for peer {{ $labels.public_key }}."
|
description: "Wireguard handshake timeout on {{ $labels.instance }} for peer {{ $labels.public_key }}."
|
||||||
- name: monitoring_rules
|
- name: monitoring_rules
|
||||||
rules:
|
rules:
|
||||||
- alert: victoriametrics_not_running
|
- alert: prometheus_not_running
|
||||||
expr: node_systemd_unit_state{instance="monitoring02.home.2rjus.net:9100", name="victoriametrics.service", state="active"} == 0
|
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="prometheus.service", state="active"} == 0
|
||||||
for: 5m
|
for: 5m
|
||||||
labels:
|
labels:
|
||||||
severity: critical
|
severity: critical
|
||||||
annotations:
|
annotations:
|
||||||
summary: "VictoriaMetrics service not running on {{ $labels.instance }}"
|
summary: "Prometheus service not running on {{ $labels.instance }}"
|
||||||
description: "VictoriaMetrics service not running on {{ $labels.instance }}"
|
description: "Prometheus service not running on {{ $labels.instance }}"
|
||||||
- alert: vmalert_not_running
|
|
||||||
expr: node_systemd_unit_state{instance="monitoring02.home.2rjus.net:9100", name="vmalert.service", state="active"} == 0
|
|
||||||
for: 5m
|
|
||||||
labels:
|
|
||||||
severity: critical
|
|
||||||
annotations:
|
|
||||||
summary: "vmalert service not running on {{ $labels.instance }}"
|
|
||||||
description: "vmalert service not running on {{ $labels.instance }}"
|
|
||||||
- alert: alertmanager_not_running
|
- alert: alertmanager_not_running
|
||||||
expr: node_systemd_unit_state{instance="monitoring02.home.2rjus.net:9100", name="alertmanager.service", state="active"} == 0
|
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="alertmanager.service", state="active"} == 0
|
||||||
for: 5m
|
for: 5m
|
||||||
labels:
|
labels:
|
||||||
severity: critical
|
severity: critical
|
||||||
annotations:
|
annotations:
|
||||||
summary: "Alertmanager service not running on {{ $labels.instance }}"
|
summary: "Alertmanager service not running on {{ $labels.instance }}"
|
||||||
description: "Alertmanager service not running on {{ $labels.instance }}"
|
description: "Alertmanager service not running on {{ $labels.instance }}"
|
||||||
|
- alert: pushgateway_not_running
|
||||||
|
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="pushgateway.service", state="active"} == 0
|
||||||
|
for: 5m
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
annotations:
|
||||||
|
summary: "Pushgateway service not running on {{ $labels.instance }}"
|
||||||
|
description: "Pushgateway service not running on {{ $labels.instance }}"
|
||||||
- alert: loki_not_running
|
- alert: loki_not_running
|
||||||
expr: node_systemd_unit_state{instance="monitoring02.home.2rjus.net:9100", name="loki.service", state="active"} == 0
|
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="loki.service", state="active"} == 0
|
||||||
for: 5m
|
for: 5m
|
||||||
labels:
|
labels:
|
||||||
severity: critical
|
severity: critical
|
||||||
@@ -292,13 +292,29 @@ groups:
|
|||||||
summary: "Loki service not running on {{ $labels.instance }}"
|
summary: "Loki service not running on {{ $labels.instance }}"
|
||||||
description: "Loki service not running on {{ $labels.instance }}"
|
description: "Loki service not running on {{ $labels.instance }}"
|
||||||
- alert: grafana_not_running
|
- alert: grafana_not_running
|
||||||
expr: node_systemd_unit_state{instance="monitoring02.home.2rjus.net:9100", name="grafana.service", state="active"} == 0
|
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="grafana.service", state="active"} == 0
|
||||||
for: 5m
|
for: 5m
|
||||||
labels:
|
labels:
|
||||||
severity: warning
|
severity: warning
|
||||||
annotations:
|
annotations:
|
||||||
summary: "Grafana service not running on {{ $labels.instance }}"
|
summary: "Grafana service not running on {{ $labels.instance }}"
|
||||||
description: "Grafana service not running on {{ $labels.instance }}"
|
description: "Grafana service not running on {{ $labels.instance }}"
|
||||||
|
- alert: tempo_not_running
|
||||||
|
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="tempo.service", state="active"} == 0
|
||||||
|
for: 5m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
summary: "Tempo service not running on {{ $labels.instance }}"
|
||||||
|
description: "Tempo service not running on {{ $labels.instance }}"
|
||||||
|
- alert: pyroscope_not_running
|
||||||
|
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="podman-pyroscope.service", state="active"} == 0
|
||||||
|
for: 5m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
summary: "Pyroscope service not running on {{ $labels.instance }}"
|
||||||
|
description: "Pyroscope service not running on {{ $labels.instance }}"
|
||||||
- name: proxmox_rules
|
- name: proxmox_rules
|
||||||
rules:
|
rules:
|
||||||
- alert: pve_node_down
|
- alert: pve_node_down
|
||||||
|
|||||||
37
services/monitoring/tempo.nix
Normal file
37
services/monitoring/tempo.nix
Normal file
@@ -0,0 +1,37 @@
|
|||||||
|
{ ... }:
|
||||||
|
{
|
||||||
|
services.tempo = {
|
||||||
|
enable = true;
|
||||||
|
settings = {
|
||||||
|
server = {
|
||||||
|
http_listen_port = 3200;
|
||||||
|
grpc_listen_port = 3201;
|
||||||
|
};
|
||||||
|
distributor = {
|
||||||
|
receivers = {
|
||||||
|
otlp = {
|
||||||
|
protocols = {
|
||||||
|
http = {
|
||||||
|
endpoint = ":4318";
|
||||||
|
cors = {
|
||||||
|
allowed_origins = [ "*.home.2rjus.net" ];
|
||||||
|
};
|
||||||
|
};
|
||||||
|
};
|
||||||
|
};
|
||||||
|
};
|
||||||
|
};
|
||||||
|
storage = {
|
||||||
|
trace = {
|
||||||
|
backend = "local";
|
||||||
|
local = {
|
||||||
|
path = "/var/lib/tempo";
|
||||||
|
};
|
||||||
|
wal = {
|
||||||
|
path = "/var/lib/tempo/wal";
|
||||||
|
};
|
||||||
|
};
|
||||||
|
};
|
||||||
|
};
|
||||||
|
};
|
||||||
|
}
|
||||||
@@ -6,24 +6,6 @@ let
|
|||||||
nodeExporterTargets = monLib.generateNodeExporterTargets self externalTargets;
|
nodeExporterTargets = monLib.generateNodeExporterTargets self externalTargets;
|
||||||
autoScrapeConfigs = monLib.generateScrapeConfigs self externalTargets;
|
autoScrapeConfigs = monLib.generateScrapeConfigs self externalTargets;
|
||||||
|
|
||||||
# TLS endpoints to monitor for certificate expiration via blackbox exporter
|
|
||||||
tlsTargets = [
|
|
||||||
"https://vault.home.2rjus.net:8200"
|
|
||||||
"https://auth.home.2rjus.net"
|
|
||||||
"https://testvm01.home.2rjus.net"
|
|
||||||
"https://nzbget.home.2rjus.net"
|
|
||||||
"https://radarr.home.2rjus.net"
|
|
||||||
"https://sonarr.home.2rjus.net"
|
|
||||||
"https://ha.home.2rjus.net"
|
|
||||||
"https://z2m.home.2rjus.net"
|
|
||||||
"https://metrics.home.2rjus.net"
|
|
||||||
"https://alertmanager.home.2rjus.net"
|
|
||||||
"https://grafana.home.2rjus.net"
|
|
||||||
"https://jelly.home.2rjus.net"
|
|
||||||
"https://nix-cache.home.2rjus.net"
|
|
||||||
"https://grafana-test.home.2rjus.net"
|
|
||||||
];
|
|
||||||
|
|
||||||
# Script to fetch AppRole token for VictoriaMetrics to use when scraping OpenBao metrics
|
# Script to fetch AppRole token for VictoriaMetrics to use when scraping OpenBao metrics
|
||||||
fetchOpenbaoToken = pkgs.writeShellApplication {
|
fetchOpenbaoToken = pkgs.writeShellApplication {
|
||||||
name = "fetch-openbao-token-vm";
|
name = "fetch-openbao-token-vm";
|
||||||
@@ -125,39 +107,6 @@ let
|
|||||||
credentials_file = "/run/secrets/victoriametrics-apiary-token";
|
credentials_file = "/run/secrets/victoriametrics-apiary-token";
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
# Blackbox TLS certificate monitoring
|
|
||||||
{
|
|
||||||
job_name = "blackbox_tls";
|
|
||||||
metrics_path = "/probe";
|
|
||||||
params = {
|
|
||||||
module = [ "https_cert" ];
|
|
||||||
};
|
|
||||||
static_configs = [{ targets = tlsTargets; }];
|
|
||||||
relabel_configs = [
|
|
||||||
{
|
|
||||||
source_labels = [ "__address__" ];
|
|
||||||
target_label = "__param_target";
|
|
||||||
}
|
|
||||||
{
|
|
||||||
source_labels = [ "__param_target" ];
|
|
||||||
target_label = "instance";
|
|
||||||
}
|
|
||||||
{
|
|
||||||
target_label = "__address__";
|
|
||||||
replacement = "127.0.0.1:9115";
|
|
||||||
}
|
|
||||||
];
|
|
||||||
}
|
|
||||||
# Sonarr exporter
|
|
||||||
{
|
|
||||||
job_name = "sonarr";
|
|
||||||
static_configs = [{ targets = [ "localhost:9709" ]; }];
|
|
||||||
}
|
|
||||||
# Proxmox VE exporter
|
|
||||||
{
|
|
||||||
job_name = "pve";
|
|
||||||
static_configs = [{ targets = [ "localhost:9221" ]; }];
|
|
||||||
}
|
|
||||||
] ++ autoScrapeConfigs;
|
] ++ autoScrapeConfigs;
|
||||||
in
|
in
|
||||||
{
|
{
|
||||||
@@ -203,7 +152,7 @@ in
|
|||||||
|
|
||||||
# Fetch apiary bearer token from Vault
|
# Fetch apiary bearer token from Vault
|
||||||
vault.secrets.victoriametrics-apiary-token = {
|
vault.secrets.victoriametrics-apiary-token = {
|
||||||
secretPath = "hosts/monitoring02/apiary-token";
|
secretPath = "hosts/monitoring01/apiary-token";
|
||||||
extractKey = "password";
|
extractKey = "password";
|
||||||
owner = "victoriametrics";
|
owner = "victoriametrics";
|
||||||
group = "victoriametrics";
|
group = "victoriametrics";
|
||||||
@@ -221,12 +170,15 @@ in
|
|||||||
};
|
};
|
||||||
};
|
};
|
||||||
|
|
||||||
# vmalert for alerting rules
|
# vmalert for alerting rules - no notifier during parallel operation
|
||||||
services.vmalert.instances.default = {
|
services.vmalert.instances.default = {
|
||||||
enable = true;
|
enable = true;
|
||||||
settings = {
|
settings = {
|
||||||
"datasource.url" = "http://localhost:8428";
|
"datasource.url" = "http://localhost:8428";
|
||||||
"notifier.url" = [ "http://localhost:9093" ];
|
# Blackhole notifications during parallel operation to prevent duplicate alerts.
|
||||||
|
# Replace with notifier.url after cutover from monitoring01:
|
||||||
|
# "notifier.url" = [ "http://localhost:9093" ];
|
||||||
|
"notifier.blackhole" = true;
|
||||||
"rule" = [ ../monitoring/rules.yml ];
|
"rule" = [ ../monitoring/rules.yml ];
|
||||||
};
|
};
|
||||||
};
|
};
|
||||||
@@ -239,11 +191,8 @@ in
|
|||||||
reverse_proxy http://127.0.0.1:8880
|
reverse_proxy http://127.0.0.1:8880
|
||||||
'';
|
'';
|
||||||
|
|
||||||
# Alertmanager
|
# Alertmanager - same config as monitoring01 but will only receive
|
||||||
services.caddy.virtualHosts."alertmanager.home.2rjus.net".extraConfig = ''
|
# alerts after cutover (vmalert notifier is disabled above)
|
||||||
reverse_proxy http://127.0.0.1:9093
|
|
||||||
'';
|
|
||||||
|
|
||||||
services.prometheus.alertmanager = {
|
services.prometheus.alertmanager = {
|
||||||
enable = true;
|
enable = true;
|
||||||
configuration = {
|
configuration = {
|
||||||
|
|||||||
@@ -16,16 +16,6 @@ in
|
|||||||
SystemKeepFree=1G
|
SystemKeepFree=1G
|
||||||
'';
|
'';
|
||||||
};
|
};
|
||||||
|
|
||||||
# Fetch Loki push password from Vault (only on hosts with Vault enabled)
|
|
||||||
vault.secrets.promtail-loki-auth = lib.mkIf config.vault.enable {
|
|
||||||
secretPath = "shared/loki/push-auth";
|
|
||||||
extractKey = "password";
|
|
||||||
owner = "promtail";
|
|
||||||
group = "promtail";
|
|
||||||
services = [ "promtail" ];
|
|
||||||
};
|
|
||||||
|
|
||||||
# Configure promtail
|
# Configure promtail
|
||||||
services.promtail = {
|
services.promtail = {
|
||||||
enable = true;
|
enable = true;
|
||||||
@@ -39,11 +29,7 @@ in
|
|||||||
|
|
||||||
clients = [
|
clients = [
|
||||||
{
|
{
|
||||||
url = "https://loki.home.2rjus.net/loki/api/v1/push";
|
url = "http://monitoring01.home.2rjus.net:3100/loki/api/v1/push";
|
||||||
basic_auth = {
|
|
||||||
username = "promtail";
|
|
||||||
password_file = "/run/secrets/promtail-loki-auth";
|
|
||||||
};
|
|
||||||
}
|
}
|
||||||
];
|
];
|
||||||
|
|
||||||
|
|||||||
@@ -16,8 +16,7 @@ let
|
|||||||
text = ''
|
text = ''
|
||||||
set -euo pipefail
|
set -euo pipefail
|
||||||
|
|
||||||
LOKI_URL="https://loki.home.2rjus.net/loki/api/v1/push"
|
LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
|
||||||
LOKI_AUTH_FILE="/run/secrets/promtail-loki-auth"
|
|
||||||
HOSTNAME=$(hostname)
|
HOSTNAME=$(hostname)
|
||||||
SESSION_ID=""
|
SESSION_ID=""
|
||||||
RECORD_MODE=false
|
RECORD_MODE=false
|
||||||
@@ -70,13 +69,7 @@ let
|
|||||||
}]
|
}]
|
||||||
}')
|
}')
|
||||||
|
|
||||||
local auth_args=()
|
|
||||||
if [[ -f "$LOKI_AUTH_FILE" ]]; then
|
|
||||||
auth_args=(-u "promtail:$(cat "$LOKI_AUTH_FILE")")
|
|
||||||
fi
|
|
||||||
|
|
||||||
if curl -s -X POST "$LOKI_URL" \
|
if curl -s -X POST "$LOKI_URL" \
|
||||||
"''${auth_args[@]}" \
|
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d "$payload" > /dev/null; then
|
-d "$payload" > /dev/null; then
|
||||||
return 0
|
return 0
|
||||||
|
|||||||
@@ -57,7 +57,7 @@ let
|
|||||||
type = types.str;
|
type = types.str;
|
||||||
description = ''
|
description = ''
|
||||||
Path to the secret in Vault (without /v1/secret/data/ prefix).
|
Path to the secret in Vault (without /v1/secret/data/ prefix).
|
||||||
Example: "hosts/ha1/mqtt-password"
|
Example: "hosts/monitoring01/grafana-admin"
|
||||||
'';
|
'';
|
||||||
};
|
};
|
||||||
|
|
||||||
@@ -152,11 +152,13 @@ in
|
|||||||
'';
|
'';
|
||||||
example = literalExpression ''
|
example = literalExpression ''
|
||||||
{
|
{
|
||||||
mqtt-password = {
|
grafana-admin = {
|
||||||
secretPath = "hosts/ha1/mqtt-password";
|
secretPath = "hosts/monitoring01/grafana-admin";
|
||||||
owner = "mosquitto";
|
owner = "grafana";
|
||||||
group = "mosquitto";
|
group = "grafana";
|
||||||
services = [ "mosquitto" ];
|
restartTrigger = true;
|
||||||
|
restartInterval = "daily";
|
||||||
|
services = [ "grafana" ];
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
'';
|
'';
|
||||||
|
|||||||
@@ -26,27 +26,26 @@ path "secret/data/shared/nixos-exporter/*" {
|
|||||||
EOT
|
EOT
|
||||||
}
|
}
|
||||||
|
|
||||||
# Shared policy for Loki push authentication (all hosts push logs)
|
|
||||||
resource "vault_policy" "loki_push" {
|
|
||||||
name = "loki-push"
|
|
||||||
|
|
||||||
policy = <<EOT
|
|
||||||
path "secret/data/shared/loki/*" {
|
|
||||||
capabilities = ["read", "list"]
|
|
||||||
}
|
|
||||||
EOT
|
|
||||||
}
|
|
||||||
|
|
||||||
# Define host access policies
|
# Define host access policies
|
||||||
locals {
|
locals {
|
||||||
host_policies = {
|
host_policies = {
|
||||||
# Example:
|
# Example: monitoring01 host
|
||||||
|
# "monitoring01" = {
|
||||||
|
# paths = [
|
||||||
|
# "secret/data/hosts/monitoring01/*",
|
||||||
|
# "secret/data/services/prometheus/*",
|
||||||
|
# "secret/data/services/grafana/*",
|
||||||
|
# "secret/data/shared/smtp/*"
|
||||||
|
# ]
|
||||||
|
# extra_policies = ["some-other-policy"] # Optional: additional policies
|
||||||
|
# }
|
||||||
|
|
||||||
|
# Example: ha1 host
|
||||||
# "ha1" = {
|
# "ha1" = {
|
||||||
# paths = [
|
# paths = [
|
||||||
# "secret/data/hosts/ha1/*",
|
# "secret/data/hosts/ha1/*",
|
||||||
# "secret/data/shared/mqtt/*"
|
# "secret/data/shared/mqtt/*"
|
||||||
# ]
|
# ]
|
||||||
# extra_policies = ["some-other-policy"] # Optional: additional policies
|
|
||||||
# }
|
# }
|
||||||
|
|
||||||
"ha1" = {
|
"ha1" = {
|
||||||
@@ -56,6 +55,16 @@ locals {
|
|||||||
]
|
]
|
||||||
}
|
}
|
||||||
|
|
||||||
|
"monitoring01" = {
|
||||||
|
paths = [
|
||||||
|
"secret/data/hosts/monitoring01/*",
|
||||||
|
"secret/data/shared/backup/*",
|
||||||
|
"secret/data/shared/nats/*",
|
||||||
|
"secret/data/services/exportarr/*",
|
||||||
|
]
|
||||||
|
extra_policies = ["prometheus-metrics"]
|
||||||
|
}
|
||||||
|
|
||||||
# Wave 1: hosts with no service secrets (only need vault.enable for future use)
|
# Wave 1: hosts with no service secrets (only need vault.enable for future use)
|
||||||
"nats1" = {
|
"nats1" = {
|
||||||
paths = [
|
paths = [
|
||||||
@@ -69,7 +78,7 @@ locals {
|
|||||||
]
|
]
|
||||||
}
|
}
|
||||||
|
|
||||||
# Wave 3: DNS servers (managed in hosts-generated.tf)
|
# Wave 3: DNS servers
|
||||||
|
|
||||||
# Wave 4: http-proxy
|
# Wave 4: http-proxy
|
||||||
"http-proxy" = {
|
"http-proxy" = {
|
||||||
@@ -95,6 +104,15 @@ locals {
|
|||||||
]
|
]
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# monitoring02: Grafana + VictoriaMetrics
|
||||||
|
"monitoring02" = {
|
||||||
|
paths = [
|
||||||
|
"secret/data/hosts/monitoring02/*",
|
||||||
|
"secret/data/hosts/monitoring01/apiary-token",
|
||||||
|
"secret/data/services/grafana/*",
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -120,7 +138,7 @@ resource "vault_approle_auth_backend_role" "hosts" {
|
|||||||
backend = vault_auth_backend.approle.path
|
backend = vault_auth_backend.approle.path
|
||||||
role_name = each.key
|
role_name = each.key
|
||||||
token_policies = concat(
|
token_policies = concat(
|
||||||
["${each.key}-policy", "homelab-deploy", "nixos-exporter", "loki-push"],
|
["${each.key}-policy", "homelab-deploy", "nixos-exporter"],
|
||||||
lookup(each.value, "extra_policies", [])
|
lookup(each.value, "extra_policies", [])
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|||||||
@@ -44,15 +44,6 @@ locals {
|
|||||||
"secret/data/hosts/garage01/*",
|
"secret/data/hosts/garage01/*",
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
"monitoring02" = {
|
|
||||||
paths = [
|
|
||||||
"secret/data/hosts/monitoring02/*",
|
|
||||||
"secret/data/services/grafana/*",
|
|
||||||
"secret/data/services/exportarr/*",
|
|
||||||
"secret/data/shared/nats/nkey",
|
|
||||||
]
|
|
||||||
extra_policies = ["prometheus-metrics"]
|
|
||||||
}
|
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -83,10 +74,7 @@ resource "vault_approle_auth_backend_role" "generated_hosts" {
|
|||||||
|
|
||||||
backend = vault_auth_backend.approle.path
|
backend = vault_auth_backend.approle.path
|
||||||
role_name = each.key
|
role_name = each.key
|
||||||
token_policies = concat(
|
token_policies = ["host-${each.key}", "homelab-deploy", "nixos-exporter"]
|
||||||
["host-${each.key}", "homelab-deploy", "nixos-exporter", "loki-push"],
|
|
||||||
lookup(each.value, "extra_policies", [])
|
|
||||||
)
|
|
||||||
secret_id_ttl = 0 # Never expire (wrapped tokens provide time limit)
|
secret_id_ttl = 0 # Never expire (wrapped tokens provide time limit)
|
||||||
token_ttl = 3600
|
token_ttl = 3600
|
||||||
token_max_ttl = 3600
|
token_max_ttl = 3600
|
||||||
|
|||||||
@@ -10,6 +10,10 @@ resource "vault_mount" "kv" {
|
|||||||
locals {
|
locals {
|
||||||
secrets = {
|
secrets = {
|
||||||
# Example host-specific secrets
|
# Example host-specific secrets
|
||||||
|
# "hosts/monitoring01/grafana-admin" = {
|
||||||
|
# auto_generate = true
|
||||||
|
# password_length = 32
|
||||||
|
# }
|
||||||
# "hosts/ha1/mqtt-password" = {
|
# "hosts/ha1/mqtt-password" = {
|
||||||
# auto_generate = true
|
# auto_generate = true
|
||||||
# password_length = 24
|
# password_length = 24
|
||||||
@@ -31,6 +35,11 @@ locals {
|
|||||||
# }
|
# }
|
||||||
# }
|
# }
|
||||||
|
|
||||||
|
"hosts/monitoring01/grafana-admin" = {
|
||||||
|
auto_generate = true
|
||||||
|
password_length = 32
|
||||||
|
}
|
||||||
|
|
||||||
"hosts/ha1/mqtt-password" = {
|
"hosts/ha1/mqtt-password" = {
|
||||||
auto_generate = true
|
auto_generate = true
|
||||||
password_length = 24
|
password_length = 24
|
||||||
@@ -48,8 +57,8 @@ locals {
|
|||||||
data = { nkey = var.nats_nkey }
|
data = { nkey = var.nats_nkey }
|
||||||
}
|
}
|
||||||
|
|
||||||
# PVE exporter config for monitoring02
|
# PVE exporter config for monitoring01
|
||||||
"hosts/monitoring02/pve-exporter" = {
|
"hosts/monitoring01/pve-exporter" = {
|
||||||
auto_generate = false
|
auto_generate = false
|
||||||
data = { config = var.pve_exporter_config }
|
data = { config = var.pve_exporter_config }
|
||||||
}
|
}
|
||||||
@@ -140,16 +149,10 @@ locals {
|
|||||||
}
|
}
|
||||||
|
|
||||||
# Bearer token for scraping apiary metrics
|
# Bearer token for scraping apiary metrics
|
||||||
"hosts/monitoring02/apiary-token" = {
|
"hosts/monitoring01/apiary-token" = {
|
||||||
auto_generate = true
|
auto_generate = true
|
||||||
password_length = 64
|
password_length = 64
|
||||||
}
|
}
|
||||||
|
|
||||||
# Loki push authentication (used by Promtail on all hosts)
|
|
||||||
"shared/loki/push-auth" = {
|
|
||||||
auto_generate = true
|
|
||||||
password_length = 32
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user