6 Commits

Author SHA1 Message Date
ef850d91a4 terraform: grant monitoring02 access to apiary-token secret
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:52:07 +01:00
a99fb5b959 grafana: remove one-time deleteDatasources cleanup
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:49:27 +01:00
d385f02c89 grafana: fix datasource provisioning crash from renamed Prometheus datasource
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:44:35 +01:00
8dfd04b406 monitoring02: add Caddy reverse proxy for VictoriaMetrics and vmalert
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Add metrics.home.2rjus.net and vmalert.home.2rjus.net CNAMEs with
Caddy TLS termination via internal ACME CA.

Refactors Grafana's Caddy config from configFile to globalConfig +
virtualHosts so both modules can contribute routes to the same
Caddy instance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:36:11 +01:00
63cf690598 victoriametrics: fix vmalert crash by adding notifier.blackhole
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
vmalert requires either a notifier URL or -notifier.blackhole when
alerting rules are present. Add blackhole flag for parallel operation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:31:53 +01:00
ef8eeaa2f5 monitoring02: add VictoriaMetrics, vmalert, and Alertmanager
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Set up the core metrics stack on monitoring02 as Phase 2 of the
monitoring migration. VictoriaMetrics replaces Prometheus with
identical scrape configs (22 jobs including auto-generated targets).

- VictoriaMetrics with 3-month retention and all scrape configs
- vmalert evaluating existing rules.yml (notifier disabled)
- Alertmanager with same routing config (no alerts during parallel op)
- Grafana datasources updated: local VictoriaMetrics as default
- Static user override for credential file access (OpenBao, Apiary)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:29:34 +01:00
51 changed files with 1252 additions and 721 deletions

View File

@@ -130,7 +130,7 @@ get_commit_info(<hash>) # Get full details of a specific change
```
**Example workflow for a service-related alert:**
1. Query `nixos_flake_info{hostname="monitoring02"}``current_rev: 8959829`
1. Query `nixos_flake_info{hostname="monitoring01"}``current_rev: 8959829`
2. `resolve_ref("master")``4633421`
3. `is_ancestor("8959829", "4633421")` → Yes, host is behind
4. `commits_between("8959829", "4633421")` → 7 commits missing

View File

@@ -30,7 +30,7 @@ Use the `lab-monitoring` MCP server tools:
### Label Reference
Available labels for log queries:
- `hostname` - Hostname (e.g., `ns1`, `monitoring02`, `ha1`) - matches the Prometheus `hostname` label
- `hostname` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`) - matches the Prometheus `hostname` label
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
- `filename` - For `varlog` job, the log file path
@@ -54,7 +54,7 @@ Journal logs are JSON-formatted. Key fields:
**All logs from a host:**
```logql
{hostname="monitoring02"}
{hostname="monitoring01"}
```
**Logs from a service across all hosts:**
@@ -74,7 +74,7 @@ Journal logs are JSON-formatted. Key fields:
**Regex matching:**
```logql
{systemd_unit="victoriametrics.service"} |~ "scrape.*failed"
{systemd_unit="prometheus.service"} |~ "scrape.*failed"
```
**Filter by level (journal scrape only):**
@@ -109,7 +109,7 @@ Default lookback is 1 hour. Use `start` parameter for older logs:
Useful systemd units for troubleshooting:
- `nixos-upgrade.service` - Daily auto-upgrade logs
- `nsd.service` - DNS server (ns1/ns2)
- `victoriametrics.service` - Metrics collection
- `prometheus.service` - Metrics collection
- `loki.service` - Log aggregation
- `caddy.service` - Reverse proxy
- `home-assistant.service` - Home automation
@@ -152,7 +152,7 @@ VMs provisioned from template2 send bootstrap progress directly to Loki via curl
Parse JSON and filter on fields:
```logql
{systemd_unit="victoriametrics.service"} | json | PRIORITY="3"
{systemd_unit="prometheus.service"} | json | PRIORITY="3"
```
---
@@ -242,11 +242,12 @@ All available Prometheus job names:
- `unbound` - DNS resolver metrics (ns1, ns2)
- `wireguard` - VPN tunnel metrics (http-proxy)
**Monitoring stack (localhost on monitoring02):**
- `victoriametrics` - VictoriaMetrics self-metrics
**Monitoring stack (localhost on monitoring01):**
- `prometheus` - Prometheus self-metrics
- `loki` - Loki self-metrics
- `grafana` - Grafana self-metrics
- `alertmanager` - Alertmanager metrics
- `pushgateway` - Push-based metrics gateway
**External/infrastructure:**
- `pve-exporter` - Proxmox hypervisor metrics
@@ -261,7 +262,7 @@ All scrape targets have these labels:
**Standard labels:**
- `instance` - Full target address (`<hostname>.home.2rjus.net:<port>`)
- `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`)
- `hostname` - Short hostname (e.g., `ns1`, `monitoring02`) - use this for host filtering
- `hostname` - Short hostname (e.g., `ns1`, `monitoring01`) - use this for host filtering
**Host metadata labels** (when configured in `homelab.host`):
- `role` - Host role (e.g., `dns`, `build-host`, `vault`)
@@ -274,7 +275,7 @@ Use the `hostname` label for easy host filtering across all jobs:
```promql
{hostname="ns1"} # All metrics from ns1
node_load1{hostname="monitoring02"} # Specific metric by hostname
node_load1{hostname="monitoring01"} # Specific metric by hostname
up{hostname="ha1"} # Check if ha1 is up
```
@@ -282,10 +283,10 @@ This is simpler than wildcarding the `instance` label:
```promql
# Old way (still works but verbose)
up{instance=~"monitoring02.*"}
up{instance=~"monitoring01.*"}
# New way (preferred)
up{hostname="monitoring02"}
up{hostname="monitoring01"}
```
### Filtering by Role/Tier

View File

@@ -73,7 +73,6 @@ Additional context, caveats, or references.
- **Reference existing patterns**: Mention how this fits with existing infrastructure
- **Tables for comparisons**: Use markdown tables when comparing options
- **Practical focus**: Emphasize what needs to happen, not theory
- **Mermaid diagrams**: Use mermaid code blocks for architecture diagrams, flow charts, or other graphs when relevant to the plan. Keep node labels short and use `<br/>` for line breaks
## Examples of Good Plans

3
.gitignore vendored
View File

@@ -2,9 +2,6 @@
result
result-*
# MCP config (contains secrets)
.mcp.json
# Terraform/OpenTofu
terraform/.terraform/
terraform/.terraform.lock.hcl

View File

@@ -20,9 +20,7 @@
"env": {
"PROMETHEUS_URL": "https://prometheus.home.2rjus.net",
"ALERTMANAGER_URL": "https://alertmanager.home.2rjus.net",
"LOKI_URL": "https://loki.home.2rjus.net",
"LOKI_USERNAME": "promtail",
"LOKI_PASSWORD": "<password from: bao kv get -field=password secret/shared/loki/push-auth>"
"LOKI_URL": "http://monitoring01.home.2rjus.net:3100"
}
},
"homelab-deploy": {
@@ -46,3 +44,4 @@
}
}
}

View File

@@ -247,7 +247,7 @@ nix develop -c homelab-deploy -- deploy \
deploy.prod.<hostname>
```
Subject format: `deploy.<tier>.<hostname>` (e.g., `deploy.prod.monitoring02`, `deploy.test.testvm01`)
Subject format: `deploy.<tier>.<hostname>` (e.g., `deploy.prod.monitoring01`, `deploy.test.testvm01`)
**Verifying Deployments:**
@@ -309,7 +309,7 @@ All hosts automatically get:
- OpenBao (Vault) secrets management via AppRole
- Internal ACME CA integration (OpenBao PKI at vault.home.2rjus.net)
- Daily auto-upgrades with auto-reboot
- Prometheus node-exporter + Promtail (logs to monitoring02)
- Prometheus node-exporter + Promtail (logs to monitoring01)
- Monitoring scrape target auto-registration via `homelab.monitoring` options
- Custom root CA trust
- DNS zone auto-registration via `homelab.dns` options
@@ -335,7 +335,7 @@ Use `nix flake show` or `nix develop -c ansible-inventory --graph` to list all h
- Infrastructure subnet: `10.69.13.x`
- DNS: ns1/ns2 provide authoritative DNS with primary-secondary setup
- Internal CA for ACME certificates (no Let's Encrypt)
- Centralized monitoring at monitoring02
- Centralized monitoring at monitoring01
- Static networking via systemd-networkd
### Secrets Management
@@ -480,21 +480,23 @@ See [docs/host-creation.md](docs/host-creation.md) for the complete host creatio
### Monitoring Stack
All hosts ship metrics and logs to `monitoring02`:
- **Metrics**: VictoriaMetrics scrapes node-exporter from all hosts
- **Logs**: Promtail ships logs to Loki on monitoring02
- **Access**: Grafana at monitoring02 for visualization
All hosts ship metrics and logs to `monitoring01`:
- **Metrics**: Prometheus scrapes node-exporter from all hosts
- **Logs**: Promtail ships logs to Loki on monitoring01
- **Access**: Grafana at monitoring01 for visualization
- **Tracing**: Tempo for distributed tracing
- **Profiling**: Pyroscope for continuous profiling
**Scrape Target Auto-Generation:**
VictoriaMetrics scrape targets are automatically generated from host configurations, following the same pattern as DNS zone generation:
Prometheus scrape targets are automatically generated from host configurations, following the same pattern as DNS zone generation:
- **Node-exporter**: All flake hosts with static IPs are automatically added as node-exporter targets
- **Service targets**: Defined via `homelab.monitoring.scrapeTargets` in service modules
- **External targets**: Non-flake hosts defined in `/services/monitoring/external-targets.nix`
- **Library**: `lib/monitoring.nix` provides `generateNodeExporterTargets` and `generateScrapeConfigs`
Service modules declare their scrape targets directly via `homelab.monitoring.scrapeTargets`. The VictoriaMetrics config on monitoring02 auto-generates scrape configs from all hosts. See "Homelab Module Options" section for available options.
Service modules declare their scrape targets directly via `homelab.monitoring.scrapeTargets`. The Prometheus config on monitoring01 auto-generates scrape configs from all hosts. See "Homelab Module Options" section for available options.
To add monitoring targets for non-NixOS hosts, edit `/services/monitoring/external-targets.nix`.

View File

@@ -10,7 +10,7 @@ NixOS Flake-based configuration repository for a homelab infrastructure. All hos
| `ca` | Internal Certificate Authority |
| `ha1` | Home Assistant + Zigbee2MQTT + Mosquitto |
| `http-proxy` | Reverse proxy |
| `monitoring02` | VictoriaMetrics, Grafana, Loki, Alertmanager |
| `monitoring01` | Prometheus, Grafana, Loki, Tempo, Pyroscope |
| `jelly01` | Jellyfin media server |
| `nix-cache02` | Nix binary cache + NATS-based build service |
| `nats1` | NATS messaging |
@@ -121,4 +121,4 @@ No manual intervention is required after `tofu apply`.
- Infrastructure subnet: `10.69.13.0/24`
- DNS: ns1/ns2 authoritative with primary-secondary AXFR
- Internal CA for TLS certificates (migrating from step-ca to OpenBao PKI)
- Centralized monitoring at monitoring02
- Centralized monitoring at monitoring01

View File

@@ -1,156 +0,0 @@
# Monitoring Stack Migration to VictoriaMetrics
## Overview
Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
and longer retention. Run in parallel with monitoring01 until validated, then switch over using
a `monitoring` CNAME for seamless transition.
## Current State
**monitoring02** (10.69.13.24) - **PRIMARY**:
- 4 CPU cores, 8GB RAM, 60GB disk
- VictoriaMetrics with 3-month retention
- vmalert with alerting enabled (routes to local Alertmanager)
- Alertmanager -> alerttonotify -> NATS notification pipeline
- Grafana with Kanidm OIDC (`grafana.home.2rjus.net`)
- Loki (log aggregation)
- CNAMEs: monitoring, alertmanager, grafana, grafana-test, metrics, vmalert, loki
**monitoring01** (10.69.13.13) - **SHUT DOWN**:
- No longer running, pending decommission
## Decision: VictoriaMetrics
Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
- Single binary replacement for Prometheus
- 5-10x better compression (30 days could become 180+ days in same space)
- Same PromQL query language (Grafana dashboards work unchanged)
- Same scrape config format (existing auto-generated configs work)
If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
## Architecture
```
┌─────────────────┐
│ monitoring02 │
│ VictoriaMetrics│
│ + Grafana │
monitoring │ + Loki │
CNAME ──────────│ + Alertmanager │
│ (vmalert) │
└─────────────────┘
│ scrapes
┌───────────────┼───────────────┐
│ │ │
┌────┴────┐ ┌─────┴────┐ ┌─────┴────┐
│ ns1 │ │ ha1 │ │ ... │
│ :9100 │ │ :9100 │ │ :9100 │
└─────────┘ └──────────┘ └──────────┘
```
## Implementation Plan
### Phase 1: Create monitoring02 Host [COMPLETE]
Host created and deployed at 10.69.13.24 (prod tier) with:
- 4 CPU cores, 8GB RAM, 60GB disk
- Vault integration enabled
- NATS-based remote deployment enabled
- Grafana with Kanidm OIDC deployed as test instance (`grafana-test.home.2rjus.net`)
### Phase 2: Set Up VictoriaMetrics Stack [COMPLETE]
New service module at `services/victoriametrics/` for VictoriaMetrics + vmalert + Alertmanager.
Imported by monitoring02 alongside the existing Grafana service.
1. **VictoriaMetrics** (port 8428):
- `services.victoriametrics.enable = true`
- `retentionPeriod = "3"` (3 months)
- All scrape configs migrated from Prometheus (22 jobs including auto-generated)
- Static user override (DynamicUser disabled) for credential file access
- OpenBao token fetch service + 30min refresh timer
- Apiary bearer token via vault.secrets
2. **vmalert** for alerting rules:
- Points to VictoriaMetrics datasource at localhost:8428
- Reuses existing `services/monitoring/rules.yml` directly via `settings.rule`
- Notifier sends to local Alertmanager at localhost:9093
3. **Alertmanager** (port 9093):
- Same configuration as monitoring01 (alerttonotify webhook routing)
- alerttonotify imported on monitoring02, routes alerts via NATS
4. **Grafana** (port 3000):
- VictoriaMetrics datasource (localhost:8428) as default
- Loki datasource pointing to localhost:3100
5. **Loki** (port 3100):
- Same configuration as monitoring01 in standalone `services/loki/` module
- Grafana datasource updated to localhost:3100
**Note:** pve-exporter and pushgateway scrape targets are not included on monitoring02.
pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics
native push support.
### Phase 3: Parallel Operation [COMPLETE]
Ran both monitoring01 and monitoring02 simultaneously to validate data collection and dashboards.
### Phase 4: Add monitoring CNAME [COMPLETE]
Added CNAMEs to monitoring02: monitoring, alertmanager, grafana, metrics, vmalert, loki.
### Phase 5: Update References [COMPLETE]
- Moved alertmanager, grafana, prometheus CNAMEs from http-proxy to monitoring02
- Removed corresponding Caddy reverse proxy entries from http-proxy
- monitoring02 Caddy serves alertmanager, grafana, metrics, vmalert directly
### Phase 6: Enable Alerting [COMPLETE]
- Switched vmalert from blackhole mode to local Alertmanager
- alerttonotify service running on monitoring02 (NATS nkey from Vault)
- prometheus-metrics Vault policy added for OpenBao scraping
- Full alerting pipeline verified: vmalert -> Alertmanager -> alerttonotify -> NATS
### Phase 7: Cutover and Decommission [IN PROGRESS]
- monitoring01 shut down (2026-02-17)
- Vault AppRole moved from approle.tf to hosts-generated.tf with extra_policies support
**Remaining cleanup (separate branch):**
- [ ] Update `system/monitoring/logs.nix` - Promtail still points to monitoring01
- [ ] Update `hosts/template2/bootstrap.nix` - Bootstrap Loki URL still points to monitoring01
- [ ] Remove monitoring01 from flake.nix and host configuration
- [ ] Destroy monitoring01 VM in Proxmox
- [ ] Remove monitoring01 from terraform state
- [ ] Remove or archive `services/monitoring/` (Prometheus config)
## Completed
- 2026-02-08: Phase 1 - monitoring02 host created
- 2026-02-17: Phase 2 - VictoriaMetrics, vmalert, Alertmanager, Loki, Grafana configured
- 2026-02-17: Phase 6 - Alerting enabled, CNAMEs migrated, monitoring01 shut down
## VictoriaMetrics Service Configuration
Implemented in `services/victoriametrics/default.nix`. Key design decisions:
- **Static user**: VictoriaMetrics NixOS module uses `DynamicUser`, overridden with a static
`victoriametrics` user so vault.secrets and credential files work correctly
- **Shared rules**: vmalert reuses `services/monitoring/rules.yml` via `settings.rule` path
reference (no YAML-to-Nix conversion needed)
- **Scrape config reuse**: Uses the same `lib/monitoring.nix` functions and
`services/monitoring/external-targets.nix` as Prometheus for auto-generated targets
## Notes
- VictoriaMetrics uses port 8428 vs Prometheus 9090
- PromQL compatibility is excellent
- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
- monitoring02 deployed via OpenTofu using `create-host` script
- Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state
- Tempo and Pyroscope deferred (not actively used; can be added later if needed)

View File

@@ -20,9 +20,9 @@ Hosts to migrate:
| http-proxy | Stateless | Reverse proxy, recreate |
| nats1 | Stateless | Messaging, recreate |
| ha1 | Stateful | Home Assistant + Zigbee2MQTT + Mosquitto |
| ~~monitoring01~~ | ~~Decommission~~ | ✓ Complete — replaced by monitoring02 (VictoriaMetrics) |
| monitoring01 | Stateful | Prometheus, Grafana, Loki |
| jelly01 | Stateful | Jellyfin metadata, watch history, config |
| ~~pgdb1~~ | ~~Decommission~~ | ✓ Complete |
| pgdb1 | Decommission | Only used by Open WebUI on gunter, migrating to local postgres |
| ~~jump~~ | ~~Decommission~~ | ✓ Complete |
| ~~auth01~~ | ~~Decommission~~ | ✓ Complete |
| ~~ca~~ | ~~Deferred~~ | ✓ Complete |
@@ -31,12 +31,10 @@ Hosts to migrate:
Before migrating any stateful host, ensure restic backups are in place and verified.
### ~~1a. Expand monitoring01 Grafana Backup~~ ✓ N/A
### 1a. Expand monitoring01 Grafana Backup
~~The existing backup only covers `/var/lib/grafana/plugins` and a sqlite dump of `grafana.db`.
Expand to back up all of `/var/lib/grafana/` to capture config directory and any other state.~~
No longer needed — monitoring01 decommissioned, replaced by monitoring02 with declarative Grafana dashboards.
The existing backup only covers `/var/lib/grafana/plugins` and a sqlite dump of `grafana.db`.
Expand to back up all of `/var/lib/grafana/` to capture config directory and any other state.
### 1b. Add Jellyfin Backup to jelly01
@@ -96,17 +94,15 @@ For each stateful host, the procedure is:
7. Start services and verify functionality
8. Decommission the old VM
### 3a. monitoring01 ✓ COMPLETE
### 3a. monitoring01
~~1. Run final Grafana backup~~
~~2. Provision new monitoring01 via OpenTofu~~
~~3. After bootstrap, restore `/var/lib/grafana/` from restic~~
~~4. Restart Grafana, verify dashboards and datasources are intact~~
~~5. Prometheus and Loki start fresh with empty data (acceptable)~~
~~6. Verify all scrape targets are being collected~~
~~7. Decommission old VM~~
Replaced by monitoring02 with VictoriaMetrics, standalone Loki and Grafana modules. Host configuration, old service modules, and terraform resources removed.
1. Run final Grafana backup
2. Provision new monitoring01 via OpenTofu
3. After bootstrap, restore `/var/lib/grafana/` from restic
4. Restart Grafana, verify dashboards and datasources are intact
5. Prometheus and Loki start fresh with empty data (acceptable)
6. Verify all scrape targets are being collected
7. Decommission old VM
### 3b. jelly01
@@ -167,19 +163,19 @@ Host was already removed from flake.nix and VM destroyed. Configuration cleaned
Host configuration, services, and VM already removed.
### pgdb1 ✓ COMPLETE
### pgdb1 (in progress)
~~Only consumer was Open WebUI on gunter, which has been migrated to use local PostgreSQL.~~
Only consumer was Open WebUI on gunter, which has been migrated to use local PostgreSQL.
~~1. Verify Open WebUI on gunter is using local PostgreSQL (not pgdb1)~~
~~2. Remove host configuration from `hosts/pgdb1/`~~
~~3. Remove `services/postgres/` (only used by pgdb1)~~
~~4. Remove from `flake.nix`~~
~~5. Remove Vault AppRole from `terraform/vault/approle.tf`~~
~~6. Destroy the VM in Proxmox~~
~~7. Commit cleanup~~
1. ~~Verify Open WebUI on gunter is using local PostgreSQL (not pgdb1)~~
2. ~~Remove host configuration from `hosts/pgdb1/`~~
3. ~~Remove `services/postgres/` (only used by pgdb1)~~
4. ~~Remove from `flake.nix`~~
5. ~~Remove Vault AppRole from `terraform/vault/approle.tf`~~
6. Destroy the VM in Proxmox
7. ~~Commit cleanup~~
Host configuration, services, terraform resources, and VM removed. See `docs/plans/pgdb1-decommission.md` for detailed plan.
See `docs/plans/pgdb1-decommission.md` for detailed plan.
## Phase 5: Decommission ca Host ✓ COMPLETE

View File

@@ -0,0 +1,209 @@
# Monitoring Stack Migration to VictoriaMetrics
## Overview
Migrate from Prometheus to VictoriaMetrics on a new host (monitoring02) to gain better compression
and longer retention. Run in parallel with monitoring01 until validated, then switch over using
a `monitoring` CNAME for seamless transition.
## Current State
**monitoring01** (10.69.13.13):
- 4 CPU cores, 4GB RAM, 33GB disk
- Prometheus with 30-day retention (15s scrape interval)
- Alertmanager (routes to alerttonotify webhook)
- Grafana (dashboards, datasources)
- Loki (log aggregation from all hosts via Promtail)
- Tempo (distributed tracing)
- Pyroscope (continuous profiling)
**Hardcoded References to monitoring01:**
- `system/monitoring/logs.nix` - Promtail sends logs to `http://monitoring01.home.2rjus.net:3100`
- `hosts/template2/bootstrap.nix` - Bootstrap logs to Loki (keep as-is until decommission)
- `services/http-proxy/proxy.nix` - Caddy proxies Prometheus, Alertmanager, Grafana, Pyroscope, Pushgateway
**Auto-generated:**
- Prometheus scrape targets (from `lib/monitoring.nix` + `homelab.monitoring.scrapeTargets`)
- Node-exporter targets (from all hosts with static IPs)
## Decision: VictoriaMetrics
Per `docs/plans/long-term-metrics-storage.md`, VictoriaMetrics is the recommended starting point:
- Single binary replacement for Prometheus
- 5-10x better compression (30 days could become 180+ days in same space)
- Same PromQL query language (Grafana dashboards work unchanged)
- Same scrape config format (existing auto-generated configs work)
If multi-year retention with downsampling becomes necessary later, Thanos can be evaluated.
## Architecture
```
┌─────────────────┐
│ monitoring02 │
│ VictoriaMetrics│
│ + Grafana │
monitoring │ + Loki │
CNAME ──────────│ + Tempo │
│ + Pyroscope │
│ + Alertmanager │
│ (vmalert) │
└─────────────────┘
│ scrapes
┌───────────────┼───────────────┐
│ │ │
┌────┴────┐ ┌─────┴────┐ ┌─────┴────┐
│ ns1 │ │ ha1 │ │ ... │
│ :9100 │ │ :9100 │ │ :9100 │
└─────────┘ └──────────┘ └──────────┘
```
## Implementation Plan
### Phase 1: Create monitoring02 Host [COMPLETE]
Host created and deployed at 10.69.13.24 (prod tier) with:
- 4 CPU cores, 8GB RAM, 60GB disk
- Vault integration enabled
- NATS-based remote deployment enabled
- Grafana with Kanidm OIDC deployed as test instance (`grafana-test.home.2rjus.net`)
### Phase 2: Set Up VictoriaMetrics Stack
New service module at `services/victoriametrics/` for VictoriaMetrics + vmalert + Alertmanager.
Imported by monitoring02 alongside the existing Grafana service.
1. **VictoriaMetrics** (port 8428): [DONE]
- `services.victoriametrics.enable = true`
- `retentionPeriod = "3"` (3 months)
- All scrape configs migrated from Prometheus (22 jobs including auto-generated)
- Static user override (DynamicUser disabled) for credential file access
- OpenBao token fetch service + 30min refresh timer
- Apiary bearer token via vault.secrets
2. **vmalert** for alerting rules: [DONE]
- Points to VictoriaMetrics datasource at localhost:8428
- Reuses existing `services/monitoring/rules.yml` directly via `settings.rule`
- No notifier configured during parallel operation (prevents duplicate alerts)
3. **Alertmanager** (port 9093): [DONE]
- Same configuration as monitoring01 (alerttonotify webhook routing)
- Will only receive alerts after cutover (vmalert notifier disabled)
4. **Grafana** (port 3000): [DONE]
- VictoriaMetrics datasource (localhost:8428) as default
- monitoring01 Prometheus datasource kept for comparison during parallel operation
- Loki datasource pointing to monitoring01 (until Loki migrated)
5. **Loki** (port 3100):
- TODO: Same configuration as current
6. **Tempo** (ports 3200, 3201):
- TODO: Same configuration
7. **Pyroscope** (port 4040):
- TODO: Same Docker-based deployment
**Note:** pve-exporter and pushgateway scrape targets are not included on monitoring02.
pve-exporter requires a local exporter instance; pushgateway is replaced by VictoriaMetrics
native push support.
### Phase 3: Parallel Operation
Run both monitoring01 and monitoring02 simultaneously:
1. **Dual scraping**: Both hosts scrape the same targets
- Validates VictoriaMetrics is collecting data correctly
2. **Dual log shipping**: Configure Promtail to send logs to both Loki instances
- Add second client in `system/monitoring/logs.nix` pointing to monitoring02
3. **Validate dashboards**: Access Grafana on monitoring02, verify dashboards work
4. **Validate alerts**: Verify vmalert evaluates rules correctly (no receiver = no notifications)
5. **Compare resource usage**: Monitor disk/memory consumption between hosts
### Phase 4: Add monitoring CNAME
Add CNAME to monitoring02 once validated:
```nix
# hosts/monitoring02/configuration.nix
homelab.dns.cnames = [ "monitoring" ];
```
This creates `monitoring.home.2rjus.net` pointing to monitoring02.
### Phase 5: Update References
Update hardcoded references to use the CNAME:
1. **system/monitoring/logs.nix**:
- Remove dual-shipping, point only to `http://monitoring.home.2rjus.net:3100`
2. **services/http-proxy/proxy.nix**: Update reverse proxy backends:
- prometheus.home.2rjus.net -> monitoring.home.2rjus.net:8428
- alertmanager.home.2rjus.net -> monitoring.home.2rjus.net:9093
- grafana.home.2rjus.net -> monitoring.home.2rjus.net:3000
- pyroscope.home.2rjus.net -> monitoring.home.2rjus.net:4040
Note: `hosts/template2/bootstrap.nix` stays pointed at monitoring01 until decommission.
### Phase 6: Enable Alerting
Once ready to cut over:
1. Enable Alertmanager receiver on monitoring02
2. Verify test alerts route correctly
### Phase 7: Cutover and Decommission
1. **Stop monitoring01**: Prevent duplicate alerts during transition
2. **Update bootstrap.nix**: Point to `monitoring.home.2rjus.net`
3. **Verify all targets scraped**: Check VictoriaMetrics UI
4. **Verify logs flowing**: Check Loki on monitoring02
5. **Decommission monitoring01**:
- Remove from flake.nix
- Remove host configuration
- Destroy VM in Proxmox
- Remove from terraform state
## Current Progress
- **Phase 1** complete (2026-02-08): monitoring02 host created, Grafana with Kanidm OIDC validated
- **Phase 2** in progress (2026-02-17): VictoriaMetrics, vmalert, Alertmanager, Grafana datasources configured
- Remaining: Loki, Tempo, Pyroscope migration
## Open Questions
- [ ] What disk size for monitoring02? Current 60GB may need expansion for 3+ months with VictoriaMetrics
- [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
- [ ] Consider replacing Promtail with Grafana Alloy (`services.alloy`, v1.12.2 in nixpkgs). Promtail is in maintenance mode and Grafana recommends Alloy as the successor. Alloy is a unified collector (logs, metrics, traces, profiles) but uses its own "River" config format instead of YAML, so less Nix-native ergonomics. Could bundle the migration with monitoring02 to consolidate disruption.
## VictoriaMetrics Service Configuration
Implemented in `services/victoriametrics/default.nix`. Key design decisions:
- **Static user**: VictoriaMetrics NixOS module uses `DynamicUser`, overridden with a static
`victoriametrics` user so vault.secrets and credential files work correctly
- **Shared rules**: vmalert reuses `services/monitoring/rules.yml` via `settings.rule` path
reference (no YAML-to-Nix conversion needed)
- **Scrape config reuse**: Uses the same `lib/monitoring.nix` functions and
`services/monitoring/external-targets.nix` as Prometheus for auto-generated targets
## Rollback Plan
If issues arise after cutover:
1. Move `monitoring` CNAME back to monitoring01
2. Restart monitoring01 services
3. Revert Promtail config to point only to monitoring01
4. Revert http-proxy backends
## Notes
- VictoriaMetrics uses port 8428 vs Prometheus 9090
- PromQL compatibility is excellent
- VictoriaMetrics native push replaces Pushgateway (remove from http-proxy if not needed)
- monitoring02 deployed via OpenTofu using `create-host` script
- Grafana dashboards defined declaratively via NixOS, not imported from monitoring01 state

View File

@@ -24,20 +24,29 @@ After evaluating WireGuard gateway vs Headscale (self-hosted Tailscale), the **W
## Architecture
```mermaid
graph TD
clients["Laptop / Phone"]
vps["VPS<br/>(WireGuard endpoint)"]
extgw["extgw01<br/>(gateway + bastion)"]
grafana["Grafana<br/>monitoring01:3000"]
jellyfin["Jellyfin<br/>jelly01:8096"]
arr["arr stack<br/>*-jail hosts"]
clients -->|WireGuard| vps
vps -->|WireGuard tunnel| extgw
extgw -->|allowed traffic| grafana
extgw -->|allowed traffic| jellyfin
extgw -->|allowed traffic| arr
```
┌─────────────────────────────────┐
│ VPS (OpenStack) │
Laptop/Phone ──→ │ WireGuard endpoint
(WireGuard) │ Client peers: laptop, phone │
│ Routes 10.69.13.0/24 via tunnel│
└──────────┬──────────────────────┘
│ WireGuard tunnel
┌─────────────────────────────────┐
│ extgw01 (gateway + bastion) │
│ - WireGuard tunnel to VPS │
│ - Firewall (allowlist only) │
│ - SSH + 2FA (full access) │
└──────────┬──────────────────────┘
│ allowed traffic only
┌─────────────────────────────────┐
│ Internal network 10.69.13.0/24 │
│ - monitoring01:3000 (Grafana) │
│ - jelly01:8096 (Jellyfin) │
│ - *-jail hosts (arr stack) │
└─────────────────────────────────┘
```
### Existing path (unchanged)

View File

@@ -39,17 +39,23 @@ Expand storage capacity for the main hdd-pool. Since we need to add disks anyway
- nzbget: NixOS service or OCI container
- NFS exports: `services.nfs.server`
### Filesystem: Keep ZFS
### Filesystem: BTRFS RAID1
**Decision**: Keep existing ZFS pool, import on NixOS
**Decision**: Migrate from ZFS to BTRFS with RAID1
**Rationale**:
- **No data migration needed**: Existing ZFS pool can be imported directly on NixOS
- **Proven reliability**: Pool has been running reliably on TrueNAS
- **NixOS ZFS support**: Well-supported, declarative configuration via `boot.zfs` and `services.zfs`
- **BTRFS RAID5/6 unreliable**: Research showed BTRFS RAID5/6 write hole is still unresolved
- **BTRFS RAID1 wasteful**: With mixed disk sizes, RAID1 wastes significant capacity vs ZFS mirrors
- Checksumming, snapshots, compression (lz4/zstd) all available
- **In-kernel**: No out-of-tree module issues like ZFS
- **Flexible expansion**: Add individual disks, not required to buy pairs
- **Mixed disk sizes**: Better handling than ZFS multi-vdev approach
- **RAID level conversion**: Can convert between RAID levels in place
- Built-in checksumming, snapshots, compression (zstd)
- NixOS has good BTRFS support
**BTRFS RAID1 notes**:
- "RAID1" means 2 copies of all data
- Distributes across all available devices
- With 6+ disks, provides redundancy + capacity scaling
- RAID5/6 avoided (known issues), RAID1/10 are stable
### Hardware: Keep Existing + Add Disks
@@ -63,94 +69,83 @@ Expand storage capacity for the main hdd-pool. Since we need to add disks anyway
**Storage architecture**:
**hdd-pool** (ZFS mirrors):
- Current: 3 mirror vdevs (2x16TB + 2x8TB + 2x8TB) = 32TB usable
- Add: mirror-3 with 2x 24TB = +24TB usable
- Total after expansion: ~56TB usable
**Bulk storage** (BTRFS RAID1 on HDDs):
- Current: 6x HDDs (2x16TB + 2x8TB + 2x8TB)
- Add: 2x new HDDs (size TBD)
- Use: Media, downloads, backups, non-critical data
- Risk tolerance: High (data mostly replaceable)
**Critical data** (small volume):
- Use 2x 240GB SSDs in mirror (BTRFS or ZFS)
- Or use 2TB NVMe for critical data
- Risk tolerance: Low (data important but small)
### Disk Purchase Decision
**Decision**: 2x 24TB drives (ordered, arriving 2026-02-21)
**Options under consideration**:
**Option A: 2x 16TB drives**
- Matches largest current drives
- Enables potential future RAID5 if desired (6x 16TB array)
- More conservative capacity increase
**Option B: 2x 20-24TB drives**
- Larger capacity headroom
- Better $/TB ratio typically
- Future-proofs better
**Initial purchase**: 2 drives (chassis has space for 2 more without modifications)
## Migration Strategy
### High-Level Plan
1. **Expand ZFS pool** (on TrueNAS):
- Install 2x 24TB drives (may need new drive trays - order from abroad if needed)
- If chassis space is limited, temporarily replace the two oldest 8TB drives (da0/ada4)
- Add as mirror-3 vdev to hdd-pool
- Verify pool health and resilver completes
- Check SMART data on old 8TB drives (all healthy as of 2026-02-20, no reallocated sectors)
- Burn-in: at minimum short + long SMART test before adding to pool
1. **Preparation**:
- Purchase 2x new HDDs (16TB or 20-24TB)
- Create NixOS configuration for new storage host
- Set up bare metal NixOS installation
2. **Prepare NixOS configuration**:
- Create host configuration (`hosts/nas1/` or similar)
- Configure ZFS pool import (`boot.zfs.extraPools`)
- Set up services: radarr, sonarr, nzbget, restic-rest, NFS
- Configure monitoring (node-exporter, promtail, smartctl-exporter)
2. **Initial BTRFS pool**:
- Install 2 new disks
- Create BTRFS filesystem in RAID1
- Mount and test NFS exports
3. **Install NixOS**:
- `zfs export hdd-pool` on TrueNAS before shutdown (clean export)
- Wipe TrueNAS boot-pool SSDs, set up as mdadm RAID1 for NixOS root
- Install NixOS on mdadm mirror (keeps boot path ZFS-independent)
- Import hdd-pool via `boot.zfs.extraPools`
- Verify all datasets mount correctly
3. **Data migration**:
- Copy data from TrueNAS ZFS pool to new BTRFS pool over 10GbE
- Verify data integrity
4. **Service migration**:
- Configure NixOS services to use ZFS dataset paths
- Update NFS exports
- Test from consuming hosts
4. **Expand pool**:
- As old ZFS pool is emptied, wipe drives and add to BTRFS pool
- Pool grows incrementally: 2 → 4 → 6 → 8 disks
- BTRFS rebalances data across new devices
5. **Cutover**:
- Update DNS/client mounts if IP changes
- Verify monitoring integration
5. **Service migration**:
- Set up radarr/sonarr/nzbget/restic as NixOS services
- Update NFS client mounts on consuming hosts
6. **Cutover**:
- Point consumers to new NAS host
- Decommission TrueNAS
### Post-Expansion: Vdev Rebalancing
ZFS has no built-in rebalance command. After adding the new 24TB vdev, ZFS will
write new data preferentially to it (most free space), leaving old vdevs packed
at ~97%. This is suboptimal but not urgent once overall pool usage drops to ~50%.
To gradually rebalance, rewrite files in place so ZFS redistributes blocks across
all vdevs proportional to free space:
```bash
# Rewrite files individually (spreads blocks across all vdevs)
find /pool/dataset -type f -exec sh -c '
for f; do cp "$f" "$f.rebal" && mv "$f.rebal" "$f"; done
' _ {} +
```
Avoid `zfs send/recv` for large datasets (e.g. 20TB) as this would concentrate
data on the emptiest vdev rather than spreading it evenly.
**Recommendation**: Do this after NixOS migration is stable. Not urgent - the pool
will function fine with uneven distribution, just slightly suboptimal for performance.
- Repurpose hardware or keep as spare
### Migration Advantages
- **No data migration**: ZFS pool imported directly, no copying terabytes of data
- **Low risk**: Pool expansion done on stable TrueNAS before OS swap
- **Reversible**: Can boot back to TrueNAS if NixOS has issues (ZFS pool is OS-independent)
- **Quick cutover**: Once NixOS config is ready, the OS swap is fast
- **Low risk**: New pool created independently, old data remains intact during migration
- **Incremental**: Can add old disks one at a time as space allows
- **Flexible**: BTRFS handles mixed disk sizes gracefully
- **Reversible**: Keep TrueNAS running until fully validated
## Next Steps
1. ~~Decide on disk size~~ - 2x 24TB ordered
2. Install drives and add mirror vdev to ZFS pool
3. Check SMART data on 8TB drives - decide whether to keep or retire
4. Design NixOS host configuration (`hosts/nas1/`)
5. Document NFS export mapping (current -> new)
6. Plan NixOS installation and cutover
1. Decide on disk size (16TB vs 20-24TB)
2. Purchase disks
3. Design NixOS host configuration (`hosts/nas1/`)
4. Plan detailed migration timeline
5. Document NFS export mapping (current new)
## Open Questions
- [ ] Final decision on disk size?
- [ ] Hostname for new NAS host? (nas1? storage1?)
- [ ] IP address/subnet: NAS and Proxmox are both on 10GbE to the same switch but different subnets, forcing traffic through the router (bottleneck). Move to same subnet during migration.
- [x] Boot drive: Reuse TrueNAS boot-pool SSDs as mdadm RAID1 for NixOS root (no ZFS on boot path)
- [ ] Retire old 8TB drives? (SMART looks healthy, keep unless chassis space is needed)
- [ ] Drive trays: do new 24TB drives fit, or order trays from abroad?
- [ ] Timeline/maintenance window for NixOS swap?
- [ ] IP address allocation (keep 10.69.12.50 or new IP?)
- [ ] Timeline/maintenance window for migration?

20
flake.lock generated
View File

@@ -28,11 +28,11 @@
]
},
"locked": {
"lastModified": 1771488195,
"narHash": "sha256-2kMxqdDyPluRQRoES22Y0oSjp7pc5fj2nRterfmSIyc=",
"lastModified": 1771004123,
"narHash": "sha256-Jw36EzL4IGIc2TmeZGphAAUrJXoWqfvCbybF8bTHgMA=",
"ref": "master",
"rev": "2d26de50559d8acb82ea803764e138325d95572c",
"revCount": 37,
"rev": "e5e8be86ecdcae8a5962ba3bddddfe91b574792b",
"revCount": 36,
"type": "git",
"url": "https://git.t-juice.club/torjus/homelab-deploy"
},
@@ -64,11 +64,11 @@
},
"nixpkgs": {
"locked": {
"lastModified": 1771419570,
"narHash": "sha256-bxAlQgre3pcQcaRUm/8A0v/X8d2nhfraWSFqVmMcBcU=",
"lastModified": 1771043024,
"narHash": "sha256-O1XDr7EWbRp+kHrNNgLWgIrB0/US5wvw9K6RERWAj6I=",
"owner": "nixos",
"repo": "nixpkgs",
"rev": "6d41bc27aaf7b6a3ba6b169db3bd5d6159cfaa47",
"rev": "3aadb7ca9eac2891d52a9dec199d9580a6e2bf44",
"type": "github"
},
"original": {
@@ -80,11 +80,11 @@
},
"nixpkgs-unstable": {
"locked": {
"lastModified": 1771369470,
"narHash": "sha256-0NBlEBKkN3lufyvFegY4TYv5mCNHbi5OmBDrzihbBMQ=",
"lastModified": 1771008912,
"narHash": "sha256-gf2AmWVTs8lEq7z/3ZAsgnZDhWIckkb+ZnAo5RzSxJg=",
"owner": "nixos",
"repo": "nixpkgs",
"rev": "0182a361324364ae3f436a63005877674cf45efb",
"rev": "a82ccc39b39b621151d6732718e3e250109076fa",
"type": "github"
},
"original": {

View File

@@ -92,6 +92,15 @@
./hosts/http-proxy
];
};
monitoring01 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self;
};
modules = commonModules ++ [
./hosts/monitoring01
];
};
jelly01 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {

View File

@@ -18,7 +18,12 @@
"sonarr"
"ha"
"z2m"
"grafana"
"prometheus"
"alertmanager"
"jelly"
"pyroscope"
"pushgw"
];
nixpkgs.config.allowUnfree = true;

View File

@@ -0,0 +1,114 @@
{
pkgs,
...
}:
{
imports = [
./hardware-configuration.nix
../../system
../../common/vm
];
homelab.host.role = "monitoring";
nixpkgs.config.allowUnfree = true;
# Use the systemd-boot EFI boot loader.
boot.loader.grub = {
enable = true;
device = "/dev/sda";
configurationLimit = 3;
};
networking.hostName = "monitoring01";
networking.domain = "home.2rjus.net";
networking.useNetworkd = true;
networking.useDHCP = false;
services.resolved.enable = true;
networking.nameservers = [
"10.69.13.5"
"10.69.13.6"
];
systemd.network.enable = true;
systemd.network.networks."ens18" = {
matchConfig.Name = "ens18";
address = [
"10.69.13.13/24"
];
routes = [
{ Gateway = "10.69.13.1"; }
];
linkConfig.RequiredForOnline = "routable";
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim
wget
git
sqlite
];
services.qemuGuest.enable = true;
# Vault secrets management
vault.enable = true;
homelab.deploy.enable = true;
vault.secrets.backup-helper = {
secretPath = "shared/backup/password";
extractKey = "password";
outputDir = "/run/secrets/backup_helper_secret";
services = [ "restic-backups-grafana" "restic-backups-grafana-db" ];
};
services.restic.backups.grafana = {
repository = "rest:http://10.69.12.52:8000/backup-nix";
passwordFile = "/run/secrets/backup_helper_secret";
paths = [ "/var/lib/grafana/plugins" ];
timerConfig = {
OnCalendar = "daily";
Persistent = true;
RandomizedDelaySec = "2h";
};
pruneOpts = [
"--keep-daily 7"
"--keep-weekly 4"
"--keep-monthly 6"
"--keep-within 1d"
];
extraOptions = [ "--retry-lock=5m" ];
};
services.restic.backups.grafana-db = {
repository = "rest:http://10.69.12.52:8000/backup-nix";
passwordFile = "/run/secrets/backup_helper_secret";
command = [ "${pkgs.sqlite}/bin/sqlite3" "/var/lib/grafana/data/grafana.db" ".dump" ];
timerConfig = {
OnCalendar = "daily";
Persistent = true;
RandomizedDelaySec = "2h";
};
pruneOpts = [
"--keep-daily 7"
"--keep-weekly 4"
"--keep-monthly 6"
"--keep-within 1d"
];
extraOptions = [ "--retry-lock=5m" ];
};
# Open ports in the firewall.
# networking.firewall.allowedTCPPorts = [ ... ];
# networking.firewall.allowedUDPPorts = [ ... ];
# Or disable the firewall altogether.
networking.firewall.enable = false;
system.stateVersion = "23.11"; # Did you read the comment?
}

View File

@@ -0,0 +1,7 @@
{ ... }:
{
imports = [
./configuration.nix
../../services/monitoring
];
}

View File

@@ -0,0 +1,42 @@
{
config,
lib,
pkgs,
modulesPath,
...
}:
{
imports = [
(modulesPath + "/profiles/qemu-guest.nix")
];
boot.initrd.availableKernelModules = [
"ata_piix"
"uhci_hcd"
"virtio_pci"
"virtio_scsi"
"sd_mod"
"sr_mod"
];
boot.initrd.kernelModules = [ "dm-snapshot" ];
boot.kernelModules = [
"ptp_kvm"
];
boot.extraModulePackages = [ ];
fileSystems."/" = {
device = "/dev/disk/by-label/root";
fsType = "xfs";
};
swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
# Enables DHCP on each ethernet and wireless interface. In case of scripted networking
# (the default) this is the recommended approach. When using systemd-networkd it's
# still possible to use this option, but it's recommended to use it in conjunction
# with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
networking.useDHCP = lib.mkDefault true;
# networking.interfaces.ens18.useDHCP = lib.mkDefault true;
nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
}

View File

@@ -18,7 +18,7 @@
role = "monitoring";
};
homelab.dns.cnames = [ "monitoring" "alertmanager" "grafana" "grafana-test" "metrics" "vmalert" "loki" ];
homelab.dns.cnames = [ "grafana-test" "metrics" "vmalert" ];
# Enable Vault integration
vault.enable = true;

View File

@@ -3,10 +3,5 @@
./configuration.nix
../../services/grafana
../../services/victoriametrics
../../services/loki
../../services/monitoring/alerttonotify.nix
../../services/monitoring/blackbox.nix
../../services/monitoring/exportarr.nix
../../services/monitoring/pve.nix
];
}

View File

@@ -25,7 +25,7 @@
};
};
timeout = 14400;
timeout = 7200;
metrics.enable = true;
};

View File

@@ -6,8 +6,7 @@ let
text = ''
set -euo pipefail
LOKI_URL="https://loki.home.2rjus.net/loki/api/v1/push"
LOKI_AUTH_FILE="/run/secrets/promtail-loki-auth"
LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
# Send a log entry to Loki with bootstrap status
# Usage: log_to_loki <stage> <message>
@@ -37,14 +36,8 @@ let
}]
}')
local auth_args=()
if [[ -f "$LOKI_AUTH_FILE" ]]; then
auth_args=(-u "promtail:$(cat "$LOKI_AUTH_FILE")")
fi
curl -s --connect-timeout 2 --max-time 5 \
-X POST \
"''${auth_args[@]}" \
-H "Content-Type: application/json" \
-d "$payload" \
"$LOKI_URL" >/dev/null 2>&1 || true

View File

@@ -20,10 +20,10 @@ vault-fetch <secret-path> <output-directory> [cache-directory]
```bash
# Fetch Grafana admin secrets
vault-fetch hosts/ha1/mqtt-password /run/secrets/grafana /var/lib/vault/cache/grafana
vault-fetch hosts/monitoring01/grafana-admin /run/secrets/grafana /var/lib/vault/cache/grafana
# Use default cache location
vault-fetch hosts/ha1/mqtt-password /run/secrets/grafana
vault-fetch hosts/monitoring01/grafana-admin /run/secrets/grafana
```
## How It Works
@@ -53,13 +53,13 @@ If Vault is unreachable or authentication fails:
This tool is designed to be called from systemd service `ExecStartPre` hooks via the `vault.secrets` NixOS module:
```nix
vault.secrets.mqtt-password = {
secretPath = "hosts/ha1/mqtt-password";
vault.secrets.grafana-admin = {
secretPath = "hosts/monitoring01/grafana-admin";
};
# Service automatically gets secrets fetched before start
systemd.services.mosquitto.serviceConfig = {
EnvironmentFile = "/run/secrets/mqtt-password/password";
systemd.services.grafana.serviceConfig = {
EnvironmentFile = "/run/secrets/grafana-admin/password";
};
```

View File

@@ -5,7 +5,7 @@ set -euo pipefail
#
# Usage: vault-fetch <secret-path> <output-directory> [cache-directory]
#
# Example: vault-fetch hosts/ha1/mqtt-password /run/secrets/grafana /var/lib/vault/cache/grafana
# Example: vault-fetch hosts/monitoring01/grafana-admin /run/secrets/grafana /var/lib/vault/cache/grafana
#
# This script:
# 1. Authenticates to Vault using AppRole credentials from /var/lib/vault/approle/
@@ -17,7 +17,7 @@ set -euo pipefail
# Parse arguments
if [ $# -lt 2 ]; then
echo "Usage: vault-fetch <secret-path> <output-directory> [cache-directory]" >&2
echo "Example: vault-fetch hosts/ha1/mqtt-password /run/secrets/grafana /var/lib/vault/cache/grafana" >&2
echo "Example: vault-fetch hosts/monitoring01/grafana /run/secrets/grafana /var/lib/vault/cache/grafana" >&2
exit 1
fi

View File

@@ -19,7 +19,7 @@
"title": "SSH Connections",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sum(oubliette_ssh_connections_total{job=\"apiary\"})",
@@ -51,7 +51,7 @@
"title": "Active Sessions",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 6, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "oubliette_sessions_active{job=\"apiary\"}",
@@ -86,7 +86,7 @@
"title": "Unique IPs",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 12, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "oubliette_storage_unique_ips{job=\"apiary\"}",
@@ -118,7 +118,7 @@
"title": "Total Login Attempts",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 18, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "oubliette_storage_login_attempts_total{job=\"apiary\"}",
@@ -150,7 +150,7 @@
"title": "SSH Connections Over Time",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"interval": "60s",
"targets": [
{
@@ -183,7 +183,7 @@
"title": "Auth Attempts Over Time",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"interval": "60s",
"targets": [
{
@@ -216,7 +216,7 @@
"title": "Sessions by Shell",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 22},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"interval": "60s",
"targets": [
{
@@ -249,7 +249,7 @@
"title": "Attempts by Country",
"type": "geomap",
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 12},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "oubliette_auth_attempts_by_country_total{job=\"apiary\"}",
@@ -318,7 +318,7 @@
"title": "Session Duration Distribution",
"type": "heatmap",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 30},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"interval": "60s",
"targets": [
{
@@ -359,7 +359,7 @@
"title": "Commands Executed by Shell",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 22},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"interval": "60s",
"targets": [
{

View File

@@ -16,7 +16,7 @@
"title": "Endpoints Monitored",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"})",
@@ -48,7 +48,7 @@
"title": "Probe Failures",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(probe_success{job=\"blackbox_tls\"} == 0) or vector(0)",
@@ -82,7 +82,7 @@
"title": "Expiring Soon (< 7d)",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) < 86400 * 7) or vector(0)",
@@ -116,7 +116,7 @@
"title": "Expiring Critical (< 24h)",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) < 86400) or vector(0)",
@@ -150,7 +150,7 @@
"title": "Minimum Days Remaining",
"type": "gauge",
"gridPos": {"h": 4, "w": 8, "x": 16, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "min((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400)",
@@ -187,7 +187,7 @@
"title": "Certificate Expiry by Endpoint",
"type": "table",
"gridPos": {"h": 12, "w": 12, "x": 0, "y": 4},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400",
@@ -253,7 +253,7 @@
"title": "Probe Status",
"type": "table",
"gridPos": {"h": 12, "w": 12, "x": 12, "y": 4},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "probe_success{job=\"blackbox_tls\"}",
@@ -340,7 +340,7 @@
"title": "Certificate Expiry Over Time",
"type": "timeseries",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 16},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400",
@@ -378,7 +378,7 @@
"title": "Probe Success Rate",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 24},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "avg(probe_success{job=\"blackbox_tls\"}) * 100",
@@ -418,7 +418,7 @@
"title": "Probe Duration",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 24},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "probe_duration_seconds{job=\"blackbox_tls\"}",

View File

@@ -15,7 +15,7 @@
{
"name": "tier",
"type": "query",
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"query": "label_values(nixos_flake_info, tier)",
"refresh": 2,
"includeAll": true,
@@ -30,7 +30,7 @@
"title": "Hosts Behind Remote",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(nixos_flake_revision_behind{tier=~\"$tier\"} == 1)",
@@ -65,7 +65,7 @@
"title": "Hosts Needing Reboot",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(nixos_config_mismatch{tier=~\"$tier\"} == 1)",
@@ -100,7 +100,7 @@
"title": "Total Hosts",
"type": "stat",
"gridPos": {"h": 4, "w": 3, "x": 8, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(nixos_flake_info{tier=~\"$tier\"})",
@@ -128,7 +128,7 @@
"title": "Nixpkgs Age",
"type": "stat",
"gridPos": {"h": 4, "w": 3, "x": 11, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "max(nixos_flake_input_age_seconds{input=\"nixpkgs\", tier=~\"$tier\"})",
@@ -163,7 +163,7 @@
"title": "Hosts Up-to-date",
"type": "stat",
"gridPos": {"h": 4, "w": 3, "x": 14, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(nixos_flake_revision_behind{tier=~\"$tier\"} == 0)",
@@ -192,7 +192,7 @@
"title": "Deployments (24h)",
"type": "stat",
"gridPos": {"h": 4, "w": 3, "x": 17, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_deployments_total{status=\"completed\"}[24h]))",
@@ -222,7 +222,7 @@
"title": "Avg Deploy Time",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_deployment_duration_seconds_sum{success=\"true\"}[24h])) / sum(increase(homelab_deploy_deployment_duration_seconds_count{success=\"true\"}[24h]))",
@@ -256,7 +256,7 @@
"title": "Fleet Status",
"type": "table",
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 4},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "nixos_flake_info{tier=~\"$tier\"}",
@@ -430,7 +430,7 @@
"title": "Generation Age by Host",
"type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 14},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sort_desc(nixos_generation_age_seconds{tier=~\"$tier\"})",
@@ -467,7 +467,7 @@
"title": "Generations per Host",
"type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 14},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sort_desc(nixos_generation_count{tier=~\"$tier\"})",
@@ -501,7 +501,7 @@
"title": "Deployment Activity (Generation Age Over Time)",
"type": "timeseries",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 22},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "nixos_generation_age_seconds{tier=~\"$tier\"}",
@@ -534,7 +534,7 @@
"title": "Flake Input Ages",
"type": "table",
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 30},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "max by (input) (nixos_flake_input_age_seconds)",
@@ -577,7 +577,7 @@
"title": "Hosts by Revision",
"type": "piechart",
"gridPos": {"h": 6, "w": 6, "x": 12, "y": 30},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count by (current_rev) (nixos_flake_info{tier=~\"$tier\"})",
@@ -601,7 +601,7 @@
"title": "Hosts by Tier",
"type": "piechart",
"gridPos": {"h": 6, "w": 6, "x": 18, "y": 30},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count by (tier) (nixos_flake_info)",
@@ -641,7 +641,7 @@
"title": "Builds (24h)",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 37},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_build_host_total{status=\"success\"}[24h]))",
@@ -671,7 +671,7 @@
"title": "Failed Builds (24h)",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 37},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_build_host_total{status=\"failure\"}[24h])) or vector(0)",
@@ -705,7 +705,7 @@
"title": "Last Build",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 37},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "time() - max(homelab_deploy_build_last_timestamp)",
@@ -739,7 +739,7 @@
"title": "Avg Build Time",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 37},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_build_duration_seconds_sum[24h])) / sum(increase(homelab_deploy_build_duration_seconds_count[24h]))",
@@ -773,7 +773,7 @@
"title": "Total Hosts Built",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 37},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(homelab_deploy_build_duration_seconds_count)",
@@ -802,7 +802,7 @@
"title": "Build Jobs (24h)",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 37},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_builds_total[24h]))",
@@ -832,7 +832,7 @@
"title": "Build Time by Host",
"type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 41},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sort_desc(homelab_deploy_build_duration_seconds_sum / homelab_deploy_build_duration_seconds_count)",
@@ -869,7 +869,7 @@
"title": "Build Count by Host",
"type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 41},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sort_desc(sum by (host) (homelab_deploy_build_host_total))",
@@ -903,7 +903,7 @@
"title": "Build Activity",
"type": "timeseries",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 49},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_build_host_total{status=\"success\"}[1h]))",

View File

@@ -11,7 +11,7 @@
{
"name": "instance",
"type": "query",
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"query": "label_values(node_uname_info, instance)",
"refresh": 2,
"includeAll": false,
@@ -26,7 +26,7 @@
"title": "CPU Usage",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\", instance=~\"$instance\"}[5m])) * 100)",
@@ -55,7 +55,7 @@
"title": "Memory Usage",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$instance\"} / node_memory_MemTotal_bytes{instance=~\"$instance\"})) * 100",
@@ -84,7 +84,7 @@
"title": "Disk Usage",
"type": "gauge",
"gridPos": {"h": 8, "w": 8, "x": 0, "y": 8},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "100 - ((node_filesystem_avail_bytes{instance=~\"$instance\",mountpoint=\"/\",fstype!=\"rootfs\"} / node_filesystem_size_bytes{instance=~\"$instance\",mountpoint=\"/\",fstype!=\"rootfs\"}) * 100)",
@@ -113,7 +113,7 @@
"title": "System Load",
"type": "timeseries",
"gridPos": {"h": 8, "w": 8, "x": 8, "y": 8},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "node_load1{instance=~\"$instance\"}",
@@ -142,7 +142,7 @@
"title": "Uptime",
"type": "stat",
"gridPos": {"h": 8, "w": 8, "x": 16, "y": 8},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "time() - node_boot_time_seconds{instance=~\"$instance\"}",
@@ -161,7 +161,7 @@
"title": "Network Traffic",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "rate(node_network_receive_bytes_total{instance=~\"$instance\",device!~\"lo|veth.*|br.*|docker.*\"}[5m])",
@@ -185,7 +185,7 @@
"title": "Disk I/O",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 16},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "rate(node_disk_read_bytes_total{instance=~\"$instance\",device!~\"dm-.*\"}[5m])",

View File

@@ -15,7 +15,7 @@
{
"name": "vm",
"type": "query",
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"query": "label_values(pve_guest_info{template=\"0\"}, name)",
"refresh": 2,
"includeAll": true,
@@ -30,7 +30,7 @@
"title": "VMs Running",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(pve_up{id=~\"qemu/.*\"} * on(id) pve_guest_info{template=\"0\"} == 1)",
@@ -56,7 +56,7 @@
"title": "VMs Stopped",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(pve_up{id=~\"qemu/.*\"} * on(id) pve_guest_info{template=\"0\"} == 0)",
@@ -87,7 +87,7 @@
"title": "Node CPU",
"type": "gauge",
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "pve_cpu_usage_ratio{id=~\"node/.*\"} * 100",
@@ -120,7 +120,7 @@
"title": "Node Memory",
"type": "gauge",
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "pve_memory_usage_bytes{id=~\"node/.*\"} / pve_memory_size_bytes{id=~\"node/.*\"} * 100",
@@ -153,7 +153,7 @@
"title": "Node Uptime",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "pve_uptime_seconds{id=~\"node/.*\"}",
@@ -180,7 +180,7 @@
"title": "Templates",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(pve_guest_info{template=\"1\"})",
@@ -206,7 +206,7 @@
"title": "VM Status",
"type": "table",
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 4},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "pve_guest_info{template=\"0\", name=~\"$vm\"}",
@@ -362,7 +362,7 @@
"title": "VM CPU Usage",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 14},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "pve_cpu_usage_ratio{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"} * 100",
@@ -391,7 +391,7 @@
"title": "VM Memory Usage",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 14},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "pve_memory_usage_bytes{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
@@ -420,7 +420,7 @@
"title": "VM Network Traffic",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 22},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "rate(pve_network_receive_bytes{id=~\"qemu/.*\"}[5m]) * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
@@ -453,7 +453,7 @@
"title": "VM Disk I/O",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 22},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "rate(pve_disk_read_bytes{id=~\"qemu/.*\"}[5m]) * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
@@ -486,7 +486,7 @@
"title": "Storage Usage",
"type": "bargauge",
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 30},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "pve_disk_usage_bytes{id=~\"storage/.*\"} / pve_disk_size_bytes{id=~\"storage/.*\"} * 100",
@@ -531,7 +531,7 @@
"title": "Storage Capacity",
"type": "table",
"gridPos": {"h": 6, "w": 12, "x": 12, "y": 30},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "pve_disk_size_bytes{id=~\"storage/.*\"}",

View File

@@ -15,7 +15,7 @@
{
"name": "hostname",
"type": "query",
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"query": "label_values(systemd_unit_state, hostname)",
"refresh": 2,
"includeAll": true,
@@ -30,7 +30,7 @@
"title": "Failed Units",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(systemd_unit_state{state=\"failed\", hostname=~\"$hostname\"} == 1) or vector(0)",
@@ -60,7 +60,7 @@
"title": "Active Units",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(systemd_unit_state{state=\"active\", hostname=~\"$hostname\"} == 1)",
@@ -86,7 +86,7 @@
"title": "Hosts Monitored",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(count by (hostname) (systemd_unit_state{hostname=~\"$hostname\"}))",
@@ -112,7 +112,7 @@
"title": "Total Service Restarts",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sum(systemd_service_restart_total{hostname=~\"$hostname\"})",
@@ -143,7 +143,7 @@
"title": "Inactive Units",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(systemd_unit_state{state=\"inactive\", hostname=~\"$hostname\"} == 1)",
@@ -169,7 +169,7 @@
"title": "Timers",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(systemd_timer_last_trigger_seconds{hostname=~\"$hostname\"})",
@@ -195,7 +195,7 @@
"title": "Failed Units",
"type": "table",
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 4},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "systemd_unit_state{state=\"failed\", hostname=~\"$hostname\"} == 1",
@@ -251,7 +251,7 @@
"title": "Service Restarts (Top 15)",
"type": "table",
"gridPos": {"h": 6, "w": 12, "x": 12, "y": 4},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "topk(15, systemd_service_restart_total{hostname=~\"$hostname\"} > 0)",
@@ -309,7 +309,7 @@
"title": "Active Units per Host",
"type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 10},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sort_desc(count by (hostname) (systemd_unit_state{state=\"active\", hostname=~\"$hostname\"} == 1))",
@@ -339,7 +339,7 @@
"title": "NixOS Upgrade Timers",
"type": "table",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 10},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "systemd_timer_last_trigger_seconds{name=\"nixos-upgrade.timer\", hostname=~\"$hostname\"}",
@@ -429,7 +429,7 @@
"title": "Backup Timers",
"type": "table",
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 18},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "systemd_timer_last_trigger_seconds{name=~\"restic.*\", hostname=~\"$hostname\"}",
@@ -524,7 +524,7 @@
"title": "Service Restarts Over Time",
"type": "timeseries",
"gridPos": {"h": 6, "w": 12, "x": 12, "y": 18},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sum by (hostname) (increase(systemd_service_restart_total{hostname=~\"$hostname\"}[1h]))",

View File

@@ -19,7 +19,7 @@
"title": "Current Temperatures",
"type": "stat",
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}",
@@ -71,7 +71,7 @@
"title": "Average Home Temperature",
"type": "gauge",
"gridPos": {"h": 6, "w": 6, "x": 12, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "avg(hass_sensor_temperature_celsius{entity!~\".*device_temperature|.*server.*\"})",
@@ -108,7 +108,7 @@
"title": "Current Humidity",
"type": "stat",
"gridPos": {"h": 6, "w": 6, "x": 18, "y": 0},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "hass_sensor_humidity_percent{entity!~\".*server.*\"}",
@@ -154,7 +154,7 @@
"title": "Temperature History (30 Days)",
"type": "timeseries",
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 6},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}",
@@ -207,7 +207,7 @@
"title": "Temperature Trend (1h rate of change)",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "deriv(hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}[1h]) * 3600",
@@ -268,7 +268,7 @@
"title": "24h Min / Max / Avg",
"type": "table",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 16},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "min_over_time(hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}[24h])",
@@ -346,7 +346,7 @@
"title": "Humidity History (30 Days)",
"type": "timeseries",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 24},
"datasource": {"type": "prometheus", "uid": "victoriametrics"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "hass_sensor_humidity_percent",

View File

@@ -37,10 +37,6 @@
# Declarative datasources
provision.datasources.settings = {
apiVersion = 1;
prune = true;
deleteDatasources = [
{ name = "Prometheus (monitoring01)"; orgId = 1; }
];
datasources = [
{
name = "VictoriaMetrics";
@@ -49,10 +45,16 @@
isDefault = true;
uid = "victoriametrics";
}
{
name = "Prometheus (monitoring01)";
type = "prometheus";
url = "http://monitoring01.home.2rjus.net:9090";
uid = "prometheus";
}
{
name = "Loki";
type = "loki";
url = "http://localhost:3100";
url = "http://monitoring01.home.2rjus.net:3100";
uid = "loki";
}
];
@@ -89,14 +91,6 @@
acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
metrics
'';
virtualHosts."grafana.home.2rjus.net".extraConfig = ''
log {
output file /var/log/caddy/grafana.log {
mode 644
}
}
reverse_proxy http://127.0.0.1:3000
'';
virtualHosts."grafana-test.home.2rjus.net".extraConfig = ''
log {
output file /var/log/caddy/grafana.log {

View File

@@ -54,49 +54,53 @@
}
reverse_proxy http://ha1.home.2rjus.net:8080
}
prometheus.home.2rjus.net {
log {
output file /var/log/caddy/prometheus.log {
mode 644
}
}
reverse_proxy http://monitoring01.home.2rjus.net:9090
}
alertmanager.home.2rjus.net {
log {
output file /var/log/caddy/alertmanager.log {
mode 644
}
}
reverse_proxy http://monitoring01.home.2rjus.net:9093
}
grafana.home.2rjus.net {
log {
output file /var/log/caddy/grafana.log {
mode 644
}
}
reverse_proxy http://monitoring01.home.2rjus.net:3000
}
jelly.home.2rjus.net {
log {
output file /var/log/caddy/jelly.log {
mode 644
}
}
header Content-Type text/html
respond <<HTML
<!DOCTYPE html>
<html>
<head>
<title>Jellyfin - Maintenance</title>
<style>
body {
background: #101020;
color: #ddd;
font-family: sans-serif;
display: flex;
justify-content: center;
align-items: center;
min-height: 100vh;
margin: 0;
text-align: center;
reverse_proxy http://jelly01.home.2rjus.net:8096
}
.container { max-width: 500px; }
.disk { font-size: 80px; animation: spin 3s linear infinite; display: inline-block; }
@keyframes spin { from { transform: rotate(0deg); } to { transform: rotate(360deg); } }
h1 { color: #00a4dc; }
p { font-size: 1.2em; line-height: 1.6; }
</style>
</head>
<body>
<div class="container">
<div class="disk">&#x1F4BF;</div>
<h1>Jellyfin is taking a nap</h1>
<p>The NAS is getting shiny new hard drives.<br>
Jellyfin will be back once the disks stop spinning up.</p>
<p style="color:#666;font-size:0.9em;">In the meantime, maybe go outside?</p>
</div>
</body>
</html>
HTML 200
pyroscope.home.2rjus.net {
log {
output file /var/log/caddy/pyroscope.log {
mode 644
}
}
reverse_proxy http://monitoring01.home.2rjus.net:4040
}
pushgw.home.2rjus.net {
log {
output file /var/log/caddy/pushgw.log {
mode 644
}
}
reverse_proxy http://monitoring01.home.2rjus.net:9091
}
http://http-proxy.home.2rjus.net/metrics {
log {

View File

@@ -1,104 +0,0 @@
{ config, lib, pkgs, ... }:
let
# Script to generate bcrypt hash from Vault password for Caddy basic_auth
generateCaddyAuth = pkgs.writeShellApplication {
name = "generate-caddy-loki-auth";
runtimeInputs = [ config.services.caddy.package ];
text = ''
PASSWORD=$(cat /run/secrets/loki-push-auth)
HASH=$(caddy hash-password --plaintext "$PASSWORD")
echo "LOKI_PUSH_HASH=$HASH" > /run/secrets/caddy-loki-auth.env
chmod 0400 /run/secrets/caddy-loki-auth.env
'';
};
in
{
# Fetch Loki push password from Vault
vault.secrets.loki-push-auth = {
secretPath = "shared/loki/push-auth";
extractKey = "password";
services = [ "caddy" ];
};
# Generate bcrypt hash for Caddy before it starts
systemd.services.caddy-loki-auth = {
description = "Generate Caddy basic auth hash for Loki";
after = [ "vault-secret-loki-push-auth.service" ];
requires = [ "vault-secret-loki-push-auth.service" ];
before = [ "caddy.service" ];
requiredBy = [ "caddy.service" ];
serviceConfig = {
Type = "oneshot";
RemainAfterExit = true;
ExecStart = lib.getExe generateCaddyAuth;
};
};
# Load the bcrypt hash as environment variable for Caddy
services.caddy.environmentFile = "/run/secrets/caddy-loki-auth.env";
# Caddy reverse proxy for Loki with basic auth
services.caddy.virtualHosts."loki.home.2rjus.net".extraConfig = ''
basic_auth {
promtail {env.LOKI_PUSH_HASH}
}
reverse_proxy http://127.0.0.1:3100
'';
services.loki = {
enable = true;
configuration = {
auth_enabled = false;
server = {
http_listen_address = "127.0.0.1";
http_listen_port = 3100;
};
common = {
ring = {
instance_addr = "127.0.0.1";
kvstore = {
store = "inmemory";
};
};
replication_factor = 1;
path_prefix = "/var/lib/loki";
};
schema_config = {
configs = [
{
from = "2024-01-01";
store = "tsdb";
object_store = "filesystem";
schema = "v13";
index = {
prefix = "loki_index_";
period = "24h";
};
}
];
};
storage_config = {
filesystem = {
directory = "/var/lib/loki/chunks";
};
};
compactor = {
working_directory = "/var/lib/loki/compactor";
compaction_interval = "10m";
retention_enabled = true;
retention_delete_delay = "2h";
retention_delete_worker_count = 150;
delete_request_store = "filesystem";
};
limits_config = {
retention_period = "30d";
ingestion_rate_mb = 10;
ingestion_burst_size_mb = 20;
max_streams_per_user = 10000;
max_query_series = 500;
max_query_parallelism = 8;
};
};
};
}

View File

@@ -1,4 +1,33 @@
{ pkgs, ... }:
let
# TLS endpoints to monitor for certificate expiration
# These are all services using ACME certificates from OpenBao PKI
tlsTargets = [
# Direct ACME certs (security.acme.certs)
"https://vault.home.2rjus.net:8200"
"https://auth.home.2rjus.net"
"https://testvm01.home.2rjus.net"
# Caddy auto-TLS on http-proxy
"https://nzbget.home.2rjus.net"
"https://radarr.home.2rjus.net"
"https://sonarr.home.2rjus.net"
"https://ha.home.2rjus.net"
"https://z2m.home.2rjus.net"
"https://prometheus.home.2rjus.net"
"https://alertmanager.home.2rjus.net"
"https://grafana.home.2rjus.net"
"https://jelly.home.2rjus.net"
"https://pyroscope.home.2rjus.net"
"https://pushgw.home.2rjus.net"
# Caddy auto-TLS on nix-cache02
"https://nix-cache.home.2rjus.net"
# Caddy auto-TLS on grafana01
"https://grafana-test.home.2rjus.net"
];
in
{
services.prometheus.exporters.blackbox = {
enable = true;
@@ -28,4 +57,36 @@
- 503
'';
};
# Add blackbox scrape config to Prometheus
# Alert rules are in rules.yml (certificate_rules group)
services.prometheus.scrapeConfigs = [
{
job_name = "blackbox_tls";
metrics_path = "/probe";
params = {
module = [ "https_cert" ];
};
static_configs = [{
targets = tlsTargets;
}];
relabel_configs = [
# Pass the target URL to blackbox as a parameter
{
source_labels = [ "__address__" ];
target_label = "__param_target";
}
# Use the target URL as the instance label
{
source_labels = [ "__param_target" ];
target_label = "instance";
}
# Point the actual scrape at the local blackbox exporter
{
target_label = "__address__";
replacement = "127.0.0.1:9115";
}
];
}
];
}

View File

@@ -0,0 +1,14 @@
{ ... }:
{
imports = [
./loki.nix
./grafana.nix
./prometheus.nix
./blackbox.nix
./exportarr.nix
./pve.nix
./alerttonotify.nix
./pyroscope.nix
./tempo.nix
];
}

View File

@@ -14,4 +14,14 @@
apiKeyFile = config.vault.secrets.sonarr-api-key.outputDir;
port = 9709;
};
# Scrape config
services.prometheus.scrapeConfigs = [
{
job_name = "sonarr";
static_configs = [{
targets = [ "localhost:9709" ];
}];
}
];
}

View File

@@ -0,0 +1,11 @@
{ pkgs, ... }:
{
services.grafana = {
enable = true;
settings = {
server = {
http_addr = "";
};
};
};
}

View File

@@ -0,0 +1,58 @@
{ ... }:
{
services.loki = {
enable = true;
configuration = {
auth_enabled = false;
server = {
http_listen_port = 3100;
};
common = {
ring = {
instance_addr = "127.0.0.1";
kvstore = {
store = "inmemory";
};
};
replication_factor = 1;
path_prefix = "/var/lib/loki";
};
schema_config = {
configs = [
{
from = "2024-01-01";
store = "tsdb";
object_store = "filesystem";
schema = "v13";
index = {
prefix = "loki_index_";
period = "24h";
};
}
];
};
storage_config = {
filesystem = {
directory = "/var/lib/loki/chunks";
};
};
compactor = {
working_directory = "/var/lib/loki/compactor";
compaction_interval = "10m";
retention_enabled = true;
retention_delete_delay = "2h";
retention_delete_worker_count = 150;
delete_request_store = "filesystem";
};
limits_config = {
retention_period = "30d";
ingestion_rate_mb = 10;
ingestion_burst_size_mb = 20;
max_streams_per_user = 10000;
max_query_series = 500;
max_query_parallelism = 8;
};
};
};
}

View File

@@ -0,0 +1,267 @@
{ self, lib, pkgs, ... }:
let
monLib = import ../../lib/monitoring.nix { inherit lib; };
externalTargets = import ./external-targets.nix;
nodeExporterTargets = monLib.generateNodeExporterTargets self externalTargets;
autoScrapeConfigs = monLib.generateScrapeConfigs self externalTargets;
# Script to fetch AppRole token for Prometheus to use when scraping OpenBao metrics
fetchOpenbaoToken = pkgs.writeShellApplication {
name = "fetch-openbao-token";
runtimeInputs = [ pkgs.curl pkgs.jq ];
text = ''
VAULT_ADDR="https://vault01.home.2rjus.net:8200"
APPROLE_DIR="/var/lib/vault/approle"
OUTPUT_FILE="/run/secrets/prometheus/openbao-token"
# Read AppRole credentials
if [ ! -f "$APPROLE_DIR/role-id" ] || [ ! -f "$APPROLE_DIR/secret-id" ]; then
echo "AppRole credentials not found at $APPROLE_DIR" >&2
exit 1
fi
ROLE_ID=$(cat "$APPROLE_DIR/role-id")
SECRET_ID=$(cat "$APPROLE_DIR/secret-id")
# Authenticate to Vault
AUTH_RESPONSE=$(curl -sf -k -X POST \
-d "{\"role_id\":\"$ROLE_ID\",\"secret_id\":\"$SECRET_ID\"}" \
"$VAULT_ADDR/v1/auth/approle/login")
# Extract token
VAULT_TOKEN=$(echo "$AUTH_RESPONSE" | jq -r '.auth.client_token')
if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
echo "Failed to extract Vault token from response" >&2
exit 1
fi
# Write token to file
mkdir -p "$(dirname "$OUTPUT_FILE")"
echo -n "$VAULT_TOKEN" > "$OUTPUT_FILE"
chown prometheus:prometheus "$OUTPUT_FILE"
chmod 0400 "$OUTPUT_FILE"
echo "Successfully fetched OpenBao token"
'';
};
in
{
# Systemd service to fetch AppRole token for Prometheus OpenBao scraping
# The token is used to authenticate when scraping /v1/sys/metrics
systemd.services.prometheus-openbao-token = {
description = "Fetch OpenBao token for Prometheus metrics scraping";
after = [ "network-online.target" ];
wants = [ "network-online.target" ];
before = [ "prometheus.service" ];
requiredBy = [ "prometheus.service" ];
serviceConfig = {
Type = "oneshot";
ExecStart = lib.getExe fetchOpenbaoToken;
};
};
# Timer to periodically refresh the token (AppRole tokens have 1-hour TTL)
systemd.timers.prometheus-openbao-token = {
description = "Refresh OpenBao token for Prometheus";
wantedBy = [ "timers.target" ];
timerConfig = {
OnBootSec = "5min";
OnUnitActiveSec = "30min";
RandomizedDelaySec = "5min";
};
};
# Fetch apiary bearer token from Vault
vault.secrets.prometheus-apiary-token = {
secretPath = "hosts/monitoring01/apiary-token";
extractKey = "password";
owner = "prometheus";
group = "prometheus";
services = [ "prometheus" ];
};
services.prometheus = {
enable = true;
# syntax-only check because we use external credential files (e.g., openbao-token)
checkConfig = "syntax-only";
alertmanager = {
enable = true;
configuration = {
global = {
};
route = {
receiver = "webhook_natstonotify";
group_wait = "30s";
group_interval = "5m";
repeat_interval = "1h";
group_by = [ "alertname" ];
};
receivers = [
{
name = "webhook_natstonotify";
webhook_configs = [
{
url = "http://localhost:5001/alert";
}
];
}
];
};
};
alertmanagers = [
{
static_configs = [
{
targets = [ "localhost:9093" ];
}
];
}
];
retentionTime = "30d";
globalConfig = {
scrape_interval = "15s";
};
rules = [
(builtins.readFile ./rules.yml)
];
scrapeConfigs = [
# Auto-generated node-exporter targets from flake hosts + external
# Each static_config entry may have labels from homelab.host metadata
{
job_name = "node-exporter";
static_configs = nodeExporterTargets;
}
# Systemd exporter on all hosts (same targets, different port)
# Preserves the same label grouping as node-exporter
{
job_name = "systemd-exporter";
static_configs = map
(cfg: cfg // {
targets = map (t: builtins.replaceStrings [ ":9100" ] [ ":9558" ] t) cfg.targets;
})
nodeExporterTargets;
}
# Local monitoring services (not auto-generated)
{
job_name = "prometheus";
static_configs = [
{
targets = [ "localhost:9090" ];
}
];
}
{
job_name = "loki";
static_configs = [
{
targets = [ "localhost:3100" ];
}
];
}
{
job_name = "grafana";
static_configs = [
{
targets = [ "localhost:3000" ];
}
];
}
{
job_name = "alertmanager";
static_configs = [
{
targets = [ "localhost:9093" ];
}
];
}
{
job_name = "pushgateway";
honor_labels = true;
static_configs = [
{
targets = [ "localhost:9091" ];
}
];
}
# Caddy metrics from nix-cache02 (serves nix-cache.home.2rjus.net)
{
job_name = "nix-cache_caddy";
scheme = "https";
static_configs = [
{
targets = [ "nix-cache.home.2rjus.net" ];
}
];
}
# pve-exporter with complex relabel config
{
job_name = "pve-exporter";
static_configs = [
{
targets = [ "10.69.12.75" ];
}
];
metrics_path = "/pve";
params = {
module = [ "default" ];
cluster = [ "1" ];
node = [ "1" ];
};
relabel_configs = [
{
source_labels = [ "__address__" ];
target_label = "__param_target";
}
{
source_labels = [ "__param_target" ];
target_label = "instance";
}
{
target_label = "__address__";
replacement = "127.0.0.1:9221";
}
];
}
# OpenBao metrics with bearer token auth
{
job_name = "openbao";
scheme = "https";
metrics_path = "/v1/sys/metrics";
params = {
format = [ "prometheus" ];
};
static_configs = [{
targets = [ "vault01.home.2rjus.net:8200" ];
}];
authorization = {
type = "Bearer";
credentials_file = "/run/secrets/prometheus/openbao-token";
};
}
# Apiary external service
{
job_name = "apiary";
scheme = "https";
scrape_interval = "60s";
static_configs = [{
targets = [ "apiary.t-juice.club" ];
}];
authorization = {
type = "Bearer";
credentials_file = "/run/secrets/prometheus-apiary-token";
};
}
] ++ autoScrapeConfigs;
pushgateway = {
enable = true;
web = {
external-url = "https://pushgw.home.2rjus.net";
};
};
};
}

View File

@@ -1,7 +1,7 @@
{ config, ... }:
{
vault.secrets.pve-exporter = {
secretPath = "hosts/monitoring02/pve-exporter";
secretPath = "hosts/monitoring01/pve-exporter";
extractKey = "config";
outputDir = "/run/secrets/pve_exporter";
mode = "0444";

View File

@@ -0,0 +1,8 @@
{ ... }:
{
virtualisation.oci-containers.containers.pyroscope = {
pull = "missing";
image = "grafana/pyroscope:latest";
ports = [ "4040:4040" ];
};
}

View File

@@ -67,13 +67,13 @@ groups:
summary: "Promtail service not running on {{ $labels.instance }}"
description: "The promtail service has not been active on {{ $labels.instance }} for 5 minutes."
- alert: filesystem_filling_up
expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[24h], 24*3600) < 0
expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 24*3600) < 0
for: 1h
labels:
severity: warning
annotations:
summary: "Filesystem predicted to fill within 24h on {{ $labels.instance }}"
description: "Based on the last 24h trend, the root filesystem on {{ $labels.instance }} is predicted to run out of space within 24 hours."
description: "Based on the last 6h trend, the root filesystem on {{ $labels.instance }} is predicted to run out of space within 24 hours."
- alert: systemd_not_running
expr: node_systemd_system_running == 0
for: 10m
@@ -259,32 +259,32 @@ groups:
description: "Wireguard handshake timeout on {{ $labels.instance }} for peer {{ $labels.public_key }}."
- name: monitoring_rules
rules:
- alert: victoriametrics_not_running
expr: node_systemd_unit_state{instance="monitoring02.home.2rjus.net:9100", name="victoriametrics.service", state="active"} == 0
- alert: prometheus_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="prometheus.service", state="active"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "VictoriaMetrics service not running on {{ $labels.instance }}"
description: "VictoriaMetrics service not running on {{ $labels.instance }}"
- alert: vmalert_not_running
expr: node_systemd_unit_state{instance="monitoring02.home.2rjus.net:9100", name="vmalert.service", state="active"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "vmalert service not running on {{ $labels.instance }}"
description: "vmalert service not running on {{ $labels.instance }}"
summary: "Prometheus service not running on {{ $labels.instance }}"
description: "Prometheus service not running on {{ $labels.instance }}"
- alert: alertmanager_not_running
expr: node_systemd_unit_state{instance="monitoring02.home.2rjus.net:9100", name="alertmanager.service", state="active"} == 0
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="alertmanager.service", state="active"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Alertmanager service not running on {{ $labels.instance }}"
description: "Alertmanager service not running on {{ $labels.instance }}"
- alert: pushgateway_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="pushgateway.service", state="active"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pushgateway service not running on {{ $labels.instance }}"
description: "Pushgateway service not running on {{ $labels.instance }}"
- alert: loki_not_running
expr: node_systemd_unit_state{instance="monitoring02.home.2rjus.net:9100", name="loki.service", state="active"} == 0
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="loki.service", state="active"} == 0
for: 5m
labels:
severity: critical
@@ -292,13 +292,29 @@ groups:
summary: "Loki service not running on {{ $labels.instance }}"
description: "Loki service not running on {{ $labels.instance }}"
- alert: grafana_not_running
expr: node_systemd_unit_state{instance="monitoring02.home.2rjus.net:9100", name="grafana.service", state="active"} == 0
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="grafana.service", state="active"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Grafana service not running on {{ $labels.instance }}"
description: "Grafana service not running on {{ $labels.instance }}"
- alert: tempo_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="tempo.service", state="active"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Tempo service not running on {{ $labels.instance }}"
description: "Tempo service not running on {{ $labels.instance }}"
- alert: pyroscope_not_running
expr: node_systemd_unit_state{instance="monitoring01.home.2rjus.net:9100", name="podman-pyroscope.service", state="active"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pyroscope service not running on {{ $labels.instance }}"
description: "Pyroscope service not running on {{ $labels.instance }}"
- name: proxmox_rules
rules:
- alert: pve_node_down

View File

@@ -0,0 +1,37 @@
{ ... }:
{
services.tempo = {
enable = true;
settings = {
server = {
http_listen_port = 3200;
grpc_listen_port = 3201;
};
distributor = {
receivers = {
otlp = {
protocols = {
http = {
endpoint = ":4318";
cors = {
allowed_origins = [ "*.home.2rjus.net" ];
};
};
};
};
};
};
storage = {
trace = {
backend = "local";
local = {
path = "/var/lib/tempo";
};
wal = {
path = "/var/lib/tempo/wal";
};
};
};
};
};
}

View File

@@ -6,24 +6,6 @@ let
nodeExporterTargets = monLib.generateNodeExporterTargets self externalTargets;
autoScrapeConfigs = monLib.generateScrapeConfigs self externalTargets;
# TLS endpoints to monitor for certificate expiration via blackbox exporter
tlsTargets = [
"https://vault.home.2rjus.net:8200"
"https://auth.home.2rjus.net"
"https://testvm01.home.2rjus.net"
"https://nzbget.home.2rjus.net"
"https://radarr.home.2rjus.net"
"https://sonarr.home.2rjus.net"
"https://ha.home.2rjus.net"
"https://z2m.home.2rjus.net"
"https://metrics.home.2rjus.net"
"https://alertmanager.home.2rjus.net"
"https://grafana.home.2rjus.net"
"https://jelly.home.2rjus.net"
"https://nix-cache.home.2rjus.net"
"https://grafana-test.home.2rjus.net"
];
# Script to fetch AppRole token for VictoriaMetrics to use when scraping OpenBao metrics
fetchOpenbaoToken = pkgs.writeShellApplication {
name = "fetch-openbao-token-vm";
@@ -125,39 +107,6 @@ let
credentials_file = "/run/secrets/victoriametrics-apiary-token";
};
}
# Blackbox TLS certificate monitoring
{
job_name = "blackbox_tls";
metrics_path = "/probe";
params = {
module = [ "https_cert" ];
};
static_configs = [{ targets = tlsTargets; }];
relabel_configs = [
{
source_labels = [ "__address__" ];
target_label = "__param_target";
}
{
source_labels = [ "__param_target" ];
target_label = "instance";
}
{
target_label = "__address__";
replacement = "127.0.0.1:9115";
}
];
}
# Sonarr exporter
{
job_name = "sonarr";
static_configs = [{ targets = [ "localhost:9709" ]; }];
}
# Proxmox VE exporter
{
job_name = "pve";
static_configs = [{ targets = [ "localhost:9221" ]; }];
}
] ++ autoScrapeConfigs;
in
{
@@ -203,7 +152,7 @@ in
# Fetch apiary bearer token from Vault
vault.secrets.victoriametrics-apiary-token = {
secretPath = "hosts/monitoring02/apiary-token";
secretPath = "hosts/monitoring01/apiary-token";
extractKey = "password";
owner = "victoriametrics";
group = "victoriametrics";
@@ -221,12 +170,15 @@ in
};
};
# vmalert for alerting rules
# vmalert for alerting rules - no notifier during parallel operation
services.vmalert.instances.default = {
enable = true;
settings = {
"datasource.url" = "http://localhost:8428";
"notifier.url" = [ "http://localhost:9093" ];
# Blackhole notifications during parallel operation to prevent duplicate alerts.
# Replace with notifier.url after cutover from monitoring01:
# "notifier.url" = [ "http://localhost:9093" ];
"notifier.blackhole" = true;
"rule" = [ ../monitoring/rules.yml ];
};
};
@@ -239,11 +191,8 @@ in
reverse_proxy http://127.0.0.1:8880
'';
# Alertmanager
services.caddy.virtualHosts."alertmanager.home.2rjus.net".extraConfig = ''
reverse_proxy http://127.0.0.1:9093
'';
# Alertmanager - same config as monitoring01 but will only receive
# alerts after cutover (vmalert notifier is disabled above)
services.prometheus.alertmanager = {
enable = true;
configuration = {

View File

@@ -16,16 +16,6 @@ in
SystemKeepFree=1G
'';
};
# Fetch Loki push password from Vault (only on hosts with Vault enabled)
vault.secrets.promtail-loki-auth = lib.mkIf config.vault.enable {
secretPath = "shared/loki/push-auth";
extractKey = "password";
owner = "promtail";
group = "promtail";
services = [ "promtail" ];
};
# Configure promtail
services.promtail = {
enable = true;
@@ -39,11 +29,7 @@ in
clients = [
{
url = "https://loki.home.2rjus.net/loki/api/v1/push";
basic_auth = {
username = "promtail";
password_file = "/run/secrets/promtail-loki-auth";
};
url = "http://monitoring01.home.2rjus.net:3100/loki/api/v1/push";
}
];

View File

@@ -16,8 +16,7 @@ let
text = ''
set -euo pipefail
LOKI_URL="https://loki.home.2rjus.net/loki/api/v1/push"
LOKI_AUTH_FILE="/run/secrets/promtail-loki-auth"
LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
HOSTNAME=$(hostname)
SESSION_ID=""
RECORD_MODE=false
@@ -70,13 +69,7 @@ let
}]
}')
local auth_args=()
if [[ -f "$LOKI_AUTH_FILE" ]]; then
auth_args=(-u "promtail:$(cat "$LOKI_AUTH_FILE")")
fi
if curl -s -X POST "$LOKI_URL" \
"''${auth_args[@]}" \
-H "Content-Type: application/json" \
-d "$payload" > /dev/null; then
return 0

View File

@@ -57,7 +57,7 @@ let
type = types.str;
description = ''
Path to the secret in Vault (without /v1/secret/data/ prefix).
Example: "hosts/ha1/mqtt-password"
Example: "hosts/monitoring01/grafana-admin"
'';
};
@@ -152,11 +152,13 @@ in
'';
example = literalExpression ''
{
mqtt-password = {
secretPath = "hosts/ha1/mqtt-password";
owner = "mosquitto";
group = "mosquitto";
services = [ "mosquitto" ];
grafana-admin = {
secretPath = "hosts/monitoring01/grafana-admin";
owner = "grafana";
group = "grafana";
restartTrigger = true;
restartInterval = "daily";
services = [ "grafana" ];
};
}
'';

View File

@@ -26,27 +26,26 @@ path "secret/data/shared/nixos-exporter/*" {
EOT
}
# Shared policy for Loki push authentication (all hosts push logs)
resource "vault_policy" "loki_push" {
name = "loki-push"
policy = <<EOT
path "secret/data/shared/loki/*" {
capabilities = ["read", "list"]
}
EOT
}
# Define host access policies
locals {
host_policies = {
# Example:
# Example: monitoring01 host
# "monitoring01" = {
# paths = [
# "secret/data/hosts/monitoring01/*",
# "secret/data/services/prometheus/*",
# "secret/data/services/grafana/*",
# "secret/data/shared/smtp/*"
# ]
# extra_policies = ["some-other-policy"] # Optional: additional policies
# }
# Example: ha1 host
# "ha1" = {
# paths = [
# "secret/data/hosts/ha1/*",
# "secret/data/shared/mqtt/*"
# ]
# extra_policies = ["some-other-policy"] # Optional: additional policies
# }
"ha1" = {
@@ -56,6 +55,16 @@ locals {
]
}
"monitoring01" = {
paths = [
"secret/data/hosts/monitoring01/*",
"secret/data/shared/backup/*",
"secret/data/shared/nats/*",
"secret/data/services/exportarr/*",
]
extra_policies = ["prometheus-metrics"]
}
# Wave 1: hosts with no service secrets (only need vault.enable for future use)
"nats1" = {
paths = [
@@ -69,7 +78,7 @@ locals {
]
}
# Wave 3: DNS servers (managed in hosts-generated.tf)
# Wave 3: DNS servers
# Wave 4: http-proxy
"http-proxy" = {
@@ -95,6 +104,15 @@ locals {
]
}
# monitoring02: Grafana + VictoriaMetrics
"monitoring02" = {
paths = [
"secret/data/hosts/monitoring02/*",
"secret/data/hosts/monitoring01/apiary-token",
"secret/data/services/grafana/*",
]
}
}
}
@@ -120,7 +138,7 @@ resource "vault_approle_auth_backend_role" "hosts" {
backend = vault_auth_backend.approle.path
role_name = each.key
token_policies = concat(
["${each.key}-policy", "homelab-deploy", "nixos-exporter", "loki-push"],
["${each.key}-policy", "homelab-deploy", "nixos-exporter"],
lookup(each.value, "extra_policies", [])
)

View File

@@ -44,15 +44,6 @@ locals {
"secret/data/hosts/garage01/*",
]
}
"monitoring02" = {
paths = [
"secret/data/hosts/monitoring02/*",
"secret/data/services/grafana/*",
"secret/data/services/exportarr/*",
"secret/data/shared/nats/nkey",
]
extra_policies = ["prometheus-metrics"]
}
}
@@ -83,10 +74,7 @@ resource "vault_approle_auth_backend_role" "generated_hosts" {
backend = vault_auth_backend.approle.path
role_name = each.key
token_policies = concat(
["host-${each.key}", "homelab-deploy", "nixos-exporter", "loki-push"],
lookup(each.value, "extra_policies", [])
)
token_policies = ["host-${each.key}", "homelab-deploy", "nixos-exporter"]
secret_id_ttl = 0 # Never expire (wrapped tokens provide time limit)
token_ttl = 3600
token_max_ttl = 3600

View File

@@ -10,6 +10,10 @@ resource "vault_mount" "kv" {
locals {
secrets = {
# Example host-specific secrets
# "hosts/monitoring01/grafana-admin" = {
# auto_generate = true
# password_length = 32
# }
# "hosts/ha1/mqtt-password" = {
# auto_generate = true
# password_length = 24
@@ -31,6 +35,11 @@ locals {
# }
# }
"hosts/monitoring01/grafana-admin" = {
auto_generate = true
password_length = 32
}
"hosts/ha1/mqtt-password" = {
auto_generate = true
password_length = 24
@@ -48,8 +57,8 @@ locals {
data = { nkey = var.nats_nkey }
}
# PVE exporter config for monitoring02
"hosts/monitoring02/pve-exporter" = {
# PVE exporter config for monitoring01
"hosts/monitoring01/pve-exporter" = {
auto_generate = false
data = { config = var.pve_exporter_config }
}
@@ -140,16 +149,10 @@ locals {
}
# Bearer token for scraping apiary metrics
"hosts/monitoring02/apiary-token" = {
"hosts/monitoring01/apiary-token" = {
auto_generate = true
password_length = 64
}
# Loki push authentication (used by Promtail on all hosts)
"shared/loki/push-auth" = {
auto_generate = true
password_length = 32
}
}
}