monitoring01: remove host and migrate services to monitoring02
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m15s
Run nix flake check / flake-check (pull_request) Failing after 3m8s

Remove monitoring01 host configuration and unused service modules
(prometheus, grafana, loki, tempo, pyroscope). Migrate blackbox,
exportarr, and pve exporters to monitoring02 with scrape configs
moved to VictoriaMetrics. Update alert rules, terraform vault
policies/secrets, http-proxy entries, and documentation to reflect
the monitoring02 migration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-17 21:50:20 +01:00
parent 1bba6f106a
commit 4f593126c0
29 changed files with 115 additions and 773 deletions

View File

@@ -30,7 +30,7 @@ Use the `lab-monitoring` MCP server tools:
### Label Reference
Available labels for log queries:
- `hostname` - Hostname (e.g., `ns1`, `monitoring01`, `ha1`) - matches the Prometheus `hostname` label
- `hostname` - Hostname (e.g., `ns1`, `monitoring02`, `ha1`) - matches the Prometheus `hostname` label
- `systemd_unit` - Systemd unit name (e.g., `nsd.service`, `nixos-upgrade.service`)
- `job` - Either `systemd-journal` (most logs), `varlog` (file-based logs), or `bootstrap` (VM bootstrap logs)
- `filename` - For `varlog` job, the log file path
@@ -54,7 +54,7 @@ Journal logs are JSON-formatted. Key fields:
**All logs from a host:**
```logql
{hostname="monitoring01"}
{hostname="monitoring02"}
```
**Logs from a service across all hosts:**
@@ -74,7 +74,7 @@ Journal logs are JSON-formatted. Key fields:
**Regex matching:**
```logql
{systemd_unit="prometheus.service"} |~ "scrape.*failed"
{systemd_unit="victoriametrics.service"} |~ "scrape.*failed"
```
**Filter by level (journal scrape only):**
@@ -109,7 +109,7 @@ Default lookback is 1 hour. Use `start` parameter for older logs:
Useful systemd units for troubleshooting:
- `nixos-upgrade.service` - Daily auto-upgrade logs
- `nsd.service` - DNS server (ns1/ns2)
- `prometheus.service` - Metrics collection
- `victoriametrics.service` - Metrics collection
- `loki.service` - Log aggregation
- `caddy.service` - Reverse proxy
- `home-assistant.service` - Home automation
@@ -152,7 +152,7 @@ VMs provisioned from template2 send bootstrap progress directly to Loki via curl
Parse JSON and filter on fields:
```logql
{systemd_unit="prometheus.service"} | json | PRIORITY="3"
{systemd_unit="victoriametrics.service"} | json | PRIORITY="3"
```
---
@@ -242,12 +242,11 @@ All available Prometheus job names:
- `unbound` - DNS resolver metrics (ns1, ns2)
- `wireguard` - VPN tunnel metrics (http-proxy)
**Monitoring stack (localhost on monitoring01):**
- `prometheus` - Prometheus self-metrics
**Monitoring stack (localhost on monitoring02):**
- `victoriametrics` - VictoriaMetrics self-metrics
- `loki` - Loki self-metrics
- `grafana` - Grafana self-metrics
- `alertmanager` - Alertmanager metrics
- `pushgateway` - Push-based metrics gateway
**External/infrastructure:**
- `pve-exporter` - Proxmox hypervisor metrics
@@ -262,7 +261,7 @@ All scrape targets have these labels:
**Standard labels:**
- `instance` - Full target address (`<hostname>.home.2rjus.net:<port>`)
- `job` - Job name (e.g., `node-exporter`, `unbound`, `nixos-exporter`)
- `hostname` - Short hostname (e.g., `ns1`, `monitoring01`) - use this for host filtering
- `hostname` - Short hostname (e.g., `ns1`, `monitoring02`) - use this for host filtering
**Host metadata labels** (when configured in `homelab.host`):
- `role` - Host role (e.g., `dns`, `build-host`, `vault`)
@@ -275,7 +274,7 @@ Use the `hostname` label for easy host filtering across all jobs:
```promql
{hostname="ns1"} # All metrics from ns1
node_load1{hostname="monitoring01"} # Specific metric by hostname
node_load1{hostname="monitoring02"} # Specific metric by hostname
up{hostname="ha1"} # Check if ha1 is up
```
@@ -283,10 +282,10 @@ This is simpler than wildcarding the `instance` label:
```promql
# Old way (still works but verbose)
up{instance=~"monitoring01.*"}
up{instance=~"monitoring02.*"}
# New way (preferred)
up{hostname="monitoring01"}
up{hostname="monitoring02"}
```
### Filtering by Role/Tier