Commit Graph

24 Commits

Author SHA1 Message Date
d2a4e4a0a1 grafana: add storage query performance panels to apiary dashboard
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m23s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 22:47:30 +01:00
65acf13e6f grafana: fix datasource UIDs for VictoriaMetrics migration
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Update all dashboard datasource references from "prometheus" to
"victoriametrics" to match the declared datasource UID. Enable
prune and deleteDatasources to clean up the old Prometheus
(monitoring01) datasource from Grafana's database.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 22:23:04 +01:00
4f593126c0 monitoring01: remove host and migrate services to monitoring02
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m15s
Run nix flake check / flake-check (pull_request) Failing after 3m8s
Remove monitoring01 host configuration and unused service modules
(prometheus, grafana, loki, tempo, pyroscope). Migrate blackbox,
exportarr, and pve exporters to monitoring02 with scrape configs
moved to VictoriaMetrics. Update alert rules, terraform vault
policies/secrets, http-proxy entries, and documentation to reflect
the monitoring02 migration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 21:50:20 +01:00
a6013d3950 monitoring02: enable alerting and migrate CNAMEs from http-proxy
Some checks failed
Run nix flake check / flake-check (push) Failing after 6m25s
Run nix flake check / flake-check (pull_request) Failing after 3m52s
- Switch vmalert from blackhole mode to sending alerts to local
  Alertmanager
- Import alerttonotify service so alerts route to NATS notifications
- Move alertmanager and grafana CNAMEs from http-proxy to monitoring02
- Add monitoring CNAME to monitoring02
- Add Caddy reverse proxy entries for alertmanager and grafana
- Remove prometheus, alertmanager, and grafana Caddy entries from
  http-proxy (now served directly by monitoring02)
- Move monitoring02 Vault AppRole to hosts-generated.tf with
  extra_policies support and prometheus-metrics policy
- Update Promtail to use authenticated loki.home.2rjus.net endpoint
  only (remove unauthenticated monitoring01 client)
- Update pipe-to-loki and bootstrap to use loki.home.2rjus.net with
  basic auth from Vault secret
- Move migration plan to completed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 21:23:21 +01:00
74e7c9faa4 monitoring02: add Loki service
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m19s
Add standalone Loki service module (services/loki/) with same config as
monitoring01 and import it on monitoring02. Update Grafana Loki datasource
to localhost. Defer Tempo and Pyroscope migration (not actively used).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 19:42:19 +01:00
4cbaa33475 monitoring02: add Caddy reverse proxy for VictoriaMetrics and vmalert
Add metrics.home.2rjus.net and vmalert.home.2rjus.net CNAMEs with
Caddy TLS termination via internal ACME CA.

Refactors Grafana's Caddy config from configFile to globalConfig +
virtualHosts so both modules can contribute routes to the same
Caddy instance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:55:08 +01:00
e329f87b0b monitoring02: add VictoriaMetrics, vmalert, and Alertmanager
Set up the core metrics stack on monitoring02 as Phase 2 of the
monitoring migration. VictoriaMetrics replaces Prometheus with
identical scrape configs (22 jobs including auto-generated targets).

- VictoriaMetrics with 3-month retention and all scrape configs
- vmalert evaluating existing rules.yml (notifier disabled)
- Alertmanager with same routing config (no alerts during parallel op)
- Grafana datasources updated: local VictoriaMetrics as default
- Static user override for credential file access (OpenBao, Apiary)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 00:55:08 +01:00
c151f31011 grafana: fix apiary dashboard panels empty on short time ranges
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m54s
Set interval=60s on rate() panels to match the actual Prometheus scrape
interval, so Grafana calculates $__rate_interval correctly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 20:03:26 +01:00
3e7aabc73a grafana: fix apiary geomap and make it full-width
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m6s
Periodic flake update / flake-update (push) Successful in 5m25s
Add gazetteer reference for country code lookup resolution.
Remove unnecessary reduce transformation. Make geomap panel
full-width (24 cols) and taller (h=10) on its own row.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 21:36:24 +01:00
361e7f2a1b grafana: add apiary honeypot dashboard
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 21:31:06 +01:00
0bc10cb1fe grafana: add build service panels to nixos-fleet dashboard
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m48s
Periodic flake update / flake-update (push) Successful in 2m20s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-11 00:49:50 +01:00
1460eea700 grafana: fix probe status table join
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m9s
Use joinByField transformation instead of merge to properly align
rows by instance. Also exclude duplicate Time/job columns from join.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 22:38:02 +01:00
98c4f54f94 grafana: add TLS certificates dashboard
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Dashboard includes:
- Stat panels for endpoints monitored, probe failures, expiring certs
- Gauge showing minimum days until any cert expires
- Table of all endpoints sorted by expiry (color-coded)
- Probe status table with HTTP status and duration
- Time series graphs for expiry trends and probe success rate

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 22:35:44 +01:00
2f5a2a4bf1 grafana: use instant queries for fleet dashboard stat panels
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m6s
Prevents stat panels from being affected by dashboard time range selection.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 19:00:33 +01:00
ed7d2aa727 grafana: add deployment metrics to nixos-fleet dashboard
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 15:58:28 +01:00
f66dfc753c grafana: add NixOS operations dashboard
All checks were successful
Run nix flake check / flake-check (push) Successful in 3m24s
Run nix flake check / flake-check (pull_request) Successful in 4m5s
Loki-based dashboard for tracking NixOS operations including:
- Upgrade activity and success/failure stats
- Build activity during upgrades
- Bootstrap logs for new VM deployments
- ACME certificate renewal activity

Log panels use LogQL json parsing with | keep host to show
clean messages with host labels.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 22:03:28 +01:00
89d0a6f358 grafana: add systemd services dashboard
Some checks failed
Run nix flake check / flake-check (push) Failing after 8m30s
Run nix flake check / flake-check (pull_request) Failing after 16m49s
Dashboard for monitoring systemd across the fleet:
- Summary stats: failed/active/inactive units, restarts, timers
- Failed units table (shows any units in failed state)
- Service restarts table (top 15 services by restart count)
- Active units per host bar chart
- NixOS upgrade timer table with last trigger time
- Backup timers table (restic jobs)
- Service restarts over time chart
- Hostname filter to focus on specific hosts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 21:06:59 +01:00
03ebee4d82 grafana: fix proxmox table __name__ column
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m9s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 21:04:41 +01:00
05630eb4d4 grafana: add Proxmox dashboard
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Dashboard for monitoring Proxmox VMs:
- Summary stats: VMs running/stopped, node CPU/memory, uptime
- VM status table with name, status, CPU%, memory%, uptime
- VM CPU usage over time
- VM memory usage over time
- Network traffic (RX/TX) per VM
- Disk I/O (read/write) per VM
- Storage usage gauges and capacity table
- VM filter to focus on specific VMs

Filters out template VMs, shows only actual guests.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 21:02:28 +01:00
d333aa0164 grafana: fix fleet table __name__ columns
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m5s
Exclude the __name__ columns that were leaking through the
table transformations.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 20:52:39 +01:00
a5d5827dcc grafana: add NixOS fleet dashboard
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Dashboard for monitoring NixOS deployments across the homelab:
- Hosts behind remote / needing reboot stat panels
- Fleet status table with revision, behind status, reboot needed, age
- Generation age bar chart (shows stale configs)
- Generations per host bar chart
- Deployment activity time series (see when hosts were updated)
- Flake input ages table
- Pie charts for hosts by revision and tier
- Tier filter variable

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 20:50:08 +01:00
1c13ec12a4 grafana: add temperature dashboard
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m5s
Dashboard includes:
- Current temperatures per room (stat panel)
- Average home temperature (gauge)
- Current humidity (stat panel)
- 30-day temperature history with mean/min/max in legend
- Temperature trend (rate of change per hour)
- 24h min/max/avg table per room
- 30-day humidity history

Filters out device_temperature (internal sensor) metrics.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 20:45:52 +01:00
4bf0eeeadb grafana: add dashboards and fix permissions
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m3s
- Change default OIDC role from Viewer to Editor for Explore access
- Add declarative dashboard provisioning
- Add node-exporter dashboard (CPU, memory, disk, load, network, I/O)
- Add Loki logs dashboard with host/job filters

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 20:39:21 +01:00
030e8518c5 grafana: add Grafana on monitoring02 with Kanidm OIDC
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m3s
Deploy Grafana test instance on monitoring02 with:
- Kanidm OIDC authentication (admins -> Admin role, others -> Viewer)
- PKCE enabled for secure OAuth2 flow (required by Kanidm)
- Declarative datasources for Prometheus and Loki on monitoring01
- Local Caddy for TLS termination via internal ACME CA
- DNS CNAME grafana-test.home.2rjus.net

Terraform changes add OAuth2 client secret and AppRole policies for
kanidm01 and monitoring02.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 20:23:26 +01:00