nixos-servers

Author	SHA1	Message	Date
Torjus Håkestad	4f593126c0	monitoring01: remove host and migrate services to monitoring02 Some checks failed Run nix flake check / flake-check (push) Failing after 3m15s Details Run nix flake check / flake-check (pull_request) Failing after 3m8s Details Remove monitoring01 host configuration and unused service modules (prometheus, grafana, loki, tempo, pyroscope). Migrate blackbox, exportarr, and pve exporters to monitoring02 with scrape configs moved to VictoriaMetrics. Update alert rules, terraform vault policies/secrets, http-proxy entries, and documentation to reflect the monitoring02 migration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 21:50:20 +01:00
Torjus Håkestad	a6013d3950	monitoring02: enable alerting and migrate CNAMEs from http-proxy Some checks failed Run nix flake check / flake-check (push) Failing after 6m25s Details Run nix flake check / flake-check (pull_request) Failing after 3m52s Details - Switch vmalert from blackhole mode to sending alerts to local Alertmanager - Import alerttonotify service so alerts route to NATS notifications - Move alertmanager and grafana CNAMEs from http-proxy to monitoring02 - Add monitoring CNAME to monitoring02 - Add Caddy reverse proxy entries for alertmanager and grafana - Remove prometheus, alertmanager, and grafana Caddy entries from http-proxy (now served directly by monitoring02) - Move monitoring02 Vault AppRole to hosts-generated.tf with extra_policies support and prometheus-metrics policy - Update Promtail to use authenticated loki.home.2rjus.net endpoint only (remove unauthenticated monitoring01 client) - Update pipe-to-loki and bootstrap to use loki.home.2rjus.net with basic auth from Vault secret - Move migration plan to completed Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 21:23:21 +01:00
Torjus Håkestad	74e7c9faa4	monitoring02: add Loki service Some checks failed Run nix flake check / flake-check (push) Failing after 3m19s Details Add standalone Loki service module (services/loki/) with same config as monitoring01 and import it on monitoring02. Update Grafana Loki datasource to localhost. Defer Tempo and Pyroscope migration (not actively used). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 19:42:19 +01:00
Torjus Håkestad	4cbaa33475	monitoring02: add Caddy reverse proxy for VictoriaMetrics and vmalert Add metrics.home.2rjus.net and vmalert.home.2rjus.net CNAMEs with Caddy TLS termination via internal ACME CA. Refactors Grafana's Caddy config from configFile to globalConfig + virtualHosts so both modules can contribute routes to the same Caddy instance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 00:55:08 +01:00
Torjus Håkestad	e329f87b0b	monitoring02: add VictoriaMetrics, vmalert, and Alertmanager Set up the core metrics stack on monitoring02 as Phase 2 of the monitoring migration. VictoriaMetrics replaces Prometheus with identical scrape configs (22 jobs including auto-generated targets). - VictoriaMetrics with 3-month retention and all scrape configs - vmalert evaluating existing rules.yml (notifier disabled) - Alertmanager with same routing config (no alerts during parallel op) - Grafana datasources updated: local VictoriaMetrics as default - Static user override for credential file access (OpenBao, Apiary) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 00:55:08 +01:00
Torjus Håkestad	c151f31011	grafana: fix apiary dashboard panels empty on short time ranges Some checks failed Run nix flake check / flake-check (push) Failing after 3m54s Details Set interval=60s on rate() panels to match the actual Prometheus scrape interval, so Grafana calculates $__rate_interval correctly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-16 20:03:26 +01:00
Torjus Håkestad	3e7aabc73a	grafana: fix apiary geomap and make it full-width Some checks failed Run nix flake check / flake-check (push) Failing after 5m6s Details Periodic flake update / flake-update (push) Successful in 5m25s Details Add gazetteer reference for country code lookup resolution. Remove unnecessary reduce transformation. Make geomap panel full-width (24 cols) and taller (h=10) on its own row. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 21:36:24 +01:00
Torjus Håkestad	361e7f2a1b	grafana: add apiary honeypot dashboard Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 21:31:06 +01:00
Torjus Håkestad	0bc10cb1fe	grafana: add build service panels to nixos-fleet dashboard Some checks failed Run nix flake check / flake-check (push) Failing after 4m48s Details Periodic flake update / flake-update (push) Successful in 2m20s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-11 00:49:50 +01:00
Torjus Håkestad	1460eea700	grafana: fix probe status table join All checks were successful Run nix flake check / flake-check (push) Successful in 2m9s Details Use joinByField transformation instead of merge to properly align rows by instance. Also exclude duplicate Time/job columns from join. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:38:02 +01:00
Torjus Håkestad	98c4f54f94	grafana: add TLS certificates dashboard Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Dashboard includes: - Stat panels for endpoints monitored, probe failures, expiring certs - Gauge showing minimum days until any cert expires - Table of all endpoints sorted by expiry (color-coded) - Probe status table with HTTP status and duration - Time series graphs for expiry trends and probe success rate Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 22:35:44 +01:00
Torjus Håkestad	2f5a2a4bf1	grafana: use instant queries for fleet dashboard stat panels All checks were successful Run nix flake check / flake-check (push) Successful in 2m6s Details Prevents stat panels from being affected by dashboard time range selection. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 19:00:33 +01:00
Torjus Håkestad	ed7d2aa727	grafana: add deployment metrics to nixos-fleet dashboard Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-09 15:58:28 +01:00
Torjus Håkestad	f66dfc753c	grafana: add NixOS operations dashboard All checks were successful Run nix flake check / flake-check (push) Successful in 3m24s Details Run nix flake check / flake-check (pull_request) Successful in 4m5s Details Loki-based dashboard for tracking NixOS operations including: - Upgrade activity and success/failure stats - Build activity during upgrades - Bootstrap logs for new VM deployments - ACME certificate renewal activity Log panels use LogQL json parsing with \| keep host to show clean messages with host labels. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 22:03:28 +01:00
Torjus Håkestad	89d0a6f358	grafana: add systemd services dashboard Some checks failed Run nix flake check / flake-check (push) Failing after 8m30s Details Run nix flake check / flake-check (pull_request) Failing after 16m49s Details Dashboard for monitoring systemd across the fleet: - Summary stats: failed/active/inactive units, restarts, timers - Failed units table (shows any units in failed state) - Service restarts table (top 15 services by restart count) - Active units per host bar chart - NixOS upgrade timer table with last trigger time - Backup timers table (restic jobs) - Service restarts over time chart - Hostname filter to focus on specific hosts Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 21:06:59 +01:00
Torjus Håkestad	03ebee4d82	grafana: fix proxmox table __name__ column All checks were successful Run nix flake check / flake-check (push) Successful in 2m9s Details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 21:04:41 +01:00
Torjus Håkestad	05630eb4d4	grafana: add Proxmox dashboard Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Dashboard for monitoring Proxmox VMs: - Summary stats: VMs running/stopped, node CPU/memory, uptime - VM status table with name, status, CPU%, memory%, uptime - VM CPU usage over time - VM memory usage over time - Network traffic (RX/TX) per VM - Disk I/O (read/write) per VM - Storage usage gauges and capacity table - VM filter to focus on specific VMs Filters out template VMs, shows only actual guests. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 21:02:28 +01:00
Torjus Håkestad	d333aa0164	grafana: fix fleet table __name__ columns All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Exclude the __name__ columns that were leaking through the table transformations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:52:39 +01:00
Torjus Håkestad	a5d5827dcc	grafana: add NixOS fleet dashboard Some checks failed Run nix flake check / flake-check (push) Has been cancelled Details Dashboard for monitoring NixOS deployments across the homelab: - Hosts behind remote / needing reboot stat panels - Fleet status table with revision, behind status, reboot needed, age - Generation age bar chart (shows stale configs) - Generations per host bar chart - Deployment activity time series (see when hosts were updated) - Flake input ages table - Pie charts for hosts by revision and tier - Tier filter variable Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:50:08 +01:00
Torjus Håkestad	1c13ec12a4	grafana: add temperature dashboard All checks were successful Run nix flake check / flake-check (push) Successful in 2m5s Details Dashboard includes: - Current temperatures per room (stat panel) - Average home temperature (gauge) - Current humidity (stat panel) - 30-day temperature history with mean/min/max in legend - Temperature trend (rate of change per hour) - 24h min/max/avg table per room - 30-day humidity history Filters out device_temperature (internal sensor) metrics. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:45:52 +01:00
Torjus Håkestad	4bf0eeeadb	grafana: add dashboards and fix permissions All checks were successful Run nix flake check / flake-check (push) Successful in 2m3s Details - Change default OIDC role from Viewer to Editor for Explore access - Add declarative dashboard provisioning - Add node-exporter dashboard (CPU, memory, disk, load, network, I/O) - Add Loki logs dashboard with host/job filters Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:39:21 +01:00
Torjus Håkestad	030e8518c5	grafana: add Grafana on monitoring02 with Kanidm OIDC Some checks failed Run nix flake check / flake-check (push) Failing after 4m3s Details Deploy Grafana test instance on monitoring02 with: - Kanidm OIDC authentication (admins -> Admin role, others -> Viewer) - PKCE enabled for secure OAuth2 flow (required by Kanidm) - Declarative datasources for Prometheus and Loki on monitoring01 - Local Caddy for TLS termination via internal ACME CA - DNS CNAME grafana-test.home.2rjus.net Terraform changes add OAuth2 client secret and AppRole policies for kanidm01 and monitoring02. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 20:23:26 +01:00

22 Commits