75 Commits

Author SHA1 Message Date
5d3d93b280 docs: move completed plans to completed folder
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m22s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 21:08:17 +01:00
ae823e439d monitoring: lower unbound cache hit ratio alert threshold to 20%
Some checks failed
Run nix flake check / flake-check (push) Failing after 9m2s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 18:55:03 +01:00
0d9f49a3b4 flake.lock: Update homelab-deploy
Some checks failed
Run nix flake check / flake-check (push) Failing after 12m25s
Improves builder logging: build failure output is now logged as
individual lines instead of a single JSON blob, making errors
readable in Loki/Grafana.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 18:36:18 +01:00
08d9e1ec3f docs: add garage S3 storage plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m26s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 18:06:53 +01:00
fa8d65b612 nix-cache02: increase builder timeout to 2 hours
Some checks failed
Run nix flake check / flake-check (push) Failing after 14m21s
Periodic flake update / flake-update (push) Successful in 5m17s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-12 14:44:55 +01:00
6726f111e3 flake.lock: Update homelab-deploy
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-12 14:42:23 +01:00
3a083285cb flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/2db38e08fdadcc0ce3232f7279bab59a15b94482?narHash=sha256-1jZvgZoAagZZB6NwGRv2T2ezPy%2BX6EFDsJm%2BYSlsvEs%3D' (2026-02-09)
  → 'github:nixos/nixpkgs/6c5e707c6b5339359a9a9e215c5e66d6d802fd7a?narHash=sha256-iKZMkr6Cm9JzWlRYW/VPoL0A9jVKtZYiU4zSrVeetIs%3D' (2026-02-11)
2026-02-12 00:01:27 +00:00
ed1821b073 nix-cache02: add scheduled builds timer
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m7s
Periodic flake update / flake-update (push) Successful in 2m18s
Add a systemd timer that triggers builds for all hosts every 2 hours
via NATS, keeping the binary cache warm.

- Add scheduler.nix with timer (every 2h) and oneshot service
- Add scheduler NATS user to DEPLOY account
- Add Vault secret and variable for scheduler NKey
- Increase nix-cache02 memory from 16GB to 20GB

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-12 00:50:09 +01:00
fa4a418007 restic: add --retry-lock=5m to all backup jobs
Some checks failed
Run nix flake check / flake-check (push) Failing after 23m42s
Prevents lock conflicts when multiple backup jobs targeting the same
repository run concurrently. Jobs will now retry acquiring the lock
every 10 seconds for up to 5 minutes before failing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-11 01:22:00 +01:00
963e5f6d3c flake.lock: Update
Flake lock file updates:

• Updated input 'homelab-deploy':
    'git+https://git.t-juice.club/torjus/homelab-deploy?ref=master&rev=a8aab16d0e7400aaa00500d08c12734da3b638e0' (2026-02-10)
  → 'git+https://git.t-juice.club/torjus/homelab-deploy?ref=master&rev=c13914bf5acdcda33de63ad5ed9d661e4dc3118c' (2026-02-10)
• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/23d72dabcb3b12469f57b37170fcbc1789bd7457?narHash=sha256-z5NJPSBwsLf/OfD8WTmh79tlSU8XgIbwmk6qB1/TFzY%3D' (2026-02-07)
  → 'github:nixos/nixpkgs/2db38e08fdadcc0ce3232f7279bab59a15b94482?narHash=sha256-1jZvgZoAagZZB6NwGRv2T2ezPy%2BX6EFDsJm%2BYSlsvEs%3D' (2026-02-09)
2026-02-11 00:01:28 +00:00
0bc10cb1fe grafana: add build service panels to nixos-fleet dashboard
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m48s
Periodic flake update / flake-update (push) Successful in 2m20s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-11 00:49:50 +01:00
b03e2e8ee4 monitoring: add alerts for homelab-deploy build failures
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-11 00:45:07 +01:00
ddcbc30665 docs: mark nix-cache01 decommission complete
Some checks failed
Run nix flake check / flake-check (push) Failing after 16m38s
Phase 4 fully complete. nix-cache01 has been:
- Removed from repo (host config, build scripts, flake entry)
- Vault resources cleaned up
- VM deleted from Proxmox

nix-cache02 is now the sole binary cache host.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:43:12 +01:00
75210805d5 nix-cache01: decommission and remove all references
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Removed:
- hosts/nix-cache01/ directory
- services/nix-cache/build-flakes.{nix,sh} (replaced by NATS builder)
- Vault secret and AppRole for nix-cache01
- Old signing key variable from terraform
- Old trusted public key from system/nix.nix

Updated:
- flake.nix: removed nixosConfiguration
- README.md: nix-cache01 -> nix-cache02
- Monitoring rules: removed build-flakes alerts, updated harmonia to nix-cache02
- Simplified proxy.nix (no longer needs hostname conditional)

nix-cache02 is now the sole binary cache host.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:40:51 +01:00
ade0538717 docs: mark nix-cache DNS cutover complete
Some checks are pending
Run nix flake check / flake-check (push) Has started running
nix-cache.home.2rjus.net now served by nix-cache02.
nix-cache01 ready for decommission.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:34:04 +01:00
83fce5f927 nix-cache: switch DNS to nix-cache02
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
- Move nix-cache CNAME from nix-cache01 to nix-cache02
- Remove actions1 CNAME (service removed)
- Update proxy.nix to serve canonical domain on nix-cache02
- Promote nix-cache02 to prod tier with build-host role

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:22:23 +01:00
afff3f28ca docs: update nix-cache-reprovision plan with Harmonia progress
Some checks failed
Run nix flake check / flake-check (push) Failing after 52s
- Phase 4 now in progress
- Harmonia configured on nix-cache02 with new signing key
- Trusted public key deployed to all hosts
- Cache tested successfully from testvm01
- Actions runner removed from scope

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:17:51 +01:00
49f7e3ae2e nix-cache: use hostname-based domain for Caddy proxy
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m18s
nix-cache01 serves nix-cache.home.2rjus.net (canonical)
nix-cache02 serves nix-cache02.home.2rjus.net (for testing)

This allows testing nix-cache02 independently before DNS cutover.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:14:14 +01:00
751edfc11d nix-cache02: add Harmonia binary cache service
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
- Parameterize harmonia.nix to use hostname-based Vault paths
- Add nix-cache services to nix-cache02
- Add Vault secret and variable for nix-cache02 signing key
- Add nix-cache02 public key to trusted-public-keys on all hosts
- Update plan doc to remove actions runner references

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 23:08:48 +01:00
98a7301985 nix-cache: remove unused Gitea Actions runner
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m23s
The actions runner on nix-cache01 was never actively used.
Removing it before migrating to nix-cache02.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 22:57:08 +01:00
34efa58cfe Merge pull request 'nix-cache02-builder' (#39) from nix-cache02-builder into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m27s
Reviewed-on: #39
2026-02-10 21:47:58 +00:00
5bfb51a497 docs: add observability phase to nix-cache plan
Some checks failed
Run nix flake check / flake-check (push) Successful in 2m35s
Run nix flake check / flake-check (pull_request) Failing after 16m1s
- Add Phase 6 for alerting and Grafana dashboards
- Document available Prometheus metrics
- Include example alerting rules for build failures

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 22:46:38 +01:00
f83145d97a docs: update nix-cache-reprovision plan with progress
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
- Mark Phase 1 (new build host) and Phase 2 (NATS build triggering) complete
- Document nix-cache02 configuration and tested build times
- Add remaining work for Harmonia, Actions runner, and DNS cutover
- Enable --enable-builds flag in MCP config

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 22:43:48 +01:00
47747329c4 nix-cache02: add homelab-deploy builder service
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m51s
- Configure builder to build nixos-servers and nixos (gunter) repos
- Add builder NKey to Vault secrets
- Update NATS permissions for builder, test-deployer, and admin-deployer
- Grant nix-cache02 access to shared homelab-deploy secrets

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 22:26:40 +01:00
2d9ca2a73f hosts: add nix-cache02 build host
Some checks failed
Run nix flake check / flake-check (push) Failing after 16m26s
New build host to replace nix-cache01 with:
- 8 CPU cores, 16GB RAM, 200GB disk
- Static IP 10.69.13.25

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 21:53:29 +01:00
98ea679ef2 docs: add monitoring02 reboot alert investigation
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m41s
Document findings from false positive host_reboot alert caused by
NTP clock adjustment affecting node_boot_time_seconds metric.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-10 17:59:53 +01:00
b709c0b703 monitoring: disable radarr exporter (version mismatch)
Some checks failed
Run nix flake check / flake-check (push) Failing after 15m20s
Periodic flake update / flake-update (push) Successful in 2m23s
Radarr on TrueNAS jail is too old - exportarr fails on
/api/v3/wanted/cutoff endpoint (404). Keep sonarr which works.

Vault secret kept for when Radarr is updated.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 22:59:45 +01:00
33c5d5b3f0 monitoring: add exportarr for radarr/sonarr metrics
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m5s
Add prometheus exportarr exporters for Radarr and Sonarr media
services. Runs on monitoring01, queries remote APIs.

- Radarr exporter on port 9708
- Sonarr exporter on port 9709
- API keys fetched from Vault

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 22:56:03 +01:00
0a28c5f495 terraform: add radarr/sonarr API keys for exportarr
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Add vault secrets for Radarr and Sonarr API keys to enable
exportarr metrics collection on monitoring01.

- services/exportarr/radarr - Radarr API key
- services/exportarr/sonarr - Sonarr API key
- Grant monitoring01 access to services/exportarr/*

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 22:52:34 +01:00
9bd48e0808 monitoring: explicitly list valid HTTP status codes
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m6s
Empty valid_status_codes defaults to 2xx only, not "any".
Explicitly list common status codes (2xx, 3xx, 4xx, 5xx) so
services returning 400/401 like ha and nzbget pass the probe.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 22:41:47 +01:00
1460eea700 grafana: fix probe status table join
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m9s
Use joinByField transformation instead of merge to properly align
rows by instance. Also exclude duplicate Time/job columns from join.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 22:38:02 +01:00
98c4f54f94 grafana: add TLS certificates dashboard
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Dashboard includes:
- Stat panels for endpoints monitored, probe failures, expiring certs
- Gauge showing minimum days until any cert expires
- Table of all endpoints sorted by expiry (color-coded)
- Probe status table with HTTP status and duration
- Time series graphs for expiry trends and probe success rate

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 22:35:44 +01:00
d1b0a5dc20 monitoring: accept any HTTP status in TLS probe
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Only care about TLS handshake success for certificate monitoring.
Services like nzbget (401) and ha (400) return non-2xx but have
valid certificates.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 22:33:45 +01:00
4d32707130 monitoring: remove duplicate rules from blackbox.nix
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m7s
The rules were already added to rules.yml but the blackbox.nix file
still had them, causing duplicate 'groups' key errors.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 22:28:42 +01:00
8e1753c2c8 monitoring: fix blackbox rules and add force-push policy
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Move certificate alert rules to rules.yml instead of adding them as a
separate rules string in blackbox.nix. The previous approach caused a
YAML parse error due to duplicate 'groups' keys.

Also add policy to CLAUDE.md: never force push to master.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 22:26:05 +01:00
75e4fb61a5 monitoring: add blackbox exporter for TLS certificate monitoring
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m6s
Add blackbox exporter to monitoring01 to probe TLS endpoints and alert
on expiring certificates. Monitors all ACME-managed certificates from
OpenBao PKI including Caddy auto-TLS services.

Alerts:
- tls_certificate_expiring_soon (< 7 days, warning)
- tls_certificate_expiring_critical (< 24h, critical)
- tls_probe_failed (connectivity issues)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 22:21:42 +01:00
2be213e454 terraform: update default template to nixos-25.11.20260207
Some checks failed
Run nix flake check / flake-check (push) Failing after 12m13s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 21:57:17 +01:00
12c252653b ansible: add reboot playbook and short hostname support
- Add reboot.yml playbook with rolling reboot (serial: 1)
  - Uses systemd reboot.target for NixOS compatibility
  - Waits for each host to come back before proceeding
- Update dynamic inventory to use short hostnames
  - ansible_host set to FQDN for connections
  - Allows -l testvm01 instead of -l testvm01.home.2rjus.net
- Update static.yml to match short hostname convention

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 21:56:32 +01:00
6493338c4c ansible: fix deprecated yaml callback plugin
Use result_format=yaml with builtin default callback instead of
the removed community.general.yaml plugin.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 21:47:16 +01:00
6e08ba9720 ansible: restructure with dynamic inventory from flake
- Move playbooks/ to ansible/playbooks/
- Add dynamic inventory script that extracts hosts from flake
  - Groups by tier (tier_test, tier_prod) and role (role_dns, etc.)
  - Reads homelab.host.* options for metadata
- Add static inventory for non-flake hosts (Proxmox)
- Add ansible.cfg with inventory path and SSH optimizations
- Add group_vars/all.yml for common variables
- Add restart-service.yml playbook for restarting systemd services
- Update provision-approle.yml with single-host safeguard
- Add ANSIBLE_CONFIG to devshell for automatic inventory discovery
- Add ansible = "false" label to template2 to exclude from inventory
- Update CLAUDE.md to reference ansible/README.md for details

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 21:41:29 +01:00
7ff3d2a09b docs: move openbao-kanidm-oidc plan to completed
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m7s
2026-02-09 19:44:06 +01:00
e85f15b73d vault: add OpenBao OIDC integration with Kanidm
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m9s
Enable Kanidm users to authenticate to OpenBao via OIDC for Web UI access.
Members of the admins group get full read/write access to secrets.

Changes:
- Add OIDC auth backend in Terraform (oidc.tf)
- Add oidc-admin and oidc-default policies
- Add openbao OAuth2 client to Kanidm
- Enable legacy crypto (RS256) for OpenBao compatibility
- Allow imperative group membership management in Kanidm

Limitations:
- CLI login not supported (Kanidm requires HTTPS for confidential client redirects)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 19:42:26 +01:00
2f5a2a4bf1 grafana: use instant queries for fleet dashboard stat panels
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m6s
Prevents stat panels from being affected by dashboard time range selection.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 19:00:33 +01:00
287141c623 hosts: add role metadata to all hosts
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m51s
Assign roles to hosts for better organization and filtering:
- ha1: home-automation
- monitoring01, monitoring02: monitoring
- jelly01: media
- nats1: messaging
- http-proxy: proxy
- testvm01-03: test

Also promote kanidm01 and monitoring02 from test to prod tier.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 16:21:08 +01:00
9ed11b712f home-assistant: fix Jinja2 battery template syntax
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m13s
The template used | min(100) | max(0) which is invalid Jinja2 syntax.
These filters expect iterables (lists), not scalar arguments. This
caused TypeError warnings on every MQTT message and left battery
sensors unavailable.

Fixed by using proper list-based min/max:
  [[[value, 100] | min, 0] | max

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 16:12:59 +01:00
ffad2dd205 monitoring: increase zigbee_sensor_stale threshold to 4 hours
The 2-hour threshold was too aggressive for temperature sensors in
stable environments. Historical data shows gaps up to 2.75 hours when
temperature hasn't changed (Home Assistant only updates last_updated
when values change). Increasing to 4 hours avoids false positives
while still catching genuine failures.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 16:10:54 +01:00
ed7d2aa727 grafana: add deployment metrics to nixos-fleet dashboard
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 15:58:28 +01:00
bf7a025364 flake: update homelab-deploy input
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m49s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 15:45:30 +01:00
4ae99dbc89 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/e576e3c9cf9bad747afcddd9e34f51d18c855b4e?narHash=sha256-tlFqNG/uzz2%2B%2BaAmn4v8J0vAkV3z7XngeIIB3rM3650%3D' (2026-02-03)
  → 'github:nixos/nixpkgs/23d72dabcb3b12469f57b37170fcbc1789bd7457?narHash=sha256-z5NJPSBwsLf/OfD8WTmh79tlSU8XgIbwmk6qB1/TFzY%3D' (2026-02-07)
• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/00c21e4c93d963c50d4c0c89bfa84ed6e0694df2?narHash=sha256-AYqlWrX09%2BHvGs8zM6ebZ1pwUqjkfpnv8mewYwAo%2BiM%3D' (2026-02-04)
  → 'github:nixos/nixpkgs/d6c71932130818840fc8fe9509cf50be8c64634f?narHash=sha256-ub1gpAONMFsT/GU2hV6ZWJjur8rJ6kKxdm9IlCT0j84%3D' (2026-02-08)
2026-02-09 00:01:58 +00:00
5c142b1323 flake: update homelab-deploy input
Some checks failed
Run nix flake check / flake-check (push) Failing after 10m7s
Periodic flake update / flake-update (push) Successful in 2m51s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 00:42:51 +01:00
4091e51f41 nixos-exporter: use nkeySeedFile option
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m26s
Use the new nkeySeedFile option instead of credentialsFile for NATS
authentication.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 00:34:22 +01:00
a8e558a6b7 flake: update nixos-exporter input
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 00:32:56 +01:00
4efc798c38 nixos-exporter: fix nkey file permissions
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m6s
Set owner/group to nixos-exporter so the service can read the
NATS credentials file.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 00:18:10 +01:00
016f8c9119 terraform: add nixos-exporter shared policy
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
- Create shared policy granting all hosts access to nixos-exporter nkey
- Add policy to both manual and generated host AppRoles
- Remove duplicate kanidm01/monitoring02 entries from hosts-generated.tf

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 00:04:17 +01:00
fec2a261ab Merge pull request 'nixos-exporter: enable NATS cache sharing' (#38) from nixos-exporter-nats-cache into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m18s
Reviewed-on: #38
2026-02-08 22:58:24 +00:00
60c04a2052 nixos-exporter: enable NATS cache sharing
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m17s
Run nix flake check / flake-check (push) Failing after 5m16s
When one host fetches the latest flake revision, it publishes to NATS
and all other hosts receive the update immediately. This reduces
redundant nix flake metadata calls across the fleet.

- Add nkeys to devshell for key generation
- Add nixos-exporter user to NATS HOMELAB account
- Add Vault secret for NKey storage
- Configure all hosts to use NATS for revision sharing
- Update nixos-exporter input to version with NATS support

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 23:57:28 +01:00
39e3f37263 flake: update homelab-deploy input
Some checks failed
Run nix flake check / flake-check (push) Failing after 15m17s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 22:49:44 +01:00
a2d93baba8 Merge pull request 'grafana: add NixOS operations dashboard' (#37) from grafana-nixos-operations-dashboard into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 3m54s
Reviewed-on: #37
2026-02-08 21:04:19 +00:00
f66dfc753c grafana: add NixOS operations dashboard
All checks were successful
Run nix flake check / flake-check (push) Successful in 3m24s
Run nix flake check / flake-check (pull_request) Successful in 4m5s
Loki-based dashboard for tracking NixOS operations including:
- Upgrade activity and success/failure stats
- Build activity during upgrades
- Bootstrap logs for new VM deployments
- ACME certificate renewal activity

Log panels use LogQL json parsing with | keep host to show
clean messages with host labels.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 22:03:28 +01:00
79a6a72719 Merge pull request 'grafana-dashboards-permissions' (#36) from grafana-dashboards-permissions into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m4s
Reviewed-on: #36
2026-02-08 20:18:22 +00:00
89d0a6f358 grafana: add systemd services dashboard
Some checks failed
Run nix flake check / flake-check (push) Failing after 8m30s
Run nix flake check / flake-check (pull_request) Failing after 16m49s
Dashboard for monitoring systemd across the fleet:
- Summary stats: failed/active/inactive units, restarts, timers
- Failed units table (shows any units in failed state)
- Service restarts table (top 15 services by restart count)
- Active units per host bar chart
- NixOS upgrade timer table with last trigger time
- Backup timers table (restic jobs)
- Service restarts over time chart
- Hostname filter to focus on specific hosts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 21:06:59 +01:00
03ebee4d82 grafana: fix proxmox table __name__ column
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m9s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 21:04:41 +01:00
05630eb4d4 grafana: add Proxmox dashboard
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Dashboard for monitoring Proxmox VMs:
- Summary stats: VMs running/stopped, node CPU/memory, uptime
- VM status table with name, status, CPU%, memory%, uptime
- VM CPU usage over time
- VM memory usage over time
- Network traffic (RX/TX) per VM
- Disk I/O (read/write) per VM
- Storage usage gauges and capacity table
- VM filter to focus on specific VMs

Filters out template VMs, shows only actual guests.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 21:02:28 +01:00
1e52eec02a monitoring: always include tier label in scrape configs
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m8s
Previously tier was only included if non-default (not "prod"), which
meant prod hosts had no tier label. This made the Grafana tier filter
only show "test" since "prod" never appeared in label_values().

Now tier is always included, so both "prod" and "test" appear in the
fleet dashboard tier selector.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 20:58:52 +01:00
d333aa0164 grafana: fix fleet table __name__ columns
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m5s
Exclude the __name__ columns that were leaking through the
table transformations.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 20:52:39 +01:00
a5d5827dcc grafana: add NixOS fleet dashboard
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Dashboard for monitoring NixOS deployments across the homelab:
- Hosts behind remote / needing reboot stat panels
- Fleet status table with revision, behind status, reboot needed, age
- Generation age bar chart (shows stale configs)
- Generations per host bar chart
- Deployment activity time series (see when hosts were updated)
- Flake input ages table
- Pie charts for hosts by revision and tier
- Tier filter variable

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 20:50:08 +01:00
1c13ec12a4 grafana: add temperature dashboard
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m5s
Dashboard includes:
- Current temperatures per room (stat panel)
- Average home temperature (gauge)
- Current humidity (stat panel)
- 30-day temperature history with mean/min/max in legend
- Temperature trend (rate of change per hour)
- 24h min/max/avg table per room
- 30-day humidity history

Filters out device_temperature (internal sensor) metrics.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 20:45:52 +01:00
4bf0eeeadb grafana: add dashboards and fix permissions
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m3s
- Change default OIDC role from Viewer to Editor for Explore access
- Add declarative dashboard provisioning
- Add node-exporter dashboard (CPU, memory, disk, load, network, I/O)
- Add Loki logs dashboard with host/job filters

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 20:39:21 +01:00
304cb117ce Merge pull request 'grafana-kanidm-oidc' (#35) from grafana-kanidm-oidc into master
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m7s
Reviewed-on: #35
2026-02-08 19:30:20 +00:00
02270a0e4a docs: update plans with Grafana OIDC progress
Some checks failed
Run nix flake check / flake-check (pull_request) Successful in 2m7s
Run nix flake check / flake-check (push) Failing after 16m31s
- auth-system-replacement.md: Mark OAuth2 client (Grafana) as completed,
  document key findings (PKCE, attribute paths, user requirements)
- monitoring-migration-victoriametrics.md: Note Grafana deployment on
  monitoring02 with Kanidm OIDC as test instance

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 20:28:10 +01:00
030e8518c5 grafana: add Grafana on monitoring02 with Kanidm OIDC
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m3s
Deploy Grafana test instance on monitoring02 with:
- Kanidm OIDC authentication (admins -> Admin role, others -> Viewer)
- PKCE enabled for secure OAuth2 flow (required by Kanidm)
- Declarative datasources for Prometheus and Loki on monitoring01
- Local Caddy for TLS termination via internal ACME CA
- DNS CNAME grafana-test.home.2rjus.net

Terraform changes add OAuth2 client secret and AppRole policies for
kanidm01 and monitoring02.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 20:23:26 +01:00
9ffdd4f862 terraform: increase monitoring02 disk to 60G
Some checks failed
Run nix flake check / flake-check (push) Failing after 11m8s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 19:23:40 +01:00
0b977808ca hosts: add monitoring02 configuration
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
New test-tier host for monitoring stack expansion with:
- Static IP 10.69.13.24
- 4 CPU cores, 4GB RAM, 20GB disk
- Vault integration and NATS-based deployment enabled

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 19:19:38 +01:00
8786113f8f docs: add OpenBao + Kanidm OIDC integration plan
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m10s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 15:45:44 +01:00
fdb2c31f84 docs: add pipe-to-loki documentation to CLAUDE.md
All checks were successful
Run nix flake check / flake-check (push) Successful in 2m1s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 15:34:01 +01:00
76 changed files with 5422 additions and 565 deletions

View File

@@ -31,7 +31,8 @@
"--",
"mcp",
"--nats-url", "nats://nats1.home.2rjus.net:4222",
"--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey"
"--nkey-file", "/home/torjus/.config/homelab-deploy/test-deployer.nkey",
"--enable-builds"
]
},
"git-explorer": {

View File

@@ -39,6 +39,30 @@ Do not automatically deploy changes. Deployments are usually done by updating th
Do not run SSH commands directly. If a command needs to be run on a remote host, provide the command to the user and ask them to run it manually.
### Sharing Command Output via Loki
All hosts have the `pipe-to-loki` script for sending command output or terminal sessions to Loki, allowing users to share output with Claude without copy-pasting.
**Pipe mode** - send command output:
```bash
command | pipe-to-loki # Auto-generated ID
command | pipe-to-loki --id my-test # Custom ID
```
**Session mode** - record interactive terminal session:
```bash
pipe-to-loki --record # Start recording, exit to send
pipe-to-loki --record --id my-session # With custom ID
```
The script prints the session ID which the user can share. Query results with:
```logql
{job="pipe-to-loki"} # All entries
{job="pipe-to-loki", id="my-test"} # Specific ID
{job="pipe-to-loki", host="testvm01"} # From specific host
{job="pipe-to-loki", type="session"} # Only sessions
```
### Testing Feature Branches on Hosts
All hosts have the `nixos-rebuild-test` helper script for testing feature branches before merging:
@@ -90,6 +114,12 @@ nix develop -c tofu -chdir=terraform/vault apply
cd terraform && tofu plan
```
### Ansible
Ansible configuration and playbooks are in `/ansible/`. See [ansible/README.md](ansible/README.md) for inventory groups, available playbooks, and usage examples.
The devshell sets `ANSIBLE_CONFIG` automatically, so no `-i` flag is needed.
### Secrets Management
Secrets are managed by OpenBao (Vault) using AppRole authentication. Most hosts use the
@@ -102,6 +132,8 @@ Terraform manages the secrets and AppRole policies in `terraform/vault/`.
**Important:** Never amend commits to `master` unless the user explicitly asks for it. Amending rewrites history and causes issues for deployed configurations.
**Important:** Never force push to `master`. If a commit on master has an error, fix it with a new commit rather than rewriting history.
**Important:** Do not use `gh pr create` to create pull requests. The git server does not support GitHub CLI for PR creation. Instead, push the branch and let the user create the PR manually via the web interface.
When starting a new plan or task, the first step should typically be to create and checkout a new branch with an appropriate name (e.g., `git checkout -b dns-automation` or `git checkout -b fix-nginx-config`).
@@ -255,7 +287,10 @@ The `current_rev` label contains the git commit hash of the deployed flake confi
- `/docs/` - Documentation and plans
- `plans/` - Future plans and proposals
- `plans/completed/` - Completed plans (moved here when done)
- `/playbooks/` - Ansible playbooks for fleet management
- `/ansible/` - Ansible configuration and playbooks
- `ansible.cfg` - Ansible configuration (inventory path, defaults)
- `inventory/` - Dynamic and static inventory sources
- `playbooks/` - Ansible playbooks for fleet management
### Configuration Inheritance
@@ -279,24 +314,11 @@ All hosts automatically get:
- Custom root CA trust
- DNS zone auto-registration via `homelab.dns` options
### Active Hosts
### Hosts
Production servers:
- `ns1`, `ns2` - Primary/secondary DNS servers (10.69.13.5/6)
- `vault01` - OpenBao (Vault) secrets server + PKI CA
- `ha1` - Home Assistant + Zigbee2MQTT + Mosquitto
- `http-proxy` - Reverse proxy
- `monitoring01` - Full observability stack (Prometheus, Grafana, Loki, Tempo, Pyroscope)
- `jelly01` - Jellyfin media server
- `nix-cache01` - Binary cache server + GitHub Actions runner
- `pgdb1` - PostgreSQL database
- `nats1` - NATS messaging server
Host configurations are in `/hosts/<hostname>/`. See `flake.nix` for the complete list of `nixosConfigurations`.
Test/staging hosts:
- `testvm01`, `testvm02`, `testvm03` - Test-tier VMs for branch testing and deployment validation
Template hosts:
- `template1`, `template2` - Base templates for cloning new hosts
Use `nix flake show` or `nix develop -c ansible-inventory --graph` to list all hosts.
### Flake Inputs
@@ -327,7 +349,7 @@ Most hosts use OpenBao (Vault) for secrets:
- `extractKey` option extracts a single key from vault JSON as a plain file
- Secrets fetched at boot by `vault-secret-<name>.service` systemd units
- Fallback to cached secrets in `/var/lib/vault/cache/` when Vault is unreachable
- Provision AppRole credentials: `nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<host>`
- Provision AppRole credentials: `nix develop -c ansible-playbook ansible/playbooks/provision-approle.yml -l <hostname>`
### Auto-Upgrade System
@@ -351,7 +373,7 @@ Template VMs are built from `hosts/template2` and deployed to Proxmox using Ansi
```bash
# Build NixOS image and deploy to Proxmox as template
nix develop -c ansible-playbook -i playbooks/inventory.ini playbooks/build-and-deploy-template.yml
nix develop -c ansible-playbook ansible/playbooks/build-and-deploy-template.yml
```
This playbook:
@@ -426,7 +448,7 @@ This means:
- `tofu plan` won't show spurious changes for Proxmox-managed defaults
**When rebuilding the template:**
1. Run `nix develop -c ansible-playbook -i playbooks/inventory.ini playbooks/build-and-deploy-template.yml`
1. Run `nix develop -c ansible-playbook ansible/playbooks/build-and-deploy-template.yml`
2. Update `default_template_name` in `terraform/variables.tf` if the name changed
3. Run `tofu plan` - should show no VM recreations (only template name in state)
4. Run `tofu apply` - updates state without touching existing VMs
@@ -509,6 +531,7 @@ The `modules/homelab/` directory defines custom options used across hosts for au
- `priority` - Alerting priority: `high` or `low`. Controls alerting thresholds for the host.
- `role` - Primary role designation (e.g., `dns`, `database`, `bastion`, `vault`)
- `labels` - Free-form key-value metadata for host categorization
- `ansible = "false"` - Exclude host from Ansible dynamic inventory
**DNS options (`homelab.dns.*`):**
- `enable` (default: `true`) - Include host in DNS zone generation

View File

@@ -12,7 +12,7 @@ NixOS Flake-based configuration repository for a homelab infrastructure. All hos
| `http-proxy` | Reverse proxy |
| `monitoring01` | Prometheus, Grafana, Loki, Tempo, Pyroscope |
| `jelly01` | Jellyfin media server |
| `nix-cache01` | Nix binary cache |
| `nix-cache02` | Nix binary cache + NATS-based build service |
| `nats1` | NATS messaging |
| `vault01` | OpenBao (Vault) secrets management |
| `template1`, `template2` | VM templates for cloning new hosts |

120
ansible/README.md Normal file
View File

@@ -0,0 +1,120 @@
# Ansible Configuration
This directory contains Ansible configuration for fleet management tasks.
## Structure
```
ansible/
├── ansible.cfg # Ansible configuration
├── inventory/
│ ├── dynamic_flake.py # Dynamic inventory from NixOS flake
│ ├── static.yml # Non-flake hosts (Proxmox, etc.)
│ └── group_vars/
│ └── all.yml # Common variables
└── playbooks/
├── build-and-deploy-template.yml
├── provision-approle.yml
├── restart-service.yml
└── run-upgrade.yml
```
## Usage
The devshell automatically configures `ANSIBLE_CONFIG`, so commands work without extra flags:
```bash
# List inventory groups
nix develop -c ansible-inventory --graph
# List hosts in a specific group
nix develop -c ansible-inventory --list | jq '.role_dns'
# Run a playbook
nix develop -c ansible-playbook ansible/playbooks/run-upgrade.yml -l tier_test
```
## Inventory
The inventory combines dynamic and static sources automatically.
### Dynamic Inventory (from flake)
The `dynamic_flake.py` script extracts hosts from the NixOS flake using `homelab.host.*` options:
**Groups generated:**
- `flake_hosts` - All NixOS hosts from the flake
- `tier_test`, `tier_prod` - By `homelab.host.tier`
- `role_dns`, `role_vault`, `role_monitoring`, etc. - By `homelab.host.role`
**Host variables set:**
- `tier` - Deployment tier (test/prod)
- `role` - Host role
- `short_hostname` - Hostname without domain
### Static Inventory
Non-flake hosts are defined in `inventory/static.yml`:
- `proxmox` - Proxmox hypervisors
## Playbooks
| Playbook | Description | Example |
|----------|-------------|---------|
| `run-upgrade.yml` | Trigger nixos-upgrade on hosts | `-l tier_prod` |
| `restart-service.yml` | Restart a systemd service | `-l role_dns -e service=unbound` |
| `reboot.yml` | Rolling reboot (one host at a time) | `-l tier_test` |
| `provision-approle.yml` | Deploy Vault credentials (single host only) | `-l testvm01` |
| `build-and-deploy-template.yml` | Build and deploy Proxmox template | (no limit needed) |
### Examples
```bash
# Restart unbound on all DNS servers
nix develop -c ansible-playbook ansible/playbooks/restart-service.yml \
-l role_dns -e service=unbound
# Trigger upgrade on all test hosts
nix develop -c ansible-playbook ansible/playbooks/run-upgrade.yml -l tier_test
# Provision Vault credentials for a specific host
nix develop -c ansible-playbook ansible/playbooks/provision-approle.yml -l testvm01
# Build and deploy Proxmox template
nix develop -c ansible-playbook ansible/playbooks/build-and-deploy-template.yml
# Rolling reboot of test hosts (one at a time, waits for each to come back)
nix develop -c ansible-playbook ansible/playbooks/reboot.yml -l tier_test
```
## Excluding Flake Hosts
To exclude a flake host from the dynamic inventory, add the `ansible = "false"` label in the host's configuration:
```nix
homelab.host.labels.ansible = "false";
```
Hosts with `homelab.dns.enable = false` are also excluded automatically.
## Adding Non-Flake Hosts
Edit `inventory/static.yml` to add hosts not managed by the NixOS flake:
```yaml
all:
children:
my_group:
hosts:
host1.example.com:
ansible_user: admin
```
## Common Variables
Variables in `inventory/group_vars/all.yml` apply to all hosts:
- `ansible_user` - Default SSH user (root)
- `domain` - Domain name (home.2rjus.net)
- `vault_addr` - Vault server URL

17
ansible/ansible.cfg Normal file
View File

@@ -0,0 +1,17 @@
[defaults]
inventory = inventory/
remote_user = root
host_key_checking = False
# Reduce SSH connection overhead
forks = 10
pipelining = True
# Output formatting (YAML output via builtin default callback)
stdout_callback = default
callbacks_enabled = profile_tasks
result_format = yaml
[ssh_connection]
# Reuse SSH connections
ssh_args = -o ControlMaster=auto -o ControlPersist=60s

View File

@@ -0,0 +1,162 @@
#!/usr/bin/env python3
"""
Dynamic Ansible inventory script that extracts host information from the NixOS flake.
Generates groups:
- flake_hosts: All hosts defined in the flake
- tier_test, tier_prod: Hosts by deployment tier
- role_<name>: Hosts by role (dns, vault, monitoring, etc.)
Usage:
./dynamic_flake.py --list # Return full inventory
./dynamic_flake.py --host X # Return host vars (not used, but required by Ansible)
"""
import json
import subprocess
import sys
from pathlib import Path
def get_flake_dir() -> Path:
"""Find the flake root directory."""
script_dir = Path(__file__).resolve().parent
# ansible/inventory/dynamic_flake.py -> repo root
return script_dir.parent.parent
def evaluate_flake() -> dict:
"""Evaluate the flake and extract host metadata."""
flake_dir = get_flake_dir()
# Nix expression to extract relevant config from each host
nix_expr = """
configs: builtins.mapAttrs (name: cfg: {
hostname = cfg.config.networking.hostName;
domain = cfg.config.networking.domain or "home.2rjus.net";
tier = cfg.config.homelab.host.tier;
role = cfg.config.homelab.host.role;
labels = cfg.config.homelab.host.labels;
dns_enabled = cfg.config.homelab.dns.enable;
}) configs
"""
try:
result = subprocess.run(
[
"nix",
"eval",
"--json",
f"{flake_dir}#nixosConfigurations",
"--apply",
nix_expr,
],
capture_output=True,
text=True,
check=True,
cwd=flake_dir,
)
return json.loads(result.stdout)
except subprocess.CalledProcessError as e:
print(f"Error evaluating flake: {e.stderr}", file=sys.stderr)
sys.exit(1)
except json.JSONDecodeError as e:
print(f"Error parsing nix output: {e}", file=sys.stderr)
sys.exit(1)
def sanitize_group_name(name: str) -> str:
"""Sanitize a string for use as an Ansible group name.
Ansible group names should contain only alphanumeric characters and underscores.
"""
return name.replace("-", "_")
def build_inventory(hosts_data: dict) -> dict:
"""Build Ansible inventory structure from host data."""
inventory = {
"_meta": {"hostvars": {}},
"flake_hosts": {"hosts": []},
}
# Track groups we need to create
tier_groups: dict[str, list[str]] = {}
role_groups: dict[str, list[str]] = {}
for _config_name, host_info in hosts_data.items():
hostname = host_info["hostname"]
domain = host_info["domain"]
tier = host_info["tier"]
role = host_info["role"]
labels = host_info["labels"]
dns_enabled = host_info["dns_enabled"]
# Skip hosts that have DNS disabled (like templates)
if not dns_enabled:
continue
# Skip hosts with ansible = "false" label
if labels.get("ansible") == "false":
continue
fqdn = f"{hostname}.{domain}"
# Use short hostname as inventory name, FQDN for connection
inventory_name = hostname
# Add to flake_hosts group
inventory["flake_hosts"]["hosts"].append(inventory_name)
# Add host variables
inventory["_meta"]["hostvars"][inventory_name] = {
"ansible_host": fqdn, # Connect using FQDN
"fqdn": fqdn,
"tier": tier,
"role": role,
}
# Group by tier
tier_group = f"tier_{sanitize_group_name(tier)}"
if tier_group not in tier_groups:
tier_groups[tier_group] = []
tier_groups[tier_group].append(inventory_name)
# Group by role (if set)
if role:
role_group = f"role_{sanitize_group_name(role)}"
if role_group not in role_groups:
role_groups[role_group] = []
role_groups[role_group].append(inventory_name)
# Add tier groups to inventory
for group_name, hosts in tier_groups.items():
inventory[group_name] = {"hosts": hosts}
# Add role groups to inventory
for group_name, hosts in role_groups.items():
inventory[group_name] = {"hosts": hosts}
return inventory
def main():
if len(sys.argv) < 2:
print("Usage: dynamic_flake.py --list | --host <hostname>", file=sys.stderr)
sys.exit(1)
if sys.argv[1] == "--list":
hosts_data = evaluate_flake()
inventory = build_inventory(hosts_data)
print(json.dumps(inventory, indent=2))
elif sys.argv[1] == "--host":
# Ansible calls this to get vars for a specific host
# We provide all vars in _meta.hostvars, so just return empty
print(json.dumps({}))
else:
print(f"Unknown option: {sys.argv[1]}", file=sys.stderr)
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,5 @@
# Common variables for all hosts
ansible_user: root
domain: home.2rjus.net
vault_addr: https://vault01.home.2rjus.net:8200

View File

@@ -0,0 +1,13 @@
# Static inventory for non-flake hosts
#
# Hosts defined here are merged with the dynamic flake inventory.
# Use this for infrastructure that isn't managed by NixOS.
#
# Use short hostnames as inventory names with ansible_host for FQDN.
all:
children:
proxmox:
hosts:
pve1:
ansible_host: pve1.home.2rjus.net

View File

@@ -15,13 +15,13 @@
- name: Build NixOS image
ansible.builtin.command:
cmd: "nixos-rebuild build-image --image-variant proxmox --flake .#template2"
chdir: "{{ playbook_dir }}/.."
chdir: "{{ playbook_dir }}/../.."
register: build_result
changed_when: true
- name: Find built image file
ansible.builtin.find:
paths: "{{ playbook_dir}}/../result"
paths: "{{ playbook_dir}}/../../result"
patterns: "*.vma.zst"
recurse: true
register: image_files
@@ -105,7 +105,7 @@
gather_facts: false
vars:
terraform_dir: "{{ playbook_dir }}/../terraform"
terraform_dir: "{{ playbook_dir }}/../../terraform"
tasks:
- name: Get image filename from earlier play

View File

@@ -1,7 +1,27 @@
---
# Provision OpenBao AppRole credentials to an existing host
# Usage: nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=ha1
# Provision OpenBao AppRole credentials to a host
#
# Usage: ansible-playbook ansible/playbooks/provision-approle.yml -l <hostname>
# Requires: BAO_ADDR and BAO_TOKEN environment variables set
#
# IMPORTANT: This playbook must target exactly one host to prevent
# accidentally regenerating credentials for multiple hosts.
- name: Validate single host target
hosts: all
gather_facts: false
tasks:
- name: Fail if targeting multiple hosts
ansible.builtin.fail:
msg: |
This playbook must target exactly one host.
Use: ansible-playbook provision-approle.yml -l <hostname>
Targeting multiple hosts would regenerate credentials for all of them,
potentially breaking existing services.
when: ansible_play_hosts | length != 1
run_once: true
- name: Fetch AppRole credentials from OpenBao
hosts: localhost
@@ -9,18 +29,17 @@
gather_facts: false
vars:
vault_addr: "{{ lookup('env', 'BAO_ADDR') | default('https://vault01.home.2rjus.net:8200', true) }}"
domain: "home.2rjus.net"
target_host: "{{ groups['all'] | first }}"
target_hostname: "{{ hostvars[target_host]['short_hostname'] | default(target_host.split('.')[0]) }}"
tasks:
- name: Validate hostname is provided
ansible.builtin.fail:
msg: "hostname variable is required. Use: -e hostname=<name>"
when: hostname is not defined
- name: Display target host
ansible.builtin.debug:
msg: "Provisioning AppRole credentials for: {{ target_hostname }}"
- name: Get role-id for host
ansible.builtin.command:
cmd: "bao read -field=role_id auth/approle/role/{{ hostname }}/role-id"
cmd: "bao read -field=role_id auth/approle/role/{{ target_hostname }}/role-id"
environment:
BAO_ADDR: "{{ vault_addr }}"
BAO_SKIP_VERIFY: "1"
@@ -29,25 +48,26 @@
- name: Generate secret-id for host
ansible.builtin.command:
cmd: "bao write -field=secret_id -f auth/approle/role/{{ hostname }}/secret-id"
cmd: "bao write -field=secret_id -f auth/approle/role/{{ target_hostname }}/secret-id"
environment:
BAO_ADDR: "{{ vault_addr }}"
BAO_SKIP_VERIFY: "1"
register: secret_id_result
changed_when: true
- name: Add target host to inventory
ansible.builtin.add_host:
name: "{{ hostname }}.{{ domain }}"
groups: vault_target
ansible_user: root
- name: Store credentials for next play
ansible.builtin.set_fact:
vault_role_id: "{{ role_id_result.stdout }}"
vault_secret_id: "{{ secret_id_result.stdout }}"
- name: Deploy AppRole credentials to host
hosts: vault_target
hosts: all
gather_facts: false
vars:
vault_role_id: "{{ hostvars['localhost']['vault_role_id'] }}"
vault_secret_id: "{{ hostvars['localhost']['vault_secret_id'] }}"
tasks:
- name: Create AppRole directory
ansible.builtin.file:

View File

@@ -0,0 +1,48 @@
---
# Reboot hosts with rolling strategy to avoid taking down redundant services
#
# Usage examples:
# # Reboot a single host
# ansible-playbook reboot.yml -l testvm01
#
# # Reboot all test hosts (one at a time)
# ansible-playbook reboot.yml -l tier_test
#
# # Reboot all DNS servers safely (one at a time)
# ansible-playbook reboot.yml -l role_dns
#
# Safety features:
# - serial: 1 ensures only one host reboots at a time
# - Waits for host to come back online before proceeding
# - Groups hosts by role to avoid rebooting same-role hosts consecutively
- name: Reboot hosts (rolling)
hosts: all
serial: 1
order: shuffle # Randomize to spread out same-role hosts
gather_facts: false
vars:
reboot_timeout: 300 # 5 minutes to wait for host to come back
tasks:
- name: Display reboot target
ansible.builtin.debug:
msg: "Rebooting {{ inventory_hostname }} (role: {{ role | default('none') }})"
- name: Reboot the host
ansible.builtin.systemd:
name: reboot.target
state: started
async: 1
poll: 0
ignore_errors: true
- name: Wait for host to come back online
ansible.builtin.wait_for_connection:
delay: 5
timeout: "{{ reboot_timeout }}"
- name: Display reboot result
ansible.builtin.debug:
msg: "{{ inventory_hostname }} rebooted successfully"

View File

@@ -0,0 +1,40 @@
---
# Restart a systemd service on target hosts
#
# Usage examples:
# # Restart unbound on all DNS servers
# ansible-playbook restart-service.yml -l role_dns -e service=unbound
#
# # Restart nginx on a specific host
# ansible-playbook restart-service.yml -l http-proxy.home.2rjus.net -e service=nginx
#
# # Restart promtail on all prod hosts
# ansible-playbook restart-service.yml -l tier_prod -e service=promtail
- name: Restart systemd service
hosts: all
gather_facts: false
tasks:
- name: Validate service name provided
ansible.builtin.fail:
msg: |
The 'service' variable is required.
Usage: ansible-playbook restart-service.yml -l <target> -e service=<name>
Examples:
-e service=nginx
-e service=unbound
-e service=promtail
when: service is not defined
run_once: true
- name: Restart {{ service }}
ansible.builtin.systemd:
name: "{{ service }}"
state: restarted
register: restart_result
- name: Display result
ansible.builtin.debug:
msg: "Service {{ service }} restarted on {{ inventory_hostname }}"

View File

@@ -151,11 +151,30 @@ Rationale:
- Well above NixOS system users (typically <1000)
- Avoids Podman/container issues with very high GIDs
### Completed (2026-02-08) - OAuth2/OIDC for Grafana
**OAuth2 client deployed for Grafana on monitoring02:**
- Client ID: `grafana`
- Redirect URL: `https://grafana-test.home.2rjus.net/login/generic_oauth`
- Scope maps: `openid`, `profile`, `email`, `groups` for `users` group
- Role mapping: `admins` group → Grafana Admin, others → Viewer
**Configuration locations:**
- Kanidm OAuth2 client: `services/kanidm/default.nix`
- Grafana OIDC config: `services/grafana/default.nix`
- Vault secret: `services/grafana/oauth2-client-secret`
**Key findings:**
- PKCE is required by Kanidm - enable `use_pkce = true` in Grafana
- Must set `email_attribute_path`, `login_attribute_path`, `name_attribute_path` to extract from userinfo
- Users need: primary credential (password + TOTP for MFA), membership in `users` group, email address set
- Unix password is separate from primary credential (web login requires primary credential)
### Next Steps
1. Enable PAM/NSS on production hosts (after test tier validation)
2. Configure TrueNAS LDAP client for NAS integration testing
3. Add OAuth2 clients (Grafana first)
3. Add OAuth2 clients for other services as needed
## References

View File

@@ -0,0 +1,135 @@
# monitoring02 Reboot Alert Investigation
**Date:** 2026-02-10
**Status:** Completed - False positive identified
## Summary
A `host_reboot` alert fired for monitoring02 at 16:27:36 UTC. Investigation determined this was a **false positive** caused by NTP clock adjustments, not an actual reboot.
## Alert Details
- **Alert:** `host_reboot`
- **Rule:** `changes(node_boot_time_seconds[10m]) > 0`
- **Host:** monitoring02
- **Time:** 2026-02-10T16:27:36Z
## Investigation Findings
### Evidence Against Actual Reboot
1. **Uptime:** System had been up for ~40 hours (143,751 seconds) at time of alert
2. **Consistent BOOT_ID:** All logs showed the same systemd BOOT_ID (`fd26e7f3d86f4cd688d1b1d7af62f2ad`) from Feb 9 through the alert time
3. **No log gaps:** Logs were continuous - no shutdown/restart cycle visible
4. **Prometheus metrics:** `node_boot_time_seconds` showed a 1-second fluctuation, then returned to normal
### Root Cause: NTP Clock Adjustment
The `node_boot_time_seconds` metric fluctuated by 1 second due to how Linux calculates boot time:
```
btime = current_wall_clock_time - monotonic_uptime
```
When NTP adjusts the wall clock, `btime` shifts by the same amount. The `node_timex_*` metrics confirmed this:
| Metric | Value |
|--------|-------|
| `node_timex_maxerror_seconds` (max in 3h) | 1.02 seconds |
| `node_timex_maxerror_seconds` (max in 24h) | 2.05 seconds |
| `node_timex_sync_status` | 1 (synced) |
| Current `node_timex_offset_seconds` | ~9ms (normal) |
The kernel's estimated maximum clock error spiked to over 1 second, causing the boot time calculation to drift momentarily.
Additionally, `systemd-resolved` logged "Clock change detected. Flushing caches." at 16:26:53Z, corroborating the NTP adjustment.
## Current Time Sync Configuration
### NixOS Guests
- **NTP client:** systemd-timesyncd (NixOS default)
- **No explicit configuration** in the codebase
- Uses default NixOS NTP server pool
### Proxmox VMs
- **Clocksource:** `kvm-clock` (optimal for KVM VMs)
- **QEMU guest agent:** Enabled
- **No additional QEMU timing args** configured
## Potential Improvements
### 1. Improve Alert Rule (Recommended)
Add tolerance to filter out small NTP adjustments:
```yaml
# Current rule (triggers on any change)
expr: changes(node_boot_time_seconds[10m]) > 0
# Improved rule (requires >60 second shift)
expr: changes(node_boot_time_seconds[10m]) > 0 and abs(delta(node_boot_time_seconds[10m])) > 60
```
### 2. Switch to Chrony (Optional)
Chrony handles time adjustments more gracefully than systemd-timesyncd:
```nix
# In common/vm/qemu-guest.nix
{
services.qemuGuest.enable = true;
services.timesyncd.enable = false;
services.chrony = {
enable = true;
extraConfig = ''
makestep 1 3
rtcsync
'';
};
}
```
### 3. Add QEMU Timing Args (Optional)
In `terraform/vms.tf`:
```hcl
args = "-global kvm-pit.lost_tick_policy=delay -rtc driftfix=slew"
```
### 4. Local NTP Server (Optional)
Running a local NTP server (e.g., on ns1/ns2) would reduce latency and improve sync stability across all hosts.
## Monitoring NTP Health
The `node_timex_*` metrics from node_exporter provide visibility into NTP health:
```promql
# Clock offset from reference
node_timex_offset_seconds
# Sync status (1 = synced)
node_timex_sync_status
# Maximum estimated error - useful for alerting
node_timex_maxerror_seconds
```
A potential alert for NTP issues:
```yaml
- alert: ntp_clock_drift
expr: node_timex_maxerror_seconds > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High clock drift on {{ $labels.hostname }}"
description: "NTP max error is {{ $value }}s on {{ $labels.hostname }}"
```
## Conclusion
No action required for the alert itself - the system was healthy. Consider implementing the improved alert rule to prevent future false positives from NTP adjustments.

View File

@@ -0,0 +1,156 @@
# Nix Cache Host Reprovision
## Overview
Reprovision `nix-cache01` using the OpenTofu workflow, and improve the build/cache system with:
1. NATS-based remote build triggering (replacing the current bash script)
2. Safer flake update workflow that validates builds before pushing to master
## Status
**Phase 1: New Build Host** - COMPLETE
**Phase 2: NATS Build Triggering** - COMPLETE
**Phase 3: Safe Flake Update Workflow** - NOT STARTED
**Phase 4: Complete Migration** - COMPLETE
**Phase 5: Scheduled Builds** - COMPLETE
## Completed Work
### New Build Host (nix-cache02)
Instead of reprovisioning nix-cache01 in-place, we created a new host `nix-cache02` at 10.69.13.25:
- **Specs**: 8 CPU cores, 16GB RAM (temporarily, will increase to 24GB after nix-cache01 decommissioned), 200GB disk
- **Provisioned via OpenTofu** with automatic Vault credential bootstrapping
- **Builder service** configured with two repos:
- `nixos-servers``git+https://git.t-juice.club/torjus/nixos-servers.git`
- `nixos` (gunter) → `git+https://git.t-juice.club/torjus/nixos.git`
### NATS-Based Build Triggering
The `homelab-deploy` tool was extended with a builder mode:
**NATS Subjects:**
- `build.<repo>.<target>` - e.g., `build.nixos-servers.all` or `build.nixos-servers.ns1`
**NATS Permissions (in DEPLOY account):**
| User | Publish | Subscribe |
|------|---------|-----------|
| Builder | `build.responses.>` | `build.>` |
| Test deployer | `deploy.test.>`, `deploy.discover`, `build.>` | `deploy.responses.>`, `deploy.discover`, `build.responses.>` |
| Admin deployer | `deploy.>`, `build.>` | `deploy.>`, `build.responses.>` |
**Vault Secrets:**
- `shared/homelab-deploy/builder-nkey` - NKey seed for builder authentication
**NixOS Configuration:**
- `hosts/nix-cache02/builder.nix` - Builder service configuration
- `services/nats/default.nix` - Updated with builder NATS user
**MCP Integration:**
- `.mcp.json` updated with `--enable-builds` flag
- Build tool available via MCP for Claude Code
**Tested:**
- Single host build: `build nixos-servers testvm01` (~30s)
- All hosts build: `build nixos-servers all` (16 hosts in ~226s)
### Harmonia Binary Cache
- Parameterized `services/nix-cache/harmonia.nix` to use hostname-based Vault paths
- Parameterized `services/nix-cache/proxy.nix` for hostname-based domain
- New signing key: `nix-cache02.home.2rjus.net-1`
- Vault secret: `hosts/nix-cache02/cache-secret`
- Removed unused Gitea Actions runner from nix-cache01
## Current State
### nix-cache02 (Active)
- Running at 10.69.13.25
- Serving `https://nix-cache.home.2rjus.net` (canonical URL)
- Builder service active, responding to NATS build requests
- Metrics exposed on port 9973 (`homelab-deploy-builder` job)
- Harmonia binary cache server running
- Signing key: `nix-cache02.home.2rjus.net-1`
- Prod tier with `build-host` role
### nix-cache01 (Decommissioned)
- VM deleted from Proxmox
- Host configuration removed from repo
- Vault AppRole and secrets removed
- Old signing key removed from trusted-public-keys
## Remaining Work
### Phase 3: Safe Flake Update Workflow
1. Create `.github/workflows/flake-update-safe.yaml`
2. Disable or remove old `flake-update.yaml`
3. Test manually with `workflow_dispatch`
4. Monitor first automated run
### Phase 4: Complete Migration ✅
1. ~~**Add Harmonia to nix-cache02**~~ ✅ Done - new signing key, parameterized service
2. ~~**Add trusted public key to all hosts**~~ ✅ Done - `system/nix.nix` updated
3. ~~**Test cache from other hosts**~~ ✅ Done - verified from testvm01
4. ~~**Update proxy and DNS**~~ ✅ Done - `nix-cache.home.2rjus.net` CNAME now points to nix-cache02
5. ~~**Deploy to all hosts**~~ ✅ Done - all hosts have new trusted key
6. ~~**Decommission nix-cache01**~~ ✅ Done - 2026-02-10:
- Removed `hosts/nix-cache01/` directory
- Removed `services/nix-cache/build-flakes.{nix,sh}`
- Removed Vault AppRole and secrets
- Removed old signing key from `system/nix.nix`
- Removed from `flake.nix`
- Deleted VM from Proxmox
### Phase 5: Scheduled Builds ✅
Implemented a systemd timer on nix-cache02 that triggers builds every 2 hours:
- **Timer**: `scheduled-build.timer` runs every 2 hours with 5m random jitter
- **Service**: `scheduled-build.service` calls `homelab-deploy build` for both repos
- **Authentication**: Dedicated scheduler NKey stored in Vault
- **NATS user**: Added to DEPLOY account with publish `build.>` and subscribe `build.responses.>`
Files:
- `hosts/nix-cache02/scheduler.nix` - Timer and service configuration
- `services/nats/default.nix` - Scheduler NATS user
- `terraform/vault/secrets.tf` - Scheduler NKey secret
- `terraform/vault/variables.tf` - Variable for scheduler NKey
## Resolved Questions
- **Parallel vs sequential builds?** Sequential - hosts share packages, subsequent builds are fast after first
- **What about gunter?** Configured as `nixos` repo in builder settings
- **Disk size?** 200GB for new host
- **Build host specs?** 8 cores, 16-24GB RAM matches current nix-cache01
### Phase 6: Observability
1. **Alerting rules** for build failures:
```promql
# Alert if any build fails
increase(homelab_deploy_build_host_total{status="failure"}[1h]) > 0
# Alert if no successful builds in 24h (scheduled builds stopped)
time() - homelab_deploy_build_last_success_timestamp > 86400
```
2. **Grafana dashboard** for build metrics:
- Build success/failure rate over time
- Average build duration per host (histogram)
- Build frequency (builds per hour/day)
- Last successful build timestamp per repo
Available metrics:
- `homelab_deploy_builds_total{repo, status}` - total builds by repo and status
- `homelab_deploy_build_host_total{repo, host, status}` - per-host build counts
- `homelab_deploy_build_duration_seconds_{bucket,sum,count}` - build duration histogram
- `homelab_deploy_build_last_timestamp{repo}` - last build attempt
- `homelab_deploy_build_last_success_timestamp{repo}` - last successful build
## Open Questions
- [x] ~~When to cut over DNS from nix-cache01 to nix-cache02?~~ Done - 2026-02-10
- [ ] Implement safe flake update workflow before or after full migration?

View File

@@ -0,0 +1,87 @@
# OpenBao + Kanidm OIDC Integration
## Status: Completed
Implemented 2026-02-09.
## Overview
Enable Kanidm users to authenticate to OpenBao (Vault) using OIDC for Web UI access. Members of the `admins` group get full read/write access to secrets.
## Implementation
### Files Modified
| File | Changes |
|------|---------|
| `terraform/vault/oidc.tf` | New - OIDC auth backend and roles |
| `terraform/vault/policies.tf` | Added oidc-admin and oidc-default policies |
| `terraform/vault/secrets.tf` | Added OAuth2 client secret |
| `terraform/vault/approle.tf` | Granted kanidm01 access to openbao secrets |
| `services/kanidm/default.nix` | Added openbao OAuth2 client, enabled imperative group membership |
### Kanidm Configuration
OAuth2 client `openbao` with:
- Confidential client (uses client secret)
- Web UI callback only: `https://vault.home.2rjus.net:8200/ui/vault/auth/oidc/oidc/callback`
- Legacy crypto enabled (RS256 for OpenBao compatibility)
- Scope maps for `admins` and `users` groups
Group membership is now managed imperatively (`overwriteMembers = false`) to prevent provisioning from resetting group memberships on service restart.
### OpenBao Configuration
OIDC auth backend at `/oidc` with two roles:
| Role | Bound Claims | Policy | Access |
|------|--------------|--------|--------|
| `admin` | `groups = admins@home.2rjus.net` | `oidc-admin` | Full read/write to secrets, system health/metrics |
| `default` | (none) | `oidc-default` | Token lookup-self, system health |
Both roles request scopes: `openid`, `profile`, `email`, `groups`
### Policies
**oidc-admin:**
- `secret/*` - create, read, update, delete, list
- `sys/health` - read
- `sys/metrics` - read
- `sys/auth` - read
- `sys/mounts` - read
**oidc-default:**
- `auth/token/lookup-self` - read
- `sys/health` - read
## Usage
### Web UI Login
1. Navigate to https://vault.home.2rjus.net:8200
2. Select "OIDC" authentication method
3. Enter role: `admin` (for admins) or `default` (for any user)
4. Click "Sign in with OIDC"
5. Authenticate with Kanidm
### Group Management
Add users to admins group for full access:
```bash
kanidm group add-members admins <username>
```
## Limitations
**CLI login not supported:** Kanidm requires HTTPS for all redirect URIs on confidential (non-public) OAuth2 clients. OpenBao CLI uses `http://localhost:8250/oidc/callback` which Kanidm rejects. Public clients would allow localhost redirects, but OpenBao requires a client secret for OIDC auth.
## Lessons Learned
1. **Kanidm group names:** Groups are returned as `groupname@domain` (e.g., `admins@home.2rjus.net`), not just the short name
2. **RS256 required:** OpenBao only supports RS256 for JWT signing; Kanidm defaults to ES256, requiring `enableLegacyCrypto = true`
3. **Scope request:** OIDC roles must explicitly request the `groups` scope via `oidc_scopes`
4. **Provisioning resets:** Kanidm provisioning with default `overwriteMembers = true` resets group memberships on restart
5. **Two-phase Terraform:** Secret must exist before OIDC backend can validate discovery URL
## References
- [OpenBao JWT/OIDC Auth Method](https://openbao.org/docs/auth/jwt/)
- [Kanidm OAuth2 Documentation](https://kanidm.github.io/kanidm/stable/integrations/oauth2.html)

View File

@@ -0,0 +1,46 @@
# Garage S3 Storage Server
## Overview
Deploy a Garage instance for self-hosted S3-compatible object storage.
## Garage Basics
- S3-compatible distributed object storage designed for self-hosting
- Supports per-key, per-bucket permissions (read/write/owner)
- Keys without explicit grants have no access
## NixOS Module
Available as `services.garage` with these key options:
- `services.garage.enable` - Enable the service
- `services.garage.package` - Must be set explicitly
- `services.garage.settings` - Freeform TOML config (replication mode, ports, RPC, etc.)
- `services.garage.settings.metadata_dir` - Metadata storage (SSD recommended)
- `services.garage.settings.data_dir` - Data block storage (supports multiple dirs since v0.9)
- `services.garage.environmentFile` - For secrets like `GARAGE_RPC_SECRET`
- `services.garage.logLevel` - error/warn/info/debug/trace
The NixOS module only manages the server daemon. Buckets and keys are managed externally.
## Bucket/Key Management
No declarative NixOS options for buckets or keys. Two options:
1. **Terraform provider** - `jkossis/terraform-provider-garage` manages buckets, keys, and permissions via the Garage Admin API v2. Could live in `terraform/garage/` similar to `terraform/vault/`.
2. **CLI** - `garage key create`, `garage bucket create`, `garage bucket allow`
## Integration Ideas
- Store Garage API keys in Vault, fetch via `vault.secrets` on consuming hosts
- Terraform manages both Vault secrets and Garage buckets/keys
- Enable admin API with token for Terraform provider access
- Add Prometheus metrics scraping (Garage exposes metrics endpoint)
## Open Questions
- Single-node or multi-node replication?
- Which host to deploy on?
- What to store? (backups, media, app data)
- Expose via HTTP proxy or direct S3 API only?

View File

@@ -169,9 +169,30 @@ Once ready to cut over:
- Destroy VM in Proxmox
- Remove from terraform state
## Current Progress
### monitoring02 Host Created (2026-02-08)
Host deployed at 10.69.13.24 (test tier) with:
- 4 CPU cores, 8GB RAM, 60GB disk
- Vault integration enabled
- NATS-based remote deployment enabled
### Grafana with Kanidm OIDC (2026-02-08)
Grafana deployed on monitoring02 as a test instance (`grafana-test.home.2rjus.net`):
- Kanidm OIDC authentication (PKCE enabled)
- Role mapping: `admins` → Admin, others → Viewer
- Declarative datasources pointing to monitoring01 (Prometheus, Loki)
- Local Caddy for TLS termination via internal ACME CA
This validates the Grafana + OIDC pattern before the full VictoriaMetrics migration. The existing
`services/monitoring/grafana.nix` on monitoring01 can be replaced with the new `services/grafana/`
module once monitoring02 becomes the primary monitoring host.
## Open Questions
- [ ] What disk size for monitoring02? 100GB should allow 3+ months with VictoriaMetrics compression
- [ ] What disk size for monitoring02? Current 60GB may need expansion for 3+ months with VictoriaMetrics
- [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
## VictoriaMetrics Service Configuration

View File

@@ -1,212 +0,0 @@
# Nix Cache Host Reprovision
## Overview
Reprovision `nix-cache01` using the OpenTofu workflow, and improve the build/cache system with:
1. NATS-based remote build triggering (replacing the current bash script)
2. Safer flake update workflow that validates builds before pushing to master
## Current State
### Host Configuration
- `nix-cache01` at 10.69.13.15 serves the binary cache via Harmonia
- Runs Gitea Actions runner for CI workflows
- Has `homelab.deploy.enable = true` (already supports NATS-based deployment)
- Uses a dedicated XFS volume at `/nix` for cache storage
### Current Build System (`services/nix-cache/build-flakes.sh`)
- Runs every 30 minutes via systemd timer
- Clones/pulls two repos: `nixos-servers` and `nixos` (gunter)
- Builds all hosts with `nixos-rebuild build` (no blacklist despite docs mentioning it)
- Pushes success/failure metrics to pushgateway
- Simple but has no filtering, no parallelism, no remote triggering
### Current Flake Update Workflow (`.github/workflows/flake-update.yaml`)
- Runs daily at midnight via cron
- Runs `nix flake update --commit-lock-file`
- Pushes directly to master
- No build validation — can push broken inputs
## Improvement 1: NATS-Based Remote Build Triggering
### Design
Extend the existing `homelab-deploy` tool to support a "build" command that triggers builds on the cache host. This reuses the NATS infrastructure already in place.
| Approach | Pros | Cons |
|----------|------|------|
| Extend homelab-deploy | Reuses existing NATS auth, NKey handling, CLI | Adds scope to existing tool |
| New nix-cache-tool | Clean separation | Duplicate NATS boilerplate, new credentials |
| Gitea Actions webhook | No custom tooling | Less flexible, tied to Gitea |
**Recommendation:** Extend `homelab-deploy` with a build subcommand. The tool already has NATS client code, authentication handling, and a listener module in NixOS.
### Implementation
1. Add new message type to homelab-deploy: `build.<host>` subject
2. Listener on nix-cache01 subscribes to `build.>` wildcard
3. On message receipt, builds the specified host and returns success/failure
4. CLI command: `homelab-deploy build <hostname>` or `homelab-deploy build --all`
### Benefits
- Trigger rebuild for specific host to ensure it's cached
- Could be called from CI after merging PRs
- Reuses existing NATS infrastructure and auth
- Progress/status could stream back via NATS reply
## Improvement 2: Smarter Flake Update Workflow
### Current Problems
1. Updates can push breaking changes to master
2. No visibility into what broke when it does
3. Hosts that auto-update can pull broken configs
### Proposed Workflow
```
┌─────────────────────────────────────────────────────────────────┐
│ Flake Update Workflow │
├─────────────────────────────────────────────────────────────────┤
│ 1. nix flake update (on feature branch) │
│ 2. Build ALL hosts locally │
│ 3. If all pass → fast-forward merge to master │
│ 4. If any fail → create PR with failure logs attached │
└─────────────────────────────────────────────────────────────────┘
```
### Implementation Options
| Option | Description | Pros | Cons |
|--------|-------------|------|------|
| **A: Self-hosted runner** | Build on nix-cache01 | Fast (local cache), simple | Ties up cache host during build |
| **B: Gitea Actions only** | Use container runner | Clean separation | Slow (no cache), resource limits |
| **C: Hybrid** | Trigger builds on nix-cache01 via NATS from Actions | Best of both | More complex |
**Recommendation:** Option A with nix-cache01 as the runner. The host is already running Gitea Actions runner and has the cache. Building all ~16 hosts is disk I/O heavy but feasible on dedicated hardware.
### Workflow Steps
1. Workflow runs on schedule (daily or weekly)
2. Creates branch `flake-update/YYYY-MM-DD`
3. Runs `nix flake update --commit-lock-file`
4. Builds each host: `nix build .#nixosConfigurations.<host>.config.system.build.toplevel`
5. If all succeed:
- Fast-forward merge to master
- Delete feature branch
6. If any fail:
- Create PR from the update branch
- Attach build logs as PR comment
- Label PR with `needs-review` or `build-failure`
- Do NOT merge automatically
### Workflow File Changes
```yaml
# New: .github/workflows/flake-update-safe.yaml
name: Safe flake update
on:
schedule:
- cron: "0 2 * * 0" # Weekly on Sunday at 2 AM
workflow_dispatch: # Manual trigger
jobs:
update-and-validate:
runs-on: homelab # Use self-hosted runner on nix-cache01
steps:
- uses: actions/checkout@v4
with:
ref: master
fetch-depth: 0 # Need full history for merge
- name: Create update branch
run: |
BRANCH="flake-update/$(date +%Y-%m-%d)"
git checkout -b "$BRANCH"
- name: Update flake
run: nix flake update --commit-lock-file
- name: Build all hosts
id: build
run: |
FAILED=""
for host in $(nix flake show --json | jq -r '.nixosConfigurations | keys[]'); do
echo "Building $host..."
if ! nix build ".#nixosConfigurations.$host.config.system.build.toplevel" 2>&1 | tee "build-$host.log"; then
FAILED="$FAILED $host"
fi
done
echo "failed=$FAILED" >> $GITHUB_OUTPUT
- name: Merge to master (if all pass)
if: steps.build.outputs.failed == ''
run: |
git checkout master
git merge --ff-only "$BRANCH"
git push origin master
git push origin --delete "$BRANCH"
- name: Create PR (if any fail)
if: steps.build.outputs.failed != ''
run: |
git push origin "$BRANCH"
# Create PR via Gitea API with build logs
# ... (PR creation with log attachment)
```
## Migration Steps
### Phase 1: Reprovision Host via OpenTofu
1. Add `nix-cache01` to `terraform/vms.tf`:
```hcl
"nix-cache01" = {
ip = "10.69.13.15/24"
cpu_cores = 4
memory = 8192
disk_size = "100G" # Larger for nix store
}
```
2. Shut down existing nix-cache01 VM
3. Run `tofu apply` to provision new VM
4. Verify bootstrap completes and cache is serving
**Note:** The cache will be cold after reprovision. Run initial builds to populate.
### Phase 2: Add Build Triggering to homelab-deploy
1. Add `build` command to homelab-deploy CLI
2. Add listener handler in NixOS module for `build.*` subjects
3. Update nix-cache01 config to enable build listener
4. Test with `homelab-deploy build testvm01`
### Phase 3: Implement Safe Flake Update Workflow
1. Create `.github/workflows/flake-update-safe.yaml`
2. Disable or remove old `flake-update.yaml`
3. Test manually with `workflow_dispatch`
4. Monitor first automated run
### Phase 4: Remove Old Build Script
1. After new workflow is stable, remove:
- `services/nix-cache/build-flakes.nix`
- `services/nix-cache/build-flakes.sh`
2. The new workflow handles scheduled builds
## Open Questions
- [ ] What runner labels should the self-hosted runner use for the update workflow?
- [ ] Should we build hosts in parallel (faster) or sequentially (easier to debug)?
- [ ] How long to keep flake-update PRs open before auto-closing stale ones?
- [ ] Should successful updates trigger a NATS notification to rebuild all hosts?
- [ ] What to do about `gunter` (external nixos repo) - include in validation?
- [ ] Disk size for new nix-cache01 - is 100G enough for cache + builds?
## Notes
- The existing `homelab.deploy.enable = true` on nix-cache01 means it already has NATS connectivity
- The Harmonia service and cache signing key will work the same after reprovision
- Actions runner token is in Vault, will be provisioned automatically
- Consider adding a `homelab.host.role = "build-host"` label for monitoring/filtering

View File

@@ -43,11 +43,21 @@ kanidm person posix set-password <username>
kanidm person posix set <username> --shell /bin/zsh
```
### Setting Email Address
Email is required for OAuth2/OIDC login (e.g., Grafana):
```bash
kanidm person update <username> --mail <email>
```
### Example: Full User Creation
```bash
kanidm person create testuser "Test User"
kanidm person update testuser --mail testuser@home.2rjus.net
kanidm group add-members ssh-users testuser
kanidm group add-members users testuser # Required for OAuth2 scopes
kanidm person posix set testuser
kanidm person posix set-password testuser
kanidm person get testuser
@@ -129,6 +139,40 @@ Kanidm auto-assigns UIDs/GIDs from its configured range. For manually assigned G
| 65,536+ | Users (auto-assigned) |
| 68,000 - 68,999 | Groups (manually assigned) |
## OAuth2/OIDC Login (Web Services)
For OAuth2/OIDC login to web services like Grafana, users need:
1. **Primary credential** - Password set via `credential update` (separate from unix password)
2. **MFA** - TOTP or passkey (Kanidm requires MFA for primary credentials)
3. **Group membership** - Member of `users` group (for OAuth2 scope mapping)
4. **Email address** - Set via `person update --mail`
### Setting Up Primary Credential (Web Login)
The primary credential is different from the unix/POSIX password:
```bash
# Interactive credential setup
kanidm person credential update <username>
# In the interactive prompt:
# 1. Type 'password' to set a password
# 2. Type 'totp' to add TOTP (scan QR with authenticator app)
# 3. Type 'commit' to save
```
### Verifying OAuth2 Readiness
```bash
kanidm person get <username>
```
Check for:
- `mail:` - Email address set
- `memberof:` - Includes `users@home.2rjus.net`
- Primary credential status (check via `credential update``status`)
## PAM/NSS Client Configuration
Enable central authentication on a host:

28
flake.lock generated
View File

@@ -28,11 +28,11 @@
]
},
"locked": {
"lastModified": 1770481834,
"narHash": "sha256-Xx9BYnI0C/qgPbwr9nj6NoAdQTbYLunrdbNSaUww9oY=",
"lastModified": 1771004123,
"narHash": "sha256-Jw36EzL4IGIc2TmeZGphAAUrJXoWqfvCbybF8bTHgMA=",
"ref": "master",
"rev": "fd0d63b103dfaf21d1c27363266590e723021c67",
"revCount": 24,
"rev": "e5e8be86ecdcae8a5962ba3bddddfe91b574792b",
"revCount": 36,
"type": "git",
"url": "https://git.t-juice.club/torjus/homelab-deploy"
},
@@ -49,11 +49,11 @@
]
},
"locked": {
"lastModified": 1770422522,
"narHash": "sha256-WmIFnquu4u58v8S2bOVWmknRwHn4x88CRfBFTzJ1inQ=",
"lastModified": 1770593543,
"narHash": "sha256-hT8Rj6JAwGDFvcxWEcUzTCrWSiupCfBa57pBDnM2C5g=",
"ref": "refs/heads/master",
"rev": "cf0ce858997af4d8dcc2ce10393ff393e17fc911",
"revCount": 11,
"rev": "5aa5f7275b7a08015816171ba06d2cbdc2e02d3e",
"revCount": 15,
"type": "git",
"url": "https://git.t-juice.club/torjus/nixos-exporter"
},
@@ -64,11 +64,11 @@
},
"nixpkgs": {
"locked": {
"lastModified": 1770136044,
"narHash": "sha256-tlFqNG/uzz2++aAmn4v8J0vAkV3z7XngeIIB3rM3650=",
"lastModified": 1770770419,
"narHash": "sha256-iKZMkr6Cm9JzWlRYW/VPoL0A9jVKtZYiU4zSrVeetIs=",
"owner": "nixos",
"repo": "nixpkgs",
"rev": "e576e3c9cf9bad747afcddd9e34f51d18c855b4e",
"rev": "6c5e707c6b5339359a9a9e215c5e66d6d802fd7a",
"type": "github"
},
"original": {
@@ -80,11 +80,11 @@
},
"nixpkgs-unstable": {
"locked": {
"lastModified": 1770197578,
"narHash": "sha256-AYqlWrX09+HvGs8zM6ebZ1pwUqjkfpnv8mewYwAo+iM=",
"lastModified": 1770562336,
"narHash": "sha256-ub1gpAONMFsT/GU2hV6ZWJjur8rJ6kKxdm9IlCT0j84=",
"owner": "nixos",
"repo": "nixpkgs",
"rev": "00c21e4c93d963c50d4c0c89bfa84ed6e0694df2",
"rev": "d6c71932130818840fc8fe9509cf50be8c64634f",
"type": "github"
},
"original": {

View File

@@ -110,15 +110,6 @@
./hosts/jelly01
];
};
nix-cache01 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self;
};
modules = commonModules ++ [
./hosts/nix-cache01
];
};
nats1 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
@@ -191,6 +182,24 @@
./hosts/kanidm01
];
};
monitoring02 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self;
};
modules = commonModules ++ [
./hosts/monitoring02
];
};
nix-cache02 = nixpkgs.lib.nixosSystem {
inherit system;
specialArgs = {
inherit inputs self;
};
modules = commonModules ++ [
./hosts/nix-cache02
];
};
};
packages = forAllSystems (
{ pkgs }:
@@ -208,9 +217,11 @@
pkgs.opentofu
pkgs.openbao
pkgs.kanidm_1_8
pkgs.nkeys
(pkgs.callPackage ./scripts/create-host { })
homelab-deploy.packages.${pkgs.system}.default
];
ANSIBLE_CONFIG = "./ansible/ansible.cfg";
};
}
);

View File

@@ -13,6 +13,8 @@
../../common/vm
];
homelab.host.role = "home-automation";
nixpkgs.config.allowUnfree = true;
# Use the systemd-boot EFI boot loader.
boot.loader.grub = {
@@ -85,6 +87,7 @@
"--keep-monthly 6"
"--keep-within 1d"
];
extraOptions = [ "--retry-lock=5m" ];
};
# Open ports in the firewall.

View File

@@ -11,6 +11,7 @@
../../common/vm
];
homelab.host.role = "proxy";
homelab.dns.cnames = [
"nzbget"
"radarr"

View File

@@ -11,6 +11,8 @@
../../common/vm
];
homelab.host.role = "media";
nixpkgs.config.allowUnfree = true;
# Use the systemd-boot EFI boot loader.
boot.loader.grub = {

View File

@@ -14,9 +14,8 @@
../../services/kanidm
];
# Host metadata
homelab.host = {
tier = "test";
tier = "prod";
role = "auth";
};

View File

@@ -11,6 +11,8 @@
../../common/vm
];
homelab.host.role = "monitoring";
nixpkgs.config.allowUnfree = true;
# Use the systemd-boot EFI boot loader.
boot.loader.grub = {
@@ -81,6 +83,7 @@
"--keep-monthly 6"
"--keep-within 1d"
];
extraOptions = [ "--retry-lock=5m" ];
};
services.restic.backups.grafana-db = {
@@ -98,6 +101,7 @@
"--keep-monthly 6"
"--keep-within 1d"
];
extraOptions = [ "--retry-lock=5m" ];
};
# Open ports in the firewall.

View File

@@ -1,33 +1,37 @@
{
config,
lib,
pkgs,
...
}:
{
imports = [
./hardware-configuration.nix
../template2/hardware-configuration.nix
../../system
../../common/vm
];
homelab.dns.cnames = [ "nix-cache" "actions1" ];
homelab.host.role = "build-host";
fileSystems."/nix" = {
device = "/dev/disk/by-label/nixcache";
fsType = "xfs";
homelab.host = {
tier = "prod";
role = "monitoring";
};
# DNS CNAME for Grafana test instance
homelab.dns.cnames = [ "grafana-test" ];
# Enable Vault integration
vault.enable = true;
# Enable remote deployment via NATS
homelab.deploy.enable = true;
nixpkgs.config.allowUnfree = true;
# Use the systemd-boot EFI boot loader.
boot.loader.grub = {
enable = true;
device = "/dev/sda";
configurationLimit = 3;
};
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda";
networking.hostName = "nix-cache01";
networking.hostName = "monitoring02";
networking.domain = "home.2rjus.net";
networking.useNetworkd = true;
networking.useDHCP = false;
@@ -41,7 +45,7 @@
systemd.network.networks."ens18" = {
matchConfig.Name = "ens18";
address = [
"10.69.13.15/24"
"10.69.13.24/24"
];
routes = [
{ Gateway = "10.69.13.1"; }
@@ -54,9 +58,6 @@
"nix-command"
"flakes"
];
vault.enable = true;
homelab.deploy.enable = true;
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim
@@ -64,13 +65,11 @@
git
];
services.qemuGuest.enable = true;
# Open ports in the firewall.
# networking.firewall.allowedTCPPorts = [ ... ];
# networking.firewall.allowedUDPPorts = [ ... ];
# Or disable the firewall altogether.
networking.firewall.enable = false;
system.stateVersion = "24.05"; # Did you read the comment?
}
system.stateVersion = "25.11"; # Did you read the comment?
}

View File

@@ -0,0 +1,6 @@
{ ... }: {
imports = [
./configuration.nix
../../services/grafana
];
}

View File

@@ -11,6 +11,8 @@
../../common/vm
];
homelab.host.role = "messaging";
nixpkgs.config.allowUnfree = true;
# Use the systemd-boot EFI boot loader.
boot.loader.grub = {

View File

@@ -1,42 +0,0 @@
{
config,
lib,
pkgs,
modulesPath,
...
}:
{
imports = [
(modulesPath + "/profiles/qemu-guest.nix")
];
boot.initrd.availableKernelModules = [
"ata_piix"
"uhci_hcd"
"virtio_pci"
"virtio_scsi"
"sd_mod"
"sr_mod"
];
boot.initrd.kernelModules = [ "dm-snapshot" ];
boot.kernelModules = [
"ptp_kvm"
];
boot.extraModulePackages = [ ];
fileSystems."/" = {
device = "/dev/disk/by-label/root";
fsType = "xfs";
};
swapDevices = [ { device = "/dev/disk/by-label/swap"; } ];
# Enables DHCP on each ethernet and wireless interface. In case of scripted networking
# (the default) this is the recommended approach. When using systemd-networkd it's
# still possible to use this option, but it's recommended to use it in conjunction
# with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
networking.useDHCP = lib.mkDefault true;
# networking.interfaces.ens18.useDHCP = lib.mkDefault true;
nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
}

View File

@@ -0,0 +1,45 @@
{ config, ... }:
{
# Fetch builder NKey from Vault
vault.secrets.builder-nkey = {
secretPath = "shared/homelab-deploy/builder-nkey";
extractKey = "nkey";
outputDir = "/run/secrets/builder-nkey";
services = [ "homelab-deploy-builder" ];
};
# Configure the builder service
services.homelab-deploy.builder = {
enable = true;
natsUrl = "nats://nats1.home.2rjus.net:4222";
nkeyFile = "/run/secrets/builder-nkey";
settings.repos = {
nixos-servers = {
url = "git+https://git.t-juice.club/torjus/nixos-servers.git";
defaultBranch = "master";
};
nixos = {
url = "git+https://git.t-juice.club/torjus/nixos.git";
defaultBranch = "master";
};
};
timeout = 7200;
metrics.enable = true;
};
# Expose builder metrics for Prometheus scraping
homelab.monitoring.scrapeTargets = [
{
job_name = "homelab-deploy-builder";
port = 9973;
}
];
# Ensure builder starts after vault secret is available
systemd.services.homelab-deploy-builder = {
after = [ "vault-secret-builder-nkey.service" ];
requires = [ "vault-secret-builder-nkey.service" ];
};
}

View File

@@ -0,0 +1,74 @@
{
config,
lib,
pkgs,
...
}:
{
imports = [
../template2/hardware-configuration.nix
../../system
../../common/vm
];
homelab.host = {
tier = "prod";
role = "build-host";
};
homelab.dns.cnames = [ "nix-cache" ];
# Enable Vault integration
vault.enable = true;
# Enable remote deployment via NATS
homelab.deploy.enable = true;
nixpkgs.config.allowUnfree = true;
boot.loader.grub.enable = true;
boot.loader.grub.device = "/dev/vda";
networking.hostName = "nix-cache02";
networking.domain = "home.2rjus.net";
networking.useNetworkd = true;
networking.useDHCP = false;
services.resolved.enable = true;
networking.nameservers = [
"10.69.13.5"
"10.69.13.6"
];
systemd.network.enable = true;
systemd.network.networks."ens18" = {
matchConfig.Name = "ens18";
address = [
"10.69.13.25/24"
];
routes = [
{ Gateway = "10.69.13.1"; }
];
linkConfig.RequiredForOnline = "routable";
};
time.timeZone = "Europe/Oslo";
nix.settings.experimental-features = [
"nix-command"
"flakes"
];
nix.settings.tarball-ttl = 0;
environment.systemPackages = with pkgs; [
vim
wget
git
];
# Open ports in the firewall.
# networking.firewall.allowedTCPPorts = [ ... ];
# networking.firewall.allowedUDPPorts = [ ... ];
# Or disable the firewall altogether.
networking.firewall.enable = false;
system.stateVersion = "25.11"; # Did you read the comment?
}

View File

@@ -1,8 +1,8 @@
{ ... }:
{
{ ... }: {
imports = [
./configuration.nix
./builder.nix
./scheduler.nix
../../services/nix-cache
../../services/actions-runner
];
}
}

View File

@@ -0,0 +1,61 @@
{ config, pkgs, lib, inputs, ... }:
let
homelab-deploy = inputs.homelab-deploy.packages.${pkgs.system}.default;
scheduledBuildScript = pkgs.writeShellApplication {
name = "scheduled-build";
runtimeInputs = [ homelab-deploy ];
text = ''
NATS_URL="nats://nats1.home.2rjus.net:4222"
NKEY_FILE="/run/secrets/scheduler-nkey"
echo "Starting scheduled builds at $(date)"
# Build all nixos-servers hosts
homelab-deploy build \
--nats-url "$NATS_URL" \
--nkey-file "$NKEY_FILE" \
nixos-servers --all
# Build all nixos (gunter) hosts
homelab-deploy build \
--nats-url "$NATS_URL" \
--nkey-file "$NKEY_FILE" \
nixos --all
echo "Scheduled builds completed at $(date)"
'';
};
in
{
# Fetch scheduler NKey from Vault
vault.secrets.scheduler-nkey = {
secretPath = "shared/homelab-deploy/scheduler-nkey";
extractKey = "nkey";
outputDir = "/run/secrets/scheduler-nkey";
services = [ "scheduled-build" ];
};
# Timer: every 2 hours
systemd.timers.scheduled-build = {
description = "Trigger scheduled Nix builds";
wantedBy = [ "timers.target" ];
timerConfig = {
OnCalendar = "*-*-* 00/2:00:00"; # Every 2 hours at :00
Persistent = true; # Run missed builds on boot
RandomizedDelaySec = "5m"; # Slight jitter
};
};
# Service: oneshot that triggers builds
systemd.services.scheduled-build = {
description = "Trigger builds for all hosts via NATS";
after = [ "network-online.target" "vault-secret-scheduler-nkey.service" ];
requires = [ "vault-secret-scheduler-nkey.service" ];
wants = [ "network-online.target" ];
serviceConfig = {
Type = "oneshot";
ExecStart = lib.getExe scheduledBuildScript;
};
};
}

View File

@@ -35,6 +35,7 @@
homelab.host = {
tier = "test";
priority = "low";
labels.ansible = "false"; # Exclude from Ansible inventory
};
boot.loader.grub.enable = true;

View File

@@ -14,9 +14,9 @@
../../common/ssh-audit.nix
];
# Host metadata (adjust as needed)
homelab.host = {
tier = "test"; # Start in test tier, move to prod after validation
tier = "test";
role = "test";
};
# Enable Vault integration

View File

@@ -14,9 +14,9 @@
../../common/ssh-audit.nix
];
# Host metadata (adjust as needed)
homelab.host = {
tier = "test"; # Start in test tier, move to prod after validation
tier = "test";
role = "test";
};
# Enable Vault integration

View File

@@ -14,9 +14,9 @@
../../common/ssh-audit.nix
];
# Host metadata (adjust as needed)
homelab.host = {
tier = "test"; # Start in test tier, move to prod after validation
tier = "test";
role = "test";
};
# Enable Vault integration

View File

@@ -58,10 +58,9 @@ let
};
# Build effective labels for a host
# Always includes hostname; only includes tier/priority/role if non-default
# Always includes hostname and tier; only includes priority/role if non-default
buildEffectiveLabels = host:
{ hostname = host.hostname; }
// (lib.optionalAttrs (host.tier != "prod") { tier = host.tier; })
{ hostname = host.hostname; tier = host.tier; }
// (lib.optionalAttrs (host.priority != "high") { priority = host.priority; })
// (lib.optionalAttrs (host.role != null) { role = host.role; })
// host.labels;

View File

@@ -1,5 +0,0 @@
[proxmox]
pve1.home.2rjus.net
[proxmox:vars]
ansible_user=root

View File

@@ -1,57 +0,0 @@
{ pkgs, config, ... }:
{
vault.secrets.actions-token = {
secretPath = "hosts/nix-cache01/actions-token";
extractKey = "token";
outputDir = "/run/secrets/actions-token-1";
services = [ "gitea-runner-actions1" ];
};
virtualisation.podman = {
enable = true;
dockerCompat = true;
};
services.gitea-actions-runner.instances = {
actions1 = {
enable = true;
tokenFile = "/run/secrets/actions-token-1";
name = "actions1.home.2rjus.net";
settings = {
log = {
level = "debug";
};
runner = {
file = ".runner";
capacity = 4;
timeout = "2h";
shutdown_timeout = "10m";
insecure = false;
fetch_timeout = "10s";
fetch_interval = "30s";
};
cache = {
enabled = true;
dir = "/var/cache/gitea-actions1";
};
container = {
privileged = false;
};
};
labels =
builtins.map (n: "${n}:docker://gitea/runner-images:${n}") [
"ubuntu-latest"
"ubuntu-latest-slim"
"ubuntu-latest-full"
]
++ [
"homelab"
];
url = "https://git.t-juice.club";
};
};
}

View File

@@ -0,0 +1,446 @@
{
"uid": "certificates-homelab",
"title": "TLS Certificates",
"tags": ["certificates", "tls", "security", "homelab"],
"timezone": "browser",
"schemaVersion": 39,
"version": 1,
"refresh": "5m",
"time": {
"from": "now-7d",
"to": "now"
},
"panels": [
{
"id": 1,
"title": "Endpoints Monitored",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"})",
"legendFormat": "Total",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "blue", "value": null}
]
}
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none",
"textMode": "auto"
},
"description": "Total number of TLS endpoints being monitored"
},
{
"id": 2,
"title": "Probe Failures",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(probe_success{job=\"blackbox_tls\"} == 0) or vector(0)",
"legendFormat": "Failing",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "red", "value": 1}
]
},
"noValue": "0"
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none",
"textMode": "auto"
},
"description": "Number of endpoints where TLS probe is failing"
},
{
"id": 3,
"title": "Expiring Soon (< 7d)",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) < 86400 * 7) or vector(0)",
"legendFormat": "Warning",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 1}
]
},
"noValue": "0"
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none",
"textMode": "auto"
},
"description": "Certificates expiring within 7 days"
},
{
"id": 4,
"title": "Expiring Critical (< 24h)",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) < 86400) or vector(0)",
"legendFormat": "Critical",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "red", "value": 1}
]
},
"noValue": "0"
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none",
"textMode": "auto"
},
"description": "Certificates expiring within 24 hours"
},
{
"id": 5,
"title": "Minimum Days Remaining",
"type": "gauge",
"gridPos": {"h": 4, "w": 8, "x": 16, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "min((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400)",
"legendFormat": "Days",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"unit": "d",
"min": 0,
"max": 90,
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "red", "value": null},
{"color": "orange", "value": 7},
{"color": "yellow", "value": 14},
{"color": "green", "value": 30}
]
}
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"showThresholdLabels": false,
"showThresholdMarkers": true
},
"description": "Shortest time until any certificate expires"
},
{
"id": 6,
"title": "Certificate Expiry by Endpoint",
"type": "table",
"gridPos": {"h": 12, "w": 12, "x": 0, "y": 4},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400",
"legendFormat": "{{instance}}",
"refId": "A",
"instant": true,
"format": "table"
}
],
"transformations": [
{
"id": "organize",
"options": {
"excludeByName": {"Time": true, "job": true, "__name__": true},
"renameByName": {"instance": "Endpoint", "Value": "Days Until Expiry"}
}
},
{
"id": "sortBy",
"options": {
"sort": [{"field": "Days Until Expiry", "desc": false}]
}
}
],
"fieldConfig": {
"defaults": {
"custom": {
"align": "left"
}
},
"overrides": [
{
"matcher": {"id": "byName", "options": "Days Until Expiry"},
"properties": [
{"id": "unit", "value": "d"},
{"id": "decimals", "value": 1},
{"id": "custom.width", "value": 150},
{
"id": "thresholds",
"value": {
"mode": "absolute",
"steps": [
{"color": "red", "value": null},
{"color": "orange", "value": 7},
{"color": "yellow", "value": 14},
{"color": "green", "value": 30}
]
}
},
{"id": "custom.cellOptions", "value": {"type": "color-background"}}
]
}
]
},
"options": {
"showHeader": true,
"sortBy": [{"displayName": "Days Until Expiry", "desc": false}]
},
"description": "All monitored endpoints sorted by days until certificate expiry"
},
{
"id": 7,
"title": "Probe Status",
"type": "table",
"gridPos": {"h": 12, "w": 12, "x": 12, "y": 4},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "probe_success{job=\"blackbox_tls\"}",
"legendFormat": "{{instance}}",
"refId": "A",
"instant": true,
"format": "table"
},
{
"expr": "probe_http_status_code{job=\"blackbox_tls\"}",
"legendFormat": "{{instance}}",
"refId": "B",
"instant": true,
"format": "table"
},
{
"expr": "probe_duration_seconds{job=\"blackbox_tls\"}",
"legendFormat": "{{instance}}",
"refId": "C",
"instant": true,
"format": "table"
}
],
"transformations": [
{
"id": "joinByField",
"options": {
"byField": "instance",
"mode": "outer"
}
},
{
"id": "organize",
"options": {
"excludeByName": {"Time": true, "Time 1": true, "Time 2": true, "Time 3": true, "job": true, "job 1": true, "job 2": true, "job 3": true, "__name__": true},
"renameByName": {
"instance": "Endpoint",
"Value #A": "Success",
"Value #B": "HTTP Status",
"Value #C": "Duration"
}
}
}
],
"fieldConfig": {
"defaults": {
"custom": {"align": "left"}
},
"overrides": [
{
"matcher": {"id": "byName", "options": "Success"},
"properties": [
{"id": "custom.width", "value": 80},
{"id": "mappings", "value": [
{"type": "value", "options": {"0": {"text": "FAIL", "color": "red"}}},
{"type": "value", "options": {"1": {"text": "OK", "color": "green"}}}
]},
{"id": "custom.cellOptions", "value": {"type": "color-text"}}
]
},
{
"matcher": {"id": "byName", "options": "HTTP Status"},
"properties": [
{"id": "custom.width", "value": 100}
]
},
{
"matcher": {"id": "byName", "options": "Duration"},
"properties": [
{"id": "unit", "value": "s"},
{"id": "decimals", "value": 3},
{"id": "custom.width", "value": 100}
]
}
]
},
"options": {
"showHeader": true
},
"description": "Probe success status, HTTP response code, and probe duration"
},
{
"id": 8,
"title": "Certificate Expiry Over Time",
"type": "timeseries",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 16},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400",
"legendFormat": "{{instance}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "d",
"custom": {
"lineWidth": 2,
"fillOpacity": 10,
"showPoints": "never"
},
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "red", "value": null},
{"color": "orange", "value": 7},
{"color": "yellow", "value": 14},
{"color": "green", "value": 30}
]
}
}
},
"options": {
"legend": {"displayMode": "table", "placement": "right", "calcs": ["lastNotNull"]},
"tooltip": {"mode": "multi", "sort": "desc"}
},
"description": "Days until certificate expiry over time - useful for spotting renewal patterns"
},
{
"id": 9,
"title": "Probe Success Rate",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 24},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "avg(probe_success{job=\"blackbox_tls\"}) * 100",
"legendFormat": "Success Rate",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"custom": {
"lineWidth": 2,
"fillOpacity": 20,
"showPoints": "never"
},
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "red", "value": null},
{"color": "yellow", "value": 90},
{"color": "green", "value": 100}
]
},
"color": {"mode": "thresholds"}
}
},
"options": {
"legend": {"displayMode": "list", "placement": "bottom"},
"tooltip": {"mode": "single"}
},
"description": "Overall probe success rate across all endpoints"
},
{
"id": 10,
"title": "Probe Duration",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 24},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "probe_duration_seconds{job=\"blackbox_tls\"}",
"legendFormat": "{{instance}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"custom": {
"lineWidth": 1,
"fillOpacity": 0,
"showPoints": "never"
}
}
},
"options": {
"legend": {"displayMode": "table", "placement": "right", "calcs": ["mean", "max"]},
"tooltip": {"mode": "multi", "sort": "desc"}
},
"description": "Time taken to complete TLS probe for each endpoint"
}
]
}

View File

@@ -0,0 +1,85 @@
{
"uid": "logs-homelab",
"title": "Logs - Homelab",
"tags": ["loki", "logs", "homelab"],
"timezone": "browser",
"schemaVersion": 39,
"version": 1,
"refresh": "30s",
"templating": {
"list": [
{
"name": "host",
"type": "query",
"datasource": {"type": "loki", "uid": "loki"},
"query": "label_values(host)",
"refresh": 2,
"includeAll": true,
"multi": false,
"current": {"text": "All", "value": "$__all"}
},
{
"name": "job",
"type": "query",
"datasource": {"type": "loki", "uid": "loki"},
"query": "label_values(job)",
"refresh": 2,
"includeAll": true,
"multi": false,
"current": {"text": "All", "value": "$__all"}
},
{
"name": "search",
"type": "textbox",
"current": {"text": "", "value": ""},
"label": "Search"
}
]
},
"panels": [
{
"id": 1,
"title": "Log Volume",
"type": "timeseries",
"gridPos": {"h": 6, "w": 24, "x": 0, "y": 0},
"datasource": {"type": "loki", "uid": "loki"},
"targets": [
{
"expr": "sum by (host) (count_over_time({host=~\"$host\", job=~\"$job\"} |~ \"$search\" [1m]))",
"legendFormat": "{{host}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "short"
}
},
"options": {
"legend": {"displayMode": "list", "placement": "bottom"}
}
},
{
"id": 2,
"title": "Logs",
"type": "logs",
"gridPos": {"h": 18, "w": 24, "x": 0, "y": 6},
"datasource": {"type": "loki", "uid": "loki"},
"targets": [
{
"expr": "{host=~\"$host\", job=~\"$job\"} |~ \"$search\"",
"refId": "A"
}
],
"options": {
"showTime": true,
"showLabels": true,
"showCommonLabels": false,
"wrapLogMessage": true,
"prettifyLogMessage": false,
"enableLogDetails": true,
"sortOrder": "Descending"
}
}
]
}

View File

@@ -0,0 +1,949 @@
{
"uid": "nixos-fleet-homelab",
"title": "NixOS Fleet - Homelab",
"tags": ["nixos", "fleet", "homelab"],
"timezone": "browser",
"schemaVersion": 39,
"version": 1,
"refresh": "1m",
"time": {
"from": "now-7d",
"to": "now"
},
"templating": {
"list": [
{
"name": "tier",
"type": "query",
"datasource": {"type": "prometheus", "uid": "prometheus"},
"query": "label_values(nixos_flake_info, tier)",
"refresh": 2,
"includeAll": true,
"multi": false,
"current": {"text": "All", "value": "$__all"}
}
]
},
"panels": [
{
"id": 1,
"title": "Hosts Behind Remote",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(nixos_flake_revision_behind{tier=~\"$tier\"} == 1)",
"legendFormat": "Behind",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 1},
{"color": "red", "value": 5}
]
},
"noValue": "0"
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none",
"textMode": "auto"
},
"description": "Number of hosts where current revision differs from remote master"
},
{
"id": 2,
"title": "Hosts Needing Reboot",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(nixos_config_mismatch{tier=~\"$tier\"} == 1)",
"legendFormat": "Need Reboot",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 1},
{"color": "orange", "value": 3},
{"color": "red", "value": 5}
]
},
"noValue": "0"
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
},
"description": "Hosts where booted generation differs from current (switched but not rebooted)"
},
{
"id": 3,
"title": "Total Hosts",
"type": "stat",
"gridPos": {"h": 4, "w": 3, "x": 8, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(nixos_flake_info{tier=~\"$tier\"})",
"legendFormat": "Hosts",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [{"color": "blue", "value": null}]
}
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
}
},
{
"id": 4,
"title": "Nixpkgs Age",
"type": "stat",
"gridPos": {"h": 4, "w": 3, "x": 11, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "max(nixos_flake_input_age_seconds{input=\"nixpkgs\", tier=~\"$tier\"})",
"legendFormat": "Nixpkgs",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 604800},
{"color": "orange", "value": 1209600},
{"color": "red", "value": 2592000}
]
}
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
},
"description": "Age of nixpkgs flake input (yellow >7d, orange >14d, red >30d)"
},
{
"id": 5,
"title": "Hosts Up-to-date",
"type": "stat",
"gridPos": {"h": 4, "w": 3, "x": 14, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(nixos_flake_revision_behind{tier=~\"$tier\"} == 0)",
"legendFormat": "Up-to-date",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [{"color": "green", "value": null}]
},
"noValue": "0"
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
}
},
{
"id": 13,
"title": "Deployments (24h)",
"type": "stat",
"gridPos": {"h": 4, "w": 3, "x": 17, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_deployments_total{status=\"completed\"}[24h]))",
"legendFormat": "Deployments",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [{"color": "blue", "value": null}]
},
"noValue": "0",
"decimals": 0
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
},
"description": "Total successful deployments in the last 24 hours"
},
{
"id": 14,
"title": "Avg Deploy Time",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_deployment_duration_seconds_sum{success=\"true\"}[24h])) / sum(increase(homelab_deploy_deployment_duration_seconds_count{success=\"true\"}[24h]))",
"legendFormat": "Avg Time",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 30},
{"color": "red", "value": 60}
]
},
"noValue": "-"
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
},
"description": "Average deployment duration over the last 24 hours (yellow >30s, red >60s)"
},
{
"id": 6,
"title": "Fleet Status",
"type": "table",
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 4},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "nixos_flake_info{tier=~\"$tier\"}",
"format": "table",
"instant": true,
"refId": "info"
},
{
"expr": "nixos_flake_revision_behind{tier=~\"$tier\"}",
"format": "table",
"instant": true,
"refId": "behind"
},
{
"expr": "nixos_config_mismatch{tier=~\"$tier\"}",
"format": "table",
"instant": true,
"refId": "mismatch"
},
{
"expr": "nixos_generation_age_seconds{tier=~\"$tier\"}",
"format": "table",
"instant": true,
"refId": "age"
},
{
"expr": "nixos_generation_count{tier=~\"$tier\"}",
"format": "table",
"instant": true,
"refId": "count"
}
],
"fieldConfig": {
"defaults": {},
"overrides": [
{
"matcher": {"id": "byName", "options": "Hostname"},
"properties": [{"id": "custom.width", "value": 120}]
},
{
"matcher": {"id": "byName", "options": "Current Rev"},
"properties": [{"id": "custom.width", "value": 90}]
},
{
"matcher": {"id": "byName", "options": "Remote Rev"},
"properties": [{"id": "custom.width", "value": 90}]
},
{
"matcher": {"id": "byName", "options": "Behind"},
"properties": [
{"id": "custom.width", "value": 70},
{"id": "mappings", "value": [
{"type": "value", "options": {"0": {"text": "No", "color": "green"}}},
{"type": "value", "options": {"1": {"text": "Yes", "color": "red"}}}
]},
{"id": "custom.cellOptions", "value": {"type": "color-text"}}
]
},
{
"matcher": {"id": "byName", "options": "Need Reboot"},
"properties": [
{"id": "custom.width", "value": 100},
{"id": "mappings", "value": [
{"type": "value", "options": {"0": {"text": "No", "color": "green"}}},
{"type": "value", "options": {"1": {"text": "Yes", "color": "orange"}}}
]},
{"id": "custom.cellOptions", "value": {"type": "color-text"}}
]
},
{
"matcher": {"id": "byName", "options": "Config Age"},
"properties": [
{"id": "unit", "value": "s"},
{"id": "custom.width", "value": 100}
]
},
{
"matcher": {"id": "byName", "options": "Generations"},
"properties": [{"id": "custom.width", "value": 100}]
},
{
"matcher": {"id": "byName", "options": "Tier"},
"properties": [{"id": "custom.width", "value": 60}]
},
{
"matcher": {"id": "byName", "options": "Role"},
"properties": [{"id": "custom.width", "value": 80}]
}
]
},
"options": {
"showHeader": true,
"sortBy": [{"displayName": "Hostname", "desc": false}]
},
"transformations": [
{
"id": "joinByField",
"options": {"byField": "hostname", "mode": "outer"}
},
{
"id": "organize",
"options": {
"excludeByName": {
"Time": true,
"Time 1": true,
"Time 2": true,
"Time 3": true,
"Time 4": true,
"Time 5": true,
"Value #info": true,
"__name__": true,
"__name__ 1": true,
"__name__ 2": true,
"__name__ 3": true,
"__name__ 4": true,
"__name__ 5": true,
"dns_role": true,
"dns_role 1": true,
"dns_role 2": true,
"dns_role 3": true,
"dns_role 4": true,
"instance": true,
"instance 1": true,
"instance 2": true,
"instance 3": true,
"instance 4": true,
"job": true,
"job 1": true,
"job 2": true,
"job 3": true,
"job 4": true,
"nixos_version": true,
"nixpkgs_rev": true,
"role 1": true,
"role 2": true,
"role 3": true,
"role 4": true,
"tier 1": true,
"tier 2": true,
"tier 3": true,
"tier 4": true
},
"indexByName": {
"hostname": 0,
"tier": 1,
"role": 2,
"current_rev": 3,
"remote_rev": 4,
"Value #behind": 5,
"Value #mismatch": 6,
"Value #age": 7,
"Value #count": 8
},
"renameByName": {
"hostname": "Hostname",
"tier": "Tier",
"role": "Role",
"current_rev": "Current Rev",
"remote_rev": "Remote Rev",
"Value #behind": "Behind",
"Value #mismatch": "Need Reboot",
"Value #age": "Config Age",
"Value #count": "Generations"
}
}
}
]
},
{
"id": 7,
"title": "Generation Age by Host",
"type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 14},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sort_desc(nixos_generation_age_seconds{tier=~\"$tier\"})",
"legendFormat": "{{hostname}}",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 86400},
{"color": "orange", "value": 259200},
{"color": "red", "value": 604800}
]
},
"min": 0
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"orientation": "horizontal",
"displayMode": "gradient",
"showUnfilled": true
},
"description": "How long ago each host's current config was deployed (yellow >1d, orange >3d, red >7d)"
},
{
"id": 8,
"title": "Generations per Host",
"type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 14},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sort_desc(nixos_generation_count{tier=~\"$tier\"})",
"legendFormat": "{{hostname}}",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "blue", "value": null},
{"color": "purple", "value": 50}
]
},
"min": 0
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"orientation": "horizontal",
"displayMode": "gradient",
"showUnfilled": true
},
"description": "Total number of NixOS generations on each host"
},
{
"id": 9,
"title": "Deployment Activity (Generation Age Over Time)",
"type": "timeseries",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 22},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "nixos_generation_age_seconds{tier=~\"$tier\"}",
"legendFormat": "{{hostname}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"custom": {
"lineWidth": 1,
"fillOpacity": 0,
"showPoints": "never",
"stacking": {"mode": "none"}
}
}
},
"options": {
"legend": {
"displayMode": "list",
"placement": "bottom"
},
"tooltip": {"mode": "multi", "sort": "desc"}
},
"description": "Generation age increases over time, drops to near-zero when deployed. Useful to see deployment patterns."
},
{
"id": 10,
"title": "Flake Input Ages",
"type": "table",
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 30},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "max by (input) (nixos_flake_input_age_seconds)",
"format": "table",
"instant": true,
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "s"
},
"overrides": [
{
"matcher": {"id": "byName", "options": "input"},
"properties": [{"id": "custom.width", "value": 150}]
}
]
},
"options": {
"showHeader": true,
"sortBy": [{"displayName": "Value", "desc": true}]
},
"transformations": [
{
"id": "organize",
"options": {
"excludeByName": {"Time": true},
"renameByName": {
"input": "Flake Input",
"Value": "Age"
}
}
}
],
"description": "Age of each flake input across the fleet"
},
{
"id": 11,
"title": "Hosts by Revision",
"type": "piechart",
"gridPos": {"h": 6, "w": 6, "x": 12, "y": 30},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count by (current_rev) (nixos_flake_info{tier=~\"$tier\"})",
"legendFormat": "{{current_rev}}",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"legend": {"displayMode": "table", "placement": "right", "values": ["value"]},
"pieType": "pie"
},
"description": "Distribution of hosts by their current flake revision"
},
{
"id": 12,
"title": "Hosts by Tier",
"type": "piechart",
"gridPos": {"h": 6, "w": 6, "x": 18, "y": 30},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count by (tier) (nixos_flake_info)",
"legendFormat": "{{tier}}",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"legend": {"displayMode": "table", "placement": "right", "values": ["value"]},
"pieType": "pie"
},
"transformations": [
{
"id": "renameByRegex",
"options": {
"regex": "^$",
"renamePattern": "prod"
}
}
],
"description": "Distribution of hosts by tier (test vs prod)"
},
{
"id": 15,
"title": "Build Service",
"type": "row",
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 36},
"collapsed": false
},
{
"id": 16,
"title": "Builds (24h)",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 37},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_build_host_total{status=\"success\"}[24h]))",
"legendFormat": "Builds",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [{"color": "green", "value": null}]
},
"noValue": "0",
"decimals": 0
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
},
"description": "Successful host builds in the last 24 hours"
},
{
"id": 17,
"title": "Failed Builds (24h)",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 37},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_build_host_total{status=\"failure\"}[24h])) or vector(0)",
"legendFormat": "Failed",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 1},
{"color": "red", "value": 5}
]
},
"noValue": "0",
"decimals": 0
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
},
"description": "Failed host builds in the last 24 hours"
},
{
"id": 18,
"title": "Last Build",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 37},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "time() - max(homelab_deploy_build_last_timestamp)",
"legendFormat": "Last Build",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 86400},
{"color": "red", "value": 604800}
]
},
"noValue": "-"
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
},
"description": "Time since last build attempt (yellow >1d, red >7d)"
},
{
"id": 19,
"title": "Avg Build Time",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 37},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_build_duration_seconds_sum[24h])) / sum(increase(homelab_deploy_build_duration_seconds_count[24h]))",
"legendFormat": "Avg Time",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 30},
{"color": "red", "value": 60}
]
},
"noValue": "-"
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
},
"description": "Average build duration per host over the last 24 hours"
},
{
"id": 20,
"title": "Total Hosts Built",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 37},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(homelab_deploy_build_duration_seconds_count)",
"legendFormat": "Hosts",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [{"color": "blue", "value": null}]
},
"noValue": "0"
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
},
"description": "Total number of unique hosts that have been built"
},
{
"id": 21,
"title": "Build Jobs (24h)",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 37},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_builds_total[24h]))",
"legendFormat": "Jobs",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [{"color": "purple", "value": null}]
},
"noValue": "0",
"decimals": 0
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
},
"description": "Total build jobs (each job may build multiple hosts) in the last 24 hours"
},
{
"id": 22,
"title": "Build Time by Host",
"type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 41},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sort_desc(homelab_deploy_build_duration_seconds_sum / homelab_deploy_build_duration_seconds_count)",
"legendFormat": "{{host}}",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 15},
{"color": "orange", "value": 25},
{"color": "red", "value": 45}
]
},
"min": 0
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"orientation": "horizontal",
"displayMode": "gradient",
"showUnfilled": true
},
"description": "Average build time per host (green <15s, yellow <25s, orange <45s, red >45s)"
},
{
"id": 23,
"title": "Build Count by Host",
"type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 41},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sort_desc(sum by (host) (homelab_deploy_build_host_total))",
"legendFormat": "{{host}}",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "blue", "value": null},
{"color": "purple", "value": 10}
]
},
"min": 0
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"orientation": "horizontal",
"displayMode": "gradient",
"showUnfilled": true
},
"description": "Total build count per host (all time)"
},
{
"id": 24,
"title": "Build Activity",
"type": "timeseries",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 49},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sum(increase(homelab_deploy_build_host_total{status=\"success\"}[1h]))",
"legendFormat": "Successful",
"refId": "A"
},
{
"expr": "sum(increase(homelab_deploy_build_host_total{status=\"failure\"}[1h]))",
"legendFormat": "Failed",
"refId": "B"
}
],
"fieldConfig": {
"defaults": {
"custom": {
"lineWidth": 1,
"fillOpacity": 30,
"showPoints": "never",
"stacking": {"mode": "none"}
}
},
"overrides": [
{
"matcher": {"id": "byName", "options": "Successful"},
"properties": [{"id": "color", "value": {"mode": "fixed", "fixedColor": "green"}}]
},
{
"matcher": {"id": "byName", "options": "Failed"},
"properties": [{"id": "color", "value": {"mode": "fixed", "fixedColor": "red"}}]
}
]
},
"options": {
"legend": {
"displayMode": "list",
"placement": "bottom"
},
"tooltip": {"mode": "multi", "sort": "desc"}
},
"description": "Build activity over time (successful vs failed builds per hour)"
}
]
}

View File

@@ -0,0 +1,296 @@
{
"uid": "nixos-operations",
"title": "NixOS Operations",
"tags": ["loki", "nixos", "operations", "homelab"],
"timezone": "browser",
"schemaVersion": 39,
"version": 1,
"refresh": "1m",
"time": {
"from": "now-24h",
"to": "now"
},
"templating": {
"list": [
{
"name": "host",
"type": "query",
"datasource": {"type": "loki", "uid": "loki"},
"query": "label_values(host)",
"refresh": 2,
"includeAll": true,
"multi": true,
"current": {"text": "All", "value": "$__all"}
}
]
},
"panels": [
{
"id": 1,
"title": "Upgrade Log Volume",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
"datasource": {"type": "loki", "uid": "loki"},
"targets": [
{
"expr": "sum(count_over_time({systemd_unit=\"nixos-upgrade.service\", host=~\"$host\"} [$__range]))",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [{"color": "blue", "value": null}]
},
"noValue": "0"
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
},
"description": "Total log entries from nixos-upgrade.service in selected time range"
},
{
"id": 2,
"title": "Successful Upgrades",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 6, "y": 0},
"datasource": {"type": "loki", "uid": "loki"},
"targets": [
{
"expr": "sum(count_over_time({systemd_unit=\"nixos-upgrade.service\", host=~\"$host\"} |= \"Done. The new configuration is\" [$__range]))",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [{"color": "green", "value": null}]
},
"noValue": "0"
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
},
"description": "Upgrades that completed successfully"
},
{
"id": 3,
"title": "Upgrade Errors",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 12, "y": 0},
"datasource": {"type": "loki", "uid": "loki"},
"targets": [
{
"expr": "sum(count_over_time({systemd_unit=\"nixos-upgrade.service\", host=~\"$host\"} |~ \"(?i)error|failed\" [$__range]))",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "red", "value": 1}
]
},
"noValue": "0"
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
},
"description": "Upgrade log entries containing errors"
},
{
"id": 4,
"title": "Bootstrap Events",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 18, "y": 0},
"datasource": {"type": "loki", "uid": "loki"},
"targets": [
{
"expr": "sum(count_over_time({job=\"bootstrap\", host=~\"$host\"} [$__range]))",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [{"color": "purple", "value": null}]
},
"noValue": "0"
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
},
"description": "Bootstrap log entries from new VM deployments"
},
{
"id": 5,
"title": "Upgrade Activity by Host",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
"datasource": {"type": "loki", "uid": "loki"},
"targets": [
{
"expr": "sum by (host) (count_over_time({systemd_unit=\"nixos-upgrade.service\", host=~\"$host\"} [5m]))",
"legendFormat": "{{host}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"lineWidth": 1,
"fillOpacity": 30,
"showPoints": "never",
"stacking": {"mode": "normal"}
}
}
},
"options": {
"legend": {"displayMode": "list", "placement": "bottom"},
"tooltip": {"mode": "multi", "sort": "desc"}
},
"description": "When upgrades ran on each host"
},
{
"id": 6,
"title": "ACME Certificate Activity",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
"datasource": {"type": "loki", "uid": "loki"},
"targets": [
{
"expr": "sum by (host) (count_over_time({systemd_unit=~\"acme.*\", host=~\"$host\"} [5m]))",
"legendFormat": "{{host}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"lineWidth": 1,
"fillOpacity": 30,
"showPoints": "never",
"stacking": {"mode": "normal"}
}
}
},
"options": {
"legend": {"displayMode": "list", "placement": "bottom"},
"tooltip": {"mode": "multi", "sort": "desc"}
},
"description": "ACME certificate renewal activity"
},
{
"id": 7,
"title": "Recent Upgrade Completions",
"type": "logs",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 12},
"datasource": {"type": "loki", "uid": "loki"},
"targets": [
{
"expr": "{systemd_unit=\"nixos-upgrade.service\", host=~\"$host\"} |= \"Done. The new configuration is\" | json | line_format \"{{.MESSAGE}}\" | keep host",
"refId": "A"
}
],
"options": {
"showTime": true,
"showLabels": true,
"showCommonLabels": false,
"wrapLogMessage": true,
"prettifyLogMessage": false,
"enableLogDetails": true,
"sortOrder": "Descending"
},
"description": "Successful upgrade completion messages showing the new system path"
},
{
"id": 8,
"title": "Build Activity",
"type": "logs",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 12},
"datasource": {"type": "loki", "uid": "loki"},
"targets": [
{
"expr": "{systemd_unit=\"nixos-upgrade.service\", host=~\"$host\"} |= \"building\" | json | line_format \"{{.MESSAGE}}\" | keep host",
"refId": "A"
}
],
"options": {
"showTime": true,
"showLabels": true,
"showCommonLabels": false,
"wrapLogMessage": true,
"prettifyLogMessage": false,
"enableLogDetails": true,
"sortOrder": "Descending"
},
"description": "Derivations being built during upgrades"
},
{
"id": 9,
"title": "Bootstrap Logs",
"type": "logs",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 20},
"datasource": {"type": "loki", "uid": "loki"},
"targets": [
{
"expr": "{job=\"bootstrap\", host=~\"$host\"}",
"refId": "A"
}
],
"options": {
"showTime": true,
"showLabels": true,
"showCommonLabels": false,
"wrapLogMessage": true,
"prettifyLogMessage": false,
"enableLogDetails": true,
"sortOrder": "Descending"
},
"description": "Logs from VM bootstrap process (new deployments)"
},
{
"id": 10,
"title": "Upgrade Errors & Failures",
"type": "logs",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 28},
"datasource": {"type": "loki", "uid": "loki"},
"targets": [
{
"expr": "{systemd_unit=\"nixos-upgrade.service\", host=~\"$host\"} |~ \"(?i)error|failed\" | json | line_format \"{{.MESSAGE}}\" | keep host",
"refId": "A"
}
],
"options": {
"showTime": true,
"showLabels": true,
"showCommonLabels": false,
"wrapLogMessage": true,
"prettifyLogMessage": false,
"enableLogDetails": true,
"sortOrder": "Descending"
},
"description": "Errors and failures during NixOS upgrades"
}
]
}

View File

@@ -0,0 +1,208 @@
{
"uid": "node-exporter-homelab",
"title": "Node Exporter - Homelab",
"tags": ["node-exporter", "prometheus", "homelab"],
"timezone": "browser",
"schemaVersion": 39,
"version": 1,
"refresh": "30s",
"templating": {
"list": [
{
"name": "instance",
"type": "query",
"datasource": {"type": "prometheus", "uid": "prometheus"},
"query": "label_values(node_uname_info, instance)",
"refresh": 2,
"includeAll": false,
"multi": false,
"current": {}
}
]
},
"panels": [
{
"id": 1,
"title": "CPU Usage",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\", instance=~\"$instance\"}[5m])) * 100)",
"legendFormat": "CPU %",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 70},
{"color": "red", "value": 90}
]
}
}
}
},
{
"id": 2,
"title": "Memory Usage",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$instance\"} / node_memory_MemTotal_bytes{instance=~\"$instance\"})) * 100",
"legendFormat": "Memory %",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 70},
{"color": "red", "value": 90}
]
}
}
}
},
{
"id": 3,
"title": "Disk Usage",
"type": "gauge",
"gridPos": {"h": 8, "w": 8, "x": 0, "y": 8},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "100 - ((node_filesystem_avail_bytes{instance=~\"$instance\",mountpoint=\"/\",fstype!=\"rootfs\"} / node_filesystem_size_bytes{instance=~\"$instance\",mountpoint=\"/\",fstype!=\"rootfs\"}) * 100)",
"legendFormat": "Root /",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 70},
{"color": "red", "value": 85}
]
}
}
}
},
{
"id": 4,
"title": "System Load",
"type": "timeseries",
"gridPos": {"h": 8, "w": 8, "x": 8, "y": 8},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "node_load1{instance=~\"$instance\"}",
"legendFormat": "1m",
"refId": "A"
},
{
"expr": "node_load5{instance=~\"$instance\"}",
"legendFormat": "5m",
"refId": "B"
},
{
"expr": "node_load15{instance=~\"$instance\"}",
"legendFormat": "15m",
"refId": "C"
}
],
"fieldConfig": {
"defaults": {
"unit": "short"
}
}
},
{
"id": 5,
"title": "Uptime",
"type": "stat",
"gridPos": {"h": 8, "w": 8, "x": 16, "y": 8},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "time() - node_boot_time_seconds{instance=~\"$instance\"}",
"legendFormat": "Uptime",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "s"
}
}
},
{
"id": 6,
"title": "Network Traffic",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "rate(node_network_receive_bytes_total{instance=~\"$instance\",device!~\"lo|veth.*|br.*|docker.*\"}[5m])",
"legendFormat": "Receive {{device}}",
"refId": "A"
},
{
"expr": "-rate(node_network_transmit_bytes_total{instance=~\"$instance\",device!~\"lo|veth.*|br.*|docker.*\"}[5m])",
"legendFormat": "Transmit {{device}}",
"refId": "B"
}
],
"fieldConfig": {
"defaults": {
"unit": "Bps"
}
}
},
{
"id": 7,
"title": "Disk I/O",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 16},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "rate(node_disk_read_bytes_total{instance=~\"$instance\",device!~\"dm-.*\"}[5m])",
"legendFormat": "Read {{device}}",
"refId": "A"
},
{
"expr": "-rate(node_disk_written_bytes_total{instance=~\"$instance\",device!~\"dm-.*\"}[5m])",
"legendFormat": "Write {{device}}",
"refId": "B"
}
],
"fieldConfig": {
"defaults": {
"unit": "Bps"
}
}
}
]
}

View File

@@ -0,0 +1,606 @@
{
"uid": "proxmox-homelab",
"title": "Proxmox - Homelab",
"tags": ["proxmox", "virtualization", "homelab"],
"timezone": "browser",
"schemaVersion": 39,
"version": 1,
"refresh": "30s",
"time": {
"from": "now-6h",
"to": "now"
},
"templating": {
"list": [
{
"name": "vm",
"type": "query",
"datasource": {"type": "prometheus", "uid": "prometheus"},
"query": "label_values(pve_guest_info{template=\"0\"}, name)",
"refresh": 2,
"includeAll": true,
"multi": true,
"current": {"text": "All", "value": "$__all"}
}
]
},
"panels": [
{
"id": 1,
"title": "VMs Running",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(pve_up{id=~\"qemu/.*\"} * on(id) pve_guest_info{template=\"0\"} == 1)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [{"color": "green", "value": null}]
}
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
}
},
{
"id": 2,
"title": "VMs Stopped",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(pve_up{id=~\"qemu/.*\"} * on(id) pve_guest_info{template=\"0\"} == 0)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 1},
{"color": "red", "value": 3}
]
},
"noValue": "0"
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
}
},
{
"id": 3,
"title": "Node CPU",
"type": "gauge",
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "pve_cpu_usage_ratio{id=~\"node/.*\"} * 100",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 70},
{"color": "red", "value": 90}
]
}
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"showThresholdLabels": false,
"showThresholdMarkers": true
}
},
{
"id": 4,
"title": "Node Memory",
"type": "gauge",
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "pve_memory_usage_bytes{id=~\"node/.*\"} / pve_memory_size_bytes{id=~\"node/.*\"} * 100",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 70},
{"color": "red", "value": 90}
]
}
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"showThresholdLabels": false,
"showThresholdMarkers": true
}
},
{
"id": 5,
"title": "Node Uptime",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "pve_uptime_seconds{id=~\"node/.*\"}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"mode": "absolute",
"steps": [{"color": "blue", "value": null}]
}
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
}
},
{
"id": 6,
"title": "Templates",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(pve_guest_info{template=\"1\"})",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [{"color": "purple", "value": null}]
}
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
}
},
{
"id": 7,
"title": "VM Status",
"type": "table",
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 4},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "pve_guest_info{template=\"0\", name=~\"$vm\"}",
"format": "table",
"instant": true,
"refId": "info"
},
{
"expr": "pve_up{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
"format": "table",
"instant": true,
"refId": "status"
},
{
"expr": "pve_cpu_usage_ratio{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"} * 100",
"format": "table",
"instant": true,
"refId": "cpu"
},
{
"expr": "pve_memory_usage_bytes{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"} / on(id) pve_memory_size_bytes * 100",
"format": "table",
"instant": true,
"refId": "mem"
},
{
"expr": "pve_uptime_seconds{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
"format": "table",
"instant": true,
"refId": "uptime"
}
],
"fieldConfig": {
"defaults": {},
"overrides": [
{
"matcher": {"id": "byName", "options": "Name"},
"properties": [{"id": "custom.width", "value": 150}]
},
{
"matcher": {"id": "byName", "options": "Status"},
"properties": [
{"id": "custom.width", "value": 80},
{"id": "mappings", "value": [
{"type": "value", "options": {"0": {"text": "Stopped", "color": "red"}}},
{"type": "value", "options": {"1": {"text": "Running", "color": "green"}}}
]},
{"id": "custom.cellOptions", "value": {"type": "color-text"}}
]
},
{
"matcher": {"id": "byName", "options": "CPU %"},
"properties": [
{"id": "unit", "value": "percent"},
{"id": "decimals", "value": 1},
{"id": "custom.width", "value": 80},
{"id": "custom.cellOptions", "value": {"type": "gauge", "mode": "basic"}},
{"id": "min", "value": 0},
{"id": "max", "value": 100},
{"id": "thresholds", "value": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 50}, {"color": "red", "value": 80}]}}
]
},
{
"matcher": {"id": "byName", "options": "Memory %"},
"properties": [
{"id": "unit", "value": "percent"},
{"id": "decimals", "value": 1},
{"id": "custom.width", "value": 100},
{"id": "custom.cellOptions", "value": {"type": "gauge", "mode": "basic"}},
{"id": "min", "value": 0},
{"id": "max", "value": 100},
{"id": "thresholds", "value": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 70}, {"color": "red", "value": 90}]}}
]
},
{
"matcher": {"id": "byName", "options": "Uptime"},
"properties": [
{"id": "unit", "value": "s"},
{"id": "custom.width", "value": 100}
]
},
{
"matcher": {"id": "byName", "options": "ID"},
"properties": [{"id": "custom.width", "value": 90}]
}
]
},
"options": {
"showHeader": true,
"sortBy": [{"displayName": "Name", "desc": false}]
},
"transformations": [
{
"id": "joinByField",
"options": {"byField": "name", "mode": "outer"}
},
{
"id": "organize",
"options": {
"excludeByName": {
"Time": true,
"Time 1": true,
"Time 2": true,
"Time 3": true,
"Time 4": true,
"Value #info": true,
"__name__": true,
"id 1": true,
"id 2": true,
"id 3": true,
"id 4": true,
"instance": true,
"instance 1": true,
"instance 2": true,
"instance 3": true,
"instance 4": true,
"job": true,
"job 1": true,
"job 2": true,
"job 3": true,
"job 4": true,
"name 1": true,
"name 2": true,
"name 3": true,
"name 4": true,
"node": true,
"tags": true,
"template": true,
"type": true
},
"indexByName": {
"name": 0,
"id": 1,
"Value #status": 2,
"Value #cpu": 3,
"Value #mem": 4,
"Value #uptime": 5
},
"renameByName": {
"name": "Name",
"id": "ID",
"Value #status": "Status",
"Value #cpu": "CPU %",
"Value #mem": "Memory %",
"Value #uptime": "Uptime"
}
}
}
]
},
{
"id": 8,
"title": "VM CPU Usage",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 14},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "pve_cpu_usage_ratio{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"} * 100",
"legendFormat": "{{name}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"custom": {
"lineWidth": 1,
"fillOpacity": 10,
"showPoints": "never"
}
}
},
"options": {
"legend": {"displayMode": "list", "placement": "bottom"},
"tooltip": {"mode": "multi", "sort": "desc"}
}
},
{
"id": 9,
"title": "VM Memory Usage",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 14},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "pve_memory_usage_bytes{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
"legendFormat": "{{name}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "bytes",
"min": 0,
"custom": {
"lineWidth": 1,
"fillOpacity": 10,
"showPoints": "never"
}
}
},
"options": {
"legend": {"displayMode": "list", "placement": "bottom"},
"tooltip": {"mode": "multi", "sort": "desc"}
}
},
{
"id": 10,
"title": "VM Network Traffic",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 22},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "rate(pve_network_receive_bytes{id=~\"qemu/.*\"}[5m]) * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
"legendFormat": "{{name}} RX",
"refId": "A"
},
{
"expr": "-rate(pve_network_transmit_bytes{id=~\"qemu/.*\"}[5m]) * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
"legendFormat": "{{name}} TX",
"refId": "B"
}
],
"fieldConfig": {
"defaults": {
"unit": "Bps",
"custom": {
"lineWidth": 1,
"fillOpacity": 10,
"showPoints": "never"
}
}
},
"options": {
"legend": {"displayMode": "list", "placement": "bottom"},
"tooltip": {"mode": "multi", "sort": "desc"}
}
},
{
"id": 11,
"title": "VM Disk I/O",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 22},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "rate(pve_disk_read_bytes{id=~\"qemu/.*\"}[5m]) * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
"legendFormat": "{{name}} Read",
"refId": "A"
},
{
"expr": "-rate(pve_disk_write_bytes{id=~\"qemu/.*\"}[5m]) * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
"legendFormat": "{{name}} Write",
"refId": "B"
}
],
"fieldConfig": {
"defaults": {
"unit": "Bps",
"custom": {
"lineWidth": 1,
"fillOpacity": 10,
"showPoints": "never"
}
}
},
"options": {
"legend": {"displayMode": "list", "placement": "bottom"},
"tooltip": {"mode": "multi", "sort": "desc"}
}
},
{
"id": 12,
"title": "Storage Usage",
"type": "bargauge",
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 30},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "pve_disk_usage_bytes{id=~\"storage/.*\"} / pve_disk_size_bytes{id=~\"storage/.*\"} * 100",
"legendFormat": "{{id}}",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 70},
{"color": "red", "value": 85}
]
}
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"orientation": "horizontal",
"displayMode": "gradient",
"showUnfilled": true
},
"transformations": [
{
"id": "renameByRegex",
"options": {
"regex": "storage/pve1/(.*)",
"renamePattern": "$1"
}
}
]
},
{
"id": 13,
"title": "Storage Capacity",
"type": "table",
"gridPos": {"h": 6, "w": 12, "x": 12, "y": 30},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "pve_disk_size_bytes{id=~\"storage/.*\"}",
"format": "table",
"instant": true,
"refId": "size"
},
{
"expr": "pve_disk_usage_bytes{id=~\"storage/.*\"}",
"format": "table",
"instant": true,
"refId": "used"
},
{
"expr": "pve_disk_size_bytes{id=~\"storage/.*\"} - pve_disk_usage_bytes{id=~\"storage/.*\"}",
"format": "table",
"instant": true,
"refId": "free"
}
],
"fieldConfig": {
"defaults": {
"unit": "bytes"
},
"overrides": [
{
"matcher": {"id": "byName", "options": "Storage"},
"properties": [{"id": "unit", "value": "none"}]
}
]
},
"options": {
"showHeader": true
},
"transformations": [
{
"id": "joinByField",
"options": {"byField": "id", "mode": "outer"}
},
{
"id": "organize",
"options": {
"excludeByName": {
"Time": true,
"Time 1": true,
"Time 2": true,
"instance": true,
"instance 1": true,
"instance 2": true,
"job": true,
"job 1": true,
"job 2": true
},
"renameByName": {
"id": "Storage",
"Value #size": "Total",
"Value #used": "Used",
"Value #free": "Free"
}
}
},
{
"id": "renameByRegex",
"options": {
"regex": "storage/pve1/(.*)",
"renamePattern": "$1"
}
}
]
}
]
}

View File

@@ -0,0 +1,553 @@
{
"uid": "systemd-homelab",
"title": "Systemd Services - Homelab",
"tags": ["systemd", "services", "homelab"],
"timezone": "browser",
"schemaVersion": 39,
"version": 1,
"refresh": "1m",
"time": {
"from": "now-24h",
"to": "now"
},
"templating": {
"list": [
{
"name": "hostname",
"type": "query",
"datasource": {"type": "prometheus", "uid": "prometheus"},
"query": "label_values(systemd_unit_state, hostname)",
"refresh": 2,
"includeAll": true,
"multi": true,
"current": {"text": "All", "value": "$__all"}
}
]
},
"panels": [
{
"id": 1,
"title": "Failed Units",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(systemd_unit_state{state=\"failed\", hostname=~\"$hostname\"} == 1) or vector(0)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "red", "value": 1}
]
},
"noValue": "0"
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
}
},
{
"id": 2,
"title": "Active Units",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(systemd_unit_state{state=\"active\", hostname=~\"$hostname\"} == 1)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [{"color": "green", "value": null}]
}
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
}
},
{
"id": 3,
"title": "Hosts Monitored",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(count by (hostname) (systemd_unit_state{hostname=~\"$hostname\"}))",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [{"color": "blue", "value": null}]
}
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
}
},
{
"id": 4,
"title": "Total Service Restarts",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sum(systemd_service_restart_total{hostname=~\"$hostname\"})",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 10},
{"color": "orange", "value": 50}
]
},
"noValue": "0"
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
}
},
{
"id": 5,
"title": "Inactive Units",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(systemd_unit_state{state=\"inactive\", hostname=~\"$hostname\"} == 1)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [{"color": "purple", "value": null}]
}
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
}
},
{
"id": 6,
"title": "Timers",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "count(systemd_timer_last_trigger_seconds{hostname=~\"$hostname\"})",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [{"color": "blue", "value": null}]
}
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "value",
"graphMode": "none"
}
},
{
"id": 7,
"title": "Failed Units",
"type": "table",
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 4},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "systemd_unit_state{state=\"failed\", hostname=~\"$hostname\"} == 1",
"format": "table",
"instant": true,
"refId": "A"
}
],
"fieldConfig": {
"defaults": {},
"overrides": [
{
"matcher": {"id": "byName", "options": "Host"},
"properties": [{"id": "custom.width", "value": 120}]
},
{
"matcher": {"id": "byName", "options": "Unit"},
"properties": [{"id": "custom.width", "value": 300}]
}
]
},
"options": {
"showHeader": true,
"sortBy": [{"displayName": "Host", "desc": false}]
},
"transformations": [
{
"id": "organize",
"options": {
"excludeByName": {
"Time": true,
"Value": true,
"__name__": true,
"dns_role": true,
"instance": true,
"job": true,
"role": true,
"state": true,
"tier": true,
"type": true
},
"renameByName": {
"hostname": "Host",
"name": "Unit"
}
}
}
],
"description": "Units currently in failed state"
},
{
"id": 8,
"title": "Service Restarts (Top 15)",
"type": "table",
"gridPos": {"h": 6, "w": 12, "x": 12, "y": 4},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "topk(15, systemd_service_restart_total{hostname=~\"$hostname\"} > 0)",
"format": "table",
"instant": true,
"refId": "A"
}
],
"fieldConfig": {
"defaults": {},
"overrides": [
{
"matcher": {"id": "byName", "options": "Host"},
"properties": [{"id": "custom.width", "value": 120}]
},
{
"matcher": {"id": "byName", "options": "Service"},
"properties": [{"id": "custom.width", "value": 280}]
},
{
"matcher": {"id": "byName", "options": "Restarts"},
"properties": [{"id": "custom.width", "value": 80}]
}
]
},
"options": {
"showHeader": true,
"sortBy": [{"displayName": "Restarts", "desc": true}]
},
"transformations": [
{
"id": "organize",
"options": {
"excludeByName": {
"Time": true,
"__name__": true,
"dns_role": true,
"instance": true,
"job": true,
"role": true,
"tier": true
},
"renameByName": {
"hostname": "Host",
"name": "Service",
"Value": "Restarts"
}
}
}
],
"description": "Services that have been restarted (since host boot)"
},
{
"id": 9,
"title": "Active Units per Host",
"type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 10},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sort_desc(count by (hostname) (systemd_unit_state{state=\"active\", hostname=~\"$hostname\"} == 1))",
"legendFormat": "{{hostname}}",
"refId": "A",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [{"color": "green", "value": null}]
},
"min": 0
}
},
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"orientation": "horizontal",
"displayMode": "gradient",
"showUnfilled": true
}
},
{
"id": 10,
"title": "NixOS Upgrade Timers",
"type": "table",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 10},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "systemd_timer_last_trigger_seconds{name=\"nixos-upgrade.timer\", hostname=~\"$hostname\"}",
"format": "table",
"instant": true,
"refId": "last"
},
{
"expr": "time() - systemd_timer_last_trigger_seconds{name=\"nixos-upgrade.timer\", hostname=~\"$hostname\"}",
"format": "table",
"instant": true,
"refId": "ago"
}
],
"fieldConfig": {
"defaults": {},
"overrides": [
{
"matcher": {"id": "byName", "options": "Host"},
"properties": [{"id": "custom.width", "value": 130}]
},
{
"matcher": {"id": "byName", "options": "Last Trigger"},
"properties": [
{"id": "unit", "value": "dateTimeAsLocalNoDateIfToday"},
{"id": "custom.width", "value": 180}
]
},
{
"matcher": {"id": "byName", "options": "Time Ago"},
"properties": [
{"id": "unit", "value": "s"},
{"id": "custom.width", "value": 120},
{"id": "thresholds", "value": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 86400}, {"color": "red", "value": 172800}]}},
{"id": "custom.cellOptions", "value": {"type": "color-text"}}
]
}
]
},
"options": {
"showHeader": true,
"sortBy": [{"displayName": "Time Ago", "desc": true}]
},
"transformations": [
{
"id": "joinByField",
"options": {"byField": "hostname", "mode": "outer"}
},
{
"id": "organize",
"options": {
"excludeByName": {
"Time": true,
"Time 1": true,
"__name__": true,
"__name__ 1": true,
"dns_role": true,
"dns_role 1": true,
"instance": true,
"instance 1": true,
"job": true,
"job 1": true,
"name": true,
"name 1": true,
"role": true,
"role 1": true,
"tier": true,
"tier 1": true
},
"indexByName": {
"hostname": 0,
"Value #last": 1,
"Value #ago": 2
},
"renameByName": {
"hostname": "Host",
"Value #last": "Last Trigger",
"Value #ago": "Time Ago"
}
}
}
],
"description": "When nixos-upgrade.timer last ran on each host. Yellow >24h, Red >48h."
},
{
"id": 11,
"title": "Backup Timers",
"type": "table",
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 18},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "systemd_timer_last_trigger_seconds{name=~\"restic.*\", hostname=~\"$hostname\"}",
"format": "table",
"instant": true,
"refId": "last"
},
{
"expr": "time() - systemd_timer_last_trigger_seconds{name=~\"restic.*\", hostname=~\"$hostname\"}",
"format": "table",
"instant": true,
"refId": "ago"
}
],
"fieldConfig": {
"defaults": {},
"overrides": [
{
"matcher": {"id": "byName", "options": "Host"},
"properties": [{"id": "custom.width", "value": 120}]
},
{
"matcher": {"id": "byName", "options": "Timer"},
"properties": [{"id": "custom.width", "value": 220}]
},
{
"matcher": {"id": "byName", "options": "Last Trigger"},
"properties": [
{"id": "unit", "value": "dateTimeAsLocalNoDateIfToday"},
{"id": "custom.width", "value": 180}
]
},
{
"matcher": {"id": "byName", "options": "Time Ago"},
"properties": [
{"id": "unit", "value": "s"},
{"id": "custom.width", "value": 100},
{"id": "thresholds", "value": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 86400}, {"color": "red", "value": 172800}]}},
{"id": "custom.cellOptions", "value": {"type": "color-text"}}
]
}
]
},
"options": {
"showHeader": true,
"sortBy": [{"displayName": "Time Ago", "desc": true}]
},
"transformations": [
{
"id": "joinByField",
"options": {"byField": "name", "mode": "outer"}
},
{
"id": "organize",
"options": {
"excludeByName": {
"Time": true,
"Time 1": true,
"__name__": true,
"__name__ 1": true,
"dns_role": true,
"dns_role 1": true,
"instance": true,
"instance 1": true,
"job": true,
"job 1": true,
"role": true,
"role 1": true,
"tier": true,
"tier 1": true,
"hostname 1": true
},
"indexByName": {
"hostname": 0,
"name": 1,
"Value #last": 2,
"Value #ago": 3
},
"renameByName": {
"hostname": "Host",
"name": "Timer",
"Value #last": "Last Trigger",
"Value #ago": "Time Ago"
}
}
}
],
"description": "Restic backup timers"
},
{
"id": 12,
"title": "Service Restarts Over Time",
"type": "timeseries",
"gridPos": {"h": 6, "w": 12, "x": 12, "y": 18},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "sum by (hostname) (increase(systemd_service_restart_total{hostname=~\"$hostname\"}[1h]))",
"legendFormat": "{{hostname}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"lineWidth": 1,
"fillOpacity": 20,
"showPoints": "never",
"stacking": {"mode": "normal"}
}
}
},
"options": {
"legend": {"displayMode": "list", "placement": "bottom"},
"tooltip": {"mode": "multi", "sort": "desc"}
},
"description": "Service restart rate per hour"
}
]
}

View File

@@ -0,0 +1,399 @@
{
"uid": "temperature-homelab",
"title": "Temperature - Homelab",
"tags": ["home-assistant", "temperature", "homelab"],
"timezone": "browser",
"schemaVersion": 39,
"version": 1,
"refresh": "1m",
"time": {
"from": "now-30d",
"to": "now"
},
"templating": {
"list": []
},
"panels": [
{
"id": 1,
"title": "Current Temperatures",
"type": "stat",
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}",
"legendFormat": "{{friendly_name}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "celsius",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "blue", "value": null},
{"color": "green", "value": 18},
{"color": "yellow", "value": 24},
{"color": "orange", "value": 27},
{"color": "red", "value": 30}
]
},
"mappings": []
},
"overrides": []
},
"options": {
"reduceOptions": {
"calcs": ["lastNotNull"],
"fields": "",
"values": false
},
"orientation": "auto",
"textMode": "auto",
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto"
},
"transformations": [
{
"id": "renameByRegex",
"options": {
"regex": "Temp (.*) Temperature",
"renamePattern": "$1"
}
}
]
},
{
"id": 2,
"title": "Average Home Temperature",
"type": "gauge",
"gridPos": {"h": 6, "w": 6, "x": 12, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "avg(hass_sensor_temperature_celsius{entity!~\".*device_temperature|.*server.*\"})",
"legendFormat": "Average",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "celsius",
"min": 15,
"max": 30,
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "blue", "value": null},
{"color": "green", "value": 18},
{"color": "yellow", "value": 24},
{"color": "red", "value": 28}
]
}
}
},
"options": {
"reduceOptions": {
"calcs": ["lastNotNull"]
},
"showThresholdLabels": false,
"showThresholdMarkers": true
}
},
{
"id": 3,
"title": "Current Humidity",
"type": "stat",
"gridPos": {"h": 6, "w": 6, "x": 18, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "hass_sensor_humidity_percent{entity!~\".*server.*\"}",
"legendFormat": "{{friendly_name}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "red", "value": null},
{"color": "yellow", "value": 30},
{"color": "green", "value": 40},
{"color": "yellow", "value": 60},
{"color": "red", "value": 70}
]
}
}
},
"options": {
"reduceOptions": {
"calcs": ["lastNotNull"]
},
"orientation": "horizontal",
"colorMode": "value",
"graphMode": "none"
},
"transformations": [
{
"id": "renameByRegex",
"options": {
"regex": "Temp (.*) Humidity",
"renamePattern": "$1"
}
}
]
},
{
"id": 4,
"title": "Temperature History (30 Days)",
"type": "timeseries",
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 6},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}",
"legendFormat": "{{friendly_name}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "celsius",
"custom": {
"lineWidth": 1,
"fillOpacity": 10,
"pointSize": 5,
"showPoints": "never",
"spanNulls": 3600000
}
}
},
"options": {
"legend": {
"displayMode": "list",
"placement": "bottom",
"calcs": ["mean", "min", "max"]
},
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"transformations": [
{
"id": "renameByRegex",
"options": {
"regex": "Temp (.*) Temperature",
"renamePattern": "$1"
}
},
{
"id": "renameByRegex",
"options": {
"regex": "temp_server Temperature",
"renamePattern": "Server"
}
}
]
},
{
"id": 5,
"title": "Temperature Trend (1h rate of change)",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "deriv(hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}[1h]) * 3600",
"legendFormat": "{{friendly_name}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "celsius",
"custom": {
"lineWidth": 1,
"fillOpacity": 20,
"showPoints": "never",
"spanNulls": 3600000
},
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "blue", "value": null},
{"color": "green", "value": -0.5},
{"color": "green", "value": 0.5},
{"color": "red", "value": 1}
]
},
"displayName": "${__field.labels.friendly_name}"
}
},
"options": {
"legend": {
"displayMode": "list",
"placement": "bottom"
},
"tooltip": {
"mode": "multi"
}
},
"transformations": [
{
"id": "renameByRegex",
"options": {
"regex": "Temp (.*) Temperature",
"renamePattern": "$1"
}
},
{
"id": "renameByRegex",
"options": {
"regex": "temp_server Temperature",
"renamePattern": "Server"
}
}
],
"description": "Rate of temperature change per hour. Positive = warming, Negative = cooling."
},
{
"id": 6,
"title": "24h Min / Max / Avg",
"type": "table",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 16},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "min_over_time(hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}[24h])",
"legendFormat": "{{friendly_name}}",
"refId": "min",
"instant": true
},
{
"expr": "max_over_time(hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}[24h])",
"legendFormat": "{{friendly_name}}",
"refId": "max",
"instant": true
},
{
"expr": "avg_over_time(hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}[24h])",
"legendFormat": "{{friendly_name}}",
"refId": "avg",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"unit": "celsius",
"decimals": 1
},
"overrides": [
{
"matcher": {"id": "byName", "options": "Room"},
"properties": [{"id": "custom.width", "value": 150}]
}
]
},
"options": {
"showHeader": true,
"sortBy": [{"displayName": "Room", "desc": false}]
},
"transformations": [
{
"id": "joinByField",
"options": {
"byField": "friendly_name",
"mode": "outer"
}
},
{
"id": "organize",
"options": {
"excludeByName": {
"Time": true,
"domain": true,
"entity": true,
"hostname": true,
"instance": true,
"job": true
},
"renameByName": {
"friendly_name": "Room",
"Value #min": "Min (24h)",
"Value #max": "Max (24h)",
"Value #avg": "Avg (24h)"
}
}
},
{
"id": "renameByRegex",
"options": {
"regex": "Temp (.*) Temperature",
"renamePattern": "$1"
}
}
]
},
{
"id": 7,
"title": "Humidity History (30 Days)",
"type": "timeseries",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 24},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"targets": [
{
"expr": "hass_sensor_humidity_percent",
"legendFormat": "{{friendly_name}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"custom": {
"lineWidth": 1,
"fillOpacity": 10,
"showPoints": "never",
"spanNulls": 3600000
}
}
},
"options": {
"legend": {
"displayMode": "list",
"placement": "bottom",
"calcs": ["mean", "min", "max"]
},
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"transformations": [
{
"id": "renameByRegex",
"options": {
"regex": "Temp (.*) Humidity",
"renamePattern": "$1"
}
},
{
"id": "renameByRegex",
"options": {
"regex": "temp_server Humidity",
"renamePattern": "Server"
}
}
]
}
]
}

View File

@@ -0,0 +1,111 @@
{ config, pkgs, ... }:
{
services.grafana = {
enable = true;
settings = {
server = {
http_addr = "127.0.0.1";
http_port = 3000;
domain = "grafana-test.home.2rjus.net";
root_url = "https://grafana-test.home.2rjus.net/";
};
# Disable anonymous access
"auth.anonymous".enabled = false;
# OIDC authentication via Kanidm
"auth.generic_oauth" = {
enabled = true;
name = "Kanidm";
client_id = "grafana";
client_secret = "$__file{/run/secrets/grafana-oauth2}";
auth_url = "https://auth.home.2rjus.net/ui/oauth2";
token_url = "https://auth.home.2rjus.net/oauth2/token";
api_url = "https://auth.home.2rjus.net/oauth2/openid/grafana/userinfo";
scopes = "openid profile email groups";
use_pkce = true; # Required by Kanidm, more secure
# Extract user attributes from userinfo response
email_attribute_path = "email";
login_attribute_path = "preferred_username";
name_attribute_path = "name";
# Map admins group to Admin role, everyone else to Editor (for Explore access)
role_attribute_path = "contains(groups[*], 'admins') && 'Admin' || 'Editor'";
allow_sign_up = true;
};
};
# Declarative datasources pointing to monitoring01
provision.datasources.settings = {
apiVersion = 1;
datasources = [
{
name = "Prometheus";
type = "prometheus";
url = "http://monitoring01.home.2rjus.net:9090";
isDefault = true;
uid = "prometheus";
}
{
name = "Loki";
type = "loki";
url = "http://monitoring01.home.2rjus.net:3100";
uid = "loki";
}
];
};
# Declarative dashboards
provision.dashboards.settings = {
apiVersion = 1;
providers = [
{
name = "homelab";
type = "file";
options.path = ./dashboards;
disableDeletion = true;
}
];
};
};
# Vault secret for OAuth2 client secret
vault.secrets.grafana-oauth2 = {
secretPath = "services/grafana/oauth2-client-secret";
extractKey = "password";
services = [ "grafana" ];
owner = "grafana";
group = "grafana";
};
# Local Caddy for TLS termination
services.caddy = {
enable = true;
package = pkgs.unstable.caddy;
configFile = pkgs.writeText "Caddyfile" ''
{
acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
metrics
}
grafana-test.home.2rjus.net {
log {
output file /var/log/caddy/grafana.log {
mode 644
}
}
reverse_proxy http://127.0.0.1:3000
}
http://${config.networking.hostName}.home.2rjus.net/metrics {
metrics
}
'';
};
# Expose Caddy metrics for Prometheus
homelab.monitoring.scrapeTargets = [{
job_name = "caddy";
port = 80;
}];
}

View File

@@ -78,15 +78,15 @@
# Override battery calculation using voltage (mV): (voltage - 2100) / 9
"0x54ef441000a547bd" = {
friendly_name = "0x54ef441000a547bd";
homeassistant.battery.value_template = "{{ (((value_json.voltage | float) - 2100) / 9) | round(0) | int | min(100) | max(0) }}";
homeassistant.battery.value_template = "{{ [[(((value_json.voltage | float) - 2100) / 9) | round(0) | int, 100] | min, 0] | max }}";
};
"0x54ef441000a54d3c" = {
friendly_name = "0x54ef441000a54d3c";
homeassistant.battery.value_template = "{{ (((value_json.voltage | float) - 2100) / 9) | round(0) | int | min(100) | max(0) }}";
homeassistant.battery.value_template = "{{ [[(((value_json.voltage | float) - 2100) / 9) | round(0) | int, 100] | min, 0] | max }}";
};
"0x54ef441000a564b6" = {
friendly_name = "temp_server";
homeassistant.battery.value_template = "{{ (((value_json.voltage | float) - 2100) / 9) | round(0) | int | min(100) | max(0) }}";
homeassistant.battery.value_template = "{{ [[(((value_json.voltage | float) - 2100) / 9) | round(0) | int, 100] | min, 0] | max }}";
};
# Other sensors

View File

@@ -24,12 +24,37 @@
idmAdminPasswordFile = config.vault.secrets.kanidm-idm-admin.outputDir;
groups = {
admins = { };
users = { };
ssh-users = { };
# overwriteMembers = false allows imperative member management via CLI
admins = { overwriteMembers = false; };
users = { overwriteMembers = false; };
ssh-users = { overwriteMembers = false; };
};
# Regular users (persons) are managed imperatively via kanidm CLI
# OAuth2/OIDC clients for service authentication
systems.oauth2.grafana = {
displayName = "Grafana";
originUrl = "https://grafana-test.home.2rjus.net/login/generic_oauth";
originLanding = "https://grafana-test.home.2rjus.net/";
basicSecretFile = config.vault.secrets.grafana-oauth2.outputDir;
preferShortUsername = true;
scopeMaps.users = [ "openid" "profile" "email" "groups" ];
};
systems.oauth2.openbao = {
displayName = "OpenBao Secrets";
# Web UI callback only (CLI localhost not supported with confidential clients)
originUrl = "https://vault.home.2rjus.net:8200/ui/vault/auth/oidc/oidc/callback";
originLanding = "https://vault.home.2rjus.net:8200/";
basicSecretFile = config.vault.secrets.openbao-oauth2.outputDir;
preferShortUsername = true;
# Enable RS256 signing algorithm (required by OpenBao)
enableLegacyCrypto = true;
# Allow groups scope for role binding
scopeMaps.admins = [ "openid" "profile" "email" "groups" ];
scopeMaps.users = [ "openid" "profile" "email" "groups" ];
};
};
};
@@ -53,6 +78,24 @@
group = "kanidm";
};
# Vault secret for Grafana OAuth2 client secret
vault.secrets.grafana-oauth2 = {
secretPath = "services/grafana/oauth2-client-secret";
extractKey = "password";
services = [ "kanidm" ];
owner = "kanidm";
group = "kanidm";
};
# Vault secret for OpenBao OAuth2 client secret
vault.secrets.openbao-oauth2 = {
secretPath = "services/openbao/oauth2-client-secret";
extractKey = "password";
services = [ "kanidm" ];
owner = "kanidm";
group = "kanidm";
};
# Note: Kanidm does not expose Prometheus metrics
# If metrics support is added in the future, uncomment:
# homelab.monitoring.scrapeTargets = [

View File

@@ -0,0 +1,92 @@
{ pkgs, ... }:
let
# TLS endpoints to monitor for certificate expiration
# These are all services using ACME certificates from OpenBao PKI
tlsTargets = [
# Direct ACME certs (security.acme.certs)
"https://vault.home.2rjus.net:8200"
"https://auth.home.2rjus.net"
"https://testvm01.home.2rjus.net"
# Caddy auto-TLS on http-proxy
"https://nzbget.home.2rjus.net"
"https://radarr.home.2rjus.net"
"https://sonarr.home.2rjus.net"
"https://ha.home.2rjus.net"
"https://z2m.home.2rjus.net"
"https://prometheus.home.2rjus.net"
"https://alertmanager.home.2rjus.net"
"https://grafana.home.2rjus.net"
"https://jelly.home.2rjus.net"
"https://pyroscope.home.2rjus.net"
"https://pushgw.home.2rjus.net"
# Caddy auto-TLS on nix-cache02
"https://nix-cache.home.2rjus.net"
# Caddy auto-TLS on grafana01
"https://grafana-test.home.2rjus.net"
];
in
{
services.prometheus.exporters.blackbox = {
enable = true;
configFile = pkgs.writeText "blackbox.yml" ''
modules:
https_cert:
prober: http
timeout: 10s
http:
fail_if_not_ssl: true
preferred_ip_protocol: ip4
valid_status_codes:
- 200
- 204
- 301
- 302
- 303
- 307
- 308
- 400
- 401
- 403
- 404
- 405
- 500
- 502
- 503
'';
};
# Add blackbox scrape config to Prometheus
# Alert rules are in rules.yml (certificate_rules group)
services.prometheus.scrapeConfigs = [
{
job_name = "blackbox_tls";
metrics_path = "/probe";
params = {
module = [ "https_cert" ];
};
static_configs = [{
targets = tlsTargets;
}];
relabel_configs = [
# Pass the target URL to blackbox as a parameter
{
source_labels = [ "__address__" ];
target_label = "__param_target";
}
# Use the target URL as the instance label
{
source_labels = [ "__param_target" ];
target_label = "instance";
}
# Point the actual scrape at the local blackbox exporter
{
target_label = "__address__";
replacement = "127.0.0.1:9115";
}
];
}
];
}

View File

@@ -4,6 +4,8 @@
./loki.nix
./grafana.nix
./prometheus.nix
./blackbox.nix
./exportarr.nix
./pve.nix
./alerttonotify.nix
./pyroscope.nix

View File

@@ -0,0 +1,27 @@
{ config, ... }:
{
# Vault secret for API key
vault.secrets.sonarr-api-key = {
secretPath = "services/exportarr/sonarr";
extractKey = "api_key";
services = [ "prometheus-exportarr-sonarr-exporter" ];
};
# Sonarr exporter
services.prometheus.exporters.exportarr-sonarr = {
enable = true;
url = "http://sonarr-jail.home.2rjus.net:8989";
apiKeyFile = config.vault.secrets.sonarr-api-key.outputDir;
port = 9709;
};
# Scrape config
services.prometheus.scrapeConfigs = [
{
job_name = "sonarr";
static_configs = [{
targets = [ "localhost:9709" ];
}];
}
];
}

View File

@@ -178,9 +178,7 @@ in
}
];
}
# TODO: nix-cache_caddy can't be auto-generated because the cert is issued
# for nix-cache.home.2rjus.net (service CNAME), not nix-cache01 (hostname).
# Consider adding a target override to homelab.monitoring.scrapeTargets.
# Caddy metrics from nix-cache02 (serves nix-cache.home.2rjus.net)
{
job_name = "nix-cache_caddy";
scheme = "https";

View File

@@ -118,13 +118,13 @@ groups:
description: "NSD has been down on {{ $labels.instance }} more than 5 minutes."
# Only alert on primary DNS (secondary has cold cache after failover)
- alert: unbound_low_cache_hit_ratio
expr: (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) / (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) + rate(unbound_cache_misses_total{dns_role="primary"}[5m]))) < 0.5
expr: (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) / (rate(unbound_cache_hits_total{dns_role="primary"}[5m]) + rate(unbound_cache_misses_total{dns_role="primary"}[5m]))) < 0.2
for: 15m
labels:
severity: warning
annotations:
summary: "Low DNS cache hit ratio on {{ $labels.instance }}"
description: "Unbound cache hit ratio is below 50% on {{ $labels.instance }}."
description: "Unbound cache hit ratio is below 20% on {{ $labels.instance }}."
- name: http_proxy_rules
rules:
- alert: caddy_down
@@ -171,37 +171,14 @@ groups:
description: "NATS has {{ $value }} slow consumers on {{ $labels.instance }}."
- name: nix_cache_rules
rules:
- alert: build_flakes_service_not_active_recently
expr: count_over_time(node_systemd_unit_state{instance="nix-cache01.home.2rjus.net:9100", name="build-flakes.service", state="active"}[1h]) < 1
for: 0m
labels:
severity: critical
annotations:
summary: "The build-flakes service on {{ $labels.instance }} has not run recently"
description: "The build-flakes service on {{ $labels.instance }} has not run recently"
- alert: build_flakes_error
expr: build_flakes_error == 1
labels:
severity: warning
annotations:
summary: "The build-flakes job has failed for host {{ $labels.host }}."
description: "The build-flakes job has failed for host {{ $labels.host }}."
- alert: harmonia_down
expr: node_systemd_unit_state {instance="nix-cache01.home.2rjus.net:9100", name = "harmonia.service", state = "active"} == 0
expr: node_systemd_unit_state{instance="nix-cache02.home.2rjus.net:9100", name="harmonia.service", state="active"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Harmonia not running on {{ $labels.instance }}"
description: "Harmonia has been down on {{ $labels.instance }} more than 5 minutes."
- alert: low_disk_space_nix
expr: node_filesystem_free_bytes{instance="nix-cache01.home.2rjus.net:9100", mountpoint="/nix"} / node_filesystem_size_bytes{instance="nix-cache01.home.2rjus.net:9100", mountpoint="/nix"} * 100 < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Disk space low on /nix for {{ $labels.instance }}"
description: "Disk space is low on /nix for host {{ $labels.instance }}. Please check."
- name: home_assistant_rules
rules:
- alert: home_assistant_down
@@ -229,13 +206,13 @@ groups:
summary: "Mosquitto not running on {{ $labels.instance }}"
description: "Mosquitto has been down on {{ $labels.instance }} more than 5 minutes."
- alert: zigbee_sensor_stale
expr: (time() - hass_last_updated_time_seconds{entity=~"sensor\\.(0x[0-9a-f]+|temp_server)_temperature"}) > 7200
expr: (time() - hass_last_updated_time_seconds{entity=~"sensor\\.(0x[0-9a-f]+|temp_server)_temperature"}) > 14400
for: 5m
labels:
severity: warning
annotations:
summary: "Zigbee sensor {{ $labels.friendly_name }} is stale"
description: "Zigbee temperature sensor {{ $labels.entity }} has not reported data for over 2 hours. The sensor may have a dead battery or connectivity issues."
description: "Zigbee temperature sensor {{ $labels.entity }} has not reported data for over 4 hours. The sensor may have a dead battery or connectivity issues."
- name: smartctl_rules
rules:
- alert: smart_critical_warning
@@ -392,3 +369,47 @@ groups:
annotations:
summary: "Cannot scrape OpenBao metrics from {{ $labels.instance }}"
description: "OpenBao metrics endpoint is not responding on {{ $labels.instance }}."
- name: certificate_rules
rules:
- alert: tls_certificate_expiring_soon
expr: (probe_ssl_earliest_cert_expiry - time()) < 86400 * 7
for: 1h
labels:
severity: warning
annotations:
summary: "TLS certificate expiring soon on {{ $labels.instance }}"
description: "The TLS certificate for {{ $labels.instance }} expires in less than 7 days."
- alert: tls_certificate_expiring_critical
expr: (probe_ssl_earliest_cert_expiry - time()) < 86400
for: 0m
labels:
severity: critical
annotations:
summary: "TLS certificate expiring within 24h on {{ $labels.instance }}"
description: "The TLS certificate for {{ $labels.instance }} expires in less than 24 hours. Immediate action required."
- alert: tls_probe_failed
expr: probe_success{job="blackbox_tls"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "TLS probe failed for {{ $labels.instance }}"
description: "Cannot connect to {{ $labels.instance }} to check TLS certificate. The service may be down or unreachable."
- name: homelab_deploy_rules
rules:
- alert: homelab_deploy_build_failed
expr: increase(homelab_deploy_build_host_total{status="failure"}[1h]) > 0
for: 0m
labels:
severity: warning
annotations:
summary: "Build failed for {{ $labels.host }} in repo {{ $labels.repo }}"
description: "Host {{ $labels.host }} failed to build from {{ $labels.repo }} repository."
- alert: homelab_deploy_builder_down
expr: up{job="homelab-deploy-builder"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Homelab deploy builder not responding on {{ $labels.instance }}"
description: "Cannot scrape homelab-deploy-builder metrics from {{ $labels.instance }} for 5 minutes."

View File

@@ -35,9 +35,18 @@
HOMELAB = {
jetstream = "enabled";
users = [
# alerttonotify (full access to HOMELAB account)
{
nkey = "UASLNKLWGICRTZMIXVD3RXLQ57XRIMCKBHP5V3PYFFRNO3E3BIJBCYMZ";
}
# nixos-exporter (restricted to nixos-exporter subjects)
{
nkey = "UBCL3ODHVERVZJNGUJ567YBBKHQZOV3LK3WO6TVVSGQOCTK2NQ3IJVRV"; # Replace with public key from: nix develop -c nk -gen user -pubout
permissions = {
publish = [ "nixos-exporter.>" ];
subscribe = [ "nixos-exporter.>" ];
};
}
];
};
@@ -65,10 +74,12 @@
publish = [
"deploy.test.>"
"deploy.discover"
"build.>"
];
subscribe = [
"deploy.responses.>"
"deploy.discover"
"build.responses.>"
];
};
}
@@ -76,8 +87,30 @@
{
nkey = "UD2BFB7DLM67P5UUVCKBUJMCHADIZLGGVUNSRLZE2ZC66FW2XT44P73Y";
permissions = {
publish = [ "deploy.>" ];
subscribe = [ "deploy.>" ];
publish = [
"deploy.>"
"build.>"
];
subscribe = [
"deploy.>"
"build.responses.>"
];
};
}
# Builder (subscribes to build requests, publishes responses)
{
nkey = "UB4PUHGKAWAK6OS62FX7DOQTPFFJTLZZBTKCOCAXDP75H3NSMWAEDJ7E";
permissions = {
subscribe = [ "build.>" ];
publish = [ "build.responses.>" ];
};
}
# Scheduler (publishes build requests, subscribes to responses)
{
nkey = "UDQ5SFEGDM66AQGLK7KQDW6ZOC2QCXE2P6EJQ6VPBSR2CRCABPOVWRI4";
permissions = {
publish = [ "build.>" ];
subscribe = [ "build.responses.>" ];
};
}
];

View File

@@ -1,29 +0,0 @@
{ pkgs, ... }:
let
build-flake-script = pkgs.writeShellApplication {
name = "build-flake-script";
runtimeInputs = with pkgs; [
git
nix
nixos-rebuild
jq
curl
];
text = builtins.readFile ./build-flakes.sh;
};
in
{
systemd.services."build-flakes" = {
serviceConfig = {
Type = "exec";
ExecStart = "${build-flake-script}/bin/build-flake-script";
};
};
systemd.timers."build-flakes" = {
enable = true;
wantedBy = [ "timers.target" ];
timerConfig = {
OnCalendar = "*-*-* *:30:00";
};
};
}

View File

@@ -1,44 +0,0 @@
JOB_NAME="build_flakes"
cd /root/nixos-servers
git pull
echo "Starting nixos-servers builds"
for host in $(nix flake show --json| jq -r '.nixosConfigurations | keys[]'); do
echo "Building $host"
if ! nixos-rebuild --verbose -L --flake ".#$host" build; then
echo "Build failed for $host"
cat <<EOF | curl -sS -X PUT --data-binary @- "https://pushgw.home.2rjus.net/metrics/job/$JOB_NAME/host/$host"
# TYPE build_flakes_error gauge
# HELP build_flakes_error 0 if the build was successful, 1 if it failed
build_flakes_error{instance="$HOSTNAME"} 1
EOF
else
echo "Build successful for $host"
cat <<EOF | curl -sS -X PUT --data-binary @- "https://pushgw.home.2rjus.net/metrics/job/$JOB_NAME/host/$host"
# TYPE build_flakes_error gauge
# HELP build_flakes_error 0 if the build was successful, 1 if it failed
build_flakes_error{instance="$HOSTNAME"} 0
EOF
fi
done
echo "All nixos-servers builds complete"
echo "Building gunter"
cd /root/nixos
git pull
host="gunter"
if ! nixos-rebuild --verbose -L --flake ".#gunter" build; then
echo "Build failed for $host"
cat <<EOF | curl -sS -X PUT --data-binary @- "https://pushgw.home.2rjus.net/metrics/job/$JOB_NAME/host/$host"
# TYPE build_flakes_error gauge
# HELP build_flakes_error 0 if the build was successful, 1 if it failed
build_flakes_error{instance="$HOSTNAME"} 1
EOF
else
echo "Build successful for $host"
cat <<EOF | curl -sS -X PUT --data-binary @- "https://pushgw.home.2rjus.net/metrics/job/$JOB_NAME/host/$host"
# TYPE build_flakes_error gauge
# HELP build_flakes_error 0 if the build was successful, 1 if it failed
build_flakes_error{instance="$HOSTNAME"} 0
EOF
fi

View File

@@ -1,10 +1,8 @@
{ ... }:
{
imports = [
./build-flakes.nix
./harmonia.nix
./proxy.nix
./nix.nix
];
}

View File

@@ -1,7 +1,7 @@
{ pkgs, config, ... }:
{
vault.secrets.cache-secret = {
secretPath = "hosts/nix-cache01/cache-secret";
secretPath = "hosts/${config.networking.hostName}/cache-secret";
extractKey = "key";
outputDir = "/run/secrets/cache-secret";
services = [ "harmonia" ];

View File

@@ -19,15 +19,34 @@
];
};
# Fetch NKey from Vault for NATS authentication
vault.secrets.nixos-exporter-nkey = {
secretPath = "shared/nixos-exporter/nkey";
extractKey = "nkey";
owner = "nixos-exporter";
group = "nixos-exporter";
};
services.prometheus.exporters.nixos = {
enable = true;
# Default port: 9971
flake = {
enable = true;
url = "git+https://git.t-juice.club/torjus/nixos-servers.git";
nats = {
enable = true;
url = "nats://nats1.home.2rjus.net:4222";
nkeySeedFile = "/run/secrets/nixos-exporter-nkey";
};
};
};
# Ensure exporter starts after Vault secret is available
systemd.services.prometheus-nixos-exporter = {
after = [ "vault-secret-nixos-exporter-nkey.service" ];
requires = [ "vault-secret-nixos-exporter-nkey.service" ];
};
# Register nixos-exporter as a Prometheus scrape target
homelab.monitoring.scrapeTargets = [
{

View File

@@ -42,7 +42,7 @@ in
"https://cuda-maintainers.cachix.org"
];
trusted-public-keys = [
"nix-cache.home.2rjus.net-1:2kowZOG6pvhoK4AHVO3alBlvcghH20wchzoR0V86UWI="
"nix-cache02.home.2rjus.net-1:QyT5FAvJtV+EPQrgQQ6iV9JMg1kRiWuIAJftM35QMls="
"cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY="
"cuda-maintainers.cachix.org-1:0dq3bujKpuEPMCX6U4WylrUDZ9JyUG0VpVZa7CNfq5E="
];

View File

@@ -33,7 +33,7 @@ variable "default_target_node" {
variable "default_template_name" {
description = "Default template VM name to clone from"
type = string
default = "nixos-25.11.20260203.e576e3c"
default = "nixos-25.11.20260207.23d72da"
}
variable "default_ssh_public_key" {

View File

@@ -15,6 +15,17 @@ path "secret/data/shared/homelab-deploy/*" {
EOT
}
# Shared policy for nixos-exporter NATS cache sharing
resource "vault_policy" "nixos_exporter" {
name = "nixos-exporter"
policy = <<EOT
path "secret/data/shared/nixos-exporter/*" {
capabilities = ["read", "list"]
}
EOT
}
# Define host access policies
locals {
host_policies = {
@@ -49,6 +60,7 @@ locals {
"secret/data/hosts/monitoring01/*",
"secret/data/shared/backup/*",
"secret/data/shared/nats/*",
"secret/data/services/exportarr/*",
]
extra_policies = ["prometheus-metrics"]
}
@@ -75,13 +87,6 @@ locals {
]
}
# Wave 5: nix-cache01
"nix-cache01" = {
paths = [
"secret/data/hosts/nix-cache01/*",
]
}
# vault01: Vault server itself (fetches secrets from itself)
"vault01" = {
paths = [
@@ -89,6 +94,24 @@ locals {
]
}
# kanidm01: Kanidm identity provider
"kanidm01" = {
paths = [
"secret/data/hosts/kanidm01/*",
"secret/data/kanidm/*",
"secret/data/services/grafana/*",
"secret/data/services/openbao/*",
]
}
# monitoring02: Grafana test instance
"monitoring02" = {
paths = [
"secret/data/hosts/monitoring02/*",
"secret/data/services/grafana/*",
]
}
}
}
@@ -114,7 +137,7 @@ resource "vault_approle_auth_backend_role" "hosts" {
backend = vault_auth_backend.approle.path
role_name = each.key
token_policies = concat(
["${each.key}-policy", "homelab-deploy"],
["${each.key}-policy", "homelab-deploy", "nixos-exporter"],
lookup(each.value, "extra_policies", [])
)

View File

@@ -33,10 +33,10 @@ locals {
"secret/data/shared/homelab-deploy/*",
]
}
"kanidm01" = {
"nix-cache02" = {
paths = [
"secret/data/hosts/kanidm01/*",
"secret/data/kanidm/*",
"secret/data/hosts/nix-cache02/*",
"secret/data/shared/homelab-deploy/*",
]
}
@@ -69,7 +69,7 @@ resource "vault_approle_auth_backend_role" "generated_hosts" {
backend = vault_auth_backend.approle.path
role_name = each.key
token_policies = ["host-${each.key}", "homelab-deploy"]
token_policies = ["host-${each.key}", "homelab-deploy", "nixos-exporter"]
secret_id_ttl = 0 # Never expire (wrapped tokens provide time limit)
token_ttl = 3600
token_max_ttl = 3600

50
terraform/vault/oidc.tf Normal file
View File

@@ -0,0 +1,50 @@
# OIDC authentication backend for Kanidm integration
# Web UI only - CLI localhost redirects not supported with confidential clients
resource "vault_jwt_auth_backend" "oidc" {
path = "oidc"
type = "oidc"
oidc_discovery_url = "https://auth.home.2rjus.net/oauth2/openid/openbao"
oidc_client_id = "openbao"
oidc_client_secret = random_password.auto_secrets["services/openbao/oauth2-client-secret"].result
default_role = "default"
tune {
listing_visibility = "unauth"
default_lease_ttl = "1h"
max_lease_ttl = "24h"
token_type = "default-service"
}
}
# Admin role - maps Kanidm admins group to admin policy
resource "vault_jwt_auth_backend_role" "admin" {
backend = vault_jwt_auth_backend.oidc.path
role_name = "admin"
token_policies = ["oidc-admin"]
user_claim = "preferred_username"
groups_claim = "groups"
bound_claims = { groups = "admins@home.2rjus.net" }
role_type = "oidc"
oidc_scopes = ["openid", "profile", "email", "groups"]
allowed_redirect_uris = [
"https://vault.home.2rjus.net:8200/ui/vault/auth/oidc/oidc/callback",
]
}
# Default role - any authenticated user (limited access)
resource "vault_jwt_auth_backend_role" "default" {
backend = vault_jwt_auth_backend.oidc.path
role_name = "default"
token_policies = ["oidc-default"]
user_claim = "preferred_username"
groups_claim = "groups"
role_type = "oidc"
oidc_scopes = ["openid", "profile", "email", "groups"]
allowed_redirect_uris = [
"https://vault.home.2rjus.net:8200/ui/vault/auth/oidc/oidc/callback",
]
}

View File

@@ -8,3 +8,50 @@ path "sys/metrics" {
}
EOT
}
# OIDC admin policy - full read/write to all secrets
resource "vault_policy" "oidc_admin" {
name = "oidc-admin"
policy = <<EOT
# Full access to KV secrets
path "secret/*" {
capabilities = ["create", "read", "update", "delete", "list"]
}
# Read system health and metrics
path "sys/health" {
capabilities = ["read"]
}
path "sys/metrics" {
capabilities = ["read"]
}
# List auth methods and mounts
path "sys/auth" {
capabilities = ["read"]
}
path "sys/mounts" {
capabilities = ["read"]
}
EOT
}
# OIDC default policy - minimal access for authenticated users
resource "vault_policy" "oidc_default" {
name = "oidc-default"
policy = <<EOT
# Read own token info
path "auth/token/lookup-self" {
capabilities = ["read"]
}
# Read system health
path "sys/health" {
capabilities = ["read"]
}
EOT
}

View File

@@ -76,15 +76,9 @@ locals {
}
# Nix cache signing key
"hosts/nix-cache01/cache-secret" = {
"hosts/nix-cache02/cache-secret" = {
auto_generate = false
data = { key = var.cache_signing_key }
}
# Gitea Actions runner token
"hosts/nix-cache01/actions-token" = {
auto_generate = false
data = { token = var.actions_token_1 }
data = { key = var.cache_signing_key_02 }
}
# Homelab-deploy NKeys
@@ -103,11 +97,50 @@ locals {
data = { nkey = var.homelab_deploy_admin_deployer_nkey }
}
"shared/homelab-deploy/builder-nkey" = {
auto_generate = false
data = { nkey = var.homelab_deploy_builder_nkey }
}
"shared/homelab-deploy/scheduler-nkey" = {
auto_generate = false
data = { nkey = var.homelab_deploy_scheduler_nkey }
}
# Kanidm idm_admin password
"kanidm/idm-admin-password" = {
auto_generate = true
password_length = 32
}
# Grafana OAuth2 client secret (for Kanidm OIDC)
"services/grafana/oauth2-client-secret" = {
auto_generate = true
password_length = 64
}
# OpenBao OAuth2 client secret (for Kanidm OIDC)
"services/openbao/oauth2-client-secret" = {
auto_generate = true
password_length = 64
}
# NKey for nixos-exporter NATS cache sharing
"shared/nixos-exporter/nkey" = {
auto_generate = false
data = { nkey = var.nixos_exporter_nkey }
}
# Exportarr API keys for media stack monitoring
"services/exportarr/radarr" = {
auto_generate = false
data = { api_key = var.radarr_api_key }
}
"services/exportarr/sonarr" = {
auto_generate = false
data = { api_key = var.sonarr_api_key }
}
}
}

View File

@@ -40,14 +40,8 @@ variable "wireguard_private_key" {
sensitive = true
}
variable "cache_signing_key" {
description = "Nix binary cache signing key"
type = string
sensitive = true
}
variable "actions_token_1" {
description = "Gitea Actions runner token"
variable "cache_signing_key_02" {
description = "Nix binary cache signing key (nix-cache02)"
type = string
sensitive = true
}
@@ -73,3 +67,38 @@ variable "homelab_deploy_admin_deployer_nkey" {
sensitive = true
}
variable "homelab_deploy_builder_nkey" {
description = "NKey seed for homelab-deploy builder"
type = string
default = "PLACEHOLDER"
sensitive = true
}
variable "homelab_deploy_scheduler_nkey" {
description = "NKey seed for scheduled build triggering"
type = string
default = "PLACEHOLDER"
sensitive = true
}
variable "nixos_exporter_nkey" {
description = "NKey seed for nixos-exporter NATS authentication"
type = string
default = "PLACEHOLDER"
sensitive = true
}
variable "radarr_api_key" {
description = "Radarr API key for exportarr metrics"
type = string
default = "PLACEHOLDER"
sensitive = true
}
variable "sonarr_api_key" {
description = "Sonarr API key for exportarr metrics"
type = string
default = "PLACEHOLDER"
sensitive = true
}

View File

@@ -79,6 +79,20 @@ locals {
disk_size = "20G"
vault_wrapped_token = "s.OOqjEECeIV7dNgCS6jNmyY3K"
}
"monitoring02" = {
ip = "10.69.13.24/24"
cpu_cores = 4
memory = 4096
disk_size = "60G"
vault_wrapped_token = "s.uXpdoGxHXpWvTsGbHkZuq1jF"
}
"nix-cache02" = {
ip = "10.69.13.25/24"
cpu_cores = 8
memory = 20480
disk_size = "200G"
vault_wrapped_token = "s.C5EuHFyULACEqZgsLqMC3cJB"
}
}
# Compute VM configurations with defaults applied