16 Commits

Author SHA1 Message Date
16ef202530 http-proxy: set content-type header on maintenance page
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m23s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 12:43:12 +01:00
5f3508a6d4 http-proxy: temporary jellyfin maintenance page
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 12:39:18 +01:00
2ca2509083 monitoring: increase filesystem_filling_up prediction window to 24h
Some checks failed
Run nix flake check / flake-check (push) Failing after 3m55s
Reduces false positives from transient Nix store growth by basing the
linear prediction on a 24h trend instead of 6h.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 09:36:27 +01:00
58702bd10b truenas-migration: note subnet issue for 10GbE traffic
Some checks failed
Run nix flake check / flake-check (push) Failing after 7m10s
NAS and Proxmox are on the same 10GbE switch but different subnets,
forcing traffic through the router. Need to fix during migration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 01:34:46 +01:00
c9f47acb01 truenas-migration: mdadm boot mirror, clean zfs export step
Use TrueNAS boot-pool SSDs as mdadm RAID1 for NixOS root to keep
the boot path ZFS-independent. Added zfs export step before shutdown.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 01:34:46 +01:00
09ce018fb2 truenas-migration: switch from BTRFS to keeping ZFS, update plan
BTRFS RAID5/6 write hole is still unresolved, and RAID1 wastes
capacity with mixed disk sizes. Keep existing ZFS pool and import
directly on NixOS instead. Updated migration strategy, disk purchase
decision (2x 24TB ordered), SMART health notes, and vdev rebalancing
guidance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 01:34:46 +01:00
3042803c4d flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/fa56d7d6de78f5a7f997b0ea2bc6efd5868ad9e8?narHash=sha256-X01Q3DgSpjeBpapoGA4rzKOn25qdKxbPnxHeMLNoHTU%3D' (2026-02-16)
  → 'github:nixos/nixpkgs/6d41bc27aaf7b6a3ba6b169db3bd5d6159cfaa47?narHash=sha256-bxAlQgre3pcQcaRUm/8A0v/X8d2nhfraWSFqVmMcBcU%3D' (2026-02-18)
2026-02-20 00:07:01 +00:00
1e7200b494 quick-plan: add mermaid diagram guideline
Some checks failed
Run nix flake check / flake-check (push) Failing after 5m7s
Periodic flake update / flake-update (push) Successful in 5m26s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 16:35:53 +01:00
eec1e374b2 docs: simplify mermaid diagram labels
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m0s
Use <br/> for line breaks and shorter node labels so the diagram
renders cleanly in Gitea.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 16:29:52 +01:00
fcc410afad docs: replace ASCII diagram with mermaid in remote-access plan
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 16:28:57 +01:00
59f0c7ceda flake.lock: update homelab-deploy
Some checks failed
Run nix flake check / flake-check (push) Failing after 8m10s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 09:04:03 +01:00
d713f06c6e flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs-unstable':
    'github:nixos/nixpkgs/a82ccc39b39b621151d6732718e3e250109076fa?narHash=sha256-gf2AmWVTs8lEq7z/3ZAsgnZDhWIckkb%2BZnAo5RzSxJg%3D' (2026-02-13)
  → 'github:nixos/nixpkgs/0182a361324364ae3f436a63005877674cf45efb?narHash=sha256-0NBlEBKkN3lufyvFegY4TYv5mCNHbi5OmBDrzihbBMQ%3D' (2026-02-17)
2026-02-19 00:01:44 +00:00
7374d1ff7f nix-cache02: increase builder timeout to 4 hours
Some checks failed
Run nix flake check / flake-check (push) Failing after 4m4s
Periodic flake update / flake-update (push) Successful in 2m32s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 23:53:33 +01:00
e912c75b6c flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:nixos/nixpkgs/3aadb7ca9eac2891d52a9dec199d9580a6e2bf44?narHash=sha256-O1XDr7EWbRp%2BkHrNNgLWgIrB0/US5wvw9K6RERWAj6I%3D' (2026-02-14)
  → 'github:nixos/nixpkgs/fa56d7d6de78f5a7f997b0ea2bc6efd5868ad9e8?narHash=sha256-X01Q3DgSpjeBpapoGA4rzKOn25qdKxbPnxHeMLNoHTU%3D' (2026-02-16)
2026-02-18 00:01:34 +00:00
b218b4f8bc docs: update migration plan for monitoring01 and pgdb1 completion
Some checks failed
Run nix flake check / flake-check (push) Failing after 16m37s
Periodic flake update / flake-update (push) Successful in 2m21s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 22:26:23 +01:00
65acf13e6f grafana: fix datasource UIDs for VictoriaMetrics migration
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Update all dashboard datasource references from "prometheus" to
"victoriametrics" to match the declared datasource UID. Enable
prune and deleteDatasources to clean up the old Prometheus
(monitoring01) datasource from Grafana's database.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 22:23:04 +01:00
16 changed files with 256 additions and 216 deletions

View File

@@ -73,6 +73,7 @@ Additional context, caveats, or references.
- **Reference existing patterns**: Mention how this fits with existing infrastructure - **Reference existing patterns**: Mention how this fits with existing infrastructure
- **Tables for comparisons**: Use markdown tables when comparing options - **Tables for comparisons**: Use markdown tables when comparing options
- **Practical focus**: Emphasize what needs to happen, not theory - **Practical focus**: Emphasize what needs to happen, not theory
- **Mermaid diagrams**: Use mermaid code blocks for architecture diagrams, flow charts, or other graphs when relevant to the plan. Keep node labels short and use `<br/>` for line breaks
## Examples of Good Plans ## Examples of Good Plans

View File

@@ -20,9 +20,9 @@ Hosts to migrate:
| http-proxy | Stateless | Reverse proxy, recreate | | http-proxy | Stateless | Reverse proxy, recreate |
| nats1 | Stateless | Messaging, recreate | | nats1 | Stateless | Messaging, recreate |
| ha1 | Stateful | Home Assistant + Zigbee2MQTT + Mosquitto | | ha1 | Stateful | Home Assistant + Zigbee2MQTT + Mosquitto |
| monitoring01 | Stateful | Prometheus, Grafana, Loki | | ~~monitoring01~~ | ~~Decommission~~ | ✓ Complete — replaced by monitoring02 (VictoriaMetrics) |
| jelly01 | Stateful | Jellyfin metadata, watch history, config | | jelly01 | Stateful | Jellyfin metadata, watch history, config |
| pgdb1 | Decommission | Only used by Open WebUI on gunter, migrating to local postgres | | ~~pgdb1~~ | ~~Decommission~~ | ✓ Complete |
| ~~jump~~ | ~~Decommission~~ | ✓ Complete | | ~~jump~~ | ~~Decommission~~ | ✓ Complete |
| ~~auth01~~ | ~~Decommission~~ | ✓ Complete | | ~~auth01~~ | ~~Decommission~~ | ✓ Complete |
| ~~ca~~ | ~~Deferred~~ | ✓ Complete | | ~~ca~~ | ~~Deferred~~ | ✓ Complete |
@@ -31,10 +31,12 @@ Hosts to migrate:
Before migrating any stateful host, ensure restic backups are in place and verified. Before migrating any stateful host, ensure restic backups are in place and verified.
### 1a. Expand monitoring01 Grafana Backup ### ~~1a. Expand monitoring01 Grafana Backup~~ ✓ N/A
The existing backup only covers `/var/lib/grafana/plugins` and a sqlite dump of `grafana.db`. ~~The existing backup only covers `/var/lib/grafana/plugins` and a sqlite dump of `grafana.db`.
Expand to back up all of `/var/lib/grafana/` to capture config directory and any other state. Expand to back up all of `/var/lib/grafana/` to capture config directory and any other state.~~
No longer needed — monitoring01 decommissioned, replaced by monitoring02 with declarative Grafana dashboards.
### 1b. Add Jellyfin Backup to jelly01 ### 1b. Add Jellyfin Backup to jelly01
@@ -94,15 +96,17 @@ For each stateful host, the procedure is:
7. Start services and verify functionality 7. Start services and verify functionality
8. Decommission the old VM 8. Decommission the old VM
### 3a. monitoring01 ### 3a. monitoring01 ✓ COMPLETE
1. Run final Grafana backup ~~1. Run final Grafana backup~~
2. Provision new monitoring01 via OpenTofu ~~2. Provision new monitoring01 via OpenTofu~~
3. After bootstrap, restore `/var/lib/grafana/` from restic ~~3. After bootstrap, restore `/var/lib/grafana/` from restic~~
4. Restart Grafana, verify dashboards and datasources are intact ~~4. Restart Grafana, verify dashboards and datasources are intact~~
5. Prometheus and Loki start fresh with empty data (acceptable) ~~5. Prometheus and Loki start fresh with empty data (acceptable)~~
6. Verify all scrape targets are being collected ~~6. Verify all scrape targets are being collected~~
7. Decommission old VM ~~7. Decommission old VM~~
Replaced by monitoring02 with VictoriaMetrics, standalone Loki and Grafana modules. Host configuration, old service modules, and terraform resources removed.
### 3b. jelly01 ### 3b. jelly01
@@ -163,19 +167,19 @@ Host was already removed from flake.nix and VM destroyed. Configuration cleaned
Host configuration, services, and VM already removed. Host configuration, services, and VM already removed.
### pgdb1 (in progress) ### pgdb1 ✓ COMPLETE
Only consumer was Open WebUI on gunter, which has been migrated to use local PostgreSQL. ~~Only consumer was Open WebUI on gunter, which has been migrated to use local PostgreSQL.~~
1. ~~Verify Open WebUI on gunter is using local PostgreSQL (not pgdb1)~~ ~~1. Verify Open WebUI on gunter is using local PostgreSQL (not pgdb1)~~
2. ~~Remove host configuration from `hosts/pgdb1/`~~ ~~2. Remove host configuration from `hosts/pgdb1/`~~
3. ~~Remove `services/postgres/` (only used by pgdb1)~~ ~~3. Remove `services/postgres/` (only used by pgdb1)~~
4. ~~Remove from `flake.nix`~~ ~~4. Remove from `flake.nix`~~
5. ~~Remove Vault AppRole from `terraform/vault/approle.tf`~~ ~~5. Remove Vault AppRole from `terraform/vault/approle.tf`~~
6. Destroy the VM in Proxmox ~~6. Destroy the VM in Proxmox~~
7. ~~Commit cleanup~~ ~~7. Commit cleanup~~
See `docs/plans/pgdb1-decommission.md` for detailed plan. Host configuration, services, terraform resources, and VM removed. See `docs/plans/pgdb1-decommission.md` for detailed plan.
## Phase 5: Decommission ca Host ✓ COMPLETE ## Phase 5: Decommission ca Host ✓ COMPLETE

View File

@@ -24,29 +24,20 @@ After evaluating WireGuard gateway vs Headscale (self-hosted Tailscale), the **W
## Architecture ## Architecture
``` ```mermaid
┌─────────────────────────────────┐ graph TD
│ VPS (OpenStack) │ clients["Laptop / Phone"]
Laptop/Phone ──→ │ WireGuard endpoint vps["VPS<br/>(WireGuard endpoint)"]
(WireGuard) │ Client peers: laptop, phone │ extgw["extgw01<br/>(gateway + bastion)"]
│ Routes 10.69.13.0/24 via tunnel│ grafana["Grafana<br/>monitoring01:3000"]
└──────────┬──────────────────────┘ jellyfin["Jellyfin<br/>jelly01:8096"]
│ WireGuard tunnel arr["arr stack<br/>*-jail hosts"]
┌─────────────────────────────────┐ clients -->|WireGuard| vps
│ extgw01 (gateway + bastion) │ vps -->|WireGuard tunnel| extgw
│ - WireGuard tunnel to VPS │ extgw -->|allowed traffic| grafana
│ - Firewall (allowlist only) │ extgw -->|allowed traffic| jellyfin
│ - SSH + 2FA (full access) │ extgw -->|allowed traffic| arr
└──────────┬──────────────────────┘
│ allowed traffic only
┌─────────────────────────────────┐
│ Internal network 10.69.13.0/24 │
│ - monitoring01:3000 (Grafana) │
│ - jelly01:8096 (Jellyfin) │
│ - *-jail hosts (arr stack) │
└─────────────────────────────────┘
``` ```
### Existing path (unchanged) ### Existing path (unchanged)

View File

@@ -39,23 +39,17 @@ Expand storage capacity for the main hdd-pool. Since we need to add disks anyway
- nzbget: NixOS service or OCI container - nzbget: NixOS service or OCI container
- NFS exports: `services.nfs.server` - NFS exports: `services.nfs.server`
### Filesystem: BTRFS RAID1 ### Filesystem: Keep ZFS
**Decision**: Migrate from ZFS to BTRFS with RAID1 **Decision**: Keep existing ZFS pool, import on NixOS
**Rationale**: **Rationale**:
- **In-kernel**: No out-of-tree module issues like ZFS - **No data migration needed**: Existing ZFS pool can be imported directly on NixOS
- **Flexible expansion**: Add individual disks, not required to buy pairs - **Proven reliability**: Pool has been running reliably on TrueNAS
- **Mixed disk sizes**: Better handling than ZFS multi-vdev approach - **NixOS ZFS support**: Well-supported, declarative configuration via `boot.zfs` and `services.zfs`
- **RAID level conversion**: Can convert between RAID levels in place - **BTRFS RAID5/6 unreliable**: Research showed BTRFS RAID5/6 write hole is still unresolved
- Built-in checksumming, snapshots, compression (zstd) - **BTRFS RAID1 wasteful**: With mixed disk sizes, RAID1 wastes significant capacity vs ZFS mirrors
- NixOS has good BTRFS support - Checksumming, snapshots, compression (lz4/zstd) all available
**BTRFS RAID1 notes**:
- "RAID1" means 2 copies of all data
- Distributes across all available devices
- With 6+ disks, provides redundancy + capacity scaling
- RAID5/6 avoided (known issues), RAID1/10 are stable
### Hardware: Keep Existing + Add Disks ### Hardware: Keep Existing + Add Disks
@@ -69,83 +63,94 @@ Expand storage capacity for the main hdd-pool. Since we need to add disks anyway
**Storage architecture**: **Storage architecture**:
**Bulk storage** (BTRFS RAID1 on HDDs): **hdd-pool** (ZFS mirrors):
- Current: 6x HDDs (2x16TB + 2x8TB + 2x8TB) - Current: 3 mirror vdevs (2x16TB + 2x8TB + 2x8TB) = 32TB usable
- Add: 2x new HDDs (size TBD) - Add: mirror-3 with 2x 24TB = +24TB usable
- Total after expansion: ~56TB usable
- Use: Media, downloads, backups, non-critical data - Use: Media, downloads, backups, non-critical data
- Risk tolerance: High (data mostly replaceable)
**Critical data** (small volume):
- Use 2x 240GB SSDs in mirror (BTRFS or ZFS)
- Or use 2TB NVMe for critical data
- Risk tolerance: Low (data important but small)
### Disk Purchase Decision ### Disk Purchase Decision
**Options under consideration**: **Decision**: 2x 24TB drives (ordered, arriving 2026-02-21)
**Option A: 2x 16TB drives**
- Matches largest current drives
- Enables potential future RAID5 if desired (6x 16TB array)
- More conservative capacity increase
**Option B: 2x 20-24TB drives**
- Larger capacity headroom
- Better $/TB ratio typically
- Future-proofs better
**Initial purchase**: 2 drives (chassis has space for 2 more without modifications)
## Migration Strategy ## Migration Strategy
### High-Level Plan ### High-Level Plan
1. **Preparation**: 1. **Expand ZFS pool** (on TrueNAS):
- Purchase 2x new HDDs (16TB or 20-24TB) - Install 2x 24TB drives (may need new drive trays - order from abroad if needed)
- Create NixOS configuration for new storage host - If chassis space is limited, temporarily replace the two oldest 8TB drives (da0/ada4)
- Set up bare metal NixOS installation - Add as mirror-3 vdev to hdd-pool
- Verify pool health and resilver completes
- Check SMART data on old 8TB drives (all healthy as of 2026-02-20, no reallocated sectors)
- Burn-in: at minimum short + long SMART test before adding to pool
2. **Initial BTRFS pool**: 2. **Prepare NixOS configuration**:
- Install 2 new disks - Create host configuration (`hosts/nas1/` or similar)
- Create BTRFS filesystem in RAID1 - Configure ZFS pool import (`boot.zfs.extraPools`)
- Mount and test NFS exports - Set up services: radarr, sonarr, nzbget, restic-rest, NFS
- Configure monitoring (node-exporter, promtail, smartctl-exporter)
3. **Data migration**: 3. **Install NixOS**:
- Copy data from TrueNAS ZFS pool to new BTRFS pool over 10GbE - `zfs export hdd-pool` on TrueNAS before shutdown (clean export)
- Verify data integrity - Wipe TrueNAS boot-pool SSDs, set up as mdadm RAID1 for NixOS root
- Install NixOS on mdadm mirror (keeps boot path ZFS-independent)
- Import hdd-pool via `boot.zfs.extraPools`
- Verify all datasets mount correctly
4. **Expand pool**: 4. **Service migration**:
- As old ZFS pool is emptied, wipe drives and add to BTRFS pool - Configure NixOS services to use ZFS dataset paths
- Pool grows incrementally: 2 → 4 → 6 → 8 disks - Update NFS exports
- BTRFS rebalances data across new devices - Test from consuming hosts
5. **Service migration**: 5. **Cutover**:
- Set up radarr/sonarr/nzbget/restic as NixOS services - Update DNS/client mounts if IP changes
- Update NFS client mounts on consuming hosts - Verify monitoring integration
6. **Cutover**:
- Point consumers to new NAS host
- Decommission TrueNAS - Decommission TrueNAS
- Repurpose hardware or keep as spare
### Post-Expansion: Vdev Rebalancing
ZFS has no built-in rebalance command. After adding the new 24TB vdev, ZFS will
write new data preferentially to it (most free space), leaving old vdevs packed
at ~97%. This is suboptimal but not urgent once overall pool usage drops to ~50%.
To gradually rebalance, rewrite files in place so ZFS redistributes blocks across
all vdevs proportional to free space:
```bash
# Rewrite files individually (spreads blocks across all vdevs)
find /pool/dataset -type f -exec sh -c '
for f; do cp "$f" "$f.rebal" && mv "$f.rebal" "$f"; done
' _ {} +
```
Avoid `zfs send/recv` for large datasets (e.g. 20TB) as this would concentrate
data on the emptiest vdev rather than spreading it evenly.
**Recommendation**: Do this after NixOS migration is stable. Not urgent - the pool
will function fine with uneven distribution, just slightly suboptimal for performance.
### Migration Advantages ### Migration Advantages
- **Low risk**: New pool created independently, old data remains intact during migration - **No data migration**: ZFS pool imported directly, no copying terabytes of data
- **Incremental**: Can add old disks one at a time as space allows - **Low risk**: Pool expansion done on stable TrueNAS before OS swap
- **Flexible**: BTRFS handles mixed disk sizes gracefully - **Reversible**: Can boot back to TrueNAS if NixOS has issues (ZFS pool is OS-independent)
- **Reversible**: Keep TrueNAS running until fully validated - **Quick cutover**: Once NixOS config is ready, the OS swap is fast
## Next Steps ## Next Steps
1. Decide on disk size (16TB vs 20-24TB) 1. ~~Decide on disk size~~ - 2x 24TB ordered
2. Purchase disks 2. Install drives and add mirror vdev to ZFS pool
3. Design NixOS host configuration (`hosts/nas1/`) 3. Check SMART data on 8TB drives - decide whether to keep or retire
4. Plan detailed migration timeline 4. Design NixOS host configuration (`hosts/nas1/`)
5. Document NFS export mapping (current new) 5. Document NFS export mapping (current -> new)
6. Plan NixOS installation and cutover
## Open Questions ## Open Questions
- [ ] Final decision on disk size?
- [ ] Hostname for new NAS host? (nas1? storage1?) - [ ] Hostname for new NAS host? (nas1? storage1?)
- [ ] IP address allocation (keep 10.69.12.50 or new IP?) - [ ] IP address/subnet: NAS and Proxmox are both on 10GbE to the same switch but different subnets, forcing traffic through the router (bottleneck). Move to same subnet during migration.
- [ ] Timeline/maintenance window for migration? - [x] Boot drive: Reuse TrueNAS boot-pool SSDs as mdadm RAID1 for NixOS root (no ZFS on boot path)
- [ ] Retire old 8TB drives? (SMART looks healthy, keep unless chassis space is needed)
- [ ] Drive trays: do new 24TB drives fit, or order trays from abroad?
- [ ] Timeline/maintenance window for NixOS swap?

20
flake.lock generated
View File

@@ -28,11 +28,11 @@
] ]
}, },
"locked": { "locked": {
"lastModified": 1771004123, "lastModified": 1771488195,
"narHash": "sha256-Jw36EzL4IGIc2TmeZGphAAUrJXoWqfvCbybF8bTHgMA=", "narHash": "sha256-2kMxqdDyPluRQRoES22Y0oSjp7pc5fj2nRterfmSIyc=",
"ref": "master", "ref": "master",
"rev": "e5e8be86ecdcae8a5962ba3bddddfe91b574792b", "rev": "2d26de50559d8acb82ea803764e138325d95572c",
"revCount": 36, "revCount": 37,
"type": "git", "type": "git",
"url": "https://git.t-juice.club/torjus/homelab-deploy" "url": "https://git.t-juice.club/torjus/homelab-deploy"
}, },
@@ -64,11 +64,11 @@
}, },
"nixpkgs": { "nixpkgs": {
"locked": { "locked": {
"lastModified": 1771043024, "lastModified": 1771419570,
"narHash": "sha256-O1XDr7EWbRp+kHrNNgLWgIrB0/US5wvw9K6RERWAj6I=", "narHash": "sha256-bxAlQgre3pcQcaRUm/8A0v/X8d2nhfraWSFqVmMcBcU=",
"owner": "nixos", "owner": "nixos",
"repo": "nixpkgs", "repo": "nixpkgs",
"rev": "3aadb7ca9eac2891d52a9dec199d9580a6e2bf44", "rev": "6d41bc27aaf7b6a3ba6b169db3bd5d6159cfaa47",
"type": "github" "type": "github"
}, },
"original": { "original": {
@@ -80,11 +80,11 @@
}, },
"nixpkgs-unstable": { "nixpkgs-unstable": {
"locked": { "locked": {
"lastModified": 1771008912, "lastModified": 1771369470,
"narHash": "sha256-gf2AmWVTs8lEq7z/3ZAsgnZDhWIckkb+ZnAo5RzSxJg=", "narHash": "sha256-0NBlEBKkN3lufyvFegY4TYv5mCNHbi5OmBDrzihbBMQ=",
"owner": "nixos", "owner": "nixos",
"repo": "nixpkgs", "repo": "nixpkgs",
"rev": "a82ccc39b39b621151d6732718e3e250109076fa", "rev": "0182a361324364ae3f436a63005877674cf45efb",
"type": "github" "type": "github"
}, },
"original": { "original": {

View File

@@ -25,7 +25,7 @@
}; };
}; };
timeout = 7200; timeout = 14400;
metrics.enable = true; metrics.enable = true;
}; };

View File

@@ -19,7 +19,7 @@
"title": "SSH Connections", "title": "SSH Connections",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 0}, "gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "sum(oubliette_ssh_connections_total{job=\"apiary\"})", "expr": "sum(oubliette_ssh_connections_total{job=\"apiary\"})",
@@ -51,7 +51,7 @@
"title": "Active Sessions", "title": "Active Sessions",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 6, "y": 0}, "gridPos": {"h": 4, "w": 6, "x": 6, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "oubliette_sessions_active{job=\"apiary\"}", "expr": "oubliette_sessions_active{job=\"apiary\"}",
@@ -86,7 +86,7 @@
"title": "Unique IPs", "title": "Unique IPs",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 12, "y": 0}, "gridPos": {"h": 4, "w": 6, "x": 12, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "oubliette_storage_unique_ips{job=\"apiary\"}", "expr": "oubliette_storage_unique_ips{job=\"apiary\"}",
@@ -118,7 +118,7 @@
"title": "Total Login Attempts", "title": "Total Login Attempts",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 18, "y": 0}, "gridPos": {"h": 4, "w": 6, "x": 18, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "oubliette_storage_login_attempts_total{job=\"apiary\"}", "expr": "oubliette_storage_login_attempts_total{job=\"apiary\"}",
@@ -150,7 +150,7 @@
"title": "SSH Connections Over Time", "title": "SSH Connections Over Time",
"type": "timeseries", "type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 4}, "gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"interval": "60s", "interval": "60s",
"targets": [ "targets": [
{ {
@@ -183,7 +183,7 @@
"title": "Auth Attempts Over Time", "title": "Auth Attempts Over Time",
"type": "timeseries", "type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 4}, "gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"interval": "60s", "interval": "60s",
"targets": [ "targets": [
{ {
@@ -216,7 +216,7 @@
"title": "Sessions by Shell", "title": "Sessions by Shell",
"type": "timeseries", "type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 22}, "gridPos": {"h": 8, "w": 12, "x": 0, "y": 22},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"interval": "60s", "interval": "60s",
"targets": [ "targets": [
{ {
@@ -249,7 +249,7 @@
"title": "Attempts by Country", "title": "Attempts by Country",
"type": "geomap", "type": "geomap",
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 12}, "gridPos": {"h": 10, "w": 24, "x": 0, "y": 12},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "oubliette_auth_attempts_by_country_total{job=\"apiary\"}", "expr": "oubliette_auth_attempts_by_country_total{job=\"apiary\"}",
@@ -318,7 +318,7 @@
"title": "Session Duration Distribution", "title": "Session Duration Distribution",
"type": "heatmap", "type": "heatmap",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 30}, "gridPos": {"h": 8, "w": 24, "x": 0, "y": 30},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"interval": "60s", "interval": "60s",
"targets": [ "targets": [
{ {
@@ -359,7 +359,7 @@
"title": "Commands Executed by Shell", "title": "Commands Executed by Shell",
"type": "timeseries", "type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 22}, "gridPos": {"h": 8, "w": 12, "x": 12, "y": 22},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"interval": "60s", "interval": "60s",
"targets": [ "targets": [
{ {

View File

@@ -16,7 +16,7 @@
"title": "Endpoints Monitored", "title": "Endpoints Monitored",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0}, "gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "count(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"})", "expr": "count(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"})",
@@ -48,7 +48,7 @@
"title": "Probe Failures", "title": "Probe Failures",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0}, "gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "count(probe_success{job=\"blackbox_tls\"} == 0) or vector(0)", "expr": "count(probe_success{job=\"blackbox_tls\"} == 0) or vector(0)",
@@ -82,7 +82,7 @@
"title": "Expiring Soon (< 7d)", "title": "Expiring Soon (< 7d)",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 0}, "gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "count((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) < 86400 * 7) or vector(0)", "expr": "count((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) < 86400 * 7) or vector(0)",
@@ -116,7 +116,7 @@
"title": "Expiring Critical (< 24h)", "title": "Expiring Critical (< 24h)",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 0}, "gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "count((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) < 86400) or vector(0)", "expr": "count((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) < 86400) or vector(0)",
@@ -150,7 +150,7 @@
"title": "Minimum Days Remaining", "title": "Minimum Days Remaining",
"type": "gauge", "type": "gauge",
"gridPos": {"h": 4, "w": 8, "x": 16, "y": 0}, "gridPos": {"h": 4, "w": 8, "x": 16, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "min((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400)", "expr": "min((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400)",
@@ -187,7 +187,7 @@
"title": "Certificate Expiry by Endpoint", "title": "Certificate Expiry by Endpoint",
"type": "table", "type": "table",
"gridPos": {"h": 12, "w": 12, "x": 0, "y": 4}, "gridPos": {"h": 12, "w": 12, "x": 0, "y": 4},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400", "expr": "(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400",
@@ -253,7 +253,7 @@
"title": "Probe Status", "title": "Probe Status",
"type": "table", "type": "table",
"gridPos": {"h": 12, "w": 12, "x": 12, "y": 4}, "gridPos": {"h": 12, "w": 12, "x": 12, "y": 4},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "probe_success{job=\"blackbox_tls\"}", "expr": "probe_success{job=\"blackbox_tls\"}",
@@ -340,7 +340,7 @@
"title": "Certificate Expiry Over Time", "title": "Certificate Expiry Over Time",
"type": "timeseries", "type": "timeseries",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 16}, "gridPos": {"h": 8, "w": 24, "x": 0, "y": 16},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400", "expr": "(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400",
@@ -378,7 +378,7 @@
"title": "Probe Success Rate", "title": "Probe Success Rate",
"type": "timeseries", "type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 24}, "gridPos": {"h": 8, "w": 12, "x": 0, "y": 24},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "avg(probe_success{job=\"blackbox_tls\"}) * 100", "expr": "avg(probe_success{job=\"blackbox_tls\"}) * 100",
@@ -418,7 +418,7 @@
"title": "Probe Duration", "title": "Probe Duration",
"type": "timeseries", "type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 24}, "gridPos": {"h": 8, "w": 12, "x": 12, "y": 24},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "probe_duration_seconds{job=\"blackbox_tls\"}", "expr": "probe_duration_seconds{job=\"blackbox_tls\"}",

View File

@@ -15,7 +15,7 @@
{ {
"name": "tier", "name": "tier",
"type": "query", "type": "query",
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"query": "label_values(nixos_flake_info, tier)", "query": "label_values(nixos_flake_info, tier)",
"refresh": 2, "refresh": 2,
"includeAll": true, "includeAll": true,
@@ -30,7 +30,7 @@
"title": "Hosts Behind Remote", "title": "Hosts Behind Remote",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0}, "gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "count(nixos_flake_revision_behind{tier=~\"$tier\"} == 1)", "expr": "count(nixos_flake_revision_behind{tier=~\"$tier\"} == 1)",
@@ -65,7 +65,7 @@
"title": "Hosts Needing Reboot", "title": "Hosts Needing Reboot",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0}, "gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "count(nixos_config_mismatch{tier=~\"$tier\"} == 1)", "expr": "count(nixos_config_mismatch{tier=~\"$tier\"} == 1)",
@@ -100,7 +100,7 @@
"title": "Total Hosts", "title": "Total Hosts",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 3, "x": 8, "y": 0}, "gridPos": {"h": 4, "w": 3, "x": 8, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "count(nixos_flake_info{tier=~\"$tier\"})", "expr": "count(nixos_flake_info{tier=~\"$tier\"})",
@@ -128,7 +128,7 @@
"title": "Nixpkgs Age", "title": "Nixpkgs Age",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 3, "x": 11, "y": 0}, "gridPos": {"h": 4, "w": 3, "x": 11, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "max(nixos_flake_input_age_seconds{input=\"nixpkgs\", tier=~\"$tier\"})", "expr": "max(nixos_flake_input_age_seconds{input=\"nixpkgs\", tier=~\"$tier\"})",
@@ -163,7 +163,7 @@
"title": "Hosts Up-to-date", "title": "Hosts Up-to-date",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 3, "x": 14, "y": 0}, "gridPos": {"h": 4, "w": 3, "x": 14, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "count(nixos_flake_revision_behind{tier=~\"$tier\"} == 0)", "expr": "count(nixos_flake_revision_behind{tier=~\"$tier\"} == 0)",
@@ -192,7 +192,7 @@
"title": "Deployments (24h)", "title": "Deployments (24h)",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 3, "x": 17, "y": 0}, "gridPos": {"h": 4, "w": 3, "x": 17, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "sum(increase(homelab_deploy_deployments_total{status=\"completed\"}[24h]))", "expr": "sum(increase(homelab_deploy_deployments_total{status=\"completed\"}[24h]))",
@@ -222,7 +222,7 @@
"title": "Avg Deploy Time", "title": "Avg Deploy Time",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 0}, "gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "sum(increase(homelab_deploy_deployment_duration_seconds_sum{success=\"true\"}[24h])) / sum(increase(homelab_deploy_deployment_duration_seconds_count{success=\"true\"}[24h]))", "expr": "sum(increase(homelab_deploy_deployment_duration_seconds_sum{success=\"true\"}[24h])) / sum(increase(homelab_deploy_deployment_duration_seconds_count{success=\"true\"}[24h]))",
@@ -256,7 +256,7 @@
"title": "Fleet Status", "title": "Fleet Status",
"type": "table", "type": "table",
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 4}, "gridPos": {"h": 10, "w": 24, "x": 0, "y": 4},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "nixos_flake_info{tier=~\"$tier\"}", "expr": "nixos_flake_info{tier=~\"$tier\"}",
@@ -430,7 +430,7 @@
"title": "Generation Age by Host", "title": "Generation Age by Host",
"type": "bargauge", "type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 14}, "gridPos": {"h": 8, "w": 12, "x": 0, "y": 14},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "sort_desc(nixos_generation_age_seconds{tier=~\"$tier\"})", "expr": "sort_desc(nixos_generation_age_seconds{tier=~\"$tier\"})",
@@ -467,7 +467,7 @@
"title": "Generations per Host", "title": "Generations per Host",
"type": "bargauge", "type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 14}, "gridPos": {"h": 8, "w": 12, "x": 12, "y": 14},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "sort_desc(nixos_generation_count{tier=~\"$tier\"})", "expr": "sort_desc(nixos_generation_count{tier=~\"$tier\"})",
@@ -501,7 +501,7 @@
"title": "Deployment Activity (Generation Age Over Time)", "title": "Deployment Activity (Generation Age Over Time)",
"type": "timeseries", "type": "timeseries",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 22}, "gridPos": {"h": 8, "w": 24, "x": 0, "y": 22},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "nixos_generation_age_seconds{tier=~\"$tier\"}", "expr": "nixos_generation_age_seconds{tier=~\"$tier\"}",
@@ -534,7 +534,7 @@
"title": "Flake Input Ages", "title": "Flake Input Ages",
"type": "table", "type": "table",
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 30}, "gridPos": {"h": 6, "w": 12, "x": 0, "y": 30},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "max by (input) (nixos_flake_input_age_seconds)", "expr": "max by (input) (nixos_flake_input_age_seconds)",
@@ -577,7 +577,7 @@
"title": "Hosts by Revision", "title": "Hosts by Revision",
"type": "piechart", "type": "piechart",
"gridPos": {"h": 6, "w": 6, "x": 12, "y": 30}, "gridPos": {"h": 6, "w": 6, "x": 12, "y": 30},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "count by (current_rev) (nixos_flake_info{tier=~\"$tier\"})", "expr": "count by (current_rev) (nixos_flake_info{tier=~\"$tier\"})",
@@ -601,7 +601,7 @@
"title": "Hosts by Tier", "title": "Hosts by Tier",
"type": "piechart", "type": "piechart",
"gridPos": {"h": 6, "w": 6, "x": 18, "y": 30}, "gridPos": {"h": 6, "w": 6, "x": 18, "y": 30},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "count by (tier) (nixos_flake_info)", "expr": "count by (tier) (nixos_flake_info)",
@@ -641,7 +641,7 @@
"title": "Builds (24h)", "title": "Builds (24h)",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 37}, "gridPos": {"h": 4, "w": 4, "x": 0, "y": 37},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "sum(increase(homelab_deploy_build_host_total{status=\"success\"}[24h]))", "expr": "sum(increase(homelab_deploy_build_host_total{status=\"success\"}[24h]))",
@@ -671,7 +671,7 @@
"title": "Failed Builds (24h)", "title": "Failed Builds (24h)",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 37}, "gridPos": {"h": 4, "w": 4, "x": 4, "y": 37},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "sum(increase(homelab_deploy_build_host_total{status=\"failure\"}[24h])) or vector(0)", "expr": "sum(increase(homelab_deploy_build_host_total{status=\"failure\"}[24h])) or vector(0)",
@@ -705,7 +705,7 @@
"title": "Last Build", "title": "Last Build",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 37}, "gridPos": {"h": 4, "w": 4, "x": 8, "y": 37},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "time() - max(homelab_deploy_build_last_timestamp)", "expr": "time() - max(homelab_deploy_build_last_timestamp)",
@@ -739,7 +739,7 @@
"title": "Avg Build Time", "title": "Avg Build Time",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 37}, "gridPos": {"h": 4, "w": 4, "x": 12, "y": 37},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "sum(increase(homelab_deploy_build_duration_seconds_sum[24h])) / sum(increase(homelab_deploy_build_duration_seconds_count[24h]))", "expr": "sum(increase(homelab_deploy_build_duration_seconds_sum[24h])) / sum(increase(homelab_deploy_build_duration_seconds_count[24h]))",
@@ -773,7 +773,7 @@
"title": "Total Hosts Built", "title": "Total Hosts Built",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 37}, "gridPos": {"h": 4, "w": 4, "x": 16, "y": 37},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "count(homelab_deploy_build_duration_seconds_count)", "expr": "count(homelab_deploy_build_duration_seconds_count)",
@@ -802,7 +802,7 @@
"title": "Build Jobs (24h)", "title": "Build Jobs (24h)",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 37}, "gridPos": {"h": 4, "w": 4, "x": 20, "y": 37},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "sum(increase(homelab_deploy_builds_total[24h]))", "expr": "sum(increase(homelab_deploy_builds_total[24h]))",
@@ -832,7 +832,7 @@
"title": "Build Time by Host", "title": "Build Time by Host",
"type": "bargauge", "type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 41}, "gridPos": {"h": 8, "w": 12, "x": 0, "y": 41},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "sort_desc(homelab_deploy_build_duration_seconds_sum / homelab_deploy_build_duration_seconds_count)", "expr": "sort_desc(homelab_deploy_build_duration_seconds_sum / homelab_deploy_build_duration_seconds_count)",
@@ -869,7 +869,7 @@
"title": "Build Count by Host", "title": "Build Count by Host",
"type": "bargauge", "type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 41}, "gridPos": {"h": 8, "w": 12, "x": 12, "y": 41},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "sort_desc(sum by (host) (homelab_deploy_build_host_total))", "expr": "sort_desc(sum by (host) (homelab_deploy_build_host_total))",
@@ -903,7 +903,7 @@
"title": "Build Activity", "title": "Build Activity",
"type": "timeseries", "type": "timeseries",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 49}, "gridPos": {"h": 8, "w": 24, "x": 0, "y": 49},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "sum(increase(homelab_deploy_build_host_total{status=\"success\"}[1h]))", "expr": "sum(increase(homelab_deploy_build_host_total{status=\"success\"}[1h]))",

View File

@@ -11,7 +11,7 @@
{ {
"name": "instance", "name": "instance",
"type": "query", "type": "query",
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"query": "label_values(node_uname_info, instance)", "query": "label_values(node_uname_info, instance)",
"refresh": 2, "refresh": 2,
"includeAll": false, "includeAll": false,
@@ -26,7 +26,7 @@
"title": "CPU Usage", "title": "CPU Usage",
"type": "timeseries", "type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}, "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\", instance=~\"$instance\"}[5m])) * 100)", "expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\", instance=~\"$instance\"}[5m])) * 100)",
@@ -55,7 +55,7 @@
"title": "Memory Usage", "title": "Memory Usage",
"type": "timeseries", "type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}, "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$instance\"} / node_memory_MemTotal_bytes{instance=~\"$instance\"})) * 100", "expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$instance\"} / node_memory_MemTotal_bytes{instance=~\"$instance\"})) * 100",
@@ -84,7 +84,7 @@
"title": "Disk Usage", "title": "Disk Usage",
"type": "gauge", "type": "gauge",
"gridPos": {"h": 8, "w": 8, "x": 0, "y": 8}, "gridPos": {"h": 8, "w": 8, "x": 0, "y": 8},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "100 - ((node_filesystem_avail_bytes{instance=~\"$instance\",mountpoint=\"/\",fstype!=\"rootfs\"} / node_filesystem_size_bytes{instance=~\"$instance\",mountpoint=\"/\",fstype!=\"rootfs\"}) * 100)", "expr": "100 - ((node_filesystem_avail_bytes{instance=~\"$instance\",mountpoint=\"/\",fstype!=\"rootfs\"} / node_filesystem_size_bytes{instance=~\"$instance\",mountpoint=\"/\",fstype!=\"rootfs\"}) * 100)",
@@ -113,7 +113,7 @@
"title": "System Load", "title": "System Load",
"type": "timeseries", "type": "timeseries",
"gridPos": {"h": 8, "w": 8, "x": 8, "y": 8}, "gridPos": {"h": 8, "w": 8, "x": 8, "y": 8},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "node_load1{instance=~\"$instance\"}", "expr": "node_load1{instance=~\"$instance\"}",
@@ -142,7 +142,7 @@
"title": "Uptime", "title": "Uptime",
"type": "stat", "type": "stat",
"gridPos": {"h": 8, "w": 8, "x": 16, "y": 8}, "gridPos": {"h": 8, "w": 8, "x": 16, "y": 8},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "time() - node_boot_time_seconds{instance=~\"$instance\"}", "expr": "time() - node_boot_time_seconds{instance=~\"$instance\"}",
@@ -161,7 +161,7 @@
"title": "Network Traffic", "title": "Network Traffic",
"type": "timeseries", "type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 16}, "gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "rate(node_network_receive_bytes_total{instance=~\"$instance\",device!~\"lo|veth.*|br.*|docker.*\"}[5m])", "expr": "rate(node_network_receive_bytes_total{instance=~\"$instance\",device!~\"lo|veth.*|br.*|docker.*\"}[5m])",
@@ -185,7 +185,7 @@
"title": "Disk I/O", "title": "Disk I/O",
"type": "timeseries", "type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 16}, "gridPos": {"h": 8, "w": 12, "x": 12, "y": 16},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "rate(node_disk_read_bytes_total{instance=~\"$instance\",device!~\"dm-.*\"}[5m])", "expr": "rate(node_disk_read_bytes_total{instance=~\"$instance\",device!~\"dm-.*\"}[5m])",

View File

@@ -15,7 +15,7 @@
{ {
"name": "vm", "name": "vm",
"type": "query", "type": "query",
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"query": "label_values(pve_guest_info{template=\"0\"}, name)", "query": "label_values(pve_guest_info{template=\"0\"}, name)",
"refresh": 2, "refresh": 2,
"includeAll": true, "includeAll": true,
@@ -30,7 +30,7 @@
"title": "VMs Running", "title": "VMs Running",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0}, "gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "count(pve_up{id=~\"qemu/.*\"} * on(id) pve_guest_info{template=\"0\"} == 1)", "expr": "count(pve_up{id=~\"qemu/.*\"} * on(id) pve_guest_info{template=\"0\"} == 1)",
@@ -56,7 +56,7 @@
"title": "VMs Stopped", "title": "VMs Stopped",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0}, "gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "count(pve_up{id=~\"qemu/.*\"} * on(id) pve_guest_info{template=\"0\"} == 0)", "expr": "count(pve_up{id=~\"qemu/.*\"} * on(id) pve_guest_info{template=\"0\"} == 0)",
@@ -87,7 +87,7 @@
"title": "Node CPU", "title": "Node CPU",
"type": "gauge", "type": "gauge",
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 0}, "gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "pve_cpu_usage_ratio{id=~\"node/.*\"} * 100", "expr": "pve_cpu_usage_ratio{id=~\"node/.*\"} * 100",
@@ -120,7 +120,7 @@
"title": "Node Memory", "title": "Node Memory",
"type": "gauge", "type": "gauge",
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 0}, "gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "pve_memory_usage_bytes{id=~\"node/.*\"} / pve_memory_size_bytes{id=~\"node/.*\"} * 100", "expr": "pve_memory_usage_bytes{id=~\"node/.*\"} / pve_memory_size_bytes{id=~\"node/.*\"} * 100",
@@ -153,7 +153,7 @@
"title": "Node Uptime", "title": "Node Uptime",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 0}, "gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "pve_uptime_seconds{id=~\"node/.*\"}", "expr": "pve_uptime_seconds{id=~\"node/.*\"}",
@@ -180,7 +180,7 @@
"title": "Templates", "title": "Templates",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 0}, "gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "count(pve_guest_info{template=\"1\"})", "expr": "count(pve_guest_info{template=\"1\"})",
@@ -206,7 +206,7 @@
"title": "VM Status", "title": "VM Status",
"type": "table", "type": "table",
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 4}, "gridPos": {"h": 10, "w": 24, "x": 0, "y": 4},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "pve_guest_info{template=\"0\", name=~\"$vm\"}", "expr": "pve_guest_info{template=\"0\", name=~\"$vm\"}",
@@ -362,7 +362,7 @@
"title": "VM CPU Usage", "title": "VM CPU Usage",
"type": "timeseries", "type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 14}, "gridPos": {"h": 8, "w": 12, "x": 0, "y": 14},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "pve_cpu_usage_ratio{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"} * 100", "expr": "pve_cpu_usage_ratio{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"} * 100",
@@ -391,7 +391,7 @@
"title": "VM Memory Usage", "title": "VM Memory Usage",
"type": "timeseries", "type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 14}, "gridPos": {"h": 8, "w": 12, "x": 12, "y": 14},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "pve_memory_usage_bytes{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}", "expr": "pve_memory_usage_bytes{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
@@ -420,7 +420,7 @@
"title": "VM Network Traffic", "title": "VM Network Traffic",
"type": "timeseries", "type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 22}, "gridPos": {"h": 8, "w": 12, "x": 0, "y": 22},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "rate(pve_network_receive_bytes{id=~\"qemu/.*\"}[5m]) * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}", "expr": "rate(pve_network_receive_bytes{id=~\"qemu/.*\"}[5m]) * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
@@ -453,7 +453,7 @@
"title": "VM Disk I/O", "title": "VM Disk I/O",
"type": "timeseries", "type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 22}, "gridPos": {"h": 8, "w": 12, "x": 12, "y": 22},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "rate(pve_disk_read_bytes{id=~\"qemu/.*\"}[5m]) * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}", "expr": "rate(pve_disk_read_bytes{id=~\"qemu/.*\"}[5m]) * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
@@ -486,7 +486,7 @@
"title": "Storage Usage", "title": "Storage Usage",
"type": "bargauge", "type": "bargauge",
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 30}, "gridPos": {"h": 6, "w": 12, "x": 0, "y": 30},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "pve_disk_usage_bytes{id=~\"storage/.*\"} / pve_disk_size_bytes{id=~\"storage/.*\"} * 100", "expr": "pve_disk_usage_bytes{id=~\"storage/.*\"} / pve_disk_size_bytes{id=~\"storage/.*\"} * 100",
@@ -531,7 +531,7 @@
"title": "Storage Capacity", "title": "Storage Capacity",
"type": "table", "type": "table",
"gridPos": {"h": 6, "w": 12, "x": 12, "y": 30}, "gridPos": {"h": 6, "w": 12, "x": 12, "y": 30},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "pve_disk_size_bytes{id=~\"storage/.*\"}", "expr": "pve_disk_size_bytes{id=~\"storage/.*\"}",

View File

@@ -15,7 +15,7 @@
{ {
"name": "hostname", "name": "hostname",
"type": "query", "type": "query",
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"query": "label_values(systemd_unit_state, hostname)", "query": "label_values(systemd_unit_state, hostname)",
"refresh": 2, "refresh": 2,
"includeAll": true, "includeAll": true,
@@ -30,7 +30,7 @@
"title": "Failed Units", "title": "Failed Units",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0}, "gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "count(systemd_unit_state{state=\"failed\", hostname=~\"$hostname\"} == 1) or vector(0)", "expr": "count(systemd_unit_state{state=\"failed\", hostname=~\"$hostname\"} == 1) or vector(0)",
@@ -60,7 +60,7 @@
"title": "Active Units", "title": "Active Units",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0}, "gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "count(systemd_unit_state{state=\"active\", hostname=~\"$hostname\"} == 1)", "expr": "count(systemd_unit_state{state=\"active\", hostname=~\"$hostname\"} == 1)",
@@ -86,7 +86,7 @@
"title": "Hosts Monitored", "title": "Hosts Monitored",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 0}, "gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "count(count by (hostname) (systemd_unit_state{hostname=~\"$hostname\"}))", "expr": "count(count by (hostname) (systemd_unit_state{hostname=~\"$hostname\"}))",
@@ -112,7 +112,7 @@
"title": "Total Service Restarts", "title": "Total Service Restarts",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 0}, "gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "sum(systemd_service_restart_total{hostname=~\"$hostname\"})", "expr": "sum(systemd_service_restart_total{hostname=~\"$hostname\"})",
@@ -143,7 +143,7 @@
"title": "Inactive Units", "title": "Inactive Units",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 0}, "gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "count(systemd_unit_state{state=\"inactive\", hostname=~\"$hostname\"} == 1)", "expr": "count(systemd_unit_state{state=\"inactive\", hostname=~\"$hostname\"} == 1)",
@@ -169,7 +169,7 @@
"title": "Timers", "title": "Timers",
"type": "stat", "type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 0}, "gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "count(systemd_timer_last_trigger_seconds{hostname=~\"$hostname\"})", "expr": "count(systemd_timer_last_trigger_seconds{hostname=~\"$hostname\"})",
@@ -195,7 +195,7 @@
"title": "Failed Units", "title": "Failed Units",
"type": "table", "type": "table",
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 4}, "gridPos": {"h": 6, "w": 12, "x": 0, "y": 4},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "systemd_unit_state{state=\"failed\", hostname=~\"$hostname\"} == 1", "expr": "systemd_unit_state{state=\"failed\", hostname=~\"$hostname\"} == 1",
@@ -251,7 +251,7 @@
"title": "Service Restarts (Top 15)", "title": "Service Restarts (Top 15)",
"type": "table", "type": "table",
"gridPos": {"h": 6, "w": 12, "x": 12, "y": 4}, "gridPos": {"h": 6, "w": 12, "x": 12, "y": 4},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "topk(15, systemd_service_restart_total{hostname=~\"$hostname\"} > 0)", "expr": "topk(15, systemd_service_restart_total{hostname=~\"$hostname\"} > 0)",
@@ -309,7 +309,7 @@
"title": "Active Units per Host", "title": "Active Units per Host",
"type": "bargauge", "type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 10}, "gridPos": {"h": 8, "w": 12, "x": 0, "y": 10},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "sort_desc(count by (hostname) (systemd_unit_state{state=\"active\", hostname=~\"$hostname\"} == 1))", "expr": "sort_desc(count by (hostname) (systemd_unit_state{state=\"active\", hostname=~\"$hostname\"} == 1))",
@@ -339,7 +339,7 @@
"title": "NixOS Upgrade Timers", "title": "NixOS Upgrade Timers",
"type": "table", "type": "table",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 10}, "gridPos": {"h": 8, "w": 12, "x": 12, "y": 10},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "systemd_timer_last_trigger_seconds{name=\"nixos-upgrade.timer\", hostname=~\"$hostname\"}", "expr": "systemd_timer_last_trigger_seconds{name=\"nixos-upgrade.timer\", hostname=~\"$hostname\"}",
@@ -429,7 +429,7 @@
"title": "Backup Timers", "title": "Backup Timers",
"type": "table", "type": "table",
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 18}, "gridPos": {"h": 6, "w": 12, "x": 0, "y": 18},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "systemd_timer_last_trigger_seconds{name=~\"restic.*\", hostname=~\"$hostname\"}", "expr": "systemd_timer_last_trigger_seconds{name=~\"restic.*\", hostname=~\"$hostname\"}",
@@ -524,7 +524,7 @@
"title": "Service Restarts Over Time", "title": "Service Restarts Over Time",
"type": "timeseries", "type": "timeseries",
"gridPos": {"h": 6, "w": 12, "x": 12, "y": 18}, "gridPos": {"h": 6, "w": 12, "x": 12, "y": 18},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "sum by (hostname) (increase(systemd_service_restart_total{hostname=~\"$hostname\"}[1h]))", "expr": "sum by (hostname) (increase(systemd_service_restart_total{hostname=~\"$hostname\"}[1h]))",

View File

@@ -19,7 +19,7 @@
"title": "Current Temperatures", "title": "Current Temperatures",
"type": "stat", "type": "stat",
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 0}, "gridPos": {"h": 6, "w": 12, "x": 0, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}", "expr": "hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}",
@@ -71,7 +71,7 @@
"title": "Average Home Temperature", "title": "Average Home Temperature",
"type": "gauge", "type": "gauge",
"gridPos": {"h": 6, "w": 6, "x": 12, "y": 0}, "gridPos": {"h": 6, "w": 6, "x": 12, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "avg(hass_sensor_temperature_celsius{entity!~\".*device_temperature|.*server.*\"})", "expr": "avg(hass_sensor_temperature_celsius{entity!~\".*device_temperature|.*server.*\"})",
@@ -108,7 +108,7 @@
"title": "Current Humidity", "title": "Current Humidity",
"type": "stat", "type": "stat",
"gridPos": {"h": 6, "w": 6, "x": 18, "y": 0}, "gridPos": {"h": 6, "w": 6, "x": 18, "y": 0},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "hass_sensor_humidity_percent{entity!~\".*server.*\"}", "expr": "hass_sensor_humidity_percent{entity!~\".*server.*\"}",
@@ -154,7 +154,7 @@
"title": "Temperature History (30 Days)", "title": "Temperature History (30 Days)",
"type": "timeseries", "type": "timeseries",
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 6}, "gridPos": {"h": 10, "w": 24, "x": 0, "y": 6},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}", "expr": "hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}",
@@ -207,7 +207,7 @@
"title": "Temperature Trend (1h rate of change)", "title": "Temperature Trend (1h rate of change)",
"type": "timeseries", "type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 16}, "gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "deriv(hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}[1h]) * 3600", "expr": "deriv(hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}[1h]) * 3600",
@@ -268,7 +268,7 @@
"title": "24h Min / Max / Avg", "title": "24h Min / Max / Avg",
"type": "table", "type": "table",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 16}, "gridPos": {"h": 8, "w": 12, "x": 12, "y": 16},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "min_over_time(hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}[24h])", "expr": "min_over_time(hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}[24h])",
@@ -346,7 +346,7 @@
"title": "Humidity History (30 Days)", "title": "Humidity History (30 Days)",
"type": "timeseries", "type": "timeseries",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 24}, "gridPos": {"h": 8, "w": 24, "x": 0, "y": 24},
"datasource": {"type": "prometheus", "uid": "prometheus"}, "datasource": {"type": "prometheus", "uid": "victoriametrics"},
"targets": [ "targets": [
{ {
"expr": "hass_sensor_humidity_percent", "expr": "hass_sensor_humidity_percent",

View File

@@ -37,6 +37,10 @@
# Declarative datasources # Declarative datasources
provision.datasources.settings = { provision.datasources.settings = {
apiVersion = 1; apiVersion = 1;
prune = true;
deleteDatasources = [
{ name = "Prometheus (monitoring01)"; orgId = 1; }
];
datasources = [ datasources = [
{ {
name = "VictoriaMetrics"; name = "VictoriaMetrics";

View File

@@ -61,7 +61,42 @@
mode 644 mode 644
} }
} }
reverse_proxy http://jelly01.home.2rjus.net:8096 header Content-Type text/html
respond <<HTML
<!DOCTYPE html>
<html>
<head>
<title>Jellyfin - Maintenance</title>
<style>
body {
background: #101020;
color: #ddd;
font-family: sans-serif;
display: flex;
justify-content: center;
align-items: center;
min-height: 100vh;
margin: 0;
text-align: center;
}
.container { max-width: 500px; }
.disk { font-size: 80px; animation: spin 3s linear infinite; display: inline-block; }
@keyframes spin { from { transform: rotate(0deg); } to { transform: rotate(360deg); } }
h1 { color: #00a4dc; }
p { font-size: 1.2em; line-height: 1.6; }
</style>
</head>
<body>
<div class="container">
<div class="disk">&#x1F4BF;</div>
<h1>Jellyfin is taking a nap</h1>
<p>The NAS is getting shiny new hard drives.<br>
Jellyfin will be back once the disks stop spinning up.</p>
<p style="color:#666;font-size:0.9em;">In the meantime, maybe go outside?</p>
</div>
</body>
</html>
HTML 200
} }
http://http-proxy.home.2rjus.net/metrics { http://http-proxy.home.2rjus.net/metrics {
log { log {

View File

@@ -67,13 +67,13 @@ groups:
summary: "Promtail service not running on {{ $labels.instance }}" summary: "Promtail service not running on {{ $labels.instance }}"
description: "The promtail service has not been active on {{ $labels.instance }} for 5 minutes." description: "The promtail service has not been active on {{ $labels.instance }} for 5 minutes."
- alert: filesystem_filling_up - alert: filesystem_filling_up
expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 24*3600) < 0 expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[24h], 24*3600) < 0
for: 1h for: 1h
labels: labels:
severity: warning severity: warning
annotations: annotations:
summary: "Filesystem predicted to fill within 24h on {{ $labels.instance }}" summary: "Filesystem predicted to fill within 24h on {{ $labels.instance }}"
description: "Based on the last 6h trend, the root filesystem on {{ $labels.instance }} is predicted to run out of space within 24 hours." description: "Based on the last 24h trend, the root filesystem on {{ $labels.instance }} is predicted to run out of space within 24 hours."
- alert: systemd_not_running - alert: systemd_not_running
expr: node_systemd_system_running == 0 expr: node_systemd_system_running == 0
for: 10m for: 10m