hosts: add nix-cache02 build host

New build host to replace nix-cache01 with: - 8 CPU cores, 16GB RAM, 200GB disk - Static IP 10.69.13.25 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
docs: add monitoring02 reboot alert investigation
2026-02-10 21:53:29 +01:00 · 2026-02-10 17:59:53 +01:00 · 2026-02-09 22:59:45 +01:00 · 2026-02-09 22:56:03 +01:00 · 2026-02-09 22:52:34 +01:00 · 2026-02-09 22:41:47 +01:00
62 changed files with 4906 additions and 85 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -39,6 +39,30 @@ Do not automatically deploy changes. Deployments are usually done by updating th

 Do not run SSH commands directly. If a command needs to be run on a remote host, provide the command to the user and ask them to run it manually.

+### Sharing Command Output via Loki
+
+All hosts have the `pipe-to-loki` script for sending command output or terminal sessions to Loki, allowing users to share output with Claude without copy-pasting.
+
+**Pipe mode** - send command output:
+```bash
+command | pipe-to-loki                  # Auto-generated ID
+command | pipe-to-loki --id my-test     # Custom ID
+```
+
+**Session mode** - record interactive terminal session:
+```bash
+pipe-to-loki --record                   # Start recording, exit to send
+pipe-to-loki --record --id my-session   # With custom ID
+```
+
+The script prints the session ID which the user can share. Query results with:
+```logql
+{job="pipe-to-loki"}                           # All entries
+{job="pipe-to-loki", id="my-test"}             # Specific ID
+{job="pipe-to-loki", host="testvm01"}          # From specific host
+{job="pipe-to-loki", type="session"}           # Only sessions
+```
+
 ### Testing Feature Branches on Hosts

 All hosts have the `nixos-rebuild-test` helper script for testing feature branches before merging:
@@ -90,6 +114,12 @@ nix develop -c tofu -chdir=terraform/vault apply
 cd terraform && tofu plan
 ```

+### Ansible
+
+Ansible configuration and playbooks are in `/ansible/`. See [ansible/README.md](ansible/README.md) for inventory groups, available playbooks, and usage examples.
+
+The devshell sets `ANSIBLE_CONFIG` automatically, so no `-i` flag is needed.
+
 ### Secrets Management

 Secrets are managed by OpenBao (Vault) using AppRole authentication. Most hosts use the
@@ -102,6 +132,8 @@ Terraform manages the secrets and AppRole policies in `terraform/vault/`.

 **Important:** Never amend commits to `master` unless the user explicitly asks for it. Amending rewrites history and causes issues for deployed configurations.

+**Important:** Never force push to `master`. If a commit on master has an error, fix it with a new commit rather than rewriting history.
+
 **Important:** Do not use `gh pr create` to create pull requests. The git server does not support GitHub CLI for PR creation. Instead, push the branch and let the user create the PR manually via the web interface.

 When starting a new plan or task, the first step should typically be to create and checkout a new branch with an appropriate name (e.g., `git checkout -b dns-automation` or `git checkout -b fix-nginx-config`).
@@ -255,7 +287,10 @@ The `current_rev` label contains the git commit hash of the deployed flake confi
 - `/docs/` - Documentation and plans
  - `plans/` - Future plans and proposals
  - `plans/completed/` - Completed plans (moved here when done)
- `/playbooks/` - Ansible playbooks for fleet management
+- `/ansible/` - Ansible configuration and playbooks
+  - `ansible.cfg` - Ansible configuration (inventory path, defaults)
+  - `inventory/` - Dynamic and static inventory sources
+  - `playbooks/` - Ansible playbooks for fleet management

 ### Configuration Inheritance

@@ -279,24 +314,11 @@ All hosts automatically get:
 - Custom root CA trust
 - DNS zone auto-registration via `homelab.dns` options

-### Active Hosts
+### Hosts

-Production servers:
- `ns1`, `ns2` - Primary/secondary DNS servers (10.69.13.5/6)
- `vault01` - OpenBao (Vault) secrets server + PKI CA
- `ha1` - Home Assistant + Zigbee2MQTT + Mosquitto
- `http-proxy` - Reverse proxy
- `monitoring01` - Full observability stack (Prometheus, Grafana, Loki, Tempo, Pyroscope)
- `jelly01` - Jellyfin media server
- `nix-cache01` - Binary cache server + GitHub Actions runner
- `pgdb1` - PostgreSQL database
- `nats1` - NATS messaging server
+Host configurations are in `/hosts/<hostname>/`. See `flake.nix` for the complete list of `nixosConfigurations`.

-Test/staging hosts:
- `testvm01`, `testvm02`, `testvm03` - Test-tier VMs for branch testing and deployment validation
-
-Template hosts:
- `template1`, `template2` - Base templates for cloning new hosts
+Use `nix flake show` or `nix develop -c ansible-inventory --graph` to list all hosts.

 ### Flake Inputs

@@ -327,7 +349,7 @@ Most hosts use OpenBao (Vault) for secrets:
 - `extractKey` option extracts a single key from vault JSON as a plain file
 - Secrets fetched at boot by `vault-secret-<name>.service` systemd units
 - Fallback to cached secrets in `/var/lib/vault/cache/` when Vault is unreachable
- Provision AppRole credentials: `nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=<host>`
+- Provision AppRole credentials: `nix develop -c ansible-playbook ansible/playbooks/provision-approle.yml -l <hostname>`

 ### Auto-Upgrade System

@@ -351,7 +373,7 @@ Template VMs are built from `hosts/template2` and deployed to Proxmox using Ansi

 ```bash
 # Build NixOS image and deploy to Proxmox as template
-nix develop -c ansible-playbook -i playbooks/inventory.ini playbooks/build-and-deploy-template.yml
+nix develop -c ansible-playbook ansible/playbooks/build-and-deploy-template.yml
 ```

 This playbook:
@@ -426,7 +448,7 @@ This means:
 - `tofu plan` won't show spurious changes for Proxmox-managed defaults

 **When rebuilding the template:**
-1. Run `nix develop -c ansible-playbook -i playbooks/inventory.ini playbooks/build-and-deploy-template.yml`
+1. Run `nix develop -c ansible-playbook ansible/playbooks/build-and-deploy-template.yml`
 2. Update `default_template_name` in `terraform/variables.tf` if the name changed
 3. Run `tofu plan` - should show no VM recreations (only template name in state)
 4. Run `tofu apply` - updates state without touching existing VMs
@@ -509,6 +531,7 @@ The `modules/homelab/` directory defines custom options used across hosts for au
 - `priority` - Alerting priority: `high` or `low`. Controls alerting thresholds for the host.
 - `role` - Primary role designation (e.g., `dns`, `database`, `bastion`, `vault`)
 - `labels` - Free-form key-value metadata for host categorization
+  - `ansible = "false"` - Exclude host from Ansible dynamic inventory

 **DNS options (`homelab.dns.*`):**
 - `enable` (default: `true`) - Include host in DNS zone generation
--- a/ansible/README.md
+++ b/ansible/README.md
@@ -0,0 +1,120 @@
+# Ansible Configuration
+
+This directory contains Ansible configuration for fleet management tasks.
+
+## Structure
+
+```
+ansible/
+├── ansible.cfg              # Ansible configuration
+├── inventory/
+│   ├── dynamic_flake.py     # Dynamic inventory from NixOS flake
+│   ├── static.yml           # Non-flake hosts (Proxmox, etc.)
+│   └── group_vars/
+│       └── all.yml          # Common variables
+└── playbooks/
+    ├── build-and-deploy-template.yml
+    ├── provision-approle.yml
+    ├── restart-service.yml
+    └── run-upgrade.yml
+```
+
+## Usage
+
+The devshell automatically configures `ANSIBLE_CONFIG`, so commands work without extra flags:
+
+```bash
+# List inventory groups
+nix develop -c ansible-inventory --graph
+
+# List hosts in a specific group
+nix develop -c ansible-inventory --list | jq '.role_dns'
+
+# Run a playbook
+nix develop -c ansible-playbook ansible/playbooks/run-upgrade.yml -l tier_test
+```
+
+## Inventory
+
+The inventory combines dynamic and static sources automatically.
+
+### Dynamic Inventory (from flake)
+
+The `dynamic_flake.py` script extracts hosts from the NixOS flake using `homelab.host.*` options:
+
+**Groups generated:**
+- `flake_hosts` - All NixOS hosts from the flake
+- `tier_test`, `tier_prod` - By `homelab.host.tier`
+- `role_dns`, `role_vault`, `role_monitoring`, etc. - By `homelab.host.role`
+
+**Host variables set:**
+- `tier` - Deployment tier (test/prod)
+- `role` - Host role
+- `short_hostname` - Hostname without domain
+
+### Static Inventory
+
+Non-flake hosts are defined in `inventory/static.yml`:
+
+- `proxmox` - Proxmox hypervisors
+
+## Playbooks
+
+| Playbook | Description | Example |
+|----------|-------------|---------|
+| `run-upgrade.yml` | Trigger nixos-upgrade on hosts | `-l tier_prod` |
+| `restart-service.yml` | Restart a systemd service | `-l role_dns -e service=unbound` |
+| `reboot.yml` | Rolling reboot (one host at a time) | `-l tier_test` |
+| `provision-approle.yml` | Deploy Vault credentials (single host only) | `-l testvm01` |
+| `build-and-deploy-template.yml` | Build and deploy Proxmox template | (no limit needed) |
+
+### Examples
+
+```bash
+# Restart unbound on all DNS servers
+nix develop -c ansible-playbook ansible/playbooks/restart-service.yml \
+  -l role_dns -e service=unbound
+
+# Trigger upgrade on all test hosts
+nix develop -c ansible-playbook ansible/playbooks/run-upgrade.yml -l tier_test
+
+# Provision Vault credentials for a specific host
+nix develop -c ansible-playbook ansible/playbooks/provision-approle.yml -l testvm01
+
+# Build and deploy Proxmox template
+nix develop -c ansible-playbook ansible/playbooks/build-and-deploy-template.yml
+
+# Rolling reboot of test hosts (one at a time, waits for each to come back)
+nix develop -c ansible-playbook ansible/playbooks/reboot.yml -l tier_test
+```
+
+## Excluding Flake Hosts
+
+To exclude a flake host from the dynamic inventory, add the `ansible = "false"` label in the host's configuration:
+
+```nix
+homelab.host.labels.ansible = "false";
+```
+
+Hosts with `homelab.dns.enable = false` are also excluded automatically.
+
+## Adding Non-Flake Hosts
+
+Edit `inventory/static.yml` to add hosts not managed by the NixOS flake:
+
+```yaml
+all:
+  children:
+    my_group:
+      hosts:
+        host1.example.com:
+          ansible_user: admin
+```
+
+## Common Variables
+
+Variables in `inventory/group_vars/all.yml` apply to all hosts:
+
+- `ansible_user` - Default SSH user (root)
+- `domain` - Domain name (home.2rjus.net)
+- `vault_addr` - Vault server URL
--- a/ansible/ansible.cfg
+++ b/ansible/ansible.cfg
@@ -0,0 +1,17 @@
+[defaults]
+inventory = inventory/
+remote_user = root
+host_key_checking = False
+
+# Reduce SSH connection overhead
+forks = 10
+pipelining = True
+
+# Output formatting (YAML output via builtin default callback)
+stdout_callback = default
+callbacks_enabled = profile_tasks
+result_format = yaml
+
+[ssh_connection]
+# Reuse SSH connections
+ssh_args = -o ControlMaster=auto -o ControlPersist=60s
--- a/ansible/inventory/dynamic_flake.py
+++ b/ansible/inventory/dynamic_flake.py
@@ -0,0 +1,162 @@
+#!/usr/bin/env python3
+"""
+Dynamic Ansible inventory script that extracts host information from the NixOS flake.
+
+Generates groups:
+  - flake_hosts: All hosts defined in the flake
+  - tier_test, tier_prod: Hosts by deployment tier
+  - role_<name>: Hosts by role (dns, vault, monitoring, etc.)
+
+Usage:
+  ./dynamic_flake.py --list    # Return full inventory
+  ./dynamic_flake.py --host X  # Return host vars (not used, but required by Ansible)
+"""
+
+import json
+import subprocess
+import sys
+from pathlib import Path
+
+
+def get_flake_dir() -> Path:
+    """Find the flake root directory."""
+    script_dir = Path(__file__).resolve().parent
+    # ansible/inventory/dynamic_flake.py -> repo root
+    return script_dir.parent.parent
+
+
+def evaluate_flake() -> dict:
+    """Evaluate the flake and extract host metadata."""
+    flake_dir = get_flake_dir()
+
+    # Nix expression to extract relevant config from each host
+    nix_expr = """
+    configs: builtins.mapAttrs (name: cfg: {
+      hostname = cfg.config.networking.hostName;
+      domain = cfg.config.networking.domain or "home.2rjus.net";
+      tier = cfg.config.homelab.host.tier;
+      role = cfg.config.homelab.host.role;
+      labels = cfg.config.homelab.host.labels;
+      dns_enabled = cfg.config.homelab.dns.enable;
+    }) configs
+    """
+
+    try:
+        result = subprocess.run(
+            [
+                "nix",
+                "eval",
+                "--json",
+                f"{flake_dir}#nixosConfigurations",
+                "--apply",
+                nix_expr,
+            ],
+            capture_output=True,
+            text=True,
+            check=True,
+            cwd=flake_dir,
+        )
+        return json.loads(result.stdout)
+    except subprocess.CalledProcessError as e:
+        print(f"Error evaluating flake: {e.stderr}", file=sys.stderr)
+        sys.exit(1)
+    except json.JSONDecodeError as e:
+        print(f"Error parsing nix output: {e}", file=sys.stderr)
+        sys.exit(1)
+
+
+def sanitize_group_name(name: str) -> str:
+    """Sanitize a string for use as an Ansible group name.
+
+    Ansible group names should contain only alphanumeric characters and underscores.
+    """
+    return name.replace("-", "_")
+
+
+def build_inventory(hosts_data: dict) -> dict:
+    """Build Ansible inventory structure from host data."""
+    inventory = {
+        "_meta": {"hostvars": {}},
+        "flake_hosts": {"hosts": []},
+    }
+
+    # Track groups we need to create
+    tier_groups: dict[str, list[str]] = {}
+    role_groups: dict[str, list[str]] = {}
+
+    for _config_name, host_info in hosts_data.items():
+        hostname = host_info["hostname"]
+        domain = host_info["domain"]
+        tier = host_info["tier"]
+        role = host_info["role"]
+        labels = host_info["labels"]
+        dns_enabled = host_info["dns_enabled"]
+
+        # Skip hosts that have DNS disabled (like templates)
+        if not dns_enabled:
+            continue
+
+        # Skip hosts with ansible = "false" label
+        if labels.get("ansible") == "false":
+            continue
+
+        fqdn = f"{hostname}.{domain}"
+
+        # Use short hostname as inventory name, FQDN for connection
+        inventory_name = hostname
+
+        # Add to flake_hosts group
+        inventory["flake_hosts"]["hosts"].append(inventory_name)
+
+        # Add host variables
+        inventory["_meta"]["hostvars"][inventory_name] = {
+            "ansible_host": fqdn,  # Connect using FQDN
+            "fqdn": fqdn,
+            "tier": tier,
+            "role": role,
+        }
+
+        # Group by tier
+        tier_group = f"tier_{sanitize_group_name(tier)}"
+        if tier_group not in tier_groups:
+            tier_groups[tier_group] = []
+        tier_groups[tier_group].append(inventory_name)
+
+        # Group by role (if set)
+        if role:
+            role_group = f"role_{sanitize_group_name(role)}"
+            if role_group not in role_groups:
+                role_groups[role_group] = []
+            role_groups[role_group].append(inventory_name)
+
+    # Add tier groups to inventory
+    for group_name, hosts in tier_groups.items():
+        inventory[group_name] = {"hosts": hosts}
+
+    # Add role groups to inventory
+    for group_name, hosts in role_groups.items():
+        inventory[group_name] = {"hosts": hosts}
+
+    return inventory
+
+
+def main():
+    if len(sys.argv) < 2:
+        print("Usage: dynamic_flake.py --list | --host <hostname>", file=sys.stderr)
+        sys.exit(1)
+
+    if sys.argv[1] == "--list":
+        hosts_data = evaluate_flake()
+        inventory = build_inventory(hosts_data)
+        print(json.dumps(inventory, indent=2))
+    elif sys.argv[1] == "--host":
+        # Ansible calls this to get vars for a specific host
+        # We provide all vars in _meta.hostvars, so just return empty
+        print(json.dumps({}))
+    else:
+        print(f"Unknown option: {sys.argv[1]}", file=sys.stderr)
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    main()
--- a/ansible/inventory/group_vars/all.yml
+++ b/ansible/inventory/group_vars/all.yml
@@ -0,0 +1,5 @@
+# Common variables for all hosts
+
+ansible_user: root
+domain: home.2rjus.net
+vault_addr: https://vault01.home.2rjus.net:8200
--- a/ansible/inventory/static.yml
+++ b/ansible/inventory/static.yml
@@ -0,0 +1,13 @@
+# Static inventory for non-flake hosts
+#
+# Hosts defined here are merged with the dynamic flake inventory.
+# Use this for infrastructure that isn't managed by NixOS.
+#
+# Use short hostnames as inventory names with ansible_host for FQDN.
+
+all:
+  children:
+    proxmox:
+      hosts:
+        pve1:
+          ansible_host: pve1.home.2rjus.net
--- a/ansible/playbooks/build-and-deploy-template.yml
+++ b/ansible/playbooks/build-and-deploy-template.yml
@@ -15,13 +15,13 @@
    - name: Build NixOS image
      ansible.builtin.command:
        cmd: "nixos-rebuild build-image --image-variant proxmox --flake .#template2"
-        chdir: "{{ playbook_dir }}/.."
+        chdir: "{{ playbook_dir }}/../.."
      register: build_result
      changed_when: true

    - name: Find built image file
      ansible.builtin.find:
-        paths: "{{ playbook_dir}}/../result"
+        paths: "{{ playbook_dir}}/../../result"
        patterns: "*.vma.zst"
        recurse: true
      register: image_files
@@ -105,7 +105,7 @@
  gather_facts: false

  vars:
-    terraform_dir: "{{ playbook_dir }}/../terraform"
+    terraform_dir: "{{ playbook_dir }}/../../terraform"

  tasks:
    - name: Get image filename from earlier play
--- a/ansible/playbooks/provision-approle.yml
+++ b/ansible/playbooks/provision-approle.yml
@@ -1,7 +1,27 @@
 ---
-# Provision OpenBao AppRole credentials to an existing host
-# Usage: nix develop -c ansible-playbook playbooks/provision-approle.yml -e hostname=ha1
+# Provision OpenBao AppRole credentials to a host
+#
+# Usage: ansible-playbook ansible/playbooks/provision-approle.yml -l <hostname>
 # Requires: BAO_ADDR and BAO_TOKEN environment variables set
+#
+# IMPORTANT: This playbook must target exactly one host to prevent
+# accidentally regenerating credentials for multiple hosts.
+
+- name: Validate single host target
+  hosts: all
+  gather_facts: false
+
+  tasks:
+    - name: Fail if targeting multiple hosts
+      ansible.builtin.fail:
+        msg: |
+          This playbook must target exactly one host.
+          Use: ansible-playbook provision-approle.yml -l <hostname>
+
+          Targeting multiple hosts would regenerate credentials for all of them,
+          potentially breaking existing services.
+      when: ansible_play_hosts | length != 1
+      run_once: true

 - name: Fetch AppRole credentials from OpenBao
  hosts: localhost
@@ -9,18 +29,17 @@
  gather_facts: false

  vars:
-    vault_addr: "{{ lookup('env', 'BAO_ADDR') | default('https://vault01.home.2rjus.net:8200', true) }}"
-    domain: "home.2rjus.net"
+    target_host: "{{ groups['all'] | first }}"
+    target_hostname: "{{ hostvars[target_host]['short_hostname'] | default(target_host.split('.')[0]) }}"

  tasks:
-    - name: Validate hostname is provided
-      ansible.builtin.fail:
-        msg: "hostname variable is required. Use: -e hostname=<name>"
-      when: hostname is not defined
+    - name: Display target host
+      ansible.builtin.debug:
+        msg: "Provisioning AppRole credentials for: {{ target_hostname }}"

    - name: Get role-id for host
      ansible.builtin.command:
-        cmd: "bao read -field=role_id auth/approle/role/{{ hostname }}/role-id"
+        cmd: "bao read -field=role_id auth/approle/role/{{ target_hostname }}/role-id"
      environment:
        BAO_ADDR: "{{ vault_addr }}"
        BAO_SKIP_VERIFY: "1"
@@ -29,25 +48,26 @@

    - name: Generate secret-id for host
      ansible.builtin.command:
-        cmd: "bao write -field=secret_id -f auth/approle/role/{{ hostname }}/secret-id"
+        cmd: "bao write -field=secret_id -f auth/approle/role/{{ target_hostname }}/secret-id"
      environment:
        BAO_ADDR: "{{ vault_addr }}"
        BAO_SKIP_VERIFY: "1"
      register: secret_id_result
      changed_when: true

-    - name: Add target host to inventory
-      ansible.builtin.add_host:
-        name: "{{ hostname }}.{{ domain }}"
-        groups: vault_target
-        ansible_user: root
+    - name: Store credentials for next play
+      ansible.builtin.set_fact:
        vault_role_id: "{{ role_id_result.stdout }}"
        vault_secret_id: "{{ secret_id_result.stdout }}"

 - name: Deploy AppRole credentials to host
-  hosts: vault_target
+  hosts: all
  gather_facts: false

+  vars:
+    vault_role_id: "{{ hostvars['localhost']['vault_role_id'] }}"
+    vault_secret_id: "{{ hostvars['localhost']['vault_secret_id'] }}"
+
  tasks:
    - name: Create AppRole directory
      ansible.builtin.file:
--- a/ansible/playbooks/reboot.yml
+++ b/ansible/playbooks/reboot.yml
@@ -0,0 +1,48 @@
+---
+# Reboot hosts with rolling strategy to avoid taking down redundant services
+#
+# Usage examples:
+#   # Reboot a single host
+#   ansible-playbook reboot.yml -l testvm01
+#
+#   # Reboot all test hosts (one at a time)
+#   ansible-playbook reboot.yml -l tier_test
+#
+#   # Reboot all DNS servers safely (one at a time)
+#   ansible-playbook reboot.yml -l role_dns
+#
+# Safety features:
+#   - serial: 1 ensures only one host reboots at a time
+#   - Waits for host to come back online before proceeding
+#   - Groups hosts by role to avoid rebooting same-role hosts consecutively
+
+- name: Reboot hosts (rolling)
+  hosts: all
+  serial: 1
+  order: shuffle  # Randomize to spread out same-role hosts
+  gather_facts: false
+
+  vars:
+    reboot_timeout: 300  # 5 minutes to wait for host to come back
+
+  tasks:
+    - name: Display reboot target
+      ansible.builtin.debug:
+        msg: "Rebooting {{ inventory_hostname }} (role: {{ role | default('none') }})"
+
+    - name: Reboot the host
+      ansible.builtin.systemd:
+        name: reboot.target
+        state: started
+      async: 1
+      poll: 0
+      ignore_errors: true
+
+    - name: Wait for host to come back online
+      ansible.builtin.wait_for_connection:
+        delay: 5
+        timeout: "{{ reboot_timeout }}"
+
+    - name: Display reboot result
+      ansible.builtin.debug:
+        msg: "{{ inventory_hostname }} rebooted successfully"
--- a/ansible/playbooks/restart-service.yml
+++ b/ansible/playbooks/restart-service.yml
@@ -0,0 +1,40 @@
+---
+# Restart a systemd service on target hosts
+#
+# Usage examples:
+#   # Restart unbound on all DNS servers
+#   ansible-playbook restart-service.yml -l role_dns -e service=unbound
+#
+#   # Restart nginx on a specific host
+#   ansible-playbook restart-service.yml -l http-proxy.home.2rjus.net -e service=nginx
+#
+#   # Restart promtail on all prod hosts
+#   ansible-playbook restart-service.yml -l tier_prod -e service=promtail
+
+- name: Restart systemd service
+  hosts: all
+  gather_facts: false
+
+  tasks:
+    - name: Validate service name provided
+      ansible.builtin.fail:
+        msg: |
+          The 'service' variable is required.
+          Usage: ansible-playbook restart-service.yml -l <target> -e service=<name>
+
+          Examples:
+            -e service=nginx
+            -e service=unbound
+            -e service=promtail
+      when: service is not defined
+      run_once: true
+
+    - name: Restart {{ service }}
+      ansible.builtin.systemd:
+        name: "{{ service }}"
+        state: restarted
+      register: restart_result
+
+    - name: Display result
+      ansible.builtin.debug:
+        msg: "Service {{ service }} restarted on {{ inventory_hostname }}"
--- a/ansible/playbooks/run-upgrade.yml
+++ b/ansible/playbooks/run-upgrade.yml
--- a/docs/plans/auth-system-replacement.md
+++ b/docs/plans/auth-system-replacement.md
@@ -151,11 +151,30 @@ Rationale:
 - Well above NixOS system users (typically <1000)
 - Avoids Podman/container issues with very high GIDs

+### Completed (2026-02-08) - OAuth2/OIDC for Grafana
+
+**OAuth2 client deployed for Grafana on monitoring02:**
+- Client ID: `grafana`
+- Redirect URL: `https://grafana-test.home.2rjus.net/login/generic_oauth`
+- Scope maps: `openid`, `profile`, `email`, `groups` for `users` group
+- Role mapping: `admins` group → Grafana Admin, others → Viewer
+
+**Configuration locations:**
+- Kanidm OAuth2 client: `services/kanidm/default.nix`
+- Grafana OIDC config: `services/grafana/default.nix`
+- Vault secret: `services/grafana/oauth2-client-secret`
+
+**Key findings:**
+- PKCE is required by Kanidm - enable `use_pkce = true` in Grafana
+- Must set `email_attribute_path`, `login_attribute_path`, `name_attribute_path` to extract from userinfo
+- Users need: primary credential (password + TOTP for MFA), membership in `users` group, email address set
+- Unix password is separate from primary credential (web login requires primary credential)
+
 ### Next Steps

 1. Enable PAM/NSS on production hosts (after test tier validation)
 2. Configure TrueNAS LDAP client for NAS integration testing
-3. Add OAuth2 clients (Grafana first)
+3. Add OAuth2 clients for other services as needed

 ## References

--- a/docs/plans/completed/cert-monitoring.md
+++ b/docs/plans/completed/cert-monitoring.md
--- a/docs/plans/completed/monitoring02-reboot-alert-investigation.md
+++ b/docs/plans/completed/monitoring02-reboot-alert-investigation.md
@@ -0,0 +1,135 @@
+# monitoring02 Reboot Alert Investigation
+
+**Date:** 2026-02-10
+**Status:** Completed - False positive identified
+
+## Summary
+
+A `host_reboot` alert fired for monitoring02 at 16:27:36 UTC. Investigation determined this was a **false positive** caused by NTP clock adjustments, not an actual reboot.
+
+## Alert Details
+
+- **Alert:** `host_reboot`
+- **Rule:** `changes(node_boot_time_seconds[10m]) > 0`
+- **Host:** monitoring02
+- **Time:** 2026-02-10T16:27:36Z
+
+## Investigation Findings
+
+### Evidence Against Actual Reboot
+
+1. **Uptime:** System had been up for ~40 hours (143,751 seconds) at time of alert
+2. **Consistent BOOT_ID:** All logs showed the same systemd BOOT_ID (`fd26e7f3d86f4cd688d1b1d7af62f2ad`) from Feb 9 through the alert time
+3. **No log gaps:** Logs were continuous - no shutdown/restart cycle visible
+4. **Prometheus metrics:** `node_boot_time_seconds` showed a 1-second fluctuation, then returned to normal
+
+### Root Cause: NTP Clock Adjustment
+
+The `node_boot_time_seconds` metric fluctuated by 1 second due to how Linux calculates boot time:
+
+```
+btime = current_wall_clock_time - monotonic_uptime
+```
+
+When NTP adjusts the wall clock, `btime` shifts by the same amount. The `node_timex_*` metrics confirmed this:
+
+| Metric | Value |
+|--------|-------|
+| `node_timex_maxerror_seconds` (max in 3h) | 1.02 seconds |
+| `node_timex_maxerror_seconds` (max in 24h) | 2.05 seconds |
+| `node_timex_sync_status` | 1 (synced) |
+| Current `node_timex_offset_seconds` | ~9ms (normal) |
+
+The kernel's estimated maximum clock error spiked to over 1 second, causing the boot time calculation to drift momentarily.
+
+Additionally, `systemd-resolved` logged "Clock change detected. Flushing caches." at 16:26:53Z, corroborating the NTP adjustment.
+
+## Current Time Sync Configuration
+
+### NixOS Guests
+- **NTP client:** systemd-timesyncd (NixOS default)
+- **No explicit configuration** in the codebase
+- Uses default NixOS NTP server pool
+
+### Proxmox VMs
+- **Clocksource:** `kvm-clock` (optimal for KVM VMs)
+- **QEMU guest agent:** Enabled
+- **No additional QEMU timing args** configured
+
+## Potential Improvements
+
+### 1. Improve Alert Rule (Recommended)
+
+Add tolerance to filter out small NTP adjustments:
+
+```yaml
+# Current rule (triggers on any change)
+expr: changes(node_boot_time_seconds[10m]) > 0
+
+# Improved rule (requires >60 second shift)
+expr: changes(node_boot_time_seconds[10m]) > 0 and abs(delta(node_boot_time_seconds[10m])) > 60
+```
+
+### 2. Switch to Chrony (Optional)
+
+Chrony handles time adjustments more gracefully than systemd-timesyncd:
+
+```nix
+# In common/vm/qemu-guest.nix
+{
+  services.qemuGuest.enable = true;
+
+  services.timesyncd.enable = false;
+  services.chrony = {
+    enable = true;
+    extraConfig = ''
+      makestep 1 3
+      rtcsync
+    '';
+  };
+}
+```
+
+### 3. Add QEMU Timing Args (Optional)
+
+In `terraform/vms.tf`:
+
+```hcl
+args = "-global kvm-pit.lost_tick_policy=delay -rtc driftfix=slew"
+```
+
+### 4. Local NTP Server (Optional)
+
+Running a local NTP server (e.g., on ns1/ns2) would reduce latency and improve sync stability across all hosts.
+
+## Monitoring NTP Health
+
+The `node_timex_*` metrics from node_exporter provide visibility into NTP health:
+
+```promql
+# Clock offset from reference
+node_timex_offset_seconds
+
+# Sync status (1 = synced)
+node_timex_sync_status
+
+# Maximum estimated error - useful for alerting
+node_timex_maxerror_seconds
+```
+
+A potential alert for NTP issues:
+
+```yaml
+- alert: ntp_clock_drift
+  expr: node_timex_maxerror_seconds > 1
+  for: 5m
+  labels:
+    severity: warning
+  annotations:
+    summary: "High clock drift on {{ $labels.hostname }}"
+    description: "NTP max error is {{ $value }}s on {{ $labels.hostname }}"
+```
+
+## Conclusion
+
+No action required for the alert itself - the system was healthy. Consider implementing the improved alert rule to prevent future false positives from NTP adjustments.
--- a/docs/plans/completed/openbao-kanidm-oidc.md
+++ b/docs/plans/completed/openbao-kanidm-oidc.md
@@ -0,0 +1,87 @@
+# OpenBao + Kanidm OIDC Integration
+
+## Status: Completed
+
+Implemented 2026-02-09.
+
+## Overview
+
+Enable Kanidm users to authenticate to OpenBao (Vault) using OIDC for Web UI access. Members of the `admins` group get full read/write access to secrets.
+
+## Implementation
+
+### Files Modified
+
+| File | Changes |
+|------|---------|
+| `terraform/vault/oidc.tf` | New - OIDC auth backend and roles |
+| `terraform/vault/policies.tf` | Added oidc-admin and oidc-default policies |
+| `terraform/vault/secrets.tf` | Added OAuth2 client secret |
+| `terraform/vault/approle.tf` | Granted kanidm01 access to openbao secrets |
+| `services/kanidm/default.nix` | Added openbao OAuth2 client, enabled imperative group membership |
+
+### Kanidm Configuration
+
+OAuth2 client `openbao` with:
+- Confidential client (uses client secret)
+- Web UI callback only: `https://vault.home.2rjus.net:8200/ui/vault/auth/oidc/oidc/callback`
+- Legacy crypto enabled (RS256 for OpenBao compatibility)
+- Scope maps for `admins` and `users` groups
+
+Group membership is now managed imperatively (`overwriteMembers = false`) to prevent provisioning from resetting group memberships on service restart.
+
+### OpenBao Configuration
+
+OIDC auth backend at `/oidc` with two roles:
+
+| Role | Bound Claims | Policy | Access |
+|------|--------------|--------|--------|
+| `admin` | `groups = admins@home.2rjus.net` | `oidc-admin` | Full read/write to secrets, system health/metrics |
+| `default` | (none) | `oidc-default` | Token lookup-self, system health |
+
+Both roles request scopes: `openid`, `profile`, `email`, `groups`
+
+### Policies
+
+**oidc-admin:**
+- `secret/*` - create, read, update, delete, list
+- `sys/health` - read
+- `sys/metrics` - read
+- `sys/auth` - read
+- `sys/mounts` - read
+
+**oidc-default:**
+- `auth/token/lookup-self` - read
+- `sys/health` - read
+
+## Usage
+
+### Web UI Login
+1. Navigate to https://vault.home.2rjus.net:8200
+2. Select "OIDC" authentication method
+3. Enter role: `admin` (for admins) or `default` (for any user)
+4. Click "Sign in with OIDC"
+5. Authenticate with Kanidm
+
+### Group Management
+Add users to admins group for full access:
+```bash
+kanidm group add-members admins <username>
+```
+
+## Limitations
+
+**CLI login not supported:** Kanidm requires HTTPS for all redirect URIs on confidential (non-public) OAuth2 clients. OpenBao CLI uses `http://localhost:8250/oidc/callback` which Kanidm rejects. Public clients would allow localhost redirects, but OpenBao requires a client secret for OIDC auth.
+
+## Lessons Learned
+
+1. **Kanidm group names:** Groups are returned as `groupname@domain` (e.g., `admins@home.2rjus.net`), not just the short name
+2. **RS256 required:** OpenBao only supports RS256 for JWT signing; Kanidm defaults to ES256, requiring `enableLegacyCrypto = true`
+3. **Scope request:** OIDC roles must explicitly request the `groups` scope via `oidc_scopes`
+4. **Provisioning resets:** Kanidm provisioning with default `overwriteMembers = true` resets group memberships on restart
+5. **Two-phase Terraform:** Secret must exist before OIDC backend can validate discovery URL
+
+## References
+
+- [OpenBao JWT/OIDC Auth Method](https://openbao.org/docs/auth/jwt/)
+- [Kanidm OAuth2 Documentation](https://kanidm.github.io/kanidm/stable/integrations/oauth2.html)
--- a/docs/plans/monitoring-migration-victoriametrics.md
+++ b/docs/plans/monitoring-migration-victoriametrics.md
@@ -169,9 +169,30 @@ Once ready to cut over:
   - Destroy VM in Proxmox
   - Remove from terraform state

+## Current Progress
+
+### monitoring02 Host Created (2026-02-08)
+
+Host deployed at 10.69.13.24 (test tier) with:
+- 4 CPU cores, 8GB RAM, 60GB disk
+- Vault integration enabled
+- NATS-based remote deployment enabled
+
+### Grafana with Kanidm OIDC (2026-02-08)
+
+Grafana deployed on monitoring02 as a test instance (`grafana-test.home.2rjus.net`):
+- Kanidm OIDC authentication (PKCE enabled)
+- Role mapping: `admins` → Admin, others → Viewer
+- Declarative datasources pointing to monitoring01 (Prometheus, Loki)
+- Local Caddy for TLS termination via internal ACME CA
+
+This validates the Grafana + OIDC pattern before the full VictoriaMetrics migration. The existing
+`services/monitoring/grafana.nix` on monitoring01 can be replaced with the new `services/grafana/`
+module once monitoring02 becomes the primary monitoring host.
+
 ## Open Questions

- [ ] What disk size for monitoring02? 100GB should allow 3+ months with VictoriaMetrics compression
+- [ ] What disk size for monitoring02? Current 60GB may need expansion for 3+ months with VictoriaMetrics
 - [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)

 ## VictoriaMetrics Service Configuration
--- a/docs/user-management.md
+++ b/docs/user-management.md
@@ -43,11 +43,21 @@ kanidm person posix set-password <username>
 kanidm person posix set <username> --shell /bin/zsh
 ```

+### Setting Email Address
+
+Email is required for OAuth2/OIDC login (e.g., Grafana):
+
+```bash
+kanidm person update <username> --mail <email>
+```
+
 ### Example: Full User Creation

 ```bash
 kanidm person create testuser "Test User"
+kanidm person update testuser --mail testuser@home.2rjus.net
 kanidm group add-members ssh-users testuser
+kanidm group add-members users testuser  # Required for OAuth2 scopes
 kanidm person posix set testuser
 kanidm person posix set-password testuser
 kanidm person get testuser
@@ -129,6 +139,40 @@ Kanidm auto-assigns UIDs/GIDs from its configured range. For manually assigned G
 | 65,536+ | Users (auto-assigned) |
 | 68,000 - 68,999 | Groups (manually assigned) |

+## OAuth2/OIDC Login (Web Services)
+
+For OAuth2/OIDC login to web services like Grafana, users need:
+
+1. **Primary credential** - Password set via `credential update` (separate from unix password)
+2. **MFA** - TOTP or passkey (Kanidm requires MFA for primary credentials)
+3. **Group membership** - Member of `users` group (for OAuth2 scope mapping)
+4. **Email address** - Set via `person update --mail`
+
+### Setting Up Primary Credential (Web Login)
+
+The primary credential is different from the unix/POSIX password:
+
+```bash
+# Interactive credential setup
+kanidm person credential update <username>
+
+# In the interactive prompt:
+# 1. Type 'password' to set a password
+# 2. Type 'totp' to add TOTP (scan QR with authenticator app)
+# 3. Type 'commit' to save
+```
+
+### Verifying OAuth2 Readiness
+
+```bash
+kanidm person get <username>
+```
+
+Check for:
+- `mail:` - Email address set
+- `memberof:` - Includes `users@home.2rjus.net`
+- Primary credential status (check via `credential update` → `status`)
+
 ## PAM/NSS Client Configuration

 Enable central authentication on a host:
--- a/flake.lock
+++ b/flake.lock
@@ -28,11 +28,11 @@
        ]
      },
      "locked": {
-        "lastModified": 1770481834,
-        "narHash": "sha256-Xx9BYnI0C/qgPbwr9nj6NoAdQTbYLunrdbNSaUww9oY=",
+        "lastModified": 1770648258,
+        "narHash": "sha256-sExxD8N9Q0RrHIoppOV6qp4jcJirLVjpQd20C72V78I=",
        "ref": "master",
-        "rev": "fd0d63b103dfaf21d1c27363266590e723021c67",
-        "revCount": 24,
+        "rev": "277a49a666347e2e2ae67128cf732956a9c3be56",
+        "revCount": 27,
        "type": "git",
        "url": "https://git.t-juice.club/torjus/homelab-deploy"
      },
@@ -49,11 +49,11 @@
        ]
      },
      "locked": {
-        "lastModified": 1770422522,
-        "narHash": "sha256-WmIFnquu4u58v8S2bOVWmknRwHn4x88CRfBFTzJ1inQ=",
+        "lastModified": 1770593543,
+        "narHash": "sha256-hT8Rj6JAwGDFvcxWEcUzTCrWSiupCfBa57pBDnM2C5g=",
        "ref": "refs/heads/master",
-        "rev": "cf0ce858997af4d8dcc2ce10393ff393e17fc911",
-        "revCount": 11,
+        "rev": "5aa5f7275b7a08015816171ba06d2cbdc2e02d3e",
+        "revCount": 15,
        "type": "git",
        "url": "https://git.t-juice.club/torjus/nixos-exporter"
      },
@@ -64,11 +64,11 @@
    },
    "nixpkgs": {
      "locked": {
-        "lastModified": 1770136044,
-        "narHash": "sha256-tlFqNG/uzz2++aAmn4v8J0vAkV3z7XngeIIB3rM3650=",
+        "lastModified": 1770464364,
+        "narHash": "sha256-z5NJPSBwsLf/OfD8WTmh79tlSU8XgIbwmk6qB1/TFzY=",
        "owner": "nixos",
        "repo": "nixpkgs",
-        "rev": "e576e3c9cf9bad747afcddd9e34f51d18c855b4e",
+        "rev": "23d72dabcb3b12469f57b37170fcbc1789bd7457",
        "type": "github"
      },
      "original": {
@@ -80,11 +80,11 @@
    },
    "nixpkgs-unstable": {
      "locked": {
-        "lastModified": 1770197578,
-        "narHash": "sha256-AYqlWrX09+HvGs8zM6ebZ1pwUqjkfpnv8mewYwAo+iM=",
+        "lastModified": 1770562336,
+        "narHash": "sha256-ub1gpAONMFsT/GU2hV6ZWJjur8rJ6kKxdm9IlCT0j84=",
        "owner": "nixos",
        "repo": "nixpkgs",
-        "rev": "00c21e4c93d963c50d4c0c89bfa84ed6e0694df2",
+        "rev": "d6c71932130818840fc8fe9509cf50be8c64634f",
        "type": "github"
      },
      "original": {
--- a/flake.nix
+++ b/flake.nix
@@ -191,6 +191,24 @@
            ./hosts/kanidm01
          ];
        };
+        monitoring02 = nixpkgs.lib.nixosSystem {
+          inherit system;
+          specialArgs = {
+            inherit inputs self;
+          };
+          modules = commonModules ++ [
+            ./hosts/monitoring02
+          ];
+        };
+        nix-cache02 = nixpkgs.lib.nixosSystem {
+          inherit system;
+          specialArgs = {
+            inherit inputs self;
+          };
+          modules = commonModules ++ [
+            ./hosts/nix-cache02
+          ];
+        };
      };
      packages = forAllSystems (
        { pkgs }:
@@ -208,9 +226,11 @@
              pkgs.opentofu
              pkgs.openbao
              pkgs.kanidm_1_8
+              pkgs.nkeys
              (pkgs.callPackage ./scripts/create-host { })
              homelab-deploy.packages.${pkgs.system}.default
            ];
+            ANSIBLE_CONFIG = "./ansible/ansible.cfg";
          };
        }
      );
--- a/hosts/ha1/configuration.nix
+++ b/hosts/ha1/configuration.nix
@@ -13,6 +13,8 @@
    ../../common/vm
  ];

+  homelab.host.role = "home-automation";
+
  nixpkgs.config.allowUnfree = true;
  # Use the systemd-boot EFI boot loader.
  boot.loader.grub = {
--- a/hosts/http-proxy/configuration.nix
+++ b/hosts/http-proxy/configuration.nix
@@ -11,6 +11,7 @@
    ../../common/vm
  ];

+  homelab.host.role = "proxy";
  homelab.dns.cnames = [
    "nzbget"
    "radarr"
--- a/hosts/jelly01/configuration.nix
+++ b/hosts/jelly01/configuration.nix
@@ -11,6 +11,8 @@
    ../../common/vm
  ];

+  homelab.host.role = "media";
+
  nixpkgs.config.allowUnfree = true;
  # Use the systemd-boot EFI boot loader.
  boot.loader.grub = {
--- a/hosts/kanidm01/configuration.nix
+++ b/hosts/kanidm01/configuration.nix
@@ -14,9 +14,8 @@
    ../../services/kanidm
  ];

-  # Host metadata
  homelab.host = {
-    tier = "test";
+    tier = "prod";
    role = "auth";
  };

--- a/hosts/monitoring01/configuration.nix
+++ b/hosts/monitoring01/configuration.nix
@@ -11,6 +11,8 @@
    ../../common/vm
  ];

+  homelab.host.role = "monitoring";
+
  nixpkgs.config.allowUnfree = true;
  # Use the systemd-boot EFI boot loader.
  boot.loader.grub = {
--- a/hosts/monitoring02/configuration.nix
+++ b/hosts/monitoring02/configuration.nix
@@ -0,0 +1,75 @@
+{
+  config,
+  lib,
+  pkgs,
+  ...
+}:
+
+{
+  imports = [
+    ../template2/hardware-configuration.nix
+
+    ../../system
+    ../../common/vm
+  ];
+
+  homelab.host = {
+    tier = "prod";
+    role = "monitoring";
+  };
+
+  # DNS CNAME for Grafana test instance
+  homelab.dns.cnames = [ "grafana-test" ];
+
+  # Enable Vault integration
+  vault.enable = true;
+
+  # Enable remote deployment via NATS
+  homelab.deploy.enable = true;
+
+  nixpkgs.config.allowUnfree = true;
+  boot.loader.grub.enable = true;
+  boot.loader.grub.device = "/dev/vda";
+
+  networking.hostName = "monitoring02";
+  networking.domain = "home.2rjus.net";
+  networking.useNetworkd = true;
+  networking.useDHCP = false;
+  services.resolved.enable = true;
+  networking.nameservers = [
+    "10.69.13.5"
+    "10.69.13.6"
+  ];
+
+  systemd.network.enable = true;
+  systemd.network.networks."ens18" = {
+    matchConfig.Name = "ens18";
+    address = [
+      "10.69.13.24/24"
+    ];
+    routes = [
+      { Gateway = "10.69.13.1"; }
+    ];
+    linkConfig.RequiredForOnline = "routable";
+  };
+  time.timeZone = "Europe/Oslo";
+
+  nix.settings.experimental-features = [
+    "nix-command"
+    "flakes"
+  ];
+  nix.settings.tarball-ttl = 0;
+  environment.systemPackages = with pkgs; [
+    vim
+    wget
+    git
+  ];
+
+  # Open ports in the firewall.
+  # networking.firewall.allowedTCPPorts = [ ... ];
+  # networking.firewall.allowedUDPPorts = [ ... ];
+  # Or disable the firewall altogether.
+  networking.firewall.enable = false;
+
+  system.stateVersion = "25.11"; # Did you read the comment?
+}
--- a/hosts/monitoring02/default.nix
+++ b/hosts/monitoring02/default.nix
@@ -0,0 +1,6 @@
+{ ... }: {
+  imports = [
+    ./configuration.nix
+    ../../services/grafana
+  ];
+}
--- a/hosts/nats1/configuration.nix
+++ b/hosts/nats1/configuration.nix
@@ -11,6 +11,8 @@
    ../../common/vm
  ];

+  homelab.host.role = "messaging";
+
  nixpkgs.config.allowUnfree = true;
  # Use the systemd-boot EFI boot loader.
  boot.loader.grub = {
--- a/hosts/nix-cache02/configuration.nix
+++ b/hosts/nix-cache02/configuration.nix
@@ -0,0 +1,72 @@
+{
+  config,
+  lib,
+  pkgs,
+  ...
+}:
+
+{
+  imports = [
+    ../template2/hardware-configuration.nix
+
+    ../../system
+    ../../common/vm
+  ];
+
+  # Host metadata (adjust as needed)
+  homelab.host = {
+    tier = "test";  # Start in test tier, move to prod after validation
+  };
+
+  # Enable Vault integration
+  vault.enable = true;
+
+  # Enable remote deployment via NATS
+  homelab.deploy.enable = true;
+
+  nixpkgs.config.allowUnfree = true;
+  boot.loader.grub.enable = true;
+  boot.loader.grub.device = "/dev/vda";
+
+  networking.hostName = "nix-cache02";
+  networking.domain = "home.2rjus.net";
+  networking.useNetworkd = true;
+  networking.useDHCP = false;
+  services.resolved.enable = true;
+  networking.nameservers = [
+    "10.69.13.5"
+    "10.69.13.6"
+  ];
+
+  systemd.network.enable = true;
+  systemd.network.networks."ens18" = {
+    matchConfig.Name = "ens18";
+    address = [
+      "10.69.13.25/24"
+    ];
+    routes = [
+      { Gateway = "10.69.13.1"; }
+    ];
+    linkConfig.RequiredForOnline = "routable";
+  };
+  time.timeZone = "Europe/Oslo";
+
+  nix.settings.experimental-features = [
+    "nix-command"
+    "flakes"
+  ];
+  nix.settings.tarball-ttl = 0;
+  environment.systemPackages = with pkgs; [
+    vim
+    wget
+    git
+  ];
+
+  # Open ports in the firewall.
+  # networking.firewall.allowedTCPPorts = [ ... ];
+  # networking.firewall.allowedUDPPorts = [ ... ];
+  # Or disable the firewall altogether.
+  networking.firewall.enable = false;
+
+  system.stateVersion = "25.11"; # Did you read the comment?
+}
--- a/hosts/nix-cache02/default.nix
+++ b/hosts/nix-cache02/default.nix
@@ -0,0 +1,5 @@
+{ ... }: {
+  imports = [
+    ./configuration.nix
+  ];
+}
--- a/hosts/template2/configuration.nix
+++ b/hosts/template2/configuration.nix
@@ -35,6 +35,7 @@
  homelab.host = {
    tier = "test";
    priority = "low";
+    labels.ansible = "false";  # Exclude from Ansible inventory
  };

  boot.loader.grub.enable = true;
--- a/hosts/testvm01/configuration.nix
+++ b/hosts/testvm01/configuration.nix
@@ -14,9 +14,9 @@
    ../../common/ssh-audit.nix
  ];

-  # Host metadata (adjust as needed)
  homelab.host = {
-    tier = "test";  # Start in test tier, move to prod after validation
+    tier = "test";
+    role = "test";
  };

  # Enable Vault integration
--- a/hosts/testvm02/configuration.nix
+++ b/hosts/testvm02/configuration.nix
@@ -14,9 +14,9 @@
    ../../common/ssh-audit.nix
  ];

-  # Host metadata (adjust as needed)
  homelab.host = {
-    tier = "test";  # Start in test tier, move to prod after validation
+    tier = "test";
+    role = "test";
  };

  # Enable Vault integration
--- a/hosts/testvm03/configuration.nix
+++ b/hosts/testvm03/configuration.nix
@@ -14,9 +14,9 @@
    ../../common/ssh-audit.nix
  ];

-  # Host metadata (adjust as needed)
  homelab.host = {
-    tier = "test";  # Start in test tier, move to prod after validation
+    tier = "test";
+    role = "test";
  };

  # Enable Vault integration
--- a/lib/monitoring.nix
+++ b/lib/monitoring.nix
@@ -58,10 +58,9 @@ let
      };

  # Build effective labels for a host
-  # Always includes hostname; only includes tier/priority/role if non-default
+  # Always includes hostname and tier; only includes priority/role if non-default
  buildEffectiveLabels = host:
-    { hostname = host.hostname; }
-    // (lib.optionalAttrs (host.tier != "prod") { tier = host.tier; })
+    { hostname = host.hostname; tier = host.tier; }
    // (lib.optionalAttrs (host.priority != "high") { priority = host.priority; })
    // (lib.optionalAttrs (host.role != null) { role = host.role; })
    // host.labels;
--- a/playbooks/inventory.ini
+++ b/playbooks/inventory.ini
@@ -1,5 +0,0 @@
-[proxmox]
-pve1.home.2rjus.net
-
-[proxmox:vars]
-ansible_user=root
--- a/services/grafana/dashboards/certificates.json
+++ b/services/grafana/dashboards/certificates.json
@@ -0,0 +1,446 @@
+{
+  "uid": "certificates-homelab",
+  "title": "TLS Certificates",
+  "tags": ["certificates", "tls", "security", "homelab"],
+  "timezone": "browser",
+  "schemaVersion": 39,
+  "version": 1,
+  "refresh": "5m",
+  "time": {
+    "from": "now-7d",
+    "to": "now"
+  },
+  "panels": [
+    {
+      "id": 1,
+      "title": "Endpoints Monitored",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "count(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"})",
+          "legendFormat": "Total",
+          "refId": "A",
+          "instant": true
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "blue", "value": null}
+            ]
+          }
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none",
+        "textMode": "auto"
+      },
+      "description": "Total number of TLS endpoints being monitored"
+    },
+    {
+      "id": 2,
+      "title": "Probe Failures",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "count(probe_success{job=\"blackbox_tls\"} == 0) or vector(0)",
+          "legendFormat": "Failing",
+          "refId": "A",
+          "instant": true
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "green", "value": null},
+              {"color": "red", "value": 1}
+            ]
+          },
+          "noValue": "0"
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none",
+        "textMode": "auto"
+      },
+      "description": "Number of endpoints where TLS probe is failing"
+    },
+    {
+      "id": 3,
+      "title": "Expiring Soon (< 7d)",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "count((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) < 86400 * 7) or vector(0)",
+          "legendFormat": "Warning",
+          "refId": "A",
+          "instant": true
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "green", "value": null},
+              {"color": "yellow", "value": 1}
+            ]
+          },
+          "noValue": "0"
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none",
+        "textMode": "auto"
+      },
+      "description": "Certificates expiring within 7 days"
+    },
+    {
+      "id": 4,
+      "title": "Expiring Critical (< 24h)",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "count((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) < 86400) or vector(0)",
+          "legendFormat": "Critical",
+          "refId": "A",
+          "instant": true
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "green", "value": null},
+              {"color": "red", "value": 1}
+            ]
+          },
+          "noValue": "0"
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none",
+        "textMode": "auto"
+      },
+      "description": "Certificates expiring within 24 hours"
+    },
+    {
+      "id": 5,
+      "title": "Minimum Days Remaining",
+      "type": "gauge",
+      "gridPos": {"h": 4, "w": 8, "x": 16, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "min((probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400)",
+          "legendFormat": "Days",
+          "refId": "A",
+          "instant": true
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "d",
+          "min": 0,
+          "max": 90,
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "red", "value": null},
+              {"color": "orange", "value": 7},
+              {"color": "yellow", "value": 14},
+              {"color": "green", "value": 30}
+            ]
+          }
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "showThresholdLabels": false,
+        "showThresholdMarkers": true
+      },
+      "description": "Shortest time until any certificate expires"
+    },
+    {
+      "id": 6,
+      "title": "Certificate Expiry by Endpoint",
+      "type": "table",
+      "gridPos": {"h": 12, "w": 12, "x": 0, "y": 4},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400",
+          "legendFormat": "{{instance}}",
+          "refId": "A",
+          "instant": true,
+          "format": "table"
+        }
+      ],
+      "transformations": [
+        {
+          "id": "organize",
+          "options": {
+            "excludeByName": {"Time": true, "job": true, "__name__": true},
+            "renameByName": {"instance": "Endpoint", "Value": "Days Until Expiry"}
+          }
+        },
+        {
+          "id": "sortBy",
+          "options": {
+            "sort": [{"field": "Days Until Expiry", "desc": false}]
+          }
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "custom": {
+            "align": "left"
+          }
+        },
+        "overrides": [
+          {
+            "matcher": {"id": "byName", "options": "Days Until Expiry"},
+            "properties": [
+              {"id": "unit", "value": "d"},
+              {"id": "decimals", "value": 1},
+              {"id": "custom.width", "value": 150},
+              {
+                "id": "thresholds",
+                "value": {
+                  "mode": "absolute",
+                  "steps": [
+                    {"color": "red", "value": null},
+                    {"color": "orange", "value": 7},
+                    {"color": "yellow", "value": 14},
+                    {"color": "green", "value": 30}
+                  ]
+                }
+              },
+              {"id": "custom.cellOptions", "value": {"type": "color-background"}}
+            ]
+          }
+        ]
+      },
+      "options": {
+        "showHeader": true,
+        "sortBy": [{"displayName": "Days Until Expiry", "desc": false}]
+      },
+      "description": "All monitored endpoints sorted by days until certificate expiry"
+    },
+    {
+      "id": 7,
+      "title": "Probe Status",
+      "type": "table",
+      "gridPos": {"h": 12, "w": 12, "x": 12, "y": 4},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "probe_success{job=\"blackbox_tls\"}",
+          "legendFormat": "{{instance}}",
+          "refId": "A",
+          "instant": true,
+          "format": "table"
+        },
+        {
+          "expr": "probe_http_status_code{job=\"blackbox_tls\"}",
+          "legendFormat": "{{instance}}",
+          "refId": "B",
+          "instant": true,
+          "format": "table"
+        },
+        {
+          "expr": "probe_duration_seconds{job=\"blackbox_tls\"}",
+          "legendFormat": "{{instance}}",
+          "refId": "C",
+          "instant": true,
+          "format": "table"
+        }
+      ],
+      "transformations": [
+        {
+          "id": "joinByField",
+          "options": {
+            "byField": "instance",
+            "mode": "outer"
+          }
+        },
+        {
+          "id": "organize",
+          "options": {
+            "excludeByName": {"Time": true, "Time 1": true, "Time 2": true, "Time 3": true, "job": true, "job 1": true, "job 2": true, "job 3": true, "__name__": true},
+            "renameByName": {
+              "instance": "Endpoint",
+              "Value #A": "Success",
+              "Value #B": "HTTP Status",
+              "Value #C": "Duration"
+            }
+          }
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "custom": {"align": "left"}
+        },
+        "overrides": [
+          {
+            "matcher": {"id": "byName", "options": "Success"},
+            "properties": [
+              {"id": "custom.width", "value": 80},
+              {"id": "mappings", "value": [
+                {"type": "value", "options": {"0": {"text": "FAIL", "color": "red"}}},
+                {"type": "value", "options": {"1": {"text": "OK", "color": "green"}}}
+              ]},
+              {"id": "custom.cellOptions", "value": {"type": "color-text"}}
+            ]
+          },
+          {
+            "matcher": {"id": "byName", "options": "HTTP Status"},
+            "properties": [
+              {"id": "custom.width", "value": 100}
+            ]
+          },
+          {
+            "matcher": {"id": "byName", "options": "Duration"},
+            "properties": [
+              {"id": "unit", "value": "s"},
+              {"id": "decimals", "value": 3},
+              {"id": "custom.width", "value": 100}
+            ]
+          }
+        ]
+      },
+      "options": {
+        "showHeader": true
+      },
+      "description": "Probe success status, HTTP response code, and probe duration"
+    },
+    {
+      "id": 8,
+      "title": "Certificate Expiry Over Time",
+      "type": "timeseries",
+      "gridPos": {"h": 8, "w": 24, "x": 0, "y": 16},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "(probe_ssl_earliest_cert_expiry{job=\"blackbox_tls\"} - time()) / 86400",
+          "legendFormat": "{{instance}}",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "d",
+          "custom": {
+            "lineWidth": 2,
+            "fillOpacity": 10,
+            "showPoints": "never"
+          },
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "red", "value": null},
+              {"color": "orange", "value": 7},
+              {"color": "yellow", "value": 14},
+              {"color": "green", "value": 30}
+            ]
+          }
+        }
+      },
+      "options": {
+        "legend": {"displayMode": "table", "placement": "right", "calcs": ["lastNotNull"]},
+        "tooltip": {"mode": "multi", "sort": "desc"}
+      },
+      "description": "Days until certificate expiry over time - useful for spotting renewal patterns"
+    },
+    {
+      "id": 9,
+      "title": "Probe Success Rate",
+      "type": "timeseries",
+      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 24},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "avg(probe_success{job=\"blackbox_tls\"}) * 100",
+          "legendFormat": "Success Rate",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "percent",
+          "min": 0,
+          "max": 100,
+          "custom": {
+            "lineWidth": 2,
+            "fillOpacity": 20,
+            "showPoints": "never"
+          },
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "red", "value": null},
+              {"color": "yellow", "value": 90},
+              {"color": "green", "value": 100}
+            ]
+          },
+          "color": {"mode": "thresholds"}
+        }
+      },
+      "options": {
+        "legend": {"displayMode": "list", "placement": "bottom"},
+        "tooltip": {"mode": "single"}
+      },
+      "description": "Overall probe success rate across all endpoints"
+    },
+    {
+      "id": 10,
+      "title": "Probe Duration",
+      "type": "timeseries",
+      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 24},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "probe_duration_seconds{job=\"blackbox_tls\"}",
+          "legendFormat": "{{instance}}",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "s",
+          "custom": {
+            "lineWidth": 1,
+            "fillOpacity": 0,
+            "showPoints": "never"
+          }
+        }
+      },
+      "options": {
+        "legend": {"displayMode": "table", "placement": "right", "calcs": ["mean", "max"]},
+        "tooltip": {"mode": "multi", "sort": "desc"}
+      },
+      "description": "Time taken to complete TLS probe for each endpoint"
+    }
+  ]
+}
--- a/services/grafana/dashboards/logs.json
+++ b/services/grafana/dashboards/logs.json
@@ -0,0 +1,85 @@
+{
+  "uid": "logs-homelab",
+  "title": "Logs - Homelab",
+  "tags": ["loki", "logs", "homelab"],
+  "timezone": "browser",
+  "schemaVersion": 39,
+  "version": 1,
+  "refresh": "30s",
+  "templating": {
+    "list": [
+      {
+        "name": "host",
+        "type": "query",
+        "datasource": {"type": "loki", "uid": "loki"},
+        "query": "label_values(host)",
+        "refresh": 2,
+        "includeAll": true,
+        "multi": false,
+        "current": {"text": "All", "value": "$__all"}
+      },
+      {
+        "name": "job",
+        "type": "query",
+        "datasource": {"type": "loki", "uid": "loki"},
+        "query": "label_values(job)",
+        "refresh": 2,
+        "includeAll": true,
+        "multi": false,
+        "current": {"text": "All", "value": "$__all"}
+      },
+      {
+        "name": "search",
+        "type": "textbox",
+        "current": {"text": "", "value": ""},
+        "label": "Search"
+      }
+    ]
+  },
+  "panels": [
+    {
+      "id": 1,
+      "title": "Log Volume",
+      "type": "timeseries",
+      "gridPos": {"h": 6, "w": 24, "x": 0, "y": 0},
+      "datasource": {"type": "loki", "uid": "loki"},
+      "targets": [
+        {
+          "expr": "sum by (host) (count_over_time({host=~\"$host\", job=~\"$job\"} |~ \"$search\" [1m]))",
+          "legendFormat": "{{host}}",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "short"
+        }
+      },
+      "options": {
+        "legend": {"displayMode": "list", "placement": "bottom"}
+      }
+    },
+    {
+      "id": 2,
+      "title": "Logs",
+      "type": "logs",
+      "gridPos": {"h": 18, "w": 24, "x": 0, "y": 6},
+      "datasource": {"type": "loki", "uid": "loki"},
+      "targets": [
+        {
+          "expr": "{host=~\"$host\", job=~\"$job\"} |~ \"$search\"",
+          "refId": "A"
+        }
+      ],
+      "options": {
+        "showTime": true,
+        "showLabels": true,
+        "showCommonLabels": false,
+        "wrapLogMessage": true,
+        "prettifyLogMessage": false,
+        "enableLogDetails": true,
+        "sortOrder": "Descending"
+      }
+    }
+  ]
+}
--- a/services/grafana/dashboards/nixos-fleet.json
+++ b/services/grafana/dashboards/nixos-fleet.json
@@ -0,0 +1,633 @@
+{
+  "uid": "nixos-fleet-homelab",
+  "title": "NixOS Fleet - Homelab",
+  "tags": ["nixos", "fleet", "homelab"],
+  "timezone": "browser",
+  "schemaVersion": 39,
+  "version": 1,
+  "refresh": "1m",
+  "time": {
+    "from": "now-7d",
+    "to": "now"
+  },
+  "templating": {
+    "list": [
+      {
+        "name": "tier",
+        "type": "query",
+        "datasource": {"type": "prometheus", "uid": "prometheus"},
+        "query": "label_values(nixos_flake_info, tier)",
+        "refresh": 2,
+        "includeAll": true,
+        "multi": false,
+        "current": {"text": "All", "value": "$__all"}
+      }
+    ]
+  },
+  "panels": [
+    {
+      "id": 1,
+      "title": "Hosts Behind Remote",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "count(nixos_flake_revision_behind{tier=~\"$tier\"} == 1)",
+          "legendFormat": "Behind",
+          "refId": "A",
+          "instant": true
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "green", "value": null},
+              {"color": "yellow", "value": 1},
+              {"color": "red", "value": 5}
+            ]
+          },
+          "noValue": "0"
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none",
+        "textMode": "auto"
+      },
+      "description": "Number of hosts where current revision differs from remote master"
+    },
+    {
+      "id": 2,
+      "title": "Hosts Needing Reboot",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "count(nixos_config_mismatch{tier=~\"$tier\"} == 1)",
+          "legendFormat": "Need Reboot",
+          "refId": "A",
+          "instant": true
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "green", "value": null},
+              {"color": "yellow", "value": 1},
+              {"color": "orange", "value": 3},
+              {"color": "red", "value": 5}
+            ]
+          },
+          "noValue": "0"
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none"
+      },
+      "description": "Hosts where booted generation differs from current (switched but not rebooted)"
+    },
+    {
+      "id": 3,
+      "title": "Total Hosts",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 3, "x": 8, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "count(nixos_flake_info{tier=~\"$tier\"})",
+          "legendFormat": "Hosts",
+          "refId": "A",
+          "instant": true
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [{"color": "blue", "value": null}]
+          }
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none"
+      }
+    },
+    {
+      "id": 4,
+      "title": "Nixpkgs Age",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 3, "x": 11, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "max(nixos_flake_input_age_seconds{input=\"nixpkgs\", tier=~\"$tier\"})",
+          "legendFormat": "Nixpkgs",
+          "refId": "A",
+          "instant": true
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "s",
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "green", "value": null},
+              {"color": "yellow", "value": 604800},
+              {"color": "orange", "value": 1209600},
+              {"color": "red", "value": 2592000}
+            ]
+          }
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none"
+      },
+      "description": "Age of nixpkgs flake input (yellow >7d, orange >14d, red >30d)"
+    },
+    {
+      "id": 5,
+      "title": "Hosts Up-to-date",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 3, "x": 14, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "count(nixos_flake_revision_behind{tier=~\"$tier\"} == 0)",
+          "legendFormat": "Up-to-date",
+          "refId": "A",
+          "instant": true
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [{"color": "green", "value": null}]
+          },
+          "noValue": "0"
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none"
+      }
+    },
+    {
+      "id": 13,
+      "title": "Deployments (24h)",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 3, "x": 17, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "sum(increase(homelab_deploy_deployments_total{status=\"completed\"}[24h]))",
+          "legendFormat": "Deployments",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [{"color": "blue", "value": null}]
+          },
+          "noValue": "0",
+          "decimals": 0
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none"
+      },
+      "description": "Total successful deployments in the last 24 hours"
+    },
+    {
+      "id": 14,
+      "title": "Avg Deploy Time",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "sum(increase(homelab_deploy_deployment_duration_seconds_sum{success=\"true\"}[24h])) / sum(increase(homelab_deploy_deployment_duration_seconds_count{success=\"true\"}[24h]))",
+          "legendFormat": "Avg Time",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "s",
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "green", "value": null},
+              {"color": "yellow", "value": 30},
+              {"color": "red", "value": 60}
+            ]
+          },
+          "noValue": "-"
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none"
+      },
+      "description": "Average deployment duration over the last 24 hours (yellow >30s, red >60s)"
+    },
+    {
+      "id": 6,
+      "title": "Fleet Status",
+      "type": "table",
+      "gridPos": {"h": 10, "w": 24, "x": 0, "y": 4},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "nixos_flake_info{tier=~\"$tier\"}",
+          "format": "table",
+          "instant": true,
+          "refId": "info"
+        },
+        {
+          "expr": "nixos_flake_revision_behind{tier=~\"$tier\"}",
+          "format": "table",
+          "instant": true,
+          "refId": "behind"
+        },
+        {
+          "expr": "nixos_config_mismatch{tier=~\"$tier\"}",
+          "format": "table",
+          "instant": true,
+          "refId": "mismatch"
+        },
+        {
+          "expr": "nixos_generation_age_seconds{tier=~\"$tier\"}",
+          "format": "table",
+          "instant": true,
+          "refId": "age"
+        },
+        {
+          "expr": "nixos_generation_count{tier=~\"$tier\"}",
+          "format": "table",
+          "instant": true,
+          "refId": "count"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {},
+        "overrides": [
+          {
+            "matcher": {"id": "byName", "options": "Hostname"},
+            "properties": [{"id": "custom.width", "value": 120}]
+          },
+          {
+            "matcher": {"id": "byName", "options": "Current Rev"},
+            "properties": [{"id": "custom.width", "value": 90}]
+          },
+          {
+            "matcher": {"id": "byName", "options": "Remote Rev"},
+            "properties": [{"id": "custom.width", "value": 90}]
+          },
+          {
+            "matcher": {"id": "byName", "options": "Behind"},
+            "properties": [
+              {"id": "custom.width", "value": 70},
+              {"id": "mappings", "value": [
+                {"type": "value", "options": {"0": {"text": "No", "color": "green"}}},
+                {"type": "value", "options": {"1": {"text": "Yes", "color": "red"}}}
+              ]},
+              {"id": "custom.cellOptions", "value": {"type": "color-text"}}
+            ]
+          },
+          {
+            "matcher": {"id": "byName", "options": "Need Reboot"},
+            "properties": [
+              {"id": "custom.width", "value": 100},
+              {"id": "mappings", "value": [
+                {"type": "value", "options": {"0": {"text": "No", "color": "green"}}},
+                {"type": "value", "options": {"1": {"text": "Yes", "color": "orange"}}}
+              ]},
+              {"id": "custom.cellOptions", "value": {"type": "color-text"}}
+            ]
+          },
+          {
+            "matcher": {"id": "byName", "options": "Config Age"},
+            "properties": [
+              {"id": "unit", "value": "s"},
+              {"id": "custom.width", "value": 100}
+            ]
+          },
+          {
+            "matcher": {"id": "byName", "options": "Generations"},
+            "properties": [{"id": "custom.width", "value": 100}]
+          },
+          {
+            "matcher": {"id": "byName", "options": "Tier"},
+            "properties": [{"id": "custom.width", "value": 60}]
+          },
+          {
+            "matcher": {"id": "byName", "options": "Role"},
+            "properties": [{"id": "custom.width", "value": 80}]
+          }
+        ]
+      },
+      "options": {
+        "showHeader": true,
+        "sortBy": [{"displayName": "Hostname", "desc": false}]
+      },
+      "transformations": [
+        {
+          "id": "joinByField",
+          "options": {"byField": "hostname", "mode": "outer"}
+        },
+        {
+          "id": "organize",
+          "options": {
+            "excludeByName": {
+              "Time": true,
+              "Time 1": true,
+              "Time 2": true,
+              "Time 3": true,
+              "Time 4": true,
+              "Time 5": true,
+              "Value #info": true,
+              "__name__": true,
+              "__name__ 1": true,
+              "__name__ 2": true,
+              "__name__ 3": true,
+              "__name__ 4": true,
+              "__name__ 5": true,
+              "dns_role": true,
+              "dns_role 1": true,
+              "dns_role 2": true,
+              "dns_role 3": true,
+              "dns_role 4": true,
+              "instance": true,
+              "instance 1": true,
+              "instance 2": true,
+              "instance 3": true,
+              "instance 4": true,
+              "job": true,
+              "job 1": true,
+              "job 2": true,
+              "job 3": true,
+              "job 4": true,
+              "nixos_version": true,
+              "nixpkgs_rev": true,
+              "role 1": true,
+              "role 2": true,
+              "role 3": true,
+              "role 4": true,
+              "tier 1": true,
+              "tier 2": true,
+              "tier 3": true,
+              "tier 4": true
+            },
+            "indexByName": {
+              "hostname": 0,
+              "tier": 1,
+              "role": 2,
+              "current_rev": 3,
+              "remote_rev": 4,
+              "Value #behind": 5,
+              "Value #mismatch": 6,
+              "Value #age": 7,
+              "Value #count": 8
+            },
+            "renameByName": {
+              "hostname": "Hostname",
+              "tier": "Tier",
+              "role": "Role",
+              "current_rev": "Current Rev",
+              "remote_rev": "Remote Rev",
+              "Value #behind": "Behind",
+              "Value #mismatch": "Need Reboot",
+              "Value #age": "Config Age",
+              "Value #count": "Generations"
+            }
+          }
+        }
+      ]
+    },
+    {
+      "id": 7,
+      "title": "Generation Age by Host",
+      "type": "bargauge",
+      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 14},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "sort_desc(nixos_generation_age_seconds{tier=~\"$tier\"})",
+          "legendFormat": "{{hostname}}",
+          "refId": "A",
+          "instant": true
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "s",
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "green", "value": null},
+              {"color": "yellow", "value": 86400},
+              {"color": "orange", "value": 259200},
+              {"color": "red", "value": 604800}
+            ]
+          },
+          "min": 0
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "orientation": "horizontal",
+        "displayMode": "gradient",
+        "showUnfilled": true
+      },
+      "description": "How long ago each host's current config was deployed (yellow >1d, orange >3d, red >7d)"
+    },
+    {
+      "id": 8,
+      "title": "Generations per Host",
+      "type": "bargauge",
+      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 14},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "sort_desc(nixos_generation_count{tier=~\"$tier\"})",
+          "legendFormat": "{{hostname}}",
+          "refId": "A",
+          "instant": true
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "blue", "value": null},
+              {"color": "purple", "value": 50}
+            ]
+          },
+          "min": 0
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "orientation": "horizontal",
+        "displayMode": "gradient",
+        "showUnfilled": true
+      },
+      "description": "Total number of NixOS generations on each host"
+    },
+    {
+      "id": 9,
+      "title": "Deployment Activity (Generation Age Over Time)",
+      "type": "timeseries",
+      "gridPos": {"h": 8, "w": 24, "x": 0, "y": 22},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "nixos_generation_age_seconds{tier=~\"$tier\"}",
+          "legendFormat": "{{hostname}}",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "s",
+          "custom": {
+            "lineWidth": 1,
+            "fillOpacity": 0,
+            "showPoints": "never",
+            "stacking": {"mode": "none"}
+          }
+        }
+      },
+      "options": {
+        "legend": {
+          "displayMode": "list",
+          "placement": "bottom"
+        },
+        "tooltip": {"mode": "multi", "sort": "desc"}
+      },
+      "description": "Generation age increases over time, drops to near-zero when deployed. Useful to see deployment patterns."
+    },
+    {
+      "id": 10,
+      "title": "Flake Input Ages",
+      "type": "table",
+      "gridPos": {"h": 6, "w": 12, "x": 0, "y": 30},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "max by (input) (nixos_flake_input_age_seconds)",
+          "format": "table",
+          "instant": true,
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "s"
+        },
+        "overrides": [
+          {
+            "matcher": {"id": "byName", "options": "input"},
+            "properties": [{"id": "custom.width", "value": 150}]
+          }
+        ]
+      },
+      "options": {
+        "showHeader": true,
+        "sortBy": [{"displayName": "Value", "desc": true}]
+      },
+      "transformations": [
+        {
+          "id": "organize",
+          "options": {
+            "excludeByName": {"Time": true},
+            "renameByName": {
+              "input": "Flake Input",
+              "Value": "Age"
+            }
+          }
+        }
+      ],
+      "description": "Age of each flake input across the fleet"
+    },
+    {
+      "id": 11,
+      "title": "Hosts by Revision",
+      "type": "piechart",
+      "gridPos": {"h": 6, "w": 6, "x": 12, "y": 30},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "count by (current_rev) (nixos_flake_info{tier=~\"$tier\"})",
+          "legendFormat": "{{current_rev}}",
+          "refId": "A",
+          "instant": true
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {}
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "legend": {"displayMode": "table", "placement": "right", "values": ["value"]},
+        "pieType": "pie"
+      },
+      "description": "Distribution of hosts by their current flake revision"
+    },
+    {
+      "id": 12,
+      "title": "Hosts by Tier",
+      "type": "piechart",
+      "gridPos": {"h": 6, "w": 6, "x": 18, "y": 30},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "count by (tier) (nixos_flake_info)",
+          "legendFormat": "{{tier}}",
+          "refId": "A",
+          "instant": true
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {}
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "legend": {"displayMode": "table", "placement": "right", "values": ["value"]},
+        "pieType": "pie"
+      },
+      "transformations": [
+        {
+          "id": "renameByRegex",
+          "options": {
+            "regex": "^$",
+            "renamePattern": "prod"
+          }
+        }
+      ],
+      "description": "Distribution of hosts by tier (test vs prod)"
+    }
+  ]
+}
--- a/services/grafana/dashboards/nixos-operations.json
+++ b/services/grafana/dashboards/nixos-operations.json
@@ -0,0 +1,296 @@
+{
+  "uid": "nixos-operations",
+  "title": "NixOS Operations",
+  "tags": ["loki", "nixos", "operations", "homelab"],
+  "timezone": "browser",
+  "schemaVersion": 39,
+  "version": 1,
+  "refresh": "1m",
+  "time": {
+    "from": "now-24h",
+    "to": "now"
+  },
+  "templating": {
+    "list": [
+      {
+        "name": "host",
+        "type": "query",
+        "datasource": {"type": "loki", "uid": "loki"},
+        "query": "label_values(host)",
+        "refresh": 2,
+        "includeAll": true,
+        "multi": true,
+        "current": {"text": "All", "value": "$__all"}
+      }
+    ]
+  },
+  "panels": [
+    {
+      "id": 1,
+      "title": "Upgrade Log Volume",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
+      "datasource": {"type": "loki", "uid": "loki"},
+      "targets": [
+        {
+          "expr": "sum(count_over_time({systemd_unit=\"nixos-upgrade.service\", host=~\"$host\"} [$__range]))",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [{"color": "blue", "value": null}]
+          },
+          "noValue": "0"
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none"
+      },
+      "description": "Total log entries from nixos-upgrade.service in selected time range"
+    },
+    {
+      "id": 2,
+      "title": "Successful Upgrades",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 6, "x": 6, "y": 0},
+      "datasource": {"type": "loki", "uid": "loki"},
+      "targets": [
+        {
+          "expr": "sum(count_over_time({systemd_unit=\"nixos-upgrade.service\", host=~\"$host\"} |= \"Done. The new configuration is\" [$__range]))",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [{"color": "green", "value": null}]
+          },
+          "noValue": "0"
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none"
+      },
+      "description": "Upgrades that completed successfully"
+    },
+    {
+      "id": 3,
+      "title": "Upgrade Errors",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 6, "x": 12, "y": 0},
+      "datasource": {"type": "loki", "uid": "loki"},
+      "targets": [
+        {
+          "expr": "sum(count_over_time({systemd_unit=\"nixos-upgrade.service\", host=~\"$host\"} |~ \"(?i)error|failed\" [$__range]))",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "green", "value": null},
+              {"color": "red", "value": 1}
+            ]
+          },
+          "noValue": "0"
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none"
+      },
+      "description": "Upgrade log entries containing errors"
+    },
+    {
+      "id": 4,
+      "title": "Bootstrap Events",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 6, "x": 18, "y": 0},
+      "datasource": {"type": "loki", "uid": "loki"},
+      "targets": [
+        {
+          "expr": "sum(count_over_time({job=\"bootstrap\", host=~\"$host\"} [$__range]))",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [{"color": "purple", "value": null}]
+          },
+          "noValue": "0"
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none"
+      },
+      "description": "Bootstrap log entries from new VM deployments"
+    },
+    {
+      "id": 5,
+      "title": "Upgrade Activity by Host",
+      "type": "timeseries",
+      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
+      "datasource": {"type": "loki", "uid": "loki"},
+      "targets": [
+        {
+          "expr": "sum by (host) (count_over_time({systemd_unit=\"nixos-upgrade.service\", host=~\"$host\"} [5m]))",
+          "legendFormat": "{{host}}",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "short",
+          "custom": {
+            "lineWidth": 1,
+            "fillOpacity": 30,
+            "showPoints": "never",
+            "stacking": {"mode": "normal"}
+          }
+        }
+      },
+      "options": {
+        "legend": {"displayMode": "list", "placement": "bottom"},
+        "tooltip": {"mode": "multi", "sort": "desc"}
+      },
+      "description": "When upgrades ran on each host"
+    },
+    {
+      "id": 6,
+      "title": "ACME Certificate Activity",
+      "type": "timeseries",
+      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
+      "datasource": {"type": "loki", "uid": "loki"},
+      "targets": [
+        {
+          "expr": "sum by (host) (count_over_time({systemd_unit=~\"acme.*\", host=~\"$host\"} [5m]))",
+          "legendFormat": "{{host}}",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "short",
+          "custom": {
+            "lineWidth": 1,
+            "fillOpacity": 30,
+            "showPoints": "never",
+            "stacking": {"mode": "normal"}
+          }
+        }
+      },
+      "options": {
+        "legend": {"displayMode": "list", "placement": "bottom"},
+        "tooltip": {"mode": "multi", "sort": "desc"}
+      },
+      "description": "ACME certificate renewal activity"
+    },
+    {
+      "id": 7,
+      "title": "Recent Upgrade Completions",
+      "type": "logs",
+      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 12},
+      "datasource": {"type": "loki", "uid": "loki"},
+      "targets": [
+        {
+          "expr": "{systemd_unit=\"nixos-upgrade.service\", host=~\"$host\"} |= \"Done. The new configuration is\" | json | line_format \"{{.MESSAGE}}\" | keep host",
+          "refId": "A"
+        }
+      ],
+      "options": {
+        "showTime": true,
+        "showLabels": true,
+        "showCommonLabels": false,
+        "wrapLogMessage": true,
+        "prettifyLogMessage": false,
+        "enableLogDetails": true,
+        "sortOrder": "Descending"
+      },
+      "description": "Successful upgrade completion messages showing the new system path"
+    },
+    {
+      "id": 8,
+      "title": "Build Activity",
+      "type": "logs",
+      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 12},
+      "datasource": {"type": "loki", "uid": "loki"},
+      "targets": [
+        {
+          "expr": "{systemd_unit=\"nixos-upgrade.service\", host=~\"$host\"} |= \"building\" | json | line_format \"{{.MESSAGE}}\" | keep host",
+          "refId": "A"
+        }
+      ],
+      "options": {
+        "showTime": true,
+        "showLabels": true,
+        "showCommonLabels": false,
+        "wrapLogMessage": true,
+        "prettifyLogMessage": false,
+        "enableLogDetails": true,
+        "sortOrder": "Descending"
+      },
+      "description": "Derivations being built during upgrades"
+    },
+    {
+      "id": 9,
+      "title": "Bootstrap Logs",
+      "type": "logs",
+      "gridPos": {"h": 8, "w": 24, "x": 0, "y": 20},
+      "datasource": {"type": "loki", "uid": "loki"},
+      "targets": [
+        {
+          "expr": "{job=\"bootstrap\", host=~\"$host\"}",
+          "refId": "A"
+        }
+      ],
+      "options": {
+        "showTime": true,
+        "showLabels": true,
+        "showCommonLabels": false,
+        "wrapLogMessage": true,
+        "prettifyLogMessage": false,
+        "enableLogDetails": true,
+        "sortOrder": "Descending"
+      },
+      "description": "Logs from VM bootstrap process (new deployments)"
+    },
+    {
+      "id": 10,
+      "title": "Upgrade Errors & Failures",
+      "type": "logs",
+      "gridPos": {"h": 8, "w": 24, "x": 0, "y": 28},
+      "datasource": {"type": "loki", "uid": "loki"},
+      "targets": [
+        {
+          "expr": "{systemd_unit=\"nixos-upgrade.service\", host=~\"$host\"} |~ \"(?i)error|failed\" | json | line_format \"{{.MESSAGE}}\" | keep host",
+          "refId": "A"
+        }
+      ],
+      "options": {
+        "showTime": true,
+        "showLabels": true,
+        "showCommonLabels": false,
+        "wrapLogMessage": true,
+        "prettifyLogMessage": false,
+        "enableLogDetails": true,
+        "sortOrder": "Descending"
+      },
+      "description": "Errors and failures during NixOS upgrades"
+    }
+  ]
+}
--- a/services/grafana/dashboards/node-exporter.json
+++ b/services/grafana/dashboards/node-exporter.json
@@ -0,0 +1,208 @@
+{
+  "uid": "node-exporter-homelab",
+  "title": "Node Exporter - Homelab",
+  "tags": ["node-exporter", "prometheus", "homelab"],
+  "timezone": "browser",
+  "schemaVersion": 39,
+  "version": 1,
+  "refresh": "30s",
+  "templating": {
+    "list": [
+      {
+        "name": "instance",
+        "type": "query",
+        "datasource": {"type": "prometheus", "uid": "prometheus"},
+        "query": "label_values(node_uname_info, instance)",
+        "refresh": 2,
+        "includeAll": false,
+        "multi": false,
+        "current": {}
+      }
+    ]
+  },
+  "panels": [
+    {
+      "id": 1,
+      "title": "CPU Usage",
+      "type": "timeseries",
+      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\", instance=~\"$instance\"}[5m])) * 100)",
+          "legendFormat": "CPU %",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "percent",
+          "min": 0,
+          "max": 100,
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "green", "value": null},
+              {"color": "yellow", "value": 70},
+              {"color": "red", "value": 90}
+            ]
+          }
+        }
+      }
+    },
+    {
+      "id": 2,
+      "title": "Memory Usage",
+      "type": "timeseries",
+      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$instance\"} / node_memory_MemTotal_bytes{instance=~\"$instance\"})) * 100",
+          "legendFormat": "Memory %",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "percent",
+          "min": 0,
+          "max": 100,
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "green", "value": null},
+              {"color": "yellow", "value": 70},
+              {"color": "red", "value": 90}
+            ]
+          }
+        }
+      }
+    },
+    {
+      "id": 3,
+      "title": "Disk Usage",
+      "type": "gauge",
+      "gridPos": {"h": 8, "w": 8, "x": 0, "y": 8},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "100 - ((node_filesystem_avail_bytes{instance=~\"$instance\",mountpoint=\"/\",fstype!=\"rootfs\"} / node_filesystem_size_bytes{instance=~\"$instance\",mountpoint=\"/\",fstype!=\"rootfs\"}) * 100)",
+          "legendFormat": "Root /",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "percent",
+          "min": 0,
+          "max": 100,
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "green", "value": null},
+              {"color": "yellow", "value": 70},
+              {"color": "red", "value": 85}
+            ]
+          }
+        }
+      }
+    },
+    {
+      "id": 4,
+      "title": "System Load",
+      "type": "timeseries",
+      "gridPos": {"h": 8, "w": 8, "x": 8, "y": 8},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "node_load1{instance=~\"$instance\"}",
+          "legendFormat": "1m",
+          "refId": "A"
+        },
+        {
+          "expr": "node_load5{instance=~\"$instance\"}",
+          "legendFormat": "5m",
+          "refId": "B"
+        },
+        {
+          "expr": "node_load15{instance=~\"$instance\"}",
+          "legendFormat": "15m",
+          "refId": "C"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "short"
+        }
+      }
+    },
+    {
+      "id": 5,
+      "title": "Uptime",
+      "type": "stat",
+      "gridPos": {"h": 8, "w": 8, "x": 16, "y": 8},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "time() - node_boot_time_seconds{instance=~\"$instance\"}",
+          "legendFormat": "Uptime",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "s"
+        }
+      }
+    },
+    {
+      "id": 6,
+      "title": "Network Traffic",
+      "type": "timeseries",
+      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "rate(node_network_receive_bytes_total{instance=~\"$instance\",device!~\"lo|veth.*|br.*|docker.*\"}[5m])",
+          "legendFormat": "Receive {{device}}",
+          "refId": "A"
+        },
+        {
+          "expr": "-rate(node_network_transmit_bytes_total{instance=~\"$instance\",device!~\"lo|veth.*|br.*|docker.*\"}[5m])",
+          "legendFormat": "Transmit {{device}}",
+          "refId": "B"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "Bps"
+        }
+      }
+    },
+    {
+      "id": 7,
+      "title": "Disk I/O",
+      "type": "timeseries",
+      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 16},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "rate(node_disk_read_bytes_total{instance=~\"$instance\",device!~\"dm-.*\"}[5m])",
+          "legendFormat": "Read {{device}}",
+          "refId": "A"
+        },
+        {
+          "expr": "-rate(node_disk_written_bytes_total{instance=~\"$instance\",device!~\"dm-.*\"}[5m])",
+          "legendFormat": "Write {{device}}",
+          "refId": "B"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "Bps"
+        }
+      }
+    }
+  ]
+}
--- a/services/grafana/dashboards/proxmox.json
+++ b/services/grafana/dashboards/proxmox.json
@@ -0,0 +1,606 @@
+{
+  "uid": "proxmox-homelab",
+  "title": "Proxmox - Homelab",
+  "tags": ["proxmox", "virtualization", "homelab"],
+  "timezone": "browser",
+  "schemaVersion": 39,
+  "version": 1,
+  "refresh": "30s",
+  "time": {
+    "from": "now-6h",
+    "to": "now"
+  },
+  "templating": {
+    "list": [
+      {
+        "name": "vm",
+        "type": "query",
+        "datasource": {"type": "prometheus", "uid": "prometheus"},
+        "query": "label_values(pve_guest_info{template=\"0\"}, name)",
+        "refresh": 2,
+        "includeAll": true,
+        "multi": true,
+        "current": {"text": "All", "value": "$__all"}
+      }
+    ]
+  },
+  "panels": [
+    {
+      "id": 1,
+      "title": "VMs Running",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "count(pve_up{id=~\"qemu/.*\"} * on(id) pve_guest_info{template=\"0\"} == 1)",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [{"color": "green", "value": null}]
+          }
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none"
+      }
+    },
+    {
+      "id": 2,
+      "title": "VMs Stopped",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "count(pve_up{id=~\"qemu/.*\"} * on(id) pve_guest_info{template=\"0\"} == 0)",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "green", "value": null},
+              {"color": "yellow", "value": 1},
+              {"color": "red", "value": 3}
+            ]
+          },
+          "noValue": "0"
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none"
+      }
+    },
+    {
+      "id": 3,
+      "title": "Node CPU",
+      "type": "gauge",
+      "gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "pve_cpu_usage_ratio{id=~\"node/.*\"} * 100",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "percent",
+          "min": 0,
+          "max": 100,
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "green", "value": null},
+              {"color": "yellow", "value": 70},
+              {"color": "red", "value": 90}
+            ]
+          }
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "showThresholdLabels": false,
+        "showThresholdMarkers": true
+      }
+    },
+    {
+      "id": 4,
+      "title": "Node Memory",
+      "type": "gauge",
+      "gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "pve_memory_usage_bytes{id=~\"node/.*\"} / pve_memory_size_bytes{id=~\"node/.*\"} * 100",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "percent",
+          "min": 0,
+          "max": 100,
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "green", "value": null},
+              {"color": "yellow", "value": 70},
+              {"color": "red", "value": 90}
+            ]
+          }
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "showThresholdLabels": false,
+        "showThresholdMarkers": true
+      }
+    },
+    {
+      "id": 5,
+      "title": "Node Uptime",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "pve_uptime_seconds{id=~\"node/.*\"}",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "s",
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [{"color": "blue", "value": null}]
+          }
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none"
+      }
+    },
+    {
+      "id": 6,
+      "title": "Templates",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "count(pve_guest_info{template=\"1\"})",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [{"color": "purple", "value": null}]
+          }
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none"
+      }
+    },
+    {
+      "id": 7,
+      "title": "VM Status",
+      "type": "table",
+      "gridPos": {"h": 10, "w": 24, "x": 0, "y": 4},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "pve_guest_info{template=\"0\", name=~\"$vm\"}",
+          "format": "table",
+          "instant": true,
+          "refId": "info"
+        },
+        {
+          "expr": "pve_up{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
+          "format": "table",
+          "instant": true,
+          "refId": "status"
+        },
+        {
+          "expr": "pve_cpu_usage_ratio{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"} * 100",
+          "format": "table",
+          "instant": true,
+          "refId": "cpu"
+        },
+        {
+          "expr": "pve_memory_usage_bytes{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"} / on(id) pve_memory_size_bytes * 100",
+          "format": "table",
+          "instant": true,
+          "refId": "mem"
+        },
+        {
+          "expr": "pve_uptime_seconds{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
+          "format": "table",
+          "instant": true,
+          "refId": "uptime"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {},
+        "overrides": [
+          {
+            "matcher": {"id": "byName", "options": "Name"},
+            "properties": [{"id": "custom.width", "value": 150}]
+          },
+          {
+            "matcher": {"id": "byName", "options": "Status"},
+            "properties": [
+              {"id": "custom.width", "value": 80},
+              {"id": "mappings", "value": [
+                {"type": "value", "options": {"0": {"text": "Stopped", "color": "red"}}},
+                {"type": "value", "options": {"1": {"text": "Running", "color": "green"}}}
+              ]},
+              {"id": "custom.cellOptions", "value": {"type": "color-text"}}
+            ]
+          },
+          {
+            "matcher": {"id": "byName", "options": "CPU %"},
+            "properties": [
+              {"id": "unit", "value": "percent"},
+              {"id": "decimals", "value": 1},
+              {"id": "custom.width", "value": 80},
+              {"id": "custom.cellOptions", "value": {"type": "gauge", "mode": "basic"}},
+              {"id": "min", "value": 0},
+              {"id": "max", "value": 100},
+              {"id": "thresholds", "value": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 50}, {"color": "red", "value": 80}]}}
+            ]
+          },
+          {
+            "matcher": {"id": "byName", "options": "Memory %"},
+            "properties": [
+              {"id": "unit", "value": "percent"},
+              {"id": "decimals", "value": 1},
+              {"id": "custom.width", "value": 100},
+              {"id": "custom.cellOptions", "value": {"type": "gauge", "mode": "basic"}},
+              {"id": "min", "value": 0},
+              {"id": "max", "value": 100},
+              {"id": "thresholds", "value": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 70}, {"color": "red", "value": 90}]}}
+            ]
+          },
+          {
+            "matcher": {"id": "byName", "options": "Uptime"},
+            "properties": [
+              {"id": "unit", "value": "s"},
+              {"id": "custom.width", "value": 100}
+            ]
+          },
+          {
+            "matcher": {"id": "byName", "options": "ID"},
+            "properties": [{"id": "custom.width", "value": 90}]
+          }
+        ]
+      },
+      "options": {
+        "showHeader": true,
+        "sortBy": [{"displayName": "Name", "desc": false}]
+      },
+      "transformations": [
+        {
+          "id": "joinByField",
+          "options": {"byField": "name", "mode": "outer"}
+        },
+        {
+          "id": "organize",
+          "options": {
+            "excludeByName": {
+              "Time": true,
+              "Time 1": true,
+              "Time 2": true,
+              "Time 3": true,
+              "Time 4": true,
+              "Value #info": true,
+              "__name__": true,
+              "id 1": true,
+              "id 2": true,
+              "id 3": true,
+              "id 4": true,
+              "instance": true,
+              "instance 1": true,
+              "instance 2": true,
+              "instance 3": true,
+              "instance 4": true,
+              "job": true,
+              "job 1": true,
+              "job 2": true,
+              "job 3": true,
+              "job 4": true,
+              "name 1": true,
+              "name 2": true,
+              "name 3": true,
+              "name 4": true,
+              "node": true,
+              "tags": true,
+              "template": true,
+              "type": true
+            },
+            "indexByName": {
+              "name": 0,
+              "id": 1,
+              "Value #status": 2,
+              "Value #cpu": 3,
+              "Value #mem": 4,
+              "Value #uptime": 5
+            },
+            "renameByName": {
+              "name": "Name",
+              "id": "ID",
+              "Value #status": "Status",
+              "Value #cpu": "CPU %",
+              "Value #mem": "Memory %",
+              "Value #uptime": "Uptime"
+            }
+          }
+        }
+      ]
+    },
+    {
+      "id": 8,
+      "title": "VM CPU Usage",
+      "type": "timeseries",
+      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 14},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "pve_cpu_usage_ratio{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"} * 100",
+          "legendFormat": "{{name}}",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "percent",
+          "min": 0,
+          "custom": {
+            "lineWidth": 1,
+            "fillOpacity": 10,
+            "showPoints": "never"
+          }
+        }
+      },
+      "options": {
+        "legend": {"displayMode": "list", "placement": "bottom"},
+        "tooltip": {"mode": "multi", "sort": "desc"}
+      }
+    },
+    {
+      "id": 9,
+      "title": "VM Memory Usage",
+      "type": "timeseries",
+      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 14},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "pve_memory_usage_bytes{id=~\"qemu/.*\"} * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
+          "legendFormat": "{{name}}",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "bytes",
+          "min": 0,
+          "custom": {
+            "lineWidth": 1,
+            "fillOpacity": 10,
+            "showPoints": "never"
+          }
+        }
+      },
+      "options": {
+        "legend": {"displayMode": "list", "placement": "bottom"},
+        "tooltip": {"mode": "multi", "sort": "desc"}
+      }
+    },
+    {
+      "id": 10,
+      "title": "VM Network Traffic",
+      "type": "timeseries",
+      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 22},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "rate(pve_network_receive_bytes{id=~\"qemu/.*\"}[5m]) * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
+          "legendFormat": "{{name}} RX",
+          "refId": "A"
+        },
+        {
+          "expr": "-rate(pve_network_transmit_bytes{id=~\"qemu/.*\"}[5m]) * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
+          "legendFormat": "{{name}} TX",
+          "refId": "B"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "Bps",
+          "custom": {
+            "lineWidth": 1,
+            "fillOpacity": 10,
+            "showPoints": "never"
+          }
+        }
+      },
+      "options": {
+        "legend": {"displayMode": "list", "placement": "bottom"},
+        "tooltip": {"mode": "multi", "sort": "desc"}
+      }
+    },
+    {
+      "id": 11,
+      "title": "VM Disk I/O",
+      "type": "timeseries",
+      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 22},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "rate(pve_disk_read_bytes{id=~\"qemu/.*\"}[5m]) * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
+          "legendFormat": "{{name}} Read",
+          "refId": "A"
+        },
+        {
+          "expr": "-rate(pve_disk_write_bytes{id=~\"qemu/.*\"}[5m]) * on(id) group_left(name) pve_guest_info{template=\"0\", name=~\"$vm\"}",
+          "legendFormat": "{{name}} Write",
+          "refId": "B"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "Bps",
+          "custom": {
+            "lineWidth": 1,
+            "fillOpacity": 10,
+            "showPoints": "never"
+          }
+        }
+      },
+      "options": {
+        "legend": {"displayMode": "list", "placement": "bottom"},
+        "tooltip": {"mode": "multi", "sort": "desc"}
+      }
+    },
+    {
+      "id": 12,
+      "title": "Storage Usage",
+      "type": "bargauge",
+      "gridPos": {"h": 6, "w": 12, "x": 0, "y": 30},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "pve_disk_usage_bytes{id=~\"storage/.*\"} / pve_disk_size_bytes{id=~\"storage/.*\"} * 100",
+          "legendFormat": "{{id}}",
+          "refId": "A",
+          "instant": true
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "percent",
+          "min": 0,
+          "max": 100,
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "green", "value": null},
+              {"color": "yellow", "value": 70},
+              {"color": "red", "value": 85}
+            ]
+          }
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "orientation": "horizontal",
+        "displayMode": "gradient",
+        "showUnfilled": true
+      },
+      "transformations": [
+        {
+          "id": "renameByRegex",
+          "options": {
+            "regex": "storage/pve1/(.*)",
+            "renamePattern": "$1"
+          }
+        }
+      ]
+    },
+    {
+      "id": 13,
+      "title": "Storage Capacity",
+      "type": "table",
+      "gridPos": {"h": 6, "w": 12, "x": 12, "y": 30},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "pve_disk_size_bytes{id=~\"storage/.*\"}",
+          "format": "table",
+          "instant": true,
+          "refId": "size"
+        },
+        {
+          "expr": "pve_disk_usage_bytes{id=~\"storage/.*\"}",
+          "format": "table",
+          "instant": true,
+          "refId": "used"
+        },
+        {
+          "expr": "pve_disk_size_bytes{id=~\"storage/.*\"} - pve_disk_usage_bytes{id=~\"storage/.*\"}",
+          "format": "table",
+          "instant": true,
+          "refId": "free"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "bytes"
+        },
+        "overrides": [
+          {
+            "matcher": {"id": "byName", "options": "Storage"},
+            "properties": [{"id": "unit", "value": "none"}]
+          }
+        ]
+      },
+      "options": {
+        "showHeader": true
+      },
+      "transformations": [
+        {
+          "id": "joinByField",
+          "options": {"byField": "id", "mode": "outer"}
+        },
+        {
+          "id": "organize",
+          "options": {
+            "excludeByName": {
+              "Time": true,
+              "Time 1": true,
+              "Time 2": true,
+              "instance": true,
+              "instance 1": true,
+              "instance 2": true,
+              "job": true,
+              "job 1": true,
+              "job 2": true
+            },
+            "renameByName": {
+              "id": "Storage",
+              "Value #size": "Total",
+              "Value #used": "Used",
+              "Value #free": "Free"
+            }
+          }
+        },
+        {
+          "id": "renameByRegex",
+          "options": {
+            "regex": "storage/pve1/(.*)",
+            "renamePattern": "$1"
+          }
+        }
+      ]
+    }
+  ]
+}
--- a/services/grafana/dashboards/systemd.json
+++ b/services/grafana/dashboards/systemd.json
@@ -0,0 +1,553 @@
+{
+  "uid": "systemd-homelab",
+  "title": "Systemd Services - Homelab",
+  "tags": ["systemd", "services", "homelab"],
+  "timezone": "browser",
+  "schemaVersion": 39,
+  "version": 1,
+  "refresh": "1m",
+  "time": {
+    "from": "now-24h",
+    "to": "now"
+  },
+  "templating": {
+    "list": [
+      {
+        "name": "hostname",
+        "type": "query",
+        "datasource": {"type": "prometheus", "uid": "prometheus"},
+        "query": "label_values(systemd_unit_state, hostname)",
+        "refresh": 2,
+        "includeAll": true,
+        "multi": true,
+        "current": {"text": "All", "value": "$__all"}
+      }
+    ]
+  },
+  "panels": [
+    {
+      "id": 1,
+      "title": "Failed Units",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "count(systemd_unit_state{state=\"failed\", hostname=~\"$hostname\"} == 1) or vector(0)",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "green", "value": null},
+              {"color": "red", "value": 1}
+            ]
+          },
+          "noValue": "0"
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none"
+      }
+    },
+    {
+      "id": 2,
+      "title": "Active Units",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "count(systemd_unit_state{state=\"active\", hostname=~\"$hostname\"} == 1)",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [{"color": "green", "value": null}]
+          }
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none"
+      }
+    },
+    {
+      "id": 3,
+      "title": "Hosts Monitored",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "count(count by (hostname) (systemd_unit_state{hostname=~\"$hostname\"}))",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [{"color": "blue", "value": null}]
+          }
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none"
+      }
+    },
+    {
+      "id": 4,
+      "title": "Total Service Restarts",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "sum(systemd_service_restart_total{hostname=~\"$hostname\"})",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "green", "value": null},
+              {"color": "yellow", "value": 10},
+              {"color": "orange", "value": 50}
+            ]
+          },
+          "noValue": "0"
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none"
+      }
+    },
+    {
+      "id": 5,
+      "title": "Inactive Units",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "count(systemd_unit_state{state=\"inactive\", hostname=~\"$hostname\"} == 1)",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [{"color": "purple", "value": null}]
+          }
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none"
+      }
+    },
+    {
+      "id": 6,
+      "title": "Timers",
+      "type": "stat",
+      "gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "count(systemd_timer_last_trigger_seconds{hostname=~\"$hostname\"})",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [{"color": "blue", "value": null}]
+          }
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "colorMode": "value",
+        "graphMode": "none"
+      }
+    },
+    {
+      "id": 7,
+      "title": "Failed Units",
+      "type": "table",
+      "gridPos": {"h": 6, "w": 12, "x": 0, "y": 4},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "systemd_unit_state{state=\"failed\", hostname=~\"$hostname\"} == 1",
+          "format": "table",
+          "instant": true,
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {},
+        "overrides": [
+          {
+            "matcher": {"id": "byName", "options": "Host"},
+            "properties": [{"id": "custom.width", "value": 120}]
+          },
+          {
+            "matcher": {"id": "byName", "options": "Unit"},
+            "properties": [{"id": "custom.width", "value": 300}]
+          }
+        ]
+      },
+      "options": {
+        "showHeader": true,
+        "sortBy": [{"displayName": "Host", "desc": false}]
+      },
+      "transformations": [
+        {
+          "id": "organize",
+          "options": {
+            "excludeByName": {
+              "Time": true,
+              "Value": true,
+              "__name__": true,
+              "dns_role": true,
+              "instance": true,
+              "job": true,
+              "role": true,
+              "state": true,
+              "tier": true,
+              "type": true
+            },
+            "renameByName": {
+              "hostname": "Host",
+              "name": "Unit"
+            }
+          }
+        }
+      ],
+      "description": "Units currently in failed state"
+    },
+    {
+      "id": 8,
+      "title": "Service Restarts (Top 15)",
+      "type": "table",
+      "gridPos": {"h": 6, "w": 12, "x": 12, "y": 4},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "topk(15, systemd_service_restart_total{hostname=~\"$hostname\"} > 0)",
+          "format": "table",
+          "instant": true,
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {},
+        "overrides": [
+          {
+            "matcher": {"id": "byName", "options": "Host"},
+            "properties": [{"id": "custom.width", "value": 120}]
+          },
+          {
+            "matcher": {"id": "byName", "options": "Service"},
+            "properties": [{"id": "custom.width", "value": 280}]
+          },
+          {
+            "matcher": {"id": "byName", "options": "Restarts"},
+            "properties": [{"id": "custom.width", "value": 80}]
+          }
+        ]
+      },
+      "options": {
+        "showHeader": true,
+        "sortBy": [{"displayName": "Restarts", "desc": true}]
+      },
+      "transformations": [
+        {
+          "id": "organize",
+          "options": {
+            "excludeByName": {
+              "Time": true,
+              "__name__": true,
+              "dns_role": true,
+              "instance": true,
+              "job": true,
+              "role": true,
+              "tier": true
+            },
+            "renameByName": {
+              "hostname": "Host",
+              "name": "Service",
+              "Value": "Restarts"
+            }
+          }
+        }
+      ],
+      "description": "Services that have been restarted (since host boot)"
+    },
+    {
+      "id": 9,
+      "title": "Active Units per Host",
+      "type": "bargauge",
+      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 10},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "sort_desc(count by (hostname) (systemd_unit_state{state=\"active\", hostname=~\"$hostname\"} == 1))",
+          "legendFormat": "{{hostname}}",
+          "refId": "A",
+          "instant": true
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [{"color": "green", "value": null}]
+          },
+          "min": 0
+        }
+      },
+      "options": {
+        "reduceOptions": {"calcs": ["lastNotNull"]},
+        "orientation": "horizontal",
+        "displayMode": "gradient",
+        "showUnfilled": true
+      }
+    },
+    {
+      "id": 10,
+      "title": "NixOS Upgrade Timers",
+      "type": "table",
+      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 10},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "systemd_timer_last_trigger_seconds{name=\"nixos-upgrade.timer\", hostname=~\"$hostname\"}",
+          "format": "table",
+          "instant": true,
+          "refId": "last"
+        },
+        {
+          "expr": "time() - systemd_timer_last_trigger_seconds{name=\"nixos-upgrade.timer\", hostname=~\"$hostname\"}",
+          "format": "table",
+          "instant": true,
+          "refId": "ago"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {},
+        "overrides": [
+          {
+            "matcher": {"id": "byName", "options": "Host"},
+            "properties": [{"id": "custom.width", "value": 130}]
+          },
+          {
+            "matcher": {"id": "byName", "options": "Last Trigger"},
+            "properties": [
+              {"id": "unit", "value": "dateTimeAsLocalNoDateIfToday"},
+              {"id": "custom.width", "value": 180}
+            ]
+          },
+          {
+            "matcher": {"id": "byName", "options": "Time Ago"},
+            "properties": [
+              {"id": "unit", "value": "s"},
+              {"id": "custom.width", "value": 120},
+              {"id": "thresholds", "value": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 86400}, {"color": "red", "value": 172800}]}},
+              {"id": "custom.cellOptions", "value": {"type": "color-text"}}
+            ]
+          }
+        ]
+      },
+      "options": {
+        "showHeader": true,
+        "sortBy": [{"displayName": "Time Ago", "desc": true}]
+      },
+      "transformations": [
+        {
+          "id": "joinByField",
+          "options": {"byField": "hostname", "mode": "outer"}
+        },
+        {
+          "id": "organize",
+          "options": {
+            "excludeByName": {
+              "Time": true,
+              "Time 1": true,
+              "__name__": true,
+              "__name__ 1": true,
+              "dns_role": true,
+              "dns_role 1": true,
+              "instance": true,
+              "instance 1": true,
+              "job": true,
+              "job 1": true,
+              "name": true,
+              "name 1": true,
+              "role": true,
+              "role 1": true,
+              "tier": true,
+              "tier 1": true
+            },
+            "indexByName": {
+              "hostname": 0,
+              "Value #last": 1,
+              "Value #ago": 2
+            },
+            "renameByName": {
+              "hostname": "Host",
+              "Value #last": "Last Trigger",
+              "Value #ago": "Time Ago"
+            }
+          }
+        }
+      ],
+      "description": "When nixos-upgrade.timer last ran on each host. Yellow >24h, Red >48h."
+    },
+    {
+      "id": 11,
+      "title": "Backup Timers",
+      "type": "table",
+      "gridPos": {"h": 6, "w": 12, "x": 0, "y": 18},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "systemd_timer_last_trigger_seconds{name=~\"restic.*\", hostname=~\"$hostname\"}",
+          "format": "table",
+          "instant": true,
+          "refId": "last"
+        },
+        {
+          "expr": "time() - systemd_timer_last_trigger_seconds{name=~\"restic.*\", hostname=~\"$hostname\"}",
+          "format": "table",
+          "instant": true,
+          "refId": "ago"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {},
+        "overrides": [
+          {
+            "matcher": {"id": "byName", "options": "Host"},
+            "properties": [{"id": "custom.width", "value": 120}]
+          },
+          {
+            "matcher": {"id": "byName", "options": "Timer"},
+            "properties": [{"id": "custom.width", "value": 220}]
+          },
+          {
+            "matcher": {"id": "byName", "options": "Last Trigger"},
+            "properties": [
+              {"id": "unit", "value": "dateTimeAsLocalNoDateIfToday"},
+              {"id": "custom.width", "value": 180}
+            ]
+          },
+          {
+            "matcher": {"id": "byName", "options": "Time Ago"},
+            "properties": [
+              {"id": "unit", "value": "s"},
+              {"id": "custom.width", "value": 100},
+              {"id": "thresholds", "value": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 86400}, {"color": "red", "value": 172800}]}},
+              {"id": "custom.cellOptions", "value": {"type": "color-text"}}
+            ]
+          }
+        ]
+      },
+      "options": {
+        "showHeader": true,
+        "sortBy": [{"displayName": "Time Ago", "desc": true}]
+      },
+      "transformations": [
+        {
+          "id": "joinByField",
+          "options": {"byField": "name", "mode": "outer"}
+        },
+        {
+          "id": "organize",
+          "options": {
+            "excludeByName": {
+              "Time": true,
+              "Time 1": true,
+              "__name__": true,
+              "__name__ 1": true,
+              "dns_role": true,
+              "dns_role 1": true,
+              "instance": true,
+              "instance 1": true,
+              "job": true,
+              "job 1": true,
+              "role": true,
+              "role 1": true,
+              "tier": true,
+              "tier 1": true,
+              "hostname 1": true
+            },
+            "indexByName": {
+              "hostname": 0,
+              "name": 1,
+              "Value #last": 2,
+              "Value #ago": 3
+            },
+            "renameByName": {
+              "hostname": "Host",
+              "name": "Timer",
+              "Value #last": "Last Trigger",
+              "Value #ago": "Time Ago"
+            }
+          }
+        }
+      ],
+      "description": "Restic backup timers"
+    },
+    {
+      "id": 12,
+      "title": "Service Restarts Over Time",
+      "type": "timeseries",
+      "gridPos": {"h": 6, "w": 12, "x": 12, "y": 18},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "sum by (hostname) (increase(systemd_service_restart_total{hostname=~\"$hostname\"}[1h]))",
+          "legendFormat": "{{hostname}}",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "short",
+          "custom": {
+            "lineWidth": 1,
+            "fillOpacity": 20,
+            "showPoints": "never",
+            "stacking": {"mode": "normal"}
+          }
+        }
+      },
+      "options": {
+        "legend": {"displayMode": "list", "placement": "bottom"},
+        "tooltip": {"mode": "multi", "sort": "desc"}
+      },
+      "description": "Service restart rate per hour"
+    }
+  ]
+}
--- a/services/grafana/dashboards/temperature.json
+++ b/services/grafana/dashboards/temperature.json
@@ -0,0 +1,399 @@
+{
+  "uid": "temperature-homelab",
+  "title": "Temperature - Homelab",
+  "tags": ["home-assistant", "temperature", "homelab"],
+  "timezone": "browser",
+  "schemaVersion": 39,
+  "version": 1,
+  "refresh": "1m",
+  "time": {
+    "from": "now-30d",
+    "to": "now"
+  },
+  "templating": {
+    "list": []
+  },
+  "panels": [
+    {
+      "id": 1,
+      "title": "Current Temperatures",
+      "type": "stat",
+      "gridPos": {"h": 6, "w": 12, "x": 0, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}",
+          "legendFormat": "{{friendly_name}}",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "celsius",
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "blue", "value": null},
+              {"color": "green", "value": 18},
+              {"color": "yellow", "value": 24},
+              {"color": "orange", "value": 27},
+              {"color": "red", "value": 30}
+            ]
+          },
+          "mappings": []
+        },
+        "overrides": []
+      },
+      "options": {
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "orientation": "auto",
+        "textMode": "auto",
+        "colorMode": "value",
+        "graphMode": "area",
+        "justifyMode": "auto"
+      },
+      "transformations": [
+        {
+          "id": "renameByRegex",
+          "options": {
+            "regex": "Temp (.*) Temperature",
+            "renamePattern": "$1"
+          }
+        }
+      ]
+    },
+    {
+      "id": 2,
+      "title": "Average Home Temperature",
+      "type": "gauge",
+      "gridPos": {"h": 6, "w": 6, "x": 12, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "avg(hass_sensor_temperature_celsius{entity!~\".*device_temperature|.*server.*\"})",
+          "legendFormat": "Average",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "celsius",
+          "min": 15,
+          "max": 30,
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "blue", "value": null},
+              {"color": "green", "value": 18},
+              {"color": "yellow", "value": 24},
+              {"color": "red", "value": 28}
+            ]
+          }
+        }
+      },
+      "options": {
+        "reduceOptions": {
+          "calcs": ["lastNotNull"]
+        },
+        "showThresholdLabels": false,
+        "showThresholdMarkers": true
+      }
+    },
+    {
+      "id": 3,
+      "title": "Current Humidity",
+      "type": "stat",
+      "gridPos": {"h": 6, "w": 6, "x": 18, "y": 0},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "hass_sensor_humidity_percent{entity!~\".*server.*\"}",
+          "legendFormat": "{{friendly_name}}",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "percent",
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "red", "value": null},
+              {"color": "yellow", "value": 30},
+              {"color": "green", "value": 40},
+              {"color": "yellow", "value": 60},
+              {"color": "red", "value": 70}
+            ]
+          }
+        }
+      },
+      "options": {
+        "reduceOptions": {
+          "calcs": ["lastNotNull"]
+        },
+        "orientation": "horizontal",
+        "colorMode": "value",
+        "graphMode": "none"
+      },
+      "transformations": [
+        {
+          "id": "renameByRegex",
+          "options": {
+            "regex": "Temp (.*) Humidity",
+            "renamePattern": "$1"
+          }
+        }
+      ]
+    },
+    {
+      "id": 4,
+      "title": "Temperature History (30 Days)",
+      "type": "timeseries",
+      "gridPos": {"h": 10, "w": 24, "x": 0, "y": 6},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}",
+          "legendFormat": "{{friendly_name}}",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "celsius",
+          "custom": {
+            "lineWidth": 1,
+            "fillOpacity": 10,
+            "pointSize": 5,
+            "showPoints": "never",
+            "spanNulls": 3600000
+          }
+        }
+      },
+      "options": {
+        "legend": {
+          "displayMode": "list",
+          "placement": "bottom",
+          "calcs": ["mean", "min", "max"]
+        },
+        "tooltip": {
+          "mode": "multi",
+          "sort": "desc"
+        }
+      },
+      "transformations": [
+        {
+          "id": "renameByRegex",
+          "options": {
+            "regex": "Temp (.*) Temperature",
+            "renamePattern": "$1"
+          }
+        },
+        {
+          "id": "renameByRegex",
+          "options": {
+            "regex": "temp_server Temperature",
+            "renamePattern": "Server"
+          }
+        }
+      ]
+    },
+    {
+      "id": 5,
+      "title": "Temperature Trend (1h rate of change)",
+      "type": "timeseries",
+      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "deriv(hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}[1h]) * 3600",
+          "legendFormat": "{{friendly_name}}",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "celsius",
+          "custom": {
+            "lineWidth": 1,
+            "fillOpacity": 20,
+            "showPoints": "never",
+            "spanNulls": 3600000
+          },
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "blue", "value": null},
+              {"color": "green", "value": -0.5},
+              {"color": "green", "value": 0.5},
+              {"color": "red", "value": 1}
+            ]
+          },
+          "displayName": "${__field.labels.friendly_name}"
+        }
+      },
+      "options": {
+        "legend": {
+          "displayMode": "list",
+          "placement": "bottom"
+        },
+        "tooltip": {
+          "mode": "multi"
+        }
+      },
+      "transformations": [
+        {
+          "id": "renameByRegex",
+          "options": {
+            "regex": "Temp (.*) Temperature",
+            "renamePattern": "$1"
+          }
+        },
+        {
+          "id": "renameByRegex",
+          "options": {
+            "regex": "temp_server Temperature",
+            "renamePattern": "Server"
+          }
+        }
+      ],
+      "description": "Rate of temperature change per hour. Positive = warming, Negative = cooling."
+    },
+    {
+      "id": 6,
+      "title": "24h Min / Max / Avg",
+      "type": "table",
+      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 16},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "min_over_time(hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}[24h])",
+          "legendFormat": "{{friendly_name}}",
+          "refId": "min",
+          "instant": true
+        },
+        {
+          "expr": "max_over_time(hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}[24h])",
+          "legendFormat": "{{friendly_name}}",
+          "refId": "max",
+          "instant": true
+        },
+        {
+          "expr": "avg_over_time(hass_sensor_temperature_celsius{entity!~\".*device_temperature\"}[24h])",
+          "legendFormat": "{{friendly_name}}",
+          "refId": "avg",
+          "instant": true
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "celsius",
+          "decimals": 1
+        },
+        "overrides": [
+          {
+            "matcher": {"id": "byName", "options": "Room"},
+            "properties": [{"id": "custom.width", "value": 150}]
+          }
+        ]
+      },
+      "options": {
+        "showHeader": true,
+        "sortBy": [{"displayName": "Room", "desc": false}]
+      },
+      "transformations": [
+        {
+          "id": "joinByField",
+          "options": {
+            "byField": "friendly_name",
+            "mode": "outer"
+          }
+        },
+        {
+          "id": "organize",
+          "options": {
+            "excludeByName": {
+              "Time": true,
+              "domain": true,
+              "entity": true,
+              "hostname": true,
+              "instance": true,
+              "job": true
+            },
+            "renameByName": {
+              "friendly_name": "Room",
+              "Value #min": "Min (24h)",
+              "Value #max": "Max (24h)",
+              "Value #avg": "Avg (24h)"
+            }
+          }
+        },
+        {
+          "id": "renameByRegex",
+          "options": {
+            "regex": "Temp (.*) Temperature",
+            "renamePattern": "$1"
+          }
+        }
+      ]
+    },
+    {
+      "id": 7,
+      "title": "Humidity History (30 Days)",
+      "type": "timeseries",
+      "gridPos": {"h": 8, "w": 24, "x": 0, "y": 24},
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "targets": [
+        {
+          "expr": "hass_sensor_humidity_percent",
+          "legendFormat": "{{friendly_name}}",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "percent",
+          "min": 0,
+          "max": 100,
+          "custom": {
+            "lineWidth": 1,
+            "fillOpacity": 10,
+            "showPoints": "never",
+            "spanNulls": 3600000
+          }
+        }
+      },
+      "options": {
+        "legend": {
+          "displayMode": "list",
+          "placement": "bottom",
+          "calcs": ["mean", "min", "max"]
+        },
+        "tooltip": {
+          "mode": "multi",
+          "sort": "desc"
+        }
+      },
+      "transformations": [
+        {
+          "id": "renameByRegex",
+          "options": {
+            "regex": "Temp (.*) Humidity",
+            "renamePattern": "$1"
+          }
+        },
+        {
+          "id": "renameByRegex",
+          "options": {
+            "regex": "temp_server Humidity",
+            "renamePattern": "Server"
+          }
+        }
+      ]
+    }
+  ]
+}
--- a/services/grafana/default.nix
+++ b/services/grafana/default.nix
@@ -0,0 +1,111 @@
+{ config, pkgs, ... }:
+{
+  services.grafana = {
+    enable = true;
+    settings = {
+      server = {
+        http_addr = "127.0.0.1";
+        http_port = 3000;
+        domain = "grafana-test.home.2rjus.net";
+        root_url = "https://grafana-test.home.2rjus.net/";
+      };
+
+      # Disable anonymous access
+      "auth.anonymous".enabled = false;
+
+      # OIDC authentication via Kanidm
+      "auth.generic_oauth" = {
+        enabled = true;
+        name = "Kanidm";
+        client_id = "grafana";
+        client_secret = "$__file{/run/secrets/grafana-oauth2}";
+        auth_url = "https://auth.home.2rjus.net/ui/oauth2";
+        token_url = "https://auth.home.2rjus.net/oauth2/token";
+        api_url = "https://auth.home.2rjus.net/oauth2/openid/grafana/userinfo";
+        scopes = "openid profile email groups";
+        use_pkce = true;  # Required by Kanidm, more secure
+        # Extract user attributes from userinfo response
+        email_attribute_path = "email";
+        login_attribute_path = "preferred_username";
+        name_attribute_path = "name";
+        # Map admins group to Admin role, everyone else to Editor (for Explore access)
+        role_attribute_path = "contains(groups[*], 'admins') && 'Admin' || 'Editor'";
+        allow_sign_up = true;
+      };
+    };
+
+    # Declarative datasources pointing to monitoring01
+    provision.datasources.settings = {
+      apiVersion = 1;
+      datasources = [
+        {
+          name = "Prometheus";
+          type = "prometheus";
+          url = "http://monitoring01.home.2rjus.net:9090";
+          isDefault = true;
+          uid = "prometheus";
+        }
+        {
+          name = "Loki";
+          type = "loki";
+          url = "http://monitoring01.home.2rjus.net:3100";
+          uid = "loki";
+        }
+      ];
+    };
+
+    # Declarative dashboards
+    provision.dashboards.settings = {
+      apiVersion = 1;
+      providers = [
+        {
+          name = "homelab";
+          type = "file";
+          options.path = ./dashboards;
+          disableDeletion = true;
+        }
+      ];
+    };
+  };
+
+  # Vault secret for OAuth2 client secret
+  vault.secrets.grafana-oauth2 = {
+    secretPath = "services/grafana/oauth2-client-secret";
+    extractKey = "password";
+    services = [ "grafana" ];
+    owner = "grafana";
+    group = "grafana";
+  };
+
+  # Local Caddy for TLS termination
+  services.caddy = {
+    enable = true;
+    package = pkgs.unstable.caddy;
+    configFile = pkgs.writeText "Caddyfile" ''
+      {
+        acme_ca https://vault.home.2rjus.net:8200/v1/pki_int/acme/directory
+        metrics
+      }
+
+      grafana-test.home.2rjus.net {
+        log {
+          output file /var/log/caddy/grafana.log {
+            mode 644
+          }
+        }
+
+        reverse_proxy http://127.0.0.1:3000
+      }
+
+      http://${config.networking.hostName}.home.2rjus.net/metrics {
+        metrics
+      }
+    '';
+  };
+
+  # Expose Caddy metrics for Prometheus
+  homelab.monitoring.scrapeTargets = [{
+    job_name = "caddy";
+    port = 80;
+  }];
+}
--- a/services/home-assistant/default.nix
+++ b/services/home-assistant/default.nix
@@ -78,15 +78,15 @@
        # Override battery calculation using voltage (mV): (voltage - 2100) / 9
        "0x54ef441000a547bd" = {
          friendly_name = "0x54ef441000a547bd";
-          homeassistant.battery.value_template = "{{ (((value_json.voltage | float) - 2100) / 9) | round(0) | int | min(100) | max(0) }}";
+          homeassistant.battery.value_template = "{{ [[(((value_json.voltage | float) - 2100) / 9) | round(0) | int, 100] | min, 0] | max }}";
        };
        "0x54ef441000a54d3c" = {
          friendly_name = "0x54ef441000a54d3c";
-          homeassistant.battery.value_template = "{{ (((value_json.voltage | float) - 2100) / 9) | round(0) | int | min(100) | max(0) }}";
+          homeassistant.battery.value_template = "{{ [[(((value_json.voltage | float) - 2100) / 9) | round(0) | int, 100] | min, 0] | max }}";
        };
        "0x54ef441000a564b6" = {
          friendly_name = "temp_server";
-          homeassistant.battery.value_template = "{{ (((value_json.voltage | float) - 2100) / 9) | round(0) | int | min(100) | max(0) }}";
+          homeassistant.battery.value_template = "{{ [[(((value_json.voltage | float) - 2100) / 9) | round(0) | int, 100] | min, 0] | max }}";
        };

        # Other sensors
--- a/services/kanidm/default.nix
+++ b/services/kanidm/default.nix
@@ -24,12 +24,37 @@
      idmAdminPasswordFile = config.vault.secrets.kanidm-idm-admin.outputDir;

      groups = {
-        admins = { };
-        users = { };
-        ssh-users = { };
+        # overwriteMembers = false allows imperative member management via CLI
+        admins = { overwriteMembers = false; };
+        users = { overwriteMembers = false; };
+        ssh-users = { overwriteMembers = false; };
      };

      # Regular users (persons) are managed imperatively via kanidm CLI
+
+      # OAuth2/OIDC clients for service authentication
+      systems.oauth2.grafana = {
+        displayName = "Grafana";
+        originUrl = "https://grafana-test.home.2rjus.net/login/generic_oauth";
+        originLanding = "https://grafana-test.home.2rjus.net/";
+        basicSecretFile = config.vault.secrets.grafana-oauth2.outputDir;
+        preferShortUsername = true;
+        scopeMaps.users = [ "openid" "profile" "email" "groups" ];
+      };
+
+      systems.oauth2.openbao = {
+        displayName = "OpenBao Secrets";
+        # Web UI callback only (CLI localhost not supported with confidential clients)
+        originUrl = "https://vault.home.2rjus.net:8200/ui/vault/auth/oidc/oidc/callback";
+        originLanding = "https://vault.home.2rjus.net:8200/";
+        basicSecretFile = config.vault.secrets.openbao-oauth2.outputDir;
+        preferShortUsername = true;
+        # Enable RS256 signing algorithm (required by OpenBao)
+        enableLegacyCrypto = true;
+        # Allow groups scope for role binding
+        scopeMaps.admins = [ "openid" "profile" "email" "groups" ];
+        scopeMaps.users = [ "openid" "profile" "email" "groups" ];
+      };
    };
  };

@@ -53,6 +78,24 @@
    group = "kanidm";
  };

+  # Vault secret for Grafana OAuth2 client secret
+  vault.secrets.grafana-oauth2 = {
+    secretPath = "services/grafana/oauth2-client-secret";
+    extractKey = "password";
+    services = [ "kanidm" ];
+    owner = "kanidm";
+    group = "kanidm";
+  };
+
+  # Vault secret for OpenBao OAuth2 client secret
+  vault.secrets.openbao-oauth2 = {
+    secretPath = "services/openbao/oauth2-client-secret";
+    extractKey = "password";
+    services = [ "kanidm" ];
+    owner = "kanidm";
+    group = "kanidm";
+  };
+
  # Note: Kanidm does not expose Prometheus metrics
  # If metrics support is added in the future, uncomment:
  # homelab.monitoring.scrapeTargets = [
--- a/services/monitoring/blackbox.nix
+++ b/services/monitoring/blackbox.nix
@@ -0,0 +1,92 @@
+{ pkgs, ... }:
+let
+  # TLS endpoints to monitor for certificate expiration
+  # These are all services using ACME certificates from OpenBao PKI
+  tlsTargets = [
+    # Direct ACME certs (security.acme.certs)
+    "https://vault.home.2rjus.net:8200"
+    "https://auth.home.2rjus.net"
+    "https://testvm01.home.2rjus.net"
+
+    # Caddy auto-TLS on http-proxy
+    "https://nzbget.home.2rjus.net"
+    "https://radarr.home.2rjus.net"
+    "https://sonarr.home.2rjus.net"
+    "https://ha.home.2rjus.net"
+    "https://z2m.home.2rjus.net"
+    "https://prometheus.home.2rjus.net"
+    "https://alertmanager.home.2rjus.net"
+    "https://grafana.home.2rjus.net"
+    "https://jelly.home.2rjus.net"
+    "https://pyroscope.home.2rjus.net"
+    "https://pushgw.home.2rjus.net"
+
+    # Caddy auto-TLS on nix-cache01
+    "https://nix-cache.home.2rjus.net"
+
+    # Caddy auto-TLS on grafana01
+    "https://grafana-test.home.2rjus.net"
+  ];
+in
+{
+  services.prometheus.exporters.blackbox = {
+    enable = true;
+    configFile = pkgs.writeText "blackbox.yml" ''
+      modules:
+        https_cert:
+          prober: http
+          timeout: 10s
+          http:
+            fail_if_not_ssl: true
+            preferred_ip_protocol: ip4
+            valid_status_codes:
+              - 200
+              - 204
+              - 301
+              - 302
+              - 303
+              - 307
+              - 308
+              - 400
+              - 401
+              - 403
+              - 404
+              - 405
+              - 500
+              - 502
+              - 503
+    '';
+  };
+
+  # Add blackbox scrape config to Prometheus
+  # Alert rules are in rules.yml (certificate_rules group)
+  services.prometheus.scrapeConfigs = [
+    {
+      job_name = "blackbox_tls";
+      metrics_path = "/probe";
+      params = {
+        module = [ "https_cert" ];
+      };
+      static_configs = [{
+        targets = tlsTargets;
+      }];
+      relabel_configs = [
+        # Pass the target URL to blackbox as a parameter
+        {
+          source_labels = [ "__address__" ];
+          target_label = "__param_target";
+        }
+        # Use the target URL as the instance label
+        {
+          source_labels = [ "__param_target" ];
+          target_label = "instance";
+        }
+        # Point the actual scrape at the local blackbox exporter
+        {
+          target_label = "__address__";
+          replacement = "127.0.0.1:9115";
+        }
+      ];
+    }
+  ];
+}
--- a/services/monitoring/default.nix
+++ b/services/monitoring/default.nix
@@ -4,6 +4,8 @@
    ./loki.nix
    ./grafana.nix
    ./prometheus.nix
+    ./blackbox.nix
+    ./exportarr.nix
    ./pve.nix
    ./alerttonotify.nix
    ./pyroscope.nix
--- a/services/monitoring/exportarr.nix
+++ b/services/monitoring/exportarr.nix
@@ -0,0 +1,27 @@
+{ config, ... }:
+{
+  # Vault secret for API key
+  vault.secrets.sonarr-api-key = {
+    secretPath = "services/exportarr/sonarr";
+    extractKey = "api_key";
+    services = [ "prometheus-exportarr-sonarr-exporter" ];
+  };
+
+  # Sonarr exporter
+  services.prometheus.exporters.exportarr-sonarr = {
+    enable = true;
+    url = "http://sonarr-jail.home.2rjus.net:8989";
+    apiKeyFile = config.vault.secrets.sonarr-api-key.outputDir;
+    port = 9709;
+  };
+
+  # Scrape config
+  services.prometheus.scrapeConfigs = [
+    {
+      job_name = "sonarr";
+      static_configs = [{
+        targets = [ "localhost:9709" ];
+      }];
+    }
+  ];
+}
--- a/services/monitoring/rules.yml
+++ b/services/monitoring/rules.yml
@@ -229,13 +229,13 @@ groups:
          summary: "Mosquitto not running on {{ $labels.instance }}"
          description: "Mosquitto has been down on {{ $labels.instance }} more than 5 minutes."
      - alert: zigbee_sensor_stale
-        expr: (time() - hass_last_updated_time_seconds{entity=~"sensor\\.(0x[0-9a-f]+|temp_server)_temperature"}) > 7200
+        expr: (time() - hass_last_updated_time_seconds{entity=~"sensor\\.(0x[0-9a-f]+|temp_server)_temperature"}) > 14400
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Zigbee sensor {{ $labels.friendly_name }} is stale"
-          description: "Zigbee temperature sensor {{ $labels.entity }} has not reported data for over 2 hours. The sensor may have a dead battery or connectivity issues."
+          description: "Zigbee temperature sensor {{ $labels.entity }} has not reported data for over 4 hours. The sensor may have a dead battery or connectivity issues."
  - name: smartctl_rules
    rules:
      - alert: smart_critical_warning
@@ -392,3 +392,29 @@ groups:
        annotations:
          summary: "Cannot scrape OpenBao metrics from {{ $labels.instance }}"
          description: "OpenBao metrics endpoint is not responding on {{ $labels.instance }}."
+  - name: certificate_rules
+    rules:
+      - alert: tls_certificate_expiring_soon
+        expr: (probe_ssl_earliest_cert_expiry - time()) < 86400 * 7
+        for: 1h
+        labels:
+          severity: warning
+        annotations:
+          summary: "TLS certificate expiring soon on {{ $labels.instance }}"
+          description: "The TLS certificate for {{ $labels.instance }} expires in less than 7 days."
+      - alert: tls_certificate_expiring_critical
+        expr: (probe_ssl_earliest_cert_expiry - time()) < 86400
+        for: 0m
+        labels:
+          severity: critical
+        annotations:
+          summary: "TLS certificate expiring within 24h on {{ $labels.instance }}"
+          description: "The TLS certificate for {{ $labels.instance }} expires in less than 24 hours. Immediate action required."
+      - alert: tls_probe_failed
+        expr: probe_success{job="blackbox_tls"} == 0
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "TLS probe failed for {{ $labels.instance }}"
+          description: "Cannot connect to {{ $labels.instance }} to check TLS certificate. The service may be down or unreachable."
--- a/services/nats/default.nix
+++ b/services/nats/default.nix
@@ -35,9 +35,18 @@
        HOMELAB = {
          jetstream = "enabled";
          users = [
+            # alerttonotify (full access to HOMELAB account)
            {
              nkey = "UASLNKLWGICRTZMIXVD3RXLQ57XRIMCKBHP5V3PYFFRNO3E3BIJBCYMZ";
            }
+            # nixos-exporter (restricted to nixos-exporter subjects)
+            {
+              nkey = "UBCL3ODHVERVZJNGUJ567YBBKHQZOV3LK3WO6TVVSGQOCTK2NQ3IJVRV"; # Replace with public key from: nix develop -c nk -gen user -pubout
+              permissions = {
+                publish = [ "nixos-exporter.>" ];
+                subscribe = [ "nixos-exporter.>" ];
+              };
+            }
          ];
        };

--- a/system/default.nix
+++ b/system/default.nix
@@ -9,6 +9,7 @@
    ./motd.nix
    ./packages.nix
    ./nix.nix
+    ./pipe-to-loki.nix
    ./root-user.nix
    ./pki/root-ca.nix
    ./sshd.nix
--- a/system/monitoring/metrics.nix
+++ b/system/monitoring/metrics.nix
@@ -19,14 +19,33 @@
    ];
  };

+  # Fetch NKey from Vault for NATS authentication
+  vault.secrets.nixos-exporter-nkey = {
+    secretPath = "shared/nixos-exporter/nkey";
+    extractKey = "nkey";
+    owner = "nixos-exporter";
+    group = "nixos-exporter";
+  };
+
  services.prometheus.exporters.nixos = {
    enable = true;
    # Default port: 9971
    flake = {
      enable = true;
      url = "git+https://git.t-juice.club/torjus/nixos-servers.git";
+      nats = {
+        enable = true;
+        url = "nats://nats1.home.2rjus.net:4222";
+        nkeySeedFile = "/run/secrets/nixos-exporter-nkey";
      };
    };
+  };
+
+  # Ensure exporter starts after Vault secret is available
+  systemd.services.prometheus-nixos-exporter = {
+    after = [ "vault-secret-nixos-exporter-nkey.service" ];
+    requires = [ "vault-secret-nixos-exporter-nkey.service" ];
+  };

  # Register nixos-exporter as a Prometheus scrape target
  homelab.monitoring.scrapeTargets = [
--- a/system/pipe-to-loki.nix
+++ b/system/pipe-to-loki.nix
@@ -0,0 +1,140 @@
+{
+  config,
+  pkgs,
+  lib,
+  ...
+}:
+let
+  pipe-to-loki = pkgs.writeShellApplication {
+    name = "pipe-to-loki";
+    runtimeInputs = with pkgs; [
+      curl
+      jq
+      util-linux
+      coreutils
+    ];
+    text = ''
+      set -euo pipefail
+
+      LOKI_URL="http://monitoring01.home.2rjus.net:3100/loki/api/v1/push"
+      HOSTNAME=$(hostname)
+      SESSION_ID=""
+      RECORD_MODE=false
+
+      usage() {
+        echo "Usage: pipe-to-loki [--id ID] [--record]"
+        echo ""
+        echo "Send command output or interactive sessions to Loki."
+        echo ""
+        echo "Options:"
+        echo "  --id ID      Set custom session ID (default: auto-generated)"
+        echo "  --record     Start interactive recording session"
+        echo ""
+        echo "Examples:"
+        echo "  command | pipe-to-loki           # Pipe command output"
+        echo "  command | pipe-to-loki --id foo  # Pipe with custom ID"
+        echo "  pipe-to-loki --record            # Start recording session"
+        exit 1
+      }
+
+      generate_id() {
+        local random_chars
+        random_chars=$(head -c 2 /dev/urandom | od -An -tx1 | tr -d ' \n')
+        echo "''${HOSTNAME}-$(date +%s)-''${random_chars}"
+      }
+
+      send_to_loki() {
+        local content="$1"
+        local type="$2"
+        local timestamp_ns
+        timestamp_ns=$(date +%s%N)
+
+        local payload
+        payload=$(jq -n \
+          --arg job "pipe-to-loki" \
+          --arg host "$HOSTNAME" \
+          --arg type "$type" \
+          --arg id "$SESSION_ID" \
+          --arg ts "$timestamp_ns" \
+          --arg content "$content" \
+          '{
+            streams: [{
+              stream: {
+                job: $job,
+                host: $host,
+                type: $type,
+                id: $id
+              },
+              values: [[$ts, $content]]
+            }]
+          }')
+
+        if curl -s -X POST "$LOKI_URL" \
+          -H "Content-Type: application/json" \
+          -d "$payload" > /dev/null; then
+          return 0
+        else
+          echo "Error: Failed to send to Loki" >&2
+          return 1
+        fi
+      }
+
+      # Parse arguments
+      while [[ $# -gt 0 ]]; do
+        case $1 in
+          --id)
+            SESSION_ID="$2"
+            shift 2
+            ;;
+          --record)
+            RECORD_MODE=true
+            shift
+            ;;
+          --help|-h)
+            usage
+            ;;
+          *)
+            echo "Unknown option: $1" >&2
+            usage
+            ;;
+        esac
+      done
+
+      # Generate ID if not provided
+      if [[ -z "$SESSION_ID" ]]; then
+        SESSION_ID=$(generate_id)
+      fi
+
+      if $RECORD_MODE; then
+        # Session recording mode
+        SCRIPT_FILE=$(mktemp)
+        trap 'rm -f "$SCRIPT_FILE"' EXIT
+
+        echo "Recording session $SESSION_ID... (exit to send)"
+
+        # Use script to record the session
+        script -q "$SCRIPT_FILE"
+
+        # Read the transcript and send to Loki
+        content=$(cat "$SCRIPT_FILE")
+        if send_to_loki "$content" "session"; then
+          echo "Session $SESSION_ID sent to Loki"
+        fi
+      else
+        # Pipe mode - read from stdin
+        if [[ -t 0 ]]; then
+          echo "Error: No input provided. Pipe a command or use --record for interactive mode." >&2
+          exit 1
+        fi
+
+        content=$(cat)
+        if send_to_loki "$content" "command"; then
+          echo "Sent to Loki with id: $SESSION_ID"
+        fi
+      fi
+    '';
+  };
+in
+{
+  environment.systemPackages = [ pipe-to-loki ];
+}
--- a/terraform/variables.tf
+++ b/terraform/variables.tf
@@ -33,7 +33,7 @@ variable "default_target_node" {
 variable "default_template_name" {
  description = "Default template VM name to clone from"
  type        = string
-  default     = "nixos-25.11.20260203.e576e3c"
+  default     = "nixos-25.11.20260207.23d72da"
 }

 variable "default_ssh_public_key" {
--- a/terraform/vault/approle.tf
+++ b/terraform/vault/approle.tf
@@ -15,6 +15,17 @@ path "secret/data/shared/homelab-deploy/*" {
 EOT
 }

+# Shared policy for nixos-exporter NATS cache sharing
+resource "vault_policy" "nixos_exporter" {
+  name = "nixos-exporter"
+
+  policy = <<EOT
+path "secret/data/shared/nixos-exporter/*" {
+  capabilities = ["read", "list"]
+}
+EOT
+}
+
 # Define host access policies
 locals {
  host_policies = {
@@ -49,6 +60,7 @@ locals {
        "secret/data/hosts/monitoring01/*",
        "secret/data/shared/backup/*",
        "secret/data/shared/nats/*",
+        "secret/data/services/exportarr/*",
      ]
      extra_policies = ["prometheus-metrics"]
    }
@@ -89,6 +101,24 @@ locals {
      ]
    }

+    # kanidm01: Kanidm identity provider
+    "kanidm01" = {
+      paths = [
+        "secret/data/hosts/kanidm01/*",
+        "secret/data/kanidm/*",
+        "secret/data/services/grafana/*",
+        "secret/data/services/openbao/*",
+      ]
+    }
+
+    # monitoring02: Grafana test instance
+    "monitoring02" = {
+      paths = [
+        "secret/data/hosts/monitoring02/*",
+        "secret/data/services/grafana/*",
+      ]
+    }
+
  }
 }

@@ -114,7 +144,7 @@ resource "vault_approle_auth_backend_role" "hosts" {
  backend   = vault_auth_backend.approle.path
  role_name = each.key
  token_policies = concat(
-    ["${each.key}-policy", "homelab-deploy"],
+    ["${each.key}-policy", "homelab-deploy", "nixos-exporter"],
    lookup(each.value, "extra_policies", [])
  )

--- a/terraform/vault/hosts-generated.tf
+++ b/terraform/vault/hosts-generated.tf
@@ -33,10 +33,9 @@ locals {
        "secret/data/shared/homelab-deploy/*",
      ]
    }
-    "kanidm01" = {
+    "nix-cache02" = {
      paths = [
-        "secret/data/hosts/kanidm01/*",
-        "secret/data/kanidm/*",
+        "secret/data/hosts/nix-cache02/*",
      ]
    }
  
@@ -69,7 +68,7 @@ resource "vault_approle_auth_backend_role" "generated_hosts" {

  backend            = vault_auth_backend.approle.path
  role_name          = each.key
-  token_policies     = ["host-${each.key}", "homelab-deploy"]
+  token_policies     = ["host-${each.key}", "homelab-deploy", "nixos-exporter"]
  secret_id_ttl      = 0 # Never expire (wrapped tokens provide time limit)
  token_ttl          = 3600
  token_max_ttl      = 3600
--- a/terraform/vault/oidc.tf
+++ b/terraform/vault/oidc.tf
@@ -0,0 +1,50 @@
+# OIDC authentication backend for Kanidm integration
+# Web UI only - CLI localhost redirects not supported with confidential clients
+resource "vault_jwt_auth_backend" "oidc" {
+  path               = "oidc"
+  type               = "oidc"
+  oidc_discovery_url = "https://auth.home.2rjus.net/oauth2/openid/openbao"
+  oidc_client_id     = "openbao"
+  oidc_client_secret = random_password.auto_secrets["services/openbao/oauth2-client-secret"].result
+  default_role       = "default"
+
+  tune {
+    listing_visibility = "unauth"
+    default_lease_ttl  = "1h"
+    max_lease_ttl      = "24h"
+    token_type         = "default-service"
+  }
+}
+
+# Admin role - maps Kanidm admins group to admin policy
+resource "vault_jwt_auth_backend_role" "admin" {
+  backend        = vault_jwt_auth_backend.oidc.path
+  role_name      = "admin"
+  token_policies = ["oidc-admin"]
+
+  user_claim   = "preferred_username"
+  groups_claim = "groups"
+  bound_claims = { groups = "admins@home.2rjus.net" }
+  role_type    = "oidc"
+  oidc_scopes  = ["openid", "profile", "email", "groups"]
+
+  allowed_redirect_uris = [
+    "https://vault.home.2rjus.net:8200/ui/vault/auth/oidc/oidc/callback",
+  ]
+}
+
+# Default role - any authenticated user (limited access)
+resource "vault_jwt_auth_backend_role" "default" {
+  backend        = vault_jwt_auth_backend.oidc.path
+  role_name      = "default"
+  token_policies = ["oidc-default"]
+
+  user_claim   = "preferred_username"
+  groups_claim = "groups"
+  role_type    = "oidc"
+  oidc_scopes  = ["openid", "profile", "email", "groups"]
+
+  allowed_redirect_uris = [
+    "https://vault.home.2rjus.net:8200/ui/vault/auth/oidc/oidc/callback",
+  ]
+}
--- a/terraform/vault/policies.tf
+++ b/terraform/vault/policies.tf
@@ -8,3 +8,50 @@ path "sys/metrics" {
 }
 EOT
 }
+
+# OIDC admin policy - full read/write to all secrets
+resource "vault_policy" "oidc_admin" {
+  name = "oidc-admin"
+
+  policy = <<EOT
+# Full access to KV secrets
+path "secret/*" {
+  capabilities = ["create", "read", "update", "delete", "list"]
+}
+
+# Read system health and metrics
+path "sys/health" {
+  capabilities = ["read"]
+}
+
+path "sys/metrics" {
+  capabilities = ["read"]
+}
+
+# List auth methods and mounts
+path "sys/auth" {
+  capabilities = ["read"]
+}
+
+path "sys/mounts" {
+  capabilities = ["read"]
+}
+EOT
+}
+
+# OIDC default policy - minimal access for authenticated users
+resource "vault_policy" "oidc_default" {
+  name = "oidc-default"
+
+  policy = <<EOT
+# Read own token info
+path "auth/token/lookup-self" {
+  capabilities = ["read"]
+}
+
+# Read system health
+path "sys/health" {
+  capabilities = ["read"]
+}
+EOT
+}
--- a/terraform/vault/secrets.tf
+++ b/terraform/vault/secrets.tf
@@ -108,6 +108,35 @@ locals {
      auto_generate   = true
      password_length = 32
    }
+
+    # Grafana OAuth2 client secret (for Kanidm OIDC)
+    "services/grafana/oauth2-client-secret" = {
+      auto_generate   = true
+      password_length = 64
+    }
+
+    # OpenBao OAuth2 client secret (for Kanidm OIDC)
+    "services/openbao/oauth2-client-secret" = {
+      auto_generate   = true
+      password_length = 64
+    }
+
+    # NKey for nixos-exporter NATS cache sharing
+    "shared/nixos-exporter/nkey" = {
+      auto_generate = false
+      data          = { nkey = var.nixos_exporter_nkey }
+    }
+
+    # Exportarr API keys for media stack monitoring
+    "services/exportarr/radarr" = {
+      auto_generate = false
+      data          = { api_key = var.radarr_api_key }
+    }
+
+    "services/exportarr/sonarr" = {
+      auto_generate = false
+      data          = { api_key = var.sonarr_api_key }
+    }
  }
 }

--- a/terraform/vault/variables.tf
+++ b/terraform/vault/variables.tf
@@ -73,3 +73,24 @@ variable "homelab_deploy_admin_deployer_nkey" {
  sensitive   = true
 }

+variable "nixos_exporter_nkey" {
+  description = "NKey seed for nixos-exporter NATS authentication"
+  type        = string
+  default     = "PLACEHOLDER"
+  sensitive   = true
+}
+
+variable "radarr_api_key" {
+  description = "Radarr API key for exportarr metrics"
+  type        = string
+  default     = "PLACEHOLDER"
+  sensitive   = true
+}
+
+variable "sonarr_api_key" {
+  description = "Sonarr API key for exportarr metrics"
+  type        = string
+  default     = "PLACEHOLDER"
+  sensitive   = true
+}
+
--- a/terraform/vms.tf
+++ b/terraform/vms.tf
@@ -79,6 +79,20 @@ locals {
      disk_size = "20G"
      vault_wrapped_token = "s.OOqjEECeIV7dNgCS6jNmyY3K"
    }
+    "monitoring02" = {
+      ip        = "10.69.13.24/24"
+      cpu_cores = 4
+      memory    = 4096
+      disk_size = "60G"
+      vault_wrapped_token = "s.uXpdoGxHXpWvTsGbHkZuq1jF"
+    }
+    "nix-cache02" = {
+      ip        = "10.69.13.25/24"
+      cpu_cores = 8
+      memory    = 16384
+      disk_size = "200G"
+      vault_wrapped_token = "s.C5EuHFyULACEqZgsLqMC3cJB"
+    }
  }

  # Compute VM configurations with defaults applied