docs: add plans for nixos and homelab prometheus exporters
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
179
docs/plans/homelab-exporter.md
Normal file
179
docs/plans/homelab-exporter.md
Normal file
@@ -0,0 +1,179 @@
|
||||
# Homelab Infrastructure Exporter
|
||||
|
||||
## Overview
|
||||
|
||||
Build a Prometheus exporter for metrics specific to our homelab infrastructure. Unlike the generic nixos-exporter, this covers services and patterns unique to our environment.
|
||||
|
||||
## Current State
|
||||
|
||||
### Existing Exporters
|
||||
- **node-exporter** (all hosts): System metrics
|
||||
- **systemd-exporter** (all hosts): Service restart counts, IP accounting
|
||||
- **labmon** (monitoring01): TLS certificate monitoring, step-ca health
|
||||
- **Service-specific**: unbound, postgres, nats, jellyfin, home-assistant, caddy, step-ca
|
||||
|
||||
### Gaps
|
||||
- No visibility into Vault/OpenBao lease expiry
|
||||
- No ACME certificate expiry from internal CA
|
||||
- No Proxmox guest agent metrics from inside VMs
|
||||
|
||||
## Metrics
|
||||
|
||||
### Vault/OpenBao Metrics
|
||||
|
||||
| Metric | Description | Source |
|
||||
|--------|-------------|--------|
|
||||
| `homelab_vault_token_expiry_seconds` | Seconds until AppRole token expires | Token metadata or lease file |
|
||||
| `homelab_vault_token_renewable` | 1 if token is renewable | Token metadata |
|
||||
|
||||
Labels: `role` (AppRole name)
|
||||
|
||||
### ACME Certificate Metrics
|
||||
|
||||
| Metric | Description | Source |
|
||||
|--------|-------------|--------|
|
||||
| `homelab_acme_cert_expiry_seconds` | Seconds until certificate expires | Parse cert from `/var/lib/acme/*/cert.pem` |
|
||||
| `homelab_acme_cert_not_after` | Unix timestamp of cert expiry | Certificate NotAfter field |
|
||||
|
||||
Labels: `domain`, `issuer`
|
||||
|
||||
Note: labmon already monitors external TLS endpoints. This covers local ACME-managed certs.
|
||||
|
||||
### Proxmox Guest Metrics (future)
|
||||
|
||||
| Metric | Description | Source |
|
||||
|--------|-------------|--------|
|
||||
| `homelab_proxmox_guest_info` | Info gauge with VM ID, name | QEMU guest agent |
|
||||
| `homelab_proxmox_guest_agent_running` | 1 if guest agent is responsive | Agent ping |
|
||||
|
||||
### DNS Zone Metrics (future)
|
||||
|
||||
| Metric | Description | Source |
|
||||
|--------|-------------|--------|
|
||||
| `homelab_dns_zone_serial` | Current zone serial number | DNS AXFR or zone file |
|
||||
|
||||
Labels: `zone`
|
||||
|
||||
## Architecture
|
||||
|
||||
Single binary with collectors enabled via config. Runs on hosts that need specific collectors.
|
||||
|
||||
```
|
||||
homelab-exporter
|
||||
├── main.go
|
||||
├── collector/
|
||||
│ ├── vault.go # Vault/OpenBao token metrics
|
||||
│ ├── acme.go # ACME certificate metrics
|
||||
│ └── proxmox.go # Proxmox guest agent (future)
|
||||
└── config/
|
||||
└── config.go
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
```yaml
|
||||
listen_addr: ":9970"
|
||||
collectors:
|
||||
vault:
|
||||
enabled: true
|
||||
token_path: "/var/lib/vault/token"
|
||||
acme:
|
||||
enabled: true
|
||||
cert_dirs:
|
||||
- "/var/lib/acme"
|
||||
proxmox:
|
||||
enabled: false
|
||||
```
|
||||
|
||||
## NixOS Module
|
||||
|
||||
```nix
|
||||
services.homelab-exporter = {
|
||||
enable = true;
|
||||
port = 9970;
|
||||
collectors = {
|
||||
vault = {
|
||||
enable = true;
|
||||
tokenPath = "/var/lib/vault/token";
|
||||
};
|
||||
acme = {
|
||||
enable = true;
|
||||
certDirs = [ "/var/lib/acme" ];
|
||||
};
|
||||
};
|
||||
};
|
||||
|
||||
# Auto-register scrape target
|
||||
homelab.monitoring.scrapeTargets = [{
|
||||
job_name = "homelab-exporter";
|
||||
port = 9970;
|
||||
}];
|
||||
```
|
||||
|
||||
## Integration
|
||||
|
||||
### Deployment
|
||||
|
||||
Deploy on hosts that have relevant data:
|
||||
- **All hosts with ACME certs**: acme collector
|
||||
- **All hosts with Vault**: vault collector
|
||||
- **Proxmox VMs**: proxmox collector (when implemented)
|
||||
|
||||
### Relationship with nixos-exporter
|
||||
|
||||
These are complementary:
|
||||
- **nixos-exporter** (port 9971): Generic NixOS metrics, deploy everywhere
|
||||
- **homelab-exporter** (port 9970): Infrastructure-specific, deploy selectively
|
||||
|
||||
Both can run on the same host if needed.
|
||||
|
||||
## Implementation
|
||||
|
||||
### Language
|
||||
|
||||
Go - consistent with labmon and nixos-exporter.
|
||||
|
||||
### Phase 1: Core + ACME
|
||||
1. Create git repository (git.t-juice.club/torjus/homelab-exporter)
|
||||
2. Implement ACME certificate collector
|
||||
3. HTTP server with `/metrics`
|
||||
4. NixOS module
|
||||
|
||||
### Phase 2: Vault Collector
|
||||
1. Implement token expiry detection
|
||||
2. Handle missing/expired tokens gracefully
|
||||
|
||||
### Phase 3: Dashboard
|
||||
1. Create Grafana dashboard for infrastructure health
|
||||
2. Add to existing monitoring service module
|
||||
|
||||
## Alert Examples
|
||||
|
||||
```yaml
|
||||
- alert: VaultTokenExpiringSoon
|
||||
expr: homelab_vault_token_expiry_seconds < 3600
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Vault token on {{ $labels.instance }} expires in < 1 hour"
|
||||
|
||||
- alert: ACMECertExpiringSoon
|
||||
expr: homelab_acme_cert_expiry_seconds < 7 * 24 * 3600
|
||||
for: 1h
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "ACME cert {{ $labels.domain }} on {{ $labels.instance }} expires in < 7 days"
|
||||
```
|
||||
|
||||
## Open Questions
|
||||
|
||||
- [ ] How to read Vault token expiry without re-authenticating?
|
||||
- [ ] Should ACME collector also check key/cert match?
|
||||
|
||||
## Notes
|
||||
|
||||
- Port 9970 (labmon uses 9969, nixos-exporter will use 9971)
|
||||
- Keep infrastructure-specific logic here, generic NixOS stuff in nixos-exporter
|
||||
- Consider merging Proxmox metrics with pve-exporter if overlap is significant
|
||||
176
docs/plans/nixos-exporter.md
Normal file
176
docs/plans/nixos-exporter.md
Normal file
@@ -0,0 +1,176 @@
|
||||
# NixOS Prometheus Exporter
|
||||
|
||||
## Overview
|
||||
|
||||
Build a generic Prometheus exporter for NixOS-specific metrics. This exporter should be useful for any NixOS deployment, not just our homelab.
|
||||
|
||||
## Goal
|
||||
|
||||
Provide visibility into NixOS system state that standard exporters don't cover:
|
||||
- Generation management (count, age, current vs booted)
|
||||
- Flake input freshness
|
||||
- Upgrade status
|
||||
|
||||
## Metrics
|
||||
|
||||
### Core Metrics
|
||||
|
||||
| Metric | Description | Source |
|
||||
|--------|-------------|--------|
|
||||
| `nixos_generation_count` | Number of system generations | Count entries in `/nix/var/nix/profiles/system-*` |
|
||||
| `nixos_current_generation` | Active generation number | Parse `readlink /run/current-system` |
|
||||
| `nixos_booted_generation` | Generation that was booted | Parse `/run/booted-system` |
|
||||
| `nixos_generation_age_seconds` | Age of current generation | File mtime of current system profile |
|
||||
| `nixos_config_mismatch` | 1 if booted != current, 0 otherwise | Compare symlink targets |
|
||||
|
||||
### Flake Metrics (optional collector)
|
||||
|
||||
| Metric | Description | Source |
|
||||
|--------|-------------|--------|
|
||||
| `nixos_flake_input_age_seconds` | Age of each flake.lock input | Parse `lastModified` from flake.lock |
|
||||
| `nixos_flake_input_info` | Info gauge with rev label | Parse `rev` from flake.lock |
|
||||
|
||||
Labels: `input` (e.g., "nixpkgs", "home-manager")
|
||||
|
||||
### Future Metrics
|
||||
|
||||
| Metric | Description | Source |
|
||||
|--------|-------------|--------|
|
||||
| `nixos_upgrade_pending` | 1 if remote differs from local | Compare flake refs (expensive) |
|
||||
| `nixos_store_size_bytes` | Size of /nix/store | `du` or filesystem stats |
|
||||
| `nixos_store_path_count` | Number of store paths | Count entries |
|
||||
|
||||
## Architecture
|
||||
|
||||
Single binary with optional collectors enabled via config or flags.
|
||||
|
||||
```
|
||||
nixos-exporter
|
||||
├── main.go
|
||||
├── collector/
|
||||
│ ├── generation.go # Core generation metrics
|
||||
│ └── flake.go # Flake input metrics
|
||||
└── config/
|
||||
└── config.go
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
```yaml
|
||||
listen_addr: ":9971"
|
||||
collectors:
|
||||
generation:
|
||||
enabled: true
|
||||
flake:
|
||||
enabled: false
|
||||
lock_path: "/etc/nixos/flake.lock" # or auto-detect from /run/current-system
|
||||
```
|
||||
|
||||
Command-line alternative:
|
||||
```bash
|
||||
nixos-exporter --listen=:9971 --collector.flake --flake.lock-path=/etc/nixos/flake.lock
|
||||
```
|
||||
|
||||
## NixOS Module
|
||||
|
||||
```nix
|
||||
services.prometheus.exporters.nixos = {
|
||||
enable = true;
|
||||
port = 9971;
|
||||
collectors = [ "generation" "flake" ];
|
||||
flake.lockPath = "/etc/nixos/flake.lock";
|
||||
};
|
||||
```
|
||||
|
||||
The module should integrate with nixpkgs' existing `services.prometheus.exporters.*` pattern.
|
||||
|
||||
## Implementation
|
||||
|
||||
### Language
|
||||
|
||||
Go - mature prometheus client library, single static binary, easy cross-compilation.
|
||||
|
||||
### Phase 1: Core
|
||||
1. Create git repository
|
||||
2. Implement generation collector (count, current, booted, age, mismatch)
|
||||
3. Basic HTTP server with `/metrics` endpoint
|
||||
4. NixOS module
|
||||
|
||||
### Phase 2: Flake Collector
|
||||
1. Parse flake.lock JSON format
|
||||
2. Extract lastModified timestamps per input
|
||||
3. Add input labels
|
||||
|
||||
### Phase 3: Packaging
|
||||
1. Add to nixpkgs or publish as flake
|
||||
2. Documentation
|
||||
3. Example Grafana dashboard
|
||||
|
||||
## Example Output
|
||||
|
||||
```
|
||||
# HELP nixos_generation_count Total number of system generations
|
||||
# TYPE nixos_generation_count gauge
|
||||
nixos_generation_count 47
|
||||
|
||||
# HELP nixos_current_generation Currently active generation number
|
||||
# TYPE nixos_current_generation gauge
|
||||
nixos_current_generation 47
|
||||
|
||||
# HELP nixos_booted_generation Generation that was booted
|
||||
# TYPE nixos_booted_generation gauge
|
||||
nixos_booted_generation 46
|
||||
|
||||
# HELP nixos_generation_age_seconds Age of current generation in seconds
|
||||
# TYPE nixos_generation_age_seconds gauge
|
||||
nixos_generation_age_seconds 3600
|
||||
|
||||
# HELP nixos_config_mismatch 1 if booted generation differs from current
|
||||
# TYPE nixos_config_mismatch gauge
|
||||
nixos_config_mismatch 1
|
||||
|
||||
# HELP nixos_flake_input_age_seconds Age of flake input in seconds
|
||||
# TYPE nixos_flake_input_age_seconds gauge
|
||||
nixos_flake_input_age_seconds{input="nixpkgs"} 259200
|
||||
nixos_flake_input_age_seconds{input="home-manager"} 86400
|
||||
```
|
||||
|
||||
## Alert Examples
|
||||
|
||||
```yaml
|
||||
- alert: NixOSConfigStale
|
||||
expr: nixos_generation_age_seconds > 7 * 24 * 3600
|
||||
for: 1h
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "NixOS config on {{ $labels.instance }} is over 7 days old"
|
||||
|
||||
- alert: NixOSRebootRequired
|
||||
expr: nixos_config_mismatch == 1
|
||||
for: 24h
|
||||
labels:
|
||||
severity: info
|
||||
annotations:
|
||||
summary: "{{ $labels.instance }} needs reboot to apply config"
|
||||
|
||||
- alert: NixpkgsInputStale
|
||||
expr: nixos_flake_input_age_seconds{input="nixpkgs"} > 30 * 24 * 3600
|
||||
for: 1d
|
||||
labels:
|
||||
severity: info
|
||||
annotations:
|
||||
summary: "nixpkgs input on {{ $labels.instance }} is over 30 days old"
|
||||
```
|
||||
|
||||
## Open Questions
|
||||
|
||||
- [ ] How to detect flake.lock path automatically? (check /run/current-system for flake info)
|
||||
- [ ] Should generation collector need root? (probably not, just reading symlinks)
|
||||
- [ ] Include in nixpkgs or distribute as standalone flake?
|
||||
|
||||
## Notes
|
||||
|
||||
- Port 9971 suggested (9970 reserved for homelab-exporter)
|
||||
- Keep scope focused on NixOS-specific metrics - don't duplicate node-exporter
|
||||
- Consider submitting to prometheus exporter registry once stable
|
||||
Reference in New Issue
Block a user