docs: add plans for nixos and homelab prometheus exporters

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-02-06 21:52:01 +01:00
parent 1f5b7b13e2
commit 7e80d2e0bc
2 changed files with 355 additions and 0 deletions

View File

@@ -0,0 +1,179 @@
# Homelab Infrastructure Exporter
## Overview
Build a Prometheus exporter for metrics specific to our homelab infrastructure. Unlike the generic nixos-exporter, this covers services and patterns unique to our environment.
## Current State
### Existing Exporters
- **node-exporter** (all hosts): System metrics
- **systemd-exporter** (all hosts): Service restart counts, IP accounting
- **labmon** (monitoring01): TLS certificate monitoring, step-ca health
- **Service-specific**: unbound, postgres, nats, jellyfin, home-assistant, caddy, step-ca
### Gaps
- No visibility into Vault/OpenBao lease expiry
- No ACME certificate expiry from internal CA
- No Proxmox guest agent metrics from inside VMs
## Metrics
### Vault/OpenBao Metrics
| Metric | Description | Source |
|--------|-------------|--------|
| `homelab_vault_token_expiry_seconds` | Seconds until AppRole token expires | Token metadata or lease file |
| `homelab_vault_token_renewable` | 1 if token is renewable | Token metadata |
Labels: `role` (AppRole name)
### ACME Certificate Metrics
| Metric | Description | Source |
|--------|-------------|--------|
| `homelab_acme_cert_expiry_seconds` | Seconds until certificate expires | Parse cert from `/var/lib/acme/*/cert.pem` |
| `homelab_acme_cert_not_after` | Unix timestamp of cert expiry | Certificate NotAfter field |
Labels: `domain`, `issuer`
Note: labmon already monitors external TLS endpoints. This covers local ACME-managed certs.
### Proxmox Guest Metrics (future)
| Metric | Description | Source |
|--------|-------------|--------|
| `homelab_proxmox_guest_info` | Info gauge with VM ID, name | QEMU guest agent |
| `homelab_proxmox_guest_agent_running` | 1 if guest agent is responsive | Agent ping |
### DNS Zone Metrics (future)
| Metric | Description | Source |
|--------|-------------|--------|
| `homelab_dns_zone_serial` | Current zone serial number | DNS AXFR or zone file |
Labels: `zone`
## Architecture
Single binary with collectors enabled via config. Runs on hosts that need specific collectors.
```
homelab-exporter
├── main.go
├── collector/
│ ├── vault.go # Vault/OpenBao token metrics
│ ├── acme.go # ACME certificate metrics
│ └── proxmox.go # Proxmox guest agent (future)
└── config/
└── config.go
```
## Configuration
```yaml
listen_addr: ":9970"
collectors:
vault:
enabled: true
token_path: "/var/lib/vault/token"
acme:
enabled: true
cert_dirs:
- "/var/lib/acme"
proxmox:
enabled: false
```
## NixOS Module
```nix
services.homelab-exporter = {
enable = true;
port = 9970;
collectors = {
vault = {
enable = true;
tokenPath = "/var/lib/vault/token";
};
acme = {
enable = true;
certDirs = [ "/var/lib/acme" ];
};
};
};
# Auto-register scrape target
homelab.monitoring.scrapeTargets = [{
job_name = "homelab-exporter";
port = 9970;
}];
```
## Integration
### Deployment
Deploy on hosts that have relevant data:
- **All hosts with ACME certs**: acme collector
- **All hosts with Vault**: vault collector
- **Proxmox VMs**: proxmox collector (when implemented)
### Relationship with nixos-exporter
These are complementary:
- **nixos-exporter** (port 9971): Generic NixOS metrics, deploy everywhere
- **homelab-exporter** (port 9970): Infrastructure-specific, deploy selectively
Both can run on the same host if needed.
## Implementation
### Language
Go - consistent with labmon and nixos-exporter.
### Phase 1: Core + ACME
1. Create git repository (git.t-juice.club/torjus/homelab-exporter)
2. Implement ACME certificate collector
3. HTTP server with `/metrics`
4. NixOS module
### Phase 2: Vault Collector
1. Implement token expiry detection
2. Handle missing/expired tokens gracefully
### Phase 3: Dashboard
1. Create Grafana dashboard for infrastructure health
2. Add to existing monitoring service module
## Alert Examples
```yaml
- alert: VaultTokenExpiringSoon
expr: homelab_vault_token_expiry_seconds < 3600
for: 5m
labels:
severity: warning
annotations:
summary: "Vault token on {{ $labels.instance }} expires in < 1 hour"
- alert: ACMECertExpiringSoon
expr: homelab_acme_cert_expiry_seconds < 7 * 24 * 3600
for: 1h
labels:
severity: warning
annotations:
summary: "ACME cert {{ $labels.domain }} on {{ $labels.instance }} expires in < 7 days"
```
## Open Questions
- [ ] How to read Vault token expiry without re-authenticating?
- [ ] Should ACME collector also check key/cert match?
## Notes
- Port 9970 (labmon uses 9969, nixos-exporter will use 9971)
- Keep infrastructure-specific logic here, generic NixOS stuff in nixos-exporter
- Consider merging Proxmox metrics with pve-exporter if overlap is significant

View File

@@ -0,0 +1,176 @@
# NixOS Prometheus Exporter
## Overview
Build a generic Prometheus exporter for NixOS-specific metrics. This exporter should be useful for any NixOS deployment, not just our homelab.
## Goal
Provide visibility into NixOS system state that standard exporters don't cover:
- Generation management (count, age, current vs booted)
- Flake input freshness
- Upgrade status
## Metrics
### Core Metrics
| Metric | Description | Source |
|--------|-------------|--------|
| `nixos_generation_count` | Number of system generations | Count entries in `/nix/var/nix/profiles/system-*` |
| `nixos_current_generation` | Active generation number | Parse `readlink /run/current-system` |
| `nixos_booted_generation` | Generation that was booted | Parse `/run/booted-system` |
| `nixos_generation_age_seconds` | Age of current generation | File mtime of current system profile |
| `nixos_config_mismatch` | 1 if booted != current, 0 otherwise | Compare symlink targets |
### Flake Metrics (optional collector)
| Metric | Description | Source |
|--------|-------------|--------|
| `nixos_flake_input_age_seconds` | Age of each flake.lock input | Parse `lastModified` from flake.lock |
| `nixos_flake_input_info` | Info gauge with rev label | Parse `rev` from flake.lock |
Labels: `input` (e.g., "nixpkgs", "home-manager")
### Future Metrics
| Metric | Description | Source |
|--------|-------------|--------|
| `nixos_upgrade_pending` | 1 if remote differs from local | Compare flake refs (expensive) |
| `nixos_store_size_bytes` | Size of /nix/store | `du` or filesystem stats |
| `nixos_store_path_count` | Number of store paths | Count entries |
## Architecture
Single binary with optional collectors enabled via config or flags.
```
nixos-exporter
├── main.go
├── collector/
│ ├── generation.go # Core generation metrics
│ └── flake.go # Flake input metrics
└── config/
└── config.go
```
## Configuration
```yaml
listen_addr: ":9971"
collectors:
generation:
enabled: true
flake:
enabled: false
lock_path: "/etc/nixos/flake.lock" # or auto-detect from /run/current-system
```
Command-line alternative:
```bash
nixos-exporter --listen=:9971 --collector.flake --flake.lock-path=/etc/nixos/flake.lock
```
## NixOS Module
```nix
services.prometheus.exporters.nixos = {
enable = true;
port = 9971;
collectors = [ "generation" "flake" ];
flake.lockPath = "/etc/nixos/flake.lock";
};
```
The module should integrate with nixpkgs' existing `services.prometheus.exporters.*` pattern.
## Implementation
### Language
Go - mature prometheus client library, single static binary, easy cross-compilation.
### Phase 1: Core
1. Create git repository
2. Implement generation collector (count, current, booted, age, mismatch)
3. Basic HTTP server with `/metrics` endpoint
4. NixOS module
### Phase 2: Flake Collector
1. Parse flake.lock JSON format
2. Extract lastModified timestamps per input
3. Add input labels
### Phase 3: Packaging
1. Add to nixpkgs or publish as flake
2. Documentation
3. Example Grafana dashboard
## Example Output
```
# HELP nixos_generation_count Total number of system generations
# TYPE nixos_generation_count gauge
nixos_generation_count 47
# HELP nixos_current_generation Currently active generation number
# TYPE nixos_current_generation gauge
nixos_current_generation 47
# HELP nixos_booted_generation Generation that was booted
# TYPE nixos_booted_generation gauge
nixos_booted_generation 46
# HELP nixos_generation_age_seconds Age of current generation in seconds
# TYPE nixos_generation_age_seconds gauge
nixos_generation_age_seconds 3600
# HELP nixos_config_mismatch 1 if booted generation differs from current
# TYPE nixos_config_mismatch gauge
nixos_config_mismatch 1
# HELP nixos_flake_input_age_seconds Age of flake input in seconds
# TYPE nixos_flake_input_age_seconds gauge
nixos_flake_input_age_seconds{input="nixpkgs"} 259200
nixos_flake_input_age_seconds{input="home-manager"} 86400
```
## Alert Examples
```yaml
- alert: NixOSConfigStale
expr: nixos_generation_age_seconds > 7 * 24 * 3600
for: 1h
labels:
severity: warning
annotations:
summary: "NixOS config on {{ $labels.instance }} is over 7 days old"
- alert: NixOSRebootRequired
expr: nixos_config_mismatch == 1
for: 24h
labels:
severity: info
annotations:
summary: "{{ $labels.instance }} needs reboot to apply config"
- alert: NixpkgsInputStale
expr: nixos_flake_input_age_seconds{input="nixpkgs"} > 30 * 24 * 3600
for: 1d
labels:
severity: info
annotations:
summary: "nixpkgs input on {{ $labels.instance }} is over 30 days old"
```
## Open Questions
- [ ] How to detect flake.lock path automatically? (check /run/current-system for flake info)
- [ ] Should generation collector need root? (probably not, just reading symlinks)
- [ ] Include in nixpkgs or distribute as standalone flake?
## Notes
- Port 9971 suggested (9970 reserved for homelab-exporter)
- Keep scope focused on NixOS-specific metrics - don't duplicate node-exporter
- Consider submitting to prometheus exporter registry once stable