Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
177 lines
5.0 KiB
Markdown
177 lines
5.0 KiB
Markdown
# NixOS Prometheus Exporter
|
|
|
|
## Overview
|
|
|
|
Build a generic Prometheus exporter for NixOS-specific metrics. This exporter should be useful for any NixOS deployment, not just our homelab.
|
|
|
|
## Goal
|
|
|
|
Provide visibility into NixOS system state that standard exporters don't cover:
|
|
- Generation management (count, age, current vs booted)
|
|
- Flake input freshness
|
|
- Upgrade status
|
|
|
|
## Metrics
|
|
|
|
### Core Metrics
|
|
|
|
| Metric | Description | Source |
|
|
|--------|-------------|--------|
|
|
| `nixos_generation_count` | Number of system generations | Count entries in `/nix/var/nix/profiles/system-*` |
|
|
| `nixos_current_generation` | Active generation number | Parse `readlink /run/current-system` |
|
|
| `nixos_booted_generation` | Generation that was booted | Parse `/run/booted-system` |
|
|
| `nixos_generation_age_seconds` | Age of current generation | File mtime of current system profile |
|
|
| `nixos_config_mismatch` | 1 if booted != current, 0 otherwise | Compare symlink targets |
|
|
|
|
### Flake Metrics (optional collector)
|
|
|
|
| Metric | Description | Source |
|
|
|--------|-------------|--------|
|
|
| `nixos_flake_input_age_seconds` | Age of each flake.lock input | Parse `lastModified` from flake.lock |
|
|
| `nixos_flake_input_info` | Info gauge with rev label | Parse `rev` from flake.lock |
|
|
|
|
Labels: `input` (e.g., "nixpkgs", "home-manager")
|
|
|
|
### Future Metrics
|
|
|
|
| Metric | Description | Source |
|
|
|--------|-------------|--------|
|
|
| `nixos_upgrade_pending` | 1 if remote differs from local | Compare flake refs (expensive) |
|
|
| `nixos_store_size_bytes` | Size of /nix/store | `du` or filesystem stats |
|
|
| `nixos_store_path_count` | Number of store paths | Count entries |
|
|
|
|
## Architecture
|
|
|
|
Single binary with optional collectors enabled via config or flags.
|
|
|
|
```
|
|
nixos-exporter
|
|
├── main.go
|
|
├── collector/
|
|
│ ├── generation.go # Core generation metrics
|
|
│ └── flake.go # Flake input metrics
|
|
└── config/
|
|
└── config.go
|
|
```
|
|
|
|
## Configuration
|
|
|
|
```yaml
|
|
listen_addr: ":9971"
|
|
collectors:
|
|
generation:
|
|
enabled: true
|
|
flake:
|
|
enabled: false
|
|
lock_path: "/etc/nixos/flake.lock" # or auto-detect from /run/current-system
|
|
```
|
|
|
|
Command-line alternative:
|
|
```bash
|
|
nixos-exporter --listen=:9971 --collector.flake --flake.lock-path=/etc/nixos/flake.lock
|
|
```
|
|
|
|
## NixOS Module
|
|
|
|
```nix
|
|
services.prometheus.exporters.nixos = {
|
|
enable = true;
|
|
port = 9971;
|
|
collectors = [ "generation" "flake" ];
|
|
flake.lockPath = "/etc/nixos/flake.lock";
|
|
};
|
|
```
|
|
|
|
The module should integrate with nixpkgs' existing `services.prometheus.exporters.*` pattern.
|
|
|
|
## Implementation
|
|
|
|
### Language
|
|
|
|
Go - mature prometheus client library, single static binary, easy cross-compilation.
|
|
|
|
### Phase 1: Core
|
|
1. Create git repository
|
|
2. Implement generation collector (count, current, booted, age, mismatch)
|
|
3. Basic HTTP server with `/metrics` endpoint
|
|
4. NixOS module
|
|
|
|
### Phase 2: Flake Collector
|
|
1. Parse flake.lock JSON format
|
|
2. Extract lastModified timestamps per input
|
|
3. Add input labels
|
|
|
|
### Phase 3: Packaging
|
|
1. Add to nixpkgs or publish as flake
|
|
2. Documentation
|
|
3. Example Grafana dashboard
|
|
|
|
## Example Output
|
|
|
|
```
|
|
# HELP nixos_generation_count Total number of system generations
|
|
# TYPE nixos_generation_count gauge
|
|
nixos_generation_count 47
|
|
|
|
# HELP nixos_current_generation Currently active generation number
|
|
# TYPE nixos_current_generation gauge
|
|
nixos_current_generation 47
|
|
|
|
# HELP nixos_booted_generation Generation that was booted
|
|
# TYPE nixos_booted_generation gauge
|
|
nixos_booted_generation 46
|
|
|
|
# HELP nixos_generation_age_seconds Age of current generation in seconds
|
|
# TYPE nixos_generation_age_seconds gauge
|
|
nixos_generation_age_seconds 3600
|
|
|
|
# HELP nixos_config_mismatch 1 if booted generation differs from current
|
|
# TYPE nixos_config_mismatch gauge
|
|
nixos_config_mismatch 1
|
|
|
|
# HELP nixos_flake_input_age_seconds Age of flake input in seconds
|
|
# TYPE nixos_flake_input_age_seconds gauge
|
|
nixos_flake_input_age_seconds{input="nixpkgs"} 259200
|
|
nixos_flake_input_age_seconds{input="home-manager"} 86400
|
|
```
|
|
|
|
## Alert Examples
|
|
|
|
```yaml
|
|
- alert: NixOSConfigStale
|
|
expr: nixos_generation_age_seconds > 7 * 24 * 3600
|
|
for: 1h
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "NixOS config on {{ $labels.instance }} is over 7 days old"
|
|
|
|
- alert: NixOSRebootRequired
|
|
expr: nixos_config_mismatch == 1
|
|
for: 24h
|
|
labels:
|
|
severity: info
|
|
annotations:
|
|
summary: "{{ $labels.instance }} needs reboot to apply config"
|
|
|
|
- alert: NixpkgsInputStale
|
|
expr: nixos_flake_input_age_seconds{input="nixpkgs"} > 30 * 24 * 3600
|
|
for: 1d
|
|
labels:
|
|
severity: info
|
|
annotations:
|
|
summary: "nixpkgs input on {{ $labels.instance }} is over 30 days old"
|
|
```
|
|
|
|
## Open Questions
|
|
|
|
- [ ] How to detect flake.lock path automatically? (check /run/current-system for flake info)
|
|
- [ ] Should generation collector need root? (probably not, just reading symlinks)
|
|
- [ ] Include in nixpkgs or distribute as standalone flake?
|
|
|
|
## Notes
|
|
|
|
- Port 9971 suggested (9970 reserved for homelab-exporter)
|
|
- Keep scope focused on NixOS-specific metrics - don't duplicate node-exporter
|
|
- Consider submitting to prometheus exporter registry once stable
|