diff --git a/docs/plans/homelab-exporter.md b/docs/plans/homelab-exporter.md new file mode 100644 index 0000000..1bfb07b --- /dev/null +++ b/docs/plans/homelab-exporter.md @@ -0,0 +1,179 @@ +# Homelab Infrastructure Exporter + +## Overview + +Build a Prometheus exporter for metrics specific to our homelab infrastructure. Unlike the generic nixos-exporter, this covers services and patterns unique to our environment. + +## Current State + +### Existing Exporters +- **node-exporter** (all hosts): System metrics +- **systemd-exporter** (all hosts): Service restart counts, IP accounting +- **labmon** (monitoring01): TLS certificate monitoring, step-ca health +- **Service-specific**: unbound, postgres, nats, jellyfin, home-assistant, caddy, step-ca + +### Gaps +- No visibility into Vault/OpenBao lease expiry +- No ACME certificate expiry from internal CA +- No Proxmox guest agent metrics from inside VMs + +## Metrics + +### Vault/OpenBao Metrics + +| Metric | Description | Source | +|--------|-------------|--------| +| `homelab_vault_token_expiry_seconds` | Seconds until AppRole token expires | Token metadata or lease file | +| `homelab_vault_token_renewable` | 1 if token is renewable | Token metadata | + +Labels: `role` (AppRole name) + +### ACME Certificate Metrics + +| Metric | Description | Source | +|--------|-------------|--------| +| `homelab_acme_cert_expiry_seconds` | Seconds until certificate expires | Parse cert from `/var/lib/acme/*/cert.pem` | +| `homelab_acme_cert_not_after` | Unix timestamp of cert expiry | Certificate NotAfter field | + +Labels: `domain`, `issuer` + +Note: labmon already monitors external TLS endpoints. This covers local ACME-managed certs. + +### Proxmox Guest Metrics (future) + +| Metric | Description | Source | +|--------|-------------|--------| +| `homelab_proxmox_guest_info` | Info gauge with VM ID, name | QEMU guest agent | +| `homelab_proxmox_guest_agent_running` | 1 if guest agent is responsive | Agent ping | + +### DNS Zone Metrics (future) + +| Metric | Description | Source | +|--------|-------------|--------| +| `homelab_dns_zone_serial` | Current zone serial number | DNS AXFR or zone file | + +Labels: `zone` + +## Architecture + +Single binary with collectors enabled via config. Runs on hosts that need specific collectors. + +``` +homelab-exporter +├── main.go +├── collector/ +│ ├── vault.go # Vault/OpenBao token metrics +│ ├── acme.go # ACME certificate metrics +│ └── proxmox.go # Proxmox guest agent (future) +└── config/ + └── config.go +``` + +## Configuration + +```yaml +listen_addr: ":9970" +collectors: + vault: + enabled: true + token_path: "/var/lib/vault/token" + acme: + enabled: true + cert_dirs: + - "/var/lib/acme" + proxmox: + enabled: false +``` + +## NixOS Module + +```nix +services.homelab-exporter = { + enable = true; + port = 9970; + collectors = { + vault = { + enable = true; + tokenPath = "/var/lib/vault/token"; + }; + acme = { + enable = true; + certDirs = [ "/var/lib/acme" ]; + }; + }; +}; + +# Auto-register scrape target +homelab.monitoring.scrapeTargets = [{ + job_name = "homelab-exporter"; + port = 9970; +}]; +``` + +## Integration + +### Deployment + +Deploy on hosts that have relevant data: +- **All hosts with ACME certs**: acme collector +- **All hosts with Vault**: vault collector +- **Proxmox VMs**: proxmox collector (when implemented) + +### Relationship with nixos-exporter + +These are complementary: +- **nixos-exporter** (port 9971): Generic NixOS metrics, deploy everywhere +- **homelab-exporter** (port 9970): Infrastructure-specific, deploy selectively + +Both can run on the same host if needed. + +## Implementation + +### Language + +Go - consistent with labmon and nixos-exporter. + +### Phase 1: Core + ACME +1. Create git repository (git.t-juice.club/torjus/homelab-exporter) +2. Implement ACME certificate collector +3. HTTP server with `/metrics` +4. NixOS module + +### Phase 2: Vault Collector +1. Implement token expiry detection +2. Handle missing/expired tokens gracefully + +### Phase 3: Dashboard +1. Create Grafana dashboard for infrastructure health +2. Add to existing monitoring service module + +## Alert Examples + +```yaml +- alert: VaultTokenExpiringSoon + expr: homelab_vault_token_expiry_seconds < 3600 + for: 5m + labels: + severity: warning + annotations: + summary: "Vault token on {{ $labels.instance }} expires in < 1 hour" + +- alert: ACMECertExpiringSoon + expr: homelab_acme_cert_expiry_seconds < 7 * 24 * 3600 + for: 1h + labels: + severity: warning + annotations: + summary: "ACME cert {{ $labels.domain }} on {{ $labels.instance }} expires in < 7 days" +``` + +## Open Questions + +- [ ] How to read Vault token expiry without re-authenticating? +- [ ] Should ACME collector also check key/cert match? + +## Notes + +- Port 9970 (labmon uses 9969, nixos-exporter will use 9971) +- Keep infrastructure-specific logic here, generic NixOS stuff in nixos-exporter +- Consider merging Proxmox metrics with pve-exporter if overlap is significant diff --git a/docs/plans/nixos-exporter.md b/docs/plans/nixos-exporter.md new file mode 100644 index 0000000..e18536c --- /dev/null +++ b/docs/plans/nixos-exporter.md @@ -0,0 +1,176 @@ +# NixOS Prometheus Exporter + +## Overview + +Build a generic Prometheus exporter for NixOS-specific metrics. This exporter should be useful for any NixOS deployment, not just our homelab. + +## Goal + +Provide visibility into NixOS system state that standard exporters don't cover: +- Generation management (count, age, current vs booted) +- Flake input freshness +- Upgrade status + +## Metrics + +### Core Metrics + +| Metric | Description | Source | +|--------|-------------|--------| +| `nixos_generation_count` | Number of system generations | Count entries in `/nix/var/nix/profiles/system-*` | +| `nixos_current_generation` | Active generation number | Parse `readlink /run/current-system` | +| `nixos_booted_generation` | Generation that was booted | Parse `/run/booted-system` | +| `nixos_generation_age_seconds` | Age of current generation | File mtime of current system profile | +| `nixos_config_mismatch` | 1 if booted != current, 0 otherwise | Compare symlink targets | + +### Flake Metrics (optional collector) + +| Metric | Description | Source | +|--------|-------------|--------| +| `nixos_flake_input_age_seconds` | Age of each flake.lock input | Parse `lastModified` from flake.lock | +| `nixos_flake_input_info` | Info gauge with rev label | Parse `rev` from flake.lock | + +Labels: `input` (e.g., "nixpkgs", "home-manager") + +### Future Metrics + +| Metric | Description | Source | +|--------|-------------|--------| +| `nixos_upgrade_pending` | 1 if remote differs from local | Compare flake refs (expensive) | +| `nixos_store_size_bytes` | Size of /nix/store | `du` or filesystem stats | +| `nixos_store_path_count` | Number of store paths | Count entries | + +## Architecture + +Single binary with optional collectors enabled via config or flags. + +``` +nixos-exporter +├── main.go +├── collector/ +│ ├── generation.go # Core generation metrics +│ └── flake.go # Flake input metrics +└── config/ + └── config.go +``` + +## Configuration + +```yaml +listen_addr: ":9971" +collectors: + generation: + enabled: true + flake: + enabled: false + lock_path: "/etc/nixos/flake.lock" # or auto-detect from /run/current-system +``` + +Command-line alternative: +```bash +nixos-exporter --listen=:9971 --collector.flake --flake.lock-path=/etc/nixos/flake.lock +``` + +## NixOS Module + +```nix +services.prometheus.exporters.nixos = { + enable = true; + port = 9971; + collectors = [ "generation" "flake" ]; + flake.lockPath = "/etc/nixos/flake.lock"; +}; +``` + +The module should integrate with nixpkgs' existing `services.prometheus.exporters.*` pattern. + +## Implementation + +### Language + +Go - mature prometheus client library, single static binary, easy cross-compilation. + +### Phase 1: Core +1. Create git repository +2. Implement generation collector (count, current, booted, age, mismatch) +3. Basic HTTP server with `/metrics` endpoint +4. NixOS module + +### Phase 2: Flake Collector +1. Parse flake.lock JSON format +2. Extract lastModified timestamps per input +3. Add input labels + +### Phase 3: Packaging +1. Add to nixpkgs or publish as flake +2. Documentation +3. Example Grafana dashboard + +## Example Output + +``` +# HELP nixos_generation_count Total number of system generations +# TYPE nixos_generation_count gauge +nixos_generation_count 47 + +# HELP nixos_current_generation Currently active generation number +# TYPE nixos_current_generation gauge +nixos_current_generation 47 + +# HELP nixos_booted_generation Generation that was booted +# TYPE nixos_booted_generation gauge +nixos_booted_generation 46 + +# HELP nixos_generation_age_seconds Age of current generation in seconds +# TYPE nixos_generation_age_seconds gauge +nixos_generation_age_seconds 3600 + +# HELP nixos_config_mismatch 1 if booted generation differs from current +# TYPE nixos_config_mismatch gauge +nixos_config_mismatch 1 + +# HELP nixos_flake_input_age_seconds Age of flake input in seconds +# TYPE nixos_flake_input_age_seconds gauge +nixos_flake_input_age_seconds{input="nixpkgs"} 259200 +nixos_flake_input_age_seconds{input="home-manager"} 86400 +``` + +## Alert Examples + +```yaml +- alert: NixOSConfigStale + expr: nixos_generation_age_seconds > 7 * 24 * 3600 + for: 1h + labels: + severity: warning + annotations: + summary: "NixOS config on {{ $labels.instance }} is over 7 days old" + +- alert: NixOSRebootRequired + expr: nixos_config_mismatch == 1 + for: 24h + labels: + severity: info + annotations: + summary: "{{ $labels.instance }} needs reboot to apply config" + +- alert: NixpkgsInputStale + expr: nixos_flake_input_age_seconds{input="nixpkgs"} > 30 * 24 * 3600 + for: 1d + labels: + severity: info + annotations: + summary: "nixpkgs input on {{ $labels.instance }} is over 30 days old" +``` + +## Open Questions + +- [ ] How to detect flake.lock path automatically? (check /run/current-system for flake info) +- [ ] Should generation collector need root? (probably not, just reading symlinks) +- [ ] Include in nixpkgs or distribute as standalone flake? + +## Notes + +- Port 9971 suggested (9970 reserved for homelab-exporter) +- Keep scope focused on NixOS-specific metrics - don't duplicate node-exporter +- Consider submitting to prometheus exporter registry once stable