# Loki Setup Improvements ## Overview The current Loki deployment on monitoring01 is functional but minimal. It lacks retention policies, rate limiting, and uses local filesystem storage. This plan evaluates improvement options across several dimensions: retention management, storage backend, resource limits, and operational improvements. ## Current State **Loki** on monitoring01 (`services/monitoring/loki.nix`): - Single-node deployment, no HA - Filesystem storage at `/var/lib/loki/chunks` - TSDB index (v13 schema, 24h period) - No retention policy configured (logs grow indefinitely) - No `limits_config` (no rate limiting, stream limits, or query guards) - No caching layer - Auth disabled (trusted network) **Promtail** on all 16 hosts (`system/monitoring/logs.nix`): - Ships systemd journal (JSON) + `/var/log/**/*.log` - Labels: `host`, `job` (systemd-journal/varlog), `systemd_unit` - Hardcoded to `http://monitoring01.home.2rjus.net:3100` **Additional log sources:** - `pipe-to-loki` script (manual log submission, `job=pipe-to-loki`) - Bootstrap logs from template2 (`job=bootstrap`) **Context:** The VictoriaMetrics migration plan (`docs/plans/monitoring-migration-victoriametrics.md`) includes moving Loki to monitoring02 with "same configuration as current". These improvements could be applied either before or after that migration. ## Improvement Areas ### 1. Retention Policy **Problem:** No retention configured. Logs accumulate until disk fills up. **Options:** | Approach | Config Location | How It Works | |----------|----------------|--------------| | **Compactor retention** | `compactor` + `limits_config` | Compactor runs periodic retention sweeps, deleting chunks older than threshold | | **Table manager** | `table_manager` | Legacy approach, not recommended for TSDB | **Recommendation:** Use compactor-based retention (the modern approach for TSDB/filesystem): ```nix compactor = { working_directory = "/var/lib/loki/compactor"; compaction_interval = "10m"; retention_enabled = true; retention_delete_delay = "2h"; retention_delete_worker_count = 150; }; limits_config = { retention_period = "30d"; # Default retention for all tenants }; ``` 30 days aligns with the Prometheus retention and is reasonable for a homelab. Older logs are rarely useful, and anything important can be found in journal archives on the hosts themselves. ### 2. Storage Backend **Decision:** Stay with filesystem storage for now. Garage S3 was considered but ruled out - the current single-node Garage (replication_factor=1) offers no real durability benefit over local disk. S3 storage can be revisited after the NAS migration, when a more robust S3-compatible solution will likely be available. ### 3. Limits Configuration **Problem:** No rate limiting or stream cardinality protection. A misbehaving service could generate excessive logs and overwhelm Loki. **Recommendation:** Add basic guardrails: ```nix limits_config = { retention_period = "30d"; ingestion_rate_mb = 10; # MB/s per tenant ingestion_burst_size_mb = 20; # Burst allowance max_streams_per_user = 10000; # Prevent label explosion max_query_series = 500; # Limit query resource usage max_query_parallelism = 8; }; ``` These are generous limits that shouldn't affect normal operation but protect against runaway log generators. ### 4. Promtail Label Improvements **Problem:** Label inconsistencies and missing useful metadata: - The `varlog` scrape config uses `hostname` while journal uses `host` (different label name) - No `tier` or `role` labels, making it hard to filter logs by deployment tier or host function **Implemented:** Standardized on `hostname` to match Prometheus labels. The journal scrape previously used a relabel from `__journal__hostname` to `host`; now both scrape configs use a static `hostname` label from `config.networking.hostName`. Also updated `pipe-to-loki` and bootstrap scripts to use `hostname` instead of `host`. 1. **Standardized label:** Both scrape configs use `hostname` (matching Prometheus) via shared `hostLabels` 2. **Added `tier` label:** Static label from `config.homelab.host.tier` (`test`/`prod`) on both scrape configs 3. **Added `role` label:** Static label from `config.homelab.host.role` on both scrape configs (conditionally, only when non-null) No cardinality impact - `tier` and `role` are 1:1 with `hostname`, so they add metadata to existing streams without creating new ones. This enables queries like: - `{tier="prod"} |= "error"` - all errors on prod hosts - `{role="dns"}` - all DNS server logs - `{tier="test", job="systemd-journal"}` - journal logs from test hosts ### 5. Journal Priority → Level Label **Problem:** Loki 3.6.3 auto-detects a `detected_level` label by parsing log message text for keywords like "INFO", "ERROR", etc. This works for applications that embed level strings in messages (Go apps, Loki itself), but **fails for traditional Unix services** that use the journal `PRIORITY` field without level text in the message. Example: NSD logs `"signal received, shutting down..."` with `PRIORITY="4"` (warning), but Loki sets `detected_level="unknown"` because the message has no level keyword. Querying `{detected_level="warn"}` misses these entirely. **Recommendation:** Add a Promtail pipeline stage to the journal scrape config that maps the `PRIORITY` field to a `level` label: | PRIORITY | level | |----------|-------| | 0-2 | critical | | 3 | error | | 4 | warning | | 5 | notice | | 6 | info | | 7 | debug | This can be done with a `json` stage to extract PRIORITY, then a `template` + `labels` stage to map and attach it. The journal `PRIORITY` field is always present, so this gives reliable level filtering for all journal logs. **Cardinality impact:** Moderate. Adds up to ~6 label values per host+unit combination. In practice most services log at 1-2 levels, so the stream count increase is manageable for 16 hosts. The filtering benefit (e.g., `{level="error"}` to find all errors across the fleet) outweighs the cost. This enables queries like: - `{level="error"}` - all errors across the fleet - `{level=~"critical|error", tier="prod"}` - prod errors and criticals - `{level="warning", role="dns"}` - warnings from DNS servers ### 6. Enable JSON Logging on Services **Problem:** Many services support structured JSON log output but may be using plain text by default. JSON logs are significantly easier to query in Loki - `| json` cleanly extracts all fields, whereas plain text requires fragile regex or pattern matching. **Recommendation:** Audit all configured services and enable JSON logging where supported. Candidates to check include: - Caddy (already JSON by default) - Prometheus / Alertmanager / Loki / Tempo - Grafana - NSD / Unbound - Home Assistant - NATS - Jellyfin - OpenBao (Vault) - Kanidm - Garage For each service, check whether it supports a JSON log format option and whether enabling it would break anything (e.g., log volume increase from verbose JSON, or dashboards that parse text format). ### 7. Monitoring CNAME for Promtail Target **Problem:** Promtail hardcodes `monitoring01.home.2rjus.net:3100`. The VictoriaMetrics migration plan already addresses this by switching to a `monitoring` CNAME. **Recommendation:** This should happen as part of the monitoring02 migration, not independently. If we do Loki improvements before that migration, keep pointing to monitoring01. ## Priority Ranking | # | Improvement | Effort | Impact | Recommendation | |---|-------------|--------|--------|----------------| | 1 | **Retention policy** | Low | High | Do first - prevents disk exhaustion | | 2 | **Limits config** | Low | Medium | Do with retention - minimal additional effort | | 3 | **Promtail label fix** | Trivial | Low | Quick fix, do with other label changes | | 4 | **Journal priority → level** | Low-medium | Medium | Reliable level filtering across the fleet | | 5 | **JSON logging audit** | Low-medium | Medium | Audit services, enable JSON where supported | | 6 | **Monitoring CNAME** | Low | Medium | Part of monitoring02 migration | ## Implementation Steps ### Phase 1: Retention + Limits (quick win) 1. Add `compactor` section to `services/monitoring/loki.nix` 2. Add `limits_config` with 30-day retention and basic rate limits 3. Update `system/monitoring/logs.nix`: - ~~Fix `hostname` → `host` label in varlog scrape config~~ Done: standardized on `hostname` (matching Prometheus) - ~~Add `tier` static label from `config.homelab.host.tier` to both scrape configs~~ Done - ~~Add `role` static label from `config.homelab.host.role` (conditionally, only when set) to both scrape configs~~ Done - ~~Add pipeline stages to journal scrape config: `json` to extract PRIORITY, `template` to map to level name, `labels` to attach as `level`~~ Done 4. Deploy to monitoring01, verify compactor runs and old data gets cleaned 5. Verify `level` label works: `{level="error"}` should return results, and match cases where `detected_level="unknown"` ### Phase 2 (future): S3 Storage Migration Revisit after NAS migration when a proper S3-compatible storage solution is available. At that point, add a new schema period with `object_store = "s3"` - the old filesystem period will continue serving historical data until it ages out past retention. ## Open Questions - [ ] What retention period makes sense? 30 days suggested, but could be 14d or 60d depending on disk/storage budget - [ ] Do we want per-stream retention (e.g., keep bootstrap/pipe-to-loki longer)? ## Notes - Loki schema changes require adding a new period entry (not modifying existing ones). The old period continues serving historical data. - The compactor is already part of single-process Loki in recent versions - it just needs to be configured. - S3 storage deferred until post-NAS migration when a proper solution is available.