docs: add Loki improvements plan
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Some checks failed
Run nix flake check / flake-check (push) Has been cancelled
Covers retention policy, limits config, Promtail label improvements (tier/role/level), and journal PRIORITY extraction. Also adds Alloy consideration to VictoriaMetrics migration plan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
171
docs/plans/loki-improvements.md
Normal file
171
docs/plans/loki-improvements.md
Normal file
@@ -0,0 +1,171 @@
|
|||||||
|
# Loki Setup Improvements
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The current Loki deployment on monitoring01 is functional but minimal. It lacks retention policies, rate limiting, and uses local filesystem storage. This plan evaluates improvement options across several dimensions: retention management, storage backend, resource limits, and operational improvements.
|
||||||
|
|
||||||
|
## Current State
|
||||||
|
|
||||||
|
**Loki** on monitoring01 (`services/monitoring/loki.nix`):
|
||||||
|
- Single-node deployment, no HA
|
||||||
|
- Filesystem storage at `/var/lib/loki/chunks`
|
||||||
|
- TSDB index (v13 schema, 24h period)
|
||||||
|
- No retention policy configured (logs grow indefinitely)
|
||||||
|
- No `limits_config` (no rate limiting, stream limits, or query guards)
|
||||||
|
- No caching layer
|
||||||
|
- Auth disabled (trusted network)
|
||||||
|
|
||||||
|
**Promtail** on all 16 hosts (`system/monitoring/logs.nix`):
|
||||||
|
- Ships systemd journal (JSON) + `/var/log/**/*.log`
|
||||||
|
- Labels: `host`, `job` (systemd-journal/varlog), `systemd_unit`
|
||||||
|
- Hardcoded to `http://monitoring01.home.2rjus.net:3100`
|
||||||
|
|
||||||
|
**Additional log sources:**
|
||||||
|
- `pipe-to-loki` script (manual log submission, `job=pipe-to-loki`)
|
||||||
|
- Bootstrap logs from template2 (`job=bootstrap`)
|
||||||
|
|
||||||
|
**Context:** The VictoriaMetrics migration plan (`docs/plans/monitoring-migration-victoriametrics.md`) includes moving Loki to monitoring02 with "same configuration as current". These improvements could be applied either before or after that migration.
|
||||||
|
|
||||||
|
## Improvement Areas
|
||||||
|
|
||||||
|
### 1. Retention Policy
|
||||||
|
|
||||||
|
**Problem:** No retention configured. Logs accumulate until disk fills up.
|
||||||
|
|
||||||
|
**Options:**
|
||||||
|
|
||||||
|
| Approach | Config Location | How It Works |
|
||||||
|
|----------|----------------|--------------|
|
||||||
|
| **Compactor retention** | `compactor` + `limits_config` | Compactor runs periodic retention sweeps, deleting chunks older than threshold |
|
||||||
|
| **Table manager** | `table_manager` | Legacy approach, not recommended for TSDB |
|
||||||
|
|
||||||
|
**Recommendation:** Use compactor-based retention (the modern approach for TSDB/filesystem):
|
||||||
|
|
||||||
|
```nix
|
||||||
|
compactor = {
|
||||||
|
working_directory = "/var/lib/loki/compactor";
|
||||||
|
compaction_interval = "10m";
|
||||||
|
retention_enabled = true;
|
||||||
|
retention_delete_delay = "2h";
|
||||||
|
retention_delete_worker_count = 150;
|
||||||
|
};
|
||||||
|
|
||||||
|
limits_config = {
|
||||||
|
retention_period = "30d"; # Default retention for all tenants
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
30 days aligns with the Prometheus retention and is reasonable for a homelab. Older logs are rarely useful, and anything important can be found in journal archives on the hosts themselves.
|
||||||
|
|
||||||
|
### 2. Storage Backend
|
||||||
|
|
||||||
|
**Decision:** Stay with filesystem storage for now. Garage S3 was considered but ruled out - the current single-node Garage (replication_factor=1) offers no real durability benefit over local disk. S3 storage can be revisited after the NAS migration, when a more robust S3-compatible solution will likely be available.
|
||||||
|
|
||||||
|
### 3. Limits Configuration
|
||||||
|
|
||||||
|
**Problem:** No rate limiting or stream cardinality protection. A misbehaving service could generate excessive logs and overwhelm Loki.
|
||||||
|
|
||||||
|
**Recommendation:** Add basic guardrails:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
limits_config = {
|
||||||
|
retention_period = "30d";
|
||||||
|
ingestion_rate_mb = 10; # MB/s per tenant
|
||||||
|
ingestion_burst_size_mb = 20; # Burst allowance
|
||||||
|
max_streams_per_user = 10000; # Prevent label explosion
|
||||||
|
max_query_series = 500; # Limit query resource usage
|
||||||
|
max_query_parallelism = 8;
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
These are generous limits that shouldn't affect normal operation but protect against runaway log generators.
|
||||||
|
|
||||||
|
### 4. Promtail Label Improvements
|
||||||
|
|
||||||
|
**Problem:** Label inconsistencies and missing useful metadata:
|
||||||
|
- The `varlog` scrape config uses `hostname` while journal uses `host` (different label name)
|
||||||
|
- No `tier` or `role` labels, making it hard to filter logs by deployment tier or host function
|
||||||
|
|
||||||
|
**Recommendations:**
|
||||||
|
|
||||||
|
1. **Fix varlog label:** Rename `hostname` to `host` for consistency with journal scrape config
|
||||||
|
2. **Add `tier` label:** Static label from `config.homelab.host.tier` (`test`/`prod`) on both scrape configs
|
||||||
|
3. **Add `role` label:** Static label from `config.homelab.host.role` on both scrape configs, only when set (10 hosts have no role, so omit to keep labels clean)
|
||||||
|
|
||||||
|
No cardinality impact - `tier` and `role` are 1:1 with `host`, so they add metadata to existing streams without creating new ones.
|
||||||
|
|
||||||
|
This enables queries like:
|
||||||
|
- `{tier="prod"} |= "error"` - all errors on prod hosts
|
||||||
|
- `{role="dns"}` - all DNS server logs
|
||||||
|
- `{tier="test", job="systemd-journal"}` - journal logs from test hosts
|
||||||
|
|
||||||
|
### 5. Journal Priority → Level Label
|
||||||
|
|
||||||
|
**Problem:** Loki 3.6.3 auto-detects a `detected_level` label by parsing log message text for keywords like "INFO", "ERROR", etc. This works for applications that embed level strings in messages (Go apps, Loki itself), but **fails for traditional Unix services** that use the journal `PRIORITY` field without level text in the message.
|
||||||
|
|
||||||
|
Example: NSD logs `"signal received, shutting down..."` with `PRIORITY="4"` (warning), but Loki sets `detected_level="unknown"` because the message has no level keyword. Querying `{detected_level="warn"}` misses these entirely.
|
||||||
|
|
||||||
|
**Recommendation:** Add a Promtail pipeline stage to the journal scrape config that maps the `PRIORITY` field to a `level` label:
|
||||||
|
|
||||||
|
| PRIORITY | level |
|
||||||
|
|----------|-------|
|
||||||
|
| 0-2 | critical |
|
||||||
|
| 3 | error |
|
||||||
|
| 4 | warning |
|
||||||
|
| 5 | notice |
|
||||||
|
| 6 | info |
|
||||||
|
| 7 | debug |
|
||||||
|
|
||||||
|
This can be done with a `json` stage to extract PRIORITY, then a `template` + `labels` stage to map and attach it. The journal `PRIORITY` field is always present, so this gives reliable level filtering for all journal logs.
|
||||||
|
|
||||||
|
**Cardinality impact:** Moderate. Adds up to ~6 label values per host+unit combination. In practice most services log at 1-2 levels, so the stream count increase is manageable for 16 hosts. The filtering benefit (e.g., `{level="error"}` to find all errors across the fleet) outweighs the cost.
|
||||||
|
|
||||||
|
This enables queries like:
|
||||||
|
- `{level="error"}` - all errors across the fleet
|
||||||
|
- `{level=~"critical|error", tier="prod"}` - prod errors and criticals
|
||||||
|
- `{level="warning", role="dns"}` - warnings from DNS servers
|
||||||
|
|
||||||
|
### 6. Monitoring CNAME for Promtail Target
|
||||||
|
|
||||||
|
**Problem:** Promtail hardcodes `monitoring01.home.2rjus.net:3100`. The VictoriaMetrics migration plan already addresses this by switching to a `monitoring` CNAME.
|
||||||
|
|
||||||
|
**Recommendation:** This should happen as part of the monitoring02 migration, not independently. If we do Loki improvements before that migration, keep pointing to monitoring01.
|
||||||
|
|
||||||
|
## Priority Ranking
|
||||||
|
|
||||||
|
| # | Improvement | Effort | Impact | Recommendation |
|
||||||
|
|---|-------------|--------|--------|----------------|
|
||||||
|
| 1 | **Retention policy** | Low | High | Do first - prevents disk exhaustion |
|
||||||
|
| 2 | **Limits config** | Low | Medium | Do with retention - minimal additional effort |
|
||||||
|
| 3 | **Promtail label fix** | Trivial | Low | Quick fix, do with other label changes |
|
||||||
|
| 4 | **Journal priority → level** | Low-medium | Medium | Reliable level filtering across the fleet |
|
||||||
|
| 5 | **Monitoring CNAME** | Low | Medium | Part of monitoring02 migration |
|
||||||
|
|
||||||
|
## Implementation Steps
|
||||||
|
|
||||||
|
### Phase 1: Retention + Limits (quick win)
|
||||||
|
|
||||||
|
1. Add `compactor` section to `services/monitoring/loki.nix`
|
||||||
|
2. Add `limits_config` with 30-day retention and basic rate limits
|
||||||
|
3. Update `system/monitoring/logs.nix`:
|
||||||
|
- Fix `hostname` → `host` label in varlog scrape config
|
||||||
|
- Add `tier` static label from `config.homelab.host.tier` to both scrape configs
|
||||||
|
- Add `role` static label from `config.homelab.host.role` (conditionally, only when set) to both scrape configs
|
||||||
|
- Add pipeline stages to journal scrape config: `json` to extract PRIORITY, `template` to map to level name, `labels` to attach as `level`
|
||||||
|
4. Deploy to monitoring01, verify compactor runs and old data gets cleaned
|
||||||
|
5. Verify `level` label works: `{level="error"}` should return results, and match cases where `detected_level="unknown"`
|
||||||
|
|
||||||
|
### Phase 2 (future): S3 Storage Migration
|
||||||
|
|
||||||
|
Revisit after NAS migration when a proper S3-compatible storage solution is available. At that point, add a new schema period with `object_store = "s3"` - the old filesystem period will continue serving historical data until it ages out past retention.
|
||||||
|
|
||||||
|
## Open Questions
|
||||||
|
|
||||||
|
- [ ] What retention period makes sense? 30 days suggested, but could be 14d or 60d depending on disk/storage budget
|
||||||
|
- [ ] Do we want per-stream retention (e.g., keep bootstrap/pipe-to-loki longer)?
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- Loki schema changes require adding a new period entry (not modifying existing ones). The old period continues serving historical data.
|
||||||
|
- The compactor is already part of single-process Loki in recent versions - it just needs to be configured.
|
||||||
|
- S3 storage deferred until post-NAS migration when a proper solution is available.
|
||||||
@@ -194,6 +194,7 @@ module once monitoring02 becomes the primary monitoring host.
|
|||||||
|
|
||||||
- [ ] What disk size for monitoring02? Current 60GB may need expansion for 3+ months with VictoriaMetrics
|
- [ ] What disk size for monitoring02? Current 60GB may need expansion for 3+ months with VictoriaMetrics
|
||||||
- [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
|
- [ ] Which dashboards to recreate declaratively? (Review monitoring01 Grafana for current set)
|
||||||
|
- [ ] Consider replacing Promtail with Grafana Alloy (`services.alloy`, v1.12.2 in nixpkgs). Promtail is in maintenance mode and Grafana recommends Alloy as the successor. Alloy is a unified collector (logs, metrics, traces, profiles) but uses its own "River" config format instead of YAML, so less Nix-native ergonomics. Could bundle the migration with monitoring02 to consolidate disruption.
|
||||||
|
|
||||||
## VictoriaMetrics Service Configuration
|
## VictoriaMetrics Service Configuration
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user