docs: update Loki improvements plan with implementation status
Some checks failed
Run nix flake check / flake-check (push) Failing after 13m55s

Mark retention, limits, labels, and level mapping as done. Add
JSON logging audit results with per-service details. Update current
state and disk usage notes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-14 00:04:16 +01:00
parent 5d68662035
commit 0db9fc6802

View File

@@ -8,16 +8,16 @@ The current Loki deployment on monitoring01 is functional but minimal. It lacks
**Loki** on monitoring01 (`services/monitoring/loki.nix`): **Loki** on monitoring01 (`services/monitoring/loki.nix`):
- Single-node deployment, no HA - Single-node deployment, no HA
- Filesystem storage at `/var/lib/loki/chunks` - Filesystem storage at `/var/lib/loki/chunks` (~6.8 GB as of 2026-02-13)
- TSDB index (v13 schema, 24h period) - TSDB index (v13 schema, 24h period)
- No retention policy configured (logs grow indefinitely) - 30-day compactor-based retention with basic rate limits
- No `limits_config` (no rate limiting, stream limits, or query guards)
- No caching layer - No caching layer
- Auth disabled (trusted network) - Auth disabled (trusted network)
**Promtail** on all 16 hosts (`system/monitoring/logs.nix`): **Promtail** on all 16 hosts (`system/monitoring/logs.nix`):
- Ships systemd journal (JSON) + `/var/log/**/*.log` - Ships systemd journal (JSON) + `/var/log/**/*.log`
- Labels: `host`, `job` (systemd-journal/varlog), `systemd_unit` - Labels: `hostname`, `tier`, `role`, `level`, `job` (systemd-journal/varlog), `systemd_unit`
- `level` label mapped from journal PRIORITY (critical/error/warning/notice/info/debug)
- Hardcoded to `http://monitoring01.home.2rjus.net:3100` - Hardcoded to `http://monitoring01.home.2rjus.net:3100`
**Additional log sources:** **Additional log sources:**
@@ -30,16 +30,7 @@ The current Loki deployment on monitoring01 is functional but minimal. It lacks
### 1. Retention Policy ### 1. Retention Policy
**Problem:** No retention configured. Logs accumulate until disk fills up. **Implemented.** Compactor-based retention with 30-day period. Note: Loki 3.6.3 requires `delete_request_store = "filesystem"` when retention is enabled (not documented in older guides).
**Options:**
| Approach | Config Location | How It Works |
|----------|----------------|--------------|
| **Compactor retention** | `compactor` + `limits_config` | Compactor runs periodic retention sweeps, deleting chunks older than threshold |
| **Table manager** | `table_manager` | Legacy approach, not recommended for TSDB |
**Recommendation:** Use compactor-based retention (the modern approach for TSDB/filesystem):
```nix ```nix
compactor = { compactor = {
@@ -48,24 +39,21 @@ compactor = {
retention_enabled = true; retention_enabled = true;
retention_delete_delay = "2h"; retention_delete_delay = "2h";
retention_delete_worker_count = 150; retention_delete_worker_count = 150;
delete_request_store = "filesystem";
}; };
limits_config = { limits_config = {
retention_period = "30d"; # Default retention for all tenants retention_period = "30d";
}; };
``` ```
30 days aligns with the Prometheus retention and is reasonable for a homelab. Older logs are rarely useful, and anything important can be found in journal archives on the hosts themselves.
### 2. Storage Backend ### 2. Storage Backend
**Decision:** Stay with filesystem storage for now. Garage S3 was considered but ruled out - the current single-node Garage (replication_factor=1) offers no real durability benefit over local disk. S3 storage can be revisited after the NAS migration, when a more robust S3-compatible solution will likely be available. **Decision:** Stay with filesystem storage for now. Garage S3 was considered but ruled out - the current single-node Garage (replication_factor=1) offers no real durability benefit over local disk. S3 storage can be revisited after the NAS migration, when a more robust S3-compatible solution will likely be available.
### 3. Limits Configuration ### 3. Limits Configuration
**Problem:** No rate limiting or stream cardinality protection. A misbehaving service could generate excessive logs and overwhelm Loki. **Implemented.** Basic guardrails added alongside retention in `limits_config`:
**Recommendation:** Add basic guardrails:
```nix ```nix
limits_config = { limits_config = {
@@ -78,8 +66,6 @@ limits_config = {
}; };
``` ```
These are generous limits that shouldn't affect normal operation but protect against runaway log generators.
### 4. Promtail Label Improvements ### 4. Promtail Label Improvements
**Problem:** Label inconsistencies and missing useful metadata: **Problem:** Label inconsistencies and missing useful metadata:
@@ -101,11 +87,7 @@ This enables queries like:
### 5. Journal Priority → Level Label ### 5. Journal Priority → Level Label
**Problem:** Loki 3.6.3 auto-detects a `detected_level` label by parsing log message text for keywords like "INFO", "ERROR", etc. This works for applications that embed level strings in messages (Go apps, Loki itself), but **fails for traditional Unix services** that use the journal `PRIORITY` field without level text in the message. **Implemented.** Promtail pipeline stages map journal `PRIORITY` to a `level` label:
Example: NSD logs `"signal received, shutting down..."` with `PRIORITY="4"` (warning), but Loki sets `detected_level="unknown"` because the message has no level keyword. Querying `{detected_level="warn"}` misses these entirely.
**Recommendation:** Add a Promtail pipeline stage to the journal scrape config that maps the `PRIORITY` field to a `level` label:
| PRIORITY | level | | PRIORITY | level |
|----------|-------| |----------|-------|
@@ -116,11 +98,9 @@ Example: NSD logs `"signal received, shutting down..."` with `PRIORITY="4"` (war
| 6 | info | | 6 | info |
| 7 | debug | | 7 | debug |
This can be done with a `json` stage to extract PRIORITY, then a `template` + `labels` stage to map and attach it. The journal `PRIORITY` field is always present, so this gives reliable level filtering for all journal logs. Uses a `json` stage to extract PRIORITY, `template` to map to level name, and `labels` to attach it. This gives reliable level filtering for all journal logs, unlike Loki's `detected_level` which only works for apps that embed level keywords in message text.
**Cardinality impact:** Moderate. Adds up to ~6 label values per host+unit combination. In practice most services log at 1-2 levels, so the stream count increase is manageable for 16 hosts. The filtering benefit (e.g., `{level="error"}` to find all errors across the fleet) outweighs the cost. Example queries:
This enables queries like:
- `{level="error"}` - all errors across the fleet - `{level="error"}` - all errors across the fleet
- `{level=~"critical|error", tier="prod"}` - prod errors and criticals - `{level=~"critical|error", tier="prod"}` - prod errors and criticals
- `{level="warning", role="dns"}` - warnings from DNS servers - `{level="warning", role="dns"}` - warnings from DNS servers
@@ -129,19 +109,39 @@ This enables queries like:
**Problem:** Many services support structured JSON log output but may be using plain text by default. JSON logs are significantly easier to query in Loki - `| json` cleanly extracts all fields, whereas plain text requires fragile regex or pattern matching. **Problem:** Many services support structured JSON log output but may be using plain text by default. JSON logs are significantly easier to query in Loki - `| json` cleanly extracts all fields, whereas plain text requires fragile regex or pattern matching.
**Recommendation:** Audit all configured services and enable JSON logging where supported. Candidates to check include: **Audit results (2026-02-13):**
- Caddy (already JSON by default)
- Prometheus / Alertmanager / Loki / Tempo
- Grafana
- NSD / Unbound
- Home Assistant
- NATS
- Jellyfin
- OpenBao (Vault)
- Kanidm
- Garage
For each service, check whether it supports a JSON log format option and whether enabling it would break anything (e.g., log volume increase from verbose JSON, or dashboards that parse text format). **Already logging JSON:**
- Caddy (all instances) - JSON by default for access logs
- homelab-deploy (listener/builder) - Go app, logs structured JSON
**Supports JSON, not configured (high value):**
| Service | How to enable | Config file |
|---------|--------------|-------------|
| Prometheus | `--log.format=json` | `services/monitoring/prometheus.nix` |
| Alertmanager | `--log.format=json` | `services/monitoring/prometheus.nix` |
| Loki | `--log.format=json` | `services/monitoring/loki.nix` |
| Grafana | `log.console.format = "json"` | `services/monitoring/grafana.nix` |
| Tempo | `log_format: json` in config | `services/monitoring/tempo.nix` |
| OpenBao | `log_format = "json"` | `services/vault/default.nix` |
**Supports JSON, not configured (lower value - minimal log output):**
| Service | How to enable |
|---------|--------------|
| Pyroscope | `--log.format=json` (OCI container) |
| Blackbox Exporter | `--log.format=json` |
| Node Exporter | `--log.format=json` (all 16 hosts) |
| Systemd Exporter | `--log.format=json` (all 16 hosts) |
**No JSON support (syslog/text only):**
- NSD, Unbound, OpenSSH, Mosquitto
**Needs verification:**
- Kanidm, Jellyfin, Home Assistant, Harmonia, Zigbee2MQTT, NATS
**Recommendation:** Start with the monitoring stack (Prometheus, Alertmanager, Loki, Grafana, Tempo) since they're all Go apps with the same `--log.format=json` flag. Then OpenBao. The exporters are lower priority since they produce minimal log output.
### 7. Monitoring CNAME for Promtail Target ### 7. Monitoring CNAME for Promtail Target
@@ -151,40 +151,46 @@ For each service, check whether it supports a JSON log format option and whether
## Priority Ranking ## Priority Ranking
| # | Improvement | Effort | Impact | Recommendation | | # | Improvement | Effort | Impact | Status |
|---|-------------|--------|--------|----------------| |---|-------------|--------|--------|--------|
| 1 | **Retention policy** | Low | High | Do first - prevents disk exhaustion | | 1 | **Retention policy** | Low | High | Done (30d compactor retention) |
| 2 | **Limits config** | Low | Medium | Do with retention - minimal additional effort | | 2 | **Limits config** | Low | Medium | Done (rate limits + stream guards) |
| 3 | **Promtail label fix** | Trivial | Low | Quick fix, do with other label changes | | 3 | **Promtail labels** | Trivial | Low | Done (hostname/tier/role/level) |
| 4 | **Journal priority → level** | Low-medium | Medium | Reliable level filtering across the fleet | | 4 | **Journal priority → level** | Low-medium | Medium | Done (pipeline stages) |
| 5 | **JSON logging audit** | Low-medium | Medium | Audit services, enable JSON where supported | | 5 | **JSON logging audit** | Low-medium | Medium | Audited, not yet enabled |
| 6 | **Monitoring CNAME** | Low | Medium | Part of monitoring02 migration | | 6 | **Monitoring CNAME** | Low | Medium | Part of monitoring02 migration |
## Implementation Steps ## Implementation Steps
### Phase 1: Retention + Limits (quick win) ### Phase 1: Retention + Labels (done 2026-02-13)
1. Add `compactor` section to `services/monitoring/loki.nix` 1. ~~Add `compactor` section to `services/monitoring/loki.nix`~~ Done
2. Add `limits_config` with 30-day retention and basic rate limits 2. ~~Add `limits_config` with 30-day retention and basic rate limits~~ Done
3. Update `system/monitoring/logs.nix`: 3. ~~Update `system/monitoring/logs.nix`~~ Done:
- ~~Fix `hostname``host` label in varlog scrape config~~ Done: standardized on `hostname` (matching Prometheus) - Standardized on `hostname` label (matching Prometheus) for both scrape configs
- ~~Add `tier` static label from `config.homelab.host.tier` to both scrape configs~~ Done - Added `tier` and `role` static labels from `homelab.host` options
- ~~Add `role` static label from `config.homelab.host.role` (conditionally, only when set) to both scrape configs~~ Done - Added pipeline stages for journal PRIORITY → `level` label mapping
- ~~Add pipeline stages to journal scrape config: `json` to extract PRIORITY, `template` to map to level name, `labels` to attach as `level`~~ Done 4. ~~Update `pipe-to-loki` and bootstrap scripts to use `hostname`~~ Done
4. Deploy to monitoring01, verify compactor runs and old data gets cleaned 5. ~~Deploy and verify labels~~ Done - all 15 hosts reporting with correct labels
5. Verify `level` label works: `{level="error"}` should return results, and match cases where `detected_level="unknown"`
### Phase 2 (future): S3 Storage Migration ### Phase 2: JSON Logging (not started)
Enable JSON logging on services that support it, starting with the monitoring stack:
1. Prometheus, Alertmanager, Loki, Grafana, Tempo (`--log.format=json`)
2. OpenBao (`log_format = "json"`)
3. Lower priority: exporters (node-exporter, systemd-exporter, blackbox)
### Phase 3 (future): S3 Storage Migration
Revisit after NAS migration when a proper S3-compatible storage solution is available. At that point, add a new schema period with `object_store = "s3"` - the old filesystem period will continue serving historical data until it ages out past retention. Revisit after NAS migration when a proper S3-compatible storage solution is available. At that point, add a new schema period with `object_store = "s3"` - the old filesystem period will continue serving historical data until it ages out past retention.
## Open Questions ## Open Questions
- [ ] What retention period makes sense? 30 days suggested, but could be 14d or 60d depending on disk/storage budget
- [ ] Do we want per-stream retention (e.g., keep bootstrap/pipe-to-loki longer)? - [ ] Do we want per-stream retention (e.g., keep bootstrap/pipe-to-loki longer)?
## Notes ## Notes
- Loki schema changes require adding a new period entry (not modifying existing ones). The old period continues serving historical data. - Loki schema changes require adding a new period entry (not modifying existing ones). The old period continues serving historical data.
- The compactor is already part of single-process Loki in recent versions - it just needs to be configured. - Loki 3.6.3 requires `delete_request_store = "filesystem"` in the compactor config when retention is enabled.
- S3 storage deferred until post-NAS migration when a proper solution is available. - S3 storage deferred until post-NAS migration when a proper solution is available.
- As of 2026-02-13, Loki uses ~6.8 GB for ~30 days of logs from 16 hosts. Prometheus uses ~7.6 GB on the same disk (33 GB total, ~8 GB free).