Covers retention policy, limits config, Promtail label improvements (tier/role/level), and journal PRIORITY extraction. Also adds Alloy consideration to VictoriaMetrics migration plan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8.4 KiB
Loki Setup Improvements
Overview
The current Loki deployment on monitoring01 is functional but minimal. It lacks retention policies, rate limiting, and uses local filesystem storage. This plan evaluates improvement options across several dimensions: retention management, storage backend, resource limits, and operational improvements.
Current State
Loki on monitoring01 (services/monitoring/loki.nix):
- Single-node deployment, no HA
- Filesystem storage at
/var/lib/loki/chunks - TSDB index (v13 schema, 24h period)
- No retention policy configured (logs grow indefinitely)
- No
limits_config(no rate limiting, stream limits, or query guards) - No caching layer
- Auth disabled (trusted network)
Promtail on all 16 hosts (system/monitoring/logs.nix):
- Ships systemd journal (JSON) +
/var/log/**/*.log - Labels:
host,job(systemd-journal/varlog),systemd_unit - Hardcoded to
http://monitoring01.home.2rjus.net:3100
Additional log sources:
pipe-to-lokiscript (manual log submission,job=pipe-to-loki)- Bootstrap logs from template2 (
job=bootstrap)
Context: The VictoriaMetrics migration plan (docs/plans/monitoring-migration-victoriametrics.md) includes moving Loki to monitoring02 with "same configuration as current". These improvements could be applied either before or after that migration.
Improvement Areas
1. Retention Policy
Problem: No retention configured. Logs accumulate until disk fills up.
Options:
| Approach | Config Location | How It Works |
|---|---|---|
| Compactor retention | compactor + limits_config |
Compactor runs periodic retention sweeps, deleting chunks older than threshold |
| Table manager | table_manager |
Legacy approach, not recommended for TSDB |
Recommendation: Use compactor-based retention (the modern approach for TSDB/filesystem):
compactor = {
working_directory = "/var/lib/loki/compactor";
compaction_interval = "10m";
retention_enabled = true;
retention_delete_delay = "2h";
retention_delete_worker_count = 150;
};
limits_config = {
retention_period = "30d"; # Default retention for all tenants
};
30 days aligns with the Prometheus retention and is reasonable for a homelab. Older logs are rarely useful, and anything important can be found in journal archives on the hosts themselves.
2. Storage Backend
Decision: Stay with filesystem storage for now. Garage S3 was considered but ruled out - the current single-node Garage (replication_factor=1) offers no real durability benefit over local disk. S3 storage can be revisited after the NAS migration, when a more robust S3-compatible solution will likely be available.
3. Limits Configuration
Problem: No rate limiting or stream cardinality protection. A misbehaving service could generate excessive logs and overwhelm Loki.
Recommendation: Add basic guardrails:
limits_config = {
retention_period = "30d";
ingestion_rate_mb = 10; # MB/s per tenant
ingestion_burst_size_mb = 20; # Burst allowance
max_streams_per_user = 10000; # Prevent label explosion
max_query_series = 500; # Limit query resource usage
max_query_parallelism = 8;
};
These are generous limits that shouldn't affect normal operation but protect against runaway log generators.
4. Promtail Label Improvements
Problem: Label inconsistencies and missing useful metadata:
- The
varlogscrape config useshostnamewhile journal useshost(different label name) - No
tierorrolelabels, making it hard to filter logs by deployment tier or host function
Recommendations:
- Fix varlog label: Rename
hostnametohostfor consistency with journal scrape config - Add
tierlabel: Static label fromconfig.homelab.host.tier(test/prod) on both scrape configs - Add
rolelabel: Static label fromconfig.homelab.host.roleon both scrape configs, only when set (10 hosts have no role, so omit to keep labels clean)
No cardinality impact - tier and role are 1:1 with host, so they add metadata to existing streams without creating new ones.
This enables queries like:
{tier="prod"} |= "error"- all errors on prod hosts{role="dns"}- all DNS server logs{tier="test", job="systemd-journal"}- journal logs from test hosts
5. Journal Priority → Level Label
Problem: Loki 3.6.3 auto-detects a detected_level label by parsing log message text for keywords like "INFO", "ERROR", etc. This works for applications that embed level strings in messages (Go apps, Loki itself), but fails for traditional Unix services that use the journal PRIORITY field without level text in the message.
Example: NSD logs "signal received, shutting down..." with PRIORITY="4" (warning), but Loki sets detected_level="unknown" because the message has no level keyword. Querying {detected_level="warn"} misses these entirely.
Recommendation: Add a Promtail pipeline stage to the journal scrape config that maps the PRIORITY field to a level label:
| PRIORITY | level |
|---|---|
| 0-2 | critical |
| 3 | error |
| 4 | warning |
| 5 | notice |
| 6 | info |
| 7 | debug |
This can be done with a json stage to extract PRIORITY, then a template + labels stage to map and attach it. The journal PRIORITY field is always present, so this gives reliable level filtering for all journal logs.
Cardinality impact: Moderate. Adds up to ~6 label values per host+unit combination. In practice most services log at 1-2 levels, so the stream count increase is manageable for 16 hosts. The filtering benefit (e.g., {level="error"} to find all errors across the fleet) outweighs the cost.
This enables queries like:
{level="error"}- all errors across the fleet{level=~"critical|error", tier="prod"}- prod errors and criticals{level="warning", role="dns"}- warnings from DNS servers
6. Monitoring CNAME for Promtail Target
Problem: Promtail hardcodes monitoring01.home.2rjus.net:3100. The VictoriaMetrics migration plan already addresses this by switching to a monitoring CNAME.
Recommendation: This should happen as part of the monitoring02 migration, not independently. If we do Loki improvements before that migration, keep pointing to monitoring01.
Priority Ranking
| # | Improvement | Effort | Impact | Recommendation |
|---|---|---|---|---|
| 1 | Retention policy | Low | High | Do first - prevents disk exhaustion |
| 2 | Limits config | Low | Medium | Do with retention - minimal additional effort |
| 3 | Promtail label fix | Trivial | Low | Quick fix, do with other label changes |
| 4 | Journal priority → level | Low-medium | Medium | Reliable level filtering across the fleet |
| 5 | Monitoring CNAME | Low | Medium | Part of monitoring02 migration |
Implementation Steps
Phase 1: Retention + Limits (quick win)
- Add
compactorsection toservices/monitoring/loki.nix - Add
limits_configwith 30-day retention and basic rate limits - Update
system/monitoring/logs.nix:- Fix
hostname→hostlabel in varlog scrape config - Add
tierstatic label fromconfig.homelab.host.tierto both scrape configs - Add
rolestatic label fromconfig.homelab.host.role(conditionally, only when set) to both scrape configs - Add pipeline stages to journal scrape config:
jsonto extract PRIORITY,templateto map to level name,labelsto attach aslevel
- Fix
- Deploy to monitoring01, verify compactor runs and old data gets cleaned
- Verify
levellabel works:{level="error"}should return results, and match cases wheredetected_level="unknown"
Phase 2 (future): S3 Storage Migration
Revisit after NAS migration when a proper S3-compatible storage solution is available. At that point, add a new schema period with object_store = "s3" - the old filesystem period will continue serving historical data until it ages out past retention.
Open Questions
- What retention period makes sense? 30 days suggested, but could be 14d or 60d depending on disk/storage budget
- Do we want per-stream retention (e.g., keep bootstrap/pipe-to-loki longer)?
Notes
- Loki schema changes require adding a new period entry (not modifying existing ones). The old period continues serving historical data.
- The compactor is already part of single-process Loki in recent versions - it just needs to be configured.
- S3 storage deferred until post-NAS migration when a proper solution is available.