Mark retention, limits, labels, and level mapping as done. Add JSON logging audit results with per-service details. Update current state and disk usage notes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9.2 KiB
Loki Setup Improvements
Overview
The current Loki deployment on monitoring01 is functional but minimal. It lacks retention policies, rate limiting, and uses local filesystem storage. This plan evaluates improvement options across several dimensions: retention management, storage backend, resource limits, and operational improvements.
Current State
Loki on monitoring01 (services/monitoring/loki.nix):
- Single-node deployment, no HA
- Filesystem storage at
/var/lib/loki/chunks(~6.8 GB as of 2026-02-13) - TSDB index (v13 schema, 24h period)
- 30-day compactor-based retention with basic rate limits
- No caching layer
- Auth disabled (trusted network)
Promtail on all 16 hosts (system/monitoring/logs.nix):
- Ships systemd journal (JSON) +
/var/log/**/*.log - Labels:
hostname,tier,role,level,job(systemd-journal/varlog),systemd_unit levellabel mapped from journal PRIORITY (critical/error/warning/notice/info/debug)- Hardcoded to
http://monitoring01.home.2rjus.net:3100
Additional log sources:
pipe-to-lokiscript (manual log submission,job=pipe-to-loki)- Bootstrap logs from template2 (
job=bootstrap)
Context: The VictoriaMetrics migration plan (docs/plans/monitoring-migration-victoriametrics.md) includes moving Loki to monitoring02 with "same configuration as current". These improvements could be applied either before or after that migration.
Improvement Areas
1. Retention Policy
Implemented. Compactor-based retention with 30-day period. Note: Loki 3.6.3 requires delete_request_store = "filesystem" when retention is enabled (not documented in older guides).
compactor = {
working_directory = "/var/lib/loki/compactor";
compaction_interval = "10m";
retention_enabled = true;
retention_delete_delay = "2h";
retention_delete_worker_count = 150;
delete_request_store = "filesystem";
};
limits_config = {
retention_period = "30d";
};
2. Storage Backend
Decision: Stay with filesystem storage for now. Garage S3 was considered but ruled out - the current single-node Garage (replication_factor=1) offers no real durability benefit over local disk. S3 storage can be revisited after the NAS migration, when a more robust S3-compatible solution will likely be available.
3. Limits Configuration
Implemented. Basic guardrails added alongside retention in limits_config:
limits_config = {
retention_period = "30d";
ingestion_rate_mb = 10; # MB/s per tenant
ingestion_burst_size_mb = 20; # Burst allowance
max_streams_per_user = 10000; # Prevent label explosion
max_query_series = 500; # Limit query resource usage
max_query_parallelism = 8;
};
4. Promtail Label Improvements
Problem: Label inconsistencies and missing useful metadata:
- The
varlogscrape config useshostnamewhile journal useshost(different label name) - No
tierorrolelabels, making it hard to filter logs by deployment tier or host function
Implemented: Standardized on hostname to match Prometheus labels. The journal scrape previously used a relabel from __journal__hostname to host; now both scrape configs use a static hostname label from config.networking.hostName. Also updated pipe-to-loki and bootstrap scripts to use hostname instead of host.
- Standardized label: Both scrape configs use
hostname(matching Prometheus) via sharedhostLabels - Added
tierlabel: Static label fromconfig.homelab.host.tier(test/prod) on both scrape configs - Added
rolelabel: Static label fromconfig.homelab.host.roleon both scrape configs (conditionally, only when non-null)
No cardinality impact - tier and role are 1:1 with hostname, so they add metadata to existing streams without creating new ones.
This enables queries like:
{tier="prod"} |= "error"- all errors on prod hosts{role="dns"}- all DNS server logs{tier="test", job="systemd-journal"}- journal logs from test hosts
5. Journal Priority → Level Label
Implemented. Promtail pipeline stages map journal PRIORITY to a level label:
| PRIORITY | level |
|---|---|
| 0-2 | critical |
| 3 | error |
| 4 | warning |
| 5 | notice |
| 6 | info |
| 7 | debug |
Uses a json stage to extract PRIORITY, template to map to level name, and labels to attach it. This gives reliable level filtering for all journal logs, unlike Loki's detected_level which only works for apps that embed level keywords in message text.
Example queries:
{level="error"}- all errors across the fleet{level=~"critical|error", tier="prod"}- prod errors and criticals{level="warning", role="dns"}- warnings from DNS servers
6. Enable JSON Logging on Services
Problem: Many services support structured JSON log output but may be using plain text by default. JSON logs are significantly easier to query in Loki - | json cleanly extracts all fields, whereas plain text requires fragile regex or pattern matching.
Audit results (2026-02-13):
Already logging JSON:
- Caddy (all instances) - JSON by default for access logs
- homelab-deploy (listener/builder) - Go app, logs structured JSON
Supports JSON, not configured (high value):
| Service | How to enable | Config file |
|---|---|---|
| Prometheus | --log.format=json |
services/monitoring/prometheus.nix |
| Alertmanager | --log.format=json |
services/monitoring/prometheus.nix |
| Loki | --log.format=json |
services/monitoring/loki.nix |
| Grafana | log.console.format = "json" |
services/monitoring/grafana.nix |
| Tempo | log_format: json in config |
services/monitoring/tempo.nix |
| OpenBao | log_format = "json" |
services/vault/default.nix |
Supports JSON, not configured (lower value - minimal log output):
| Service | How to enable |
|---|---|
| Pyroscope | --log.format=json (OCI container) |
| Blackbox Exporter | --log.format=json |
| Node Exporter | --log.format=json (all 16 hosts) |
| Systemd Exporter | --log.format=json (all 16 hosts) |
No JSON support (syslog/text only):
- NSD, Unbound, OpenSSH, Mosquitto
Needs verification:
- Kanidm, Jellyfin, Home Assistant, Harmonia, Zigbee2MQTT, NATS
Recommendation: Start with the monitoring stack (Prometheus, Alertmanager, Loki, Grafana, Tempo) since they're all Go apps with the same --log.format=json flag. Then OpenBao. The exporters are lower priority since they produce minimal log output.
7. Monitoring CNAME for Promtail Target
Problem: Promtail hardcodes monitoring01.home.2rjus.net:3100. The VictoriaMetrics migration plan already addresses this by switching to a monitoring CNAME.
Recommendation: This should happen as part of the monitoring02 migration, not independently. If we do Loki improvements before that migration, keep pointing to monitoring01.
Priority Ranking
| # | Improvement | Effort | Impact | Status |
|---|---|---|---|---|
| 1 | Retention policy | Low | High | Done (30d compactor retention) |
| 2 | Limits config | Low | Medium | Done (rate limits + stream guards) |
| 3 | Promtail labels | Trivial | Low | Done (hostname/tier/role/level) |
| 4 | Journal priority → level | Low-medium | Medium | Done (pipeline stages) |
| 5 | JSON logging audit | Low-medium | Medium | Audited, not yet enabled |
| 6 | Monitoring CNAME | Low | Medium | Part of monitoring02 migration |
Implementation Steps
Phase 1: Retention + Labels (done 2026-02-13)
AddDonecompactorsection toservices/monitoring/loki.nixAddDonelimits_configwith 30-day retention and basic rate limitsUpdateDone:system/monitoring/logs.nix- Standardized on
hostnamelabel (matching Prometheus) for both scrape configs - Added
tierandrolestatic labels fromhomelab.hostoptions - Added pipeline stages for journal PRIORITY →
levellabel mapping
- Standardized on
UpdateDonepipe-to-lokiand bootstrap scripts to usehostnameDeploy and verify labelsDone - all 15 hosts reporting with correct labels
Phase 2: JSON Logging (not started)
Enable JSON logging on services that support it, starting with the monitoring stack:
- Prometheus, Alertmanager, Loki, Grafana, Tempo (
--log.format=json) - OpenBao (
log_format = "json") - Lower priority: exporters (node-exporter, systemd-exporter, blackbox)
Phase 3 (future): S3 Storage Migration
Revisit after NAS migration when a proper S3-compatible storage solution is available. At that point, add a new schema period with object_store = "s3" - the old filesystem period will continue serving historical data until it ages out past retention.
Open Questions
- Do we want per-stream retention (e.g., keep bootstrap/pipe-to-loki longer)?
Notes
- Loki schema changes require adding a new period entry (not modifying existing ones). The old period continues serving historical data.
- Loki 3.6.3 requires
delete_request_store = "filesystem"in the compactor config when retention is enabled. - S3 storage deferred until post-NAS migration when a proper solution is available.
- As of 2026-02-13, Loki uses ~6.8 GB for ~30 days of logs from 16 hosts. Prometheus uses ~7.6 GB on the same disk (33 GB total, ~8 GB free).