Files

Run nix flake check / flake-check (push) Failing after 15m38s

Details

docs: add JSON logging audit to Loki improvements plan

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-13 22:44:05 +01:00

9.3 KiB

Raw Blame History

Loki Setup Improvements

Overview

The current Loki deployment on monitoring01 is functional but minimal. It lacks retention policies, rate limiting, and uses local filesystem storage. This plan evaluates improvement options across several dimensions: retention management, storage backend, resource limits, and operational improvements.

Current State

Loki on monitoring01 (services/monitoring/loki.nix):

Single-node deployment, no HA
Filesystem storage at /var/lib/loki/chunks
TSDB index (v13 schema, 24h period)
No retention policy configured (logs grow indefinitely)
No limits_config (no rate limiting, stream limits, or query guards)
No caching layer
Auth disabled (trusted network)

Promtail on all 16 hosts (system/monitoring/logs.nix):

Ships systemd journal (JSON) + /var/log/**/*.log
Labels: host, job (systemd-journal/varlog), systemd_unit
Hardcoded to http://monitoring01.home.2rjus.net:3100

Additional log sources:

pipe-to-loki script (manual log submission, job=pipe-to-loki)
Bootstrap logs from template2 (job=bootstrap)

Context: The VictoriaMetrics migration plan (docs/plans/monitoring-migration-victoriametrics.md) includes moving Loki to monitoring02 with "same configuration as current". These improvements could be applied either before or after that migration.

Improvement Areas

1. Retention Policy

Problem: No retention configured. Logs accumulate until disk fills up.

Options:

Approach	Config Location	How It Works
Compactor retention	`compactor` + `limits_config`	Compactor runs periodic retention sweeps, deleting chunks older than threshold
Table manager	`table_manager`	Legacy approach, not recommended for TSDB

Recommendation: Use compactor-based retention (the modern approach for TSDB/filesystem):

compactor = {
  working_directory = "/var/lib/loki/compactor";
  compaction_interval = "10m";
  retention_enabled = true;
  retention_delete_delay = "2h";
  retention_delete_worker_count = 150;
};

limits_config = {
  retention_period = "30d";  # Default retention for all tenants
};

30 days aligns with the Prometheus retention and is reasonable for a homelab. Older logs are rarely useful, and anything important can be found in journal archives on the hosts themselves.

2. Storage Backend

Decision: Stay with filesystem storage for now. Garage S3 was considered but ruled out - the current single-node Garage (replication_factor=1) offers no real durability benefit over local disk. S3 storage can be revisited after the NAS migration, when a more robust S3-compatible solution will likely be available.

3. Limits Configuration

Problem: No rate limiting or stream cardinality protection. A misbehaving service could generate excessive logs and overwhelm Loki.

Recommendation: Add basic guardrails:

limits_config = {
  retention_period = "30d";
  ingestion_rate_mb = 10;           # MB/s per tenant
  ingestion_burst_size_mb = 20;     # Burst allowance
  max_streams_per_user = 10000;     # Prevent label explosion
  max_query_series = 500;           # Limit query resource usage
  max_query_parallelism = 8;
};

These are generous limits that shouldn't affect normal operation but protect against runaway log generators.

4. Promtail Label Improvements

Problem: Label inconsistencies and missing useful metadata:

The varlog scrape config uses hostname while journal uses host (different label name)
No tier or role labels, making it hard to filter logs by deployment tier or host function

Recommendations:

Fix varlog label: Rename hostname to host for consistency with journal scrape config
Add tier label: Static label from config.homelab.host.tier (test/prod) on both scrape configs
Add role label: Static label from config.homelab.host.role on both scrape configs, only when set (10 hosts have no role, so omit to keep labels clean)

No cardinality impact - tier and role are 1:1 with host, so they add metadata to existing streams without creating new ones.

This enables queries like:

{tier="prod"} |= "error" - all errors on prod hosts
{role="dns"} - all DNS server logs
{tier="test", job="systemd-journal"} - journal logs from test hosts

5. Journal Priority → Level Label

Problem: Loki 3.6.3 auto-detects a detected_level label by parsing log message text for keywords like "INFO", "ERROR", etc. This works for applications that embed level strings in messages (Go apps, Loki itself), but fails for traditional Unix services that use the journal PRIORITY field without level text in the message.

Example: NSD logs "signal received, shutting down..." with PRIORITY="4" (warning), but Loki sets detected_level="unknown" because the message has no level keyword. Querying {detected_level="warn"} misses these entirely.

Recommendation: Add a Promtail pipeline stage to the journal scrape config that maps the PRIORITY field to a level label:

PRIORITY	level
0-2	critical
3	error
4	warning
5	notice
6	info
7	debug

This can be done with a json stage to extract PRIORITY, then a template + labels stage to map and attach it. The journal PRIORITY field is always present, so this gives reliable level filtering for all journal logs.

Cardinality impact: Moderate. Adds up to ~6 label values per host+unit combination. In practice most services log at 1-2 levels, so the stream count increase is manageable for 16 hosts. The filtering benefit (e.g., {level="error"} to find all errors across the fleet) outweighs the cost.

This enables queries like:

{level="error"} - all errors across the fleet
{level=~"critical|error", tier="prod"} - prod errors and criticals
{level="warning", role="dns"} - warnings from DNS servers

6. Enable JSON Logging on Services

Problem: Many services support structured JSON log output but may be using plain text by default. JSON logs are significantly easier to query in Loki - | json cleanly extracts all fields, whereas plain text requires fragile regex or pattern matching.

Recommendation: Audit all configured services and enable JSON logging where supported. Candidates to check include:

Caddy (already JSON by default)
Prometheus / Alertmanager / Loki / Tempo
Grafana
NSD / Unbound
Home Assistant
NATS
Jellyfin
OpenBao (Vault)
Kanidm
Garage

For each service, check whether it supports a JSON log format option and whether enabling it would break anything (e.g., log volume increase from verbose JSON, or dashboards that parse text format).

7. Monitoring CNAME for Promtail Target

Problem: Promtail hardcodes monitoring01.home.2rjus.net:3100. The VictoriaMetrics migration plan already addresses this by switching to a monitoring CNAME.

Recommendation: This should happen as part of the monitoring02 migration, not independently. If we do Loki improvements before that migration, keep pointing to monitoring01.

Priority Ranking

#	Improvement	Effort	Impact	Recommendation
1	Retention policy	Low	High	Do first - prevents disk exhaustion
2	Limits config	Low	Medium	Do with retention - minimal additional effort
3	Promtail label fix	Trivial	Low	Quick fix, do with other label changes
4	Journal priority → level	Low-medium	Medium	Reliable level filtering across the fleet
5	JSON logging audit	Low-medium	Medium	Audit services, enable JSON where supported
6	Monitoring CNAME	Low	Medium	Part of monitoring02 migration

Implementation Steps

Phase 1: Retention + Limits (quick win)

Add compactor section to services/monitoring/loki.nix
Add limits_config with 30-day retention and basic rate limits
Update system/monitoring/logs.nix:
- Fix hostname → host label in varlog scrape config
- Add tier static label from config.homelab.host.tier to both scrape configs
- Add role static label from config.homelab.host.role (conditionally, only when set) to both scrape configs
- Add pipeline stages to journal scrape config: json to extract PRIORITY, template to map to level name, labels to attach as level
Deploy to monitoring01, verify compactor runs and old data gets cleaned
Verify level label works: {level="error"} should return results, and match cases where detected_level="unknown"

Phase 2 (future): S3 Storage Migration

Revisit after NAS migration when a proper S3-compatible storage solution is available. At that point, add a new schema period with object_store = "s3" - the old filesystem period will continue serving historical data until it ages out past retention.

Open Questions

What retention period makes sense? 30 days suggested, but could be 14d or 60d depending on disk/storage budget
Do we want per-stream retention (e.g., keep bootstrap/pipe-to-loki longer)?

Notes

Loki schema changes require adding a new period entry (not modifying existing ones). The old period continues serving historical data.
The compactor is already part of single-process Loki in recent versions - it just needs to be configured.
S3 storage deferred until post-NAS migration when a proper solution is available.

9.3 KiB Raw Blame History