Files

Run nix flake check / flake-check (push) Failing after 13m55s

Details

docs: update Loki improvements plan with implementation status

Mark retention, limits, labels, and level mapping as done. Add
JSON logging audit results with per-service details. Update current
state and disk usage notes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-14 00:04:16 +01:00

9.2 KiB

Raw Blame History

Loki Setup Improvements

Overview

The current Loki deployment on monitoring01 is functional but minimal. It lacks retention policies, rate limiting, and uses local filesystem storage. This plan evaluates improvement options across several dimensions: retention management, storage backend, resource limits, and operational improvements.

Current State

Loki on monitoring01 (services/monitoring/loki.nix):

Single-node deployment, no HA
Filesystem storage at /var/lib/loki/chunks (~6.8 GB as of 2026-02-13)
TSDB index (v13 schema, 24h period)
30-day compactor-based retention with basic rate limits
No caching layer
Auth disabled (trusted network)

Promtail on all 16 hosts (system/monitoring/logs.nix):

Ships systemd journal (JSON) + /var/log/**/*.log
Labels: hostname, tier, role, level, job (systemd-journal/varlog), systemd_unit
level label mapped from journal PRIORITY (critical/error/warning/notice/info/debug)
Hardcoded to http://monitoring01.home.2rjus.net:3100

Additional log sources:

pipe-to-loki script (manual log submission, job=pipe-to-loki)
Bootstrap logs from template2 (job=bootstrap)

Context: The VictoriaMetrics migration plan (docs/plans/monitoring-migration-victoriametrics.md) includes moving Loki to monitoring02 with "same configuration as current". These improvements could be applied either before or after that migration.

Improvement Areas

1. Retention Policy

Implemented. Compactor-based retention with 30-day period. Note: Loki 3.6.3 requires delete_request_store = "filesystem" when retention is enabled (not documented in older guides).

compactor = {
  working_directory = "/var/lib/loki/compactor";
  compaction_interval = "10m";
  retention_enabled = true;
  retention_delete_delay = "2h";
  retention_delete_worker_count = 150;
  delete_request_store = "filesystem";
};

limits_config = {
  retention_period = "30d";
};

2. Storage Backend

Decision: Stay with filesystem storage for now. Garage S3 was considered but ruled out - the current single-node Garage (replication_factor=1) offers no real durability benefit over local disk. S3 storage can be revisited after the NAS migration, when a more robust S3-compatible solution will likely be available.

3. Limits Configuration

Implemented. Basic guardrails added alongside retention in limits_config:

limits_config = {
  retention_period = "30d";
  ingestion_rate_mb = 10;           # MB/s per tenant
  ingestion_burst_size_mb = 20;     # Burst allowance
  max_streams_per_user = 10000;     # Prevent label explosion
  max_query_series = 500;           # Limit query resource usage
  max_query_parallelism = 8;
};

4. Promtail Label Improvements

Problem: Label inconsistencies and missing useful metadata:

The varlog scrape config uses hostname while journal uses host (different label name)
No tier or role labels, making it hard to filter logs by deployment tier or host function

Implemented: Standardized on hostname to match Prometheus labels. The journal scrape previously used a relabel from __journal__hostname to host; now both scrape configs use a static hostname label from config.networking.hostName. Also updated pipe-to-loki and bootstrap scripts to use hostname instead of host.

Standardized label: Both scrape configs use hostname (matching Prometheus) via shared hostLabels
Added tier label: Static label from config.homelab.host.tier (test/prod) on both scrape configs
Added role label: Static label from config.homelab.host.role on both scrape configs (conditionally, only when non-null)

No cardinality impact - tier and role are 1:1 with hostname, so they add metadata to existing streams without creating new ones.

This enables queries like:

{tier="prod"} |= "error" - all errors on prod hosts
{role="dns"} - all DNS server logs
{tier="test", job="systemd-journal"} - journal logs from test hosts

5. Journal Priority → Level Label

Implemented. Promtail pipeline stages map journal PRIORITY to a level label:

PRIORITY	level
0-2	critical
3	error
4	warning
5	notice
6	info
7	debug

Uses a json stage to extract PRIORITY, template to map to level name, and labels to attach it. This gives reliable level filtering for all journal logs, unlike Loki's detected_level which only works for apps that embed level keywords in message text.

Example queries:

{level="error"} - all errors across the fleet
{level=~"critical|error", tier="prod"} - prod errors and criticals
{level="warning", role="dns"} - warnings from DNS servers

6. Enable JSON Logging on Services

Problem: Many services support structured JSON log output but may be using plain text by default. JSON logs are significantly easier to query in Loki - | json cleanly extracts all fields, whereas plain text requires fragile regex or pattern matching.

Audit results (2026-02-13):

Already logging JSON:

Caddy (all instances) - JSON by default for access logs
homelab-deploy (listener/builder) - Go app, logs structured JSON

Supports JSON, not configured (high value):

Service	How to enable	Config file
Prometheus	`--log.format=json`	`services/monitoring/prometheus.nix`
Alertmanager	`--log.format=json`	`services/monitoring/prometheus.nix`
Loki	`--log.format=json`	`services/monitoring/loki.nix`
Grafana	`log.console.format = "json"`	`services/monitoring/grafana.nix`
Tempo	`log_format: json` in config	`services/monitoring/tempo.nix`
OpenBao	`log_format = "json"`	`services/vault/default.nix`

Supports JSON, not configured (lower value - minimal log output):

Service	How to enable
Pyroscope	`--log.format=json` (OCI container)
Blackbox Exporter	`--log.format=json`
Node Exporter	`--log.format=json` (all 16 hosts)
Systemd Exporter	`--log.format=json` (all 16 hosts)

No JSON support (syslog/text only):

NSD, Unbound, OpenSSH, Mosquitto

Needs verification:

Kanidm, Jellyfin, Home Assistant, Harmonia, Zigbee2MQTT, NATS

Recommendation: Start with the monitoring stack (Prometheus, Alertmanager, Loki, Grafana, Tempo) since they're all Go apps with the same --log.format=json flag. Then OpenBao. The exporters are lower priority since they produce minimal log output.

7. Monitoring CNAME for Promtail Target

Problem: Promtail hardcodes monitoring01.home.2rjus.net:3100. The VictoriaMetrics migration plan already addresses this by switching to a monitoring CNAME.

Recommendation: This should happen as part of the monitoring02 migration, not independently. If we do Loki improvements before that migration, keep pointing to monitoring01.

Priority Ranking

#	Improvement	Effort	Impact	Status
1	Retention policy	Low	High	Done (30d compactor retention)
2	Limits config	Low	Medium	Done (rate limits + stream guards)
3	Promtail labels	Trivial	Low	Done (hostname/tier/role/level)
4	Journal priority → level	Low-medium	Medium	Done (pipeline stages)
5	JSON logging audit	Low-medium	Medium	Audited, not yet enabled
6	Monitoring CNAME	Low	Medium	Part of monitoring02 migration

Implementation Steps

Phase 1: Retention + Labels (done 2026-02-13)

~~Add compactor section to services/monitoring/loki.nix~~ Done
~~Add limits_config with 30-day retention and basic rate limits~~ Done
~~Update system/monitoring/logs.nix~~ Done:
- Standardized on hostname label (matching Prometheus) for both scrape configs
- Added tier and role static labels from homelab.host options
- Added pipeline stages for journal PRIORITY → level label mapping
~~Update pipe-to-loki and bootstrap scripts to use hostname~~ Done
~~Deploy and verify labels~~ Done - all 15 hosts reporting with correct labels

Phase 2: JSON Logging (not started)

Enable JSON logging on services that support it, starting with the monitoring stack:

Prometheus, Alertmanager, Loki, Grafana, Tempo (--log.format=json)
OpenBao (log_format = "json")
Lower priority: exporters (node-exporter, systemd-exporter, blackbox)

Phase 3 (future): S3 Storage Migration

Revisit after NAS migration when a proper S3-compatible storage solution is available. At that point, add a new schema period with object_store = "s3" - the old filesystem period will continue serving historical data until it ages out past retention.

Open Questions

Do we want per-stream retention (e.g., keep bootstrap/pipe-to-loki longer)?

Notes

Loki schema changes require adding a new period entry (not modifying existing ones). The old period continues serving historical data.
Loki 3.6.3 requires delete_request_store = "filesystem" in the compactor config when retention is enabled.
S3 storage deferred until post-NAS migration when a proper solution is available.
As of 2026-02-13, Loki uses ~6.8 GB for ~30 days of logs from 16 hosts. Prometheus uses ~7.6 GB on the same disk (33 GB total, ~8 GB free).

9.2 KiB Raw Blame History