-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Description
After a server reboot, if the persisted committed read offset is significantly behind the journal's log start offset (due to retention cleaning), the retry loop in LocalKafkaJournal.readNext() generates millions of redundant log entries and delays journal recovery.
see: Graylog2/support#410
Root Cause
In LocalKafkaJournal.java line 618-650, the readNext() method captures the original startOffset and, when read() returns empty, retries by incrementing from that original offset:
long failedReadOffset = startOffset; // e.g. 1,670,968
long retryReadOffset = failedReadOffset + 1;
while (messages.isEmpty() && failedReadOffset < (logEndOffset - 1)) {
LOG.warn("Couldn't read any messages from offset <{}>...", failedReadOffset, retryReadOffset);
messages = read(retryReadOffset, requestedMaximumCount);
failedReadOffset++;
retryReadOffset++;
}However, read() at line 686-693 silently adjusts offsets that are behind logStartOffset:
if (readOffset < logStartOffset) {
readOffset = logStartOffset;
maxOffset = readOffset + maximumCount;
}The retry loop is unaware of this internal adjustment. Every retry offset (1,670,969, 1,670,970, ...) is still below logStartOffset (29,535,578), so each call to read() adjusts to the same logStartOffset, gets the same empty result, and the loop continues.
This produces approximately logStartOffset - committedOffset iterations (~27.8 million in the reported case), each generating:
- 1 WARN in
readNext() - 1 INFO in
read()
Steps to Reproduce
- Run Graylog with an active journal long enough for retention cleaning to delete old segments (e.g., log starts at offset 29M)
- Let the committed read offset fall significantly behind the log start offset (e.g., offset ~1.6M)
- Reboot the server
- Observe massive WARN/INFO log spam from
LocalKafkaJournal
Expected Behavior
When the committed offset is behind logStartOffset, the retry loop should skip directly to logStartOffset rather than incrementing one-by-one from the stale offset.