Skip to content

Excessive logging after server reboot when journal offset is outdated #25006

@AntonEbel

Description

@AntonEbel

Description

After a server reboot, if the persisted committed read offset is significantly behind the journal's log start offset (due to retention cleaning), the retry loop in LocalKafkaJournal.readNext() generates millions of redundant log entries and delays journal recovery.

see: Graylog2/support#410

Root Cause

In LocalKafkaJournal.java line 618-650, the readNext() method captures the original startOffset and, when read() returns empty, retries by incrementing from that original offset:

long failedReadOffset = startOffset;       // e.g. 1,670,968
long retryReadOffset = failedReadOffset + 1;

while (messages.isEmpty() && failedReadOffset < (logEndOffset - 1)) {
    LOG.warn("Couldn't read any messages from offset <{}>...", failedReadOffset, retryReadOffset);
    messages = read(retryReadOffset, requestedMaximumCount);
    failedReadOffset++;
    retryReadOffset++;
}

However, read() at line 686-693 silently adjusts offsets that are behind logStartOffset:

if (readOffset < logStartOffset) {
    readOffset = logStartOffset;
    maxOffset = readOffset + maximumCount;
}

The retry loop is unaware of this internal adjustment. Every retry offset (1,670,969, 1,670,970, ...) is still below logStartOffset (29,535,578), so each call to read() adjusts to the same logStartOffset, gets the same empty result, and the loop continues.

This produces approximately logStartOffset - committedOffset iterations (~27.8 million in the reported case), each generating:

  • 1 WARN in readNext()
  • 1 INFO in read()

Steps to Reproduce

  1. Run Graylog with an active journal long enough for retention cleaning to delete old segments (e.g., log starts at offset 29M)
  2. Let the committed read offset fall significantly behind the log start offset (e.g., offset ~1.6M)
  3. Reboot the server
  4. Observe massive WARN/INFO log spam from LocalKafkaJournal

Expected Behavior

When the committed offset is behind logStartOffset, the retry loop should skip directly to logStartOffset rather than incrementing one-by-one from the stale offset.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions