Skip to content

Conversation

@dlg99
Copy link
Contributor

@dlg99 dlg99 commented Jan 8, 2025

Motivation

corrupt entry log file causes OODMEs and stuck GC.

Log (from bk 4.15.x):

2025-01-03T00:24:55,037+0000 [GarbageCollectorThread-7-1] INFO  org.apache.bookkeeper.bookie.GarbageCollectorThread - Extracting entry log meta from entryLogId: 45795
2025-01-03T00:24:55,038+0000 [GarbageCollectorThread-7-1] INFO  org.apache.bookkeeper.bookie.EntryLogger - Failed to get ledgers map index from: 45795.log : Negative position
2025-01-03T00:24:55,039+0000 [GarbageCollectorThread-7-1] ERROR org.apache.bookkeeper.common.util.SafeRunnable - Unexpected throwable caught
io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 1936946533 byte(s) of direct memory (used: 1140850688, max: 2147483648)
        at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:845) ~[io.netty-netty-common-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:774) ~[io.netty-netty-common-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:701) ~[io.netty-netty-buffer-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.buffer.PoolArena$DirectArena.newUnpooledChunk(PoolArena.java:690) ~[io.netty-netty-buffer-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:226) ~[io.netty-netty-buffer-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.buffer.PoolArena.allocate(PoolArena.java:144) ~[io.netty-netty-buffer-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.buffer.PoolArena.reallocate(PoolArena.java:302) ~[io.netty-netty-buffer-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:122) ~[io.netty-netty-buffer-4.1.86.Final.jar:4.1.86.Final]
        at org.apache.bookkeeper.bookie.EntryLogger.scanEntryLog(EntryLogger.java:1030) ~[org.apache.bookkeeper-bookkeeper-server-4.15.4.jar:4.15.4]
        at org.apache.bookkeeper.bookie.EntryLogger.extractEntryLogMetadataByScanning(EntryLogger.java:1168) ~[org.apache.bookkeeper-bookkeeper-server-4.15.4.jar:4.15.4]
        at org.apache.bookkeeper.bookie.EntryLogger.getEntryLogMetadata(EntryLogger.java:1071) ~[org.apache.bookkeeper-bookkeeper-server-4.15.4.jar:4.15.4]
        at org.apache.bookkeeper.bookie.GarbageCollectorThread.extractMetaFromEntryLogs(GarbageCollectorThread.java:758) ~[org.apache.bookkeeper-bookkeeper-server-4.15.4.jar:4.15.4]
        at org.apache.bookkeeper.bookie.GarbageCollectorThread.runWithFlags(GarbageCollectorThread.java:411) ~[org.apache.bookkeeper-bookkeeper-server-4.15.4.jar:4.15.4]
        at org.apache.bookkeeper.bookie.GarbageCollectorThread.safeRun(GarbageCollectorThread.java:391) ~[org.apache.bookkeeper-bookkeeper-server-4.15.4.jar:4.15.4]
        at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) ~[org.apache.bookkeeper-bookkeeper-common-4.15.4.jar:4.15.4]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) ~[?:?]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.86.Final.jar:4.1.86.Final]
        at java.lang.Thread.run(Thread.java:833) ~[?:?]

I don't have access to the environment, AFAIK there is enough direct memory and other entry logs can be compacted ok.
I don't know how it got corrupted.

Changes

Handle exception, log, skip the file. Similar to #3901

@dlg99 dlg99 requested review from eolivelli and hangc0276 January 8, 2025 18:09
@dlg99 dlg99 added this to the 4.18.0 milestone Jan 17, 2025
Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@eolivelli eolivelli merged commit 175e294 into apache:master Feb 13, 2025
23 checks passed
zymap pushed a commit that referenced this pull request Aug 11, 2025
zymap pushed a commit that referenced this pull request Aug 11, 2025
zymap added a commit that referenced this pull request Aug 12, 2025
### Motivation

We met the GarbageCollectionThread was stopped by some runtime error, but we didn't catch it then, causing the GC to stop. 
Such as: 
#3901
#4544

In our case, the GC stopped because of the OutOfDirectMemoryException then the process stopped and the files can not be deleted. But we didn't see any error logs. This PR enhance the log info when an unhandled error happens.
We already have the [PR](#4544) fixed that. 

And another fix in this PR is to change the Exception to the Throwable in the getEntryLogMetadata.

Here is the error stack:

```
    io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 213909504 byte(s) of direct memory (used: 645922847, max: 858783744)
	at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:880)
	at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:809)
	at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:718)
	at io.netty.buffer.PoolArena$DirectArena.newUnpooledChunk(PoolArena.java:707)
	at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:224)
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:142)
	at io.netty.buffer.PoolArena.reallocate(PoolArena.java:317)
	at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:123)
	at io.netty.buffer.AbstractByteBuf.ensureWritable0(AbstractByteBuf.java:305)
	at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:280)
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1103)
	at org.apache.bookkeeper.bookie.BufferedReadChannel.read(BufferedReadChannel.java:104)
	at org.apache.bookkeeper.bookie.DefaultEntryLogger.extractEntryLogMetadataFromIndex(DefaultEntryLogger.java:1109)
	at org.apache.bookkeeper.bookie.DefaultEntryLogger.getEntryLogMetadata(DefaultEntryLogger.java:1060)
	at org.apache.bookkeeper.bookie.GarbageCollectorThread.extractMetaFromEntryLogs(GarbageCollectorThread.java:678)
	at org.apache.bookkeeper.bookie.GarbageCollectorThread.runWithFlags(GarbageCollectorThread.java:365)
	at org.apache.bookkeeper.bookie.GarbageCollectorThread.lambda$triggerGC$4(GarbageCollectorThread.java:268)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)
```

You can see it get much more memory used here extractEntryLogMetadataFromIndex(DefaultEntryLogger.java:1109). The reason is that the header has the wrong data of the header, which should already be fixed by #4607. Then it reading with a wrong map size which could take a lot of memory.
zymap added a commit that referenced this pull request Aug 12, 2025
### Motivation

We met the GarbageCollectionThread was stopped by some runtime error, but we didn't catch it then, causing the GC to stop.
Such as:
#3901
#4544

In our case, the GC stopped because of the OutOfDirectMemoryException then the process stopped and the files can not be deleted. But we didn't see any error logs. This PR enhance the log info when an unhandled error happens.
We already have the [PR](#4544) fixed that.

And another fix in this PR is to change the Exception to the Throwable in the getEntryLogMetadata.

Here is the error stack:

```
    io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 213909504 byte(s) of direct memory (used: 645922847, max: 858783744)
	at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:880)
	at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:809)
	at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:718)
	at io.netty.buffer.PoolArena$DirectArena.newUnpooledChunk(PoolArena.java:707)
	at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:224)
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:142)
	at io.netty.buffer.PoolArena.reallocate(PoolArena.java:317)
	at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:123)
	at io.netty.buffer.AbstractByteBuf.ensureWritable0(AbstractByteBuf.java:305)
	at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:280)
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1103)
	at org.apache.bookkeeper.bookie.BufferedReadChannel.read(BufferedReadChannel.java:104)
	at org.apache.bookkeeper.bookie.DefaultEntryLogger.extractEntryLogMetadataFromIndex(DefaultEntryLogger.java:1109)
	at org.apache.bookkeeper.bookie.DefaultEntryLogger.getEntryLogMetadata(DefaultEntryLogger.java:1060)
	at org.apache.bookkeeper.bookie.GarbageCollectorThread.extractMetaFromEntryLogs(GarbageCollectorThread.java:678)
	at org.apache.bookkeeper.bookie.GarbageCollectorThread.runWithFlags(GarbageCollectorThread.java:365)
	at org.apache.bookkeeper.bookie.GarbageCollectorThread.lambda$triggerGC$4(GarbageCollectorThread.java:268)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)
```

You can see it get much more memory used here extractEntryLogMetadataFromIndex(DefaultEntryLogger.java:1109). The reason is that the header has the wrong data of the header, which should already be fixed by #4607. Then it reading with a wrong map size which could take a lot of memory.

(cherry picked from commit e80d031)
zymap added a commit that referenced this pull request Aug 12, 2025
### Motivation

We met the GarbageCollectionThread was stopped by some runtime error, but we didn't catch it then, causing the GC to stop.
Such as:
#3901
#4544

In our case, the GC stopped because of the OutOfDirectMemoryException then the process stopped and the files can not be deleted. But we didn't see any error logs. This PR enhance the log info when an unhandled error happens.
We already have the [PR](#4544) fixed that.

And another fix in this PR is to change the Exception to the Throwable in the getEntryLogMetadata.

Here is the error stack:

```
    io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 213909504 byte(s) of direct memory (used: 645922847, max: 858783744)
	at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:880)
	at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:809)
	at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:718)
	at io.netty.buffer.PoolArena$DirectArena.newUnpooledChunk(PoolArena.java:707)
	at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:224)
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:142)
	at io.netty.buffer.PoolArena.reallocate(PoolArena.java:317)
	at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:123)
	at io.netty.buffer.AbstractByteBuf.ensureWritable0(AbstractByteBuf.java:305)
	at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:280)
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1103)
	at org.apache.bookkeeper.bookie.BufferedReadChannel.read(BufferedReadChannel.java:104)
	at org.apache.bookkeeper.bookie.DefaultEntryLogger.extractEntryLogMetadataFromIndex(DefaultEntryLogger.java:1109)
	at org.apache.bookkeeper.bookie.DefaultEntryLogger.getEntryLogMetadata(DefaultEntryLogger.java:1060)
	at org.apache.bookkeeper.bookie.GarbageCollectorThread.extractMetaFromEntryLogs(GarbageCollectorThread.java:678)
	at org.apache.bookkeeper.bookie.GarbageCollectorThread.runWithFlags(GarbageCollectorThread.java:365)
	at org.apache.bookkeeper.bookie.GarbageCollectorThread.lambda$triggerGC$4(GarbageCollectorThread.java:268)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)
```

You can see it get much more memory used here extractEntryLogMetadataFromIndex(DefaultEntryLogger.java:1109). The reason is that the header has the wrong data of the header, which should already be fixed by #4607. Then it reading with a wrong map size which could take a lot of memory.

(cherry picked from commit e80d031)
priyanshu-ctds pushed a commit to datastax/bookkeeper that referenced this pull request Aug 20, 2025
(cherry picked from commit 175e294)
(cherry picked from commit d89930a)
priyanshu-ctds pushed a commit to datastax/bookkeeper that referenced this pull request Aug 20, 2025
### Motivation

We met the GarbageCollectionThread was stopped by some runtime error, but we didn't catch it then, causing the GC to stop.
Such as:
apache#3901
apache#4544

In our case, the GC stopped because of the OutOfDirectMemoryException then the process stopped and the files can not be deleted. But we didn't see any error logs. This PR enhance the log info when an unhandled error happens.
We already have the [PR](apache#4544) fixed that.

And another fix in this PR is to change the Exception to the Throwable in the getEntryLogMetadata.

Here is the error stack:

```
    io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 213909504 byte(s) of direct memory (used: 645922847, max: 858783744)
	at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:880)
	at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:809)
	at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:718)
	at io.netty.buffer.PoolArena$DirectArena.newUnpooledChunk(PoolArena.java:707)
	at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:224)
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:142)
	at io.netty.buffer.PoolArena.reallocate(PoolArena.java:317)
	at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:123)
	at io.netty.buffer.AbstractByteBuf.ensureWritable0(AbstractByteBuf.java:305)
	at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:280)
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1103)
	at org.apache.bookkeeper.bookie.BufferedReadChannel.read(BufferedReadChannel.java:104)
	at org.apache.bookkeeper.bookie.DefaultEntryLogger.extractEntryLogMetadataFromIndex(DefaultEntryLogger.java:1109)
	at org.apache.bookkeeper.bookie.DefaultEntryLogger.getEntryLogMetadata(DefaultEntryLogger.java:1060)
	at org.apache.bookkeeper.bookie.GarbageCollectorThread.extractMetaFromEntryLogs(GarbageCollectorThread.java:678)
	at org.apache.bookkeeper.bookie.GarbageCollectorThread.runWithFlags(GarbageCollectorThread.java:365)
	at org.apache.bookkeeper.bookie.GarbageCollectorThread.lambda$triggerGC$4(GarbageCollectorThread.java:268)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)
```

You can see it get much more memory used here extractEntryLogMetadataFromIndex(DefaultEntryLogger.java:1109). The reason is that the header has the wrong data of the header, which should already be fixed by apache#4607. Then it reading with a wrong map size which could take a lot of memory.

(cherry picked from commit e80d031)
(cherry picked from commit 55308e9)
srinath-ctds pushed a commit to datastax/bookkeeper that referenced this pull request Aug 21, 2025
(cherry picked from commit 175e294)
(cherry picked from commit d89930a)
srinath-ctds pushed a commit to datastax/bookkeeper that referenced this pull request Aug 21, 2025
### Motivation

We met the GarbageCollectionThread was stopped by some runtime error, but we didn't catch it then, causing the GC to stop.
Such as:
apache#3901
apache#4544

In our case, the GC stopped because of the OutOfDirectMemoryException then the process stopped and the files can not be deleted. But we didn't see any error logs. This PR enhance the log info when an unhandled error happens.
We already have the [PR](apache#4544) fixed that.

And another fix in this PR is to change the Exception to the Throwable in the getEntryLogMetadata.

Here is the error stack:

```
    io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 213909504 byte(s) of direct memory (used: 645922847, max: 858783744)
	at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:880)
	at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:809)
	at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:718)
	at io.netty.buffer.PoolArena$DirectArena.newUnpooledChunk(PoolArena.java:707)
	at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:224)
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:142)
	at io.netty.buffer.PoolArena.reallocate(PoolArena.java:317)
	at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:123)
	at io.netty.buffer.AbstractByteBuf.ensureWritable0(AbstractByteBuf.java:305)
	at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:280)
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1103)
	at org.apache.bookkeeper.bookie.BufferedReadChannel.read(BufferedReadChannel.java:104)
	at org.apache.bookkeeper.bookie.DefaultEntryLogger.extractEntryLogMetadataFromIndex(DefaultEntryLogger.java:1109)
	at org.apache.bookkeeper.bookie.DefaultEntryLogger.getEntryLogMetadata(DefaultEntryLogger.java:1060)
	at org.apache.bookkeeper.bookie.GarbageCollectorThread.extractMetaFromEntryLogs(GarbageCollectorThread.java:678)
	at org.apache.bookkeeper.bookie.GarbageCollectorThread.runWithFlags(GarbageCollectorThread.java:365)
	at org.apache.bookkeeper.bookie.GarbageCollectorThread.lambda$triggerGC$4(GarbageCollectorThread.java:268)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)
```

You can see it get much more memory used here extractEntryLogMetadataFromIndex(DefaultEntryLogger.java:1109). The reason is that the header has the wrong data of the header, which should already be fixed by apache#4607. Then it reading with a wrong map size which could take a lot of memory.

(cherry picked from commit e80d031)
(cherry picked from commit 55308e9)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants