Skip to content

Fix race condition NPE in V3 response handling during timeout check#4737

Open
jprieto-temporal wants to merge 1 commit intoapache:masterfrom
jprieto-temporal:jprieto/lwwyxzowllvx
Open

Fix race condition NPE in V3 response handling during timeout check#4737
jprieto-temporal wants to merge 1 commit intoapache:masterfrom
jprieto-temporal:jprieto/lwwyxzowllvx

Conversation

@jprieto-temporal
Copy link
Copy Markdown

When BookKeeper receives a V3 protocol response from a bookie, readV3Response retrieves the pending operation's completion object from a map using a non-removing get, schedules an async handler on an executor, and then removes the entry from the map on the next line.

The async handler eventually calls release() on the completion object, which nulls out its fields and recycles it back into an object pool. If the executor completes this work before the calling thread reaches the remove call, the entry is still in the map but points to an object with nulled-out fields. A periodic timeout-checking thread that scans the same map can then encounter this entry and NPE when it tries to read a field off the null reference, as I observed in a test environment.

java.lang.NullPointerException: Cannot read field "addEntryTimeoutNanos" because "this.perChannelBookieClient" is null
	at org.apache.bookkeeper.proto.AddCompletion.maybeTimeout(AddCompletion.java:92)
	at org.apache.bookkeeper.proto.PerChannelBookieClient.lambda$static$2(PerChannelBookieClient.java:1068)
	at org.apache.bookkeeper.util.collections.ConcurrentOpenHashMap$Section.removeIf(ConcurrentOpenHashMap.java:563)
	at org.apache.bookkeeper.util.collections.ConcurrentOpenHashMap.removeIf(ConcurrentOpenHashMap.java:277)
	at org.apache.bookkeeper.proto.PerChannelBookieClient.checkTimeoutOnPendingOperations(PerChannelBookieClient.java:1072)
	at org.apache.bookkeeper.proto.DefaultPerChannelBookieClientPool.checkTimeoutOnPendingOperations(DefaultPerChannelBookieClientPool.java:131)
	at org.apache.bookkeeper.proto.BookieClientImpl.monitorPendingOperations(BookieClientImpl.java:595)
	at org.apache.bookkeeper.proto.BookieClientImpl.lambda$new$0(BookieClientImpl.java:130)
	at com.google.common.util.concurrent.MoreExecutors$ScheduledListeningDecorator$NeverSuccessfulListenableFutureTask.run(MoreExecutors.java:639)
	at org.apache.bookkeeper.common.util.SingleThreadSafeScheduledExecutorService$SafeRunnable.run(SingleThreadSafeScheduledExecutorService.java:46)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:1583)

Three threads are involved:

  1. Netty I/O thread runs readV3Response: calls get(key), schedules async work, and there is a delay before the next line, remove(key), is reached.
  2. Executor thread runs the scheduled handler, which completes the operation and calls release(), nulling out fields on the completion object. This finishes before the remove line is reached by thread 1.
  3. Also before thread 1 reaches the remove, the timeout monitor thread accesses the map with removeIf, calling maybeTimeout() on each entry.

The release() in thread 2 mutates the object's fields, not the map, so it doesn't need any map lock and can happen concurrently with thread 3 reading the same object out of the map.

The V2 response handler (readV2Response) already does this correctly. It atomically removes the entry from the map before scheduling async work, so the entry is gone before any executor thread can call release(). This change makes the V3 path match. V3 keys use unique transaction IDs so there is no duplicate-key concern that would require keeping the entry in the map.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant