Fix race condition NPE in V3 response handling during timeout check#4737
Open
jprieto-temporal wants to merge 1 commit intoapache:masterfrom
Open
Fix race condition NPE in V3 response handling during timeout check#4737jprieto-temporal wants to merge 1 commit intoapache:masterfrom
jprieto-temporal wants to merge 1 commit intoapache:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When BookKeeper receives a V3 protocol response from a bookie, readV3Response retrieves the pending operation's completion object from a map using a non-removing get, schedules an async handler on an executor, and then removes the entry from the map on the next line.
The async handler eventually calls release() on the completion object, which nulls out its fields and recycles it back into an object pool. If the executor completes this work before the calling thread reaches the
removecall, the entry is still in the map but points to an object with nulled-out fields. A periodic timeout-checking thread that scans the same map can then encounter this entry and NPE when it tries to read a field off the null reference, as I observed in a test environment.Three threads are involved:
readV3Response: callsget(key), schedules async work, and there is a delay before the next line,remove(key), is reached.release(), nulling out fields on the completion object. This finishes before theremoveline is reached by thread 1.remove, the timeout monitor thread accesses the map withremoveIf, callingmaybeTimeout()on each entry.The
release()in thread 2 mutates the object's fields, not the map, so it doesn't need any map lock and can happen concurrently with thread 3 reading the same object out of the map.The V2 response handler (readV2Response) already does this correctly. It atomically removes the entry from the map before scheduling async work, so the entry is gone before any executor thread can call release(). This change makes the V3 path match. V3 keys use unique transaction IDs so there is no duplicate-key concern that would require keeping the entry in the map.