fix(consumer): Fix deadlock when consumer event handling and kafka client polling is scheduled onto the same POSIX thread #32
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
Now that the consumer event handling code is running in a dedicated green thread, it turns out that it can deadlock with the green thread that polls our rdkafka client. This change runs the rdkafka client polling in a dedicated blocking thread pool so it will never contend with our green threads.
Details
Async rust does not rely on preemption to switch what future to execute. It is uses a cooperative model, so an underlying POSIX thread can only change what green thread to execute when the currently executing green thread hits an
.awaitpoint in its code. Therefore, any code that runs between any.awaitpoint is guaranteed that it will hog that POSIX thread.The rdkafka rust binding library provides an async client, however the rebalance callback is not async. In order to know when the rebalance event that the rebalance callback sent to the event handler is processed, it needs to use a
sync_channelinstead of something that is async like oneshot channel to signal completion. This also means that while the rebalance callback is waiting for rendezvous, it does not yield execution.sequenceDiagram Broker->>+pre_rebalance: Partition revocation pre_rebalance->>+handle_events: Write event & rendezvous sender to event sender handle_events->>+ActorHandles: .shutdown() ActorHandles->>-handle_events: .shutdown() completes handle_events->>pre_rebalance: Write to rendezvous sender deactivate handle_events pre_rebalance->>-Broker: AckMeanwhile, the handle_events green thread could be scheduled on the same POSIX thread as the consumer client green thread. Since the rebalance callback is waiting for rendezvous, and it does not yield execution, the handle_events green thread never gets to run. This will deadlock the consumer.
How do we prevent future problems like this?
This problem has always existed since the consumer is merged in here. We just got lucky because the consumer polling was running on a separate POSIX thread than the event handling thread. We can reproduce this problem from earlier commits if we run them with a single threaded runtime
#[tokio::main(flavor = "current_thread")]Since this is the only non-async part of our codebase, now that it is on a dedicated threadpool, it should never happen again.
But to give us more confidence, when we have our integration test, we should also test our program with the
current_threadtokio setting.