KAFKA-20198: Fix StickyTaskAssignor capacity calculation to use proportional instance limit#22006
Open
ChoMinGi wants to merge 2 commits intoapache:trunkfrom
Open
KAFKA-20198: Fix StickyTaskAssignor capacity calculation to use proportional instance limit#22006ChoMinGi wants to merge 2 commits intoapache:trunkfrom
ChoMinGi wants to merge 2 commits intoapache:trunkfrom
Conversation
…rtional instance limit Replace per-thread floor division with per-instance proportional ceiling in hasRoomForActiveTask. The previous floor-based calculation (capacity * floor(taskCount / totalCapacity)) underestimates the instance limit when tasks don't divide evenly across threads, causing overflow tasks to bounce between instances during cooperative rebalancing.
harimm
reviewed
Apr 10, 2026
| .collect(Collectors.toSet()) | ||
| .size(); | ||
| return newActiveTaskCount < capacity * activeTasksPerThread; | ||
| final int instanceLimit = (taskCount * capacity + totalCapacity - 1) / totalCapacity; |
There was a problem hiding this comment.
I think this needs a regression test that fails with the old quota math and passes with the new proportional ceiling.
On top of that, can we add a behavior test that simulates repeated StickyTaskAssignor.assign(...) rounds, and then assert that an uneven-capacity case converges in 2 rounds? That would cover the actual failure mode described in the PR, not just the arithmetic change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
KAFKA-20198
Saw the discussion on Confluent Slack about
StickyTaskAssignortaking many rebalance rounds to converge with the classic group protocol when scaling up. Dug into it and found the root cause inhasRoomForActiveTask.The per-instance limit is computed as
capacity * (taskCount / totalCapacity), but the integer division floors — so with 450 tasks and 20 threads, each instance gets a limit of10 * 22 = 220instead of the fair share 225. The 10-task gap pushes overflow intofindBestClientForTask, which picks differently each round due to HashMap iteration order, creating a feedback loop that prevents convergence.Fixed by computing the instance limit directly as a proportional share:
(taskCount * capacity + totalCapacity - 1) / totalCapacity, givingceil(450 * 10 / 20) = 225. Same approach asAbstractStickyAssignor.maxQuota.Simulated with the actual
StickyTaskAssignoracross cooperative rebalance rounds:All existing tests pass.