Skip to content

KAFKA-20198: Fix StickyTaskAssignor capacity calculation to use proportional instance limit#22006

Open
ChoMinGi wants to merge 2 commits intoapache:trunkfrom
ChoMinGi:kafka-20198-sticky-assignor-fix
Open

KAFKA-20198: Fix StickyTaskAssignor capacity calculation to use proportional instance limit#22006
ChoMinGi wants to merge 2 commits intoapache:trunkfrom
ChoMinGi:kafka-20198-sticky-assignor-fix

Conversation

@ChoMinGi
Copy link
Copy Markdown
Contributor

@ChoMinGi ChoMinGi commented Apr 9, 2026

KAFKA-20198

Saw the discussion on Confluent Slack about StickyTaskAssignor taking many rebalance rounds to converge with the classic group protocol when scaling up. Dug into it and found the root cause in hasRoomForActiveTask.

The per-instance limit is computed as capacity * (taskCount / totalCapacity), but the integer division floors — so with 450 tasks and 20 threads, each instance gets a limit of 10 * 22 = 220 instead of the fair share 225. The 10-task gap pushes overflow into findBestClientForTask, which picks differently each round due to HashMap iteration order, creating a feedback loop that prevents convergence.

Fixed by computing the instance limit directly as a proportional share: (taskCount * capacity + totalCapacity - 1) / totalCapacity, giving ceil(450 * 10 / 20) = 225. Same approach as AbstractStickyAssignor.maxQuota.

Simulated with the actual StickyTaskAssignor across cooperative rebalance rounds:

Scenario Before After
450p/10t 5 rounds 2 rounds
100p/4t 4 rounds 2 rounds
200p/8t 6 rounds 2 rounds
1000p/16t 9 rounds 5 rounds
500p/12t 9 rounds 3 rounds
Even cases 2 rounds 2 rounds

All existing tests pass.

…rtional instance limit

Replace per-thread floor division with per-instance proportional ceiling
in hasRoomForActiveTask. The previous floor-based calculation
(capacity * floor(taskCount / totalCapacity)) underestimates the
instance limit when tasks don't divide evenly across threads, causing
overflow tasks to bounce between instances during cooperative
rebalancing.
@github-actions github-actions bot added streams small Small PRs triage PRs from the community labels Apr 9, 2026
.collect(Collectors.toSet())
.size();
return newActiveTaskCount < capacity * activeTasksPerThread;
final int instanceLimit = (taskCount * capacity + totalCapacity - 1) / totalCapacity;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a regression test that fails with the old quota math and passes with the new proportional ceiling.

On top of that, can we add a behavior test that simulates repeated StickyTaskAssignor.assign(...) rounds, and then assert that an uneven-capacity case converges in 2 rounds? That would cover the actual failure mode described in the PR, not just the arithmetic change.

@github-actions github-actions bot removed the triage PRs from the community label Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants