KAFKA-20198: Fix StickyTaskAssignor capacity calculation to use proportional instance limit by ChoMinGi · Pull Request #22006 · apache/kafka

ChoMinGi · 2026-04-09T06:42:21Z

KAFKA-20198

Saw the discussion on Confluent Slack about StickyTaskAssignor taking many rebalance rounds to converge with the classic group protocol when scaling up. Dug into it and found the root cause in hasRoomForActiveTask.

The per-instance limit is computed as capacity * (taskCount / totalCapacity), but the integer division floors — so with 450 tasks and 20 threads, each instance gets a limit of 10 * 22 = 220 instead of the fair share 225. The 10-task gap pushes overflow into findBestClientForTask, which picks differently each round due to HashMap iteration order, creating a feedback loop that prevents convergence.

Fixed by computing the instance limit directly as a proportional share: (taskCount * capacity + totalCapacity - 1) / totalCapacity, giving ceil(450 * 10 / 20) = 225. Same approach as AbstractStickyAssignor.maxQuota.

Simulated with the actual StickyTaskAssignor across cooperative rebalance rounds:

Scenario	Before	After
450p/10t	5 rounds	2 rounds
100p/4t	4 rounds	2 rounds
200p/8t	6 rounds	2 rounds
1000p/16t	9 rounds	5 rounds
500p/12t	9 rounds	3 rounds
Even cases	2 rounds	2 rounds

All existing tests pass.

…rtional instance limit Replace per-thread floor division with per-instance proportional ceiling in hasRoomForActiveTask. The previous floor-based calculation (capacity * floor(taskCount / totalCapacity)) underestimates the instance limit when tasks don't divide evenly across threads, causing overflow tasks to bounce between instances during cooperative rebalancing.

harimm · 2026-04-10T15:23:47Z

...rc/main/java/org/apache/kafka/streams/processor/assignment/assignors/StickyTaskAssignor.java

                .collect(Collectors.toSet())
                .size();
-            return newActiveTaskCount < capacity * activeTasksPerThread;
+            final int instanceLimit = (taskCount * capacity + totalCapacity - 1) / totalCapacity;


I think this needs a regression test that fails with the old quota math and passes with the new proportional ceiling.

On top of that, can we add a behavior test that simulates repeated StickyTaskAssignor.assign(...) rounds, and then assert that an uneven-capacity case converges in 2 rounds? That would cover the actual failure mode described in the PR, not just the arithmetic change.

…stance limit

github-actions bot added streams small Small PRs triage PRs from the community labels Apr 9, 2026

harimm reviewed Apr 10, 2026

View reviewed changes

KAFKA-20198: Add regression and convergence tests for proportional in…

884e21e

…stance limit

github-actions bot removed the triage PRs from the community label Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-20198: Fix StickyTaskAssignor capacity calculation to use proportional instance limit#22006

KAFKA-20198: Fix StickyTaskAssignor capacity calculation to use proportional instance limit#22006
ChoMinGi wants to merge 2 commits intoapache:trunkfrom
ChoMinGi:kafka-20198-sticky-assignor-fix

ChoMinGi commented Apr 9, 2026 •

edited

Loading

Uh oh!

harimm Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ChoMinGi commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

harimm Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ChoMinGi commented Apr 9, 2026 •

edited

Loading