refactor: Separate TaskPool abstraction from Scheduler by Shaddoll · Pull Request #7750 · cadence-workflow/cadence

Shaddoll · 2026-02-25T23:44:29Z

What changed?
Introduce TaskPool interface to decouple task storage and scheduling
logic from task execution and dispatch logic in the Weighted Round
Robin Scheduler.

Key changes:

Add TaskPool interface with Submit/TrySubmit/GetNextTask/Len methods
Create WeightedRoundRobinTaskPool implementation wrapping channel pool
Refactor Scheduler to use TaskPool abstraction instead of direct
channel pool access
Move notification and metrics handling to Scheduler layer for better
separation of concerns
Implement O(1) Len() using atomic counter for efficient task tracking
Make GetNextTask() stateful to maintain position across calls
Add comprehensive tests validating weighted round-robin ordering

Why?

Clear separation between scheduling algorithm and task execution
Enables future alternative TaskPool implementations (priority queue,
FIFO, etc.) without changing dispatcher logic
Improved testability with smaller, focused components
Maintains 100% backward compatibility with existing code

How did you test it?
cd common/task && go test ./...

Potential risks
It might break history task scheduler if there is a bug.

Release notes
N/A

Documentation Changes
N/A

Reviewer Validation

PR Description Quality (check these before reviewing code):

common/task/weighted_round_robin_task_pool.go

gitar-bot · 2026-02-26T18:44:28Z

🔍 CI failure analysis for a029718: Lint failure due to mock file generation path mismatch. Integration tests also timing out due to infinite loop bug in GetNextTask().

Lint Failure (Related to PR)

The generated mock file common/task/interface_mock.go contains full paths in its header:

//	mockgen -package task -source common/task/interface.go -destination common/task/interface_mock.go

CI's make go-generate regenerates it with relative paths:

//	mockgen -package task -source interface.go -destination interface_mock.go

This mismatch causes lint validation to fail.

Integration Test Timeouts (Related to PR)

Two Cassandra persistence tests timeout with "context deadline exceeded":

TestCassandraVisibilityPersistence/TestFilteringByWorkflowID
TestCassandraExecutionManager/TestConflictResolveWorkflowExecutionCurrentIsSelf

The bugfix commit a0297189c removed loop termination logic from GetNextTask() in weighted_round_robin_task_pool.go, creating an infinite loop when Len() > 0 but channels are empty. This causes the history task processor to hang.

Code Review ⚠️ Changes requested 1 resolved / 2 findings

Clean refactoring that successfully decouples TaskPool from Scheduler, but the GetNextTask infinite-spin bug (Wxc0Ixbt) remains unresolved — the for p.Len() > 0 loop still lacks cycle detection and can spin indefinitely while holding the mutex if scheduled channels don't contain the counted tasks.

⚠️

Bug: GetNextTask can spin infinitely when schedule shrinks mid-scan

📄 common/task/weighted_round_robin_task_pool.go:173 📄 common/task/weighted_round_robin_task_pool.go:188-200

In GetNextTask(), startIndex is captured from p.scheduleIndex at line 173 before the scan loop begins. If the loop exhausts the current schedule and fetches a fresh (shorter) schedule at line 189, p.scheduleIndex resets to 0 but startIndex retains its old value. When startIndex >= len(newSchedule), the termination check p.scheduleIndex == startIndex at line 198 can never be true, causing an infinite busy loop while holding the mutex.

Concrete scenario: Schedule has 5 entries, scheduleIndex=3 (so startIndex=3). All channels empty. After wrapping to the end, a fresh 3-entry schedule is fetched. Index resets to 0, iterates 0→1→2, wraps again, fetches the same 3-entry schedule, index resets to 0 — and the cycle repeats forever. The condition index == 3 is unreachable when the schedule only has 3 entries.

This holds the mutex indefinitely, blocking all Submit(), TrySubmit(), and other GetNextTask() callers — effectively a deadlock for the entire task pool.

Trigger conditions: A channel is cleaned up (reducing schedule size) while the pool has a non-zero scheduleIndex from a prior call, and all remaining channels happen to be empty at the moment of the scan.

Suggested fix

	// Try to get a task starting from the current index
	startIndex := p.scheduleIndex
	checked := 0
	scheduleLen := len(p.schedule)
	for checked < scheduleLen {
		select {
		case task := <-p.schedule[p.scheduleIndex]:
			p.scheduleIndex++
			p.taskCount.Add(-1)
			return task, true
		case <-p.ctx.Done():
			return nil, false
		default:
			p.scheduleIndex++
			checked++
			if p.scheduleIndex >= len(p.schedule) {
				p.schedule = p.pool.GetSchedule()
				p.scheduleIndex = 0
				scheduleLen = len(p.schedule)
				if scheduleLen == 0 {
					return nil, false
				}
				checked = 0
				startIndex = 0
			}
		}
	}
	return nil, false

✅ 1 resolved

✅ Bug: GetNextTask holds mutex during channel reads, blocking submitters

📄 common/task/weighted_round_robin_task_pool.go:159
GetNextTask() acquires the pool's mutex at line 159 and holds it for the entire scan across all schedule entries — including non-blocking channel reads. Since Submit() and TrySubmit() don't acquire this mutex (they operate on the underlying channel pool), this doesn't block submitters directly. However, it does serialize all GetNextTask() calls. With multiple dispatchers (DispatcherCount > 1), they will contend on this lock and only one dispatcher can scan at a time, reducing the benefit of multiple dispatchers.

This is a design trade-off rather than a correctness bug, but worth noting for future optimization — the old code's dispatchTasks fetched the schedule lock-free and iterated over it without holding any additional lock.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

refactor: Separate TaskPool abstraction from Scheduler

1799798

Shaddoll force-pushed the scheduler branch from 6f63617 to 1799798 Compare February 25, 2026 23:49

gitar-bot bot reviewed Feb 26, 2026

View reviewed changes

common/task/weighted_round_robin_task_pool.go Show resolved Hide resolved

bugfix

a029718

Shaddoll marked this pull request as draft February 26, 2026 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Separate TaskPool abstraction from Scheduler#7750

refactor: Separate TaskPool abstraction from Scheduler#7750
Shaddoll wants to merge 2 commits intocadence-workflow:masterfrom
Shaddoll:scheduler

Shaddoll commented Feb 25, 2026

Uh oh!

Uh oh!

gitar-bot bot commented Feb 26, 2026 •

edited

Loading

Lint Failure (Related to PR)

Integration Test Timeouts (Related to PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Shaddoll commented Feb 25, 2026

Reviewer Validation

Uh oh!

Uh oh!

gitar-bot bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Lint Failure (Related to PR)

Integration Test Timeouts (Related to PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gitar-bot bot commented Feb 26, 2026 •

edited

Loading