Skip to content

refactor: Separate TaskPool abstraction from Scheduler#7750

Draft
Shaddoll wants to merge 2 commits intocadence-workflow:masterfrom
Shaddoll:scheduler
Draft

refactor: Separate TaskPool abstraction from Scheduler#7750
Shaddoll wants to merge 2 commits intocadence-workflow:masterfrom
Shaddoll:scheduler

Conversation

@Shaddoll
Copy link
Member

What changed?
Introduce TaskPool interface to decouple task storage and scheduling
logic from task execution and dispatch logic in the Weighted Round
Robin Scheduler.

Key changes:

  • Add TaskPool interface with Submit/TrySubmit/GetNextTask/Len methods
  • Create WeightedRoundRobinTaskPool implementation wrapping channel pool
  • Refactor Scheduler to use TaskPool abstraction instead of direct
    channel pool access
  • Move notification and metrics handling to Scheduler layer for better
    separation of concerns
  • Implement O(1) Len() using atomic counter for efficient task tracking
  • Make GetNextTask() stateful to maintain position across calls
  • Add comprehensive tests validating weighted round-robin ordering

Why?

  • Clear separation between scheduling algorithm and task execution
  • Enables future alternative TaskPool implementations (priority queue,
    FIFO, etc.) without changing dispatcher logic
  • Improved testability with smaller, focused components
  • Maintains 100% backward compatibility with existing code

How did you test it?
cd common/task && go test ./...

Potential risks
It might break history task scheduler if there is a bug.

Release notes
N/A

Documentation Changes
N/A


Reviewer Validation

PR Description Quality (check these before reviewing code):

  • "What changed" provides a clear 1-2 line summary
    • Project Issue is linked
  • "Why" explains the full motivation with sufficient context
  • Testing is documented:
    • Unit test commands are included (with exact go test invocation)
    • Integration test setup/commands included (if integration tests were run)
    • Canary testing details included (if canary was mentioned)
  • Potential risks section is thoughtfully filled out (or legitimately N/A)
  • Release notes included if this completes a user-facing feature
  • Documentation needs are addressed (or noted if uncertain)

@gitar-bot
Copy link

gitar-bot bot commented Feb 26, 2026

🔍 CI failure analysis for a029718: Lint failure due to mock file generation path mismatch. Integration tests also timing out due to infinite loop bug in GetNextTask().

Lint Failure (Related to PR)

The generated mock file common/task/interface_mock.go contains full paths in its header:

//	mockgen -package task -source common/task/interface.go -destination common/task/interface_mock.go

CI's make go-generate regenerates it with relative paths:

//	mockgen -package task -source interface.go -destination interface_mock.go

This mismatch causes lint validation to fail.

Integration Test Timeouts (Related to PR)

Two Cassandra persistence tests timeout with "context deadline exceeded":

  • TestCassandraVisibilityPersistence/TestFilteringByWorkflowID
  • TestCassandraExecutionManager/TestConflictResolveWorkflowExecutionCurrentIsSelf

The bugfix commit a0297189c removed loop termination logic from GetNextTask() in weighted_round_robin_task_pool.go, creating an infinite loop when Len() > 0 but channels are empty. This causes the history task processor to hang.

Code Review ⚠️ Changes requested 1 resolved / 2 findings

Clean refactoring that successfully decouples TaskPool from Scheduler, but the GetNextTask infinite-spin bug (Wxc0Ixbt) remains unresolved — the for p.Len() > 0 loop still lacks cycle detection and can spin indefinitely while holding the mutex if scheduled channels don't contain the counted tasks.

⚠️ Bug: GetNextTask can spin infinitely when schedule shrinks mid-scan

📄 common/task/weighted_round_robin_task_pool.go:173 📄 common/task/weighted_round_robin_task_pool.go:188-200

In GetNextTask(), startIndex is captured from p.scheduleIndex at line 173 before the scan loop begins. If the loop exhausts the current schedule and fetches a fresh (shorter) schedule at line 189, p.scheduleIndex resets to 0 but startIndex retains its old value. When startIndex >= len(newSchedule), the termination check p.scheduleIndex == startIndex at line 198 can never be true, causing an infinite busy loop while holding the mutex.

Concrete scenario: Schedule has 5 entries, scheduleIndex=3 (so startIndex=3). All channels empty. After wrapping to the end, a fresh 3-entry schedule is fetched. Index resets to 0, iterates 0→1→2, wraps again, fetches the same 3-entry schedule, index resets to 0 — and the cycle repeats forever. The condition index == 3 is unreachable when the schedule only has 3 entries.

This holds the mutex indefinitely, blocking all Submit(), TrySubmit(), and other GetNextTask() callers — effectively a deadlock for the entire task pool.

Trigger conditions: A channel is cleaned up (reducing schedule size) while the pool has a non-zero scheduleIndex from a prior call, and all remaining channels happen to be empty at the moment of the scan.

Suggested fix
	// Try to get a task starting from the current index
	startIndex := p.scheduleIndex
	checked := 0
	scheduleLen := len(p.schedule)
	for checked < scheduleLen {
		select {
		case task := <-p.schedule[p.scheduleIndex]:
			p.scheduleIndex++
			p.taskCount.Add(-1)
			return task, true
		case <-p.ctx.Done():
			return nil, false
		default:
			p.scheduleIndex++
			checked++
			if p.scheduleIndex >= len(p.schedule) {
				p.schedule = p.pool.GetSchedule()
				p.scheduleIndex = 0
				scheduleLen = len(p.schedule)
				if scheduleLen == 0 {
					return nil, false
				}
				checked = 0
				startIndex = 0
			}
		}
	}
	return nil, false
✅ 1 resolved
Bug: GetNextTask holds mutex during channel reads, blocking submitters

📄 common/task/weighted_round_robin_task_pool.go:159
GetNextTask() acquires the pool's mutex at line 159 and holds it for the entire scan across all schedule entries — including non-blocking channel reads. Since Submit() and TrySubmit() don't acquire this mutex (they operate on the underlying channel pool), this doesn't block submitters directly. However, it does serialize all GetNextTask() calls. With multiple dispatchers (DispatcherCount > 1), they will contend on this lock and only one dispatcher can scan at a time, reducing the benefit of multiple dispatchers.

This is a design trade-off rather than a correctness bug, but worth noting for future optimization — the old code's dispatchTasks fetched the schedule lock-free and iterated over it without holding any additional lock.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@Shaddoll Shaddoll marked this pull request as draft February 26, 2026 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant