Skip to content

fix(queue): replace semaphore with sync.Cond to eliminate worker serialization#9

Merged
jwnx merged 3 commits intomasterfrom
jwnx/attempt-to-fix-30min-semaphore
Feb 2, 2026
Merged

fix(queue): replace semaphore with sync.Cond to eliminate worker serialization#9
jwnx merged 3 commits intomasterfrom
jwnx/attempt-to-fix-30min-semaphore

Conversation

@jwnx
Copy link

@jwnx jwnx commented Jan 13, 2026

The current implementation uses a size-1 semaphore to ensure a single
worker reads from q.items.Front() at a given time. The problem is that
after the worker realizes it has to wait until plannedToStartWorkAt,
it continues to hold the semaphore while sleeping. Other workers block
until the sleeping worker wakes up, processes its task, and releases.

The q.items list acts as a FIFO queue and the keeps the invariant that
elements are ordered by how soon they are to be processed, given the
plannedToStartWorkAt value. While a worker sleeps holding the semaphore,
elements that should be scheduled immediately cannot be processed even
though they might have nothing else to do. When we consider that the
default max backoff is 1000s, we can have a worker holding the semaphore
for as much as 16 minutes.

This patch removes the semaphore and replaces it with a sync.Cond. We'll
c.Wait in two situations: there's nothing to do (no elements to process),
so we wait until we get a Signal from insert, or we got a element but
have to wait for the plannedToStartWorkAt timer. To avoid a timer mutex
race, we prefer Signals when we know an item is ready (insert) or the
timer fired. Broadcast is used for cancellations and shutdowns.

Signed-off-by: Juliana Oliveira juliana@fly.io

The current implementation uses a size-1 semaphore to ensure a single
worker reads from q.items.Front() at a given time. The problem is that
after the worker realizes it has to wait until `plannedToStartWorkAt`,
it sleeps and forgets to release the semaphore. Other workers keep
waiting until the one holding the semaphore wakes up, runs its task and
successfully releases.

The q.items list acts as a FIFO queue and the keeps the invariant that
elements are ordered by how soon they are to be processed, given the
plannedToStartWorkAt value. While a worker sleeps holding the semaphore,
elements that should be scheduled immediately cannot be processed even
though they might have nothing else to do. When we consider that the
default max backoff is 1000s, we can have a worker holding the semaphore
for as much as 16 minutes.

This patch removes the semaphore and replaces it with a sync.Cond. We'll
c.Wait in two situations: there's nothing to do (no elements to process),
so we wait until we get a Signal from insert, or we got a element but
have to wait for the plannedToStartWorkAt timer. To avoid a timer mutex
race, we prefer Signals when we know an item is ready (insert) or the
timer fired. Broadcast is used for cancellations and shutdowns.

Signed-off-by: Juliana Oliveira <juliana@fly.io>
@soulware
Copy link
Member

Nice find!

jwnx added 2 commits January 14, 2026 12:58
Signed-off-by: Juliana Oliveira <juliana@fly.io>
Signed-off-by: Juliana Oliveira <juliana@fly.io>
@jwnx jwnx changed the title fix(queue): removes semaphore while avoiding timer mutex race fix(queue): replace semaphore with sync.Cond to eliminate worker serialization Feb 2, 2026
@jwnx jwnx merged commit 72d5452 into master Feb 2, 2026
1 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants