Skip to content

Fix infinite spin-wait and missing cancellation propagation in TaskDeduplicator.#28938

Closed
tjgq wants to merge 1 commit intobazelbuild:masterfrom
tjgq:dedup-fix
Closed

Fix infinite spin-wait and missing cancellation propagation in TaskDeduplicator.#28938
tjgq wants to merge 1 commit intobazelbuild:masterfrom
tjgq:dedup-fix

Conversation

@tjgq
Copy link
Contributor

@tjgq tjgq commented Mar 10, 2026

The test TaskDeduplicatorTest.executeIfNeeded_executeAndCancelLoop_noErrors sporadically hung during the ExecutorService.close() call. This was due to two primary issues:

  1. In executeIfNew, if a thread encountered a RefcountedFuture that was already canceled (refcount = 0), it would call Thread.yield() and continue the while(true) loop. It relied on a listener attached to the canceled future to eventually remove it from the map. However, when using virtual threads, many threads could enter this spin loop simultaneously, saturating the underlying carrier threads. This prevented the listener from being scheduled, creating a deadlock where the spinning threads never saw the entry removed.

  2. RefcountedFuture was not propagating the cancel() call to its delegate future. This meant that even if all callers canceled their interest in a task, the task would continue to execute in the background, wasting resources and increasing contention.

This PR makes the following changes:

  1. executeIfNew and maybeJoinExecution are modified to explicitly call inFlightTasks.remove(key, future) if retain() fails. This ensures the spin loop is broken immediately by the next thread to encounter the canceled future, rather than waiting for an asynchronous listener.

  2. RefcountedFuture.cancel now calls delegate.cancel(mayInterruptIfRunning) when the internal reference count drops to zero.

  3. Thread.yield() and the associated @SuppressWarnings("ThreadPriorityCheck") are removed, as they are no longer necessary.

Fixes #28302.

@github-actions github-actions bot added the awaiting-review PR is awaiting review from an assigned reviewer label Mar 10, 2026
@iancha1992 iancha1992 added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Mar 10, 2026
@fmeum
Copy link
Collaborator

fmeum commented Mar 10, 2026

@iancha1992 @meteorcloudy This is another PR for a 9.0.2 release.

@fmeum
Copy link
Collaborator

fmeum commented Mar 10, 2026

@bazel-io fork 9.1.0

@iancha1992
Copy link
Member

@bazel-io fork 9.0.2

…duplicator.

The test TaskDeduplicatorTest.executeIfNeeded_executeAndCancelLoop_noErrors
sporadically hung during the ExecutorService.close() call. This was due to
two primary issues:

1. In executeIfNew, if a thread encountered a RefcountedFuture that was already
   canceled (refcount = 0), it would call Thread.yield() and continue the
   while(true) loop. It relied on a listener attached to the canceled future to
   eventually remove it from the map. However, when using virtual threads, many
   threads could enter this spin loop simultaneously, saturating the underlying
   carrier threads. This prevented the listener from being scheduled, creating
   a deadlock where the spinning threads never saw the entry removed.

2. RefcountedFuture was not propagating the cancel() call to its delegate
   future. This meant that even if all callers canceled their interest in a
   task, the task would continue to execute in the background, wasting
   resources and increasing contention.

This PR makes the following changes:

1. executeIfNew and maybeJoinExecution are modified to explicitly call
   inFlightTasks.remove(key, future) if retain() fails. This ensures the spin
   loop is broken immediately by the next thread to encounter the canceled
   future, rather than waiting for an asynchronous listener.

2. RefcountedFuture.cancel now calls delegate.cancel(mayInterruptIfRunning)
   when the internal reference count drops to zero.

3. Thread.yield() and the associated @SuppressWarnings("ThreadPriorityCheck")
   are removed, as they are no longer necessary.

Fixes bazelbuild#28302.
copybara-service bot pushed a commit that referenced this pull request Mar 13, 2026
Due to limitation of GitHub, the workflow won't have permission to edit labels of a PR when it's triggered by pull_request_review

https://docs.github.com/en/actions/reference/workflows-and-actions/events-that-trigger-workflows#workflows-in-forked-repositories-1

Caused failure in #28938

PiperOrigin-RevId: 883083524
Change-Id: I4320841489e884a17602271afc37b4a03783898b
@github-actions github-actions bot removed the awaiting-review PR is awaiting review from an assigned reviewer label Mar 13, 2026
bazel-io pushed a commit to bazel-io/bazel that referenced this pull request Mar 13, 2026
…duplicator. (bazelbuild#28938)

The test TaskDeduplicatorTest.executeIfNeeded_executeAndCancelLoop_noErrors sporadically hung during the ExecutorService.close() call. This was due to two issues:

1. In executeIfNew, if a thread encountered a RefcountedFuture that was already canceled (refcount = 0), it would call Thread.yield() and continue the while(true) loop. It relied on a listener attached to the canceled future to eventually remove it from the map. However, when using virtual threads, many threads could enter this spin loop simultaneously, saturating the underlying carrier threads. This prevented the listener from being scheduled, creating a deadlock where the spinning threads never saw the entry removed.

2. RefcountedFuture was not propagating the cancel() call to its delegate future. This meant that even if all callers canceled their interest in a task, the task would continue to execute in the background, wasting resources and increasing contention.

This PR makes the following changes:

1. executeIfNew and maybeJoinExecution are modified to explicitly call inFlightTasks.remove(key, future) if retain() fails. This ensures the spin loop is broken immediately by the next thread to encounter the canceled future, rather than waiting for an asynchronous listener.

2. RefcountedFuture.cancel now calls delegate.cancel(mayInterruptIfRunning) when the internal reference count drops to zero.

3. Thread.yield() and the associated @SuppressWarnings("ThreadPriorityCheck") are removed, as they are no longer necessary.

Fixes bazelbuild#28302.

Closes bazelbuild#28938.

PiperOrigin-RevId: 883147973
Change-Id: I28c1db252573a4c39b1a9e53d32e218327340054
bazel-io pushed a commit to bazel-io/bazel that referenced this pull request Mar 13, 2026
…duplicator. (bazelbuild#28938)

The test TaskDeduplicatorTest.executeIfNeeded_executeAndCancelLoop_noErrors sporadically hung during the ExecutorService.close() call. This was due to two issues:

1. In executeIfNew, if a thread encountered a RefcountedFuture that was already canceled (refcount = 0), it would call Thread.yield() and continue the while(true) loop. It relied on a listener attached to the canceled future to eventually remove it from the map. However, when using virtual threads, many threads could enter this spin loop simultaneously, saturating the underlying carrier threads. This prevented the listener from being scheduled, creating a deadlock where the spinning threads never saw the entry removed.

2. RefcountedFuture was not propagating the cancel() call to its delegate future. This meant that even if all callers canceled their interest in a task, the task would continue to execute in the background, wasting resources and increasing contention.

This PR makes the following changes:

1. executeIfNew and maybeJoinExecution are modified to explicitly call inFlightTasks.remove(key, future) if retain() fails. This ensures the spin loop is broken immediately by the next thread to encounter the canceled future, rather than waiting for an asynchronous listener.

2. RefcountedFuture.cancel now calls delegate.cancel(mayInterruptIfRunning) when the internal reference count drops to zero.

3. Thread.yield() and the associated @SuppressWarnings("ThreadPriorityCheck") are removed, as they are no longer necessary.

Fixes bazelbuild#28302.

Closes bazelbuild#28938.

PiperOrigin-RevId: 883147973
Change-Id: I28c1db252573a4c39b1a9e53d32e218327340054
github-merge-queue bot pushed a commit that referenced this pull request Mar 13, 2026
…n TaskDeduplicator. (#28938) (#28990)

The test
TaskDeduplicatorTest.executeIfNeeded_executeAndCancelLoop_noErrors
sporadically hung during the ExecutorService.close() call. This was due
to two issues:

1. In executeIfNew, if a thread encountered a RefcountedFuture that was
already canceled (refcount = 0), it would call Thread.yield() and
continue the while(true) loop. It relied on a listener attached to the
canceled future to eventually remove it from the map. However, when
using virtual threads, many threads could enter this spin loop
simultaneously, saturating the underlying carrier threads. This
prevented the listener from being scheduled, creating a deadlock where
the spinning threads never saw the entry removed.

2. RefcountedFuture was not propagating the cancel() call to its
delegate future. This meant that even if all callers canceled their
interest in a task, the task would continue to execute in the
background, wasting resources and increasing contention.

This PR makes the following changes:

1. executeIfNew and maybeJoinExecution are modified to explicitly call
inFlightTasks.remove(key, future) if retain() fails. This ensures the
spin loop is broken immediately by the next thread to encounter the
canceled future, rather than waiting for an asynchronous listener.

2. RefcountedFuture.cancel now calls
delegate.cancel(mayInterruptIfRunning) when the internal reference count
drops to zero.

3. Thread.yield() and the associated
@SuppressWarnings("ThreadPriorityCheck") are removed, as they are no
longer necessary.

Fixes #28302.

Closes #28938.

PiperOrigin-RevId: 883147973
Change-Id: I28c1db252573a4c39b1a9e53d32e218327340054

Commit
a0760f1

Co-authored-by: Tiago Quelhas <tjgq@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

team-Remote-Exec Issues and PRs for the Execution (Remote) team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky timeout: TaskDeduplicatorTest.executeIfNeeded_executeAndCancelLoop_noErrors()

3 participants