Skip to content

Conversation

@JeremyDahlgren
Copy link
Contributor

Fixes a race condition in testConcurrentExecuteAndClose() where awaitClose() is called before addCloseListener() has been called, resulting in a situation where closeListener is never completed and the closeLatch is never pulled.

Closes #129121

Fixes a race condition in testConcurrentExecuteAndClose() where
awaitClose() is called before addCloseListener() has been called,
resulting in a situation where closeListener is never completed and
the closeLatch is never pulled.

Closes elastic#129121
@JeremyDahlgren JeremyDahlgren added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. auto-backport Automatically create backport pull requests when merged Team:Distributed Coordination Meta label for Distributed Coordination team v8.19.0 v9.1.0 v9.0.3 v8.17.8 v8.18.3 labels Jun 11, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

// if the channel is already closed, the listener gets notified immediately, from the same thread.
if (open.get() == false) {
listener.onResponse(null);
// Handle scenario where awaitClose() was called before any calls to addCloseListener(), this ensures closeLatch is pulled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point, but then could we achieve the same thing by dropping the whole if (open.get() == false) and always calling closeListener.onResponse(ActionListener.assertOnce(listener));? I guess that doesn't guarantee to complete listener on the calling thread in that case, not sure if this is important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes testChannelAlreadyClosed() can fail on the unexpected RestCancellableNodeClient.getNumChannels() value if a listener is completed async. I can update this test and refactor TestHttpChannel to support waiting for the closeLatch to get pulled after previously calling close(). Right now you can't just call awaitClose() since it tries to first call close() which will fail since the atomic has already been set.

I'll try to simplify TestHttpChannel.addCloseListener() to address this comment and the other comment below, adjusting the existing tests and TestHttpChannel as needed.

if (open.get() == false) {
listener.onResponse(null);
// Handle scenario where awaitClose() was called before any calls to addCloseListener(), this ensures closeLatch is pulled.
if (closeListener.isDone() == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assertFalse(closeListener.isDone()) on this branch too? Seems like it'd be a test bug to want to add two of them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the channel is closed we can get more calls here since the channel is removed from httpChannels in RestCancelllableNodeClient.CloseListener.onResponse(), so a new RestCancelllableNodeClient.CloseListener is created in RestCancelllableNodeClient.doExecute(), which leads to maybeRegisterChannel() and then httpChannel.addCloseListener().

Copy link
Contributor Author

@JeremyDahlgren JeremyDahlgren Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to avoid multiple addCloseListener() calls it looks like we'd want to check if the httpChannel is closed at the beginning of RestCancellableNodeClient.doExecute() and complete the listener with a failure before returning. But this would be a change in existing behavior and expected cancellations and we'd need to alter the tests for it. To keep the change minimal in the PR I removed the isDone() check and added an additional comment to clarify what is expected when the channel is closed. WDYT?

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

if (open.get() == false) {
listener.onResponse(null);
// Ensure closeLatch is pulled by completing the closeListener with a noop that is ignored if it is already completed.
// Note that when the channel is closed we may see multiple addCloseListener() calls, so we do not assert on isDone() here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... nor can we rely on the listener with which closeListener is completed itself being completed which is why we pass in a noop() after completing listener directly ourselves. Would you add words to that effect to this comment for the next reader?

This all seems kinda ugly but I don't have a better suggestion.

@JeremyDahlgren JeremyDahlgren merged commit 9a8e503 into elastic:main Jun 13, 2025
18 checks passed
JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this pull request Jun 13, 2025
…lastic#129294)

Fixes a race condition in testConcurrentExecuteAndClose() where
awaitClose() is called before addCloseListener() has been called,
resulting in a situation where closeListener is never completed and
the closeLatch is never pulled.

Closes elastic#129121
JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this pull request Jun 13, 2025
…lastic#129294)

Fixes a race condition in testConcurrentExecuteAndClose() where
awaitClose() is called before addCloseListener() has been called,
resulting in a situation where closeListener is never completed and
the closeLatch is never pulled.

Closes elastic#129121
JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this pull request Jun 13, 2025
…lastic#129294)

Fixes a race condition in testConcurrentExecuteAndClose() where
awaitClose() is called before addCloseListener() has been called,
resulting in a situation where closeListener is never completed and
the closeLatch is never pulled.

Closes elastic#129121
JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this pull request Jun 13, 2025
…lastic#129294)

Fixes a race condition in testConcurrentExecuteAndClose() where
awaitClose() is called before addCloseListener() has been called,
resulting in a situation where closeListener is never completed and
the closeLatch is never pulled.

Closes elastic#129121
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.19
9.0
8.17
8.18

elasticsearchmachine pushed a commit that referenced this pull request Jun 13, 2025
…129294) (#129426)

Fixes a race condition in testConcurrentExecuteAndClose() where
awaitClose() is called before addCloseListener() has been called,
resulting in a situation where closeListener is never completed and
the closeLatch is never pulled.

Closes #129121
elasticsearchmachine pushed a commit that referenced this pull request Jun 13, 2025
…129294) (#129424)

Fixes a race condition in testConcurrentExecuteAndClose() where
awaitClose() is called before addCloseListener() has been called,
resulting in a situation where closeListener is never completed and
the closeLatch is never pulled.

Closes #129121
elasticsearchmachine pushed a commit that referenced this pull request Jun 13, 2025
…129294) (#129425)

Fixes a race condition in testConcurrentExecuteAndClose() where
awaitClose() is called before addCloseListener() has been called,
resulting in a situation where closeListener is never completed and
the closeLatch is never pulled.

Closes #129121
elasticsearchmachine pushed a commit that referenced this pull request Jun 13, 2025
…129294) (#129423)

Fixes a race condition in testConcurrentExecuteAndClose() where
awaitClose() is called before addCloseListener() has been called,
resulting in a situation where closeListener is never completed and
the closeLatch is never pulled.

Closes #129121
nicktindall pushed a commit to nicktindall/elasticsearch that referenced this pull request Jul 1, 2025
…lastic#129294)

Fixes a race condition in testConcurrentExecuteAndClose() where
awaitClose() is called before addCloseListener() has been called,
resulting in a situation where closeListener is never completed and
the closeLatch is never pulled.

Closes elastic#129121

(cherry picked from commit 9a8e503)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. Team:Distributed Coordination Meta label for Distributed Coordination team >test Issues or PRs that are addressing/adding tests v8.17.8 v8.18.3 v8.19.0 v9.0.3 v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] RestCancellableNodeClientTests testConcurrentExecuteAndClose failing

3 participants