Skip to content

Conversation

@JeremyDahlgren
Copy link
Contributor

A race condition exists in TransportTestAction that could result in a subrequest timing out before the corresponding task has been registered on the remote node. This change add a trace log message to the TaskManager for when a children list is not found in cancelChildLocal() and a counter that is incremented. The test still verifies that at least one child task is cancelled via the expected cancellation mechanism, manually cancelling any orphaned tasks and verifies that the number of manual cancellations required equals the number of cancelChildLocal() calls observed where a children list was not found.

To reproduce the failure reliably I changed the timeout in TransportTestAction from 400 milliseconds to 1 milliseconds when testing locally.

See #123568 for details about possible follow ups on either implementing cancellation retry support, refactoring TransportTestAction to enforce ordering of requests so timeouts do not occur until the corresponding remote task has been registered, and discussion of support for transport-level timeouts in general.

Closes #123568

A race condition exists in TransportTestAction that could result
in a subrequest timing out before the corresponding task has been
registered on the remote node.  This change add a trace log message
to the task manager for when a children list is not found in
cancelChildLocal() and a counter that is incremented.  The test
still verifies that at least one child task is cancelled via the
expected cancellation mechanism, manually cancelling any orphaned tasks
and verifies that the number of manual cancellations required equals
the number of cancelChildLocal() calls observed where a children
list was not found.

Closes elastic#123568
@JeremyDahlgren JeremyDahlgren added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. auto-backport Automatically create backport pull requests when merged Team:Distributed Coordination Meta label for Distributed Coordination team v8.19.0 v9.1.0 v9.0.3 v8.18.3 labels Jun 17, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@ywangd
Copy link
Member

ywangd commented Jun 17, 2025

I am not sure whether we want to fix it as a pure test problem. I'd personally rather keep it and fix the underlying issue. If we do prefer to fix the test for now, I think it would be better to fix the test assumption by ensuring a cancellation request is sent after the associated task is executing, i.e. some coordinations between sending internal:admin/tasks/cancel_child request (by intercepting the transport action) and TransportTestAction#doExecute.

@DaveCTurner
Copy link
Contributor

Yeah I'm inclined to agree. Seems weird to make this test bend itself around the bugs in the very subsystem it's supposed to be testing.

@JeremyDahlgren
Copy link
Contributor Author

Closing per the review comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. Team:Distributed Coordination Meta label for Distributed Coordination team >test Issues or PRs that are addressing/adding tests v8.18.3 v8.19.0 v9.0.3 v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] CancellableTasksIT testChildrenTasksCancelledOnTimeout failing

4 participants