Skip to content

Conversation

joshua-adams-1
Copy link
Contributor

Backports #134926 to 9.0

I was able to reproduce the error by following the steps outlined by @JeremyDahlgren here

The error occurs due to the following sequence of events:

  1. One of the snapshot threads has a sleep when executing snapshot tasks
  2. The other is not delayed, so it will get blocked in the repository to allow the test to get to the putShutdownForRemovalMetadata() call
  3. After the delays the first thread processes the other snapshot tasks, which are now paused, which throws a PausedSnapshotException, and then immediately notifies the master with the failure
  4. The first four SnapshotShutdownProgressTracker log messages are verified in the test and then the masterTransportService.addRequestHandlingBehavior() is called to block the notifications
  5. The repo is unblocked on the nodeForRemoval, but by this time it is too late and one or more updates have already been sent, possibly leaving only the single blocked shard, causing the assertion at 581 to fail

The proposed change:

  1. Ensures that there are n shards on otherNode with data. At the moment, there is only 1.
  2. I have reordered the blocking on the master node to come earlier. This is so we are always capturing the notifications back to the master, either by the snapshot thread being blocked, or if there was some delay on a thread due to other concurrent scheduling.

A full explanation of the changes can be found here


Testing

  • Having been able to reproducible make the test fail, it now succeeds 100 times in a row
  • The original gradle command ./gradlew ":server:internalClusterTest" --tests "org.elasticsearch.snapshots.SnapshotShutdownIT.testSnapshotShutdownProgressTracker" -Dtests.seed=C44FB8C3BA8AC78E -Dtests.locale=dz -Dtests.timezone=Australia/Queensland -Druntime.java=23 succeeds
  • CI passes

Resolves: #134620

Reorders the test to avoid a race condition due to one snapshot
thread becoming blocked

Relates: 134620
@joshua-adams-1 joshua-adams-1 self-assigned this Sep 25, 2025
@joshua-adams-1 joshua-adams-1 changed the title Fixes testSnapshotShutdownProgressTracker [9.0] Fixes testSnapshotShutdownProgressTracker (#134926) Sep 25, 2025
@joshua-adams-1 joshua-adams-1 added :Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. >test Issues or PRs that are addressing/adding tests labels Sep 25, 2025
@joshua-adams-1 joshua-adams-1 marked this pull request as ready for review September 25, 2025 10:07
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Sep 25, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Contributor

@JeremyDahlgren JeremyDahlgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@joshua-adams-1 joshua-adams-1 merged commit b54eb5f into elastic:9.0 Sep 25, 2025
20 checks passed
@joshua-adams-1 joshua-adams-1 deleted the 9.0-snapshot-shutdown-it-test-failure branch September 25, 2025 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport :Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. Team:Distributed Coordination Meta label for Distributed Coordination team >test Issues or PRs that are addressing/adding tests v9.0.8
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants