[9.0] Fix deadlock in ThreadPoolMergeScheduler when a failing merge closes the IndexWriter (#134656) #135175

tlrx · 2025-09-22T08:31:56Z

This change fixes a bug that causes a deadlock in the thread pool merge scheduler when a merge fails due to a tragic event.

The deadlock occurs because Lucene aborts running merges when failing with a tragic event and then waits for them to complete. But those "running" merges might in fact be waiting in the Elasticsearch's thread pool merge scheduler tasks queue, or they might be waiting in the backlogged merge tasks queue because the per-shard concurrent merges count limit has been reached, or they might simply be waiting for enough disk space to be executed. In which cases the merge thread that is failing waits indefinitely.

The proposed fix in this change uses the merge thread that is failing due to a tragic event to abort all other enqueued and backlogged merge tasks of the same shard, before pursuing with the closing of the IndexWriter. This way Lucene won't have to wait for any running merges as they would have all be aborted upfront.

Backport of the #134656 for 9.0.8
Relates ES-12664

…the IndexWriter (elastic#134656) This change fixes a bug that causes a deadlock in the thread pool merge scheduler when a merge fails due to a tragic event. The deadlock occurs because Lucene aborts running merges when failing with a tragic event and then waits for them to complete. But those "running" merges might in fact be waiting in the Elasticsearch's thread pool merge scheduler tasks queue, or they might be waiting in the backlogged merge tasks queue because the per-shard concurrent merges count limit has been reached, or they might simply be waiting for enough disk space to be executed. In which cases the merge thread that is failing waits indefinitely. The proposed fix in this change uses the merge thread that is failing due to a tragic event to abort all other enqueued and backlogged merge tasks of the same shard, before pursuing with the closing of the IndexWriter. This way Lucene won't have to wait for any running merges as they would have all be aborted upfront. Relates ES-12664

tlrx added backport v9.0.8 labels Sep 22, 2025

elasticsearchmachine and others added 2 commits September 22, 2025 08:40

[CI] Auto commit changes from spotless

5cbf960

Adjust for 9.0

e43d509

tlrx merged commit 1c119b6 into elastic:9.0 Sep 22, 2025
20 checks passed

tlrx deleted the 2025/09/09/ES-12664-9.0 branch September 22, 2025 10:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[9.0] Fix deadlock in ThreadPoolMergeScheduler when a failing merge closes the IndexWriter (#134656) #135175

[9.0] Fix deadlock in ThreadPoolMergeScheduler when a failing merge closes the IndexWriter (#134656) #135175

Uh oh!

tlrx commented Sep 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[9.0] Fix deadlock in ThreadPoolMergeScheduler when a failing merge closes the IndexWriter (#134656) #135175

[9.0] Fix deadlock in ThreadPoolMergeScheduler when a failing merge closes the IndexWriter (#134656) #135175

Uh oh!

Conversation

tlrx commented Sep 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant