Skip to content

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Sep 22, 2025

This change fixes a bug that causes a deadlock in the thread pool merge scheduler when a merge fails due to a tragic event.

The deadlock occurs because Lucene aborts running merges when failing with a tragic event and then waits for them to complete. But those "running" merges might in fact be waiting in the Elasticsearch's thread pool merge scheduler tasks queue, or they might be waiting in the backlogged merge tasks queue because the per-shard concurrent merges count limit has been reached, or they might simply be waiting for enough disk space to be executed. In which cases the merge thread that is failing waits indefinitely.

The proposed fix in this change uses the merge thread that is failing due to a tragic event to abort all other enqueued and backlogged merge tasks of the same shard, before pursuing with the closing of the IndexWriter. This way Lucene won't have to wait for any running merges as they would have all be aborted upfront.

Backport of the #134656 for 9.0.8
Relates ES-12664

…the IndexWriter (elastic#134656)

This change fixes a bug that causes a deadlock in the thread pool merge scheduler when a merge fails due to a tragic event.

The deadlock occurs because Lucene aborts running merges when failing with a tragic event and then waits for them to complete. But those "running" merges might in fact be waiting in the Elasticsearch's thread pool merge scheduler tasks queue, or they might be waiting in the backlogged merge tasks queue because the per-shard concurrent merges count limit has been reached, or they might simply be waiting for enough disk space to be executed. In which cases the merge thread that is failing waits indefinitely.

The proposed fix in this change uses the merge thread that is failing due to a tragic event to abort all other enqueued and backlogged merge tasks of the same shard, before pursuing with the closing of the IndexWriter. This way Lucene won't have to wait for any running merges as they would have all be aborted upfront.

Relates ES-12664
@tlrx tlrx merged commit 1c119b6 into elastic:9.0 Sep 22, 2025
20 checks passed
@tlrx tlrx deleted the 2025/09/09/ES-12664-9.0 branch September 22, 2025 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant