Skip to content

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Mar 28, 2025

We still have a potential deadlock if a thread (t1) is refreshing a reader and tries to acquire (in a non-reentrant manner) the engine read lock while holding the reader refresh lock.

If another thread (t2) is waiting for the engine write lock, the attempts to acquire engine read locks are blocked (this is how reentrant read/write locks work): t1 then blocks indefinitely for the read lock without releasing the refresh lock, potentially blocking other threads that are waiting for the refresh lock to be released.

This change changes the order in which the engine read lock and refresh locks are acquired, so that it t1 is refreshing the reader and holds the refresh lock, any attempt to acquire a read lock will be reentrant and should succeed.

The compromise here is that reader refreshes can now block the reset and closing of a shard.

Relates #124635

@tlrx tlrx added >bug :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v9.1.0 labels Mar 28, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Indexing Meta label for Distributed Indexing team label Mar 28, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

@elasticsearchmachine
Copy link
Collaborator

Hi @tlrx, I've created a changelog YAML for you.

Copy link
Contributor

@fcofdez fcofdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

}
}

// Some engine implementations use a references counting mechanism to avoid closing the engine until all operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is duplicated?

engineLock.readLock().lock();
var release = true;
Engine previousEngine = null;
engineLock.writeLock().lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice simplification out of this work 🎉

postResetNewEngineConsumer.accept(newEngine);
onNewEngine(newEngine);
engineLock.writeLock().unlock();
// Some engine implementations use a references counting mechanism to avoid closing the engine until all operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the comment 👍

closeShards(readonlyShard);
}

@AwaitsFix(bugUrl = "Adjust this test")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be tackled I think?

/**
* Indicates that the {@link #close(String, boolean, Executor, ActionListener)} has been called
*/
private final AtomicBoolean isClosing = new AtomicBoolean();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why we needed to introduce this new flag?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only to help with some Engine.Warmer logic that is executed by the refresh listener and cannot abort early if the shard is closing.

@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Mar 28, 2025
@tlrx
Copy link
Member Author

tlrx commented Apr 4, 2025

Closed in favor of #126311

Thanks Francisco for the feedback!

@tlrx tlrx closed this Apr 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. serverless-linked Added by automation, don't add manually Team:Distributed Indexing Meta label for Distributed Indexing team v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants