Use read/write engine lock to guard operations against resets #124635

tlrx · 2025-03-12T11:40:19Z

Today shard's engine mutation are guarded by an engineMutex object monitor. But we would like to be able to execute one or more operations on an engine instance, without this instance being resetted during the execution of the operation.

In order to do that, this change replaces the engineMutex by a reentrant read/write lock and introduces two new methods IndexShard#withEngine() and IndexShard#withEngineOrNull() that can be used to execute an operation while avoiding the current engine instance to be reset. It does not prevent it to be closed during execution though.

Relates ES-10826

Note: I'm opening this change for further discussion and hand-off.

elasticsearchmachine · 2025-03-14T11:40:31Z

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

henningandersen

Looks good, I assume tests pass?

henningandersen · 2025-03-14T11:49:40Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

-                    return store.getMetadata(null, true);
+            engineLock.readLock().lock();
+            try {
+                synchronized (closeMutex) {


I think we should take the closeMutex first to have same lock acquisition ordering as in close?

Right, I pushed a1b5ff3

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

henningandersen · 2025-03-14T12:03:17Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+                    // How do we ensure that no indexing operations have been processed since prepareForEngineReset() here? We're not
+                    // blocking all operations when resetting the engine nor we are blocking flushes or force-merges.
+


I think the assumption is that we are indeed blocking operations? IIUC, we only hollow under the permit today and unhollow will happen prior to the indexing. I suppose we will get to this issue down the road once we start online hollowing, but would it not be ok to assume that the client has acquired permits/blocked operations?

I think the relocation also does something to merge and flush.

But you are probably right about the need to protect against changes. Could the IndexEngine do so, given that it knows it is now hollow through a method called from all such mutating methods?

We are indeed blocking ingestion when hollowing during the primary relocation.

We are also in general blocking ingestion (with our own ingestion blocker in stateless) when unhollowing.

I think maybe we could add an assertion that either the permits are held, or the ingestion blocker in stateless is installed here. But it might be a bit cumbersome to put the assertion on the ingestion blocker here (since it's in stateless code). I'd leave it to @fcofdez to figure this out (and can help if needed).

There is a small chance a force merge might come through and it might fail, and we created ES-11277 to investigate in the future if it's serious to handle (for the moment we believe not that serious).

fcofdez · 2025-03-21T18:39:44Z

Unrelated failure:

> Task :x-pack:plugin:logsdb:javaRestTest
--
  | java.lang.AssertionError: Source matching failed at document id [13]. Source documents don't match for field [bIfLGBoqaw.GgwIIyZm.xjjfhj]: Error [Values of type [half_float] don't match after normalization

Triggering CI again.

…csearch into ES-10826-no-refresh-on-close

…n-close

tlrx

LGTM

Relates elastic#124635 Closes ES-11324

Relates elastic#124635

…c#124635) Today shard's engine mutation are guarded by an engineMutex object monitor. But we would like to be able to execute one or more operations on an engine instance, without this instance being resetted during the execution of the operation. In order to do that, this change replaces the engineMutex by a reentrant read/write lock and introduces two new methods IndexShard#withEngine() and IndexShard#withEngineOrNull() that can be used to execute an operation while avoiding the current engine instance to be reset. It does not prevent it to be closed during execution though. Relates ES-10826 Co-authored-by: Francisco Fernández Castaño <[email protected]>

…#124635)" This reverts commit 4aa7ce5.

…#124635)" (#125915) This reverts commit 4aa7ce5.

…t resets (elastic#124635)" (elastic#125915)" This reverts commit 7fadeeb.

…6311) This change re-introduces the engine read/write lock to guard against engine resets. It differs from #124635 on the following: uses the engineMutex for creating/closing engines uses the reentrant r/w lock for retaining engine instances and for resetting the engine acquires the reentrant read lock during refreshes to prevent deadlocks during resets add tests to ensure no deadlock when re-acquiring read lock in refresh listeners Relates ES-11447

elasticsearchmachine added the v9.1.0 label Mar 12, 2025

tlrx force-pushed the ES-10826-no-refresh-on-close branch from 782187e to acf2a3a Compare March 12, 2025 12:48

elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Mar 12, 2025

tlrx force-pushed the ES-10826-no-refresh-on-close branch 5 times, most recently from fbea5c1 to 4958122 Compare March 13, 2025 10:58

tlrx changed the title ~~Draft ES-10826 (no refresh on engine close)~~ Draft ES-10826 (no write lock held during engine close) Mar 13, 2025

tlrx force-pushed the ES-10826-no-refresh-on-close branch 3 times, most recently from 82ada07 to f565ee1 Compare March 13, 2025 13:42

tlrx removed the serverless-linked Added by automation, don't add manually label Mar 13, 2025

tlrx force-pushed the ES-10826-no-refresh-on-close branch 2 times, most recently from 920c83c to 2256a6f Compare March 13, 2025 17:17

ES-10826

a2f57d2

tlrx force-pushed the ES-10826-no-refresh-on-close branch from 2256a6f to a2f57d2 Compare March 14, 2025 09:37

tlrx changed the title ~~Draft ES-10826 (no write lock held during engine close)~~ Use read/write engine lock to guard operations against resets Mar 14, 2025

tlrx added the :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. label Mar 14, 2025

tlrx marked this pull request as ready for review March 14, 2025 11:40

tlrx requested a review from fcofdez March 14, 2025 11:40

elasticsearchmachine added the Team:Distributed Indexing Meta label for Distributed Indexing team label Mar 14, 2025

tlrx requested a review from henningandersen March 14, 2025 11:40

henningandersen reviewed Mar 14, 2025

View reviewed changes

tlrx and others added 5 commits March 14, 2025 14:28

Merge branch 'main' into ES-10826-no-refresh-on-close

f718587

closeMutex -> engineLock

a1b5ff3

comment

0a89b07

verifyNotClosed

eed44c7

[CI] Auto commit changes from spotless

f354205

Merge branch 'main' into ES-10826-no-refresh-on-close

167dced

fcofdez added 8 commits March 21, 2025 19:42

Merge branch 'main' into ES-10826-no-refresh-on-close

1e1dd32

Merge branch 'main' into ES-10826-no-refresh-on-close

52e1343

Close the engine outside of the write lock

8bc5374

Merge branch 'ES-10826-no-refresh-on-close' of github.com:tlrx/elasti…

9809a95

…csearch into ES-10826-no-refresh-on-close

Merge remote-tracking branch 'origin/main' into ES-10826-no-refresh-o…

f0a6ea4

…n-close

Missing javadocs

9b1d3c2

Revert part

cc071be

Merge remote-tracking branch 'origin/main' into ES-10826-no-refresh-o…

f2eae65

…n-close

tlrx commented Mar 25, 2025

View reviewed changes

fcofdez merged commit 4aa7ce5 into elastic:main Mar 25, 2025
17 checks passed

fcofdez added a commit to fcofdez/elasticsearch that referenced this pull request Mar 26, 2025

Guard Get operations against Engine resets

3aa270a

Relates elastic#124635 Closes ES-11324

fcofdez mentioned this pull request Mar 26, 2025

Guard Get operations against Engine resets #125646

Merged

This was referenced Mar 28, 2025

Ensure that RefreshListener do not access engine under refresh lock #124328

Closed

Draft ES-10826 #122749

Closed

[Draft] Add IndexShard.withEngine method #123688

Closed

tlrx added a commit to tlrx/elasticsearch that referenced this pull request Mar 28, 2025

Hold engine read lock during reader refresh

4547a47

Relates elastic#124635

tlrx mentioned this pull request Mar 28, 2025

Hold engine read lock during reader refresh #125856

Closed

tlrx added a commit that referenced this pull request Mar 31, 2025

Revert "Use read/write engine lock to guard operations against resets (…

d754064

…#124635)" This reverts commit 4aa7ce5.

This was referenced Mar 31, 2025

Revert "Use read/write engine lock to guard operations against resets" #125914

Closed

Revert "Use read/write engine lock to guard operations against resets" #125915

Merged

tlrx added a commit that referenced this pull request Mar 31, 2025

Revert "Use read/write engine lock to guard operations against resets (…

7fadeeb

…#124635)" (#125915) This reverts commit 4aa7ce5.

tlrx added a commit to tlrx/elasticsearch that referenced this pull request Apr 2, 2025

Revert "Revert "Use read/write engine lock to guard operations agains…

29be30d

…t resets (elastic#124635)" (elastic#125915)" This reverts commit 7fadeeb.

tlrx added a commit to tlrx/elasticsearch that referenced this pull request Apr 3, 2025

Revert "Revert "Use read/write engine lock to guard operations agains…

6942041

…t resets (elastic#124635)" (elastic#125915)" This reverts commit 7fadeeb.

tlrx mentioned this pull request Apr 4, 2025

Revive read/write engine lock to guard operations against resets #126311

Merged

tlrx added a commit to tlrx/elasticsearch that referenced this pull request Apr 7, 2025

Revert "Revert "Use read/write engine lock to guard operations agains…

fcb9278

…t resets (elastic#124635)" (elastic#125915)" This reverts commit 7fadeeb.

tlrx added a commit to tlrx/elasticsearch that referenced this pull request Apr 8, 2025

Revert "Revert "Use read/write engine lock to guard operations agains…

d290507

…t resets (elastic#124635)" (elastic#125915)" This reverts commit 7fadeeb.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use read/write engine lock to guard operations against resets #124635

Use read/write engine lock to guard operations against resets #124635

Uh oh!

tlrx commented Mar 12, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Mar 14, 2025

Uh oh!

henningandersen left a comment

Uh oh!

henningandersen Mar 14, 2025

Uh oh!

tlrx Mar 14, 2025

Uh oh!

Uh oh!

Uh oh!

henningandersen Mar 14, 2025

Uh oh!

henningandersen Mar 14, 2025

Uh oh!

kingherc Mar 18, 2025

Uh oh!

fcofdez commented Mar 21, 2025

Uh oh!

tlrx left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

		// How do we ensure that no indexing operations have been processed since prepareForEngineReset() here? We're not
		// blocking all operations when resetting the engine nor we are blocking flushes or force-merges.

Use read/write engine lock to guard operations against resets #124635

Use read/write engine lock to guard operations against resets #124635

Uh oh!

Conversation

tlrx commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Mar 14, 2025

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

henningandersen Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

tlrx Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

henningandersen Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

henningandersen Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

kingherc Mar 18, 2025

Choose a reason for hiding this comment

Uh oh!

fcofdez commented Mar 21, 2025

Uh oh!

tlrx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

tlrx commented Mar 12, 2025 •

edited

Loading