-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Suspend Index throttling when relocating #128797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suspend Index throttling when relocating #128797
Conversation
|
Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing) |
|
I noticed during testing that throttling gets disabled once a shard is moved. I guess this is because of the way the engine is created for the relocated shard. But I haven't had a chance to dig into the relocation code to verify that this is expected behaviour. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. Left a number of comments.
| indexShard.suspendThrottling(); | ||
| waitUntilBlocked(ActionListener.assertOnce(onAcquired), timeout, timeUnit, executor); | ||
| // TODO: Does this do anything ? Looks like the relocated shard does not have throttling enabled | ||
| indexShard.resumeThrottling(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would prefer to handle this outside this class, we can make a method in IndexShard that wraps blockOperations and does this, avoiding sending an object to this method and the effect on testing etc.
Also, notice that this is sort of incorrect as is in that we sometimes call this with the executor set to the generic thread pool. We should instead resume throttling when the listener is called, that will handle all cases.
| } | ||
| } | ||
|
|
||
| public boolean isIndexingPaused() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not seem necessary to expose outside IndexShard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was calling it from RelocationIT. I can remove it. Just for my understanding, why is it risky to expose this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in the latest upload
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It exposes internal state from the engine. As such it is not "risky", but exposing more than necessary breaks encapsulation. In particular this one is only there for testing and can be fetched just as easily without this. The IndexShard interface is huge and I'd like to keep the surface it has down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining Henning
server/src/main/java/org/elasticsearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
| logger.info("--> index more docs so we have something in the translog"); | ||
| for (int i = 10; i < 20; i++) { | ||
| prepareIndex("test").setId(Integer.toString(i)).setSource("field", "value" + i).get(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not folllow why this is important to the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not. I wrote this test by modifying testRelocationWhileIndexingRandom() so it's just a carry over from there. Removed it.
| assertHitCount(prepareSearch("test").setSize(0), 20); | ||
|
|
||
| logger.info("--> relocate the shard from node1 to node2"); | ||
| ClusterRerouteUtils.reroute(client(), new MoveAllocationCommand("test", 0, node_1, node_2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to set an allocation rule through index settings. Someting like index.routing.allocation.include._id = node_2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I follow this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe I do. Like this ?
updateIndexSettings(Settings.builder().put("index.routing.allocation.include._id", node_2), "test");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want me to change this everywhere in this file ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how to make that work. I tried this:
updateIndexSettings(Settings.builder().put("index.routing.allocation.include._id", nodes[toNode]), "test");
ensureGreen(ACCEPTABLE_RELOCATION_TIME, "test");
But it looks like this is not enough to ensure that the shard has moved to the target node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to use ._name if you use node_2, like done here (though that one excludes, you can do that too - or use include, both should work).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
| assertThat(clusterHealthResponse.isTimedOut(), equalTo(false)); | ||
|
|
||
| // Relocated shard is not throttled | ||
| assertThat(shard.isIndexingPaused(), equalTo(false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems surprising, why is it not throttled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially thought it might be because the node that we relocate the shard to does not have PAUSE_THROTTLING enabled. But that doesn't help either. So I am guessing it has to do with how we do the relocation, wouldn't we have to recreate the engine on the new node and it probably will not transfer throttling ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But wait, this is the original source shard we are talking about, not the relocated target shard, so it should have throttling enabled after we resume throttling. I will need to look into this a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I figured it out, it's because the engine is null for the source shard. I will just get rid of this check, I don't think it is useful.
server/src/internalClusterTest/java/org/elasticsearch/recovery/RelocationIT.java
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/recovery/RelocationIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/recovery/RelocationIT.java
Outdated
Show resolved
Hide resolved
…exingForPermits Refresh
…exingForPermits Refresh branch
…exingForPermits Refresh branch
…uma/elasticsearch into 05192025/UnpauseIndexingForPermits pull
…exingForPermits refresh branch
…uma/elasticsearch into 05192025/UnpauseIndexingForPermits Refresh branch
…exingForPermits refresh branch
…exingForPermits Refresh branch
…exingForPermits Refresh branch
…uma/elasticsearch into 05192025/UnpauseIndexingForPermits pull
|
There was a problem with RelocationIT#testRelocationWhileIndexingRandom() where we were relocating the replica and not the primary. I changed it so we are relocating the primary, and it works fine now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
| * @param timeUnit the time unit of the {@code timeout} argument | ||
| * @param executor executor on which to wait for in-flight operations to finish and acquire all permits | ||
| */ | ||
| public void blockOperations( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be private?
| public void blockOperations( | |
| private void blockOperations( |
…exingForPermits Refresh branch
…exingForPermits Refresh branch
…exingForPermits Refresh
If index throttling is enabled such that it pauses all indexing threads that try to index into a shard, this can starve other tasks such as relocation that try to acquire all indexing permits. This PR addresses this by suspending throttling to allow the indexing threads that are holding the permits to pass. Addresses ES-11770.
If index throttling is enabled such that it pauses all indexing threads that try to index into a shard, this can starve other tasks such as relocation that try to acquire all indexing permits. This PR addresses this by suspending throttling to allow the indexing threads that are holding the permits to pass. Addresses ES-11770.
Addresses ES-11770.
If index throttling is enabled such that it pauses all indexing threads that try to index into a shard, this can starve other tasks such as relocation that try to acquire all indexing permits. This PR addresses this by suspending throttling to allow the indexing threads that are holding the permits to pass.