Configurable limit on concurrent shard closing #121267

DaveCTurner · 2025-01-30T10:10:42Z

Today we limit the number of shards concurrently closed by the
IndicesClusterStateService, but this limit is currently a function of
the CPU count of the node. On nodes with plentiful CPU but poor IO
performance we may want to restrict this limit further. This commit
exposes the throttling limit as a setting.

Today we limit the number of shards concurrently closed by the `IndicesClusterStateService`, but this limit is currently a function of the CPU count of the node. On nodes with plentiful CPU but poor IO performance we may want to restrict this limit further. This commit exposes the throttling limit as a setting.

elasticsearchmachine · 2025-01-30T10:11:07Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

DaveCTurner · 2025-01-30T10:11:14Z

>non-issue because this setting is only really for internal use

DiannaHohensee

Code change looks good, I'm having trouble understanding the test though.

server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java

DiannaHohensee · 2025-01-30T16:28:15Z

server/src/test/java/org/elasticsearch/indices/cluster/ShardCloseExecutorTests.java

+public class ShardCloseExecutorTests extends ESTestCase {
+
+    public void testThrottling() {
+        final var defaultProcessors = EsExecutors.NODE_PROCESSORS_SETTING.get(Settings.EMPTY).roundUp();


What are the expectations around the value of defaultProcessors for tests? You have if-statements later, and I'm wondering what runs.

It is the number of CPUs of the machine on which the tests are running, so can be more or less than 10. And it's not permitted to increase node.processors to greater than the default, which is why we have to skip some tests on low-CPU machiens.

DiannaHohensee · 2025-01-30T17:59:36Z

server/src/test/java/org/elasticsearch/indices/cluster/ShardCloseExecutorTests.java

+
+        assertEquals(expectedLimit, tasksToRun.size()); // didn't enqueue the final task yet
+
+        for (int i = 0; i < tasksToRun.size(); i++) {


I'm struggling to understand this method. Is there any way you could refactor or document it to make it easier to understand?

I added some comments in d1fd519, does that help?

DiannaHohensee

Thanks, that's made it easier for me to understand. LGTM!

Today we limit the number of shards concurrently closed by the `IndicesClusterStateService`, but this limit is currently a function of the CPU count of the node. On nodes with plentiful CPU but poor IO performance we may want to restrict this limit further. This commit exposes the throttling limit as a setting.

elasticsearchmachine · 2025-01-31T17:54:37Z

💚 Backport successful

Status	Branch	Result
✅	8.x

Today we limit the number of shards concurrently closed by the `IndicesClusterStateService`, but this limit is currently a function of the CPU count of the node. On nodes with plentiful CPU but poor IO performance we may want to restrict this limit further. This commit exposes the throttling limit as a setting.

ywangd · 2025-02-19T01:08:50Z

server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java

+     */
+    public static final Setting<Integer> CONCURRENT_SHARD_CLOSE_LIMIT = Setting.intSetting(
+        "indices.store.max_concurrent_closing_shards",
+        settings -> Integer.toString(Math.min(10, EsExecutors.NODE_PROCESSORS_SETTING.get(settings).roundUp())),


Previously the default max was

final var maxThreads = Math.max(EsExecutors.NODE_PROCESSORS_SETTING.get(settings).roundUp(), 10);

Note Math.max instead of Math.min. Is this change intentional?

Yes, it was a (my) mistake to use max here in the first place. I noticed the issue when I saw a small (IO-bound) node struggling to close lots of shards at once.

DaveCTurner added >non-issue :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v9.0.0 v8.18.0 labels Jan 30, 2025

DaveCTurner requested a review from DiannaHohensee January 30, 2025 10:10

DaveCTurner requested a review from a team as a code owner January 30, 2025 10:10

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Jan 30, 2025

Merge branch 'main' into 2025/01/30/CONCURRENT_SHARD_CLOSE_LIMIT

1ac59d6

elasticsearchmachine added v8.19.0 and removed v8.18.0 labels Jan 30, 2025

Merge branch 'main' into 2025/01/30/CONCURRENT_SHARD_CLOSE_LIMIT

934f2b3

elasticsearchmachine added v9.1.0 and removed v9.0.0 labels Jan 30, 2025

DiannaHohensee reviewed Jan 30, 2025

View reviewed changes

DaveCTurner added 3 commits January 31, 2025 10:07

Merge branch 'main' into 2025/01/30/CONCURRENT_SHARD_CLOSE_LIMIT

9605f46

Comments

d1fd519

Correction

e8d2f5a

DiannaHohensee approved these changes Jan 31, 2025

View reviewed changes

DaveCTurner added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) auto-backport Automatically create backport pull requests when merged labels Jan 31, 2025

Merge branch 'main' into 2025/01/30/CONCURRENT_SHARD_CLOSE_LIMIT

c7476bf

elasticsearchmachine merged commit e1c6c3f into elastic:main Jan 31, 2025
17 checks passed

DaveCTurner deleted the 2025/01/30/CONCURRENT_SHARD_CLOSE_LIMIT branch January 31, 2025 17:53

DaveCTurner mentioned this pull request Jan 31, 2025

[8.x] Configurable limit on concurrent shard closing (#121267) #121444

Merged

ywangd reviewed Feb 19, 2025

View reviewed changes

ywangd mentioned this pull request Feb 19, 2025

[CI] DedicatedClusterSnapshotRestoreIT testRestoreShrinkIndex failing #121717

Closed


		assertEquals(expectedLimit, tasksToRun.size()); // didn't enqueue the final task yet

		for (int i = 0; i < tasksToRun.size(); i++) {

Configurable limit on concurrent shard closing #121267

Configurable limit on concurrent shard closing #121267

Uh oh!

Conversation

DaveCTurner commented Jan 30, 2025

Uh oh!

elasticsearchmachine commented Jan 30, 2025

Uh oh!

DaveCTurner commented Jan 30, 2025

Uh oh!

DiannaHohensee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DiannaHohensee Jan 30, 2025

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Jan 31, 2025

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee Jan 30, 2025

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Jan 31, 2025

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Jan 31, 2025

💚 Backport successful

Uh oh!

ywangd Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants