Skip to content

Conversation

@DaveCTurner
Copy link
Contributor

Today we limit the number of shards concurrently closed by the
IndicesClusterStateService, but this limit is currently a function of
the CPU count of the node. On nodes with plentiful CPU but poor IO
performance we may want to restrict this limit further. This commit
exposes the throttling limit as a setting.

Today we limit the number of shards concurrently closed by the
`IndicesClusterStateService`, but this limit is currently a function of
the CPU count of the node. On nodes with plentiful CPU but poor IO
performance we may want to restrict this limit further. This commit
exposes the throttling limit as a setting.
@DaveCTurner DaveCTurner added >non-issue :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v9.0.0 v8.18.0 labels Jan 30, 2025
@DaveCTurner DaveCTurner requested a review from a team as a code owner January 30, 2025 10:10
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Jan 30, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@DaveCTurner
Copy link
Contributor Author

>non-issue because this setting is only really for internal use

Copy link
Contributor

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code change looks good, I'm having trouble understanding the test though.

public class ShardCloseExecutorTests extends ESTestCase {

public void testThrottling() {
final var defaultProcessors = EsExecutors.NODE_PROCESSORS_SETTING.get(Settings.EMPTY).roundUp();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the expectations around the value of defaultProcessors for tests? You have if-statements later, and I'm wondering what runs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the number of CPUs of the machine on which the tests are running, so can be more or less than 10. And it's not permitted to increase node.processors to greater than the default, which is why we have to skip some tests on low-CPU machiens.


assertEquals(expectedLimit, tasksToRun.size()); // didn't enqueue the final task yet

for (int i = 0; i < tasksToRun.size(); i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to understand this method. Is there any way you could refactor or document it to make it easier to understand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments in d1fd519, does that help?

Copy link
Contributor

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's made it easier for me to understand. LGTM!

@DaveCTurner DaveCTurner added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) auto-backport Automatically create backport pull requests when merged labels Jan 31, 2025
@elasticsearchmachine elasticsearchmachine merged commit e1c6c3f into elastic:main Jan 31, 2025
17 checks passed
@DaveCTurner DaveCTurner deleted the 2025/01/30/CONCURRENT_SHARD_CLOSE_LIMIT branch January 31, 2025 17:53
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jan 31, 2025
Today we limit the number of shards concurrently closed by the
`IndicesClusterStateService`, but this limit is currently a function of
the CPU count of the node. On nodes with plentiful CPU but poor IO
performance we may want to restrict this limit further. This commit
exposes the throttling limit as a setting.
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x

DaveCTurner added a commit that referenced this pull request Feb 6, 2025
Today we limit the number of shards concurrently closed by the
`IndicesClusterStateService`, but this limit is currently a function of
the CPU count of the node. On nodes with plentiful CPU but poor IO
performance we may want to restrict this limit further. This commit
exposes the throttling limit as a setting.
*/
public static final Setting<Integer> CONCURRENT_SHARD_CLOSE_LIMIT = Setting.intSetting(
"indices.store.max_concurrent_closing_shards",
settings -> Integer.toString(Math.min(10, EsExecutors.NODE_PROCESSORS_SETTING.get(settings).roundUp())),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the default max was

final var maxThreads = Math.max(EsExecutors.NODE_PROCESSORS_SETTING.get(settings).roundUp(), 10);

Note Math.max instead of Math.min. Is this change intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was a (my) mistake to use max here in the first place. I noticed the issue when I saw a small (IO-bound) node struggling to close lots of shards at once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v8.19.0 v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants