Skip to content

Conversation

@nicktindall
Copy link
Contributor

@nicktindall nicktindall commented Sep 4, 2025

As an alternative to adding code to model shard movements to/from nodes with no NodeUsageStatsForThreadPools in the current ClusterInfo, just add a test to confirm that the situation is only very brief, because we eagerly fetch a new ClusterInfo when that occurs.

There didn't seem to be a test that covered this behaviour specifically (it dates back to the dark ages). Interesting that many of the other scenarios (become master 1, fail master 1, become master 2, fail master 2) don't seem to assert that the InternalClusterInfoService actually did anything? You can comment out all the behaviour in org.elasticsearch.cluster.InternalClusterInfoService#clusterChanged and the only bit that fails is the interval polling part.

I can add assertions to those if we think it's worthwhile?

@nicktindall nicktindall added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Sep 4, 2025
@elasticsearchmachine elasticsearchmachine added Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0 labels Sep 4, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

);
// Don't use runUntilFlag because we don't want the scheduled task to run
deterministicTaskQueue.runAllRunnableTasks();
assertTrue(nodeJoined.get());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may be why the existing tests don't assert anything, it feels a little hacky, but this seemed like the right test to add to? Could also implement as an IT

@nicktindall nicktindall requested a review from Copilot September 4, 2025 03:22
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a test to verify that the ClusterInfo is refreshed when a new node joins the cluster, addressing a gap in test coverage for this critical behavior. The test ensures that when nodes are added to a cluster, the InternalClusterInfoService eagerly fetches updated cluster information, which is important for proper shard allocation decisions.

Key changes:

  • Added test coverage for node join and leave scenarios in cluster info service scheduling
  • Verified that cluster info refresh is triggered when nodes join but not when they leave

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

…InfoServiceSchedulingTests.java

Co-authored-by: Copilot <[email protected]>
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nicktindall nicktindall merged commit dc743a9 into elastic:main Sep 4, 2025
33 checks passed
@nicktindall nicktindall deleted the test_clusterinfo_refresh_on_node_join branch September 4, 2025 23:43
jbaiera pushed a commit to jbaiera/elasticsearch that referenced this pull request Sep 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team >test Issues or PRs that are addressing/adding tests v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants