Skip to content

Conversation

@DiannaHohensee
Copy link
Contributor

@DiannaHohensee DiannaHohensee commented Jul 28, 2025

The initial version of the write load decider with #canAllocate
implemented. Checks whether the new node assignment for a shard
would exceed the node's simulated utilization threshold.

Closes ES-12564


Is this direction alright with folks? I'd like to get this working end-to-end (so we don't block each other), then we can improve pieces of the system in parallel. I need to add the testing, but want to check in first.

Update: Ready for review now 👍 I've filed ES-12620 as a followup for IT testing. the Monitor (ES-11992) may be needed for more thorough testing, I haven't thought through how to write the testing yet, but I expect we should be able to get something working without it.

@DiannaHohensee DiannaHohensee self-assigned this Jul 28, 2025
@DiannaHohensee DiannaHohensee added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0 labels Jul 28, 2025
Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep this seems reasonable to me

@DiannaHohensee DiannaHohensee changed the title WriteLoadConstraintDecider PoC Implement WriteLoadConstraintDecider#canAllocate Aug 11, 2025
@DiannaHohensee DiannaHohensee marked this pull request as ready for review August 11, 2025 23:00
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

assert nodeUsageStatsForThreadPools.threadPoolUsageStatsMap().get(ThreadPool.Names.WRITE) != null;
var nodeWriteThreadPoolStats = nodeUsageStatsForThreadPools.threadPoolUsageStatsMap().get(ThreadPool.Names.WRITE);
var nodeWriteThreadPoolLoadThreshold = writeLoadConstraintSettings.getWriteThreadPoolHighUtilizationThresholdSetting();
if (nodeWriteThreadPoolStats.averageThreadPoolUtilization() >= nodeWriteThreadPoolLoadThreshold) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (nodeWriteThreadPoolStats.averageThreadPoolUtilization() >= nodeWriteThreadPoolLoadThreshold) {
This one looks redundant after calculateShardMovementChange. If simulation fails this one will fail too. I don't think "overhead" of calculateShardMovementChange will be noticeable anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the next check would also catch this. It's a matter of explanation message for the NO decision, really, at this point. They could be combined; separately the messages can be clearer for the user to understand, I think.

Copy link
Contributor

@mhl-b mhl-b Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calculateShardMovementChange already contains information about current thread pool utilization, so it's not hard to read that node is at high threshold before movement attempt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you're suggesting. Are you proposing to add logic to calculateShardMovementChange? If the threshold is already exceeded, the calculation adds on top of the value (exceeding threshold more), nothing need change there.

I like the clarity of the separate explain messages. Do you feel strongly about merging the two if statements?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly

Comment on lines 46 to 49
ClusterState clusterState = ClusterStateCreationUtils.stateWithAssignedPrimariesAndReplicas(new String[] { indexName }, 3, 1);
// The number of data nodes the util method above creates is numberOfReplicas+1.
assertEquals(3, clusterState.nodes().size());
assertEquals(1, clusterState.metadata().getTotalNumberOfIndices());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find these assertion very distractive from the actual change. Unit test should assert behaviour of the unit in question, preferably one or two per unit test, we should not assert our setup infrastructure.

If these methods are not trusted we'd better make them trusted, or put assertion inside of those.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The helper method is not documented and goes through several method layers before selecting the number of nodes as a side-effect of the input. Properly improving the method, in my mind, would require: method renames; a new method parameter through the stack; and documentation. It would make a lot of noise in this PR.

The original intent of the helper method was pretty clearly index focused, not nodes. But it's very helpful in setting up the ClusterState so I don't have to roll it all by hand again myself. Perhaps I can do a follow up patch to address this? Rather than making the noise in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or dont use assertions for these, IMHO it will be fine, less code, easier to read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, I'm not really comfortable relying on hidden method behavior without asserting that it continues to be true. If the behavior were to be changed, unaware of this dependency, this test would fail in unclear ways.

Would a follow up refactor be satisfactory? The asserts would no longer be necessary if the contract were changed.

Copy link
Contributor Author

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, updated and ready for another round.

assert nodeUsageStatsForThreadPools.threadPoolUsageStatsMap().get(ThreadPool.Names.WRITE) != null;
var nodeWriteThreadPoolStats = nodeUsageStatsForThreadPools.threadPoolUsageStatsMap().get(ThreadPool.Names.WRITE);
var nodeWriteThreadPoolLoadThreshold = writeLoadConstraintSettings.getWriteThreadPoolHighUtilizationThresholdSetting();
if (nodeWriteThreadPoolStats.averageThreadPoolUtilization() >= nodeWriteThreadPoolLoadThreshold) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you're suggesting. Are you proposing to add logic to calculateShardMovementChange? If the threshold is already exceeded, the calculation adds on top of the value (exceeding threshold more), nothing need change there.

I like the clarity of the separate explain messages. Do you feel strongly about merging the two if statements?

Comment on lines 46 to 49
ClusterState clusterState = ClusterStateCreationUtils.stateWithAssignedPrimariesAndReplicas(new String[] { indexName }, 3, 1);
// The number of data nodes the util method above creates is numberOfReplicas+1.
assertEquals(3, clusterState.nodes().size());
assertEquals(1, clusterState.metadata().getTotalNumberOfIndices());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, I'm not really comfortable relying on hidden method behavior without asserting that it continues to be true. If the behavior were to be changed, unaware of this dependency, this test would fail in unclear ways.

Would a follow up refactor be satisfactory? The asserts would no longer be necessary if the contract were changed.


try (var ignoredRefs = fetchRefs) {
maybeFetchIndicesStats(diskThresholdEnabled || writeLoadConstraintEnabled == WriteLoadDeciderStatus.ENABLED);
maybeFetchIndicesStats(diskThresholdEnabled || writeLoadConstraintEnabled.atLeastLowThresholdEnabled());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: alternatively we could have a method WriteLoadDeciderStatus#requiresShardLevelWriteLoads() (and requiresNodeLevelWriteLoads()) which would return true for LOW_THRESHOLD_ONLY and ENABLED but false for DISABLED.

It would read nicer if the writeLoadConstraintEnabled field was called writeLoadDeciderStatus if we went that way.

Don't feel strongly about this naming thing though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requiresShardLevelWriteLoads and requiresNodeLevelWriteLoads doesn't seem like the right split, as I understand it. I was imagining LOW as the best-effort hot-spot prevention (canAllocate) without hot-spot correction (canRemain), and fully enabled as including hot-spot correction. Both node and shard level stats are needed for prevention, to compare the shard move write load change with the node's overall write load.

I'll leave this as is until some follow up.

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with some nits

Copy link
Contributor Author

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied updates per Nick's review.


try (var ignoredRefs = fetchRefs) {
maybeFetchIndicesStats(diskThresholdEnabled || writeLoadConstraintEnabled == WriteLoadDeciderStatus.ENABLED);
maybeFetchIndicesStats(diskThresholdEnabled || writeLoadConstraintEnabled.atLeastLowThresholdEnabled());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requiresShardLevelWriteLoads and requiresNodeLevelWriteLoads doesn't seem like the right split, as I understand it. I was imagining LOW as the best-effort hot-spot prevention (canAllocate) without hot-spot correction (canRemain), and fully enabled as including hot-spot correction. Both node and shard level stats are needed for prevention, to compare the shard move write load change with the node's overall write load.

I'll leave this as is until some follow up.

Copy link
Contributor

@mhl-b mhl-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DiannaHohensee DiannaHohensee merged commit c1aadc4 into elastic:main Aug 15, 2025
32 of 33 checks passed
szybia added a commit to szybia/elasticsearch that referenced this pull request Aug 15, 2025
* upstream/main: (32 commits)
  Speed up loading keyword fields with index sorts (elastic#132950)
  Mute org.elasticsearch.index.mapper.LongFieldMapperTests testSyntheticSourceWithTranslogSnapshot elastic#132964
  Simplify EsqlSession (elastic#132848)
  Implement WriteLoadConstraintDecider#canAllocate (elastic#132041)
  Mute org.elasticsearch.test.rest.yaml.CcsCommonYamlTestSuiteIT test {p0=search/400_synthetic_source/_doc_count} elastic#132965
  Switch to PR-based benchmark pipeline defined in ES repo (elastic#132941)
  Breakdown undesired allocations by shard routing role (elastic#132235)
  Implement v_magnitude function (elastic#132765)
  Introduce execution location marker for better handling of remote/local compatibility (elastic#132205)
  Mute org.elasticsearch.cluster.ClusterInfoServiceIT testMaxQueueLatenciesInClusterInfo elastic#132957
  Unmuting simulate index data stream mapping overrides yaml rest test (elastic#132946)
  Remove CrossClusterCancellationIT.createLocalIndex() (elastic#132952)
  Mute org.elasticsearch.index.mapper.LongFieldMapperTests testFetch elastic#132956
  Fix failing UT by adding a required capability (elastic#132947)
  Precompute the BitsetCacheKey hashCode (elastic#132875)
  Adding simulate ingest effective mapping (elastic#132833)
  Mute org.elasticsearch.index.mapper.LongFieldMapperTests testFetchMany elastic#132948
  Rename skipping logic to remove hard link to skip_unavailable (elastic#132861)
  Store ignored source in unique stored fields per entry (elastic#132142)
  Add random tests with match_only_text multi-field (elastic#132380)
  ...
joshua-adams-1 pushed a commit to joshua-adams-1/elasticsearch that referenced this pull request Aug 15, 2025
The initial version of the write load decider with #canAllocate
implemented. Checks whether the new node assignment for a shard
would exceed the node's simulated utilization threshold.

Closes ES-12564
szybia added a commit to szybia/elasticsearch that referenced this pull request Aug 15, 2025
…-stats

* upstream/main: (36 commits)
  Fix reproducability of builds against Java EA versions (elastic#132847)
  Speed up loading keyword fields with index sorts (elastic#132950)
  Mute org.elasticsearch.index.mapper.LongFieldMapperTests testSyntheticSourceWithTranslogSnapshot elastic#132964
  Simplify EsqlSession (elastic#132848)
  Implement WriteLoadConstraintDecider#canAllocate (elastic#132041)
  Mute org.elasticsearch.test.rest.yaml.CcsCommonYamlTestSuiteIT test {p0=search/400_synthetic_source/_doc_count} elastic#132965
  Switch to PR-based benchmark pipeline defined in ES repo (elastic#132941)
  Breakdown undesired allocations by shard routing role (elastic#132235)
  Implement v_magnitude function (elastic#132765)
  Introduce execution location marker for better handling of remote/local compatibility (elastic#132205)
  Mute org.elasticsearch.cluster.ClusterInfoServiceIT testMaxQueueLatenciesInClusterInfo elastic#132957
  Unmuting simulate index data stream mapping overrides yaml rest test (elastic#132946)
  Remove CrossClusterCancellationIT.createLocalIndex() (elastic#132952)
  Mute org.elasticsearch.index.mapper.LongFieldMapperTests testFetch elastic#132956
  Fix failing UT by adding a required capability (elastic#132947)
  Precompute the BitsetCacheKey hashCode (elastic#132875)
  Adding simulate ingest effective mapping (elastic#132833)
  Mute org.elasticsearch.index.mapper.LongFieldMapperTests testFetchMany elastic#132948
  Rename skipping logic to remove hard link to skip_unavailable (elastic#132861)
  Store ignored source in unique stored fields per entry (elastic#132142)
  ...
shardRouting.shardId()
);
logger.debug(explain);
return Decision.single(Decision.Type.NO, NAME, explain);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should respond "not-preferred" instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was disagreement on implementing Decision#NOT_PREFERRED. So we're getting the basics in with NO, and I plan to explore the balancer and decision code in ES-11998 this sprint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants