Reshard add shard count #133985

ankikuma · 2025-09-02T13:46:49Z

Add reshard count - effective shard count as seen by the coordination node - to the replication request. Note that this new field has not been serialized over the network yet.

…ShardCount Refresh branch

…ShardCount refresh

…ShardCount Refresh branch

…lasticsearch into 08272025/ReshardAddShardCount pull

server/src/main/java/org/elasticsearch/action/bulk/BulkShardRequest.java

server/src/main/java/org/elasticsearch/cluster/metadata/IndexReshardingState.java

…ShardCount refresh branch

…ShardCount Refresh branch

…lasticsearch into 08272025/ReshardAddShardCount git pull

...r/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

…ShardCount Refresh

…ShardCount refresh

…in' into 08272025/ReshardAddShardCount Refresh

…lasticsearch into 08272025/ReshardAddShardCount pull

server/src/main/java/org/elasticsearch/action/bulk/BulkOperation.java

server/src/main/java/org/elasticsearch/action/support/replication/ReplicationRequest.java

bcully · 2025-09-16T21:55:31Z

server/src/main/java/org/elasticsearch/action/support/replication/ReplicationRequest.java

+     * The purpose of this metadata is to reconcile the cluster state visible at the coordinating
+     * node with that visible at the source shard node. (w.r.t resharding).
+     * Note that we are able to get away with a single number, instead of an array of target shard states,
+     * because we only allow splits in increments of 2x.


I'm tempted to want concrete examples here. I know this makes for a long comment, but I think the field is unintuitive enough to warrant it.

I agree that examples will be helpful. Its semantics really is confusing.

Also, I was thinking if I rename the field in the ReplicationRequest to reshardSplitExpectedShardCount, will that be better ?

I think we referred to this value as a "checksum of resharding state" in the design document. I wonder if calling it some kind of checksum will resolve the naming dispute.

Ya I like reshardSplitShardCountChecksum. Because otherwise it sounds like it's a count that you can actually use for something. It's really just a value to be used not as is, but in the context of reconciling cluster state between 2 different nodes.

bikeshedding, how about "summary" instead of "checksum"?

I kind of like checksum better but happy change it ...

Ok, I won't push hard on this. To me checksum implies an integrity check - you have some data, and alongside it you have a provided checksum, and if you recompute the checksum and it doesn't match the provided checksum you have an integrity problem. It sort of applies here - we have our own view of the routing table, and the coordinator provides a "checksum", and if we don't produce the same value over our view then we don't trust the coordinator's binning (which kind of looks like a response to an integrity error if you squint).

I felt that summary (i.e., a lossily compressed form that still has relevant info) was maybe a little more accurate, but I fully admit I'm bikeshedding. Carry on as you see fit.

bcully · 2025-09-16T22:02:24Z

...r/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

-                final ProjectId projectId = clusterState.metadata().projectFor(index).id();
+                final ProjectMetadata project = clusterState.metadata().projectFor(index);
+                final ProjectId projectId = project.id();
+                final IndexMetadata indexMetadata = project.index(index);


can this be null? I know we have a shard reference here but I'm not sure whether that ensures that an index isn't deleted in cluster state.

Index delete/close are supposed to acquire all permits before proceeding. I think TransportVerifyShardBeforeCloseAction#acquirePrimaryOperationPermit is responsible for this ?

server/src/main/java/org/elasticsearch/cluster/metadata/IndexMetadata.java

bcully · 2025-09-16T22:07:01Z

server/src/main/java/org/elasticsearch/cluster/metadata/IndexMetadata.java

+        assert shardId >= 0 && shardId < getNumberOfShards() : "shardId is out of bounds";
+        int shardCount = getNumberOfShards();
+        if (reshardingMetadata != null) {
+            if (reshardingMetadata.getSplit().isTargetShard(shardId)) {


if this is being called for a target shard, does that mean that the coordinator has already determined that the target shard is ready?

Yes. The reason is that we already have target state checks for routing requests at the coordinator level. The fact that a request is routed to the target shard means that it must be ready.

I have added more details to the method description

bcully · 2025-09-16T22:11:04Z

server/src/test/java/org/elasticsearch/cluster/metadata/IndexReshardingMetadataTests.java

+            assertThat(
+                IndexMetadataAfterReshard.getReshardSplitShardCount(i, IndexReshardingState.Split.TargetShardState.CLONE),
+                equalTo(numTargetShards)
+            );


this state feels like it should be illegal. The query doesn't really make sense, since a shard in CLONE state is by definition not ready to accept operations.

The getReshardSplitShardCount is agnostic to the semantics of the operation. Given an input shard and a target shard state (that which is required for targets to be ready for the operation), it returns the "ReshardSplitShardCount" observed based on the IndexMetadata. For example, search will pass in the required state as SPLIT while indexing will pass in HANDOFF.
I will try to make it more clear in the API comments.

I mean, I think it would be nice if the public API were harder to misuse. My feeling is that supplying CLONE to this function is an indication of a logical mistake, so we shouldn't allow it. One way we could do that is to make getReshardSplitShardCount private and have two wrappers, e.g., getReshardSplitShardCountForIndexing and getReshardSplitShardCountForSearch that call getReshardSplitShardCount with HANDOFF and SPLIT respectively.

Yes sure, that will make the code more readable too I think. We will need a wrapper for refresh and flush as well. I guess we can add wrappers as and when needed.

I was thinking more like "index" and "search" were friendly names for HANDOFF and SPLIT. I'm not sure whether we need wrappers per operation - it seems like they would just be synonyms, and having multiple functions that do the same thing seems noisy?

Ya we can just use these names or rename them later to something more generic

…lasticsearch into 08272025/ReshardAddShardCount Pull

lkts · 2025-09-18T20:52:34Z

server/src/main/java/org/elasticsearch/action/support/replication/ReplicationRequest.java

+     * The purpose of this metadata is to reconcile the cluster state visible at the coordinating
+     * node with that visible at the source shard node. (w.r.t resharding).
+     * Note that we are able to get away with a single number, instead of an array of target shard states,
+     * because we only allow splits in increments of 2x.


I think we referred to this value as a "checksum of resharding state" in the design document. I wonder if calling it some kind of checksum will resolve the naming dispute.

lkts · 2025-09-18T21:07:17Z

server/src/main/java/org/elasticsearch/action/support/replication/ReplicationRequest.java

+        this(shardId, 0, in);
+    }
+
+    public ReplicationRequest(@Nullable ShardId shardId, int reshardSplitShardCount, StreamInput in) throws IOException {


Is this because some inheritors do not have shardId (i guess IndexRequest is an example)?

lkts · 2025-09-18T21:12:30Z

server/src/test/java/org/elasticsearch/cluster/metadata/IndexReshardingMetadataTests.java

+        final var numSourceShards = 2;
+        indexMetadata = IndexMetadata.builder(indexMetadata).reshardAddShards(numSourceShards).build();
+
+        assertNull(reshardingMetadata);


You need to re-read reshardingMetadata here.

lkts · 2025-09-18T21:14:43Z

server/src/test/java/org/elasticsearch/cluster/metadata/IndexReshardingMetadataTests.java

+        // starting state is as expected
+        assertEquals(numSourceShards, reshardingMetadata.shardCountBefore());
+        assertEquals(numSourceShards * multiple, reshardingMetadata.shardCountAfter());
+        final int numTargetShards = reshardingMetadata.shardCountAfter();


This is not a number of target shards, right? This is total.

Yes. I renamed this to numShardsAfterReshard so it's not confusing.

bcully · 2025-09-18T18:35:41Z

server/src/main/java/org/elasticsearch/action/bulk/BulkOperation.java


+                // Get effective shardCount for shardId and pass it on as parameter to new BulkShardRequest
+                var indexMetadata = project.index(shardId.getIndexName());
+                int reshardSplitShardCount = 0;


double checking: my understanding is this is always safe - 0 means that the receiving shard will always inspect the bulk item request and decide whether it needs to resplit, which may not be optimal but won't ever be wrong. Right?

No, actually I was thinking of 0 as being the case where we do not care what the reshardSplitShardCount is. This is so we don't have to change all the tests that were written pre-autoshardingtosharding.
In this particular instance, where we are creating the BulkShardRequest, it is 0 if the indexMatadata is null. But thinking about this now, I wonder if this could lead to wrong results. Perhaps I should reserve 0 for the case where we always inspect the request and modify the tests.

bcully · 2025-09-18T22:36:47Z

server/src/main/java/org/elasticsearch/action/support/replication/ReplicationRequest.java

+        this(shardId, 0, in);
+    }
+
+    public ReplicationRequest(@Nullable ShardId shardId, int reshardSplitShardCount, StreamInput in) throws IOException {


Ok, I've skimmed #56209 that introduced thin serialization, apparently to save redundant shard ID serialization since shard IDs are large. It looks like the expected case is that shard ID is not null so we can do the thin read, but that the request may override the shard ID even if it is not null (lines 113-121 below, on the right side of the diff).

I suppose it's the case that if we can omit the shard ID in serialization because we're getting it from somewhere else then we can probably also save 4 bytes per request on reshardSplitShardCount, so we can provide it to the constructor instead of reading it over the wire. But as I read the code below, that's not what we're doing. We're just using a provided value if the transport version isn't recent enough. I don't think that's a thin serialization concern. I think if we want to be thin, then we use the provided value and don't serialize it at all? Unless there's a case where we need to override the provided value, as with index. Then we need to do a boolean to signal the presence of the shard ID and I think the savings isn't worth it.

Overall, my feeling is that given this is only a 4 byte field, it would be simpler right now to not provide the shard count to the constructor and always deserialize it if the transport version is new enough, or supply 0 in line if it isn't. Thoughts?

bcully · 2025-09-18T22:38:25Z

...r/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

-                final ProjectId projectId = clusterState.metadata().projectFor(index).id();
+                final ProjectMetadata project = clusterState.metadata().projectFor(index);
+                final ProjectId projectId = project.id();
+                final IndexMetadata indexMetadata = project.index(index);


bcully · 2025-09-18T22:44:39Z

server/src/test/java/org/elasticsearch/cluster/metadata/IndexReshardingMetadataTests.java

+            assertThat(
+                IndexMetadataAfterReshard.getReshardSplitShardCount(i, IndexReshardingState.Split.TargetShardState.CLONE),
+                equalTo(numTargetShards)
+            );


I mean, I think it would be nice if the public API were harder to misuse. My feeling is that supplying CLONE to this function is an indication of a logical mistake, so we shouldn't allow it. One way we could do that is to make getReshardSplitShardCount private and have two wrappers, e.g., getReshardSplitShardCountForIndexing and getReshardSplitShardCountForSearch that call getReshardSplitShardCount with HANDOFF and SPLIT respectively.

…ShardCount Refresh

jxie-1 · 2025-09-22T16:59:08Z

server/src/main/java/org/elasticsearch/cluster/metadata/IndexMetadata.java

+     * @param shardId  Input shardId for which we want to calculate the effective shard count
+     */
+    public int getReshardSplitShardCountChecksumForIndexing(int shardId) {
+        return (getReshardSplitShardCountChecksum(shardId, IndexReshardingState.Split.TargetShardState.HANDOFF));


Small nit but seems like there's extra parentheses? Also for getReshardSplitShardCountChecksumForSearch

Yup, will fix it.

bcully

just one question on this version

bcully · 2025-09-24T16:44:35Z

...r/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

+                assert (reshardSplitShardCountChecksum == 0
+                    || reshardSplitShardCountChecksum == indexMetadata.getReshardSplitShardCountChecksumForIndexing(
+                        primaryRequest.getRequest().shardId().getId()


Is this being executed on the source shard? If so, could this assertion fire when the coordinator's checksum is stale vs the source?

ankikuma added 4 commits August 28, 2025 11:24

Add shard count

02ec564

Add reshard count

87528bc

Merge remote-tracking branch 'upstream/main' into 08272025/ReshardAdd…

865aff5

…ShardCount Refresh branch

Merge remote-tracking branch 'upstream/main' into 08272025/ReshardAdd…

1b0e1b4

…ShardCount refresh

elasticsearchmachine added v9.2.0 serverless-linked Added by automation, don't add manually labels Sep 2, 2025

ankikuma and others added 5 commits September 2, 2025 13:06

Read reshardShardCount

e3ddf06

Add test

3ae25bf

[CI] Auto commit changes from spotless

cab853d

Merge remote-tracking branch 'upstream/main' into 08272025/ReshardAdd…

418b4ff

…ShardCount Refresh branch

Merge branch '08272025/ReshardAddShardCount' of github.com:ankikuma/e…

5c496d0

…lasticsearch into 08272025/ReshardAddShardCount pull

lkts reviewed Sep 3, 2025

View reviewed changes

ankikuma and others added 5 commits September 3, 2025 13:16

change variable names

84a96b8

Merge remote-tracking branch 'upstream/main' into 08272025/ReshardAdd…

c49ccb1

…ShardCount refresh branch

[CI] Auto commit changes from spotless

fcfdae2

Merge remote-tracking branch 'upstream/main' into 08272025/ReshardAdd…

4caafb6

…ShardCount Refresh branch

Merge branch '08272025/ReshardAddShardCount' of github.com:ankikuma/e…

3dc489d

…lasticsearch into 08272025/ReshardAddShardCount git pull

ankikuma commented Sep 5, 2025

View reviewed changes

...r/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java Show resolved Hide resolved

ankikuma and others added 12 commits September 12, 2025 12:33

serialize reshardAddShardCount on the wire

eff86a7

refresh branch

112a8b2

Merge remote-tracking branch 'upstream/main' into 08272025/ReshardAdd…

1a32024

…ShardCount Refresh

commit

cd83303

commit

9ae8428

Merge remote-tracking branch 'upstream/main' into 08272025/ReshardAdd…

f07291d

…ShardCount refresh

commit

b569c40

Merge remote-tracking branch 'transport-version-resources-upstream/ma…

ade5904

…in' into 08272025/ReshardAddShardCount Refresh

fix writeTo

ed33595

[CI] Update transport version definitions

3b68cf5

serialization changewq

f9ca11d

Merge branch '08272025/ReshardAddShardCount' of github.com:ankikuma/e…

bcb3c85

…lasticsearch into 08272025/ReshardAddShardCount pull

elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Sep 16, 2025

ankikuma added the needs:triage Requires assignment of a team area label label Sep 16, 2025

elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Sep 16, 2025

bcully reviewed Sep 16, 2025

View reviewed changes

ankikuma added 3 commits September 17, 2025 19:31

address review comments

761ac7d

Merge branch '08272025/ReshardAddShardCount' of github.com:ankikuma/e…

f35138a

…lasticsearch into 08272025/ReshardAddShardCount Pull

transportversions

fddc0bf

lkts reviewed Sep 18, 2025

View reviewed changes

bcully reviewed Sep 18, 2025

View reviewed changes

ankikuma added 9 commits September 19, 2025 13:57

minor review changes

ff82490

es

2be982b

refresh

59378c1

commit

6c8f204

tv changes

7c20e53

commit

9c8a8e7

commit

3c2d7af

Merge remote-tracking branch 'upstream/main' into 08272025/ReshardAdd…

10dd203

…ShardCount Refresh

commit

028eb0d

jxie-1 reviewed Sep 22, 2025

View reviewed changes

bcully approved these changes Sep 24, 2025

View reviewed changes

ankikuma added 5 commits September 24, 2025 20:03

Rename reshardSplitShardCountChecksum to reshardSplitShardCountSummary

cebb052

commit

eba523f

refresh

f8b8bf5

refresh

f392eb5

csv

585cf2b

ankikuma merged commit f38f4ad into elastic:main Sep 25, 2025
34 checks passed

Reshard add shard count #133985

Reshard add shard count #133985

Uh oh!

Conversation

ankikuma commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankikuma Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ankikuma commented Sep 2, 2025 •

edited

Loading

ankikuma Sep 17, 2025 •

edited

Loading