Skip to content

Cluster state and recovery constructs for in-place shard split#20979

Open
vikasvb90 wants to merge 1 commit intoopensearch-project:mainfrom
vikasvb90:online_shard_split
Open

Cluster state and recovery constructs for in-place shard split#20979
vikasvb90 wants to merge 1 commit intoopensearch-project:mainfrom
vikasvb90:online_shard_split

Conversation

@vikasvb90
Copy link
Copy Markdown
Contributor

Description

Add cluster state infrastructure for in-place shard split

This PR adds the cluster state update service and supporting POJO changes needed to trigger an in-place shard split.

Changes:

Cluster state update service:

  • MetadataInPlaceShardSplitService: Submits an acked cluster state update task that updates SplitShardsMetadata on the target index and triggers a reroute. Validates that the index exists, the shard is not already splitting or already split, and virtual shards are not enabled on the index.
  • InPlaceShardSplitClusterStateUpdateRequest: Request POJO holding index name, shard ID, split-into count, and cause.
  • ClusterManagerTask: Added IN_PLACE_SHARD_SPLIT task key for cluster manager task throttling.

Routing POJO changes to support shard split lifecycle:

  • ShardRoutingState: Added SPLITTING state for parent shards undergoing an in-place split.
  • RecoverySource: Added InPlaceShardSplitRecoverySource type for child shards recovering from a parent shard on the same node.
  • UnassignedInfo: Added CHILD_SHARD_CREATED reason for child shards pending allocation after a split.
  • AllocationId: Added splitChildAllocationIds and parentAllocationId fields with version-gated serialization (V_3_6_0), factory methods (newSplit, newTargetSplit, cancelSplit, finishSplit), and updated equals/hashCode/toXContent.
  • ShardRouting: Added recoveringChildShards and parentShardId fields, new constructor accepting split fields, and query methods (splitting(), isSplitTarget(), getRecoveringChildShards(), getParentShardId()).

The REST API is not exposed in this PR. The routing allocation logic to actually assign child shards to nodes will follow in a subsequent PR.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@vikasvb90 vikasvb90 requested a review from a team as a code owner March 24, 2026 03:52
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 24, 2026

PR Reviewer Guide 🔍

(Review updated until commit 8dbd619)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Routing POJO changes for in-place shard split lifecycle

Relevant files:

  • server/src/main/java/org/opensearch/cluster/routing/AllocationId.java
  • server/src/main/java/org/opensearch/cluster/routing/RecoverySource.java
  • server/src/main/java/org/opensearch/cluster/routing/ShardRoutingState.java
  • server/src/main/java/org/opensearch/cluster/routing/UnassignedInfo.java
  • server/src/main/java/org/opensearch/cluster/routing/ShardRouting.java
  • server/src/test/java/org/opensearch/cluster/routing/AllocationIdSplitTests.java
  • server/src/test/java/org/opensearch/cluster/routing/RecoverySourceSplitTests.java
  • server/src/test/java/org/opensearch/cluster/routing/ShardRoutingStateSplitTests.java

Sub-PR theme: Cluster state update service for in-place shard split

Relevant files:

  • server/src/main/java/org/opensearch/action/admin/indices/split/InPlaceShardSplitClusterStateUpdateRequest.java
  • server/src/main/java/org/opensearch/action/admin/indices/split/package-info.java
  • server/src/main/java/org/opensearch/cluster/metadata/MetadataInPlaceShardSplitService.java
  • server/src/main/java/org/opensearch/cluster/service/ClusterManagerTask.java
  • server/src/test/java/org/opensearch/cluster/metadata/MetadataInPlaceShardSplitServiceTests.java

⚡ Recommended focus areas for review

Broken Assert

The original assert expectedShardSize >= 0 || state != INITIALIZING || state != RELOCATING was always true (a value can't simultaneously be both states), making it a no-op guard. The new version expectedShardSize >= 0 || state != INITIALIZING && state != RELOCATING && state != SPLITTING uses operator precedence where && binds tighter than ||, so it reads as expectedShardSize >= 0 || (state != INITIALIZING && state != RELOCATING && state != SPLITTING). This means the assert still passes whenever expectedShardSize < 0 but state is not one of those three — it never actually enforces that expectedShardSize >= 0 when the state IS one of those three. The intended guard (reject negative shard size for INITIALIZING/RELOCATING/SPLITTING) is not enforced.

assert expectedShardSize >= 0
    || state != ShardRoutingState.INITIALIZING && state != ShardRoutingState.RELOCATING && state != ShardRoutingState.SPLITTING
    : expectedShardSize + " state: " + state;
Mutable Array Exposure

getRecoveringChildShards() returns the raw ShardRouting[] array directly. Callers can mutate the array contents, breaking the immutability contract of ShardRouting. It should return a defensive copy or an unmodifiable view (e.g., Arrays.copyOf or expose as List<ShardRouting>).

public ShardRouting[] getRecoveringChildShards() {
    return recoveringChildShards;
}
Serialization Version Gap

The serialization of splitChildAllocationIds and parentAllocationId is gated on Version.V_3_6_0, but the BytesStreamOutput used in tests (AllocationIdSplitTests) has no version set, so it defaults to the current version. In a mixed-version cluster where a node on an older version receives an AllocationId written by a newer node, the reader will skip these fields and set them to null, silently losing split state. This is expected for BWC, but the comment in ShardRouting's deserialization constructor says these fields are "transient - populated by RoutingNodes constructor, not serialized on the wire" — yet AllocationId does serialize them. This inconsistency could cause confusion and subtle bugs if the RoutingNodes reconstruction path doesn't account for the version-gated absence of these fields.

if (in.getVersion().onOrAfter(Version.V_3_6_0)) {
    splitChildAllocationIds = in.readOptionalStringList();
    parentAllocationId = in.readOptionalString();
} else {
    splitChildAllocationIds = null;
    parentAllocationId = null;
}
Missing Input Validation

applyShardSplitRequest does not validate that request.getSplitInto() is greater than 1 (or at least greater than 0) before calling splitMetadataBuilder.splitShard(shardId, request.getSplitInto()). The test testApplyShardSplitRequestThrowsForSplitIntoZero expects an ArithmeticException to be thrown downstream, but relying on an arithmetic exception from a downstream call for input validation is fragile. A value of 1 (split into 1) would likely silently succeed but produce a meaningless operation. Explicit validation with a clear IllegalArgumentException should be added.

int shardId = request.getShardId();
SplitShardsMetadata splitShardsMetadata = curIndexMetadata.getSplitShardsMetadata();

if (splitShardsMetadata.getInProgressSplitShardIds().contains(shardId)) {
    throw new IllegalArgumentException("Splitting of shard [" + shardId + "] is already in progress");
}

if (splitShardsMetadata.isSplitParent(shardId)) {
    throw new IllegalArgumentException("Shard [" + shardId + "] has already been split.");
}

RoutingTable.Builder routingTableBuilder = RoutingTable.builder(currentState.routingTable());
Metadata.Builder metadataBuilder = Metadata.builder(currentState.metadata());
IndexMetadata.Builder indexMetadataBuilder = IndexMetadata.builder(curIndexMetadata);

SplitShardsMetadata.Builder splitMetadataBuilder = new SplitShardsMetadata.Builder(splitShardsMetadata);
splitMetadataBuilder.splitShard(shardId, request.getSplitInto());

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 24, 2026

PR Code Suggestions ✨

Latest suggestions up to 8dbd619

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix recovery source deserialization for SPLITTING state

The SPLITTING state also requires a recovery source (it uses
InPlaceShardSplitRecoverySource), but the deserialization logic only reads
recoverySource for UNASSIGNED or INITIALIZING states. This means a shard in
SPLITTING state will have a null recovery source after deserialization, which is
inconsistent with the constructor assertion that recovery source is only available
on unassigned or initializing shards. The SPLITTING state should be included in the
condition for reading the recovery source, or the assertion and write logic should
be updated to handle it consistently.

server/src/main/java/org/opensearch/cluster/routing/ShardRouting.java [435-439]

-if (state == ShardRoutingState.UNASSIGNED || state == ShardRoutingState.INITIALIZING) {
+if (state == ShardRoutingState.UNASSIGNED || state == ShardRoutingState.INITIALIZING || state == ShardRoutingState.SPLITTING) {
         recoverySource = RecoverySource.readFrom(in);
     } else {
         recoverySource = null;
     }
Suggestion importance[1-10]: 7

__

Why: The SPLITTING state likely requires a recovery source (InPlaceShardSplitRecoverySource), but the deserialization only reads it for UNASSIGNED or INITIALIZING. However, looking at the constructor assertion at line 153-154, it only checks for UNASSIGNED or INITIALIZING having a recovery source, suggesting SPLITTING may intentionally not use one in the current design. This is a valid concern but depends on the intended design.

Medium
Fix recovery source serialization for SPLITTING state

The writeToThin serialization does not write the recovery source for the SPLITTING
state, which is inconsistent with the read path (once fixed). If a shard in
SPLITTING state has a recovery source, it will not be serialized, causing a mismatch
on deserialization. The SPLITTING state should be included here as well.

server/src/main/java/org/opensearch/cluster/routing/ShardRouting.java [474-476]

-if (state == ShardRoutingState.UNASSIGNED || state == ShardRoutingState.INITIALIZING) {
+if (state == ShardRoutingState.UNASSIGNED || state == ShardRoutingState.INITIALIZING || state == ShardRoutingState.SPLITTING) {
         recoverySource.writeTo(out);
     }
Suggestion importance[1-10]: 7

__

Why: The writeToThin method doesn't serialize the recovery source for SPLITTING state, which would be inconsistent if SPLITTING shards have a recovery source. This is a valid serialization consistency concern that pairs with suggestion 1.

Medium
Fix assertion to include SPLITTING state with recovery source

The assertion does not account for the new SPLITTING state, which also has a
recovery source (InPlaceShardSplitRecoverySource). This will cause assertion
failures when creating a shard in SPLITTING state with a recovery source. The
SPLITTING state should be included in the condition.

server/src/main/java/org/opensearch/cluster/routing/ShardRouting.java [153-154]

-assert (state == ShardRoutingState.UNASSIGNED || state == ShardRoutingState.INITIALIZING) == (recoverySource != null)
-        : "recovery source only available on unassigned or initializing shard but was " + state;
+assert (state == ShardRoutingState.UNASSIGNED || state == ShardRoutingState.INITIALIZING || state == ShardRoutingState.SPLITTING) == (recoverySource != null)
+        : "recovery source only available on unassigned, initializing, or splitting shard but was " + state;
Suggestion importance[1-10]: 6

__

Why: The assertion at line 153-154 doesn't account for SPLITTING state potentially having a recovery source. If SPLITTING shards use InPlaceShardSplitRecoverySource, this assertion would fail. This is a valid concern that could cause assertion errors at runtime.

Low
General
Prevent mutation of internal array via defensive copy

Returning a mutable array directly exposes the internal state of ShardRouting,
allowing callers to modify the array contents. Since ShardRouting is a core routing
data structure, this could lead to subtle bugs. Consider returning a defensive copy
or using an unmodifiable list instead.

server/src/main/java/org/opensearch/cluster/routing/ShardRouting.java [329-331]

 public ShardRouting[] getRecoveringChildShards() {
-    return recoveringChildShards;
+    return recoveringChildShards == null ? null : recoveringChildShards.clone();
 }
Suggestion importance[1-10]: 4

__

Why: Returning the raw recoveringChildShards array exposes internal state to mutation. However, the comment in the code indicates these are transient fields populated by RoutingNodes constructor, so the risk may be limited in practice.

Low

Previous suggestions

Suggestions up to commit 8450d14
CategorySuggestion                                                                                                                                    Impact
Possible issue
Verify enum ordinal and switch case alignment

The readFrom method uses Type.values()[in.readByte()] to deserialize the recovery
type, which relies on enum ordinal order. Adding IN_PLACE_SHARD_SPLIT at the end is
correct for new nodes, but REMOTE_STORE was previously the last entry (ordinal 6).
If any serialized data exists with REMOTE_STORE at ordinal 6, this is fine, but the
readFrom switch uses the enum ordinal directly — verify that IN_PLACE_SHARD_SPLIT
(ordinal 6) does not collide with REMOTE_STORE (also ordinal 6 before this change).
The enum now has REMOTE_STORE at ordinal 6 and IN_PLACE_SHARD_SPLIT at ordinal 7,
but the switch in readFrom handles IN_PLACE_SHARD_SPLIT before REMOTE_STORE, which
could cause a deserialization mismatch for older serialized REMOTE_STORE bytes.

server/src/main/java/org/opensearch/cluster/routing/RecoverySource.java [122-130]

 public enum Type {
     EMPTY_STORE,
     EXISTING_STORE,
     PEER,
     SNAPSHOT,
     LOCAL_SHARDS,
     REMOTE_STORE,
     IN_PLACE_SHARD_SPLIT
 }
+// Ensure readFrom switch case order matches enum ordinals:
+// case 5: LOCAL_SHARDS, case 6: REMOTE_STORE, case 7: IN_PLACE_SHARD_SPLIT
Suggestion importance[1-10]: 7

__

Why: The readFrom method uses Type.values()[in.readByte()] for deserialization, so ordinal order matters critically. The PR adds IN_PLACE_SHARD_SPLIT after REMOTE_STORE, and the switch in readFrom handles IN_PLACE_SHARD_SPLIT before REMOTE_STORE — this could cause a real deserialization mismatch. However, the improved_code only adds a comment rather than fixing the switch case order, limiting the score.

Medium
General
Add shard ID bounds validation

There is no validation that shardId is within the valid range [0, numberOfShards). A
request with an out-of-range shard ID will propagate into SplitShardsMetadata
operations and may cause unexpected behavior or an obscure error. Add an explicit
bounds check before proceeding.

server/src/main/java/org/opensearch/cluster/metadata/MetadataInPlaceShardSplitService.java [100-108]

 if (curIndexMetadata.getNumberOfVirtualShards() != -1) {
     throw new IllegalArgumentException(
         "In-place shard split is not supported on index [" + request.getIndex() + "] with virtual shards enabled"
     );
 }
 
 int shardId = request.getShardId();
+if (shardId < 0 || shardId >= curIndexMetadata.getNumberOfShards()) {
+    throw new IllegalArgumentException(
+        "Shard [" + shardId + "] is out of range for index [" + request.getIndex() + "] with ["
+            + curIndexMetadata.getNumberOfShards() + "] shards"
+    );
+}
 SplitShardsMetadata splitShardsMetadata = curIndexMetadata.getSplitShardsMetadata();
Suggestion importance[1-10]: 6

__

Why: There is no bounds check on shardId before it's used in SplitShardsMetadata operations, which could lead to obscure errors. The test testApplyShardSplitRequestThrowsForInvalidShardId expects an IllegalArgumentException for shard ID 99, but without this check the exception may not be thrown or may come from a different place with a less clear message.

Low
Ensure deserialized list is immutable

The splitChildAllocationIds read from the stream is a mutable list, but the
constructor wraps it in Collections.unmodifiableList. However, when reading from the
stream directly in this constructor (bypassing the private constructor), the list is
assigned directly without making it unmodifiable. This inconsistency could allow
mutation of the internal list. The deserialization path should also wrap the list.

server/src/main/java/org/opensearch/cluster/routing/AllocationId.java [119-129]

 AllocationId(StreamInput in) throws IOException {
     this.id = in.readString();
     this.relocationId = in.readOptionalString();
     if (in.getVersion().onOrAfter(Version.V_3_6_0)) {
-        splitChildAllocationIds = in.readOptionalStringList();
+        List<String> childIds = in.readOptionalStringList();
+        splitChildAllocationIds = childIds == null ? null : Collections.unmodifiableList(childIds);
         parentAllocationId = in.readOptionalString();
     } else {
         splitChildAllocationIds = null;
         parentAllocationId = null;
     }
 }
Suggestion importance[1-10]: 6

__

Why: The stream deserialization constructor assigns splitChildAllocationIds directly without wrapping it in Collections.unmodifiableList, while the private constructor does wrap it. This inconsistency could allow mutation of the internal list when deserialized, breaking the immutability contract of the class.

Low
Clarify assertion for split recovery source

The assertion allows InPlaceShardSplitRecoverySource for replica shards, but the
recovery source is only set on UNASSIGNED or INITIALIZING shards. Child shards
created by an in-place split should be primary shards (or at least the assertion
message should be updated to reflect the new valid case). Verify that child shards
from a split are always primary; if not, the assertion logic may incorrectly permit
non-primary, non-searchOnly replicas to use this recovery source in other contexts.

server/src/main/java/org/opensearch/cluster/routing/ShardRouting.java [155-159]

 assert recoverySource == null
     || recoverySource == PeerRecoverySource.INSTANCE
     || primary
     || searchOnly
-    || recoverySource == RecoverySource.InPlaceShardSplitRecoverySource.INSTANCE : "replica shards always recover from primary";
+    || recoverySource == RecoverySource.InPlaceShardSplitRecoverySource.INSTANCE : "replica shards always recover from primary (except in-place split child shards)";
Suggestion importance[1-10]: 2

__

Why: The suggestion only changes the assertion message string, not the logic. While the clarification is marginally useful, it's a trivial documentation-only change with minimal impact on correctness or functionality.

Low
Suggestions up to commit 8600fa4
CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix incorrect logical operator in assertion

The assertion logic is incorrect. Using || between state != X conditions means the
assertion is always true (a state cannot simultaneously be INITIALIZING, RELOCATING,
and SPLITTING). The original code used && for the negated conditions, which is
equivalent to asserting the state is not any of those values when expectedShardSize
< 0. This should use && to properly enforce the constraint.

server/src/main/java/org/opensearch/cluster/routing/ShardRouting.java [149-152]

 assert expectedShardSize >= 0
     || state != ShardRoutingState.INITIALIZING
-    || state != ShardRoutingState.RELOCATING
-    || state != ShardRoutingState.SPLITTING : expectedShardSize + " state: " + state;
+    && state != ShardRoutingState.RELOCATING
+    && state != ShardRoutingState.SPLITTING : expectedShardSize + " state: " + state;
Suggestion importance[1-10]: 8

__

Why: The assertion uses || between state != X conditions, making it always true (tautology). The correct logic requires && to properly enforce that expectedShardSize >= 0 when the state is INITIALIZING, RELOCATING, or SPLITTING. This is a real logical bug that would make the assertion meaningless.

Medium
General
Add shard ID bounds validation before split

There is no validation that shardId is within the valid range [0, numberOfShards). A
request with an out-of-range shard ID (e.g., 99 for a 3-shard index) would silently
proceed and corrupt the split metadata. Add a bounds check before proceeding with
the split logic.

server/src/main/java/org/opensearch/cluster/metadata/MetadataInPlaceShardSplitService.java [100-111]

 if (curIndexMetadata.getNumberOfVirtualShards() != -1) {
     throw new IllegalArgumentException(
         "In-place shard split is not supported on index [" + request.getIndex() + "] with virtual shards enabled"
     );
 }
 
 int shardId = request.getShardId();
+if (shardId < 0 || shardId >= curIndexMetadata.getNumberOfShards()) {
+    throw new IllegalArgumentException(
+        "Shard [" + shardId + "] is out of range for index [" + request.getIndex() + "] with ["
+            + curIndexMetadata.getNumberOfShards() + "] shards"
+    );
+}
+
 SplitShardsMetadata splitShardsMetadata = curIndexMetadata.getSplitShardsMetadata();
 
 if (splitShardsMetadata.getInProgressSplitShardIds().contains(shardId)) {
     throw new IllegalArgumentException("Splitting of shard [" + shardId + "] is already in progress");
 }
Suggestion importance[1-10]: 6

__

Why: Missing bounds validation for shardId is a real gap — an out-of-range shard ID could silently corrupt split metadata. The test testApplyShardSplitRequestThrowsForInvalidShardId expects an IllegalArgumentException for shard ID 99, but without this check the behavior is undefined. Adding this validation improves correctness and matches test expectations.

Low
Validate positive child shard count in split factory

There is no validation that numberOfChildShards is greater than zero. Calling
newSplit with zero or a negative value would create an AllocationId with an empty or
meaningless child list, which could cause silent failures downstream. Add a
precondition check.

server/src/main/java/org/opensearch/cluster/routing/AllocationId.java [200-207]

 public static AllocationId newSplit(AllocationId allocationId, int numberOfChildShards) {
     assert allocationId.getSplitChildAllocationIds() == null && allocationId.getParentAllocationId() == null;
+    if (numberOfChildShards <= 0) {
+        throw new IllegalArgumentException("numberOfChildShards must be > 0, got: " + numberOfChildShards);
+    }
     List<String> childIds = new ArrayList<>();
     for (int c = 0; c < numberOfChildShards; c++) {
         childIds.add(UUIDs.randomBase64UUID());
     }
     return new AllocationId(allocationId.getId(), null, childIds, null);
 }
Suggestion importance[1-10]: 5

__

Why: Calling newSplit with zero or negative numberOfChildShards would silently create an invalid AllocationId with an empty child list. Adding an explicit precondition check prevents silent failures and improves robustness of the split operation.

Low
Document or handle transient split fields in deserialization

When deserializing ShardRouting from a StreamInput, recoveringChildShards and
parentShardId are always set to null, ignoring any serialized values. If these
fields need to be persisted across cluster state serialization, they should be read
from the stream (with version gating similar to AllocationId). If they are
intentionally transient, this should be documented clearly.

server/src/main/java/org/opensearch/cluster/routing/ShardRouting.java [452-453]

+// These fields are intentionally transient and not persisted in the stream.
 recoveringChildShards = null;
 parentShardId = null;
Suggestion importance[1-10]: 2

__

Why: The suggestion only asks to add a comment clarifying that recoveringChildShards and parentShardId are intentionally transient. The improved_code is essentially the same as existing_code with just a comment added, which is a low-impact documentation-only change.

Low

@vikasvb90 vikasvb90 self-assigned this Mar 24, 2026
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 8600fa4: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@vikasvb90 vikasvb90 force-pushed the online_shard_split branch from 8600fa4 to 8450d14 Compare March 24, 2026 08:57
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 8450d14

@vikasvb90 vikasvb90 force-pushed the online_shard_split branch from 8450d14 to 8dbd619 Compare March 24, 2026 09:42
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 8dbd619

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 8dbd619: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 7f2da9b: ABORTED

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for d78724a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

* Applies a shard split request to the given cluster state. Updates the split metadata
* in the index metadata and triggers a reroute so that child shards get allocated.
*/
static ClusterState applyShardSplitRequest(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have separate check to not allow split if cluster is in mixed mode during upgrades to have a guardrail

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean for the version lower than the split supported version? In other version upgrades we can support this. And for a version upgrade from a non-supported split version to a supported split version, it is implicitly handled because the REST action as well as the admin client (invoked from a plugin) won't be available on lower version cluster manager.

If you want I can still put an additional check here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I am thinking that we should prevent splitting a shard in all cases where it is now supposed to move off the node except shard rebalancing case.
And if split already started then we should cancel it and let the shard relocate in cases above.
What say?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes makes sense to me.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added multiple checks now to meet shard split criteria now in this class. Cancellation of split will happen later as part of a new cluster state update so that will come in next PR. I will also need to add cancellation reason in _cat/recovery API to inform user why split was cancelled.

/**
* Creates a new allocation id for a child shard that is the result of a split.
*/
public static AllocationId newTargetSplit(AllocationId allocationId, String childAllocId) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the above method generates a transient value and when do we call this method? Not clear on the flow here.

Copy link
Copy Markdown
Contributor Author

@vikasvb90 vikasvb90 Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya there are actually missing pieces to avoid making this PR huge. But your concern is very valid. So linking the source code from my personal repo here. For this comment,

  1. newSplit is called on parent shard to generate N random UUIDs and store them in splitChildAllocationIds on the parent's allocation object.
  2. newTargetSplit is called on each child shard. Takes one of the generated UUIDs, and creates a link to its parent by storing parentAllocationId.

This is same as relocation flow where in newRelocation source shard generates random UUID for target shard where target shard's id becomes this generated UUID. And this is how both are linked together.

For split, these generations happen in createRecoveringChildShards of RoutingNodes

/**
* Returns the recovering child shards of this splitting shard, or null if not splitting.
*/
public ShardRouting[] getRecoveringChildShards() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getRecoveringChildShards() returns the raw ShardRouting[] array directly. Callers can mutate the array contents, breaking the immutability contract of ShardRouting. It should return a defensive copy or an unmodifiable view (e.g., Arrays.copyOf or expose as List).

can you look into this

@vikasvb90 vikasvb90 force-pushed the online_shard_split branch from d78724a to 7b2c720 Compare March 28, 2026 14:04
@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

@vikasvb90 vikasvb90 force-pushed the online_shard_split branch 2 times, most recently from 7b2c720 to b1ffae2 Compare March 28, 2026 14:17
@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for b1ffae2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: vikasvb90 <vikasvb@amazon.com>
@vikasvb90 vikasvb90 force-pushed the online_shard_split branch from b1ffae2 to f3dfee0 Compare March 28, 2026 14:52
@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants