Skip to content

Revert "Add metadata to support shard split (#20859)"#20935

Closed
andrross wants to merge 1 commit intoopensearch-project:mainfrom
andrross:revert-ed2e1006
Closed

Revert "Add metadata to support shard split (#20859)"#20935
andrross wants to merge 1 commit intoopensearch-project:mainfrom
andrross:revert-ed2e1006

Conversation

@andrross
Copy link
Member

This reverts commit ed2e100.

The change to IndexMetadata has resulted in bwc failures with the remote cluster state feature. I'm seeing a lot of failures like this: https://build.ci.opensearch.org/job/gradle-check/72870/consoleText

» [2026-03-19T21:47:40,525][ERROR][o.o.g.r.RemoteClusterStateService] [v3.5.1-remote-2] Failed to read cluster state from remote
»  org.opensearch.gateway.remote.RemoteStateTransferException: Download failed for index_with_replicas
»  	at org.opensearch.gateway.remote.RemoteIndexMetadataManager.lambda$getWrappedReadListener$3(RemoteIndexMetadataManager.java:159)
»  	at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90)
»  	at org.opensearch.common.remote.RemoteWriteableEntityBlobStore.lambda$readAsync$0(RemoteWriteableEntityBlobStore.java:87)
»  	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:918)
»  	at java.****/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1090)
»  	at java.****/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:614)
»  	at java.****/java.lang.Thread.run(Thread.java:1474)
»  Caused by: java.lang.IllegalStateException: Can't get text on a START_ARRAY at -1:779
»  	at org.opensearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:99)
»  	at org.opensearch.core.xcontent.AbstractXContentParser.map(AbstractXContentParser.java:298)
»  	at org.opensearch.core.xcontent.AbstractXContentParser.mapStrings(AbstractXContentParser.java:282)
»  	at org.opensearch.cluster.metadata.IndexMetadata$Builder.fromXContent(IndexMetadata.java:2468)
»  	at org.opensearch.cluster.metadata.IndexMetadata.fromXContent(IndexMetadata.java:1500)
»  	at org.opensearch.repositories.blobstore.ChecksumBlobStoreFormat.deserialize(ChecksumBlobStoreFormat.java:144)
»  	at org.opensearch.gateway.remote.model.RemoteIndexMetadata.deserialize(RemoteIndexMetadata.java:136)
»  	at org.opensearch.gateway.remote.model.RemoteIndexMetadata.deserialize(RemoteIndexMetadata.java:35)
»  	at org.opensearch.common.remote.RemoteWriteableEntityBlobStore.read(RemoteWriteableEntityBlobStore.java:77)
»  	at org.opensearch.common.remote.RemoteWriteableEntityBlobStore.lambda$readAsync$0(RemoteWriteableEntityBlobStore.java:85)

The old node (v3.5.1-remote-2) tries to read this remote cluster state. Its IndexMetadata.fromXContent() doesn't recognize split_shards_metadata or primary_terms_map, so it falls through to the catch-all
else clause at the end of the START_OBJECT handling:

   } else {
       // assume it's custom index metadata
       builder.putCustom(currentFieldName, parser.mapStrings());
   }

Related to #20910

Check List

  • Functionality includes testing.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

This reverts commit ed2e100.

The change to IndexMetadata has resulted in bwc failures with the remote
cluster state feature.

Signed-off-by: Andrew Ross <andrross@amazon.com>
@andrross andrross requested a review from a team as a code owner March 20, 2026 01:40
@github-actions
Copy link
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
📝 TODO sections

🔀 No multiple PR themes
⚡ Recommended focus areas for review

BWC Compatibility

The revert unconditionally writes primaryTerms as a VLongArray in writeTo, writeVerifiableTo, and IndexMetadataDiff.writeTo without any version gating. The original pre-revert code had version checks (e.g., if (in.getVersion().before(Version.V_3_6_0))). If nodes running the reverted code communicate with nodes running the original (pre-revert) code that expected the new format, this could cause deserialization failures in mixed-version clusters.

out.writeVLongArray(primaryTerms);
Removed Tests

Three tests (testPrimaryTermsMapXContentRoundTrip, testPrimaryTermsMapStreamRoundTrip, testPrimaryTermsMapDiffRoundTrip) that validated primary terms serialization round-trips were removed as part of this revert. The existing test coverage for primary terms serialization should be verified to ensure it remains adequate.

    key -> Setting.simpleString(key, value -> IP_VALIDATOR.accept(key, value), Property.Dynamic, Property.IndexScope)
);
public static final Setting.AffixSetting<String> INDEX_ROUTING_INCLUDE_GROUP_SETTING = Setting.prefixKeySetting(
    INDEX_ROUTING_INCLUDE_GROUP_PREFIX + ".",
    key -> Setting.simpleString(key, value -> IP_VALIDATOR.accept(key, value), Property.Dynamic, Property.IndexScope)
);
public static final Setting.AffixSetting<String> INDEX_ROUTING_EXCLUDE_GROUP_SETTING = Setting.prefixKeySetting(
    INDEX_ROUTING_EXCLUDE_GROUP_PREFIX + ".",
    key -> Setting.simpleString(key, value -> IP_VALIDATOR.accept(key, value), Property.Dynamic, Property.IndexScope)
Null Safety

In Builder.primaryTerm(int shardId), if primaryTerms is null, initializePrimaryTerms() is called, but the method returns this.primaryTerms[shardId] without re-checking that primaryTerms was actually initialized. Similarly, primaryTerms(long[] primaryTerms) does not guard against a null input array.

public long primaryTerm(int shardId) {
    if (primaryTerms == null) {
        initializePrimaryTerms();
    }
    return this.primaryTerms[shardId];

@github-actions
Copy link
Contributor

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Move assertion before field assignment

The assertion is placed after the assignment, so this.primaryTerms is already set
even if the assertion fails in production (assertions can be disabled). The
assertion should be placed before the assignment to validate the input before
storing it.

server/src/main/java/org/opensearch/cluster/metadata/IndexMetadata.java [1194-1195]

+assert primaryTerms.length == numberOfShards;
 this.primaryTerms = primaryTerms;
-assert primaryTerms.length == numberOfShards;
Suggestion importance[1-10]: 4

__

Why: While technically valid, assertions are disabled in production by default in Java, so the order doesn't affect production behavior. The suggestion is a minor style improvement for correctness during testing, but has low practical impact.

Low
Add bounds check for array access

The previous code had bounds checking and threw a meaningful
IllegalArgumentException when the shard ID was invalid. The new array access will
throw an ArrayIndexOutOfBoundsException with a less descriptive message. Consider
adding explicit bounds validation to preserve the informative error message.

server/src/main/java/org/opensearch/cluster/metadata/IndexMetadata.java [1282-1284]

 public long primaryTerm(int shardId) {
+    if (shardId < 0 || shardId >= this.primaryTerms.length) {
+        throw new IllegalArgumentException("No primary term available for shard " + shardId);
+    }
     return this.primaryTerms[shardId];
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion improves error messaging by replacing an ArrayIndexOutOfBoundsException with a more descriptive IllegalArgumentException, which aids debugging. However, the old code's null check was removed in favor of array access, and the ArrayIndexOutOfBoundsException is still reasonably informative.

Low
Guard against null array in setter

The primaryTerms setter in the Builder class is private, but it is called from
IndexMetadataDiff.apply() via builder.primaryTerms(primaryTerms). If this method is
intended to be accessible from the diff apply logic (which is in the same file), the
access level is fine, but it should be verified that the array passed is never null
to avoid a NullPointerException.

server/src/main/java/org/opensearch/cluster/metadata/IndexMetadata.java [2180-2182]

 private void primaryTerms(long[] primaryTerms) {
+    if (primaryTerms == null) {
+        throw new IllegalArgumentException("primaryTerms must not be null");
+    }
     this.primaryTerms = primaryTerms.clone();
 }
Suggestion importance[1-10]: 3

__

Why: The null check is a defensive programming practice, but the method is private and callers in the same file already control the input. The suggestion asks to verify null safety which is a minor concern given the controlled usage context.

Low

@github-actions
Copy link
Contributor

❌ Gradle check result for 60dda9b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

✅ Gradle check result for 60dda9b: SUCCESS

@codecov
Copy link

codecov bot commented Mar 20, 2026

Codecov Report

❌ Patch coverage is 82.14286% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.27%. Comparing base (9f5d0e1) to head (60dda9b).
⚠️ Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
...org/opensearch/cluster/metadata/IndexMetadata.java 80.76% 0 Missing and 5 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20935      +/-   ##
============================================
- Coverage     73.30%   73.27%   -0.04%     
+ Complexity    72484    72444      -40     
============================================
  Files          5819     5817       -2     
  Lines        331155   330872     -283     
  Branches      47840    47739     -101     
============================================
- Hits         242769   242444     -325     
- Misses        68876    68951      +75     
+ Partials      19510    19477      -33     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andrross
Copy link
Member Author

I believe removing the remote cluster state tests is the right thing to do in the short term, as opposed to this PR. See the discussion in #20942 and #20910

@andrross
Copy link
Member Author

Closing as the failing test has been removed #20942

@andrross andrross closed this Mar 20, 2026
@andrross andrross deleted the revert-ed2e1006 branch March 20, 2026 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant