Skip to content

Fix array_index_out_of_bounds_exception with wildcard and aggregations#20842

Merged
andrross merged 1 commit intoopensearch-project:mainfrom
ShawnQiang1:wildcardBug
Mar 12, 2026
Merged

Fix array_index_out_of_bounds_exception with wildcard and aggregations#20842
andrross merged 1 commit intoopensearch-project:mainfrom
ShawnQiang1:wildcardBug

Conversation

@ShawnQiang1
Copy link
Contributor

@ShawnQiang1 ShawnQiang1 commented Mar 11, 2026

Description

under concurrent conditions, encountered a situation where multiple threads were simultaneously calling the valueFetcher object, causing the buffer to be read and written at the same time.

I was unable to reproduce the bug in the IT test cases; I only reproduced and fixed the issue locally using the method mentioned in the issue. I would be very grateful if anyone could help supplement the test cases.

This pr need to backport to 2.x

Related Issues

#20838
#18701

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 11, 2026

PR Reviewer Guide 🔍

(Review updated until commit 4af9812)

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Null Safety

When valueFetcherSupplier is null (i.e., when context is null), calling valueFetcherSupplier.get() inside createWeight will throw a NullPointerException. There should be a null check before invoking the supplier, similar to how the old code handled the null valueFetcher case.

final ValueFetcher valueFetcher = valueFetcherSupplier.get();
valueFetcher.setNextReader(context);
Performance Concern

A new ValueFetcher instance is created per scorer acquisition (i.e., per leaf segment per thread). Depending on the cost of fieldType.valueFetcher(...), this could introduce overhead. It should be verified that creating a ValueFetcher per call to get(leadCost) is acceptable and does not cause significant performance degradation under high concurrency or many segments.

final ValueFetcher valueFetcher = valueFetcherSupplier.get();
valueFetcher.setNextReader(context);

@github-actions
Copy link
Contributor

github-actions bot commented Mar 11, 2026

PR Code Suggestions ✨

Latest suggestions up to 4af9812
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Add null safety for supplier invocation

The Supplier lambda captures context and fieldType from the outer scope, but
fieldType.valueFetcher may not be thread-safe to call concurrently. Consider
ensuring the supplier itself is thread-safe or documenting that the supplier must
produce independent instances without shared mutable state.

server/src/main/java/org/opensearch/index/mapper/WildcardFieldMapper.java [847-850]

-// Create a new ValueFetcher per thread.
-// ValueFetcher.setNextReader is not thread safe.
-final ValueFetcher valueFetcher = valueFetcherSupplier.get();
+// Create a new ValueFetcher per thread/leaf.
+// ValueFetcher.setNextReader is not thread safe, so a new instance is created each time.
+final ValueFetcher valueFetcher = Objects.requireNonNull(valueFetcherSupplier, "valueFetcherSupplier must not be null").get();
 valueFetcher.setNextReader(context);
Suggestion importance[1-10]: 4

__

Why: Adding Objects.requireNonNull provides a clearer error message if valueFetcherSupplier is null, but as noted above, the null case is already guarded by searchLookup being null. The thread-safety concern about fieldType.valueFetcher is speculative and not directly evidenced by the diff.

Low
Possible issue
Guard against null supplier before invocation

When valueFetcherSupplier is null (i.e., context is null), calling
valueFetcherSupplier.get() in the get(long leadCost) method will throw a
NullPointerException. A null check should be added before invoking
valueFetcherSupplier.get() in the scorer, or the supplier should be guarded with a
null-safe wrapper.

server/src/main/java/org/opensearch/index/mapper/WildcardFieldMapper.java [765-771]

+if (context != null) {
+    this.searchLookup = context.lookup();
+    this.valueFetcherSupplier = () -> fieldType.valueFetcher(context, context.lookup(), null);
+} else {
+    this.searchLookup = null;
+    this.valueFetcherSupplier = null;
+}
 
-
Suggestion importance[1-10]: 3

__

Why: The valueFetcherSupplier is only null when context is null, and in that case searchLookup is also null. The scorer code at line 846 already uses searchLookup (which would NPE first), so the null valueFetcherSupplier path is guarded by the same condition. The improved_code is identical to existing_code, making this suggestion not actionable.

Low

Previous suggestions

Suggestions up to commit d1a02bd
CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard against null supplier dereference

When valueFetcherSupplier is null (i.e., context is null), calling
valueFetcherSupplier.get() in the createWeight method will throw a
NullPointerException. There should be a null check before invoking
valueFetcherSupplier.get() in the scorer, similar to how the original code likely
guarded against a null valueFetcher. Ensure the scorer path that calls
valueFetcherSupplier.get() is only reached when valueFetcherSupplier is non-null.

server/src/main/java/org/opensearch/index/mapper/WildcardFieldMapper.java [765-771]

+if (context != null) {
+    this.searchLookup = context.lookup();
+    this.valueFetcherSupplier = () -> fieldType.valueFetcher(context, context.lookup(), null);
+} else {
+    this.searchLookup = null;
+    this.valueFetcherSupplier = null;
+}
 
-
Suggestion importance[1-10]: 4

__

Why: The existing_code and improved_code are identical, so no actual fix is demonstrated. However, the concern about a potential NullPointerException when valueFetcherSupplier is null is valid if the scorer path can be reached without a context. The suggestion asks to verify/ensure a guard exists but doesn't provide the actual fix code.

Low
General
Ensure thread-safe supplier invocation

The Supplier lambda captures context and fieldType from the outer scope, but
fieldType.valueFetcher(...) may not be safe to call concurrently if the supplier
itself is not thread-safe. Consider ensuring the supplier implementation is
stateless or synchronized, or document clearly that the supplier must be
thread-safe, to avoid race conditions when multiple threads call
valueFetcherSupplier.get() simultaneously.

server/src/main/java/org/opensearch/index/mapper/WildcardFieldMapper.java [847-850]

-// Create a new ValueFetcher per thread.
-// ValueFetcher.setNextReader is not thread safe.
-final ValueFetcher valueFetcher = valueFetcherSupplier.get();
+// Create a new ValueFetcher per thread/leaf.
+// ValueFetcher.setNextReader is not thread safe, so a new instance is created per scorer.
+final ValueFetcher valueFetcher = Objects.requireNonNull(valueFetcherSupplier, "valueFetcherSupplier must not be null").get();
 valueFetcher.setNextReader(context);
Suggestion importance[1-10]: 3

__

Why: The suggestion adds a Objects.requireNonNull check on valueFetcherSupplier, which is a minor defensive improvement. However, the thread-safety concern about the supplier itself is speculative and the lambda () -> fieldType.valueFetcher(context, context.lookup(), null) creates a new instance each call, which is inherently safe for concurrent use.

Low

@ShawnQiang1 ShawnQiang1 marked this pull request as ready for review March 11, 2026 16:30
@github-actions
Copy link
Contributor

❌ Gradle check result for d1a02bd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@ShawnQiang1
Copy link
Contributor Author

@msfroh please take a look at this PR

Copy link
Contributor

@msfroh msfroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ShawnQiang1!

This makes perfect sense and I'm kind of surprised that it hasn't burned us before.

@github-actions
Copy link
Contributor

❌ Gradle check result for d1a02bd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

Persistent review updated to latest commit 4af9812

@github-actions
Copy link
Contributor

✅ Gradle check result for 4af9812: SUCCESS

@codecov
Copy link

codecov bot commented Mar 12, 2026

Codecov Report

❌ Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.33%. Comparing base (4b9021f) to head (4af9812).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
...g/opensearch/index/mapper/WildcardFieldMapper.java 50.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##               main   #20842   +/-   ##
=========================================
  Coverage     73.32%   73.33%           
- Complexity    72267    72297   +30     
=========================================
  Files          5795     5795           
  Lines        330056   330057    +1     
  Branches      47643    47643           
=========================================
+ Hits         242030   242054   +24     
+ Misses        68584    68580    -4     
+ Partials      19442    19423   -19     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ShawnQiang1
Copy link
Contributor Author

@msfroh @andrross could anyone help merge this pr?

@andrross andrross merged commit 0e2783c into opensearch-project:main Mar 12, 2026
34 of 35 checks passed
@ShawnQiang1
Copy link
Contributor Author

ShawnQiang1 commented Mar 12, 2026

this bug same exist in 2.x please consider whether to backport @andrross @msfroh

@andrross
Copy link
Member

What do you think regarding the backport @msfroh? Looks like a pretty straightforward fix with minimal risk to me.

@msfroh
Copy link
Contributor

msfroh commented Mar 12, 2026

What do you think regarding the backport @msfroh? Looks like a pretty straightforward fix with minimal risk to me.

Yeah -- I think it makes sense. I'll add the backport tag and fix any merge conflicts as necessary.

@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.19 failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.19 2.19
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.19
# Create a new branch
git switch --create backport/backport-20842-to-2.19
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 0e2783c3f042f41c09ad2cf38d96272e5834b4ba
# Push it to GitHub
git push --set-upstream origin backport/backport-20842-to-2.19
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.19

Then, create a pull request where the base branch is 2.19 and the compare/head branch is backport/backport-20842-to-2.19.

@ShawnQiang1 ShawnQiang1 deleted the wildcardBug branch March 13, 2026 03:21
@ShawnQiang1
Copy link
Contributor Author

ShawnQiang1 commented Mar 13, 2026

What do you think regarding the backport @msfroh? Looks like a pretty straightforward fix with minimal risk to me.

Yeah -- I think it makes sense. I'll add the backport tag and fix any merge conflicts as necessary.

yes , i just reproduced this bug on 2.19 branch, let me help to backport this

@ShawnQiang1
Copy link
Contributor Author

ShawnQiang1 commented Mar 13, 2026

@andrross @msfroh
Should we merge into the 2.x branch or the 2.19 branch?

@andrross
Copy link
Member

Should we merge into the 2.x branch or the 2.9 branch?

@ShawnQiang1 Only the 2.19 branch is necessary, thanks!

@ShawnQiang1
Copy link
Contributor Author

@andrross this backport is not merge in 2.19 #17001, causing other errors to occur on my fix.

@andrross
Copy link
Member

@msfroh Should we backport #17001 as well?

@ShawnQiang1
Copy link
Contributor Author

ShawnQiang1 commented Mar 13, 2026

In fact, I verified on branch 2.19 that my changes fixed the issue I was responsible for. However, during the verification process, I discovered another error. After tracing the source, I found that branch 2.19 was missing this pr #17001

@ShawnQiang1
Copy link
Contributor Author

@msfroh Should we backport #17001 as well?

This isn't even a question. This backport is for the 2.x branch; I don't know why it wasn't carried over to 2.19.

@ShawnQiang1
Copy link
Contributor Author

ShawnQiang1 commented Mar 13, 2026

I've figured it out: the 2.19 branch parted ways with 2.x on 2025/1/30, but this PR was merged into 2.x on 2025/2/19, so it wasn't included. @andrross

@ShawnQiang1
Copy link
Contributor Author

ShawnQiang1 commented Mar 13, 2026

According to the roadmap, version 2.19 will actually be the last version of 2.x. The historical mission of 2.x has ended. Shouldn't we consider prohibiting further code merges into 2.x? The backport configuration needs to remove the 2.x branch to prevent the aforementioned incorrect merge issues,Perhaps some changes are needed to the configuration of the CI tools. @andrross @msfroh @cwperks what do you think guys

@andrross
Copy link
Member

@ShawnQiang1 I agree the 2.x branch should be removed (see #18603)

the 2.19 branch parted ways with 2.x on 2025/1/30, but this PR was merged into 2.x on 2025/2/19

Think the problem was that the backport was created on 1/10, prior to the 2.19 branch being created, but not merged until 2/19, and no one caught that this got missed from the 2.19 branch.

ShawnQiang1 added a commit to ShawnQiang1/OpenSearch that referenced this pull request Mar 17, 2026
opensearch-project#20842)

Signed-off-by: Shawn Qiang <814238703@qq.com>
(cherry picked from commit 0e2783c)
Signed-off-by: Shawn Qiang <814238703@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants