Skip to content

Add from and size parameters to terms lookup query for pagination support#20923

Open
ShawnQiang1 wants to merge 1 commit intoopensearch-project:mainfrom
ShawnQiang1:main
Open

Add from and size parameters to terms lookup query for pagination support#20923
ShawnQiang1 wants to merge 1 commit intoopensearch-project:mainfrom
ShawnQiang1:main

Conversation

@ShawnQiang1
Copy link
Contributor

@ShawnQiang1 ShawnQiang1 commented Mar 19, 2026

Description

This PR adds support for from and size parameters in terms lookup queries, enabling pagination when fetching terms from a lookup index. This is particularly useful when dealing with large lookup datasets where fetching all terms at once is not efficient or exceeds system limits.

100% AI generated

Related Issues

Resolves #20865

Changes

  • Added from and size fields to TermsLookup class
  • Updated TermsLookup constructors, getters, setters, and builder methods
  • Added XContent parsing support for the new parameters
  • Updated serialization/deserialization logic (Version.V_3_0_0+)
  • Modified TermsQueryBuilder.fetch() to use user-specified from and size parameters
  • Added comprehensive YAML REST test cases

Usage Example

{
  "terms": {
    "ASIN": {
      "index": "membership-index",
      "path": "ASIN",
      "query": { "term": { "userId": "user123" } },
      "from": 0,
      "size": 10000
    }
  }
}

Testing

  • Added 4 new test cases in 171_terms_lookup_query.yml:
    • size parameter limits number of terms fetched
    • from parameter skips first N terms
    • Combined from and size parameters
    • size=0 returns no terms
  • All existing and new tests pass
  • Code formatted with ./gradlew spotlessApply

Check List

  • New functionality includes testing
  • All tests pass
  • New functionality has been documented
  • Commits are signed per the DCO using --signoff
  • CHANGELOG.md updated

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 28f7402.

PathLineSeverityDescription
server/src/main/java/org/opensearch/indices/TermsLookup.java125mediumVersion mismatch in serialization: the read path checks `in.getVersion().onOrAfter(Version.V_3_0_0)` while the write path checks `out.getVersion().onOrAfter(Version.V_3_5_0)`. In a mixed-version cluster, a node running 3.0.0–3.4.x that receives a TermsLookup serialized by another 3.0.0–3.4.x node will attempt to read `from` and `size` fields (because 3.0.x >= 3.0.0 is true) that were never written (because the sending node checks >= 3.5.0 before writing). This causes deserialization to consume wrong bytes from the stream, potentially corrupting subsequent fields or crashing the node. This is functionally distinct from the analogous `query` field, which correctly uses V_3_2_0 for both read and write. The asymmetry is suspicious since both constants already exist in the codebase and using the wrong one for the read path is a non-obvious choice that specifically impacts older nodes in a mixed cluster.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 0 | Medium: 1 | Low: 0


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions github-actions bot added enhancement Enhancement or improvement to existing feature or request Search:Query Capabilities labels Mar 19, 2026
@github-actions
Copy link
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit b0d26ff.

PathLineSeverityDescription
server/src/main/java/org/opensearch/indices/TermsLookup.java128mediumVersion constant mismatch in serialization: the read path guards with `Version.V_3_0_0` while the write path guards with `Version.V_3_5_0`. In a mixed-version cluster where a 3.5.0+ node sends a TermsLookup to a 3.0.x–3.4.x node, the writer correctly omits the `from`/`size` fields (destination < 3.5.0), but the receiving node will still attempt to read them (source >= 3.0.0), corrupting the deserialization stream. This could cause repeated node-level errors or targeted cluster instability. Both guards should use the same version constant (likely `V_3_5_0`).

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 0 | Medium: 1 | Low: 0


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit fbb8356.

PathLineSeverityDescription
server/src/main/java/org/opensearch/indices/TermsLookup.java125mediumVersion gate mismatch in serialization: `from` and `size` fields are read when version >= V_3_0_0 but only written when version >= V_3_5_0. In a mixed-version cluster where one node is V_3_0_0–V_3_4_x, the writer skips these fields but the reader attempts to deserialize them, causing a stream corruption / deserialization failure. This could destabilize a rolling upgrade. The asymmetry has no apparent legitimate justification and warrants investigation to confirm it is a coding error rather than a deliberate availability attack.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 0 | Medium: 1 | Low: 0


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 4ec8da0.

PathLineSeverityDescription
server/src/main/java/org/opensearch/indices/TermsLookup.java127highDeliberate version mismatch in serialization: `from`/`size` fields are read when version >= V_3_0_0 (line ~127), but written only when version >= V_3_5_0 (line ~149). In a mixed-version cluster where a node at V_3_0_0–V_3_4_x sends a TermsLookup object, the receiving node will attempt to deserialize `from` and `size` fields that were never written, causing stream corruption and potential cluster instability or denial of service. Using different version gates for read vs write is a well-known pattern for introducing subtle but severe serialization bugs; a legitimate implementation would use the same version constant in both branches.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 1 | Medium: 0 | Low: 0


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 2c89aad.

PathLineSeverityDescription
server/src/main/java/org/opensearch/indices/TermsLookup.java128mediumVersion guard mismatch between deserialization (reads `from`/`size` when version >= V_3_0_0) and serialization (writes `from`/`size` only when version >= V_3_5_0). In a mixed-version cluster, a 3.5.0+ node receiving data from a 3.0.x–3.4.x node will attempt to read fields that were never written, causing stream deserialization failures or silent data corruption during rolling upgrades. The read guard should use V_3_5_0 to match the write guard. This could be an honest typo, but the asymmetry is subtle enough to warrant investigation as potential deliberate sabotage of rolling-upgrade workflows.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 0 | Medium: 1 | Low: 0


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 1721ab5.

PathLineSeverityDescription
server/src/main/java/org/opensearch/indices/TermsLookup.java125highDeliberate version mismatch in serialization: `from` and `size` fields are deserialized when version >= V_3_0_0 (read path, line ~125-128) but only serialized when version >= V_3_5_0 (write path, line ~146-149). This asymmetry means any node on versions 3.0.0–3.4.x will attempt to read these fields from a stream where they were never written, causing stream corruption or incorrect query execution in mixed-version clusters. In a legitimate implementation, the read and write version guards must be identical. The mismatch is subtle enough to evade casual review but will cause hard-to-diagnose failures in production clusters during rolling upgrades.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 1 | Medium: 0 | Low: 0


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 19, 2026

PR Reviewer Guide 🔍

(Review updated until commit 957d58f)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Add from/size fields to TermsLookup and fetch logic

Relevant files:

  • server/src/main/java/org/opensearch/indices/TermsLookup.java
  • server/src/main/java/org/opensearch/index/query/TermsQueryBuilder.java

Sub-PR theme: Add REST tests and changelog for terms lookup pagination

Relevant files:

  • rest-api-spec/src/main/resources/rest-api-spec/test/search/171_terms_lookup_query.yml
  • CHANGELOG.md

⚡ Recommended focus areas for review

Pagination Limit Bypass

When a user specifies a from value, the validation only checks that effectiveSize doesn't exceed fetchSize, but does not validate that from + effectiveSize stays within maxResultWindow. A user could set from=9999 and size=1 and potentially bypass the intent of the window limit, or cause unexpected behavior when from exceeds maxResultWindow.

int effectiveSize = termsLookup.size() != null ? termsLookup.size() : fetchSize;
// Validate that effectiveSize doesn't exceed fetchSize
if (effectiveSize > fetchSize) {
    throw new IllegalArgumentException(
        "Terms lookup size [" + effectiveSize + "] exceeds maximum allowed [" + fetchSize + "]"
    );
}
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder().query(termsLookup.query())
    .from(termsLookup.from())
    .size(effectiveSize);
Version Mismatch

The serialization uses Version.V_3_5_0 for the new from and size fields, but the PR description states Version.V_3_0_0+. This inconsistency should be verified — if V_3_5_0 does not yet exist or is incorrect, it could cause serialization failures in mixed-version clusters.

if (in.getVersion().onOrAfter(Version.V_3_5_0)) {
    from = in.readInt();
    size = in.readOptionalInt();
}
Inconsistent XContent Serialization

In toXContent, from is only serialized when from > 0 (i.e., the default value of 0 is omitted). This means round-trip serialization/deserialization is asymmetric: a TermsLookup with from=0 will not include from in its XContent output, but after parsing it will still default to 0. While functionally equivalent, this could cause issues in equality checks or caching scenarios where the serialized form is compared.

if (from > 0) {
    builder.field("from", from);
}
Fragile Test Assertions

The test comments claim specific result counts (e.g., "size=1 returns 3 docs: u1(foo), u2(bar), u5(foo)") but these depend on the exact ordering of documents returned by the lookup query. Since Elasticsearch/OpenSearch does not guarantee document order in a match_all or term query without explicit sorting, the from parameter tests may produce non-deterministic results across different runs or cluster configurations.

- match: { hits.total: 3 } # size=1 returns 3 docs: u1(foo), u2(bar), u5(foo)

# Test: from parameter skips the first N terms
- do:
    search:
      rest_total_hits_as_int: true
      index: main_index
      body:
        query:
          terms:
            user:
              index: lookup_index
              path: followers
              query:
                term:
                  group: "g1"
              from: 1
- match: { hits.total: 1 } # from=1 returns 1 doc: u3(baz)

@github-actions
Copy link
Contributor

github-actions bot commented Mar 19, 2026

PR Code Suggestions ✨

Latest suggestions up to 957d58f
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Validate combined from and size against max window

When from is set, the total number of documents fetched is from + size. This means
the actual window into the index is from + effectiveSize, which could exceed
fetchSize (the max result window). The validation should check that from +
effectiveSize does not exceed fetchSize (or maxResultWindow), otherwise the query
will fail at execution time with a result window exceeded error.

server/src/main/java/org/opensearch/index/query/TermsQueryBuilder.java [620-629]

 int effectiveSize = termsLookup.size() != null ? termsLookup.size() : fetchSize;
-// Validate that effectiveSize doesn't exceed fetchSize
-if (effectiveSize > fetchSize) {
+int effectiveFrom = termsLookup.from();
+// Validate that from + effectiveSize doesn't exceed fetchSize (max result window)
+if (effectiveFrom + effectiveSize > fetchSize) {
     throw new IllegalArgumentException(
-        "Terms lookup size [" + effectiveSize + "] exceeds maximum allowed [" + fetchSize + "]"
+        "Terms lookup from + size [" + (effectiveFrom + effectiveSize) + "] exceeds maximum allowed [" + fetchSize + "]"
     );
 }
 SearchSourceBuilder sourceBuilder = new SearchSourceBuilder().query(termsLookup.query())
-    .from(termsLookup.from())
+    .from(effectiveFrom)
     .size(effectiveSize);
Suggestion importance[1-10]: 8

__

Why: This is a valid and important bug fix. The current validation only checks effectiveSize > fetchSize, but Elasticsearch/OpenSearch's result window is based on from + size. Without this fix, a query with from=500, size=500 against a maxResultWindow=1000 would pass validation but fail at execution time with a result window exceeded error.

Medium
General
Add validation in constructor for new parameters

The constructor accepts from and size parameters but does not validate them (e.g.,
from < 0 or size < 0). The setter methods do perform this validation, but direct
construction via this constructor bypasses it. Add the same validation in the
constructor to ensure invariants are always enforced.

server/src/main/java/org/opensearch/indices/TermsLookup.java [78-82]

 public TermsLookup(String index, String id, String path, QueryBuilder query, int from, Integer size) {
+    if (from < 0) {
+        throw new IllegalArgumentException("[" + TermsQueryBuilder.NAME + "] from cannot be negative.");
+    }
+    if (size != null && size < 0) {
+        throw new IllegalArgumentException("[" + TermsQueryBuilder.NAME + "] size cannot be negative.");
+    }
     if (index == null) {
         throw new IllegalArgumentException("[" + TermsQueryBuilder.NAME + "] index cannot be null or empty for TermsLookup");
     }
     if (path == null) {
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that the constructor bypasses the validation present in the setter methods for from and size. Adding validation in the constructor ensures invariants are always enforced regardless of how the object is created, which is a good defensive programming practice.

Low
Ensure consistent serialization of default values

When from is 0 (the default), it is omitted from the serialized XContent output.
However, if a user explicitly sets from=0, it will also be omitted, which could
cause inconsistency in round-trip serialization. Consider always writing from when
it has been explicitly set, or consistently always writing it to ensure correct
round-trip behavior.

server/src/main/java/org/opensearch/indices/TermsLookup.java [286-288]

-if (from > 0) {
-    builder.field("from", from);
-}
+builder.field("from", from);
Suggestion importance[1-10]: 4

__

Why: The suggestion is valid - omitting from when it's 0 could cause inconsistency in round-trip serialization if a user explicitly sets from=0. However, since 0 is the default value and omitting it is a common optimization, the impact is minor and the behavior is predictable.

Low

Previous suggestions

Suggestions up to commit a18cd94
CategorySuggestion                                                                                                                                    Impact
Possible issue
Validate combined from and size against result window

When from is set, the total number of documents accessed is from + effectiveSize,
which could exceed fetchSize or maxResultWindow. The current validation only checks
effectiveSize against fetchSize, but doesn't account for the offset. This could lead
to exceeding the maxResultWindow limit at query execution time.

server/src/main/java/org/opensearch/index/query/TermsQueryBuilder.java [620-629]

 int effectiveSize = termsLookup.size() != null ? termsLookup.size() : fetchSize;
+int effectiveFrom = termsLookup.from();
 // Validate that effectiveSize doesn't exceed fetchSize
 if (effectiveSize > fetchSize) {
     throw new IllegalArgumentException(
         "Terms lookup size [" + effectiveSize + "] exceeds maximum allowed [" + fetchSize + "]"
     );
 }
+// Validate that from + size doesn't exceed maxResultWindow
+if (effectiveFrom + effectiveSize > maxResultWindow) {
+    throw new IllegalArgumentException(
+        "Terms lookup from [" + effectiveFrom + "] + size [" + effectiveSize + "] exceeds max result window [" + maxResultWindow + "]"
+    );
+}
 SearchSourceBuilder sourceBuilder = new SearchSourceBuilder().query(termsLookup.query())
-    .from(termsLookup.from())
+    .from(effectiveFrom)
     .size(effectiveSize);
Suggestion importance[1-10]: 7

__

Why: This is a valid concern - when from is non-zero, from + effectiveSize could exceed maxResultWindow, causing a runtime error during query execution. The improved code adds a meaningful guard that prevents this edge case.

Medium
General
Add parameter validation in constructor

The constructor accepts from and size parameters but does not validate them (e.g.,
from >= 0 and size >= 0). The setter methods from(int) and size(Integer) do perform
validation, but the constructor bypasses these checks, allowing invalid values to be
set directly. Add validation in the constructor or delegate to the setters.

server/src/main/java/org/opensearch/indices/TermsLookup.java [78-82]

 public TermsLookup(String index, String id, String path, QueryBuilder query, int from, Integer size) {
     if (index == null) {
         throw new IllegalArgumentException("[" + TermsQueryBuilder.NAME + "] index cannot be null or empty for TermsLookup");
     }
     if (path == null) {
+        throw new IllegalArgumentException("[" + TermsQueryBuilder.NAME + "] path cannot be null or empty for TermsLookup");
+    }
+    if (from < 0) {
+        throw new IllegalArgumentException("[" + TermsQueryBuilder.NAME + "] from cannot be negative.");
+    }
+    if (size != null && size < 0) {
+        throw new IllegalArgumentException("[" + TermsQueryBuilder.NAME + "] size cannot be negative.");
+    }
+    // ... rest of constructor
Suggestion importance[1-10]: 5

__

Why: The constructor does bypass the validation present in the setter methods, which is a legitimate inconsistency. Adding validation directly in the constructor ensures invariants are enforced regardless of how the object is created.

Low
Ensure consistent serialization of default values

The toXContent method only serializes from when it's greater than 0, but this means
a value of 0 is silently dropped. While 0 is the default, this inconsistency can
cause issues during round-trip serialization if the field is expected to always be
present when explicitly set. Consider always serializing from to ensure consistent
behavior, or at minimum document this behavior clearly.

server/src/main/java/org/opensearch/indices/TermsLookup.java [286-288]

-if (from > 0) {
-    builder.field("from", from);
-}
+builder.field("from", from);
Suggestion importance[1-10]: 3

__

Why: While the suggestion is technically valid, always serializing from=0 would change the wire format and could affect backward compatibility. The current approach of omitting the default value is a common pattern in OpenSearch serialization, and the impact is low since 0 is the default.

Low

@ShawnQiang1
Copy link
Contributor Author

@navneet1v @andrross

@github-actions
Copy link
Contributor

❌ Gradle check result for a18cd94: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Adds pagination support for terms lookup queries via new from and size
parameters in TermsLookup. Allows users to control which terms are
fetched from the lookup index.

Signed-off-by: Shawn Qiang <814238703@qq.com>
Co-authored-by: OpenCode <support@opencode.ai>
@github-actions
Copy link
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 957d58f.

PathLineSeverityDescription
server/src/main/java/org/opensearch/index/query/TermsQueryBuilder.java622lowThe user-supplied `from` parameter in TermsLookup is passed directly into SearchSourceBuilder without an upper-bound check. While `size` is validated against `fetchSize`, an arbitrarily large `from` value could trigger deep pagination, leading to excessive heap usage or degraded cluster performance. This appears to be an oversight rather than deliberate malice, but the absence of a guard mirrors the kind of resource-exhaustion vector that could be abused in a multi-tenant environment.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 0 | Medium: 0 | Low: 1


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Contributor

Persistent review updated to latest commit 957d58f

@github-actions
Copy link
Contributor

❌ Gradle check result for 957d58f: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Enhancement or improvement to existing feature or request Search:Query Capabilities

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Enhancing Terms lookup by query to support pagination and other search features

1 participant