absolute rowId and derived source integration for query_then_fetch #19676

animodak7 · 2025-10-17T13:47:40Z

Description

Implements query then fetch -
a) reads substrait plan derived from query "SELECT * FROM "index-7" WHERE ___row_id < 10 LIMIT 9"
b) queries parquet files and returns row_ids
c) based on row ids and projections creates access plan and executes to get plan recordBatchStream
d) Using derived source mappers converts it to source json object bytesRef.
Run e2e test using - ./gradlew :plugins:engine-datafusion:test --tests 'org.opensearch.datafusion.DataFusionServiceTests.testQueryThenFetchE2ETest' -Dtests.seed=EF79DE2C926C4ACC -Dtests.locale=ms-BN -Dtests.timezone=America/Belem

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2025-10-17T13:53:22Z

❌ Gradle check result for 0047c1a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

plugins/dataformat-csv/jni/Cargo.toml

plugins/engine-datafusion/src/main/java/org/opensearch/datafusion/DataFusionQueryJNI.java

github-actions · 2025-10-27T04:02:06Z

❌ Gradle check result for 991699b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

alchemist51 · 2025-10-24T08:36:43Z

plugins/engine-datafusion/jni/src/listing_table.rs

    ///       multiple equivalent orderings, the outer `Vec` will have a
    ///       single element.
    pub file_sort_order: Vec<Vec<SortExpr>>,
+


Can we not extend the listing table instead of modifying it? Thinking of an Object where default listingOptions can be the child and we override only the imp functions?

We cannot extend it since there are internal functions which we need to change to make this work such as scan which again use functions restricted to ListingTable object which are not accessible to child, so anyway for these functions we will have to copy them to somewhere child can access. We can think of this but there might be some changes need in datafusion as well, we can plan to to this.

Agree, I think we should open an issue in datafusion once we have an plan

alchemist51 · 2025-10-24T08:37:52Z

plugins/engine-datafusion/jni/src/listing_table.rs


-    fn add_path_preserving_metadata(&self, file_groups: Vec<FileGroup>) -> Vec<FileGroup> {
-        let re = Regex::new(r"generation-(\d+)").unwrap();
+    fn add_path_preserving_metadata(&self, file_groups: Vec<FileGroup>, files_metadata: Arc<Vec<FileMetadata>>) -> Vec<FileGroup> {


let's return Result<> format for error handling.

alchemist51 · 2025-10-24T08:39:30Z

plugins/engine-datafusion/jni/src/listing_table.rs


-    fn add_path_preserving_metadata(&self, file_groups: Vec<FileGroup>) -> Vec<FileGroup> {
-        let re = Regex::new(r"generation-(\d+)").unwrap();
+    fn add_path_preserving_metadata(&self, file_groups: Vec<FileGroup>, files_metadata: Arc<Vec<FileMetadata>>) -> Vec<FileGroup> {


How do we make sure the vec and the Vec have the same order? Does it matter?

order of file_matadata will defer in creating row_base for now we are dependant on DatafusionReader for files ordering. In function file_group ordering can be different than file_metadata ordering we are mapping row_base based on location.

Thinking for cases of high number of files. We are finding the meta file every time for every iteration. Will it make sense to sort them with the same key and just have validation in the loop for it?

makes sense will think more on this and will try to optimize in next pr

alchemist51 · 2025-10-24T08:57:29Z

plugins/engine-datafusion/jni/src/listing_table.rs

+                let row_count = files_metadata.iter()
+                    .find(|meta| { location.contains(meta.object_meta.location.as_ref()) })
+                    .map(|meta| meta.row_group_row_counts().iter().sum::<i64>() as i32)
+                    .unwrap_or_default();


what is the default here? Should we not always find it?

yes, we should always find the row base. changed.

Should we throw an error here instead then?

plugins/engine-datafusion/jni/src/listing_table.rs

plugins/engine-datafusion/jni/src/row_id_optimizer.rs

alchemist51 · 2025-10-27T05:27:05Z

plugins/engine-datafusion/src/test/resources/data/index-7/0/generation-1.parquet

Do we want to structure it differently?

We can structure it differently, will write another test case after completing fetch phase will update then.

github-actions · 2025-10-27T05:51:03Z

❌ Gradle check result for 0bce75a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-10-27T07:59:24Z

❌ Gradle check result for 392e0c1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

animodak7 requested review from a team, Bukhtawar, CEHENKLE, Rishikesh1159, anasalkouz, andrross, ashking94, cwperks, dbwiddis, gbbafna, jed326, kotwanikunal, mch2, msfroh, owaiskazi19, reta, sachinpkale, saratvemulapalli, shwetathareja and sohami as code owners October 17, 2025 13:47

bharath-techie reviewed Oct 18, 2025

View reviewed changes

plugins/dataformat-csv/jni/Cargo.toml Outdated Show resolved Hide resolved

bharath-techie reviewed Oct 18, 2025

View reviewed changes

plugins/engine-datafusion/src/main/java/org/opensearch/datafusion/DataFusionQueryJNI.java Outdated Show resolved Hide resolved

animodak7 force-pushed the feature/datafusion branch from 0047c1a to 991699b Compare October 27, 2025 03:56

alchemist51 reviewed Oct 27, 2025

View reviewed changes

animodak7 force-pushed the feature/datafusion branch from 991699b to 0bce75a Compare October 27, 2025 05:41

absolute rowId and derived source integration for query_then_fetch

392e0c1

animodak7 force-pushed the feature/datafusion branch from 0bce75a to 392e0c1 Compare October 27, 2025 07:53

absolute rowId and derived source integration for query_then_fetch #19676

Are you sure you want to change the base?

absolute rowId and derived source integration for query_then_fetch #19676

Conversation

animodak7 commented Oct 17, 2025

Description

Related Issues

Check List

Uh oh!

github-actions bot commented Oct 17, 2025

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Oct 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alchemist51 Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 27, 2025

Uh oh!

github-actions bot commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alchemist51 Oct 27, 2025 •

edited

Loading