Upgrade DataFusion to 52.1.0 and Add Liquid Cache Support by cocosz · Pull Request #20740 · opensearch-project/OpenSearch

cocosz · 2026-02-27T08:30:11Z

Summary

This PR upgrades the DataFusion engine from version 51.0.0 to 52.1.0 and integrates the liquid-cache-datafusion-local dependency for enhanced caching capabilities.

Changes

Dependency Updates

DataFusion Core: 51.0.0 → 52.1.0
DataFusion Expression: 51.0.0 → 52.1.0
DataFusion DataSource: 51.0.0 → 52.1.0
DataFusion Substrait: 51.0.0 → 52.1.0
Arrow Libraries: 57.1.0 → 57.3.0
Parquet: 57.1.0 → 57.3.0
Object Store: 0.12.4 → 0.12.5
New: Added liquid-cache-datafusion-local = "0.1.12"

github-actions · 2026-02-27T08:30:52Z

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 776c709.

Path	Line	Severity	Description
plugins/engine-datafusion/jni/Cargo.toml	64	medium	New third-party dependency 'liquid-cache-datafusion-local = 0.1.12' added. This is a relatively obscure crate at a specific version that integrates deeply into the query execution path and writes query data to disk. While it appears to be a legitimate open-source caching library, it warrants vetting of its crate origin, ownership, and code for any unexpected network calls or data handling before merging into a production search engine.
plugins/engine-datafusion/jni/src/liquid_cache_runtime.rs	22	low	Hardcoded cache directory path '/var/lib/opensearch/liquid_cache' is used unconditionally without any configurable override. Query data (Parquet file contents) will be persisted to this path. If filesystem permissions on this directory are overly permissive or if the node is shared, cached sensitive query data could be read by other local processes. This is an information disclosure risk rather than malicious intent, but the fixed path with no configuration option is an anomaly.

The table above displays the top 10 most important findings.

Total: 2 | Critical: 0 | High: 0 | Medium: 1 | Low: 1

Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.

⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

github-actions · 2026-02-27T08:33:21Z

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit ab6614a.

Path	Line	Severity	Description
plugins/engine-datafusion/jni/Cargo.toml	64	medium	New third-party crate 'liquid-cache-datafusion-local = "0.1.12"' introduced with a fixed minor version. This is a relatively obscure crate (version 0.1.x indicates early-stage). Supply chain risk: the crate's behavior at runtime (file I/O, object store registration) is difficult to audit without reviewing the crate source. The crate is given access to the DataFusion RuntimeEnv and SessionContext, making it capable of intercepting data reads.
plugins/engine-datafusion/jni/src/liquid_cache_runtime.rs	22	low	Hardcoded filesystem path '/var/lib/opensearch/liquid_cache' is used for cache storage. While plausible for a caching feature, hardcoding a specific system path (rather than deriving it from OpenSearch configuration) bypasses any path sanitization or permission controls that the application normally enforces.
plugins/engine-datafusion/jni/src/liquid_cache_runtime.rs	18	low	LiquidCacheRef is typed as 'Arc', erasing the concrete type of the cache object. This makes static analysis of what the cache reference holds impossible, and obscures the actual runtime behavior of the stored liquid_cache object returned by the external crate.

The table above displays the top 10 most important findings.

Total: 3 | Critical: 0 | High: 0 | Medium: 1 | Low: 2

Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.

⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

alchemist51 · 2026-02-27T08:44:10Z

plugins/engine-datafusion/jni/src/liquid_cache_runtime.rs

+    liquid_cache: LiquidCacheRef,
+}
+
+static LIQUID_ONLY: OnceLock<Result<LiquidOnlyRuntime, String>> = OnceLock::new();


Why are we doing it here? Can't it be done in global runtime?

alchemist51 · 2026-02-27T08:45:13Z

plugins/engine-datafusion/jni/src/custom_cache_manager.rs

-            config = config.with_files_statistics_cache(Some(default_stats));
-        }
+        // Add statistics cache if available - use default since CustomStatisticsCache doesn't implement FileStatisticsCache trait
+        let default_stats = Arc::new(DefaultFileStatisticsCache::default());


This defeats the purpose of our custom statistics cache

bharath-techie · 2026-02-27T18:25:38Z

plugins/engine-datafusion/jni/src/query_executor.rs

-    config.options_mut().execution.parquet.pushdown_filters = false;
-    config.options_mut().execution.target_partitions = target_partitions;
-    config.options_mut().execution.batch_size = 8192;
+        .with_metadata_cache_limit(250 * 1024 * 1024)


This should be configurable via settings

bharath-techie · 2026-02-27T18:26:27Z

plugins/engine-datafusion/jni/src/query_executor.rs

+    info!("[LiquidCache] Creating Parquet access plans for {} row IDs across {} files", 
+        row_ids.len(), files_metadata.len());
    let access_plans = create_access_plans(row_ids, files_metadata.clone()).await?;
+    info!("[LiquidCache] ✓ Access plans created, Liquid Cache will optimize data access");


Kindly trim unnecessary logs throughout this PR and keep only the essential debug logs.

bharath-techie · 2026-02-27T18:27:05Z

plugins/engine-datafusion/jni/src/lib.rs

        .build().unwrap();

+    log_info!("[LiquidCache] Initializing global Liquid Cache (1GB max)");
+    let liquid_runtime = match LiquidOnlyRuntime::init(1024 * 1024 * 1024) {


This should be configurable via settings.

github-actions · 2026-03-02T14:38:51Z

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 13e10d5.

Path	Line	Severity	Description
plugins/engine-datafusion/jni/Cargo.toml	64	medium	New third-party dependency `liquid-cache-datafusion-local = "0.1.12"` added without apparent prior usage elsewhere. Adding a new external crate is a potential supply chain attack vector; the crate's provenance and trustworthiness should be verified against crates.io ownership and audit history before merging.
plugins/engine-datafusion/jni/src/statistics_cache.rs	500	low	`FileStatisticsCache::list_entries` is implemented to always return an empty HashMap with the comment that it is 'used for introspection only'. While plausibly benign, this silently suppresses cache visibility for any monitoring or auditing tooling that relies on this interface, which could obscure cache state from operators.

The table above displays the top 10 most important findings.

Total: 2 | Critical: 0 | High: 0 | Medium: 1 | Low: 1

Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.

⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

github-actions · 2026-03-02T15:27:03Z

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 17d5398.

Path	Line	Severity	Description
plugins/engine-datafusion/jni/Cargo.toml	64	medium	New external dependency 'liquid-cache-datafusion-local = "0.1.12"' added with only a vague comment ('byte-level caching'). This package is relatively obscure and does not appear to be referenced in any of the code changes shown in this diff. Adding an unused or minimally documented dependency warrants verification of the package's provenance, ownership, and published source code before merging.
plugins/engine-datafusion/jni/src/statistics_cache.rs	500	low	The FileStatisticsCache::list_entries() implementation unconditionally returns an empty HashMap, suppressing all cache introspection data. The comment claims this is 'for introspection only', but silently hiding cache state could mask unexpected behavior or make auditing harder. This is likely a stub implementation for a new trait method, but should be confirmed as intentional.

The table above displays the top 10 most important findings.

Total: 2 | Critical: 0 | High: 0 | Medium: 1 | Low: 1

Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.

⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

github-actions · 2026-03-02T15:41:14Z

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit ab300b4.

Path	Line	Severity	Description
plugins/engine-datafusion/jni/Cargo.toml	64	medium	New third-party dependency 'liquid-cache-datafusion-local = "0.1.12"' added with no visible usage anywhere in the diff. An obscure crate at version 0.1.x being pulled in without any corresponding code changes is anomalous and warrants verification of the crate's provenance, publisher identity, and whether it was actually intended for this PR.
plugins/engine-datafusion/Cargo.toml	45	low	object_store version constraint changed from an exact pin '=0.12.4' to an unpinned '0.12.5'. The original exact pin was likely intentional for reproducible/audited builds. Loosening this allows future patch bumps without explicit review, slightly increasing supply chain exposure.

The table above displays the top 10 most important findings.

Total: 2 | Critical: 0 | High: 0 | Medium: 1 | Low: 1

Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.

⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

…pendency Signed-off-by: Tanvir Alam <tanvralm@amazon.com>

github-actions · 2026-03-02T15:47:57Z

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 76ff645.

Path	Line	Severity	Description
plugins/engine-datafusion/Cargo.toml	54	low	New external dependency 'liquid-cache-datafusion-local = "0.1.12"' added. While the comment describes it as a byte-level caching library and it corresponds to a real open-source project, any new third-party dependency warrants supply chain verification to confirm the crate version and publisher are trustworthy.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 0 | Medium: 0 | Low: 1

Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.

⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

github-actions · 2026-03-02T15:48:51Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes Sub-PR theme: Upgrade DataFusion, Arrow, Parquet, and Object Store dependencies to latest versions Relevant files: plugins/engine-datafusion/Cargo.toml plugins/engine-datafusion/jni/Cargo.toml Sub-PR theme: Adapt source code to DataFusion 52 API changes and integrate Liquid Cache Relevant files: plugins/engine-datafusion/jni/src/absolute_row_id_optimizer.rs plugins/engine-datafusion/jni/src/listing_table.rs plugins/engine-datafusion/jni/src/query_executor.rs plugins/engine-datafusion/jni/src/cache.rs plugins/engine-datafusion/jni/src/custom_cache_manager.rs plugins/engine-datafusion/jni/src/statistics_cache.rs
⚡ Recommended focus areas for review Error Handling `projected_schema()` now uses `.expect("projected_schema failed")` which will panic on failure. The previous code used `.clone()` on a direct field access. Consider propagating the error instead of panicking, especially in a production optimizer path. let projected_schema = datasource.projected_schema().expect("projected_schema failed"); Hardcoded Source In `create_datasource_projection`, a new `ParquetSource` is always created unconditionally, discarding any existing file source configuration (e.g., custom options, pushdowns) from the original `datasource`. This may silently drop important source-level settings when the optimizer rewrites the plan. use datafusion::datasource::physical_plan::ParquetSource; let new_file_source = Arc::new(ParquetSource::new(new_table_schema)); let file_scan_config = FileScanConfigBuilder::from(datasource.clone()) .with_source(new_file_source) .with_projection_indices(Some(new_projections)) .expect("Failed to set projection indices") .build(); Null Table Scope `TableScopedPath` is constructed with `table: None` in both `execute_query_with_cross_rt_stream` and `execute_fetch_phase`. If the cache lookup logic in DataFusion 52 uses the `table` field for scoping/isolation, setting it to `None` may cause cache collisions between different tables sharing the same path prefix. let table_scoped_path = datafusion::execution::cache::TableScopedPath { table: None, path: table_path.prefix().clone(), }; list_file_cache.put(&table_scoped_path, object_meta); Empty Implementation The `FileStatisticsCache::list_entries` implementation always returns an empty `HashMap`. If DataFusion or any tooling relies on this for cache warming, eviction decisions, or diagnostics, this stub will silently produce incorrect results. The comment says "introspection only" but this should be validated. impl datafusion::execution::cache::cache_manager::FileStatisticsCache for CustomStatisticsCache { fn list_entries(&self) -> std::collections::HashMap<object_store::path::Path, datafusion::execution::cache::cache_manager::FileStatisticsCacheEntry> { // Return empty map — this is used for introspection only std::collections::HashMap::new() } } Partition Column Type In `create_file_source_with_schema_adapter`, partition columns are constructed with `nullable = false`. Previously, partition columns may have had different nullability. If any partition column can be null (e.g., missing partition value), this hardcoded `false` could cause schema mismatches or incorrect query results. .map(\|(name, dt)\| Arc::new(Field::new(name, dt.clone(), false)) as _) .collect(), );

github-actions · 2026-03-02T15:49:24Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Propagate error instead of panicking Using `.expect()` here will cause a panic if `projected_schema()` returns an error, which could crash the query executor in production. This should propagate the error using `?` instead, since the enclosing function likely returns a `Result`. plugins/engine-datafusion/jni/src/absolute_row_id_optimizer.rs [44] -let projected_schema = datasource.projected_schema().expect("projected_schema failed"); +let projected_schema = datasource.projected_schema()?; Suggestion importance[1-10]: 5 __ Why: Using `.expect()` instead of `?` for error propagation is a valid concern for production robustness. However, the enclosing function `build_updated_file_source_schema` returns `(SchemaRef, Vec<usize>)` not a `Result`, so `?` cannot be used directly without changing the function signature.	Low
	Verify partition column index alignment after schema change The `ROW_BASE_FIELD_NAME` field is being added to the `TableSchema` as a partition column, but then `with_table_partition_cols` was removed from the builder. The partition column must be declared consistently — if it's embedded in the `TableSchema`, ensure the projection index calculation (`parquet_schema.fields.len()` as the partition column index) still correctly references it, since the schema length may not align with the partition column position after this change. plugins/engine-datafusion/jni/src/query_executor.rs [426-430] let table_schema = datafusion_datasource::table_schema::TableSchema::new( parquet_schema.clone(), vec![Arc::new(Field::new(ROW_BASE_FIELD_NAME, DataType::Int64, false))], ); let file_source = Arc::new(ParquetSource::new(table_schema)); +// Verify: partition column index = parquet_schema.fields.len() (0-based index after all file fields) +// This must match the TableSchema partition column position Suggestion importance[1-10]: 3 __ Why: The suggestion raises a valid concern about whether `parquet_schema.fields.len()` correctly indexes the partition column after the schema change, but the 'improved_code' only adds a comment without actually fixing anything. The existing code appears correct since `TableSchema::new` takes file schema + partition cols separately, and the partition column index would still be at `parquet_schema.fields.len()`.	Low
	Fix early return breaking builder chain The `?` operator is applied to `with_projection_indices(...)` mid-chain, which breaks the builder pattern — the `?` will short-circuit and return early before `.with_limit()`, `.with_output_ordering()`, and `.with_expr_adapter()` are called. The result of `with_projection_indices` should be unwrapped before continuing the chain, or the builder should be split into separate statements. plugins/engine-datafusion/jni/src/listing_table.rs [1325-1329] -.with_projection_indices(projection.cloned())? -.with_limit(limit) -.with_output_ordering(output_ordering) -.with_expr_adapter(self.expr_adapter_factory.clone()) -.build(), +let builder = FileScanConfigBuilder::new(object_store_url, file_source) + .with_file_groups(partitioned_file_lists) + .with_constraints(self.constraints.clone()) + .with_statistics(statistics) + .with_projection_indices(projection.cloned())? + .with_limit(limit) + .with_output_ordering(output_ordering) + .with_expr_adapter(self.expr_adapter_factory.clone()) + .build(); +self.options.format.create_physical_plan(state, builder).await Suggestion importance[1-10]: 2 __ Why: The concern about `?` breaking the builder chain is theoretically valid in some languages, but in Rust, `?` on a method in a chain works correctly — it propagates the error from `with_projection_indices` and the remaining chain methods are only called if it returns `Ok`. The 'improved_code' restructures the code but doesn't actually fix a real bug, and the suggested refactoring duplicates builder calls already present in the surrounding context.	Low
General	Return actual cache entries instead of empty map Returning an empty map from `list_entries` means the cache contents are never visible for introspection or debugging, and any tooling that relies on this method to enumerate cached entries (e.g., for cache invalidation or monitoring) will silently see no entries even when the cache is populated. Consider iterating over the actual inner cache entries to return a correct map. plugins/engine-datafusion/jni/src/statistics_cache.rs [501-504] fn list_entries(&self) -> std::collections::HashMap<object_store::path::Path, datafusion::execution::cache::cache_manager::FileStatisticsCacheEntry> { - // Return empty map — this is used for introspection only - std::collections::HashMap::new() + self.inner_cache + .list_entries() } Suggestion importance[1-10]: 4 __ Why: Returning an empty map from `list_entries` is a functional gap that could affect cache introspection and monitoring. However, the `inner_cache` type may not implement `list_entries()` directly, making the suggested fix potentially incorrect without knowing the inner cache's API.	Low

github-actions · 2026-03-02T15:53:33Z

❌ Gradle check result for 76ff645: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-03-05T00:55:57Z

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit ba9d9c1.

Path	Line	Severity	Description
plugins/engine-datafusion/Cargo.toml	55	medium	New third-party dependency 'liquid-cache-datafusion-local = "0.1.12"' added with no prior workspace usage. This crate is not a well-known ecosystem package and should be verified for authenticity and provenance before inclusion — potential supply chain risk if the crate name was squatted or typosquatted.
plugins/engine-datafusion/src/test/resources/data/index-7/0/parquet/generation-1.parquet	1	low	Binary Parquet files added as test resources cannot be statically reviewed for embedded payloads or exfiltration triggers. In context they appear to be legitimate test data, but binary test fixtures should be generated programmatically or have their contents attested.

The table above displays the top 10 most important findings.

Total: 2 | Critical: 0 | High: 0 | Medium: 1 | Low: 1

Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.

⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

github-actions · 2026-03-05T01:01:36Z

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 698f7c3.

Path	Line	Severity	Description
plugins/engine-datafusion/Cargo.toml	53	medium	New third-party dependency 'liquid-cache-datafusion-local = "0.1.12"' added at workspace level. Version 0.1.12 is early-stage and the library has not appeared in prior dependencies. Any new workspace-level crate is a potential supply chain vector and should be audited against its published crate source before merging.
plugins/engine-datafusion/src/test/resources/data/index-7/0/parquet/generation-1.parquet	1	low	Binary Parquet test file added. Binary blobs cannot be reviewed inline for embedded payloads or malicious content. Content should be verified to contain only expected schema/row data consistent with the declared test scenario.
plugins/engine-datafusion/jni/src/statistics_cache.rs	500	low	FileStatisticsCache::list_entries() is implemented to always return an empty HashMap with comment 'used for introspection only'. If this interface is used by monitoring or auditing subsystems, silently returning empty data could suppress visibility into cache state, though no direct malicious path is evident in context.

The table above displays the top 10 most important findings.

Total: 3 | Critical: 0 | High: 0 | Medium: 1 | Low: 2

Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.

⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

Signed-off-by: Tanvir Alam <tanvralm@amazon.com>

github-actions · 2026-03-05T09:25:20Z

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 579060c.

Path	Line	Severity	Description
plugins/engine-datafusion/Cargo.toml	54	medium	New third-party dependency 'liquid-cache-datafusion-local = "0.1.12"' added as a workspace dependency and pulled into the JNI layer. The package is at a very early version (0.1.x), is not a well-known official DataFusion or Apache project, and no explicit usage of it appears in any Rust source file in this diff. An added-but-not-visibly-used dependency is a classic supply chain staging pattern. The crate should be verified as coming from a trusted, audited source before merging.
plugins/engine-datafusion/src/test/resources/data/index-7/0/parquet/generation-1.parquet	1	low	Two new binary Parquet test files are introduced. While the context (test data for query/fetch phase tests) is plausible, binary blobs committed to source control cannot be inspected in a standard diff review and are a potential vector for embedding hidden payloads. Both files should be validated to contain only expected schema/row data with no embedded executable content.
plugins/engine-datafusion/jni/src/statistics_cache.rs	497	low	The new FileStatisticsCache::list_entries() implementation unconditionally returns an empty HashMap, explicitly suppressing cache introspection for the custom statistics cache. The comment acknowledges this is used for introspection/monitoring. Silencing observability hooks can obscure cache state from audit or monitoring tooling, though a legitimate reason (avoiding double-counting or unsupported operation) may exist.

The table above displays the top 10 most important findings.

Total: 3 | Critical: 0 | High: 0 | Medium: 1 | Low: 2

Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.

⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

alchemist51 · 2026-03-02T14:49:13Z

plugins/engine-datafusion/jni/src/query_executor.rs

    let list_file_cache = Arc::new(DefaultListFilesCache::default());
-    list_file_cache.put(table_path.prefix(), object_meta);
+    let table_scoped_path = datafusion::execution::cache::TableScopedPath {
+        table: None,


Do we need to define the table also here!

TableScopedPath is required by DataFusion's cache API signature - the put() method expects this struct type, not a raw path.

alchemist51 · 2026-03-02T14:50:28Z

plugins/engine-datafusion/jni/src/query_executor.rs

    let mut config = SessionConfig::new();
    config.options_mut().execution.parquet.pushdown_filters = false;
-    config.options_mut().execution.target_partitions = target_partitions;
+    config.options_mut().execution.target_partitions = 4;


let's make sure we keep it same?

alchemist51 · 2026-03-02T14:55:15Z

plugins/engine-datafusion/jni/Cargo.toml

 url = { workspace = true }

+# Liquid Cache for byte-level caching
+liquid-cache-datafusion-local = "0.1.12"


Do it similar to other packages

alchemist51 · 2026-03-05T12:04:09Z

plugins/engine-datafusion/src/test/resources/data/index-7/0/parquet/generation-1.parquet

Why these files?

The /parquet/ subdirectory was created because the DatafusionEngine constructs the file path by appending the data format name to the base path. Looking at the earlier context, when using DataFormat.PARQUET, the engine expects files to be in a parquet/ subdirectory.

Why can't we mock it in the tests the path? I think we should already have some file

i think only paths have changed - this should be okay right ?

Yes only paths have changed the files are now inside parquet directory

…tructure Signed-off-by: Tanvir Alam <tanvralm@amazon.com>

github-actions · 2026-03-05T14:16:27Z

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit c496725.

Path	Line	Severity	Description
plugins/engine-datafusion/Cargo.toml	44	medium	New dependency `liquid-cache-datafusion-local = "0.1.12"` added with a vague description ('byte-level caching'). This is an early-version (0.1.x), relatively unknown crate. It appears in both workspace and jni Cargo.toml but no actual usage of its API is visible in the Rust source changes in this diff, making its purpose unclear. Warrants supply-chain verification of the crate's publisher and source.
plugins/engine-datafusion/jni/src/statistics_cache.rs	500	medium	`list_entries()` in the `FileStatisticsCache` implementation intentionally returns an empty HashMap with the comment 'used for introspection only'. This deliberately hides all cached entries from any monitoring, auditing, or introspection tooling that relies on this interface, which could obscure cache state from operators or security tooling.
plugins/engine-datafusion/src/test/java/org/opensearch/datafusion/DataFusionServiceTests.java	87	low	Import of `org.opensearch.vectorized.execution.search.spi.QueryResult` references a non-standard package path (`vectorized`) not typical of the OpenSearch core API. The origin and ownership of this package should be confirmed to rule out a shadowed or injected dependency.

The table above displays the top 10 most important findings.

Total: 3 | Critical: 0 | High: 0 | Medium: 2 | Low: 1

Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.

⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

github-actions · 2026-03-13T11:10:42Z

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit fbb15e0.

Path	Line	Severity	Description
plugins/engine-datafusion/Cargo.toml	55	medium	New third-party Rust crate 'liquid-cache-datafusion-local = "0.1.12"' added as a dependency. Early-version (0.1.x) crates from outside the established DataFusion/Arrow ecosystem warrant supply chain vetting. The crate is inserted into the caching layer, giving it access to cached file metadata and statistics. Verify this crate's provenance on crates.io before merging.
plugins/engine-datafusion/jni/src/statistics_cache.rs	500	low	The FileStatisticsCache::list_entries() implementation unconditionally returns an empty HashMap, deliberately hiding all cache entries from any introspection or monitoring consumers. While the comment says 'introspection only', silently suppressing visibility into the cache state could mask unexpected data retention or eviction behavior.
plugins/engine-datafusion/jni/src/query_executor.rs	124	low	TableScopedPath is constructed with 'table: None', bypassing any table-level scoping that the DataFusion 52.x API intended to enforce. This could inadvertently (or intentionally) allow cross-table cache hits. Given the API migration context this is likely an adaptation placeholder, but should be confirmed against the upstream intent.

The table above displays the top 10 most important findings.

Total: 3 | Critical: 0 | High: 0 | Medium: 1 | Low: 2

Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.

⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

…ojection Signed-off-by: Tanvir Alam <tanvralm@amazon.com>

github-actions · 2026-03-13T11:13:27Z

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit d955609.

Path	Line	Severity	Description
plugins/engine-datafusion/Cargo.toml	55	medium	New dependency 'liquid-cache-datafusion-local = "0.1.12"' is added to the workspace and pulled into jni/Cargo.toml, but no actual usage (no 'use liquid_cache' or API calls) appears anywhere in the diff. Adding an obscure, low-version (0.1.12) crate without visible consumption is a supply-chain risk: Rust build scripts execute at compile time and could run arbitrary code. Warrants verification that the crate's source and build.rs are benign before merging.
plugins/engine-datafusion/jni/src/statistics_cache.rs	500	low	The newly implemented FileStatisticsCache::list_entries() unconditionally returns an empty HashMap, suppressing all cache state from any monitoring or introspection path that calls this interface. While described as 'introspection only', silently hiding cache contents could impede auditing or anomaly detection. Likely a minimal stub implementation, but worth confirming no security tooling depends on this data.

The table above displays the top 10 most important findings.

Total: 2 | Critical: 0 | High: 0 | Medium: 1 | Low: 1

Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.

⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

cocosz requested a review from a team as a code owner February 27, 2026 08:30

cocosz force-pushed the feature/datafusion-52-liquid-cache branch from 776c709 to ab6614a Compare February 27, 2026 08:32

alchemist51 reviewed Feb 27, 2026

View reviewed changes

bharath-techie reviewed Feb 27, 2026

View reviewed changes

cocosz force-pushed the feature/datafusion-52-liquid-cache branch from ab6614a to 13e10d5 Compare March 2, 2026 14:38

cocosz changed the title ~~Integrate Liquid Cache with DataFusion 52.1 for byte-level Parquet ca…~~ Upgrade DataFusion to 52.1.0 and Add Liquid Cache Support Mar 2, 2026

cocosz force-pushed the feature/datafusion-52-liquid-cache branch from 17d5398 to ab300b4 Compare March 2, 2026 15:40

Upgrade DataFusion to 52.1.0 and add liquid-cache-datafusion-local de…

76ff645

…pendency Signed-off-by: Tanvir Alam <tanvralm@amazon.com>

cocosz force-pushed the feature/datafusion-52-liquid-cache branch from ab300b4 to 76ff645 Compare March 2, 2026 15:47

cocosz force-pushed the feature/datafusion-52-liquid-cache branch from ba9d9c1 to 698f7c3 Compare March 5, 2026 01:00

Fix cache manager tests and DataFusionServiceTests

579060c

Signed-off-by: Tanvir Alam <tanvralm@amazon.com>

cocosz force-pushed the feature/datafusion-52-liquid-cache branch from 698f7c3 to 579060c Compare March 5, 2026 09:24

alchemist51 reviewed Mar 5, 2026

View reviewed changes

Refactor DataFusionServiceTests to use format-specific subdirectory s…

c496725

…tructure Signed-off-by: Tanvir Alam <tanvralm@amazon.com>

Fix index out-of-bounds panic in AbsoluteRowIdOptimizer statistics pr…

d955609

…ojection Signed-off-by: Tanvir Alam <tanvralm@amazon.com>

cocosz force-pushed the feature/datafusion-52-liquid-cache branch from fbb15e0 to d955609 Compare March 13, 2026 11:12

bharath-techie merged commit 6e5699d into opensearch-project:feature/datafusion Mar 13, 2026
9 of 32 checks passed

Conversation

cocosz commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Dependency Updates

Uh oh!

github-actions bot commented Feb 27, 2026

PR Code Analyzer ❗

Uh oh!

github-actions bot commented Feb 27, 2026

PR Code Analyzer ❗

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 2, 2026

PR Code Analyzer ❗

Uh oh!

github-actions bot commented Mar 2, 2026

PR Code Analyzer ❗

Uh oh!

github-actions bot commented Mar 2, 2026

PR Code Analyzer ❗

Uh oh!

github-actions bot commented Mar 2, 2026

PR Code Analyzer ❗

Uh oh!

github-actions bot commented Mar 2, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions bot commented Mar 2, 2026

PR Code Suggestions ✨

Uh oh!

github-actions bot commented Mar 2, 2026

Uh oh!

github-actions bot commented Mar 5, 2026

PR Code Analyzer ❗

Uh oh!

github-actions bot commented Mar 5, 2026

PR Code Analyzer ❗

Uh oh!

github-actions bot commented Mar 5, 2026

PR Code Analyzer ❗

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 5, 2026

PR Code Analyzer ❗

Uh oh!

github-actions bot commented Mar 13, 2026

PR Code Analyzer ❗

Uh oh!

github-actions bot commented Mar 13, 2026

PR Code Analyzer ❗

Uh oh!

Uh oh!

cocosz commented Feb 27, 2026 •

edited

Loading