Skip to content

Upgrade DataFusion to 52.1.0 and Add Liquid Cache Support#20740

Merged
bharath-techie merged 4 commits intoopensearch-project:feature/datafusionfrom
cocosz:feature/datafusion-52-liquid-cache
Mar 13, 2026
Merged

Upgrade DataFusion to 52.1.0 and Add Liquid Cache Support#20740
bharath-techie merged 4 commits intoopensearch-project:feature/datafusionfrom
cocosz:feature/datafusion-52-liquid-cache

Conversation

@cocosz
Copy link

@cocosz cocosz commented Feb 27, 2026

Summary

This PR upgrades the DataFusion engine from version 51.0.0 to 52.1.0 and integrates the liquid-cache-datafusion-local dependency for enhanced caching capabilities.

Changes

Dependency Updates

  • DataFusion Core: 51.0.0 → 52.1.0
  • DataFusion Expression: 51.0.0 → 52.1.0
  • DataFusion DataSource: 51.0.0 → 52.1.0
  • DataFusion Substrait: 51.0.0 → 52.1.0
  • Arrow Libraries: 57.1.0 → 57.3.0
  • Parquet: 57.1.0 → 57.3.0
  • Object Store: 0.12.4 → 0.12.5
  • New: Added liquid-cache-datafusion-local = "0.1.12"

@cocosz cocosz requested a review from a team as a code owner February 27, 2026 08:30
@github-actions
Copy link
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 776c709.

PathLineSeverityDescription
plugins/engine-datafusion/jni/Cargo.toml64mediumNew third-party dependency 'liquid-cache-datafusion-local = 0.1.12' added. This is a relatively obscure crate at a specific version that integrates deeply into the query execution path and writes query data to disk. While it appears to be a legitimate open-source caching library, it warrants vetting of its crate origin, ownership, and code for any unexpected network calls or data handling before merging into a production search engine.
plugins/engine-datafusion/jni/src/liquid_cache_runtime.rs22lowHardcoded cache directory path '/var/lib/opensearch/liquid_cache' is used unconditionally without any configurable override. Query data (Parquet file contents) will be persisted to this path. If filesystem permissions on this directory are overly permissive or if the node is shared, cached sensitive query data could be read by other local processes. This is an information disclosure risk rather than malicious intent, but the fixed path with no configuration option is an anomaly.

The table above displays the top 10 most important findings.

Total: 2 | Critical: 0 | High: 0 | Medium: 1 | Low: 1


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@cocosz cocosz force-pushed the feature/datafusion-52-liquid-cache branch from 776c709 to ab6614a Compare February 27, 2026 08:32
@github-actions
Copy link
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit ab6614a.

PathLineSeverityDescription
plugins/engine-datafusion/jni/Cargo.toml64mediumNew third-party crate 'liquid-cache-datafusion-local = "0.1.12"' introduced with a fixed minor version. This is a relatively obscure crate (version 0.1.x indicates early-stage). Supply chain risk: the crate's behavior at runtime (file I/O, object store registration) is difficult to audit without reviewing the crate source. The crate is given access to the DataFusion RuntimeEnv and SessionContext, making it capable of intercepting data reads.
plugins/engine-datafusion/jni/src/liquid_cache_runtime.rs22lowHardcoded filesystem path '/var/lib/opensearch/liquid_cache' is used for cache storage. While plausible for a caching feature, hardcoding a specific system path (rather than deriving it from OpenSearch configuration) bypasses any path sanitization or permission controls that the application normally enforces.
plugins/engine-datafusion/jni/src/liquid_cache_runtime.rs18lowLiquidCacheRef is typed as 'Arc', erasing the concrete type of the cache object. This makes static analysis of what the cache reference holds impossible, and obscures the actual runtime behavior of the stored liquid_cache object returned by the external crate.

The table above displays the top 10 most important findings.

Total: 3 | Critical: 0 | High: 0 | Medium: 1 | Low: 2


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

liquid_cache: LiquidCacheRef,
}

static LIQUID_ONLY: OnceLock<Result<LiquidOnlyRuntime, String>> = OnceLock::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing it here? Can't it be done in global runtime?

config = config.with_files_statistics_cache(Some(default_stats));
}
// Add statistics cache if available - use default since CustomStatisticsCache doesn't implement FileStatisticsCache trait
let default_stats = Arc::new(DefaultFileStatisticsCache::default());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This defeats the purpose of our custom statistics cache

config.options_mut().execution.parquet.pushdown_filters = false;
config.options_mut().execution.target_partitions = target_partitions;
config.options_mut().execution.batch_size = 8192;
.with_metadata_cache_limit(250 * 1024 * 1024)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be configurable via settings

info!("[LiquidCache] Creating Parquet access plans for {} row IDs across {} files",
row_ids.len(), files_metadata.len());
let access_plans = create_access_plans(row_ids, files_metadata.clone()).await?;
info!("[LiquidCache] ✓ Access plans created, Liquid Cache will optimize data access");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kindly trim unnecessary logs throughout this PR and keep only the essential debug logs.

.build().unwrap();

log_info!("[LiquidCache] Initializing global Liquid Cache (1GB max)");
let liquid_runtime = match LiquidOnlyRuntime::init(1024 * 1024 * 1024) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be configurable via settings.

@cocosz cocosz force-pushed the feature/datafusion-52-liquid-cache branch from ab6614a to 13e10d5 Compare March 2, 2026 14:38
@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2026

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 13e10d5.

PathLineSeverityDescription
plugins/engine-datafusion/jni/Cargo.toml64mediumNew third-party dependency `liquid-cache-datafusion-local = "0.1.12"` added without apparent prior usage elsewhere. Adding a new external crate is a potential supply chain attack vector; the crate's provenance and trustworthiness should be verified against crates.io ownership and audit history before merging.
plugins/engine-datafusion/jni/src/statistics_cache.rs500low`FileStatisticsCache::list_entries` is implemented to always return an empty HashMap with the comment that it is 'used for introspection only'. While plausibly benign, this silently suppresses cache visibility for any monitoring or auditing tooling that relies on this interface, which could obscure cache state from operators.

The table above displays the top 10 most important findings.

Total: 2 | Critical: 0 | High: 0 | Medium: 1 | Low: 1


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@cocosz cocosz changed the title Integrate Liquid Cache with DataFusion 52.1 for byte-level Parquet ca… Upgrade DataFusion to 52.1.0 and Add Liquid Cache Support Mar 2, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2026

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 17d5398.

PathLineSeverityDescription
plugins/engine-datafusion/jni/Cargo.toml64mediumNew external dependency 'liquid-cache-datafusion-local = "0.1.12"' added with only a vague comment ('byte-level caching'). This package is relatively obscure and does not appear to be referenced in any of the code changes shown in this diff. Adding an unused or minimally documented dependency warrants verification of the package's provenance, ownership, and published source code before merging.
plugins/engine-datafusion/jni/src/statistics_cache.rs500lowThe FileStatisticsCache::list_entries() implementation unconditionally returns an empty HashMap, suppressing all cache introspection data. The comment claims this is 'for introspection only', but silently hiding cache state could mask unexpected behavior or make auditing harder. This is likely a stub implementation for a new trait method, but should be confirmed as intentional.

The table above displays the top 10 most important findings.

Total: 2 | Critical: 0 | High: 0 | Medium: 1 | Low: 1


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@cocosz cocosz force-pushed the feature/datafusion-52-liquid-cache branch from 17d5398 to ab300b4 Compare March 2, 2026 15:40
@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2026

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit ab300b4.

PathLineSeverityDescription
plugins/engine-datafusion/jni/Cargo.toml64mediumNew third-party dependency 'liquid-cache-datafusion-local = "0.1.12"' added with no visible usage anywhere in the diff. An obscure crate at version 0.1.x being pulled in without any corresponding code changes is anomalous and warrants verification of the crate's provenance, publisher identity, and whether it was actually intended for this PR.
plugins/engine-datafusion/Cargo.toml45lowobject_store version constraint changed from an exact pin '=0.12.4' to an unpinned '0.12.5'. The original exact pin was likely intentional for reproducible/audited builds. Loosening this allows future patch bumps without explicit review, slightly increasing supply chain exposure.

The table above displays the top 10 most important findings.

Total: 2 | Critical: 0 | High: 0 | Medium: 1 | Low: 1


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

…pendency

Signed-off-by: Tanvir Alam <tanvralm@amazon.com>
@cocosz cocosz force-pushed the feature/datafusion-52-liquid-cache branch from ab300b4 to 76ff645 Compare March 2, 2026 15:47
@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2026

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 76ff645.

PathLineSeverityDescription
plugins/engine-datafusion/Cargo.toml54lowNew external dependency 'liquid-cache-datafusion-local = "0.1.12"' added. While the comment describes it as a byte-level caching library and it corresponds to a real open-source project, any new third-party dependency warrants supply chain verification to confirm the crate version and publisher are trustworthy.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 0 | Medium: 0 | Low: 1


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2026

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Upgrade DataFusion, Arrow, Parquet, and Object Store dependencies to latest versions

Relevant files:

  • plugins/engine-datafusion/Cargo.toml
  • plugins/engine-datafusion/jni/Cargo.toml

Sub-PR theme: Adapt source code to DataFusion 52 API changes and integrate Liquid Cache

Relevant files:

  • plugins/engine-datafusion/jni/src/absolute_row_id_optimizer.rs
  • plugins/engine-datafusion/jni/src/listing_table.rs
  • plugins/engine-datafusion/jni/src/query_executor.rs
  • plugins/engine-datafusion/jni/src/cache.rs
  • plugins/engine-datafusion/jni/src/custom_cache_manager.rs
  • plugins/engine-datafusion/jni/src/statistics_cache.rs

⚡ Recommended focus areas for review

Error Handling

projected_schema() now uses .expect("projected_schema failed") which will panic on failure. The previous code used .clone() on a direct field access. Consider propagating the error instead of panicking, especially in a production optimizer path.

let projected_schema = datasource.projected_schema().expect("projected_schema failed");
Hardcoded Source

In create_datasource_projection, a new ParquetSource is always created unconditionally, discarding any existing file source configuration (e.g., custom options, pushdowns) from the original datasource. This may silently drop important source-level settings when the optimizer rewrites the plan.

use datafusion::datasource::physical_plan::ParquetSource;
let new_file_source = Arc::new(ParquetSource::new(new_table_schema));

let file_scan_config = FileScanConfigBuilder::from(datasource.clone())
    .with_source(new_file_source)
    .with_projection_indices(Some(new_projections))
    .expect("Failed to set projection indices")
    .build();
Null Table Scope

TableScopedPath is constructed with table: None in both execute_query_with_cross_rt_stream and execute_fetch_phase. If the cache lookup logic in DataFusion 52 uses the table field for scoping/isolation, setting it to None may cause cache collisions between different tables sharing the same path prefix.

let table_scoped_path = datafusion::execution::cache::TableScopedPath {
    table: None,
    path: table_path.prefix().clone(),
};
list_file_cache.put(&table_scoped_path, object_meta);
Empty Implementation

The FileStatisticsCache::list_entries implementation always returns an empty HashMap. If DataFusion or any tooling relies on this for cache warming, eviction decisions, or diagnostics, this stub will silently produce incorrect results. The comment says "introspection only" but this should be validated.

impl datafusion::execution::cache::cache_manager::FileStatisticsCache for CustomStatisticsCache {
    fn list_entries(&self) -> std::collections::HashMap<object_store::path::Path, datafusion::execution::cache::cache_manager::FileStatisticsCacheEntry> {
        // Return empty map — this is used for introspection only
        std::collections::HashMap::new()
    }
}
Partition Column Type

In create_file_source_with_schema_adapter, partition columns are constructed with nullable = false. Previously, partition columns may have had different nullability. If any partition column can be null (e.g., missing partition value), this hardcoded false could cause schema mismatches or incorrect query results.

        .map(|(name, dt)| Arc::new(Field::new(name, dt.clone(), false)) as _)
        .collect(),
);

@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2026

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Propagate error instead of panicking

Using .expect() here will cause a panic if projected_schema() returns an error,
which could crash the query executor in production. This should propagate the error
using ? instead, since the enclosing function likely returns a Result.

plugins/engine-datafusion/jni/src/absolute_row_id_optimizer.rs [44]

-let projected_schema = datasource.projected_schema().expect("projected_schema failed");
+let projected_schema = datasource.projected_schema()?;
Suggestion importance[1-10]: 5

__

Why: Using .expect() instead of ? for error propagation is a valid concern for production robustness. However, the enclosing function build_updated_file_source_schema returns (SchemaRef, Vec<usize>) not a Result, so ? cannot be used directly without changing the function signature.

Low
Verify partition column index alignment after schema change

The ROW_BASE_FIELD_NAME field is being added to the TableSchema as a partition
column, but then with_table_partition_cols was removed from the builder. The
partition column must be declared consistently — if it's embedded in the
TableSchema, ensure the projection index calculation (parquet_schema.fields.len() as
the partition column index) still correctly references it, since the schema length
may not align with the partition column position after this change.

plugins/engine-datafusion/jni/src/query_executor.rs [426-430]

 let table_schema = datafusion_datasource::table_schema::TableSchema::new(
     parquet_schema.clone(),
     vec![Arc::new(Field::new(ROW_BASE_FIELD_NAME, DataType::Int64, false))],
 );
 let file_source = Arc::new(ParquetSource::new(table_schema));
+// Verify: partition column index = parquet_schema.fields.len() (0-based index after all file fields)
+// This must match the TableSchema partition column position
Suggestion importance[1-10]: 3

__

Why: The suggestion raises a valid concern about whether parquet_schema.fields.len() correctly indexes the partition column after the schema change, but the 'improved_code' only adds a comment without actually fixing anything. The existing code appears correct since TableSchema::new takes file schema + partition cols separately, and the partition column index would still be at parquet_schema.fields.len().

Low
Fix early return breaking builder chain

The ? operator is applied to with_projection_indices(...) mid-chain, which breaks
the builder pattern — the ? will short-circuit and return early before
.with_limit(), .with_output_ordering(), and .with_expr_adapter() are called. The
result of with_projection_indices should be unwrapped before continuing the chain,
or the builder should be split into separate statements.

plugins/engine-datafusion/jni/src/listing_table.rs [1325-1329]

-.with_projection_indices(projection.cloned())?
-.with_limit(limit)
-.with_output_ordering(output_ordering)
-.with_expr_adapter(self.expr_adapter_factory.clone())
-.build(),
+let builder = FileScanConfigBuilder::new(object_store_url, file_source)
+    .with_file_groups(partitioned_file_lists)
+    .with_constraints(self.constraints.clone())
+    .with_statistics(statistics)
+    .with_projection_indices(projection.cloned())?
+    .with_limit(limit)
+    .with_output_ordering(output_ordering)
+    .with_expr_adapter(self.expr_adapter_factory.clone())
+    .build();
+self.options.format.create_physical_plan(state, builder).await
Suggestion importance[1-10]: 2

__

Why: The concern about ? breaking the builder chain is theoretically valid in some languages, but in Rust, ? on a method in a chain works correctly — it propagates the error from with_projection_indices and the remaining chain methods are only called if it returns Ok. The 'improved_code' restructures the code but doesn't actually fix a real bug, and the suggested refactoring duplicates builder calls already present in the surrounding context.

Low
General
Return actual cache entries instead of empty map

Returning an empty map from list_entries means the cache contents are never visible
for introspection or debugging, and any tooling that relies on this method to
enumerate cached entries (e.g., for cache invalidation or monitoring) will silently
see no entries even when the cache is populated. Consider iterating over the actual
inner cache entries to return a correct map.

plugins/engine-datafusion/jni/src/statistics_cache.rs [501-504]

 fn list_entries(&self) -> std::collections::HashMap<object_store::path::Path, datafusion::execution::cache::cache_manager::FileStatisticsCacheEntry> {
-    // Return empty map — this is used for introspection only
-    std::collections::HashMap::new()
+    self.inner_cache
+        .list_entries()
 }
Suggestion importance[1-10]: 4

__

Why: Returning an empty map from list_entries is a functional gap that could affect cache introspection and monitoring. However, the inner_cache type may not implement list_entries() directly, making the suggested fix potentially incorrect without knowing the inner cache's API.

Low

@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2026

❌ Gradle check result for 76ff645: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2026

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit ba9d9c1.

PathLineSeverityDescription
plugins/engine-datafusion/Cargo.toml55mediumNew third-party dependency 'liquid-cache-datafusion-local = "0.1.12"' added with no prior workspace usage. This crate is not a well-known ecosystem package and should be verified for authenticity and provenance before inclusion — potential supply chain risk if the crate name was squatted or typosquatted.
plugins/engine-datafusion/src/test/resources/data/index-7/0/parquet/generation-1.parquet1lowBinary Parquet files added as test resources cannot be statically reviewed for embedded payloads or exfiltration triggers. In context they appear to be legitimate test data, but binary test fixtures should be generated programmatically or have their contents attested.

The table above displays the top 10 most important findings.

Total: 2 | Critical: 0 | High: 0 | Medium: 1 | Low: 1


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@cocosz cocosz force-pushed the feature/datafusion-52-liquid-cache branch from ba9d9c1 to 698f7c3 Compare March 5, 2026 01:00
@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2026

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 698f7c3.

PathLineSeverityDescription
plugins/engine-datafusion/Cargo.toml53mediumNew third-party dependency 'liquid-cache-datafusion-local = "0.1.12"' added at workspace level. Version 0.1.12 is early-stage and the library has not appeared in prior dependencies. Any new workspace-level crate is a potential supply chain vector and should be audited against its published crate source before merging.
plugins/engine-datafusion/src/test/resources/data/index-7/0/parquet/generation-1.parquet1lowBinary Parquet test file added. Binary blobs cannot be reviewed inline for embedded payloads or malicious content. Content should be verified to contain only expected schema/row data consistent with the declared test scenario.
plugins/engine-datafusion/jni/src/statistics_cache.rs500lowFileStatisticsCache::list_entries() is implemented to always return an empty HashMap with comment 'used for introspection only'. If this interface is used by monitoring or auditing subsystems, silently returning empty data could suppress visibility into cache state, though no direct malicious path is evident in context.

The table above displays the top 10 most important findings.

Total: 3 | Critical: 0 | High: 0 | Medium: 1 | Low: 2


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

Signed-off-by: Tanvir Alam <tanvralm@amazon.com>
@cocosz cocosz force-pushed the feature/datafusion-52-liquid-cache branch from 698f7c3 to 579060c Compare March 5, 2026 09:24
@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2026

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 579060c.

PathLineSeverityDescription
plugins/engine-datafusion/Cargo.toml54mediumNew third-party dependency 'liquid-cache-datafusion-local = "0.1.12"' added as a workspace dependency and pulled into the JNI layer. The package is at a very early version (0.1.x), is not a well-known official DataFusion or Apache project, and no explicit usage of it appears in any Rust source file in this diff. An added-but-not-visibly-used dependency is a classic supply chain staging pattern. The crate should be verified as coming from a trusted, audited source before merging.
plugins/engine-datafusion/src/test/resources/data/index-7/0/parquet/generation-1.parquet1lowTwo new binary Parquet test files are introduced. While the context (test data for query/fetch phase tests) is plausible, binary blobs committed to source control cannot be inspected in a standard diff review and are a potential vector for embedding hidden payloads. Both files should be validated to contain only expected schema/row data with no embedded executable content.
plugins/engine-datafusion/jni/src/statistics_cache.rs497lowThe new FileStatisticsCache::list_entries() implementation unconditionally returns an empty HashMap, explicitly suppressing cache introspection for the custom statistics cache. The comment acknowledges this is used for introspection/monitoring. Silencing observability hooks can obscure cache state from audit or monitoring tooling, though a legitimate reason (avoiding double-counting or unsupported operation) may exist.

The table above displays the top 10 most important findings.

Total: 3 | Critical: 0 | High: 0 | Medium: 1 | Low: 2


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

let list_file_cache = Arc::new(DefaultListFilesCache::default());
list_file_cache.put(table_path.prefix(), object_meta);
let table_scoped_path = datafusion::execution::cache::TableScopedPath {
table: None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to define the table also here!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TableScopedPath is required by DataFusion's cache API signature - the put() method expects this struct type, not a raw path.

let mut config = SessionConfig::new();
config.options_mut().execution.parquet.pushdown_filters = false;
config.options_mut().execution.target_partitions = target_partitions;
config.options_mut().execution.target_partitions = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make sure we keep it same?

url = { workspace = true }

# Liquid Cache for byte-level caching
liquid-cache-datafusion-local = "0.1.12"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do it similar to other packages

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why these files?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The /parquet/ subdirectory was created because the DatafusionEngine constructs the file path by appending the data format name to the base path. Looking at the earlier context, when using DataFormat.PARQUET, the engine expects files to be in a parquet/ subdirectory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we mock it in the tests the path? I think we should already have some file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think only paths have changed - this should be okay right ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes only paths have changed the files are now inside parquet directory

…tructure

Signed-off-by: Tanvir Alam <tanvralm@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2026

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit c496725.

PathLineSeverityDescription
plugins/engine-datafusion/Cargo.toml44mediumNew dependency `liquid-cache-datafusion-local = "0.1.12"` added with a vague description ('byte-level caching'). This is an early-version (0.1.x), relatively unknown crate. It appears in both workspace and jni Cargo.toml but no actual usage of its API is visible in the Rust source changes in this diff, making its purpose unclear. Warrants supply-chain verification of the crate's publisher and source.
plugins/engine-datafusion/jni/src/statistics_cache.rs500medium`list_entries()` in the `FileStatisticsCache` implementation intentionally returns an empty HashMap with the comment 'used for introspection only'. This deliberately hides all cached entries from any monitoring, auditing, or introspection tooling that relies on this interface, which could obscure cache state from operators or security tooling.
plugins/engine-datafusion/src/test/java/org/opensearch/datafusion/DataFusionServiceTests.java87lowImport of `org.opensearch.vectorized.execution.search.spi.QueryResult` references a non-standard package path (`vectorized`) not typical of the OpenSearch core API. The origin and ownership of this package should be confirmed to rule out a shadowed or injected dependency.

The table above displays the top 10 most important findings.

Total: 3 | Critical: 0 | High: 0 | Medium: 2 | Low: 1


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit fbb15e0.

PathLineSeverityDescription
plugins/engine-datafusion/Cargo.toml55mediumNew third-party Rust crate 'liquid-cache-datafusion-local = "0.1.12"' added as a dependency. Early-version (0.1.x) crates from outside the established DataFusion/Arrow ecosystem warrant supply chain vetting. The crate is inserted into the caching layer, giving it access to cached file metadata and statistics. Verify this crate's provenance on crates.io before merging.
plugins/engine-datafusion/jni/src/statistics_cache.rs500lowThe FileStatisticsCache::list_entries() implementation unconditionally returns an empty HashMap, deliberately hiding all cache entries from any introspection or monitoring consumers. While the comment says 'introspection only', silently suppressing visibility into the cache state could mask unexpected data retention or eviction behavior.
plugins/engine-datafusion/jni/src/query_executor.rs124lowTableScopedPath is constructed with 'table: None', bypassing any table-level scoping that the DataFusion 52.x API intended to enforce. This could inadvertently (or intentionally) allow cross-table cache hits. Given the API migration context this is likely an adaptation placeholder, but should be confirmed against the upstream intent.

The table above displays the top 10 most important findings.

Total: 3 | Critical: 0 | High: 0 | Medium: 1 | Low: 2


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

…ojection

Signed-off-by: Tanvir Alam <tanvralm@amazon.com>
@cocosz cocosz force-pushed the feature/datafusion-52-liquid-cache branch from fbb15e0 to d955609 Compare March 13, 2026 11:12
@github-actions
Copy link
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit d955609.

PathLineSeverityDescription
plugins/engine-datafusion/Cargo.toml55mediumNew dependency 'liquid-cache-datafusion-local = "0.1.12"' is added to the workspace and pulled into jni/Cargo.toml, but no actual usage (no 'use liquid_cache' or API calls) appears anywhere in the diff. Adding an obscure, low-version (0.1.12) crate without visible consumption is a supply-chain risk: Rust build scripts execute at compile time and could run arbitrary code. Warrants verification that the crate's source and build.rs are benign before merging.
plugins/engine-datafusion/jni/src/statistics_cache.rs500lowThe newly implemented FileStatisticsCache::list_entries() unconditionally returns an empty HashMap, suppressing all cache state from any monitoring or introspection path that calls this interface. While described as 'introspection only', silently hiding cache contents could impede auditing or anomaly detection. Likely a minimal stub implementation, but worth confirming no security tooling depends on this data.

The table above displays the top 10 most important findings.

Total: 2 | Critical: 0 | High: 0 | Medium: 1 | Low: 1


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@bharath-techie bharath-techie merged commit 6e5699d into opensearch-project:feature/datafusion Mar 13, 2026
9 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants