Skip to content

Conversation

@cxzl25
Copy link

@cxzl25 cxzl25 commented Dec 20, 2025

Which issue does this PR close?

  • Closes #.

Rationale for this change

I encountered a short-circuit evaluation issue with BinaryExpr when using DataFusion. The operator on the right side may be a function, which should only be called after the left-side condition is satisfied; otherwise, executing the function may result in errors (such as illegal content exceptions).
The previously implemented short-circuit operation has a default ratio of 0.2, and this value cannot be modified externally. For this specific case, a larger value may need to be set.

apache/auron#1763

What changes are included in this PR?

Supports setting the preselection_threshold for BinaryExpr

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the physical-expr Changes to the physical-expr crates label Dec 20, 2025
@github-actions github-actions bot added the core Core DataFusion crate label Dec 20, 2025
@cxzl25
Copy link
Author

cxzl25 commented Dec 22, 2025

@acking-you @alamb Can you help review this PR? Thanks.

Copy link
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @cxzl25,

Thanks for your contribution.

The previously implemented short-circuit operation has a default ratio of 0.2, and this value cannot be modified externally. For this specific case, a larger value may need to be set.

The implemented feature is not yet user-accessible.

For example you can make it accessible via ExecutionOptions:

Add to ExecutionOptions

In datafusion/common/src/config.rs, add to ExecutionOptions:

pub struct ExecutionOptions {
    // ... existing fields ...
    
    pub binary_expr_preselection_threshold: f32, default = 0.2
}

Then thread this through physical planning:

// In physical_planner.rs, when creating binary expressions
let threshold = self.execution_props.binary_expr_preselection_threshold;
    
let binary_expr = BinaryExpr::new(left, op, right)
    .with_preselection_threshold(threshold);

@cxzl25
Copy link
Author

cxzl25 commented Dec 26, 2025

@kosiew Thanks for your comments and code. When implementing this PR, I also considered adding a configuration to ExecutionOptions. The reason for choosing to add an option in BinaryExpr is that different binary_expr_preselection_threshold values can be passed for different BinaryExpr instances.
However, I still think adding a configuration in ExecutionOptions is also a good idea.
Can we implement both of these things in this PR?

For example you can make it accessible via ExecutionOptions:

@kosiew
Copy link
Contributor

kosiew commented Dec 28, 2025

@cxzl25

adding a configuration in ExecutionOptions is also a good idea.
Can we implement both of these things in this PR?

Yes, as adding the option fulfills the objective of your PR this value cannot be modified externally.

@acking-you
Copy link
Contributor

acking-you commented Jan 6, 2026

@cxzl25

I encountered a short-circuit evaluation issue with BinaryExpr when using DataFusion. The operator on the right side may be a function, which should only be called after the left-side condition is satisfied; otherwise, executing the function may result in errors (such as illegal content exceptions).

Sorry for the late reply—I'm really curious to know what this specific case is about?

For example, I didn't quite understand the examples you gave below. and why does setting a larger value fix the issue? Because I understand that even if the value is increased, it might still not meet the condition you mentioned: "should only be called after the left-side condition is satisfied"

create table tmp_xml ( id int ,c1 string) stored as parquet ;

insert overwrite table tmp_xml values (1,'a'),(5,'<'),(5,'<'),(5,'<'),(5,'<'),(5,'<'),(5,'<'),(2,'<data><k1>v1</k1></data>'),(3,'<daa>'),(4,'>'),(5,'<'),(5,'<'),(5,'<'),(5,'<'),(5,'<'),(5,'<');

select * , case when id=2 and xpath_string(c1,'/data/k1')= 'v1' then 'xml' else 'no_xml' end as extracted_value from tmp_xml;

@alamb
Copy link
Contributor

alamb commented Jan 6, 2026

Is this waiting for anything else before we merge it? We made a branch-52 now, so merging to main means it will be included in 53.0.0 (or we can backport to branch-52)

@cxzl25
Copy link
Author

cxzl25 commented Jan 8, 2026

Because I understand that even if the value is increased, it might still not meet the condition you mentioned

In this case it might be possible to set it to 1 so that it always short-circuits the operation.

@cxzl25
Copy link
Author

cxzl25 commented Jan 8, 2026

Is this waiting for anything else before we merge it?

I'm trying to add the configuration in ExecutionOptions.

However, there seems to be no way to read the configuration items of ExecutionOptions when PhysicalExpr evaluate.

impl PhysicalExpr for BinaryExpr {

    fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
        use arrow::compute::kernels::numeric::*;

        // Evaluate left-hand side expression.
        let lhs = self.left.evaluate(batch)?;

        // Check if we can apply short-circuit evaluation.
        match check_short_circuit(&lhs, &self.op, self.preselection_threshold) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants