-
Notifications
You must be signed in to change notification settings - Fork 1.9k
feat: Add preselection_threshold option to BinaryExpr
#19420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@acking-you @alamb Can you help review this PR? Thanks. |
kosiew
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @cxzl25,
Thanks for your contribution.
The previously implemented short-circuit operation has a default ratio of 0.2, and this value cannot be modified externally. For this specific case, a larger value may need to be set.
The implemented feature is not yet user-accessible.
For example you can make it accessible via ExecutionOptions:
Add to ExecutionOptions
In datafusion/common/src/config.rs, add to ExecutionOptions:
pub struct ExecutionOptions {
// ... existing fields ...
pub binary_expr_preselection_threshold: f32, default = 0.2
}Then thread this through physical planning:
// In physical_planner.rs, when creating binary expressions
let threshold = self.execution_props.binary_expr_preselection_threshold;
let binary_expr = BinaryExpr::new(left, op, right)
.with_preselection_threshold(threshold);|
@kosiew Thanks for your comments and code. When implementing this PR, I also considered adding a configuration to ExecutionOptions. The reason for choosing to add an option in BinaryExpr is that different binary_expr_preselection_threshold values can be passed for different BinaryExpr instances.
|
Yes, as adding the option fulfills the objective of your PR |
Sorry for the late reply—I'm really curious to know what this specific case is about? For example, I didn't quite understand the examples you gave below. and why does setting a larger value fix the issue? Because I understand that even if the value is increased, it might still not meet the condition you mentioned: "should only be called after the left-side condition is satisfied" create table tmp_xml ( id int ,c1 string) stored as parquet ;
insert overwrite table tmp_xml values (1,'a'),(5,'<'),(5,'<'),(5,'<'),(5,'<'),(5,'<'),(5,'<'),(2,'<data><k1>v1</k1></data>'),(3,'<daa>'),(4,'>'),(5,'<'),(5,'<'),(5,'<'),(5,'<'),(5,'<'),(5,'<');
select * , case when id=2 and xpath_string(c1,'/data/k1')= 'v1' then 'xml' else 'no_xml' end as extracted_value from tmp_xml; |
|
Is this waiting for anything else before we merge it? We made a branch-52 now, so merging to main means it will be included in 53.0.0 (or we can backport to branch-52) |
In this case it might be possible to set it to 1 so that it always short-circuits the operation. |
I'm trying to add the configuration in ExecutionOptions. However, there seems to be no way to read the configuration items of impl PhysicalExpr for BinaryExpr {
fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
use arrow::compute::kernels::numeric::*;
// Evaluate left-hand side expression.
let lhs = self.left.evaluate(batch)?;
// Check if we can apply short-circuit evaluation.
match check_short_circuit(&lhs, &self.op, self.preselection_threshold) { |
Which issue does this PR close?
Rationale for this change
I encountered a short-circuit evaluation issue with
BinaryExprwhen using DataFusion. The operator on the right side may be a function, which should only be called after the left-side condition is satisfied; otherwise, executing the function may result in errors (such as illegal content exceptions).The previously implemented short-circuit operation has a default ratio of 0.2, and this value cannot be modified externally. For this specific case, a larger value may need to be set.
apache/auron#1763
What changes are included in this PR?
Supports setting the
preselection_thresholdforBinaryExprAre these changes tested?
Are there any user-facing changes?