-
Notifications
You must be signed in to change notification settings - Fork 87
Add extended filter pruning optimizations #638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add support for additional filter types in Iceberg file pruning: - CONJUNCTION_OR: Prune files when ALL child filters would prune - COMPARE_NOTEQUAL (!=): Prune when file contains only the excluded value - COMPARE_BETWEEN: Prune when file range doesn't overlap query range - BOUND_FUNCTION (prefix/starts_with): Prune based on string prefix bounds - COMPARE_NOT_IN: Prune when file's single value is in exclusion list All optimizations follow the existing conservative approach - unknown or unhandled cases return true (don't prune) to ensure correctness. Includes test coverage in extended_filter_pruning.test validating both correctness and pruning behavior.
Adds support for filter pruning on expression patterns that DuckDB's FilterCombiner drops, including: - Complex nested OR with AND children: (col >= 0 AND col < 500) OR (col >= 4500) - NOT BETWEEN expressions - NOT (expr) logical negation - ends_with/suffix string functions Implementation: - Add MatchBoundsFromExpression() in IcebergPredicate for direct Expression evaluation - Add ShouldEvaluateDirectly() to detect complex patterns FilterCombiner drops - Store complex expressions in IcebergMultiFileList::complex_filters - Evaluate complex filters alongside TableFilterSet in FileMatchesFilter()
Signed-off-by: Achille Roussel <[email protected]>
|
Hi @achille-roussel, thanks for the PR! I will take a look soon |
|
Hello @Tmonster, I have a follow up change with more optimizations that I would like to submit but didn't want to mix in this one: I'll submit it after we got this one merged in. |
|
Hello @Tmonster Would you be able to take a look at this change in the coming week? |
|
Hi @achille-roussel, Yes, I will try to take a look sometime this week 👍 |
|
@achille-roussel I have encountered a bug where the filter prunning doesn't apply when filtering on the same table in two different CTEs (#610). any idea if this PR will resolve it? |
|
Hello @dor-bernstein, I don't know if this will address it, would you be able to test with a built of the iceberg extension that includes this change? |
|
@achille-roussel yes |
Tmonster
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I have looked at this and stepped through it myself, and am a bit confused with what the original problem is, and how this PR is trying to fix it.
One thing I do I agree on, is that we can add a check for TableFilterType::CONJUNCTION_OR on line 158 of src/iceberg_predicate.cpp.
The rest though I'm unsure about. With just the TableFilterType::CONJUNCTION_OR addition, you test file passes all tests. If you add tests for suffix/starts_with functions, then I think some tests will start failing, and we can look into that later.
Simple BETWEEN expressions are also handled with the present logic currently, the filter_combiner transforms them into AND(greater_than_or_equal_to, less_than_or_equal_to) table filter expressions, so the addition on line 222 of iceberg_predicate either isn't necessary, or it needs another test case that I'm not thinking of
I agree we should try to do more work to prune TableFilterType::EXPRESSION_FILTER:s, but I don't think we need an addition of complex_filters to do this, most filters can be generated as table_filters now.
Also, I have implemented writing upper and lower bounds to manifest files, so you can write more end-to-end tests in your PR. See https://github.com/duckdb/duckdb-iceberg/blob/main/test/sql/local/irc/insert/test_write_upper_and_lower_bounds.test which tests writing of upper and lower bounds to see what is supported
| } | ||
| case ExpressionType::OPERATOR_NOT: | ||
| //! NOT expressions wrapping prunable expressions | ||
| return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we say we should evaluate them, but here you early out, and say we cannot prune. Is there still reasoning behind this?
|
Thanks for the detailed feedback @Tmonster ! I'm still new to contributing to duckdb and it shows it the change I think, your guidance is extremely valuable. Here's the angle that I was coming from: I have code that current uses a pre-processing step to query iceberg tables and push filters to partitions (parsing the SQL AST and converting this into predicate expressions), then use I'm going to digest your feedback and will push an update, thanks again! |
|
So Iceberg has some pretty good partitioning right now. Not yet for You can check the filter push down logs with |
This PR adds support for additional filter types in Iceberg file pruning:
All optimizations follow the existing conservative approach - unknown or unhandled cases return true (don't prune) to ensure correctness.
I included tests in extended_filter_pruning.test validating both correctness and pruning behavior.
The second commit adds support for filter pruning on expression patterns that DuckDB's FilterCombiner drops: