Skip to content

Conversation

@achille-roussel
Copy link

@achille-roussel achille-roussel commented Dec 29, 2025

This PR adds support for additional filter types in Iceberg file pruning:

  • CONJUNCTION_OR: Prune files when ALL child filters would prune
  • COMPARE_NOTEQUAL (!=): Prune when file contains only the excluded value
  • COMPARE_BETWEEN: Prune when file range doesn't overlap query range
  • BOUND_FUNCTION (prefix/starts_with): Prune based on string prefix bounds
  • COMPARE_NOT_IN: Prune when file's single value is in exclusion list

All optimizations follow the existing conservative approach - unknown or unhandled cases return true (don't prune) to ensure correctness.

I included tests in extended_filter_pruning.test validating both correctness and pruning behavior.


The second commit adds support for filter pruning on expression patterns that DuckDB's FilterCombiner drops:

  • Complex nested OR with AND children: (col >= 0 AND col < 500) OR (col >= 4500)
  • NOT BETWEEN expressions
  • NOT (expr) logical negation
  • ends_with/suffix string functions

Add support for additional filter types in Iceberg file pruning:

- CONJUNCTION_OR: Prune files when ALL child filters would prune
- COMPARE_NOTEQUAL (!=): Prune when file contains only the excluded value
- COMPARE_BETWEEN: Prune when file range doesn't overlap query range
- BOUND_FUNCTION (prefix/starts_with): Prune based on string prefix bounds
- COMPARE_NOT_IN: Prune when file's single value is in exclusion list

All optimizations follow the existing conservative approach - unknown or
unhandled cases return true (don't prune) to ensure correctness.

Includes test coverage in extended_filter_pruning.test validating both
correctness and pruning behavior.
Adds support for filter pruning on expression patterns that DuckDB's
FilterCombiner drops, including:
- Complex nested OR with AND children: (col >= 0 AND col < 500) OR (col >= 4500)
- NOT BETWEEN expressions
- NOT (expr) logical negation
- ends_with/suffix string functions

Implementation:
- Add MatchBoundsFromExpression() in IcebergPredicate for direct Expression evaluation
- Add ShouldEvaluateDirectly() to detect complex patterns FilterCombiner drops
- Store complex expressions in IcebergMultiFileList::complex_filters
- Evaluate complex filters alongside TableFilterSet in FileMatchesFilter()
Signed-off-by: Achille Roussel <[email protected]>
@Tmonster
Copy link
Collaborator

Tmonster commented Jan 2, 2026

Hi @achille-roussel, thanks for the PR! I will take a look soon

@achille-roussel
Copy link
Author

achille-roussel commented Jan 7, 2026

Hello @Tmonster,

I have a follow up change with more optimizations that I would like to submit but didn't want to mix in this one:

I'll submit it after we got this one merged in.

@achille-roussel
Copy link
Author

Hello @Tmonster

Would you be able to take a look at this change in the coming week?

@Tmonster
Copy link
Collaborator

Hi @achille-roussel,

Yes, I will try to take a look sometime this week 👍

@dor-bernstein
Copy link

@achille-roussel I have encountered a bug where the filter prunning doesn't apply when filtering on the same table in two different CTEs (#610). any idea if this PR will resolve it?

@achille-roussel
Copy link
Author

Hello @dor-bernstein, I don't know if this will address it, would you be able to test with a built of the iceberg extension that includes this change?

@dor-bernstein
Copy link

@achille-roussel yes

Copy link
Collaborator

@Tmonster Tmonster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I have looked at this and stepped through it myself, and am a bit confused with what the original problem is, and how this PR is trying to fix it.

One thing I do I agree on, is that we can add a check for TableFilterType::CONJUNCTION_OR on line 158 of src/iceberg_predicate.cpp.

The rest though I'm unsure about. With just the TableFilterType::CONJUNCTION_OR addition, you test file passes all tests. If you add tests for suffix/starts_with functions, then I think some tests will start failing, and we can look into that later.

Simple BETWEEN expressions are also handled with the present logic currently, the filter_combiner transforms them into AND(greater_than_or_equal_to, less_than_or_equal_to) table filter expressions, so the addition on line 222 of iceberg_predicate either isn't necessary, or it needs another test case that I'm not thinking of

I agree we should try to do more work to prune TableFilterType::EXPRESSION_FILTER:s, but I don't think we need an addition of complex_filters to do this, most filters can be generated as table_filters now.

Also, I have implemented writing upper and lower bounds to manifest files, so you can write more end-to-end tests in your PR. See https://github.com/duckdb/duckdb-iceberg/blob/main/test/sql/local/irc/insert/test_write_upper_and_lower_bounds.test which tests writing of upper and lower bounds to see what is supported

}
case ExpressionType::OPERATOR_NOT:
//! NOT expressions wrapping prunable expressions
return true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we say we should evaluate them, but here you early out, and say we cannot prune. Is there still reasoning behind this?

@achille-roussel
Copy link
Author

Thanks for the detailed feedback @Tmonster !

I'm still new to contributing to duckdb and it shows it the change I think, your guidance is extremely valuable.

Here's the angle that I was coming from: I have code that current uses a pre-processing step to query iceberg tables and push filters to partitions (parsing the SQL AST and converting this into predicate expressions), then use read_parquet on the resulting files. This code handles complex nested expression to get the right set of data files to run the queries against. I would like to port this code to use duckdb-iceberg instead, but it appeared that the current state of partition filtering in the extension was less advanced, hence what I tried to address with this change.

I'm going to digest your feedback and will push an update, thanks again!

@Tmonster
Copy link
Collaborator

So Iceberg has some pretty good partitioning right now. Not yet for OR statements, but for most table filters yes.
One thing you can do to find out what data files duckdb is missing is to check the logs of DuckDB and compare it to the parquet files you are manually reading with read_parquet. If you can narrow down what filters are not getting pushed, then it becomes easier to know where filter pushdown is lacking

You can check the filter push down logs with

call enable_logging('Iceberg');
select * from my_partitioned_Iceberg_table where -- filters;
select * from duckdb_logs() where type = 'Iceberg' and message like '%data_file%' or message like '%manifest_file%';

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants