Skip to content

Conversation

@discord9
Copy link
Contributor

@discord9 discord9 commented Dec 19, 2025

Which issue does this PR close?

Rationale for this change

For dynamic filter to work properly, table scan must get correct column even if it's passing through alias(by ProjectionExec) hence need to modify parent filter when gather_filters_for_pushdown

What changes are included in this PR?

as title, add support for handling simple alias in pushdown filter, which expand aliased column(in pushdown filter) to it's original expressions(or UnKnownColumn if can't found aliased column in pushdown filter) so alias in projection is supported, also added unit tests.
AI Content claim: the core logic is hand written and thoroughly understood, but unit test are largely generated with some human guidance

Are these changes tested?

Unit tests are added, please comment if more tests are needed

Are there any user-facing changes?

Yes, dynamic filter will work properly with alias now, I'm not sure if that count as breaking change though?

@github-actions github-actions bot added physical-plan Changes to the physical-plan crate physical-expr Changes to the physical-expr crates labels Dec 19, 2025
@discord9 discord9 changed the title feat: support alias on dynamic filter with ProjectionExec feat: support pushdown alias on simple dynamic filter with ProjectionExec Dec 25, 2025
@discord9 discord9 marked this pull request as ready for review December 25, 2025 12:38
@discord9 discord9 changed the title feat: support pushdown alias on simple dynamic filter with ProjectionExec feat: support pushdown alias on dynamic filter with ProjectionExec Dec 25, 2025
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Dec 26, 2025
@discord9
Copy link
Contributor Author

@jackkleeman @adriangb hi, I added the projection alias support in #17246, since you have the most context on this, could you please take a look when you have a chance?

@adriangb adriangb self-requested a review December 29, 2025 15:15
@discord9 discord9 force-pushed the feat/dyn_filter_alias branch from ee4e327 to 4775fc7 Compare December 30, 2025 03:27
@adriangb adriangb force-pushed the feat/dyn_filter_alias branch from 4775fc7 to 0ccefc8 Compare January 2, 2026 16:31
Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Just needs some tweaks and more tests

}

#[test]
fn test_filter_pushdown_with_unknown_column() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand how an unknown column fits into the picture? How do they get created? Why do we need special handling here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand how an unknown column fits into the picture? How do they get created? Why do we need special handling here?

unknown column seems right when encounter a column thah can't be found in input schema, but maybe a better way to handle this is simply not collect said filter if unknown column is encountered?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is what would cause that to end up in a query? Won’t the query fail to execute? It seems like some marker another optimizer rule puts in place and later cleans up. I’m not saying your handling of them is wrong, I’m just trying to understand what’s going on because I’m surprised there is such a thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is what would cause that to end up in a query? Won’t the query fail to execute? It seems like some marker another optimizer rule puts in place and later cleans up. I’m not saying your handling of them is wrong, I’m just trying to understand what’s going on because I’m surprised there is such a thing.

you are right previously it's only used by partition in here:

Arc::new(UnKnownColumn::new(&expr.to_string()))
to represent unknown column (also in some kind of projection) so for that usecase unknown column will cause partition by hash to have wrong behavior

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still do not understand why unknown columns are relevant here. Do any tests fail if you don't do special handling of them? It seems to me that any attempt to push an UnknownColumn in a filter through a ProjectionExec means something has already gone seriously wrong and the end result query would fail to execute.

@github-actions github-actions bot added the core Core DataFusion crate label Jan 5, 2026
@discord9
Copy link
Contributor Author

discord9 commented Jan 5, 2026

Added tests in datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs and slt

@discord9 discord9 force-pushed the feat/dyn_filter_alias branch from 8544e36 to 6c9e95b Compare January 5, 2026 11:16
…test: filter pushdown projection

Signed-off-by: discord9 <[email protected]>
Signed-off-by: discord9 <[email protected]>
Signed-off-by: discord9 <[email protected]>
Signed-off-by: discord9 <[email protected]>
Signed-off-by: discord9 <[email protected]>
Signed-off-by: discord9 <[email protected]>
Signed-off-by: discord9 <[email protected]>
… for clarity

test: add test for filter pushdown with swapped aliases
test: update dynamic filter projection pushdown test name for consistency
Signed-off-by: discord9 <[email protected]>
Signed-off-by: discord9 <[email protected]>
Signed-off-by: discord9 <[email protected]>
Signed-off-by: discord9 <[email protected]>
Signed-off-by: discord9 <[email protected]>
Signed-off-by: discord9 <[email protected]>
Signed-off-by: discord9 <[email protected]>
@adriangb adriangb force-pushed the feat/dyn_filter_alias branch from 2b6a9a5 to b50b1ab Compare January 7, 2026 18:34
@adriangb
Copy link
Contributor

adriangb commented Jan 8, 2026

@discord9 I can't make a PR to your fork as far as far as I can tell, but can you check the diff in 7ee2bfb and see what you think? Sorry some of it is in 22981aa as well.

glob = { workspace = true }
insta = { workspace = true }
paste = { workspace = true }
pretty_assertions = "1.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to already be used elsewhere (this is not a net new depednecy), so I think it is ok to add

@discord9
Copy link
Contributor Author

discord9 commented Jan 9, 2026

@discord9 I can't make a PR to your fork as far as far as I can tell, but can you check the diff in 7ee2bfb and see what you think? Sorry some of it is in 22981aa as well.

I have some doubt about leave it unchanged, say if in this case:

Sort filter=a@0>800
  <Some strange Extension node that went unhandled>(say something cause a unknown column(a new a@0 in this case) to appear>
    Projection some other columns, old a@0 not included(filter should be unknown column at this stage)
      <do something with old a@0>
      Projection b@0 as a@0
        TableScan filter= DynamicFilter [ b@0 > 800](should be unknown column, but left unchange confuse it with existing columns)

I guess my point is that replace not found column with UnknownColumn will allow downstream user to better debug their problem when DynamicFilter is involved(Since evaluate UnknownColumn return errors, and downstream custom TableScan could use this to easily determine whether he can use this dynamic filter, and even downstream user miss to handle Extension node's gather pushdown, it's easier to spot due to UnknownColumn), and for datafusion itself, in normal case without external extension logical plan, projection shouldn't cause unknown columns in filter, so it's ok?
But if we left unknown columns unchanged, certain query(with extension plan) might confuse those unchanged columns with something else and cause hidden bug which is much hard to discover or debug, so having UnknownColumn is more of a defensive guard thing?
And it seems better to have a explict way to see if it went wrong does no harm and is beneficial? Altough if you insist I can cherry pick 7ee2bfb
Edit: Maybe a middle ground is just refuse to pushdown filters when encounter unknown columns?

Signed-off-by: discord9 <[email protected]>
@github-actions github-actions bot removed the physical-expr Changes to the physical-expr crates label Jan 9, 2026
Signed-off-by: discord9 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dynamic filter pushdown through projection should support aliases

3 participants