feat(datafusion): Treat timestamp conversion functions like a cast.#945
feat(datafusion): Treat timestamp conversion functions like a cast.#945omerhadari wants to merge 5 commits intoapache:mainfrom
Conversation
|
EDIT: copying this message to the issue itself, since discussion started there. @Fokko hope it's ok to tag :) This PR contains the first (and easiest) part of the fix for the issue. I wanted to ask, is there a way to express function within iceberg predicates? Is this even desired? The reason this could be beneficial is that sometimes you need access to the column value and then you could perform much better manifest elimination. A few examples I have in this context:
This also reveals what I think is a bug. In If I understand correctly, this could cause wrong results for example for the query would result in the predicate Would appreciate some guidance about how to tackle this issue of propagating more information, I don't think it makes sense in the scope of this PR but maybe I am missing something basic. |
|
Hey @omerhadari Thanks for working on this. Unfortunately, I'm afraid that this is a pretty complex task. When we do like |
crates/integrations/datafusion/src/physical_plan/expr_to_predicate.rs
Outdated
Show resolved
Hide resolved
|
Updated the PR with the cases that were discussed to be within the scope for now (in this comment) I think this makes sense for a single PR, though it doesn't solve the entire issue. |
4907956 to
735e7df
Compare
Does it make sense to create an issue for supporting an equivalent to UnboundTransform here? |
|
@Fokko will you be able to do another round of review? |
liurenjie1024
left a comment
There was a problem hiding this comment.
Thanks @omerhadari for this pr, but I think the correct way to solve problems in this pr is to call expression simplification in datafusion rather hacking in this pr.
| } | ||
| Expr::Cast(c) => to_iceberg_predicate(&c.expr), | ||
| Expr::Cast(c) => { | ||
| if DataType::Date32 == c.data_type || DataType::Date64 == c.data_type { |
There was a problem hiding this comment.
I have concerns handling this case in such a way here. This is a process of simplifying expression in datafusion, which should call datafusion's expression simplification api.
There was a problem hiding this comment.
This is less about simplification, rather than extracting the right information for Iceberg to optimize.
There was a problem hiding this comment.
This is less about simplification, rather than extracting the right information for Iceberg to optimize.
Maybe the word simplification is a little confusing, I think constant folding is better here. The expressions in tests should be simplified by constant folding and replaced with a literal, which could be processed by iceberg.
There was a problem hiding this comment.
Constant folding is being done I think, (e.g. TO_TIMESTAMP(2+2) => TO_TIMESTAMP(4)) but I am not sure it makes sense to turn TO_TIMESTAMP(<date_str>) to CAST(date_str AS TIMESTAMP) as a simplification in Datafusion. Currently the implementation of CAST and TO_TIMESTAMP is the same with no parameters, but I don't think turning TO_TIMESTAMP to CAST is a "simplification" necessarily. It may even be counter intuitive and effect errors in some weird ways, with no gains to readability or performance.
There was a problem hiding this comment.
By simplification I don't mean converting TO_TIMESAMPT(<date_str>) to CAST(date_str AS TIMESTAMP), I mean it should be converting TO_TIMESAMPT(<date_str>) to <timestamp literal>, which should be done in constant folding.
There was a problem hiding this comment.
I think datafusion's expression simplification api should have done this, you could see this example:
https://github.com/apache/datafusion/blob/132b21f02763cabbab85ab99e0e0c7fd6cd3d516/datafusion-examples/examples/expr_api.rs#L153
https://github.com/apache/datafusion/blob/132b21f02763cabbab85ab99e0e0c7fd6cd3d516/datafusion-examples/examples/expr_api.rs#L180
There was a problem hiding this comment.
Ok, I don't mind it being in in DataFusion, I'll try opening an issue/PR there and see.
There was a problem hiding this comment.
This doesn't seem a datafusion problem, maybe we should call the simplification api in the iceberg planner?
There was a problem hiding this comment.
This would be redundant I think. We should call it in the tests to mimic production behaviour.
The current simplification API in DataFusion simplifies casts, not these specific function calls (TO_TIMESTAMP, TO_DATE) as far as I could tell.
There was a problem hiding this comment.
I'm not sure about the details, but I guess in iceberg-datafusion planner we missed sth. It's fine for me to seek for help in datafusion community.
| if DataType::Date32 == c.data_type || DataType::Date64 == c.data_type { | ||
| match c.expr.as_ref() { | ||
| Expr::Literal(ScalarValue::Utf8(Some(literal))) => { | ||
| let date = literal.split('T').next(); |
There was a problem hiding this comment.
I would be reluctant to do any string manipulation here.
There was a problem hiding this comment.
Instead, I would add support for the datetime format in date_from_str:
iceberg-rust/crates/iceberg/src/spec/values.rs
Lines 1459 to 1466 in 75c7e8d
There was a problem hiding this comment.
@Fokko currently date_from_str supports datetime (i.e. 2025-01-01T11:11:11 will not be truncated).
I think this is a mistake given the name of the function (date_from_str and not datetime_from_str) and its documentation, but I don't think I can change its behaviour so drastically now right?
Do you suggest I change it, or create another one that is fix?
|
|
||
| #[test] | ||
| fn test_to_date_comparison_creates_predicate() { | ||
| let sql = "ts >= CAST('2023-01-05T11:11:11' AS DATE)"; |
There was a problem hiding this comment.
I'm surprised to see that this works:
spark-sql (default)> SELECT CAST('2023-01-05T11:11:11' AS DATE);
2023-01-05
There was a problem hiding this comment.
Personally I would have preferred an error, but nobody is asking me 😆
|
@liurenjie1024 @Fokko regardless of whether or not turning Talking about what you can see here |
Thanks @jonathanc-n I think it would make sense to fix the bug you mentioned, and it would be better to put it in another pr. |
|
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |

Partially Solves #933
Note: This PR also fixes a bug that made Casts to Date which implicitly truncate the expression inside them create predicates with the raw expression, rather than with the truncated date. This could result in wrong predicates (see the relevant test in the PR).