Replies: 1 comment
-
Closing this out, since there's #7748! Thanks! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'd like to point out a subtle yet common error that can result in unexpected Cartesian products in SQL queries when using Ibis.
Currently, there are two Python methods in question.
def get_feat_df(df_clusters: ibis.expr.types.TableExpr, user_clusters: list, id_col: str) -> ibis.expr.types.TableExpr:
def get_feat_df2(df_clusters: ibis.expr.types.TableExpr, user_clusters: list, id_col: str) -> ibis.expr.types.TableExpr:
Their key difference lies in the sequence of operations: one method applies a filter before constructing a list of case statements, while the other constructs both the case statement list and the list needed for filter in parallel. After this parallel construction, it then applies a chained invocation of filter followed by the application of case statements.
Here are the SQL statements generated by these two methods.
first:
second:
Notably, the second method generates SQL that unexpectedly results in a Cartesian product. The reason behind this is that the case statement operates on the original DataFrame, not the filtered DataFrame as would occur in a chained call. This behavior is distinct from what users might be accustomed to in PySpark.
Beta Was this translation helpful? Give feedback.
All reactions