-
Notifications
You must be signed in to change notification settings - Fork 134
Support string column identifiers for sort/aggregate/window and stricter Expr validation #1221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Refactor expression handling and `_simplify_expression` for stronger type checking and clearer error handling - Improve type annotations for `file_sort_order` and `order_by` to support string inputs - Refactor DataFrame `filter` method to better validate predicates - Replace internal error message variable with public constant - Clarify usage of `col()` and `column()` in DataFrame examples
…handling - Update `order_by` handling in Window class for better type support - Improve type checking in DataFrame expression handling - Replace `Expr`/`SortExpr` with `SortKey` in file_sort_order and related functions - Simplify file_sort_order handling in SessionContext - Rename `_EXPR_TYPE_ERROR` → `EXPR_TYPE_ERROR` for consistency - Clarify usage of `col()` vs `column()` in DataFrame examples - Enhance documentation for file_sort_order in SessionContext
…ng, sorting, and docs - Introduce `ensure_expr` helper and improve internal expression validation - Update error messages and tests to consistently use `EXPR_TYPE_ERROR` - Refactor expression handling with `_to_raw_expr`, `_ensure_expr`, and `SortKey` - Improve type safety and consistency in sort key definitions and file sort order - Add parameterized parquet sorting tests - Enhance DataFrame docstrings with clearer guidance and usage examples - Fix minor typos and error message clarity
…ation - Introduced `ensure_expr_list` to validate and flatten nested expressions, treating strings as atomic - Updated expression utilities to improve consistency across aggregation and window functions - Consolidated and expanded parameterized tests for string equivalence in ranking and window functions - Exposed `EXPR_TYPE_ERROR` for consistent error messaging across modules and tests - Improved internal sort logic using `expr_internal.SortExpr` - Clarified expectations for `join_on` expressions in documentation - Standardized imports and improved test clarity for maintainability
|
Hi, thanks for the PR.
One thing to check, |
|
hi @HeWhoHeWho Yes. Comparisons and arithmetic on an Expr automatically coerce plain Python values to literals, so you can write: df.filter(col("A") > 123)
df.filter(col("B") == "Jack")without explicitly wrapping 123 or "Jack" in lit()/literal(). Internally, each operator checks whether the right‑hand side is already an Expr; if not, it calls Expr.literal to convert the value before performing the operation. |
timsaucer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good improvement overall, and making our approach explicit is a good idea. I do think there is precedent in plenty of other libraries where some functions take strings and assume column names and others require you to provide expressions, so I feel comfortable in what we have here. I do have a couple of suggestions about places where the documentation feels a little awkward. Overall, a very nice addition.
| .. code-block:: python | ||
| from datafusion import col, lit | ||
| df.filter(col('age') > lit(21)) | ||
| Without ``lit()`` DataFusion would treat ``21`` as a column name rather than a | ||
| constant value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this statement true? df.filter(col('age') > 21) would treat 21 as a column name? I think that's a change in how the comparison operator works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right.
The comparison operators on Expr automatically convert any non-Expr value into a literal expression.
I will correct the documentation
| # Drop columns | ||
| df = df.drop("temporary_column") | ||
| String Columns and Expressions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the title here is misleading. "String Columns" to me would mean columns that contain string values. I think maybe we should call this something like "Function arguments taking column names" or "Column names as function arguments"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will correct this.
| String Columns and Expressions | ||
| ------------------------------ | ||
|
|
||
| Some ``DataFrame`` methods accept plain strings when an argument refers to an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recommend "plain strings" -> "column names"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will correct this.
python/datafusion/dataframe.py
Outdated
| :func:`datafusion.col` or :func:`datafusion.lit`; plain strings are not | ||
| accepted. If more complex logic is required, see the logical operations in | ||
| :py:mod:`~datafusion.functions`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this part that says "plain strings are not accepted". When you view it from the context of this PR and Issue, the statement makes sense. But if you are an external user coming across this function definition for the first time, I don't think the statement makes sense. I think you can just remove the change here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will correct this.
python/datafusion/dataframe.py
Outdated
| :func:`datafusion.col` or :func:`datafusion.lit`; plain strings are not | ||
| accepted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as before.
python/datafusion/dataframe.py
Outdated
| pass named expressions use the form name=Expr. | ||
| By passing expressions, iterables of expressions, or named expressions. | ||
| All expressions must be :class:`~datafusion.expr.Expr` objects created via | ||
| :func:`datafusion.col` or :func:`datafusion.lit`; plain strings are not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as before
python/datafusion/dataframe.py
Outdated
| On expressions are used to support in-equality predicates. Equality | ||
| predicates are correctly optimized | ||
| Join predicates must be :class:`~datafusion.expr.Expr` objects, typically | ||
| built with :func:`datafusion.col`; plain strings are not accepted. On |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as before
python/datafusion/expr.py
Outdated
|
|
||
| def expr_list_to_raw_expr_list( | ||
| expr_list: Optional[list[Expr] | Expr], | ||
| expr_list: Optional[_typing.Union[Sequence[_typing.Union[Expr, str]], Expr, str]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the change from | to _typing.Union ? I'm not completely against it but it seems we're inconsistent in our approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I find the _typing.Union to be an eye sore. Would it be possible to just import Union if we're going in this direction?
Also the | is the preferred syntax in Python 3.10 and later and 3.9 reaches end of life next month. Maybe we just stick with | and at the end of October update our minimum version to 3.10. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will amend this.
…raw_expr_list function
… method documentation
|
@timsaucer |
timsaucer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for another excellent PR!
Which issue does this PR close?
Rationale for this change
Users reported inconsistent behavior when using plain string column identifiers vs. explicit
col()/column()andlit()/literal()expressions. SomeDataFramemethods accepted bare strings (e.g.select,sort,drop), while expression-returning APIs (e.g.filter,with_column) requiredExprobjects. This led to confusion and surprising runtime errors.This PR clarifies and documents the intended behavior by:
sort,aggregategrouping keys, manyorder_byparameters, file sort metadata).Exprobjects for APIs that expect an expression (e.g.filter,with_column,with_columnselements,join_onpredicates). When a non-Expris passed to such APIs, a clearTypeErrormessage guides the user to usecol()/column()orlit()/literal().This makes usages consistent and explicit while preserving ergonomics where obvious (accepting column-name strings).
What changes are included in this PR?
High-level summary
Add robust helpers to validate and convert user-provided values:
ensure_expr(value)— validate and return the internal expression object or raiseTypeErrorwith a helpful message.ensure_expr_list(iterable)— flatten and validate nested iterables ofExprobjects._to_raw_expr(value)— convert anExprorstrcolumn name to the internal raw expression.SortKey = Expr | SortExpr | strtype alias to represent items accepted by sort-like APIs.Allow column-name strings in many APIs that logically take column identifiers:
DataFrame.sort,DataFrame.aggregate(group-by keys),Window.order_by, variousfunctions.*order_byparameters, andSessionContextfile sort-order metadata.Enforce explicit
Exprvalues in expression-only APIs and provide clear error messages referencingcol()/column()andlit()/literal()helpers.Update Python doc (
docs/source/user-guide/dataframe/index.rst) to explain which methods accept strings and which requirecol()/lit().Add tests covering string acceptance and error cases (extensive additions to
python/tests/test_dataframe.pyandpython/tests/test_expr.py).Files changed (high level)
python/datafusion/expr.pyensure_expr,ensure_expr_list,_to_raw_expr.expr_list_to_raw_expr_listacceptstrin addition toExprand convert accordingly.sort_list_to_raw_sort_listacceptstrandSortKeyand convert entries to rawSortExpr.SortKeyalias andEXPR_TYPE_ERRORconstant for consistent error messages.python/datafusion/dataframe.pyensure_expr/ensure_expr_listto validate expressions infilter,with_column(s),join_on,aggregate, and other places.sort,aggregategroup-by, and related APIs via conversions.python/datafusion/context.pyfile_sort_orderas nestedSortKeysequences and added_convert_file_sort_orderto create the low-level representation for the Rust bindings.exprnamespace asexpr_internalwhere required.python/datafusion/functions.pySortKeyfororder_byparameters.docs/source/user-guide/dataframe/index.rstcol()/lit()expressions, with examples.python/tests/test_dataframe.py,python/tests/test_expr.pyNotable implementation details
order_byorgroup_by, we convert the string intoExpr.column(name)prior to calling the lower-level bindings.filter,with_column,join_on), passing a plain string now raisesTypeErrorwith the message:Use col()/column() or lit()/literal() to construct expressions(exposed viaEXPR_TYPE_ERROR). Tests assert on that message to ensure consistent behavior.Are these changes tested?
Yes. This PR adds and updates unit tests to cover:
sort,aggregategroup-by,window.order_by,array_agg(..., order_by=...),first_value/last_value/nth_value(..., order_by=...),lead/lag,row_number/rank/dense_rank/percent_rank/cume_dist/ntile, andSessionContextfile sort order conversions.Exprtypes to expression-only APIs (filter,with_column,with_columns,join_on, etc.).col("...")counterparts when allowed.Specifically modified/added tests include (non-exhaustive list):
python/tests/test_dataframe.py— many new tests:test_select_unsupported,test_sort_string_and_expression_equivalent,test_sort_unsupported,test_aggregate_string_and_expression_equivalent,test_aggregate_tuple_group_by,test_filter_string_unsupported,test_with_column_invalid_expr,test_with_columns_invalid_expr,test_join_on_invalid_expr,test_aggregate_invalid_aggs,test_order_by_string_equivalence, and file sort-order tests forregister_parquet,register_listing_table, andread_parquet.python/tests/test_expr.py— tests forensure_exprandensure_expr_listbehavior and error messages.Are there any user-facing changes?
Yes.
Behavioral / UX changes
Methods that naturally take column identifiers now accept plain
strvalues (for convenience):DataFrame.sort("col")DataFrame.aggregate(group_by="col", ...)(and group keys can be sequences/tuples of strings)order_byparameters in window and aggregate functions now accept string column names.SessionContextfile sort-order metadata accepts column name strings.Methods that inherently require expressions (computation or predicate) now enforce
Exprvalues and raise a clearTypeErrordirecting the user tocol()/column()orlit()/literal():DataFrame.filter— must pass anExpr(e.g.col("x") > lit(1)).DataFrame.with_column,with_columns— items must beExprobjects.DataFrame.join_on— ON predicates must beExpr(equality or other expressions).Documentation
Compatibility / API surface
SortKeywas introduced in the codebase to express the union of accepted types for sort-like parameters. This is an internal typing convenience and should not affect external users directly.Example usage
Notes for reviewers
expr.py(ensure_expr,ensure_expr_list,_to_raw_expr) and the conversions indataframe.pyandcontext.pythat call them. These are the core safety/validation changes.