Support string column identifiers for sort/aggregate/window and stricter Expr validation #1221

kosiew · 2025-09-02T09:37:46Z

Which issue does this PR close?

Closes Potential bug(?): Inconsistent usage of column() / col() and literal() / lit() #1214

Rationale for this change

Users reported inconsistent behavior when using plain string column identifiers vs. explicit col()/column() and lit()/literal() expressions. Some DataFrame methods accepted bare strings (e.g. select, sort, drop), while expression-returning APIs (e.g. filter, with_column) required Expr objects. This led to confusion and surprising runtime errors.

This PR clarifies and documents the intended behavior by:

Accepting plain string column identifiers where passing a column name is natural (e.g. sort, aggregate grouping keys, many order_by parameters, file sort metadata).
Requiring explicit Expr objects for APIs that expect an expression (e.g. filter, with_column, with_columns elements, join_on predicates). When a non-Expr is passed to such APIs, a clear TypeError message guides the user to use col()/column() or lit()/literal().
Updating documentation and tests to reflect the semantics and show examples.

This makes usages consistent and explicit while preserving ergonomics where obvious (accepting column-name strings).

What changes are included in this PR?

High-level summary

Add robust helpers to validate and convert user-provided values:
- ensure_expr(value) — validate and return the internal expression object or raise TypeError with a helpful message.
- ensure_expr_list(iterable) — flatten and validate nested iterables of Expr objects.
- _to_raw_expr(value) — convert an Expr or str column name to the internal raw expression.
- Introduce SortKey = Expr | SortExpr | str type alias to represent items accepted by sort-like APIs.
Allow column-name strings in many APIs that logically take column identifiers: DataFrame.sort, DataFrame.aggregate (group-by keys), Window.order_by, various functions.* order_by parameters, and SessionContext file sort-order metadata.
Enforce explicit Expr values in expression-only APIs and provide clear error messages referencing col()/column() and lit()/literal() helpers.
Update Python doc (docs/source/user-guide/dataframe/index.rst) to explain which methods accept strings and which require col()/lit().
Add tests covering string acceptance and error cases (extensive additions to python/tests/test_dataframe.py and python/tests/test_expr.py).

Files changed (high level)

python/datafusion/expr.py
- Added helpers: ensure_expr, ensure_expr_list, _to_raw_expr.
- Made expr_list_to_raw_expr_list accept str in addition to Expr and convert accordingly.
- Made sort_list_to_raw_sort_list accept str and SortKey and convert entries to raw SortExpr.
- Added SortKey alias and EXPR_TYPE_ERROR constant for consistent error messages.
python/datafusion/dataframe.py
- Use ensure_expr/ensure_expr_list to validate expressions in filter, with_column(s), join_on, aggregate, and other places.
- Allow column name strings for sort, aggregate group-by, and related APIs via conversions.
- Added user-facing docstrings clarifying expected types for parameters and examples.
python/datafusion/context.py
- Accept file_sort_order as nested SortKey sequences and added _convert_file_sort_order to create the low-level representation for the Rust bindings.
- Import the internal expr namespace as expr_internal where required.
python/datafusion/functions.py
- Updated numerous function signatures to accept SortKey for order_by parameters.
- Updated docstrings and added usage examples where helpful.
docs/source/user-guide/dataframe/index.rst
- New section: "String Columns and Expressions" documenting which methods accept plain strings and which require explicit col()/lit() expressions, with examples.
python/tests/test_dataframe.py, python/tests/test_expr.py
- Added many tests exercising both permissive string-accepting behavior and strict expression validation.

Notable implementation details

When a user supplies a plain string for sort/aggregate/window order_by or group_by, we convert the string into Expr.column(name) prior to calling the lower-level bindings.
For APIs where an expression is required (e.g., filter, with_column, join_on), passing a plain string now raises TypeError with the message: Use col()/column() or lit()/literal() to construct expressions (exposed via EXPR_TYPE_ERROR). Tests assert on that message to ensure consistent behavior.

Are these changes tested?

Yes. This PR adds and updates unit tests to cover:

Accepting column-name strings for sort, aggregate group-by, window.order_by, array_agg(..., order_by=...), first_value/last_value/nth_value(..., order_by=...), lead/lag, row_number/rank/dense_rank/percent_rank/cume_dist/ntile, and SessionContext file sort order conversions.
Error paths for passing non-Expr types to expression-only APIs (filter, with_column, with_columns, join_on, etc.).
Tests ensure that string-based usages are equivalent to their col("...") counterparts when allowed.

Specifically modified/added tests include (non-exhaustive list):

python/tests/test_dataframe.py — many new tests: test_select_unsupported, test_sort_string_and_expression_equivalent, test_sort_unsupported, test_aggregate_string_and_expression_equivalent, test_aggregate_tuple_group_by, test_filter_string_unsupported, test_with_column_invalid_expr, test_with_columns_invalid_expr, test_join_on_invalid_expr, test_aggregate_invalid_aggs, test_order_by_string_equivalence, and file sort-order tests for register_parquet, register_listing_table, and read_parquet.
python/tests/test_expr.py — tests for ensure_expr and ensure_expr_list behavior and error messages.

Are there any user-facing changes?

Yes.

Behavioral / UX changes

Methods that naturally take column identifiers now accept plain str values (for convenience):
- DataFrame.sort("col")
- DataFrame.aggregate(group_by="col", ...) (and group keys can be sequences/tuples of strings)
- Many order_by parameters in window and aggregate functions now accept string column names.
- SessionContext file sort-order metadata accepts column name strings.
Methods that inherently require expressions (computation or predicate) now enforce Expr values and raise a clear TypeError directing the user to col()/column() or lit()/literal():
- DataFrame.filter — must pass an Expr (e.g. col("x") > lit(1)).
- DataFrame.with_column, with_columns — items must be Expr objects.
- DataFrame.join_on — ON predicates must be Expr (equality or other expressions).

Documentation

The user guide now contains a new section "String Columns and Expressions" explaining the above and showing usage examples. This should reduce confusion for users migrating from other libraries like Polars and make the library semantics explicit.

Compatibility / API surface

This is backwards-compatible for most users: previously-accepted code that used strings where allowed will continue to work. Code that accidentally passed strings to expression-only APIs will now be rejected with clearer errors instead of producing surprising behavior.
A new type alias SortKey was introduced in the codebase to express the union of accepted types for sort-like parameters. This is an internal typing convenience and should not affect external users directly.

Example usage

from datafusion import col, lit, functions as f

# Allowed: passing a column name string to sort or aggregate grouping
df.sort("id")
df.aggregate("id", [f.count(col("value"))])

# Required: expressions for predicate/transform
# Bad: df.filter("age > 21")  # raises TypeError
# Good:
df.filter(col("age") > lit(21))

# Window example: order_by accepts string
f.first_value(col("a"), order_by="ts")

Notes for reviewers

Focus review on the helpers in expr.py (ensure_expr, ensure_expr_list, _to_raw_expr) and the conversions in dataframe.py and context.py that call them. These are the core safety/validation changes.
The bulk of the diff is tests and doc changes which should be reviewed for correctness and clarity.

- Refactor expression handling and `_simplify_expression` for stronger type checking and clearer error handling - Improve type annotations for `file_sort_order` and `order_by` to support string inputs - Refactor DataFrame `filter` method to better validate predicates - Replace internal error message variable with public constant - Clarify usage of `col()` and `column()` in DataFrame examples

…handling - Update `order_by` handling in Window class for better type support - Improve type checking in DataFrame expression handling - Replace `Expr`/`SortExpr` with `SortKey` in file_sort_order and related functions - Simplify file_sort_order handling in SessionContext - Rename `_EXPR_TYPE_ERROR` → `EXPR_TYPE_ERROR` for consistency - Clarify usage of `col()` vs `column()` in DataFrame examples - Enhance documentation for file_sort_order in SessionContext

…ng, sorting, and docs - Introduce `ensure_expr` helper and improve internal expression validation - Update error messages and tests to consistently use `EXPR_TYPE_ERROR` - Refactor expression handling with `_to_raw_expr`, `_ensure_expr`, and `SortKey` - Improve type safety and consistency in sort key definitions and file sort order - Add parameterized parquet sorting tests - Enhance DataFrame docstrings with clearer guidance and usage examples - Fix minor typos and error message clarity

…ation - Introduced `ensure_expr_list` to validate and flatten nested expressions, treating strings as atomic - Updated expression utilities to improve consistency across aggregation and window functions - Consolidated and expanded parameterized tests for string equivalence in ranking and window functions - Exposed `EXPR_TYPE_ERROR` for consistent error messaging across modules and tests - Improved internal sort logic using `expr_internal.SortExpr` - Clarified expectations for `join_on` expressions in documentation - Standardized imports and improved test clarity for maintainability

…s purpose

…d consistency

…e checking

… clarity

…ypes methods

HeWhoHeWho · 2025-09-03T04:57:20Z

Hi, thanks for the PR.

# Required: expressions for predicate/transform ... # Good: df.filter(col("age") > lit(21))

One thing to check, lit() or literal() will remain flexible as below right?
Example: df.filter(col('A') > 123) or df.filter(col('B') == 'Jack') # These should be Good as well
This behaviour is allowed in Polars, current DataFusion ver. also supports this, hence checking if there are any changes made to this.

kosiew · 2025-09-03T15:12:07Z

hi @HeWhoHeWho

Yes. Comparisons and arithmetic on an Expr automatically coerce plain Python values to literals, so you can write:

df.filter(col("A") > 123)
df.filter(col("B") == "Jack")

without explicitly wrapping 123 or "Jack" in lit()/literal().

Internally, each operator checks whether the right‑hand side is already an Expr; if not, it calls Expr.literal to convert the value before performing the operation.
Consequently, lit() and literal() remain available but are optional for simple constants.

timsaucer

I think this is a good improvement overall, and making our approach explicit is a good idea. I do think there is precedent in plenty of other libraries where some functions take strings and assume column names and others require you to provide expressions, so I feel comfortable in what we have here. I do have a couple of suggestions about places where the documentation feels a little awkward. Overall, a very nice addition.

timsaucer · 2025-09-14T12:19:03Z

docs/source/user-guide/dataframe/index.rst

+.. code-block:: python
+
+    from datafusion import col, lit
+    df.filter(col('age') > lit(21))
+
+Without ``lit()`` DataFusion would treat ``21`` as a column name rather than a
+constant value.


Is this statement true? df.filter(col('age') > 21) would treat 21 as a column name? I think that's a change in how the comparison operator works.

You're right.

The comparison operators on Expr automatically convert any non-Expr value into a literal expression.
I will correct the documentation

timsaucer · 2025-09-14T12:20:39Z

docs/source/user-guide/dataframe/index.rst

    # Drop columns
    df = df.drop("temporary_column")

+String Columns and Expressions


I think the title here is misleading. "String Columns" to me would mean columns that contain string values. I think maybe we should call this something like "Function arguments taking column names" or "Column names as function arguments"

I will correct this.

timsaucer · 2025-09-14T12:21:03Z

docs/source/user-guide/dataframe/index.rst

+String Columns and Expressions
+------------------------------
+
+Some ``DataFrame`` methods accept plain strings when an argument refers to an


recommend "plain strings" -> "column names"

I will correct this.

docs/source/user-guide/dataframe/index.rst

timsaucer · 2025-09-14T12:30:07Z

python/datafusion/dataframe.py

+        :func:`datafusion.col` or :func:`datafusion.lit`; plain strings are not
+        accepted. If more complex logic is required, see the logical operations in
+        :py:mod:`~datafusion.functions`.


I would remove this part that says "plain strings are not accepted". When you view it from the context of this PR and Issue, the statement makes sense. But if you are an external user coming across this function definition for the first time, I don't think the statement makes sense. I think you can just remove the change here.

I will correct this.

timsaucer · 2025-09-14T12:30:33Z

python/datafusion/dataframe.py

+        :func:`datafusion.col` or :func:`datafusion.lit`; plain strings are not
+        accepted.


Same comment as before.

timsaucer · 2025-09-14T12:30:49Z

python/datafusion/dataframe.py

-        pass named expressions use the form name=Expr.
+        By passing expressions, iterables of expressions, or named expressions.
+        All expressions must be :class:`~datafusion.expr.Expr` objects created via
+        :func:`datafusion.col` or :func:`datafusion.lit`; plain strings are not


Same comment as before

timsaucer · 2025-09-14T12:32:19Z

python/datafusion/dataframe.py

-        On expressions are used to support in-equality predicates. Equality
-        predicates are correctly optimized
+        Join predicates must be :class:`~datafusion.expr.Expr` objects, typically
+        built with :func:`datafusion.col`; plain strings are not accepted. On


Same comment as before

timsaucer · 2025-09-14T12:34:34Z

python/datafusion/expr.py

+
 def expr_list_to_raw_expr_list(
-    expr_list: Optional[list[Expr] | Expr],
+    expr_list: Optional[_typing.Union[Sequence[_typing.Union[Expr, str]], Expr, str]],


Why the change from | to _typing.Union ? I'm not completely against it but it seems we're inconsistent in our approach

FWIW I find the _typing.Union to be an eye sore. Would it be possible to just import Union if we're going in this direction?

Also the | is the preferred syntax in Python 3.10 and later and 3.9 reaches end of life next month. Maybe we just stick with | and at the end of October update our minimum version to 3.10. What do you think?

I will amend this.

…ame methods

…rame class

…raw_expr_list function

… method documentation

kosiew · 2025-09-15T04:25:58Z

@timsaucer
Thanks for the review and feedback.
I have made the suggested changes.

timsaucer

Thank you for another excellent PR!

kosiew added 11 commits September 2, 2025 16:20

refactor: update docstring for sort_or_default function to clarify it…

31a648f

…s purpose

fix Ruff errors

37307b0

refactor: update type hints to use typing.Union for better clarity an…

05cd237

…d consistency

fix Ruff errors

28619d9

refactor: simplify type hints by removing unnecessary imports for typ…

9adbf4f

…e checking

refactor: update type hints for rex_type and types methods to improve…

0a27617

… clarity

refactor: remove unnecessary type ignore comments from rex_type and t…

92bc68e

…ypes methods

timsaucer reviewed Sep 14, 2025

View reviewed changes

kosiew added 9 commits September 15, 2025 11:11

docs: update section title for clarity on DataFrame method arguments

7258428

docs: clarify description of DataFrame methods accepting column names

c3e2a04

docs: add note to clarify function documentation reference for DataFr…

38043fa

…ame methods

docs: remove outdated information about predicate acceptance in DataF…

d78fdef

…rame class

refactor: simplify type hint for expr_list parameter in expr_list_to_…

992d619

…raw_expr_list function

docs: clarify usage of datafusion.col and datafusion.lit in DataFrame…

bd0f57e

… method documentation

docs: clarify usage of col() and lit() in DataFrame filter examples

2b813bf

Merge branch 'main' into col-1214

93c81fa

Fix ruff errors

9aa9985

kosiew requested a review from timsaucer September 15, 2025 04:26

timsaucer approved these changes Sep 16, 2025

View reviewed changes

timsaucer merged commit d54dc4a into apache:main Sep 16, 2025
16 checks passed

		:func:`datafusion.col` or :func:`datafusion.lit`; plain strings are not
		accepted.

Support string column identifiers for sort/aggregate/window and stricter Expr validation #1221

Support string column identifiers for sort/aggregate/window and stricter Expr validation #1221

Uh oh!

Conversation

kosiew commented Sep 2, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Example usage

Notes for reviewers

Uh oh!

HeWhoHeWho commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kosiew commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timsaucer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kosiew Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kosiew commented Sep 15, 2025

Uh oh!

timsaucer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HeWhoHeWho commented Sep 3, 2025 •

edited

Loading

kosiew commented Sep 3, 2025 •

edited

Loading

kosiew Sep 15, 2025 •

edited

Loading