-
Notifications
You must be signed in to change notification settings - Fork 131
Support string column identifiers for sort/aggregate/window and stricter Expr validation #1221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 11 commits
f9cafb8
91167b0
54687a2
f591617
31a648f
37307b0
05cd237
28619d9
9adbf4f
0a27617
92bc68e
7258428
c3e2a04
38043fa
d78fdef
992d619
bd0f57e
2b813bf
93c81fa
9aa9985
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -126,6 +126,53 @@ DataFusion's DataFrame API offers a wide range of operations: | |
# Drop columns | ||
df = df.drop("temporary_column") | ||
|
||
String Columns and Expressions | ||
------------------------------ | ||
|
||
Some ``DataFrame`` methods accept plain strings when an argument refers to an | ||
|
||
existing column. These include: | ||
timsaucer marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
* :py:meth:`~datafusion.DataFrame.select` | ||
* :py:meth:`~datafusion.DataFrame.sort` | ||
* :py:meth:`~datafusion.DataFrame.drop` | ||
* :py:meth:`~datafusion.DataFrame.join` (``on`` argument) | ||
* :py:meth:`~datafusion.DataFrame.aggregate` (grouping columns) | ||
|
||
Note that :py:meth:`~datafusion.DataFrame.join_on` expects ``col()``/``column()`` expressions rather than plain strings. | ||
|
||
For such methods, you can pass column names directly: | ||
|
||
.. code-block:: python | ||
|
||
from datafusion import col, functions as f | ||
|
||
df.sort('id') | ||
df.aggregate('id', [f.count(col('value'))]) | ||
|
||
The same operation can also be written with explicit column expressions, using either ``col()`` or ``column()``: | ||
|
||
.. code-block:: python | ||
|
||
from datafusion import col, column, functions as f | ||
|
||
df.sort(col('id')) | ||
df.aggregate(column('id'), [f.count(col('value'))]) | ||
|
||
Note that ``column()`` is an alias of ``col()``, so you can use either name; the example above shows both in action. | ||
|
||
Whenever an argument represents an expression—such as in | ||
:py:meth:`~datafusion.DataFrame.filter` or | ||
:py:meth:`~datafusion.DataFrame.with_column`—use ``col()`` to reference columns | ||
and wrap constant values with ``lit()`` (also available as ``literal()``): | ||
|
||
.. code-block:: python | ||
|
||
from datafusion import col, lit | ||
df.filter(col('age') > lit(21)) | ||
|
||
Without ``lit()`` DataFusion would treat ``21`` as a column name rather than a | ||
constant value. | ||
|
||
|
||
Terminal Operations | ||
------------------- | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the title here is misleading. "String Columns" to me would mean columns that contain string values. I think maybe we should call this something like "Function arguments taking column names" or "Column names as function arguments"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will correct this.