Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion python/datafusion/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -415,7 +415,14 @@
Returns:
DataFrame with those columns removed in the projection.
"""
Comment on lines 420 to 427
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR. This looks good. One request: Would you mind updating the docstring to specify that column case is respected and does not need double quotes like other operations such as select? You can also specify that leading and trailing " are allowable as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the docstring.
Based on the comment from @HeWhoHeWho should we open another issue for select(), drop(), sort() or should we expand the scope of this issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop is unique in that it only takes in column names. The others differ in that they take expressions. We have some syntactic sugar around them to allow passing in strings that get turned into expressions. From the perspective of datafusion, requiring the double quotes for select and sort is the expected behavior. It is also included in the online documentation with a call out box to emphasize it. So I think we are okay with just this PR as it is.

If people want the behavior of select and sort changed then we should consider adding an upstream change to the datafusion repo with a configuration setting to do this. I tested with the datafusion.sql_parser.enable_ident_normalization but that didn't seem to make an impact.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If people want the behavior of select and sort changed then we should consider adding an upstream change to the datafusion repo with a configuration setting to do this. I tested with the datafusion.sql_parser.enable_ident_normalization but that didn't seem to make an impact.

Say we go by - if Expression is used, then force the usage of double quote, otherwise allow select('col') or select('"col"'), same goes for sort.
Would this be a big change on upstream at datafusion repo?

return DataFrame(self.df.drop(*columns))
normalized_columns = []
for col in columns:
if col.startswith('"') and col.endswith('"'):
normalized_columns.append(col.strip('"')) # Removes quotes from both sides of col

Check failure on line 421 in python/datafusion/dataframe.py

View workflow job for this annotation

GitHub Actions / build

Ruff (E501)

python/datafusion/dataframe.py:421:89: E501 Line too long (97 > 88)
else:
normalized_columns.append(col)

Check failure on line 424 in python/datafusion/dataframe.py

View workflow job for this annotation

GitHub Actions / build

Ruff (W293)

python/datafusion/dataframe.py:424:1: W293 Blank line contains whitespace
return DataFrame(self.df.drop(*normalized_columns))

def filter(self, *predicates: Expr) -> DataFrame:
"""Return a DataFrame for which ``predicate`` evaluates to ``True``.
Expand Down
11 changes: 10 additions & 1 deletion python/tests/test_dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -216,7 +216,16 @@
assert result.column(0) == pa.array([4, 5, 6])
assert result.column(1) == pa.array([1, 2, 3])


def test_drop_quoted_columns():
ctx = SessionContext()
batch = pa.RecordBatch.from_arrays([pa.array([1, 2, 3])], names=["ID_For_Students"])
df = ctx.create_dataframe([[batch]])

Check failure on line 223 in python/tests/test_dataframe.py

View workflow job for this annotation

GitHub Actions / build

Ruff (W293)

python/tests/test_dataframe.py:223:1: W293 Blank line contains whitespace
# Both should work
assert df.drop('"ID_For_Students"').schema().names == []
assert df.drop('ID_For_Students').schema().names == []

Check failure on line 226 in python/tests/test_dataframe.py

View workflow job for this annotation

GitHub Actions / build

Ruff (Q000)

python/tests/test_dataframe.py:226:20: Q000 Single quotes found but double quotes preferred


Check failure on line 228 in python/tests/test_dataframe.py

View workflow job for this annotation

GitHub Actions / build

Ruff (W293)

python/tests/test_dataframe.py:228:1: W293 Blank line contains whitespace
def test_select_mixed_expr_string(df):
df = df.select(column("b"), "a")

Expand Down
Loading