Skip to content

Conversation

K-dash
Copy link
Contributor

@K-dash K-dash commented Oct 14, 2025

Which issue does this PR close?

Closes #1273

Rationale for this change

Users have requested Spark-like support for DataFrame.filter("a > 1") so they can reuse existing SQL predicate strings without converting them to expression objects.

What changes are included in this PR?

  • Allow DataFrame.filter to normalize SQL string predicates via parse_sql_expr before dispatching to the internal API.
  • Add regression tests covering pure SQL strings, mixed string/Expr predicates, and invalid SQL errors.
  • Update the DataFrame user guide to mention SQL string filtering.

Are there any user-facing changes?

DataFrame.filter now accepts SQL string predicates in addition to Expr objects, and the documentation reflects this capability. No breaking API changes.

@milenkovicm
Copy link
Contributor

if @timsaucer agrees, can we expand the scope from filter and include other similar methods which are not to hard to implement, i think join_on has expression

@timsaucer
Copy link
Member

if @timsaucer agrees, can we expand the scope from filter and include other similar methods which are not to hard to implement, i think join_on has expression

join_on may be a bit tricky. It is unclear to me which dataframe we should parse the expressions against. When you have column selection in the sql parsing, it returns qualified columns, so if evaluated against the wrong one you could end up in a bad state.

That being said, I am not at all opposed to evaluating other places in DataFrame to give similar treatment to.

@milenkovicm
Copy link
Contributor

join_on may be a bit tricky. It is unclear to me which dataframe we should parse the expressions against. When you have column selection in the sql parsing, it returns qualified columns, so if evaluated against the wrong one you could end up in a bad state.

I missed that important case

That being said, I am not at all opposed to evaluating other places in DataFrame to give similar treatment to.

@K-dash would you be interested in investigating ?

@timsaucer
Copy link
Member

FWIW I did a quick test with this:

--- a/python/datafusion/dataframe.py
+++ b/python/datafusion/dataframe.py
@@ -424,7 +424,9 @@ class DataFrame:
             df = df.select("a", col("b"), col("a").alias("alternate_a"))

         """
-        exprs_internal = expr_list_to_raw_expr_list(exprs)
+        expr_list = [self.parse_sql_expr(e) if isinstance(e, str) else e for e in exprs]
+
+        exprs_internal = expr_list_to_raw_expr_list(expr_list)
         return DataFrame(self.df.select(*exprs_internal))

With that you can do df.select("a-b"). It feels very intuitive to me.

@K-dash
Copy link
Contributor Author

K-dash commented Oct 14, 2025

Thanks for sharing the snippet—being able to call df.select("a - b") does feel very natural.
I’m definitely interested in digging deeper into this. However, since including select and other APIs would likely expand the scope too much,
I'd like to keep this PR focused on filter. I’ll investigate and report back in the next PR if that works for you.

@milenkovicm
Copy link
Contributor

FWIW I did a quick test with this:

--- a/python/datafusion/dataframe.py
+++ b/python/datafusion/dataframe.py
@@ -424,7 +424,9 @@ class DataFrame:
             df = df.select("a", col("b"), col("a").alias("alternate_a"))

         """
-        exprs_internal = expr_list_to_raw_expr_list(exprs)
+        expr_list = [self.parse_sql_expr(e) if isinstance(e, str) else e for e in exprs]
+
+        exprs_internal = expr_list_to_raw_expr_list(expr_list)
         return DataFrame(self.df.select(*exprs_internal))

With that you can do df.select("a-b"). It feels very intuitive to me.

should we roll back df.select_expr and do this instead @timsaucer , it makes sense to me to do it

@timsaucer
Copy link
Member

should we roll back df.select_expr and do this instead @timsaucer , it makes sense to me to do it

Yes, but no. The problem with that snippet is that I think it will fail for people (like me) who have column names that are not sql parseable. They should still work as turning into a column expression.

Copy link
Contributor

@milenkovicm milenkovicm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense but lets wait for @timsaucer

Copy link
Member

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @K-dash and @milenkovicm !

@timsaucer timsaucer merged commit fe0cf8c into apache:main Oct 15, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Change DataFrame.filter(predicates: Expr | str)

3 participants