[SPARK-47672][SQL] Avoid double eval from filter pushDown w/ projection pushdown by holdenk · Pull Request #46143 · apache/spark

holdenk · 2024-04-20T00:14:53Z

What changes were proposed in this pull request?

Changes the filter pushDown optimizer to not push down past projections of the same element if we reasonable expect that computing that element is likely to be expensive.

This is a slightly complex alternative to #45802 which also moves parts of projections down so that the filters can move further down.

An expression can indicate if it is too expensive to be worth the potential savings of being double evaluated as a result of pushdown (by default we do this for all UDFs).

Future Work / What else remains to do?

Right now if a cond is expensive and it references something in the projection we don't push-down. We could probably do better and gate this on if the thing we are reference is expensive rather than the condition it's self. We could do this as a follow up item or as part of this PR.

Why are the changes needed?

Currently Spark may double compute expensive operations (like json parsing, UDF eval, etc.) as a result of filter pushdown past projections.

Does this PR introduce any user-facing change?

SQL optimizer change may impact some user queries, results should be the same and hopefully a little faster.

How was this patch tested?

New tests were added to the FilterPushDownSuite, and the initial problem of double evaluation was confirmed with a github gist

Was this patch authored or co-authored using generative AI tooling?

Used claude to generate more test coverage.

holdenk · 2024-05-08T21:40:57Z

CC @cloud-fan do you have thoughts / cycles?

mridulm · 2024-05-09T17:26:01Z

+CC @shardulm94

github-actions · 2024-08-21T00:21:31Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

holdenk · 2024-12-05T23:30:06Z

Hi @cloud-fan looks like the "with" suggestion ended up being more complicated than originally suggested( see #46499 (comment) ). In the interest of progress and avoiding double evaluation of a lot of really expensive things we don't need I intend to update this PR and merge it. We can still circle back to the with approach eventually.

cloud-fan · 2024-12-06T14:26:41Z

Sorry for the late response to this project. I think the With approach is not that complicated and I'm fixing the nested With limitation here: #49093 . After this is merged, I can followup with the actual pushdown implementation if @zml1206 can't continue his work.

github-actions · 2025-03-22T00:25:21Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

holdenk · 2025-11-03T18:24:00Z

Just wondering if we have a consesus on the best way to go about this @zml1206 / @cloud-fan ? I'm thinking based on the > 1 year since with change it might be more complicated than we originally thought. I can re-explore as well is @zml1206 is busy but we could also go for the simpler solution in the meantime since double UDF evaluation is bad.

cloud-fan · 2025-11-03T19:15:28Z

Hi @holdenk , we tried very hard to solve this issue efficiently but failed. The idea was to let filter carry a project list and push them down together, but when we push through Project/Aggregate which also contains a project list, we may still hit expression duplication and need to make a decision based on cost.

Sorry I should have moved back to this PR earlier. I think we can simplify it a bit as we will likely never have a practical cost model for Spark expressions. Let's just avoid UDF expression (extends marker expression UserDefinedExpression) duplication during a filter pushdown and add a config to enable it.

holdenk · 2025-11-03T20:33:26Z

Sounds like a plan, I'll work on simplifying this code.

…n an 'expensive' projected operation (rlike) Co-authored-by: Holden Karau <holden@pigscanfly.ca>

… we can't do pawrtial on the || Co-authored-by: Holden Karau <holden@pigscanfly.ca>

… (idk seems better than the hack of determining if something is eligible for pushdown, but we could also use that maybe? idk) & start updating optimizer filter pushdown past a project for _partial_ pushdowns. Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

…ct to 'save' Co-authored-by: Holden Karau <holden@pigscanfly.ca>

…ushDown Co-authored-by: Holden Karau <holden@pigscanfly.ca>

… projection that we are using Co-authored-by: Holden Karau <holden@pigscanfly.ca>

…o aliases Co-authored-by: Holden Karau <holden@pigscanfly.ca>

cloud-fan · 2026-01-05T13:27:49Z

To discuss #46143 (comment) further:

Yes that's true, but given your previous statement around how adding projections is not free I don't think that's the right way to structure this.

That's why my initial suggestion was to not do this optimization at all. We just keep the Filter above the Project. By doing so we avoid the expensive expression duplication caused by filter pushdown, but all expressions in Project now need to be evaluated against the full input. I'm not sure how serious this issue is, and I was just trying to help simplify the algorithm given you are doing this optimization. I'm more than happier if you agree to drop this optimization and simplify the code.

holdenk · 2026-01-05T21:04:38Z

That's why my initial suggestion was to not do this optimization at all. We just keep the Filter above the Project. By doing so we avoid the expensive expression duplication caused by filter pushdown, but all expressions in Project now need to be evaluated against the full input. I'm not sure how serious this issue is, and I was just trying to help simplify the algorithm given you are doing this optimization. I'm more than happier if you agree to drop this optimization and simplify the code.

So just always leave up complex filters and don't don't attempt to split them if needed? I think that's sub-optimal for fairly self evident reasons but if you still find the current implementation too complex I could move it into a follow-on PR so there's less to review here and we just fix the perf regression introduced in 3.0

cloud-fan · 2026-01-06T16:33:52Z

A followup SGTM, at least we can fix the perf regression first.

holdenk · 2026-01-06T18:45:26Z

Awesome, I'll rework this then :)

…pushdown-split-projection Co-authored-by: Holden Karau <holden@pigscanfly.ca>

…ombos) Co-authored-by: Holden Karau <holden@pigscanfly.ca>

… so need to ref it. Co-authored-by: Holden Karau <holden@pigscanfly.ca>

holdenk · 2026-01-12T21:01:06Z

@cloud-fan updated to just fix the regression.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

cloud-fan · 2026-01-19T14:20:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+        val resolvedCheapFilter = replaceAlias(combinedCheapFilter, aliasMap)
+        val baseChild: LogicalPlan = Filter(resolvedCheapFilter, child = grandChild)
+        // Insert a last projection to match the desired column ordering and
+        // evaluate any stragglers and select the already computed columns.


hmm why is this needed if we only push a Filter through a Project?

cloud-fan · 2026-01-19T14:23:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+          (cond, AttributeMap.empty[Alias])
+        } else {
+          val (replaced, usedAliases) = replaceAliasWhileTracking(cond, aliasMap)
+          (cond, usedAliases)


we return cond instead of replaced, which indicates that the new function replaceAliasWhileTracking is not necessary for this new simply version.

I think we can make it very simple:

val (stayUp, pushDown) = splitConjunctivePredicates(condition).partition { predicate => replaceAlias(predicate, aliasMap).expensive } (stayUp, pushDown) match { case (_, Nil) => // Nothing to push down, keep the same filter. filter case (Nil, _) => // Push all project.copy(child = Filter(replaceAlias(condition, aliasMap), grandChild)) case _ => Filter( stayUp.reduce(And), project.copy( child = Filter(replaceAlias(pushDown.reduce(And), aliasMap), grandChild) ) ) }

So the whiletracking API is giving us back the usedAliases which we are returning and replaceAlias does not. If we just look at if the substituted expression is expensive rather than the alias we'll incorrectly leave up a filter which does an expensive regex over a projected inexpesnive column. I'll write this in the comments though.

…pushdown-split-projection

…but still remember this is a best effort heuristic

…g alias replace and discussions around why not to use replaceAliasWhileTracking

…tic.

cloud-fan · 2026-02-19T15:55:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+      val usedAliasesForCondition = splitCondition.map { cond =>
+        // If the legacy double evaluation behavior is enabled we just say
+        // every filter is "free."
+        if (!SQLConf.get.avoidDoubleFilterEval) {


If avoidDoubleFilterEval is false we can just return project.copy(child = Filter(replaceAlias(condition, aliasMap), grandChild)), to make sure the legacy code path is exactly the same as before.

cloud-fan · 2026-02-19T16:10:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala

+        newElem match {
+          case None => a
+          case Some(b) =>
+            replaced += (a, b)


we should skip this operation if the AttributeMap already contains a. It does not change the result but is more efficient.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

cloud-fan · 2026-02-19T16:15:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+        // For all filter which do not reference any expensive aliases then
+        // just push the filter while resolving the non-expensive aliases.
+        val combinedCheapFilter = cheap.reduce(And)
+        val baseChild: LogicalPlan = Filter(combinedCheapFilter, child = grandChild)


Suggested change

val baseChild: LogicalPlan = Filter(combinedCheapFilter, child = grandChild)

val baseChild = Filter(combinedCheapFilter, child = grandChild)

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala

cloud-fan · 2026-02-19T16:17:54Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala

+    comparePlans(optimized, correctAnswer)
+  }
+
+  test("SPARK-47672: Avoid double evaluation with projections can't push past certain items") {


Is this just a more complicated case of the test SPARK-47672: Make sure that we handle the case where everything is expensive?

Pretty much, makes sure we handle the split correctly though.

cloud-fan · 2026-02-19T16:19:03Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala

+  test("SPARK-47672: Case 1 - multiple filters not referencing projection aliases") {
+    val originalQuery = testStringRelation
+      .select($"a" as "c", $"e".rlike("magic") as "f", $"b" as "d")
+      .where($"c" > 5 && $"d" < 10)


c and d do reference projection aliases, shall we use .select($"a", $"b", ...)?

cloud-fan · 2026-02-19T16:19:20Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala

+    comparePlans(optimized, correctAnswer)
+  }
+
+  // Case 2: Multiple filters with inexpensive references - all should be pushed


do we have a test case for the mixed cases?

…ing into the tracking map only add if not present (same result, possible faster, since value is always the same), remove val result = ... (was useful for debugging).

…nd push down correctly. test

github-actions bot added the SQL label Apr 20, 2024

holdenk mentioned this pull request May 6, 2024

[SPARK-47672][SQL] Avoid double eval from filter pushDown #45802

Closed

holdenk force-pushed the SPARK-47672-avoid-double-eval-from-filter-pushdown-split-projection branch from a97d56f to f1d1ddd Compare May 6, 2024 17:58

holdenk changed the title ~~[WIP][SPARK-47672][SQL] Avoid double eval from filter pushDown w/ projection pushdown~~ [SPARK-47672][SQL] Avoid double eval from filter pushDown w/ projection pushdown May 8, 2024

github-actions bot added the Stale label Aug 21, 2024

github-actions bot closed this Aug 22, 2024

holdenk removed the Stale label Dec 2, 2024

holdenk reopened this Dec 2, 2024

holdenk force-pushed the SPARK-47672-avoid-double-eval-from-filter-pushdown-split-projection branch from ac85ead to a5d8400 Compare December 5, 2024 23:28

github-actions bot added the Stale label Mar 22, 2025

github-actions bot closed this Mar 23, 2025

holdenk reopened this Nov 3, 2025

holdenk removed the Stale label Nov 3, 2025

holdenk and others added 8 commits November 3, 2025 15:50

Start adding a test for relationship pushdown where we don't push dow…

816d27d

…n an 'expensive' projected operation (rlike) Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Break up the filter pushdown test into two parts with || and && since…

4bf7342

… we can't do pawrtial on the || Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Get the filter push down/not-push down working.

8f44a4b

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

A few more "expensive" type operations we would want to use the proje…

3408090

…ct to 'save' Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Fix push to do alias replace with a non-empty stay up and non-empty p…

0ab43c0

…ushDown Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Fill out how we do out filter push down to also push the parts of the…

fa486c9

… projection that we are using Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Fix up tests, add a bit more coverage, fix the case where there are n…

b2f5d80

…o aliases Co-authored-by: Holden Karau <holden@pigscanfly.ca>

sfc-gh-hkarau and others added 3 commits January 12, 2026 12:26

Merge branch 'master' into SPARK-47672-avoid-double-eval-from-filter-…

073b5af

…pushdown-split-projection Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Back out the case 3 optimization (splitting expensive proj + filter c…

74f76ae

…ombos) Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Update comment, we don't do the filter/proj/filter/proj split anymore…

afbc2ea

… so need to ref it. Co-authored-by: Holden Karau <holden@pigscanfly.ca>

holdenk force-pushed the SPARK-47672-avoid-double-eval-from-filter-pushdown-split-projection branch from 814405a to afbc2ea Compare January 12, 2026 21:00

peter-toth reviewed Jan 13, 2026

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala Show resolved Hide resolved

cloud-fan reviewed Jan 19, 2026

View reviewed changes

sfc-gh-hkarau added 7 commits February 18, 2026 13:54

Merge branch 'master' into SPARK-47672-avoid-double-eval-from-filter-…

f6bd41c

…pushdown-split-projection

Simplify the intermediate projection because we do not do any splits.

6e2fd3b

Clarify comment.

fb65283

Update the regexp matcher to look for expensive regexes a bit better …

93a1b61

…but still remember this is a best effort heuristic

Let's just go ahead and use the replaced value to avoid double callin…

511068e

…g alias replace and discussions around why not to use replaceAliasWhileTracking

Fix replaced ref when avoidDoubleFilterEval is true.

07a9556

Improve heuristic for expensive/cheap regexes but it is just a heuris…

1a44511

…tic.

cloud-fan reviewed Feb 19, 2026

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Feb 19, 2026

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Feb 19, 2026

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Feb 19, 2026

View reviewed changes

sfc-gh-hkarau added 2 commits February 20, 2026 13:51

CR feedback: short circuit on legacy behavior, rather than always add…

8df91b0

…ing into the tracking map only add if not present (same result, possible faster, since value is always the same), remove val result = ... (was useful for debugging).

Make a combined SPARK-47672: Case 1, 2, and 3 make sure we leave up a…

eab7c8c

…nd push down correctly. test

	val baseChild: LogicalPlan = Filter(combinedCheapFilter, child = grandChild)
	val baseChild = Filter(combinedCheapFilter, child = grandChild)

Comments

Conversation

holdenk commented Apr 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Future Work / What else remains to do?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

holdenk commented May 8, 2024

Uh oh!

mridulm commented May 9, 2024

Uh oh!

github-actions bot commented Aug 21, 2024

Uh oh!

holdenk commented Dec 5, 2024

Uh oh!

cloud-fan commented Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 22, 2025

Uh oh!

holdenk commented Nov 3, 2025

Uh oh!

cloud-fan commented Nov 3, 2025

Uh oh!

holdenk commented Nov 3, 2025

Uh oh!

cloud-fan commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

holdenk commented Jan 5, 2026

Uh oh!

cloud-fan commented Jan 6, 2026

Uh oh!

holdenk commented Jan 6, 2026

Uh oh!

holdenk commented Jan 12, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

holdenk commented Apr 20, 2024 •

edited

Loading

cloud-fan commented Dec 6, 2024 •

edited

Loading

cloud-fan commented Jan 5, 2026 •

edited

Loading