Skip to content

Comments

[SPARK-47672][SQL] Avoid double eval from filter pushDown w/ projection pushdown#46143

Open
holdenk wants to merge 68 commits intoapache:masterfrom
holdenk:SPARK-47672-avoid-double-eval-from-filter-pushdown-split-projection
Open

[SPARK-47672][SQL] Avoid double eval from filter pushDown w/ projection pushdown#46143
holdenk wants to merge 68 commits intoapache:masterfrom
holdenk:SPARK-47672-avoid-double-eval-from-filter-pushdown-split-projection

Conversation

@holdenk
Copy link
Contributor

@holdenk holdenk commented Apr 20, 2024

What changes were proposed in this pull request?

Changes the filter pushDown optimizer to not push down past projections of the same element if we reasonable expect that computing that element is likely to be expensive.

This is a slightly complex alternative to #45802 which also moves parts of projections down so that the filters can move further down.

An expression can indicate if it is too expensive to be worth the potential savings of being double evaluated as a result of pushdown (by default we do this for all UDFs).

Future Work / What else remains to do?

Right now if a cond is expensive and it references something in the projection we don't push-down. We could probably do better and gate this on if the thing we are reference is expensive rather than the condition it's self. We could do this as a follow up item or as part of this PR.

Why are the changes needed?

Currently Spark may double compute expensive operations (like json parsing, UDF eval, etc.) as a result of filter pushdown past projections.

Does this PR introduce any user-facing change?

SQL optimizer change may impact some user queries, results should be the same and hopefully a little faster.

How was this patch tested?

New tests were added to the FilterPushDownSuite, and the initial problem of double evaluation was confirmed with a github gist

Was this patch authored or co-authored using generative AI tooling?

Used claude to generate more test coverage.

@github-actions github-actions bot added the SQL label Apr 20, 2024
@holdenk holdenk force-pushed the SPARK-47672-avoid-double-eval-from-filter-pushdown-split-projection branch from a97d56f to f1d1ddd Compare May 6, 2024 17:58
@holdenk holdenk changed the title [WIP][SPARK-47672][SQL] Avoid double eval from filter pushDown w/ projection pushdown [SPARK-47672][SQL] Avoid double eval from filter pushDown w/ projection pushdown May 8, 2024
@holdenk
Copy link
Contributor Author

holdenk commented May 8, 2024

CC @cloud-fan do you have thoughts / cycles?

@mridulm
Copy link
Contributor

mridulm commented May 9, 2024

+CC @shardulm94

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Aug 21, 2024
@github-actions github-actions bot closed this Aug 22, 2024
@holdenk holdenk removed the Stale label Dec 2, 2024
@holdenk holdenk reopened this Dec 2, 2024
@holdenk holdenk force-pushed the SPARK-47672-avoid-double-eval-from-filter-pushdown-split-projection branch from ac85ead to a5d8400 Compare December 5, 2024 23:28
@holdenk
Copy link
Contributor Author

holdenk commented Dec 5, 2024

Hi @cloud-fan looks like the "with" suggestion ended up being more complicated than originally suggested( see #46499 (comment) ). In the interest of progress and avoiding double evaluation of a lot of really expensive things we don't need I intend to update this PR and merge it. We can still circle back to the with approach eventually.

@cloud-fan
Copy link
Contributor

cloud-fan commented Dec 6, 2024

Sorry for the late response to this project. I think the With approach is not that complicated and I'm fixing the nested With limitation here: #49093 . After this is merged, I can followup with the actual pushdown implementation if @zml1206 can't continue his work.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Mar 22, 2025
@github-actions github-actions bot closed this Mar 23, 2025
@holdenk
Copy link
Contributor Author

holdenk commented Nov 3, 2025

Just wondering if we have a consesus on the best way to go about this @zml1206 / @cloud-fan ? I'm thinking based on the > 1 year since with change it might be more complicated than we originally thought. I can re-explore as well is @zml1206 is busy but we could also go for the simpler solution in the meantime since double UDF evaluation is bad.

@holdenk holdenk reopened this Nov 3, 2025
@holdenk holdenk removed the Stale label Nov 3, 2025
@cloud-fan
Copy link
Contributor

Hi @holdenk , we tried very hard to solve this issue efficiently but failed. The idea was to let filter carry a project list and push them down together, but when we push through Project/Aggregate which also contains a project list, we may still hit expression duplication and need to make a decision based on cost.

Sorry I should have moved back to this PR earlier. I think we can simplify it a bit as we will likely never have a practical cost model for Spark expressions. Let's just avoid UDF expression (extends marker expression UserDefinedExpression) duplication during a filter pushdown and add a config to enable it.

@holdenk
Copy link
Contributor Author

holdenk commented Nov 3, 2025

Sounds like a plan, I'll work on simplifying this code.

holdenk and others added 8 commits November 3, 2025 15:50
…n an 'expensive' projected operation (rlike)

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
… we can't do pawrtial on the ||

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
… (idk seems better than the hack of determining if something is eligible for pushdown, but we could also use that maybe? idk) & start updating optimizer filter pushdown past a project for _partial_ pushdowns.

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
…ct to 'save'

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
…ushDown

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
… projection that we are using

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
…o aliases

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
@cloud-fan
Copy link
Contributor

cloud-fan commented Jan 5, 2026

To discuss #46143 (comment) further:

Yes that's true, but given your previous statement around how adding projections is not free I don't think that's the right way to structure this.

That's why my initial suggestion was to not do this optimization at all. We just keep the Filter above the Project. By doing so we avoid the expensive expression duplication caused by filter pushdown, but all expressions in Project now need to be evaluated against the full input. I'm not sure how serious this issue is, and I was just trying to help simplify the algorithm given you are doing this optimization. I'm more than happier if you agree to drop this optimization and simplify the code.

@holdenk
Copy link
Contributor Author

holdenk commented Jan 5, 2026

That's why my initial suggestion was to not do this optimization at all. We just keep the Filter above the Project. By doing so we avoid the expensive expression duplication caused by filter pushdown, but all expressions in Project now need to be evaluated against the full input. I'm not sure how serious this issue is, and I was just trying to help simplify the algorithm given you are doing this optimization. I'm more than happier if you agree to drop this optimization and simplify the code.

So just always leave up complex filters and don't don't attempt to split them if needed? I think that's sub-optimal for fairly self evident reasons but if you still find the current implementation too complex I could move it into a follow-on PR so there's less to review here and we just fix the perf regression introduced in 3.0

@cloud-fan
Copy link
Contributor

A followup SGTM, at least we can fix the perf regression first.

@holdenk
Copy link
Contributor Author

holdenk commented Jan 6, 2026

Awesome, I'll rework this then :)

sfc-gh-hkarau and others added 3 commits January 12, 2026 12:26
…pushdown-split-projection

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
…ombos)

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
… so need to ref it.

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
@holdenk holdenk force-pushed the SPARK-47672-avoid-double-eval-from-filter-pushdown-split-projection branch from 814405a to afbc2ea Compare January 12, 2026 21:00
@holdenk
Copy link
Contributor Author

holdenk commented Jan 12, 2026

@cloud-fan updated to just fix the regression.

val resolvedCheapFilter = replaceAlias(combinedCheapFilter, aliasMap)
val baseChild: LogicalPlan = Filter(resolvedCheapFilter, child = grandChild)
// Insert a last projection to match the desired column ordering and
// evaluate any stragglers and select the already computed columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm why is this needed if we only push a Filter through a Project?

(cond, AttributeMap.empty[Alias])
} else {
val (replaced, usedAliases) = replaceAliasWhileTracking(cond, aliasMap)
(cond, usedAliases)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we return cond instead of replaced, which indicates that the new function replaceAliasWhileTracking is not necessary for this new simply version.

I think we can make it very simple:

val (stayUp, pushDown) = splitConjunctivePredicates(condition).partition { predicate =>
  replaceAlias(predicate, aliasMap).expensive
}
(stayUp, pushDown) match {
  case (_, Nil) =>
    // Nothing to push down, keep the same filter.
    filter
  case (Nil, _) =>
    // Push all
    project.copy(child = Filter(replaceAlias(condition, aliasMap), grandChild))
  case _ =>
    Filter(
      stayUp.reduce(And),
      project.copy(
        child = Filter(replaceAlias(pushDown.reduce(And), aliasMap), grandChild)
      )
    )
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the whiletracking API is giving us back the usedAliases which we are returning and replaceAlias does not. If we just look at if the substituted expression is expensive rather than the alias we'll incorrectly leave up a filter which does an expensive regex over a projected inexpesnive column. I'll write this in the comments though.

val usedAliasesForCondition = splitCondition.map { cond =>
// If the legacy double evaluation behavior is enabled we just say
// every filter is "free."
if (!SQLConf.get.avoidDoubleFilterEval) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If avoidDoubleFilterEval is false we can just return project.copy(child = Filter(replaceAlias(condition, aliasMap), grandChild)), to make sure the legacy code path is exactly the same as before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

newElem match {
case None => a
case Some(b) =>
replaced += (a, b)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should skip this operation if the AttributeMap already contains a. It does not change the result but is more efficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call

// For all filter which do not reference any expensive aliases then
// just push the filter while resolving the non-expensive aliases.
val combinedCheapFilter = cheap.reduce(And)
val baseChild: LogicalPlan = Filter(combinedCheapFilter, child = grandChild)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val baseChild: LogicalPlan = Filter(combinedCheapFilter, child = grandChild)
val baseChild = Filter(combinedCheapFilter, child = grandChild)

comparePlans(optimized, correctAnswer)
}

test("SPARK-47672: Avoid double evaluation with projections can't push past certain items") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a more complicated case of the test SPARK-47672: Make sure that we handle the case where everything is expensive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much, makes sure we handle the split correctly though.

test("SPARK-47672: Case 1 - multiple filters not referencing projection aliases") {
val originalQuery = testStringRelation
.select($"a" as "c", $"e".rlike("magic") as "f", $"b" as "d")
.where($"c" > 5 && $"d" < 10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c and d do reference projection aliases, shall we use .select($"a", $"b", ...)?

comparePlans(optimized, correctAnswer)
}

// Case 2: Multiple filters with inexpensive references - all should be pushed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a test case for the mixed cases?

…ing into the tracking map only add if not present (same result, possible faster, since value is always the same), remove val result = ... (was useful for debugging).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants