You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-53192][CONNECT] Always cache a DataSource in the Spark Connect Plan Cache
### What changes were proposed in this pull request?
I believe we can dramatically improve the performance of the `SparkConnectPlanner` for plans using the same `Read.DataSource` (`spark.read`) multiple times (within the same session) by actively caching them in the [Spark Connect Plan Cache](a1fc6d5).
At the moment, every occurrence of a `Read.DataSource` issues a separate analysis of the `DataSource`, which leads to us kicking off a new Spark Job per analysis, if no explicit schema is provided. This leads to very slow plan translation, because we need to fetch the (meta)data every time.
For example, the following code, unionizing the same CSV file N times, kicks off N+1 Spark Jobs for the analysis of the final DataFrame in Spark Connect (compared to exactly 1 for Spark Classic):
```scala
val df = spark.read.csv("abc.csv")
(0 until N).foldLeft(df)((a, _) => a.union(df)).schema
```
I propose to adjust the Spark Connect Plan Cache to always cache a `Read.DataSource`, even when it is not the root of a relation. This always reduces the required Spark Jobs for analysis to at most 1 per **unique** `DataSource`. This has the same effect as when one explicitly analyzes the base DataSource today to populate the Spark Connect Plan Cache with its base plan, greatly improving the performance for subsequent queries using this `DataSource`:
```scala
val df = spark.read.csv("abc.csv")
df.schema
(0 until N).foldLeft(df)((a, _) => a.union(df)).schema
```
To compensate for the increase in cached plans, I increased the Plan Cache size from 16 to 32. The DataSource plans are always leaf nodes, so I do not think they will add much memory pressure - this increases the cache size for all plans, though, so it is up for debate if we really want to do this.
I also added a flag that lets one turn off this aggressive caching: `spark.connect.session.planCache.alwaysCacheDataSourceReadsEnabled`.
Side note:
This fix works so well because the `SparkConnectPlanner` [actively analyses](https://github.com/apache/spark/blob/8b80ea04d0b14d19b819cd4648b5ddd3e1c42650/sql/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala#L1518) the `LogicalPlan` when translating the `Read.DataSource`. I am wondering whether this makes sense, conceptually, and why we chose to use `queryExecution.analyzed` instead of `queryExecution.logical` here.
### Why are the changes needed?
Translating `DataSource`s today using the `SparkConnectPlanner` is very expensive/ineffective. There are a ton of Spark Connect Plan Cache misses because we only cache top-level plans. The improvement greatly improves the performance of Spark Connect plan translation involving `Read.DataSource`s, which are very common.
### Does this PR introduce _any_ user-facing change?
We now always cache the analyzed `Read.DataSource` in the Spark Connect Plan Cache instead of only a top-level plan containing it. This is equivalent to accessing the `DataSource`'s schema after its creation, so I would argue it makes the whole caching experience only more consistent, while greatly improving the performance.
### How was this patch tested?
Added a test case to `SparkConnectSessionHolderSuite` that checks the improved caching and also verifies that it can be turned off with the newly added flag.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes#51921 from dillitz/datasource-caching.
Authored-by: Robert Dillitz <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
Copy file name to clipboardExpand all lines: sql/connect/server/src/test/scala/org/apache/spark/sql/connect/service/SparkConnectSessionHolderSuite.scala
+50-10Lines changed: 50 additions & 10 deletions
Original file line number
Diff line number
Diff line change
@@ -401,19 +401,23 @@ class SparkConnectSessionHolderSuite extends SharedSparkSession {
0 commit comments