-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[Spark]Refactor Spark project structure to combine both Dsv1 connector and kernel backed Dsv2 connector #5320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
build.sbt
Outdated
// Module 3: delta-spark-v2 (kernel-spark based, depends on v1-shaded) | ||
// ============================================================ | ||
lazy val `delta-spark-v2` = (project in file("kernel-spark")) | ||
.dependsOn(`delta-spark-v1-shaded`) // Only depends on shaded v1 (no DeltaLog) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly what i had in mind.
build.sbt
Outdated
// Test sources point to original spark/src/test/ (no file movement) | ||
Test / unmanagedSourceDirectories ++= Seq( | ||
baseDirectory.value.getParentFile / "spark" / "src" / "test" / "scala", | ||
baseDirectory.value.getParentFile / "spark" / "src" / "test" / "java" | ||
), | ||
Test / unmanagedResourceDirectories += | ||
baseDirectory.value.getParentFile / "spark" / "src" / "test" / "resources", | ||
|
||
// Include spark-version-specific test sources | ||
Test / unmanagedSourceDirectories ++= { | ||
val sparkVer = sparkVersion.value | ||
if (sparkVer.startsWith("3.5")) { | ||
Seq(baseDirectory.value.getParentFile / "spark" / "src" / "test" / "scala-spark-3.5") | ||
} else if (sparkVer.startsWith("4.0")) { | ||
Seq(baseDirectory.value.getParentFile / "spark" / "src" / "test" / "scala-spark-master") | ||
} else { | ||
Seq.empty | ||
} | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be simpler to actually move the files, but this is okay for the first cut.
build.sbt
Outdated
// ============================================================ | ||
// Module 1: delta-spark-v1 (prod code only, no tests) | ||
// ============================================================ | ||
lazy val `delta-spark-v1` = (project in file("spark")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you using "`" (backticks) ?
why not just name it sparkV1 (it all already delta)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and fundamentally.. this is scala... and scala variables. nobody uses -
in variable names :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, will rename
build.sbt
Outdated
// ============================================================ | ||
lazy val spark = (project in file("spark-combined")) | ||
.dependsOn(`delta-spark-shaded`) // Direct dependency on shaded (for delegation classes) | ||
.dependsOn(`delta-spark-v1` % "test->test") // Test utilities from v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need this? v1 has no tests.
build.sbt
Outdated
// This module contains delegation code like: | ||
// - DeltaCatalog (delegates to V1 or V2) | ||
// - DeltaSparkSessionExtension (registers both) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really have to shade anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These comments and many of those are stale, I will fix them (after fixing tests) and let you know then
…java - build.sbt: Removed old kernelSpark module (replaced by sparkV2 in our refactor) - StreamingHelperTest.java: Use deltaLog.initialCatalogTable() instead of Option.empty()
*/ | ||
class DeltaSparkSessionExtension extends (SparkSessionExtensions => Unit) { | ||
class LegacyDeltaSparkSessionExtension extends AbstractDeltaSparkSessionExtension | ||
class AbstractDeltaSparkSessionExtension extends (SparkSessionExtensions => Unit) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a blank line between the two classes.
Also, add comment for each classs
class DeltaCatalog extends DelegatingCatalogExtension | ||
class LegacyDeltaCatalog extends AbstractDeltaCatalog | ||
|
||
class AbstractDeltaCatalog extends DelegatingCatalogExtension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add a README.md under spark-combined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, shall we call it spark-unified
instead of spark-combined
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, we can have
- spark
- sparkV1
- sparkV2
|
||
// Filter out DeltaLog, Snapshot, OptimisticTransaction classes | ||
v1Mappings.filterNot { case (file, path) => | ||
path.contains("org/apache/spark/sql/delta/DeltaLog") || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about org/apache/spark/sql/delta/actions/actions.scala
? Should we filter this out too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is very v1 specific.
Which Delta project/connector is this regarding?
Description
This PR refactors the Delta Spark build system to support a modular architecture that separates V1 and V2 implementations while maintaining a single published
delta-spark
jar that could use connector for both V1 and V2 and with public entry point DeltaCatalog and DeltaSparkSessionExtension unchanged.Architecture Changes
Module Structure:
sparkV1
(internal): Delta Spark V1 implementation (production code only, no tests)sparkV1Shaded
(internal): V1 withoutDeltaLog
/Snapshot
/OptimisticTransaction
classes, used to avoid V2 connector accidentally depend on legacy V1 representation of delta log.sparkV2
(internal): Kernel-based Delta Spark implementation (formerlykernelSpark
)spark
(combined): Final published module that merges V1 + V2 + storage intodelta-spark.jar
Details Changes
Rename old catalog plugin:
Abstract*
base classes insparkV1
andLegacy*
subclasses for backward compatibilityspark-combined
:DeltaCatalog
extendsAbstractDeltaCatalog
DeltaSparkSessionExtension
extendsAbstractDeltaSparkSessionExtension
This makes public entry point DeltaCatalog and DeltaSparkSessionExtension unchanged.
Single Jar rules all connectors:
sparkV1
,sparkV2
,sparkV1Shaded
) marked withskipReleaseSettings
- not published to Mavenspark
(combined) usespackageBin / mappings
to merge classes from internal modulesTest Configuration:
spark
(combined) module withTest / baseDirectory
pointing tospark/
directoryHow was this patch tested?
All existing unit tests pass
Does this PR introduce any user-facing changes?
No user-facing changes. This is an internal build system refactoring:
No