Skip to content

Conversation

huan233usc
Copy link
Collaborator

@huan233usc huan233usc commented Oct 9, 2025

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

This PR refactors the Delta Spark build system to support a modular architecture that separates V1 and V2 implementations while maintaining a single published delta-spark jar that could use connector for both V1 and V2 and with public entry point DeltaCatalog and DeltaSparkSessionExtension unchanged.

Architecture Changes

Module Structure:

  • sparkV1 (internal): Delta Spark V1 implementation (production code only, no tests)
  • sparkV1Shaded (internal): V1 without DeltaLog/Snapshot/OptimisticTransaction classes, used to avoid V2 connector accidentally depend on legacy V1 representation of delta log.
  • sparkV2 (internal): Kernel-based Delta Spark implementation (formerly kernelSpark)
  • spark (combined): Final published module that merges V1 + V2 + storage into delta-spark.jar

Details Changes

Rename old catalog plugin:

  • Introduced Abstract* base classes in sparkV1 and Legacy* subclasses for backward compatibility
  • New unified classes in spark-combined:
    • DeltaCatalog extends AbstractDeltaCatalog
    • DeltaSparkSessionExtension extends AbstractDeltaSparkSessionExtension
      This makes public entry point DeltaCatalog and DeltaSparkSessionExtension unchanged.

Single Jar rules all connectors:

  • Internal modules (sparkV1, sparkV2, sparkV1Shaded) marked with skipReleaseSettings - not published to Maven
  • spark (combined) uses packageBin / mappings to merge classes from internal modules
  • Duplicate class detection added to prevent overlapping code
  • POM post-processing excludes internal module dependencies

Test Configuration:

  • Tests run in spark (combined) module with Test / baseDirectory pointing to spark/ directory
  • I could run tests for both V1/V2 connector

How was this patch tested?

All existing unit tests pass

Does this PR introduce any user-facing changes?

No user-facing changes. This is an internal build system refactoring:

No

build.sbt Outdated
// Module 3: delta-spark-v2 (kernel-spark based, depends on v1-shaded)
// ============================================================
lazy val `delta-spark-v2` = (project in file("kernel-spark"))
.dependsOn(`delta-spark-v1-shaded`) // Only depends on shaded v1 (no DeltaLog)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly what i had in mind.

build.sbt Outdated
Comment on lines 680 to 698
// Test sources point to original spark/src/test/ (no file movement)
Test / unmanagedSourceDirectories ++= Seq(
baseDirectory.value.getParentFile / "spark" / "src" / "test" / "scala",
baseDirectory.value.getParentFile / "spark" / "src" / "test" / "java"
),
Test / unmanagedResourceDirectories +=
baseDirectory.value.getParentFile / "spark" / "src" / "test" / "resources",

// Include spark-version-specific test sources
Test / unmanagedSourceDirectories ++= {
val sparkVer = sparkVersion.value
if (sparkVer.startsWith("3.5")) {
Seq(baseDirectory.value.getParentFile / "spark" / "src" / "test" / "scala-spark-3.5")
} else if (sparkVer.startsWith("4.0")) {
Seq(baseDirectory.value.getParentFile / "spark" / "src" / "test" / "scala-spark-master")
} else {
Seq.empty
}
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be simpler to actually move the files, but this is okay for the first cut.

build.sbt Outdated
// ============================================================
// Module 1: delta-spark-v1 (prod code only, no tests)
// ============================================================
lazy val `delta-spark-v1` = (project in file("spark"))
Copy link
Contributor

@tdas tdas Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you using "`" (backticks) ?
why not just name it sparkV1 (it all already delta)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and fundamentally.. this is scala... and scala variables. nobody uses - in variable names :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, will rename

build.sbt Outdated
// ============================================================
lazy val spark = (project in file("spark-combined"))
.dependsOn(`delta-spark-shaded`) // Direct dependency on shaded (for delegation classes)
.dependsOn(`delta-spark-v1` % "test->test") // Test utilities from v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this? v1 has no tests.

build.sbt Outdated
Comment on lines 579 to 581
// This module contains delegation code like:
// - DeltaCatalog (delegates to V1 or V2)
// - DeltaSparkSessionExtension (registers both)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really have to shade anything?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments and many of those are stale, I will fix them (after fixing tests) and let you know then

@huan233usc huan233usc changed the title [POC]New spark structure [WIP][POC]New spark structure Oct 10, 2025
@huan233usc huan233usc changed the title [WIP][POC]New spark structure [Spark]Refactor Spark project structure to combine both Dsv1 connector and kernel backed Dsv2 connector Oct 17, 2025
*/
class DeltaSparkSessionExtension extends (SparkSessionExtensions => Unit) {
class LegacyDeltaSparkSessionExtension extends AbstractDeltaSparkSessionExtension
class AbstractDeltaSparkSessionExtension extends (SparkSessionExtensions => Unit) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a blank line between the two classes.
Also, add comment for each classs

class DeltaCatalog extends DelegatingCatalogExtension
class LegacyDeltaCatalog extends AbstractDeltaCatalog

class AbstractDeltaCatalog extends DelegatingCatalogExtension
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add comments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a README.md under spark-combined?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, shall we call it spark-unified instead of spark-combined?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, we can have

  • spark
  • sparkV1
  • sparkV2


// Filter out DeltaLog, Snapshot, OptimisticTransaction classes
v1Mappings.filterNot { case (file, path) =>
path.contains("org/apache/spark/sql/delta/DeltaLog") ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about org/apache/spark/sql/delta/actions/actions.scala? Should we filter this out too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is very v1 specific.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants