[spark] Eliminate the De/serialization process when writing to the ap… #5159

Aitozi · 2025-02-26T05:33:44Z

…pend bucket table

Purpose

Linked issue: close #5148

Currently, we append the bucket column into the Row to shuffle, which lead to the DeserializeToObject and SerializeFromObject node, which is CPU costly.

I try to

introduce a fixed_bucket function to directly calculate based on the original InternalRow
mapPartitions on the RDD[InternalRow] to avoid to convert to Row when performing write

The later one's performance is significant better

Tests

API and Format

Documentation

Aitozi · 2025-02-28T02:03:06Z

CC @JingsongLi @YannByron @Zouxxyy

Zouxxyy

Thanks, can we remove SparkRow in the future? Actually, I noticed that Spark's DataWriter takes InternalRow as input too, SparkInternalRowWrapper is useful for integrating the v2 writer in the future.

Zouxxyy · 2025-03-10T15:05:07Z

paimon-spark/paimon-spark-common/src/main/scala/org/apache/paimon/spark/SparkTableWrite.scala

-  private def toPaimonRow(row: Row) =
-    new SparkRow(rowType, row, SparkRowUtils.getRowKind(row, rowKindColIdx))
+  private def toPaimonRow(row: InternalRow) =
+    new SparkInternalRowWrapper(


We can use like this to reduce the cost of initialization.

SparkInternalRowWrapper wrap(row internalRow) { this.row = internalRow; return this; }

Zouxxyy · 2025-03-10T15:16:47Z

...spark/paimon-spark-common/src/main/java/org/apache/paimon/spark/SparkInternalRowWrapper.java

+    private final int length;
+    private final StructType structType;
+
+    public SparkInternalRowWrapper(


We can only keep these in SparkInternalRowWrapper

private org.apache.spark.sql.catalyst.InternalRow internalRow; private final StructType structType; // or paimon rowType private final rowKindColIdx

and get FieldCount or RowKind by static method

I have one concern, can we get smaller fields by getRow(int pos, int numFields) , if it's support, the numFields may not equal to the structType length. So, I kept the length field now.

Zouxxyy · 2025-03-10T15:21:11Z

.../paimon-spark-common/src/main/scala/org/apache/paimon/spark/commands/PaimonSparkWriter.scala

-            } finally {
-              write.close()
+      val schema = dataFrame.schema
+      dataFrame.queryExecution.toRdd


Have you tested the performance difference with this modification for unaware bucket scenario

No, I have not test this diff

Zouxxyy · 2025-03-10T15:45:52Z

...n-spark/paimon-spark-3.2/src/main/scala/org/apache/paimon/spark/catalyst/Compatibility.scala

  }
+
+  def callFunction(name: String, args: Seq[Column]): Column = {
+    call_udf(name, args: _*)


What is the difference between this and the call_function in Spark 3.5+. It seems that new Compatibility have been added for it.

Removed, and use the call_udf

Zouxxyy · 2025-03-10T15:48:22Z

...k/paimon-spark-common/src/main/scala/org/apache/paimon/spark/commands/BucketExpression.scala

+import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback
+import org.apache.spark.sql.types.{DataType, DataTypes, StructField, StructType}
+
+case class FixedBucketExpression(_children: Seq[Expression])


Let's add some comments to the args

Aitozi · 2025-03-11T08:06:08Z

Thanks @Zouxxyy for your comments, I think we can remove the SparkRow when we all turned into InternalRow, it's feasible

Aitozi · 2025-03-12T05:29:14Z

Another question to discussion here: before this PR we do not use the bucket expression, so we do not depend on the PaimonSparkExtension during write, but now we force to write with the bucket expression, so the extension is forced to use now.

I think this may break the compatibility, what do you think of this cc @Zouxxyy @YannByron @JingsongLi

JingsongLi · 2025-03-12T08:03:15Z

I remember that Spark's dynamic partitioning writing seems to require configuring extensions too? cc @Zouxxyy

Aitozi · 2025-03-12T15:31:54Z

I append a commit to support work as before when the spark.sql.extensions is not set, the bucket expression will not be used.

Zouxxyy · 2025-03-12T16:07:45Z

I append a commit to support work as before when the spark.sql.extensions is not set, the bucket expression will not be used.

The doc including spark.sql.extensions in the default conf has been a long time. In fact, paimon spark has relied heavily on extension for resolve in earlier versions, otherwise, there could be issue with data misalignment during writing without it. Perhaps we should emphasize the necessity of conf extension a bit more in the documentation.

Aitozi · 2025-03-13T02:49:44Z

I append a commit to support work as before when the spark.sql.extensions is not set, the bucket expression will not be used.

The doc including spark.sql.extensions in the default conf has been a long time. In fact, paimon spark has relied heavily on extension for resolve in earlier versions, otherwise, there could be issue with data misalignment during writing without it. Perhaps we should emphasize the necessity of conf extension a bit more in the documentation.

Yes, the extension help to do some type alignment work, +1 to

emphasize the necessity of conf extension a bit more in the documentation

But, I still think we should keep the way to work without extension, we currently use the extension optionally (eg: dynamic overwrite, call procedure). So I lean to not break the compatibility.

… table

Zouxxyy

overall LGTM, Just need to remove unnecessary modifications.

Zouxxyy · 2025-03-21T07:05:41Z

.../paimon-spark-common/src/main/scala/org/apache/paimon/spark/commands/PaimonSparkWriter.scala

-              write.finish()
-            } finally {
-              write.close()
+      dataFrame


remove unnecessary changes

Zouxxyy · 2025-03-21T07:05:48Z

.../paimon-spark-common/src/main/scala/org/apache/paimon/spark/commands/PaimonSparkWriter.scala

+          val inputSchema = inputDs.schema
          writeWithBucketAssigner(
-            partitionByKey(),
+            inputDs,


remove unnecessary changes

Zouxxyy · 2025-03-21T07:06:31Z

.../paimon-spark-common/src/main/scala/org/apache/paimon/spark/commands/PaimonSparkWriter.scala

    val encoderGroupWithBucketCol = EncoderSerDeGroup(withInitBucketCol.schema)

-    def newWrite(): SparkTableWrite = new SparkTableWrite(writeBuilder, rowType, rowKindColIdx)
+    def newWrite(): SparkTableWrite =


emove unnecessary changes

Zouxxyy · 2025-03-21T07:31:16Z

.../paimon-spark-common/src/main/scala/org/apache/paimon/spark/commands/PaimonSparkWriter.scala

+            .map(x => col(data.schema.fieldNames(x)))
+            .toSeq
+          val args = Seq(lit(bucketNumber)) ++ bucketKeyCol
+          val repartitioned =


what about data.withColumn(BUCKET_COL, call_udf(BucketExpression.FIXED_BUCKET, args: _*)), so that we can use iter.foreach(row => write.write(row, row.getInt(bucketColIdx))) to avoid computing bucket id twice

Aitozi · 2025-03-21T08:09:59Z

overall LGTM, Just need to remove unnecessary modifications.

Thanks @Zouxxyy for your comments, addressed all of them

…pend bucket table (apache#5159)

Aitozi force-pushed the spark-write-1 branch 5 times, most recently from 7e73de8 to 8d2594d Compare February 27, 2025 15:57

Zouxxyy self-requested a review March 10, 2025 14:59

Zouxxyy reviewed Mar 10, 2025

View reviewed changes

Zouxxyy mentioned this pull request Mar 10, 2025

Draft: support spark v2 write #5241

Closed

Aitozi added 5 commits March 13, 2025 19:45

[spark] Eliminate the De/serialization process when writing to bucket…

bc88f84

… table

fix comment

211e432

fix comments

43f53fb

add mode to be compatabile with the extension disabled

15239ed

format

ef7a5f2

Aitozi force-pushed the spark-write-1 branch from e4c908f to ef7a5f2 Compare March 13, 2025 11:46

Aitozi added 2 commits March 20, 2025 19:27

revert dataFrame.queryExecution.toRdd

241e1ea

remove unused

4846204

Zouxxyy approved these changes Mar 21, 2025

View reviewed changes

Zouxxyy reviewed Mar 21, 2025

View reviewed changes

Aitozi added 2 commits March 21, 2025 16:04

resolve comments

1bdc408

resolve comments

f2af0e9

fix

796ab92

Zouxxyy merged commit 1266c36 into apache:master Mar 21, 2025
18 checks passed

danzhewuju pushed a commit to danzhewuju/paimon that referenced this pull request Mar 31, 2025

[spark] Eliminate the De/serialization process when writing to the ap…

c379793

…pend bucket table (apache#5159)

[spark] Eliminate the De/serialization process when writing to the ap… #5159

[spark] Eliminate the De/serialization process when writing to the ap… #5159

Uh oh!

Conversation

Aitozi commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Tests

API and Format

Documentation

Uh oh!

Aitozi commented Feb 28, 2025

Uh oh!

Zouxxyy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Aitozi commented Mar 11, 2025

Uh oh!

Aitozi commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JingsongLi commented Mar 12, 2025

Uh oh!

Aitozi commented Mar 12, 2025

Uh oh!

Zouxxyy commented Mar 12, 2025

Uh oh!

Aitozi commented Mar 13, 2025

Uh oh!

Zouxxyy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Aitozi commented Mar 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Aitozi commented Feb 26, 2025 •

edited

Loading

Aitozi commented Mar 12, 2025 •

edited

Loading