Skip to content

Conversation

@Aitozi
Copy link
Contributor

@Aitozi Aitozi commented Feb 26, 2025

…pend bucket table

Purpose

Linked issue: close #5148

Currently, we append the bucket column into the Row to shuffle, which lead to the DeserializeToObject and SerializeFromObject node, which is CPU costly.

image

image

I try to

  1. introduce a fixed_bucket function to directly calculate based on the original InternalRow
  2. mapPartitions on the RDD[InternalRow] to avoid to convert to Row when performing write

image

image

The later one's performance is significant better

Tests

API and Format

Documentation

@Aitozi Aitozi force-pushed the spark-write-1 branch 5 times, most recently from 7e73de8 to 8d2594d Compare February 27, 2025 15:57
@Aitozi
Copy link
Contributor Author

Aitozi commented Feb 28, 2025

CC @JingsongLi @YannByron @Zouxxyy

@Zouxxyy Zouxxyy self-requested a review March 10, 2025 14:59
Copy link
Contributor

@Zouxxyy Zouxxyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, can we remove SparkRow in the future? Actually, I noticed that Spark's DataWriter takes InternalRow as input too, SparkInternalRowWrapper is useful for integrating the v2 writer in the future.

private def toPaimonRow(row: Row) =
new SparkRow(rowType, row, SparkRowUtils.getRowKind(row, rowKindColIdx))
private def toPaimonRow(row: InternalRow) =
new SparkInternalRowWrapper(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use like this to reduce the cost of initialization.

  SparkInternalRowWrapper wrap(row internalRow) {
    this.row = internalRow;
    return this;
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

private final int length;
private final StructType structType;

public SparkInternalRowWrapper(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can only keep these in SparkInternalRowWrapper

private org.apache.spark.sql.catalyst.InternalRow internalRow;
private final StructType structType; // or paimon rowType
private final rowKindColIdx

and get FieldCount or RowKind by static method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one concern, can we get smaller fields by getRow(int pos, int numFields) , if it's support, the numFields may not equal to the structType length. So, I kept the length field now.

} finally {
write.close()
val schema = dataFrame.schema
dataFrame.queryExecution.toRdd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested the performance difference with this modification for unaware bucket scenario

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I have not test this diff

}

def callFunction(name: String, args: Seq[Column]): Column = {
call_udf(name, args: _*)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between this and the call_function in Spark 3.5+. It seems that new Compatibility have been added for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, and use the call_udf

import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback
import org.apache.spark.sql.types.{DataType, DataTypes, StructField, StructType}

case class FixedBucketExpression(_children: Seq[Expression])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some comments to the args

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@Aitozi
Copy link
Contributor Author

Aitozi commented Mar 11, 2025

Thanks @Zouxxyy for your comments, I think we can remove the SparkRow when we all turned into InternalRow, it's feasible

@Aitozi
Copy link
Contributor Author

Aitozi commented Mar 12, 2025

Another question to discussion here: before this PR we do not use the bucket expression, so we do not depend on the PaimonSparkExtension during write, but now we force to write with the bucket expression, so the extension is forced to use now.

I think this may break the compatibility, what do you think of this cc @Zouxxyy @YannByron @JingsongLi

@JingsongLi
Copy link
Contributor

I remember that Spark's dynamic partitioning writing seems to require configuring extensions too? cc @Zouxxyy

@Aitozi
Copy link
Contributor Author

Aitozi commented Mar 12, 2025

I append a commit to support work as before when the spark.sql.extensions is not set, the bucket expression will not be used.

@Zouxxyy
Copy link
Contributor

Zouxxyy commented Mar 12, 2025

I append a commit to support work as before when the spark.sql.extensions is not set, the bucket expression will not be used.

The doc including spark.sql.extensions in the default conf has been a long time. In fact, paimon spark has relied heavily on extension for resolve in earlier versions, otherwise, there could be issue with data misalignment during writing without it. Perhaps we should emphasize the necessity of conf extension a bit more in the documentation.

@Aitozi
Copy link
Contributor Author

Aitozi commented Mar 13, 2025

I append a commit to support work as before when the spark.sql.extensions is not set, the bucket expression will not be used.

The doc including spark.sql.extensions in the default conf has been a long time. In fact, paimon spark has relied heavily on extension for resolve in earlier versions, otherwise, there could be issue with data misalignment during writing without it. Perhaps we should emphasize the necessity of conf extension a bit more in the documentation.

Yes, the extension help to do some type alignment work, +1 to

emphasize the necessity of conf extension a bit more in the documentation

But, I still think we should keep the way to work without extension, we currently use the extension optionally (eg: dynamic overwrite, call procedure). So I lean to not break the compatibility.

Copy link
Contributor

@Zouxxyy Zouxxyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM, Just need to remove unnecessary modifications.

write.finish()
} finally {
write.close()
dataFrame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove unnecessary changes

val inputSchema = inputDs.schema
writeWithBucketAssigner(
partitionByKey(),
inputDs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove unnecessary changes

val encoderGroupWithBucketCol = EncoderSerDeGroup(withInitBucketCol.schema)

def newWrite(): SparkTableWrite = new SparkTableWrite(writeBuilder, rowType, rowKindColIdx)
def newWrite(): SparkTableWrite =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emove unnecessary changes

.map(x => col(data.schema.fieldNames(x)))
.toSeq
val args = Seq(lit(bucketNumber)) ++ bucketKeyCol
val repartitioned =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about data.withColumn(BUCKET_COL, call_udf(BucketExpression.FIXED_BUCKET, args: _*)), so that we can use iter.foreach(row => write.write(row, row.getInt(bucketColIdx))) to avoid computing bucket id twice

@Aitozi
Copy link
Contributor Author

Aitozi commented Mar 21, 2025

overall LGTM, Just need to remove unnecessary modifications.

Thanks @Zouxxyy for your comments, addressed all of them

@Zouxxyy Zouxxyy merged commit 1266c36 into apache:master Mar 21, 2025
18 checks passed
danzhewuju pushed a commit to danzhewuju/paimon that referenced this pull request Mar 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Reduce the de/serialization between the InternalRow and Row for spark writer

3 participants