[spark]: add paimon bucket functions #5242

zhongyujiang · 2025-03-10T04:42:53Z

Purpose

Linked issue: part of #4816, split out from #5241

Add the Paimon bucket function transform in the Paimon-Spark connector.
It can be used to request data clustering based on the Paimon bucket transform from DataSource V2 when integrating V2 write: use case

Tests

BucketFunctionTest

API and Format

Documentation

zhongyujiang

cc @Zouxxyy @YannByron Can you please take a review when you have time? Thanks!

zhongyujiang · 2025-03-10T04:50:00Z

...mon-spark-common/src/main/java/org/apache/paimon/spark/catalog/functions/BucketFunction.java

+        builder.put("BucketLongString", BucketLongString.class);
+        builder.put("BucketStringInteger", BucketStringInteger.class);
+        builder.put("BucketStringLong", BucketStringLong.class);
+        builder.put("BucketStringString", BucketStringString.class);


For primitive types and common composite types, I added separate bucket function classs to improve performance, while BucketGeneric handles other types.

Have you tested the performance of this implementation? I think the current approach is too specialized, and the data conversion involving Spark and Paimon doesn't reuse the code.

https://github.com/zhongyujiang/incubator-paimon/tree/gh/spark-v2-write_function-benchmark

Benchmark Mode Cnt Score Error Units PaimonBucketFunctionBenchmark.bucketFunction ss 5 0.416 ± 0.033 s/op PaimonBucketFunctionBenchmark.genericBucketFunction ss 5 0.685 ± 0.302 s/op

Here is a JMH benchmark result.
I think this optimization has limited impact on write performance, since a significant amount of time is spent on network and compression during data writing. However, there is indeed a noticeable performance difference between the two implementations.
This part of the code can't be reused since the invoke method takes different input types.

It looks like the difference is small. I added a benchmark #5418.
Will test the performance of the existing bucket function and this PR's, including the function and e2e write cost.

If we only compare these two function implementations, you can see from the results above that there's a clear difference 0.685 / 0.416 = 1.64.
In real write scenarios, I agree that the impact of this optimization is very minimal.

FYI, I’ve already added a benchmark comparison between the v2 write and the previous v1 write here, the results is attached in the PR too. But it didn’t include the newly added v1 bucket function optimization, I think the benchmark results from JMH would be more reliable.

@zhongyujiang The functions of fixed_bucket() and bucket() are the same, so it is reasonable to keep only one (Keep bucket because it will be used in bucket join and so on)

I tested it and found that the current implementation is indeed faster, but the e2e time is about the same. But the current implementation introduces some duplicate code, such as the conversion between spark type and paimon type, as well as some proprietary code, which I think will be difficult to maintain in the future.

I suggest moving the logic of fixed_bucket to bucket, then remove fixed_bucket, WDYT

Sure, let me update this.

zhongyujiang · 2025-03-10T04:51:50Z

...mon-spark-common/src/main/java/org/apache/paimon/spark/catalog/functions/BucketFunction.java

+
+        if (type instanceof org.apache.paimon.types.TimestampType) {
+            return ((org.apache.paimon.types.TimestampType) type).getPrecision()
+                    == SPARK_TIMESTAMP_PRECISION;


The Spark Timestamp type is in microseconds and does not allow custom precision. When binding the Paimon bucket function, Paimon’s type information is not available. Therefore, support is only possible when the Paimon timestamp precision is equal to the Spark timestamp precision.

zhongyujiang · 2025-04-01T01:23:36Z

@Zouxxyy @YannByron Hi, can you please help review this when you have time? Thanks.

Aitozi · 2025-04-15T02:29:06Z

Hi, I'm curious what's the different between the FixedBucketExpression vs BucketFunction. I just try to know why we have to use the BoundFunction not the Expression here ?

...mon-spark-common/src/main/java/org/apache/paimon/spark/catalog/functions/BucketFunction.java

zhongyujiang · 2025-04-15T04:27:01Z

Hi, I'm curious what's the different between the FixedBucketExpression vs BucketFunction. I just try to know why we have to use the BoundFunction not the Expression here ?

@Aitozi Yea, Zouxxyy also mentioned this. I think when I pulled this PR, FixedBucketExpression hadn't been merged yet, so I added BucketFunction. I'll update this PR today.

zhongyujiang

I've updated to resolve the comments above, please take a look when you have time, cc @Aitozi @Zouxxyy Thanks!

zhongyujiang · 2025-04-16T08:50:21Z

paimon-common/src/main/java/org/apache/paimon/data/serializer/InternalRowSerializer.java

    private transient BinaryRowWriter reuseWriter;

+    public InternalRowSerializer(RowType rowType, boolean isBucketKeySerializer) {
+        this(


Bucket key projection use different write logic for non-compact type nulls: ref, we cannot use InternalRowSerializer for bucket key serialization, so added this.

zhongyujiang · 2025-04-16T08:55:01Z

...-spark-common/src/main/scala/org/apache/paimon/spark/catalog/functions/PaimonFunctions.scala

+ *
+ * params arg0: bucket number, arg1...argn bucket keys.
+ */
+class BucketFunction extends UnboundFunction {


This is copied from master...Zouxxyy:incubator-paimon:dev/replace-bucket-function, except that the change where the input type passed to SparkInternalRowWrapper was changed from bucketKeyStructType to inputType, since the type of the Spark InternalRow wrapped by the wrapper is inputType (which includes the NUM_BUCKETS column).

We have removed the NUM_BUCKETS col via ProjectedRow.from(mapping), so just pass bucketKeyStructType here.

Hi @Zouxxyy, what I meant is that when creating the SparkInternalRowWrapper, we should pass in inputType instead of bucketKeyStructType, because the input row passed to this wrapper includes the NUM_BUCKETS column.
For InternalRowSerializer, we still pass in bucketKeyStructType, since as you mentioned, we’ve already removed NUM_BUCKETS using ProjectedRow.

Got it, make sense, thanks!

...mon-spark-ut/src/test/java/org/apache/paimon/spark/catalog/functions/BucketFunctionTest.java

zhongyujiang · 2025-04-18T05:46:21Z

@Zouxxyy rebased, please take another look when you have time, thanks!

Zouxxyy

+1, thanks

zhongyujiang commented Mar 10, 2025

View reviewed changes

Aitozi reviewed Apr 15, 2025

View reviewed changes

...mon-spark-common/src/main/java/org/apache/paimon/spark/catalog/functions/BucketFunction.java Outdated Show resolved Hide resolved

zhongyujiang force-pushed the gh/spark-bucket-functions branch from f468d15 to 0122c24 Compare April 16, 2025 08:45

zhongyujiang commented Apr 16, 2025

View reviewed changes

Zouxxyy mentioned this pull request Apr 17, 2025

[core] Use write null for uncompact decimal and timestamp in InternalRowSerialize #5483

Merged

Refactor bucket functions.

8b4ae0c

zhongyujiang force-pushed the gh/spark-bucket-functions branch from 0122c24 to 8b4ae0c Compare April 18, 2025 05:43

Clean comments.

fe8b1f5

Zouxxyy approved these changes Apr 18, 2025

View reviewed changes

Zouxxyy merged commit b448244 into apache:master Apr 18, 2025
20 checks passed

zhongyujiang mentioned this pull request May 6, 2025

Draft: support spark v2 write #5241

Closed

zhongyujiang deleted the gh/spark-bucket-functions branch May 6, 2025 13:16

[spark]: add paimon bucket functions #5242

[spark]: add paimon bucket functions #5242

Uh oh!

Conversation

zhongyujiang commented Mar 10, 2025

Purpose

Tests

API and Format

Documentation

Uh oh!

zhongyujiang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhongyujiang commented Apr 1, 2025

Uh oh!

Aitozi commented Apr 15, 2025

Uh oh!

Uh oh!

zhongyujiang commented Apr 15, 2025

Uh oh!

zhongyujiang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Zouxxyy Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zhongyujiang commented Apr 18, 2025

Uh oh!

Zouxxyy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Zouxxyy Apr 18, 2025 •

edited

Loading