Skip to content

Conversation

@zhongyujiang
Copy link
Contributor

Purpose

Linked issue: part of #4816

Support spark datasource v2 write path, reduce write serialization overhead and accelerate the process of writing to primary key tables in Spark. Currently only added support for fixed-bucket table.

Tests

BucketFunctionTest, SparkWriteITCase

PaimonSourceWriteBenchmark:

Benchmark                           Mode  Cnt   Score    Error  Units
PaimonSourceWriteBenchmark.v1Write    ss    5  13.845 ± 23.192   s/op
PaimonSourceWriteBenchmark.v2Write    ss    5   9.579 ± 14.929   s/op

API and Format

Documentation

Add a config spark.sql.paimon.use-v2-write to enable switching to v2 write, will fall back to v1 write when encountering an unsupported scenario(e.g. HASH_DYNAMIC bucket mode table).

Note: this is an overall draft PR, which will be split into smaller PRs for easier review.

import java.io.Serializable;

/** Wrapper of Spark {@link InternalRow}. */
public class SparkInternalRowWrapper implements org.apache.paimon.data.InternalRow, Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it looks like #5159 has some duplicated work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I'll rebase once it's merged.

@zhongyujiang
Copy link
Contributor Author

replaced by #5242 and #5531

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants