Skip to content

Conversation

@swuferhong
Copy link
Contributor

@swuferhong swuferhong commented Mar 11, 2025

Purpose

Linked issue: #570

This pr is aims to support hash by bucket key for PrimaryKey Table for Flink sink. The approach of this pr is to first hash the data by bucket id before writing to the Flink sink, sending data with the same bucket id to the same sink parallelism. This allows the Fluss writer to batch kv data more effectively, thus reducing the pressure on the client and improving throughput. From the testing results, this can greatly enhance the throughput of writing Fluss PrimaryKey tables and reduce memory usage, while also reducing pressure on the server side (fewer putKv requests and larger CDC log batches).

Brief change log

Tests

API and Format

Documentation

@swuferhong swuferhong closed this Mar 11, 2025
@swuferhong swuferhong reopened this Mar 11, 2025
@wuchong wuchong changed the title [flink] Flink sink support hash by bucket key for PrimaryKey Table [flink] Flink sink support hash by bucket id for PrimaryKey Table Mar 12, 2025
@wuchong
Copy link
Member

wuchong commented Mar 12, 2025

The test coverage is failed:

Warning:  Rule violated for class com.alibaba.fluss.connector.flink.sink.RowDataKeySelector: lines covered ratio is 0.00, but expected minimum is 0.70

@swuferhong
Copy link
Contributor Author

@wuchong comments addressed.

@swuferhong swuferhong linked an issue Mar 20, 2025 that may be closed by this pull request
2 tasks
Comment on lines +160 to +170
void testAppendLogWithBucketKeyWithSinkBucketShuffle() throws Exception {
testAppendLogWithBucketKey(true);
}

@Test
void testAppendLogWithBucketKeyWithoutSinkBucketShuffle() throws Exception {
testAppendLogWithBucketKey(false);
}

private void testAppendLogWithBucketKey(boolean sinkBucketShuffle) throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try @ParameterizedTest? It would be better to use @ParameterizedTest if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried.... But I encountered the same problem as when I developed #351 (add parameterized test in FlinkTableSinkITCase) : When using @ParameterizedTest, the CI on GitHub will hang to timeout, but in local env tests will passed. I'll create an issue to trace this problem and look into the root cause later: #659

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think we need to investigate the reason, otherwise, we don't know whether to add @ParameterizedTest or not.

@wuchong wuchong merged commit 38b6c4a into apache:main Mar 24, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flink sink support hash by bucket id for PrimaryKey Table

2 participants