Skip to content

Conversation

@loserwang1024
Copy link
Contributor

@loserwang1024 loserwang1024 commented Oct 9, 2025

Purpose

Linked issue: close #1789

Brief change log

Tests

API and Format

Documentation

@loserwang1024 loserwang1024 force-pushed the poc-sink-shuffle branch 2 times, most recently from 50ad020 to 249e260 Compare October 10, 2025 03:47
@loserwang1024 loserwang1024 requested a review from wuchong October 10, 2025 06:37
@loserwang1024
Copy link
Contributor Author

@wuchong @leonardBang , CC

@loserwang1024
Copy link
Contributor Author

@wuchong , CC, this pr is needed.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements dynamic shuffle mode for Fluss sink to address performance bottlenecks in the current bucket-based shuffle approach. The implementation collects data statistics at each checkpoint, aggregates them in a coordinator, and dynamically redistributes records across downstream subtasks based on partition traffic patterns. This approach prevents single-subtask bottlenecks and provides better load balancing, especially for partitioned tables with uneven traffic distribution.

Key Changes:

  • Introduces DistributionMode enum with three modes: NONE, BUCKET_SHUFFLE, and DYNAMIC_SHUFFLE
  • Implements statistics collection and coordination infrastructure (DataStatistics, DataStatisticsOperator, DataStatisticsCoordinator)
  • Adds weighted assignment strategies (WeightedRandomAssignment, WeightedBucketIdAssignment) for dynamic partition-to-subtask mapping
  • Extends FlussSerializationSchema with size calculation method for accurate weight estimation

Reviewed changes

Copilot reviewed 34 out of 34 changed files in this pull request and generated 19 comments.

Show a summary per file
File Description
DistributionMode.java New enum defining shuffle distribution modes (NONE, BUCKET_SHUFFLE, DYNAMIC_SHUFFLE)
DataStatistics.java Data structure tracking partition names and their frequency counts
DataStatisticsOperator.java Flink operator collecting local statistics and forwarding records wrapped with statistics
DataStatisticsCoordinator.java Coordinator aggregating statistics from subtasks and broadcasting global statistics
StatisticsOrRecordChannelComputer.java Channel computer implementing dynamic shuffle based on partition statistics
WeightedRandomAssignment.java Partition assignment strategy using weighted random distribution
WeightedBucketIdAssignment.java Bucket-aware partition assignment combining bucketing with weighted distribution
RowDataSerializationSchema.java Extended with size() method to calculate record size for weight estimation
FlinkSink.java Updated to support DYNAMIC_SHUFFLE mode with pre-write topology transformation
FlinkTableSink.java Modified to use DistributionMode and provide DataStreamSinkProvider for dynamic shuffle
FlussSinkBuilder.java Updated API to accept DistributionMode and TypeInformation for shuffle configuration
FlinkConnectorOptions.java Added SINK_DISTRIBUTION_MODE configuration option
Test files Comprehensive unit tests for statistics, channel computer, and partition assignment

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@loserwang1024 loserwang1024 force-pushed the poc-sink-shuffle branch 2 times, most recently from e6591b3 to a1ccee5 Compare December 31, 2025 02:50
@loserwang1024
Copy link
Contributor Author

@wuchong , I have rebase this pr, CC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fluss sink supports dynamic shuffle

2 participants