[core] Support upper bound in dynamic bucket mode by liyubin117 · Pull Request #4974 · apache/paimon

liyubin117 · 2025-01-21T07:33:18Z

Purpose

In dynamic bucket mode, unlimited buckets lead to an unpredicable number of small files, which lead to stability problems. so we should support upper bound in dynamic bucket mode.

Linked issue: close #4942

Tests

HashBucketAssignerTest
SimpleHashBucketAssignerTest

Use new feature, Produce 3 buckets after inserting 12 rows

Not use new feature, Produce 6 buckets after inserting 12 rows

API and Format

Documentation

docs/layouts/shortcodes/generated/core_configuration.html

liyubin117 · 2025-01-21T08:20:47Z

@JingsongLi CI passed, PTAL, thanks!

paimon-common/src/main/java/org/apache/paimon/CoreOptions.java

paimon-core/src/main/java/org/apache/paimon/index/SimpleHashBucketAssigner.java

paimon-common/src/main/java/org/apache/paimon/CoreOptions.java

JingsongLi

I'm not sure if these modifications are effective, so let me give you my suggestion:
We only need to modify PartitionIndex.asign inside:

Only code:

// 3. create a new bucket
for (int i = 0; i < Short.MAX_VALUE; i++) {
    if (bucketFilter.test(i) && !totalBucket.contains(i)) {
        hash2Bucket.put(hash, (short) i);
        nonFullBucketInformation.put(i, 1L);
        totalBucket.add(i);
        return i;
    }
}

// 4. too many buckets, throw exception
@SuppressWarnings("OptionalGetWithoutIsPresent")
int maxBucket = totalBucket.stream().mapToInt(Integer::intValue).max().getAsInt();
throw new RuntimeException(
        String.format(
                "Too more bucket %s, you should increase target bucket row number %s.",
                maxBucket, targetBucketRowNumber));

New code:

// 3. create a new bucket
for (int i = 0; i < max_buckets; i++) {
    if (bucketFilter.test(i) && !totalBucket.contains(i)) {
        hash2Bucket.put(hash, (short) i);
        nonFullBucketInformation.put(i, 1L);
        totalBucket.add(i);
        return i;
    }
}

// 4. exceed max_buckets, just pick a bucket for record.
pick a min bucket (belongs to this task) to the record.

liyubin117 · 2025-01-23T09:19:22Z

After offline discussion with @JingsongLi , we have reached a consesus: We can't just update PartitionIndex logic because it doesn't handle SimpleHashBucketAssigner; When the buckets are full, a random bucket is selected for writing.

liyubin117 · 2025-02-06T08:50:27Z

@JingsongLi The feature has been completed as discussed, Looking forward your review !

paimon-core/src/main/java/org/apache/paimon/index/PartitionIndex.java

JingsongLi · 2025-02-17T05:42:27Z

paimon-core/src/main/java/org/apache/paimon/index/PartitionIndex.java

-                        maxBucket, targetBucketRowNumber));
+        // 4. exceed buckets upper bound
+        int bucket =
+                KeyAndBucketExtractor.bucketWithUpperBound(totalBucket, hash, totalBucket.size());


You should find a bucket for own... Not randomly from all buckets.

Thanks for your kind remind.

Only when a new hash arrives and the bucket limit is exceeded, a bucket is randomly selected from task-owned buckets (totalBucket instance maintains buckets owned by itself task).

Then, the key-value pair of this hash and bucket will be put into the cache.

After that, whenever this hash arrives, it will always find its own bucket.

// 1. is it a key that has appeared before if (hash2Bucket.containsKey(hash)) { return hash2Bucket.get(hash); }

JingsongLi · 2025-02-17T05:42:55Z

paimon-core/src/main/java/org/apache/paimon/index/PartitionIndex.java

+        int maxBucketId =
+                totalBucket.isEmpty()
+                        ? 0
+                        : totalBucket.stream().mapToInt(Integer::intValue).max().getAsInt();


Here should not invoke the stream every time, you should cache the maxBucketId.

Wise consideration :)

JingsongLi

+1

[core] Support upper bound in dynamic bucket mode

ed6ebe2

JingsongLi reviewed Jan 21, 2025

View reviewed changes

paimon-common/src/main/java/org/apache/paimon/CoreOptions.java Outdated Show resolved Hide resolved

wwj6591812 reviewed Jan 22, 2025

View reviewed changes

paimon-core/src/main/java/org/apache/paimon/index/SimpleHashBucketAssigner.java Outdated Show resolved Hide resolved

paimon-common/src/main/java/org/apache/paimon/CoreOptions.java Outdated Show resolved Hide resolved

liyubin117 force-pushed the max_dynamic_bucket branch 3 times, most recently from c7b7755 to df576e2 Compare January 22, 2025 10:42

liyubin117 requested a review from JingsongLi January 23, 2025 05:54

JingsongLi requested changes Jan 23, 2025

View reviewed changes

refactor to support total buckets upper bound

054f1df

liyubin117 force-pushed the max_dynamic_bucket branch from df576e2 to 054f1df Compare January 23, 2025 10:18

liyubin117 requested a review from JingsongLi January 23, 2025 11:09

JingsongLi reviewed Feb 10, 2025

View reviewed changes

paimon-core/src/main/java/org/apache/paimon/index/PartitionIndex.java Outdated Show resolved Hide resolved

JingsongLi reviewed Feb 10, 2025

View reviewed changes

paimon-core/src/main/java/org/apache/paimon/index/PartitionIndex.java Outdated Show resolved Hide resolved

modify as Jingsong said

abcd024

JingsongLi reviewed Feb 11, 2025

View reviewed changes

paimon-core/src/main/java/org/apache/paimon/index/PartitionIndex.java Outdated Show resolved Hide resolved

liyubin117 force-pushed the max_dynamic_bucket branch from 1396cff to 41f3370 Compare February 11, 2025 07:32

refactor to simplify

d419d2c

liyubin117 force-pushed the max_dynamic_bucket branch from 41f3370 to d419d2c Compare February 11, 2025 08:26

liyubin117 requested a review from JingsongLi February 13, 2025 02:01

JingsongLi reviewed Feb 17, 2025

View reviewed changes

cache the maxBucketId

c68ebb5

liyubin117 force-pushed the max_dynamic_bucket branch from b574f89 to c68ebb5 Compare February 17, 2025 09:05

JingsongLi approved these changes Feb 19, 2025

View reviewed changes

JingsongLi merged commit cf22950 into apache:master Feb 19, 2025
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Support upper bound in dynamic bucket mode#4974

[core] Support upper bound in dynamic bucket mode#4974
JingsongLi merged 5 commits intoapache:masterfrom
liyubin117:max_dynamic_bucket

liyubin117 commented Jan 21, 2025 •

edited

Loading

Uh oh!

liyubin117 commented Jan 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JingsongLi left a comment •

edited

Loading

Uh oh!

liyubin117 commented Jan 23, 2025 •

edited

Loading

Uh oh!

liyubin117 commented Feb 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JingsongLi Feb 17, 2025

Uh oh!

liyubin117 Feb 17, 2025 •

edited

Loading

Uh oh!

JingsongLi Feb 17, 2025

Uh oh!

liyubin117 Feb 17, 2025

Uh oh!

JingsongLi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

liyubin117 commented Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Tests

API and Format

Documentation

Uh oh!

liyubin117 commented Jan 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JingsongLi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liyubin117 commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liyubin117 commented Feb 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JingsongLi Feb 17, 2025

Choose a reason for hiding this comment

Uh oh!

liyubin117 Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JingsongLi Feb 17, 2025

Choose a reason for hiding this comment

Uh oh!

liyubin117 Feb 17, 2025

Choose a reason for hiding this comment

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

liyubin117 commented Jan 21, 2025 •

edited

Loading

JingsongLi left a comment •

edited

Loading

liyubin117 commented Jan 23, 2025 •

edited

Loading

liyubin117 Feb 17, 2025 •

edited

Loading