[core] Support upper bound in dynamic bucket mode#4974
[core] Support upper bound in dynamic bucket mode#4974JingsongLi merged 5 commits intoapache:masterfrom
Conversation
|
@JingsongLi CI passed, PTAL, thanks! |
paimon-core/src/main/java/org/apache/paimon/index/SimpleHashBucketAssigner.java
Outdated
Show resolved
Hide resolved
c7b7755 to
df576e2
Compare
There was a problem hiding this comment.
I'm not sure if these modifications are effective, so let me give you my suggestion:
We only need to modify PartitionIndex.asign inside:
Only code:
// 3. create a new bucket
for (int i = 0; i < Short.MAX_VALUE; i++) {
if (bucketFilter.test(i) && !totalBucket.contains(i)) {
hash2Bucket.put(hash, (short) i);
nonFullBucketInformation.put(i, 1L);
totalBucket.add(i);
return i;
}
}
// 4. too many buckets, throw exception
@SuppressWarnings("OptionalGetWithoutIsPresent")
int maxBucket = totalBucket.stream().mapToInt(Integer::intValue).max().getAsInt();
throw new RuntimeException(
String.format(
"Too more bucket %s, you should increase target bucket row number %s.",
maxBucket, targetBucketRowNumber));
New code:
// 3. create a new bucket
for (int i = 0; i < max_buckets; i++) {
if (bucketFilter.test(i) && !totalBucket.contains(i)) {
hash2Bucket.put(hash, (short) i);
nonFullBucketInformation.put(i, 1L);
totalBucket.add(i);
return i;
}
}
// 4. exceed max_buckets, just pick a bucket for record.
pick a min bucket (belongs to this task) to the record.
|
After offline discussion with @JingsongLi , we have reached a consesus: We can't just update |
df576e2 to
054f1df
Compare
|
@JingsongLi The feature has been completed as discussed, Looking forward your review ! |
paimon-core/src/main/java/org/apache/paimon/index/PartitionIndex.java
Outdated
Show resolved
Hide resolved
paimon-core/src/main/java/org/apache/paimon/index/PartitionIndex.java
Outdated
Show resolved
Hide resolved
paimon-core/src/main/java/org/apache/paimon/index/PartitionIndex.java
Outdated
Show resolved
Hide resolved
1396cff to
41f3370
Compare
41f3370 to
d419d2c
Compare
| maxBucket, targetBucketRowNumber)); | ||
| // 4. exceed buckets upper bound | ||
| int bucket = | ||
| KeyAndBucketExtractor.bucketWithUpperBound(totalBucket, hash, totalBucket.size()); |
There was a problem hiding this comment.
You should find a bucket for own... Not randomly from all buckets.
There was a problem hiding this comment.
Thanks for your kind remind.
- Only when a new hash arrives and the bucket limit is exceeded, a bucket is randomly selected from task-owned buckets (
totalBucketinstance maintains buckets owned by itself task). - Then, the key-value pair of this hash and bucket will be put into the cache.
- After that, whenever this hash arrives, it will always find its own bucket.
// 1. is it a key that has appeared before
if (hash2Bucket.containsKey(hash)) {
return hash2Bucket.get(hash);
}
| int maxBucketId = | ||
| totalBucket.isEmpty() | ||
| ? 0 | ||
| : totalBucket.stream().mapToInt(Integer::intValue).max().getAsInt(); |
There was a problem hiding this comment.
Here should not invoke the stream every time, you should cache the maxBucketId.
There was a problem hiding this comment.
Wise consideration :)
b574f89 to
c68ebb5
Compare
Purpose
In dynamic bucket mode, unlimited buckets lead to an unpredicable number of small files, which lead to stability problems. so we should support upper bound in dynamic bucket mode.
Linked issue: close #4942
Tests
HashBucketAssignerTest
SimpleHashBucketAssignerTest
API and Format
Documentation
docs/layouts/shortcodes/generated/core_configuration.html