-
Notifications
You must be signed in to change notification settings - Fork 2.5k
feat: Introduce the configuration for forcibly triggering timeline compaction #17779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat: Introduce the configuration for forcibly triggering timeline compaction #17779
Conversation
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java
Outdated
Show resolved
Hide resolved
| log.info("No Instants to archive"); | ||
| } | ||
| // run compact and clean if needed even no instants were archived | ||
| if (!instantsToArchive.isEmpty() || config.isTimelineCompactionForced()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current compaction strategy is kind of eager(compact all the layers casecadingly) when a new archived file is generated, so even if we trigger compaction for commits that does not archive, it does not take effect as expected at all.
Wondering wheter we could introduce some lazy compaction strategies or add a upper threshold for the target file size of single round of compaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering wheter we could introduce some lazy compaction strategies or add a upper threshold for the target file size of single round of compaction.
I agree with this idea.
The current strategy is relatively simple, and in some corner cases, compaction will be blocked, resulting in all newly added archived files not being compacted.
Therefore, I will propose two pr later:
- Solve the current corner case that block normal compaction
- Introduce more diverse compaction strategies, not only for trigger timing (lazy/eager), but also for compaction strategies, such as finding at most one batch of candicates at each level or compact each level as much as possible, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so even if we trigger compaction for commits that does not archive, it does not take effect as expected at all.
This is related to the current strategy.
Currently, each time compact is triggered, each level will only merge at most one batch of files, and then move on to the next level for merging.
So if I have 100 files at layer L0 that can be merged, but with batch=10, that compact will only merge the first ten. Therefore, if we don't provide a switch to force the compact to be triggered, the remaining 90 files won't be able to be triggered for merging until there are some new instants comes into archve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this idea. The current strategy is relatively simple, and in some corner cases, compaction will be blocked, resulting in all newly added archived files not being compacted. Therefore, I will propose two pr later:
- Solve the current corner case that block normal compaction
first pr: #17784
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I have 100 files at layer L0 that can be merged, but with batch=10, that compact will only merge the first ten. Therefore, if we don't provide a switch to force the compact to be triggered, the remaining 90 files won't be able to be triggered for merging until there are some new instants comes into archve.
This could not happen, at most 1 file is generated in each layer, we just need to merge one bunch of files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could not happen, at most 1 file is generated in each layer, we just need to merge one bunch of files.
We encountered a scene like this:
- The instants were written normally, and at this time, archiver was also called up once normally.
- archiver archived the active instants.
- Because instant is archived, a timeline compaction is triggered. However, timeline compaction failed due to some occasional exception, such as hdfs timeout, etc.
- At this point, I re-ran archival, but since there was no instant that needed to be archve, timelime compaction could never be triggered until there was an active instant that needed to be archived.
In this case, I do need a logic to forcibly trigger timeline compaction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think it makes more sense to add a spark procedure to trigger the timeline compaction instead of bind it with the write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice idea👍 agree it, it will be more clearly and reasonably
…mpaction 1. Introduce the configuration for forcibly triggering timeline compaction Signed-off-by: TheR1sing3un <[email protected]> feat: enable timeline compaction by default 1. enable timeline compaction by default Signed-off-by: TheR1sing3un <[email protected]> rerun
ba986e3 to
f5c5029
Compare
Describe the issue this Pull Request addresses
closes #17778
Summary and Changelog
Impact
none
Risk Level
low
Documentation Update
none
instruction to make changes to the website. -->
Contributor's checklist