-
Notifications
You must be signed in to change notification settings - Fork 113
[controller] Implement concurrently deleting versions in deleteAllVersionsInStore #2323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sushantmane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add unit tests and integration tests to validate concurrent deletion?
| /** | ||
| * Config for concurrently deleting store versions in Venice controller. | ||
| */ | ||
| public static final String ENABLE_CONCURRENTLY_DELETING_STORE_VERSIONS = "enable.concurrent_deleting_store_versions"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's follow the format consistent with other configs in the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1.
I'd also suggest if these keys have logical grouping to them, prefix that to the configurations as well. Additionally, some of these configurations can be agnostic to type of clean up e.g., (concurrent vs sequential and so-on)
e.g., store.version.cleanup would be the umbrella under which you can put these configs. One way to model is
store.version.cleanup.policy
store.version.cleanup.wait.time.ms
store.version.cleanup.worker.pool.size
I am aware some of the existing configurations maybe spread across but this way, we can eventually bring them under all related configs under a single umbrella/prefix and thus enables us to parse out configurations and extract feature relevant configurations with ease.
| this.sslEnabled = sslEnabled; | ||
|
|
||
| // only create the executor service when concurrent deletion is enabled for the cluster | ||
| this.executorServiceMap = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's create only one shared executor with configurable size
| "Concurrently store deletion: Failed to delete all versions in store: {} in cluster: {}", | ||
| storeName, | ||
| clusterName, | ||
| e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we rethrow exception here?
| this.enableConcurrentlyDeletingStoreVersions = | ||
| props.getBoolean(ConfigKeys.ENABLE_CONCURRENTLY_DELETING_STORE_VERSIONS, false); | ||
| this.workerThreadSizeForConcurrentlyDeletingStoreVersions = | ||
| props.getInt(ConfigKeys.WORKER_THREAD_SIZE_FOR_CONCURRENTLY_DELETING_STORE_VERSIONS, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be a case where we don't want to create the threadpool when ENABLE_CONCURRENTLY_DELETING_STORE_VERSIONS is enabled? If so, let's set the default thread count to 1 so we can skip the check in VeniceHelixAdmin L568
mynameborat
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have few fundamental questions.
- Typically, the number of versions for a store should be a couple if not large. With the current solution, we are only improving deletion by a factor of that couple (which may or may not be significant). Do we have some stats on average number of versions that get cleaned up as part of single invocation of this API?
- With the current implementation, we block the caller thread until all the clean up completes successfully. So knowing the speed up factor for [1] is important to measure how soon are we relieving the caller thread to be able to consume other admin messages for the store.
- We already have a
StoreBackupVersionCleanupServicewhich deletes all old version in the store. If we decide to do this, can we extend the deletion to also all versions if needed and move all the concurrency/sequential cleanup policy for orchestration inside to keep it encapsulated. Doing so, we can haveVeniceHelixAdminonly interact w/ the clean up service.
Lastly, if there is a admin message to delete all versions for the store, is the scenario where we want to delete the store all together? That way, the admin messages for the store should no longer be critical to be blocked/unblocked. I think blocking the caller thread is just generally impacting admin message processing for other stores because admin.consumption.max.worker.thread.pool.size is currently set to 3.
Did we try increasing this number to see if that helps with admin message consumption pressure?
| /** | ||
| * Config for concurrently deleting store versions in Venice controller. | ||
| */ | ||
| public static final String ENABLE_CONCURRENTLY_DELETING_STORE_VERSIONS = "enable.concurrent_deleting_store_versions"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1.
I'd also suggest if these keys have logical grouping to them, prefix that to the configurations as well. Additionally, some of these configurations can be agnostic to type of clean up e.g., (concurrent vs sequential and so-on)
e.g., store.version.cleanup would be the umbrella under which you can put these configs. One way to model is
store.version.cleanup.policy
store.version.cleanup.wait.time.ms
store.version.cleanup.worker.pool.size
I am aware some of the existing configurations maybe spread across but this way, we can eventually bring them under all related configs under a single umbrella/prefix and thus enables us to parse out configurations and extract feature relevant configurations with ease.
|
Hi there. This pull request has been inactive for 30 days. To keep our review queue healthy, we plan to close it in 7 days unless there is new activity. If you are still working on this, please push a commit, leave a comment, or convert it to draft to signal intent. Thank you for your time and contributions. |
|
Closing this pull request due to 37 days of inactivity. This is not a judgment on the value of the work. If you would like to continue, please reopen or open a new PR and we will be happy to take another look. Thank you again for contributing. |
Problem Statement
Recently, there are many store deletion operations. Since deleting stores take a long time, this issue became a problem to other admin operation as well (It can block the consumption from the admin topics if the execution takes too long).
This PR add the config and the changes to make deleteAllVersionsInStore in parallel to help speed up the deletion. For many stores, they will delete all versions for the original store, venice_system_store_davinci_push_status_store* store, venice_system_store_meta_store_* store. To preserve the order of deleting different type of stores, we can just make deleting all version concurrently.
Solution
Code changes
Concurrency-Specific Checks
Both reviewer and PR author to verify
synchronized,RWLock) are used where needed.ConcurrentHashMap,CopyOnWriteArrayList).How was this PR tested?
Does this PR introduce any user-facing or breaking changes?