Skip to content

Conversation

@sofiaz11
Copy link
Collaborator

@sofiaz11 sofiaz11 commented Dec 2, 2025

Problem Statement

Recently, there are many store deletion operations. Since deleting stores take a long time, this issue became a problem to other admin operation as well (It can block the consumption from the admin topics if the execution takes too long).

This PR add the config and the changes to make deleteAllVersionsInStore in parallel to help speed up the deletion. For many stores, they will delete all versions for the original store, venice_system_store_davinci_push_status_store* store, venice_system_store_meta_store_* store. To preserve the order of deleting different type of stores, we can just make deleting all version concurrently.

Solution

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
  • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

@sofiaz11 sofiaz11 changed the title [controller] Implement concurrently deleting versions in deleteAllVer… [controller] Implement concurrently deleting versions in deleteAllVersionsInStore Dec 2, 2025
Copy link
Contributor

@sushantmane sushantmane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add unit tests and integration tests to validate concurrent deletion?

/**
* Config for concurrently deleting store versions in Venice controller.
*/
public static final String ENABLE_CONCURRENTLY_DELETING_STORE_VERSIONS = "enable.concurrent_deleting_store_versions";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's follow the format consistent with other configs in the file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.
I'd also suggest if these keys have logical grouping to them, prefix that to the configurations as well. Additionally, some of these configurations can be agnostic to type of clean up e.g., (concurrent vs sequential and so-on)

e.g., store.version.cleanup would be the umbrella under which you can put these configs. One way to model is

store.version.cleanup.policy
store.version.cleanup.wait.time.ms
store.version.cleanup.worker.pool.size

I am aware some of the existing configurations maybe spread across but this way, we can eventually bring them under all related configs under a single umbrella/prefix and thus enables us to parse out configurations and extract feature relevant configurations with ease.

this.sslEnabled = sslEnabled;

// only create the executor service when concurrent deletion is enabled for the cluster
this.executorServiceMap = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's create only one shared executor with configurable size

"Concurrently store deletion: Failed to delete all versions in store: {} in cluster: {}",
storeName,
clusterName,
e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rethrow exception here?

this.enableConcurrentlyDeletingStoreVersions =
props.getBoolean(ConfigKeys.ENABLE_CONCURRENTLY_DELETING_STORE_VERSIONS, false);
this.workerThreadSizeForConcurrentlyDeletingStoreVersions =
props.getInt(ConfigKeys.WORKER_THREAD_SIZE_FOR_CONCURRENTLY_DELETING_STORE_VERSIONS, -1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be a case where we don't want to create the threadpool when ENABLE_CONCURRENTLY_DELETING_STORE_VERSIONS is enabled? If so, let's set the default thread count to 1 so we can skip the check in VeniceHelixAdmin L568

Copy link
Contributor

@mynameborat mynameborat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have few fundamental questions.

  1. Typically, the number of versions for a store should be a couple if not large. With the current solution, we are only improving deletion by a factor of that couple (which may or may not be significant). Do we have some stats on average number of versions that get cleaned up as part of single invocation of this API?
  2. With the current implementation, we block the caller thread until all the clean up completes successfully. So knowing the speed up factor for [1] is important to measure how soon are we relieving the caller thread to be able to consume other admin messages for the store.
  3. We already have a StoreBackupVersionCleanupService which deletes all old version in the store. If we decide to do this, can we extend the deletion to also all versions if needed and move all the concurrency/sequential cleanup policy for orchestration inside to keep it encapsulated. Doing so, we can have VeniceHelixAdmin only interact w/ the clean up service.

Lastly, if there is a admin message to delete all versions for the store, is the scenario where we want to delete the store all together? That way, the admin messages for the store should no longer be critical to be blocked/unblocked. I think blocking the caller thread is just generally impacting admin message processing for other stores because admin.consumption.max.worker.thread.pool.size is currently set to 3.

Did we try increasing this number to see if that helps with admin message consumption pressure?

/**
* Config for concurrently deleting store versions in Venice controller.
*/
public static final String ENABLE_CONCURRENTLY_DELETING_STORE_VERSIONS = "enable.concurrent_deleting_store_versions";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.
I'd also suggest if these keys have logical grouping to them, prefix that to the configurations as well. Additionally, some of these configurations can be agnostic to type of clean up e.g., (concurrent vs sequential and so-on)

e.g., store.version.cleanup would be the umbrella under which you can put these configs. One way to model is

store.version.cleanup.policy
store.version.cleanup.wait.time.ms
store.version.cleanup.worker.pool.size

I am aware some of the existing configurations maybe spread across but this way, we can eventually bring them under all related configs under a single umbrella/prefix and thus enables us to parse out configurations and extract feature relevant configurations with ease.

@github-actions
Copy link

github-actions bot commented Jan 3, 2026

Hi there. This pull request has been inactive for 30 days. To keep our review queue healthy, we plan to close it in 7 days unless there is new activity. If you are still working on this, please push a commit, leave a comment, or convert it to draft to signal intent. Thank you for your time and contributions.

@github-actions github-actions bot added the stale label Jan 3, 2026
@github-actions
Copy link

Closing this pull request due to 37 days of inactivity. This is not a judgment on the value of the work. If you would like to continue, please reopen or open a new PR and we will be happy to take another look. Thank you again for contributing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants