-
Notifications
You must be signed in to change notification settings - Fork 791
SOLR-17947: CloudSolrClient refreshes collection state asynchronously using a dedicated thread pool to reduce ZooKeeper blocking under load. #3851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
markrmiller
commented
Nov 7, 2025
- replace striped locks with single-flight cache refresh futures to stop thundering herd
- keep stale entries usable while background refresh runs, update retry semantics accordingly
… using a dedicated thread pool to reduce ZooKeeper blocking under load.
dsmiley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some impressive testing!
| new ConcurrentHashMap<>(); | ||
| private final Object stateRefreshExecutorLock = new Object(); | ||
| private volatile int stateRefreshParallelism = DEFAULT_STATE_REFRESH_PARALLELISM; | ||
| private volatile ExecutorService stateRefreshExecutor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it disappointing to see yet another ExecutorService (ThreadPool) for this use-case, so close to a a place where we can call HttpSolrClientBase.requestAsync, which uses an existing threadpool and non-blocking IO. The parallelism can be controlled with a semaphore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hate to see comments like this on this project. The comment correctly flags that adding another ExecutorService is questionable, but it’s mostly expressing frustration rather than explaining the risk, and it doesn’t do much to invite collaboration. It doesn’t say why another executor is a problem (extra threads, lifecycle/timeout divergence, duplicated logic), and it only implicitly suggests requestAsync as the alternative. A better version would explicitly state the operational and maintenance concerns, clearly propose reusing HttpSolrClientBase.requestAsync (or ask why it can’t be reused), and frame this as a question or suggestion instead of a verdict, in a neutral tone that focuses on the design rather than how disappointing it feels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know my audience (you), and figured you would be receptive. As you followed my recommendation and obviously knew why I was recommend this, I was right. Sorry if it rubbed you the wrong way, and thanks for being receptive.
| retriedAtNano = System.nanoTime(); | ||
| } | ||
|
|
||
| boolean markMaybeStaleIfOutsideBackoff(long retryBackoffNano) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit of javadocs would be helpful. Especially to document the return meaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type of comment has similar deficiencies. There is no concrete reason why you suggest this. It’s framed as a drive-by suggestion, not a joint attempt to clarify the API. That makes it easy to dismiss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(shock) You are asking me to give a concrete reason to ask that you write a line of javadoc to a method that had none? Is that unreasonable? I think my request is fair and I was sincere in asking (I was confused; javadoc would help).
| } else { | ||
| future = CompletableFuture.supplyAsync(() -> loadDocCollection(key), executor); | ||
| } | ||
| future.whenComplete( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it could happen that whenComplete is invoked by the caller thread if the future has already completed. I think. The lambda you pass in will manipulate collectionRefreshes, which violates the rules of its compute methods on a ConcurrentHashMap.
| ); | ||
|
|
||
| protected volatile Object[] locks = objectList(3); | ||
| private final ConcurrentHashMap<String, CompletableFuture<DocCollection>> collectionRefreshes = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of creating an additional cache, I think it would be better to integrate into StateCache, which is already collection-keyed, already has eviction of old stuff.
| } | ||
| } | ||
|
|
||
| // First retry without sending state versions to avoid needless waits when stale state is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate why "sending state versions" (I assume \_stateVer\_) is a problem?
| inputCollections, | ||
| /*skipStateVersion*/ true, | ||
| refreshesToWaitFor, | ||
| waitedForRefresh); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer like you do elsewhere: /*waitedForRefresh*/ true
|
Addressed @dsmiley’s inline notes: reused the existing update thread pool with a semaphore for refreshes, added javadocs for markMaybeStaleIfOutsideBackoff, switched the refresh cleanup to whenCompleteAsync so we don’t mutate the map inside compute, documented why we temporarily drop stateVer, and tagged the waitedForRefresh parameter. I don't agree that the in-flight refresh tracker should be folded directly into StateCache. |
… using a dedicated thread pool to reduce ZooKeeper blocking under load.
dsmiley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe sometime after this merges, I'll throw up a PR straw-man collectionRefreshes going away if we enhance StateCache. If so, I'll tag you for review.
In the bigger picture, I have a desire to move the state caching from CloudSolrClient into ClusterStateProvider, and then also using CSP within Solr (server) to places like HttpSolrCall and maybe more. This might sound like crazy-talk, but... there's a vision behind this
|
|
||
| // First retry without sending state versions to avoid needless waits when stale state is | ||
| // still usable. | ||
| // First retry without sending state versions so the server does not immediately reject the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we ought to improve the server at some point (not this issue/PR) to not reject a request due to a non-synchronized stateVer. I filed https://issues.apache.org/jira/browse/SOLR-17601 for that. I'll should add a comment there that it should be able to simplify this code/complexity you are adding here. Maybe you can add a reference to that JIRA here, please, to alert the reader.
| ExecutorService executor = threadPool; | ||
| CompletableFuture<DocCollection> future; | ||
| if (executor == null || ExecutorUtil.isShutdown(executor)) { | ||
| future = new CompletableFuture<>(); | ||
| try { | ||
| future.complete(loadDocCollection(key)); | ||
| } catch (Throwable t) { | ||
| future.completeExceptionally(t); | ||
| } | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're trying hard to load a collection's state in spite of a strong indication that our CloudSolrClient is shutting down. Why bother? Imagine instead simply assume threadPool is never null and proceed, leading to a RejectedExecutionException (reasonable). Our close method needn't set threadPool to null.
| } | ||
|
|
||
| private CompletableFuture<DocCollection> triggerCollectionRefresh(String collection) { | ||
| if (closed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm skeptical we should do any work whatsoever if we are closed.
| executor); | ||
| } | ||
| future.whenComplete( | ||
| future.whenCompleteAsync( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This uses the default (ForkJoinPool) executor. Please pass the executor.
Maybe Solr should put the methods that use that executor on a forbidden-api list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also; the task to do here is so trivially quick that I don't think it makes sense to try to do asynchronously.
|
A question for you: in your mind, are code reviews strictly either "drive by" or "collaborative"? Is there something else? Can they be a bit of both? Is "drive by" unappreciative/unwelcome/frowned-upon? I can appreciate a yearning for collaboration... collaboration is good... but can we assume/expect those who review code to always have the time to be, well, collaborative? |
| future = | ||
| CompletableFuture.supplyAsync( | ||
| () -> { | ||
| stateRefreshSemaphore.acquireUninterruptibly(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we should acquire before supplyAsync, thus some potential back-pressure for triggerCollectionRefresh's callers? not sure honestly
dsmiley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really appreciate your improvements here, by the way; thank you!