SOLR-17947: CloudSolrClient refreshes collection state asynchronously using a dedicated thread pool to reduce ZooKeeper blocking under load. #3851

markrmiller · 2025-11-07T22:58:29Z

replace striped locks with single-flight cache refresh futures to stop thundering herd
keep stale entries usable while background refresh runs, update retry semantics accordingly

… using a dedicated thread pool to reduce ZooKeeper blocking under load.

dsmiley

Some impressive testing!

dsmiley · 2025-11-09T22:30:21Z

solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java

+      new ConcurrentHashMap<>();
+  private final Object stateRefreshExecutorLock = new Object();
+  private volatile int stateRefreshParallelism = DEFAULT_STATE_REFRESH_PARALLELISM;
+  private volatile ExecutorService stateRefreshExecutor;


I find it disappointing to see yet another ExecutorService (ThreadPool) for this use-case, so close to a a place where we can call HttpSolrClientBase.requestAsync, which uses an existing threadpool and non-blocking IO. The parallelism can be controlled with a semaphore.

I hate to see comments like this on this project. The comment correctly flags that adding another ExecutorService is questionable, but it’s mostly expressing frustration rather than explaining the risk, and it doesn’t do much to invite collaboration. It doesn’t say why another executor is a problem (extra threads, lifecycle/timeout divergence, duplicated logic), and it only implicitly suggests requestAsync as the alternative. A better version would explicitly state the operational and maintenance concerns, clearly propose reusing HttpSolrClientBase.requestAsync (or ask why it can’t be reused), and frame this as a question or suggestion instead of a verdict, in a neutral tone that focuses on the design rather than how disappointing it feels.

I know my audience (you), and figured you would be receptive. As you followed my recommendation and obviously knew why I was recommend this, I was right. Sorry if it rubbed you the wrong way, and thanks for being receptive.

dsmiley · 2025-11-09T22:35:59Z

solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java

      retriedAtNano = System.nanoTime();
    }
+
+    boolean markMaybeStaleIfOutsideBackoff(long retryBackoffNano) {


a bit of javadocs would be helpful. Especially to document the return meaning.

This type of comment has similar deficiencies. There is no concrete reason why you suggest this. It’s framed as a drive-by suggestion, not a joint attempt to clarify the API. That makes it easy to dismiss.

(shock) You are asking me to give a concrete reason to ask that you write a line of javadoc to a method that had none? Is that unreasonable? I think my request is fair and I was sincere in asking (I was confused; javadoc would help).

dsmiley · 2025-11-10T03:32:58Z

solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java

+          } else {
+            future = CompletableFuture.supplyAsync(() -> loadDocCollection(key), executor);
+          }
+          future.whenComplete(


it could happen that whenComplete is invoked by the caller thread if the future has already completed. I think. The lambda you pass in will manipulate collectionRefreshes, which violates the rules of its compute methods on a ConcurrentHashMap.

dsmiley · 2025-11-10T03:38:26Z

solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java

          );

-  protected volatile Object[] locks = objectList(3);
+  private final ConcurrentHashMap<String, CompletableFuture<DocCollection>> collectionRefreshes =


Instead of creating an additional cache, I think it would be better to integrate into StateCache, which is already collection-keyed, already has eviction of old stuff.

dsmiley · 2025-11-10T03:51:33Z

solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java

+          }
+        }
+
+        // First retry without sending state versions to avoid needless waits when stale state is


Could you elaborate why "sending state versions" (I assume \_stateVer\_) is a problem?

dsmiley · 2025-11-10T03:53:13Z

solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java

+                  inputCollections,
+                  /*skipStateVersion*/ true,
+                  refreshesToWaitFor,
+                  waitedForRefresh);


I'd prefer like you do elsewhere: /*waitedForRefresh*/ true

markrmiller · 2025-11-14T04:55:47Z

Addressed @dsmiley’s inline notes: reused the existing update thread pool with a semaphore for refreshes, added javadocs for markMaybeStaleIfOutsideBackoff, switched the refresh cleanup to whenCompleteAsync so we don’t mutate the map inside compute, documented why we temporarily drop stateVer, and tagged the waitedForRefresh parameter. I don't agree that the in-flight refresh tracker should be folded directly into StateCache.

… using a dedicated thread pool to reduce ZooKeeper blocking under load.

dsmiley

Maybe sometime after this merges, I'll throw up a PR straw-man collectionRefreshes going away if we enhance StateCache. If so, I'll tag you for review.

In the bigger picture, I have a desire to move the state caching from CloudSolrClient into ClusterStateProvider, and then also using CSP within Solr (server) to places like HttpSolrCall and maybe more. This might sound like crazy-talk, but... there's a vision behind this

dsmiley · 2025-11-14T14:18:24Z

solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java


-        // First retry without sending state versions to avoid needless waits when stale state is
-        // still usable.
+        // First retry without sending state versions so the server does not immediately reject the


IMO we ought to improve the server at some point (not this issue/PR) to not reject a request due to a non-synchronized stateVer. I filed https://issues.apache.org/jira/browse/SOLR-17601 for that. I'll should add a comment there that it should be able to simplify this code/complexity you are adding here. Maybe you can add a reference to that JIRA here, please, to alert the reader.

dsmiley · 2025-11-14T14:40:10Z

solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java

+          ExecutorService executor = threadPool;
          CompletableFuture<DocCollection> future;
          if (executor == null || ExecutorUtil.isShutdown(executor)) {
            future = new CompletableFuture<>();
            try {
              future.complete(loadDocCollection(key));
            } catch (Throwable t) {
              future.completeExceptionally(t);
            }
          } else {


You're trying hard to load a collection's state in spite of a strong indication that our CloudSolrClient is shutting down. Why bother? Imagine instead simply assume threadPool is never null and proceed, leading to a RejectedExecutionException (reasonable). Our close method needn't set threadPool to null.

dsmiley · 2025-11-14T14:43:10Z

solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java

  }

  private CompletableFuture<DocCollection> triggerCollectionRefresh(String collection) {
    if (closed) {


I'm skeptical we should do any work whatsoever if we are closed.

dsmiley · 2025-11-14T15:03:36Z

solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java

+                    executor);
          }
-          future.whenComplete(
+          future.whenCompleteAsync(


This uses the default (ForkJoinPool) executor. Please pass the executor.
Maybe Solr should put the methods that use that executor on a forbidden-api list?

Also; the task to do here is so trivially quick that I don't think it makes sense to try to do asynchronously.

dsmiley · 2025-11-14T15:56:08Z

A question for you: in your mind, are code reviews strictly either "drive by" or "collaborative"? Is there something else? Can they be a bit of both? Is "drive by" unappreciative/unwelcome/frowned-upon? I can appreciate a yearning for collaboration... collaboration is good... but can we assume/expect those who review code to always have the time to be, well, collaborative?

dsmiley · 2025-11-14T16:02:23Z

solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java

+            future =
+                CompletableFuture.supplyAsync(
+                    () -> {
+                      stateRefreshSemaphore.acquireUninterruptibly();


perhaps we should acquire before supplyAsync, thus some potential back-pressure for triggerCollectionRefresh's callers? not sure honestly

dsmiley

I really appreciate your improvements here, by the way; thank you!

SOLR-17947: CloudSolrClient refreshes collection state asynchronously…

3e71b9a

… using a dedicated thread pool to reduce ZooKeeper blocking under load.

github-actions bot added client:solrj tests labels Nov 7, 2025

dsmiley reviewed Nov 10, 2025

View reviewed changes

SOLR-17947: CloudSolrClient refreshes collection state asynchronously…

19e3655

… using a dedicated thread pool to reduce ZooKeeper blocking under load.

dsmiley reviewed Nov 14, 2025

View reviewed changes

SOLR-17947: CloudSolrClient refreshes collection state asynchronously using a dedicated thread pool to reduce ZooKeeper blocking under load. #3851

Are you sure you want to change the base?

SOLR-17947: CloudSolrClient refreshes collection state asynchronously using a dedicated thread pool to reduce ZooKeeper blocking under load. #3851

Uh oh!

Conversation

markrmiller commented Nov 7, 2025

Uh oh!

dsmiley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markrmiller commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dsmiley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsmiley commented Nov 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsmiley left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

markrmiller commented Nov 14, 2025 •

edited

Loading