Parallelize backups and restore file operations by elangelo · Pull Request #4023 · apache/solr

elangelo · 2026-01-06T14:47:49Z

Description

This PR ensures multiple threads are used to create backups and to restore backups. This ensures a considerate speedup when using cloud storage such as S3.
For comparison a backup to s3 of 1.8TiB takes roughly 16 minutes with this code. a 340GiB collection on the old code takes roughly 50 minutes.
Restoring the same collection took 7 minutes instead of 1 hour and 20 minutes (on a 6 node cluster)

Solution

As the previous implementation already had a loop over all files that needed to be backed up to the backup repository I simply wrapped that in a ThreadPool Executor

Tests

I have ran this code locally on a solrcloud cluster

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.
The current tests cover the improvement
I have added documentation for the Reference Guide
I have added a changelog entry for my change

…to 1 thread, allow overriding with system properties

epugh

I don't have the multi thread chops to approve this, but reaidn through it looks good. I wanted a change to the variable name. Do we need any new tests for this capablity, or do the existing ones cover it well enough?

epugh · 2026-01-06T20:28:06Z

solr/core/src/java/org/apache/solr/handler/IncrementalShardBackup.java

+   * SOLR_BACKUP_MAX_PARALLEL_UPLOADS}.
+   */
+  private static final int DEFAULT_MAX_PARALLEL_UPLOADS =
+      EnvUtils.getPropertyAsInteger("solr.backup.maxParallelUploads", 1);


The pattern we are using now is dot cased, so solr.backup.maxparalleluploads, or maybe if we had mulitple properties solr.backup.parraleluploads.max....

Good use of EnvUtils, we need them everywhere.

I changed this.

solr/core/src/java/org/apache/solr/handler/IncrementalShardBackup.java

epugh · 2026-01-06T20:34:25Z

solr/solr-ref-guide/modules/deployment-guide/pages/backup-restore.adoc

+Backup and restore operations can transfer multiple index files in parallel to improve throughput, especially when using cloud storage repositories like S3 or GCS where latency is higher.
+The parallelism is controlled via system properties or environment variables:
+
+`solr.backup.maxParallelUploads`::


solr.backup.maxparalleluploads ?

elangelo · 2026-01-07T07:54:46Z

I don't have the multi thread chops to approve this, but reaidn through it looks good. I wanted a change to the variable name. Do we need any new tests for this capablity, or do the existing ones cover it well enough?

I think the current tests actually cover everything already. Mind that I did change the gcsrepository and s3repository tests to have some parallelism. Unfortunately I was limited to only 2 threads as with more I got an OutOfMemoryException. But I think it still covers what needs covering.

epugh

LGTM. I'd love another committer who is more comfortable with this code base and especially the multithreaded nature of it to review as well.

… would be such a bottleneck

…was referred to by the non-canonical name `ExecutorUtil.MDCAwareThreadPoolExecutor.CallerRunsPolicy`

github-actions · 2026-03-10T00:14:35Z

This PR has had no activity for 60 days and is now labeled as stale. Any new activity will remove the stale label. To attract more reviewers, please tag people who might be familiar with the code area and/or notify the dev@solr.apache.org mailing list. To exempt this PR from being marked as stale, make it a draft PR or add the label "exempt-stale". If left unattended, this PR will be closed after another 60 days of inactivity. Thank you for your contribution!

Copilot

Pull request overview

This PR adds configurable parallelism to Solr’s backup (incremental shard backup) and restore (core restore) file-transfer loops to improve throughput, especially for higher-latency cloud repositories (e.g., S3/GCS).

Changes:

Add parallel upload/download execution for index file transfers during backup and restore, gated by new sysprop/env settings.
Document the new parallel transfer settings in the ref guide.
Update S3/GCS incremental backup tests to enable parallelism and add an unreleased changelog entry.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
solr/solr-ref-guide/modules/deployment-guide/pages/backup-restore.adoc	Documents new parallel upload/download properties and tuning guidance.
solr/modules/s3-repository/src/test/org/apache/solr/s3/S3IncrementalBackupTest.java	Enables parallel backup/restore via sysprops for S3 incremental backup tests.
solr/modules/gcs-repository/src/test/org/apache/solr/gcs/GCSIncrementalBackupTest.java	Enables parallel backup/restore via sysprops for GCS incremental backup tests.
solr/core/src/java/org/apache/solr/handler/RestoreCore.java	Parallelizes restore file copy/download work via an executor and aggregates errors.
solr/core/src/java/org/apache/solr/handler/IncrementalShardBackup.java	Parallelizes incremental backup upload work, makes stats thread-safe, aggregates errors.
changelog/unreleased/parallelizebackups.yml	Adds changelog entry for the feature.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-10T01:41:23Z

changelog/unreleased/parallelizebackups.yml

+authors:
+  - name: Samuel Verstraete
+    github: elangelo


The changelog author metadata uses a github field, but this repository’s changelog format documentation uses nick (optionally with url) under authors. Using an unexpected key may fail changelog validation or omit author info; please switch github: elangelo to nick: elangelo (and add url if desired).

Copilot · 2026-03-10T01:41:24Z

solr/core/src/java/org/apache/solr/handler/RestoreCore.java

+                  60L,
+                  TimeUnit.SECONDS,
+                  new SynchronousQueue<>(),
+                  new SolrNamedThreadFactory("RestoreCore"),
+                  new ThreadPoolExecutor.CallerRunsPolicy())


Using SynchronousQueue with CallerRunsPolicy means once all maxParallelDownloads threads are busy, additional downloads will execute on the calling thread. That can exceed the configured cap (up to maxParallelDownloads + 1 concurrent transfers) and also bypass the MDCAwareThreadPoolExecutor wrapping for those caller-run tasks. Consider a bounded queue/fixed pool or explicitly limiting in-flight submissions to enforce the configured parallelism.

The CallerRunsPolicy fallback does mean the submitting thread can run a task when the pool is saturated, but the submitting thread is the Solr request thread — it already carries full MDC context, so there's no MDC loss here. MDCAwareThreadPoolExecutor exists to propagate MDC to new pool threads; the caller-runs case doesn't need that propagation. On the cap concern: the maxParallel* setting is a throughput knob, not a hard safety limit. An occasional N+1 concurrent transfer when the pool is fully busy is negligible for a backup/restore workload.

Copilot · 2026-03-10T01:41:24Z

solr/core/src/java/org/apache/solr/handler/RestoreCore.java

+                Throwable cause = e.getCause();
+                // Unwrap RuntimeExceptions that wrap the original IOException
+                if (cause instanceof RuntimeException && cause.getCause() != null) {
+                  firstError = cause.getCause();
+                } else {


In the parallel path, the RuntimeException unwrapping (firstError = cause.getCause()) can discard the wrapper message that contains the filename. Prefer preserving that message (or re-wrapping the cause with filename context) when propagating the first failure.

Copilot · 2026-03-10T01:41:24Z

solr/core/src/java/org/apache/solr/handler/IncrementalShardBackup.java

+                60L,
+                TimeUnit.SECONDS,
+                new SynchronousQueue<>(),
+                new SolrNamedThreadFactory("IncrementalBackup"),
+                new ThreadPoolExecutor.CallerRunsPolicy())


With SynchronousQueue + CallerRunsPolicy, once maxParallelUploads threads are busy, additional uploads will run on the submitting thread. That can exceed the configured cap (up to maxParallelUploads + 1 concurrent uploads) and bypass MDCAwareThreadPoolExecutor wrapping for caller-run tasks. Consider a bounded queue/fixed pool or explicitly bounding in-flight submissions to enforce the configured limit.

Copilot · 2026-03-10T01:41:24Z

solr/core/src/java/org/apache/solr/handler/IncrementalShardBackup.java

+    List<Future<?>> uploadFutures = new ArrayList<>();
+
+    try {


This implementation queues a Future for every index file and holds them in uploadFutures until the end. For large indexes this can create significant memory overhead and delays error reporting. Consider processing completed tasks as they finish (e.g., ExecutorCompletionService) and/or limiting in-flight submissions to maxParallelUploads.

this really doesn't hold. we need to wait for all futures anyway, so storing them in a list is what we need to do here

Copilot · 2026-03-10T01:41:25Z

solr/core/src/java/org/apache/solr/handler/IncrementalShardBackup.java

+          } catch (RuntimeException e) {
+            if (e.getCause() instanceof IOException) {
+              throw (IOException) e.getCause();
+            }
+            throw e;


In the synchronous path, rethrowing only e.getCause() (when it’s an IOException) loses the wrapper message that includes the filename ("Failed to process file: ..."). Preserve that per-file context when propagating errors so backup failures are diagnosable.

Copilot · 2026-03-10T01:41:25Z

solr/core/src/java/org/apache/solr/handler/RestoreCore.java

+            } catch (RuntimeException e) {
+              if (e.getCause() instanceof IOException) {
+                throw (IOException) e.getCause();
+              }
+              throw e;


In the synchronous path, unwrapping and rethrowing only e.getCause() drops the wrapper message that includes per-file context (filename). Preserve the wrapper message (or re-wrap the cause including the filename) so restore failures remain actionable.

Copilot · 2026-03-10T01:41:25Z

solr/core/src/java/org/apache/solr/handler/RestoreCore.java

+      List<Future<?>> downloadFutures = new ArrayList<>();
+


This submits one task per index file and retains every Future in downloadFutures until the end. For large collections with many segment files, that can add substantial memory/GC overhead and delays surfacing failures until all tasks are submitted. Consider processing completions incrementally (e.g., ExecutorCompletionService) and/or bounding the number of in-flight tasks to maxParallelDownloads.

Copilot · 2026-03-10T01:41:25Z

solr/core/src/java/org/apache/solr/handler/IncrementalShardBackup.java

+              Throwable cause = e.getCause();
+              // Unwrap RuntimeExceptions that wrap the original IOException
+              if (cause instanceof RuntimeException && cause.getCause() != null) {
+                firstError = cause.getCause();
+              } else {


In the parallel join logic, unwrapping RuntimeException to cause.getCause() can discard the wrapper message that includes the filename. Preserve the wrapper message (or re-wrap the underlying IOException with file context) when surfacing the first failure from future.get().

- Replace unsafe IOException cast with `new IOException(msg, cause)` to preserve the original cause chain in IncrementalShardBackup and RestoreCore - Simplify ExecutionException handling by removing unnecessary RuntimeException unwrapping; directly assign `e.getCause()` as the first error - Fix changelog entry: rename `github` field to `nick` for author metadata

samuelverstraete added 2 commits January 6, 2026 14:14

parallelize backups so backups to s3 don't take ages

64619f8

parallelize the restore too

34a9000

elangelo marked this pull request as draft January 6, 2026 16:20

make sure gradlew check runs properly, add to documentation, default …

f83b6e8

…to 1 thread, allow overriding with system properties

github-actions bot added documentation Improvements or additions to documentation module:s3-repository module:gcs-repository tests labels Jan 6, 2026

elangelo marked this pull request as ready for review January 6, 2026 17:31

epugh requested changes Jan 6, 2026

View reviewed changes

samuelverstraete added 2 commits January 7, 2026 09:10

changelog and var renaming

3e17ce5

modified the changeog and fixed the system property casing

a19ebce

epugh approved these changes Jan 7, 2026

View reviewed changes

samuelverstraete added 3 commits January 7, 2026 13:46

this should have been in the task from the get go, did not realize it…

3fcf874

… would be such a bottleneck

fix a compile warning The type ThreadPoolExecutor.CallerRunsPolicy …

23d5561

…was referred to by the non-canonical name `ExecutorUtil.MDCAwareThreadPoolExecutor.CallerRunsPolicy`

fix changelog errors

5ed0f17

github-actions bot added the stale PR not updated in 60 days label Mar 10, 2026

janhoy requested a review from Copilot March 10, 2026 01:34

Copilot started reviewing on behalf of janhoy March 10, 2026 01:35 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

github-actions bot removed the stale PR not updated in 60 days label Mar 11, 2026

Conversation

elangelo commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Solution

Tests

Checklist

Uh oh!

epugh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elangelo Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elangelo commented Jan 7, 2026

Uh oh!

epugh left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

elangelo commented Jan 6, 2026 •

edited

Loading

elangelo Jan 7, 2026 •

edited

Loading