redpanda/migrator: optimize schema registry sync with parallel processing#3951
Merged
redpanda/migrator: optimize schema registry sync with parallel processing#3951
Conversation
Convert listSubjectSchemas() to return iterator for memory-efficient processing of large schema registries.
… in SyncLoop() Add filter function to listSubjectSchemas() to fetch each schema version once. Change knownSubjects to set type since IDs are not tracked.
Fix concurrent access to knownSchemas map and improve code clarity.
Process schema references depth-first to ensure versions are migrated in correct dependency order, matching migrator v1 behavior.
…ma migration Add configurable worker pool to process subjects in parallel. Each worker uses DFS traversal to complete entire subject trees before moving to next. Shuffle subject order for improved load distribution across workers.
70586b6 to
5cd248d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Improves schema registry migration performance by adding configurable parallel processing. Large schema registries with thousands of schemas now migrate significantly faster.
Changes
Parallel subject processing - Added
max_parallel_http_requestsconfig option (default: 10) to control worker pool size for concurrent schema migrationMemory optimization - Converted schema loading from batch to streaming iterator to handle large registries without loading all schemas into memory
Correct dependency order - Implemented DFS traversal to ensure schema references are migrated in proper dependency order, matching migrator v1 behavior
Deduplication - Each schema version is now fetched exactly once during sync loop
Load balancing - Subject order is shuffled to distribute work evenly across workers
Configuration
New optional field in schema registry config: