feat(migrator): streaming batch approach for large schema migrations#3949
Closed
vuldin wants to merge 1 commit intoredpanda-data:mainfrom
Closed
feat(migrator): streaming batch approach for large schema migrations#3949vuldin wants to merge 1 commit intoredpanda-data:mainfrom
vuldin wants to merge 1 commit intoredpanda-data:mainfrom
Conversation
5bd150b to
59d3efc
Compare
59d3efc to
a4e0a44
Compare
b0d978b to
a6c1bfd
Compare
a6c1bfd to
f3ffcfb
Compare
- Add `batch_size` config to control subjects fetched per batch (default: 100) - Add `workers` config for parallel sync within batches (default: 10) - Stream schemas: fetch batch → sync batch → report progress → repeat - Sample first batch to detect schema references - Only sort subjects by schema ID when references are detected - Force sequential processing when references exist to preserve ordering - Real-time progress logging with synced/skipped counts Ordering guarantees: - If references detected: subjects sorted by schema ID, sequential processing - If no references: parallel processing safe with fixed IDs (translate_ids=false) Memory optimization: - Only hold one batch in memory at a time - Avoid 2x API calls in common case (no references) - Tracking maps grow with synced schemas (unavoidable for deduplication)
f3ffcfb to
b4ea857
Compare
Member
Author
|
Thanks for the quick review @josephwoodward , I've updated the branch based on your suggestions. Let me know if you have any questions or additional suggestions. |
Member
Author
|
I was able to complete a test using a build of this branch for schema migration:
|
Collaborator
|
Work moved to #3951 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Trying to replicate a large number of schemas with Migrator fails/hangs since it tries to initially get all schemas. This PR allows streaming schemas in batches, keeping order (and schema IDs) in the target cluster the same as the source cluster. Memory utilization does not grow as schemas are produced. This addresses memory, visibility, and ordering issues when migrating large schema counts (40K+).
batch_sizeconfig to control subjects fetched per batch (default: 100)workersconfig for parallel sync within batches (default: 10)Ordering guarantees:
Memory optimization: