feat(migrator): streaming batch approach for large schema migrations by vuldin · Pull Request #3949 · redpanda-data/connect

vuldin · 2026-02-01T00:11:02Z

Trying to replicate a large number of schemas with Migrator fails/hangs since it tries to initially get all schemas. This PR allows streaming schemas in batches, keeping order (and schema IDs) in the target cluster the same as the source cluster. Memory utilization does not grow as schemas are produced. This addresses memory, visibility, and ordering issues when migrating large schema counts (40K+).

Add batch_size config to control subjects fetched per batch (default: 100)
Add workers config for parallel sync within batches (default: 10)
Stream schemas: fetch batch → sync batch → report progress → repeat
Sample first batch to detect schema references
Only sort subjects by schema ID when references are detected
Force sequential processing when references exist to preserve ordering
Real-time progress logging with synced/skipped counts

Ordering guarantees:

If references detected: subjects sorted by schema ID, sequential processing
If no references: parallel processing safe with fixed IDs (translate_ids=false)

Memory optimization:

Only hold one batch in memory at a time
Avoid 2x API calls in common case (no references)
Tracking maps grow with synced schemas (unavoidable for deduplication)

internal/impl/redpanda/migrator/migrator_schema_registry.go

- Add `batch_size` config to control subjects fetched per batch (default: 100) - Add `workers` config for parallel sync within batches (default: 10) - Stream schemas: fetch batch → sync batch → report progress → repeat - Sample first batch to detect schema references - Only sort subjects by schema ID when references are detected - Force sequential processing when references exist to preserve ordering - Real-time progress logging with synced/skipped counts Ordering guarantees: - If references detected: subjects sorted by schema ID, sequential processing - If no references: parallel processing safe with fixed IDs (translate_ids=false) Memory optimization: - Only hold one batch in memory at a time - Avoid 2x API calls in common case (no references) - Tracking maps grow with synced schemas (unavoidable for deduplication)

vuldin · 2026-02-01T08:19:27Z

Thanks for the quick review @josephwoodward , I've updated the branch based on your suggestions. Let me know if you have any questions or additional suggestions.

vuldin · 2026-02-01T19:27:36Z

I was able to complete a test using a build of this branch for schema migration:

count: 120K schemas
duration: ~3 hours
memory growth: stable (~50MB)
schema IDs remained identical across clusters
errors: none

mmatczuk · 2026-02-02T15:42:24Z

Work moved to #3951

vuldin force-pushed the allow-large-schema-count branch from 5bd150b to 59d3efc Compare February 1, 2026 00:24

vuldin marked this pull request as draft February 1, 2026 00:33

vuldin force-pushed the allow-large-schema-count branch from 59d3efc to a4e0a44 Compare February 1, 2026 00:41

vuldin changed the title ~~feat(migrator): add parallel schema sync and progress logging for large registries~~ feat(migrator): streaming batch approach for large schema migrations Feb 1, 2026

vuldin marked this pull request as ready for review February 1, 2026 01:28

vuldin force-pushed the allow-large-schema-count branch 2 times, most recently from b0d978b to a6c1bfd Compare February 1, 2026 01:58

josephwoodward reviewed Feb 1, 2026

View reviewed changes

internal/impl/redpanda/migrator/migrator_schema_registry.go Outdated Show resolved Hide resolved

josephwoodward reviewed Feb 1, 2026

View reviewed changes

internal/impl/redpanda/migrator/migrator_schema_registry.go Show resolved Hide resolved

josephwoodward reviewed Feb 1, 2026

View reviewed changes

internal/impl/redpanda/migrator/migrator_schema_registry.go Outdated Show resolved Hide resolved

josephwoodward reviewed Feb 1, 2026

View reviewed changes

internal/impl/redpanda/migrator/migrator_schema_registry.go Outdated Show resolved Hide resolved

vuldin force-pushed the allow-large-schema-count branch from a6c1bfd to f3ffcfb Compare February 1, 2026 08:10

vuldin force-pushed the allow-large-schema-count branch from f3ffcfb to b4ea857 Compare February 1, 2026 08:14

mmatczuk self-assigned this Feb 2, 2026

mmatczuk closed this Feb 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(migrator): streaming batch approach for large schema migrations#3949

feat(migrator): streaming batch approach for large schema migrations#3949
vuldin wants to merge 1 commit intoredpanda-data:mainfrom
vuldin:allow-large-schema-count

vuldin commented Feb 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vuldin commented Feb 1, 2026

Uh oh!

vuldin commented Feb 1, 2026

Uh oh!

mmatczuk commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vuldin commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vuldin commented Feb 1, 2026

Uh oh!

vuldin commented Feb 1, 2026

Uh oh!

mmatczuk commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vuldin commented Feb 1, 2026 •

edited

Loading