Skip to content

feat(migrator): streaming batch approach for large schema migrations#3949

Closed
vuldin wants to merge 1 commit intoredpanda-data:mainfrom
vuldin:allow-large-schema-count
Closed

feat(migrator): streaming batch approach for large schema migrations#3949
vuldin wants to merge 1 commit intoredpanda-data:mainfrom
vuldin:allow-large-schema-count

Conversation

@vuldin
Copy link
Member

@vuldin vuldin commented Feb 1, 2026

Trying to replicate a large number of schemas with Migrator fails/hangs since it tries to initially get all schemas. This PR allows streaming schemas in batches, keeping order (and schema IDs) in the target cluster the same as the source cluster. Memory utilization does not grow as schemas are produced. This addresses memory, visibility, and ordering issues when migrating large schema counts (40K+).

  • Add batch_size config to control subjects fetched per batch (default: 100)
  • Add workers config for parallel sync within batches (default: 10)
  • Stream schemas: fetch batch → sync batch → report progress → repeat
  • Sample first batch to detect schema references
  • Only sort subjects by schema ID when references are detected
  • Force sequential processing when references exist to preserve ordering
  • Real-time progress logging with synced/skipped counts

Ordering guarantees:

  • If references detected: subjects sorted by schema ID, sequential processing
  • If no references: parallel processing safe with fixed IDs (translate_ids=false)

Memory optimization:

  • Only hold one batch in memory at a time
  • Avoid 2x API calls in common case (no references)
  • Tracking maps grow with synced schemas (unavoidable for deduplication)

@vuldin vuldin force-pushed the allow-large-schema-count branch from 5bd150b to 59d3efc Compare February 1, 2026 00:24
@vuldin vuldin marked this pull request as draft February 1, 2026 00:33
@vuldin vuldin force-pushed the allow-large-schema-count branch from 59d3efc to a4e0a44 Compare February 1, 2026 00:41
@vuldin vuldin changed the title feat(migrator): add parallel schema sync and progress logging for large registries feat(migrator): streaming batch approach for large schema migrations Feb 1, 2026
@vuldin vuldin marked this pull request as ready for review February 1, 2026 01:28
@vuldin vuldin force-pushed the allow-large-schema-count branch 2 times, most recently from b0d978b to a6c1bfd Compare February 1, 2026 01:58
@vuldin vuldin force-pushed the allow-large-schema-count branch from a6c1bfd to f3ffcfb Compare February 1, 2026 08:10
- Add `batch_size` config to control subjects fetched per batch (default: 100)
- Add `workers` config for parallel sync within batches (default: 10)
- Stream schemas: fetch batch → sync batch → report progress → repeat
- Sample first batch to detect schema references
- Only sort subjects by schema ID when references are detected
- Force sequential processing when references exist to preserve ordering
- Real-time progress logging with synced/skipped counts

Ordering guarantees:
- If references detected: subjects sorted by schema ID, sequential processing
- If no references: parallel processing safe with fixed IDs (translate_ids=false)

Memory optimization:
- Only hold one batch in memory at a time
- Avoid 2x API calls in common case (no references)
- Tracking maps grow with synced schemas (unavoidable for deduplication)
@vuldin vuldin force-pushed the allow-large-schema-count branch from f3ffcfb to b4ea857 Compare February 1, 2026 08:14
@vuldin
Copy link
Member Author

vuldin commented Feb 1, 2026

Thanks for the quick review @josephwoodward , I've updated the branch based on your suggestions. Let me know if you have any questions or additional suggestions.

@vuldin
Copy link
Member Author

vuldin commented Feb 1, 2026

I was able to complete a test using a build of this branch for schema migration:

  • count: 120K schemas
  • duration: ~3 hours
  • memory growth: stable (~50MB)
  • schema IDs remained identical across clusters
  • errors: none

@mmatczuk mmatczuk self-assigned this Feb 2, 2026
@mmatczuk
Copy link
Collaborator

mmatczuk commented Feb 2, 2026

Work moved to #3951

@mmatczuk mmatczuk closed this Feb 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants