Skip to content

Commit b4ea857

Browse files
committed
feat(migrator): streaming batch approach for large schema migrations
- Add `batch_size` config to control subjects fetched per batch (default: 100) - Add `workers` config for parallel sync within batches (default: 10) - Stream schemas: fetch batch → sync batch → report progress → repeat - Sample first batch to detect schema references - Only sort subjects by schema ID when references are detected - Force sequential processing when references exist to preserve ordering - Real-time progress logging with synced/skipped counts Ordering guarantees: - If references detected: subjects sorted by schema ID, sequential processing - If no references: parallel processing safe with fixed IDs (translate_ids=false) Memory optimization: - Only hold one batch in memory at a time - Avoid 2x API calls in common case (no references) - Tracking maps grow with synced schemas (unavoidable for deduplication)
1 parent f789836 commit b4ea857

File tree

3 files changed

+428
-27
lines changed

3 files changed

+428
-27
lines changed

docs/modules/components/pages/outputs/redpanda_migrator.adoc

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,8 @@ output:
5353
translate_ids: false
5454
normalize: false
5555
strict: false
56+
workers: 10
57+
batch_size: 100
5658
consumer_groups:
5759
enabled: true
5860
interval: 1m
@@ -139,6 +141,8 @@ output:
139141
translate_ids: false
140142
normalize: false
141143
strict: false
144+
workers: 10
145+
batch_size: 100
142146
consumer_groups:
143147
enabled: true
144148
interval: 1m
@@ -1425,6 +1429,24 @@ Error on unknown schema IDs. Only relevant when translate_ids is true. When fals
14251429
14261430
*Default*: `false`
14271431
1432+
=== `schema_registry.workers`
1433+
1434+
Number of parallel workers for schema sync operations. Higher values improve throughput for large schema counts.
1435+
1436+
1437+
*Type*: `int`
1438+
1439+
*Default*: `10`
1440+
1441+
=== `schema_registry.batch_size`
1442+
1443+
Number of subjects to fetch and sync per batch. Schemas are streamed in batches rather than fetched all at once, reducing memory usage and providing real-time progress for large migrations.
1444+
1445+
1446+
*Type*: `int`
1447+
1448+
*Default*: `100`
1449+
14281450
=== `consumer_groups`
14291451
14301452
Sorry! This field is missing documentation.

0 commit comments

Comments
 (0)