fix: handle VectorBatchAppender type mismatch in consolidateBatches#43
fix: handle VectorBatchAppender type mismatch in consolidateBatches#43
Conversation
When records across separate poll cycles have different types for the same field (e.g., string vs number), consolidateBatches would crash the task with "The targetVector to append must have the same type". Now it catches the exception and falls back to writing batches individually. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
don't like this. could end up with too many writes |
Two fixes: 1. castBatchToSchema used ArrowType equality for the "same type, direct copy" path. For complex types (Struct, List, Map), ArrowType is always equal (e.g., both ArrowType.Struct) even when children differ. Changed to Field.equals which also compares children, nullability, and metadata. Mismatched children now go through castVectorValues instead of a raw copyFromSafe that can throw or corrupt data. 2. Added missing type promotions in castVectorValues that ArrowSchemaMerge.unifySchemas can produce but had no cast handler: - Bool → Int32/Int64/Float64 (Bool is in areAllNumeric) - TinyInt/SmallInt → Int32 (promoteNumericTypes can return Int32) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove unused allocator parameter from consolidate() - Remove per-run catch that could write corrupt data on partial append failure - Narrow flushBatches catch to RuntimeException, re-throw directly - Add consolidation failure cleanup path to close original batches - Update javadoc with explicit ownership contract and failure semantics - Update SchemaMismatchIntegrationTest javadoc for new approach
|
Rather than falling back on single batches, I've changed the code to write contiguous schema-matching batches in the most efficient way possible. This removed a lot of code, and opened up several other efficiencies. Also added a lot of tests. BatchConsolidator.java
DucklakeSinkTask.java (flushBatches)
SchemaMismatchIntegrationTest.java
BatchConsolidationTest.java
|
|
This also removes the type promotion that had initially been in the PR. |
There was a problem hiding this comment.
Pull request overview
This PR addresses production task failures caused by Arrow VectorBatchAppender type mismatches during batch consolidation by introducing schema-aware consolidation that only appends batches with compatible (including nested-child) schemas, and by moving consolidation logic out of DucklakeSinkTask into a dedicated helper.
Changes:
- Add
BatchConsolidatorto group contiguous batches by compatible schema and consolidate each run via in-placeVectorBatchAppenderappend. - Update
DucklakeSinkTask.flushBatchesto useBatchConsolidatorand simplify the flush path by writing one batch per compatible run. - Add unit + integration coverage for consolidation behavior, schema-compatibility edge cases, and the previously failing mismatch scenario.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
src/main/java/com/inyo/ducklake/connect/DucklakeSinkTask.java |
Switch flush-time consolidation to BatchConsolidator and adjust close/error-handling paths. |
src/main/java/com/inyo/ducklake/connect/BatchConsolidator.java |
New utility to group by compatible schema and consolidate runs in-place using VectorBatchAppender. |
src/test/java/com/inyo/ducklake/connect/BatchConsolidationTest.java |
New unit tests for consolidation correctness, schema compatibility, ordering, and memory management. |
src/integrationTest/java/com/inyo/ducklake/connect/SchemaMismatchIntegrationTest.java |
New integration test reproducing the schema mismatch scenario across poll cycles and asserting the task remains running. |
CLAUDE.md |
Update local dev environment guidance (mise → flox). |
Summary
events-ducklaketasks are FAILED withIllegalArgumentException: The targetVector to append must have the same type as the targetVector being appendedinconsolidateBatchescastBatchToSchemausedArrowTypeequality for the "same type, direct copy" path. For complex types (Struct, List, Map),ArrowTypeis always equal (e.g., bothArrowType.Struct) even when children differ. This caused rawcopyFromSafecalls between vectors with mismatched child types, which either threw or corrupted dataField.equalswhich compares children, nullability, and metadata. Mismatched fields now go throughcastVectorValuesfor proper type promotionArrowSchemaMerge.unifySchemascan produce butcastVectorValuesdidn't handle:Bool → Int32/Int64/Float64,TinyInt/SmallInt → Int32consolidateBatchesnow catchesVectorBatchAppenderexceptions and falls back to writing batches individually, instead of crashing the taskTest plan
SchemaMismatchIntegrationTest— sends records with conflicting types (string vs number) in separate poll cycles, verifies task stays RUNNING after flush🤖 Generated with Claude Code