Skip to content

Commit 567f441

Browse files
jecsand838alamb
andauthored
Add array/map/fixed schema resolution and default value support to arrow-avro codec (#8292)
# Which issue does this PR close? This work continues arrow-avro schema resolution support and aligns behavior with the Avro spec. - **Related to**: #4886 (“Add Avro Support”): ongoing work to round out the reader/decoder, including schema resolution and type promotion. - **Follow-ups/Context**: #8124 (schema resolution & type promotion for the decoder), #8223 (enum mapping for schema resolution). These previous efforts established the foundations that this PR extends to default values and additional resolvable types. # Rationale for this change Avro’s **schema resolution** requires readers to reconcile differences between the writer and reader schemas, including: - Using record-field **default values** when the writer lacks a field present in the reader; defaults must be type-correct (i.e., union defaults match the first union member; bytes/fixed defaults are JSON strings). - Recursively resolving **arrays** (by item schema) and **maps** (by value schema). - Resolving **fixed** types (size and unqualified name must match) and erroring when they do not. Prior to this change, arrow-avro’s resolution handled some cases but lacked full Codec support for **default values** and for resolving **array/map/fixed** shapes between writer and reader. This led to gaps when reading evolved data or datasets produced by heterogeneous systems. This PR implements these missing pieces so the Arrow reader behaves per the spec in common evolution scenarios. # What changes are included in this PR? This PR modifies **`arrow-avro/src/codec.rs`** to extend the schema-resolution path - **Default value handling** for record fields - Reads and applies default values when the reader expects a field absent from the writer, including **nested defaults**. - Validates defaults per the Avro spec (e.g., union defaults match the first schema; bytes/fixed defaults are JSON strings). - **Array / Map / Fixed schema resolution** - **Array**: recursively resolves item schemas (writer↔reader). - **Map**: recursively resolves value schemas. - **Fixed**: enforces matching size and (unqualified) name; otherwise signals an error, consistent with the spec. - **Codec updates** - Refactors internal codec logic to support the above during decoding, including resolution for **record fields** and **nested defaults**. (See commit message for the high-level summary.) # Are these changes tested? **Yes.** This PR includes new unit tests in `arrow-avro/src/codec.rs` covering: 1) **Default validation & persistence** - `Null`/union‑nullability rules; metadata persistence of defaults (`AVRO_FIELD_DEFAULT_METADATA_KEY`). 2) **`AvroLiteral` Parsing** - Range checks for `i32`/`f32`; correct literals for `i64`/`f64`; `Utf8`/`Utf8View`; `uuid` strings (RFC‑4122). - Byte‑range mapping for `bytes`/`fixed` defaults; `Fixed(n)` length enforcement; `decimal` on `fixed` vs `bytes`; `duration`/interval fixed **12**‑byte enforcement. 3) **Collections & records** - Array/map defaults shape; enum symbol validity; record defaults for missing fields, required‑field errors, and honoring field‑level defaults; skip‑fields retained for writer‑only fields. 4) **Resolution mechanics** - Element **promotion** (`int` to `long`) for arrays; **reader metadata precedence** for colliding attributes; `fixed` name/size match including **alias**. # Are there any user-facing changes? N/A --------- Co-authored-by: Andrew Lamb <[email protected]>
1 parent 226a425 commit 567f441

File tree

1 file changed

+810
-53
lines changed

1 file changed

+810
-53
lines changed

0 commit comments

Comments
 (0)