You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Added List and Struct Encoding to arrow-avro Writer (#8274)
# Which issue does this PR close?
- Part of #4886
# Rationale for this change
This refactor streamlines the `arrow-avro` writer by introducing a
single, schema‑driven `RecordEncoder` that plans writes up front and
encodes rows using consistent, explicit rules for nullability and type
dispatch. It reduces duplication in nested/struct/list handling, makes
the order of Avro union branches (null‑first vs null‑second) an explicit
choice, and aligns header schema generation with value encoding.
This should improve correctness (especially for nested optionals), make
behavior easier to reason about, and pave the way for future
optimizations.
# What changes are included in this PR?
**High‑level:**
* Introduces a unified, schema‑driven `RecordEncoder` with a builder
that walks the Avro record in Avro order and maps each field to its
Arrow column, producing a reusable write plan. The encoder covers
scalars and nested types (struct, (large) lists, maps,
strings/binaries).
* Applies a single model of **nullability** throughout encoding,
including nested sites (list items, fixed‑size list items, map values),
and uses explicit union‑branch indices according to the chosen order.
**API and implementation details:**
* **Writer / encoder refactor**
* Replaces the previous per‑column/child encoding paths with a
**`FieldPlan`** tree (variants for `Scalar`, `Struct { … }`, and `List {
… }`) and per‑site `nullability` carried from the Avro schema.
* Adds encoder variants for `LargeBinary`, `Utf8`, `Utf8Large`, `List`,
`LargeList`, and `Struct`.
* Encodes union branch indices with `write_optional_index` (writes
`0x00/0x02` according to Null‑First/Null‑Second), replacing the old
branch write.
* **Schema generation & metadata**
* Moves the **`Nullability`** enum to `schema.rs` and threads it through
schema generation and writer logic.
* Adds `AvroSchema::from_arrow_with_options(schema,
Option<Nullability>)` to either reuse embedded Avro JSON or build new
Avro JSON that **honors the requested null‑union order at all nullable
sites**.
* Adds `extend_with_passthrough_metadata` so Arrow schema metadata is
copied into Avro JSON while skipping Avro‑reserved and internal Arrow
keys.
* Introduces helpers like `wrap_nullable` and
`arrow_field_to_avro_with_order` to apply ordering consistently for
arrays, fixed‑size lists, maps, structs, and unions.
* **Format and glue**
* Simplifies `writer/format.rs` by removing the `EncoderOptions`
plumbing from the OCF format; `write_long` remains exported for header
writing.
# Are these changes tested?
Yes.
* Adds focused unit tests in `writer/encoder.rs` that verify scalar and
string/binary encodings (e.g., Binary/LargeBinary, Utf8/LargeUtf8) and
validate length/branch encoding primitives used by the writer.
* Round trip integration tests that validate List and Struct decoding in
`writer/mod.rs`.
* Adjusts existing schema tests (e.g., decimal metadata expectations) to
align with the new schema/metadata handling.
# Are there any user-facing changes?
N/A because arrow-avro is not public yet.
---------
Co-authored-by: Ryan Johnson <[email protected]>
Co-authored-by: Matthijs Brobbel <[email protected]>
0 commit comments