You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Regression Testing, Bug Fixes, and Public API Tightening for arrow-avro (#8492)
# Which issue does this PR close?
- **Related to**: #4886 (“Add Avro Support”)
# Rationale for this change
**NOTE:** This PR contains over **2300 lines of test code**. The actual
production code diff is **less than 800 LOC**.
Before we publish `arrow-avro`, we want to "minimize its public API
surface" and ship a well‑tested, spec‑compliant implementation. In the
process of adding intensive regression tests and canonical‑form checks,
we found several correctness gaps around alias handling, union
resolution, Unicode/name validation, list child nullability, “null”
string handling, and a mis-wired `Writer` capacity setting. This PR
tightens the API and fixes those issues to align with the Avro spec
(aliases and defaults, union resolution, names and Unicode, etc.).
# What changes are included in this PR?
**Public API tightening**
- Restrict visibility of numerous schema/codec types and functions
within `arrow-avro` so only intended entry points are public ahead of
making the crate public.
**Bug fixes discovered via regression testing (All fixed)**
1. **Alias bugs (aliases without defaults / restrictive namespaces)**
- Enforce spec‑compliant alias resolution: aliases may be
fully‑qualified or relative to the reader’s namespace, and alias‑based
rewrites still require reader defaults when the writer field is absent.
This follows Avro’s alias rules and record‑field default behavior.
2. **Special‑case union resolution (writer not a union, reader is)**
- When the writer schema is **not** a union but the reader is, we no
longer attempt to decode a union `type_id`; per spec, the reader must
pick the first union branch that matches the writer’s schema.
3. **Valid Avro Unicode characters & name rules in Schema**
- Distinguish between *Unicode strings* (which may contain any valid
UTF‑8) and *identifiers* (names/enum symbols) which must match
`[A-Za-z_][A-Za-z0-9_]*`. Tests were added to accept valid Unicode
string content while enforcing the ASCII identifier regex.
4. **Nullable `ListArray` child item bug**
- Correct mapping of Avro array item nullability to Arrow `ListArray`’s
inner `"item"` field. (By convention the inner field is named `"item"`
and nullability is explicit.) This aligns with Arrow’s builder/typing
docs.
5. **“null” string vs. hard `null`**
- Fix default/value handling to differentiate JSON `null` from the
string literal `"null"` per the Avro defaults table.
6. **`Writer` capacity knob wired up**
- Plumb the provided capacity through the writer implementation so
preallocation/knobbed capacity is respected. (See changes under
`arrow-avro/src/writer/mod.rs`.)
# Are these changes tested?
Yes. This PR adds substantial regression coverage:
- Canonical‑form checks for schemas.
- Alias/namespace + default‑value resolution cases.
- Reader‑union vs. writer‑non‑union decoding paths.
- Unicode content vs. identifier name rules.
- `ListArray` inner field nullability behavior.
- Round‑trips exercising the `Writer` with the capacity knob set.
A new, comprehensive Avro fixture (`test/data/comprehensive_e2e.avro`)
is included to drive end‑to‑end scenarios and edge cases,.
# Are there any user-facing changes?
N/A
0 commit comments