Skip to content

Commit 5dd3463

Browse files
authored
Add arrow-avro SchemaStore and fingerprinting (#8039)
# Which issue does this PR close? - Part of #4886 - Pre-work for #8006 # Rationale for this change Apache Avro’s [single object encoding](https://avro.apache.org/docs/1.11.1/specification/#single-object-encoding) prefixes every record with the marker `0xC3 0x01` followed by a `Rabin` [schema fingerprint ](https://avro.apache.org/docs/1.11.1/specification/#schema-fingerprints) so that readers can identify the correct writer schema without carrying the full definition in each message. While the current `arrow‑avro` implementation can read container files, it cannot ingest these framed messages or handle streams where the writer schema changes over time. The Avro specification recommends computing a 64‑bit CRC‑64‑AVRO (Rabin) hashed fingerprint of the [parsed canonical form of a schema](https://avro.apache.org/docs/1.11.1/specification/#parsing-canonical-form-for-schemas) to look up the `Schema` from a local schema store or registry. This PR introduces **`SchemaStore`** and **fingerprinting** to enable: * **Zero‑copy schema identification** for decoding streaming Avro messages published in single‑object format (i.e. Kafka, Pulsar, etc) into Arrow. * **Dynamic schema evolution** by laying the foundation to resolve writer reader schema differences on the fly. **NOTE:** Integration with `Decoder` and `Reader` coming in next PR. # What changes are included in this PR? | Area | Highlights | | ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **`schema.rs`** | *New* `Fingerprint`, `SchemaStore`, and `SINGLE_OBJECT_MAGIC`; canonical‑form generator; Rabin fingerprint calculator; `compare_schemas` helper. | | **`lib.rs`** | `mod schema` is now `pub` | | **Unit tests** | New tests covering fingerprint generation, store registration/lookup, unknown‑fingerprint errors, and interaction with UTF8‑view decoding. | | **Docs & Examples** | Extensive inline docs with examples on all new public methods / structs. | # Are these changes tested? Yes. New tests cover: 1. **Fingerprinting** against the canonical examples from the Avro spec 2. **`SchemaStore` behavior** deduplication, duplicate registration, and lookup. # Are there any user-facing changes? N/A
1 parent a3d144f commit 5dd3463

File tree

3 files changed

+564
-5
lines changed

3 files changed

+564
-5
lines changed

arrow-avro/Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ bzip2 = { version = "0.6.0", optional = true }
5656
xz = { version = "0.1", default-features = false, optional = true }
5757
crc = { version = "3.0", optional = true }
5858
uuid = "1.17"
59+
strum_macros = "0.27"
5960

6061
[dev-dependencies]
6162
arrow-data = { workspace = true }

arrow-avro/src/lib.rs

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -33,10 +33,10 @@
3333
/// Implements the primary reader interface and record decoding logic.
3434
pub mod reader;
3535

36-
// Avro schema parsing and representation
37-
//
38-
// Provides types for parsing and representing Avro schema definitions.
39-
mod schema;
36+
/// Avro schema parsing and representation
37+
///
38+
/// Provides types for parsing and representing Avro schema definitions.
39+
pub mod schema;
4040

4141
/// Compression codec implementations for Avro
4242
///

0 commit comments

Comments
 (0)