Skip to content

Datashape scalability #13

@wesm

Description

@wesm

I have a high level design question concerning using text as a serialized representation of array metadata. In my opinion, it is not the best choice as a primary representation. Let me explain why

A problem that the big data community has run into concerns "large metadata" -- storing very wide or very complex tables in file formats like Parquet has resulted in performance problems because manipulating the metadata itself (serialized as a Thrift message) is unwieldy -- parsing and serializing is expensive, and doing small projections / selecting certain fields out of a large data set incurs the cost of the entire metadata.

In Apache Arrow, we have a more restricted type system than datashape, but we decided to use a "zero copy" or "no parse" technology, Flatbuffers (from Google), to represent metadata in a portable way. We chose Flatbuffers over alternatives (like Cap'n Proto -- created by the main author of Google's Protocol Buffers file format) because of community traction and cross-language support (it has first class support for C++, C#, C, Go, Java, JS, PHP, and Python, with Rust and other languages in the works).

Using a technology like Flatbuffers for the serialized representation enables O(1) inspection of schemas, without any copying or parsing overhead, regardless of how big the schema is. Have you thought about using something like this instead?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions