-
Notifications
You must be signed in to change notification settings - Fork 15
Description
I have a high level design question concerning using text as a serialized representation of array metadata. In my opinion, it is not the best choice as a primary representation. Let me explain why
A problem that the big data community has run into concerns "large metadata" -- storing very wide or very complex tables in file formats like Parquet has resulted in performance problems because manipulating the metadata itself (serialized as a Thrift message) is unwieldy -- parsing and serializing is expensive, and doing small projections / selecting certain fields out of a large data set incurs the cost of the entire metadata.
In Apache Arrow, we have a more restricted type system than datashape, but we decided to use a "zero copy" or "no parse" technology, Flatbuffers (from Google), to represent metadata in a portable way. We chose Flatbuffers over alternatives (like Cap'n Proto -- created by the main author of Google's Protocol Buffers file format) because of community traction and cross-language support (it has first class support for C++, C#, C, Go, Java, JS, PHP, and Python, with Rust and other languages in the works).
Using a technology like Flatbuffers for the serialized representation enables O(1) inspection of schemas, without any copying or parsing overhead, regardless of how big the schema is. Have you thought about using something like this instead?