Apache Arrow Flight and Arrow Adbc for data transfer and data loading

Currently, Airbyte uses custom airbyte json streams to transfer data from sources and destinations, this causes a huge CPU overhead when translating data from JSON to other formats. Also, Airbyte stream has to carry the schema for the corresponding json records which is an extra data overhead. JSON serialization over the network is very bad. 

An efficient way to handle such workloads would be to translate data to [Apache Arrow](https://arrow.apache.org/docs/index.html) format. 

Advantages:
1. Arrow file is a zero-copy file format and has a schema with the file.
2. [Arrow Flight](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) can be leveraged internally within Airbyte to accelerate data transfer from sources to destinations.
3. Because all data transfers between sources and destinations are Arrow format, airbyte can implement intermediate transformation with custom expressions that know how to query an Arrow file. 
4. [Arrow ADBC](https://arrow.apache.org/docs/format/ADBC.html) is a way to translate JDBC data to arrow format for any database. This could cause direct translation via the driver side/source side and the resulting arrow batches then can be transferred to the destination images via Arrow flight again having a huge performance gain.

Disadvantages:
Converting Arrow data to textual formats like JSON, CSV or JDBC can cause destinations to become slower due to data translation. Whereas if the format is columnar like Parquet, ORC or iceberg etc then arrow translations are effortless. 

I am happy to collaborate and contribute to this approach. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apache Arrow Flight and Arrow Adbc for data transfer and data loading #24546

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Apache Arrow Flight and Arrow Adbc for data transfer and data loading #24546

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions