-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Description
Currently, Airbyte uses custom airbyte json streams to transfer data from sources and destinations, this causes a huge CPU overhead when translating data from JSON to other formats. Also, Airbyte stream has to carry the schema for the corresponding json records which is an extra data overhead. JSON serialization over the network is very bad.
An efficient way to handle such workloads would be to translate data to Apache Arrow format.
Advantages:
- Arrow file is a zero-copy file format and has a schema with the file.
- Arrow Flight can be leveraged internally within Airbyte to accelerate data transfer from sources to destinations.
- Because all data transfers between sources and destinations are Arrow format, airbyte can implement intermediate transformation with custom expressions that know how to query an Arrow file.
- Arrow ADBC is a way to translate JDBC data to arrow format for any database. This could cause direct translation via the driver side/source side and the resulting arrow batches then can be transferred to the destination images via Arrow flight again having a huge performance gain.
Disadvantages:
Converting Arrow data to textual formats like JSON, CSV or JDBC can cause destinations to become slower due to data translation. Whereas if the format is columnar like Parquet, ORC or iceberg etc then arrow translations are effortless.
I am happy to collaborate and contribute to this approach.