Delta Lake tables are very slow with DuckDB, faster with DataFusion, break with Polars for 1 billion rows #6771
lostmygithubaccount
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I wanted to generate 1 billion rows of data and do some comparisons between backends. I ended up with a script like this to generate the data:
resulting in a decent amount of data (larger than RAM):
interesting behavior observed between DuckDB, DataFusion, and Polars, between Parquet and Delta Lake. to summarize:
read_parquet
and very slow forread_delta
read_parquet
and much faster forread_delta
(goofy timing in the code below was due to issues with
%time
)DuckDB:
DataFusion:
Polars (both fail after tens of seconds):
I'm not really sure what to make of this but figured I'd document it here
Beta Was this translation helpful? Give feedback.
All reactions