-
Notifications
You must be signed in to change notification settings - Fork 76
Open
Description
I've found in my benchmarks nanoparquet to be much less efficient than arrow in term of speed and RAM usage
expression median mem_alloc name size
<char> <num> <num> <char> <char>
1: df_parquet 1.153 5.578 write small
2: df_nanoparquet 0.674 183.986 write small
3: dt_parquet 5.172 0.018 write small
4: dt_nanoparquet 0.656 183.876 write small
5: df_parquet 10.878 0.015 write big
6: df_nanoparquet 10.182 2068.884 write big
7: dt_parquet 11.461 0.015 write big
8: dt_nanoparquet 10.038 2068.947 write big
9: df_parquet 0.088 34.901 read small
10: df_nanoparquet 0.414 183.187 read small
11: df_parquet 1.187 0.009 read big
12: df_nanoparquet 5.180 1324.072 read big
speed and RAM usage when reading big files are not very good .
on nanoparquet repo they say :
Being single-threaded and not fully optimized,
nanoparquet is probably not suited well for large data sets.
It should be fine for a couple of gigabytes.
Reading or writing a ~250MB file that has 32 million rows
and 14 columns takes about 10-15 seconds on an M2 MacBook Pro.
For larger files, use Apache Arrow or DuckDB.
rio uses arrow for feather already so I'm not sure why we rely on nanoparquet for parquet
If you keep nanoparquet as default maybe we could have an option to use arrow instead?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels