Skip to content

use arrow::read_parquet instead of nanoparquet #462

@BenoitLondon

Description

@BenoitLondon

I've found in my benchmarks nanoparquet to be much less efficient than arrow in term of speed and RAM usage

        expression median mem_alloc   name   size
            <char>  <num>     <num> <char> <char>
 1:     df_parquet  1.153     5.578  write  small
 2: df_nanoparquet  0.674   183.986  write  small
 3:     dt_parquet  5.172     0.018  write  small
 4: dt_nanoparquet  0.656   183.876  write  small
 5:     df_parquet 10.878     0.015  write    big
 6: df_nanoparquet 10.182  2068.884  write    big
 7:     dt_parquet 11.461     0.015  write    big
 8: dt_nanoparquet 10.038  2068.947  write    big
 9:     df_parquet  0.088    34.901   read  small
10: df_nanoparquet  0.414   183.187   read  small
11:     df_parquet  1.187     0.009   read    big
12: df_nanoparquet  5.180  1324.072   read    big

speed and RAM usage when reading big files are not very good .

on nanoparquet repo they say :

Being single-threaded and not fully optimized, 
nanoparquet is probably not suited well for large data sets. 
It should be fine for a couple of gigabytes. 
Reading or writing a ~250MB file that has 32 million rows 
and 14 columns takes about 10-15 seconds on an M2 MacBook Pro.
 For larger files, use Apache Arrow or DuckDB.

rio uses arrow for feather already so I'm not sure why we rely on nanoparquet for parquet

If you keep nanoparquet as default maybe we could have an option to use arrow instead?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions