On the size of an xarray.DataArray loaded via a Parquet store vs. Raw Data Files #539

NikosAlexandris · 2025-01-21T21:29:15Z

NikosAlexandris
Jan 21, 2025

In comment #345 (comment) I reported the total size of thousands of single Parquet stores being 147,3 MB which is way smaller than the total size of one aggregate Parquet store 1,5 GB of all single ones !

@martindurant explains in #345 (comment)

...that a large number of chunks are so small in their compressed form, that they are smaller than the default inline_threshold=500 in MultiZarrToZarr, and the values are embedded into the reference output.

Today I think to have a similar question (see also : Details below). I observe a difference in the reported size of an xarray.DataArray compared to the size of the 366 raw NetCDF files. Specifically :

366 .nc files : 2.2 G
1 .parquet file : 44 Κ
The xarray.DataArray named SIS has dimensions time: 17568, lat: 232, lon: 238 and is reported to be 4GB, containing 970,034,688 values with dtype=float32.
In contrast, the total size of 366 raw .nc files is 2.2GB
The corresponding .parquet file which I use to load the data via Xarray is only 44KB.

I am trying to understand :

why the in-memory size reported by Xarray is significantly larger than the size of the raw data files ?
What to do with the inline_threshold=500 option ?

Details

❯ ls -1 *0.nc |wc -l
366

❯ du -sch *0.nc | tail -1
2.2G total

❯ du -sch SISin2000_Italia_48_64_64_zlib_0_combined.parquet
44K SISin2000_Italia_48_64_64_zlib_0_combined.parquet
44K total

and

In [1]: import xarray

In [2]: index = xarray.open_dataset('SISin2000_Italia_48_64_64_zlib_0_combined.parquet')

In [3]: index.SIS
Out[3]:
<xarray.DataArray 'SIS' (time: 17568, lat: 232, lon: 238)> Size: 4GB
[970034688 values with dtype=float32]
Coordinates:
  * lat      (lat) float32 928B 35.53 35.58 35.62 35.67 ... 46.97 47.03 47.08
  * lon      (lon) float32 952B 6.675 6.725 6.775 6.825 ... 18.42 18.48 18.52
  * time     (time) datetime64[ns] 141kB 2000-01-01 ... 2000-12-31T23:30:00
Attributes:
    cell_methods:   time: point
    long_name:      Surface Downwelling Shortwave Radiation
    standard_name:  surface_downwelling_shortwave_flux_in_air
    units:          W m-2

In [4]: index.SIS.encoding
Out[4]:
{'chunks': (48, 64, 64),
 'preferred_chunks': {'time': 48, 'lat': 64, 'lon': 64},
 'compressor': None,
 'filters': None,
 'missing_value': -999,
 '_FillValue': np.int16(-999),
 'dtype': dtype('int16')}

Answered by NikosAlexandris

Jul 1, 2025

Answering to my question :

one_netcdf_file = xarray.open_dataset('SISin200001010000004231000101MA_Italia_48_64_64_zlib_0.nc')

sets by default mask_and_scale=True which casts the data from int16 to float32. Setting mask_and_scale=False prohibits the data type cast that happens "automatically" and thus reports bigger size of data.

View full answer

NikosAlexandris · 2025-01-21T21:42:08Z

NikosAlexandris
Jan 21, 2025
Author

Further details, for a single NetCDF file :

❯ stat SISin200001010000004231000101MA_Italia_48_64_64_zlib_0.nc
  File: .../SISin200001010000004231000101MA_Italia_48_64_64_zlib_0.nc
  Size: 6328605   	Blocks: 12368      IO Block: 4096   regular file
Device: 8,97	Inode: 8791662107  Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/     nik)   Gid: ( 1000/autologin)
Access: 2025-01-20 23:08:49.843344130 +0100
Modify: 2025-01-20 23:08:29.186676971 +0100
Change: 2025-01-20 23:08:29.186676971 +0100
 Birth: 2025-01-20 23:08:28.636676957 +0100

plus some details of its internal structure

  Variable        Shape            Chunks         Cache      Elements   Preemption   Type      Scale   Offset   Compression   Level   Shuffling   Read Time
 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  lat             232              contiguous     1048576    521        0.75         float32   -       -                      0       False       -
  lon             238              contiguous     1048576    521        0.75         float32   -       -                      0       False       -
  record_status   48               48             16777216   1000       0.75         int8      -       -                      0       False       -
  lat_bnds        232 x 2          contiguous     1048576    521        0.75         float32   -       -                      0       False       -
  lon_bnds        238 x 2          contiguous     1048576    521        0.75         float32   -       -                      0       False       -
  SIS             48 x 232 x 238   48 x 64 x 64   16777216   1000       0.75         int16     -       -                      0       False       0.014
  time            48               48             16777216   1000       0.75         float64   -       -                      0       False       -

                                         File size: 6328605 bytes, Dimensions: time: 48, lon: 238, bnds: 2, lat: 232
                                          * Cache: Size in bytes, Number of elements, Preemption ranging in [0, 1]

yet

In [8]: one_netcdf_file = xarray.open_dataset('SISin200001010000004231000101MA_Italia_48_64_64_zlib_0.nc')

In [9]: one_netcdf_file.SIS
Out[9]:
<xarray.DataArray 'SIS' (time: 48, lat: 232, lon: 238)> Size: 11MB
[2650368 values with dtype=float32]
Coordinates:
  * time     (time) datetime64[ns] 384B 2000-01-01 ... 2000-01-01T23:30:00
  * lon      (lon) float32 952B 6.675 6.725 6.775 6.825 ... 18.42 18.48 18.52
  * lat      (lat) float32 928B 35.53 35.58 35.62 35.67 ... 46.97 47.03 47.08
Attributes:
    standard_name:  surface_downwelling_shortwave_flux_in_air
    long_name:      Surface Downwelling Shortwave Radiation
    units:          W m-2
    cell_methods:   time: point

3 replies

NikosAlexandris Jul 1, 2025
Author

Answering to my question :

one_netcdf_file = xarray.open_dataset('SISin200001010000004231000101MA_Italia_48_64_64_zlib_0.nc')

sets by default mask_and_scale=True which casts the data from int16 to float32. Setting mask_and_scale=False prohibits the data type cast that happens "automatically" and thus reports bigger size of data.

Answer selected by NikosAlexandris

martindurant Jul 1, 2025
Maintainer

Thanks @NikosAlexandris . Is there something we should do about this?

NikosAlexandris Jul 1, 2025
Author

Thanks @NikosAlexandris . Is there something we should do about this?

I am perhaps the last one to voice out, but casting data types automatically isn't like giving the user a choice. If it's not this is the only way to go, it seems like we know better and do it for you already. What do you think ? At least a warning would be important in this context. It's details that might just slip away.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

On the size of an xarray.DataArray loaded via a Parquet store vs. Raw Data Files #539

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

On the size of an xarray.DataArray loaded via a Parquet store vs. Raw Data Files #539

Uh oh!

Uh oh!

NikosAlexandris Jan 21, 2025

Replies: 1 comment · 3 replies

Uh oh!

NikosAlexandris Jan 21, 2025 Author

Uh oh!

NikosAlexandris Jul 1, 2025 Author

Uh oh!

martindurant Jul 1, 2025 Maintainer

Uh oh!

NikosAlexandris Jul 1, 2025 Author

NikosAlexandris
Jan 21, 2025

Replies: 1 comment 3 replies

NikosAlexandris
Jan 21, 2025
Author

NikosAlexandris Jul 1, 2025
Author

martindurant Jul 1, 2025
Maintainer

NikosAlexandris Jul 1, 2025
Author