On the size of an xarray.DataArray loaded via a Parquet store vs. Raw Data Files #539
-
In comment #345 (comment) I reported the total size of thousands of single Parquet stores being 147,3 MB which is way smaller than the total size of one aggregate Parquet store 1,5 GB of all single ones ! @martindurant explains in #345 (comment)
Today I think to have a similar question (see also : Details below). I observe a difference in the reported size of an xarray.DataArray compared to the size of the 366 raw NetCDF files. Specifically :
I am trying to understand :
Details ❯ ls -1 *0.nc |wc -l
366
❯ du -sch *0.nc | tail -1
2.2G total
❯ du -sch SISin2000_Italia_48_64_64_zlib_0_combined.parquet
44K SISin2000_Italia_48_64_64_zlib_0_combined.parquet
44K total and In [1]: import xarray
In [2]: index = xarray.open_dataset('SISin2000_Italia_48_64_64_zlib_0_combined.parquet')
In [3]: index.SIS
Out[3]:
<xarray.DataArray 'SIS' (time: 17568, lat: 232, lon: 238)> Size: 4GB
[970034688 values with dtype=float32]
Coordinates:
* lat (lat) float32 928B 35.53 35.58 35.62 35.67 ... 46.97 47.03 47.08
* lon (lon) float32 952B 6.675 6.725 6.775 6.825 ... 18.42 18.48 18.52
* time (time) datetime64[ns] 141kB 2000-01-01 ... 2000-12-31T23:30:00
Attributes:
cell_methods: time: point
long_name: Surface Downwelling Shortwave Radiation
standard_name: surface_downwelling_shortwave_flux_in_air
units: W m-2
In [4]: index.SIS.encoding
Out[4]:
{'chunks': (48, 64, 64),
'preferred_chunks': {'time': 48, 'lat': 64, 'lon': 64},
'compressor': None,
'filters': None,
'missing_value': -999,
'_FillValue': np.int16(-999),
'dtype': dtype('int16')} |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Further details, for a single NetCDF file : ❯ stat SISin200001010000004231000101MA_Italia_48_64_64_zlib_0.nc
File: .../SISin200001010000004231000101MA_Italia_48_64_64_zlib_0.nc
Size: 6328605 Blocks: 12368 IO Block: 4096 regular file
Device: 8,97 Inode: 8791662107 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/ nik) Gid: ( 1000/autologin)
Access: 2025-01-20 23:08:49.843344130 +0100
Modify: 2025-01-20 23:08:29.186676971 +0100
Change: 2025-01-20 23:08:29.186676971 +0100
Birth: 2025-01-20 23:08:28.636676957 +0100 plus some details of its internal structure Variable Shape Chunks Cache Elements Preemption Type Scale Offset Compression Level Shuffling Read Time
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
lat 232 contiguous 1048576 521 0.75 float32 - - 0 False -
lon 238 contiguous 1048576 521 0.75 float32 - - 0 False -
record_status 48 48 16777216 1000 0.75 int8 - - 0 False -
lat_bnds 232 x 2 contiguous 1048576 521 0.75 float32 - - 0 False -
lon_bnds 238 x 2 contiguous 1048576 521 0.75 float32 - - 0 False -
SIS 48 x 232 x 238 48 x 64 x 64 16777216 1000 0.75 int16 - - 0 False 0.014
time 48 48 16777216 1000 0.75 float64 - - 0 False -
File size: 6328605 bytes, Dimensions: time: 48, lon: 238, bnds: 2, lat: 232
* Cache: Size in bytes, Number of elements, Preemption ranging in [0, 1] yet In [8]: one_netcdf_file = xarray.open_dataset('SISin200001010000004231000101MA_Italia_48_64_64_zlib_0.nc')
In [9]: one_netcdf_file.SIS
Out[9]:
<xarray.DataArray 'SIS' (time: 48, lat: 232, lon: 238)> Size: 11MB
[2650368 values with dtype=float32]
Coordinates:
* time (time) datetime64[ns] 384B 2000-01-01 ... 2000-01-01T23:30:00
* lon (lon) float32 952B 6.675 6.725 6.775 6.825 ... 18.42 18.48 18.52
* lat (lat) float32 928B 35.53 35.58 35.62 35.67 ... 46.97 47.03 47.08
Attributes:
standard_name: surface_downwelling_shortwave_flux_in_air
long_name: Surface Downwelling Shortwave Radiation
units: W m-2
cell_methods: time: point |
Beta Was this translation helpful? Give feedback.
Answering to my question :
sets by default
mask_and_scale=True
which casts the data fromint16
tofloat32
. Settingmask_and_scale=False
prohibits the data type cast that happens "automatically" and thus reports bigger size of data.