-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Labels
topic-metadataRelating to the handling of metadata (i.e. attrs and encoding)Relating to the handling of metadata (i.e. attrs and encoding)
Description
What is your issue?
I noticed that open_dataset
with chunks="auto"
fails when netCDF4 variables/coordinates are encoded as NC_STRING
.
The reason is that xarray reads netCDF4 NC_STRING
as object
type, and dask
cannot estimate the size of a object
dtype.
As a workaround, the user must currently rewrite the netCDF4 and specify the string DataArray(s) encoding
(s) as a fixed-length string type (i.e "S2"
if max string length is 2) so that the data are written as NC_CHAR
and xarray read it back as byte-encoded fixed-length string type.
Here below I provide a reproducible example
import xarray as xr
import numpy as np
# Define string datarray
arr = np.array(["M6", "M3"], dtype=str)
print(arr.dtype) # <U2
da = xr.DataArray(data=arr, dims=("time"))
data_vars = {"str_arr": da}
# Create dataset
ds_nc_string = xr.Dataset(data_vars=data_vars)
# Set chunking to see behaviour at read-time
ds_nc_string["str_arr"] = ds_nc_string["str_arr"].chunk(1) # chunks ((1,1),)
# Write dataset with NC_STRING
ds_nc_string["str_arr"].encoding["dtype"] = str
ds_nc_string.to_netcdf("/tmp/nc_string.nc")
# Write dataset with NC_CHAR
ds_nc_char = xr.Dataset(data_vars=data_vars)
ds_nc_char["str_arr"].encoding["dtype"] = "S2"
ds_nc_char.to_netcdf("/tmp/nc_char.nc")
# When NC_STRING, chunks="auto" does not work when string are saved as
# --> NC STRING is read as object, and dask can not estimate chunk size !
# If chunks={} it reads the NC_STRING array in a single dask chunk !!!
ds_nc_string = xr.open_dataset("/tmp/nc_string.nc", chunks="auto") # NotImplementedError
ds_nc_string = xr.open_dataset("/tmp/nc_string.nc", chunks={}) # Works
ds_nc_string.chunks # chunks (2,)
# With NC_CHAR, chunks={} and chunks="auto" works and returns the same result!
ds_nc_char = xr.open_dataset("/tmp/nc_char.nc", chunks={})
ds_nc_char.chunks # chunks (2,)
ds_nc_char = xr.open_dataset("/tmp/nc_char.nc", chunks="auto")
ds_nc_char.chunks # chunks (2,)
# NC_STRING is read back as object
ds_nc_string = xr.open_dataset("/tmp/nc_string.nc", chunks=None)
ds_nc_string["str_arr"].dtype # object
# NC_CHAR is read back as fixed length byte-string representation (S2)
ds_nc_char = xr.open_dataset("/tmp/nc_char.nc", chunks=None)
ds_nc_char["str_arr"].dtype # S2
ds_nc_char["str_arr"].data.astype(str) # U2
Questions:
open_dataset
should not take care of automatically deserializing theNC_CHAR
fixed-length byte-string representation into aUnicode string
?open_dataset
should not take care of automatically readingNC_STRING
asUnicode string
(convertingobject
tostr
)?
Related issues are:
Metadata
Metadata
Assignees
Labels
topic-metadataRelating to the handling of metadata (i.e. attrs and encoding)Relating to the handling of metadata (i.e. attrs and encoding)