Skip to content

open_dataset with chunks="auto" fails when a netCDF4 variables/coordinates is encoded as NC_STRING #7868

@ghiggi

Description

@ghiggi

What is your issue?

I noticed that open_dataset with chunks="auto" fails when netCDF4 variables/coordinates are encoded as NC_STRING.
The reason is that xarray reads netCDF4 NC_STRING as object type, and dask cannot estimate the size of a object dtype.

As a workaround, the user must currently rewrite the netCDF4 and specify the string DataArray(s) encoding(s) as a fixed-length string type (i.e "S2" if max string length is 2) so that the data are written as NC_CHAR and xarray read it back as byte-encoded fixed-length string type.

Here below I provide a reproducible example

import xarray as xr
import numpy as np

# Define string datarray
arr = np.array(["M6", "M3"], dtype=str)
print(arr.dtype)  # <U2
da = xr.DataArray(data=arr, dims=("time"))
data_vars = {"str_arr": da}

# Create dataset
ds_nc_string = xr.Dataset(data_vars=data_vars)

# Set chunking to see behaviour at read-time
ds_nc_string["str_arr"] = ds_nc_string["str_arr"].chunk(1)  # chunks ((1,1),)
 
# Write dataset with NC_STRING
ds_nc_string["str_arr"].encoding["dtype"] = str
ds_nc_string.to_netcdf("/tmp/nc_string.nc")

# Write dataset with NC_CHAR
ds_nc_char = xr.Dataset(data_vars=data_vars)
ds_nc_char["str_arr"].encoding["dtype"] = "S2"
ds_nc_char.to_netcdf("/tmp/nc_char.nc")

# When NC_STRING, chunks="auto" does not work when string are saved as 
# --> NC STRING is read as object, and dask can not estimate chunk size !
# If chunks={} it reads the NC_STRING array in a single dask chunk !!!
ds_nc_string = xr.open_dataset("/tmp/nc_string.nc", chunks="auto") # NotImplementedError
ds_nc_string = xr.open_dataset("/tmp/nc_string.nc", chunks={})     # Works
ds_nc_string.chunks  # chunks (2,)

# With NC_CHAR, chunks={} and chunks="auto" works and returns the same result!
ds_nc_char = xr.open_dataset("/tmp/nc_char.nc", chunks={})   
ds_nc_char.chunks # chunks (2,)
ds_nc_char = xr.open_dataset("/tmp/nc_char.nc", chunks="auto")
ds_nc_char.chunks # chunks (2,)

# NC_STRING is read back as object 
ds_nc_string = xr.open_dataset("/tmp/nc_string.nc", chunks=None)
ds_nc_string["str_arr"].dtype  #  object 

# NC_CHAR is read back as fixed length byte-string representation (S2) 
ds_nc_char = xr.open_dataset("/tmp/nc_char.nc", chunks=None)
ds_nc_char["str_arr"].dtype            #  S2 
ds_nc_char["str_arr"].data.astype(str) #  U2 

Questions:

  • open_dataset should not take care of automatically deserializing the NC_CHAR fixed-length byte-string representation into a Unicode string?
  • open_dataset should not take care of automatically reading NC_STRING as Unicode string (converting object to str)?

Related issues are:

Metadata

Metadata

Assignees

No one assigned

    Labels

    topic-metadataRelating to the handling of metadata (i.e. attrs and encoding)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions