`open_dataset` with `chunks="auto"` fails when a netCDF4 variables/coordinates is encoded as `NC_STRING`

### What is your issue?

I noticed that `open_dataset` with `chunks="auto"` fails when netCDF4 variables/coordinates are encoded as `NC_STRING`.
The reason is that xarray reads netCDF4 `NC_STRING` as `object` type, and `dask` cannot estimate the size of a `object` dtype. 

As a workaround, the user must currently rewrite the netCDF4 and specify the string DataArray(s) `encoding`(s) as a fixed-length string type (i.e `"S2"` if max string length is 2) so that the data are written as `NC_CHAR` and xarray read it back as byte-encoded fixed-length string type.

**Here below I provide a reproducible example** 

```
import xarray as xr
import numpy as np

# Define string datarray
arr = np.array(["M6", "M3"], dtype=str)
print(arr.dtype)  # <U2
da = xr.DataArray(data=arr, dims=("time"))
data_vars = {"str_arr": da}

# Create dataset
ds_nc_string = xr.Dataset(data_vars=data_vars)

# Set chunking to see behaviour at read-time
ds_nc_string["str_arr"] = ds_nc_string["str_arr"].chunk(1)  # chunks ((1,1),)
 
# Write dataset with NC_STRING
ds_nc_string["str_arr"].encoding["dtype"] = str
ds_nc_string.to_netcdf("/tmp/nc_string.nc")

# Write dataset with NC_CHAR
ds_nc_char = xr.Dataset(data_vars=data_vars)
ds_nc_char["str_arr"].encoding["dtype"] = "S2"
ds_nc_char.to_netcdf("/tmp/nc_char.nc")

# When NC_STRING, chunks="auto" does not work when string are saved as 
# --> NC STRING is read as object, and dask can not estimate chunk size !
# If chunks={} it reads the NC_STRING array in a single dask chunk !!!
ds_nc_string = xr.open_dataset("/tmp/nc_string.nc", chunks="auto") # NotImplementedError
ds_nc_string = xr.open_dataset("/tmp/nc_string.nc", chunks={})     # Works
ds_nc_string.chunks  # chunks (2,)

# With NC_CHAR, chunks={} and chunks="auto" works and returns the same result!
ds_nc_char = xr.open_dataset("/tmp/nc_char.nc", chunks={})   
ds_nc_char.chunks # chunks (2,)
ds_nc_char = xr.open_dataset("/tmp/nc_char.nc", chunks="auto")
ds_nc_char.chunks # chunks (2,)

# NC_STRING is read back as object 
ds_nc_string = xr.open_dataset("/tmp/nc_string.nc", chunks=None)
ds_nc_string["str_arr"].dtype  #  object 

# NC_CHAR is read back as fixed length byte-string representation (S2) 
ds_nc_char = xr.open_dataset("/tmp/nc_char.nc", chunks=None)
ds_nc_char["str_arr"].dtype            #  S2 
ds_nc_char["str_arr"].data.astype(str) #  U2 
```

Questions: 
- `open_dataset` should not take care of automatically deserializing the `NC_CHAR` fixed-length byte-string representation into a `Unicode string`? 
- `open_dataset` should not take care of automatically reading `NC_STRING` as `Unicode string` (converting `object` to `str`)? 

Related issues are: 
- https://github.com/pydata/xarray/issues/7652
- https://github.com/pydata/xarray/issues/2059
- https://github.com/pydata/xarray/pull/7654
- https://github.com/pydata/xarray/issues/2040


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

`open_dataset` with `chunks="auto"` fails when a netCDF4 variables/coordinates is encoded as `NC_STRING` #7868

What is your issue?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

open_dataset with chunks="auto" fails when a netCDF4 variables/coordinates is encoded as NC_STRING #7868

Description

What is your issue?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`open_dataset` with `chunks="auto"` fails when a netCDF4 variables/coordinates is encoded as `NC_STRING` #7868