Skip to content

ds.to_zarr() fails when trying to write a very large zarr of dtype='object', even with compute=False #10583

@AgedSage

Description

@AgedSage

What is your issue?

When making an empty zarr full of object dtypes, I got a MemoryError, despite setting compute=False.

What did I expect to happen?
I expected to save the empty zarr to disk.
My use case is that I am trying to create a zarr dataset which will be populated with variable length utf-8 strings, which I know was not previously supported with zarr, but it now is. The issue, I think, can be explained by this comment in zarr.py:

File ~/miniconda3/lib/python3.13/site-packages/xarray/backends/zarr.py:531, in encode_zarr_variable(var, needs_copy, name)
    510 """
    511 Converts an Variable into an Variable which follows some
    512 of the CF conventions:
   (...)    527     A variable which has been encoded as described above.
    528 """
    530 var = conventions.encode_cf_variable(var, name=name)
--> 531 var = ensure_dtype_not_object(var, name=name)
    533 # zarr allows unicode, but not variable-length strings, so it's both
    534 # simpler and more compact to always encode as UTF-8 explicitly.
    535 # TODO: allow toggling this explicitly via dtype in encoding.
    536 # TODO: revisit this now that Zarr _does_ allow variable-length strings
    537 coder = coding.strings.EncodedStringCoder(allows_unicode=True)

MCVE:

dummies = dask.array.zeros((5000, 100, 2000, 50), chunks=(10, 10, 500, 50), dtype = np.dtypes.StringDType)
ds = xr.Dataset({"foo": (["x", "y", "z", "alpha"], dummies)}, coords={"x": np.arange(5000), "y" : np.arange(100), "z" : np.arange(2000), "alpha" : np.arange(50)})
bigZarr = xr.merge([ds,dsf])
bigZarr.to_zarr('myzarr.zarr', compute=False, consolidated=False)
  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.

  • Complete example — the example is self-contained, including all data and the text of any traceback.

  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.

  • New issue — a search of GitHub Issues suggests this is not a duplicate.

  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Also, in the mean time, if anyone has a recommendation for how to make my project work regardless of this limitation, I would be keen to hear how. Much appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs triageIssue that has not been reviewed by xarray team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions