Skip to content

Inconsistent behavior between Parquet and JSON when chunks are missing #493

@ashiklom

Description

@ashiklom

Taking the first file from here (https://noaa-goes17.s3.amazonaws.com/index.html#ABI-L1b-RadF/2022/001/00/) as an example:

The following code:

import kerchunk.hdf
import json
                                                                                                                                                     
fname = "OR_ABI-L1b-RadF-M6C01_G17_s20220010000320_e20220010009386_c20220010009424.nc"
                                                                                                                                                     
h5chunks = kerchunk.hdf.SingleHdf5ToZarr(fname)
refs = h5chunks.translate()
                                                                                                                                                     
with open("test.json", "w") as f:
    f.write(json.dumps(refs, indent=2))

Produces the following JSON output (excerpt; slightly clipped):

"..."
"Rad/.zarray": "{\"chunks\":[226,226],\"compressor\":null,\"dtype\":\"<i2\",\"fill_value\":1023, ..."
"Rad/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"y\",\"x\"],\"_Unsigned\":\"true\",\"add_offset\":-25.9"
"Rad/0.16": "base64:eAHt0DENAAAAAqDD/pk1iIwGNMWAAQMGDBgwYMCAAQMGDBgwYMCAAQMGDBgwYMCAAQMGDBgwYMCA..."
"Rad/0.17": [
  "/css/geostationary/BackStage/GOES-17-ABI-L1B-FULLD/2022/001/00/OR_ABI-L1b-RadF-M6C01_G17_s202..."
  51538,
  1448
],
"Rad/0.18": [
  "/css/geostationary/BackStage/GOES-17-ABI-L1B-FULLD/2022/001/00/OR_ABI-L1b-RadF-M6C01_G17_s202..."
  52986,
  4155
],
"Rad/0.19": [
  "/css/geostationary/BackStage/GOES-17-ABI-L1B-FULLD/2022/001/00/OR_ABI-L1b-RadF-M6C01_G17_s202..."
  57141,
  5554
],
"Rad/0.20": [
  "/css/geostationary/BackStage/GOES-17-ABI-L1B-FULLD/2022/001/00/OR_ABI-L1b-RadF-M6C01_G17_s202..."
  22412,
  7527
],
"..."

Note that the radiance chunks begin at 0.16 --- there is no Rad/0.{0--15}. That's weird --- I'm assuming this is some HDF5 sparse data cleverness. But in any case, xarray.open_dataset("test.json", engine="kerchunk") and subsequent summarizing of the entire Rad array (dat.Rad.mean().values) works fine here.

However, if you spit this out as a Parquet dataset, then it produces a file with rows 0-15 containing nan paths and 0 values, and then the real data start at row 16. That's fine...except that reading that Parquet file fails with an error like this (full backtrace in details):

  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/mapping.py", line 105, in getitems
    out = self.fs.cat(keys2, on_error=oe)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/implementations/reference.py", line 836, in cat
    proto_dict = _protocol_groups(path, self.references)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/implementations/reference.py", line 52, in _protocol_groups
    protocol = _prot_in_references(path, references)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/implementations/reference.py", line 44, in _prot_in_references
    return split_protocol(ref[0])[0] if ref[0] else ref[0]
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/core.py", line 544, in split_protocol
    if "://" in urlpath:
       ^^^^^^^^^^^^^^^^
TypeError: argument of type 'float' is not iterable

I've traced this back to a references.get("Rad/0.0") call that returns a nan "url" that can't be parsed by subsequent code. Here's some relevant pdb traces:

> /gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/core.py(544)split_protocol()
-> if "://" in urlpath:
(Pdb) u
> /gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/implementations/reference.py(44)_prot_in_references()
-> return split_protocol(ref[0])[0] if ref[0] else ref[0]
(Pdb) ll
 41     def _prot_in_references(path, references):
 42         ref = references.get(path)
 43         if isinstance(ref, (list, tuple)):
 44  ->         return split_protocol(ref[0])[0] if ref[0] else ref[0]
(Pdb) p ref
[nan]
(Pdb) p path
'Rad/0.0.0'
(Pdb)

Traceback (most recent call last):
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/read.py", line 6, in <module>
    print(combined_ds.Rad.mean().values)
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/_aggregations.py", line 1664, in mean
    return self.reduce(
           ^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/dataarray.py", line 3826, in reduce
    var = self.variable.reduce(func, dim, axis, keep_attrs, keepdims, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/variable.py", line 1663, in reduce
    result = super().reduce(
             ^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/namedarray/core.py", line 912, in reduce
    data = func(self.data, **kwargs)
                ^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/variable.py", line 449, in data
    return self._data.get_duck_array()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 837, in get_duck_a
rray
    self._ensure_cached()
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 831, in _ensure_ca
ched
    self.array = as_indexable(self.array.get_duck_array())
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 788, in get_duck_a
rray
    return self.array.get_duck_array()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 837, in get_duck_a
rray
    self._ensure_cached()
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 831, in _ensure_ca
ched
    self.array = as_indexable(self.array.get_duck_array())
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 788, in get_duck_a
rray
    return self.array.get_duck_array()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 658, in get_duck_a
rray
    array = array.get_duck_array()
            ^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/coding/variables.py", line 81, in get_duck_array
    return self.func(self.array.get_duck_array())
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 651, in get_duck_array
    array = self.array[self.key]
            ~~~~~~~~~~^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/backends/zarr.py", line 104, in __getitem__
    return indexing.explicit_indexing_adapter(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 1015, in explicit_indexing_adapter
    result = raw_indexing_method(raw_key.tuple)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/backends/zarr.py", line 94, in _getitem
    return self._array[key]
           ~~~~~~~~~~~^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/zarr/core.py", line 798, in __getitem__
    result = self.get_orthogonal_selection(pure_selection, fields=fields)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/zarr/core.py", line 1080, in get_orthogonal_selection
    return self._get_selection(indexer=indexer, out=out, fields=fields)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/zarr/core.py", line 1343, in _get_selection
    self._chunk_getitems(
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/zarr/core.py", line 2179, in _chunk_getitems
    cdatas = self.chunk_store.getitems(ckeys, contexts=contexts)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/zarr/storage.py", line 1426, in getitems
    results_transformed = self.map.getitems(list(keys_transformed), on_error="return")
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/mapping.py", line 105, in getitems
    out = self.fs.cat(keys2, on_error=oe)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/implementations/reference.py", line 836, in cat
    proto_dict = _protocol_groups(path, self.references)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/implementations/reference.py", line 52, in _protocol_groups
    protocol = _prot_in_references(path, references)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/implementations/reference.py", line 44, in _prot_in_references
    return split_protocol(ref[0])[0] if ref[0] else ref[0]
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/core.py", line 544, in split_protocol
    if "://" in urlpath:
       ^^^^^^^^^^^^^^^^
TypeError: argument of type 'float' is not iterable

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions