Struggling with KerchunkParquetParser() #862
-
|
I'm trying to read a kerchunk-generated parquet file into Vzarr using I think this is different than #466. I'm trying this reproducible workflow: import obstore
from virtualizarr import open_virtual_dataset
from virtualizarr.registry import ObjectStoreRegistry
from virtualizarr.parsers import KerchunkParquetParser
import fsspec
import os
bucket = "s3://umassd-fvcom"
region = 'us-east-1'
path = "gom3/hindcast/parquet/combined.parq"
# download parquet file for speed of parsing
local_parq = "combined.parq"
fs = fsspec.filesystem('s3', anon=True)
_ = fs.get(f"{bucket}/{path}", local_parq, recursive=True)
parser = KerchunkParquetParser()
store = obstore.store.from_url(bucket, region=region, skip_signature=True)
registry = ObjectStoreRegistry({bucket: store})
vds = open_virtual_dataset(
url=local_parq,
parser=KerchunkParquetParser(),
registry=registry,
loadable_variables=[],
)which produces the error: ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[1], line 20
17 store = obstore.store.from_url(bucket, region=region, skip_signature=True)
18 registry = ObjectStoreRegistry({bucket: store})
---> 20 vds = open_virtual_dataset(
21 url=local_parq,
22 parser=KerchunkParquetParser(),
23 registry=registry,
24 loadable_variables=[],
25 )
File /srv/conda/envs/notebook/lib/python3.13/site-packages/virtualizarr/xarray.py:88, in open_virtual_dataset(url, registry, parser, drop_variables, loadable_variables, decode_times)
46 """
47 Open an archival data source as an [xarray.Dataset][] wrapping virtualized zarr arrays.
48
(...) 84 in `loadable_variables` and normal lazily indexed arrays for each variable in `loadable_variables`.
85 """
86 filepath = validate_and_normalize_path_to_uri(url, fs_root=Path.cwd().as_uri())
---> 88 manifest_store = parser(
89 url=filepath,
90 registry=registry,
91 )
93 ds = manifest_store.to_virtual_dataset(
94 loadable_variables=loadable_variables,
95 decode_times=decode_times,
96 )
97 return ds.drop_vars(list(drop_variables or ()))
File /srv/conda/envs/notebook/lib/python3.13/site-packages/virtualizarr/parsers/kerchunk/parquet.py:92, in KerchunkParquetParser.__call__(self, url, registry)
89 full_reference = {"refs": array_refs}
90 refs = KerchunkStoreRefs(full_reference)
---> 92 manifestgroup = manifestgroup_from_kerchunk_refs(
93 refs,
94 group=self.group,
95 fs_root=self.fs_root,
96 skip_variables=self.skip_variables,
97 )
99 return ManifestStore(group=manifestgroup, registry=registry)
File /srv/conda/envs/notebook/lib/python3.13/site-packages/virtualizarr/parsers/kerchunk/translator.py:117, in manifestgroup_from_kerchunk_refs(refs, group, fs_root, skip_variables)
113 arr_names = [var for var in arr_names if var not in skip_variables]
115 # TODO support iterating over multiple nested groups
116 marrs = {
--> 117 arr_name: manifestarray_from_kerchunk_refs(refs, arr_name, fs_root=fs_root)
118 for arr_name in arr_names
119 }
121 # TODO probably need to parse the group-level attributes more here
122 attributes = fully_decode_arr_refs(refs["refs"]).get(".zattrs", {})
File /srv/conda/envs/notebook/lib/python3.13/site-packages/virtualizarr/parsers/kerchunk/translator.py:180, in manifestarray_from_kerchunk_refs(refs, var_name, fs_root)
178 # we want to remove the _ARRAY_DIMENSIONS from the final variables' .attrs
179 if chunk_dict:
--> 180 manifest = manifest_from_kerchunk_chunk_dict(chunk_dict, fs_root=fs_root)
181 marr = ManifestArray(metadata=metadata, chunkmanifest=manifest)
182 elif len(metadata.shape) != 0:
183 # empty variables don't have physical chunks, but zarray shows that the variable
184 # is at least 1D
File /srv/conda/envs/notebook/lib/python3.13/site-packages/virtualizarr/parsers/kerchunk/translator.py:216, in manifest_from_kerchunk_chunk_dict(kerchunk_chunk_dict, fs_root)
214 elif not isinstance(v, (tuple, list)):
215 raise TypeError(f"Unexpected type {type(v)} for chunk value: {v}")
--> 216 chunk_entries[k] = chunkentry_from_kerchunk(v, fs_root=fs_root)
217 return ChunkManifest(entries=chunk_entries)
File /srv/conda/envs/notebook/lib/python3.13/site-packages/virtualizarr/parsers/kerchunk/translator.py:230, in chunkentry_from_kerchunk(path_and_byte_range_info, fs_root)
228 path = path_and_byte_range_info[0]
229 offset = 0
--> 230 length = UPath(path).stat().st_size
231 else:
232 path, offset, length = path_and_byte_range_info
File /srv/conda/envs/notebook/lib/python3.13/site-packages/upath/core.py:153, in _UPathMeta.__call__(cls, *args, **kwargs)
150 # We do this call manually, because cls could be a registered
151 # subclass of UPath that is not directly inheriting from UPath.
152 inst = cls.__new__(cls, *args, **kwargs)
--> 153 inst.__init__(*args, **kwargs) # type: ignore[misc]
154 return inst
File /srv/conda/envs/notebook/lib/python3.13/site-packages/upath/implementations/local.py:159, in LocalPath.__init__(self, protocol, chain_parser, *args, **storage_options)
152 def __init__(
153 self,
154 *args,
(...) 157 **storage_options: Any,
158 ) -> None:
--> 159 super(_UPathMixin, self).__init__(*args)
160 self._chain = Chain(ChainSegment(str(self), "", storage_options))
161 self._chain_parser = chain_parser
File /srv/conda/envs/notebook/lib/python3.13/pathlib/_local.py:503, in Path.__init__(self, *args, **kwargs)
500 msg = ("support for supplying keyword arguments to pathlib.PurePath "
501 "is deprecated and scheduled for removal in Python {remove}")
502 warnings._deprecated("pathlib.PurePath(**kwargs)", msg, remove=(3, 14))
--> 503 super().__init__(*args)
File /srv/conda/envs/notebook/lib/python3.13/pathlib/_local.py:132, in PurePath.__init__(self, *args)
130 path = arg
131 if not isinstance(path, str):
--> 132 raise TypeError(
133 "argument should be a str or an os.PathLike "
134 "object where __fspath__ returns a str, "
135 f"not {type(path).__name__!r}")
136 paths.append(path)
137 # Avoid calling super().__init__, as an optimisation
TypeError: argument should be a str or an os.PathLike object where __fspath__ returns a str, not 'float' |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
|
There are multiple issues going on here. Firstly, the fsspec referencemapper does not return directories in a consistent order, so the variables in the kerchunk parquet "file" (it's actually a directory of parquet files) get processed in a different order each time I run the test script, meaning I get different behaviour and different errors each time (because there are apparently other problems in addition to the error you highlighted). Sometimes when I run this script I get the error you do. That one is because apparently the kerchunk parquet references spec can use use a value of Other times when I run this I get File "/Users/tom-em/Documents/Code/VirtualiZarr/virtualizarr/codecs.py", line 30, in zarr_codec_config_to_v3
if num_codec["id"].startswith("numcodecs."):
~~~~~~~~~^^^^^^
TypeError: string indices must be integers, not 'strThis one is because instead of specifying the whole codec config properly, sometimes kerchunk seems to have just saved the literal string "checksum" instead of a dictionary. Due to this, even with the other fix, this Finally, reading the kerchunk parquet spec again made me realize that there is another kerchunk behaviour that we don't explicitly test that we can handle - inlined refs. We can't support it (yet) but #865 at least tests that we forbid it with a clear error. |
Beta Was this translation helpful? Give feedback.
There are multiple issues going on here.
Firstly, the fsspec referencemapper does not return directories in a consistent order, so the variables in the kerchunk parquet "file" (it's actually a directory of parquet files) get processed in a different order each time I run the test script, meaning I get different behaviour and different errors each time (because there are apparently other problems in addition to the error you highlighted).
Sometimes when I run this script I get the error you do. That one is because apparently the kerchunk parquet references spec can use use a value of
nanforpath(which is normally a string), as a way of saying "this zarr chunk was not initialized". #864 s…