Skip to content

Conversation

@chuckwondo
Copy link
Collaborator

@TomNicholas, here are some changes to help things along a bit.

The tests on your branch were passing because the fits test was skipped due to astropy being missing. But for the same reason, mypy was failing.

This PR fixes this so that the fits test is no longer skipped, and also so that the vendored code is called instead of the code from the kerchunk lib.

Here's what I'm getting at this point with the changes in this PR:

$ pytest --run-network-tests virtualizarr/tests/test_readers/test_fits.py 
__________________________________________________ test_open_hubble_data __________________________________________________

    @requires_kerchunk
    @requires_network
    def test_open_hubble_data():
        # data from https://registry.opendata.aws/hst/
>       vds = open_virtual_dataset(
            "s3://stpubdata/hst/public/f05i/f05i0201m/f05i0201m_a1f.fits",
            reader_options={"storage_options": {"anon": True}},
        )

virtualizarr/tests/test_readers/test_fits.py:14: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
virtualizarr/backend.py:203: in open_virtual_dataset
    vds = backend_cls.open_virtual_dataset(
virtualizarr/readers/fits.py:38: in open_virtual_dataset
    {"refs": process_file(filepath, **(reader_options or {}))}
virtualizarr/vendor/kerchunk/fits.py:164: in process_file
    arr = g.empty(
.venv/lib/python3.11/site-packages/zarr/_compat.py:44: in inner_f
    return f(*args, **kwargs)
.venv/lib/python3.11/site-packages/zarr/core/group.py:2373: in empty
    return Array(self._sync(self._async_group.empty(name=name, shape=shape, **kwargs)))
.venv/lib/python3.11/site-packages/zarr/core/sync.py:185: in _sync
    return sync(
.venv/lib/python3.11/site-packages/zarr/core/sync.py:141: in sync
    raise return_result
.venv/lib/python3.11/site-packages/zarr/core/sync.py:100: in _runner
    return await coro
.venv/lib/python3.11/site-packages/zarr/core/group.py:1476: in empty
    return await async_api.empty(shape=shape, store=self.store_path, path=name, **kwargs)
.venv/lib/python3.11/site-packages/zarr/api/asynchronous.py:947: in empty
    return await create(shape=shape, fill_value=None, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

shape = (17, 589), chunks = None, dtype = '>i4', compressor = None, fill_value = None, order = None
store = StorePath(MemoryStore, 'memory://4778130752'), synchronizer = None, overwrite = False

    async def create(
        ...
    ) -> AsyncArray[ArrayV2Metadata] | AsyncArray[ArrayV3Metadata]:
        ...
>       return await AsyncArray.create(
            store_path,
            shape=shape,
            chunks=chunks,
            dtype=dtype,
            compressor=compressor,
            fill_value=fill_value,
            exists_ok=overwrite,
            filters=filters,
            dimension_separator=dimension_separator,
            zarr_format=zarr_format,
            chunk_shape=chunk_shape,
            chunk_key_encoding=chunk_key_encoding,
            codecs=codecs,
            dimension_names=dimension_names,
            attributes=attributes,
            order=order,
            **kwargs,
        )
E       TypeError: AsyncArray.create() got an unexpected keyword argument 'compression'

.venv/lib/python3.11/site-packages/zarr/api/asynchronous.py:907: TypeError
================================================= short test summary info =================================================
FAILED virtualizarr/tests/test_readers/test_fits.py::test_open_hubble_data - TypeError: AsyncArray.create() got an unexpected keyword argument 'compression'
==================================================== 1 failed in 1.50s ====================================================

When I comment out the compression arg in the following call within the vendored process_file function, I get past the error above:

            arr = g.empty(
                name=name,
                dtype=dtype,
                shape=shape,
                chunks=shape,
                compression=None,
                **kwargs,
            )

Then I hit the following error:

cls = <enum 'DataType'>, dtype = dtype('>i4')

    @classmethod
    def from_numpy(cls, dtype: np.dtype[Any]) -> DataType:
        if dtype.kind in "UT":
            return DataType.string
        elif dtype.kind == "S":
            return DataType.bytes
        dtype_to_data_type = {
            "|b1": "bool",
            "bool": "bool",
            "|i1": "int8",
            "<i2": "int16",
            "<i4": "int32",
            "<i8": "int64",
            "|u1": "uint8",
            "<u2": "uint16",
            "<u4": "uint32",
            "<u8": "uint64",
            "<f2": "float16",
            "<f4": "float32",
            "<f8": "float64",
            "<c8": "complex64",
            "<c16": "complex128",
        }
>       return DataType[dtype_to_data_type[dtype.str]]
E       KeyError: '>i4'

.venv/lib/python3.11/site-packages/zarr/core/metadata/v3.py:626: KeyError

When I change all > to < in the following dict within the vendored fits.py file:

BITPIX2DTYPE = {
    8: "uint8",
    16: ">i2",
    32: ">i4",
    64: ">i8",
    -32: "float32",
    -64: "float64",
}  # always bigendian

Then I get past the KeyError, and hit this error:

virtualizarr/translators/kerchunk.py:103: in virtual_vars_from_kerchunk_refs
    var_names = find_var_names(refs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

ds_reference_dict = {'refs': {'PRIMARY/0.0': ['s3://stpubdata/hst/public/f05i/f05i0201m/f05i0201m_a1f.fits', 17280, 40052], 'PRIMARY/zarr....<zarr.core.buffer.cpu.Buffer object at 0x11a989050>, 'zarr.json': <zarr.core.buffer.cpu.Buffer object at 0x11a8cd2d0>}}

    def find_var_names(ds_reference_dict: KerchunkStoreRefs) -> list[str]:
        """Find the names of zarr variables in this store/group."""
    
        refs = ds_reference_dict["refs"]
    
        found_var_names = []
        for key in refs.keys():
            # has to capture "foo/.zarray", but ignore ".zgroup", ".zattrs", and "subgroup/bar/.zarray"
            # TODO this might be a sign that we should introduce a KerchunkGroupRefs type and cut down the references before getting to this point...
            if key not in (".zgroup", ".zattrs", ".zmetadata"):
>               first_part, second_part, *_ = key.split("/")
E               ValueError: not enough values to unpack (expected at least 2, got 1)

virtualizarr/translators/kerchunk.py:232: ValueError

This is because the zarr.json key within the refs does not contain a /.

I suspect the changes I had to make to the vendored fits.py file to get past a couple of errors is the wrong thing to do, which is why this PR does not contain those changes.

However, I wanted to at least see how far I could get so I could at least provide this additional information that my help you get this sorted out correctly.

@TomNicholas
Copy link
Member

Thank you @chuckwondo!

Just FYI there is potentially some duplication of effort going on here unfortunately - see fsspec/kerchunk#516 (comment)

@maxrjones
Copy link
Member

Superseded by #406

@maxrjones maxrjones closed this Jan 30, 2025
@chuckwondo chuckwondo deleted the vendor_kerchunk_fits_reader branch April 8, 2025 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants