Skip to content

Schema inconsistency across backends - some use refs as the top-level key, some don't #561

@TomNicholas

Description

@TomNicholas

(Raising something we found a while ago here, original issue is zarr-developers/VirtualiZarr#160 (comment))

tl;dr: There's a schema inconsistency across kerchunk backends - some use refs as the top-level key, some don't.

The output of kerchunk.tiff.tiff_to_zarr(url) looks like

{
  '.zgroup': '{\n "zarr_format": 2\n}',
  '.zattrs': '{"multiscales":[{"datasets":[{"path":"0"},{"path":"1"},{"path":"2"}],"metadata":{},"name":"","version":"0.1"}],"OVR_RESAMPLING_ALG":"NEAREST","LAYOUT":"IFDS_BEFORE_DATA","BLOCK_ORDER":"ROW_MAJOR","BLOCK_LEADER":"SIZE_AS_UINT4","BLOCK_TRAILER":"LAST_4_BYTES_REPEATED","KNOWN_INCOMPATIBLE_EDITION":"NO","KeyDirectoryVersion":1,"KeyRevision":1,"KeyRevisionMinor":0,"GTModelTypeGeoKey":1,"GTRasterTypeGeoKey":1,"GTCitationGeoKey":"Albers","GeographicTypeGeoKey":4326,"GeogCitationGeoKey":"WGS 84","GeogAngularUnitsGeoKey":9102,"GeogSemiMajorAxisGeoKey":6378140.0,"GeogInvFlatteningGeoKey":298.256999999996,"ProjectedCSTypeGeoKey":32767,"ProjectionGeoKey":32767,"ProjCoordTransGeoKey":11,"ProjLinearUnitsGeoKey":9001,"ProjStdParallel1GeoKey":29.5,"ProjStdParallel2GeoKey":45.5,"ProjNatOriginLongGeoKey":-96.0,"ProjNatOriginLatGeoKey":23.0,"ProjFalseEastingGeoKey":0.0,"ProjFalseNorthingGeoKey":0.0,"ModelPixelScale":[30.0,30.0,0.0],"ModelTiepoint":[0.0,0.0,0.0,-1801185.0,2700405.0,0.0]}',
  '0/.zattrs': '{\n "_ARRAY_DIMENSIONS": [\n  "Y",\n  "X"\n ]\n}',
  '0/.zarray': '{\n "chunks": [\n  512,\n  512\n ],\n "compressor": {\n  "id": "zlib"\n },\n "dtype": "|u1",\n "fill_value": 0,\n "filters": null,\n "order": "C",\n "shape": [\n  2048,\n  2048\n ],\n "zarr_format": 2\n}',
  ...,
}

It looks like this is not the same structure that e.g. kerchunk.hdf.SingleHdf5ToZarr returns.

What virtualizarr expects (and what the kerchunk docs promise...) is that the keys of the outermost dictionary are 'refs' and 'version'. This kerchunk.tiff.tiff_to_zarr(url) function seems to have jumped straight to giving us the contents that would normally be underneath the 'refs' key.

This is an inconsistency in the schema, and an example of kerchunk not obeying it's own specification. It also seems to provide no benefit as far as I can tell.

In VirtualiZarr we simply worked around it by special-casing tiffs to add that top-level {'refs': ...} ourselves (so this is not at all urgent for us, I'm just raising this for completeness), but in theory it should really be fixed here. It would be a breaking change for kerchunk though.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions