Skip to content

MultiZarrToZarr uses first file's x/y coordinates for all inputs, causing spatial misalignment in concatenated output #584

@mliukis

Description

@mliukis

When using MultiZarrToZarr to combine multiple (time, y, x) NetCDF files along the time dimension, the resulting Zarr store inherits the x and y coordinate arrays exclusively from the first input file. If subsequent input files have different (but overlapping) spatial extents — e.g., different bounding boxes at the same grid resolution — their data seems to be written at the original chunk offsets without remapping to the correct coordinate positions. This causes spatial displacement of data from all files except the first.

All 3D variables in the input NetCDF files have identical (1, 512, 512) chunking along the (time, y, x) dimensions.

Is there a way to merge the data and consolidate x and y extents across such input files?

Minimal reproducible example:

import fsspec
import s3fs
import os
from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr
import xarray as xr

s3 = s3fs.S3FileSystem()

bucket = 's3://its-live-data'

granules = [
    "velocity_image_pair/landsatOLI/v02/S80W170/LC08_L1GT_020121_20231013_20231102_02_T2_X_LC09_L1GT_020121_20231106_20231106_02_T2_G0120V02_P084.nc",
    "velocity_image_pair/landsatOLI/v02/S80W170/LC08_L1GT_020120_20201121_20210315_02_T2_X_LC08_L1GT_020120_20210124_20210305_02_T2_G0120V02_P051.nc",
]

so = dict(
    anon=True, default_fill_cache=False, default_cache_type='first'
)

# Read input granules in
singles = []
for each_granule in granules:
    s3_path = os.path.join(bucket, each_granule)
    print(f'Reading {s3_path}...')
    
    with fsspec.open(s3_path, **so) as inf:
        h5chunks = SingleHdf5ToZarr(inf, s3_path, inline_threshold=300)
        singles.append(h5chunks.translate())

# Combine granules
mzz = MultiZarrToZarr(
    singles,
    remote_protocol="s3",
    remote_options={'anon': True},
    concat_dims=["time"],
    identical_dims=['mapping']
)

out = mzz.translate()

# Open merged cube as xarray.Dataset for exploration (plotting)
fs = fsspec.filesystem(
    "reference",
    fo=out,          # or pass mzz_ref dict directly
    remote_protocol="s3",
    remote_options={
        "anon": True,
        "asynchronous": False,    # must match ReferenceFileSystem's mode
    },
    skip_instance_cache=True,
)

mapper = fs.get_mapper("")
ds = xr.open_zarr(mapper, consolidated=False, decode_timedelta=False, chunks={})

This is (time, y, x) v data variable from original input data (please note the x extents):
Image

And the same input data as it appears in the virtual cube with x extent of original first input file:

Image

Examining x and y chunking shows reference to the first input file:

for key in out['refs'].keys():
    if ( key.startswith('x/') or key.startswith('y/') ) and "/." not in key:
        print(f'{key=}: {out['refs'][key]}')
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions