-
Notifications
You must be signed in to change notification settings - Fork 94
Description
When using MultiZarrToZarr to combine multiple (time, y, x) NetCDF files along the time dimension, the resulting Zarr store inherits the x and y coordinate arrays exclusively from the first input file. If subsequent input files have different (but overlapping) spatial extents — e.g., different bounding boxes at the same grid resolution — their data seems to be written at the original chunk offsets without remapping to the correct coordinate positions. This causes spatial displacement of data from all files except the first.
All 3D variables in the input NetCDF files have identical (1, 512, 512) chunking along the (time, y, x) dimensions.
Is there a way to merge the data and consolidate x and y extents across such input files?
Minimal reproducible example:
import fsspec
import s3fs
import os
from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr
import xarray as xr
s3 = s3fs.S3FileSystem()
bucket = 's3://its-live-data'
granules = [
"velocity_image_pair/landsatOLI/v02/S80W170/LC08_L1GT_020121_20231013_20231102_02_T2_X_LC09_L1GT_020121_20231106_20231106_02_T2_G0120V02_P084.nc",
"velocity_image_pair/landsatOLI/v02/S80W170/LC08_L1GT_020120_20201121_20210315_02_T2_X_LC08_L1GT_020120_20210124_20210305_02_T2_G0120V02_P051.nc",
]
so = dict(
anon=True, default_fill_cache=False, default_cache_type='first'
)
# Read input granules in
singles = []
for each_granule in granules:
s3_path = os.path.join(bucket, each_granule)
print(f'Reading {s3_path}...')
with fsspec.open(s3_path, **so) as inf:
h5chunks = SingleHdf5ToZarr(inf, s3_path, inline_threshold=300)
singles.append(h5chunks.translate())
# Combine granules
mzz = MultiZarrToZarr(
singles,
remote_protocol="s3",
remote_options={'anon': True},
concat_dims=["time"],
identical_dims=['mapping']
)
out = mzz.translate()
# Open merged cube as xarray.Dataset for exploration (plotting)
fs = fsspec.filesystem(
"reference",
fo=out, # or pass mzz_ref dict directly
remote_protocol="s3",
remote_options={
"anon": True,
"asynchronous": False, # must match ReferenceFileSystem's mode
},
skip_instance_cache=True,
)
mapper = fs.get_mapper("")
ds = xr.open_zarr(mapper, consolidated=False, decode_timedelta=False, chunks={})
This is (time, y, x) v data variable from original input data (please note the x extents):

And the same input data as it appears in the virtual cube with x extent of original first input file:
Examining x and y chunking shows reference to the first input file:
for key in out['refs'].keys():
if ( key.startswith('x/') or key.startswith('y/') ) and "/." not in key:
print(f'{key=}: {out['refs'][key]}')
