-
Notifications
You must be signed in to change notification settings - Fork 54
Description
I was trying to virtualize references and then load some NASA MERRA2 data offered at GESDISC. I noticed that when the dataset is loaded using the references made by virtualizarr, the initial timestep has a NaT value whereas the exact same dataset loaded using references made by kerchunk does not. Here is the data file used: https://data.gesdisc.earthdata.nasa.gov/data/MERRA2/M2T1NXSLV.5.12.4/1980/01/MERRA2_100.tavg1_2d_slv_Nx.19800103.nc4
Diving deeper, the only difference I could find between the two JSONs is that the fill_value is 0 for kerchunk whereas virtualizarr sets it to None:
Kerchunk time zarray: "time/.zarray": "{\"chunks\":[1],\"compressor\":null,\"dtype\":\"<i4\",\"fill_value\":null,\"filters\":[{\"elementsize\":4,\"id\":\"shuffle\"},{\"id\":\"zlib\",\"level\":2}],\"order\":\"C\",\"shape\":[24],\"zarr_format\":2}",
Virtualizarr --> kerchunk time zarray: "time/.zarray": "{\"shape\":[24],\"chunks\":[1],\"dtype\":\"<i4\",\"fill_value\":0,\"order\":\"C\",\"compressor\":null,\"filters\":[{\"elementsize\":4,\"id\":\"shuffle\"},{\"id\":\"zlib\",\"level\":2}],\"zarr_format\":2}",
I tried manually changing the virtualizarr produced JSONs fill_value to null and that fixed this issue. So TLDR seems like there is a bug in the default fill_value (potentially datetime specific?) cc @TomNicholas @mpiannucci
Code used to produce the two JSONs and highlighting the issue:
from kerchunk.hdf import SingleHdf5ToZarr
import ujson
m = SingleHdf5ToZarr('MERRA2_100.tavg1_2d_slv_Nx.19800103.nc4', inline_threshold=0)
m2 = m.translate()
with open("MERRA2_100.tavg1_2d_slv_Nx.19800103_NO_INLINE.nc4.json", "w") as f:
f.write(ujson.dumps(m2))
ds = xr.open_dataset("MERRA2_100.tavg1_2d_slv_Nx.19800103_NO_INLINE.nc4.json", engine="kerchunk")
print(ds.time)<xarray.DataArray 'time' (time: 24)> Size: 192B
array(['1980-01-03T00:30:00.000000000', '1980-01-03T01:30:00.000000000',
'1980-01-03T02:30:00.000000000', '1980-01-03T03:30:00.000000000',
'1980-01-03T04:30:00.000000000', '1980-01-03T05:30:00.000000000',
'1980-01-03T06:30:00.000000000', '1980-01-03T07:30:00.000000000',
'1980-01-03T08:30:00.000000000', '1980-01-03T09:30:00.000000000',
'1980-01-03T10:30:00.000000000', '1980-01-03T11:30:00.000000000',
'1980-01-03T12:30:00.000000000', '1980-01-03T13:30:00.000000000',
'1980-01-03T14:30:00.000000000', '1980-01-03T15:30:00.000000000',
'1980-01-03T16:30:00.000000000', '1980-01-03T17:30:00.000000000',
'1980-01-03T18:30:00.000000000', '1980-01-03T19:30:00.000000000',
'1980-01-03T20:30:00.000000000', '1980-01-03T21:30:00.000000000',
'1980-01-03T22:30:00.000000000', '1980-01-03T23:30:00.000000000'],
dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] 192B 1980-01-03T00:30:00 ... 1980-01-03T23...
Attributes:
begin_date: 19800103
begin_time: 3000
long_name: time
time_increment: 10000
valid_range: [-999999986991104.0, 999999986991104.0]
vmax: 999999986991104.0
vmin: -999999986991104.0from virtualizarr import open_virtual_dataset
vds = open_virtual_dataset("MERRA2_100.tavg1_2d_slv_Nx.19800103.nc4", indexes={})
vds.virtualize.to_kerchunk("MERRA2_100.tavg1_2d_slv_Nx.19800103_VIRTUALIZARR.nc4.json", format="json")
ds2 = xr.open_dataset("MERRA2_100.tavg1_2d_slv_Nx.19800103_VIRTUALIZARR.nc4.json", engine="kerchunk")
print(ds2.time)<xarray.DataArray 'time' (time: 24)> Size: 192B
array([ 'NaT', '1980-01-03T01:30:00.000000000',
'1980-01-03T02:30:00.000000000', '1980-01-03T03:30:00.000000000',
'1980-01-03T04:30:00.000000000', '1980-01-03T05:30:00.000000000',
'1980-01-03T06:30:00.000000000', '1980-01-03T07:30:00.000000000',
'1980-01-03T08:30:00.000000000', '1980-01-03T09:30:00.000000000',
'1980-01-03T10:30:00.000000000', '1980-01-03T11:30:00.000000000',
'1980-01-03T12:30:00.000000000', '1980-01-03T13:30:00.000000000',
'1980-01-03T14:30:00.000000000', '1980-01-03T15:30:00.000000000',
'1980-01-03T16:30:00.000000000', '1980-01-03T17:30:00.000000000',
'1980-01-03T18:30:00.000000000', '1980-01-03T19:30:00.000000000',
'1980-01-03T20:30:00.000000000', '1980-01-03T21:30:00.000000000',
'1980-01-03T22:30:00.000000000', '1980-01-03T23:30:00.000000000'],
dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] 192B NaT ... 1980-01-03T23:30:00
Attributes:
begin_date: 19800103
begin_time: 3000
long_name: time
time_increment: 10000
valid_range: [-999999986991104.0, 999999986991104.0]
vmax: 999999986991104.0
vmin: -999999986991104.0