Skip to content

Datetime coordinate missing single timestep (NaT) when data is loadedΒ #352

@ayushnag

Description

@ayushnag

I was trying to virtualize references and then load some NASA MERRA2 data offered at GESDISC. I noticed that when the dataset is loaded using the references made by virtualizarr, the initial timestep has a NaT value whereas the exact same dataset loaded using references made by kerchunk does not. Here is the data file used: https://data.gesdisc.earthdata.nasa.gov/data/MERRA2/M2T1NXSLV.5.12.4/1980/01/MERRA2_100.tavg1_2d_slv_Nx.19800103.nc4

Diving deeper, the only difference I could find between the two JSONs is that the fill_value is 0 for kerchunk whereas virtualizarr sets it to None:

Kerchunk time zarray: "time/.zarray": "{\"chunks\":[1],\"compressor\":null,\"dtype\":\"<i4\",\"fill_value\":null,\"filters\":[{\"elementsize\":4,\"id\":\"shuffle\"},{\"id\":\"zlib\",\"level\":2}],\"order\":\"C\",\"shape\":[24],\"zarr_format\":2}",

Virtualizarr --> kerchunk time zarray: "time/.zarray": "{\"shape\":[24],\"chunks\":[1],\"dtype\":\"<i4\",\"fill_value\":0,\"order\":\"C\",\"compressor\":null,\"filters\":[{\"elementsize\":4,\"id\":\"shuffle\"},{\"id\":\"zlib\",\"level\":2}],\"zarr_format\":2}",

I tried manually changing the virtualizarr produced JSONs fill_value to null and that fixed this issue. So TLDR seems like there is a bug in the default fill_value (potentially datetime specific?) cc @TomNicholas @mpiannucci

Code used to produce the two JSONs and highlighting the issue:

from kerchunk.hdf import SingleHdf5ToZarr
import ujson
m = SingleHdf5ToZarr('MERRA2_100.tavg1_2d_slv_Nx.19800103.nc4', inline_threshold=0)
m2 = m.translate()
with open("MERRA2_100.tavg1_2d_slv_Nx.19800103_NO_INLINE.nc4.json", "w") as f:
    f.write(ujson.dumps(m2))

ds = xr.open_dataset("MERRA2_100.tavg1_2d_slv_Nx.19800103_NO_INLINE.nc4.json", engine="kerchunk")
print(ds.time)
<xarray.DataArray 'time' (time: 24)> Size: 192B
array(['1980-01-03T00:30:00.000000000', '1980-01-03T01:30:00.000000000',
       '1980-01-03T02:30:00.000000000', '1980-01-03T03:30:00.000000000',
       '1980-01-03T04:30:00.000000000', '1980-01-03T05:30:00.000000000',
       '1980-01-03T06:30:00.000000000', '1980-01-03T07:30:00.000000000',
       '1980-01-03T08:30:00.000000000', '1980-01-03T09:30:00.000000000',
       '1980-01-03T10:30:00.000000000', '1980-01-03T11:30:00.000000000',
       '1980-01-03T12:30:00.000000000', '1980-01-03T13:30:00.000000000',
       '1980-01-03T14:30:00.000000000', '1980-01-03T15:30:00.000000000',
       '1980-01-03T16:30:00.000000000', '1980-01-03T17:30:00.000000000',
       '1980-01-03T18:30:00.000000000', '1980-01-03T19:30:00.000000000',
       '1980-01-03T20:30:00.000000000', '1980-01-03T21:30:00.000000000',
       '1980-01-03T22:30:00.000000000', '1980-01-03T23:30:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 192B 1980-01-03T00:30:00 ... 1980-01-03T23...
Attributes:
    begin_date:      19800103
    begin_time:      3000
    long_name:       time
    time_increment:  10000
    valid_range:     [-999999986991104.0, 999999986991104.0]
    vmax:            999999986991104.0
    vmin:            -999999986991104.0
from virtualizarr import open_virtual_dataset
vds = open_virtual_dataset("MERRA2_100.tavg1_2d_slv_Nx.19800103.nc4", indexes={})
vds.virtualize.to_kerchunk("MERRA2_100.tavg1_2d_slv_Nx.19800103_VIRTUALIZARR.nc4.json", format="json")
ds2 = xr.open_dataset("MERRA2_100.tavg1_2d_slv_Nx.19800103_VIRTUALIZARR.nc4.json", engine="kerchunk")
print(ds2.time)
<xarray.DataArray 'time' (time: 24)> Size: 192B
array([                          'NaT', '1980-01-03T01:30:00.000000000',
       '1980-01-03T02:30:00.000000000', '1980-01-03T03:30:00.000000000',
       '1980-01-03T04:30:00.000000000', '1980-01-03T05:30:00.000000000',
       '1980-01-03T06:30:00.000000000', '1980-01-03T07:30:00.000000000',
       '1980-01-03T08:30:00.000000000', '1980-01-03T09:30:00.000000000',
       '1980-01-03T10:30:00.000000000', '1980-01-03T11:30:00.000000000',
       '1980-01-03T12:30:00.000000000', '1980-01-03T13:30:00.000000000',
       '1980-01-03T14:30:00.000000000', '1980-01-03T15:30:00.000000000',
       '1980-01-03T16:30:00.000000000', '1980-01-03T17:30:00.000000000',
       '1980-01-03T18:30:00.000000000', '1980-01-03T19:30:00.000000000',
       '1980-01-03T20:30:00.000000000', '1980-01-03T21:30:00.000000000',
       '1980-01-03T22:30:00.000000000', '1980-01-03T23:30:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 192B NaT ... 1980-01-03T23:30:00
Attributes:
    begin_date:      19800103
    begin_time:      3000
    long_name:       time
    time_increment:  10000
    valid_range:     [-999999986991104.0, 999999986991104.0]
    vmax:            999999986991104.0
    vmin:            -999999986991104.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions