Skip to content

Integrating the proxy into the data viewer - progress update and performance observations and other issues #6

@andersy005

Description

@andersy005

@katamartin and I have been making progress in integrating the proxy into the data viewer. Our intention is to use the proxy for on-the-fly rechunking of datasets for visualization purposes. The results are looking promising and the performance is satisfactory (for small datasets and datasets hosted in AWS S3) even without caching on the backend

  • https://storage.googleapis.com/carbonplan-maps/ncview/demo/single_timestep/air_temperature.zarr

Screenshot 2023-01-25 at 10 48 19

  • s3://carbonplan-data-viewer/demo/MURSST.zarr ( the original chunk size is roughly ~ 1.21 GB)
    Screenshot 2023-01-25 at 12 24 27

  • retrieving data from stores hosted outside outside of S3 takes a long time (as expected). the following are timings for gs://ldeo-glaciology/bedmachine/bm.zarr (the original chunk size is roughly ~ 35MB)

Screenshot 2023-01-25 at 11 54 05

there's still more work to do to ensure seamless interoperability with existing zarr clients. To illustrate this, below is a code snippet that demonstrates how the proxy can be used via the zarr Python library.

  • instantiate a zarr store via fsspec
In [21]: url = 'http://127.0.0.1:8000/storage.googleapis.com/ldeo-glaciology/bedmachine/bm.zarr'

In [22]: store = zarr.storage.FSStore(url, client_kwargs={'headers': {"chunks": "10,10"}})

In [23]: store['.zattrs']
Out[23]: b'{"Author":"Mathieu Morlighem","Conventions":"CF-1.7","Data_citation":"Morlighem M. et al., (2019), Deep glacial troughs and stabilizing ridges unveiled beneath the margins of the Antarctic ice sheet, Nature Geoscience (accepted)","Notes":"Data processed at the Department of Earth System Science, University of California, Irvine","Projection":"Polar Stereographic South (71S,0E)","Title":"BedMachine Antarctica","false_easting":[0.0],"false_northing":[0.0],"grid_mapping_name":"polar_stereographic","ice_density (kg m-3)":[917.0],"inverse_flattening":[298.2794050428205],"latitude_of_projection_origin":[-90.0],"license":"No restrictions on access or use","no_data":[-9999.0],"nx":[13333.0],"ny":[13333.0],"proj4":"+init=epsg:3031","sea_water_density (kg m-3)":[1027.0],"semi_major_axis":[6378273.0],"spacing":[500],"standard_parallel":[-71.0],"straight_vertical_longitude_from_pole":[0.0],"version":"05-Nov-2019 (v1.38)","xmin":[-3333000],"ymax":[3333000]}'
  • open an array within the zarr store
In [25]: arr = zarr.open(store, path='/bed')

In [27]: arr
Out[27]: <zarr.core.Array '/bed' (13333, 13333) float32>
  • retrieve some data
In [28]: arr[:10, :10]
Out[28]: 
array([[-5914.538 , -5919.3955, -5924.865 , -5930.3765, -5935.8853,
        -5941.0205, -5945.997 , -5950.359 , -5954.3784, -5958.045 ],
       [-5910.384 , -5915.8296, -5921.3076, -5927.158 , -5932.7554,
        -5938.29  , -5943.1704, -5947.785 , -5951.881 , -5955.54  ],
       [-5906.422 , -5911.8516, -5917.63  , -5923.6133, -5929.573 ,
        -5935.029 , -5940.271 , -5944.9736, -5949.237 , -5952.898 ],
       [-5902.613 , -5908.093 , -5914.061 , -5920.044 , -5925.9707,
        -5931.7017, -5937.0083, -5941.9688, -5946.243 , -5950.265 ],
       [-5899.054 , -5904.7085, -5910.5   , -5916.532 , -5922.4585,
        -5928.2095, -5933.64  , -5938.608 , -5943.3335, -5947.362 ],
       [-5895.9683, -5901.283 , -5907.2   , -5913.2   , -5919.1235,
        -5924.6836, -5930.077 , -5935.3584, -5940.0796, -5944.544 ],
       [-5892.8423, -5898.332 , -5904.08  , -5910.0503, -5915.838 ,
        -5921.344 , -5926.583 , -5931.785 , -5936.9224, -5941.452 ],
       [-5890.067 , -5895.4604, -5901.1587, -5906.9365, -5912.6836,
        -5918.2617, -5923.3687, -5928.1724, -5933.3447, -5937.538 ],
       [-5887.37  , -5892.716 , -5898.2046, -5903.9224, -5909.691 ,
        -5915.144 , -5920.3755, -5925.193 , -5928.876 , -5933.021 ],
       [-5884.786 , -5890.015 , -5895.455 , -5900.958 , -5906.5366,
        -5912.1353, -5917.4043, -5921.5264, -5925.1343, -5928.5483]],
      dtype=float32)

if we attempt to access a variable whose dimensionality does not match the specified chunks in the HTTP headers, it causes issues or failure

. for instance, in our store, x is 1D, and the chunks we specified earlier are 10,10 as defined in zarr.storage.FSStore(url, client_kwargs={'headers': {"chunks": "10,10"}})

In [29]: store['x/.zarray']
Out[29]: b'{"chunks":[10,10],"compressor":null,"dtype":"<i4","fill_value":null,"filters":[],"order":"C","shape":[13333],"zarr_format":2}'

In [30]: store['x/0']
---------------------------------------------------------------------------
ClientResponseError                       Traceback (most recent call last)
Cell In[30], line 1
----> 1 store['x/0']

ClientResponseError: 500, message='Internal Server Error', url=URL('http://127.0.0.1:8000/storage.googleapis.com/ldeo-glaciology/bedmachine/bm.zarr/x/0')

It would be nice if there's a way to override the headers via fsspec.

I am also CC-ing some folks (@freeman-lab, @norlandrhagen, @jhamman, @rabernat) who might be interested in this, to keep them in the loop of our progress

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions