Skip to content

Usage: Poor performance of NaN-aware xarray computations? #908

@peanutfun

Description

@peanutfun

Please provide a description of what you'd like to do.

I am using sparse arrays with xarray and I found that a simple DataArray.sum() operation performs poorly with skipna=True, which is the default for float data types. It seems like it's usually faster to densify the underlying array, compute the sum on it, and then sparsifying this result again, than computing the sum on the original array (granted, this only works if the array fits into memory).

What is even more confusing is that the performance gets even worse when I set fill_value=np.nan, in which case the skipna=True should be somewhat trivial. Why is that?

I am using fill_value=np.nan because I find it to be the "natural" choice for xarray data, because xarray uses NaNs to indicate "no data" and uses them as default fill values when merging, aligning, or extending data. I therefore think it's important that fill_value=np.nan does not incur a penalty when compared to the default fill_value=0.

PS: I thought this has to do with sparse data structures, not xarray, so I raised the issue here.

Example Code

# Fill value is 0
$ python -m timeit -s "import sparse; import xarray as xr; arr = xr.DataArray(sparse.random((100, 100, 100), density=0.1, fill_value=0), dims=['x', 'y', 'z'])" "arr.sum(dim='x')"
1 loop, best of 5: 24.1 msec per loop

# Fill value is NaN, takes twice (!) the time
$ python -m timeit -s "import sparse; import xarray as xr; import numpy as np; arr = xr.DataArray(sparse.random((100, 100, 100), density=0.1, fill_value=np.nan), dims=['x', 'y', 'z'])" "arr.sum(dim='x')"
1 loop, best of 5: 44.6 msec per loop

# Densifying is slightly faster
$ python -m timeit -s "import sparse; import xarray as xr; import numpy as np; arr = xr.DataArray(sparse.random((100, 100, 100), density=0.1, fill_value=np.nan), dims=['x', 'y', 'z'])" "arr.data = arr.data.todense(); arr.sum(dim='x'); arr.data = sparse.as_coo(arr.data)"
5 loops, best of 5: 38.9 msec per loop

# Setting the fill value to zero and not skipping NaNs is way faster
$ python -m timeit -s "import sparse; import xarray as xr; import numpy as np; arr = xr.DataArray(sparse.random((100, 100, 100), density=0.1, fill_value=np.nan), dims=['x', 'y', 'z'])" "arr.data.fill_value=0.0; arr.sum(dim='x', skipna=False)"
20 loops, best of 5: 7.81 msec per loop

# For comparison, the computation on a dense array is faster, even with skipna=True
$ python -m timeit -s "import sparse; import xarray as xr; import numpy as np; arr = xr.DataArray(sparse.random((100, 100, 100), density=0.1, fill_value=np.nan).todense(), dims=['x', 'y', 'z'])" "arr.sum(dim='x')"
100 loops, best of 5: 2.99 msec per loop

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions