Skip to content

Flexible indexing blog post #795

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
447 changes: 447 additions & 0 deletions public/posts/flexible-indexing/da-pandas-repr.html

Large diffs are not rendered by default.

457 changes: 457 additions & 0 deletions public/posts/flexible-indexing/da-rasterix-repr.html

Large diffs are not rendered by default.

451 changes: 451 additions & 0 deletions public/posts/flexible-indexing/ds-range-repr.html

Large diffs are not rendered by default.

449 changes: 449 additions & 0 deletions public/posts/flexible-indexing/ds-range-slice-repr.html

Large diffs are not rendered by default.

Binary file added public/posts/flexible-indexing/summary-slide.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
498 changes: 498 additions & 0 deletions public/posts/flexible-indexing/xvec-repr.html

Large diffs are not rendered by default.

Binary file added public/posts/flexible-indexing/xvecfig.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
28 changes: 14 additions & 14 deletions src/components/layout.js
Original file line number Diff line number Diff line change
Expand Up @@ -13,26 +13,26 @@ export const Layout = ({
url = 'https://xarray.dev',
enableBanner = false,
}) => {
const bannerTitle = 'Check out the new blog post!:'
const bannerTitle = 'Check out the latest blog post:'
// The first link will be the main description for the banner
const bannerDescription = (
<Link href='/blog/xarray-biology' fontWeight='medium'>
<Link href='/blog/flexible-indexing' fontWeight='medium'>
{' '}
{/* Ensure it stands out a bit */}
Xarray for Biology: Learn how Xarray can be used for Biological workflows.
Xarray indexes: Unleash the power of coordinates!
</Link>
)
// The second link will be passed as children, styled to be smaller
const bannerChildren = (
<Link
href='https://docs.google.com/forms/d/e/1FAIpQLSeGvTLONF-24V7z2HoACm4MhEr82c2V-VIzA9eqM9-jt-Xh8g/viewform?usp=sharing&ouid=111570313164368772519'
fontSize='sm'
>
{' '}
{/* Add your second link here, smaller font */}
<b>SciPy 2025</b> Click here for info about an Xarray for Bio Sprint!
</Link>
)
// const bannerChildren = (
// <Link
// href='https://docs.google.com/forms/d/e/1FAIpQLSeGvTLONF-24V7z2HoACm4MhEr82c2V-VIzA9eqM9-jt-Xh8g/viewform?usp=sharing&ouid=111570313164368772519'
// fontSize='sm'
// >
// {' '}
// {/* Add your second link here, smaller font */}
// <b>SciPy 2025</b> Click here for info about an Xarray for Bio Sprint!
// </Link>
//)
Comment on lines +26 to +35
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove? Or did you want to leave that as an example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could go either way, I commented instead of deleting to more easily go back to the existing multi-line banner


// Determine the base URL based on the environment
const baseUrl = process.env.NEXT_PUBLIC_VERCEL_URL
Expand Down Expand Up @@ -77,7 +77,7 @@ export const Layout = ({
<Header />
{enableBanner && (
<Banner title={bannerTitle} description={bannerDescription}>
{bannerChildren}
{/* {bannerChildren} */}
</Banner>
)}
{children}
Expand Down
218 changes: 218 additions & 0 deletions src/posts/flexible-indexing/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
---
title: 'Xarray Indexes: Exciting new ways to slice and dice your data!'
date: '2025-08-11'
authors:
- name: Benoît Bovy
github: benbovy
- name: Scott Henderson
github: scottyhq
- name: Deepak Cherian
github: dcherian
- name: Justus Magin
github: keewis
summary: 'An introduction to customizable coordinate-based data selection and alignment for more efficient handling of both traditional and more exotic data structures'
---

\_TLDR: Xarray>2025.6 has been through a major refactoring of its internals that makes coordinate-based data selection and alignment customizable, enabling more efficient handling of both traditional and more exotic data structures. In this post we highlight a few examples that take advantage of this new superpower!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TLDR looks a bit weird with the leading underscore after rendering, and I'd probably write the version comparison as xarray>2025.06.0. Is it truly > or rather >=, though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xarray>2025.06.0. Is it truly > or rather >= ?

While some pieces of this have been around longer, I know there have been a lot of incremental improvements, so I thought I'd play it safe and go very recent. @benbovy do you have a preference?

Copy link
Member

@keewis keewis Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the version itself is fine (but I didn't check for anything specific that we'd need), I just found the > a bit unusual, I would have used >= instead (and adjusted the version as necessary)


<figure>
<img src='/posts/flexible-indexing/summary-slide.png' />
<figcaption>
*Summary schematic from Deepak Cherian's [2025 SciPy
Presentation](https://www.youtube.com/watch?v=I-NHCuLhRjY) highlighting new
custom Indexes and usecases. [Link to full slide
deck](https://docs.google.com/presentation/d/1sQU2N0-ThNZM8TUhsZy-kT0bZnu0H5X0FRJz2eKwEpA/edit?slide=id.g37373ba88e6_0_214#slide=id.g37373ba88e6_0_214)*
</figcaption>
</figure>

{/* This is a comment that won't be rendered! */}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{/* This is a comment that won't be rendered! */}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll add a small section to the readme on authoring tips. I was commenting some things out while testing the build locally and had to search around to figure out this syntax! Eventually it would be nice for a more streamlined blogging interface, just using mystmd or jupyterbook, I'll open an issue with a couple ideas.


## Indexing basics

First thing's first, _what is an `index` and why is it helpful?_

> In brief, an _index_ makes repeated selection of data more efficient. Xarray Indexes connect coordinate labels to associated data values and encode important contextual information about the coordinate space.

Examples of indexes are all around you and are a fundamental way to organize and simplify access to information.
If you want a book about Natural Sciences, you can go to your local library branch and head straight to section `500`, or if you're in the mood for a good novel go to section `800`. Connecting thematic labels with numbers is a classic indexing system that's been around for hundreds of years [(Dewey Decimal System, 1876)](https://en.wikipedia.org/wiki/Dewey_Decimal_Classification).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be a US thing, I've never seen this kind of numbering scheme (although admittedly I've not been to a lot of libraries with science sections). Still, the example is interesting and I would keep it, I'd just qualify it with something like "In the US ..." (or whatever fits)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, definitely Dewey Decimal is a US thing, and there are plenty of alternatives :)

The need for an index becomes critical as the size of data grows - just imagine the time it would take to find a specific novel amongst a million uncategorized books!

The same efficiencies arise in computing. Consider a simple 1D dataset consisting of measurements `Y=[10,20,30,40,50,60]` at six coordinate positions `X=[1, 2, 4, 8, 16, 32]`. _What was our measurement at `X=8`?_
To answer this in code, we have need an index that is simply a key:value mapping between the coordinate values and integer positions `i=[0,1,2,3,4,5]` in the coordinates array.
With only 6 coordinates, we easily see `X[3]=8` so our measurement of interest is `Y[3]=40`.

> 💡 **Note:** for large datasets we should loop over _all_ the coordinates to ensure there are no repeated values! This initial pass over all the coordinates to build an _index_ may take significant time and may not always be desirable.

## Pandas.Index

Xarray's [label-based selection](https://docs.xarray.dev/en/latest/user-guide/indexing.html#indexing-with-dimension-names) allows a more expressive and simple syntax in which you don't have to think about the index (`da.sel(x=8) = 40`). Up until now, Xarray has relied exclusively on [Pandas.Index](https://pandas.pydata.org/docs/user_guide/indexing.html), which is still used by default:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be closer to the import

Suggested change
Xarray's [label-based selection](https://docs.xarray.dev/en/latest/user-guide/indexing.html#indexing-with-dimension-names) allows a more expressive and simple syntax in which you don't have to think about the index (`da.sel(x=8) = 40`). Up until now, Xarray has relied exclusively on [Pandas.Index](https://pandas.pydata.org/docs/user_guide/indexing.html), which is still used by default:
Xarray's [label-based selection](https://docs.xarray.dev/en/latest/user-guide/indexing.html#indexing-with-dimension-names) allows a more expressive and simple syntax in which you don't have to think about the index (`da.sel(x=8) = 40`). Up until now, Xarray has relied exclusively on [pandas.Index](https://pandas.pydata.org/docs/user_guide/indexing.html), which is still used by default:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, should this mention databases and hash-tables? It doesn't have to, just a thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this mention databases and hash-tables? It doesn't have to, just a thought.

I'm actually not sure what you mean here? I thought pandas.Index is a hash-table? But my understanding of this space is admittedly not that great 😅 . If you have a sentence or two please suggest!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's correct. I think I remember hearing that hash tables were created for databases, so that's why I mentioned both (but either way, I don't think it's necessary to follow this, and I don't have any specific sentences in mind).


```python
x = np.array([1, 2, 4, 8, 16, 32])
y = np.array([10, 20, 30, 40, 50, 60])
da = xr.DataArray(y, coords={'x': x})
da
```

<RawHTML filePath='/posts/flexible-indexing/da-pandas-repr.html' />

```python
da.sel(x=8)
# 40
```

## Alternatives to Pandas.Index

Importantly, a loop over all the coordinate values is not the only way to create an index.
You might recognize that our coordinates about can in fact be represented by a function `X(i)=2**i` where `i` is the integer position! Given that function we can quickly get measurement values at any position: `Y(X=4)` = `Y[sqrt(4)]` = `Y[2]=30`. Xarray now has a [CoordinateTransformIndex](https://xarray-indexes.readthedocs.io/blocks/transform.html) to handle this type of on-demand lookup of coordinate positions.

### Xarray RangeIndex

A simple special case of `CoordinateTransformIndex` is a `RangeIndex` where coordinates can be defined by a start, stop, and uniform step size. `Pandas.RangeIndex` only supports _integers_, whereas Xarray handles floating-point values. Coordinate look-up is performed on-the-fly rather than loading all values into memory up-front when creating a Dataset, which is critical for the example below that has a coordinate array of 7TB!

```python
from xarray.indexes import RangeIndex

index = RangeIndex.arange(0.0, 1000.0, 1e-9, dim='x') # 7TB coordinate array!
ds = xr.Dataset(coords=xr.Coordinates.from_xindex(index))
ds
```

<RawHTML filePath='/posts/flexible-indexing/ds-range-repr.html' />

Selection preserves the RangeIndex and does not require loading all the coordinates into memory.

```
sliced = ds.isel(x=slice(1_000, 50_000, 100))
sliced.x
```

<RawHTML filePath='/posts/flexible-indexing/ds-range-slice-repr.html' />

## Third-party custom Indexes

In addition to a few new built-in indexes, Xarray can be extended by third party indexes defined outside of the core codebase. This capability is very important to support a multitude of domain-specific data structures.

### Rasterix RasterIndex

Earlier we mentioned that coordinates often have a _functional representation_.
For 2D raster images, this function often takes the form of an [Affine Transform](https://en.wikipedia.org/wiki/Affine_transformation).
The [rasterix](https://github.com/xarray-contrib/rasterix) library extends Xarray with a `RasterIndex` which computes coordinates for geospatial images such as GeoTiffs via Affine Transform.

Below is a simple example of slicing a large mosaic of GeoTiffs without ever loading the coordinates into memory, note that a new Affine is defined after the slicing operation:

```python
# 811816322401 values!
import rasterix

#26475 GeoTiffs represented by a GDAL VRT
da = xr.open_dataarray('https://opentopography.s3.sdsc.edu/raster/COP30/COP30_hh.vrt',
engine='rasterio',
parse_coordinates=False).squeeze().pipe(
rasterix.assign_index
)
da
```

<RawHTML filePath='/posts/flexible-indexing/da-rasterix-repr.html' />

```python
print('Original geotransform:\n', da.xindexes['x'].transform())
da_sliced = da.sel(x=slice(-122.4, -120.0), y=slice(-47.1,-49.0))
print('Sliced geotransform:\n', da_sliced.xindexes['x'].transform())
```

```
Original geotransform:
| 0.00, 0.00,-180.00|
| 0.00,-0.00, 84.00|
| 0.00, 0.00, 1.00|

Sliced geotransform:
| 0.00, 0.00,-122.40|
| 0.00,-0.00,-47.10|
| 0.00, 0.00, 1.00|
```

### XProj CRSIndex

> real-world datasets are usually more than just raw numbers; they have labels which encode information about how the array values map to locations in space, time, etc. [Xarray Docs](https://docs.xarray.dev/en/stable/getting-started-guide/why-xarray.html#what-labels-enable)

We often think about metadata providing context for _measurement values_ but metadata is also critical for coordinates!
In particular, to align two different datasets we must ask if the coordinates are in the same coordinate system.
In other words, do they share the same origin and scale?

There are currently over 7000 commonly used [Coordinate Reference Systems (CRS)](https://spatialreference.org/ref/epsg/) for geospatial data in the authoritative EPSG database!
And of course an infinite number of custom-defined CRSs.
[xproj.CRSIndex](https://xproj.readthedocs.io/en/latest/) gives Xarray objects an automatic awareness of the coordinate reference system operations like `xr.align()`, which can raise an an informative error when there is a CRS mismatch:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[xproj.CRSIndex](https://xproj.readthedocs.io/en/latest/) gives Xarray objects an automatic awareness of the coordinate reference system operations like `xr.align()`, which can raise an an informative error when there is a CRS mismatch:
[xproj.CRSIndex](https://xproj.readthedocs.io/en/latest/) gives Xarray objects an automatic awareness of the coordinate reference system operations like `xr.align()`, which can raise an informative error when there is a CRS mismatch:


```python
from xproj import CRSIndex
lons1 = np.arange(-125, -120, 1)
lons2 = np.arange(-122, -118, 1)
ds1 = xr.Dataset(coords={'longitude': lons1}).proj.assign_crs(crs=4267)
ds2 = xr.Dataset(coords={'longitude': lons2}).proj.assign_crs(crs=4326)
ds1 + ds2
```

```pytb
MergeError: conflicting values/indexes on objects to be combined for coordinate 'crs'
```

### XVec GeometryIndex

A "vector data cube" is an n-D array that has at least one dimension indexed by a 2-D array of vector geometries.
Large vector cubes can take advantage of an [R-tree spatial index](https://en.wikipedia.org/wiki/R-tree) for efficiently selecting vector geometries within a given bounding box.
The `XVec.GeometryIndex` provides this functionality, below is a short code snippet but please refer to the [documentation for more](https://xvec.readthedocs.io/en/stable/indexing.html)!

```python
import xvec
import geopandas as gpd
from geodatasets import get_path

# Dataset that contains demographic data indexed by U.S. counties
counties = gpd.read_file(get_path("geoda.natregimes"))

cube = xr.Dataset(
data_vars=dict(
population=(["county", "year"], counties[["PO60", "PO70", "PO80", "PO90"]]),
unemployment=(["county", "year"], counties[["UE60", "UE70", "UE80", "UE90"]]),
),
coords=dict(county=counties.geometry, year=[1960, 1970, 1980, 1990]),
).xvec.set_geom_indexes("county", crs=counties.crs)
cube
```

<RawHTML filePath='/posts/flexible-indexing/xvec-repr.html' />

```python
# Efficient selection using shapely.STRtree
from shapely.geometry import box

subset = cube.xvec.query(
"county",
box(-125.4, 40, -120.0, 50),
predicate="intersects",
)

subset['population'].xvec.plot(col='year');
```

<p align='center'>
<img src='/posts/flexible-indexing/xvecfig.png' />
</p>

### Even more examples!

Be sure to check out the [Gallery of Custom Index Examples](https://xarray-indexes.readthedocs.io) for more detailed examples of all the indexes mentioned in this post and more!

## What's next?

While we're extremely excited about what can _already_ be accomplished with the new indexing capabilities, there are plenty of exciting ideas for future work.

Have an idea for your own custom index? Check out [this section of the Xarray documentation](https://docs.xarray.dev/en/stable/internals/how-to-create-custom-index.html).

## Acknowledgments

This work would not have been possible without technical input from the Xarray core team and community!
Several developers received essential funding from a [CZI Essential Open Source Software for Science (EOSS) grant](https://xarray.dev/blog/czi-eoss-grant-conclusion) as well as NASA's Open Source Tools, Frameworks, and Libraries (OSTFL) grant 80NSSC22K0345.