diff --git a/README.md b/README.md index 8b784312..9cd693df 100644 --- a/README.md +++ b/README.md @@ -49,3 +49,46 @@ You can start editing the page by modifying `pages/index.js`. The page auto-upda

+ +## Authoring blog post tips + +1. To create a new blog post a good place to start is copying a subfolder under `src/posts/`, so, for example https://xarray.dev/blog/flox is written here https://github.com/xarray-contrib/xarray.dev/blob/e04905f5ea039eb2eb848c0b4945beee323900e4/src/posts/flox/index.md + +### Static assets + +Once you have `src/posts/newpost/index.md` start writing! If you want to include figures or other static assets, they go into a matching `public/posts/newpost` folder. But! reference an images without the `public` part of the path like this: + +```html +

+ +

+``` + +### Xarray HTML reprs + +To include an html repr, you must save it first: + +```python +with open('da-repr.html', 'w') as f: + f.write(da._repr_html_()) +``` + +Then put it into the post's static assets folder `public/posts/newpost/da-repr.html`. And finally in `src/posts/newpost/index.md` you can include it with this syntax: + +``` + +``` + +### Toggling visibilty of sections (markdown comments) + +While authoring, you might want to toggle specific sections on and off during rendering. You can do that with this syntax: + +``` +{/* This is a comment that won't be rendered! */} +``` + +### Landing page banner + +If you'd like to add a link to the latest blog post on the landing page banner, edit this section here: + +https://github.com/xarray-contrib/xarray.dev/blob/e04905f5ea039eb2eb848c0b4945beee323900e4/src/components/layout.js#L18 diff --git a/public/posts/flexible-indexing/da-pandas-repr.html b/public/posts/flexible-indexing/da-pandas-repr.html new file mode 100644 index 00000000..9710958f --- /dev/null +++ b/public/posts/flexible-indexing/da-pandas-repr.html @@ -0,0 +1,447 @@ +
+ + + + + + + + + + + + + + +
<xarray.DataArray (x: 6)> Size: 48B
+array([10, 20, 30, 40, 50, 60])
+Coordinates:
+  * x        (x) int64 48B 1 2 4 8 16 32
\ No newline at end of file diff --git a/public/posts/flexible-indexing/da-rasterix-repr.html b/public/posts/flexible-indexing/da-rasterix-repr.html new file mode 100644 index 00000000..232306f6 --- /dev/null +++ b/public/posts/flexible-indexing/da-rasterix-repr.html @@ -0,0 +1,457 @@ +
+ + + + + + + + + + + + + + +
<xarray.DataArray 'band_data' (y: 626401, x: 1296001)> Size: 3TB
+[811816322401 values with dtype=float32]
+Coordinates:
+    band         int64 8B 1
+    spatial_ref  int64 8B ...
+  * x            (x) float64 10MB -180.0 -180.0 -180.0 ... 180.0 180.0 180.0
+  * y            (y) float64 5MB 84.0 84.0 84.0 84.0 ... -90.0 -90.0 -90.0 -90.0
+Indexes:
+  ┌ x        RasterIndex (crs=None)
+  └ y
+Attributes:
+    AREA_OR_POINT:  Point
\ No newline at end of file diff --git a/public/posts/flexible-indexing/ds-range-repr.html b/public/posts/flexible-indexing/ds-range-repr.html new file mode 100644 index 00000000..aec8e205 --- /dev/null +++ b/public/posts/flexible-indexing/ds-range-repr.html @@ -0,0 +1,451 @@ +
+ + + + + + + + + + + + + + +
<xarray.Dataset> Size: 8MB
+Dimensions:  (x: 1000000)
+Coordinates:
+  * x        (x) float64 8MB 0.0 0.1 0.2 0.3 0.4 ... 1e+05 1e+05 1e+05 1e+05
+Data variables:
+    *empty*
+Indexes:
+    x        RangeIndex (start=0, stop=1e+05, step=0.1)
\ No newline at end of file diff --git a/public/posts/flexible-indexing/ds-range-slice-repr.html b/public/posts/flexible-indexing/ds-range-slice-repr.html new file mode 100644 index 00000000..af6ff6d7 --- /dev/null +++ b/public/posts/flexible-indexing/ds-range-slice-repr.html @@ -0,0 +1,449 @@ +
+ + + + + + + + + + + + + + +
<xarray.DataArray 'x' (x: 490)> Size: 4kB
+[490 values with dtype=float64]
+Coordinates:
+  * x        (x) float64 4kB 1e-06 1.1e-06 1.2e-06 ... 4.98e-05 4.99e-05
+Indexes:
+    x        RangeIndex (start=1e-06, stop=5e-05, step=1e-07)
diff --git a/public/posts/flexible-indexing/summary-slide.png b/public/posts/flexible-indexing/summary-slide.png new file mode 100644 index 00000000..35300973 Binary files /dev/null and b/public/posts/flexible-indexing/summary-slide.png differ diff --git a/public/posts/flexible-indexing/xvec-repr.html b/public/posts/flexible-indexing/xvec-repr.html new file mode 100644 index 00000000..355b5d73 --- /dev/null +++ b/public/posts/flexible-indexing/xvec-repr.html @@ -0,0 +1,498 @@ +
+ + + + + + + + + + + + + + +
<xarray.Dataset> Size: 173kB
+Dimensions:       (county: 3085, year: 4)
+Coordinates:
+  * county        (county) geometry 25kB POLYGON ((-95.34258270263672 48.5467...
+  * year          (year) int64 32B 1960 1970 1980 1990
+Data variables:
+    population    (county, year) int32 49kB 4304 3987 3764 ... 43766 55800 65077
+    unemployment  (county, year) float64 99kB 7.9 9.0 5.903 ... 7.018 5.489
+Indexes:
+    county   GeometryIndex (crs=EPSG:4326)
\ No newline at end of file diff --git a/public/posts/flexible-indexing/xvecfig.png b/public/posts/flexible-indexing/xvecfig.png new file mode 100644 index 00000000..31cdb83a Binary files /dev/null and b/public/posts/flexible-indexing/xvecfig.png differ diff --git a/src/components/layout.js b/src/components/layout.js index eacaceb3..a0c02d16 100644 --- a/src/components/layout.js +++ b/src/components/layout.js @@ -13,26 +13,26 @@ export const Layout = ({ url = 'https://xarray.dev', enableBanner = false, }) => { - const bannerTitle = 'Check out the new blog post!:' + const bannerTitle = 'Check out the latest blog post:' // The first link will be the main description for the banner const bannerDescription = ( - + {' '} {/* Ensure it stands out a bit */} - Xarray for Biology: Learn how Xarray can be used for Biological workflows. + Xarray Indexes: Exciting new ways to slice and dice your data! ) // The second link will be passed as children, styled to be smaller - const bannerChildren = ( - - {' '} - {/* Add your second link here, smaller font */} - SciPy 2025 Click here for info about an Xarray for Bio Sprint! - - ) + // const bannerChildren = ( + // + // {' '} + // {/* Add your second link here, smaller font */} + // SciPy 2025 Click here for info about an Xarray for Bio Sprint! + // + //) // Determine the base URL based on the environment const baseUrl = process.env.NEXT_PUBLIC_VERCEL_URL @@ -77,7 +77,7 @@ export const Layout = ({
{enableBanner && ( - {bannerChildren} + {/* {bannerChildren} */} )} {children} diff --git a/src/posts/flexible-indexing/index.md b/src/posts/flexible-indexing/index.md new file mode 100644 index 00000000..d46aa85d --- /dev/null +++ b/src/posts/flexible-indexing/index.md @@ -0,0 +1,216 @@ +--- +title: 'Xarray Indexes: Exciting new ways to slice and dice your data!' +date: '2025-08-11' +authors: + - name: Benoît Bovy + github: benbovy + - name: Scott Henderson + github: scottyhq + - name: Deepak Cherian + github: dcherian + - name: Justus Magin + github: keewis +summary: 'An introduction to customizable coordinate-based data selection and alignment for more efficient handling of both traditional and more exotic data structures' +--- + +**TL;DR**: xarray>=2025.6 has been through a major refactoring of its internals that makes coordinate-based data selection and alignment customizable, enabling more efficient handling of both traditional and more exotic data structures. In this post we highlight a few examples that take advantage of this new superpower! + +
+ +
+ *Summary schematic from Deepak Cherian's [2025 SciPy + Presentation](https://www.youtube.com/watch?v=I-NHCuLhRjY) highlighting new + custom Indexes and usecases. [Link to full slide + deck](https://docs.google.com/presentation/d/1sQU2N0-ThNZM8TUhsZy-kT0bZnu0H5X0FRJz2eKwEpA/edit?slide=id.g37373ba88e6_0_214#slide=id.g37373ba88e6_0_214)* +
+
+ +## Indexing basics + +First thing's first, _what is an `index` and why is it helpful?_ + +> In brief, an _index_ makes selection of subsets of data more efficient. Xarray Indexes connect coordinate labels to associated data values and encode important contextual information about the coordinate space. + +Examples of indexes are all around you and are a fundamental way to organize and simplify access to information. +In the United States, if you want a book about Natural Sciences, you can go to your local library branch and head straight to section 500. Or if you're in the mood for a classic novel go to section 800. Connecting thematic labels with numbers (`{'Natural Sciences': 500, 'Literature': 800}`) is a classic indexing system that's been around for hundreds of years [(Dewey Decimal System, 1876)](https://en.wikipedia.org/wiki/Dewey_Decimal_Classification). +The need for an index becomes critical as the size of data grows - just imagine the time it would take to find a specific novel amongst a million uncategorized books! + +The same efficiencies arise in computing. Consider a simple 1D dataset consisting of measurements `Y=[10,20,30,40,50,60]` at six coordinate positions `X=[1, 2, 4, 8, 16, 32]`. _What was our measurement at `X=8`?_ +To answer this in code, we need an index that is simply a _key:value_ mapping or "hash table" between the coordinate values and integer positions `i=[0,1,2,3,4,5]` in the coordinates array. +With only 6 coordinates, we easily see `X[3]=8` so our measurement of interest is `Y[3]=40`. + +> 💡 **Note:** for large datasets we should loop over _all_ the coordinates to ensure there are no repeated values! This initial pass over all the coordinates to build an index may take significant time and may not always be desirable. + +## pandas.Index + +Xarray's [label-based selection](https://docs.xarray.dev/en/latest/user-guide/indexing.html#indexing-with-dimension-names) allows a more expressive and simple syntax in which you don't have to think about the index (`da.sel(x=8) = 40`). Up until now, Xarray has relied exclusively on [pandas.Index](https://pandas.pydata.org/docs/user_guide/indexing.html), which is still used by default: + +```python +x = np.array([1, 2, 4, 8, 16, 32]) +y = np.array([10, 20, 30, 40, 50, 60]) +da = xr.DataArray(y, coords={'x': x}) +da +``` + + + +```python +da.sel(x=8) +# 40 +``` + +## Alternatives to pandas.Index + +Importantly, a loop over all the coordinate values is not the only way to create an index. +You might recognize that our coordinates can in fact be represented by a function `X(i)=2**i` where `i` is the integer position! Given that function we can quickly get measurement values at any position: `Y(X=8)` = `Y[log2(8)]` = `Y[3]=40`. Xarray now has a [CoordinateTransformIndex](https://xarray-indexes.readthedocs.io/blocks/transform.html) to handle this type of on-demand lookup of coordinate positions! + +### xarray RangeIndex + +A simple special case of `CoordinateTransformIndex` is a `RangeIndex` where coordinates can be defined by a start, stop, and uniform step size. _`pandas.RangeIndex` only supports integers_, whereas Xarray handles floating-point values. Coordinate look-up is performed on-the-fly rather than loading all values into memory up-front when creating a Dataset, which is critical for the example below that has a coordinate array of 7TB! + +```python +from xarray.indexes import RangeIndex + +index = RangeIndex.arange(0.0, 1000.0, 1e-9, dim='x') # 7TB coordinate array! +ds = xr.Dataset(coords=xr.Coordinates.from_xindex(index)) +ds +``` + + + +Selection preserves the RangeIndex and does not require loading all the coordinates into memory. + +``` +sliced = ds.isel(x=slice(1_000, 50_000, 100)) +sliced.x +``` + + + +## Third-party custom Indexes + +In addition to a few new built-in indexes, `xarray.Index` provides an API that allows dealing with coordinate data and metadata in a highly customizable way for the most common Xarray operations such as `sel`, `align`, `concat`, `stack`. This is a powerful extension mechanism that is very important for supporting a multitude of domain-specific data structures. + +### rasterix RasterIndex + +Earlier we mentioned that coordinates often have a _functional representation_. +For 2D raster images, this function often takes the form of an [Affine Transform](https://en.wikipedia.org/wiki/Affine_transformation). +The [rasterix](https://github.com/xarray-contrib/rasterix) library extends Xarray with a `RasterIndex` which computes coordinates for geospatial images such as GeoTiffs via Affine Transform. + +Below is a simple example of slicing a large mosaic of GeoTiffs without ever loading the coordinates into memory, note that a new Affine is defined after the slicing operation: + +```python +# 811816322401 values! +import rasterix + +#26475 GeoTiffs represented by a GDAL VRT +da = xr.open_dataarray('https://opentopography.s3.sdsc.edu/raster/COP30/COP30_hh.vrt', + engine='rasterio', + parse_coordinates=False).squeeze().pipe( + rasterix.assign_index +) +da +``` + + + +```python +print('Original geotransform:\n', da.xindexes['x'].transform()) +da_sliced = da.sel(x=slice(-122.4, -120.0), y=slice(-47.1,-49.0)) +print('Sliced geotransform:\n', da_sliced.xindexes['x'].transform()) +``` + +``` +Original geotransform: + | 0.00, 0.00,-180.00| +| 0.00,-0.00, 84.00| +| 0.00, 0.00, 1.00| + +Sliced geotransform: + | 0.00, 0.00,-122.40| +| 0.00,-0.00,-47.10| +| 0.00, 0.00, 1.00| +``` + +### XProj CRSIndex + +> real-world datasets are usually more than just raw numbers; they have labels which encode information about how the array values map to locations in space, time, etc. [Xarray Docs](https://docs.xarray.dev/en/stable/getting-started-guide/why-xarray.html#what-labels-enable) + +We often think about metadata providing context for _measurement values_ but metadata is also critical for coordinates! +In particular, to align two different datasets we must ask if the coordinates are in the same coordinate system. +In other words, do they share the same origin and scale? + +There are currently over 7000 commonly used [Coordinate Reference Systems (CRS)](https://spatialreference.org/ref/epsg/) for geospatial data in the authoritative EPSG database! +And of course an infinite number of custom-defined CRSs. +[xproj.CRSIndex](https://xproj.readthedocs.io/en/latest/) gives Xarray objects an automatic awareness of the coordinate reference system operations like `xr.align()`, which can raise an informative error when there is a CRS mismatch: + +```python +from xproj import CRSIndex +lons1 = np.arange(-125, -120, 1) +lons2 = np.arange(-122, -118, 1) +ds1 = xr.Dataset(coords={'longitude': lons1}).proj.assign_crs(crs=4267) +ds2 = xr.Dataset(coords={'longitude': lons2}).proj.assign_crs(crs=4326) +ds1 + ds2 +``` + +```pytb +MergeError: conflicting values/indexes on objects to be combined for coordinate 'crs' +``` + +### XVec GeometryIndex + +A "vector data cube" is an n-D array that has at least one dimension indexed by a 2-D array of vector geometries. +Large vector cubes can take advantage of an [R-tree spatial index](https://en.wikipedia.org/wiki/R-tree) for efficiently selecting vector geometries within a given bounding box. +The `xvec.GeometryIndex` provides this functionality, below is a short code snippet but please refer to the [documentation for more](https://xvec.readthedocs.io/en/stable/indexing.html)! + +```python +import xvec +import geopandas as gpd +from geodatasets import get_path + +# Dataset that contains demographic data indexed by U.S. counties +counties = gpd.read_file(get_path("geoda.natregimes")) + +cube = xr.Dataset( + data_vars=dict( + population=(["county", "year"], counties[["PO60", "PO70", "PO80", "PO90"]]), + unemployment=(["county", "year"], counties[["UE60", "UE70", "UE80", "UE90"]]), + ), + coords=dict(county=counties.geometry, year=[1960, 1970, 1980, 1990]), +).xvec.set_geom_indexes("county", crs=counties.crs) +cube +``` + + + +```python +# Efficient selection using shapely.STRtree +from shapely.geometry import box + +subset = cube.xvec.query( + "county", + box(-125.4, 40, -120.0, 50), + predicate="intersects", +) + +subset['population'].xvec.plot(col='year'); +``` + +

+ +

+ +### Even more examples! + +Be sure to check out the [Gallery of Custom Index Examples](https://xarray-indexes.readthedocs.io) for more detailed examples of all the indexes mentioned in this post and more! + +## What's next? + +While we're extremely excited about what can _already_ be accomplished with the new indexing capabilities, there are plenty of exciting ideas for future work. + +Have an idea for your own custom index? Check out [this section of the Xarray documentation](https://docs.xarray.dev/en/stable/internals/how-to-create-custom-index.html). + +## Acknowledgments + +This work would not have been possible without technical input from the Xarray core team and community! +Several developers received essential funding from a [CZI Essential Open Source Software for Science (EOSS) grant](https://xarray.dev/blog/czi-eoss-grant-conclusion) as well as NASA's Open Source Tools, Frameworks, and Libraries (OSTFL) grant 80NSSC22K0345.