Added `filter_table_by_query` #894

srivarra · 2025-03-03T07:22:06Z

Added filtering by a table query as discussed in #626. Added both a standalone function sd.filter_table_by_query and a method sd.SpatialData.filter_table_by_query.

Function signature

class SpatialData:
	...

    def filter_by_table_query(
        self,
        table_name: str,
        filter_tables: bool = True,
        elements: list[str] | None = None,
        obs_expr: Predicates | None = None,
        var_expr: Predicates | None = None,
        x_expr: Predicates | None = None,
        obs_names_expr: Predicates | None = None,
        var_names_expr: Predicates | None = None,
        layer: str | None = None,
        how: Literal["left", "left_exclusive", "inner", "right", "right_exclusive"] = "right",
    ) -> SpatialData:

sd.filter_by_table_query is the same, but instead of self, you have to provide the SpatialData object of interest.

What expressions can you use?

Several methods are supported by narwhals. As long as the method doesn't aggregate.
- I know that the following work: >,>=,<,<=, ==, is_in,
- And from Expr.str contains, starts_with, ends_with work.

What parts can you filter on?

You can filter on the obs and var DataFrame attributes of AnnData.

You can filter on obs_names and var_names. (uses an.obs_names, and an.var_names instead of an.col)

You can filter on the expression matrix X w.r.t layers as well.

Some Examples

# Using the mibitof dataset cause it's small and has a table which covers multiple spatialdata elements.

import spatialdata as sd
import annsel as an
from upath import UPath

mibitof_path = UPath("~/Downloads/mibitof-dataset.zarr")

sdata = sd.read_zarr(mibitof_path)

sdata

SpatialData Repr

SpatialData object, with associated Zarr store: [/Users/srivarra/Downloads/mibitof-dataset.zarr](https://file+.vscode-resource.vscode-cdn.net/Users/srivarra/Downloads/mibitof-dataset.zarr)
├── Images
│     ├── 'point8_image': DataArray[cyx] (3, 1024, 1024)
│     ├── 'point16_image': DataArray[cyx] (3, 1024, 1024)
│     └── 'point23_image': DataArray[cyx] (3, 1024, 1024)
├── Labels
│     ├── 'point8_labels': DataArray[yx] (1024, 1024)
│     ├── 'point16_labels': DataArray[yx] (1024, 1024)
│     └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (3309, 36)
with coordinate systems:
    ▸ 'point8', with elements:
        point8_image (Images), point8_labels (Labels)
    ▸ 'point16', with elements:
        point16_image (Images), point16_labels (Labels)
    ▸ 'point23', with elements:
        point23_image (Images), point23_labels (Labels)

For context here is what the table looks like:

AnnData object with n_obs × n_vars = 3309 × 36
    obs: 'row_num', 'point', 'cell_id', 'X1', 'center_rowcoord', 'center_colcoord', 'cell_size', 'category', 'donor', 'Cluster', 'batch', 'library_id'
    uns: 'spatialdata_attrs'
    obsm: 'X_scanorama', 'X_umap', 'spatial'

Filter with respect the donor "21d7", and filter var_names where we have "ASCT2", "ATP5A" and any marker that starts with "CD".

sd.filter_by_table_query(
    sdata,
    table_name="table",
    obs_expr=an.col("donor") == "21d7",
    var_names_expr=(
        an.var_names.is_in(["ASCT2", "ATP5A"])
        | an.var_names.str.starts_with("CD")
    ),
    x_expr=None,
)

Output

SpatialData object
├── Labels
│     └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (1241, 14)
with coordinate systems:
    ▸ 'point23', with elements:
        point23_labels (Labels)

Filter by batches "0" and "1".

sdata.filter_by_table_query(
    table_name="table",
    obs_expr=an.col("batch").is_in(["1", "0"]),
)

Output

SpatialData object
├── Labels
│     ├── 'point8_labels': DataArray[yx] (1024, 1024)
│     └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (2286, 36)
with coordinate systems:
    ▸ 'point8', with elements:
        point8_labels (Labels)
    ▸ 'point23', with elements:
        point23_labels (Labels)

Filter by obs_names which start with "9"

sd.filter_by_table_query(
    sdata,
    table_name="table",
    obs_names_expr=an.obs_names.str.starts_with("9")
)

Output

SpatialData object
├── Labels
│     └── 'point8_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (624, 36)
with coordinate systems:
    ▸ 'point8', with elements:
        point8_labels (Labels)

Note that tuples of Expressions applies an & operator

sd.filter_by_table_query(
    sdata,
    table_name="table",
    var_names_expr=(an.var_names.str.contains("CD"), an.var_names == "CD8"),
    x_expr=None,
)

Output

SpatialData object
├── Labels
│     ├── 'point8_labels': DataArray[yx] (1024, 1024)
│     ├── 'point16_labels': DataArray[yx] (1024, 1024)
│     └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (3309, 1)
with coordinate systems:
    ▸ 'point8', with elements:
        point8_labels (Labels)
    ▸ 'point16', with elements:
        point16_labels (Labels)
    ▸ 'point23', with elements:
        point23_labels (Labels)

Complex query.

sd.filter_by_table_query(
    sdata,
    elements=["point23_labels", "point8_labels"],
    table_name="table",
    # Filter observations (rows) based on multiple conditions
    obs_expr=(
        # Cells from donor 21d7 OR 90de
        an.col("donor").is_in(["21d7", "90de"])
        # AND cells with size greater than 400
        & (an.col("cell_size") > 400)
        # AND cells that are either Epithelial or contain "Tcell" in their cluster name
        & (an.col("Cluster") == "Epithelial")
        | (an.col("Cluster").str.contains("Tcell"))
    ),
    # Filter variables (columns) based on multiple conditions
    var_names_expr=(
        # Select columns that start with CD
        an.var_names.str.starts_with("CD")
        # OR columns that contain "ATP"
        | an.var_names.str.contains("ATP")
        # OR specific columns
        | an.var_names.is_in(["ASCT2", "PKM2", "SMA"])
    ),
    # Filter based on expression values
    x_expr=(
        # Keep cells where ASCT2 is greater than 0.1
        (an.col("ASCT2") > 0.1)
        # AND less than 2 for ASCT2
        & (an.col("ASCT2") < 2)
    ),
    how="right",
)

Output

SpatialData object
├── Labels
│     ├── 'point8_labels': DataArray[yx] (1024, 1024)
│     └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (268, 17)
with coordinate systems:
    ▸ 'point8', with elements:
        point8_labels (Labels)
    ▸ 'point23', with elements:
        point23_labels (Labels)

Other things to note:

I added a more complex SpatialData for testing in conftest.py. I do not know if this should be there or somewhere else, or if I should make better use of what's there currently.

Any thoughts or suggestions?
Is this a feature which requires a tutorial notebook or additions to an already existing one?

Notebook: Table Queries

codecov · 2025-03-03T07:29:53Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.09%. Comparing base (60be9ce) to head (1677e35).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #894      +/-   ##
==========================================
- Coverage   92.10%   92.09%   -0.01%     
==========================================
  Files          48       48              
  Lines        7433     7442       +9     
==========================================
+ Hits         6846     6854       +8     
- Misses        587      588       +1

Files with missing lines	Coverage Δ
src/spatialdata/__init__.py	`96.42% <ø> (ø)`
src/spatialdata/_core/query/relational_query.py	`91.23% <100.00%> (+0.09%)`	⬆️
src/spatialdata/_core/spatialdata.py	`91.49% <100.00%> (+0.03%)`	⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

melonora · 2025-05-27T14:32:12Z

Hey @srivarra, thanks for the PR! Sorry for not getting to it earlier, last months have been a bit hectic with the PhD, but checking it now. Might incorporate a tutorial notebook given that people could be new to this way of writing queries, but in any case I am in favor of this kind of syntax.

srivarra · 2025-05-27T17:52:00Z

@melonora No worries, hope the PhD is going well! Sounds good I'll draft up a tutorial notebook and request a review when it's done.

melonora · 2025-05-27T20:13:28Z

Awesome!

srivarra · 2025-05-28T21:42:30Z

@melonora Would the proper process be:

Make an issue in scverse/spatialdata-notebooks
Make a PR there for the notebook
??? Find a way to link it to this branch? A bit confused on this part.

tyty

melonora · 2025-06-01T12:56:52Z

@srivarra Just open a PR in scverse/spatialdata-notebooks:) As a title you can give it table queries

added filter_table_by_query

3140fc0

added tests

7b9ec9b

srivarra marked this pull request as draft March 6, 2025 05:01

srivarra added 7 commits March 10, 2025 22:17

Merge branch 'main' into features/filter_by_table_query

65c58a3

updated annsel, udjusted test

559cc59

fixed docstring: func instead of method

add5880

using SpatialData method for one test for ci

d6a9b95

removed explicit optional in docstrings

877d85c

updated docstring Notes

7998541

updated api/operations.md

3fb202e

LucaMarconato mentioned this pull request Mar 15, 2025

Is there a way to subset cell_boundaries using an AnnData table? #898

Closed

Pancreas-Pratik mentioned this pull request Apr 2, 2025

module 'spatialdata' has no attribute 'match_sdata_to_table' #912

Closed

updated annsel version

33423f6

srivarra marked this pull request as ready for review April 10, 2025 21:05

srivarra and others added 2 commits April 22, 2025 12:24

Merge branch 'main' into features/filter_by_table_query

2285aa8

Merge branch 'main' into features/filter_by_table_query

1677e35

LucaMarconato mentioned this pull request May 28, 2025

Function to subset the entire spatialdata object scverse/squidpy#1007

Open

Merge branch 'main' into features/filter_by_table_query

deabf30

srivarra mentioned this pull request Jul 28, 2025

Table Queries scverse/spatialdata-notebooks#147

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added `filter_table_by_query` #894

Added `filter_table_by_query` #894

srivarra commented Mar 3, 2025 •

edited

Loading

Uh oh!

codecov bot commented Mar 3, 2025 •

edited

Loading

Uh oh!

melonora commented May 27, 2025

Uh oh!

srivarra commented May 27, 2025

Uh oh!

melonora commented May 27, 2025 •

edited

Loading

Uh oh!

srivarra commented May 28, 2025

Uh oh!

melonora commented Jun 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

Added filter_table_by_query #894

Are you sure you want to change the base?

Added filter_table_by_query #894

Conversation

srivarra commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Some Examples

Uh oh!

codecov bot commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

melonora commented May 27, 2025

Uh oh!

srivarra commented May 27, 2025

Uh oh!

melonora commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srivarra commented May 28, 2025

Uh oh!

melonora commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Added `filter_table_by_query` #894

Added `filter_table_by_query` #894

srivarra commented Mar 3, 2025 •

edited

Loading

codecov bot commented Mar 3, 2025 •

edited

Loading

melonora commented May 27, 2025 •

edited

Loading

melonora commented Jun 1, 2025 •

edited

Loading