Skip to content

Conversation

srivarra
Copy link

@srivarra srivarra commented Mar 3, 2025

Added filtering by a table query as discussed in #626. Added both a standalone function sd.filter_table_by_query and a method sd.SpatialData.filter_table_by_query.

Function signature

class SpatialData:
	...

    def filter_by_table_query(
        self,
        table_name: str,
        filter_tables: bool = True,
        elements: list[str] | None = None,
        obs_expr: Predicates | None = None,
        var_expr: Predicates | None = None,
        x_expr: Predicates | None = None,
        obs_names_expr: Predicates | None = None,
        var_names_expr: Predicates | None = None,
        layer: str | None = None,
        how: Literal["left", "left_exclusive", "inner", "right", "right_exclusive"] = "right",
    ) -> SpatialData:

sd.filter_by_table_query is the same, but instead of self, you have to provide the SpatialData object of interest.


What expressions can you use?

  • Several methods are supported by narwhals. As long as the method doesn't aggregate.
    • I know that the following work: >,>=,<,<=, ==, is_in,
    • And from Expr.str contains, starts_with, ends_with work.

What parts can you filter on?

You can filter on the obs and var DataFrame attributes of AnnData.

You can filter on obs_names and var_names. (uses an.obs_names, and an.var_names instead of an.col)

You can filter on the expression matrix X w.r.t layers as well.


Some Examples

# Using the mibitof dataset cause it's small and has a table which covers multiple spatialdata elements.

import spatialdata as sd
import annsel as an
from upath import UPath

mibitof_path = UPath("~/Downloads/mibitof-dataset.zarr")

sdata = sd.read_zarr(mibitof_path)

sdata
SpatialData Repr
SpatialData object, with associated Zarr store: [/Users/srivarra/Downloads/mibitof-dataset.zarr](https://file+.vscode-resource.vscode-cdn.net/Users/srivarra/Downloads/mibitof-dataset.zarr)
├── Images
│     ├── 'point8_image': DataArray[cyx] (3, 1024, 1024)
│     ├── 'point16_image': DataArray[cyx] (3, 1024, 1024)
│     └── 'point23_image': DataArray[cyx] (3, 1024, 1024)
├── Labels
│     ├── 'point8_labels': DataArray[yx] (1024, 1024)
│     ├── 'point16_labels': DataArray[yx] (1024, 1024)
│     └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (3309, 36)
with coordinate systems:
    ▸ 'point8', with elements:
        point8_image (Images), point8_labels (Labels)
    ▸ 'point16', with elements:
        point16_image (Images), point16_labels (Labels)
    ▸ 'point23', with elements:
        point23_image (Images), point23_labels (Labels)

For context here is what the table looks like:

AnnData object with n_obs × n_vars = 3309 × 36
    obs: 'row_num', 'point', 'cell_id', 'X1', 'center_rowcoord', 'center_colcoord', 'cell_size', 'category', 'donor', 'Cluster', 'batch', 'library_id'
    uns: 'spatialdata_attrs'
    obsm: 'X_scanorama', 'X_umap', 'spatial'
  1. Filter with respect the donor "21d7", and filter var_names where we have "ASCT2", "ATP5A" and any marker that starts with "CD".
sd.filter_by_table_query(
    sdata,
    table_name="table",
    obs_expr=an.col("donor") == "21d7",
    var_names_expr=(
        an.var_names.is_in(["ASCT2", "ATP5A"])
        | an.var_names.str.starts_with("CD")
    ),
    x_expr=None,
)
Output

SpatialData object
├── Labels
│     └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (1241, 14)
with coordinate systems:
    ▸ 'point23', with elements:
        point23_labels (Labels)

  1. Filter by batches "0" and "1".
sdata.filter_by_table_query(
    table_name="table",
    obs_expr=an.col("batch").is_in(["1", "0"]),
)
Output

SpatialData object
├── Labels
│     ├── 'point8_labels': DataArray[yx] (1024, 1024)
│     └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (2286, 36)
with coordinate systems:
    ▸ 'point8', with elements:
        point8_labels (Labels)
    ▸ 'point23', with elements:
        point23_labels (Labels)

  1. Filter by obs_names which start with "9"
sd.filter_by_table_query(
    sdata,
    table_name="table",
    obs_names_expr=an.obs_names.str.starts_with("9")
)
Output

SpatialData object
├── Labels
│     └── 'point8_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (624, 36)
with coordinate systems:
    ▸ 'point8', with elements:
        point8_labels (Labels)

  1. Note that tuples of Expressions applies an & operator
sd.filter_by_table_query(
    sdata,
    table_name="table",
    var_names_expr=(an.var_names.str.contains("CD"), an.var_names == "CD8"),
    x_expr=None,
)
Output

SpatialData object
├── Labels
│     ├── 'point8_labels': DataArray[yx] (1024, 1024)
│     ├── 'point16_labels': DataArray[yx] (1024, 1024)
│     └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (3309, 1)
with coordinate systems:
    ▸ 'point8', with elements:
        point8_labels (Labels)
    ▸ 'point16', with elements:
        point16_labels (Labels)
    ▸ 'point23', with elements:
        point23_labels (Labels)

  1. Complex query.
sd.filter_by_table_query(
    sdata,
    elements=["point23_labels", "point8_labels"],
    table_name="table",
    # Filter observations (rows) based on multiple conditions
    obs_expr=(
        # Cells from donor 21d7 OR 90de
        an.col("donor").is_in(["21d7", "90de"])
        # AND cells with size greater than 400
        & (an.col("cell_size") > 400)
        # AND cells that are either Epithelial or contain "Tcell" in their cluster name
        & (an.col("Cluster") == "Epithelial")
        | (an.col("Cluster").str.contains("Tcell"))
    ),
    # Filter variables (columns) based on multiple conditions
    var_names_expr=(
        # Select columns that start with CD
        an.var_names.str.starts_with("CD")
        # OR columns that contain "ATP"
        | an.var_names.str.contains("ATP")
        # OR specific columns
        | an.var_names.is_in(["ASCT2", "PKM2", "SMA"])
    ),
    # Filter based on expression values
    x_expr=(
        # Keep cells where ASCT2 is greater than 0.1
        (an.col("ASCT2") > 0.1)
        # AND less than 2 for ASCT2
        & (an.col("ASCT2") < 2)
    ),
    how="right",
)
Output

SpatialData object
├── Labels
│     ├── 'point8_labels': DataArray[yx] (1024, 1024)
│     └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (268, 17)
with coordinate systems:
    ▸ 'point8', with elements:
        point8_labels (Labels)
    ▸ 'point23', with elements:
        point23_labels (Labels)


Other things to note:

I added a more complex SpatialData for testing in conftest.py. I do not know if this should be there or somewhere else, or if I should make better use of what's there currently.

  • Any thoughts or suggestions?
  • Is this a feature which requires a tutorial notebook or additions to an already existing one?

Notebook: Table Queries

Copy link

codecov bot commented Mar 3, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.09%. Comparing base (60be9ce) to head (1677e35).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #894      +/-   ##
==========================================
- Coverage   92.10%   92.09%   -0.01%     
==========================================
  Files          48       48              
  Lines        7433     7442       +9     
==========================================
+ Hits         6846     6854       +8     
- Misses        587      588       +1     
Files with missing lines Coverage Δ
src/spatialdata/__init__.py 96.42% <ø> (ø)
src/spatialdata/_core/query/relational_query.py 91.23% <100.00%> (+0.09%) ⬆️
src/spatialdata/_core/spatialdata.py 91.49% <100.00%> (+0.03%) ⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@srivarra srivarra marked this pull request as draft March 6, 2025 05:01
@srivarra srivarra marked this pull request as ready for review April 10, 2025 21:05
@melonora
Copy link
Collaborator

Hey @srivarra, thanks for the PR! Sorry for not getting to it earlier, last months have been a bit hectic with the PhD, but checking it now. Might incorporate a tutorial notebook given that people could be new to this way of writing queries, but in any case I am in favor of this kind of syntax.

@srivarra
Copy link
Author

@melonora No worries, hope the PhD is going well! Sounds good I'll draft up a tutorial notebook and request a review when it's done.

@melonora
Copy link
Collaborator

melonora commented May 27, 2025

Awesome!

@srivarra
Copy link
Author

@melonora Would the proper process be:

  1. Make an issue in scverse/spatialdata-notebooks
  2. Make a PR there for the notebook
  3. ??? Find a way to link it to this branch? A bit confused on this part.

tyty

@melonora
Copy link
Collaborator

melonora commented Jun 1, 2025

@srivarra Just open a PR in scverse/spatialdata-notebooks:) As a title you can give it table queries

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants