diff --git a/notebooks.md b/notebooks.md index 2877d3b..1294317 100644 --- a/notebooks.md +++ b/notebooks.md @@ -171,6 +171,18 @@ See the bottom of this page for links to analysis tutorials for external methods :maxdepth: 1 notebooks/examples/densenet.ipynb + + .. grid-item:: + + .. container:: custom-card + + .. image:: _static/img/table-queries.jpg + :target: notebooks/examples/table-queries.html + + .. toctree:: + :maxdepth: 1 + + notebooks/examples/table-queries.ipynb ``` ## Technology-specific diff --git a/notebooks/examples/table-queries.ipynb b/notebooks/examples/table-queries.ipynb new file mode 100644 index 0000000..a4a52d2 --- /dev/null +++ b/notebooks/examples/table-queries.ipynb @@ -0,0 +1,523 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6a1fb5d2", + "metadata": {}, + "source": [ + "# Filtering SpatialData elements with Table Queries\n", + "\n", + "## Introduction\n", + "\n", + "The `spatialdata` framework supports both the representation of `SpatialElement`s (images, labels, points, shapes) and of annotations for these elements. As we explored in the [tables](./tables.ipynb) notebook, some types of `SpatialElement`s can contain annotations within themselves, but the general approach we take is to represent `SpatialElement`s and annotations in separate objects using `AnnData` tables.\n", + "\n", + "In this notebook we introduce **table queries** - a filtering mechanism that allows you to subset both the annotations (tables) and their corresponding spatial elements using expressive query syntax. This functionality is provided by the `filter_table_by_query()` function, which uses the [`annsel`](https://github.com/srivarra/annsel) library for building query expressions. Under the hood, `annsel` uses [`narwhals`](https://narwhals-dev.github.io/narwhals/), an \"*extremely lightweight and extensible compatibility layer between dataframe libraries*\". This notebook assumes that you are have familarized yourself with content in the [tables](./tables.ipynb) notebook." + ] + }, + { + "cell_type": "markdown", + "id": "c50f0610", + "metadata": {}, + "source": [ + "## Setup and Data Loading\n", + "\n", + "Lets start by importing the necessary libraries and loading the example blobs dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2b0b3286", + "metadata": {}, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "\n", + "import annsel as an\n", + "import numpy as np\n", + "\n", + "import spatialdata as sd\n", + "from spatialdata.datasets import blobs\n", + "\n", + "blobs_sdata = blobs()\n", + "blobs_sdata" + ] + }, + { + "cell_type": "markdown", + "id": "facac31d", + "metadata": {}, + "source": [ + "The table in the blobs dataset is rather minimal, so we will artifically add a couple of columns (`cell_type` and `area`) to help illustrate the functionality." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "65a84100", + "metadata": {}, + "outputs": [], + "source": [ + "rng = np.random.default_rng(123456)\n", + "\n", + "blobs_sdata.tables[\"table\"].obs[\"cell_type\"] = rng.choice(\n", + " [\"A\", \"B\", \"C\", \"C\", \"AA\", \"BB\", \"CC\"], size=blobs_sdata.tables[\"table\"].n_obs\n", + ")\n", + "blobs_sdata.tables[\"table\"].obs[\"cell_type_granular\"] = rng.choice(\n", + " [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\", \"G\", \"H\", \"I\", \"J\"], size=blobs_sdata.tables[\"table\"].n_obs\n", + ")\n", + "blobs_sdata.tables[\"table\"].obs[\"area\"] = rng.choice(\n", + " [10, 20, 30, 40, 50, 60, 70, 80, 90, 100], size=blobs_sdata.tables[\"table\"].n_obs\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "56161d84", + "metadata": {}, + "source": [ + "## Supported Operations" + ] + }, + { + "cell_type": "markdown", + "id": "8e2f2314", + "metadata": {}, + "source": [ + "## Basic Filtering Examples\n", + "\n", + "Now let's explore how to filter our blobs `SpatialData` object using table queries.\n", + "\n", + "The most common use case is to filter based on observations (`obs`):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7bea46b5", + "metadata": {}, + "outputs": [], + "source": [ + "blobs_sdata_filtered = sd.filter_by_table_query(blobs_sdata, table_name=\"table\", obs_expr=an.col(\"cell_type\") == \"A\")\n", + "blobs_sdata_filtered" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9990439f", + "metadata": {}, + "outputs": [], + "source": [ + "print(\n", + " f\"\\nObservations reduced from {blobs_sdata_filtered.tables['table'].n_obs} to {blobs_sdata_filtered.tables['table'].n_obs}\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "e187bc4e", + "metadata": {}, + "source": [ + "### Breaking Down `an.col(\"cell_type\") == \"A\"`\n", + "\n", + "\n", + "\n", + "**What is `an.col(\"cell_type\")`?**\n", + "\n", + "`an.col(\"cell_type\")` creates a column reference that points to the \"cell_type\" column (doesn't specify if it's in `obs` or `var`). By assigning this to the `obs_expr` argument, you're telling the function to filter the `obs` component of the AnnData table based on this column. Think of it as saying \"I want to work with the cell_type column\".\n", + "\n", + "\n", + "**What does `== \"A\"` do?**\n", + "\n", + "The equality operator `== \"A\"` applies a comparison operator to that column reference, creating a boolean condition that will be `True` for rows where cell_type equals \"A\" and `False` everywhere else.\n", + "\n", + "**Why This Syntax Design?**\n", + "\n", + "These expressions are ran in `narwhals` under the hood to create expressions and run them. If you have a keen eye, you may notice that this syntax is similar to Polars, as the Narwhals API follows as closely as it can to the ergonomics of Polars.\n" + ] + }, + { + "cell_type": "markdown", + "id": "43ac99c0", + "metadata": {}, + "source": [ + "Lets take look at another example, this time we will want to select observations which belong to the `blobs_labels` region." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2c9e9bc6", + "metadata": {}, + "outputs": [], + "source": [ + "blobs_sdata_filtered = sd.filter_by_table_query(\n", + " blobs_sdata,\n", + " table_name=\"table\",\n", + " obs_expr=an.col(\"region\") == \"blobs_labels\",\n", + ")\n", + "blobs_sdata_filtered" + ] + }, + { + "cell_type": "markdown", + "id": "3e3858a1", + "metadata": {}, + "source": [ + "Since all the observations in the table are from the `blobs_labels` element, The table query will return the same `AnnData` object to SpatialDate. But in terms of the other `SpatilaElements` we can see that it's only kept the `blobss_labels` element.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "a6575841", + "metadata": {}, + "source": [ + "You can also filter based on numeric values, as you'd expect." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "392bc718", + "metadata": {}, + "outputs": [], + "source": [ + "blobs_sdata_filtered = sd.filter_by_table_query(blobs_sdata, table_name=\"table\", obs_expr=an.col(\"instance_id\") <= 10)\n", + "blobs_sdata_filtered" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1e7d53b3", + "metadata": {}, + "outputs": [], + "source": [ + "blobs_sdata_filtered = sd.filter_by_table_query(\n", + " blobs_sdata, table_name=\"table\", obs_expr=an.col(\"instance_id\").is_in([1, 3, 5, 8, 13])\n", + ")\n", + "blobs_sdata_filtered" + ] + }, + { + "cell_type": "markdown", + "id": "bb941902", + "metadata": {}, + "source": [ + "## Supported Operators and Expressions\n", + "\n", + "- `an.col(\"column_name\")` - reference a column in `obs` or `var`\n", + " - *Note:* Can be multiple columns, `an.col([\"column_name1\", \"column_name2\"])`\n", + "- Special \"columns\":\n", + " - `an.obs_names` - reference observation names (row indices, aka `AnnData.obs_names`)\n", + " - `an.var_names` - reference variable names (column names, aka `AnnData.var_names`)\n", + "- Comparison operators:\n", + " - `>`, `>=`, `<`, `<=`, `==`, `!=`\n", + "- Membership:\n", + " - `.is_in([list])`\n", + "- String methods:\n", + " - `.str.contains()`, `.str.starts_with()`, `.str.ends_with()`\n", + "- Logical:\n", + " - `&` (and), `|` (or), `~` (not)\n", + "\n", + "As long as an expression does not perform an aggregation under the hood or change length, it can be passed used.\n", + "\n", + "For a full list of supported operators and expressions, see the corersponding [narwhals documentation](https://narwhals-dev.github.io/narwhals/api-reference/expr/)." + ] + }, + { + "cell_type": "markdown", + "id": "290b68ac", + "metadata": {}, + "source": [ + "We can also combine multiple expressions per table component (`obs`, `var`, etc...)\n", + "\n", + "Here we will select observations that have a cell type which starts with `\"A\"`, and observations which whose `cell_type_granular` is in `[\"A\", \"B\", \"C\"]`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "534b44ed", + "metadata": {}, + "outputs": [], + "source": [ + "blobs_sdata_filtered = sd.filter_by_table_query(\n", + " blobs_sdata,\n", + " table_name=\"table\",\n", + " obs_expr=((an.col(\"cell_type\").str.starts_with(\"A\")) | (an.col(\"cell_type_granular\").is_in([\"A\", \"B\", \"C\"]))),\n", + ")\n", + "blobs_sdata_filtered" + ] + }, + { + "cell_type": "markdown", + "id": "e08c41da", + "metadata": {}, + "source": [ + "There are two ways to use \"and\" operators in table queries:\n", + "\n", + "1. Using `&` operator between two expressions\n", + "2. Using a tuple of expressions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9902555b", + "metadata": {}, + "outputs": [], + "source": [ + "blobs_sdata_filtered = sd.filter_by_table_query(\n", + " blobs_sdata,\n", + " table_name=\"table\",\n", + " obs_expr=((an.col(\"cell_type\").str.starts_with(\"A\")), (an.col(\"cell_type_granular\").is_in([\"A\", \"B\", \"C\"]))),\n", + ")\n", + "blobs_sdata_filtered.tables[\"table\"].obs" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "affdf97e", + "metadata": {}, + "outputs": [], + "source": [ + "blobs_sdata_filtered = sd.filter_by_table_query(\n", + " blobs_sdata,\n", + " table_name=\"table\",\n", + " obs_expr=((an.col(\"cell_type\").str.starts_with(\"A\")) & (an.col(\"cell_type_granular\").is_in([\"A\", \"B\", \"C\"]))),\n", + ")\n", + "blobs_sdata_filtered.tables[\"table\"].obs" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cc281d89", + "metadata": {}, + "outputs": [], + "source": [ + "blobs_sdata_filtered.tables[\"table\"].var_names" + ] + }, + { + "cell_type": "markdown", + "id": "c07836a6", + "metadata": {}, + "source": [ + "In this example, suppose that the `var_name` `channel_0_sum` is of some importance to you when the expression value for some observation is greater than 125. We can also filter based on that matrix's column." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8bc5b876", + "metadata": {}, + "outputs": [], + "source": [ + "blobs_sdata_filtered = sd.filter_by_table_query(\n", + " blobs_sdata,\n", + " table_name=\"table\",\n", + " x_expr=an.col(\"channel_0_sum\") > 125,\n", + ")\n", + "blobs_sdata_filtered.tables[\"table\"].obs" + ] + }, + { + "cell_type": "markdown", + "id": "5924fae5", + "metadata": {}, + "source": [ + "And of course you can combine different filters across different `AnnData` Table components." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7c8738e1", + "metadata": {}, + "outputs": [], + "source": [ + "blobs_sdata_filtered = sd.filter_by_table_query(\n", + " blobs_sdata,\n", + " table_name=\"table\",\n", + " obs_expr=an.col(\"cell_type\") == \"B\",\n", + " x_expr=an.col(\"channel_0_sum\") > 125,\n", + ")\n", + "blobs_sdata_filtered.tables[\"table\"].obs" + ] + }, + { + "cell_type": "markdown", + "id": "f69cacac", + "metadata": {}, + "source": [ + "## Using a Real Dataset" + ] + }, + { + "cell_type": "markdown", + "id": "0b76dd36", + "metadata": {}, + "source": [ + "To wrap up the notebook, we'll briefly use the queries \n", + "\n", + "Here we'll take a look querying using the [mibitof dataset](https://spatialdata.scverse.org/en/stable/tutorials/notebooks/datasets/README.html). In addition there is a companion notebook " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f475c27a", + "metadata": {}, + "outputs": [], + "source": [ + "mibitof_zarr_path = Path(\"~/Downloads/mibitof.zarr\").expanduser()\n", + "\n", + "mibitof_sdata = sd.read_zarr(mibitof_zarr_path)\n", + "mibitof_sdata" + ] + }, + { + "cell_type": "markdown", + "id": "2ec3c23a", + "metadata": {}, + "source": [ + "Lets also get a brief look at the `obs` component of the `AnnData` table. Here are a few columns of interest:\n", + "\n", + "- `point`: This is the name of the Field of View (FOV) that an observation belongs to (in this case it's cells)\n", + "- `cell_size`: The area of a cell\n", + "- `donor`: The donor that the cell is from\n", + "- `Cluster`: The cluster / cell type that the cell belongs to\n", + "- `batch`: The batch that the cell is from (usually with respect to the donor or point / FOV)\n", + "- `library_id`: An identifier pointing to which `SpatialElement` the observation belongs to." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9652e669", + "metadata": {}, + "outputs": [], + "source": [ + "mibitof_sdata.tables[\"table\"].obs" + ] + }, + { + "cell_type": "markdown", + "id": "97e10b27", + "metadata": {}, + "source": [ + "In this example, we're picking donor \"21d7\" and keeping `vars` that either start with `\"CD\"` or are `\"ASCT2\"` or `\"ATP5A\"`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8594cdfd", + "metadata": {}, + "outputs": [], + "source": [ + "mibitof_sdata_filtered = sd.filter_by_table_query(\n", + " mibitof_sdata,\n", + " # filter_tables=False,\n", + " table_name=\"table\",\n", + " obs_expr=an.col(\"donor\") == \"21d7\",\n", + " var_names_expr=(an.var_names.is_in([\"ASCT2\", \"ATP5A\"]) | an.var_names.str.starts_with(\"CD\")),\n", + ")\n", + "mibitof_sdata_filtered" + ] + }, + { + "cell_type": "markdown", + "id": "b64edfdf", + "metadata": {}, + "source": [ + "If your spatialdata object has a lot of `SpatialElements` and you only want to apply the filter to a subset of them, you can use the `element_names` parameter to specify which ones you want to use for the filter!\n", + "\n", + "As a final example, let's take it up a few notches and use most of the features of the `filter_by_table_query` function. We will also be using the `method` version of the query instead of the `function`. They behave the same way, except that the `method` version passes in it's own `SpatialData` object.\n", + "\n", + "\n", + "We'll be subsetting of specific `SpatialElements`, and applying filters across `obs`, `var`, and `x` components of the `AnnData` table with a variety of queries." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e1e0b5ba", + "metadata": {}, + "outputs": [], + "source": [ + "mibitof_sdata_filtered = mibitof_sdata_filtered.filter_by_table_query(\n", + " table_name=\"table\",\n", + " element_names=[\"point23_labels\", \"point8_labels\"],\n", + " # Filter observations (obs) based on multiple conditions\n", + " obs_expr=(\n", + " # Cells from donor 21d7 OR 90de\n", + " an.col(\"donor\").is_in([\"21d7\", \"90de\"])\n", + " # AND cells with size greater than 400\n", + " & (an.col(\"cell_size\") > 400)\n", + " # AND cells that are either Epithelial or contain \"Tcell\" in their cluster name\n", + " & (an.col(\"Cluster\") == \"Epithelial\")\n", + " | (an.col(\"Cluster\").str.contains(\"Tcell\"))\n", + " ),\n", + " # Filter variables (var) based on multiple conditions\n", + " var_names_expr=(\n", + " # Select columns that start with CD\n", + " an.var_names.str.starts_with(\"CD\")\n", + " # OR columns that contain \"ATP\"\n", + " | an.var_names.str.contains(\"ATP\")\n", + " # OR specific columns\n", + " | an.var_names.is_in([\"ASCT2\", \"PKM2\", \"SMA\"])\n", + " ),\n", + " # Filter based on expression values\n", + " x_expr=(\n", + " # Keep cells where ASCT2 is greater than 0.1\n", + " (an.col(\"ASCT2\") > 0.1)\n", + " # AND less than 2 for ASCT2\n", + " & (an.col(\"ASCT2\") < 2)\n", + " ),\n", + " how=\"right\",\n", + ")\n", + "mibitof_sdata_filtered" + ] + }, + { + "cell_type": "markdown", + "id": "b81c62ff", + "metadata": {}, + "source": [ + "To wrap up, there are a few things to note:\n", + "\n", + "1. **NOTE:** `SpatialElements` are filtered, but the components within those elements are not.\n", + " 1. For example, when we're filtering by the `obs` table and we get a subset of the Label `SpatialElement`, the individual segmentation masks are not modified, they will have the exact same masks as the original Label `SpatialElement`.\n", + "2. A layer of a given `AnnData` table can be used by specifying the `layer` parameter in the `filter_by_table_query` function.\n", + "3. You can use either the method or the function, they behave exactly the same." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.5" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}