Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,267 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "d8a7cf09",
"metadata": {},
"source": [
"# Scalability of anndata x alphapepttools\n",
"\n",
"Here, we demonstrate the scalability of anndata based on an example dataset by Albrecht et al, 2025\n",
"\n",
"> Albrecht, V., Müller-Reif, J. B., Brennsteiner, V. & Mann, M. A Simplified Perchloric Acid Workflow With Neutralization (PCA N) for Democratizing Deep Plasma Proteomics at Population Scale. Molecular & Cellular Proteomics 24, 101071 (2025).\n",
"\n",
"In this study, Albrecht et al processed more than 2000 pooled plasma samples to benchmark the reproducibility of the PCA-N workflow, as part of a larger clinical cohort study. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6dfe6d9d",
"metadata": {},
"outputs": [],
"source": [
"import alphapepttools as at\n",
"import anndata as ad\n",
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "126e9c4d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"./albrecht_pcan_report.tsv already exists (9664.169023513794 MB)\n"
]
}
],
"source": [
"report_path = at.data.get_data(\"albrecht_pcan\", output_dir=\".\")"
]
},
{
"cell_type": "markdown",
"id": "5686b5b6",
"metadata": {},
"source": [
"We showcase that anndata is capable of handling this comparatively large study and that we can extract the individual feature layers from the report:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "68ee8f62",
"metadata": {},
"outputs": [],
"source": [
"precursors: ad.AnnData = at.io.read_psm_table(\n",
" report_path,\n",
" search_engine=\"diann\",\n",
" level=\"precursors\",\n",
" intensity_column=\"Precursor.Normalised\",\n",
" feature_id_column=\"Precursor.Id\",\n",
" sample_id_column=\"Run\",\n",
" var_columns=[\n",
" \"Run\",\n",
" \"Stripped.Sequence\",\n",
" \"Precursor.Charge\",\n",
" \"RT\",\n",
" \"RT.Start\",\n",
" \"RT.Stop\",\n",
" \"IM\",\n",
" \"Protein.Group\",\n",
" \"Protein.Ids\",\n",
" \"Genes\",\n",
" \"MS2.Scan\",\n",
" \"CScore\",\n",
" \"Q.Value\",\n",
" \"Precursor.Id\",\n",
" \"Global.Q.Value\",\n",
" \"Global.PG.Q.Value\",\n",
" \"Lib.Q.Value\",\n",
" \"Lib.PG.Q.Value\",\n",
" ],\n",
")"
]
},
{
"cell_type": "markdown",
"id": "96061e92",
"metadata": {},
"source": [
"The metadata from the PSM report is retained in the `.var` attribute of the precursor anndata object"
]
},
{
"cell_type": "markdown",
"id": "1591d888",
"metadata": {},
"source": [
"Here, we subset the data to a all precursors whose retention time is below 5 minutes"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "09e1a5e6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"View of AnnData object with n_obs × n_vars = 1801 × 3948\n",
" var: 'Run', 'Stripped.Sequence', 'Precursor.Charge', 'RT', 'RT.Start', 'RT.Stop', 'IM', 'Protein.Group', 'Protein.Ids', 'Genes', 'MS2.Scan', 'CScore', 'Q.Value', 'PG.MaxLFQ', 'Precursor.Normalised', 'Genes.MaxLFQ', 'Global.Q.Value', 'Global.PG.Q.Value', 'Lib.Q.Value', 'Lib.PG.Q.Value'"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"precursors[:, precursors.var_names[(precursors.var[\"RT\"] < 5)]] # noqa: PLR2004"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d8d0325d",
"metadata": {},
"outputs": [],
"source": [
"df = pd.DataFrame(precursors.X).apply(pd.to_numeric, errors=\"coerce\")\n",
"\n",
"precursors.obs_names = pd.Index(precursors.obs.index.to_numpy())\n",
"precursors.var_names = pd.Index(precursors.var.index.to_numpy())\n",
"\n",
"precursors.var = pd.DataFrame(\n",
" precursors.var.convert_dtypes(dtype_backend=\"numpy_nullable\").values,\n",
" index=precursors.var.index,\n",
" columns=precursors.var.columns,\n",
")\n",
"precursors.var[\"Precursor.Charge\"] = precursors.var[\"Precursor.Charge\"].astype(float)\n",
"precursors.var[\"RT\"] = precursors.var[\"RT\"].astype(float)\n",
"precursors.var[\"RT.Start\"] = precursors.var[\"RT.Start\"].astype(float)\n",
"precursors.var[\"RT.Stop\"] = precursors.var[\"RT.Stop\"].astype(float)\n",
"precursors.var[\"IM\"] = precursors.var[\"IM\"].astype(float)\n",
"precursors.var[\"MS2.Scan\"] = precursors.var[\"MS2.Scan\"].astype(float)\n",
"precursors.var[\"Q.Value\"] = precursors.var[\"Q.Value\"].astype(float)\n",
"precursors.var[\"Global.Q.Value\"] = precursors.var[\"Global.Q.Value\"].astype(float)\n",
"precursors.var[\"Global.PG.Q.Value\"] = precursors.var[\"Global.PG.Q.Value\"].astype(float)\n",
"precursors.var[\"Lib.Q.Value\"] = precursors.var[\"Lib.Q.Value\"].astype(float)\n",
"precursors.var[\"Lib.PG.Q.Value\"] = precursors.var[\"Lib.PG.Q.Value\"].astype(float)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0647c3dd",
"metadata": {},
"outputs": [],
"source": [
"precursors.X = np.where(pd.isna(precursors.X), np.nan, precursors.X).astype(float)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c72a80fc",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"... storing 'Run' as categorical\n",
"... storing 'Stripped.Sequence' as categorical\n",
"... storing 'Protein.Group' as categorical\n",
"... storing 'Protein.Ids' as categorical\n",
"... storing 'Genes' as categorical\n",
"... storing 'CScore' as categorical\n"
]
}
],
"source": [
"precursors.write_h5ad(\"./albrecht.precursors.h5ad\")"
]
},
{
"cell_type": "markdown",
"id": "5ecad3b6",
"metadata": {},
"source": [
"We can similarly process protein-level and gene-level information"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "98f3cc47",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"AnnData object with n_obs × n_vars = 1801 × 2161"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"proteins: ad.AnnData = at.io.read_psm_table(report_path, search_engine=\"diann\", level=\"proteins\")\n",
"proteins"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6bbec870",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"AnnData object with n_obs × n_vars = 1801 × 2105"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"genes: ad.AnnData = at.io.read_psm_table(report_path, search_engine=\"diann\", level=\"genes\")\n",
"genes"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "alphatools",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
7 changes: 7 additions & 0 deletions src/alphapepttools/data/datasets.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -113,3 +113,10 @@
search_engine: diann
data_type: psm
description: Small example reports for testing PSM readers (DiaNN 1.9.0, parquet)

- name: albrecht_pcan
url: https://datashare.biochem.mpg.de/s/Bn6R9FQRj3Jn2ZB
search_engine: diann
data_type: psm
citation: "Albrecht, V., Müller-Reif, J. B., Brennsteiner, V. & Mann, M. A Simplified Perchloric Acid Workflow With Neutralization (PCA N) for Democratizing Deep Plasma Proteomics at Population Scale. Molecular & Cellular Proteomics 24, 101071 (2025)."
description: "2000 Plasma samples processed with the PCA-N workflow (PXD064475)."
2 changes: 1 addition & 1 deletion tests/run_notebook_tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,6 @@
export IS_PYTEST_RUN=True

# TODO enable also study_03_biomarker_skin.ipynb
ALL_NBS=$(find ../docs/notebooks -name "*.ipynb" | grep -v "study_03_biomarker_skin")
ALL_NBS=$(find ../docs/notebooks -name "*.ipynb" | grep -v -e "study_03_biomarker_skin" -e "supplementary_02_scalability-demo_PCAn")

python -m pytest --nbmake $(echo $ALL_NBS)
Loading