MannLabs · lucas-diedrich · Feb 3, 2026 · Feb 3, 2026 · Feb 3, 2026 · Feb 3, 2026
diff --git a/docs/notebooks/supplementary/supplementary_02_scalability-demo_PCAn.ipynb b/docs/notebooks/supplementary/supplementary_02_scalability-demo_PCAn.ipynb
@@ -0,0 +1,267 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "d8a7cf09",
+   "metadata": {},
+   "source": [
+    "# Scalability of anndata x alphapepttools\n",
+    "\n",
+    "Here, we demonstrate the scalability of anndata based on an example dataset by Albrecht et al, 2025\n",
+    "\n",
+    "> Albrecht, V., Müller-Reif, J. B., Brennsteiner, V. & Mann, M. A Simplified Perchloric Acid Workflow With Neutralization (PCA N) for Democratizing Deep Plasma Proteomics at Population Scale. Molecular & Cellular Proteomics 24, 101071 (2025).\n",
+    "\n",
+    "In this study, Albrecht et al processed more than 2000 pooled plasma samples to benchmark the reproducibility of the PCA-N workflow, as part of a larger clinical cohort study. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6dfe6d9d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import alphapepttools as at\n",
+    "import anndata as ad\n",
+    "import pandas as pd\n",
+    "import numpy as np"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "126e9c4d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "./albrecht_pcan_report.tsv already exists (9664.169023513794 MB)\n"
+     ]
+    }
+   ],
+   "source": [
+    "report_path = at.data.get_data(\"albrecht_pcan\", output_dir=\".\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5686b5b6",
+   "metadata": {},
+   "source": [
+    "We showcase that anndata is capable of handling this comparatively large study and that we can extract the individual feature layers from the report:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "68ee8f62",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "precursors: ad.AnnData = at.io.read_psm_table(\n",
+    "    report_path,\n",
+    "    search_engine=\"diann\",\n",
+    "    level=\"precursors\",\n",
+    "    intensity_column=\"Precursor.Normalised\",\n",
+    "    feature_id_column=\"Precursor.Id\",\n",
+    "    sample_id_column=\"Run\",\n",
+    "    var_columns=[\n",
+    "        \"Run\",\n",
+    "        \"Stripped.Sequence\",\n",
+    "        \"Precursor.Charge\",\n",
+    "        \"RT\",\n",
+    "        \"RT.Start\",\n",
+    "        \"RT.Stop\",\n",
+    "        \"IM\",\n",
+    "        \"Protein.Group\",\n",
+    "        \"Protein.Ids\",\n",
+    "        \"Genes\",\n",
+    "        \"MS2.Scan\",\n",
+    "        \"CScore\",\n",
+    "        \"Q.Value\",\n",
+    "        \"Precursor.Id\",\n",
+    "        \"Global.Q.Value\",\n",
+    "        \"Global.PG.Q.Value\",\n",
+    "        \"Lib.Q.Value\",\n",
+    "        \"Lib.PG.Q.Value\",\n",
+    "    ],\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "96061e92",
+   "metadata": {},
+   "source": [
+    "The metadata from the PSM report is retained in the `.var` attribute of the precursor anndata object"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1591d888",
+   "metadata": {},
+   "source": [
+    "Here, we subset the data to a all precursors whose retention time is below 5 minutes"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "09e1a5e6",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "View of AnnData object with n_obs × n_vars = 1801 × 3948\n",
+       "    var: 'Run', 'Stripped.Sequence', 'Precursor.Charge', 'RT', 'RT.Start', 'RT.Stop', 'IM', 'Protein.Group', 'Protein.Ids', 'Genes', 'MS2.Scan', 'CScore', 'Q.Value', 'PG.MaxLFQ', 'Precursor.Normalised', 'Genes.MaxLFQ', 'Global.Q.Value', 'Global.PG.Q.Value', 'Lib.Q.Value', 'Lib.PG.Q.Value'"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "precursors[:, precursors.var_names[(precursors.var[\"RT\"] < 5)]]  # noqa: PLR2004"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d8d0325d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = pd.DataFrame(precursors.X).apply(pd.to_numeric, errors=\"coerce\")\n",
+    "\n",
+    "precursors.obs_names = pd.Index(precursors.obs.index.to_numpy())\n",
+    "precursors.var_names = pd.Index(precursors.var.index.to_numpy())\n",
+    "\n",
+    "precursors.var = pd.DataFrame(\n",
+    "    precursors.var.convert_dtypes(dtype_backend=\"numpy_nullable\").values,\n",
+    "    index=precursors.var.index,\n",
+    "    columns=precursors.var.columns,\n",
+    ")\n",
+    "precursors.var[\"Precursor.Charge\"] = precursors.var[\"Precursor.Charge\"].astype(float)\n",
+    "precursors.var[\"RT\"] = precursors.var[\"RT\"].astype(float)\n",
+    "precursors.var[\"RT.Start\"] = precursors.var[\"RT.Start\"].astype(float)\n",
+    "precursors.var[\"RT.Stop\"] = precursors.var[\"RT.Stop\"].astype(float)\n",
+    "precursors.var[\"IM\"] = precursors.var[\"IM\"].astype(float)\n",
+    "precursors.var[\"MS2.Scan\"] = precursors.var[\"MS2.Scan\"].astype(float)\n",
+    "precursors.var[\"Q.Value\"] = precursors.var[\"Q.Value\"].astype(float)\n",
+    "precursors.var[\"Global.Q.Value\"] = precursors.var[\"Global.Q.Value\"].astype(float)\n",
+    "precursors.var[\"Global.PG.Q.Value\"] = precursors.var[\"Global.PG.Q.Value\"].astype(float)\n",
+    "precursors.var[\"Lib.Q.Value\"] = precursors.var[\"Lib.Q.Value\"].astype(float)\n",
+    "precursors.var[\"Lib.PG.Q.Value\"] = precursors.var[\"Lib.PG.Q.Value\"].astype(float)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0647c3dd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "precursors.X = np.where(pd.isna(precursors.X), np.nan, precursors.X).astype(float)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c72a80fc",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "... storing 'Run' as categorical\n",
+      "... storing 'Stripped.Sequence' as categorical\n",
+      "... storing 'Protein.Group' as categorical\n",
+      "... storing 'Protein.Ids' as categorical\n",
+      "... storing 'Genes' as categorical\n",
+      "... storing 'CScore' as categorical\n"
+     ]
+    }
+   ],
+   "source": [
+    "precursors.write_h5ad(\"./albrecht.precursors.h5ad\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5ecad3b6",
+   "metadata": {},
+   "source": [
+    "We can similarly process protein-level and gene-level information"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "98f3cc47",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "AnnData object with n_obs × n_vars = 1801 × 2161"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "proteins: ad.AnnData = at.io.read_psm_table(report_path, search_engine=\"diann\", level=\"proteins\")\n",
+    "proteins"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6bbec870",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "AnnData object with n_obs × n_vars = 1801 × 2105"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "genes: ad.AnnData = at.io.read_psm_table(report_path, search_engine=\"diann\", level=\"genes\")\n",
+    "genes"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "alphatools",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/src/alphapepttools/data/datasets.yaml b/src/alphapepttools/data/datasets.yaml
@@ -113,3 +113,10 @@
   search_engine: diann
   data_type: psm
   description: Small example reports for testing PSM readers (DiaNN 1.9.0, parquet)
+
+- name: albrecht_pcan
+  url: https://datashare.biochem.mpg.de/s/Bn6R9FQRj3Jn2ZB
+  search_engine: diann
+  data_type: psm
+  citation: "Albrecht, V., Müller-Reif, J. B., Brennsteiner, V. & Mann, M. A Simplified Perchloric Acid Workflow With Neutralization (PCA N) for Democratizing Deep Plasma Proteomics at Population Scale. Molecular & Cellular Proteomics 24, 101071 (2025)."
+  description: "2000 Plasma samples processed with the PCA-N workflow (PXD064475)."
diff --git a/tests/run_notebook_tests.sh b/tests/run_notebook_tests.sh
@@ -5,6 +5,6 @@
 export IS_PYTEST_RUN=True
 
 # TODO enable also study_03_biomarker_skin.ipynb
-ALL_NBS=$(find ../docs/notebooks -name "*.ipynb" | grep -v "study_03_biomarker_skin")
+ALL_NBS=$(find ../docs/notebooks -name "*.ipynb" | grep -v -e "study_03_biomarker_skin" -e "supplementary_02_scalability-demo_PCAn")
 
 python -m pytest --nbmake $(echo $ALL_NBS)