In this tutorial we introduce some issues related to the analysis of real world data that are made available for research in clinical data warehouses. It is targeted towards data scientists that master the basics of Python programming and data analysis. The tutorial is decomposed in a series of small exercises and a final project. Whereas small exercises illustrate specific issues, the final project mimics an end-to-end research study that may be reported in a scientific article.
Data is fake, and this project can consequently be freely shared without impacting patients’ privacy. A fake data generator is made available and can be tuned to illustrate various use cases. Its development has been freely inspired by the characteristics and issues observed while analyzing data of the Greater Paris University Hospitals.
Python, JupyterLab and an environment manager are recommended. You may choose for instance Anaconda.
We also recommend using Visual Studio Code.
Please follow theses instructions:
- Open a terminal
- Go to your local repository for the 2025_EI project
- Clone the project locally :
git clone {URL} - Using the terminal, access the cloned file
cd edstuto - Install the required packages with uv:
pip install uv==0.7.8uv venv --python 3.11.9source .venv/bin/activateuv sync
NB: For VS Code users, in order to see clearly the plots, it is recommended to enable the Theme Matplotlib Plots in your setting > Extensions > Jupyter.
The following scientific libraries developed in the context of Paris’ clinical data warehouse may moreover be leveraged to facilitate the resolution of some exercises:
- eds-scikit: a set of tools to assist data scientists working on a clinical data warehouse (structured data).
- edsnlp: a set of spaCy components that are used to extract information from clinical notes written in French (unstructured data).
We would like to thank Assistance Publique – Hôpitaux de Paris and AP-HP Foundation for funding this project.