This repositoy contains the computer code that has been executed to generate the results of the article:
@unpublished{biomedics,
author = {Adam Remaki and Jacques Ung and Pierre Pages and Perceval Wajsburt and Guillaume Faure and Thomas Petit-Jean and Xavier Tannier and Christel Gérardin},
title = {Improving phenotyping of patients with immune-mediated inflammatory diseases through automated processing of discharge summaries: a multicenter cohort study},
note = {Manuscript submitted for publication},
year = {2024}
}
The data and trained models are available via a secure Jupyter interface in a kubernetes cluster. Access to the clinical data warehouse's raw data can be granted following the process described on its website: www.eds.aphp.fr. A prior validation of the access by the local institutional review board is required (IRB number: CSE200093). In the case of non-AP-HP researchers, the signature of a collaboration contract is mandatory.
If you are AP-HP researchers, you can use your own data in BRAT format.
BioMedics stands on the shoulders of the library EDS-NLP (a collaborative NLP framework that aims primarily at extracting information from French clinical notes). BioMedics aims specifically at extracting laboratory test and drug information from clinical note. It consists of two pipelines:
-
In order to process large-scale data, the study uses Spark 2.4 (an open-source engine for large-scale data processing) which requires to:
- Install a version of Python
$\geq 3.7.1$ and$< 3.9$ . - Install Java 8 (you can install OpenJDK 8, an open-source reference implementation of Java 8)
- Install a version of Python
-
Clone the repository:
git clone https://github.com/Aremaki/BioMedics.git- Create a virtual environment with the suitable Python version (>= 3.7.1 and < 3.9):
cd biomedics
python -m venv .venv
source .venv/bin/activate-
Install Poetry (a tool for dependency management and packaging in Python) with the following command line:
- Linux, macOS, Windows (WSL):
curl -sSL https://install.python-poetry.org | python3 -- Windows (Powershell):
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | py -For more details, check the installation guide
-
Install dependencies:
poetry installThis step will extract the study cohort with spark query and will save the discharge summaries in BRAT format in the data/CRH/raw folder. If you use another cohort you can skip this step and place your data in the same folder.
- Install EDS-Toolbox (a python library that provides an efficient way of submitting PySpark scripts on AP-HP's data platform. As it is AP-HP specific, it is not available on PyPI):
pip install edstoolbox- Run the script with EDS-Toolbox:
cd scripts/create_dataset
bash run.shThe model used for NER and qualification comes from EDS-Medic which have been developed by the data science team of AP-HP (Assistance Publique – Hôpitaux de Paris).
- The training of this model requires annotated discharge summaries in BRAT format.
- The training also requires a word embedding model, we recommend to use EDS-CamemBERT-fine-tuned available in the AP-HP's model catalogue or CamemBERT-bio available on Hugging Face. Place it in the
models/word_embeddingfolder. - Modify or create your own configuration in
configs/ner/<your_config>.cfg: You can set the path of your training and testing data folder, set the hyperparameters of the model, set the training parameters...
Training, evaluation and inference are gathered into one sbatch but you can comment the part you would like to skip in run.sh:
cd scripts/ner
sbtach run.shIt aims at extracting values and units from complete laboratory test in the text. It is based on regular expression processed at very high speed with spark engine. The algorithm has been implemented by the data science team of Pitié-Salpêtrière Hospital.
cd scripts/extract_measurement
sbtach run.shThe normalization for laboratory test is based on CODER, a BERT-based model finetuned on the UMLS:
- Download a word embedding model, we recommend to use CODER-all available on Hugging Face. Place it in the
models/word_embeddingfolder. - Download the full release of UMLS and follow
data/umls/manage_umls.ipynbnotebook.
The normalization for drug names is a fuzzy matching on a knowledge dictionnary. This dictionary is an aggregation of two open source dictionaries of drug names with their corresponding ATC codes: the UMLS restricted to the ATC vocabulary and the Unique Drug Interoperability Repository (RUIM) created by the French National Agency for Medicines and Health Products Safety (ANSM):
- Run
python data/drug_knowledge/dic_generation.pyin order to create the drug knowledge dictionary.
Both normalizations are gathered into one sbatch but you can comment the part you would like to skip in run.sh:
cd scripts/normalization/
sbtach run.shThis step filter results according to the laboratory tests and drug treatments studied. If you are studying other laboratory tests or drugs, you have to modify this step.
cd scripts/clinical_application
sbtach run.shGenerate result figures of the paper from notebooks:
-
Create a Spark-enabled kernel with your environnement:
eds-toolbox kernel --spark --hdfs
-
Convert markdown into jupyter notebook:
cd notebooks jupytext --set-formats md,ipynb 'clinical_results.md'
-
Open .ipynb and start the kernel you've just created.
- Run the cells to obtain the table results.
conf: Configuration files.data: Saved processed data and knowledge dictionnaries.models: Trained models.figures: Saved results.notebooks: Notebooks that generate figures.biomedics: Source code.scripts: Scripts to process data.
We would like to thank AI4IDF for funding this project, the data science teams of Assistance Publique – Hôpitaux de Paris for contributing to the project.