BioMedics

Study

This repositoy contains the computer code that has been executed to generate the results of the article:

@unpublished{biomedics,
author = {Adam Remaki and Jacques Ung and Pierre Pages and Perceval Wajsburt and Guillaume Faure and Thomas Petit-Jean and Xavier Tannier and Christel Gérardin},
title = {Improving phenotyping of patients with immune-mediated inflammatory diseases through automated processing of discharge summaries: a multicenter cohort study},
note = {Manuscript submitted for publication},
year = {2024}
}

Data and model availability

The data and trained models are available via a secure Jupyter interface in a kubernetes cluster. Access to the clinical data warehouse's raw data can be granted following the process described on its website: www.eds.aphp.fr. A prior validation of the access by the local institutional review board is required (IRB number: CSE200093). In the case of non-AP-HP researchers, the signature of a collaboration contract is mandatory.

If you are AP-HP researchers, you can use your own data in BRAT format.

Overall pipelines

BioMedics stands on the shoulders of the library EDS-NLP (a collaborative NLP framework that aims primarily at extracting information from French clinical notes). BioMedics aims specifically at extracting laboratory test and drug information from clinical note. It consists of two pipelines:

How to run the code on AP-HP's data platform

1. Setup

In order to process large-scale data, the study uses Spark 2.4 (an open-source engine for large-scale data processing) which requires to:
- Install a version of Python $\geq 3.7.1$ and $< 3.9$.
- Install Java 8 (you can install OpenJDK 8, an open-source reference implementation of Java 8)
Clone the repository:

git clone https://github.com/Aremaki/BioMedics.git

Create a virtual environment with the suitable Python version (>= 3.7.1 and < 3.9):

cd biomedics
python -m venv .venv
source .venv/bin/activate

Install Poetry (a tool for dependency management and packaging in Python) with the following command line:
- Linux, macOS, Windows (WSL):
```
curl -sSL https://install.python-poetry.org | python3 -
```
- Windows (Powershell):
```
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | py -
```
For more details, check the installation guide
Install dependencies:

poetry install

2. Extract cohort:

This step will extract the study cohort with spark query and will save the discharge summaries in BRAT format in the data/CRH/raw folder. If you use another cohort you can skip this step and place your data in the same folder.

Install EDS-Toolbox (a python library that provides an efficient way of submitting PySpark scripts on AP-HP's data platform. As it is AP-HP specific, it is not available on PyPI):

pip install edstoolbox

Run the script with EDS-Toolbox:

cd scripts/create_dataset
bash run.sh

3. NER + Qualification

The model used for NER and qualification comes from EDS-Medic which have been developed by the data science team of AP-HP (Assistance Publique – Hôpitaux de Paris).

Requirements:

The training of this model requires annotated discharge summaries in BRAT format.
The training also requires a word embedding model, we recommend to use EDS-CamemBERT-fine-tuned available in the AP-HP's model catalogue or CamemBERT-bio available on Hugging Face. Place it in the models/word_embedding folder.
Modify or create your own configuration in configs/ner/<your_config>.cfg: You can set the path of your training and testing data folder, set the hyperparameters of the model, set the training parameters...

Run:

Training, evaluation and inference are gathered into one sbatch but you can comment the part you would like to skip in run.sh:

cd scripts/ner
sbtach run.sh

4. Extract measurement

It aims at extracting values and units from complete laboratory test in the text. It is based on regular expression processed at very high speed with spark engine. The algorithm has been implemented by the data science team of Pitié-Salpêtrière Hospital.

cd scripts/extract_measurement
sbtach run.sh

5. Normalization

Laboratory tests

The normalization for laboratory test is based on CODER, a BERT-based model finetuned on the UMLS:

Requirements:

Download a word embedding model, we recommend to use CODER-all available on Hugging Face. Place it in the models/word_embedding folder.
Download the full release of UMLS and follow data/umls/manage_umls.ipynb notebook.

Drugs

The normalization for drug names is a fuzzy matching on a knowledge dictionnary. This dictionary is an aggregation of two open source dictionaries of drug names with their corresponding ATC codes: the UMLS restricted to the ATC vocabulary and the Unique Drug Interoperability Repository (RUIM) created by the French National Agency for Medicines and Health Products Safety (ANSM):

Requirements:

Run python data/drug_knowledge/dic_generation.py in order to create the drug knowledge dictionary.

Run:

Both normalizations are gathered into one sbatch but you can comment the part you would like to skip in run.sh:

cd scripts/normalization/
sbtach run.sh

6. Filter results for clinical application

This step filter results according to the laboratory tests and drug treatments studied. If you are studying other laboratory tests or drugs, you have to modify this step.

cd scripts/clinical_application
sbtach run.sh

7. Generate figures

Generate result figures of the paper from notebooks:

Create a Spark-enabled kernel with your environnement:
```
eds-toolbox kernel --spark --hdfs
```

Convert markdown into jupyter notebook:

cd notebooks
jupytext --set-formats md,ipynb 'clinical_results.md'

Open .ipynb and start the kernel you've just created.
- Run the cells to obtain the table results.

Project structure

conf: Configuration files.
data: Saved processed data and knowledge dictionnaries.
models: Trained models.
figures: Saved results.
notebooks: Notebooks that generate figures.
biomedics: Source code.
scripts: Scripts to process data.

Acknowledgement

We would like to thank AI4IDF for funding this project, the data science teams of Assistance Publique – Hôpitaux de Paris for contributing to the project.

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
biomedics		biomedics
configs		configs
data		data
figures		figures
models		models
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
README_end2end.md		README_end2end.md
README_end2end_public.md		README_end2end_public.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BioMedics

Study

Data and model availability

Overall pipelines

How to run the code on AP-HP's data platform

1. Setup

2. Extract cohort:

3. NER + Qualification

Requirements:

Run:

4. Extract measurement

5. Normalization

Laboratory tests

Requirements:

Drugs

Requirements:

Run:

6. Filter results for clinical application

7. Generate figures

Project structure

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

Aremaki/BioMedics

Folders and files

Latest commit

History

Repository files navigation

BioMedics

Study

Data and model availability

Overall pipelines

How to run the code on AP-HP's data platform

1. Setup

2. Extract cohort:

3. NER + Qualification

Requirements:

Run:

4. Extract measurement

5. Normalization

Laboratory tests

Requirements:

Drugs

Requirements:

Run:

6. Filter results for clinical application

7. Generate figures

Project structure

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages