11# CEHR-BERT
22
3+ [ ![ PyPI - Version] ( https://img.shields.io/pypi/v/cehrbert )] ( https://pypi.org/project/cehrbert/ )
4+ ![ Python] ( https://img.shields.io/badge/-Python_3.10-blue?logo=python&logoColor=white )
5+ [ ![ tests] ( https://github.com/cumc-dbmi/cehrbert/actions/workflows/tests.yml/badge.svg )] ( https://github.com/cumc-dbmi/cehrbert/actions/workflows/tests.yml )
6+ [ ![ license] ( https://img.shields.io/badge/License-MIT-green.svg?labelColor=gray )] ( https://github.com/cumc-dbmi/cehrbert/blob/main/LICENSE )
7+ [ ![ contributors] ( https://img.shields.io/github/contributors/cumc-dbmi/cehrbert.svg )] ( https://github.com/cumc-dbmi/cehrbert/graphs/contributors )
8+
9+
310CEHR-BERT is a large language model developed for the structured EHR data, the work has been published
411at https://proceedings.mlr.press/v158/pang21a.html . CEHR-BERT currently only supports the structured EHR data in the
512OMOP format, which is a common data model used to support observational studies and managed by the Observational Health
@@ -55,15 +62,9 @@ Build the project
5562pip install -e .[dev]
5663```
5764
58- Download [ jtds-1.3.1.jar] ( jtds-1.3.1.jar ) into the spark jars folder in the python environment
59- ``` console
60- cp jtds-1.3.1.jar .venv/lib/python3.10/site-packages/pyspark/jars/
61- ```
62-
6365## Instructions for Use with [ MEDS] ( https://github.com/Medical-Event-Data-Standard/meds )
64-
65- ### 1. Convert MEDS to the [ meds_reader] ( https://github.com/som-shahlab/meds_reader ) database
66-
66+ Step 1. Convert MEDS to the [ meds_reader] ( https://github.com/som-shahlab/meds_reader ) database
67+ ---------------------------
6768If you don't have the MEDS dataset, you could convert the OMOP dataset to the MEDS
6869using [ meds_etl] ( https://github.com/Medical-Event-Data-Standard/meds_etl ) .
6970We have prepared a synthea dataset with 1M patients for you to test, you could download it
@@ -123,22 +124,41 @@ Convert MEDS to the meds_reader database to get the patient level data
123124meds_reader_convert synthea_meds synthea_meds_reader --num_threads 4
124125```
125126
126- ### 2. Pretrain CEHR-BERT using the meds_reader database
127+ Step 2. Pretrain CEHR-BERT using the meds_reader database
128+ ---------------------------
127129``` console
128130mkdir test_dataset_prepared;
129131mkdir test_synthea_results;
130- python -m cehrbert.runners.hf_cehrbert_pretrain_runner sample_configs/hf_cehrbert_pretrain_runner_meds_config.yaml
132+ python -m cehrbert.runners.hf_cehrbert_pretrain_runner \
133+ sample_configs/hf_cehrbert_pretrain_runner_meds_config.yaml
131134```
132135
133136## Instructions for Use with OMOP
134137
135- ### 1. Download OMOP tables as parquet files
136-
138+ Step 1. Download OMOP tables as parquet files
139+ ---------------------------
137140We created a spark app to download OMOP tables from SQL Server as parquet files. You need adjust the properties
138- in ` db_properties.ini ` to match with your database setup.
139-
141+ in ` db_properties.ini ` to match with your database setup. Download [ jtds-1.3.1.jar] ( https://mvnrepository.com/artifact/net.sourceforge.jtds/jtds/1.3.1 ) into the spark jars folder in the python environment.
140142``` console
141- PYTHONPATH=./: spark-submit tools/download_omop_tables.py -c db_properties.ini -tc person visit_occurrence condition_occurrence procedure_occurrence drug_exposure measurement observation_period concept concept_relationship concept_ancestor -o ~/Documents/omop_test/
143+ cp jtds-1.3.1.jar .venv/lib/python3.10/site-packages/pyspark/jars/
144+ ```
145+ We use spark as the data processing engine to generate the pretraining data.
146+ For that, we need to set up the relevant SPARK environment variables.
147+ ``` bash
148+ # the omop derived tables need to be built using pyspark
149+ export SPARK_WORKER_INSTANCES=" 1"
150+ export SPARK_WORKER_CORES=" 16"
151+ export SPARK_EXECUTOR_CORES=" 2"
152+ export SPARK_DRIVER_MEMORY=" 12g"
153+ export SPARK_EXECUTOR_MEMORY=" 12g"
154+ ```
155+ Download the OMOP tables as parquet files
156+ ``` console
157+ python -u -m cehrbert.tools.download_omop_tables -c db_properties.ini \
158+ -tc person visit_occurrence condition_occurrence procedure_occurrence \
159+ drug_exposure measurement observation_period \
160+ concept concept_relationship concept_ancestor \
161+ -o ~/Documents/omop_test/
142162```
143163
144164We have prepared a synthea dataset with 1M patients for you to test, you could download it
@@ -148,44 +168,73 @@ at [omop_synthea.tar.gz](https://drive.google.com/file/d/1k7-cZACaDNw8A1JRI37mfM
148168tar -xvf omop_synthea.tar ~/Document/omop_test/
149169```
150170
151- ### 2. Generate training data for CEHR-BERT
152-
171+ Step 2. Generate training data for CEHR-BERT using cehrbert_data
172+ ---------------------------
153173We order the patient events in chronological order and put all data points in a sequence. We insert artificial tokens
154174VS (visit start) and VE (visit end) to the start and the end of the visit. In addition, we insert artificial time
155175tokens (ATT) between visits to indicate the time interval between visits. This approach allows us to apply BERT to
156176structured EHR as-is.
157177The sequence can be seen conceptually as [ VS] [ V1] [ VE] [ ATT] [ VS] [ V2] [ VE] , where [ V1] and [ V2] represent a list of
158178concepts associated with those visits.
159179
160- ``` console
161- PYTHONPATH=./: spark-submit spark_apps/generate_training_data.py -i ~/Documents/omop_test/ -o ~/Documents/omop_test/cehr-bert -tc condition_occurrence procedure_occurrence drug_exposure -d 1985-01-01 --is_new_patient_representation -iv
180+ Set up the pyspark environment variables if you haven't done so.
181+ ``` bash
182+ # the omop derived tables need to be built using pyspark
183+ export SPARK_WORKER_INSTANCES=" 1"
184+ export SPARK_WORKER_CORES=" 16"
185+ export SPARK_EXECUTOR_CORES=" 2"
186+ export SPARK_DRIVER_MEMORY=" 12g"
187+ export SPARK_EXECUTOR_MEMORY=" 12g"
188+ ```
189+ Generate the pretraining data using the following command
190+ ``` bash
191+ sh src/cehrbert/scripts/create_cehrbert_pretraining_data.sh \
192+ --input_folder $OMOP_DIR \
193+ --output_folde $CEHR_BERT_DATA_DIR \
194+ --start_date " 1985-01-01"
162195```
163196
164- ### 3. Pre-train CEHR-BERT
197+ Step 3. Pre-train CEHR-BERT
198+ ---------------------------
165199If you don't have your own OMOP instance, we have provided a sample of patient sequence data generated using Synthea
166200at ` sample/patient_sequence ` in the repo. CEHR-BERT expects the data folder to be named as ` patient_sequence `
167201
168202``` console
169203mkdir test_dataset_prepared;
170204mkdir test_results;
171- python -m cehrbert.runners.hf_cehrbert_pretrain_runner sample_configs/hf_cehrbert_pretrain_runner_config.yaml
205+ python -m cehrbert.runners.hf_cehrbert_pretrain_runner \
206+ sample_configs/hf_cehrbert_pretrain_runner_config.yaml
172207```
173208
174209If your dataset is large, you could add ``` --use_dask ``` in the command above
175210
176- ### 4. Generate hf readmission prediction task
211+ Step 4. Generate hf readmission prediction task
212+ ---------------------------
177213If you don't have your own OMOP instance, we have provided a sample of patient sequence data generated using Synthea
178- at ` sample/hf_readmissioon ` in the repo
179-
214+ at ` sample/hf_readmissioon ` in the repo. Set up the pyspark environment variables if you haven't done so.
215+ ``` bash
216+ # the omop derived tables need to be built using pyspark
217+ export SPARK_WORKER_INSTANCES=" 1"
218+ export SPARK_WORKER_CORES=" 16"
219+ export SPARK_EXECUTOR_CORES=" 2"
220+ export SPARK_DRIVER_MEMORY=" 12g"
221+ export SPARK_EXECUTOR_MEMORY=" 12g"
222+ ```
223+ Generate the HF readmission prediction task
180224``` console
181- PYTHONPATH=./:$PYTHONPATH spark-submit spark_apps/prediction_cohorts/hf_readmission.py -c hf_readmission -i ~/Documents/omop_test/ -o ~/Documents/omop_test/cehr-bert -dl 1985-01-01 -du 2020-12-31 -l 18 -u 100 -ow 360 -ps 0 -pw 30 --is_new_patient_representation
225+ python -u -m cehrbert.prediction_cohorts.hf_readmission \
226+ -c hf_readmission -i ~/Documents/omop_test/ -o ~/Documents/omop_test/cehr-bert \
227+ -dl 1985-01-01 -du 2020-12-31 \
228+ -l 18 -u 100 -ow 360 -ps 0 -pw 30 \
229+ --is_new_patient_representation
182230```
183231
184- ### 5. Fine-tune CEHR-BERT
185-
232+ Step 5. Fine-tune CEHR-BERT
233+ ---------------------------
186234``` console
187235mkdir test_finetune_results;
188- python -m cehrbert.runners.hf_cehrbert_finetune_runner sample_configs/hf_cehrbert_finetuning_runner_config.yaml
236+ python -m cehrbert.runners.hf_cehrbert_finetune_runner \
237+ sample_configs/hf_cehrbert_finetuning_runner_config.yaml
189238```
190239
191240## Contact us
0 commit comments