Skip to content

Commit 13d65ca

Browse files
authored
updated README.md for the OMOP instruction (#105)
1 parent f2df8ce commit 13d65ca

File tree

5 files changed

+191
-28
lines changed

5 files changed

+191
-28
lines changed

README.md

Lines changed: 77 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,12 @@
11
# CEHR-BERT
22

3+
[![PyPI - Version](https://img.shields.io/pypi/v/cehrbert)](https://pypi.org/project/cehrbert/)
4+
![Python](https://img.shields.io/badge/-Python_3.10-blue?logo=python&logoColor=white)
5+
[![tests](https://github.com/cumc-dbmi/cehrbert/actions/workflows/tests.yml/badge.svg)](https://github.com/cumc-dbmi/cehrbert/actions/workflows/tests.yml)
6+
[![license](https://img.shields.io/badge/License-MIT-green.svg?labelColor=gray)](https://github.com/cumc-dbmi/cehrbert/blob/main/LICENSE)
7+
[![contributors](https://img.shields.io/github/contributors/cumc-dbmi/cehrbert.svg)](https://github.com/cumc-dbmi/cehrbert/graphs/contributors)
8+
9+
310
CEHR-BERT is a large language model developed for the structured EHR data, the work has been published
411
at https://proceedings.mlr.press/v158/pang21a.html. CEHR-BERT currently only supports the structured EHR data in the
512
OMOP format, which is a common data model used to support observational studies and managed by the Observational Health
@@ -55,15 +62,9 @@ Build the project
5562
pip install -e .[dev]
5663
```
5764

58-
Download [jtds-1.3.1.jar](jtds-1.3.1.jar) into the spark jars folder in the python environment
59-
```console
60-
cp jtds-1.3.1.jar .venv/lib/python3.10/site-packages/pyspark/jars/
61-
```
62-
6365
## Instructions for Use with [MEDS](https://github.com/Medical-Event-Data-Standard/meds)
64-
65-
### 1. Convert MEDS to the [meds_reader](https://github.com/som-shahlab/meds_reader) database
66-
66+
Step 1. Convert MEDS to the [meds_reader](https://github.com/som-shahlab/meds_reader) database
67+
---------------------------
6768
If you don't have the MEDS dataset, you could convert the OMOP dataset to the MEDS
6869
using [meds_etl](https://github.com/Medical-Event-Data-Standard/meds_etl).
6970
We have prepared a synthea dataset with 1M patients for you to test, you could download it
@@ -123,22 +124,41 @@ Convert MEDS to the meds_reader database to get the patient level data
123124
meds_reader_convert synthea_meds synthea_meds_reader --num_threads 4
124125
```
125126

126-
### 2. Pretrain CEHR-BERT using the meds_reader database
127+
Step 2. Pretrain CEHR-BERT using the meds_reader database
128+
---------------------------
127129
```console
128130
mkdir test_dataset_prepared;
129131
mkdir test_synthea_results;
130-
python -m cehrbert.runners.hf_cehrbert_pretrain_runner sample_configs/hf_cehrbert_pretrain_runner_meds_config.yaml
132+
python -m cehrbert.runners.hf_cehrbert_pretrain_runner \
133+
sample_configs/hf_cehrbert_pretrain_runner_meds_config.yaml
131134
```
132135

133136
## Instructions for Use with OMOP
134137

135-
### 1. Download OMOP tables as parquet files
136-
138+
Step 1. Download OMOP tables as parquet files
139+
---------------------------
137140
We created a spark app to download OMOP tables from SQL Server as parquet files. You need adjust the properties
138-
in `db_properties.ini` to match with your database setup.
139-
141+
in `db_properties.ini` to match with your database setup. Download [jtds-1.3.1.jar](https://mvnrepository.com/artifact/net.sourceforge.jtds/jtds/1.3.1) into the spark jars folder in the python environment.
140142
```console
141-
PYTHONPATH=./: spark-submit tools/download_omop_tables.py -c db_properties.ini -tc person visit_occurrence condition_occurrence procedure_occurrence drug_exposure measurement observation_period concept concept_relationship concept_ancestor -o ~/Documents/omop_test/
143+
cp jtds-1.3.1.jar .venv/lib/python3.10/site-packages/pyspark/jars/
144+
```
145+
We use spark as the data processing engine to generate the pretraining data.
146+
For that, we need to set up the relevant SPARK environment variables.
147+
```bash
148+
# the omop derived tables need to be built using pyspark
149+
export SPARK_WORKER_INSTANCES="1"
150+
export SPARK_WORKER_CORES="16"
151+
export SPARK_EXECUTOR_CORES="2"
152+
export SPARK_DRIVER_MEMORY="12g"
153+
export SPARK_EXECUTOR_MEMORY="12g"
154+
```
155+
Download the OMOP tables as parquet files
156+
```console
157+
python -u -m cehrbert.tools.download_omop_tables -c db_properties.ini \
158+
-tc person visit_occurrence condition_occurrence procedure_occurrence \
159+
drug_exposure measurement observation_period \
160+
concept concept_relationship concept_ancestor \
161+
-o ~/Documents/omop_test/
142162
```
143163

144164
We have prepared a synthea dataset with 1M patients for you to test, you could download it
@@ -148,44 +168,73 @@ at [omop_synthea.tar.gz](https://drive.google.com/file/d/1k7-cZACaDNw8A1JRI37mfM
148168
tar -xvf omop_synthea.tar ~/Document/omop_test/
149169
```
150170

151-
### 2. Generate training data for CEHR-BERT
152-
171+
Step 2. Generate training data for CEHR-BERT using cehrbert_data
172+
---------------------------
153173
We order the patient events in chronological order and put all data points in a sequence. We insert artificial tokens
154174
VS (visit start) and VE (visit end) to the start and the end of the visit. In addition, we insert artificial time
155175
tokens (ATT) between visits to indicate the time interval between visits. This approach allows us to apply BERT to
156176
structured EHR as-is.
157177
The sequence can be seen conceptually as [VS] [V1] [VE] [ATT] [VS] [V2] [VE], where [V1] and [V2] represent a list of
158178
concepts associated with those visits.
159179

160-
```console
161-
PYTHONPATH=./: spark-submit spark_apps/generate_training_data.py -i ~/Documents/omop_test/ -o ~/Documents/omop_test/cehr-bert -tc condition_occurrence procedure_occurrence drug_exposure -d 1985-01-01 --is_new_patient_representation -iv
180+
Set up the pyspark environment variables if you haven't done so.
181+
```bash
182+
# the omop derived tables need to be built using pyspark
183+
export SPARK_WORKER_INSTANCES="1"
184+
export SPARK_WORKER_CORES="16"
185+
export SPARK_EXECUTOR_CORES="2"
186+
export SPARK_DRIVER_MEMORY="12g"
187+
export SPARK_EXECUTOR_MEMORY="12g"
188+
```
189+
Generate the pretraining data using the following command
190+
```bash
191+
sh src/cehrbert/scripts/create_cehrbert_pretraining_data.sh \
192+
--input_folder $OMOP_DIR \
193+
--output_folde $CEHR_BERT_DATA_DIR \
194+
--start_date "1985-01-01"
162195
```
163196

164-
### 3. Pre-train CEHR-BERT
197+
Step 3. Pre-train CEHR-BERT
198+
---------------------------
165199
If you don't have your own OMOP instance, we have provided a sample of patient sequence data generated using Synthea
166200
at `sample/patient_sequence` in the repo. CEHR-BERT expects the data folder to be named as `patient_sequence`
167201

168202
```console
169203
mkdir test_dataset_prepared;
170204
mkdir test_results;
171-
python -m cehrbert.runners.hf_cehrbert_pretrain_runner sample_configs/hf_cehrbert_pretrain_runner_config.yaml
205+
python -m cehrbert.runners.hf_cehrbert_pretrain_runner \
206+
sample_configs/hf_cehrbert_pretrain_runner_config.yaml
172207
```
173208

174209
If your dataset is large, you could add ```--use_dask``` in the command above
175210

176-
### 4. Generate hf readmission prediction task
211+
Step 4. Generate hf readmission prediction task
212+
---------------------------
177213
If you don't have your own OMOP instance, we have provided a sample of patient sequence data generated using Synthea
178-
at `sample/hf_readmissioon` in the repo
179-
214+
at `sample/hf_readmissioon` in the repo. Set up the pyspark environment variables if you haven't done so.
215+
```bash
216+
# the omop derived tables need to be built using pyspark
217+
export SPARK_WORKER_INSTANCES="1"
218+
export SPARK_WORKER_CORES="16"
219+
export SPARK_EXECUTOR_CORES="2"
220+
export SPARK_DRIVER_MEMORY="12g"
221+
export SPARK_EXECUTOR_MEMORY="12g"
222+
```
223+
Generate the HF readmission prediction task
180224
```console
181-
PYTHONPATH=./:$PYTHONPATH spark-submit spark_apps/prediction_cohorts/hf_readmission.py -c hf_readmission -i ~/Documents/omop_test/ -o ~/Documents/omop_test/cehr-bert -dl 1985-01-01 -du 2020-12-31 -l 18 -u 100 -ow 360 -ps 0 -pw 30 --is_new_patient_representation
225+
python -u -m cehrbert.prediction_cohorts.hf_readmission \
226+
-c hf_readmission -i ~/Documents/omop_test/ -o ~/Documents/omop_test/cehr-bert \
227+
-dl 1985-01-01 -du 2020-12-31 \
228+
-l 18 -u 100 -ow 360 -ps 0 -pw 30 \
229+
--is_new_patient_representation
182230
```
183231

184-
### 5. Fine-tune CEHR-BERT
185-
232+
Step 5. Fine-tune CEHR-BERT
233+
---------------------------
186234
```console
187235
mkdir test_finetune_results;
188-
python -m cehrbert.runners.hf_cehrbert_finetune_runner sample_configs/hf_cehrbert_finetuning_runner_config.yaml
236+
python -m cehrbert.runners.hf_cehrbert_finetune_runner \
237+
sample_configs/hf_cehrbert_finetuning_runner_config.yaml
189238
```
190239

191240
## Contact us

sample_configs/hf_cehrbert_finetuning_runner_config.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
1+
# Please point this to your model folder
12
model_name_or_path: "test_results"
3+
# Please point this to your model folder
24
tokenizer_name_or_path: "test_results"
35

46
data_folder: "sample_data/finetune/full"
@@ -32,6 +34,7 @@ max_position_embeddings: 512
3234
dataloader_num_workers: 4
3335
dataloader_prefetch_factor: 2
3436

37+
# Please point this to your finetuned model folder
3538
output_dir: "test_finetune_results"
3639
evaluation_strategy: "epoch"
3740
save_strategy: "epoch"

sample_configs/hf_cehrbert_pretrain_runner_config.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
1+
# Please point this to your output model folder
12
model_name_or_path: "test_results"
3+
# Please point this to your output model folder
24
tokenizer_name_or_path: "test_results"
35

46
data_folder: "sample_data/pretrain"
@@ -32,7 +34,9 @@ max_position_embeddings: 512
3234
dataloader_num_workers: 4
3335
dataloader_prefetch_factor: 4
3436

37+
# Please point this to your output model folder
3538
output_dir: "test_results"
39+
3640
evaluation_strategy: "epoch"
3741
save_strategy: "epoch"
3842
learning_rate: 0.00005

sample_configs/hf_cehrbert_pretrain_runner_meds_config.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
1+
# Please point this to your model folder
12
model_name_or_path: "test_synthea_results"
3+
# Please point this to your model folder
24
tokenizer_name_or_path: "test_synthea_results"
35

6+
# Please point this to the MEDS_READER because the MEDS data is used as the input
47
data_folder: "synthea_meds_reader"
58
dataset_prepared_path: "test_dataset_prepared"
69
validation_split_percentage: 0.05
@@ -32,6 +35,7 @@ max_position_embeddings: 512
3235
dataloader_num_workers: 4
3336
dataloader_prefetch_factor: 4
3437

38+
# Please point this to your model folder
3539
output_dir: "test_synthea_results"
3640
evaluation_strategy: "epoch"
3741
save_strategy: "epoch"
Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
#!/bin/bash
2+
3+
# Function to display usage
4+
usage() {
5+
echo "Usage: $0 --input_folder INPUT_FOLDER --output_folder OUTPUT_FOLDER --start_date START_DATE"
6+
echo ""
7+
echo "Required Arguments:"
8+
echo " --input_folder PATH Input folder path"
9+
echo " --output_folder PATH Output folder path"
10+
echo " --start_date DATE Start date"
11+
echo ""
12+
echo "Example:"
13+
echo " $0 --input_folder /path/to/input --output_folder /path/to/output --start_date 1985-01-01"
14+
exit 1
15+
}
16+
17+
# Check if no arguments were provided
18+
if [ $# -eq 0 ]; then
19+
usage
20+
fi
21+
22+
# Initialize variables
23+
INPUT_FOLDER=""
24+
OUTPUT_FOLDER=""
25+
START_DATE=""
26+
27+
# Domain tables (fixed list)
28+
DOMAIN_TABLES=("condition_occurrence" "procedure_occurrence" "drug_exposure")
29+
30+
# Parse command line arguments
31+
ARGS=$(getopt -o "" --long input_folder:,output_folder:,start_date:,help -n "$0" -- "$@")
32+
33+
if [ $? -ne 0 ]; then
34+
usage
35+
fi
36+
37+
eval set -- "$ARGS"
38+
39+
while true; do
40+
case "$1" in
41+
--input_folder)
42+
INPUT_FOLDER="$2"
43+
shift 2
44+
;;
45+
--output_folder)
46+
OUTPUT_FOLDER="$2"
47+
shift 2
48+
;;
49+
--start_date)
50+
START_DATE="$2"
51+
shift 2
52+
;;
53+
--help)
54+
usage
55+
;;
56+
--)
57+
shift
58+
break
59+
;;
60+
*)
61+
echo "Internal error!"
62+
exit 1
63+
;;
64+
esac
65+
done
66+
67+
# Validate required arguments
68+
if [ -z "$INPUT_FOLDER" ] || [ -z "$OUTPUT_FOLDER" ] || [ -z "$START_DATE" ]; then
69+
echo "Error: Missing required arguments"
70+
usage
71+
fi
72+
73+
# Create output folder if it doesn't exist
74+
mkdir -p "$OUTPUT_FOLDER"
75+
76+
# Step 1: Generate included concept list
77+
CONCEPT_LIST_CMD="python -u -m cehrbert_data.apps.generate_included_concept_list \
78+
-i \"$INPUT_FOLDER\" \
79+
-o \"$OUTPUT_FOLDER\" \
80+
--min_num_of_patients 100 \
81+
--ehr_table_list ${DOMAIN_TABLES[@]}"
82+
83+
echo "Running concept list generation:"
84+
echo "$CONCEPT_LIST_CMD"
85+
eval "$CONCEPT_LIST_CMD"
86+
87+
# Step 2: Generate training data
88+
TRAINING_DATA_CMD="python -m cehrbert_data.apps.generate_training_data \
89+
--input_folder \"$INPUT_FOLDER\" \
90+
--output_folder \"$OUTPUT_FOLDER\" \
91+
-d $START_DATE \
92+
--att_type day \
93+
--inpatient_att_type day \
94+
-iv \
95+
-ip \
96+
--include_concept_list \
97+
--include_death \
98+
--gpt_patient_sequence \
99+
--domain_table_list ${DOMAIN_TABLES[@]}"
100+
101+
echo "Running training data generation:"
102+
echo "$TRAINING_DATA_CMD"
103+
eval "$TRAINING_DATA_CMD"

0 commit comments

Comments
 (0)