Skip to content

Commit 3761d2f

Browse files
authored
Merge pull request #1335 from MIT-LCP/update_builddb_mimiciv
Update build scripts for MIMIC-IV v2.0
2 parents 1988828 + f7514e7 commit 3761d2f

34 files changed

+753
-467
lines changed

.github/workflows/build-db.yml

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
name: Test DB build scripts
2+
on: pull_request
3+
4+
jobs:
5+
build-mimic-iv-psql:
6+
# Containers must run in Linux based operating systems
7+
runs-on: ubuntu-latest
8+
# Docker Hub image that `container-job` executes in
9+
container: node:latest
10+
11+
# Service containers to run with `container-job`
12+
services:
13+
# Label used to access the service container
14+
postgres:
15+
# Docker Hub image
16+
image: postgres
17+
# Provide the password for postgres
18+
env:
19+
POSTGRES_PASSWORD: postgres
20+
# Set health checks to wait until postgres has started
21+
options: >-
22+
--health-cmd pg_isready
23+
--health-interval 10s
24+
--health-timeout 5s
25+
--health-retries 5
26+
27+
steps:
28+
- name: Check out repository code
29+
uses: actions/checkout@v3
30+
31+
- name: Install psql command
32+
run: |
33+
apt-get update
34+
apt-get install --yes --no-install-recommends postgresql-client
35+
36+
- id: 'auth'
37+
uses: 'google-github-actions/auth@v0'
38+
with:
39+
project_id: ${{ secrets.GCP_PROJECT_ID }}
40+
credentials_json: ${{ secrets.GCP_SA_KEY }}
41+
42+
- name: 'Set up Cloud SDK'
43+
uses: 'google-github-actions/setup-gcloud@v0'
44+
45+
- name: Download demo and create tables on PostgreSQL
46+
run: |
47+
echo "Downloading MIMIC-IV demo from GCP."
48+
gsutil -q -u $PROJECT_ID -m cp -r gs://mimic-iv-archive/v2.0/demo ./
49+
echo "Building and loading data into psql."
50+
psql -q -h $POSTGRES_HOST -U postgres -f mimic-iv/buildmimic/postgres/create.sql
51+
psql -q -h $POSTGRES_HOST -U postgres -v mimic_data_dir=demo -f mimic-iv/buildmimic/postgres/load_gz.sql
52+
echo "Validating build."
53+
psql -h $POSTGRES_HOST -U postgres -f mimic-iv/buildmimic/postgres/validate_demo.sql > validate_results.txt
54+
cat validate_results.txt
55+
56+
env:
57+
# The hostname used to communicate with the PostgreSQL service container
58+
POSTGRES_HOST: postgres
59+
PGPASSWORD: postgres
60+
# The default PostgreSQL port
61+
POSTGRES_PORT: 5432
62+
PROJECT_ID: ${{ secrets.GCP_PROJECT_ID }}

mimic-iv/buildmimic/bigquery/schemas/ed/diagnosis.json renamed to mimic-iv-ed/buildmimic/bigquery/schemas/ed/diagnosis.json

File renamed without changes.

mimic-iv/buildmimic/bigquery/schemas/ed/edstays.json renamed to mimic-iv-ed/buildmimic/bigquery/schemas/ed/edstays.json

File renamed without changes.

mimic-iv/buildmimic/bigquery/schemas/ed/medrecon.json renamed to mimic-iv-ed/buildmimic/bigquery/schemas/ed/medrecon.json

File renamed without changes.
File renamed without changes.
File renamed without changes.

mimic-iv/buildmimic/bigquery/schemas/ed/vitalsign.json renamed to mimic-iv-ed/buildmimic/bigquery/schemas/ed/vitalsign.json

File renamed without changes.

mimic-iv/buildmimic/bigquery/schemas/ed/vitalsign_hl7.json renamed to mimic-iv-ed/buildmimic/bigquery/schemas/ed/vitalsign_hl7.json

File renamed without changes.

mimic-iv/buildmimic/bigquery/README.md

Lines changed: 22 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Loading MIMIC-IV to BigQuery
22

3-
**YOU DO NOT NEED TO INSTALL MIMIC-IV YOURSELF!** MIMIC-IV has been loaded onto BigQuery by the LCP, and is available for credentialed researchers to access. If you are credentialed, then you may be granted access MIMIC-IV on BigQuery instantly by following the [cloud configuration tutorial](https://mimic-iv.mit.edu/docs/access/cloud/).
3+
**YOU DO NOT NEED TO INSTALL MIMIC-IV YOURSELF!** MIMIC-IV has been loaded onto BigQuery by the LCP, and is available for credentialed researchers to access. If you are credentialed, then you may be granted access MIMIC-IV on BigQuery instantly by following the [cloud configuration tutorial](https://mimic.mit.edu/docs/gettingstarted/cloud/).
44

55
The following instructions are provided for transparency and were used to create the current copy of MIMIC-IV on BigQuery.
66

@@ -38,43 +38,39 @@ gcloud init
3838

3939
---
4040

41-
## STEP 3: Verify you can access the MIMIC-IV files on Google Cloud Storage
41+
## STEP 3: Download the MIMIC-IV files
4242

43-
### A) Check the content of the bucket.
43+
Download the MIMIC-IV dataset files. The easiest way to download them is to open a terminal then run:
4444

45-
```sh
46-
gsutil ls gs://mimiciv-1.0.physionet.org
4745
```
48-
49-
It should list a zip file, and some auxiliary files associated with the project (SHA256SUMS.txt).
50-
51-
```sh
52-
gs://mimiciv-1.0.physionet.org/mimic-iv-1.0.zip
46+
wget -r -N -c -np --user YOURUSERNAME --ask-password https://physionet.org/files/mimiciv/2.0/
5347
```
5448

55-
Download and extract the zip file locally. Then, upload the resultant folders (`core`, `hosp`, and `icu`) to a GCP bucket of your choice:
49+
Replace `YOURUSERNAME` with your physionet username.
50+
51+
Then, upload the folders (`hosp` and `icu`) to a GCP bucket of your choice:
5652

5753
```sh
5854
bucket="mimic-data"
5955

60-
unzip mimic-iv-1.0.zip
61-
gsutil -m cp -r core hosp icu gs://$bucket/v1.0/
56+
gsutil -m cp -r hosp icu gs://$bucket/v2.0/
6257
```
6358

6459
## STEP 4: Create a new BigQuery dataset
6560

66-
### A) Create a new dataset for MIMIC-IV version 1.0
61+
### A) Create a new dataset for MIMIC-IV version 2.0
6762

68-
In this example, we have chosen **mimic4_v1_0** as the dataset name.
63+
In this example, we have chosen **mimic4_v2_0** as the dataset prefix for the ICU/hosp modules.
6964

7065
```sh
71-
bq mk --dataset --data_location US --description "MIMIC-IV version 1.0" mimic4_v1_0
66+
bq mk --dataset --data_location US --description "MIMIC-IV version 2.0 ICU data" mimic4_v2_0_icu
67+
bq mk --dataset --data_location US --description "MIMIC-IV version 2.0 hosp data" mimic4_v2_0_hosp
7268
```
7369

7470
### B) Check the status of the dataset created
7571

7672
```sh
77-
bq show mimic4_v1_0
73+
bq show mimic4_v2_0_hosp
7874
```
7975

8076
---
@@ -101,13 +97,12 @@ BigQuery schemas are provided in this GitHub repository. Download the table sche
10197

10298
## STEP 6: Create tables and load the compressed files
10399

104-
### A) Create a script file (ex: upload_mimic4_v1_0.sh) and copy the code below.
100+
### A) Create a script file (ex: upload_mimic4_v2_0.sh) and copy the code below.
105101

106102
You will need to change the **schema_local_folder** to match the path to the schemas on your local machine.
107103

108104
Note also that the below assumes the following dataset structure:
109105

110-
* <dataset_prefix>_core
111106
* <dataset_prefix>_icu
112107
* <dataset_prefix>_hosp
113108

@@ -118,25 +113,25 @@ If you would like all tables on the same dataset, you should modify the below sc
118113

119114
# Initialize parameters
120115
bucket="mimic-data" # we chose this bucket earlier when uploading data
121-
dataset_prefix="mimic"
122-
schema_local_folder="/home/alistairewj/mimic-iv/v1.0/schemas"
116+
dataset_prefix="mimic4_v2_0"
117+
schema_local_folder="~/mimic-code/mimic-iv/buildmimic/bigquery/schemas"
123118

124119
# Get the list of files in the bucket
125120

126-
for module in core hosp icu;
121+
for module in hosp icu;
127122
do
128-
FILES=$(gsutil ls gs://$bucket/v1.0/$module/*.csv.gz)
123+
FILES=$(gsutil ls gs://$bucket/v2.0/$module/*.csv.gz)
129124

130125
for file in $FILES
131126
do
132127

133-
# Extract the table name from the file path (ex: gs://mimic4_v1_0/ADMISSIONS.csv.gz)
128+
# Extract the table name from the file path (ex: gs://mimic4_v2_0/ADMISSIONS.csv.gz)
134129
base=${file##*/} # remove path
135130
filename=${base%.*} # remove .gz
136131
tablename=${filename%.*} # remove .csv
137132

138133
# Create table and populate it with data from the bucket
139-
echo bq load --allow_quoted_newlines --skip_leading_rows=1 --source_format=CSV --replace ${dataset_prefix}_${module}.$tablename gs://$bucket/v1.0/$module/$tablename.csv.gz $schema_local_folder/$module/$tablename.json
134+
bq load --allow_quoted_newlines --skip_leading_rows=1 --source_format=CSV --replace ${dataset_prefix}_${module}.$tablename gs://$bucket/v2.0/$module/$tablename.csv.gz $schema_local_folder/$module/$tablename.json
140135

141136
# Check for error
142137
if [ $? -eq 0 ];then
@@ -155,7 +150,7 @@ This code will get the list of files in the bucket, and for each file, it will e
155150
### B) Set the CHMOD to allow the file as executable (ex: 755), and execute the script file
156151

157152
```sh
158-
./upload_mimic4_v1_0.sh
153+
./upload_mimic4_v2_0.sh
159154
```
160155

161156
### C) Results of the upload process
@@ -254,7 +249,7 @@ We can test a successful build by running a check query.
254249
255250
```sh
256251
bq query --use_legacy_sql=False 'select CASE WHEN count(*) = 383220 THEN True ELSE
257-
False end AS check from `mimic4_v1_0.patients`'
252+
False end AS check from `mimic4_v2_0.patients`'
258253
```
259254
260255
This verifies we have the expected row count in the patients table. It's further possible to check the row counts of the other tables by comparing to the already existing MIMIC-IV BigQuery dataset available on `physionet-data`.

mimic-iv/buildmimic/bigquery/schemas/core/admissions.json renamed to mimic-iv/buildmimic/bigquery/schemas/hosp/admissions.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
[{"name": "subject_id", "type": "INT64", "mode": "REQUIRED"}, {"name": "hadm_id", "type": "INT64", "mode": "REQUIRED"}, {"name": "admittime", "type": "DATETIME", "mode": "REQUIRED"}, {"name": "dischtime", "type": "DATETIME", "mode": "NULLABLE"}, {"name": "deathtime", "type": "DATETIME", "mode": "NULLABLE"}, {"name": "admission_type", "type": "STRING", "mode": "REQUIRED"}, {"name": "admission_location", "type": "STRING", "mode": "NULLABLE"}, {"name": "discharge_location", "type": "STRING", "mode": "NULLABLE"}, {"name": "insurance", "type": "STRING", "mode": "NULLABLE"}, {"name": "language", "type": "STRING", "mode": "NULLABLE"}, {"name": "marital_status", "type": "STRING", "mode": "NULLABLE"}, {"name": "ethnicity", "type": "STRING", "mode": "NULLABLE"}, {"name": "edregtime", "type": "DATETIME", "mode": "NULLABLE"}, {"name": "edouttime", "type": "DATETIME", "mode": "NULLABLE"}, {"name": "hospital_expire_flag", "type": "INT64", "mode": "NULLABLE"}]
1+
[{"name": "subject_id", "type": "INT64", "mode": "REQUIRED"}, {"name": "hadm_id", "type": "INT64", "mode": "REQUIRED"}, {"name": "admittime", "type": "DATETIME", "mode": "REQUIRED"}, {"name": "dischtime", "type": "DATETIME", "mode": "NULLABLE"}, {"name": "deathtime", "type": "DATETIME", "mode": "NULLABLE"}, {"name": "admission_type", "type": "STRING", "mode": "REQUIRED"}, {"name": "admission_location", "type": "STRING", "mode": "NULLABLE"}, {"name": "discharge_location", "type": "STRING", "mode": "NULLABLE"}, {"name": "insurance", "type": "STRING", "mode": "NULLABLE"}, {"name": "language", "type": "STRING", "mode": "NULLABLE"}, {"name": "marital_status", "type": "STRING", "mode": "NULLABLE"}, {"name": "race", "type": "STRING", "mode": "NULLABLE"}, {"name": "edregtime", "type": "DATETIME", "mode": "NULLABLE"}, {"name": "edouttime", "type": "DATETIME", "mode": "NULLABLE"}, {"name": "hospital_expire_flag", "type": "INT64", "mode": "NULLABLE"}]

0 commit comments

Comments
 (0)