Skip to content

Commit 6acd1dd

Browse files
authored
Merge pull request #1240 from armando-fandango/sqlite-import-for-mimic4
sqlite import for mimic4 as per #1052
2 parents 15ab4a2 + ecd635b commit 6acd1dd

File tree

3 files changed

+152
-0
lines changed

3 files changed

+152
-0
lines changed
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Building the MIMIC database with SQLite
2+
3+
Either `import.sh` or `import.py` can be used to generate a [SQLite]([https://sqlite.org/index.html) database file from the MIMIC-IV demo or full dataset.
4+
5+
`import.sh` is a shell script that will work with any POSIX compliant shell.
6+
It is memory efficient and does not require loading entire data files
7+
into memory. It only needs three things to run:
8+
9+
1. A POSIX compliant shell (e.g., dash, bash, zsh, ksh, etc.)
10+
2. [SQLite]([https://sqlite.org/index.html)
11+
3. gzip (which is installed by default on any Linux/BSD/Mac variant)
12+
13+
**Note:** The `import.sh` script will set all data fields to *text*.
14+
15+
`import.py` is a python script. It requires the following to run:
16+
17+
1. Python 3 installed
18+
2. SQLite
19+
3. [pandas](https://pandas.pydata.org/)
20+
21+
## Step 1: Download the CSV or CSV.GZ files.
22+
23+
- Download the MIMIC-IV dataset from: https://physionet.org/content/mimiciv/
24+
- Place `import.sh` or `import.py` into the same folder as the `csv` or `csv.gz` files
25+
26+
i.e. your folder structure should resemble:
27+
28+
```
29+
path/to/mimic-iv/
30+
├── import.sh
31+
├── import.py
32+
├── hosp
33+
│ ├── admissions.csv.gz
34+
│ ├── ...
35+
│ └── transfers.csv.gz
36+
└── hosp
37+
├── chartevents.csv.gz
38+
├── ...
39+
└── procedureevents.csv.gz
40+
41+
42+
## Step 2: Edit the script if needed.
43+
44+
`import.sh` does **not** need edits to work with either the demo or full dataset.
45+
Please continue to Step 3.
46+
47+
If you are using the `import.py` script,
48+
it may be necessary to make minor edits to the `import.py` script. For example:
49+
50+
- If your files are `.csv` rather than `csv.gz`, you will need to change `csv.gz` to `csv`.
51+
52+
## Step 3: Generate the SQLite file
53+
54+
To generate the SQLite file:
55+
56+
If you are using `import.sh`, run on the command-line:
57+
58+
```
59+
$ ./import.sh
60+
```
61+
62+
If you are using `import.py`, run on the command-line:
63+
64+
```
65+
$ python import.py
66+
```
67+
68+
If loading the full dataset, this will take some time,
69+
particularly the `CHARTEVENTS` table.
70+
71+
The scripts will ultimately generate an SQLite database file called `mimic4.db`.
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
import os
2+
import sys
3+
4+
from glob import glob
5+
import pandas as pd
6+
7+
DATABASE_NAME = "mimic4.db"
8+
THRESHOLD_SIZE = 5 * 10**7
9+
CHUNKSIZE = 10**6
10+
CONNECTION_STRING = "sqlite:///{}".format(DATABASE_NAME)
11+
12+
if os.path.exists(DATABASE_NAME):
13+
msg = "File {} already exists.".format(DATABASE_NAME)
14+
print(msg)
15+
sys.exit()
16+
17+
for f in glob("**/*.csv*", recursive=True):
18+
print("Starting processing {}".format(f))
19+
folder, filename = os.path.split(f)
20+
tablename = filename.strip(".gz").strip(".csv").lower()
21+
if os.path.getsize(f) < THRESHOLD_SIZE:
22+
df = pd.read_csv(f)
23+
df.to_sql(tablename, CONNECTION_STRING)
24+
else:
25+
# If the file is too large, let's do the work in chunks
26+
for chunk in pd.read_csv(f, chunksize=CHUNKSIZE, low_memory=False):
27+
chunk.to_sql(tablename, CONNECTION_STRING, if_exists="append")
28+
print("Finished processing {}".format(f))
29+
30+
print("Should be all done!")
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
#!/bin/sh
2+
3+
# Copyright (c) 2021 Thomas Ward <[email protected]>
4+
#
5+
# Permission to use, copy, modify, and distribute this software for any
6+
# purpose with or without fee is hereby granted, provided that the above
7+
# copyright notice and this permission notice appear in all copies.
8+
#
9+
# THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
10+
# WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
11+
# MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
12+
# ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
13+
# WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
14+
# ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
15+
# OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
16+
17+
OUTFILE=mimic4.db
18+
19+
if [ -s "$OUTFILE" ]; then
20+
echo "File \"$OUTFILE\" already exists." >&2
21+
exit 111
22+
fi
23+
24+
for FILE in */**.csv*; do
25+
# skip loop if glob didn't match an actual file
26+
[ -f "$FILE" ] || continue
27+
# trim off extension and lowercase file stem (e.g., HELLO.csv -> hello)
28+
TABLE_NAME=$(echo "${FILE%%.*}" | tr "[:upper:]" "[:lower:]")
29+
case "$FILE" in
30+
*csv)
31+
IMPORT_CMD=".import $FILE $TABLE_NAME"
32+
;;
33+
# need to decompress csv before load
34+
*csv.gz)
35+
IMPORT_CMD=".import \"|gzip -dc $FILE\" $TABLE_NAME"
36+
;;
37+
# not a data file so skip
38+
*)
39+
continue
40+
;;
41+
esac
42+
echo "Loading $FILE."
43+
sqlite3 $OUTFILE <<EOF
44+
.headers on
45+
.mode csv
46+
$IMPORT_CMD
47+
EOF
48+
echo "Finished loading $FILE."
49+
done
50+
51+
echo "Finished loading data into $OUTFILE."

0 commit comments

Comments
 (0)