Skip to content

Commit 68cf860

Browse files
authored
Merge pull request #1489 from MIT-LCP/duckdb_mimiciv
Improvements for duckdb
2 parents 8ca183d + a7425b2 commit 68cf860

File tree

4 files changed

+719
-531
lines changed

4 files changed

+719
-531
lines changed

.github/workflows/duckdb.yml

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -21,14 +21,11 @@ jobs:
2121
2222
- name: Download demo data
2323
uses: ./.github/actions/download-demo
24-
with:
25-
gcp-project-id: ${{ secrets.GCP_PROJECT_ID }}
26-
gcp-sa-key: ${{ secrets.GCP_SA_KEY }}
2724

2825
- name: Load icu/hosp data into duckdb
2926
run: |
3027
echo "Running duckdb build."
31-
./${BUILDCODE_PATH}/import_duckdb.sh
28+
./${BUILDCODE_PATH}/import_duckdb.sh ./
3229
3330
echo `md5sum mimic4.db`
3431
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# DuckDB
2+
3+
The script in this folder creates the schema for MIMIC-IV and
4+
loads the data into the appropriate tables for
5+
[DuckDB](https://duckdb.org/).
6+
DuckDB, like SQLite, is serverless and
7+
stores all information in a single file.
8+
Unlike SQLite, an OLTP database,
9+
DuckDB is an OLAP database, and therefore optimized for analytical queries.
10+
This will result in faster queries for researchers using MIMIC-IV
11+
with DuckDB compared to SQLite.
12+
To learn more, please read their ["why duckdb"](https://duckdb.org/docs/why_duckdb)
13+
page.
14+
15+
The instructions to load MIMIC-III into a DuckDB
16+
only require:
17+
1. DuckDB to be installed and
18+
2. Your computer to have a POSIX-compliant terminal shell,
19+
which is already found by default on any Mac OSX, Linux, or BSD installation.
20+
21+
To use these instructions on Windows,
22+
you need a Unix command line environment,
23+
which you can obtain by either installing
24+
[Windows Subsystem for Linux](https://docs.microsoft.com/en-us/windows/wsl/install-win10)
25+
or [Cygwin](https://www.cygwin.com/).
26+
27+
## Set-up
28+
29+
### Quick overview
30+
31+
1. [Install](https://duckdb.org/docs/installation/) the CLI version of DuckDB
32+
2. [Download](https://physionet.org/content/mimiciii/1.4/) the MIMIC-III files
33+
3. Create DuckDB database and load data
34+
35+
### Install DuckDB
36+
37+
Follow instructions on their website to
38+
[install](https://duckdb.org/docs/installation/)
39+
the CLI version of DuckDB.
40+
41+
You will need to place the `duckdb` binary in a folder on your environment path,
42+
e.g. `/usr/local/bin`.
43+
44+
### Download MIMIC-III files
45+
46+
[Download](https://physionet.org/content/mimiciii/1.4/)
47+
the CSV files for MIMIC-III by any method you wish.
48+
49+
The intructions assume the CSV files are in the folder structure as follows:
50+
51+
```
52+
mimic_data_dir
53+
ADMISSIONS.csv.gz
54+
...
55+
```
56+
57+
The CSV files can be uncompressed (end in `.csv`) or compressed (end in `.csv.gz`).
58+
59+
The easiest way to download them is to open a terminal then run:
60+
61+
```
62+
wget -r -N -c -np -nH --cut-dirs=1 --user YOURUSERNAME --ask-password https://physionet.org/files/mimiciii/1.4/
63+
```
64+
65+
Replace `YOURUSERNAME` with your physionet username.
66+
67+
This will make you `mimic_data_dir` be `mimiciii/1.4`.
68+
69+
# Create DuckDB database and load data
70+
71+
The last step requires creating a DuckDB database and
72+
loading the data into it.
73+
74+
You can do all of this will one shell script, `import_duckdb.sh`,
75+
located in this repository.
76+
77+
See the help for it below:
78+
79+
```sh
80+
$ ./import_duckdb.sh -h
81+
./import_duckdb.sh:
82+
USAGE: ./import_duckdb.sh mimic_data_dir [output_db]
83+
WHERE:
84+
mimic_data_dir directory that contains csv.gz or csv files
85+
output_db: optional filename for duckdb file (default: mimic3.db)
86+
$
87+
```
88+
89+
Here's an example invocation that will make the database in the default "mimic3.db":
90+
91+
```sh
92+
$ ./import_duckdb.sh physionet.org/files/mimiciii/1.4
93+
94+
... output removed
95+
Successfully finished loading data into mimic3.db.
96+
97+
$ ls -lh mimic3.db
98+
-rw-rw-r--. 1 myuser mygroup 26G Jan 25 16:11 mimic3.db
99+
```
100+
101+
The script will print out progress as it goes.
102+
Be patient, this can take minutes to hours to load
103+
depending on your computer's configuration.
104+
105+
# Help
106+
107+
Please see the [issues page](https://github.com/MIT-LCP/mimic-iii/issues) to discuss other issues you may be having.

0 commit comments

Comments
 (0)