Skip to content

Commit 7a98493

Browse files
author
Benjamin Moody
committed
sqlite/import.py: avoid dependence on sqlalchemy.
To import MIMIC-IV into SQLite, import.py uses Pandas both to parse each data file (read_csv) and then to push the data into an SQL database (to_sql). The latter step can use an SQLAlchemy database connection for full generality (which might be useful sometimes), but it can also simply use an sqlite3.Connection created by the Python standard library. Since this script is solely aimed at providing an easy way to get the data into SQLite format, it's nice to avoid unnecessary dependencies.
1 parent 1ff562b commit 7a98493

File tree

2 files changed

+19
-20
lines changed

2 files changed

+19
-20
lines changed

mimic-iv/buildmimic/sqlite/README.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,7 @@ into memory. It only needs three things to run:
1515
`import.py` is a python script. It requires the following to run:
1616

1717
1. Python 3 installed
18-
2. SQLite
19-
3. [pandas](https://pandas.pydata.org/)
20-
4. [sqlalchemy](https://www.sqlalchemy.org/)
18+
2. [pandas](https://pandas.pydata.org/)
2119

2220
## Step 1: Download the CSV or CSV.GZ files.
2321

Lines changed: 18 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import os
2+
import sqlite3
23
import sys
34

45
from glob import glob
@@ -7,28 +8,28 @@
78
DATABASE_NAME = "mimic4.db"
89
THRESHOLD_SIZE = 5 * 10**7
910
CHUNKSIZE = 10**6
10-
CONNECTION_STRING = "sqlite:///{}".format(DATABASE_NAME)
1111

1212
if os.path.exists(DATABASE_NAME):
1313
msg = "File {} already exists.".format(DATABASE_NAME)
1414
print(msg)
1515
sys.exit()
1616

17-
for f in glob("**/*.csv*", recursive=True):
18-
print("Starting processing {}".format(f))
19-
folder, filename = os.path.split(f)
20-
tablename = filename.lower()
21-
if tablename.endswith('.gz'):
22-
tablename = tablename[:-3]
23-
if tablename.endswith('.csv'):
24-
tablename = tablename[:-4]
25-
if os.path.getsize(f) < THRESHOLD_SIZE:
26-
df = pd.read_csv(f)
27-
df.to_sql(tablename, CONNECTION_STRING)
28-
else:
29-
# If the file is too large, let's do the work in chunks
30-
for chunk in pd.read_csv(f, chunksize=CHUNKSIZE, low_memory=False):
31-
chunk.to_sql(tablename, CONNECTION_STRING, if_exists="append")
32-
print("Finished processing {}".format(f))
17+
with sqlite3.Connection(DATABASE_NAME) as connection:
18+
for f in glob("**/*.csv*", recursive=True):
19+
print("Starting processing {}".format(f))
20+
folder, filename = os.path.split(f)
21+
tablename = filename.lower()
22+
if tablename.endswith('.gz'):
23+
tablename = tablename[:-3]
24+
if tablename.endswith('.csv'):
25+
tablename = tablename[:-4]
26+
if os.path.getsize(f) < THRESHOLD_SIZE:
27+
df = pd.read_csv(f)
28+
df.to_sql(tablename, connection)
29+
else:
30+
# If the file is too large, let's do the work in chunks
31+
for chunk in pd.read_csv(f, chunksize=CHUNKSIZE, low_memory=False):
32+
chunk.to_sql(tablename, connection, if_exists="append")
33+
print("Finished processing {}".format(f))
3334

3435
print("Should be all done!")

0 commit comments

Comments
 (0)