In this project, I extracted data from csv files, then created tables from each file, and loaded them into Postgres.
tables = ["students", "teachers", "classes", "courses", "enrollments", "grades"]
for table in tables:
conn.execute(f"""
CREATE OR REPLACE TABLE
{table}
AS SELECT
*
FROM
read_csv_auto('data/raw_data/{table}.csv');""")For the transformations, I modified the schema of each tabl. By default, Postgres takes INTEGER columns (4 bytes) as BIGINT (8 bytes). For columns like age, id, credits, this is unnecessary and leads to wasted storage. For example:
conn.execute("""
CREATE OR REPLACE TABLE
students_clean
AS SELECT
CAST(student_id AS INTEGER) AS student_id,
CAST(age AS SMALLINT) AS age,
* EXCLUDE (student_id, age)
FROM
students;
""")Once I transformed all the tables, I simply loaded them into Postgres with this simple query
for table in tables:
clean_table_name = f"{table}_clean" # students_clean, teachers_clean, etc...
conn.execute(f"""
CREATE OR REPLACE TABLE
pg_db.{table}
AS SELECT
*
FROM
{clean_table_name};
""")
print(f"[{table}] table loaded succesfully into Postgres")
conn.close()-
Clone repository
git clone https://github.com/fran-cornachione/DuckDB-ETL
-
Install requirements
pip install -r requirements.txt -
Replace this line with your Postgres credentials
conn.execute(f""" ATTACH 'dbname=your_dbname user=your_user password=your_password host=your_host port=5432' AS pg_db (TYPE POSTGRES); """)
-
On the
ETL.ipynbfile, click the "Run All" button
