Skip to content

An ETL process made with SQL (DuckDB) and Python. Extracts data from .csv files, creates tables from those files,

Notifications You must be signed in to change notification settings

fran-cornachione/DuckDB-ETL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DuckDB ETL

1761912291555

In this project, I extracted data from csv files, then created tables from each file, and loaded them into Postgres.

tables = ["students", "teachers", "classes", "courses", "enrollments", "grades"]

for table in tables: 
	conn.execute(f"""
	CREATE OR REPLACE TABLE
		{table}
	AS SELECT
		*
	FROM
		read_csv_auto('data/raw_data/{table}.csv');""")

For the transformations, I modified the schema of each tabl. By default, Postgres takes INTEGER columns (4 bytes) as BIGINT (8 bytes). For columns like age, id, credits, this is unnecessary and leads to wasted storage. For example:

conn.execute("""
CREATE OR REPLACE TABLE
    students_clean
AS SELECT
    CAST(student_id AS INTEGER) AS student_id,
    CAST(age AS SMALLINT) AS age,
    * EXCLUDE (student_id, age)
FROM
    students;
""")

Once I transformed all the tables, I simply loaded them into Postgres with this simple query

for table in tables:
    clean_table_name = f"{table}_clean" # students_clean, teachers_clean, etc...
  
    conn.execute(f"""
    CREATE OR REPLACE TABLE 
        pg_db.{table} 
    AS SELECT 
        * 
    FROM 
        {clean_table_name};
    """)
  
    print(f"[{table}] table loaded succesfully into Postgres")

conn.close()

How to run the project

  1. Clone repository

    git clone https://github.com/fran-cornachione/DuckDB-ETL
  2. Install requirements

    pip install -r requirements.txt
    
  3. Replace this line with your Postgres credentials

    conn.execute(f"""
    ATTACH 'dbname=your_dbname user=your_user password=your_password host=your_host port=5432' AS pg_db (TYPE POSTGRES);
    """)
  4. On the ETL.ipynb file, click the "Run All" button

About

An ETL process made with SQL (DuckDB) and Python. Extracts data from .csv files, creates tables from those files,

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors