DuckDB-ETL/README.md at main · fran-cornachione/DuckDB-ETL

DuckDB ETL

In this project, I extracted data from csv files, then created tables from each file, and loaded them into Postgres.

tables = ["students", "teachers", "classes", "courses", "enrollments", "grades"]

for table in tables: 
	conn.execute(f"""
	CREATE OR REPLACE TABLE
		{table}
	AS SELECT
		*
	FROM
		read_csv_auto('data/raw_data/{table}.csv');""")

For the transformations, I modified the schema of each tabl. By default, Postgres takes INTEGER columns (4 bytes) as BIGINT (8 bytes). For columns like age, id, credits, this is unnecessary and leads to wasted storage. For example:

conn.execute("""
CREATE OR REPLACE TABLE
    students_clean
AS SELECT
    CAST(student_id AS INTEGER) AS student_id,
    CAST(age AS SMALLINT) AS age,
    * EXCLUDE (student_id, age)
FROM
    students;
""")

Once I transformed all the tables, I simply loaded them into Postgres with this simple query

for table in tables:
    clean_table_name = f"{table}_clean" # students_clean, teachers_clean, etc...
  
    conn.execute(f"""
    CREATE OR REPLACE TABLE 
        pg_db.{table} 
    AS SELECT 
        * 
    FROM 
        {clean_table_name};
    """)
  
    print(f"[{table}] table loaded succesfully into Postgres")

conn.close()

How to run the project

Clone repository

git clone https://github.com/fran-cornachione/DuckDB-ETL

Install requirements
```
pip install -r requirements.txt
```

Replace this line with your Postgres credentials

conn.execute(f"""
ATTACH 'dbname=your_dbname user=your_user password=your_password host=your_host port=5432' AS pg_db (TYPE POSTGRES);
""")

On the ETL.ipynb file, click the "Run All" button

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DuckDB ETL

How to run the project

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

DuckDB ETL

How to run the project