This repository contains the source code and setup instructions for our final project in UC Berkeley’s Data Engineering course.
Note: Due to course policies, the full dataset and final results (report) cannot be made publicly accessible. If you are a recruiter or collaborator interested in reviewing the complete findings, please reach out to me directly:
- Email: nawodakw@berkeley.edu
- LinkedIn: linkedin.com/in/nawodakw
We benchmarked two data systems—PostgreSQL and PySpark—on the Yelp Open Dataset. Our goal was to compare performance, storage costs, and usability across a set of representative queries. Key tasks included:
- Schema Design: Created a relational schema in PostgreSQL and a corresponding plan for PySpark.
- Data Loading: Ingested multi-gigabyte Yelp data locally.
- Benchmark Queries: Tested query performance (time and memory), user ergonomics, and data-modeling flexibility.
- Analysis & Report: Summarized findings to guide system selection for various business use cases.
postgresql/: Scripts and instructions for setting up the PostgreSQL database, schemas, and sample queries.pyspark/: PySpark notebooks and scripts to illustrate our benchmarking approach on the same Yelp data.docs/: Additional documentation, diagrams, and notes on data modeling.README.md: This overview document.
Our full write-up, including detailed performance metrics and analysis, is private. Please contact me if you would like to access it.
- Email: nawodakw@berkeley.edu
- LinkedIn: linkedin.com/in/nawodakw
Thank you for your interest!