Skip to content

nawoda2/Yelp-Database-Benchmarking

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

Yelp Database Benchmarking

This repository contains the source code and setup instructions for our final project in UC Berkeley’s Data Engineering course.

Note: Due to course policies, the full dataset and final results (report) cannot be made publicly accessible. If you are a recruiter or collaborator interested in reviewing the complete findings, please reach out to me directly:

Project Overview

We benchmarked two data systems—PostgreSQL and PySpark—on the Yelp Open Dataset. Our goal was to compare performance, storage costs, and usability across a set of representative queries. Key tasks included:

  1. Schema Design: Created a relational schema in PostgreSQL and a corresponding plan for PySpark.
  2. Data Loading: Ingested multi-gigabyte Yelp data locally.
  3. Benchmark Queries: Tested query performance (time and memory), user ergonomics, and data-modeling flexibility.
  4. Analysis & Report: Summarized findings to guide system selection for various business use cases.

Repository Structure

  • postgresql/: Scripts and instructions for setting up the PostgreSQL database, schemas, and sample queries.
  • pyspark/: PySpark notebooks and scripts to illustrate our benchmarking approach on the same Yelp data.
  • docs/: Additional documentation, diagrams, and notes on data modeling.
  • README.md: This overview document.

Confidential Report

Our full write-up, including detailed performance metrics and analysis, is private. Please contact me if you would like to access it.


Contact

Thank you for your interest!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published