Data-Modeling-with-Postgres

(Udacity: Data Engineering Nano Degree) | [email protected] | 2019-12-23 This project is a part of Udacity's Data Engineer Nano Degree.

Project 1 - Data Modeling with Postgres

Courtesy: Adobe Stock Images

Background:

The startup Sparkify was interested to analyze the data collected from the user activity on their music app on the mobile phones.

Based on the user activity the startup would like to perform some analytics to derive insights which will help the organization to better understand the user behavior, and so add more interesting features to their mobile app which would enhance the user experience and stratergize the product development roadmap.

Database Schema

The database schema the team has come up with the following STAR-SCHEMA which captures all the important dimensions needed to create the fact table.
Artists, Users, Songs, Time are the most quintessential dimensions needed to build the fact table, which is then used to derive insights.
FACT TABLE: contains the required dimensions to categorize and their measures (build aggregations) to derive facts, which enables analyze the user behavior.

Perform Normalization

Ensured the Dimension tables meet the 3NF (Normalization Form)

Most Important features of the selected Dimension data are used as table columns.

One-to-Many relationship with the Fact (OLAP) Table.

Create Dimension Tables

Created the following DIMENSION tables

Users: user info (columns: user_id, first_name, last_name, gender, level)

Songs: song info (columns: song_id, title, artist_id, year, duration)

Artists: artist info (columns: artist_id, name, location, latitude, longitude)

Time: detailed time info about song plays (columns: start_time, hour, day, week, month, year, weekday)

Perform Denormalization

Ensured the Fact tables have all the primary keys of the Dimension tables.

All the categories of the Dimension tables are included within the Fact tables.

All the required measures can be calculated using the aggregation function performed on the categories (data).

Create the FACT table:

songplays: song play data together with user, artist, and song info (songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent)

SQL Queries

Establish connection to the local instance of the Postgres DB.
Create CREATE TABLE statements to Create Dimension and Fact tables.
Create INSERT INTO TABLE statements to enter the songs and log data into the Dimension and Fact table.

Perform ETL

Extract data from songs data JSON file to fill in the **SONGS** and **ARTISTS** dimension table.
Extract data from log data JSON file to fill in the **USERS** and **TIME** dimension table.

Timestamp is represented as milliseconds in the log data.
Timestamp is transformed into **Time (hh:mm:ss), Hour, Day, Week of Year, Month, Year, Day of Week**.

Extract the log data, and data from dimension tables to fill in the **SONGPLAYS** FACT table.

Convert Jupter Note Book Code into Modular Python code (.py) file

Convert the Python Scripts from the web kernel to modular Python code (.py) file.
Create common functions to perform the database and ETL functions.

Sample Queries which can be used for Analytics

For a given user what is his favorite songs (most played ones)
What is the most played songs (Toppers) at a given time of the day or season based on user demographics
Is there a song which is played across all regions of the country.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.idea		.idea
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
log_data/2018/11		log_data/2018/11
song_data/A/A/A		song_data/A/A/A
.gitignore		.gitignore
2018-11-01-events.json		2018-11-01-events.json
Data Engineering Nanodegree - Udacity.pdf		Data Engineering Nanodegree - Udacity.pdf
Music_App_Analytics.jpg		Music_App_Analytics.jpg
README.md		README.md
Sparkifydb.png		Sparkifydb.png
Sparkifydb.xml		Sparkifydb.xml
create_tables.py		create_tables.py
etl.ipynb		etl.ipynb
etl.py		etl.py
sql_queries.py		sql_queries.py
terminate_session.sql		terminate_session.sql
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data-Modeling-with-Postgres

Project 1 - Data Modeling with Postgres

Background:

Database Schema

Perform Normalization

Create Dimension Tables

Perform Denormalization

Create the FACT table:

SQL Queries

Perform ETL

Convert Jupter Note Book Code into Modular Python code (.py) file

Sample Queries which can be used for Analytics

About

Uh oh!

Releases

Packages

Languages

learngvrk/Data-Modeling-with-Postgres

Folders and files

Latest commit

History

Repository files navigation

Data-Modeling-with-Postgres

Project 1 - Data Modeling with Postgres

Background:

Database Schema

Perform Normalization

Create Dimension Tables

Perform Denormalization

Create the FACT table:

SQL Queries

Perform ETL

Convert Jupter Note Book Code into Modular Python code (.py) file

Sample Queries which can be used for Analytics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages