(Udacity: Data Engineering Nano Degree) | [email protected] | 2019-12-23 This project is a part of Udacity's Data Engineer Nano Degree.
Courtesy: Adobe Stock Images
The startup Sparkify was interested to analyze the data collected from the user activity on their music app on the mobile phones.
Based on the user activity the startup would like to perform some analytics to derive insights which will help the organization to better understand the user behavior, and so add more interesting features to their mobile app which would enhance the user experience and stratergize the product development roadmap.
- The database schema the team has come up with the following STAR-SCHEMA which captures all the important dimensions needed to create the fact table.
- Artists, Users, Songs, Time are the most quintessential dimensions needed to build the fact table, which is then used to derive insights.
- FACT TABLE: contains the required dimensions to categorize and their measures (build aggregations) to derive facts, which enables analyze the user behavior.
- Ensured the Dimension tables meet the 3NF (Normalization Form)
- Most Important features of the selected Dimension data are used as table columns.
- One-to-Many relationship with the Fact (OLAP) Table.
Created the following DIMENSION tables
- Users: user info (columns: user_id, first_name, last_name, gender, level)
- Songs: song info (columns: song_id, title, artist_id, year, duration)
- Artists: artist info (columns: artist_id, name, location, latitude, longitude)
- Time: detailed time info about song plays (columns: start_time, hour, day, week, month, year, weekday)
- Ensured the Fact tables have all the primary keys of the Dimension tables.
- All the categories of the Dimension tables are included within the Fact tables.
- All the required measures can be calculated using the aggregation function performed on the categories (data).
songplays: song play data together with user, artist, and song info (songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent)
- Establish connection to the local instance of the Postgres DB.
- Create CREATE TABLE statements to Create Dimension and Fact tables.
- Create INSERT INTO TABLE statements to enter the songs and log data into the Dimension and Fact table.
- Extract data from songs data JSON file to fill in the **SONGS** and **ARTISTS** dimension table.
- Extract data from log data JSON file to fill in the **USERS** and **TIME** dimension table.
- Timestamp is represented as milliseconds in the log data.
- Timestamp is transformed into **Time (hh:mm:ss), Hour, Day, Week of Year, Month, Year, Day of Week**.
- Extract the log data, and data from dimension tables to fill in the **SONGPLAYS** FACT table.
- Convert the Python Scripts from the web kernel to modular Python code (.py) file.
- Create common functions to perform the database and ETL functions.
- For a given user what is his favorite songs (most played ones)
- What is the most played songs (Toppers) at a given time of the day or season based on user demographics
- Is there a song which is played across all regions of the country.