Data-Modeling-with-Apache Cassandra (Udacity: Data Engineering Nano Degree) | [email protected] | 2019-12-27 This project is a part of Udacity's Data Engineer Nano Degree.
Courtesy: Adobe Stock Images
The startup Sparkify was interested to analyze the data collected from the user activity on their music app on the mobile phones.
Based on the user activity the startup would like to perform some analytics to derive insights which will help the organization to better understand the user behavior, and so add more interesting features to their mobile app which would enhance the user experience and stratergize the product development roadmap.
The business is looking to answer the following questions or would like to analyze the data obtained from the answer (query results).
- Give me the artist, song title and song's length in the music app history that was heard during sessionId = 338, and itemInSession = 4
- Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182
- Give me every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'
- Find all the sub-directories, files (event-log data) within the sub-directories within the files.
- Identify all the files within the sub-directories and create a list of file(s) absolute path.
- Loop through each of the file name and Open the file (.csv)
- Skip the first line in the file
- Read through the lines in rest of the file and append it to list
- Close the file
- Open a new csv file
- Loop through the list created earlier
- Ignore the list item if the artist_name is Empty.
- Write the selected list items to each line of the .csv file
- Close the csv file.
- For Apache Cassandra, the best strategy would be to have one TABLE per one QUERY.
- Artists, Users, Songs are dimensions needed to build the tables, which is then used to derive insights.
- Since these tables are used for analytics they are expected to be DE-NORMALIZED.
- Ensured the Fact tables have all the primary keys of the Dimension tables.
- All the categories of the Dimension tables are included within the Fact tables.
- All the required measures can be calculated using the aggregation function performed on the categories (data).
- Apart from the fields/columns requested in the queries the other fields required are.
- Fields part of WHERE clause.
- Fields needed to create Partition Keys and Clustering Keys
- Partition Keys and Clustering Keys are determined based on the product usage and business knowledge.
- JOINS are not allowed in Apache Cassandra and so we do not follow Normalization techniques while designing the tables
- Establish connection to the local instance of the Apache Cassandra with replication factor=1.
- Create CREATE TABLE statements to create tables per query.
- Create INSERT INTO TABLE statements to enter the event data into the tables.
- Extract data from the CSV file created from the ETL pipeline and add it to the INSERT statements.
- Convert the Python Scripts from the web kernel to modular Python code (.py) file.
- Create common functions to perform the database and ETL functions.