Data-Modeling-with-Cassandra

Data-Modeling-with-Apache Cassandra (Udacity: Data Engineering Nano Degree) | [email protected] | 2019-12-27 This project is a part of Udacity's Data Engineer Nano Degree.

Project 2 - Data Modeling with Apache Cassandra

Courtesy: Adobe Stock Images

Background:

The startup Sparkify was interested to analyze the data collected from the user activity on their music app on the mobile phones.

Based on the user activity the startup would like to perform some analytics to derive insights which will help the organization to better understand the user behavior, and so add more interesting features to their mobile app which would enhance the user experience and stratergize the product development roadmap.

Business Need:

The business is looking to answer the following questions or would like to analyze the data obtained from the answer (query results).

Give me the artist, song title and song's length in the music app history that was heard during sessionId = 338, and itemInSession = 4

Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182

Give me every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'

ETL Pipeline for Pre-Processing the Files

Find all the sub-directories, files (event-log data) within the sub-directories within the files.
Identify all the files within the sub-directories and create a list of file(s) absolute path.
Loop through each of the file name and Open the file (.csv)
1. Skip the first line in the file
2. Read through the lines in rest of the file and append it to list
3. Close the file
Open a new csv file
Loop through the list created earlier
1. Ignore the list item if the artist_name is Empty.
2. Write the selected list items to each line of the .csv file
Close the csv file.

Database Schema

For Apache Cassandra, the best strategy would be to have one TABLE per one QUERY.
Artists, Users, Songs are dimensions needed to build the tables, which is then used to derive insights.
Since these tables are used for analytics they are expected to be DE-NORMALIZED.

Perform Denormalization

Ensured the Fact tables have all the primary keys of the Dimension tables.

All the categories of the Dimension tables are included within the Fact tables.

All the required measures can be calculated using the aggregation function performed on the categories (data).

Rules for Constructing Tables

Apart from the fields/columns requested in the queries the other fields required are.

Fields part of WHERE clause.

Fields needed to create Partition Keys and Clustering Keys

Partition Keys and Clustering Keys are determined based on the product usage and business knowledge.

JOINS are not allowed in Apache Cassandra and so we do not follow Normalization techniques while designing the tables

SQL Queries

Establish connection to the local instance of the Apache Cassandra with replication factor=1.
Create CREATE TABLE statements to create tables per query.
Create INSERT INTO TABLE statements to enter the event data into the tables.
Extract data from the CSV file created from the ETL pipeline and add it to the INSERT statements.

Convert Jupter Note Book Code into Modular Python code (.py) file

Convert the Python Scripts from the web kernel to modular Python code (.py) file.
Create common functions to perform the database and ETL functions.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.ipynb_checkpoints		.ipynb_checkpoints
event_data		event_data
images		images
Music_App_Analytics.jpg		Music_App_Analytics.jpg
Project_Data_Modeling with Cassandra.ipynb		Project_Data_Modeling with Cassandra.ipynb
README.md		README.md
event_datafile_new.csv		event_datafile_new.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data-Modeling-with-Cassandra

Project 2 - Data Modeling with Apache Cassandra

Background:

Business Need:

ETL Pipeline for Pre-Processing the Files

Database Schema

Perform Denormalization

Rules for Constructing Tables

SQL Queries

Convert Jupter Note Book Code into Modular Python code (.py) file

About

Uh oh!

Releases

Packages

Languages

learngvrk/Data-Modeling-with-Cassandra

Folders and files

Latest commit

History

Repository files navigation

Data-Modeling-with-Cassandra

Project 2 - Data Modeling with Apache Cassandra

Background:

Business Need:

ETL Pipeline for Pre-Processing the Files

Database Schema

Perform Denormalization

Rules for Constructing Tables

SQL Queries

Convert Jupter Note Book Code into Modular Python code (.py) file

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages