This project is intended to create a postgres database for Sparkify platform song and artist data and combine that information with Sparkify songplay data to create a descriptive set of records of Sparkify platform song plays.
- Installation
- Usage
- Database Design
- ETL Process
- Files
- data (folder)
- log_data
- song_data
- create_tables.py
- etl.py
- sql_queries.py
- etl.ipynb
- test.ipynb
- data (folder)
Python 3
pip install pyscopg2
From the command line, run these commands:
$ python create_tables.py
$ python etl.py
The Sparkify database consists of only 5 separate tables linked by primary and foreign keys. Below are descriptions of the tables and schema.
Table: songplay
Contains data to show the songs listened to by Sparkify platform users. Also shows the time and location of when and where the user listened to the song.
| Type | Column | Type |
|---|---|---|
| PK | songplay_id | int |
| FK | start_time | timestamp |
| FK | user_id | int |
| null | level | varchar |
| FK | song_id | varchar |
| FK | artist_id | varchar |
| null | session_id | varchar |
| null | location | varchar |
| null | user_agent | varchar |
Table: user
Contains data on the Sparkify platform user.
| Type | Column | Type |
|---|---|---|
| PK | user_id | int |
| null | first_name | varchar |
| null | last_name | varchar |
| null | gender | varchar |
| null | level | varchar |
Table: song
Contains song data.
| Type | Column | Type |
|---|---|---|
| PK | song_id | varchar |
| null | title | varchar |
| FK | artist_id | varchar |
| null | year | int |
| null | duration | decimal |
Table: artist
Contains artist data.
| Type | Column | Type |
|---|---|---|
| PK | artist_id | varchar |
| null | name | varchar |
| null | location | varchar |
| null | latitude | varchar |
| null | longitude | varchar |
Table: time
Contains timestamp data from song listens as well as other time related.
| Type | Column | Type |
|---|---|---|
| PK | start_time | timestamp |
| null | hour | int |
| null | day | int |
| null | week | int |
| null | month | int |
| null | year | int |
| null | weekday | int |
song_data => insert song data into song table song_data => insert artist data into artist data
log_data => extract timestamp and time information from songplay, insert into time table => find artist_id and song_id from artist and song tables => insert artist, song, and songplay data into songplay table
data (folder)
- log_data
- json records of Sparkify song plays
- song_data
- json records of Sparkify song and artist information
create_tables.py
Python module that drops tables if they exist and creates them in the database using queries from sql_queries.
etl.py
Python module that moves songplay, song and artist data from the song_data and log_data folders and inserts data into the songplay table.
etl.ipynb
Notebook for developing etl.py module.
sql_queries.py
Python module that contains all SQL queries for table creation and inserts.
test.ipynb
Notebook for testing etl.py module.
