This is a UCI Data Science Undergraduate Capstone Project under the Stats 170A/B course. The project group consists of myself and Anjali Krishnan (@anjalik2).
The aim was to develop a system to examine a possible association between public sentiment as reflected in Reddit news posts, and the popularity of various songs as seen on online music charts. We hoped that knowledge of current events could provide insight as to the kinds of music that become of public interest at the same point in time.
The data_tools/ folder contains scripts that perform web scraping and feature engineering to create one dataset:
data_tools/data_fetching.pyis first run to obtain the raw data and store it in various tables in a SQL database.data_tools/text_analysis.pyis then run to create model features from the text data collected.data_tools/create_features_tables.sqlis finally run to set up the schema for the full dataset table and populate this table with the appropriate data.
The EDA/ folder contains code to visualize aspects of the resulting dataset. Running these files is not necessary to fit any models.
EDA/EDA_sentiment_scores.Rcreates line plots of the news and music sentiment scores over time.EDA/word_embeddings.Rexports the music word embeddings for use with the Tensorflow Embeddings Projector.- This data is stored in
EDA/TEP-demo/. Usage information can be found in our report.
- This data is stored in
The models/ folder contains code to fit various models on the dataset.
There are three models that we fit. Descriptions of each are described in the report. These models are in the form of model[number].Rmd and are meant to be knitted to observe model results. Each model is also exported to an .html file for convenient viewing.
Other items in this folder are:
models/model_template.Rmd, a template file to easily create new model notebooksmodels/saves/, a directory for storing saved copies of the downloaded dataset and fitted models- This process is automatic during the knitting process. The first knit may take a while, but subsequent knits are quick.
models/includes/, a directory containing.Rmdfiles with R chunks that are reused across several model notebooks.- These are imported into each model notebook to facilitate changes to the chunks. An include can be changed once and have its changes apply to all the model files.
Our final project report can be found in .pdf form in the report/ folder, as well as its source as an .Rmd file.
The demo_for_course_grading/ folder contains a simplified version of our best-performing model for the course staff to easily work with:
demo_for_course_grading/model2.Rmdis a copy of the Model 2 notebook (with all the includes coded directly into the single file). This file can be run or knitted to observe results.- A
.htmlversion has been exported for viewing convenience.
- A
demo_for_course_grading/dataset_20MB.csvis a subset of our full dataset to faciltate quick evaluation.demo_for_course_grading/*.rdaare model save files that facilitate quick knitting of the notebook.