GitHub - AuraSinis/covid_tweets: Covid-19 Confirmed Case Rate Change via Geotagged Tweet Sentiment Analysis

COVID-19 Rates via Sentiment Analysis of Tweets

Lucas Dwyer, Henry Valk, and Margret Rubio-Keefer

Problem Statement:

The COVID-19 response has been largely regional and state-based in nature. Some states have enacted strictly enforced stay-at-home policies, while others have provided guidelines. It would be worthwhile to compare the sentiment analysis of social media posts across geographic regions and compare them to the local the occurrences of the pandemic in those areas. Furthermore, it would be useful if any time series forecasting model based on social media sentinment analysis data.

Datasets

Selected states (for data size purposes): NY, CA, TX, FL, GA

CORONAVIRUS (COVID-19) GEO-TAGGED TWEETS DATASET

Hydrator must be used to process the tweets.

This dataset was processed down to the daily mean sentinment by TextBlob in the form of polarity TB_polarity and subjectivity TB_subjectivity, indexed by day, for the five selected states.

COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University

This dataset was processed down to the daily confirmed cases and daily confirmed deaths, indexed by day, for the five selected states.

Models

Logistic Regression

Performance

We only got a score of 62% accuracy in picking the 2nd difference of the confirmed cases, however, it is important to note that based on the confusion matrix, the model might still be useful.

The Florida Next Day Model

It is possible that the reason that a next day model, despite the virus having an incubation period of 2-12 days, might perform like this would be that many people might begin tweeting about feeling ill, and then going to get tested, and then they get their results the next day(s). While waiting for the results, they might begin tweeting more about the virus.

It is also important to note that the majority of the errors are false positives; the model predicted that the 2nd difference would be positive, when in fact, it was negative. Great! Couldn't be happier to be wrong! However of the 34 day forecasting window, only 3 false negatives, or cases when we predicted the 2nd difference would be negative, and it was positive. That's less than 9% predictions under rating the coming change in the 2nd difference.

OLS Regression

Performance

This model factored in tweet sentiment standard deviation in addition to mean values, at multiple time lags. While it achieved modest R2 values when training on the entire dataset, cross-validation rendered the model worse than the null model.

It may be the case that there simply isn't enough information in twitter sentiment to accurately predict a continous variable like single day change in covid cases per capita. While incorporating sentiment at multiple time lags did improve the model, these features may be best used as a small part of a more holistic model.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Covid_data		Covid_data
assets		assets
data/processed		data/processed
.DS_Store		.DS_Store
COVID-19_Rates_via_Sentiment_Analysis_of_Tweets.pptx		COVID-19_Rates_via_Sentiment_Analysis_of_Tweets.pptx
Covid19_EDA.ipynb		Covid19_EDA.ipynb
JH_covid_data_cleaning.ipynb		JH_covid_data_cleaning.ipynb
OLS_modeling.ipynb		OLS_modeling.ipynb
README.md		README.md
TextBlob_EDA.ipynb		TextBlob_EDA.ipynb
logistic_regression_modeling.ipynb		logistic_regression_modeling.ipynb
tweet_extractor.ipynb		tweet_extractor.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COVID-19 Rates via Sentiment Analysis of Tweets

Lucas Dwyer, Henry Valk, and Margret Rubio-Keefer

Problem Statement:

Datasets

Selected states (for data size purposes): NY, CA, TX, FL, GA

CORONAVIRUS (COVID-19) GEO-TAGGED TWEETS DATASET

Hydrator must be used to process the tweets.

COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University

Models

Logistic Regression

Performance

The Florida Next Day Model

OLS Regression

Performance

Slide Deck

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

COVID-19 Rates via Sentiment Analysis of Tweets

Lucas Dwyer, Henry Valk, and Margret Rubio-Keefer

Problem Statement:

Datasets

Selected states (for data size purposes): NY, CA, TX, FL, GA

CORONAVIRUS (COVID-19) GEO-TAGGED TWEETS DATASET

Hydrator must be used to process the tweets.

COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University

Models

Logistic Regression

Performance

The Florida Next Day Model

OLS Regression

Performance

Slide Deck

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages