Skip to content

AuraSinis/covid_tweets

Repository files navigation

COVID-19 Rates via Sentiment Analysis of Tweets

Lucas Dwyer, Henry Valk, and Margret Rubio-Keefer

Problem Statement:

The COVID-19 response has been largely regional and state-based in nature. Some states have enacted strictly enforced stay-at-home policies, while others have provided guidelines. It would be worthwhile to compare the sentiment analysis of social media posts across geographic regions and compare them to the local the occurrences of the pandemic in those areas. Furthermore, it would be useful if any time series forecasting model based on social media sentinment analysis data.

Datasets

Selected states (for data size purposes): NY, CA, TX, FL, GA
Hydrator must be used to process the tweets.

This dataset was processed down to the daily mean sentinment by TextBlob in the form of polarity TB_polarity and subjectivity TB_subjectivity, indexed by day, for the five selected states.

This dataset was processed down to the daily confirmed cases and daily confirmed deaths, indexed by day, for the five selected states.

Models

Logistic Regression

Performance

We only got a score of 62% accuracy in picking the 2nd difference of the confirmed cases, however, it is important to note that based on the confusion matrix, the model might still be useful.

The Florida Next Day Model

Florida Model's Confusion Matrix alt ><

It is possible that the reason that a next day model, despite the virus having an incubation period of 2-12 days, might perform like this would be that many people might begin tweeting about feeling ill, and then going to get tested, and then they get their results the next day(s). While waiting for the results, they might begin tweeting more about the virus.

It is also important to note that the majority of the errors are false positives; the model predicted that the 2nd difference would be positive, when in fact, it was negative. Great! Couldn't be happier to be wrong! However of the 34 day forecasting window, only 3 false negatives, or cases when we predicted the 2nd difference would be negative, and it was positive. That's less than 9% predictions under rating the coming change in the 2nd difference.

OLS Regression

Performance

This model factored in tweet sentiment standard deviation in addition to mean values, at multiple time lags. While it achieved modest R2 values when training on the entire dataset, cross-validation rendered the model worse than the null model.

It may be the case that there simply isn't enough information in twitter sentiment to accurately predict a continous variable like single day change in covid cases per capita. While incorporating sentiment at multiple time lags did improve the model, these features may be best used as a small part of a more holistic model.

About

Covid-19 Confirmed Case Rate Change via Geotagged Tweet Sentiment Analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors