This repository contains code and data for a data science project focused on using natural language processing techniques to analyze tweets related to disasters and detect the positive cases.
The dataset used for this project is the Natural Language Processing with Disaster Tweets competition dataset from Kaggle. It consists of a collection of tweets related to disasters, where each tweet is pre-labelled whether it is related to a real disaster.
The dataset includes 4,342 tweets labeled as not disaster-related and 3,271 labeled as disaster-related. The tweets cover a range of disaster events, including natural disasters like hurricanes and earthquakes, as well as man-made disasters like terrorist attacks and oil spills.
The main goals of this project are:
- To perform exploratory data analysis on the dataset to gain insights into the characteristics of disaster-related tweets.
- To perform topic modeling on the dataset to identify common themes and topics in disaster-related tweets.
- To build a classification model to detect whether a given tweet is related to a real disaster or not.
In the exploratory data analysis phase, we'll examine the dataset to gain insights into the characteristics of disaster-related tweets. This will include analyzing the distribution of disaster-related tweets, examining the most common words and phrases used in disaster-related tweets, looking for patterns in the data that could be useful for building a classification model, and possibly scrape for real-time data for retraining purposes (closed-loop feedback).
In the topic modeling phase, we'll use unsupervised machine learning techniques to identify common themes and topics in disaster-related tweets. This will involve using algorithms like Latent Dirichlet Allocation (LDA) to identify groups of words that tend to appear together in disaster-related tweets.
In the classification model phase, we'll use supervised machine learning techniques to build a model that can detect whether a given tweet is related to a real disaster or not. We'll use a variety of feature extraction techniques, including bag-of-words and word embeddings, and evaluate the performance of different algorithms, such as Naive Bayes, Logistic Regression, and Random Forest.
To use the code in this repository, you'll need to have the following software installed:
- Python 3
- Jupyter Notebook
- The required Python packages (listed in
requirements.txt
)
To get started, simply clone this repository to your local machine and run the Jupyter Notebook file (.ipynb)
files. This file contains all the code necessary to perform the project tasks outlined above.
The results of this project will be presented in a blog post, which will be linked to in this README once it is published. We expect that the project will provide valuable insights into the characteristics of disaster-related tweets and will produce a classification model that can accurately detect real disaster tweets. This could be of great help to businesses in the disaster relief space, allowing them to provide physical and psychological help more efficiently and quickly.
Overall, the goal of this project is to use natural language processing techniques to analyze tweets related to disasters and detect the positive cases. We hope that this project will provide useful insights and models that can be applied in real-world disaster relief efforts.