Skip to content

guidesify/DataSci_Crisis

Repository files navigation

Data Science Project: NLP for Disaster Tweets

This repository contains code and data for a data science project focused on using natural language processing techniques to analyze tweets related to disasters and detect the positive cases.

Data Source

The dataset used for this project is the Natural Language Processing with Disaster Tweets competition dataset from Kaggle. It consists of a collection of tweets related to disasters, where each tweet is pre-labelled whether it is related to a real disaster.

The dataset includes 4,342 tweets labeled as not disaster-related and 3,271 labeled as disaster-related. The tweets cover a range of disaster events, including natural disasters like hurricanes and earthquakes, as well as man-made disasters like terrorist attacks and oil spills.

Project Goals

The main goals of this project are:

  • To perform exploratory data analysis on the dataset to gain insights into the characteristics of disaster-related tweets.
  • To perform topic modeling on the dataset to identify common themes and topics in disaster-related tweets.
  • To build a classification model to detect whether a given tweet is related to a real disaster or not.

Exploratory Data Analysis

In the exploratory data analysis phase, we'll examine the dataset to gain insights into the characteristics of disaster-related tweets. This will include analyzing the distribution of disaster-related tweets, examining the most common words and phrases used in disaster-related tweets, looking for patterns in the data that could be useful for building a classification model, and possibly scrape for real-time data for retraining purposes (closed-loop feedback).

Topic Modeling

In the topic modeling phase, we'll use unsupervised machine learning techniques to identify common themes and topics in disaster-related tweets. This will involve using algorithms like Latent Dirichlet Allocation (LDA) to identify groups of words that tend to appear together in disaster-related tweets.

Classification Model

In the classification model phase, we'll use supervised machine learning techniques to build a model that can detect whether a given tweet is related to a real disaster or not. We'll use a variety of feature extraction techniques, including bag-of-words and word embeddings, and evaluate the performance of different algorithms, such as Naive Bayes, Logistic Regression, and Random Forest.

Usage

To use the code in this repository, you'll need to have the following software installed:

  • Python 3
  • Jupyter Notebook
  • The required Python packages (listed in requirements.txt)

To get started, simply clone this repository to your local machine and run the Jupyter Notebook file (.ipynb) files. This file contains all the code necessary to perform the project tasks outlined above.

Results

The results of this project will be presented in a blog post, which will be linked to in this README once it is published. We expect that the project will provide valuable insights into the characteristics of disaster-related tweets and will produce a classification model that can accurately detect real disaster tweets. This could be of great help to businesses in the disaster relief space, allowing them to provide physical and psychological help more efficiently and quickly.

Conclusion

Overall, the goal of this project is to use natural language processing techniques to analyze tweets related to disasters and detect the positive cases. We hope that this project will provide useful insights and models that can be applied in real-world disaster relief efforts.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •