💡[Feature]: Add Sentiment Analysis of Movie Reviews Project

sanchitc05 · sanchitc05 · commit fd3f02502ec8 · 2024-10-21T16:46:01.000Z
Fixes #1510
diff --git a/NLP/sentiment-analysis-movie-reviews/Readme.md b/NLP/sentiment-analysis-movie-reviews/Readme.md
@@ -0,0 +1,154 @@
+# **Sentiment Analysis of Movie Reviews**  
+
+This project implements **sentiment analysis** using **NLP (Natural Language Processing)** techniques to classify movie reviews as **positive, negative, or neutral**. The goal is to process unstructured text data, extract meaningful features, and train classification models to predict the sentiment of each review.
+
+---
+
+## **Table of Contents**
+- [Tech Stack](#tech-stack)  
+- [Features](#features)  
+- [Benefits](#benefits)  
+- [Project Structure](#project-structure)  
+- [Installation](#installation)  
+- [Usage](#usage)  
+- [Evaluation Metrics](#evaluation-metrics)  
+- [Contributing](#contributing)  
+- [License](#license)  
+- [Contact](#contact)  
+
+---
+
+## **Tech Stack**
+
+The following technologies and libraries are used in this project:
+
+- **Python 3.8+**: Core programming language for building the project.
+- **pandas**: For data manipulation and analysis.
+- **scikit-learn**: Machine learning library used for feature extraction and building classifiers.
+- **nltk (Natural Language Toolkit)**: For text preprocessing (tokenization, stopword removal, etc.).
+- **TF-IDF Vectorizer**: Converts text data into numerical features.
+- **Naive Bayes and SVM**: Machine learning algorithms used for classification.
+- **GitHub Codespaces**: Cloud-based development environment for coding and collaboration.
+
+---
+
+## **Features**
+
+- **Text Preprocessing:** 
+  - Removes punctuation, converts text to lowercase, tokenizes, and removes stopwords.
+- **Feature Extraction:** 
+  - Uses **TF-IDF vectorization** to convert reviews into numerical form for model input.
+- **Classification Models:**
+  - Implements **Multinomial Naive Bayes** and **Support Vector Machine (SVM)** classifiers.
+- **Evaluation Metrics:** 
+  - Outputs **accuracy score** and a **classification report** with precision, recall, and F1-score.
+
+---
+
+## **Benefits**
+
+1. **Scalable and Automated Analysis:**  
+   Automates the sentiment analysis of large volumes of reviews, eliminating the need for manual reading.  
+2. **Improved Decision-Making:**  
+   Helps platforms like Netflix or Amazon identify audience reactions and improve content recommendations.  
+3. **Proactive Issue Detection:**  
+   Identifies negative sentiment early to help brands respond to user concerns in real-time.  
+4. **Versatile Solution:**  
+   Can be extended to other text-based sentiment use cases, such as product reviews or social media posts.  
+5. **Brand Reputation Management:**  
+   Helps brands track public opinion trends and address customer feedback proactively.
+
+---
+
+## **Project Structure**
+
+```plaintext
+sentiment-analysis-movie-reviews/
+│
+├── sentiment_analysis.py      # Main script with model code.
+├── requirements.txt           # List of dependencies.
+├── README.md                  # Project documentation.
+```
+
+---
+
+## **Installation**
+
+1. **Clone the Repository:**
+   ```bash
+   git clone https://github.com/<your-username>/machine-learning-repos.git
+   cd machine-learning-repos/sentiment-analysis-movie-reviews
+   ```
+
+2. **Set Up Virtual Environment (Optional):**
+   ```bash
+   python -m venv venv
+   source venv/bin/activate  # For Linux/macOS
+   .\venv\Scripts\activate   # For Windows
+   ```
+
+3. **Install Dependencies:**
+   ```bash
+   pip install -r requirements.txt
+   ```
+
+---
+
+## **Usage**
+
+1. **Prepare the Dataset:**  
+   Optionally, you can store additional datasets inside the `data/` folder.
+
+2. **Run the Script:**
+   ```bash
+   python sentiment_analysis.py
+   ```
+
+3. **Expected Output:**  
+   The console will display:
+   - A **classification report** (precision, recall, and F1-score) for each class.
+   - The **overall accuracy** of the classifier.
+
+---
+
+## **Evaluation Metrics**
+
+- **Accuracy:** Measures the percentage of correct predictions.
+- **Precision:** Measures the correctness of positive predictions.
+- **Recall:** Measures how well the model identifies positive instances.
+- **F1-Score:** Harmonic mean of precision and recall for better evaluation in imbalanced datasets.
+
+---
+
+## **Contributing**
+
+We welcome contributions to this project! To contribute:  
+
+1. **Fork** this repository.  
+2. Create a **new branch** for your feature or bug fix:  
+   ```bash
+   git checkout -b feature-branch
+   ```  
+3. Commit your changes:  
+   ```bash
+   git commit -m "Add feature/bug fix"
+   ```  
+4. **Push** the changes to your forked repository:  
+   ```bash
+   git push origin feature-branch
+   ```  
+5. Open a **Pull Request** to the original repository and provide a detailed description of your changes.
+
+---
+
+## **License**
+
+This project is licensed under the **MIT License**. See the [LICENSE](../LICENSE) file for more details.
+
+---
+
+## **Contact**
+
+- **Author:** [Sanchit Chauhan]
+- **Email:** <sanchitchauhan005@gmail.com.com>  
+- **GitHub:** [Your GitHub Profile](https://github.com/sanchitc05)  
diff --git a/NLP/sentiment-analysis-movie-reviews/requirements.txt b/NLP/sentiment-analysis-movie-reviews/requirements.txt
@@ -0,0 +1,3 @@
+pandas
+scikit-learn
+nltk
diff --git a/NLP/sentiment-analysis-movie-reviews/sentiment_analysis.py b/NLP/sentiment-analysis-movie-reviews/sentiment_analysis.py
@@ -0,0 +1,50 @@
+import pandas as pd
+from sklearn.model_selection import train_test_split
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.naive_bayes import MultinomialNB
+from sklearn.svm import SVC
+from sklearn.metrics import classification_report, accuracy_score
+import nltk
+from nltk.corpus import stopwords
+from nltk.tokenize import word_tokenize
+import re
+
+nltk.download('punkt')
+nltk.download('stopwords')
+
+stop_words = set(stopwords.words('english'))
+
+def preprocess_text(text):
+    text = re.sub(r'[^a-zA-Z\s]', '', text)
+    tokens = word_tokenize(text.lower())
+    filtered_tokens = [word for word in tokens if word not in stop_words]
+    return ' '.join(filtered_tokens)
+
+data = {
+    'review': [
+        "The movie was absolutely fantastic!",
+        "I hated the plot. It was so boring.",
+        "It was an average movie, not great but not terrible either.",
+        "What a waste of time! Completely disappointed.",
+        "Loved the acting and cinematography!"
+    ],
+    'sentiment': ['positive', 'negative', 'neutral', 'negative', 'positive']
+}
+df = pd.DataFrame(data)
+df['cleaned_review'] = df['review'].apply(preprocess_text)
+
+X = df['cleaned_review']
+y = df['sentiment']
+X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
+
+tfidf = TfidfVectorizer(max_features=5000)
+X_train_tfidf = tfidf.fit_transform(X_train)
+X_test_tfidf = tfidf.transform(X_test)
+
+nb_model = MultinomialNB()
+nb_model.fit(X_train_tfidf, y_train)
+nb_predictions = nb_model.predict(X_test_tfidf)
+
+print("Naive Bayes Classifier Report:")
+print(classification_report(y_test, nb_predictions))
+print("Accuracy:", accuracy_score(y_test, nb_predictions))