This project performs Sentiment Analysis on movie reviews using IMDb's large-scale dataset. It applies machine learning and NLP techniques to classify reviews as positive or negative, along with explainability (XAI) tools to interpret model decisions.
- Convert IMDb ratings (1–10 scale) into binary sentiment classes:
- 1–4 → Negative
- 7–10 → Positive
- Build models to classify text-based reviews into positive or negative
- Address challenges like sarcasm, negation, and mixed sentiments
- Use Explainable AI (XAI) to understand and interpret the model’s predictions
- 📁 Source: Kaggle IMDb Dataset
- 🔢 Size: 149,780 reviews
- 🎬 Columns:
Review: Text reviewRating: Numerical score (1–10)Movie: Name of the movieResenhas: Portuguese review (dropped)
- Expanded contractions (e.g., can't → cannot)
- Removed URLs, special characters, and irrelevant stopwords
- Retained negation words like “not” to preserve sentiment context
- Created a
Review_cleancolumn for processed text
- Verified class balance (equal reviews per rating)
- Plotted word clouds by sentiment
- Analyzed review length, word density, and average word length
- N-gram analysis (uni-, bi-, and trigrams) showed key sentiment patterns
- Used
CountVectorizerandTF-IDFfor text-to-vector conversion - Focused on unigrams, bigrams, and trigrams
- Applied chi-squared test for feature selection
- Found TF-IDF more effective in separating sentiment
- Logistic Regression ✅ (Best performer)
- Decision Tree (Overfit)
- Random Forest (Some overfitting)
- Precision
- F1-score
- AUC-ROC
- Precision: 0.8955 (test)
- F1 Score: 0.8955
- AUC: 0.9606
- Fine-tuned using
GridSearchCV
- Sarcasm: e.g., “Sure, best movie ever... I slept halfway.”
- Mixed polarity: Some reviews contained both praise and criticism.
- Negation Handling: “Not bad” ≠ “bad”
- Introduce deep learning models like LSTM or BERT
- Implement sarcasm detection modules
- Apply domain adaptation for different datasets (e.g., product reviews)
- Nikhil Gupta