๐ฑ Samsung Mobile Reviews โ Sentiment Analysis Using Digikala API | Persian Text Processing | Machine Learning
โญ About the Project
This repository contains a complete end-to-end sentiment analysis pipeline on Persian user reviews of Samsung mobile phones, collected directly from the Digikala API.
The project includes:
- Automated data collection from Digikala API (products + comments)
- Full data cleaning & preprocessing for Persian text
- TF-IDF vectorization optimized for short Persian reviews
- Handling imbalanced classes using targeted oversampling
- Training multiple ML models (Naive Bayes, Logistic Regression, Linear SVM)
- Evaluation using Accuracy, Precision, Recall, F1-score, and Confusion Matrix
๐ Dataset Overview
The dataset contains user reviews for Samsung mobile phones:
| Column | Description |
|---|---|
body |
Original user comment |
sentiment |
Label assigned based on Digikala rating |
comment |
Preprocessed Persian text |
label |
Encoded label |
| (+ extracted product features from API in raw files) |
๐ง Sentiment Classes
Reviews were categorized into:
- Positive
- Neutral
- Negative
Based on the numeric rating provided by users on Digikala.
๐ Machine Learning Models Used
We trained and compared:
- Naive Bayes (Best Performance)
- Logistic Regression
- Linear SVM
Best model selected based on weighted F1-score and overall metrics.
๐ Evaluation Metrics
All models were evaluated using:
- Accuracy
- Precision
- Recall
- F1-score
- Confusion Matrix
Final results (example):
| Model | Accuracy | F1-score |
|---|---|---|
| Naive Bayes | 0.78 | 0.71 |
| Logistic Regression | 0.69 | 0.70 |
| Linear SVM | 0.73 | 0.71 |
๐ How to Run the Project
- Install dependencies: pip install -r requirements.txt
- Run data collection (optional): python fetch_data.py
- Open the notebook: jupyter notebook sentiment_analysis.ipynb
โ Preprocessing Techniques
- Persian text normalization
- Removing non-Persian characters
- Tokenization
- Stopword removal
- No stemming (better performance for short reviews)
- TF-IDF with n-grams (1,2)