This project builds an end-to-end Natural Language Processing (NLP) pipeline to classify tweets as positive or negative using supervised machine learning techniques. It demonstrates practical text preprocessing, feature engineering, model training, and evaluation on large-scale social media data.
The dataset used in this project is the Sentiment140 dataset from Kaggle:
π https://www.kaggle.com/datasets/kazanova/sentiment140
The dataset contains 1.6 million labeled tweets. Each record includes:
- target β Tweet polarity (
0 = Negative,4 = Positive) - id β Unique tweet identifier
- date β Timestamp of the tweet
- flag β Query associated with the tweet (
NO_QUERYif none) - user β Username of the tweet author
- text β The content of the tweet
- Text cleaning (removing punctuation, URLs, special characters)
- Lowercasing
- Stopword removal
- Tokenization
- Text vectorization using CountVectorizer / TF-IDF
- Conversion of tweet text into numerical feature vectors
- Supervised machine learning classifier trained on labeled tweet data
- Training Accuracy: 79.87%
- Test Accuracy: 77.67%
The small gap (~2%) between training and test accuracy indicates good generalization with no significant overfitting.
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| 0 (Negative) | 0.79 | 0.76 | 0.77 | 160,000 |
| 1 (Positive) | 0.77 | 0.80 | 0.78 | 160,000 |
- Overall Accuracy: 78%
- Macro Average F1-Score: 0.78
- Weighted Average F1-Score: 0.78
- The model achieves balanced performance across both sentiment classes.
- Precision and recall values are consistent, indicating stable classification behavior.
- Similar training and testing accuracy suggests the model is not overfitting.
- Performance is solid for a classical machine learning approach on noisy social media text.
- Hyperparameter tuning
- Use of n-grams and advanced vectorization techniques
- Implementation of deep learning models (LSTM / GRU)
- Transformer-based models (BERT)
- Deployment as an API or web application
- Python
- Pandas & NumPy
- Scikit-learn
- NLP preprocessing techniques
- Jupyter Notebook
- Large-scale text classification implementation
- Practical NLP pipeline development
- Balanced sentiment prediction performance
- Strong baseline model with room for advanced improvements
- This implementation is based on a tutorial from GeeksforGeeks and was developed for practice purposes.