An end-to-end Machine Learning-based SMS Spam Detection system that classifies messages as Spam or Ham (legitimate) using NLP, TF-IDF, feature engineering, and Logistic Regression.
This project is designed to handle modern spam patterns such as phishing links, KYC scams, job/loan frauds, cashback scams, and Hinglish spam, going beyond traditional keyword-based filters.
SMS spam today includes:
- ๐ฃ Phishing & fake KYC alerts
- ๐ผ Job & loan scams
- ๐ฐ Cashback & reward frauds
- ๐ Mixed-language (Hinglish) spam
Traditional rule-based systems fail to adapt. This project uses Machine Learning + NLP to build an accurate, explainable, and scalable spam detection pipeline.
- Build a robust ML-based SMS spam classifier
- Handle modern & evolving spam patterns
- Apply NLP preprocessing and feature engineering
- Achieve high accuracy with low false positives
- Provide explainable predictions using feature importance
sms-spam-detection/
โโโ sms_spam_detection.ipynb # Main Jupyter notebook with complete analysis
โโโ sms_spam_detection.py # Python script version
โโโ spam_new.csv # Dataset
โโโ requirements.txt # Python dependencies
โโโ .gitignore # Git ignore file
โโโ LICENSE # MIT License
โโโ README.md # This file
- UCI SMS Spam Collection: 5,572 real messages
- Synthetic Dataset: ~5,000 modern spam messages
- Total Dataset Size: ~10,500 SMS
Spam categories include:
- Promotional spam
- Phishing & KYC fraud
- Job & loan scams
- Hinglish spam
- Legitimate transactional & conversational SMS
- Lowercasing
- URL replacement (
urltoken) - Phone number masking (
phonetoken) - Stopword removal (NLTK)
- Lemmatization
- Special character normalization
๐น TF-IDF Word Features
- Unigrams & bigrams
๐น Character-level TF-IDF
- Robust to spelling variations & noisy text
๐น Numeric Features
- URL count
- Phone number presence
- Digit count
- Uppercase ratio
- Special characters
- Message length stats
๐น Keyword-based Features
- Binary indicators for terms like:
reward,cashback,verify,kyc,loan,job,winner,selected,urgent,account,blocked
- Algorithm: Logistic Regression
- Class Weight: balanced
- Train/Test Split: 80% / 20% (stratified)
- Accuracy
- Precision, Recall, F1-Score
- Confusion Matrix
- ROC Curve (AUC)
- Precision-Recall Curve
- Feature Importance Analysis
| Metric | Value |
|---|---|
| Accuracy | โ 99% |
| Precision (Spam) | โ 0.99 |
| Recall (Spam) | โ 0.99 |
| F1-Score (Spam) | โ 0.99 |
| ROC-AUC | ~1.0 |
๐ง Threshold Optimization:
A probability threshold of 0.35 provided better detection of soft spam while minimizing false positives.
- Python 3.8+
- Pandas, NumPy
- NLTK
- Scikit-learn
- Matplotlib
- Joblib
- Jupyter Notebook / Google Colab
- Python 3.8 or higher
- pip package manager
- Clone the repository
git clone https://github.com/UtkarshSrivastava1139/sms-spam-detection.git
cd "sms detection"- Install dependencies
pip install -r requirements.txt- Download NLTK data (first time only)
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')jupyter notebook sms_spam_detection.ipynbfrom predict import predict_single_sms
message = "Your KYC will expire today. Update now!"
label, probability = predict_single_sms(message)
print(f"Prediction: {label}, Confidence: {probability:.2%}")- Keyword feature importance derived from Logistic Regression coefficients
- Clearly highlights which terms strongly influence spam classification
- Improves trust and interpretability of the model
This project demonstrates a complete ML pipeline for SMS spam detection, capable of handling both traditional and modern scam messages with high accuracy and interpretability.
It is suitable for:
- Academic demonstrations
- Industry-level spam filtering
- Extension into real-time systems
- ๐ค Transformer models (BERT, DistilBERT)
- ๐ Real-time API / mobile app deployment
- ๐ Multilingual spam detection
- ๐ Online learning for evolving spam patterns
- UCI SMS Spam Collection Dataset (Kaggle)
- Scikit-learn Documentation
- NLTK Documentation
- Speech and Language Processing โ Jurafsky & Manning
This project is licensed under the MIT License - see the LICENSE file for details.
- Utkarsh Srivastava - GitHub
Contributions, issues, and feature requests are welcome! Feel free to check the issues page.
Give a โญ if this project helped you!