📩 SMS Spam Detection using Machine Learning

An end-to-end Machine Learning-based SMS Spam Detection system that classifies messages as Spam or Ham (legitimate) using NLP, TF-IDF, feature engineering, and Logistic Regression.

This project is designed to handle modern spam patterns such as phishing links, KYC scams, job/loan frauds, cashback scams, and Hinglish spam, going beyond traditional keyword-based filters.

Developed by: Utkarsh Srivastava

🚀 Motivation

SMS spam today includes:

🎣 Phishing & fake KYC alerts
💼 Job & loan scams
💰 Cashback & reward frauds
🌐 Mixed-language (Hinglish) spam

Traditional rule-based systems fail to adapt. This project uses Machine Learning + NLP to build an accurate, explainable, and scalable spam detection pipeline.

🎯 Objectives

Build a robust ML-based SMS spam classifier
Handle modern & evolving spam patterns
Apply NLP preprocessing and feature engineering
Achieve high accuracy with low false positives
Provide explainable predictions using feature importance

📁 Project Structure

sms-spam-detection/
├── sms_spam_detection.ipynb    # Main Jupyter notebook with complete analysis
├── sms_spam_detection.py       # Python script version
├── spam_new.csv                # Dataset
├── requirements.txt            # Python dependencies
├── .gitignore                  # Git ignore file
├── LICENSE                     # MIT License
└── README.md                   # This file

🧠 Methodology

1️⃣ Dataset Preparation

UCI SMS Spam Collection: 5,572 real messages
Synthetic Dataset: ~5,000 modern spam messages
Total Dataset Size: ~10,500 SMS

Spam categories include:

Promotional spam
Phishing & KYC fraud
Job & loan scams
Hinglish spam
Legitimate transactional & conversational SMS

2️⃣ Text Preprocessing

Lowercasing
URL replacement (urltoken)
Phone number masking (phonetoken)
Stopword removal (NLTK)
Lemmatization
Special character normalization

3️⃣ Feature Engineering

🔹 TF-IDF Word Features

Unigrams & bigrams

🔹 Character-level TF-IDF

Robust to spelling variations & noisy text

🔹 Numeric Features

URL count
Phone number presence
Digit count
Uppercase ratio
Special characters
Message length stats

🔹 Keyword-based Features

Binary indicators for terms like: reward, cashback, verify, kyc, loan, job, winner, selected, urgent, account, blocked

4️⃣ Model Training

Algorithm: Logistic Regression
Class Weight: balanced
Train/Test Split: 80% / 20% (stratified)

5️⃣ Evaluation Metrics

Accuracy
Precision, Recall, F1-Score
Confusion Matrix
ROC Curve (AUC)
Precision-Recall Curve
Feature Importance Analysis

📊 Results

Metric	Value
Accuracy	≈ 99%
Precision (Spam)	≈ 0.99
Recall (Spam)	≈ 0.99
F1-Score (Spam)	≈ 0.99
ROC-AUC	~1.0

🔧 Threshold Optimization:
A probability threshold of 0.35 provided better detection of soft spam while minimizing false positives.

🛠️ Technologies Used

Python 3.8+
Pandas, NumPy
NLTK
Scikit-learn
Matplotlib
Joblib
Jupyter Notebook / Google Colab

⚙️ Installation

Prerequisites

Python 3.8 or higher
pip package manager

Steps

Clone the repository

git clone https://github.com/UtkarshSrivastava1139/sms-spam-detection.git
cd "sms detection"

Install dependencies

pip install -r requirements.txt

Download NLTK data (first time only)

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

🚀 Usage

Option 1: Jupyter Notebook

jupyter notebook sms_spam_detection.ipynb

Option 2: Python Script

from predict import predict_single_sms

message = "Your KYC will expire today. Update now!"
label, probability = predict_single_sms(message)

print(f"Prediction: {label}, Confidence: {probability:.2%}")

🔍 Explainability

Keyword feature importance derived from Logistic Regression coefficients
Clearly highlights which terms strongly influence spam classification
Improves trust and interpretability of the model

📌 Conclusion

This project demonstrates a complete ML pipeline for SMS spam detection, capable of handling both traditional and modern scam messages with high accuracy and interpretability.

It is suitable for:

Academic demonstrations
Industry-level spam filtering
Extension into real-time systems

🔮 Future Scope

🤖 Transformer models (BERT, DistilBERT)
🌐 Real-time API / mobile app deployment
🌍 Multilingual spam detection
📈 Online learning for evolving spam patterns

📚 References

UCI SMS Spam Collection Dataset (Kaggle)
Scikit-learn Documentation
NLTK Documentation
Speech and Language Processing — Jurafsky & Manning

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

👥 Contributors

Utkarsh Srivastava - GitHub

🤝 Contributing

Contributions, issues, and feature requests are welcome! Feel free to check the issues page.

⭐ Show your support

Give a ⭐ if this project helped you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📩 SMS Spam Detection using Machine Learning

Developed by: Utkarsh Srivastava

🚀 Motivation

🎯 Objectives

📁 Project Structure

🧠 Methodology

1️⃣ Dataset Preparation

2️⃣ Text Preprocessing

3️⃣ Feature Engineering

4️⃣ Model Training

5️⃣ Evaluation Metrics

📊 Results

🛠️ Technologies Used

⚙️ Installation

Prerequisites

Steps

🚀 Usage

Option 1: Jupyter Notebook

Option 2: Python Script

🔍 Explainability

📌 Conclusion

🔮 Future Scope

📚 References

📄 License

👥 Contributors

🤝 Contributing

⭐ Show your support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
readme.md		readme.md
requirements.txt		requirements.txt
sms_spam_detection.ipynb		sms_spam_detection.ipynb
sms_spam_detection.py		sms_spam_detection.py
spam_new.csv		spam_new.csv

Folders and files

Latest commit

History

Repository files navigation

📩 SMS Spam Detection using Machine Learning

Developed by: Utkarsh Srivastava

🚀 Motivation

🎯 Objectives

📁 Project Structure

🧠 Methodology

1️⃣ Dataset Preparation

2️⃣ Text Preprocessing

3️⃣ Feature Engineering

4️⃣ Model Training

5️⃣ Evaluation Metrics

📊 Results

🛠️ Technologies Used

⚙️ Installation

Prerequisites

Steps

🚀 Usage

Option 1: Jupyter Notebook

Option 2: Python Script

🔍 Explainability

📌 Conclusion

🔮 Future Scope

📚 References

📄 License

👥 Contributors

🤝 Contributing

⭐ Show your support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages