Skip to content

UtkarshSrivastava1139/sms-spam-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ“ฉ SMS Spam Detection using Machine Learning

Python License ML

An end-to-end Machine Learning-based SMS Spam Detection system that classifies messages as Spam or Ham (legitimate) using NLP, TF-IDF, feature engineering, and Logistic Regression.

This project is designed to handle modern spam patterns such as phishing links, KYC scams, job/loan frauds, cashback scams, and Hinglish spam, going beyond traditional keyword-based filters.

Developed by: Utkarsh Srivastava

๐Ÿš€ Motivation

SMS spam today includes:

  • ๐ŸŽฃ Phishing & fake KYC alerts
  • ๐Ÿ’ผ Job & loan scams
  • ๐Ÿ’ฐ Cashback & reward frauds
  • ๐ŸŒ Mixed-language (Hinglish) spam

Traditional rule-based systems fail to adapt. This project uses Machine Learning + NLP to build an accurate, explainable, and scalable spam detection pipeline.


๐ŸŽฏ Objectives

  • Build a robust ML-based SMS spam classifier
  • Handle modern & evolving spam patterns
  • Apply NLP preprocessing and feature engineering
  • Achieve high accuracy with low false positives
  • Provide explainable predictions using feature importance

๐Ÿ“ Project Structure

sms-spam-detection/
โ”œโ”€โ”€ sms_spam_detection.ipynb    # Main Jupyter notebook with complete analysis
โ”œโ”€โ”€ sms_spam_detection.py       # Python script version
โ”œโ”€โ”€ spam_new.csv                # Dataset
โ”œโ”€โ”€ requirements.txt            # Python dependencies
โ”œโ”€โ”€ .gitignore                  # Git ignore file
โ”œโ”€โ”€ LICENSE                     # MIT License
โ””โ”€โ”€ README.md                   # This file

๐Ÿง  Methodology

1๏ธโƒฃ Dataset Preparation

  • UCI SMS Spam Collection: 5,572 real messages
  • Synthetic Dataset: ~5,000 modern spam messages
  • Total Dataset Size: ~10,500 SMS

Spam categories include:

  • Promotional spam
  • Phishing & KYC fraud
  • Job & loan scams
  • Hinglish spam
  • Legitimate transactional & conversational SMS

2๏ธโƒฃ Text Preprocessing

  • Lowercasing
  • URL replacement (urltoken)
  • Phone number masking (phonetoken)
  • Stopword removal (NLTK)
  • Lemmatization
  • Special character normalization

3๏ธโƒฃ Feature Engineering

๐Ÿ”น TF-IDF Word Features

  • Unigrams & bigrams

๐Ÿ”น Character-level TF-IDF

  • Robust to spelling variations & noisy text

๐Ÿ”น Numeric Features

  • URL count
  • Phone number presence
  • Digit count
  • Uppercase ratio
  • Special characters
  • Message length stats

๐Ÿ”น Keyword-based Features

  • Binary indicators for terms like: reward, cashback, verify, kyc, loan, job, winner, selected, urgent, account, blocked

4๏ธโƒฃ Model Training

  • Algorithm: Logistic Regression
  • Class Weight: balanced
  • Train/Test Split: 80% / 20% (stratified)

5๏ธโƒฃ Evaluation Metrics

  • Accuracy
  • Precision, Recall, F1-Score
  • Confusion Matrix
  • ROC Curve (AUC)
  • Precision-Recall Curve
  • Feature Importance Analysis

๐Ÿ“Š Results

Metric Value
Accuracy โ‰ˆ 99%
Precision (Spam) โ‰ˆ 0.99
Recall (Spam) โ‰ˆ 0.99
F1-Score (Spam) โ‰ˆ 0.99
ROC-AUC ~1.0

๐Ÿ”ง Threshold Optimization:
A probability threshold of 0.35 provided better detection of soft spam while minimizing false positives.


๐Ÿ› ๏ธ Technologies Used

  • Python 3.8+
  • Pandas, NumPy
  • NLTK
  • Scikit-learn
  • Matplotlib
  • Joblib
  • Jupyter Notebook / Google Colab

โš™๏ธ Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Steps

  1. Clone the repository
git clone https://github.com/UtkarshSrivastava1139/sms-spam-detection.git
cd "sms detection"
  1. Install dependencies
pip install -r requirements.txt
  1. Download NLTK data (first time only)
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

๐Ÿš€ Usage

Option 1: Jupyter Notebook

jupyter notebook sms_spam_detection.ipynb

Option 2: Python Script

from predict import predict_single_sms

message = "Your KYC will expire today. Update now!"
label, probability = predict_single_sms(message)

print(f"Prediction: {label}, Confidence: {probability:.2%}")

๐Ÿ” Explainability

  • Keyword feature importance derived from Logistic Regression coefficients
  • Clearly highlights which terms strongly influence spam classification
  • Improves trust and interpretability of the model

๐Ÿ“Œ Conclusion

This project demonstrates a complete ML pipeline for SMS spam detection, capable of handling both traditional and modern scam messages with high accuracy and interpretability.

It is suitable for:

  • Academic demonstrations
  • Industry-level spam filtering
  • Extension into real-time systems

๐Ÿ”ฎ Future Scope

  • ๐Ÿค– Transformer models (BERT, DistilBERT)
  • ๐ŸŒ Real-time API / mobile app deployment
  • ๐ŸŒ Multilingual spam detection
  • ๐Ÿ“ˆ Online learning for evolving spam patterns

๐Ÿ“š References


๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿ‘ฅ Contributors


๐Ÿค Contributing

Contributions, issues, and feature requests are welcome! Feel free to check the issues page.


โญ Show your support

Give a โญ if this project helped you!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors