This project classifies SMS messages as Spam or Not Spam using Naive Bayes algorithms.
- SMS Spam Collection dataset
- Stored in
data/spam.csv - Encoded in UTF-8
Originally, this project used a standard Kaggle SMS spam dataset. While it performed well on promotional spam, recall was poor on modern scam patterns such as:
- account security alerts
- fake delivery messages
- job and investment scams
- invoice and refund phishing
To address this, a custom dataset was generated containing modern spam and scam patterns.
- Gaussian Naive Bayes
- Multinomial Naive Bayes (selected for deployment)
- Bernoulli Naive Bayes
- Python
- Pandas
- NumPy
- Scikit-learn
- Joblib
spam-detection/
├── spam_sms_detection.ipynb
├── data/
│ └── spam.csv
├── models/
│ ├── mnb.pkl
│ └── vectorizer.pkl
└── requirements.txt
- Install dependencies
pip install -r requirements.txt - Open the notebook
spam_sms_detection.ipynb - Run all cells
This project uses NLTK for tokenization. If you face errors related to tokenizers, run:
import nltk
nltk.download('punkt')
### Results
- Recall improved from ~74% → ~99.6%
- Precision remains ~99%
- Model now generalizes better to real-world scam messages
### Note on Limitations
This is a text-only model. Certain messages such as neutral security
alerts or charity requests may still be ambiguous without metadata
(sender, headers, links).
## Future Improvements
- Web app deployment
- API support