Skip to content

Commit fd31f17

Browse files
committed
Added 2 new classifiers and more tuning scripts
1 parent 1dc3273 commit fd31f17

26 files changed

+686
-47
lines changed

README.md

Lines changed: 111 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,14 @@
1010
[![GitHub contributors](https://img.shields.io/github/contributors/adamspd/spam-detection-project)](https://github.com/adamspd/spam-detection-project/graphs/contributors)
1111

1212
Spam-Detector-AI is a Python package for detecting and filtering spam messages using Machine Learning models. The
13-
package integrates with Django or any other project that uses python and offers three different classifiers: Naive
14-
Bayes, Random Forest, and Support Vector Machine (SVM).
13+
package integrates with Django or any other project that uses python and offers different types of classifiers: Naive
14+
Bayes, Random Forest, and Support Vector Machine (SVM). Since version 2.1.0, two new classifiers have been added:
15+
Logistic Regression and XGBClassifier.
1516

16-
⚠️ **Warning**: No significant breaking changes were added to the version 2.x.x in terms of usage. ⚠️
17+
⚠️ _**Warning**: No significant breaking changes were added to the version 2.x.x in terms of usage. On the other hand,
18+
the fine-tuning of the models has been moved to a separate module (`tuning`) and the tests have been moved to a
19+
separate module (`tests`)._
20+
⚠️
1721

1822
## Table of Contents
1923

@@ -42,6 +46,7 @@ Make sure you have the following dependencies installed:
4246
- pandas
4347
- numpy
4448
- joblib
49+
- xgboost
4550

4651
Additionally, you'll need to download the NLTK data and to do so, use the python interpreter to run the following
4752
commands:
@@ -76,9 +81,11 @@ If this happens, use an IDE to run the `trainer.py`file until a fix is implement
7681

7782
This will train all the models and save them as `.joblib` files in the models directory. For now, there is 3 models:
7883

79-
- `naive_bayes.pkl`
80-
- `random_forest.pkl`
81-
- `svm.pkl`
84+
- `naive_bayes_model.joblib`
85+
- `random_forest_model.joblib`
86+
- `svm_model.joblib`
87+
- `logistic_regression_model.joblib`
88+
- `xgb_model.joblib`
8289

8390
### Tests
8491

@@ -166,21 +173,111 @@ The test results are shown below:
166173

167174
##### Accuracy: 0.9773572152754308
168175

169-
The models that performed the best are the Random Forest and the SVM. The SVM model has a slightly better accuracy than
170-
the Random Forest model. Knowing that all the models were not perfect, I decided to use a voting classifier to combine
171-
the predictions of the 3 models. The voting classifier will use the majority vote to make the final prediction.
176+
<br>
177+
178+
#### _Model: LOGISTIC_REGRESSION_
179+
180+
##### Confusion Matrix:
181+
182+
| | Predicted: Ham | Predicted: Spam |
183+
|------------------|----------------------|---------------------|
184+
| **Actual: Ham** | 2065 (True Negative) | 48 (False Positive) |
185+
| **Actual: Spam** | 46 (False Negative) | 989 (True Positive) |
186+
187+
- True Negative (TN): 2065 messages were correctly identified as ham (non-spam).
188+
- False Positive (FP): 48 ham messages were incorrectly identified as spam.
189+
- False Negative (FN): 46 spam messages were incorrectly identified as ham.
190+
- True Positive (TP): 989 messages were correctly identified as spam.
191+
192+
##### Performance Metrics:
193+
194+
| | Precision | Recall | F1-Score | Support |
195+
|--------------|-----------|--------|----------|---------|
196+
| Ham | 0.98 | 0.98 | 0.98 | 2113 |
197+
| Spam | 0.95 | 0.96 | 0.95 | 1035 |
198+
| **Accuracy** | | | **0.97** | 3148 |
199+
| Macro Avg | 0.97 | 0.97 | 0.97 | 3148 |
200+
| Weighted Avg | 0.97 | 0.97 | 0.97 | 3148 |
201+
202+
##### Accuracy: 0.9707680491551459
203+
204+
<br>
205+
206+
#### _Model: XGB_
207+
208+
##### Confusion Matrix:
209+
210+
| | Predicted: Ham | Predicted: Spam |
211+
|------------------|----------------------|----------------------|
212+
| **Actual: Ham** | 2050 (True Negative) | 63 (False Positive) |
213+
| **Actual: Spam** | 28 (False Negative) | 1007 (True Positive) |
214+
215+
- True Negative (TN): 2050 messages were correctly identified as ham (non-spam).
216+
- False Positive (FP): 63 ham messages were incorrectly identified as spam.
217+
- False Negative (FN): 28 spam messages were incorrectly identified as ham.
218+
- True Positive (TP): 1007 messages were correctly identified as spam.
219+
220+
##### Performance Metrics:
221+
222+
| | Precision | Recall | F1-Score | Support |
223+
|--------------|-----------|--------|----------|---------|
224+
| Ham | 0.99 | 0.97 | 0.98 | 2113 |
225+
| Spam | 0.94 | 0.97 | 0.96 | 1035 |
226+
| **Accuracy** | | | **0.97** | 3148 |
227+
| Macro Avg | 0.96 | 0.97 | 0.97 | 3148 |
228+
| Weighted Avg | 0.97 | 0.97 | 0.97 | 3148 |
229+
230+
##### Accuracy: 0.9710927573062261
231+
232+
The models that performed the best are the SVM and Logistic Regression, with the SVM model achieving slightly higher
233+
accuracy than Logistic Regression.
234+
Given that no single model achieved perfect accuracy, I have decided to implement a voting classifier.
235+
This classifier will combine the predictions of the five models (Naive Bayes, Random Forest, SVM,
236+
Logistic Regression, and XGB) using a majority vote system to make the final prediction.
237+
This approach aims to leverage the strengths of each model to improve overall prediction accuracy.
238+
239+
##### Weighted Voting System
240+
241+
To enhance the decision-making process, I've refined our approach to a weighted voting system. This new system assigns
242+
different weights to each model's vote based on their respective accuracies. The weights are proportional to the
243+
accuracy of each model relative to the sum of the accuracies of all models. The models with higher accuracy have a
244+
greater influence on the final decision.
245+
246+
The models and their respective proportional weights are as follows:
247+
248+
- Naive Bayes: Weight = 0.1822
249+
- Random Forest: Weight = 0.2047
250+
- SVM (Support Vector Machine): Weight = 0.2052
251+
- Logistic Regression: Weight = 0.2039
252+
- XGBoost (XGB): Weight = 0.2039
253+
254+
These weights were calculated based on the accuracy of each model as a proportion of the total accuracy of all models.
255+
The final decision whether a message is spam or not is determined by the weighted spam score. Each model casts a vote
256+
(spam or not spam), and this vote is multiplied by the model's weight. The weighted spam scores from all models are then
257+
summed up. If this total weighted spam score exceeds 50% of the total possible weight, the message is classified as
258+
spam. Otherwise, it's classified as not spam (ham).
259+
260+
This approach ensures that the more accurate models have a larger say in the final decision, thereby increasing the
261+
reliability of spam detection. It combines the strengths of each model, compensating for individual weaknesses and
262+
provides a more nuanced classification.
263+
264+
##### System Output
265+
266+
The system provides a detailed output for each message, showing the vote (spam or ham) from each model, along with its
267+
weight. It also displays the total weighted spam score and the final classification decision (Spam or Not Spam). This
268+
transparency in the voting process allows for easier understanding and debugging of the model's performance on different
269+
messages.
172270

173271
If you have trained the models on new data, you can test them by running the following command:
174272

175273
```sh
176-
python test_and_tuning/test.py
274+
python tests/test.py
177275
```
178276

179-
:warning: **Warning**: A module not found error may occur :warning:
277+
⚠️ **Warning**: A module not found error may occur ⚠️
180278

181279
If this happens, use an IDE to run the `test.py`file until a fix is implemented.
182280

183-
184281
### Making Predictions
185282

186283
To use the spam detector in your Django project:
@@ -208,7 +305,8 @@ print(f"Is spam: {is_spam}")
208305
- `loading_and_processing/`: Contains utility functions for loading and preprocessing data.
209306
- `models/`: Contains the trained models and their vectorizers.
210307
- `prediction/`: Contains the main spam detector class.
211-
- `test_and_tuning/`: Contains scripts for testing and tuning the classifiers.
308+
- `tests/`: Contains scripts for testing
309+
- `tuning/`: Contains scripts for tuning the classifiers.
212310
- `training/`: Contains scripts for training the classifiers.
213311

214312
## Contributing

requirements.txt

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
11
scikit-learn==1.4.0
22
imbalanced-learn==0.11.0
3-
pandas==2.1.4
3+
pandas==2.2.0
44
nltk~=3.8.1
55
setuptools==69.0.3
66
pytest==7.4.4
77
requests~=2.31.0
88
imblearn~=0.0
9-
joblib~=1.3.2
9+
joblib~=1.3.2
10+
xgboost~=2.0.3

spam_detector_ai/classifiers/classifier_types.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,5 @@ class ClassifierType(Enum):
77
NAIVE_BAYES = auto()
88
RANDOM_FOREST = auto()
99
SVM = auto()
10+
XGB = auto()
11+
LOGISTIC_REGRESSION = auto()
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# spam_detector_ai/classifiers/logistic_regression_classifier.py
2+
3+
from sklearn.linear_model import LogisticRegression
4+
from sklearn.feature_extraction.text import TfidfVectorizer
5+
from .base_classifier import BaseClassifier
6+
7+
8+
class LogisticRegressionSpamClassifier(BaseClassifier):
9+
def __init__(self):
10+
super().__init__()
11+
self.vectoriser = TfidfVectorizer(**BaseClassifier.VECTORIZER_PARAMS)
12+
13+
def train(self, X_train, y_train):
14+
X_train_vectorized = self.vectoriser.fit_transform(X_train)
15+
self.classifier = LogisticRegression(C=10, max_iter=200, penalty='l2', solver='saga')
16+
self.classifier.fit(X_train_vectorized, y_train)
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# spam_detector_ai/classifiers/xgb_classifier.py
2+
from sklearn.preprocessing import LabelEncoder
3+
from xgboost import XGBClassifier
4+
from sklearn.feature_extraction.text import TfidfVectorizer
5+
from .base_classifier import BaseClassifier
6+
7+
8+
class XGBSpamClassifier(BaseClassifier):
9+
def __init__(self):
10+
super().__init__()
11+
self.vectoriser = TfidfVectorizer(**BaseClassifier.VECTORIZER_PARAMS)
12+
self.label_encoder = LabelEncoder()
13+
14+
def train(self, X_train, y_train):
15+
X_train_vectorized = self.vectoriser.fit_transform(X_train)
16+
y_train_encoded = self.label_encoder.fit_transform(y_train)
17+
self.classifier = XGBClassifier(colsample_bytree=0.8, learning_rate=0.2, max_depth=5, n_estimators=300,
18+
subsample=1)
19+
self.classifier.fit(X_train_vectorized, y_train_encoded)
20+
21+
def predict(self, X_test):
22+
X_test_vectorized = self.vectoriser.transform(X_test)
23+
predictions = self.classifier.predict(X_test_vectorized)
24+
return self.label_encoder.inverse_transform(predictions)

0 commit comments

Comments
 (0)