|
10 | 10 | [](https://github.com/adamspd/spam-detection-project/graphs/contributors)
|
11 | 11 |
|
12 | 12 | Spam-Detector-AI is a Python package for detecting and filtering spam messages using Machine Learning models. The
|
13 |
| -package integrates with Django or any other project that uses python and offers three different classifiers: Naive |
14 |
| -Bayes, Random Forest, and Support Vector Machine (SVM). |
| 13 | +package integrates with Django or any other project that uses python and offers different types of classifiers: Naive |
| 14 | +Bayes, Random Forest, and Support Vector Machine (SVM). Since version 2.1.0, two new classifiers have been added: |
| 15 | +Logistic Regression and XGBClassifier. |
15 | 16 |
|
16 |
| -⚠️ **Warning**: No significant breaking changes were added to the version 2.x.x in terms of usage. ⚠️ |
| 17 | +⚠️ _**Warning**: No significant breaking changes were added to the version 2.x.x in terms of usage. On the other hand, |
| 18 | +the fine-tuning of the models has been moved to a separate module (`tuning`) and the tests have been moved to a |
| 19 | +separate module (`tests`)._ |
| 20 | +⚠️ |
17 | 21 |
|
18 | 22 | ## Table of Contents
|
19 | 23 |
|
@@ -42,6 +46,7 @@ Make sure you have the following dependencies installed:
|
42 | 46 | - pandas
|
43 | 47 | - numpy
|
44 | 48 | - joblib
|
| 49 | +- xgboost |
45 | 50 |
|
46 | 51 | Additionally, you'll need to download the NLTK data and to do so, use the python interpreter to run the following
|
47 | 52 | commands:
|
@@ -76,9 +81,11 @@ If this happens, use an IDE to run the `trainer.py`file until a fix is implement
|
76 | 81 |
|
77 | 82 | This will train all the models and save them as `.joblib` files in the models directory. For now, there is 3 models:
|
78 | 83 |
|
79 |
| -- `naive_bayes.pkl` |
80 |
| -- `random_forest.pkl` |
81 |
| -- `svm.pkl` |
| 84 | +- `naive_bayes_model.joblib` |
| 85 | +- `random_forest_model.joblib` |
| 86 | +- `svm_model.joblib` |
| 87 | +- `logistic_regression_model.joblib` |
| 88 | +- `xgb_model.joblib` |
82 | 89 |
|
83 | 90 | ### Tests
|
84 | 91 |
|
@@ -166,21 +173,111 @@ The test results are shown below:
|
166 | 173 |
|
167 | 174 | ##### Accuracy: 0.9773572152754308
|
168 | 175 |
|
169 |
| -The models that performed the best are the Random Forest and the SVM. The SVM model has a slightly better accuracy than |
170 |
| -the Random Forest model. Knowing that all the models were not perfect, I decided to use a voting classifier to combine |
171 |
| -the predictions of the 3 models. The voting classifier will use the majority vote to make the final prediction. |
| 176 | +<br> |
| 177 | + |
| 178 | +#### _Model: LOGISTIC_REGRESSION_ |
| 179 | + |
| 180 | +##### Confusion Matrix: |
| 181 | + |
| 182 | +| | Predicted: Ham | Predicted: Spam | |
| 183 | +|------------------|----------------------|---------------------| |
| 184 | +| **Actual: Ham** | 2065 (True Negative) | 48 (False Positive) | |
| 185 | +| **Actual: Spam** | 46 (False Negative) | 989 (True Positive) | |
| 186 | + |
| 187 | +- True Negative (TN): 2065 messages were correctly identified as ham (non-spam). |
| 188 | +- False Positive (FP): 48 ham messages were incorrectly identified as spam. |
| 189 | +- False Negative (FN): 46 spam messages were incorrectly identified as ham. |
| 190 | +- True Positive (TP): 989 messages were correctly identified as spam. |
| 191 | + |
| 192 | +##### Performance Metrics: |
| 193 | + |
| 194 | +| | Precision | Recall | F1-Score | Support | |
| 195 | +|--------------|-----------|--------|----------|---------| |
| 196 | +| Ham | 0.98 | 0.98 | 0.98 | 2113 | |
| 197 | +| Spam | 0.95 | 0.96 | 0.95 | 1035 | |
| 198 | +| **Accuracy** | | | **0.97** | 3148 | |
| 199 | +| Macro Avg | 0.97 | 0.97 | 0.97 | 3148 | |
| 200 | +| Weighted Avg | 0.97 | 0.97 | 0.97 | 3148 | |
| 201 | + |
| 202 | +##### Accuracy: 0.9707680491551459 |
| 203 | + |
| 204 | +<br> |
| 205 | + |
| 206 | +#### _Model: XGB_ |
| 207 | + |
| 208 | +##### Confusion Matrix: |
| 209 | + |
| 210 | +| | Predicted: Ham | Predicted: Spam | |
| 211 | +|------------------|----------------------|----------------------| |
| 212 | +| **Actual: Ham** | 2050 (True Negative) | 63 (False Positive) | |
| 213 | +| **Actual: Spam** | 28 (False Negative) | 1007 (True Positive) | |
| 214 | + |
| 215 | +- True Negative (TN): 2050 messages were correctly identified as ham (non-spam). |
| 216 | +- False Positive (FP): 63 ham messages were incorrectly identified as spam. |
| 217 | +- False Negative (FN): 28 spam messages were incorrectly identified as ham. |
| 218 | +- True Positive (TP): 1007 messages were correctly identified as spam. |
| 219 | + |
| 220 | +##### Performance Metrics: |
| 221 | + |
| 222 | +| | Precision | Recall | F1-Score | Support | |
| 223 | +|--------------|-----------|--------|----------|---------| |
| 224 | +| Ham | 0.99 | 0.97 | 0.98 | 2113 | |
| 225 | +| Spam | 0.94 | 0.97 | 0.96 | 1035 | |
| 226 | +| **Accuracy** | | | **0.97** | 3148 | |
| 227 | +| Macro Avg | 0.96 | 0.97 | 0.97 | 3148 | |
| 228 | +| Weighted Avg | 0.97 | 0.97 | 0.97 | 3148 | |
| 229 | + |
| 230 | +##### Accuracy: 0.9710927573062261 |
| 231 | + |
| 232 | +The models that performed the best are the SVM and Logistic Regression, with the SVM model achieving slightly higher |
| 233 | +accuracy than Logistic Regression. |
| 234 | +Given that no single model achieved perfect accuracy, I have decided to implement a voting classifier. |
| 235 | +This classifier will combine the predictions of the five models (Naive Bayes, Random Forest, SVM, |
| 236 | +Logistic Regression, and XGB) using a majority vote system to make the final prediction. |
| 237 | +This approach aims to leverage the strengths of each model to improve overall prediction accuracy. |
| 238 | + |
| 239 | +##### Weighted Voting System |
| 240 | + |
| 241 | +To enhance the decision-making process, I've refined our approach to a weighted voting system. This new system assigns |
| 242 | +different weights to each model's vote based on their respective accuracies. The weights are proportional to the |
| 243 | +accuracy of each model relative to the sum of the accuracies of all models. The models with higher accuracy have a |
| 244 | +greater influence on the final decision. |
| 245 | + |
| 246 | +The models and their respective proportional weights are as follows: |
| 247 | + |
| 248 | +- Naive Bayes: Weight = 0.1822 |
| 249 | +- Random Forest: Weight = 0.2047 |
| 250 | +- SVM (Support Vector Machine): Weight = 0.2052 |
| 251 | +- Logistic Regression: Weight = 0.2039 |
| 252 | +- XGBoost (XGB): Weight = 0.2039 |
| 253 | + |
| 254 | +These weights were calculated based on the accuracy of each model as a proportion of the total accuracy of all models. |
| 255 | +The final decision whether a message is spam or not is determined by the weighted spam score. Each model casts a vote |
| 256 | +(spam or not spam), and this vote is multiplied by the model's weight. The weighted spam scores from all models are then |
| 257 | +summed up. If this total weighted spam score exceeds 50% of the total possible weight, the message is classified as |
| 258 | +spam. Otherwise, it's classified as not spam (ham). |
| 259 | + |
| 260 | +This approach ensures that the more accurate models have a larger say in the final decision, thereby increasing the |
| 261 | +reliability of spam detection. It combines the strengths of each model, compensating for individual weaknesses and |
| 262 | +provides a more nuanced classification. |
| 263 | + |
| 264 | +##### System Output |
| 265 | + |
| 266 | +The system provides a detailed output for each message, showing the vote (spam or ham) from each model, along with its |
| 267 | +weight. It also displays the total weighted spam score and the final classification decision (Spam or Not Spam). This |
| 268 | +transparency in the voting process allows for easier understanding and debugging of the model's performance on different |
| 269 | +messages. |
172 | 270 |
|
173 | 271 | If you have trained the models on new data, you can test them by running the following command:
|
174 | 272 |
|
175 | 273 | ```sh
|
176 |
| -python test_and_tuning/test.py |
| 274 | +python tests/test.py |
177 | 275 | ```
|
178 | 276 |
|
179 |
| -:warning: **Warning**: A module not found error may occur :warning: |
| 277 | +⚠️ **Warning**: A module not found error may occur ⚠️ |
180 | 278 |
|
181 | 279 | If this happens, use an IDE to run the `test.py`file until a fix is implemented.
|
182 | 280 |
|
183 |
| - |
184 | 281 | ### Making Predictions
|
185 | 282 |
|
186 | 283 | To use the spam detector in your Django project:
|
@@ -208,7 +305,8 @@ print(f"Is spam: {is_spam}")
|
208 | 305 | - `loading_and_processing/`: Contains utility functions for loading and preprocessing data.
|
209 | 306 | - `models/`: Contains the trained models and their vectorizers.
|
210 | 307 | - `prediction/`: Contains the main spam detector class.
|
211 |
| -- `test_and_tuning/`: Contains scripts for testing and tuning the classifiers. |
| 308 | +- `tests/`: Contains scripts for testing |
| 309 | +- `tuning/`: Contains scripts for tuning the classifiers. |
212 | 310 | - `training/`: Contains scripts for training the classifiers.
|
213 | 311 |
|
214 | 312 | ## Contributing
|
|
0 commit comments