Phishing attack is a simplest way to obtain sensitive information from innocent users. Aim of the phishers is to acquire critical information like username, password and bank account details. Cyber security persons are now looking for trustworthy and steady detection techniques for phishing websites detection. This paper deals with machine learning technology for detection of phishing URLs by extracting and analyzing various features of legitimate and phishing URLs. Decision Tree, random forest and Support vector machine algorithms are used to detect phishing websites. Aim of the paper is to detect phishing URLs as well as narrow down to best machine learning algorithm by comparing accuracy rate, false positive and false negative rate of each algorithm.
The goal of this project is to make a detection model which will detect the phishing websites depending on various factors, using machine learning algorithms.
The dataset which is used here, is collected from Kaggle website. Here is the link of the dataset : https://www.kaggle.com/eswarchandt/phishing-website-detector. I have uploaded the same in Dataset folder too, you can access that!
- Importing all the required libraries. Check
requirements.txt. - Upload the dataset and the Jupyter Notebook file.
- After that I have extracted the features and deployed the ML algorithms.
- Classification Models -
- Logistic Regression
- Decision Tree Classifier
- Random forest classifier
- K-Nearest Neighbouring
- Support Vector Machine (SVM)
- Gradient Boosting
- AdaBoost Classifier
- XgBoost Classifier
- Finding out something more about Logistic Regression
- Model Comparison
- Conclusion
We have deployed nine machine learning algorithms and every algorithm is deployed successfully without any hesitation. We have checked the accuracy of the models based on the accuracy score of each of the models. Now let's take a look at the scores of each models.
| Name of the Model | Accuracy Score |
|---|---|
| Logistic Regression | 92.76 |
| Decision Tree Classifier | 94.72 |
| Random Forest Classifier | 97.05 |
| KNN Algorithm | 63.52 |
| Support Vector Machine Algorithm | 56.04 |
| Gradient Boosting | 84.11 |
| AdaBoost Classifier | 91.04 |
| XgBoost Classifier | 94.72 |
| Logistic Regression on two features | 84.11 |
Comparing all those scores scored by the machine learning algorithms, it is clear that Random Forest and Decision Tree Classifier are having the upper hand in case of this dataset and after this, we can use XgBoost Classifier algorithm, which is also having good score as compared to the other deployed algorithms
Best Fitted Models ranking -
- Random forest classifier
- Decision tree classifier
- XgBoost Classifier
- Logistic Regression
- AdaBoost classifier
- Gradient Boosting
- KNN
- SVM


