This project end to end phishing URL detection system using BERT model fine-tuning and provides a Flask API for real-time predictions. It has CI/CD pipeline using GitHub Actions and AWS.
.
├── app.py # Flask application
├── main.py # Model training script
├── requirements.txt # Python dependencies
├── Dockerfile # Docker configuration
└── .github/workflows/ # GitHub Actions workflows
- Clone this repository:
git clone https://github.com/darshan8850/Finetune-Bert-Phishing-URL-Detection.git cd Fine-Tuning-BERT-Model-for-Text-Classification
- Python 3.9+
- Docker
- AWS Account with:
- EC2 instance
- ECR repository
- IAM user with appropriate permissions
-
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Run the Flask application:
python app.py
-
Build the Docker image:
docker build -t phishing-url-detection . -
Run the container:
docker run -p 5000:5000 phishing-url-detection
-
Set up GitHub Secrets:
AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEYEC2_HOSTEC2_USERNAMEEC2_SSH_KEY
-
Create an ECR repository:
aws ecr create-repository --repository-name phishing-url-detection
-
Push to main branch to trigger deployment
GET /health
Response:
{
"status": "healthy"
}POST /predict
Request body:
{
"url": "https://example.com"
}Response:
{
"url": "https://example.com",
"prediction": "Safe",
"confidence": 0.95
}The project uses a custom dataset for phishing URL classification, available at darshan8950/phishing_url_detection_BERT
on the Hugging Face Hub.
The base model used is bert-base-uncased, which is fine-tuned for binary classification of URLs as safe or potentially
phishing.
After fine-tuning, the model's performance can be evaluated using accuracy and AUC metrics. Refer to the output of main. py for detailed results.
Fine Tuned Model - Link
Dataset - Link
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request