ArabicFraudDetection

A machine learning project to detect fraudulent comments in ride-hailing services using Arabic text data. The project leverages pre-trained language models like AraBERT to classify user comments as fraudulent or non-fraudulent.

Overview

Detecting fraudulent activities in ride-hailing services is critical for ensuring reliability and customer trust. This project focuses on processing Arabic text data, fine-tuning a pre-trained language model, and deploying an effective system for fraud detection.

Key Features:

Custom Preprocessing: Automatically label comments based on fraud-related keywords.
Fine-Tuning Pre-trained Models: Fine-tune AraBERT for binary classification tasks.
Evaluation Metrics: Calculate metrics like accuracy, precision, recall, and F1-score.
Modular Code: Organized and reusable Python modules for data processing, training, and evaluation.
Docker Support: Easily run the project in a containerized environment.

Directory Structure

ArabicFraudDetection/
├── data/                    # Store data files or sample datasets
├── src/                     # Python scripts for preprocessing, training, evaluation, etc.
│   ├── preprocessing.py     # Data cleaning and preparation
│   ├── train.py             # Model fine-tuning script
│   ├── evaluate.py          # Evaluation and metrics calculation
│   ├── utils.py             # Helper functions
├── models/                  # Save trained models and checkpoints
├── notebooks/               # Jupyter notebooks for exploratory work
├── tests/                   # Unit and integration tests
├── tmp_trainer/             # 
├── requirements.txt         # Python dependencies
├── Dockerfile               # Dockerfile to containerize the project
├── README.md                # Documentation
├── LICENSE                  # License file
├── .gitignore               # Ignore unnecessary files (e.g., data, logs)
└── setup.py                 # Package installation script

Getting Started

Prerequisites

Ensure you have the following installed:

Python 3.8+
Pip
Docker (if using containerized setup)
Virtual environment (optional but recommended)

Installation

Option 1: Run Locally

Clone the repository:

git clone https://github.com/smasoudrezvani/ArabicFraudDetection_LLM.git
cd ArabicFraudDetection

Install dependencies:
```
pip install -r requirements.txt
```
Prepare your dataset:
- Place your raw dataset (e.g., df_rating&comment.xlsx) in the data/ folder.

Option 2: Run with Docker

Build the Docker image:

docker build -t arabic-fraud-detection-LLM .

Run the Docker container:

docker run --rm -it arabic-fraud-detection-LLM

Usage (Not using Docker)

1. Preprocess Data

Run the preprocessing script to clean and label the data:

python3 ./src/preprocessing.py --input "data/df_rating&comment.xlsx" --output "data/processed"

2. Train the Model

Fine-tune the pre-trained language model:

python3 ./src/train.py --model aubmindlab/bert-base-arabertv2 --train data/processed/train.json --test data/processed/test.json --output models/fraud_detector

3. Evaluate the Model

Evaluate the model's performance on the test set:

python3 ./src/evaluate.py --model ./models/fraud_detector --test ./data/processed/test.json

Using Docker

1. Run Preprocessing with Docker

You can modify the Dockerfile command or use the container interactively:

docker run -v $(pwd)/data:/app/data arabic-fraud-detection python src/preprocessing.py --input data/df_rating&comment.xlsx --output data/processed

2. Train the Model with Docker

docker run -v $(pwd)/models:/app/models arabic-fraud-detection python src/train.py --model aubmindlab/bert-base-arabertv2 --train data/processed/train.json --test data/processed/test.json --output models/fraud_detector

Results

Metric	Value
Accuracy	0.92
Precision	0.90
Recall	0.89
F1 Score	0.89

Advanced Features

Hyperparameter Tuning: Use the Trainer API's hyperparameter search functionality to optimize the model.
Data Augmentation: Extend the dataset using techniques like back-translation or synonym replacement.
Deployment: Deploy the model as a REST API using FastAPI or a user-friendly interface with Streamlit.

Contributing

Contributions are welcome! To contribute:

Fork the repository.
Create a feature branch: git checkout -b feature-name.
Commit changes: git commit -m 'Add feature-name'.
Push to the branch: git push origin feature-name.
Open a pull request.

License

This project is licensed under the Apache License.

Acknowledgments

Special thanks to Hugging Face for providing the tools and pre-trained models that make this project possible.


---

This version includes details about how to use the `Dockerfile` for preprocessing, training, and evaluation. Let me know if you'd like further changes!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ArabicFraudDetection

Overview

Key Features:

Directory Structure

Getting Started

Prerequisites

Installation

Option 1: Run Locally

Option 2: Run with Docker

Usage (Not using Docker)

1. Preprocess Data

2. Train the Model

3. Evaluate the Model

Using Docker

1. Run Preprocessing with Docker

2. Train the Model with Docker

Results

Advanced Features

Contributing

License

Acknowledgments

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

ArabicFraudDetection

Overview

Key Features:

Directory Structure

Getting Started

Prerequisites

Installation

Option 1: Run Locally

Option 2: Run with Docker

Usage (Not using Docker)

1. Preprocess Data

2. Train the Model

3. Evaluate the Model

Using Docker

1. Run Preprocessing with Docker

2. Train the Model with Docker

Results

Advanced Features

Contributing

License

Acknowledgments