This project focuses on detecting fraudulent credit card transactions using machine learning techniques. Given the extreme imbalance in fraud transactions, methods like SMOTE (Synthetic Minority Over-Sampling Technique) and Logistic Regression have been employed to enhance model performance.
The dataset consists of credit card transactions over a two-day period in September 2013, containing:
- 284,807 transactions
- 492 fraud cases (0.172% of total transactions)
V1toV28: PCA-transformed features (due to confidentiality).Time: Elapsed time since the first transaction.Amount: Transaction amount.Class: Fraud status (0 = Non-fraud, 1 = Fraud).
- Since fraudulent transactions are extremely rare (0.17%), we applied SMOTE to balance the dataset.
- Fraudulent instances increased from 469 to 65,598 post-balancing.
- Weight of Evidence (WOE): Used to transform categorical variables into continuous values.
- Information Value (IV): Selected the most predictive features (IV > 0.3).
- π Correlation heatmaps to identify relationships between features.
- π Distribution analysis to understand fraud vs. non-fraud transactions.
- π Fraud vs. Non-Fraud Imbalance visualization.
The primary model used is Logistic Regression, a widely accepted model for fraud detection.
- ROC Curve π: Visualizing classification performance.
- Fraud Score Calculation π―: Transformed log-odds into a fraud risk score (0-100), categorizing transactions as:
- π’ No Risk
- π‘ Low Risk
- π Moderate Risk
- π΄ High Risk