This project demonstrates Credit Card Fraud Detection using a publicly available dataset, showcasing the complete pipeline from data exploration to model deployment.
The goal of this project is to build a model that can accurately identify fraudulent credit card transactions.
We use the Credit Card Fraud Detection dataset and employ a Logistic Regression model, while addressing the significant class imbalance present in the data.
- Dataset: Credit Card Fraud Detection (via
fetch_openml) - Features:
- V1βV28 β anonymized transaction features (PCA transformed)
- Amount β transaction amount
- Class β target variable (
0 = non-fraud,1 = fraud)
- Loaded into a pandas DataFrame for analysis.
- Checked data structure, missing values, and target variable imbalance.
- Applied StandardScaler for feature scaling.
- Split into train/test sets with stratification to preserve class ratios.
- Applied SMOTE (Synthetic Minority Over-sampling Technique) to handle severe class imbalance.
- Trained a Logistic Regression model on the SMOTE-resampled training set.
- Evaluated on the unseen test set using multiple metrics:
- β Precision
- β Recall
- β F1-Score
- β Confusion Matrix
- β ROC AUC Score
- Saved the trained model for future predictions on new transactions.
Key findings from model evaluation:
- Recall (Fraudulent Class):
0.92β (captures most fraud cases) - Precision (Fraudulent Class):
0.06β οΈ (high false positives) - ROC AUC Score:
0.9707π (strong discriminative ability)
π Interpretation:
The model is excellent at catching frauds (high recall), but suffers from low precision (too many false alarms). Further optimization (e.g., advanced models, threshold tuning, ensemble methods) is needed.
- Python π
- Pandas, NumPy β Data handling
- Scikit-learn β Preprocessing, modeling, evaluation
- Imbalanced-learn (SMOTE) β Handling class imbalance
- Experiment with tree-based models (Random Forest, XGBoost, LightGBM).
- Apply threshold tuning for better precision-recall tradeoff.
- Use anomaly detection methods for fraud detection.
- Build a real-time detection system with streaming data.
- Dataset: Credit Card Fraud Detection on OpenML
- Imbalanced-learn Documentation: SMOTE
This project provides a solid baseline for fraud detection with Logistic Regression and SMOTE.
It highlights the challenges of imbalanced data and the trade-off between recall vs. precision in financial fraud detection.