Bank Fraud Detection using Deep Learning and Tree-Based Models

A comparative study on strategies to handle extreme class imbalance in a real-world fraud detection dataset.

Franco Pérez Rivera — Data Science Portfolio Project – 2025

1. Project Overview

This project addresses the challenge of detecting fraudulent transactions in a highly imbalanced real-world bank dataset, where fraud represents just over 1% of all cases.

The goal is to compare and evaluate various modeling strategies — including undersampling, oversampling with SMOTE, Focal Loss, ensemble methods, and tree-based algorithms — to determine which approaches are most effective in handling class imbalance while maintaining strong predictive performance.

In addition to deep learning architectures, traditional models such as Random Forest, XGBoost, and LightGBM are also assessed using a dedicated version of the dataset optimized for each modeling family.

All data preparation steps, model evaluation metrics (F1, Precision, Recall, ROC AUC, PR AUC), and results are documented and reproducible.

2. Project Goals

Build a robust and reproducible pipeline for fraud detection on an imbalanced dataset.
Compare deep learning–based models with tree-based alternatives using consistent evaluation metrics.
Explore and evaluate multiple imbalance handling strategies:
- Undersampling at different levels
- SMOTE (Synthetic Minority Oversampling Technique)
- Focal Loss
- Ensemble methods
- PCA-based dimensionality reduction
Identify which strategies offer the best trade-off between precision and recall.
Provide practical recommendations for deploying fraud detection models in real-world banking environments.

3. Dataset Structure

This project uses the Bank Account Fraud (BAF) Dataset, published as part of the NeurIPS 2022 competition. The dataset is synthetic but highly realistic, designed to mimic real-world bank fraud detection systems. It includes:

Extreme class imbalance (~1.1% fraud cases)
Temporal dynamics and distribution shifts
Differential privacy through noise injection and CTGAN-generated samples
Feature encoding to simulate production-level banking data

We used the file Base.csv, which contains 1,000,000 samples and 32 original features. After cleaning and transformation, we created three optimized datasets:

df_tree_final.csv: tailored for tree-based models (minimal preprocessing, raw/categorized features)
df_dl_final_clean.csv: cleaned, scaled, and filtered for deep learning models
df_dl_pca.csv: PCA-transformed version (95% variance retained) for improved training efficiency

4. Executive Summary

This project explores how to detect fraudulent bank account activity under extreme class imbalance using both deep learning and tree-based approaches.

Key findings:

Class imbalance matters: The dataset is highly imbalanced (~1.1% fraud), which makes naïve models ineffective despite high accuracy.
Undersampling is powerful: A simple 0.5% undersampling strategy achieved the highest F1 score (0.81) and best precision-recall trade-off (PR AUC = 0.91).
PCA helps neural networks: Dimensionality reduction via PCA slightly improved recall and PR AUC while reducing training time.
Tree-based models outperform deep learning in this setting:
- LightGBM achieved the best overall results: F1 = 0.75, ROC AUC = 0.89, PR AUC = 0.83
- XGBoost and Tuned Random Forest closely followed, outperforming deep learning models
Hybrid strategies (e.g., SMOTE + Focal Loss Ensembles) improved recall but not overall F1 compared to undersampling or boosting

The final comparison confirmed that, for fraud detection tasks with heavy imbalance, carefully tuned tree-based models — combined with light undersampling — offer state-of-the-art results with lower training costs than neural networks.

5. Strategy Deep Dive

5.1 Data Cleaning and Preparation

The original dataset (Base.csv) was cleaned and transformed using a custom utils_cleaning.py script. Key steps included:

Variable categorization: Binary (2 levels), categorical (3–8 levels), and numeric (>8 values)
Constant and low-variance feature removal
Outlier handling:
- Categorical binning for skewed variables (e.g., session length, address months)
- IQR-based outlier removal only from the majority class (non-fraud) in deep learning data
Categorical grouping for high-cardinality variables (e.g., payment type, housing, device OS)
Feature transformation (log, Box-Cox) for deep learning models
Three modeling datasets created:
- df_tree_final.csv: raw + binned features, ideal for tree models
- df_dl_final_clean.csv: scaled and transformed
- df_dl_pca.csv: PCA-reduced (95% variance retained)

This pipeline ensured clean, interpretable, and consistent inputs across all modeling strategies.

5.2 Deep Learning Baseline Setup

All deep learning experiments were conducted using a consistent architecture to isolate the effect of imbalance-handling strategies. The model was implemented in PyTorch and included:

Input layer: Matching the number of features
Hidden Layer 1: 128 neurons + BatchNorm + ReLU + Dropout (30%)
Hidden Layer 2: 64 neurons + ReLU + Dropout (20%)
Output layer: Sigmoid activation for binary classification

The training pipeline included:

Stratified train/test split
StandardScaler for feature normalization
Early stopping based on F1 score (with patience)
Mini-batch training using DataLoader (batch size = 64)
Evaluation metrics: F1, Precision, Recall, ROC AUC, PR AUC

To ensure computational efficiency and faster experimentation, we applied a base level of undersampling to all deep learning models — keeping 2% of the majority class — unless otherwise stated.

This setup served as the foundation for comparing:

No balancing (baseline)
SMOTE
Focal Loss
SMOTE + Focal Loss
Ensembles
Dimensionality reduction with PCA

5.3 Undersampling Strategies

Undersampling was the simplest yet most effective technique tested. It reduces the number of non-fraud samples, making the dataset more balanced without introducing synthetic data.

We compared the performance of the deep learning model at different undersampling levels:

Undersampling Level	F1 Score	Precision	Recall	PR AUC
0.5%	0.81	0.87	0.75	0.91
2% (baseline)	0.68	0.63	0.73	0.73
5%	0.52	0.40	0.74	0.56
10%	0.39	0.26	0.73	0.42

Takeaway:
Even a tiny subset of the majority class (0.5%) was enough to train an effective model. This configuration achieved the best F1 and PR AUC scores, outperforming all other balancing techniques.

All models were trained using the same neural network architecture, loss function (BCEWithLogitsLoss with class weighting), and early stopping configuration to ensure comparability.

5.4 Oversampling & Ensemble Strategies

We explored a range of oversampling-based techniques and ensemble variants to assess whether they could outperform simple undersampling in fraud detection.

These strategies aimed to either:

Augment the minority class via SMOTE, or
Adjust learning focus via Focal Loss, or
Combine models trained on different balanced folds to improve robustness

Ensemble methods used partitioned SMOTE-based resampling across folds, with or without Focal Loss, and averaged predictions.

Despite this variety, the overall performance metrics (ROC AUC and PR AUC) remained largely stable across strategies. However, we observed notable shifts in Precision and Recall, reflecting the inherent trade-off in how each method handles false positives and false negatives.

Performance Comparison

Strategy	F1 Score	Precision	Recall	ROC AUC	PR AUC
Baseline (2% undersampling)	0.68	0.63	0.73	0.82	0.73
SMOTE	0.68	0.60	0.77	0.82	0.73
Focal Loss (Optimized α, γ)	0.65	0.70	0.60	0.82	0.73
SMOTE + Focal Loss	0.67	0.60	0.77	0.82	0.73
Partitioned SMOTE Ensemble (BCE)	0.68	0.63	0.73	0.82	0.73
Partitioned SMOTE Ensemble (Focal)	0.68	0.61	0.78	0.82	0.73

Takeaway:

Focal Loss focuses on hard-to-classify fraud cases, improving precision but often at the cost of recall.
SMOTE-based strategies increase recall by providing more fraud examples, but may hurt precision due to synthetic data overlap.
Ensembles help stabilize results by reducing variance and capturing complementary signals across folds, often boosting recall.

While no oversampling-based method outperformed the best undersampling setting (0.5%), ensemble variants slightly improved recall while maintaining stable AUC metrics, making them viable when minimizing false negatives is a priority.

5.5 Dimensionality Reduction with PCA

To evaluate whether dimensionality reduction could improve model efficiency and performance, we applied Principal Component Analysis (PCA) to the dataset used for deep learning.

PCA retained 95% of the total variance, significantly reducing feature count.
All preprocessing steps (scaling, encoding, outlier removal) were performed before PCA.

We compared the deep learning model trained on the PCA-transformed dataset versus the original baseline (2% undersampling, no PCA).

Strategy	F1 Score	Precision	Recall	ROC AUC	PR AUC
Baseline (2%)	0.68	0.63	0.73	0.82	0.73
Baseline + PCA (2%)	0.68	0.63	0.75	0.83	0.74

Key Insight:
PCA slightly improved recall and PR AUC, with no loss in F1 or precision. It also sped up training and made the model more stable under repeated runs. Dimensionality reduction is a useful enhancement for neural networks in high-dimensional fraud datasets.

5.6 Tree-Based Models

In addition to deep learning, we evaluated several tree-based algorithms, which are often more robust to unscaled features and can handle class imbalance internally through class_weight or scale_pos_weight.

All models were trained on a dataset with 2% undersampling, optimized for tree-based algorithms (no scaling, grouped categories).

Models Evaluated

Random Forest (baseline): Default parameters, balanced class weighting
Random Forest (tuned): Increased number of trees, regularized depth and split criteria
XGBoost: Gradient boosting with tuned hyperparameters and scale_pos_weight for imbalance
LightGBM: Leaf-wise boosting with histogram optimization and similar tuning

Results Summary

Model	F1 Score	Precision	Recall	ROC AUC	PR AUC
Random Forest (baseline)	0.70	0.79	0.62	0.88	0.80
Random Forest (tuned)	0.73	0.71	0.76	0.88	0.80
XGBoost	0.74	0.73	0.76	0.89	0.82
LightGBM	0.75	0.72	0.77	0.89	0.83

Takeaway:
Tree-based models clearly outperformed neural networks in this setting. LightGBM achieved the best overall results, offering the highest F1, recall, and AUC scores, with faster training and simpler deployment. Random Forest also improved significantly with minimal tuning.

These models are well-suited for fraud detection pipelines due to their interpretability and robustness to feature noise.

5.7 Final Comparison and Takeaways

The following table summarizes the best-performing models across all strategies:

Model	F1 Score	Precision	Recall	ROC AUC	PR AUC
Undersample 0.5% (NN)	0.81	0.87	0.75	0.82	0.91
LightGBM (2%)	0.75	0.72	0.77	0.89	0.83
XGBoost (2%)	0.74	0.73	0.76	0.89	0.82
Random Forest (Tuned, 2%)	0.73	0.71	0.76	0.88	0.80
SMOTE + Focal Loss (NN)	0.67	0.60	0.77	0.82	0.73
Ensemble SMOTE + Focal (NN)	0.68	0.61	0.78	0.82	0.73
Baseline (2%, NN)	0.68	0.63	0.73	0.82	0.73
Baseline + PCA (2%, NN)	0.68	0.63	0.75	0.83	0.74

ROC Curve – All Models

Figure 1: ROC curves for all models sorted by ROC AUC. LightGBM and XGBoost show the best overall discrimination performance.

Precision-Recall Curve – All Models

Figure 2: PR curves for all models sorted by PR AUC. The 0.5% undersampled neural network achieves the highest precision-recall balance.

Key Conclusions:

LightGBM and XGBoost are the most consistent top performers across all metrics, with fast training and high AUC scores.
Simple undersampled neural networks can surpass ensemble and oversampling approaches in PR AUC, highlighting their viability in low-resource settings.
ROC curves tend to be similar across models due to low overall fraud prevalence, but PR curves reveal clearer performance differences.

Both curves reinforce that precision-recall metrics are more informative than ROC AUC in extreme imbalance scenarios like fraud detection.

6. Recommendations

Based on the experiments and results from this project, the following recommendations are proposed:

🔁 Tune undersampling ratio across all model types:
The best results were observed with a 0.5% undersample ratio, but this may not be optimal for every model or strategy. Testing a range of undersampling values — both in neural networks and tree-based models — can help find the most effective balance between precision and recall. There is no “perfect” ratio; the optimal point depends on the model, data representation, and business priorities.
Consider modifying the neural network architecture:
While we used a fixed architecture to control for variability, tuning hidden layers, dropout rates, and learning rates could close the performance gap between neural networks and tree models.
Undersampling remains a surprisingly powerful strategy:
It reduced training time, simplified implementation, and consistently outperformed more complex balancing techniques like SMOTE or Focal Loss. In low-fraud environments, a small and carefully chosen subset of the majority class can carry sufficient signal for effective training.
Tree-based models were the top performers:
LightGBM and XGBoost achieved the highest F1, ROC AUC, and PR AUC values. For practical deployments, these models offer interpretability, faster inference, and better resilience to feature noise.
Always compare modeling families:
No single model is universally best. In this project, neural networks were competitive under certain setups (e.g., 0.5% undersampling), while boosting methods dominated overall. Testing multiple algorithms ensures more robust conclusions.
Use PR AUC over ROC AUC for evaluation in imbalanced datasets:
ROC curves can be misleading when the positive class is rare. PR AUC better reflects the model's ability to correctly identify fraud without being diluted by the majority class.
For real-world applications:
Start with LightGBM + light undersampling as a strong baseline. Evaluate deep learning only if computational resources allow and the feature space benefits from representation learning (e.g., embeddings, sequences, raw text).

7. Technical Details

This project was developed using Python and follows best practices for reproducibility and modularity.

Libraries and Tools

Data processing: pandas, numpy
Visualization: matplotlib, seaborn
Modeling:
- Tree-based: scikit-learn, xgboost, lightgbm
- Deep Learning: PyTorch
Sampling techniques: imblearn (SMOTE)
Evaluation: sklearn.metrics, custom plotting functions
Dimensionality Reduction: PCA from sklearn.decomposition

Environment

Python version: 3.10
Training hardware: Laptop with Intel i7 / 16GB RAM (no GPU)
All models trained on CPU, with early stopping and batch control to optimize runtime

Code Structure

utils_cleaning.py: Custom functions for outlier removal, categorization, scaling, and export
notebooks/: Jupyter notebooks for data cleaning, EDA, model training and evaluation
images/: Contains exported plots used in reporting (e.g., ROC and PR curves)

Modeling Notes

Neural networks used BCEWithLogitsLoss with optional Focal Loss
All deep learning models shared the same architecture to isolate the effect of resampling techniques
Tree-based models used stratified undersampling and were tuned via grid search or early performance inspection
PCA retained 95% of variance and was only applied to deep learning inputs

8. Limitations

While this project provides useful insights into fraud detection under class imbalance, several limitations should be considered:

Synthetic dataset: Although realistic, the data is generated via CTGAN and subject to artificial patterns or noise not present in real banking systems.
Single dataset version: We focused only on the Base.csv file. Other variants in the BAF suite (e.g., with demographic bias or temporal drift) were not explored.
Fixed neural network architecture: To ensure fair comparison, the network was not optimized beyond basic structure. A more tailored design could yield stronger results.
No real-time or streaming evaluation: All models were trained and evaluated offline. In production, data arrives incrementally, which may affect performance.
Undersampling randomness: Despite stratified sampling, undersampling introduces variance depending on which examples are kept. Results may shift slightly between runs.

This project provides a strong starting point, but future iterations could extend the analysis to other dataset variants, more complex pipelines, or real-time inference settings.

⚠️ Due to file size limitations on GitHub, the CSV datasets (df_tree_final.csv, df_dl_final_clean.csv, df_dl_pca.csv) are not included in this repository.
You can recreate them by running the data cleaning notebook (DataCleaning.ipynb) with the original dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
DataCleaning		DataCleaning
DeepLearning		DeepLearning
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bank Fraud Detection using Deep Learning and Tree-Based Models

1. Project Overview

2. Project Goals

3. Dataset Structure

4. Executive Summary

5. Strategy Deep Dive

5.1 Data Cleaning and Preparation

5.2 Deep Learning Baseline Setup

5.3 Undersampling Strategies

5.4 Oversampling & Ensemble Strategies

Performance Comparison

5.5 Dimensionality Reduction with PCA

5.6 Tree-Based Models

Models Evaluated

Results Summary

5.7 Final Comparison and Takeaways

ROC Curve – All Models

Precision-Recall Curve – All Models

6. Recommendations

7. Technical Details

Libraries and Tools

Environment

Code Structure

Modeling Notes

8. Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bank Fraud Detection using Deep Learning and Tree-Based Models

1. Project Overview

2. Project Goals

3. Dataset Structure

4. Executive Summary

5. Strategy Deep Dive

5.1 Data Cleaning and Preparation

5.2 Deep Learning Baseline Setup

5.3 Undersampling Strategies

5.4 Oversampling & Ensemble Strategies

Performance Comparison

5.5 Dimensionality Reduction with PCA

5.6 Tree-Based Models

Models Evaluated

Results Summary

5.7 Final Comparison and Takeaways

ROC Curve – All Models

Precision-Recall Curve – All Models

6. Recommendations

7. Technical Details

Libraries and Tools

Environment

Code Structure

Modeling Notes

8. Limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages