A comparative study on strategies to handle extreme class imbalance in a real-world fraud detection dataset.
Franco Pérez Rivera — Data Science Portfolio Project – 2025
This project addresses the challenge of detecting fraudulent transactions in a highly imbalanced real-world bank dataset, where fraud represents just over 1% of all cases.
The goal is to compare and evaluate various modeling strategies — including undersampling, oversampling with SMOTE, Focal Loss, ensemble methods, and tree-based algorithms — to determine which approaches are most effective in handling class imbalance while maintaining strong predictive performance.
In addition to deep learning architectures, traditional models such as Random Forest, XGBoost, and LightGBM are also assessed using a dedicated version of the dataset optimized for each modeling family.
All data preparation steps, model evaluation metrics (F1, Precision, Recall, ROC AUC, PR AUC), and results are documented and reproducible.
- Build a robust and reproducible pipeline for fraud detection on an imbalanced dataset.
- Compare deep learning–based models with tree-based alternatives using consistent evaluation metrics.
- Explore and evaluate multiple imbalance handling strategies:
- Undersampling at different levels
- SMOTE (Synthetic Minority Oversampling Technique)
- Focal Loss
- Ensemble methods
- PCA-based dimensionality reduction
- Identify which strategies offer the best trade-off between precision and recall.
- Provide practical recommendations for deploying fraud detection models in real-world banking environments.
This project uses the Bank Account Fraud (BAF) Dataset, published as part of the NeurIPS 2022 competition. The dataset is synthetic but highly realistic, designed to mimic real-world bank fraud detection systems. It includes:
- Extreme class imbalance (~1.1% fraud cases)
- Temporal dynamics and distribution shifts
- Differential privacy through noise injection and CTGAN-generated samples
- Feature encoding to simulate production-level banking data
We used the file Base.csv, which contains 1,000,000 samples and 32 original features. After cleaning and transformation, we created three optimized datasets:
df_tree_final.csv: tailored for tree-based models (minimal preprocessing, raw/categorized features)df_dl_final_clean.csv: cleaned, scaled, and filtered for deep learning modelsdf_dl_pca.csv: PCA-transformed version (95% variance retained) for improved training efficiency
This project explores how to detect fraudulent bank account activity under extreme class imbalance using both deep learning and tree-based approaches.
Key findings:
- Class imbalance matters: The dataset is highly imbalanced (~1.1% fraud), which makes naïve models ineffective despite high accuracy.
- Undersampling is powerful: A simple 0.5% undersampling strategy achieved the highest F1 score (0.81) and best precision-recall trade-off (PR AUC = 0.91).
- PCA helps neural networks: Dimensionality reduction via PCA slightly improved recall and PR AUC while reducing training time.
- Tree-based models outperform deep learning in this setting:
- LightGBM achieved the best overall results: F1 = 0.75, ROC AUC = 0.89, PR AUC = 0.83
- XGBoost and Tuned Random Forest closely followed, outperforming deep learning models
- Hybrid strategies (e.g., SMOTE + Focal Loss Ensembles) improved recall but not overall F1 compared to undersampling or boosting
The final comparison confirmed that, for fraud detection tasks with heavy imbalance, carefully tuned tree-based models — combined with light undersampling — offer state-of-the-art results with lower training costs than neural networks.
The original dataset (Base.csv) was cleaned and transformed using a custom utils_cleaning.py script. Key steps included:
- Variable categorization: Binary (2 levels), categorical (3–8 levels), and numeric (>8 values)
- Constant and low-variance feature removal
- Outlier handling:
- Categorical binning for skewed variables (e.g., session length, address months)
- IQR-based outlier removal only from the majority class (non-fraud) in deep learning data
- Categorical grouping for high-cardinality variables (e.g., payment type, housing, device OS)
- Feature transformation (log, Box-Cox) for deep learning models
- Three modeling datasets created:
df_tree_final.csv: raw + binned features, ideal for tree modelsdf_dl_final_clean.csv: scaled and transformeddf_dl_pca.csv: PCA-reduced (95% variance retained)
This pipeline ensured clean, interpretable, and consistent inputs across all modeling strategies.
All deep learning experiments were conducted using a consistent architecture to isolate the effect of imbalance-handling strategies. The model was implemented in PyTorch and included:
- Input layer: Matching the number of features
- Hidden Layer 1: 128 neurons + BatchNorm + ReLU + Dropout (30%)
- Hidden Layer 2: 64 neurons + ReLU + Dropout (20%)
- Output layer: Sigmoid activation for binary classification
The training pipeline included:
- Stratified train/test split
- StandardScaler for feature normalization
- Early stopping based on F1 score (with patience)
- Mini-batch training using
DataLoader(batch size = 64) - Evaluation metrics: F1, Precision, Recall, ROC AUC, PR AUC
To ensure computational efficiency and faster experimentation, we applied a base level of undersampling to all deep learning models — keeping 2% of the majority class — unless otherwise stated.
This setup served as the foundation for comparing:
- No balancing (baseline)
- SMOTE
- Focal Loss
- SMOTE + Focal Loss
- Ensembles
- Dimensionality reduction with PCA
Undersampling was the simplest yet most effective technique tested. It reduces the number of non-fraud samples, making the dataset more balanced without introducing synthetic data.
We compared the performance of the deep learning model at different undersampling levels:
| Undersampling Level | F1 Score | Precision | Recall | PR AUC |
|---|---|---|---|---|
| 0.5% | 0.81 | 0.87 | 0.75 | 0.91 |
| 2% (baseline) | 0.68 | 0.63 | 0.73 | 0.73 |
| 5% | 0.52 | 0.40 | 0.74 | 0.56 |
| 10% | 0.39 | 0.26 | 0.73 | 0.42 |
Takeaway:
Even a tiny subset of the majority class (0.5%) was enough to train an effective model. This configuration achieved the best F1 and PR AUC scores, outperforming all other balancing techniques.
All models were trained using the same neural network architecture, loss function (BCEWithLogitsLoss with class weighting), and early stopping configuration to ensure comparability.
We explored a range of oversampling-based techniques and ensemble variants to assess whether they could outperform simple undersampling in fraud detection.
These strategies aimed to either:
- Augment the minority class via SMOTE, or
- Adjust learning focus via Focal Loss, or
- Combine models trained on different balanced folds to improve robustness
Ensemble methods used partitioned SMOTE-based resampling across folds, with or without Focal Loss, and averaged predictions.
Despite this variety, the overall performance metrics (ROC AUC and PR AUC) remained largely stable across strategies. However, we observed notable shifts in Precision and Recall, reflecting the inherent trade-off in how each method handles false positives and false negatives.
| Strategy | F1 Score | Precision | Recall | ROC AUC | PR AUC |
|---|---|---|---|---|---|
| Baseline (2% undersampling) | 0.68 | 0.63 | 0.73 | 0.82 | 0.73 |
| SMOTE | 0.68 | 0.60 | 0.77 | 0.82 | 0.73 |
| Focal Loss (Optimized α, γ) | 0.65 | 0.70 | 0.60 | 0.82 | 0.73 |
| SMOTE + Focal Loss | 0.67 | 0.60 | 0.77 | 0.82 | 0.73 |
| Partitioned SMOTE Ensemble (BCE) | 0.68 | 0.63 | 0.73 | 0.82 | 0.73 |
| Partitioned SMOTE Ensemble (Focal) | 0.68 | 0.61 | 0.78 | 0.82 | 0.73 |
Takeaway:
- Focal Loss focuses on hard-to-classify fraud cases, improving precision but often at the cost of recall.
- SMOTE-based strategies increase recall by providing more fraud examples, but may hurt precision due to synthetic data overlap.
- Ensembles help stabilize results by reducing variance and capturing complementary signals across folds, often boosting recall.
While no oversampling-based method outperformed the best undersampling setting (0.5%), ensemble variants slightly improved recall while maintaining stable AUC metrics, making them viable when minimizing false negatives is a priority.
To evaluate whether dimensionality reduction could improve model efficiency and performance, we applied Principal Component Analysis (PCA) to the dataset used for deep learning.
- PCA retained 95% of the total variance, significantly reducing feature count.
- All preprocessing steps (scaling, encoding, outlier removal) were performed before PCA.
We compared the deep learning model trained on the PCA-transformed dataset versus the original baseline (2% undersampling, no PCA).
| Strategy | F1 Score | Precision | Recall | ROC AUC | PR AUC |
|---|---|---|---|---|---|
| Baseline (2%) | 0.68 | 0.63 | 0.73 | 0.82 | 0.73 |
| Baseline + PCA (2%) | 0.68 | 0.63 | 0.75 | 0.83 | 0.74 |
Key Insight:
PCA slightly improved recall and PR AUC, with no loss in F1 or precision. It also sped up training and made the model more stable under repeated runs. Dimensionality reduction is a useful enhancement for neural networks in high-dimensional fraud datasets.
In addition to deep learning, we evaluated several tree-based algorithms, which are often more robust to unscaled features and can handle class imbalance internally through class_weight or scale_pos_weight.
All models were trained on a dataset with 2% undersampling, optimized for tree-based algorithms (no scaling, grouped categories).
- Random Forest (baseline): Default parameters, balanced class weighting
- Random Forest (tuned): Increased number of trees, regularized depth and split criteria
- XGBoost: Gradient boosting with tuned hyperparameters and
scale_pos_weightfor imbalance - LightGBM: Leaf-wise boosting with histogram optimization and similar tuning
| Model | F1 Score | Precision | Recall | ROC AUC | PR AUC |
|---|---|---|---|---|---|
| Random Forest (baseline) | 0.70 | 0.79 | 0.62 | 0.88 | 0.80 |
| Random Forest (tuned) | 0.73 | 0.71 | 0.76 | 0.88 | 0.80 |
| XGBoost | 0.74 | 0.73 | 0.76 | 0.89 | 0.82 |
| LightGBM | 0.75 | 0.72 | 0.77 | 0.89 | 0.83 |
Takeaway:
Tree-based models clearly outperformed neural networks in this setting. LightGBM achieved the best overall results, offering the highest F1, recall, and AUC scores, with faster training and simpler deployment. Random Forest also improved significantly with minimal tuning.
These models are well-suited for fraud detection pipelines due to their interpretability and robustness to feature noise.
The following table summarizes the best-performing models across all strategies:
| Model | F1 Score | Precision | Recall | ROC AUC | PR AUC |
|---|---|---|---|---|---|
| Undersample 0.5% (NN) | 0.81 | 0.87 | 0.75 | 0.82 | 0.91 |
| LightGBM (2%) | 0.75 | 0.72 | 0.77 | 0.89 | 0.83 |
| XGBoost (2%) | 0.74 | 0.73 | 0.76 | 0.89 | 0.82 |
| Random Forest (Tuned, 2%) | 0.73 | 0.71 | 0.76 | 0.88 | 0.80 |
| SMOTE + Focal Loss (NN) | 0.67 | 0.60 | 0.77 | 0.82 | 0.73 |
| Ensemble SMOTE + Focal (NN) | 0.68 | 0.61 | 0.78 | 0.82 | 0.73 |
| Baseline (2%, NN) | 0.68 | 0.63 | 0.73 | 0.82 | 0.73 |
| Baseline + PCA (2%, NN) | 0.68 | 0.63 | 0.75 | 0.83 | 0.74 |

Figure 1: ROC curves for all models sorted by ROC AUC. LightGBM and XGBoost show the best overall discrimination performance.

Figure 2: PR curves for all models sorted by PR AUC. The 0.5% undersampled neural network achieves the highest precision-recall balance.
Key Conclusions:
- LightGBM and XGBoost are the most consistent top performers across all metrics, with fast training and high AUC scores.
- Simple undersampled neural networks can surpass ensemble and oversampling approaches in PR AUC, highlighting their viability in low-resource settings.
- ROC curves tend to be similar across models due to low overall fraud prevalence, but PR curves reveal clearer performance differences.
Both curves reinforce that precision-recall metrics are more informative than ROC AUC in extreme imbalance scenarios like fraud detection.
Based on the experiments and results from this project, the following recommendations are proposed:
-
🔁 Tune undersampling ratio across all model types:
The best results were observed with a 0.5% undersample ratio, but this may not be optimal for every model or strategy. Testing a range of undersampling values — both in neural networks and tree-based models — can help find the most effective balance between precision and recall. There is no “perfect” ratio; the optimal point depends on the model, data representation, and business priorities. -
Consider modifying the neural network architecture:
While we used a fixed architecture to control for variability, tuning hidden layers, dropout rates, and learning rates could close the performance gap between neural networks and tree models. -
Undersampling remains a surprisingly powerful strategy:
It reduced training time, simplified implementation, and consistently outperformed more complex balancing techniques like SMOTE or Focal Loss. In low-fraud environments, a small and carefully chosen subset of the majority class can carry sufficient signal for effective training. -
Tree-based models were the top performers:
LightGBM and XGBoost achieved the highest F1, ROC AUC, and PR AUC values. For practical deployments, these models offer interpretability, faster inference, and better resilience to feature noise. -
Always compare modeling families:
No single model is universally best. In this project, neural networks were competitive under certain setups (e.g., 0.5% undersampling), while boosting methods dominated overall. Testing multiple algorithms ensures more robust conclusions. -
Use PR AUC over ROC AUC for evaluation in imbalanced datasets:
ROC curves can be misleading when the positive class is rare. PR AUC better reflects the model's ability to correctly identify fraud without being diluted by the majority class. -
For real-world applications:
Start with LightGBM + light undersampling as a strong baseline. Evaluate deep learning only if computational resources allow and the feature space benefits from representation learning (e.g., embeddings, sequences, raw text).
This project was developed using Python and follows best practices for reproducibility and modularity.
- Data processing:
pandas,numpy - Visualization:
matplotlib,seaborn - Modeling:
- Tree-based:
scikit-learn,xgboost,lightgbm - Deep Learning:
PyTorch
- Tree-based:
- Sampling techniques:
imblearn(SMOTE) - Evaluation:
sklearn.metrics, custom plotting functions - Dimensionality Reduction:
PCAfromsklearn.decomposition
- Python version: 3.10
- Training hardware: Laptop with Intel i7 / 16GB RAM (no GPU)
- All models trained on CPU, with early stopping and batch control to optimize runtime
utils_cleaning.py: Custom functions for outlier removal, categorization, scaling, and exportnotebooks/: Jupyter notebooks for data cleaning, EDA, model training and evaluationimages/: Contains exported plots used in reporting (e.g., ROC and PR curves)
- Neural networks used
BCEWithLogitsLosswith optional Focal Loss - All deep learning models shared the same architecture to isolate the effect of resampling techniques
- Tree-based models used stratified undersampling and were tuned via grid search or early performance inspection
- PCA retained 95% of variance and was only applied to deep learning inputs
While this project provides useful insights into fraud detection under class imbalance, several limitations should be considered:
- Synthetic dataset: Although realistic, the data is generated via CTGAN and subject to artificial patterns or noise not present in real banking systems.
- Single dataset version: We focused only on the
Base.csvfile. Other variants in the BAF suite (e.g., with demographic bias or temporal drift) were not explored. - Fixed neural network architecture: To ensure fair comparison, the network was not optimized beyond basic structure. A more tailored design could yield stronger results.
- No real-time or streaming evaluation: All models were trained and evaluated offline. In production, data arrives incrementally, which may affect performance.
- Undersampling randomness: Despite stratified sampling, undersampling introduces variance depending on which examples are kept. Results may shift slightly between runs.
This project provides a strong starting point, but future iterations could extend the analysis to other dataset variants, more complex pipelines, or real-time inference settings.
⚠️ Due to file size limitations on GitHub, the CSV datasets (df_tree_final.csv,df_dl_final_clean.csv,df_dl_pca.csv) are not included in this repository.
You can recreate them by running the data cleaning notebook (DataCleaning.ipynb) with the original dataset.