This repository contains an AI/ML-based solution for detecting fraudulent transactions on a UPI-scale dataset. The goal is to develop a robust and scalable model capable of identifying anomalies, suspicious patterns, and fraudulent activities in real-time or through post-transaction analysis.
This dataset consists of two files: link to dataset
- Transaction Data: 2.3 GB (CSV format)
- MCC (Merchant Category Code) Data: 99 KB (CSV format)
- Total Transactions: 24,386,900
- Fraudulent Transactions: 388,431 (~1.6% of dataset)
- Merchants & Locations:
- 98,953 unique merchants
- 13,429 unique merchant cities
- Most Common Payment Method: Swipe Transaction (15.4M times)
- High-Risk MCC Categories: Certain MCCs show a higher correlation with fraudulent transactions.
- Most Common MCC Category: "Airlines" (128 occurrences).
- Fraud-Prone MCCs: Some business categories, such as cash disbursement and online services, show an increased likelihood of fraud.
- IRS Reportable Transactions: 83% of transactions fall under reportable categories.
Metric | MCC Data | Transaction Data |
---|---|---|
Total Records | 981 | 24,386,900 |
Key Column | mcc |
transaction_id |
Most Common Category | "Airlines" (128 times) | "Swipe Transaction" (15.4M) |
Fraud-Prone MCC Categories | Identified and categorized | 1.6% fraudulent cases |
Reportable Transactions | 83% | Significant fraud cases observed |
To efficiently handle large-scale data, we converted CSV files to Parquet format, which offers:
- Better Compression: Reduces storage size significantly.
- Faster Read/Write Operations: Speeds up data processing.
- Optimized Query Performance: Ideal for large datasets used in ML models.
import pandas as pd
# Load CSV and convert to Parquet
transaction_df = pd.read_csv('transactions.csv')
transaction_df.to_parquet('transactions.parquet', engine='pyarrow', compression='snappy')
- Label encoding for categorical variables
- Conversion of
Amount
to numeric format - Timestamp feature extraction (hour, weekday, weekend indicators)
- Handling missing values in MCC and ZIP codes
- Feature engineering based on:
- Transaction frequency
- Time of transaction
- Merchant patterns
- Declined transactions before a fraud event
- Time-based features: Hour, day of the week, weekend vs. weekday transactions.
- Amount patterns: Identifying frequent high-value transactions.
- Merchant-based patterns: Identifying merchants with repeated fraudulent activity.
- MCC Analysis: Grouping transactions based on high-risk categories.
- Supervised Learning:
- XGBoost
- Unsupervised Learning:
- Isolation Forest
Fraud cases constitute only 1.6% of the dataset, making the data highly imbalanced. To address this:
- Resampling Techniques:
- Oversampling: Using SMOTE to synthetically generate fraud samples.
- Undersampling: Reducing non-fraud transactions to balance the dataset.
- Cost-Sensitive Learning:
- Assigning higher misclassification penalties for fraud cases in XGBoost using
scale_pos_weight
.
- Assigning higher misclassification penalties for fraud cases in XGBoost using
- Anomaly Detection Approach:
- Using Isolation Forest, which is robust to class imbalance.
- Evaluation Metrics Beyond Accuracy:
- Precision & Recall Trade-off to reduce false negatives.
- ROC-AUC & PR-AUC for better fraud detection performance.
- Data Ingestion: Raw transactions are collected and stored in a Parquet data warehouse.
- Feature Engineering & Processing: Extracting relevant features using Pandas and NumPy.
- Model Training:
- Supervised learning (XGBoost) and unsupervised learning (Isolation Forest).
- Real-Time Detection with FastAPI:
- API receives transaction details and runs predictions.
- Alert & Action System:
- Flags suspicious transactions and triggers alerts.
For real-time fraud detection, we use FastAPI, a high-performance web framework.

- Data Processing: Pandas, NumPy
- Visualization: Matplotlib, Seaborn
- Machine Learning: Scikit-Learn, XGBoost
- Anomaly Detection: Isolation Forest
- Deployment: FastAPI for real-time fraud detection API
- Big Data Processing: Spark (if scaling to large datasets)