Real-time Fraud Detection System

A real-time financial transaction processing system designed to detect fraudulent activities using big data technologies and advanced machine learning algorithms. The project implements the Lambda architecture, allowing streaming and batch data processing.

Technologies used: Apache NiFi, Spark, Hadoop, HDFS, Hive, Cassandra, Superset, TrinoDB, Kafka, Kafdrop, Zookeeper, MLFlow, Docker

How to run

Make sure you have docker, Python (preferably 3.12.7), and OPTIONALLY JDK 17 (look at the step below; it is necessary for local model training with Spark) installed
NOTE: It is only needed for local model training; models are attached to the repository in the 'services/streaming_processing/models' folder, so you don't need to perform this step: Install winutils (if you are running on Windows) for Hadoop - https://github.com/steveloughran/winutils/ - place them in C:/hadoop/bin/ Add this folder to PATH and set HADOOP_HOME=C:/hadoop
Start docker desktop
Download the datasets pointed out below
Install the requirements in requirements.txt
Run the notebooks in the 'eda' folder to preprocess the datasets
Modify config files if needed - look at the relevant files in the 'services' folder. The default configuration should work out of the box
Navigate to the 'scripts' folder
Run prepare_datasets_local_pretraining.py and train_local_model.py (only if you want to recreate the pre-trained models)
Run start_containers.bat (alternatively, one can run only certain scenarios such as streaming processing flow, data ingestion to hive, or batch processing; all scripts are available in the 'scripts' folder)
After a few minutes, run the post_start.bat in the 'scripts' folder. It will create Hive tables.
You should be able to access the services (look at docker-compose for ports and addresses; look at authorization-access-data.json for credentials)
To display the dashboards, navigate to http://localhost:8088/superset and use the connection string: trino://admin:@presto:8080/cassandra/fraud_analytics Then run the 'import_superset_dashboards.bat' in 'scripts' folder

Data

Download from:

And place unpacked files in the root folder in the 'datasets' folder.

Expected structure:

.  
├── datasets  
│   ├── Fraud.csv // dataset 1: 'Fraudulent Transactions Data' from Kaggle  
│   └── Credit_Card_Fraud_.arff // dataset 2: 'Credit_Card_Fraud_' from OpenML  
│   └── transactions_df.csv // dataset 3: 'Credit Card Transactions Synthetic Data Generation' from Kaggle  
│   └── customer_profiles_table.csv  
│   └── terminal_profiles_table.csv   
│   └── creditcard.csv // dataset 4: 'Credit Card Fraud Detection' from Kaggle (optional)  
├── ...

Name		Name	Last commit message	Last commit date
Latest commit History 210 Commits
.github/workflows		.github/workflows
eda		eda
ml_training		ml_training
reports		reports
scripts		scripts
services		services
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
authorization-access-data.json		authorization-access-data.json
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-time Fraud Detection System

How to run

Data

Architecture

Superset

NiFi

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

salveendutt/Big-Data-Analytics

Folders and files

Latest commit

History

Repository files navigation

Real-time Fraud Detection System

How to run

Data

Architecture

Superset

NiFi

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages