Skip to content

Teblurrry/largeScaleTransactionDataAnalysis

Repository files navigation

largeScaleTransactionDataAnalysis

This project tries to demonstrate the application of PySpark for analyzing REC-SSEC Bank's large-scale transaction data, comprising over 1 million rows. The analysis includes data pre- processing, exploratory data analysis (EDA), and machine learning algorithms (Logistic Regression and K-Means Clustering) to derive insights into transaction patterns for classifying and segmenting data. which also addresses trends in transaction values and counts across domains and locations. Performance optimization techniques, for example, caching and partitioning were applied to enhance computational efficiency. Results include domain and location-based transaction trends, prioritized domain activity, and clustering of transaction patterns, underscoring PySpark’s scalability and efficiency in handling big data tasks, with performance optimization techniques further improving processing times. Findings tried to underscore PySpark's distributed computing capabilities in handling large-scale datasets efficiently. Moreover, key insights into the dataset were visualized using Seaborn and Matplotlib. This work tried to understand domain-level and location-level banking trends in a distributed computing environment.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages