This project tries to demonstrate the application of PySpark for analyzing REC-SSEC Bank's large-scale transaction data, comprising over 1 million rows. The analysis includes data pre- processing, exploratory data analysis (EDA), and machine learning algorithms (Logistic Regression and K-Means Clustering) to derive insights into transaction patterns for classifying and segmenting data. which also addresses trends in transaction values and counts across domains and locations. Performance optimization techniques, for example, caching and partitioning were applied to enhance computational efficiency. Results include domain and location-based transaction trends, prioritized domain activity, and clustering of transaction patterns, underscoring PySpark’s scalability and efficiency in handling big data tasks, with performance optimization techniques further improving processing times. Findings tried to underscore PySpark's distributed computing capabilities in handling large-scale datasets efficiently. Moreover, key insights into the dataset were visualized using Seaborn and Matplotlib. This work tried to understand domain-level and location-level banking trends in a distributed computing environment.
-
Notifications
You must be signed in to change notification settings - Fork 0
Teblurrry/largeScaleTransactionDataAnalysis
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published