CURE-Spark is a high-performance implementation of the CURE (Clustering Using Representatives) algorithm, optimized for Big Data environments using Apache Spark and Scala. This project addresses the limitations of traditional K-means clustering by handling non-spherical clusters and outliers effectively at scale.
- Distributed Computing: leveraged Spark RDDs to process massive datasets that exceed single-machine memory.
- Advanced Clustering: Implemented the hierarchical CURE algorithm which uses multiple representative points per cluster.
- Data Analysis Pipeline: Includes Exploratory Data Analysis (EDA) and results visualization using Jupyter Notebooks (
.ipynb). - Outlier Handling: Robust mechanisms to identify and filter anomalies during the clustering process.
- Language: Scala 2.11+
- Framework: Apache Spark Core / MLlib
- Tools: SBT (Simple Build Tool), Jupyter Notebooks
- Concepts: MapReduce, RDD Transformations, Hierarchical Clustering
This project demonstrates the ability to translate complex theoretical algorithms into Production-Ready Distributed Systems, showcasing a deep understanding of both Data Science principles and Big Data Engineering.
See cure.pdf for a detailed theoretical breakdown and performance analysis.