CURE-Spark: Scalable Hierarchical Clustering

📊 Project Overview

CURE-Spark is a high-performance implementation of the CURE (Clustering Using Representatives) algorithm, optimized for Big Data environments using Apache Spark and Scala. This project addresses the limitations of traditional K-means clustering by handling non-spherical clusters and outliers effectively at scale.

🔑 Key Features

Distributed Computing: leveraged Spark RDDs to process massive datasets that exceed single-machine memory.
Advanced Clustering: Implemented the hierarchical CURE algorithm which uses multiple representative points per cluster.
Data Analysis Pipeline: Includes Exploratory Data Analysis (EDA) and results visualization using Jupyter Notebooks (.ipynb).
Outlier Handling: Robust mechanisms to identify and filter anomalies during the clustering process.

🛠️ Tech Stack & Skills

Language: Scala 2.11+
Framework: Apache Spark Core / MLlib
Tools: SBT (Simple Build Tool), Jupyter Notebooks
Concepts: MapReduce, RDD Transformations, Hierarchical Clustering

💡 Innovation

This project demonstrates the ability to translate complex theoretical algorithms into Production-Ready Distributed Systems, showcasing a deep understanding of both Data Science principles and Big Data Engineering.

📄 Documentation

See cure.pdf for a detailed theoretical breakdown and performance analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
datasets		datasets
project		project
results		results
src/main		src/main
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt
cure.pdf		cure.pdf
eda-increase_cardinality-outlier_insertion.ipynb		eda-increase_cardinality-outlier_insertion.ipynb
results-findings.ipynb		results-findings.ipynb
results.ipynb		results.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CURE-Spark: Scalable Hierarchical Clustering

📊 Project Overview

🔑 Key Features

🛠️ Tech Stack & Skills

💡 Innovation

📄 Documentation

About

Uh oh!

Releases

Packages

Languages

gmixoulis/CURE-Spark-Clustering

Folders and files

Latest commit

History

Repository files navigation

CURE-Spark: Scalable Hierarchical Clustering

📊 Project Overview

🔑 Key Features

🛠️ Tech Stack & Skills

💡 Innovation

📄 Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages