Skip to content

gmixoulis/CURE-Spark-Clustering

Repository files navigation

CURE-Spark: Scalable Hierarchical Clustering

📊 Project Overview

CURE-Spark is a high-performance implementation of the CURE (Clustering Using Representatives) algorithm, optimized for Big Data environments using Apache Spark and Scala. This project addresses the limitations of traditional K-means clustering by handling non-spherical clusters and outliers effectively at scale.

🔑 Key Features

  • Distributed Computing: leveraged Spark RDDs to process massive datasets that exceed single-machine memory.
  • Advanced Clustering: Implemented the hierarchical CURE algorithm which uses multiple representative points per cluster.
  • Data Analysis Pipeline: Includes Exploratory Data Analysis (EDA) and results visualization using Jupyter Notebooks (.ipynb).
  • Outlier Handling: Robust mechanisms to identify and filter anomalies during the clustering process.

🛠️ Tech Stack & Skills

  • Language: Scala 2.11+
  • Framework: Apache Spark Core / MLlib
  • Tools: SBT (Simple Build Tool), Jupyter Notebooks
  • Concepts: MapReduce, RDD Transformations, Hierarchical Clustering

💡 Innovation

This project demonstrates the ability to translate complex theoretical algorithms into Production-Ready Distributed Systems, showcasing a deep understanding of both Data Science principles and Big Data Engineering.

📄 Documentation

See cure.pdf for a detailed theoretical breakdown and performance analysis.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages