Skip to content

git-avinashpawar/HPC-Parallel-K-means

Repository files navigation

HPC-Parallel-K-Means

A parallelized K-Means clustering algorithm implemented during my High Performance Computing coursework at Indiana University, showcasing both sequential and parallel designs for enhanced performance on large datasets.

This project was executed on Indiana University’s Big Red 200 supercomputer, scaling experiments on up to 120 compute nodes.


About the Project

  • Course: High Performance Computing (HPC)
  • Institution: Indiana University (Graduate-level)
  • Purpose: Compare the performance of sequential and parallel implementations of K-Means clustering using C++, analyzing speedup and scalability.
  • Infrastructure: IU Big Red 200 supercomputer (Cray Shasta), up to 120 nodes.

Implementations

Implementation Type File Parallel Approach
Sequential Baseline sequential-k-means.cpp Single-threaded C++
Parallel + Multithreading parallel-k-means.cpp OpenMP / Pthreads (depending on setup)
Data Generation generateData.cpp Creates synthetic datasets (e.g., 10000.txt, 100000.txt)

How to Run

  1. Compile:

    g++ sequential-k-means.cpp -o seq_kmeans
    g++ parallel-k-means.cpp -fopenmp -o par_kmeans  # if using OpenMP
  2. Generate Data:

    g++ generateData.cpp -o gen_data
    ./gen_data 1000000  # generates a 1 million-point dataset
  3. Execute Locally:

    ./seq_kmeans data.txt
    ./par_kmeans data.txt
  4. Execute on Big Red 200 (SLURM example):

    sbatch -N 60 -n 120 run_kmeans.sh

About

Parallelized algorithm for K-means

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages