A parallelized K-Means clustering algorithm implemented during my High Performance Computing coursework at Indiana University, showcasing both sequential and parallel designs for enhanced performance on large datasets.
This project was executed on Indiana University’s Big Red 200 supercomputer, scaling experiments on up to 120 compute nodes.
- Course: High Performance Computing (HPC)
- Institution: Indiana University (Graduate-level)
- Purpose: Compare the performance of sequential and parallel implementations of K-Means clustering using C++, analyzing speedup and scalability.
- Infrastructure: IU Big Red 200 supercomputer (Cray Shasta), up to 120 nodes.
Implementation Type | File | Parallel Approach |
---|---|---|
Sequential Baseline | sequential-k-means.cpp |
Single-threaded C++ |
Parallel + Multithreading | parallel-k-means.cpp |
OpenMP / Pthreads (depending on setup) |
Data Generation | generateData.cpp |
Creates synthetic datasets (e.g., 10000.txt , 100000.txt ) |
-
Compile:
g++ sequential-k-means.cpp -o seq_kmeans g++ parallel-k-means.cpp -fopenmp -o par_kmeans # if using OpenMP
-
Generate Data:
g++ generateData.cpp -o gen_data ./gen_data 1000000 # generates a 1 million-point dataset
-
Execute Locally:
./seq_kmeans data.txt ./par_kmeans data.txt
-
Execute on Big Red 200 (SLURM example):
sbatch -N 60 -n 120 run_kmeans.sh