This repository contains an implementation of the approximate k-Nearest Neighbors (ANN) algorithm using multiple parallelization techniques: Sequential
, OpenMP
, OpenCilk
, and Pthreads
The project implements the k-Nearest Neighbors (k-NN) algorithm, which finds the closest data points to a given query point. This project aims to improve the k-NN algorithm's performance using parallelization strategies to handle large datasets efficiently. The project uses the ANN (Approximate k-NN) approach for faster processing at the cost of a small loss in accuracy (recall).
Before running the script, ensure that you have the following installed:
- make: For building the project.
- OpenMP: Install the OpenMP library for multi-threaded execution.
- OpenCilk: Install the OpenCilk runtime for parallel execution.
- Pthreads: Ensure that the Pthreads library is available for thread-based parallelization.
- OpenBLAS: For fast linear algebra computations, used in distance calculation.
- Linux/Unix environment: For running the bash scripts and performance tools.
To compile and execute different implementations on specified datasets, follow these steps.
Script location:
The script used for running benchmarks and experiments is located at:
`./Results/run.bash`
- Note: Before running parallel implementations for a dataset, you must first execute the
Sequential
implementation (need the results for recall and speedup calculations)
bash run_script.sh <type> <dataset> <sampling_reduction> <candidate_reduction> [MIN_SIZE_factor] [THREADS_NUM]
-
<type>
: The implementation type (Sequential
,OpenMP
,OpenCilk
,Pthreads
). -
<dataset>
: The dataset to run the program on (mnist
,fashion-mnist
,sift
). -
<sampling_reduction>
: The sampling reduction factor (positive integer). -
<candidate_reduction>
: The candidate reduction factor (positive integer). -
[MIN_SIZE_factor]
(optional): To divide MIN_SIZE (default is1
). -
[THREADS_NUM]
(optional): The number of threads to use (default is1
).- Note: For
Pthreads
, the number of threads must be manually specified in the code by changing theMAX_THREADS
define in./src/knn_Pthreads.c
.
- Note: For
-
Run Sequential Implementation:
bash run.bash Sequential mnist
-
Run OpenCilk Implementation with 12 threads:
This command will run the
OpenCilk
implementation on themnist
dataset with a sampling reduction of25
, candidate reduction of200
, and a minimum size factor of1
. It will use12
threads.bash run.bash OpenCilk mnist 25 200 1 12
After the script has completed running, the results will be saved in the following directory structure:
-
Analytic Results:
- Location:
./<type>/<dataset>/Analytic/results_<dataset>_<sampling_reduction>_<candidate_reduction>_<minsize_factor>.txt
- Contains indices of k nearest neighbors for each query.
- Location:
-
Statistics
- Location:
./<type>/<dataset>/Statistics/statistics_<dataset>_<sampling_reduction>_<candidate_reduction>_<minsize_factor>.txt
- Summarizes performance metrics, including execution time, recall, and queries per second.
- Location:
The directory structure will be created if it does not already exist.