Accelerating transcriptomic_clustering with Distributed Computing

This method implements a dynamic, asynchronous clustering algorithm using the Message-Passing Interface (MPI) to distribute clustering tasks across multiple HPC nodes. It provides the same clustering results as the transcriptomic_clustering package (except that nearby cluster ids no longer imply similarity), but is optimized for distributed computating using MPI, significantly reducing runtime for large datasets. In v2.0.0, the final merging step is included in the same run.

Quick Start

Edit submit_pipeline.sh
- Update the script to specify your input paths, clustering and final-merging parameters, and the number of nodes and memory to request on HPC.
Make the submit_pipeline.sh file executable:
```
chmod +x submit_pipeline.sh
```
Run the submit_pipeline.sh
- This submits both the clustering and the final merging job
```
./submit_pipeline.sh
```
Find the final output
- The merged clustering results are saved in clustering_results_after_merging.csv, where the index contains cell names and the cl column contains cluster IDs
- Refer to manager_output.log and final_merge.log for details on how clusters were split in each clustering job, and the number of clusters before and after final merging.

Background

The transcriptomic clustering Python package that uses a scVI latent space can be found here: transcriptomic_clustering. It is the Python version of the R package scrattch.hicat, both of which perform clustering recursively (depth-first search). The recursive approach can take significant time for large datasets. For instance, clustering 1 million cells can take ~2 days.

Distributed Clustering Model

This method replaces the depth-first search (DFS) recursive method with a dynamic, asynchronous manager-worker model:

Manager Node: Oversees the clustering process, distributing jobs to worker nodes and managing the queue of tasks.
Worker Nodes: Independently perform clustering tasks and, once a job finishes, immediately taking a job from the queue and performing clustering again.
Asynchronous Task Distribution: The system doesn’t wait for all clustering jobs at the same hierarchy to finish before moving on. Instead, as soon as a worker node completes its current job, it directly moves to the next pending task in the queue.

How It Works

Step 1: The manager node performs the initial clustering task. Once that finishes, the manager node appends all subsequent clustering jobs for each cluster into the job queue. For each available worker node, the manager assigns a job from the queue.
Step 2: Each worker node performs its assigned clustering task.
Step 3: Once a worker finishes, it sends the result to the manager and immediately takes a pending job from the queue. The manager evaluate the clusters:
- for clusters that cannot be further clustered (clustered into 1 cluster), the results are added to the final results.
- for clusters that can be furtehr clustered (clustered into > 1 clusters), the subclusters are added to the queue for further clustering.
Step 4: The process continues until the job queue is empty and all nodes have terminated.

Benchmarking Results

Run time comparison

The table below shows the comparison of run time between the transcriptomic_clustering package and hicatMPI.

Number of Cells	transcriptomic_clustering	hicatMPI (This Method)
10,000	12 minutes	9 minutes
80,000	63 minutes	18 minutes
1,000,000	47 hours 41 minutes	2 hours 15 minutes

Clustering of 3.3 million cells using hicatMPI was completed on 10 nodes (500 GB each) in 7 hours.

Clustering results comparison

The confusion matrices below shows the clustering results from the transcriptomic_clustering package (y axis) and hicatMPI (x axis) for clustering 10k cells (left), 78k cells (middle), and 1 million cells (right). The 1-1 correspondence confirms that the same clustering results are obtained from both the original transcriptomic_clustering package and this MPI-based implementation, given the same data input and clustering parameters.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
images		images
test		test
README.md		README.md
final_merging.py		final_merging.py
iterative_clustering_mpi_manager.py		iterative_clustering_mpi_manager.py
iterative_clustering_mpi_worker.py		iterative_clustering_mpi_worker.py
sbatch_finalMerging.sh		sbatch_finalMerging.sh
sbatch_hicatMPI.sh		sbatch_hicatMPI.sh
submit_pipeline.sh		submit_pipeline.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Accelerating transcriptomic_clustering with Distributed Computing

Quick Start

Background

Distributed Clustering Model

How It Works

Benchmarking Results

Run time comparison

Clustering results comparison

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

yyuandann/hicatMPI

Folders and files

Latest commit

History

Repository files navigation

Accelerating transcriptomic_clustering with Distributed Computing

Quick Start

Background

Distributed Clustering Model

How It Works

Benchmarking Results

Run time comparison

Clustering results comparison

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages