-
Notifications
You must be signed in to change notification settings - Fork 16
Description
What would you like to see added?
Article to answer questions like:
- What problems does a cluster solve?
- How does a cluster compare to my local computer?
- What problems does cloud.rc solve?
- How does cloud compare to my local computer?
- How do I parallelize my workflow?
- What kinds of tools are available for parallelizing my workflow?
Info dump from a ticket on this subject.
Hey there, I'm going to write up an overview of parallel processing here as a draft for our documentation. As someone still in the learning phase I think you might find this useful.
I'll start with a model for reasoning about computers. Most folks who start with HPC are coming in with a model of computers as a single black box. You put data and commands in, you get results out. This makes sense because local machines (workstations, laptops, desktops) work like that. I won't go into detail about the inner workings of the black box, but will say that one computer has distinct components that work togethers, like memory, storage, CPU, GPU, network interface, etc. Clusters, like Cheaha, are structured differently.
Cheaha can be thought of as (and is) a networked collection of individual computers. We call these computers "nodes" and each "node" is capable of functioning as a fully independent computer, if it were set up that way. A node is, literally, a box with the usual computer parts inside: memory, storage, CPU, GPU (sometimes), network interface, etc. How storage works on our cluster is a bit different (most storage locations are shared across all nodes), but that's a separate discussion.
So when you start a job on Cheaha and don't specify multiple nodes, you get a job on one node. This job will behave like that single computer model and you can use it, for the most part, like a local machine. Common and popular data processing software, libraries and packages are often set up to make effective use of multiple cores on a single machine when the algorithms allow for parallelization.
If you ask for two nodes, then you'll get a job with two computers. While these computers are capable of communicating data with each other, and very rapidly, they do not communicate by default. This means two or more nodes can't be treated as a single computer.
If you start up Python in your two node job, it will only be running on one of the nodes. This means Python will only be able to see one nodes' worth of resources, and won't know the other node exists unless you program something to make them communicate.
There are technologies that allow multiple nodes to collaborate to complete complex tasks. At the software programming level these include (1) networking sockets and (2) MPI (Message Passing Interface). A lot of important software is built to use one or both of these technologies to allow very large scale jobs to use substantial portions of clusters effectively.
Cloud.rc is a bit different. With cloud.rc you can provision VMs that behave like individual computers. If you really wanted to and had the time and expertise, you could set up a virtual cluster on a virtual network using multiple VMs (please don't do this, use Cheaha instead). One use case of Cloud.rc is to test workflows that make use of a single node by trying it out on a single VM.
There are four main use-cases for parallel programming I've come across. There's some overlap among them, but these divisions tend to be useful in deciding what tools to use.
Prevent simultaneous tasks from blocking each other. An example is when you don't want backend processing of an application to block someone from using the graphical frontend for your application. Imagine if your internet browser became unresponsive every time a page was being rendered! This is solved using asynchronous threads within a single CPU core or among multiple cores on a single computer. This is not common for researchers to need to know. You are almost certainly not in this space.
To perform many similar, independent tasks in the shortest time. This is referred to as "Embarrassingly Parallel" or "Pleasingly Parallel" and is very common in scientific workflows. An example is when you have a thousand CSV files with the same structure and need to summarize each of them. There are many tools for solving this. Slurm has
--array, Python hasmultiprocessing.Poolfeatures, R has theforeachanddoParallelpackages (which are built on theparallelpackage), matlab has the parallel processing toolbox andparforloop. This sort of workflow is often run on a single node, but can be made to run on multiple nodes with a fair amount of extra work or specialized tools. You may be in this space.To perform many similar, mutually interdependent subtasks of a larger task in the shortest time. This is found often in physical simulations and uncommonly elsewhere. Think structural and thermal simulations. The tools for these are often highly specialized, commercial, and take a lot of specialized domain knowledge to understand. This sort of workflow is typically run across more than one node and requires some understanding of MPI. You are probably not in this space.
To execute an arbitrary workflow with many different, inter-dependent tasks in the shortest time. The mathematical concept for such a workflow is a Directed Acyclic Graph (DAG). This is most commonly found in the life sciences, especially in the -omics spaces. This can be done at multiple levels of computation. We can set up a workflow to be processed efficiently on a single node using a technology like the Dask package in Python. Or we can set up a workflow across multiple nodes with a workflow manager like Nextflow or Snakemake. There is even a tool I learned about recently called Pegasus that lets scientists set up workflows across multiple clusters. You may be in this space, if not now then possibly in the future.
I hope you found this useful in guiding you toward a more ideal solution. If you'd like, feel free to come to our office hours and we can help identify tools that will work well for your needs. https://docs.rc.uab.edu/#contact-us