-
Notifications
You must be signed in to change notification settings - Fork 16
Description
What would you like to see added?
We have an article on workflow managers, but it is a stub. Let's flesh it out!
Technologies to consider:
=95% of use cases are Nextflow and Snakemake
- Nextflow (more popular, existing bioinformatics pipelines, easier to reason about)
- Snakemake (also used by some of our researchers, harder to reason about)
- Pegasus (HTC-focused, connects to National Cyberinfrastructre like OSG)
- Dask (a data science oriented python library)
- META (alternative to array jobs)
Below are not as important for us.
- Swift (a language)
- Airflow (Apache, may not be compatible with Cheaha, needs a "dedicated server")
- Cromwell
Here is a big curated list: https://github.com/pditommaso/awesome-pipeline
Topics to discuss:
- Use cases generally
- Use cases for each technology (or technology type)
- Directed Acyclic Graphs (DAGs), what they are, briefly their mathematical properties in lay terms, and why you should care (every computational workflow can be modeled as a DAG!), with a couple of examples
- How to understand your workflow as a DAG, with examples
- Close the loop by explaining how to know which technology is needed for a given problem in a Q&A format
Some text to get things started:
Workflow managers like Nextflow, etc., leverage the existing scheduler (correct me if I'm wrong). Naturally they have a server that manages the workflow. Their job is to facilitate creation and management of arbitrary task DAGs, and then execute those using a scheduler. Have a DAG that isn't shaped like a collapsing tree (many tasks feeding into a single task) and/or has a shortest path longer than 2 nodes? You need a workflow manager!
META is an alternative to built in array jobs. The DAG can't be more complex than a collapsing tree. It's use case is many similar tasks where there are so many it would crash the scheduler or exceed per-user job limits. Its interface is (probably?) simpler than most workflow managers. I haven't used either so I can't speak to that 100%.
Info on META is given below.
Discovered this today. It is a "meta" job scheduler for use with Slurm. https://docs.alliancecan.ca/wiki/META:_A_package_for_job_farming
Use case is many similar, serial jobs, like we saw recently when we broke the scheduler max jobs limit, meant to replace array jobs when there are too many tasks.
From what I can tell it works by using an MPI "server" to host a serial job queue that doles jobs out to workers on other nodes. The server and workers are in their own long-running jobs. That way many short tasks can be fit under the umbrella of fewer, longer SLURM jobs.
Example: if you have 2000 tasks, each taking about 10 minutes, you could request META ru 20 jobs, each running 100 of the tasks over ~20 hours.