Workflow managers and META

### What would you like to see added?

We have an article on workflow managers, but it is a stub. Let's flesh it out!

## Technologies to consider:

>=95% of use cases are Nextflow and Snakemake

- Nextflow (more popular, existing bioinformatics pipelines, easier to reason about)
- Snakemake (also used by some of our researchers, harder to reason about)
- Pegasus (HTC-focused, connects to National Cyberinfrastructre like OSG)
- Dask (a data science oriented python library)
- META (alternative to array jobs)

Below are not as important for us.

- Swift (a language)
- Airflow (Apache, may not be compatible with Cheaha, needs a "dedicated server")
- Cromwell

Here is a big curated list: https://github.com/pditommaso/awesome-pipeline

## Topics to discuss:

- Use cases generally
- Use cases for each technology (or technology type)
- Directed Acyclic Graphs (DAGs), what they are, briefly their mathematical properties in lay terms, and why you should care (every computational workflow can be modeled as a DAG!), with a couple of examples
- How to understand your workflow as a DAG, with examples
- Close the loop by explaining how to know which technology is needed for a given problem in a Q&A format

## Some text to get things started:

> Workflow managers like Nextflow, etc., leverage the existing scheduler (correct me if I'm wrong). Naturally they have a server that manages the workflow. Their job is to facilitate creation and management of arbitrary task DAGs, and then execute those using a scheduler. Have a DAG that isn't shaped like a collapsing tree (many tasks feeding into a single task) and/or has a shortest path longer than 2 nodes? You need a workflow manager!
> 
> META is an alternative to built in array jobs. The DAG can't be more complex than a collapsing tree. It's use case is many similar tasks where there are so many it would crash the scheduler or exceed per-user job limits. Its interface is (probably?) simpler than most workflow managers. I haven't used either so I can't speak to that 100%.

Info on META is given below.

> Discovered this today. It is a "meta" job scheduler for use with Slurm. https://docs.alliancecan.ca/wiki/META:_A_package_for_job_farming
> 
> Use case is many similar, serial jobs, like we saw recently when we broke the scheduler max jobs limit, meant to replace array jobs when there are too many tasks.
> From what I can tell it works by using an MPI "server" to host a serial job queue that doles jobs out to workers on other nodes. The server and workers are in their own long-running jobs. That way many short tasks can be fit under the umbrella of fewer, longer SLURM jobs.
> Example: if you have 2000 tasks, each taking about 10 minutes, you could request META ru  20 jobs, each running 100 of the tasks over ~20 hours. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Workflow managers and META #462

What would you like to see added?

Technologies to consider:

Topics to discuss:

Some text to get things started:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Workflow managers and META #462

Description

What would you like to see added?

Technologies to consider:

Topics to discuss:

Some text to get things started:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions