-
Notifications
You must be signed in to change notification settings - Fork 1
Description
https://hackmd.io/SqmpNz40TMO9aRIF3G_j2g?both
Debugging failing snakemake runs
When I run snakemake workflows "for realz", with many CPUs (a large -j) and/or across many machines, I frequently run into errors that are really hard to track down.
The first problem that I often encounter is that I can't figure out why the command failed. This is partly because a frustrating UNIX-ism: Snakemake outputs the error message after the command fails, so you need to go look above the command to see the error output (this is something that is hard to change in UNIX). But the bigger problem is that when running many commands at the same time, the output gets mixed together and it is difficult or impossible to figure out which output text and errors go with which command.
This connects with the problem that sometimes running multiple commands at the same time can cause problems. The most common such problem is memory usage - when one command requires a large amount of RAM, it may fail itself or cause other commands running at the same time to fail.
The easiest way I've found to debug all of this is to do the following:
- run as many snakemake jobs as possible with
-k; - once that is complete, all the remaining TODO jobs are failing. Now, run them one at a time, either manually (by specifying a particular output file) or by limiting the number of threads you give - e.g.
-j 1.