|
| 1 | +## Snakemake CHT pipeline |
| 2 | + |
| 3 | +[Snakemake](https://bitbucket.org/snakemake/snakemake/wiki/Home) is a |
| 4 | +workflow management system, designed to streamline the execution of |
| 5 | +software pipelines. We now provide a Snakemake rule file that can be |
| 6 | +used to run the entire Combined Haplotype Pipeline. |
| 7 | + |
| 8 | +For a more complete description of Snakemake see the |
| 9 | +[Snakemake tutorial](http://snakemake.bitbucket.org/snakemake-tutorial.html). |
| 10 | + |
| 11 | +## Installing Snakemake |
| 12 | + |
| 13 | +Snakemake requires python3, however the CHT pipeline requires |
| 14 | +python2. For this reason, if you are using |
| 15 | +[Anaconda](https://www.continuum.io/downloads), it is recommended that |
| 16 | +you create a [python3 |
| 17 | +environment](http://conda.pydata.org/docs/py2or3.html#create-a-python-3-5-environment). For example you can create a python3.5 Anaconda environment with the following shell command (this only needs to be done once): |
| 18 | + |
| 19 | + conda create -n py35 python=3.5 anaconda |
| 20 | + |
| 21 | +You can then activate the py35 environment, and install the latest version of |
| 22 | +Snakemake with the following commands: |
| 23 | + |
| 24 | + source activate py35 |
| 25 | + conda install snakemake |
| 26 | + |
| 27 | +Then when you want to switch back to your default (e.g. python2) environment |
| 28 | +do the following: |
| 29 | + |
| 30 | + source deactivate |
| 31 | + |
| 32 | + |
| 33 | +## Configuring the CHT pipeline |
| 34 | + |
| 35 | +The rules for the Snakemake tasks are defined in the [Snakefile](Snakefile). |
| 36 | + |
| 37 | +Configuration parameters for this Snakefile are read from the YAML file |
| 38 | +[snake_conf.yaml](snake_conf.yaml). |
| 39 | + |
| 40 | +Before running Snakemake edit this file to specify the location |
| 41 | +of all of the input directories and files that will be used by the pipeline. |
| 42 | +This includes locations of the impute2 SNP files, input BAM files etc. |
| 43 | + |
| 44 | +Importantly you must set `wasp_dir` to point to the location of WASP |
| 45 | +on your system, and set `py2` and `Rscript` to setup the environment |
| 46 | +for python and R (e.g. by modifying your PATH) and call the |
| 47 | +appropriate interpreter. This is necessary because Snakemake is run |
| 48 | +using python3, but most of the scripts require python2. |
| 49 | + |
| 50 | + |
| 51 | +## Running the CHT pipeline |
| 52 | + |
| 53 | +Snakemake can be run as a single process or on a compute cluster with |
| 54 | +multiple jobs running simultaneuously. To run Snakemake on a single node |
| 55 | +you could do something like the following: |
| 56 | + |
| 57 | + source activate py35 |
| 58 | + cd $WASP_DIR/CHT |
| 59 | + snakemake |
| 60 | + |
| 61 | +We provide a script [run_snakemake.sh](run_snakemake.sh) to run Snakemake |
| 62 | +on a SGE compute cluster. You must be in a python3 environment to run this |
| 63 | +script, and the script must be run from a job submission host. |
| 64 | + |
| 65 | + source activate py35 |
| 66 | + cd $WASP_DIR/CHT |
| 67 | + ./run_snakemake.sh |
| 68 | + |
| 69 | +It should be possible to make simple modifications to this script to |
| 70 | +run on queue management systems other than SGE (e.g. LSF or Slurm). |
| 71 | + |
| 72 | + |
| 73 | +You should Snakemake from within a [Screen](https://www.gnu.org/software/screen/) virtual terminal or using [nohup](https://en.wikipedia.org/wiki/Nohup) so |
| 74 | +that if you are disconnected from the cluster, Snakemake will continue to run. |
| 75 | + |
| 76 | +At the conclusion of the pipeline, a QQPlot will be generated that summarizes |
| 77 | +the results of the CHT. |
| 78 | + |
| 79 | + |
| 80 | +## Debugging the CHT pipeline |
| 81 | + |
| 82 | +By default Snakemake will write an output and error file for each job |
| 83 | +to your home directory. These files will be named like `snakejob.<rulename>.<job_num>.sh.{e|o}<sge_jobid>`. For example: |
| 84 | + |
| 85 | + # contains error output for extract_haplotype_read_counts rule: |
| 86 | + snakejob.extract_haplotype_read_counts.13.sh.e4507125 |
| 87 | + |
| 88 | +If a rule fails, you should check the appropriate output file to see what |
| 89 | +error occurred. A major benefit of Snakemake is that if you re-run snakemake |
| 90 | +after a job fails it will pickup where it left off. |
| 91 | + |
0 commit comments