Skip to content

Commit 295a18b

Browse files
authored
Merge pull request #4 from bmvdgeijn/master
Merge from bryce branch following release 2
2 parents b0bd0ba + 36d2347 commit 295a18b

File tree

195 files changed

+8435
-2323
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

195 files changed

+8435
-2323
lines changed

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,15 @@
22
__pycache__/
33
*.py[cod]
44

5+
# emacs tmp files
6+
*~
7+
58
# C extensions
69
*.so
710

11+
# snakemake files
12+
.snakemake
13+
814
# Distribution / packaging
915
.Python
1016
env/

CHANGELOG.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
Version 0.2 - September 3, 2016
2+
-----------
3+
4+
Version 0.2 of WASP is a major update to the code,
5+
especially the mapping code. It fixes several bugs related
6+
to how paired-end reads are handled. For this reason it is
7+
strongly recommended that users switch to this version
8+
of the pipline.
9+
10+
Changes include:
11+
* re-wrote mapping scripts to make simpler and more modular
12+
* re-wrote mapping test scripts and added of many tests
13+
* fixed several mapping pipeline bugs related to paired-end reads
14+
* find_intersecting_snps.py window size no longer required (is now
15+
unlimited)
16+
* find_intersecting_snps.py can now take HDF5 files as input
17+
* find_intersecting_snps.py can now consider only haplotypes
18+
present in samples, rather than all possible allelic combinations
19+
of SNPs overlapping reads.
20+
* added get_as_counts.py script that outputs allele-specific read
21+
counts at all polymorphic SNPs.
22+
* snp2h5 now records sample info in output HDF5 files
23+
* improved speed of many CHT pipeline steps
24+
* improved stability of CHT dispersion parameter estimation
25+
* added Snakemake workflows for both mapping and CHT pipelines
26+
* added qqplot.R script to CHT workflow
27+
28+
29+
Version 0.1
30+
-----------
31+
Initial version of WASP

CHT/.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
11
*.py[cod]
22

3+
# snakemake files
4+
.snakemake
5+
6+
# emacs backups
7+
*~
8+
39
# C extensions
410
*.so
511

CHT/README.md

Lines changed: 16 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -93,11 +93,15 @@ the [1000 Genomes website](http://www.1000genomes.org/data#DataAccess).
9393

9494
## Workflow
9595

96-
An example workflow is provided in [example_workflow.sh](../example_workflow.sh)
97-
script. This workflow uses data in the [example_data directory](../example_data).
96+
We now provide a Snakemake workflow that can be used to run the entire
97+
CHT pipeline. For more information see the [Snakemake README](README.snakemake.md)
98+
99+
An example workflow in the form of a shell script is also provided in
100+
[example_workflow.sh](../example_workflow.sh) script. This workflow uses
101+
data in the [example_data directory](../example_data).
98102

99103
Some of the input files that we used for our paper can be downloaded from
100-
[here](http://eqtl.uchicago.edu/histone_mods/haplotype_read_counts/).
104+
[here](http://eqtl.uchicago.edu/histone_mods/).
101105

102106
The following steps can be used to generate input files and run the
103107
Combined Haplotype Test. The examples given below use the example
@@ -166,9 +170,16 @@ For example, if the goal is to identify histone-mark QTLs, the target
166170
regions should be ChIP-seq peaks, and the test SNPs should be SNPs
167171
that are near-to or within the ChIP-seq peaks.
168172

173+
*Note (added 4/25/2016):* the target regions for a single test regions should be
174+
non-overlapping. Overlapping target regions can cause some reads
175+
to be counted multiple times in a single test and inflate the test
176+
statistic. We plan to add a check for this to the extract haplotype
177+
read counts.
178+
169179
If the goal is to identify eQTLs, the target regions should be the
170180
exons of genes, and the test SNPs could be SNPs within a specified
171-
distance of the TSS.
181+
distance of the TSS. If a gene contains overlapping or duplicate
182+
exons these should be collapsed.
172183

173184
We provide a script, `get_target_regions.py`, that can generate a list
174185
of target regions and test SNPs for ChIP-seq peaks that match
@@ -371,5 +382,5 @@ The following example shows how the first 2 PCs can be used as covariates (repla
371382
## Contact
372383

373384
For questions about the combined haplotype test, please contact Graham McVicker
374-
(gpm@stanford.edu) or Bryce van de Geijn (bmvdgeijn@uchicago.edu).
385+
(gmcvicker@salk.edu) or Bryce van de Geijn (vandegeijn@hsph.harvard.edu).
375386

CHT/README.snakemake.md

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
## Snakemake CHT pipeline
2+
3+
[Snakemake](https://bitbucket.org/snakemake/snakemake/wiki/Home) is a
4+
workflow management system, designed to streamline the execution of
5+
software pipelines. We now provide a Snakemake rule file that can be
6+
used to run the entire Combined Haplotype Pipeline.
7+
8+
For a more complete description of Snakemake see the
9+
[Snakemake tutorial](http://snakemake.bitbucket.org/snakemake-tutorial.html).
10+
11+
## Installing Snakemake
12+
13+
Snakemake requires python3, however the CHT pipeline requires
14+
python2. For this reason, if you are using
15+
[Anaconda](https://www.continuum.io/downloads), it is recommended that
16+
you create a [python3
17+
environment](http://conda.pydata.org/docs/py2or3.html#create-a-python-3-5-environment). For example you can create a python3.5 Anaconda environment with the following shell command (this only needs to be done once):
18+
19+
conda create -n py35 python=3.5 anaconda
20+
21+
You can then activate the py35 environment, and install the latest version of
22+
Snakemake with the following commands:
23+
24+
source activate py35
25+
conda install snakemake
26+
27+
Then when you want to switch back to your default (e.g. python2) environment
28+
do the following:
29+
30+
source deactivate
31+
32+
33+
## Configuring the CHT pipeline
34+
35+
The rules for the Snakemake tasks are defined in the [Snakefile](Snakefile).
36+
37+
Configuration parameters for this Snakefile are read from the YAML file
38+
[snake_conf.yaml](snake_conf.yaml).
39+
40+
Before running Snakemake edit this file to specify the location
41+
of all of the input directories and files that will be used by the pipeline.
42+
This includes locations of the impute2 SNP files, input BAM files etc.
43+
44+
Importantly you must set `wasp_dir` to point to the location of WASP
45+
on your system, and set `py2` and `Rscript` to setup the environment
46+
for python and R (e.g. by modifying your PATH) and call the
47+
appropriate interpreter. This is necessary because Snakemake is run
48+
using python3, but most of the scripts require python2.
49+
50+
51+
## Running the CHT pipeline
52+
53+
Snakemake can be run as a single process or on a compute cluster with
54+
multiple jobs running simultaneuously. To run Snakemake on a single node
55+
you could do something like the following:
56+
57+
source activate py35
58+
cd $WASP_DIR/CHT
59+
snakemake
60+
61+
We provide a script [run_snakemake.sh](run_snakemake.sh) to run Snakemake
62+
on a SGE compute cluster. You must be in a python3 environment to run this
63+
script, and the script must be run from a job submission host.
64+
65+
source activate py35
66+
cd $WASP_DIR/CHT
67+
./run_snakemake.sh
68+
69+
It should be possible to make simple modifications to this script to
70+
run on queue management systems other than SGE (e.g. LSF or Slurm).
71+
72+
73+
You should Snakemake from within a [Screen](https://www.gnu.org/software/screen/) virtual terminal or using [nohup](https://en.wikipedia.org/wiki/Nohup) so
74+
that if you are disconnected from the cluster, Snakemake will continue to run.
75+
76+
At the conclusion of the pipeline, a QQPlot will be generated that summarizes
77+
the results of the CHT.
78+
79+
80+
## Debugging the CHT pipeline
81+
82+
By default Snakemake will write an output and error file for each job
83+
to your home directory. These files will be named like `snakejob.<rulename>.<job_num>.sh.{e|o}<sge_jobid>`. For example:
84+
85+
# contains error output for extract_haplotype_read_counts rule:
86+
snakejob.extract_haplotype_read_counts.13.sh.e4507125
87+
88+
If a rule fails, you should check the appropriate output file to see what
89+
error occurred. A major benefit of Snakemake is that if you re-run snakemake
90+
after a job fails it will pickup where it left off.
91+

0 commit comments

Comments
 (0)