Skip to content

Commit aac8334

Browse files
author
Achim Klittich
committed
Initial commit
0 parents  commit aac8334

File tree

4 files changed

+506
-0
lines changed

4 files changed

+506
-0
lines changed

README.md

Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# WindowTrees README
2+
## Intro
3+
WindowTrees is a Python RAxML wrapper. WindowTrees allows the generation and evaluation of a large number of window trees (trees built from only a certain window of the sequence) in short time. As input it uses consensus sequences in fasta format. These consensus sequences must have been created by mapping reads of different individuals or species to the same reference. All scaffolds or contigs that are included need to have the same name in every input file.
4+
5+
## What does WindowTrees do?
6+
WindowTrees uses the consensus fasta files as input and creates windows of a previously specified size (-w). It is irrelevant whether short scaffolds or complete genomes are provided as input.
7+
8+
Afterwards the base content is determined. It is possible to specify a N-threshold to exclude windows with high numbers of missing data ("N"). Furthermore, WindowTrees determines the number of variable positions in the used windows.
9+
10+
The generated and checked window will then be used as input for RAxML to calculate the respective tree. It is necessary to provide an outgroup. In order to increase the speed, the tree generation step is parallelized.
11+
12+
In the final output WindowTrees provides folders that contain trees with the same topology. Additionally, it also counts the number of trees found for each topology.
13+
14+
WindowTrees provides several optional arguments to be adaptable to the specific purpose of different analyses. Beside the already mentioned N-threshold those are the possibility to specify gaps or overlaps between the windows (-lw) and a binary mode in RAxML (BINGAMMA) to only use transversions (--binary). For the latter the fasta files must be non-binary. WindowTrees will do the transformation. This can be especially useful when working with ancient DNA data.
15+
16+
The outout trees can be used for various analyses like ABBA BABA statistics/D stats other gene flow analyses. Examples for its usage are Hempel et al. (20xx), where the tool in its current form was first applied or Barlow et al. (2018). It also gives a good estimate on the possible topologies found in the sequences.
17+
18+
---
19+
20+
## Dependencies
21+
### Python 3.7:
22+
- ETE3 (Tree evaluation) [[ETEToolkit website]](http://etetoolkit.org/)
23+
- pyfaidx (Sequence slicing) [[pyfaidx GitHub]](https://github.com/mdshw5/pyfaidx)
24+
25+
### Other Software
26+
- RAxML [[RAxML website)]](https://cme.h-its.org/exelixis/web/software/raxml/)
27+
28+
---
29+
30+
## Installation
31+
A simple and fast method to provide the required dependencies is to use Anaconda [[Anaconda install instructions]](https://docs.anaconda.com/anaconda/install/index.html) or miniconda [[Miniconda install instructions]](https://docs.conda.io/en/latest/miniconda.html#).
32+
33+
After the respective conda version has been installed, it is possible to install all required dependencies with the following command in the CLI:
34+
```sh
35+
# create conda environment called windowtrees and install dependencies from bioconda and conda-forge channel
36+
37+
conda create -n windowtrees -c bioconda -c conda-forge ete3 pyfaidx raxml python=3.7
38+
39+
# activate environment
40+
source activate windwotrees
41+
42+
```
43+
44+
---
45+
46+
## Analysis Workflow
47+
1. Provide several consensus sequences in fasta format that have been mapped to the same reference. The names of the scaffolds in the different consensus files must have the same name.
48+
2. Create a colon seperated list of short_name:/path/to/fasta. Each row of the file can contain information for one sample with the according path. (example input_test.tab)
49+
3. Start WindowTrees with appropriate parameters for your anaylsis.
50+
51+
Example WindowTrees call:
52+
```sh
53+
python3 -u windowtrees.py -o testrun --outgroup outgroup-name -w 200000 inputfile_test.tab
54+
```
55+
This will start the window tree analysis using WindowTrees. We define the window size with -w to be 200kB, the outgroup is defined by using the short_name from the input file. The generated output and the temporary files will be written into the folder testrun (-o)
56+
57+
4. Inspect and further process the generated and topology sorted trees.
58+
59+
## Usage
60+
Calling windowtrees:
61+
```sh
62+
# example call
63+
python3 -u windowtrees.py -o testrun --outgroup outgroup-name -w 200000 inputfile_test.tab
64+
```
65+
66+
```sh
67+
# help
68+
python3 windowtrees.py -h
69+
```
70+
71+
```
72+
usage: python3 windowtrees.py [-h] -o OUTDIR --outgroup OUTGROUP -w WINDOWSIZE [-lw GAPSIZE] [--binary] [-N NTHRESHOLD] [--cpu NCPUS] inputfile
73+
74+
windowtrees.py: error: the following arguments are required: inputfile, -o, --outgroup, -w
75+
76+
WindowTrees: Calculate window trees for whole genomes to determine gene flow
77+
78+
required arguments:
79+
inputfile Input file with short name and full path per line, colon separated
80+
-o OUTDIR Output directory
81+
--outgroup OUTGROUP Specify outgroup with short name as given in input file
82+
-w WINDOWSIZE Specify window size
83+
84+
optional arguments:
85+
-h, --help show this help message and exit
86+
-lw GAPSIZE Positive (gap) or negative (overlap) integer [0]
87+
--binary Activate binary mode with BINGAMMA model [GTRGAMMA] to only use transversions, ! input for binary mode must be non-binary fasta file !
88+
-N NTHRESHOLD Ratio of allowed Ns per sequence in each window [0.1]
89+
--cpu NCPUS Number of CPUs [2]
90+
91+
92+
```
93+
94+
## Output
95+
The program will generate the following output:
96+
- as many folders as trees were found by the program numbered consecutively (e.g. tree_1, tree_2...). These folders contain the trees that were found in newick format. The name of each tree contains the following: the number of that tree topology, a consecutive number of this tree, RaxML_bestTree, the scaffold number.fa, the number of variable sites and the position of the window (example: tree_1_1_RAxML_bestTree.Scaffold_5.fa_VS-137_37400001-37420000.nw".
97+
- a folder called "trees" which contains the raxml outfile for each of the found trees for each window. The name consists of the scaffold name, number of variable sites and window position (example: Scaffold_1.fa_VS-70_60300001-60320000.fa_outfile.txt)
98+
- a file called "MSA_windowstats.out"
99+
This file gives the path to the output directory, the applied window size and the allowed N threshold at the top. Next it provides the statistics for each winddow giving the scaffold name, the window position, the number of Ns, if the N threshold was passed (failed/passed), amount of A, T, G and C, if the N threshold was passed (True/False) and the number of variable sites in case the N threshold was passed.
100+
- a file called "foundTrees.txt" gives a summary of the topologies found and the amount of each one.
101+
102+
Here is an example output of MSA_windowstats.out
103+
```
104+
## Output directory: /path/to/output/dir
105+
## Windowsize: 20000 max(N): 50.0%
106+
## Sequence;CountN;Filter;CountA;CountT;CountG;CountC;
107+
#scaffold_1.fa_1-20000 PassNtreshold: False VarSites: NA
108+
Indiv1;20000;Failed;0;0;0;0;
109+
Indiv2;20000;Failed;0;0;0;0;
110+
Indiv3;20000;Failed;0;0;0;0;
111+
Indiv4;20000;Failed;0;0;0;0;
112+
.
113+
.
114+
.
115+
#scaffold_25.fa_400001-420000 PassNtreshold: True VarSites: 167
116+
Indiv1;4795;Passed;4538;4089;3238;3340;
117+
Indiv2;1444;Passed;5410;5050;3974;4122;
118+
Indiv3;1722;Passed;5295;4987;3918;4078;
119+
Indiv4;21;Passed;5773;5485;4317;4404;
120+
121+
```
122+
Here is an example of foundTress.txt
123+
```
124+
tree_1 (((Indiv1,Indiv2),Indiv3),Indiv4); 4000
125+
tree_2 (((Indiv1,Indiv3),Indiv2),Indiv4); 7000
126+
tree_3 ((Indiv1,(Indiv3,Indiv2)),Indiv4); 2000
127+
```
128+
129+
## How to cite
130+
When you use this tool please cite this github site and Hempel, E., Bibi, F., Faith, J.T., Koepfli, K.-P., Klittich, A.M., Duchêne, D.A., Brink, J.S., Kalthoff, D.C., Dalén, L., Hofreiter, M. & Westbury, M.V. 2022 When blue turns to grey - Paleogenomic insights into the evolutionary history and extinction of the blue antelope (Hippotragus leucophaeus). bioRxiv 2022.04.12.487785; doi: [https://doi.org/10.1101/2022.04.12.487785] (https://doi.org/10.1101/2022.04.12.487785).
131+
132+
## References
133+
- Barlow, A. Cahill, J.A., Hartmann, S., Theunert, Chr., Xenikoudakis, G., Fortes, G.G., Paijmans, J.L.A., Rabeder, G., Frischauf, Chr., Grandal-d'Anglade, A., Garcia-Vàzquez, A., Murtskhvaladze, M., Saarma, U., Anijalg, P., Skrbinšek, T., Bertorelle, G., Gasparian, B., Bar-Oz, G., Pinhasi, R., Slatkin, M., Dalén, L., Shapiro, B. & Hofreiter, M. 2018, Partial genomic survival of cave bears in living brown bears. Nature Ecology & Evolution 2: 1563-1570. doi: [https://doi.org/10.1038/s41559-018-0654-8] (https://doi.org/10.1038/s41559-018-0654-8).
134+
- Stamatakis, A. 214. RAxML Version 8: A tooblue turns to grey - Paleogenomic insights
135+
into the evolutionary history and extinction of the
136+
blue antelope (Hippotragus leucophaeus)l for Phylogenetic Analysis and Post-Analysis of Large Phylogenies. Bioinformatics 30: 1312-1313. doi: [https://doi.org/10.1093/bioinformatics/btu033](https://doi.org/10.1093/bioinformatics/btu033).
137+
- Huerta-Cepas, J., Serra, F. & Bork, P. 2016. ETE 3: Reconstruction, analysis and visualization of phylogenomic data. Molecular Biology and Evolution 33: 1635-1638. doi: [https://doi.org/10.1093/molbev/msw046](https://academic.oup.com/mbe/article/33/6/1635/2579822).
138+
- Hempel, E., Bibi, F., Faith, J.T., Koepfli, K.-P., Klittich, A.M., Duchêne, D.A., Brink, J.S., Kalthoff, D.C., Dalén, L., Hofreiter, M. & Westbury, M.V. 2022 When blue turns to grey - Paleogenomic insights into the evolutionary history and extinction of the blue antelope (Hippotragus leucophaeus). bioRxiv 2022.04.12.487785; doi: [https://doi.org/10.1101/2022.04.12.487785] (https://doi.org/10.1101/2022.04.12.487785).
139+
- Shirley, M.D., Ma, Z., Pedersen, B.S. & Wheelan, S.J.. 2015. Efficient "pythonic" access to FASTA files using pyfaidx. PeerJ PrePrints 3: e970v1. doi: [https://doi.org/10.7287/peerj.preprints.970v1] (https://doi.org/10.7287/peerj.preprints.970v1).

inputfile_test.tab

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
Individual1:/path/to/fastafile/individual1.fa
2+
Individual2:/path/to/fastafile/individual2.fa
3+
Individual3:/path/to/fastafile/individual3.fa
4+
Individual4:/path/to/fastafile/individual4.fa
5+
Individual5:/path/to/fastafile/individual5.fa
6+
Individual6:/path/to/fastafile/individual6.fa
7+
Individual7:/path/to/fastafile/individual7.fa
8+
Individual8:/path/to/fastafile/individual8.fa
9+
Outgroup:/path/to/fastafile/outgroup.fa

run_windowtrees_SLURM.sh

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
#!/bin/bash
2+
### Created by Achim Klittich
3+
################################################################################
4+
### SLURM CONFIG
5+
################################################################################
6+
#SBATCH --job-name="windowtrees"
7+
### runtime
8+
#SBATCH --time=2-00:00:00
9+
### changedir / output
10+
#SBATCH --chdir=/hpc-cloud/user/windowTrees
11+
#SBATCH --output=/hpc-cloud/user/windowTrees/WindowTrees_%j.log
12+
#SBATCH --error=/hpc-cloud/user/windowtTees/WindowTrees_%j.log
13+
### Account / Partition
14+
#SBATCH --partition=All ### add partition here
15+
#SBATCH --account=all ### add account here if necessary
16+
### NODE / CPU settings
17+
#SBATCH --nodes=1
18+
#SBATCH --ntasks=1
19+
#SBATCH --cpus-per-task=10
20+
### MEM per node
21+
#SBATCH --mem=32G
22+
### Notifications
23+
#SBATCH --mail-type=END
24+
#SBATCH --mail-user=email@email.de ### add your email address here to get notified
25+
################################################################################
26+
### Print Date and get time
27+
################################################################################
28+
date
29+
start_time="$(date -u +%s)"
30+
################################################################################
31+
### Print Slurmscript to log
32+
################################################################################
33+
echo "################################################################################"
34+
echo -e "Used SLURM script:\n"
35+
cat $0
36+
echo -e "\n################################################################################"
37+
echo "### Output:"
38+
echo "################################################################################"
39+
40+
################################################################################
41+
### Define variables
42+
################################################################################
43+
windowtree_py=/hpc-cloud/user/WindowTrees/windowtrees.py
44+
inputfile=/hpc-cloud/user/WindowTrees/inputfile_test.tab
45+
outpath=/hpc-cloud/user/WindowTrees/run1
46+
outgroup="BELUGA"
47+
windowsize=1000000
48+
nthreshold=0.2
49+
ncpu=10
50+
51+
52+
################################################################################
53+
### Running code
54+
################################################################################
55+
# print input file to log
56+
echo "Utilized sample input file:"
57+
cat $inputfile
58+
echo "Starting WindowTrees..."
59+
# create output dir
60+
mkdir $outpath
61+
# start WindowTree py with arguments defined above
62+
python3 -u $windowtree_py -o $outpath --outgroup $outgroup -w $windowsize --cpu $ncpu $inputfile
63+
64+
################################################################################
65+
### Print elapsed time and date
66+
################################################################################
67+
end_time="$(date -u +%s)"
68+
elapsed="$(($end_time-$start_time))"
69+
echo "Job finished after $elapsed sec"
70+
date

0 commit comments

Comments
 (0)