Skip to content

Commit 56e1b67

Browse files
authored
Merge pull request #5 from mshakya/master
merging
2 parents 51af384 + ca30292 commit 56e1b67

File tree

5 files changed

+225
-28
lines changed

5 files changed

+225
-28
lines changed

docs/conf.py

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -19,14 +19,14 @@
1919

2020
# -- Project information -----------------------------------------------------
2121

22-
project = 'phame'
23-
copyright = '2018, migun'
24-
author = 'migun'
22+
project = 'PhaME'
23+
copyright = '2018, LANL'
24+
author = 'Migun Shakya, Chien-Chi Lo, Sanaa Ahmed, Mark Flynn, Patrick S.G Chain'
2525

2626
# The short X.Y version
27-
version = ''
27+
version = '1'
2828
# The full version, including alpha/beta/rc tags
29-
release = ''
29+
release = '1.0.2'
3030

3131

3232
# -- General configuration ---------------------------------------------------
@@ -66,7 +66,7 @@
6666
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
6767

6868
# The name of the Pygments (syntax highlighting) style to use.
69-
pygments_style = None
69+
pygments_style = 'sphinx'
7070

7171

7272
# -- Options for HTML output -------------------------------------------------
@@ -106,7 +106,7 @@
106106
# -- Options for HTMLHelp output ---------------------------------------------
107107

108108
# Output file base name for HTML help builder.
109-
htmlhelp_basename = 'phame'
109+
htmlhelp_basename = 'PhaME'
110110

111111

112112
# -- Options for LaTeX output ------------------------------------------------
@@ -133,8 +133,8 @@
133133
# (source start file, target name, title,
134134
# author, documentclass [howto, manual, or own class]).
135135
latex_documents = [
136-
(master_doc, 'phame.tex', 'phame Documentation',
137-
'migun', 'manual'),
136+
(master_doc, 'phame.tex', 'PhaME',
137+
'PhaME Development Team', 'manual'),
138138
]
139139

140140

@@ -143,7 +143,7 @@
143143
# One entry per manual page. List of tuples
144144
# (source start file, name, description, authors, manual section).
145145
man_pages = [
146-
(master_doc, 'phame', 'phame Documentation',
146+
(master_doc, 'PhaME', 'PhaME',
147147
[author], 1)
148148
]
149149

@@ -154,8 +154,8 @@
154154
# (source start file, target name, title, author,
155155
# dir menu entry, description, category)
156156
texinfo_documents = [
157-
(master_doc, 'phame', 'phame Documentation',
158-
author, 'phame', 'One line description of project.',
157+
(master_doc, 'PhaME',
158+
author, 'PhaME', 'One line description of project.',
159159
'Miscellaneous'),
160160
]
161161

docs/introduction/introduction.rst

Lines changed: 189 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,17 +4,199 @@ Introduction
44
What is PhaME?
55
==============
66

7+
PhaME or Phylogenetic and Molecular Evolution (PhaME) is an analysis tool that performs phylogenetic and molecular evolutionary analysis.
78

9+
Given a reference, PhaME extracts SNPs from complete genomes, draft genomes and/or reads, uses SNP multiple sequence alignment to construct a phylogenetic tree, and provides evolutionary analyses (genes under positive selection) using CDS SNPs.
810

9-
PhaME or Phylogenetic and Molecular Evolution (PhaME) analysis tool allows suite of analysis pertaining to phylogeny and moleuclar evolution.
1011

11-
Given a reference, PhaME extracts SNPs from complete genomes, draft genomes and/or reads, uses SNP multiple sequence alignment to construct a phylogenetic tree, and provides evolutionary analyses (genes under positive selection) using CDS SNPs.
12+
PhaME: Under the hood.
13+
======================
14+
15+
Here, We have explained in detail how PhaME works in detail.
16+
17+
Selecting Reference genome:
18+
-----------------------------
19+
PhaME is a reference genome based tool where all input genomes and metagenomes are first aligned against a reference genome. As a first step in the PhaME pipeline, given a set of reference genomes in path set in parameter `refdir` `phame.ctl` file, one genome is picked as the reference genome. PhaME provides multiple options to pick the most appropriate reference genome. First, reference genome can be picked randomly from the set. Second, users can select reference genome from the list. Lastly, users can use MinHash option, which will calculate the MinHash distance between all reference genomes with each other and with input contigs and raw reads to pick a genome that has the shortest average distance with each other. MinHash distances are calculated using BBMap [Bushnell]_.
20+
21+
Self-nucmerization to remove repeats from reference genomes:
22+
---------------------------------------------------------------
23+
PhaME is built on alignment tool nucmer [Kurtz 2004]_ for genome alignment. All genomes in fasta format that are complete and included as set of reference genomes are first aligned with self, called `self-nucmerization`, using `nucmer` to remove repeats within a genome. Following `nucmer` command is used for the self-nucmerization step:
24+
25+
::
26+
27+
nucmer --maxmatch --nosimplify --prefix=seq_seq$$ ref_genomeA.fasta ref_genomeA.fasta
28+
29+
::
30+
31+
The option `--maxmatch` reports all matches is used to ensure all possible alignments are reported for maximal removal of repeats. The identified repeat regions are then removed from downstream analyses.
32+
33+
Genome Alignments
34+
--------------------------------
35+
All genomes that are in `refdir` are first aligned against the reference genome (see section 1) that have had its repeats removed (section 2). Likewise, incomplete genomes or contigs, the ones that are listed in the `workdir` with extension `.contig` are also aligned against the reference genome using `nucmer`. For aligning genomes in fasta format against each other, following command, same as the previous step for nucmer alignment is used:
36+
37+
::
38+
39+
nucmer --maxmatch refgenome.fasta genome.fasta
40+
41+
::
42+
43+
All other options in nucmer alignments are kept at default, some of the important ones are listed below:
44+
45+
::
46+
47+
-b|breaklen Set the distance an alignment extension will attempt to
48+
extend poor scoring regions before giving up (default 200)
49+
-c|mincluster Sets the minimum length of a cluster of matches (default 65)
50+
-D|diagdiff Set the maximum diagonal difference between two adjacent
51+
anchors in a cluster (default 5)
52+
-d|diagfactor Set the maximum diagonal difference between two adjacent
53+
anchors in a cluster as a differential fraction of the gap
54+
length (default 0.12)
55+
--[no]extend Toggle the cluster extension step (default --extend)
56+
-g|maxgap Set the maximum gap between two adjacent matches in a
57+
cluster (default 90)
58+
-l|minmatch Set the minimum length of a single match (default 20)
59+
60+
::
61+
62+
Also, `nucmer` only aligns `"A"` `"T"` `"C"` `"G"`, all other characters are ignored. So, if there are `"N"`s in the provided genomes, thse positions are not included in the alignment.
63+
64+
*Note*: If an analysis requires running multiple iterations of PhaME on a same set of dataset or a subset of dataset, one doesn't need to perform the alignment over and over again. PhaME provides an option where it can keep all possible pairwise alignment of genomes from `refdir` for future analyses. All the steps mentioned in this section are the same, except that all vs. all alignment is performed compared to just one reference. By doing all vs. all alignment one can also test the effect on their analyses with different reference genomes.
65+
66+
Mapping of raw reads to reference genome
67+
-------------------------------------------
68+
PhaME as of now only processes short raw reads from Illumina. If raw reads, single or paired end, are included in the analyses, they are mapped to the reference genome using either `bowtie2` or `BWA`. For reads mapping of reference genome, following commands are used:
69+
70+
First, it builds database from the reference genome.
71+
::
72+
73+
bowtie2-build refgenome refgenome
74+
75+
::
76+
or, if BWA was chosen as the preferred aligner:
77+
78+
::
79+
80+
bwa index refgenome
81+
82+
::
83+
84+
The raw reads are then mapped to the reference genomne using one of the following commands:
85+
86+
For bowtie2 and paired reads:
87+
88+
::
89+
90+
bowtie2 -a -x $refgenome -1 read1 -2 read2 -S paired.sam`;
91+
92+
::
93+
The option `-a` reports all possible alignments.
94+
95+
For bowtie2 and single end reads:
96+
97+
::
98+
99+
bowtie2 -a -x $refgenome -U read -S single.sam`;
100+
101+
::
102+
103+
For BWA and paired reads:
104+
105+
::
106+
107+
bwa mem refgenome read1 read2 | samtools view -ubS -| samtools sort -T tmp_folder -O BAM -o paired.bam
108+
109+
::
110+
111+
For BWA and single end reads:
112+
113+
::
114+
115+
bwa mem refgenome read |samtools view -ubS - | samtools sort -T tmp_folder -O BAM -o single.bam
116+
117+
::
118+
119+
120+
Filtering genome alignments
121+
------------------------------
122+
Genome alignment produced using `nucmer` are filtered using `delta-filter` to only keep 1 to 1 alignments allowing for rearrangements. This filtering step is produced for all `nucmer` alignments.
123+
124+
::
125+
126+
delta-filter -1 genome.delta > genome.snpfilter
127+
128+
::
129+
130+
131+
Calling SNPs from genome alignments
132+
--------------------------------------
133+
The pairwise `nucmer` alignments are then parsed to produce a SNP table using `show-snps`.
134+
135+
::
136+
137+
show-snps -CT genome.snpfilter > genome.snps
138+
139+
::
140+
141+
Here, option C and T specifies not to report SNPs from ambiguous alignments and report the output in tab delimited file respectively.
142+
143+
Reporting nucmer alignments
144+
------------------------------
145+
146+
Each alignments are further parse to produce a tab delimited file that has information on regions and %ID of their alignments.
147+
::
148+
149+
show-coords -clTr genome.snpfilter > genome.coords
150+
151+
::
152+
153+
The parameter flag -clTr implies different headers to be reported in the report.
154+
155+
::
156+
157+
-c Include percent coverage information in the output
158+
-l Include the sequence length information in the output
159+
-r Sort output lines by reference IDs and coordinates
160+
-T Switch output to tab-delimited format
161+
162+
::
163+
164+
Calling SNPs from read mapping
165+
---------------------------------
166+
`bcftools mpileup` is used for calling SNPs from read mapping results (bam file) of every genomes represented by raw reads. Maximum depth is set to 1000000 for both SNP and indel calling and minimum gaps for calling an indel is set to 3. The output vcf file is then used to call SNPs using `bcftools call` where ploidy is specified as `1` if its a haploid or bacterial genome, else it is called using default parameter. Furthermore, based on the user specified parameter in the control file, SNPs are further filtered based on percentage of SNPs. Here are the snippets of commmand that are run as part of this. All of them result in a vcf file.
167+
168+
::
169+
170+
bcftools mpileup -d 1000000 -L 1000000 -m 3 -Ov -f $refgenome $bam_output | bcftools call --ploidy 1 -cO b > $bcf_output;
171+
bcftools view -v snps,indels,mnps,ref,bnd,other -Ov $bcf_output | vcfutils.pl varFilter -a$min_alt_bases -d$min_depth -D$max_depth > $vcf_output`;
172+
bcftools filter -i '(DP4[0]+DP4[1])==0 || (DP4[2]+DP4[3])/(DP4[0]+DP4[1]+DP4[2]+DP4[3]) > $snp_filter' $vcf_output > $vcf_filtered`
173+
174+
::
175+
176+
177+
Calculating core genome alignments
178+
----------------------------------
179+
As a first step in calculating the core genome, all alignments to reference are checked for linear coverage to assure the proportion of reference genome that was used in the alignment. If its lower than the threshold cutoff (default = 0.6) set in control file, that genome will be removed from further analyses. Then rest of the pairwise alignments that are either in vcf format or nucmer formats are then collated to calculate a core genome. Only the alignment position that are 100% conserved are kept, all other positions are removed from the final core genome alignment. PhaME produces multiple alignment files corresponding to core genome such as the one that has only the variant sites (`_all_snp_alignment.fna`), has variant and invariant sites (`all_alignment.fna`), and the ones that have SNPs from only the coding region (`_cds_snp_alignment.fna`). The coding region SNP alignment requires a GFF formatted annotation file.
180+
12181

182+
Reconstructing core genome phylogeny
183+
-------------------------------------
184+
PhaME provides multiple tools (RAxML [Stamatakis 2014]_, FastTree [Price 2010]_, and IQ-Tree [Nguyen 2015]_) to reconstruct phylogeny from one core genome alignments that have invariant sites. If RAxML or FastTree option is chosen, users cannot modify the models as they are pre-selected. RAxML trees are reconstructed using GTRGAMMAI models that "GTR + Optimization of substitution rates + GAMMA model of rate heterogeneity (alpha parameter will be estimated)" with `I` but with estimate for invariable sites. FastTree uses GTR model only. IQ-TREE is run using option `-m TEST` that searches for the best model that fits the data before reconstructing the phylogeny. RAxML is the only option that is currently available that can also calculate the bootstraps.
13185

14-
Quick usage
15-
===========
16-
To quickly get started with PhaME, you can install using conda.
186+
Selecting genes for molecular evolutionary analyses
187+
-------------------------------------------------------
188+
To perform selection analyses using PAML or HyPhy, codon alignments of genes are required. Based on the position of SNPs in the reference genome, if a SNP is within a coding region and if that coding region does not have a gap, they are extracted from the core genome alignment. The nucleotide sequences of the genes are then translated to protein sequences, aligned using the program `mafft`, and then reverse translated back to nucleotide using the perl code `pal2nal.pl` from http://www.bork.embl.de/pal2nal/.
17189

18-
.. code-block:: console
190+
Molecular Evoluationary analyses
191+
------------------------------------
192+
The set of gene alignments are then used for molecular evolutionary analyses using either PAML [Yang 2007]_ or HyPhy [Pond 2005]_. PAML is run twice for the same gene using two differnt models (`model=0` and `model=2`), first that sets one omega ratio for all branches and another that sets different omega ratios for all lineages. For the first model, additional parameter variation model that specifies, neutral (1), selection (2), beta and omega ratio between 0 and 1 (7), and beta, omega and an additional omega is run. For the Model 2 with variable omega ratios across all branches, the model with one omega across all sites are used. If HyPhy is selected, it uses the aBSREL (adaptive Branch-Site Random Effects Likelihood) model.
19193

20-
conda install phame
194+
References
195+
--------------
196+
.. [Yang 2007] Yang Z: PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol 2007, 24:1586-1591.
197+
.. [Pond 2005] Pond SL, Frost SD, Muse SV: HyPhy: hypothesis testing using phylogenies. Bioinformatics 2005, 21:676-679.
198+
.. [Kurtz 2004] Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes. Genome Biol 2004, 5:R12.
199+
.. [Bushnell] Bushnell B: BBMap. 37.66 edition. sourceforge.net/projects/bbmap/.
200+
.. [Stamatakis 2014] Stamatakis A: RAxML version 8: a tool for phylogenetic analysis and post- analysis of large phylogenies. Bioinformatics 2014, 30:1312-1313.
201+
.. [Price 2010] Price MN, Dehal PS, Arkin AP: FastTree 2--approximately maximum- likelihood trees for large alignments. PLoS One 2010, 5:e9490.
202+
.. [Nguyen 2015] Nguyen LT, Schmidt HA, von Haeseler A, Minh BQ: IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol 2015, 32:268-274.

lib/PhaME.pm

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -602,7 +602,8 @@ sub buildTree {
602602
if ( $tree == 1 || $tree == 4 ) {
603603
print OUT "Reconstructing phylogeny using FastTree\n";
604604
my $fasttree
605-
= "export OMP_NUM_THREADS=$thread; FastTree -quiet -nt -gtr < $outdir/$name\_snp_alignment.fna > $outdir/$name\.fasttree 2>>$error \n\n";
605+
= "export OMP_NUM_THREADS=$thread; FastTree -quiet -nt -gtr < $outdir/$name\_all_alignment.fna > $outdir/$name\.fasttree 2>>$error \n\n";
606+
# = "export OMP_NUM_THREADS=$thread; FastTree -quiet -nt -gtr < $outdir/$name\_snp_alignment.fna > $outdir/$name\.fasttree 2>>$error \n\n";
606607
print OUT $fasttree;
607608
if ( system($fasttree) ) { die "Error running $fasttree.\n"; }
608609

@@ -611,7 +612,8 @@ sub buildTree {
611612
print OUT "Reconstructing phylogeny using RaxML\n";
612613
print OUT "\n";
613614
my $raxml
614-
= "raxmlHPC-PTHREADS -p 10 -T $thread -m GTRGAMMAI -s $outdir/$name\_snp_alignment.fna -w $outdir -n $name 2>>$error >> $log\n\n";
615+
= "raxmlHPC-PTHREADS -p 10 -T $thread -m GTRGAMMAI -s $outdir/$name\_all_alignment.fna -w $outdir -n $name 2>>$error >> $log\n\n";
616+
# = "raxmlHPC-PTHREADS -p 10 -T $thread -m GTRGAMMAI -s $outdir/$name\_snp_alignment.fna -w $outdir -n $name 2>>$error >> $log\n\n";
615617
print OUT $raxml;
616618
if ( system($raxml) ) { die "Error running $raxml.\n"; }
617619
# dont need a rooted tree, removing it for now
@@ -628,7 +630,7 @@ sub buildTree {
628630
print OUT "Also bootstraping IQ-Trees trees\n";
629631
print OUT "\n";
630632
my $iqtree
631-
= "iqtree -m TEST -s $outdir/$name\_snp_alignment.fna -nt $thread 2>>$error >> $log\n\n";
633+
= "iqtree -m TEST -s $outdir/$name\_all_alignment.fna -nt $thread 2>>$error >> $log\n\n";
632634
print OUT $iqtree;
633635
if ( system($iqtree) ) { die "Error running $iqtree.\n"; }
634636
}
@@ -674,6 +676,15 @@ sub bootstrap {
674676
if ( system($bestTree) ) { die "Error running $bestTree.\n"; }
675677

676678
}
679+
if ( $tree == 3 || $tree == 4 ) {
680+
print OUT "Reconstructing phylogeny using IQ-tree after finding the best model\n";
681+
print OUT "Also bootstraping IQ-Trees trees\n";
682+
print OUT "\n";
683+
my $iqtree
684+
= "iqtree -m TEST -b $bootstrap -s $outdir/$name\_all_alignment.fna -nt $thread 2>>$error >> $log\n\n";
685+
print OUT $iqtree;
686+
if ( system($iqtree) ) { die "Error running $iqtree.\n"; }
687+
}
677688

678689
return "Bootstrap complete";
679690
close OUT;

src/phame

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,7 @@ my $code = 0;
9191
my $type;
9292
my $clean = 1;
9393
my $threads = 1;
94-
my $cutoff = 0;
94+
my $cutoff = 0.6;
9595
my $annotation;
9696
my $genefile;
9797
my $runtime = time;

test/ctl_files/t1_ebola_preads.ctl

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -16,20 +16,24 @@
1616
data = 4 # *See below 0:only complete(F); 1:only contig(C); 2:only reads(R);
1717
# 3:combination F+C; 4:combination F+R; 5:combination C+R;
1818
# 6:combination F+C+R; 7:realignment *See below
19+
1920
reads = 2 # 1: single reads; 2: paired reads; 3: both types present;
21+
22+
SNPsfilter = 0.6 #
2023

21-
tree = 1 # 0:no tree; 1:use FastTree; 2:use RAxML; 3:use both;
24+
tree = 4 # 0:no tree; 1:use FastTree; 2:use RAxML; 3:use IQ tree 4:all;
2225
bootstrap = 0 # 0:no; 1:yes; # Run bootstrapping *See below
23-
N = 10 # Number of bootstraps to run *See below
26+
N = 0 # Number of bootstraps to run *See below
2427

25-
PosSelect = 1 # 0:No; 1:use PAML; 2:use HyPhy; 3:use both
28+
PosSelect = 0 # 0:No; 1:use PAML; 2:use HyPhy; 3:use both
2629

2730
code = 1 # 0:Bacteria; 1:Virus
2831

29-
clean = 0 # 0:no clean; 1:clean
32+
clean = 1 # 0:no clean; 1:clean
3033

3134
threads = 2 # Number of threads to use
3235

33-
cutoff = 0.0 # Linear alignment (LA) coverage against reference - ignores SNPs from organism that have lower cutoff.
36+
cutoff = 0.5 # Linear alignment (LA) coverage against reference - ignores SNPs from organism that have lower cutoff.
37+
3438

3539

0 commit comments

Comments
 (0)