Skip to content

Commit 3d9442f

Browse files
committed
Merge branch 'devel' of github.com:gtonkinhill/panaroo into devel
2 parents d8a2764 + 9abae52 commit 3d9442f

File tree

8 files changed

+34
-18
lines changed

8 files changed

+34
-18
lines changed

docs/gettingstarted/citation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,4 @@ If you use Panaroo please cite:
55
**Tonkin-Hill G, MacAlasdair N, Ruis C, Weimann A, Horesh G, Lees JA, Gladstone RA, Lo S, Beaudoin C, Floto RA, Frost SDW, Corander J, Bentley SD, Parkhill J. 2020. Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biol 21:180.**
66

77

8-
We include some implementations of other algorithms in the post processing scripts which should be cited seperately. The citations for these scripts are given in the relevant section of the documentation.
8+
We include some implementations of other algorithms in the post processing scripts which should be cited separately. The citations for these scripts are given in the relevant section of the documentation.

docs/gettingstarted/output.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
A csv file describing which gene is in which sample. If a gene cluster is present in a sample, the sequence name of the representative for that sample is given in the matrix. The corresponding DNA and protein sequence can then be matched to those found in the `combined_DNA_CDS.fasta` and `combined_protein_CDS.fasta` files. The format is the same as that given by [Roary](https://sanger-pathogens.github.io/Roary/).
66

7-
Annotations that have been merged will be seperated by a semicolon. Refound genes that inlcude a stop codon will have '_stop' appended to the end of the gene name. This indicates they could be a potential pseudo gene. Gene within a cluster that have an unusual length will have '_len' appended to the gene name.
7+
Annotations that have been merged will be separated by a semicolon. Refound genes that include a stop codon will have '_stop' appended to the end of the gene name. This indicates they could be a potential pseudo gene. Gene within a cluster that have an unusual length will have '_len' appended to the gene name.
88

99
### gene_presence_absence.Rtab
1010

@@ -16,15 +16,15 @@ The final pan-genome graph generated by Panaroo. This can be viewed easily using
1616

1717
### struct_presence_absence.csv
1818

19-
A csv file which lists the presence and abscence of different genomic rearrangement events. The genes involved in each event are listed in the respective column names of the csv. The thresholds for calling these events can be changed by adjusting the `--min_edge_support_sv` parameter when calling panaroo.
19+
A csv file which lists the presence and absence of different genomic rearrangement events. The genes involved in each event are listed in the respective column names of the csv. The thresholds for calling these events can be changed by adjusting the `--min_edge_support_sv` parameter when calling Panaroo.
2020

2121
### pan_genome_reference.fa
2222

23-
This is a similar output to that produced by [Roary](https://sanger-pathogens.github.io/Roary/). It creates a linear reference genome of all the genes found in the dataset. The order of the genes in this reference are not significant.
23+
This is a similar output to that produced by [Roary](https://sanger-pathogens.github.io/Roary/). It creates a linear reference genome of all the genes found in the dataset. The order of the genes in this reference are not significant. NOTE: to avoid issues with the multi-mapping of reads, paralogous gene clusters will only be represented once in this reference.
2424

2525
### gene_data.csv
2626

27-
This is a very large file mainly used internally in the program. It links each gene sequnece and annotation to the internal representations used. It can be useful in interpreting some of the output especially the 'final_graph.gml' file.
27+
This is a very large file mainly used internally in the program. It links each gene sequence and annotation to the internal representations used. It can be useful in interpreting some of the output especially the 'final_graph.gml' file.
2828

2929
### combined_DNA_CDS.fasta
3030

@@ -33,3 +33,11 @@ This is a fasta file which includes all nucleotide sequence for both the annotat
3333
### combined_protein_CDS.fasta
3434

3535
Similar to the `combined_DNA_CDS.fasta` file, this is a fasta file which includes all protein sequence for both the annotated genes and those refound by the program. The gene names are the internal ones used by Panaroo. These can be translated to the original names using the 'gene_data.csv' file.
36+
37+
### core_gene_alignment.aln
38+
39+
An alignment of genes present in at least the fraction of genomes specified by the `--core_threshold` parameter (default=0.95). Currently, in cases where a gene is fragmented only the longer fragment will appear in this output.
40+
41+
### core_gene_alignment_filtered.aln
42+
43+
This alignment is recommended for building core genome phylogenies. It is a filtered version of the core genome alignment. Additional genes are removed if they exceed the Block Mapping and Gathering with Entropy (BMGE) filter. This is set using the `--core_entropy_filter` parameter. By default this automatically adapts to each dataset and identifies outlying genes using Tukey's outlier test (recommended).

docs/gettingstarted/params.md

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -21,23 +21,23 @@ The Panaroo algorithm initially performs a conservative clustering step before c
2121

2222
Thus we recommend using the defaults for `--threshold` (0.98) and `--len_dif_percent` (0.98).
2323

24-
If you wish to adjust the level at which Panaroo colapses genes into putitive families we suggest changing the family sequence identity level (default 0.7). Thus to run Panaroo using a more relaxed threshold of 50% identity you could run
24+
If you wish to adjust the level at which Panaroo collapses genes into putative families we suggest changing the family sequence identity level (default 0.7). Thus to run Panaroo using a more relaxed threshold of 50% identity you could run
2525

2626
```
2727
panaroo -i *.gff -o ./results/ --clean-mode strict -f 0.5
2828
```
2929

3030
#### Paralogs
3131

32-
Panaroo splits paralogs into seperate clusters by default. Merging paralogs can be enabled by running Panraoo as
32+
Panaroo splits paralogs into separate clusters by default. Merging paralogs can be enabled by running Panaroo as
3333

3434
```
3535
panaroo -i *.gff -o ./results/ --clean-mode strict --merge_paralogs
3636
```
3737

3838
#### Refinding Genes
3939

40-
In order to identify genes that have been missed by annotation software, Panaroo incoporates a refinding step. Suppose two clusters geneA and geneB are adjacent in the Panaroo pangenome graph. If geneA is present in a genome but its neighbour (geneB) is not then Panaroo searches the sequence surrounding geneA for the presence of geneB. The radius of this search in nucleotides is controlled by `--search_radius`, with the default being 5000.
40+
In order to identify genes that have been missed by annotation software, Panaroo incorporates a refinding step. Suppose two clusters geneA and geneB are adjacent in the Panaroo pangenome graph. If geneA is present in a genome but its neighbour (geneB) is not then Panaroo searches the sequence surrounding geneA for the presence of geneB. The radius of this search in nucleotides is controlled by `--search_radius`, with the default being 5000.
4141

4242
As such missing genes are often the results of assembly fragmentation, the refinding step only requires that a proportion of the missing gene is located. This proportion can be controlled using the `--refind_prop_match` parameter.
4343

@@ -65,15 +65,17 @@ usage: panaroo [-h] -i INPUT_FILES [INPUT_FILES ...] -o OUTPUT_DIR
6565
[--high_var_flag CYCLE_THRESHOLD_MIN]
6666
[--min_edge_support_sv MIN_EDGE_SUPPORT_SV]
6767
[--all_seq_in_graph] [--no_clean_edges] [-a {core,pan}]
68-
[--aligner {prank,clustal,mafft}] [--core_threshold CORE]
69-
[-t N_CPU] [--quiet] [--version]
68+
[--aligner {prank,clustal,mafft}] [--codons]
69+
[--core_threshold CORE] [--core_entropy_filter HC_THRESHOLD]
70+
[-t N_CPU] [--codon-table TABLE] [--quiet] [--version]
7071
7172
panaroo: an updated pipeline for pangenome investigation
7273
7374
optional arguments:
7475
-h, --help show this help message and exit
7576
-t N_CPU, --threads N_CPU
7677
number of threads to use (default=1)
78+
--codon-table TABLE the codon table to use for translation (default=11)
7779
--quiet suppress additional output
7880
--version show program's version number and exit
7981
@@ -117,7 +119,7 @@ Mode:
117119
118120
Matching:
119121
-c ID, --threshold ID
120-
sequence identity threshold (default=0.95)
122+
sequence identity threshold (default=0.98)
121123
-f FAMILY_THRESHOLD, --family_threshold FAMILY_THRESHOLD
122124
protein family sequence identity threshold
123125
(default=0.7)
@@ -171,8 +173,14 @@ Gene alignment:
171173
--aligner {prank,clustal,mafft}
172174
Specify an aligner. Options:'prank', 'clustal', and
173175
default: 'mafft'
176+
--codons Generate codon alignments by aligning sequences at the
177+
protein level
174178
--core_threshold CORE
175179
Core-genome sample threshold (default=0.95)
180+
--core_entropy_filter HC_THRESHOLD
181+
Manually set the Block Mapping and Gathering with
182+
Entropy (BMGE) filter. Can be between 0.0 and 1.0. By
183+
default this is set using the Tukey outlier method.
176184
```
177185

178186
#### Default Parameters

docs/gettingstarted/quickstart.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ mkdir results
1010
panaroo -i *.gff -o results --clean-mode strict
1111
```
1212

13-
If you are using GFFs from RefSeq they can occasionally include annotations that do not conform the expected. This is usually due to a premature stop codon or a gene of invalid length. By default Panaroo will fail to parse these annotations. However, you can set Panaroo to ignore invalid annotaion by enabling the `remove-invalid-genes` flag
13+
If you are using GFFs from RefSeq they can occasionally include annotations that do not conform the expected. This is usually due to a premature stop codon or a gene of invalid length. By default Panaroo will fail to parse these annotations. However, you can set Panaroo to ignore invalid annotation by enabling the `remove-invalid-genes` flag
1414

1515
```
1616
panaroo -i *.gff -o results --clean-mode strict --remove-invalid-genes
@@ -20,15 +20,15 @@ panaroo -i *.gff -o results --clean-mode strict --remove-invalid-genes
2020

2121
By default Panaroo runs in its strictest (most conservative) mode. We have found that for most use cases this removes potential sources of contamination and error whilst retaining the majority of genes researchers are interested in.
2222

23-
Very rare plasmids are difficult to distguish from contamination. Thus, if you are interested in retaining such plasmids at the expense of added contamination we recommend running panaroo using its most sensitive mode
23+
Very rare plasmids are difficult to distinguish from contamination. Thus, if you are interested in retaining such plasmids at the expense of added contamination we recommend running panaroo using its most sensitive mode
2424

2525
```
2626
panaroo -i *.gff -o results --clean-mode sensitive
2727
```
2828

2929
## Different input formats
3030

31-
Panaroo now supports multiple input formats. To use non-standard GFF3 files you must profide the input file as a list in a text file (one per line). Seperate GFF and FASTA files can be provided per isolate by providing each file delimeted by a space or a tab. Genbank file formats are also supported with extensions '.gbk', '.gb' or '.gbff'. These must compliant with Genbank/ENA/DDJB. This can be forced in Prokka by specifying the `--compliance` parameter.
31+
Panaroo now supports multiple input formats. To use non-standard GFF3 files you must profile the input file as a list in a text file (one per line). Separate GFF and FASTA files can be provided per isolate by providing each file delimited by a space or a tab. Genbank file formats are also supported with extensions '.gbk', '.gb' or '.gbff'. These must compliant with Genbank/ENA/DDJB. This can be forced in Prokka by specifying the `--compliance` parameter.
3232

3333
NOTE: Some annotations file such as those from RefSeq include annotations that break some of the assumptions of Panaroo. These include gene annotations of unusual length or that with premature stop codons (pseudogenes). By default Panaroo will throw an error if it encounters these annotations. You can automatically filter out such annotations by calling panaroo with the `--remove-invalid-genes` flag.
3434

docs/merge/merge_graphs.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Merge Panaroo graphs
22

3-
It is possible to merge pangenome graphs generated by independent runs of the Panaroo algorithm. This is particularly useful when dealing with very large or diverse datasets. The Panaroo algorithm assumes that a dataset is not overly diverse and thus results can be improved by running the algorithm on seperate clusters of genomes independently before merging the resulting graphs.
3+
It is possible to merge pangenome graphs generated by independent runs of the Panaroo algorithm. This is particularly useful when dealing with very large or diverse datasets. The Panaroo algorithm assumes that a dataset is not overly diverse and thus results can be improved by running the algorithm on separate clusters of genomes independently before merging the resulting graphs.
44

55
This approach can also be used to compare the pangenomes of different species or lineages.
66

docs/post/pansize.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ These model based approaches are preferable to the common accumulation curves of
66

77
### Infinitely Many Genes model
88

9-
The IMG model allows for gene gain from an unbounded reservoir of new genes, and gene loss. As the reservoir is unbounded, the same gene gain event can only occur once. This might represent horizontal gene transfer from a diverged taxa with gene loss representing the conversion of genes to pseudogenes or deletion in reproduction. This model is described in Baemdicker et al. 2012 and Collins et al. 2012.
9+
The IMG model allows for gene gain from an unbounded reservoir of new genes, and gene loss. As the reservoir is unbounded, the same gene gain event can only occur once. This might represent horizontal gene transfer from a diverged taxa with gene loss representing the conversion of genes to pseudogenes or deletion in reproduction. This model is described in Baumdicker et al. 2012 and Collins et al. 2012.
1010

1111
To estimate the parameters of this model, a dated phylogeny based on the core genome is required. Such phylogenies can be produced using [BEAST](https://www.beast2.org/) or by combining faster methods such as [IQ-TREE](http://www.iqtree.org/) and [BactDating](https://xavierdidelot.github.io/BactDating/)
1212

panaroo/generate_output.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -397,7 +397,7 @@ def concatenate_core_genome_alignments(core_names, output_dir, hc_threshold):
397397
if hc_threshold is None:
398398
allh = np.array([gene[3] for gene in gene_alignments])
399399
q = np.quantile(allh, [0.25,0.75])
400-
hc_threshold = q[1] + 1.5*(q[1]-q[0])
400+
hc_threshold = max(0.01, q[1] + 1.5*(q[1]-q[0]))
401401
print(f"Entropy threshold automatically set to {hc_threshold}.")
402402

403403
isolate_aln = []

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ def find_version(*file_paths):
3838
url="https://github.com/gtonkinhill/panaroo",
3939
install_requires=[
4040
'networkx', 'gffutils', 'BioPython', 'joblib', 'tqdm', 'edlib',
41-
'scipy', 'numpy', 'matplotlib', 'sklearn', 'plotly', 'dendropy',
41+
'scipy', 'numpy', 'matplotlib', 'scikit-learn', 'plotly', 'dendropy',
4242
'intbitset', 'biocode'
4343
],
4444
python_requires='>=3.6.0',

0 commit comments

Comments
 (0)