You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/gettingstarted/citation.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,4 +5,4 @@ If you use Panaroo please cite:
5
5
**Tonkin-Hill G, MacAlasdair N, Ruis C, Weimann A, Horesh G, Lees JA, Gladstone RA, Lo S, Beaudoin C, Floto RA, Frost SDW, Corander J, Bentley SD, Parkhill J. 2020. Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biol 21:180.**
6
6
7
7
8
-
We include some implementations of other algorithms in the post processing scripts which should be cited seperately. The citations for these scripts are given in the relevant section of the documentation.
8
+
We include some implementations of other algorithms in the post processing scripts which should be cited separately. The citations for these scripts are given in the relevant section of the documentation.
Copy file name to clipboardExpand all lines: docs/gettingstarted/output.md
+12-4Lines changed: 12 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@
4
4
5
5
A csv file describing which gene is in which sample. If a gene cluster is present in a sample, the sequence name of the representative for that sample is given in the matrix. The corresponding DNA and protein sequence can then be matched to those found in the `combined_DNA_CDS.fasta` and `combined_protein_CDS.fasta` files. The format is the same as that given by [Roary](https://sanger-pathogens.github.io/Roary/).
6
6
7
-
Annotations that have been merged will be seperated by a semicolon. Refound genes that inlcude a stop codon will have '_stop' appended to the end of the gene name. This indicates they could be a potential pseudo gene. Gene within a cluster that have an unusual length will have '_len' appended to the gene name.
7
+
Annotations that have been merged will be separated by a semicolon. Refound genes that include a stop codon will have '_stop' appended to the end of the gene name. This indicates they could be a potential pseudo gene. Gene within a cluster that have an unusual length will have '_len' appended to the gene name.
8
8
9
9
### gene_presence_absence.Rtab
10
10
@@ -16,15 +16,15 @@ The final pan-genome graph generated by Panaroo. This can be viewed easily using
16
16
17
17
### struct_presence_absence.csv
18
18
19
-
A csv file which lists the presence and abscence of different genomic rearrangement events. The genes involved in each event are listed in the respective column names of the csv. The thresholds for calling these events can be changed by adjusting the `--min_edge_support_sv` parameter when calling panaroo.
19
+
A csv file which lists the presence and absence of different genomic rearrangement events. The genes involved in each event are listed in the respective column names of the csv. The thresholds for calling these events can be changed by adjusting the `--min_edge_support_sv` parameter when calling Panaroo.
20
20
21
21
### pan_genome_reference.fa
22
22
23
-
This is a similar output to that produced by [Roary](https://sanger-pathogens.github.io/Roary/). It creates a linear reference genome of all the genes found in the dataset. The order of the genes in this reference are not significant.
23
+
This is a similar output to that produced by [Roary](https://sanger-pathogens.github.io/Roary/). It creates a linear reference genome of all the genes found in the dataset. The order of the genes in this reference are not significant. NOTE: to avoid issues with the multi-mapping of reads, paralogous gene clusters will only be represented once in this reference.
24
24
25
25
### gene_data.csv
26
26
27
-
This is a very large file mainly used internally in the program. It links each gene sequnece and annotation to the internal representations used. It can be useful in interpreting some of the output especially the 'final_graph.gml' file.
27
+
This is a very large file mainly used internally in the program. It links each gene sequence and annotation to the internal representations used. It can be useful in interpreting some of the output especially the 'final_graph.gml' file.
28
28
29
29
### combined_DNA_CDS.fasta
30
30
@@ -33,3 +33,11 @@ This is a fasta file which includes all nucleotide sequence for both the annotat
33
33
### combined_protein_CDS.fasta
34
34
35
35
Similar to the `combined_DNA_CDS.fasta` file, this is a fasta file which includes all protein sequence for both the annotated genes and those refound by the program. The gene names are the internal ones used by Panaroo. These can be translated to the original names using the 'gene_data.csv' file.
36
+
37
+
### core_gene_alignment.aln
38
+
39
+
An alignment of genes present in at least the fraction of genomes specified by the `--core_threshold` parameter (default=0.95). Currently, in cases where a gene is fragmented only the longer fragment will appear in this output.
40
+
41
+
### core_gene_alignment_filtered.aln
42
+
43
+
This alignment is recommended for building core genome phylogenies. It is a filtered version of the core genome alignment. Additional genes are removed if they exceed the Block Mapping and Gathering with Entropy (BMGE) filter. This is set using the `--core_entropy_filter` parameter. By default this automatically adapts to each dataset and identifies outlying genes using Tukey's outlier test (recommended).
Copy file name to clipboardExpand all lines: docs/gettingstarted/params.md
+14-6Lines changed: 14 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,23 +21,23 @@ The Panaroo algorithm initially performs a conservative clustering step before c
21
21
22
22
Thus we recommend using the defaults for `--threshold` (0.98) and `--len_dif_percent` (0.98).
23
23
24
-
If you wish to adjust the level at which Panaroo colapses genes into putitive families we suggest changing the family sequence identity level (default 0.7). Thus to run Panaroo using a more relaxed threshold of 50% identity you could run
24
+
If you wish to adjust the level at which Panaroo collapses genes into putative families we suggest changing the family sequence identity level (default 0.7). Thus to run Panaroo using a more relaxed threshold of 50% identity you could run
In order to identify genes that have been missed by annotation software, Panaroo incoporates a refinding step. Suppose two clusters geneA and geneB are adjacent in the Panaroo pangenome graph. If geneA is present in a genome but its neighbour (geneB) is not then Panaroo searches the sequence surrounding geneA for the presence of geneB. The radius of this search in nucleotides is controlled by `--search_radius`, with the default being 5000.
40
+
In order to identify genes that have been missed by annotation software, Panaroo incorporates a refinding step. Suppose two clusters geneA and geneB are adjacent in the Panaroo pangenome graph. If geneA is present in a genome but its neighbour (geneB) is not then Panaroo searches the sequence surrounding geneA for the presence of geneB. The radius of this search in nucleotides is controlled by `--search_radius`, with the default being 5000.
41
41
42
42
As such missing genes are often the results of assembly fragmentation, the refinding step only requires that a proportion of the missing gene is located. This proportion can be controlled using the `--refind_prop_match` parameter.
Copy file name to clipboardExpand all lines: docs/gettingstarted/quickstart.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,7 @@ mkdir results
10
10
panaroo -i *.gff -o results --clean-mode strict
11
11
```
12
12
13
-
If you are using GFFs from RefSeq they can occasionally include annotations that do not conform the expected. This is usually due to a premature stop codon or a gene of invalid length. By default Panaroo will fail to parse these annotations. However, you can set Panaroo to ignore invalid annotaion by enabling the `remove-invalid-genes` flag
13
+
If you are using GFFs from RefSeq they can occasionally include annotations that do not conform the expected. This is usually due to a premature stop codon or a gene of invalid length. By default Panaroo will fail to parse these annotations. However, you can set Panaroo to ignore invalid annotation by enabling the `remove-invalid-genes` flag
By default Panaroo runs in its strictest (most conservative) mode. We have found that for most use cases this removes potential sources of contamination and error whilst retaining the majority of genes researchers are interested in.
22
22
23
-
Very rare plasmids are difficult to distguish from contamination. Thus, if you are interested in retaining such plasmids at the expense of added contamination we recommend running panaroo using its most sensitive mode
23
+
Very rare plasmids are difficult to distinguish from contamination. Thus, if you are interested in retaining such plasmids at the expense of added contamination we recommend running panaroo using its most sensitive mode
Panaroo now supports multiple input formats. To use non-standard GFF3 files you must profide the input file as a list in a text file (one per line). Seperate GFF and FASTA files can be provided per isolate by providing each file delimeted by a space or a tab. Genbank file formats are also supported with extensions '.gbk', '.gb' or '.gbff'. These must compliant with Genbank/ENA/DDJB. This can be forced in Prokka by specifying the `--compliance` parameter.
31
+
Panaroo now supports multiple input formats. To use non-standard GFF3 files you must profile the input file as a list in a text file (one per line). Separate GFF and FASTA files can be provided per isolate by providing each file delimited by a space or a tab. Genbank file formats are also supported with extensions '.gbk', '.gb' or '.gbff'. These must compliant with Genbank/ENA/DDJB. This can be forced in Prokka by specifying the `--compliance` parameter.
32
32
33
33
NOTE: Some annotations file such as those from RefSeq include annotations that break some of the assumptions of Panaroo. These include gene annotations of unusual length or that with premature stop codons (pseudogenes). By default Panaroo will throw an error if it encounters these annotations. You can automatically filter out such annotations by calling panaroo with the `--remove-invalid-genes` flag.
Copy file name to clipboardExpand all lines: docs/merge/merge_graphs.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Merge Panaroo graphs
2
2
3
-
It is possible to merge pangenome graphs generated by independent runs of the Panaroo algorithm. This is particularly useful when dealing with very large or diverse datasets. The Panaroo algorithm assumes that a dataset is not overly diverse and thus results can be improved by running the algorithm on seperate clusters of genomes independently before merging the resulting graphs.
3
+
It is possible to merge pangenome graphs generated by independent runs of the Panaroo algorithm. This is particularly useful when dealing with very large or diverse datasets. The Panaroo algorithm assumes that a dataset is not overly diverse and thus results can be improved by running the algorithm on separate clusters of genomes independently before merging the resulting graphs.
4
4
5
5
This approach can also be used to compare the pangenomes of different species or lineages.
Copy file name to clipboardExpand all lines: docs/post/pansize.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ These model based approaches are preferable to the common accumulation curves of
6
6
7
7
### Infinitely Many Genes model
8
8
9
-
The IMG model allows for gene gain from an unbounded reservoir of new genes, and gene loss. As the reservoir is unbounded, the same gene gain event can only occur once. This might represent horizontal gene transfer from a diverged taxa with gene loss representing the conversion of genes to pseudogenes or deletion in reproduction. This model is described in Baemdicker et al. 2012 and Collins et al. 2012.
9
+
The IMG model allows for gene gain from an unbounded reservoir of new genes, and gene loss. As the reservoir is unbounded, the same gene gain event can only occur once. This might represent horizontal gene transfer from a diverged taxa with gene loss representing the conversion of genes to pseudogenes or deletion in reproduction. This model is described in Baumdicker et al. 2012 and Collins et al. 2012.
10
10
11
11
To estimate the parameters of this model, a dated phylogeny based on the core genome is required. Such phylogenies can be produced using [BEAST](https://www.beast2.org/) or by combining faster methods such as [IQ-TREE](http://www.iqtree.org/) and [BactDating](https://xavierdidelot.github.io/BactDating/)
0 commit comments