Merge branch 'devel' of github.com:gtonkinhill/panaroo into devel

gtonkinhill · gtonkinhill · commit 3d9442fc00d1 · 2023-02-28T02:15:19.000+01:00
diff --git a/docs/gettingstarted/citation.md b/docs/gettingstarted/citation.md
@@ -5,4 +5,4 @@ If you use Panaroo please cite:
 **Tonkin-Hill G, MacAlasdair N, Ruis C, Weimann A, Horesh G, Lees JA, Gladstone RA, Lo S, Beaudoin C, Floto RA, Frost SDW, Corander J, Bentley SD, Parkhill J. 2020. Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biol 21:180.**
 
 
-We include some implementations of other algorithms in the post processing scripts which should be cited seperately. The citations for these scripts are given in the relevant section of the documentation.
+We include some implementations of other algorithms in the post processing scripts which should be cited separately. The citations for these scripts are given in the relevant section of the documentation.
diff --git a/docs/gettingstarted/output.md b/docs/gettingstarted/output.md
@@ -4,7 +4,7 @@
 
 A csv file describing which gene is in which sample. If a gene cluster is present in a sample, the sequence name of the representative for that sample is given in the matrix. The corresponding DNA and protein sequence can then be matched to those found in the `combined_DNA_CDS.fasta` and `combined_protein_CDS.fasta` files. The format is the same as that given by [Roary](https://sanger-pathogens.github.io/Roary/).
 
-Annotations that have been merged will be seperated by a semicolon. Refound genes that inlcude a stop codon will have '_stop' appended to the end of the gene name. This indicates they could be a potential pseudo gene. Gene within a cluster that have an unusual length will have '_len' appended to the gene name.
+Annotations that have been merged will be separated by a semicolon. Refound genes that include a stop codon will have '_stop' appended to the end of the gene name. This indicates they could be a potential pseudo gene. Gene within a cluster that have an unusual length will have '_len' appended to the gene name.
 
 ### gene_presence_absence.Rtab
 
@@ -16,15 +16,15 @@ The final pan-genome graph generated by Panaroo. This can be viewed easily using
 
 ### struct_presence_absence.csv
 
-A csv file which lists the presence and abscence of different genomic rearrangement events. The genes involved in each event are listed in the respective column names of the csv. The thresholds for calling these events can be changed by adjusting the `--min_edge_support_sv` parameter when calling panaroo.
+A csv file which lists the presence and absence of different genomic rearrangement events. The genes involved in each event are listed in the respective column names of the csv. The thresholds for calling these events can be changed by adjusting the `--min_edge_support_sv` parameter when calling Panaroo.
 
 ### pan_genome_reference.fa
 
-This is a similar output to that produced by [Roary](https://sanger-pathogens.github.io/Roary/). It creates a linear reference genome of all the genes found in the dataset. The order of the genes in this reference are not significant.
+This is a similar output to that produced by [Roary](https://sanger-pathogens.github.io/Roary/). It creates a linear reference genome of all the genes found in the dataset. The order of the genes in this reference are not significant. NOTE: to avoid issues with the multi-mapping of reads, paralogous gene clusters will only be represented once in this reference.
 
 ### gene_data.csv
 
-This is a very large file mainly used internally in the program. It links each gene sequnece and annotation to the internal representations used. It can be useful in interpreting some of the output especially the 'final_graph.gml' file.
+This is a very large file mainly used internally in the program. It links each gene sequence and annotation to the internal representations used. It can be useful in interpreting some of the output especially the 'final_graph.gml' file.
 
 ### combined_DNA_CDS.fasta
 
@@ -33,3 +33,11 @@ This is a fasta file which includes all nucleotide sequence for both the annotat
 ### combined_protein_CDS.fasta
 
 Similar to the `combined_DNA_CDS.fasta` file, this is a fasta file which includes all protein sequence for both the annotated genes and those refound by the program. The gene names are the internal ones used by Panaroo. These can be translated to the original names using the 'gene_data.csv' file.
+
+### core_gene_alignment.aln
+
+An alignment of genes present in at least the fraction of genomes specified by the `--core_threshold` parameter (default=0.95). Currently, in cases where a gene is fragmented only the longer fragment will appear in this output.
+
+### core_gene_alignment_filtered.aln
+
+This alignment is recommended for building core genome phylogenies. It is a filtered version of the core genome alignment. Additional genes are removed if they exceed the Block Mapping and Gathering with Entropy (BMGE) filter. This is set using the `--core_entropy_filter` parameter. By default this automatically adapts to each dataset and identifies outlying genes using Tukey's outlier test (recommended).
diff --git a/docs/gettingstarted/params.md b/docs/gettingstarted/params.md
@@ -21,23 +21,23 @@ The Panaroo algorithm initially performs a conservative clustering step before c
 
 Thus we recommend using the defaults for `--threshold` (0.98) and `--len_dif_percent` (0.98).
 
-If you wish to adjust the level at which Panaroo colapses genes into putitive families we suggest changing the family sequence identity level (default 0.7). Thus to run Panaroo using a more relaxed threshold of 50% identity you could run
+If you wish to adjust the level at which Panaroo collapses genes into putative families we suggest changing the family sequence identity level (default 0.7). Thus to run Panaroo using a more relaxed threshold of 50% identity you could run
 
 ```
 panaroo -i *.gff -o ./results/ --clean-mode strict -f 0.5
 ```
 
 #### Paralogs
 
-Panaroo splits paralogs into seperate clusters by default. Merging paralogs can be enabled by running Panraoo as
+Panaroo splits paralogs into separate clusters by default. Merging paralogs can be enabled by running Panaroo as
 
 ```
 panaroo -i *.gff -o ./results/  --clean-mode strict --merge_paralogs
 ```
 
 #### Refinding Genes
 
-In order to identify genes that have been missed by annotation software, Panaroo incoporates a refinding step. Suppose two clusters geneA and geneB are adjacent in the Panaroo pangenome graph. If geneA is present in a genome but its neighbour (geneB) is not then Panaroo searches the sequence surrounding geneA for the presence of geneB. The radius of this search in nucleotides is controlled by `--search_radius`, with the default being 5000.  
+In order to identify genes that have been missed by annotation software, Panaroo incorporates a refinding step. Suppose two clusters geneA and geneB are adjacent in the Panaroo pangenome graph. If geneA is present in a genome but its neighbour (geneB) is not then Panaroo searches the sequence surrounding geneA for the presence of geneB. The radius of this search in nucleotides is controlled by `--search_radius`, with the default being 5000.  
 
 As such missing genes are often the results of assembly fragmentation, the refinding step only requires that a proportion of the missing gene is located. This proportion can be controlled using the `--refind_prop_match` parameter.
 
@@ -65,15 +65,17 @@ usage: panaroo [-h] -i INPUT_FILES [INPUT_FILES ...] -o OUTPUT_DIR
                [--high_var_flag CYCLE_THRESHOLD_MIN]
                [--min_edge_support_sv MIN_EDGE_SUPPORT_SV]
                [--all_seq_in_graph] [--no_clean_edges] [-a {core,pan}]
-               [--aligner {prank,clustal,mafft}] [--core_threshold CORE]
-               [-t N_CPU] [--quiet] [--version]
+               [--aligner {prank,clustal,mafft}] [--codons]
+               [--core_threshold CORE] [--core_entropy_filter HC_THRESHOLD]
+               [-t N_CPU] [--codon-table TABLE] [--quiet] [--version]
 
 panaroo: an updated pipeline for pangenome investigation
 
 optional arguments:
   -h, --help            show this help message and exit
   -t N_CPU, --threads N_CPU
                         number of threads to use (default=1)
+  --codon-table TABLE   the codon table to use for translation (default=11)
   --quiet               suppress additional output
   --version             show program's version number and exit
 
@@ -117,7 +119,7 @@ Mode:
 
 Matching:
   -c ID, --threshold ID
-                        sequence identity threshold (default=0.95)
+                        sequence identity threshold (default=0.98)
   -f FAMILY_THRESHOLD, --family_threshold FAMILY_THRESHOLD
                         protein family sequence identity threshold
                         (default=0.7)
@@ -171,8 +173,14 @@ Gene alignment:
   --aligner {prank,clustal,mafft}
                         Specify an aligner. Options:'prank', 'clustal', and
                         default: 'mafft'
+  --codons              Generate codon alignments by aligning sequences at the
+                        protein level
   --core_threshold CORE
                         Core-genome sample threshold (default=0.95)
+  --core_entropy_filter HC_THRESHOLD
+                        Manually set the Block Mapping and Gathering with
+                        Entropy (BMGE) filter. Can be between 0.0 and 1.0. By
+                        default this is set using the Tukey outlier method.
 ```
 
 #### Default Parameters
diff --git a/docs/gettingstarted/quickstart.md b/docs/gettingstarted/quickstart.md
@@ -10,7 +10,7 @@ mkdir results
 panaroo -i *.gff -o results --clean-mode strict
 ```
 
-If you are using GFFs from RefSeq they can occasionally include annotations that do not conform the expected. This is usually due to a premature stop codon or a gene of invalid length. By default Panaroo will fail to parse these annotations. However, you can set Panaroo to ignore invalid annotaion by enabling the `remove-invalid-genes` flag 
+If you are using GFFs from RefSeq they can occasionally include annotations that do not conform the expected. This is usually due to a premature stop codon or a gene of invalid length. By default Panaroo will fail to parse these annotations. However, you can set Panaroo to ignore invalid annotation by enabling the `remove-invalid-genes` flag 
 
 ```
 panaroo -i *.gff -o results --clean-mode strict --remove-invalid-genes
@@ -20,15 +20,15 @@ panaroo -i *.gff -o results --clean-mode strict --remove-invalid-genes
 
 By default Panaroo runs in its strictest (most conservative) mode. We have found that for most use cases this removes potential sources of contamination and error whilst retaining the majority of genes researchers are interested in. 
 
-Very rare plasmids are difficult to distguish from contamination. Thus, if you are interested in retaining such plasmids at the expense of added contamination we recommend running panaroo using its most sensitive mode
+Very rare plasmids are difficult to distinguish from contamination. Thus, if you are interested in retaining such plasmids at the expense of added contamination we recommend running panaroo using its most sensitive mode
 
 ```
 panaroo -i *.gff -o results --clean-mode sensitive
 ```
 
 ## Different input formats
 
-Panaroo now supports multiple input formats. To use non-standard GFF3 files you must profide the input file as a list in a text file (one per line). Seperate GFF and FASTA files can be provided per isolate by providing each file delimeted by a space or a tab. Genbank file formats are also supported with extensions '.gbk', '.gb' or '.gbff'. These must compliant with Genbank/ENA/DDJB. This can be forced in Prokka by specifying the `--compliance` parameter.
+Panaroo now supports multiple input formats. To use non-standard GFF3 files you must profile the input file as a list in a text file (one per line). Separate GFF and FASTA files can be provided per isolate by providing each file delimited by a space or a tab. Genbank file formats are also supported with extensions '.gbk', '.gb' or '.gbff'. These must compliant with Genbank/ENA/DDJB. This can be forced in Prokka by specifying the `--compliance` parameter.
 
 NOTE: Some annotations file such as those from RefSeq include annotations that break some of the assumptions of Panaroo. These include gene annotations of unusual length or that with premature stop codons (pseudogenes). By default Panaroo will throw an error if it encounters these annotations. You can automatically filter out such annotations by calling panaroo with the `--remove-invalid-genes` flag.
 
diff --git a/docs/merge/merge_graphs.md b/docs/merge/merge_graphs.md
@@ -1,6 +1,6 @@
 # Merge Panaroo graphs
 
-It is possible to merge pangenome graphs generated by independent runs of the Panaroo algorithm. This is particularly useful when dealing with very large or diverse datasets. The Panaroo algorithm assumes that a dataset is not overly diverse and thus results can be improved by running the algorithm on seperate clusters of genomes independently before merging the resulting graphs.
+It is possible to merge pangenome graphs generated by independent runs of the Panaroo algorithm. This is particularly useful when dealing with very large or diverse datasets. The Panaroo algorithm assumes that a dataset is not overly diverse and thus results can be improved by running the algorithm on separate clusters of genomes independently before merging the resulting graphs.
 
 This approach can also be used to compare the pangenomes of different species or lineages.
 
diff --git a/docs/post/pansize.md b/docs/post/pansize.md
@@ -6,7 +6,7 @@ These model based approaches are preferable to the common accumulation curves of
 
 ### Infinitely Many Genes model
 
-The IMG model allows for gene gain from an unbounded reservoir of new genes, and gene loss. As the reservoir is unbounded, the same gene gain event can only occur once. This might represent horizontal gene transfer from a diverged taxa with gene loss representing the conversion of genes to pseudogenes or deletion in reproduction. This model is described in Baemdicker et al. 2012 and Collins et al. 2012.
+The IMG model allows for gene gain from an unbounded reservoir of new genes, and gene loss. As the reservoir is unbounded, the same gene gain event can only occur once. This might represent horizontal gene transfer from a diverged taxa with gene loss representing the conversion of genes to pseudogenes or deletion in reproduction. This model is described in Baumdicker et al. 2012 and Collins et al. 2012.
 
 To estimate the parameters of this model, a dated phylogeny based on the core genome is required. Such phylogenies can be produced using [BEAST](https://www.beast2.org/) or by combining faster methods such as [IQ-TREE](http://www.iqtree.org/) and [BactDating](https://xavierdidelot.github.io/BactDating/)
 
diff --git a/panaroo/generate_output.py b/panaroo/generate_output.py
@@ -397,7 +397,7 @@ def concatenate_core_genome_alignments(core_names, output_dir, hc_threshold):
     if hc_threshold is None:
         allh = np.array([gene[3] for gene in gene_alignments])
         q = np.quantile(allh, [0.25,0.75])
-        hc_threshold = q[1] + 1.5*(q[1]-q[0])
+        hc_threshold = max(0.01, q[1] + 1.5*(q[1]-q[0]))
         print(f"Entropy threshold automatically set to {hc_threshold}.")
 
     isolate_aln = []
diff --git a/setup.py b/setup.py
@@ -38,7 +38,7 @@ def find_version(*file_paths):
     url="https://github.com/gtonkinhill/panaroo",
     install_requires=[
         'networkx', 'gffutils', 'BioPython', 'joblib', 'tqdm', 'edlib',
-        'scipy', 'numpy', 'matplotlib', 'sklearn', 'plotly', 'dendropy',
+        'scipy', 'numpy', 'matplotlib', 'scikit-learn', 'plotly', 'dendropy',
         'intbitset', 'biocode'
     ],
     python_requires='>=3.6.0',

Original file line number	Diff line number	Diff line change
`@@ -5,4 +5,4 @@ If you use Panaroo please cite:`
`5`	`5`	`Tonkin-Hill G, MacAlasdair N, Ruis C, Weimann A, Horesh G, Lees JA, Gladstone RA, Lo S, Beaudoin C, Floto RA, Frost SDW, Corander J, Bentley SD, Parkhill J. 2020. Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biol 21:180.`
`6`	`6`
`7`	`7`
`8`		`-We include some implementations of other algorithms in the post processing scripts which should be cited seperately. The citations for these scripts are given in the relevant section of the documentation.`
	`8`	`+We include some implementations of other algorithms in the post processing scripts which should be cited separately. The citations for these scripts are given in the relevant section of the documentation.`