-
Notifications
You must be signed in to change notification settings - Fork 38
Description
Hello!
I'm running Panaroo on a bunch of bacterial genomes (different strains of the same species) and I noticed a few things in the output that I cannot explain based on what I read in the documentation. I hope you can help me!
Here is the command I ran:
panaroo -i Seqs/*.gff3 -o Output_aligned/ --clean-mode moderate -a core --aligner mafft --core_threshold 1.00 --remove-invalid-genes --refind-mode off -t 16
As you can see, I did not use "merge_paralogs", so I expected that multi-copy genes would be split into different gene clusters. However, when I inspected the aligned core gene sequences, I noticed that several of these alignments contained more sequences than the number of genomes I used. In the most extreme case, the alignment file contained 90 sequences although I only included 29 genomes in the analysis. I re-ran the same analysis with --aligner none to obtain also the unaligned sequences for this gene cluster. In this unaligned sequence file, I could see that some genomes have one long sequence in this cluster (9800 bp), while other genomes have 5 or 6 short sequences (180-500 bp long).
I have three questions regarding this observation:
- Is this behaviour intended or is there an error in the gene clustering step? Shouldn't multiple gene copies from the same genome be split into different clusters?
- I'm astonished by the length difference between the sequences in this gene cluster. I did not modify the default parameters, but I thought that big differences in size would be avoided by the len_dif_percent parameter (default: 0.98)?
- Despite the multiple gene copies from the same genome in some gene clusters, the global core_gene_alignment.aln file contains 29 sequences, i.e. one for each analysed genomes. That makes me wonder how Panaroo dealt with the multiple gene copies in the single-gene alignments to produce the global alignment? Is the global alignment produced independently from the single-gene alignments or is there an additional step to exclude these multiple copies?
Thanks a lot for looking into this so that I can understand if I need to adjust certain parameters.
Cheers!