Skip to content

Question regarding paralogs #364

@jessydit

Description

@jessydit

Hello!

I'm running Panaroo on a bunch of bacterial genomes (different strains of the same species) and I noticed a few things in the output that I cannot explain based on what I read in the documentation. I hope you can help me!

Here is the command I ran:
panaroo -i Seqs/*.gff3 -o Output_aligned/ --clean-mode moderate -a core --aligner mafft --core_threshold 1.00 --remove-invalid-genes --refind-mode off -t 16

As you can see, I did not use "merge_paralogs", so I expected that multi-copy genes would be split into different gene clusters. However, when I inspected the aligned core gene sequences, I noticed that several of these alignments contained more sequences than the number of genomes I used. In the most extreme case, the alignment file contained 90 sequences although I only included 29 genomes in the analysis. I re-ran the same analysis with --aligner none to obtain also the unaligned sequences for this gene cluster. In this unaligned sequence file, I could see that some genomes have one long sequence in this cluster (9800 bp), while other genomes have 5 or 6 short sequences (180-500 bp long).

I have three questions regarding this observation:

  1. Is this behaviour intended or is there an error in the gene clustering step? Shouldn't multiple gene copies from the same genome be split into different clusters?
  2. I'm astonished by the length difference between the sequences in this gene cluster. I did not modify the default parameters, but I thought that big differences in size would be avoided by the len_dif_percent parameter (default: 0.98)?
  3. Despite the multiple gene copies from the same genome in some gene clusters, the global core_gene_alignment.aln file contains 29 sequences, i.e. one for each analysed genomes. That makes me wonder how Panaroo dealt with the multiple gene copies in the single-gene alignments to produce the global alignment? Is the global alignment produced independently from the single-gene alignments or is there an additional step to exclude these multiple copies?

Thanks a lot for looking into this so that I can understand if I need to adjust certain parameters.
Cheers!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions