Skip to content

Sequence order in the input and parallelism affect reproducibility #228

@apcamargo

Description

@apcamargo

I conducted an experiment to evaluate the variance in phylogenetic trees reconstructed from a gene. I observed that trees reconstructed from two files with identical MSAs but different sequence orders produced different trees. This was assessed using tree distance metrics and visual inspection. Despite setting a constant seed (via --seed), the two files still resulted in distinct trees. Additionally, when using the same input file (therefore, the same sequence order) and the same seed, the resulting trees were still different.

image

The same substitution model was used across all tests:

iqtree2 -T 8 -s test_msa.afa -m WAG+G4 --prefix test_tree --nstop 50 --seed 8787

Using a single thread (-nt 1) resulted in identical trees when the input was identical. However, shuffling the order of the sequences in the input FASTA led to distinct trees.

I could not find any documentation indicating that using multiple threads would make results irreproducible or that changing the sequence order in the input file would lead to different results, regardless of the seed and number of threads. Am I missing something?

Version:

IQ-TREE multicore version 2.3.4 COVID-edition for Linux x86 64-bit built Apr 26 2024

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions