Skip to content

Conversation

@RaphaelBouvet
Copy link

Hi,
Thank you for developing and releasing Tranception :)

During testing of this model, i noticed that the MSA_processing in tranception/utils/msa_utils.py was a limiting step if the msa was too large.

This PR adds a Fast_MSA_processing class with improved speed 🔥 at the cost of more memory.
For example for a msa with 21 k sequences:

    fast processing : 13 sec
    base processing : 472 sec

Instead of doing sequences by sequences comparisons in the original code, i parallelize the calculation.
For big msa, doing the whole comparisons at once is not possible so i used sub_arrays to split the calculation.

  • The resulting weights are identical compared to the prereleased weights in my tests.
  • The memory usage can be adjusted manually by changing the size of the subarrays (maybe we can adjust depending on user ram)
  • The code might not work if there is empty sequences in the msa (not tested)

I am sure there is a better/faster way to do this calculation but this method worked for me

Do not hesitate if you have any questions,
Best wishes,
Raphaël

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant