Skip to content

The FM-index/LC-hash implementation cannot handle lower case and capital letters at the same time #1

@lisanhu

Description

@lisanhu

The current implementation does not convert the DNA sequence to a binary representation. The suffix array building module will read the input file as plain text, which means 'A' and 'a' are different in the alphabet. Thus, when we have both 'A' and 'a' in the genes, the suffix array will not be what we want, because, to match queries regardless of the case, we have to treat upper and lower case letters the same.

This problem may not be affecting the results with a workaround that convert all letters to upper case in the indexing phase.

To completely fix this issue, we need to do at least the following tasks:

  • Find a binary representation for the DNA sequences, one candidate of such format is the ".2bit " format defined here
  • In the intermediate files, convert the "fasta" file to ".2bit" format
  • Build our own suffix array building module to read the contents in ".2bit" format
  • Update the alignment part of FM-index/LC-hash for the previous changes

Because it's not easy to accomplish above tasks, and we could avoid this issue by a simple patch, so there's no foreseen plan to fix this issue, but it should be fixed sometime in the future.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestwontfixThis will not be worked on

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions