-
Notifications
You must be signed in to change notification settings - Fork 0
Description
The current implementation does not convert the DNA sequence to a binary representation. The suffix array building module will read the input file as plain text, which means 'A' and 'a' are different in the alphabet. Thus, when we have both 'A' and 'a' in the genes, the suffix array will not be what we want, because, to match queries regardless of the case, we have to treat upper and lower case letters the same.
This problem may not be affecting the results with a workaround that convert all letters to upper case in the indexing phase.
To completely fix this issue, we need to do at least the following tasks:
- Find a binary representation for the DNA sequences, one candidate of such format is the ".2bit " format defined here
- In the intermediate files, convert the "fasta" file to ".2bit" format
- Build our own suffix array building module to read the contents in ".2bit" format
- Update the alignment part of FM-index/LC-hash for the previous changes
Because it's not easy to accomplish above tasks, and we could avoid this issue by a simple patch, so there's no foreseen plan to fix this issue, but it should be fixed sometime in the future.