The FM-index/LC-hash implementation cannot handle lower case and capital letters at the same time

The current implementation does not convert the DNA sequence to a binary representation. The suffix array building module will read the input file as plain text, which means 'A' and 'a' are different in the alphabet. Thus, when we have both 'A' and 'a' in the genes, the suffix array will not be what we want, because, to match queries regardless of the case, we have to treat upper and lower case letters the same. 

This problem may not be affecting the results with a workaround that convert all letters to upper case in the indexing phase. 

To completely fix this issue, we need to do at least the following tasks:
* Find a binary representation for the DNA sequences, one candidate of such format is the ".2bit " format defined [here](http://genome.ucsc.edu/FAQ/FAQformat.html#format7)
* In the intermediate files, convert the "fasta" file to ".2bit" format
* Build our own suffix array building module to read the contents in ".2bit" format
* Update the alignment part of FM-index/LC-hash for the previous changes

Because it's not easy to accomplish above tasks, and we could avoid this issue by a simple patch, so there's no foreseen plan to fix this issue, but it should be fixed sometime in the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The FM-index/LC-hash implementation cannot handle lower case and capital letters at the same time #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

The FM-index/LC-hash implementation cannot handle lower case and capital letters at the same time #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions