Skip to content

Performance improvement ideas #6

@lisanhu

Description

@lisanhu

Although AccSeq does well in some tasks, it's still important to further improve its performance. Some ideas have developed during the implementation of this project, but these ideas are mostly immature that has not been added into the algorithm.

Ideas:

  1. Reduce the suffix array to an array of gene indices. Each entry in the suffix array represents a location in the original file, we can create another auxiliary data structure that holds the index of the gene of that location. With this approach, we can use this auxiliary data structure in the voting procedure, which saves a lot of memory and should be faster when doing the voting.
  2. We need a run-length-encoding library which enables random access for many tasks. The auxiliary data structure mentioned above and many other things could be compressed using RLE.
  3. The LC-hash algorithm might be able to be further improved. Can we figure out an equation for "forward search"? If so, the LC-hash could help with queries shorter than hash length. Moreover, if we replace the hash table with a different one which can hold a longer hash length for only the most used queries, we may be able to fit the algorithm to any hash length. The most used queries could be searched during the indexing phase, or during runtime. Runtime LC-hash would be the most ideal case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions