Skip to content

Conversation

@siddharthab
Copy link

@siddharthab siddharthab commented Sep 4, 2024

Fixes #31.

Opening a BAM file is an expensive operation as the index needs to be
fully read. In paired reads mode, at every contig change, the file was
being opened again to iterate over all reads from the previous contig.
This is usually not an issue for genome alignments, but transcriptome
alignments may have ~100k contigs, which makes this an expensive
operation.

Ideally, the two-pass mode should not have to read the file again, and
instead just maintain a rolling window of reads in memory.

Opening a BAM file is an expensive operation as the index needs to be
fully read. In paired reads mode, at every contig change, the file was
being opened again to iterate over all reads from the previous contig.
This is usually not an issue for genome alignments, but transcriptome
alignments may have ~100k contigs, which makes this an expensive
operation.

Ideally, the two-pass mode should not have to read the file again, and
instead just maintain a rolling window of reads in memory.
@siddharthab
Copy link
Author

siddharthab commented Sep 4, 2024

With this change, the test case in the linked issue takes 14 minutes now instead of 6.3 hours.

@siddharthab
Copy link
Author

@Daniel-Liu-c0deb0t Can you please accept this PR?

@MatthiasZepper
Copy link

I just wanted to express explicit support for this proposal!

While I am not familiar with the implementation details, I think, it is a very important fix. Transcriptomic alignments or draft genome assemblies typically have numerous contigs and if this fix streamlines the deduplication of those input files so dramatically, I would love to see it merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Very slow paired reads mode for transcriptome

2 participants