Implement an Efficient Algorithm for Capped-Depth BAM Downsampling#118
Implement an Efficient Algorithm for Capped-Depth BAM Downsampling#118adimascf wants to merge 14 commits intombhall88:mainfrom
Conversation
mbhall88
left a comment
There was a problem hiding this comment.
Amazing work Dimas!
See my comments inline.
Also, could you switch your commit messages to use conventional commit form (see here). Feel free to force push the commit message changes as I haven't pulled this PR yet.
The CI is failing due to code formatting. So before your commits, just remember to run cargo fmt if you've touched any code files.
Replaced fethcing logic to linear scan with random priority in `aln` subcommand. Added tests for depth, sorting, and reproducibility
Remove the 'shuffle_records_by_position' function and its unit tests, they are not needed by the sweepline approach. Also update README.md to recommend 'cargo-tarpaulin' instead of 'kcov'
Introduced a new cli argument that defines the maximum gap between read start positions to consider them as 'swappable' This feature still mantaining the randomness while also improving the target coverage Also adds a unit test to verify the evenness of the subsampling depth relative to the target.
Added a writing block after we scan the whole contig before we actually clear the heap
…r option brought back the fetching logic for alignment subsampling with its functions and unit tests added additional paramters and made each logic as a method
Initial refactor to migrate alignment I/O to noodles. The primary goal here is correctness and preserving existing behavior rather than performance. This first change make the program: - ~ 2.5 times slower for the sweep line strategy - ~10 times slower for the fetching strategy likely caused by the use of RecordBuf (owned) when it is converted from Record (borrowed) in sweep line. For fetching, process to query a certain region is likely the main cause of this performance decline.
Only do conversion from lazy Record to owned RecordBuf when it will be put in the heap. Add noodles-bgzf/libdeflate for optimal decompression process. Add additional `end` field in `ScoredRead` struct to allow one time alignment end calculation. set LTO to true to produce better optimised code, at the cost of longer linking time. Performance: Reduce runtime by ~50%, 1.25 faster now than the previous rust-htslib implementation.
Replace single point query with region based batch loading to reduce disk seeking and decompression process Add `batch_cache` and `candidates` vectors to store `RecordBuf` for shuffling and sampling process
84c3a0d to
7b52d07
Compare
|
Hi Michael, I changed to
Then I brought back the fetching approach as a variant in the Also, I am not sure whether I should have made a separate PR on this, but I also managed to switch the library to Then I made a change on fetching logic after migrating to I checked that the result from rasusa with I think the code is ready for review. Please let me know if there are any changes I should make in the code or logic, or if any clarification is needed. Happy to learn Rust through this project! |
|
Great work Dimas! Amazing work switching to noodles too. Also make sure you format and lint before pushing your next commits. Also, it looks like, from the CI and Once these changes are all done could you also post some kind of benchmark analysis here that we can refer to and users can refer to? |
|
Thanks, Michael, happy to learn more Rust through It's strange that running Regarding the Here is the benchmark analysis checking uniformity between sweep line and fetching strategies on both Illumina and ONT data. I used Illumina data SRR26899147 (avg depth ~824x) and ONT data SRR26899102 (avg depth ~2943x) , with the S enterica reference genome. I aligned the reads from both datasets and ran rasusa with the commands below: I binned the per-base positions into 1000 bp bins and plotted them below: |
I believe Maybe try removing
It might be because you don't have the latest rust version locally? This happens regularly, with a new update of the rust version, new lints can get added to Great benchmarking work! Could you also add some benchmarks regarding runtime and memory usage? I usually do runtime with |
|
With the same datasets and the target depth of 50×, I benchmarked runtime using Results (runtime):
Summary: Sweep Line ran ~3× faster than Fetching on Illumina data and ~12× faster on ONT data. Then I measured memory usage using Results (memory usage):
Since Sweep Line perfoms a linear pass over the reads in the alignment file, it is both faster and more memory efficient on both datasets (Illumina avg depth ~824×, ONT avg depth ~2943×).
The dependency
You are right, my local rust version was outdated. After updating, the clippy warning is now detected. While benchmarking, I also inspected the outputs from both approaches a bit closer. It seems that both approaches may not retain paired-end information in the Illumina dataset, as reads are treated independently. I did not see this mentioned in the README, so I just wanted to note it here in case it’s relevant. If preserving or restoring the paired-end information is important for downstream analysis, I think running |
|
Awesome work Dimas!! Great to see the gains are so dramatic.
Argh good point. It would be nice to retain pairs if possible. Though it feels like trying to retain paired reads with either approach is going to be complicated, as when you get to the mate, you have to kick a read out of the active reads.... Do you think this will be possible without too much refactoring? |
|
Yeah right. I think that seems reasonable, I will take a look today |
Previously, paired-end were reads treated independently, which led to orphans reads (loss of mates) and potential issue in downstream analysis. The program only runs this strategy if it detects paired end inputs. It will divide the target depth by 2, and perfoms the subsampling algorithm (sweep line or fetching) in the first iteration with only consider the first segements. In the second interation, it focus to only recover their mates (last segments).


Description
Replaces the current random access fetching with a streaming sweep line with a random priorities approach.
Implementation Details:
OrdandPartialEqtraits forScoredReadto handle random score comparison for each read record, and usingqnameto resolve tiesValidation:
Type of Change
Testing
Documentation
Checklist
Additional Notes
Currently, the
shuffle_records_by_positionfunction and its associated tests are kept in the file (marked as dead code).