Implement an Efficient Algorithm for Capped-Depth BAM Downsampling by adimascf · Pull Request #118 · mbhall88/rasusa

adimascf · 2026-01-08T04:16:56Z

Description

Replaces the current random access fetching with a streaming sweep line with a random priorities approach.

Implementation Details:

Linear scan: Iterates through the alignment file sequentially, one read at a time, instead of re-fetching reads repeatedly.
Max heap strategy: Maintains up to N active reads with the smallest scores.
- Reads are assigned a random score.
- The heap helps to track the "worst" (highest) score currently in the selection.
- If the heap is full, the read with the highest score is removed to make room for lower-scoring reads.
Efficiency: Improve the speed, a linear pass over the BAM, with complexity ~O(R log N), where R = number of reads and N = target depth.
Reproducibility: Deterministic sampling using seeded hashing ensures reproducibility.
Traits: Implemented manually Ord and PartialEq traits for ScoredRead to handle random score comparison for each read record, and using qname to resolve ties

Validation:

Safety check: Added a logic to detect and return an error if the input is not coordinate-sorted yet.
Tests: Added several tests for depth limits, seed reproducibility, and read ordering.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Performance improvement
Code refactoring

Testing

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have tested on multiple platforms (if applicable)

Documentation

I have updated the documentation accordingly
I have updated the CHANGELOG.md (if applicable)

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
My changes generate no new warnings
Any dependent changes have been merged and published in downstream modules

Additional Notes

Currently, the shuffle_records_by_position function and its associated tests are kept in the file (marked as dead code).

mbhall88

Amazing work Dimas!

See my comments inline.

Also, could you switch your commit messages to use conventional commit form (see here). Feel free to force push the commit message changes as I haven't pulled this PR yet.

The CI is failing due to code formatting. So before your commits, just remember to run cargo fmt if you've touched any code files.

src/alignment.rs

Replaced fethcing logic to linear scan with random priority in `aln` subcommand. Added tests for depth, sorting, and reproducibility

Remove the 'shuffle_records_by_position' function and its unit tests, they are not needed by the sweepline approach. Also update README.md to recommend 'cargo-tarpaulin' instead of 'kcov'

Introduced a new cli argument that defines the maximum gap between read start positions to consider them as 'swappable' This feature still mantaining the randomness while also improving the target coverage Also adds a unit test to verify the evenness of the subsampling depth relative to the target.

Added a writing block after we scan the whole contig before we actually clear the heap

…r option brought back the fetching logic for alignment subsampling with its functions and unit tests added additional paramters and made each logic as a method

Initial refactor to migrate alignment I/O to noodles. The primary goal here is correctness and preserving existing behavior rather than performance. This first change make the program: - ~ 2.5 times slower for the sweep line strategy - ~10 times slower for the fetching strategy likely caused by the use of RecordBuf (owned) when it is converted from Record (borrowed) in sweep line. For fetching, process to query a certain region is likely the main cause of this performance decline.

Only do conversion from lazy Record to owned RecordBuf when it will be put in the heap. Add noodles-bgzf/libdeflate for optimal decompression process. Add additional `end` field in `ScoredRead` struct to allow one time alignment end calculation. set LTO to true to produce better optimised code, at the cost of longer linking time. Performance: Reduce runtime by ~50%, 1.25 faster now than the previous rust-htslib implementation.

Replace single point query with region based batch loading to reduce disk seeking and decompression process Add `batch_cache` and `candidates` vectors to store `RecordBuf` for shuffling and sampling process

adimascf · 2026-01-29T01:48:03Z

Hi Michael,

I changed to peek_mut() as suggested in the comment above. Also, I added swap_distance to address undersampling when it swaps a new read with the one that already span long before.

test_subsampling_statistic is added to test whether the resulting depth is close to the target depth. I infer it using median and mean depth.

Then I brought back the fetching approach as a variant in the SubsamplingStrategy enum, with the sweep line as the default strategy.

Also, I am not sure whether I should have made a separate PR on this, but I also managed to switch the library to noodles. Now, both sweep line and fetching implementation use noodles_util.

Then I made a change on fetching logic after migrating to noodles_util as I found that the query() method in noodles_util seems to to decompress data more eagerly, which impacted the runtime. So, my workaround to reduce the decompression process is to query a larger region at once (batch_cache), then look for overlapping reads given a position manually. I also updated the toml file to enable libdeflate as suggested here: zaeleus/noodles#218.

I checked that the result from rasusa with noodles are identical to rust-htslib using following commands:

# Fetch strategy verification
diff <(./target/release/rasusa aln --coverage 2 --seed 123 --strategy fetch tests/cases/test.bam 2> /dev/null | samtools view - | cut -f 1) \
     <(../rasusa-binaries/rasusa_htslib aln --coverage 2 --seed 123 --strategy fetch tests/cases/test.bam 2> /dev/null | samtools view - | cut -f 1)

# Sweep line strategy verification
diff <(./target/release/rasusa aln --coverage 2 --seed 123 tests/cases/test.bam 2> /dev/null | samtools view - | cut -f 1) \
     <(../rasusa-binaries/rasusa_htslib aln --coverage 2 --seed 123 tests/cases/test.bam 2> /dev/null | samtools view - | cut -f 1)

I think the code is ready for review. Please let me know if there are any changes I should make in the code or logic, or if any clarification is needed. Happy to learn Rust through this project!

src/alignment.rs

Cargo.toml

mbhall88 · 2026-02-02T02:13:47Z

Great work Dimas! Amazing work switching to noodles too. Also make sure you format and lint before pushing your next commits. Also, it looks like, from the CI and cargo shear that noodles-bgzf is not being used?

Once these changes are all done could you also post some kind of benchmark analysis here that we can refer to and users can refer to?

adimascf · 2026-02-03T07:19:54Z

Thanks, Michael, happy to learn more Rust through noodles as well.

It's strange that running cargo clippy --all-features --all-targets -- -D warnings on my local machine did not catch that error/warning, but now I replaced unwrap() with if let binding inreads.rs as the error message suggested.

Regarding the noodles-bgzf, I believe it is used internally by noodles for decompression, even though it is not used in the code explicitly. I have added to the ignored list in the toml to satisfy cargo shear.

Here is the benchmark analysis checking uniformity between sweep line and fetching strategies on both Illumina and ONT data.

I used Illumina data SRR26899147 (avg depth ~824x) and ONT data SRR26899102 (avg depth ~2943x) , with the S enterica reference genome.

I aligned the reads from both datasets and ran rasusa with the commands below:

# Illumina dataset
## Alignment
bowtie2-build ATCC_10708__202309/reference.fna ATCC_10708__202309/reference
bowtie2 -x ATCC_10708__202309/reference -1 SRR26899147_1.fastq.gz -2 SRR26899147_2.fastq.gz | samtools view -bS - | samtools sort -o s_enterica_illumina.sorted.bam

## rasusa using sweep line
./target/release/rasusa aln --seed 123 -c 50 s_enterica_illumina.sorted.bam | samtools sort -o sweep_s_enterica_illumina.sorted.30.bam 

## rasusa using fetching
./target/release/rasusa aln --seed 123 -c 50 --strategy fetch s_enterica_illumina.sorted.bam | samtools sort -o fetch_s_enterica_illumina.sorted.30.bam

## Depth calculation
samtools depth -a -J sweep_s_enterica_illumina.sorted.50.bam > sweep_s_enterica_illumina.sorted.50.depth.tsv 
samtools depth -a -J fetch_s_enterica_illumina.sorted.50.bam > fetch_s_enterica_illumina.sorted.50.depth.tsv

# ONT dataset
## Alignemnt
minimap2 -aL --cs --MD -t 8 -x map-ont --secondary=no ATCC_10708__202309/reference.fna SRR26899102_1.fastq.gz | samtools sort -o s_enterica_ont.hd.sorted.bam

## rasusa using sweep line
./target/release/rasusa aln --seed 123 -c 50 s_enterica_ont.hd.sorted.bam | samtools sort -o sweep_s_enterica_ont.sorted.50.bam 

## rasusa using fetching
./target/release/rasusa aln --seed 123 -c 50 --strategy fetch s_enterica_ont.hd.sorted.bam | samtools sort -o fetch_s_enterica_ont.sorted.50.bam

## Depth calculation
samtools depth -a -J sweep_s_enterica_ont.sorted.50.bam > sweep_s_enterica_ont.sorted.50.depth.tsv
samtools depth -a -J sweep_s_enterica_ont.sorted.50.bam > fetch_s_enterica_ont.sorted.50.depth.tsv

I binned the per-base positions into 1000 bp bins and plotted them below:

mbhall88 · 2026-02-05T00:51:06Z

Regarding the noodles-bgzf, I believe it is used internally by noodles for decompression, even though it is not used in the code explicitly. I have added to the ignored list in the toml to satisfy cargo shear.

I believe noodles-util pulls noodles-bgzf in as you're using the alignment feature. See here
https://github.com/zaeleus/noodles/blob/e3047065044f16e49c7ab37f1c62c54a5f7c32d3/noodles-util/Cargo.toml#L17

Maybe try removing noodles-bgzf and running a test sample to check it works without it.

It's strange that running cargo clippy --all-features --all-targets -- -D warnings on my local machine did not catch that error/warning, but now I replaced unwrap() with if let binding inreads.rs as the error message suggested.

It might be because you don't have the latest rust version locally? rustup update

This happens regularly, with a new update of the rust version, new lints can get added to clippy that catch code which previously wasn't caught.

Great benchmarking work! Could you also add some benchmarks regarding runtime and memory usage? I usually do runtime with hyperfine. For memory usage I would just use /usr/bin/time with the -v flag. Maybe put those results into a table?

adimascf · 2026-02-09T05:04:20Z

With the same datasets and the target depth of 50×, I benchmarked runtime using hyperfine (running rasusa, without piping the output to samtools sort). The commands I used were:

TARGET_DEPTH=50
SL_ILLUMINA_CMD="./target/release/rasusa aln --seed 2109 --coverage $TARGET_DEPTH s_enterica_illumina.sorted.bam -o sweep_s_enterica_illumina.50.bam"
FETCH_ILLUMINA_CMD="./target/release/rasusa aln --seed 2109 --coverage $TARGET_DEPTH --strategy fetch s_enterica_illumina.sorted.bam -o fetch_s_enterica_illumina.50.bam"
SL_ONT_CMD="./target/release/rasusa aln --seed 2109 --coverage $TARGET_DEPTH s_enterica_ont.hd.sorted.bam -o sweep_s_enterica_ont.50.bam"
FETCH_ONT_CMD="./target/release/rasusa aln --seed 2109 --coverage $TARGET_DEPTH --strategy fetch s_enterica_ont.hd.sorted.bam -o fetch_s_enterica_ont.50.bam"

hyperfine --warmup 2 --runs 10 --export-markdown runtime-benchmark.md \
  -n sweepline-illumina "$SL_ILLUMINA_CMD" \
  -n fetching-illumina "$FETCH_ILLUMINA_CMD" \
  -n sweepline-ont "$SL_ONT_CMD" \
  -n fetching-ont "$FETCH_ONT_CMD"

Results (runtime):

Command	Mean [s]	Min [s]	Max [s]	Relative
`sweepline-illumina`	52.983 ± 1.755	51.713	59.401	1.00
`fetching-illumina`	156.336 ± 2.972	154.319	166.859	2.95 ± 0.11
`sweepline-ont`	96.068 ± 3.638	94.296	110.363	1.81 ± 0.09
`fetching-ont`	1195.671 ± 20.126	1171.811	1226.843	22.57 ± 0.84

Summary: Sweep Line ran ~3× faster than Fetching on Illumina data and ~12× faster on ONT data.

Then I measured memory usage using /usr/bin/time -v on the same commands, and taking Maximum resident set size as the metric:

/usr/bin/time -v $SL_ILLUMINA_CMD && /usr/bin/time -v $FETCH_ILLUMINA_CMD && /usr/bin/time -v $SL_ONT_CMD && /usr/bin/time -v $FETCH_ONT_CMD

Results (memory usage):

Strategy	Data Type	Max RSS (MB)
Sweep Line	Illumina	4.21
Fetching	Illumina	77.54
Sweep Line	ONT	8.17
Fetching	ONT	310.59

Since Sweep Line perfoms a linear pass over the reads in the alignment file, it is both faster and more memory efficient on both datasets (Illumina avg depth ~824×, ONT avg depth ~2943×).

Maybe try removing noodles-bgzf and running a test sample to check it works without it.

The dependency noodles-bgzf = { version = "0.45.0", features = ["libdeflate"] } is to enable libdeflate feature. The code still works without it, but in my small benchmark the runtime was slightly slower:

Summary
  libdeflate ran
    1.06 ± 0.01 times faster than no-libdeflate

It might be because you don't have the latest rust version locally? rustup update

You are right, my local rust version was outdated. After updating, the clippy warning is now detected.

While benchmarking, I also inspected the outputs from both approaches a bit closer. It seems that both approaches may not retain paired-end information in the Illumina dataset, as reads are treated independently. I did not see this mentioned in the README, so I just wanted to note it here in case it’s relevant.

If preserving or restoring the paired-end information is important for downstream analysis, I think running samtools fixmate on the output can help restore proper flags.

mbhall88 · 2026-02-10T10:09:05Z

Awesome work Dimas!! Great to see the gains are so dramatic.

While benchmarking, I also inspected the outputs from both approaches a bit closer. It seems that both approaches may not retain paired-end information in the Illumina dataset, as reads are treated independently. I did not see this mentioned in the README, so I just wanted to note it here in case it’s relevant.

If preserving or restoring the paired-end information is important for downstream analysis, I think running samtools fixmate on the output can help restore proper flags.

Argh good point. It would be nice to retain pairs if possible. Though it feels like trying to retain paired reads with either approach is going to be complicated, as when you get to the mate, you have to kick a read out of the active reads.... Do you think this will be possible without too much refactoring?

adimascf · 2026-02-10T22:52:05Z

Yeah right. I think that seems reasonable, I will take a look today

Previously, paired-end were reads treated independently, which led to orphans reads (loss of mates) and potential issue in downstream analysis. The program only runs this strategy if it detects paired end inputs. It will divide the target depth by 2, and perfoms the subsampling algorithm (sweep line or fetching) in the first iteration with only consider the first segements. In the second interation, it focus to only recover their mates (last segments).

adimascf · 2026-02-13T07:51:36Z

Hi Michael,

I have implemented the two-iteration strategy to handle paired-end Illumina data as per our previous chat.

I added a helper function recover_mates() to take care of the last_segment records for paired-end reads. I also added two new tests test_paired_end_retention_fetch() and test_paired_end_retention_stream(), along with a small paired-end BAM file to ensure we always subsample reads in pairs (when paired-end input is provided).

To verify the result on a real alignment file (same files as above), I ran samtools flagstat on the original alignment file and on the subsampling result. There are no orphan reads generated by rasusa, and the pairing statistics remain proportionally consistent.

original file:

21475714 + 0 in total (QC-passed reads + QC-failed reads)
21475714 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
20210444 + 0 mapped (94.11% : N/A)
20210444 + 0 primary mapped (94.11% : N/A)
21475714 + 0 paired in sequencing
10737857 + 0 read1
10737857 + 0 read2
16920784 + 0 properly paired (78.79% : N/A)
19115700 + 0 with itself and mate mapped
1094744 + 0 singletons (5.10% : N/A)
12 + 0 with mate mapped to a different chr
12 + 0 with mate mapped to a different chr (mapQ>=5)

paired-end aware rasusa output:

1161166 + 0 in total (QC-passed reads + QC-failed reads)
1161166 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
1105361 + 0 mapped (95.19% : N/A)
1105361 + 0 primary mapped (95.19% : N/A)
1161166 + 0 paired in sequencing
580583 + 0 read1
580583 + 0 read2
930272 + 0 properly paired (80.12% : N/A)
1049556 + 0 with itself and mate mapped
55805 + 0 singletons (4.81% : N/A)
2 + 0 with mate mapped to a different chr
2 + 0 with mate mapped to a different chr (mapQ>=5)

I also checked coverage uniformity after this change:

As we expected, both sweep line and fetching no longer produce a perfectly flat line for paired-end Illumina data, with the smoothed blue line consistenly close to the target depth (50x).

in terms of benchmarking runtime and memory usage, i compiled into a single table below:

Command	Mean (s)	Min (s)	Max (s)	Relative	Max RSS (MB)
sweepline-illumina-pe	29.616 ± 0.216	29.389	30.071	1.00	56.2
fetching-illumina-pe	128.751 ± 0.624	128.049	129.883	4.35 ± 0.04	127.8
sweepline-ont	95.511 ± 2.330	93.844	100.164	3.22 ± 0.08	8.3
fetching-ont	1257.453 ± 81.765	1168.909	1363.556	42.46 ± 2.78	314.1

As expected, memory usage increased (~50 MB) for both sweep line and fetching. Interestingly, sweep line is significantly faster after this change on Illumina data. I think this is because we essentially process a smaller heap (target depth x 0.5). There is no significant difference for the ONT data after this change.

Let me know what you think 🙂

adimascf marked this pull request as ready for review January 8, 2026 08:43

mbhall88 requested changes Jan 12, 2026

View reviewed changes

adimascf added 10 commits January 20, 2026 09:47

perf: Implement sweepline algorithm for capped depth bam downsampling

1b3e83f

Replaced fethcing logic to linear scan with random priority in `aln` subcommand. Added tests for depth, sorting, and reproducibility

refactor: use a closure for depth reporting

a09fa14

test: add tests to cover input and output extension fail

c0adf19

refactor: remove unused shuffling logic and update coverage docs

b0d9dcd

Remove the 'shuffle_records_by_position' function and its unit tests, they are not needed by the sweepline approach. Also update README.md to recommend 'cargo-tarpaulin' instead of 'kcov'

fix: resolve a coverage drop at the end of contig

70e16f7

Added a writing block after we scan the whole contig before we actually clear the heap

refactor: make fetching and sweep line (default) approaches as an use…

0cd2f7e

…r option brought back the fetching logic for alignment subsampling with its functions and unit tests added additional paramters and made each logic as a method

perf: optimise random access in run_fecth() with batch caching

7b52d07

Replace single point query with region based batch loading to reduce disk seeking and decompression process Add `batch_cache` and `candidates` vectors to store `RecordBuf` for shuffling and sampling process

adimascf force-pushed the subsampling-aln/sweepline branch from 84c3a0d to 7b52d07 Compare January 29, 2026 00:55

mbhall88 requested changes Feb 2, 2026

View reviewed changes

src/alignment.rs Show resolved Hide resolved

src/alignment.rs Outdated Show resolved Hide resolved

src/alignment.rs Show resolved Hide resolved

src/alignment.rs Outdated Show resolved Hide resolved

Cargo.toml Outdated Show resolved Hide resolved

adimascf added 2 commits February 3, 2026 14:55

chore: address PR comments (docs, renaming, and cleanup)

56608aa

refactor: resolve clippy unnecessary-unwrap

634d9d6

adimascf added 2 commits February 12, 2026 13:14

refactor: refactored the repeated code blocks into functions

a326ebc

Conversation

adimascf commented Jan 8, 2026

Description

Implementation Details:

Validation:

Type of Change

Testing

Documentation

Checklist

Additional Notes

Uh oh!

mbhall88 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adimascf commented Jan 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mbhall88 commented Feb 2, 2026

Uh oh!

adimascf commented Feb 3, 2026

Uh oh!

mbhall88 commented Feb 5, 2026

Uh oh!

adimascf commented Feb 9, 2026

Uh oh!

mbhall88 commented Feb 10, 2026

Uh oh!

adimascf commented Feb 10, 2026

Uh oh!

adimascf commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants