Skip to content

Implement an Efficient Algorithm for Capped-Depth BAM Downsampling#118

Open
adimascf wants to merge 14 commits intombhall88:mainfrom
adimascf:subsampling-aln/sweepline
Open

Implement an Efficient Algorithm for Capped-Depth BAM Downsampling#118
adimascf wants to merge 14 commits intombhall88:mainfrom
adimascf:subsampling-aln/sweepline

Conversation

@adimascf
Copy link

@adimascf adimascf commented Jan 8, 2026

Description

Replaces the current random access fetching with a streaming sweep line with a random priorities approach.

Implementation Details:

  • Linear scan: Iterates through the alignment file sequentially, one read at a time, instead of re-fetching reads repeatedly.
  • Max heap strategy: Maintains up to N active reads with the smallest scores.
    • Reads are assigned a random score.
    • The heap helps to track the "worst" (highest) score currently in the selection.
    • If the heap is full, the read with the highest score is removed to make room for lower-scoring reads.
  • Efficiency: Improve the speed, a linear pass over the BAM, with complexity ~O(R log N), where R = number of reads and N = target depth.
  • Reproducibility: Deterministic sampling using seeded hashing ensures reproducibility.
  • Traits: Implemented manually Ord and PartialEq traits for ScoredRead to handle random score comparison for each read record, and using qname to resolve ties

Validation:

  • Safety check: Added a logic to detect and return an error if the input is not coordinate-sorted yet.
  • Tests: Added several tests for depth limits, seed reproducibility, and read ordering.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Performance improvement
  • Code refactoring

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested on multiple platforms (if applicable)

Documentation

  • I have updated the documentation accordingly
  • I have updated the CHANGELOG.md (if applicable)

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

Additional Notes

Currently, the shuffle_records_by_position function and its associated tests are kept in the file (marked as dead code).

@adimascf adimascf marked this pull request as ready for review January 8, 2026 08:43
Copy link
Owner

@mbhall88 mbhall88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work Dimas!

See my comments inline.

Also, could you switch your commit messages to use conventional commit form (see here). Feel free to force push the commit message changes as I haven't pulled this PR yet.

The CI is failing due to code formatting. So before your commits, just remember to run cargo fmt if you've touched any code files.

Replaced fethcing logic to linear scan with random priority in `aln` subcommand. Added tests for depth, sorting, and reproducibility
Remove the 'shuffle_records_by_position' function and its unit tests, they are not needed by the sweepline approach. Also update README.md to recommend 'cargo-tarpaulin' instead of 'kcov'
Introduced a new cli argument that defines the maximum gap between read start positions to consider them as 'swappable'

This feature still mantaining the randomness while also improving the target coverage

Also adds a unit test to verify the evenness of the subsampling depth relative to the target.
Added a writing block after we scan the whole contig before we actually clear the heap
…r option

brought back the fetching logic for alignment subsampling with its functions and unit tests

added additional paramters and made each logic as a method
Initial refactor to migrate alignment I/O to noodles.
The primary goal here is correctness and preserving existing behavior
rather than performance.

This first change make the program:
- ~ 2.5 times slower for the sweep line strategy
- ~10 times slower for the fetching strategy

likely caused by the use of RecordBuf (owned) when it is converted from Record (borrowed) in sweep line.
For fetching, process to query a certain region is likely the main cause of this performance decline.
Only do conversion from lazy Record to owned RecordBuf when it will be put in the heap.
Add noodles-bgzf/libdeflate for optimal decompression process.
Add additional `end` field in `ScoredRead` struct to allow one time alignment end calculation.
set LTO to true to produce better optimised code, at the cost of longer linking time.

Performance: Reduce runtime by ~50%, 1.25 faster now than the previous rust-htslib implementation.
Replace single point query with region based batch loading to reduce disk seeking and decompression process

Add `batch_cache` and `candidates` vectors to store `RecordBuf` for shuffling and sampling process
@adimascf adimascf force-pushed the subsampling-aln/sweepline branch from 84c3a0d to 7b52d07 Compare January 29, 2026 00:55
@adimascf
Copy link
Author

Hi Michael,

I changed to peek_mut() as suggested in the comment above. Also, I added swap_distance to address undersampling when it swaps a new read with the one that already span long before.

test_subsampling_statistic is added to test whether the resulting depth is close to the target depth. I infer it using median and mean depth.

Then I brought back the fetching approach as a variant in the SubsamplingStrategy enum, with the sweep line as the default strategy.

Also, I am not sure whether I should have made a separate PR on this, but I also managed to switch the library to noodles. Now, both sweep line and fetching implementation use noodles_util.

Then I made a change on fetching logic after migrating to noodles_util as I found that the query() method in noodles_util seems to to decompress data more eagerly, which impacted the runtime. So, my workaround to reduce the decompression process is to query a larger region at once (batch_cache), then look for overlapping reads given a position manually. I also updated the toml file to enable libdeflate as suggested here: zaeleus/noodles#218.

I checked that the result from rasusa with noodles are identical to rust-htslib using following commands:

# Fetch strategy verification
diff <(./target/release/rasusa aln --coverage 2 --seed 123 --strategy fetch tests/cases/test.bam 2> /dev/null | samtools view - | cut -f 1) \
     <(../rasusa-binaries/rasusa_htslib aln --coverage 2 --seed 123 --strategy fetch tests/cases/test.bam 2> /dev/null | samtools view - | cut -f 1)

# Sweep line strategy verification
diff <(./target/release/rasusa aln --coverage 2 --seed 123 tests/cases/test.bam 2> /dev/null | samtools view - | cut -f 1) \
     <(../rasusa-binaries/rasusa_htslib aln --coverage 2 --seed 123 tests/cases/test.bam 2> /dev/null | samtools view - | cut -f 1)

I think the code is ready for review. Please let me know if there are any changes I should make in the code or logic, or if any clarification is needed. Happy to learn Rust through this project!

@mbhall88
Copy link
Owner

mbhall88 commented Feb 2, 2026

Great work Dimas! Amazing work switching to noodles too. Also make sure you format and lint before pushing your next commits. Also, it looks like, from the CI and cargo shear that noodles-bgzf is not being used?

Once these changes are all done could you also post some kind of benchmark analysis here that we can refer to and users can refer to?

@adimascf
Copy link
Author

adimascf commented Feb 3, 2026

Thanks, Michael, happy to learn more Rust through noodles as well.

It's strange that running cargo clippy --all-features --all-targets -- -D warnings on my local machine did not catch that error/warning, but now I replaced unwrap() with if let binding inreads.rs as the error message suggested.

Regarding the noodles-bgzf, I believe it is used internally by noodles for decompression, even though it is not used in the code explicitly. I have added to the ignored list in the toml to satisfy cargo shear.

Here is the benchmark analysis checking uniformity between sweep line and fetching strategies on both Illumina and ONT data.

I used Illumina data SRR26899147 (avg depth ~824x) and ONT data SRR26899102 (avg depth ~2943x) , with the S enterica reference genome.

I aligned the reads from both datasets and ran rasusa with the commands below:

# Illumina dataset
## Alignment
bowtie2-build ATCC_10708__202309/reference.fna ATCC_10708__202309/reference
bowtie2 -x ATCC_10708__202309/reference -1 SRR26899147_1.fastq.gz -2 SRR26899147_2.fastq.gz | samtools view -bS - | samtools sort -o s_enterica_illumina.sorted.bam

## rasusa using sweep line
./target/release/rasusa aln --seed 123 -c 50 s_enterica_illumina.sorted.bam | samtools sort -o sweep_s_enterica_illumina.sorted.30.bam 

## rasusa using fetching
./target/release/rasusa aln --seed 123 -c 50 --strategy fetch s_enterica_illumina.sorted.bam | samtools sort -o fetch_s_enterica_illumina.sorted.30.bam

## Depth calculation
samtools depth -a -J sweep_s_enterica_illumina.sorted.50.bam > sweep_s_enterica_illumina.sorted.50.depth.tsv 
samtools depth -a -J fetch_s_enterica_illumina.sorted.50.bam > fetch_s_enterica_illumina.sorted.50.depth.tsv

# ONT dataset
## Alignemnt
minimap2 -aL --cs --MD -t 8 -x map-ont --secondary=no ATCC_10708__202309/reference.fna SRR26899102_1.fastq.gz | samtools sort -o s_enterica_ont.hd.sorted.bam

## rasusa using sweep line
./target/release/rasusa aln --seed 123 -c 50 s_enterica_ont.hd.sorted.bam | samtools sort -o sweep_s_enterica_ont.sorted.50.bam 

## rasusa using fetching
./target/release/rasusa aln --seed 123 -c 50 --strategy fetch s_enterica_ont.hd.sorted.bam | samtools sort -o fetch_s_enterica_ont.sorted.50.bam

## Depth calculation
samtools depth -a -J sweep_s_enterica_ont.sorted.50.bam > sweep_s_enterica_ont.sorted.50.depth.tsv
samtools depth -a -J sweep_s_enterica_ont.sorted.50.bam > fetch_s_enterica_ont.sorted.50.depth.tsv

I binned the per-base positions into 1000 bp bins and plotted them below:
subsampling_results

@mbhall88
Copy link
Owner

mbhall88 commented Feb 5, 2026

Regarding the noodles-bgzf, I believe it is used internally by noodles for decompression, even though it is not used in the code explicitly. I have added to the ignored list in the toml to satisfy cargo shear.

I believe noodles-util pulls noodles-bgzf in as you're using the alignment feature. See here
https://github.com/zaeleus/noodles/blob/e3047065044f16e49c7ab37f1c62c54a5f7c32d3/noodles-util/Cargo.toml#L17

Maybe try removing noodles-bgzf and running a test sample to check it works without it.

It's strange that running cargo clippy --all-features --all-targets -- -D warnings on my local machine did not catch that error/warning, but now I replaced unwrap() with if let binding inreads.rs as the error message suggested.

It might be because you don't have the latest rust version locally? rustup update

This happens regularly, with a new update of the rust version, new lints can get added to clippy that catch code which previously wasn't caught.

Great benchmarking work! Could you also add some benchmarks regarding runtime and memory usage? I usually do runtime with hyperfine. For memory usage I would just use /usr/bin/time with the -v flag. Maybe put those results into a table?

@adimascf
Copy link
Author

adimascf commented Feb 9, 2026

With the same datasets and the target depth of 50×, I benchmarked runtime using hyperfine (running rasusa, without piping the output to samtools sort). The commands I used were:

TARGET_DEPTH=50
SL_ILLUMINA_CMD="./target/release/rasusa aln --seed 2109 --coverage $TARGET_DEPTH s_enterica_illumina.sorted.bam -o sweep_s_enterica_illumina.50.bam"
FETCH_ILLUMINA_CMD="./target/release/rasusa aln --seed 2109 --coverage $TARGET_DEPTH --strategy fetch s_enterica_illumina.sorted.bam -o fetch_s_enterica_illumina.50.bam"
SL_ONT_CMD="./target/release/rasusa aln --seed 2109 --coverage $TARGET_DEPTH s_enterica_ont.hd.sorted.bam -o sweep_s_enterica_ont.50.bam"
FETCH_ONT_CMD="./target/release/rasusa aln --seed 2109 --coverage $TARGET_DEPTH --strategy fetch s_enterica_ont.hd.sorted.bam -o fetch_s_enterica_ont.50.bam"

hyperfine --warmup 2 --runs 10 --export-markdown runtime-benchmark.md \
  -n sweepline-illumina "$SL_ILLUMINA_CMD" \
  -n fetching-illumina "$FETCH_ILLUMINA_CMD" \
  -n sweepline-ont "$SL_ONT_CMD" \
  -n fetching-ont "$FETCH_ONT_CMD"

Results (runtime):

Command Mean [s] Min [s] Max [s] Relative
sweepline-illumina 52.983 ± 1.755 51.713 59.401 1.00
fetching-illumina 156.336 ± 2.972 154.319 166.859 2.95 ± 0.11
sweepline-ont 96.068 ± 3.638 94.296 110.363 1.81 ± 0.09
fetching-ont 1195.671 ± 20.126 1171.811 1226.843 22.57 ± 0.84

Summary: Sweep Line ran ~3× faster than Fetching on Illumina data and ~12× faster on ONT data.

Then I measured memory usage using /usr/bin/time -v on the same commands, and taking Maximum resident set size as the metric:

/usr/bin/time -v $SL_ILLUMINA_CMD && /usr/bin/time -v $FETCH_ILLUMINA_CMD && /usr/bin/time -v $SL_ONT_CMD && /usr/bin/time -v $FETCH_ONT_CMD

Results (memory usage):

Strategy Data Type Max RSS (MB)
Sweep Line Illumina 4.21
Fetching Illumina 77.54
Sweep Line ONT 8.17
Fetching ONT 310.59

Since Sweep Line perfoms a linear pass over the reads in the alignment file, it is both faster and more memory efficient on both datasets (Illumina avg depth ~824×, ONT avg depth ~2943×).


Maybe try removing noodles-bgzf and running a test sample to check it works without it.

The dependency noodles-bgzf = { version = "0.45.0", features = ["libdeflate"] } is to enable libdeflate feature. The code still works without it, but in my small benchmark the runtime was slightly slower:

Summary
  libdeflate ran
    1.06 ± 0.01 times faster than no-libdeflate

It might be because you don't have the latest rust version locally? rustup update

You are right, my local rust version was outdated. After updating, the clippy warning is now detected.


While benchmarking, I also inspected the outputs from both approaches a bit closer. It seems that both approaches may not retain paired-end information in the Illumina dataset, as reads are treated independently. I did not see this mentioned in the README, so I just wanted to note it here in case it’s relevant.

If preserving or restoring the paired-end information is important for downstream analysis, I think running samtools fixmate on the output can help restore proper flags.

@mbhall88
Copy link
Owner

Awesome work Dimas!! Great to see the gains are so dramatic.

While benchmarking, I also inspected the outputs from both approaches a bit closer. It seems that both approaches may not retain paired-end information in the Illumina dataset, as reads are treated independently. I did not see this mentioned in the README, so I just wanted to note it here in case it’s relevant.

If preserving or restoring the paired-end information is important for downstream analysis, I think running samtools fixmate on the output can help restore proper flags.

Argh good point. It would be nice to retain pairs if possible. Though it feels like trying to retain paired reads with either approach is going to be complicated, as when you get to the mate, you have to kick a read out of the active reads.... Do you think this will be possible without too much refactoring?

@adimascf
Copy link
Author

Yeah right. I think that seems reasonable, I will take a look today

Previously, paired-end were reads treated independently, which led to orphans reads (loss of mates) and potential issue in downstream analysis.

The program only runs this strategy if it detects paired end inputs. It will divide the target depth by 2,
and perfoms the subsampling algorithm (sweep line or fetching) in the first iteration with only consider the first segements.
In the second interation, it focus to only recover their mates (last segments).
@adimascf
Copy link
Author

Hi Michael,

I have implemented the two-iteration strategy to handle paired-end Illumina data as per our previous chat.

I added a helper function recover_mates() to take care of the last_segment records for paired-end reads. I also added two new tests test_paired_end_retention_fetch() and test_paired_end_retention_stream(), along with a small paired-end BAM file to ensure we always subsample reads in pairs (when paired-end input is provided).

To verify the result on a real alignment file (same files as above), I ran samtools flagstat on the original alignment file and on the subsampling result. There are no orphan reads generated by rasusa, and the pairing statistics remain proportionally consistent.

original file:

21475714 + 0 in total (QC-passed reads + QC-failed reads)
21475714 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
20210444 + 0 mapped (94.11% : N/A)
20210444 + 0 primary mapped (94.11% : N/A)
21475714 + 0 paired in sequencing
10737857 + 0 read1
10737857 + 0 read2
16920784 + 0 properly paired (78.79% : N/A)
19115700 + 0 with itself and mate mapped
1094744 + 0 singletons (5.10% : N/A)
12 + 0 with mate mapped to a different chr
12 + 0 with mate mapped to a different chr (mapQ>=5)

paired-end aware rasusa output:

1161166 + 0 in total (QC-passed reads + QC-failed reads)
1161166 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
1105361 + 0 mapped (95.19% : N/A)
1105361 + 0 primary mapped (95.19% : N/A)
1161166 + 0 paired in sequencing
580583 + 0 read1
580583 + 0 read2
930272 + 0 properly paired (80.12% : N/A)
1049556 + 0 with itself and mate mapped
55805 + 0 singletons (4.81% : N/A)
2 + 0 with mate mapped to a different chr
2 + 0 with mate mapped to a different chr (mapQ>=5)

I also checked coverage uniformity after this change:
subsampling_results_paired_end_aware

As we expected, both sweep line and fetching no longer produce a perfectly flat line for paired-end Illumina data, with the smoothed blue line consistenly close to the target depth (50x).

in terms of benchmarking runtime and memory usage, i compiled into a single table below:

Command Mean (s) Min (s) Max (s) Relative Max RSS (MB)
sweepline-illumina-pe 29.616 ± 0.216 29.389 30.071 1.00 56.2
fetching-illumina-pe 128.751 ± 0.624 128.049 129.883 4.35 ± 0.04 127.8
sweepline-ont 95.511 ± 2.330 93.844 100.164 3.22 ± 0.08 8.3
fetching-ont 1257.453 ± 81.765 1168.909 1363.556 42.46 ± 2.78 314.1

As expected, memory usage increased (~50 MB) for both sweep line and fetching. Interestingly, sweep line is significantly faster after this change on Illumina data. I think this is because we essentially process a smaller heap (target depth x 0.5). There is no significant difference for the ONT data after this change.

Let me know what you think 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants