Skip to content

BWA Mem Sorting optimization#323

Open
ignacio3437 wants to merge 14 commits intoPlant-Food-Research-Open:devfrom
ignacio3437:dev
Open

BWA Mem Sorting optimization#323
ignacio3437 wants to merge 14 commits intoPlant-Food-Research-Open:devfrom
ignacio3437:dev

Conversation

@ignacio3437
Copy link
Collaborator

I spent some time trying to improve the runtime of the assemblyqc hic steps but my changes did not end up speeding up anything significantly.

The pipeline was doing a samtools sort -n step during the hic.bam file creation. The name-sorted bam is only needed for the HICQC module, which only uses ~1M read pairs, so to speed things up I have turned off the name sorting and introduced a new module.

SAMTOOLS_SUBSAMPLE_SORT creates a new bam file of a subset of the hic.bam file that is 5% of the reads. This subset_hic.bam is then name-sorted and passed to HICQC.
The rest of the pipeline uses the full (not name-sorted) hic.bam

Unfortunately the time saved during the BWAMEM step does not pay off in the long run. Here is my test using the HYv4 dataset:

Gallvp_Main_min Iggy_fork_min Gallvp_main_mem(GB) Iggy_fork_mem(GB)
BWA_MEM 349 281 17.7 5
Samblaster 121 110 5.63 5.63
JuicerPre 66 108
SortSub 29 7
TOTAL TIME 536 528

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes: nextflow run . -profile test,docker --outdir <OUTDIR> and nf-test test --profile docker tests/.
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@ignacio3437
Copy link
Collaborator Author

#322

// MODULE: SAMTOOLS_SUBSAMPLE_SORT
SAMTOOLS_SUBSAMPLE_SORT (
ch_bam,
0.05 // Sample 5% of reads
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GallVp : Can we turn this into a parameter? By default we can set it to 100% which would essentially skip the SAMTOOLS_SUBSAMPLE_SORT module.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can turn this into a parameter, but I think we would need to add the logic here to skip the subsample step if the parameter = 100%.

As is, this ch_subsampled_sorted_bam is only passed to hicqc.


// SUBWORKFLOW: FASTQ_BWA_MEM_SAMBLASTER
val_sort_bam = true
val_sort_bam = false
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GallVp : If we skip SAMTOOLS_SUBSAMPLE_SORT, do we need to reenable this? Or was this completely unnecessary. I vaguely remember that I had to turn this on because some downstream tool failed. But maybe that was the old HiC workflow based on the run_visualiser script.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not strictly necessary, all the output files are produced correctly. But it seems like some tools speed up with a name sorted bam. Especially some of the JuicerPre steps.

I wonder what the speedup would be with a coordinate sorted bam which is the standard sort. I think it would speed some steps up even more.

The name sorted bam was only required for hicqc.

@ignacio3437 ignacio3437 requested a review from GallVp February 12, 2026 02:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants