BWA Mem Sorting optimization#323
BWA Mem Sorting optimization#323ignacio3437 wants to merge 14 commits intoPlant-Food-Research-Open:devfrom
Conversation
…rch-Open/dev dev -> main: Version 3.0.0
…rch-Open/patch/315 [Plant-Food-Research-OpenGH-315] Patched synteny crash due to Syri failure
Main to dev
| // MODULE: SAMTOOLS_SUBSAMPLE_SORT | ||
| SAMTOOLS_SUBSAMPLE_SORT ( | ||
| ch_bam, | ||
| 0.05 // Sample 5% of reads |
There was a problem hiding this comment.
@GallVp : Can we turn this into a parameter? By default we can set it to 100% which would essentially skip the SAMTOOLS_SUBSAMPLE_SORT module.
There was a problem hiding this comment.
We can turn this into a parameter, but I think we would need to add the logic here to skip the subsample step if the parameter = 100%.
As is, this ch_subsampled_sorted_bam is only passed to hicqc.
|
|
||
| // SUBWORKFLOW: FASTQ_BWA_MEM_SAMBLASTER | ||
| val_sort_bam = true | ||
| val_sort_bam = false |
There was a problem hiding this comment.
@GallVp : If we skip SAMTOOLS_SUBSAMPLE_SORT, do we need to reenable this? Or was this completely unnecessary. I vaguely remember that I had to turn this on because some downstream tool failed. But maybe that was the old HiC workflow based on the run_visualiser script.
There was a problem hiding this comment.
It is not strictly necessary, all the output files are produced correctly. But it seems like some tools speed up with a name sorted bam. Especially some of the JuicerPre steps.
I wonder what the speedup would be with a coordinate sorted bam which is the standard sort. I think it would speed some steps up even more.
The name sorted bam was only required for hicqc.
I spent some time trying to improve the runtime of the assemblyqc hic steps but my changes did not end up speeding up anything significantly.
The pipeline was doing a
samtools sort -nstep during the hic.bam file creation. The name-sorted bam is only needed for the HICQC module, which only uses ~1M read pairs, so to speed things up I have turned off the name sorting and introduced a new module.SAMTOOLS_SUBSAMPLE_SORT creates a new bam file of a subset of the hic.bam file that is 5% of the reads. This subset_hic.bam is then name-sorted and passed to HICQC.
The rest of the pipeline uses the full (not name-sorted) hic.bam
Unfortunately the time saved during the BWAMEM step does not pay off in the long run. Here is my test using the HYv4 dataset:
PR checklist
nf-core pipelines lint).nextflow run . -profile test,docker --outdir <OUTDIR>andnf-test test --profile docker tests/.nextflow run . -profile debug,test,docker --outdir <OUTDIR>).docs/usage.mdis updated.docs/output.mdis updated.CHANGELOG.mdis updated.README.mdis updated (including new tool citations and authors/contributors).