[Not Implimented] removing BWA_MEM sorting#322
[Not Implimented] removing BWA_MEM sorting#322ignacio3437 wants to merge 6 commits intoPlant-Food-Research-Open:mainfrom
Conversation
This PR is against the
|
|
Can you please reopen the PR against the dev branch? I think we can merge this as a feature. |
| // MODULE: SAMTOOLS_SUBSAMPLE_SORT | ||
| SAMTOOLS_SUBSAMPLE_SORT ( | ||
| ch_bam, | ||
| 0.05 // Sample 5% of reads |
There was a problem hiding this comment.
Can we turn this into a parameter? By default we can set it to 100% which would essentially skip the SAMTOOLS_SUBSAMPLE_SORT module.
|
|
||
| // SUBWORKFLOW: FASTQ_BWA_MEM_SAMBLASTER | ||
| val_sort_bam = true | ||
| val_sort_bam = false |
There was a problem hiding this comment.
If we skip SAMTOOLS_SUBSAMPLE_SORT, do we need to reenable this? Or was this completely unnecessary. I vaguely remember that I had to turn this on because some downstream tool failed. But maybe that was the old HiC workflow based on the run_visualiser script.
|
|
||
| - name: Post PR comment | ||
| uses: marocchino/sticky-pull-request-comment@52423e01640425a022ef5fd42c6fb5f633a02728 # v2 | ||
| uses: marocchino/sticky-pull-request-comment@773744901bac0e8cbb5a0dc842800d45e9b2b405 # v2 |
I spent some time trying to improve the runtime of the assemblyqc hic steps but my changes did not end up speeding up anything significantly so I think I will drop this. But I wanted to share since I did spend a few days on it and I learned a bit along the way.
The pipeline was doing a
samtools sort -nstep during the hic.bam file creation. The name-sorted bam is only needed for the HICQC module, which only uses ~1M read pairs, so to speed things up I have turned off the name sorting and introduced a new module.SAMTOOLS_SUBSAMPLE_SORT creates a new bam file of a subset of the hic.bam file that is 5% of the reads. This subset_hic.bam is then name-sorted and passed to HICQC.
The rest of the pipeline uses the full (not name-sorted) hic.bam
Unfortunately the time saved during the BWAMEM step does not pay off in the long run. Here is my test using the HYv4 dataset:
PR checklist
nf-core pipelines lint).nextflow run . -profile test,docker --outdir <OUTDIR>andnf-test test --profile docker tests/.nextflow run . -profile debug,test,docker --outdir <OUTDIR>).docs/usage.mdis updated.docs/output.mdis updated.CHANGELOG.mdis updated.README.mdis updated (including new tool citations and authors/contributors).