-
Notifications
You must be signed in to change notification settings - Fork 296
Description
The documentation for the -noOverlapping feature of bedtools shuffle says:
There often arise cases where one wants to shuffle intervals throughout the genome, yet one wants to prevent the intervals from occupying a single common base pair. The -noOverlapping option allows one to enforce no such overlaps.
I recently encountered a case where the output intervals of bedtools shuffle did overlap despite the use of -noOverlapping. My command line was:
bedtools shuffle -incl hg38_autosome_minus_gaps.bed -i seed_regions.bed -seed 1155258083 -noOverlapping -excl hg38_intergenic_1mb.bed -f 0.2 -g Homo_sapiens_assembly38.fasta.fai > shuffled.bed
Seed regions and other details are below.
I think that I have traced this through the code to the fact that I used the -f 0.2 parameter to allow partial overlap with the excluded regions given by -excl: the code seems to treat previously shuffled intervals the same as intervals from the -excl file and applies the -f parameter to those overlap checks as well. Perhaps this is expected behavior but this was surprising to me given the documentation for the option above ("prevent the intervals from occupying a single common base", with no mention of the reciprocal overlap check). If I am correct about this being the cause, I'd recommend either updating the documentation to make it clear that the reciprocal overlap option applies to the -noOverlapping option (which still seems a little off to me given that the list of exclusion regions grows as more intervals are shuffled), or making the -noOverlapping option ignore -f in its checks with previously shuffled intervals.
Bedtools v 2.31.0
seed regions:
chr2 12000000 12050000
chr2 13000000 13100000
chr2 14000000 14200000
chr21 15000000 15500000
chr21 18000000 19000000
chr1 100000000 104000000
chr3 12000000 12050000
chr3 13000000 13100000
chr3 14000000 14200000
chr20 15000000 15500000
chr20 18000000 19000000
chr4 100000000 104000000
output regions (note the two chr7 regions overlap):
chr2 86451858 86501858
chr2 239260937 239360937
chr15 68428748 68628748
chr2 224891400 225391400
chr7 95231862 96231862
chr2 106627290 110627290
chr11 29331791 29381791
chr5 69182028 69282028
chr3 174112517 174312517
chr11 114688618 115188618
chr11 3095108 4095108
chr7 95658166 99658166