Skip to content

Unexpected behavior from -noOverlapping option of bedtools shuffle #1126

@cwhelan

Description

@cwhelan

The documentation for the -noOverlapping feature of bedtools shuffle says:

There often arise cases where one wants to shuffle intervals throughout the genome, yet one wants to prevent the intervals from occupying a single common base pair. The -noOverlapping option allows one to enforce no such overlaps.

I recently encountered a case where the output intervals of bedtools shuffle did overlap despite the use of -noOverlapping. My command line was:

bedtools shuffle -incl hg38_autosome_minus_gaps.bed -i seed_regions.bed -seed 1155258083 -noOverlapping -excl hg38_intergenic_1mb.bed -f 0.2 -g Homo_sapiens_assembly38.fasta.fai > shuffled.bed

Seed regions and other details are below.

I think that I have traced this through the code to the fact that I used the -f 0.2 parameter to allow partial overlap with the excluded regions given by -excl: the code seems to treat previously shuffled intervals the same as intervals from the -excl file and applies the -f parameter to those overlap checks as well. Perhaps this is expected behavior but this was surprising to me given the documentation for the option above ("prevent the intervals from occupying a single common base", with no mention of the reciprocal overlap check). If I am correct about this being the cause, I'd recommend either updating the documentation to make it clear that the reciprocal overlap option applies to the -noOverlapping option (which still seems a little off to me given that the list of exclusion regions grows as more intervals are shuffled), or making the -noOverlapping option ignore -f in its checks with previously shuffled intervals.

Bedtools v 2.31.0

seed regions:

chr2	12000000	12050000
chr2	13000000	13100000
chr2	14000000	14200000
chr21	15000000	15500000
chr21	18000000	19000000
chr1	100000000	104000000
chr3	12000000	12050000
chr3	13000000	13100000
chr3	14000000	14200000
chr20	15000000	15500000
chr20	18000000	19000000
chr4	100000000	104000000

output regions (note the two chr7 regions overlap):

chr2	86451858	86501858
chr2	239260937	239360937
chr15	68428748	68628748
chr2	224891400	225391400
chr7	95231862	96231862
chr2	106627290	110627290
chr11	29331791	29381791
chr5	69182028	69282028
chr3	174112517	174312517
chr11	114688618	115188618
chr11	3095108	4095108
chr7	95658166	99658166

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions