You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Try to ensure CSI indexes are built with valid parameters
The genome range that a CSI index can cover is set by the
combination of min_shift (the size of each bin) and n_lvls (the
number of levels in the binning index, which sets the number of
smallest bins present). The index code attempted to adjust n_lvls
so that it's high enough to cover the range needed, however setting
it to a value of ten or more resulted in a broken index because the
resulting bin numbers overflow a 32-bit signed integer. Such an
overflow could easily happen when min_shift was set less than 10,
and the file being indexed did not include reference lengths so
the indexer used its default length of 100 Gbases (chosen to be
bigger than any known reference sequence).
This rewrites the n_lvls setting code so that the value chosen
will never be higher than nine. If necessary, min_shift is
adjusted instead to give the desired range and if that happens
a warning is printed as it's likely to have overridden a user
setting. The code to do this is moved to hts.c so it can be
called by all of the SAM/BAM, VCF/BCF and tabix indexers.
For the case where there are no contig lengths, n_lvls is chosen
to give an indexable length of at least 100G if min_shift >= 10,
or otherwise n_lvls is set to the maximum allowed (9) to give
the longest range permitted by the requested min_shift. This
should work for all be the longest genomes; should the length
limit be hit, indexing will fail and the user will see an error
message suggesting they use a larger min_shift value (see
hts_idx_check_range).
Fixes#1966 (CSI access runtime issue with m=9)
0 commit comments