-
Notifications
You must be signed in to change notification settings - Fork 459
Description
As noted in #1470 (comment), HTSlib has tabix iterators, and it has multi-region iterators for BAM and CRAM, but it does not have multi-region tabix iterators.
The important thing about multi-region iterators, and why they can't be easily replicated with repeated single-region queries, is that they resolve overlapping regions and results present in multiple regions for you, efficiently. The algorithm required to deduplicate results when you only have a single-region query involves a lot of fiddly parsing of the results and reasoning about when you can forget the deduplication information, and even then might run into trouble when the underlying file contains distinct duplicate entries that you want to see as many times as they actually occur.
Since htslib supports tabix for GAF files containing reads as of @jmonlong's work in #1763, the lack of multi-region tabix support means there's no good way to query reads from multiple regions from a tabix-indexed GAF, without getting duplicate results. A similar issue comes up with tabix-indexed SAM, which @jkbonfield talks about wanting to use for some reason in #1902 (comment).
The tabix iterators are here, and they're all macros around the generic hts_itr_* functions:
Lines 59 to 63 in a4dde3f
| #define tbx_itr_destroy(iter) hts_itr_destroy(iter) | |
| #define tbx_itr_queryi(tbx, tid, beg, end) hts_itr_query((tbx)->idx, (tid), (beg), (end), tbx_readrec) | |
| #define tbx_itr_querys(tbx, s) hts_itr_querys((tbx)->idx, (s), (hts_name2id_f)(tbx_name2id), (tbx), hts_itr_query, tbx_readrec) | |
| #define tbx_itr_next(htsfp, tbx, itr, r) hts_itr_next(hts_get_bgzfp(htsfp), (itr), (r), (tbx)) | |
| #define tbx_bgzf_itr_next(bgzfp, tbx, itr, r) hts_itr_next((bgzfp), (itr), (r), (tbx)) |
But to make a multi-region iterator, you need to define an hts_itr_multi_query_func, the requirements/expected behavior of which are not documented:
Lines 1322 to 1344 in a4dde3f
| typedef int hts_itr_multi_query_func(const hts_idx_t *idx, hts_itr_t *itr); | |
| HTSLIB_EXPORT | |
| int hts_itr_multi_bam(const hts_idx_t *idx, hts_itr_t *iter); | |
| HTSLIB_EXPORT | |
| int hts_itr_multi_cram(const hts_idx_t *idx, hts_itr_t *iter); | |
| /// Create a multi-region iterator from a region list | |
| /** @param idx Index | |
| @param reglist Region list | |
| @param count Number of items in region list | |
| @param getid Callback to convert names to target IDs | |
| @param hdr Indexed file header (passed to getid) | |
| @param itr_specific Filetype-specific callback function | |
| @param readrec Callback to read an input file record | |
| @param seek Callback to seek in the input file | |
| @param tell Callback to return current input file location | |
| @return An iterator on success; NULL on failure | |
| The iterator struct returned by a successful call should be freed | |
| via hts_itr_destroy() when it is no longer needed. | |
| */ | |
| HTSLIB_EXPORT | |
| hts_itr_t *hts_itr_regions(const hts_idx_t *idx, hts_reglist_t *reglist, int count, hts_name2id_f getid, void *hdr, hts_itr_multi_query_func *itr_specific, hts_readrec_func *readrec, hts_seek_func *seek, hts_tell_func *tell); |
You also need to obtain an hts_seek_func, an hts_tell_func, and an hts_readrec_func. For tabix there's only the tbx_readrec function defined right now.
To implement tabix multi-iterators, someone needs to:
- Figure out what the
hts_itr_multi_query_funcfunctions are supposed to do - Figure out if
hts_itr_multi_tabixis a thing that can be implemented, or whether different implementations for the different tabix-indexable files (hts_itr_multi_vcf,hts_itr_multi_gaf,tbx_itr_multi_sam, etc.) are required. - Implement the query function(s), or make the documentation sufficient that library users can implement them for their own tabix-indexed file types.
- Implement the seek and tell functions for tabix, or make it clearer that the user is intended to write these themselves.