-
Notifications
You must be signed in to change notification settings - Fork 10
Support unindexed VCFs #337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support unindexed VCFs #337
Conversation
Oh, that's neat. Thanks @jeromekelleher. From my reading here this is supporting non-indexed VCFs, rather than simply non-compressed VCFs, right? |
Compressed or uncompressed is fine. |
You're right though, it's the indexing that matters |
c2b7a63
to
123a9c7
Compare
OK, I think this is ready to go. We can now support VCFs that don't have an index and (therefore) can also support uncompressed VCF text. The rules are
Other than that, everything works as before. We can mixed indexed and unindexed files no problem. |
Woohoo. This is great. @AprilWei001 will be pleased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! Just a couple of minor comments.
bio2zarr/vcf_utils.py
Outdated
|
||
def contig_record_counts(self): | ||
if self.index is None: | ||
return {self.sequence_names[0]: np.inf} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this infinite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent question! Fixed to use the correct constant.
bio2zarr/vcf_utils.py
Outdated
TABIX = ".tbi" | ||
|
||
|
||
class IndexedVcf(contextlib.AbstractContextManager): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is now a bit misleading since the VCF is not necessarily indexed. The class is used for partitioning, iterating over variants, and getting metadata about number of variants. I can't think of a better name though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point. I've renamed to "VcfFile". It's good enough.
Supporting uncompressed VCF is a bit of a faff, but worth the trouble. Here's a WIP for that.
cc @hyanwong