-
Notifications
You must be signed in to change notification settings - Fork 10
Add prefix argument variant_id to plink conversion #390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
jeromekelleher
merged 11 commits into
sgkit-dev:main
from
jeromekelleher:plink-improvements
May 21, 2025
Merged
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
bf958f6
Add prefix argument variant_id to plink conversion
jeromekelleher fbbd878
Identify bug in A1/A2 allele handling
jeromekelleher a2134a2
Switch to VCF allele ordering for plink
jeromekelleher 8c4661e
Fixup tests to use the prefix not .bed for plink
jeromekelleher 37d6fd5
Add converted plink file
jeromekelleher fab64e1
Add round-trip tests using plink's VCF output
jeromekelleher 56357da
Add more VCF files converted using plink
jeromekelleher c758835
Switch plink hets from [1, 0] to [0, 1]
jeromekelleher fbbf930
Document CLIs
jeromekelleher 58df0b2
Fixup du tests
jeromekelleher b51c8d5
Finish up docs for plink and update CHANGELOG
jeromekelleher File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
(sec-plink2zarr-cli-ref)= | ||
# CLI Reference | ||
|
||
% A note on cross references... There's some weird long-standing problem with | ||
% cross referencing program values in Sphinx, which means that we can't use | ||
% the built-in labels generated by sphinx-click. We can make our own explicit | ||
% targets, but these have to have slightly weird names to avoid conflicting | ||
% with what sphinx-click is doing. So, hence the cmd- prefix. | ||
% Based on: https://github.com/skypilot-org/skypilot/pull/2834 | ||
|
||
```{eval-rst} | ||
|
||
.. _cmd-plink2zarr-convert: | ||
.. click:: bio2zarr.cli:convert_plink | ||
:prog: plink2zarr convert | ||
:nested: full | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
(sec-plink2zarr)= | ||
# plink2zarr | ||
|
||
Convert plink data to the | ||
[VCF Zarr specification](https://github.com/sgkit-dev/vcf-zarr-spec/) | ||
reliably in parallel. | ||
|
||
See {ref}`sec-plink2zarr-cli-ref` for detailed documentation on | ||
command line options. | ||
|
||
Conversion of the plink data model to VCF follows the semantics of plink1.9 as closely | ||
as possible. That is, given a binary plink fileset with prefix "fileset" (i.e., | ||
fileset.bed, fileset.bim, fileset.fam), running | ||
``` | ||
$ plink2zarr convert fileset out.vcz | ||
``` | ||
should produce the same result in ``out.vcz`` as | ||
``` | ||
$ plink1.9 --bfile fileset --keep-allele-order --recode vcf-iid --out tmp | ||
$ vcf2zarr convert tmp.vcf out.vcz | ||
``` | ||
|
||
:::{warning} | ||
It is important to note that we follow the same conventions as plink 2.0 | ||
where the A1 allele in the [bim file](https://www.cog-genomics.org/plink/2.0/formats#bim) | ||
is the VCF ALT and A2 is the REF. | ||
::: | ||
|
||
:::{note} | ||
Currently we only convert the basic VCF-like data from plink, and don't include | ||
phenotypes and pedigree information. These are planned as future enhancements. | ||
Please comment on [this issue](https://github.com/sgkit-dev/bio2zarr/issues/392) | ||
if you are interested in this functionality. | ||
::: | ||
|
||
|
||
|
||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
##fileformat=VCFv4.2 | ||
##fileDate=20250521 | ||
##source=PLINKv1.90 | ||
##contig=<ID=1,length=21> | ||
##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome"> | ||
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> | ||
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ind0 ind1 ind2 ind3 ind4 ind5 ind6 ind7 ind8 ind9 | ||
1 10 1_10 GG A . . PR GT 1/1 1/1 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 | ||
1 20 1_20 C TTT . . PR GT 1/1 1/1 1/1 1/1 0/0 0/0 0/0 0/0 0/0 0/0 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
##fileformat=VCFv4.2 | ||
##fileDate=20250521 | ||
##source=PLINKv1.90 | ||
##contig=<ID=1,length=21> | ||
##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome"> | ||
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> | ||
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ind0 ind1 ind2 ind3 ind4 ind5 ind6 ind7 ind8 ind9 | ||
1 10 1_10 G A . . PR GT 1/1 1/1 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 | ||
1 20 1_20 C T . . PR GT 1/1 1/1 1/1 1/1 0/0 0/0 0/0 0/0 0/0 0/0 |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nicer to have the PlinkPaths
__init__
just take the prefix?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pros and cons - I tend to try and keep these dataclasses as simple as possible for reasoning about serialising and so on.