feat: output dosage and count in the .pvar file from annotaTR#253
feat: output dosage and count in the .pvar file from annotaTR#253yli091230 wants to merge 3 commits intogymrek-lab:masterfrom
Conversation
nicholema
left a comment
There was a problem hiding this comment.
The added feature looks good to me
|
one quick question: where do we usually use these info flags? would it make sense to have a command line switch that turns this output off in case folks don't want it and they would prefer to save some space? or will it always be helpful to have? |
Thanks for the comments. We noticed lots of imputed TRs dosages (or sum of TR lengths) are bi-allelic or with multi-alleles (usually 3 alleles) with major alleles with very high allele frequencies (e.g. >99.99%). Those loci will be problematic for GWAS, showing crazy effect size with great p-values. We want to use this DSCOUNT field to filter out those problematic loci. It only write into the .pvar file. I kept as much as raw information there to enable flexible filtering options. I feel worth to keep it there since it wouldn't take too much space and provide more information compare to the DSLEN. |
There was a problem hiding this comment.
lots of imputed TRs dosages (or sum of TR lengths) are bi-allelic or with multi-alleles
ok, that makes sense! it sounds like we should always keep that in the .pvar output, in that case
It might be a good idea to add a test to the tests in test_annotaTR.py to confirm that the .pvar output is as we would expect. Also, it might be good to have a separate test for the new GetAlleleCount() method
| else: | ||
| raise ValueError("Invalid match_refpanel_on=%s"%match_on) | ||
|
|
||
| def GetAlleleCount(record): |
There was a problem hiding this comment.
| def GetAlleleCount(record): | |
| def GetAlleleCount(record: cyvcf2.Variant): |
How similar is this method to tr_harmonizer.GetAlleleCounts? Why do we need a new function for this? Maybe you can describe the differences between the two methods in the function signature of this one?
Checklist
fix:. Otherwise, if it introduces a new feature, please prefix it withfeat:. If it introduces a breaking change, please add an exclamation before the colon, likefeat!:. If the scope of the PR changes because of a revision to it, please update the PR title, since the title will be used in our CHANGELOG.Description
Imputed TRs output from the annotaTR may be bi-allelic and contains rare alleles (only few count in a large cohort). This PR makes annotaTR output a new INFO filed named
DSCOUNTcontains the non-zero dosage and their counts number in thepvarfile. The script has been tested and an example output is provided below: