Skip to content

feat: output dosage and count in the .pvar file from annotaTR#253

Open
yli091230 wants to merge 3 commits intogymrek-lab:masterfrom
yli091230:master
Open

feat: output dosage and count in the .pvar file from annotaTR#253
yli091230 wants to merge 3 commits intogymrek-lab:masterfrom
yli091230:master

Conversation

@yli091230
Copy link
Copy Markdown
Collaborator

Checklist

  • I've checked to ensure there aren't already other open pull requests for the same update/change
  • I've prefixed the title of my PR according to the conventional commits specification. If your PR fixes a bug, please prefix the PR with fix: . Otherwise, if it introduces a new feature, please prefix it with feat: . If it introduces a breaking change, please add an exclamation before the colon, like feat!: . If the scope of the PR changes because of a revision to it, please update the PR title, since the title will be used in our CHANGELOG.
  • I've explained my changes in a manner that will make it possible for both users and maintainers of TRTools to understand them
  • I've added tests for any new functionality. Or, if this PR fixes a bug, I've added test(s) that replicate it
  • All functions, modules, classes etc. still conform to numpy docstring standards

Description

Imputed TRs output from the annotaTR may be bi-allelic and contains rare alleles (only few count in a large cohort). This PR makes annotaTR output a new INFO filed named DSCOUNT contains the non-zero dosage and their counts number in the pvar file. The script has been tested and an example output is provided below:

##fileformat=VCFv4.2
##INFO=<ID=DSLEN,Number=2,Type=Float,Description="Minimum and maximum dosages, used if normalization was applied">
##INFO=<ID=DSCOUNT,Number=1,Type=String,Description="list of dosage with non-zero count and their counts. use for filtering">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr21	14280176	.	A	T	.	.	DSLEN=6.50,6.50	DSCOUNT=6.5;6
chr21	14286385	.	A	T	.	.	DSLEN=14.00,20.00	DSCOUNT=17.0,18.0,18.0,18.0;2,2,1,1
chr21	14289514	.	A	T	.	.	DSLEN=16.50,16.50	DSCOUNT=16.5;6
chr21	14291382	.	A	T	.	.	DSLEN=17.00,21.00	DSCOUNT=20.0;6
chr21	14291456	.	A	T	.	.	DSLEN=8.50,20.50	DSCOUNT=12.5,15.5,14.5;1,2,3
chr21	14292535	.	A	T	.	.	DSLEN=4.00,8.00	DSCOUNT=6.0,7.0;3,3

@yli091230 yli091230 requested review from gymreklab and nicholema April 4, 2025 00:18
Copy link
Copy Markdown
Contributor

@nicholema nicholema left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added feature looks good to me

@aryarm
Copy link
Copy Markdown
Member

aryarm commented Jun 19, 2025

one quick question: where do we usually use these info flags? would it make sense to have a command line switch that turns this output off in case folks don't want it and they would prefer to save some space? or will it always be helpful to have?

@yli091230
Copy link
Copy Markdown
Collaborator Author

yli091230 commented Jun 23, 2025

one quick question: where do we usually use these info flags? would it make sense to have a command line switch that turns this output off in case folks don't want it and they would prefer to save some space? or will it always be helpful to have?

Thanks for the comments. We noticed lots of imputed TRs dosages (or sum of TR lengths) are bi-allelic or with multi-alleles (usually 3 alleles) with major alleles with very high allele frequencies (e.g. >99.99%). Those loci will be problematic for GWAS, showing crazy effect size with great p-values. We want to use this DSCOUNT field to filter out those problematic loci. It only write into the .pvar file. I kept as much as raw information there to enable flexible filtering options. I feel worth to keep it there since it wouldn't take too much space and provide more information compare to the DSLEN.

Copy link
Copy Markdown
Member

@aryarm aryarm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lots of imputed TRs dosages (or sum of TR lengths) are bi-allelic or with multi-alleles

ok, that makes sense! it sounds like we should always keep that in the .pvar output, in that case

It might be a good idea to add a test to the tests in test_annotaTR.py to confirm that the .pvar output is as we would expect. Also, it might be good to have a separate test for the new GetAlleleCount() method

else:
raise ValueError("Invalid match_refpanel_on=%s"%match_on)

def GetAlleleCount(record):
Copy link
Copy Markdown
Member

@aryarm aryarm Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def GetAlleleCount(record):
def GetAlleleCount(record: cyvcf2.Variant):

How similar is this method to tr_harmonizer.GetAlleleCounts? Why do we need a new function for this? Maybe you can describe the differences between the two methods in the function signature of this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants