-
Notifications
You must be signed in to change notification settings - Fork 20
Description
Hi!
VariantAnnotation is a great package and I enjoy depending on it in my own package, thanks for that!
I'm looking at adding support for dibase substitutions, and seems there might be a little bug in the annotation there.
From
coding = VariantAnnotation::predictCoding(vcf, txdb, seqSource = fafile)
on a few hundred variants, I saw a few dibase variants that affected two amino acids, where one of them turned into stop, but CONSEQUENCE (incorrectly I think) classified it as a nonsynomous variant:
> coding[3,]
GRanges object with 1 range and 16 metadata columns:
seqnames ranges strand | REF ALT
<Rle> <IRanges> <Rle> | <DNAStringSet> <CharacterList>
[1] 1 27877576-27877577 - | GG AA
QUAL FILTER varAllele CDSLOC PROTEINLOC QUERYID
<numeric> <character> <DNAStringSet> <IRanges> <IntegerList> <integer>
[1] 40 PASS TT 1050-1051 350,351 12
TXID CDSID GENEID CONSEQUENCE REFCODON
<character> <IntegerList> <character> <factor> <DNAStringSet>
[1] 4786 14376,14377 27245 nonsynonymous CCCCAG
VARCODON REFAA VARAA
<DNAStringSet> <AAStringSet> <AAStringSet>
[1] CCTTAG PQ P*
-------
seqinfo: 93 sequences from an unspecified genome
While DBS that stayed in one codon were (correctly) annotated as nonsense.
> coding[coding$CONSEQUENCE=='nonsense',][1,]
GRanges object with 1 range and 16 metadata columns:
seqnames ranges strand | REF ALT
<Rle> <IRanges> <Rle> | <DNAStringSet> <CharacterList>
[1] 3 164739107-164739108 - | AT TA
QUAL FILTER varAllele CDSLOC PROTEINLOC QUERYID
<numeric> <character> <DNAStringSet> <IRanges> <IntegerList> <integer>
[1] 40 PASS TA 3163-3164 1055 69
TXID CDSID GENEID CONSEQUENCE REFCODON
<character> <IntegerList> <character> <factor> <DNAStringSet>
[1] 16983 52835 6476 nonsense ATA
VARCODON REFAA VARAA
<DNAStringSet> <AAStringSet> <AAStringSet>
[1] TAA I *
-------
seqinfo: 93 sequences from an unspecified genome
Seems it's checking if VARAA is identical to "*", while it really should check if it contains "*", as that is enough to truncate the protein.
Line 184 of methods-predictCoding.R
consequence[nonsynonymous & (as.character(varAA) %in% "*")] <- "nonsense"
Possibly should be something along the lines of
consequence[nonsynonymous & grepl("\\*", as.character(varAA))] <- "nonsense"
I'm adding a quick fix for it in my code
coding = VariantAnnotation::predictCoding(vcf, txdb, seqSource = fafile)
coding$CONSEQUENCE[grepl("\\*", as.character(coding$VARAA))] = 'nonsense'
so no hurry for me.