Skip to content

nonsense mutations from DBS across two codons misclassified as nonsynonymous? #84

@ChristofferFlensburg

Description

@ChristofferFlensburg

Hi!
VariantAnnotation is a great package and I enjoy depending on it in my own package, thanks for that!

I'm looking at adding support for dibase substitutions, and seems there might be a little bug in the annotation there.

From

coding = VariantAnnotation::predictCoding(vcf, txdb, seqSource = fafile)

on a few hundred variants, I saw a few dibase variants that affected two amino acids, where one of them turned into stop, but CONSEQUENCE (incorrectly I think) classified it as a nonsynomous variant:

> coding[3,]
GRanges object with 1 range and 16 metadata columns:
      seqnames            ranges strand |            REF             ALT
         <Rle>         <IRanges>  <Rle> | <DNAStringSet> <CharacterList>
  [1]        1 27877576-27877577      - |             GG              AA
           QUAL      FILTER      varAllele    CDSLOC    PROTEINLOC   QUERYID
      <numeric> <character> <DNAStringSet> <IRanges> <IntegerList> <integer>
  [1]        40        PASS             TT 1050-1051       350,351        12
             TXID         CDSID      GENEID   CONSEQUENCE       REFCODON
      <character> <IntegerList> <character>      <factor> <DNAStringSet>
  [1]        4786   14376,14377       27245 nonsynonymous         CCCCAG
            VARCODON         REFAA         VARAA
      <DNAStringSet> <AAStringSet> <AAStringSet>
  [1]         CCTTAG            PQ            P*
  -------
  seqinfo: 93 sequences from an unspecified genome

While DBS that stayed in one codon were (correctly) annotated as nonsense.

> coding[coding$CONSEQUENCE=='nonsense',][1,]
GRanges object with 1 range and 16 metadata columns:
      seqnames              ranges strand |            REF             ALT
         <Rle>           <IRanges>  <Rle> | <DNAStringSet> <CharacterList>
  [1]        3 164739107-164739108      - |             AT              TA
           QUAL      FILTER      varAllele    CDSLOC    PROTEINLOC   QUERYID
      <numeric> <character> <DNAStringSet> <IRanges> <IntegerList> <integer>
  [1]        40        PASS             TA 3163-3164          1055        69
             TXID         CDSID      GENEID CONSEQUENCE       REFCODON
      <character> <IntegerList> <character>    <factor> <DNAStringSet>
  [1]       16983         52835        6476    nonsense            ATA
            VARCODON         REFAA         VARAA
      <DNAStringSet> <AAStringSet> <AAStringSet>
  [1]            TAA             I             *
  -------
  seqinfo: 93 sequences from an unspecified genome

Seems it's checking if VARAA is identical to "*", while it really should check if it contains "*", as that is enough to truncate the protein.
Line 184 of methods-predictCoding.R

    consequence[nonsynonymous & (as.character(varAA) %in% "*")] <- "nonsense" 

Possibly should be something along the lines of

    consequence[nonsynonymous & grepl("\\*", as.character(varAA))] <- "nonsense" 

I'm adding a quick fix for it in my code

coding = VariantAnnotation::predictCoding(vcf, txdb, seqSource = fafile)
coding$CONSEQUENCE[grepl("\\*", as.character(coding$VARAA))] = 'nonsense'

so no hurry for me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions