Skip to content

locateVariants failing on intronic variants #87

@jayoung

Description

@jayoung

hi there,

I'm experiencing a bug that's the same as an issue that was closed back in November, but it still seems to be a problem.

I am using the release version (1.50.0) - apologies if this is fixed in devel. I cannot right now wrap my head around running the devel version - I used to know how to switch back and forth between versions more easily....

Anyway:

locateVariants() fails to annotate some intronic variants, unless I create and use a cached version of the txdb where I've removed empty elements. It's very similar to the older report, but here's the code I use to illustrate it:

I'm using example data as seen in the vignette

thanks!

Janet

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(VariantAnnotation)

## read in test vcf file and sort out seqlevels
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
seqlevels(vcf) <- "chr22"

## define txdb
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

## get rowRanges, as in the vignette
rd <- rowRanges(vcf)

### use just the first 10 snps
test_snps <- rd[1:10]

## show that they're in an intron:
txdb_introns <- intronsByTranscript(txdb, use.names=TRUE)
findOverlaps(test_snps, txdb_introns)
# Hits object with 10 hits and 0 metadata columns:
#     queryHits subjectHits
# <integer>   <integer>
#     [1]         1       75253
# [2]         2       75253
# [3]         3       75253
# [4]         4       75253
# [5]         5       75253
# [6]         6       75253
# [7]         7       75253
# [8]         8       75253
# [9]         9       75253
# [10]        10       75253

#### locateVariants returns no annotation for these SNPs, whether we use AllVariants() or IntronVariants()
test_loc <- locateVariants(test_snps, txdb, AllVariants())
test_loc
# 
# GRanges object with 0 ranges and 9 metadata columns:
#     seqnames    ranges strand | LOCATION  LOCSTART    LOCEND   QUERYID      TXID
# <Rle> <IRanges>  <Rle> | <factor> <integer> <integer> <integer> <integer>
#     CDSID      GENEID       PRECEDEID        FOLLOWID
# <IntegerList> <character> <CharacterList> <CharacterList>
#     -------
#     seqinfo: no sequences

test_loc_2 <- locateVariants(test_snps, txdb, IntronVariants())
test_loc_2
# GRanges object with 0 ranges and 9 metadata columns:
#     seqnames    ranges strand | LOCATION  LOCSTART    LOCEND   QUERYID      TXID
# <Rle> <IRanges>  <Rle> | <factor> <integer> <integer> <integer> <integer>
#     CDSID      GENEID       PRECEDEID        FOLLOWID
# <IntegerList> <character> <CharacterList> <CharacterList>
#     -------
#     seqinfo: no sequences


#### if we make a cached txdb where we remove empty elements it DOES find the intronic variants
## https://github.com/Bioconductor/VariantAnnotation/issues/76

cache <- new.env(parent = emptyenv())

# CodingVariants
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene::TxDb.Hsapiens.UCSC.hg19.knownGene
z <- cdsBy(txdb)
z <- z[lengths(z) > 0L]
cache[["cdsbytx"]] <- z

# IntronVariants
z <- intronsByTranscript(txdb)
z <- z[lengths(z) > 0L]
cache[["intbytx"]] <- z

# ThreeUTRVariants 
z <- threeUTRsByTranscript(txdb)
z <- z[lengths(z) > 0L]
cache[["threeUTRbytx"]] <- z

# FiveUTRVariants 
z <- fiveUTRsByTranscript(txdb)
z <- z[lengths(z) > 0L]
cache[["fiveUTRbytx"]] <- z

# IntergenicVariants 
z <- transcriptsBy(txdb, "gene")
z <- z[lengths(z) > 0L]
cache[["txbygene"]]  <- z

# SpliceSiteVariants 
# no need to populate cache further ?

# PromoterVariants 
z <- transcripts(txdb)
z <- z[lengths(z) > 0L]
cache[["tx"]] <- splitAsList(z, seq_len(length(z))) 
names(cache[["tx"]]) <- z$tx_id

test_loc_3 <- locateVariants(test_snps, txdb, AllVariants(), cache = cache)
test_loc_3
# GRanges object with 10 ranges and 9 metadata columns:
#     seqnames    ranges strand | LOCATION  LOCSTART    LOCEND   QUERYID        TXID
# <Rle> <IRanges>  <Rle> | <factor> <integer> <integer> <integer> <character>
#     rs7410291    chr22  50300078      - |   intron     10763     10763         1       75253
# rs147922003    chr22  50300086      - |   intron     10755     10755         2       75253
# rs114143073    chr22  50300101      - |   intron     10740     10740         3       75253
# rs141778433    chr22  50300113      - |   intron     10728     10728         4       75253
# rs182170314    chr22  50300166      - |   intron     10675     10675         5       75253
# rs115145310    chr22  50300187      - |   intron     10654     10654         6       75253
# rs186769856    chr22  50300268      - |   intron     10573     10573         7       75253
# rs77627744    chr22  50300346      - |   intron     10495     10495         8       75253
# rs193230365    chr22  50300423      - |   intron     10418     10418         9       75253
# rs9627788    chr22  50300438      - |   intron     10403     10403        10       75253
# CDSID      GENEID       PRECEDEID        FOLLOWID
# <IntegerList> <character> <CharacterList> <CharacterList>
#     rs7410291                     79087                                
# rs147922003                     79087                                
# rs114143073                     79087                                
# rs141778433                     79087                                
# rs182170314                     79087                                
# rs115145310                     79087                                
# rs186769856                     79087                                
# rs77627744                     79087                                
# rs193230365                     79087                                
# rs9627788                     79087                                
# -------
#     seqinfo: 1 sequence from an unspecified genome; no seqlengths

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions