-
Notifications
You must be signed in to change notification settings - Fork 20
Open
Description
hi there,
I'm experiencing a bug that's the same as an issue that was closed back in November, but it still seems to be a problem.
I am using the release version (1.50.0) - apologies if this is fixed in devel. I cannot right now wrap my head around running the devel version - I used to know how to switch back and forth between versions more easily....
Anyway:
locateVariants() fails to annotate some intronic variants, unless I create and use a cached version of the txdb where I've removed empty elements. It's very similar to the older report, but here's the code I use to illustrate it:
I'm using example data as seen in the vignette
thanks!
Janet
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(VariantAnnotation)
## read in test vcf file and sort out seqlevels
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
seqlevels(vcf) <- "chr22"
## define txdb
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
## get rowRanges, as in the vignette
rd <- rowRanges(vcf)
### use just the first 10 snps
test_snps <- rd[1:10]
## show that they're in an intron:
txdb_introns <- intronsByTranscript(txdb, use.names=TRUE)
findOverlaps(test_snps, txdb_introns)
# Hits object with 10 hits and 0 metadata columns:
# queryHits subjectHits
# <integer> <integer>
# [1] 1 75253
# [2] 2 75253
# [3] 3 75253
# [4] 4 75253
# [5] 5 75253
# [6] 6 75253
# [7] 7 75253
# [8] 8 75253
# [9] 9 75253
# [10] 10 75253
#### locateVariants returns no annotation for these SNPs, whether we use AllVariants() or IntronVariants()
test_loc <- locateVariants(test_snps, txdb, AllVariants())
test_loc
#
# GRanges object with 0 ranges and 9 metadata columns:
# seqnames ranges strand | LOCATION LOCSTART LOCEND QUERYID TXID
# <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer> <integer>
# CDSID GENEID PRECEDEID FOLLOWID
# <IntegerList> <character> <CharacterList> <CharacterList>
# -------
# seqinfo: no sequences
test_loc_2 <- locateVariants(test_snps, txdb, IntronVariants())
test_loc_2
# GRanges object with 0 ranges and 9 metadata columns:
# seqnames ranges strand | LOCATION LOCSTART LOCEND QUERYID TXID
# <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer> <integer>
# CDSID GENEID PRECEDEID FOLLOWID
# <IntegerList> <character> <CharacterList> <CharacterList>
# -------
# seqinfo: no sequences
#### if we make a cached txdb where we remove empty elements it DOES find the intronic variants
## https://github.com/Bioconductor/VariantAnnotation/issues/76
cache <- new.env(parent = emptyenv())
# CodingVariants
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene::TxDb.Hsapiens.UCSC.hg19.knownGene
z <- cdsBy(txdb)
z <- z[lengths(z) > 0L]
cache[["cdsbytx"]] <- z
# IntronVariants
z <- intronsByTranscript(txdb)
z <- z[lengths(z) > 0L]
cache[["intbytx"]] <- z
# ThreeUTRVariants
z <- threeUTRsByTranscript(txdb)
z <- z[lengths(z) > 0L]
cache[["threeUTRbytx"]] <- z
# FiveUTRVariants
z <- fiveUTRsByTranscript(txdb)
z <- z[lengths(z) > 0L]
cache[["fiveUTRbytx"]] <- z
# IntergenicVariants
z <- transcriptsBy(txdb, "gene")
z <- z[lengths(z) > 0L]
cache[["txbygene"]] <- z
# SpliceSiteVariants
# no need to populate cache further ?
# PromoterVariants
z <- transcripts(txdb)
z <- z[lengths(z) > 0L]
cache[["tx"]] <- splitAsList(z, seq_len(length(z)))
names(cache[["tx"]]) <- z$tx_id
test_loc_3 <- locateVariants(test_snps, txdb, AllVariants(), cache = cache)
test_loc_3
# GRanges object with 10 ranges and 9 metadata columns:
# seqnames ranges strand | LOCATION LOCSTART LOCEND QUERYID TXID
# <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer> <character>
# rs7410291 chr22 50300078 - | intron 10763 10763 1 75253
# rs147922003 chr22 50300086 - | intron 10755 10755 2 75253
# rs114143073 chr22 50300101 - | intron 10740 10740 3 75253
# rs141778433 chr22 50300113 - | intron 10728 10728 4 75253
# rs182170314 chr22 50300166 - | intron 10675 10675 5 75253
# rs115145310 chr22 50300187 - | intron 10654 10654 6 75253
# rs186769856 chr22 50300268 - | intron 10573 10573 7 75253
# rs77627744 chr22 50300346 - | intron 10495 10495 8 75253
# rs193230365 chr22 50300423 - | intron 10418 10418 9 75253
# rs9627788 chr22 50300438 - | intron 10403 10403 10 75253
# CDSID GENEID PRECEDEID FOLLOWID
# <IntegerList> <character> <CharacterList> <CharacterList>
# rs7410291 79087
# rs147922003 79087
# rs114143073 79087
# rs141778433 79087
# rs182170314 79087
# rs115145310 79087
# rs186769856 79087
# rs77627744 79087
# rs193230365 79087
# rs9627788 79087
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
Metadata
Metadata
Assignees
Labels
No labels