forked from UW-GAC/topmed_workshop_2017
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathgds_intro.Rmd
More file actions
95 lines (71 loc) · 2.8 KB
/
gds_intro.Rmd
File metadata and controls
95 lines (71 loc) · 2.8 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# GDS format
GDS is Genomic Data Structure, a storage format that can efficiently store genomic data and provide fast random access to subsets of the data. For more information on GDS for sequence data, read the [SeqArray package vignette](https://github.com/zhengxwen/SeqArray/blob/master/vignettes/SeqArrayTutorial.Rmd).
## Exploring a GDS file
To use the R packages developed at the DCC for sequence data, we first need to convert a VCF file to GDS. (If the file is BCF, use [https://samtools.github.io/bcftools/bcftools.html](bcftools) to convert to VCF.)
```{r vcf2gds}
library(SeqArray)
data.path <- "https://github.com/smgogarten/analysis_pipeline/raw/devel/testdata"
vcffile <- "1KG_phase3_subset_chr1.vcf.gz"
if (!file.exists(vcffile)) download.file(file.path(data.path, vcffile), vcffile)
gdsfile <- "1KG_phase3_subset_chr1.gds"
seqVCF2GDS(vcffile, gdsfile, fmt.import="GT", storage.option="LZMA_RA", verbose=FALSE)
```
We can interact with the GDS file using the SeqArray package.
```{r seqarray}
gds <- seqOpen(gdsfile)
gds
# the unique sample identifier comes from the VCF header
sample.id <- seqGetData(gds, "sample.id")
length(sample.id)
head(sample.id)
# a unique integer ID is assigned to each variant
variant.id <- seqGetData(gds, "variant.id")
length(variant.id)
head(variant.id)
# reference allele frequency of each variant
afreq <- seqAlleleFreq(gds)
hist(afreq, breaks=50)
# define a filter to read a subset of data
seqSetFilter(gds, variant.id=1:10, sample.id=sample.id[1:5])
# genotype data is stored as number of copies of each allele
geno <- seqGetData(gds, "genotype")
dim(geno)
geno[,,1:2]
```
The [SeqVarTools package](http://bioconductor.org/packages/SeqVarTools) has some additional functions for interacting with SeqArray-format GDS files.
```{r seqvartools}
library(SeqVarTools)
# return genotypes in matrix format
getGenotype(gds)
getGenotypeAlleles(gds)
refDosage(gds)
altDosage(gds)
# look at reference and alternate alleles
refChar(gds)
altChar(gds)
# reset the filter to all variants and samples
seqResetFilter(gds)
# how many alleles for each variant?
n <- seqNumAllele(gds)
table(n)
# some variants have multiple alleles
multi.allelic <- which(n > 2)
altChar(gds)[multi.allelic]
# extract a particular alternate allele
altChar(gds, n=1)[multi.allelic]
altChar(gds, n=2)[multi.allelic]
# how many variants are SNVs vs INDELs?
table(isSNV(gds, biallelic=TRUE))
table(isSNV(gds, biallelic=FALSE))
seqClose(gds)
```
## Exercise: HWE
Use the `hwe` function in SeqVarTools to run a Hardy-Weinberg Equilibrium test on each variant. Identify a variant with low p-value and inspect its genotypes.
```{r exercise_gds, include=FALSE, eval=FALSE}
hwe.res <- hwe(gds)
lowp <- !is.na(hwe.res$p) & hwe.res$p < 1e-4
head(hwe.res[lowp,])
seqSetFilter(gds, variant.id=75)
table(getGenotype(gds))
table(refDosage(gds))
```