topmed_workshop_2017/gds_intro.Rmd at master · joshbis/topmed_workshop_2017 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# GDS format

GDS is Genomic Data Structure, a storage format that can efficiently store genomic data and provide fast random access to subsets of the data. For more information on GDS for sequence data, read the [SeqArray package vignette](https://github.com/zhengxwen/SeqArray/blob/master/vignettes/SeqArrayTutorial.Rmd).

## Exploring a GDS file

To use the R packages developed at the DCC for sequence data, we first need to convert a VCF file to GDS. (If the file is BCF, use [https://samtools.github.io/bcftools/bcftools.html](bcftools) to convert to VCF.)

```{r vcf2gds}
library(SeqArray)
data.path <- "https://github.com/smgogarten/analysis_pipeline/raw/devel/testdata"
vcffile <- "1KG_phase3_subset_chr1.vcf.gz"
if (!file.exists(vcffile)) download.file(file.path(data.path, vcffile), vcffile)
gdsfile <- "1KG_phase3_subset_chr1.gds"
seqVCF2GDS(vcffile, gdsfile, fmt.import="GT", storage.option="LZMA_RA", verbose=FALSE)
```

We can interact with the GDS file using the SeqArray package.

```{r seqarray}
gds <- seqOpen(gdsfile)
gds

# the unique sample identifier comes from the VCF header
sample.id <- seqGetData(gds, "sample.id")
length(sample.id)
head(sample.id)

# a unique integer ID is assigned to each variant
variant.id <- seqGetData(gds, "variant.id")
length(variant.id)
head(variant.id)

# reference allele frequency of each variant
afreq <- seqAlleleFreq(gds)
hist(afreq, breaks=50)

# define a filter to read a subset of data
seqSetFilter(gds, variant.id=1:10, sample.id=sample.id[1:5])

# genotype data is stored as number of copies of each allele
geno <- seqGetData(gds, "genotype")
dim(geno)
geno[,,1:2]
```

The [SeqVarTools package](http://bioconductor.org/packages/SeqVarTools) has some additional functions for interacting with SeqArray-format GDS files.

```{r seqvartools}
library(SeqVarTools)

# return genotypes in matrix format
getGenotype(gds)
getGenotypeAlleles(gds)
refDosage(gds)
altDosage(gds)

# look at reference and alternate alleles
refChar(gds)
altChar(gds)

# reset the filter to all variants and samples
seqResetFilter(gds)

# how many alleles for each variant?
n <- seqNumAllele(gds)
table(n)

# some variants have multiple alleles
multi.allelic <- which(n > 2)
altChar(gds)[multi.allelic]

# extract a particular alternate allele
altChar(gds, n=1)[multi.allelic]
altChar(gds, n=2)[multi.allelic]

# how many variants are SNVs vs INDELs?
table(isSNV(gds, biallelic=TRUE))
table(isSNV(gds, biallelic=FALSE))

seqClose(gds)
```

## Exercise: HWE

Use the `hwe` function in SeqVarTools to run a Hardy-Weinberg Equilibrium test on each variant. Identify a variant with low p-value and inspect its genotypes.

```{r exercise_gds, include=FALSE, eval=FALSE}
hwe.res <- hwe(gds)
lowp <- !is.na(hwe.res$p) & hwe.res$p < 1e-4
head(hwe.res[lowp,])
seqSetFilter(gds, variant.id=75)
table(getGenotype(gds))
table(refDosage(gds))
```