topmed_workshop_2017/gds_intro.Rmd at master · UW-GAC/topmed_workshop_2017 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
# GDS format

GDS is Genomic Data Structure, a storage format that can efficiently store genomic data and provide fast random access to subsets of the data. For more information on GDS for sequence data, read the [SeqArray package vignette](https://github.com/zhengxwen/SeqArray/blob/master/vignettes/SeqArrayTutorial.Rmd).

## Exploring a GDS file

To use the R packages developed at the DCC for sequence data, we first need to convert a VCF file to GDS. (If the file is BCF, use [https://samtools.github.io/bcftools/bcftools.html](bcftools) to convert to VCF.)

```{r vcf2gds}
library(SeqArray)
data.path <- "https://github.com/UW-GAC/analysis_pipeline/raw/master/testdata"
vcffile <- "1KG_phase3_subset_chr1.vcf.gz"
if (!file.exists(vcffile)) download.file(file.path(data.path, vcffile), vcffile)
gdsfile <- "1KG_phase3_subset_chr1.gds"
seqVCF2GDS(vcffile, gdsfile, fmt.import="GT", storage.option="LZMA_RA", verbose=FALSE)
```

We can interact with the GDS file using the SeqArray package.

```{r seqarray}
gds <- seqOpen(gdsfile)
gds

# the unique sample identifier comes from the VCF header
sample.id <- seqGetData(gds, "sample.id")
length(sample.id)
head(sample.id)

# a unique integer ID is assigned to each variant
variant.id <- seqGetData(gds, "variant.id")
length(variant.id)
head(variant.id)

# reference allele frequency of each variant
afreq <- seqAlleleFreq(gds)
hist(afreq, breaks=50)
```

We can define a filter on the `gds` object. After using the `seqSetFilter` command, all subsequent reads from the `gds` object are restricted to the selected subset of data, until a new filter is defined or `seqResetFilter` is called.

```{r filter}
seqSetFilter(gds, variant.id=1:10, sample.id=sample.id[1:5])
```

Genotype data is stored in a 3-dimensional array, where the first dimension is always 2 for diploid genotypes. The second and third dimensions are samples and variants, respectively. The values of the array denote alleles: `0` is the reference allele and `1` is the alternate allele. For multiallelic variants, other alternate alleles are represented as integers `> 1`.

```{r genotypes}
geno <- seqGetData(gds, "genotype")
dim(geno)
geno[,,1:2]
```

The [SeqVarTools package](http://bioconductor.org/packages/SeqVarTools) has some additional functions for interacting with SeqArray-format GDS files.

```{r seqvartools}
library(SeqVarTools)

# return genotypes in matrix format
getGenotype(gds)
getGenotypeAlleles(gds)
refDosage(gds)
altDosage(gds)

# look at reference and alternate alleles
refChar(gds)
altChar(gds)

# reset the filter to all variants and samples
seqResetFilter(gds)

# how many alleles for each variant?
n <- seqNumAllele(gds)
table(n)

# some variants have more than one alternate allele
multi.allelic <- which(n > 2)
altChar(gds)[multi.allelic]

# extract a particular alternate allele
# first alternate
altChar(gds, n=1)[multi.allelic]
# second alternate
altChar(gds, n=2)[multi.allelic]

# how many variants are SNVs vs INDELs?
table(isSNV(gds, biallelic=TRUE))
table(isSNV(gds, biallelic=FALSE))
# 11 SNVs are multi-allelic
```

## Exercises

1. Set a filter selecting only multi-allelic variants. Inspect their genotypes using the different methods you learned above. Use the `alleleDosage` method to find dosage for the second (and third, etc.) alternate allele.

```{r exercise_gds}
seqSetFilter(gds, variant.sel=multi.allelic)

geno <- seqGetData(gds, "genotype")
dim(geno)
geno[,1:5,]

geno <- getGenotype(gds)
dim(geno)
head(geno)

geno <- getGenotypeAlleles(gds)
head(geno)

dos <- refDosage(gds)
head(dos)

dos <- altDosage(gds)
head(dos)

dos <- alleleDosage(gds, n=2)
head(dos)

dos <- alleleDosage(gds, n=3)
head(dos)
```

2. Use the `hwe` function in SeqVarTools to run a Hardy-Weinberg Equilibrium test on each variant. Identify a variant with low p-value and inspect its genotypes. (Note that the HWE test is only valid for biallelic variants.)

```{r exercise_hwe}
seqResetFilter(gds)
hwe.res <- hwe(gds)
lowp <- !is.na(hwe.res$p) & hwe.res$p < 1e-4
head(hwe.res[lowp,])

seqSetFilter(gds, variant.id=75)
table(getGenotype(gds))
table(refDosage(gds))
```

```{r intro_close}
seqClose(gds)
```