topmed_workshop_2017/association.Rmd at master · UW-GAC/topmed_workshop_2017 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
# Association tests

Since TOPMed has many studies with related participants, we focus on linear mixed models. Logistic mixed models are also possible using GENESIS, see the [GMMAT paper](https://www.ncbi.nlm.nih.gov/pubmed/27018471).

## Null model

The first step in an association test is to fit the null model. We will need an `AnnotatedDataFrame` with phenotypes, and a GRM. We have a sample annotation with a `sample.id` column matched to the GDS file, and a phenotype file with `subject_id`. (In this example, we use the 1000 Genomes IDs for both sample and subject ID.) For TOPMed data, it is also important to match by study, as subject IDs are not unique across studies.

```{r null_model}
# sample annotation
workshop.path <- "https://github.com/UW-GAC/topmed_workshop_2017/raw/master"
sampfile <- "sample_annotation.RData"
if (!file.exists(sampfile)) download.file(file.path(workshop.path, sampfile), sampfile)
annot <- TopmedPipeline::getobj(sampfile)
library(Biobase)
head(pData(annot))

# phenotypes by subject ID
phenfile <- "phenotype_annotation.RData"
if (!file.exists(phenfile)) download.file(file.path(workshop.path, phenfile), phenfile)
phen <- TopmedPipeline::getobj(phenfile)
head(pData(phen))
varMetadata(phen)

# merge sample annotation with phenotypes
library(dplyr)
dat <- pData(annot) %>%
    left_join(pData(phen), by=c("subject.id"="subject_id", "sex"="sex"))
meta <- bind_rows(varMetadata(annot), varMetadata(phen)[3:5,,drop=FALSE])
annot <- AnnotatedDataFrame(dat, meta)

# load the GRM
data.path <- "https://github.com/UW-GAC/analysis_pipeline/raw/master/testdata"
grmfile <- "grm.RData"
if (!file.exists(grmfile)) download.file(file.path(data.path, grmfile), grmfile)
grm <- TopmedPipeline::getobj(grmfile)
# the row and column names of the covariance matrix must be set to sample.id
rownames(grm$grm) <- colnames(grm$grm) <- grm$sample.id
```

We will test for an association between genotype and height, adjusting for sex, age, and study as covariates. If the sample set involves multiple distinct groups with different variances for the phenotype, we recommend allowing the model to use heterogeneous variance among groups with the parameter `group.var`. We saw in a previous exercise that the variance differs by study.

```{r}
library(GENESIS)
nullmod <- fitNullMM(annot, outcome="height", covars=c("sex", "age", "study"),
                     covMatList=grm$grm, group.var="study", verbose=FALSE)
```

We also recommend taking an inverse normal transform of the residuals and refitting the model. This is done separately for each group, and the transformed residuals are rescaled. See the full procedure in the
[pipeline documenation](https://github.com/UW-GAC/analysis_pipeline#association-testing).

## Single-variant tests

Single-variant tests are the same as in GWAS. We use the `assocTestMM` function in GENESIS. We have to create a `SeqVarData` object including both the GDS file and the sample annotation containing phenotypes.

```{r assoc_single}
library(SeqVarTools)
gdsfile <- "1KG_phase3_subset_chr1.gds"
if (!file.exists(gdsfile)) download.file(file.path(data.path, gdsfile), gdsfile)
gds <- seqOpen(gdsfile)
seqData <- SeqVarData(gds, sampleData=annot)
assoc <- assocTestMM(seqData, nullmod)
head(assoc)
```

We make a QQ plot to examine the results.

```{r assoc_single_qq}
library(ggplot2)
qqPlot <- function(pval) {
    pval <- pval[!is.na(pval)]
    n <- length(pval)
    x <- 1:n
    dat <- data.frame(obs=sort(pval),
                      exp=x/n,
                      upper=qbeta(0.025, x, rev(x)),
                      lower=qbeta(0.975, x, rev(x)))

    ggplot(dat, aes(-log10(exp), -log10(obs))) +
        geom_line(aes(-log10(exp), -log10(upper)), color="gray") +
        geom_line(aes(-log10(exp), -log10(lower)), color="gray") +
        geom_point() +
        geom_abline(intercept=0, slope=1, color="red") +
        xlab(expression(paste(-log[10], "(expected P)"))) +
        ylab(expression(paste(-log[10], "(observed P)"))) +
        theme_bw()
}

qqPlot(assoc$Wald.pval)
```

## Exercises

1. Logistic regression: `fitNullMM` can use a binary phenotype as the outcome variable by specifying the argument `family=binomial`. Use the `status` column in the sample annotation to fit a null model for simulated case/control status, with `sex` and `Population` as covariates. Refer to the documentation for `fitNulMM` to see what other parameters need to be changed for a binary outcome. Then run a single-variant test using this model.

```{r exercise_logistic}
nullmod.status <- fitNullMM(annot, outcome="status", covars=c("sex", "Population"),
                            covMatList=grm$grm, family=binomial, verbose=FALSE)
assoc <- assocTestMM(seqData, nullmod.status, test="Score")
head(assoc)
```

2. Inverse normal transform: use the TopmedPipeline function `addInvNorm` to perform an inverse normal transform on the `height` variable. (You will need to load the TopmedPipeline library.) Inspect the code for this function by typing `addInvNorm` at the R prompt, so you understand what it is doing. Then for each study separately, compute a null model and do the inverse normal transform using just the values for that study. Compare these residuals with the initial residuals you obtained for that study by transforming all studies together.

```{r exercise_invnorm}
library(TopmedPipeline)
annot.norm <- addInvNorm(annot, nullmod, outcome="height", covars=c("sex", "age", "study"))
head(pData(annot.norm))

addInvNorm

studies <- sort(unique(annot$study))
resid.studies <- bind_rows(lapply(studies, function(x) {
    print(x)
    annot.study <- annot[annot$study == x,]
    nullmod.study <- fitNullMM(annot.study, outcome="height", covars=c("sex", "age"),
                               covMatList=grm$grm, verbose=FALSE)
    addInvNorm(annot.study, nullmod.study, outcome="height", covars=c("sex", "age")) %>%
        pData() %>%
        select(resid.norm) %>%
        mutate(study=x, run="separate")
}))
head(resid.studies)

library(tidyr)
dat <- pData(annot.norm) %>%
    select(resid.norm, study) %>%
    mutate(run="combined") %>%
    bind_rows(resid.studies)
ggplot(dat, aes(study, resid.norm, fill=run)) + geom_boxplot()

dat %>%
    group_by(study, run) %>%
    summarise(mean=mean(resid.norm), var=var(resid.norm))
```

## Sliding window tests

For rare variants, we can do burden tests or SKAT on sliding windows using the GENESIS function `assocTestSeqWindow`. We restrict the test to variants with alternate allele frequency < 0.1. (For real data, this threshold would be lower.) We use a flat weighting scheme.

```{r assoc_window_burden}
assoc <- assocTestSeqWindow(seqData, nullmod, test="Burden", AF.range=c(0,0.1),
                            weight.beta=c(1,1), window.size=5, window.shift=2)
names(assoc)
head(assoc$results)
head(assoc$variantInfo)

qqPlot(assoc$results$Score.pval)
```

For SKAT, we use the Wu weights.

```{r assoc_window_skat}
assoc <- assocTestSeqWindow(seqData, nullmod, test="SKAT", AF.range=c(0,0.1),
                            weight.beta=c(1,25), window.size=5, window.shift=2)
head(assoc$results)
head(assoc$variantInfo)

qqPlot(assoc$results$pval_0)
```

### Exercise

Repeat the previous exercise on logistic regression, this time running a sliding-window test.

```{r exercise_sliding}
nullmod.status <- fitNullMM(annot, outcome="status", covars=c("sex", "Population"),
                            covMatList=grm$grm, family=binomial, verbose=FALSE)
assoc <- assocTestSeqWindow(seqData, nullmod.status, test="Burden", AF.range=c(0,0.1),
                            weight.beta=c(1,1), window.size=5, window.shift=2)
head(assoc$results)
```


## Annotation-based aggregate tests

**Note:** the code and libraries in this section are under active development, and are not production-level. It is provided to give workshop participants an example of some of the kinds of analysis tasks that might be performed with TOPMed annotation data. Use the code at your own risk, and be warned that it may break in unexpected ways. Github issues and contributions are welcome!

Analysts generally aggregate rare variants for association testing to decrease multiple testing burden and increase statistical power. They can group variants that fall within arbitrary ranges (such as sliding-windows), or they can group variants with intent. For example, an analyst could aggregate variants that that fall between transcription start sites and stop sites, within coding regions, within regulatory regions, or other genomic features selected from sources like published gene models or position- or transcript-based variant annotation. An analyst could also choose to filter the variants prior or subsequent to aggregation using annotation-based criteria such as functional impact or quality scores.

To demonstrate, we will aggregate a subset of TOPMed variants from chromosome 22. The subset is a portion of TOPMed SNP and indel variants that are also in the 1000 Genomes Project. We will parse an example variant annotation file to select fields of interest, parse a GENCODE .gtf file to define our genic units, and then aggregate the selected variants into the defined genic units.

### Working with variant annotation

Variants called from the TOPMed data set are annotated using the [Whole Genome Sequence Annotator (WGSA)](https://sites.google.com/site/jpopgen/wgsa). WGSA output files include 359 annotation fields, some of which are themselves lists of annotation values. Thus, individual variants may be annotated with more than 1000 individual fields. WGSA produces different output files for different featues. TOPMed variant annotation includes separate files for SNPs and for indels. The subsetted variant annotation files we will use for this example are available via github:

```{r}
ben.workshop.path <- "https://github.com/bheavner/topmed_workshop_2017_bh/raw/master"
snpfile <- "snp.tsv.gz"
if (!file.exists(snpfile)) download.file(file.path(ben.workshop.path, snpfile), snpfile)

indelfile <- "indel.tsv.gz"
if (!file.exists(indelfile)) download.file(file.path(ben.workshop.path, indelfile), indelfile)
```

The WGSA output files are tab-separated text files, with one line per annotated variant. Since there are many annotation fields, these files can be unwieldy to work with directly. As an example, the first two lines of the SNP variant annotation file can be previewed within R:

```{r}
readLines("snp.tsv.gz", n=2)
```

The DCC has begun an R package, *wgsaparsr*, to begin working with WGSA output files. This package is under development, and is available on github at [https://github.com/UW-GAC/wgsaparsr](https://github.com/UW-GAC/wgsaparsr). For now, the package can be installed from github using the *devtools* package:

```{r message=FALSE}
#library(devtools)
#devtools::install_github("UW-GAC/wgsaparsr@1.0.0.9003")
library(wgsaparsr)
#library(tidyverse) # just in case it's not loaded yet
library(tibble)
library(dplyr)
library(tidyr)
library(readr)
```

*wgsaparsr* includes a *get_fields()* function to list the annotation fields available in a WGSA output file:

```{r}
# list all fields in an annotation file:
get_fields("snp.tsv.gz")
```

Only a subset of these annotations may be necessary for a particular association test, and it is unweildy to work with all of them, so it is useful to process the WGSA output file to select fields of interest. The *wgsaparsr* function *parse_to_file()* allows field selection by name.

An additional complication in working with the WGSA output files is that some of the annotation fields are transcript-based, rather than position-based. Thus, if a variant locus is within multiple transcripts, those fields will have multiple entries (often separated by a | character). For example, annotation fields such as `VEP_ensembl_Transcript_ID` may have many values within a single tab-separated field.

*wgsaparsr::parse_to_file()* addresses this by splitting such list-fields into multiple rows. Other annotation fields for that variant are duplicated, and associated columns are filled with the same value for each transcript that a particular variant falls within. A consequence of this approach is that the processed annotation file has more lines than the WGSA output file. In freeze 4, processing expanded the annotation by a factor of about 5 - the 220 million annotations result in a 1-billion row database for subsequent aggregation.

*wgsaparsr::parse_to_file()* reads a snp annotation file, selects specified fields, and expands user-defined transcript-level annotation fields. It produces a tab-separated output file for subsequent analysis.

```{r}
desired_columns <-
  c(
    "`#chr`", #NOTE: backtics on #chr because it starts with special character!
    "pos",
    "ref",
    "alt",
    "rs_dbSNP147",
    # "CADDphred",
    "CADD_phred", #NOTE: different than the indel annotation file.
    "VEP_ensembl_Transcript_ID",
    "VEP_ensembl_Gene_Name",
    "VEP_ensembl_Gene_ID",
    "VEP_ensembl_Consequence",
    "VEP_ensembl_Amino_Acid_Change",
    "VEP_ensembl_LoF",
    "VEP_ensembl_LoF_filter",
    "VEP_ensembl_LoF_flags",
    "VEP_ensembl_LoF_info"
    # "1000Gp3_AF" #skipped for the workshop because code doesn't work with this variable name
    )

to_split <-
    c(
    "VEP_ensembl_Consequence",
    "VEP_ensembl_Transcript_ID",
    "VEP_ensembl_Gene_Name",
    "VEP_ensembl_Gene_ID",
    "VEP_ensembl_Amino_Acid_Change",
    "VEP_ensembl_LoF",
    "VEP_ensembl_LoF_filter",
    "VEP_ensembl_LoF_flags",
    "VEP_ensembl_LoF_info"
    )

parse_to_file("snp.tsv.gz", "parsed_snp.tsv", desired_columns, to_split, verbose = TRUE)
```

Although the output file has fewer columns than the the raw WGSA output file, this .tsv file is not particularly nice to work with in R:

```{r}
readLines("parsed_snp.tsv", n=2)
```

However, *get_fields()* does work on the parsed file to view available fields:

```{r}
# list all fields in an annotation file:
get_fields("parsed_snp.tsv")
```

The WGSA output files for indel variants differs from the output for SNPs. Some of the field names differ slightly (e.g. "CADDphred" instead of "CADD_phred"), and there are some fields of interest that include feature counts in brackets (e.g. ENCODE_Dnase_cells includes fields like 125\{23\}). Thus, (for now) *wgsaparsr* includes *parse_indel_to_file()*. *parse_indel_to_file()* is very similar to *parse_to_file()*, and will likely be incorporated to that function in the near future. The syntax for *parse_indel_to_file()* is the same as *parse_to_file()*:

```{r}
desired_columns_indel <-
  c(
    "`#chr`", #NOTE: backtics on #chr because it starts with special character!
    "pos",
    "ref",
    "alt",
    "rs_dbSNP147",
    "CADDphred",
    #  "CADD_phred", #NOTE: different than the general annotation file.
    "VEP_ensembl_Transcript_ID",
    "VEP_ensembl_Gene_Name",
    "VEP_ensembl_Gene_ID",
    "VEP_ensembl_Consequence",
    "VEP_ensembl_Amino_Acid_Change",
    "VEP_ensembl_LoF",
    "VEP_ensembl_LoF_filter",
    "VEP_ensembl_LoF_flags",
    "VEP_ensembl_LoF_info"
    # "1000Gp3_AF"#skipped for the workshop because code doesn't work with this variable name
    )

parse_indel_to_file("indel.tsv.gz", "parsed_indel.tsv", desired_columns_indel, to_split, verbose = TRUE)
```

Inspection shows that the output format is the same for this function:

```{r}
readLines("parsed_indel.tsv", n=2)
```

Or as a list,

```{r}
# list all fields in an annotation file:
get_fields("parsed_indel.tsv")
```

If an analyst wished to filter the list of variants prior to aggregation, the processing code could be modified to apply filters during parsing, or the annotation file could be reprocessed to apply filters at this point. Alternatively, filters can also be applied subsequent to aggregation.

As insurance for this exercise, the parsed files are also available on github:

```{r}
ben.workshop.path <- "https://github.com/bheavner/topmed_workshop_2017_bh/raw/master"
parsedsnpfile <- "parsed_snp.tsv"
if (!file.exists(parsedsnpfile)) download.file(file.path(ben.workshop.path, parsedsnpfile), parsedsnpfile)

parsedindelfile <- "parsed_indel.tsv"
if (!file.exists(parsedindelfile)) download.file(file.path(ben.workshop.path, parsedindelfile), parsedindelfile)
```

### Defining "gene" ranges for aggregation

Aggregation requires definition of the desired aggregation units. As a relatively simple example, we will build a list of genomic ranges corresponding to genes as defined by the [GENCODE Project](https://www.gencodegenes.org/about.html).

The GENCODE Project's Genomic ENCylopedia Of DNA Elements is available in the well-documented [.gtf file format](https://www.gencodegenes.org/data_format.html). Generally, .gtf files consist of 9 tab-separated fields, some of which may consist of various numbers of key:value pairs.

The DCC has begun an R package, *genetable*, to parse and work with .gtf files. This package is under development, and is available on github at [https://github.com/UW-GAC/genetable](https://github.com/UW-GAC/genetable). For now, the package can be installed from github using the *devtools* package:

```{r message=FALSE}
#library(devtools)
#devtools::install_github("UW-GAC/genetable")

library(genetable)
```

I'll be working with the gencode release 19 because it's the last one on GRCh37. The [file](ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz) can be downloaded via [https://www.gencodegenes.org/releases/19.html](https://www.gencodegenes.org/releases/19.html).

In this case, I've trimmed the gencode file to include only chromosome 22 feature definitions (since that's the variant annotation set I'm using for the demo).

```{r}
gtffile <- "chr22.gtf.gz"
if (!file.exists(gtffile)) download.file(file.path(ben.workshop.path, gtffile), gtffile)

gtf_source <- "chr22.gtf.gz"
```

The details of the *genetable* package are of less interest for this workshop, so we'll just use it - we can import and tidy the .gtf file:
```{r}
# import the gtf file to a tidy data frame (a tibble)
gtf <- import_gencode(gtf_source)

# look at the tibble
glimpse(gtf)
```

We can see that genomic features are tagged by feature type:
```{r}
# summarize the number of features by tag.
summarize_tag(gtf, tag = "basic")
```

And we can use these feature type tags to filter the .gtf annotation to extract the starting and ending genomic positions for features of interest, such as features tagged "gene":
```{r}
# filter gtf file to return transcript features tagged basic
basic_transcripts <- filter_gencode(gtf, featurearg = "transcript", tagarg = "basic")

# or filter for features == "gene"
genes <- filter_gencode(gtf, featurearg = "gene")

# define the boundaries of the feature of interest
# this can be slow for complicated features
#gene_bounds <- define_boundaries(basic_transcripts, "gene_id")
gene_bounds <- define_boundaries(genes, "gene_id")

# can check the resulting tibble for sanity
glimpse(gene_bounds)
```

Finally, *genetable* includes a function to save a self-documented tab separated file containing the filtered .gtf results:
```{r}
# save to file
note <- 'This file includes starting and ending ranges for feature = "gene" in the gtf file.'
save_to_file(gene_bounds, notes = note) # will automatically make file called feature_bounds_DATE.tsv
```

As insurance for this exercise, the genic range definitions that I made last week are also available on github:

```{r}
ben.workshop.path <- "https://github.com/bheavner/topmed_workshop_2017_bh/raw/master"
boundsfile <- "feature_bounds_20170804.tsv"
if (!file.exists(boundsfile)) download.file(file.path(ben.workshop.path, boundsfile), boundsfile)
```

### Aggregating TOPMed variants into genic units

Now we've generated a set of variants with a manaageable number of annotation fields, and defined the desired aggregation units as sets of genomic ranges. The set of variants may be filtered using the annotation fields we've chosen (our list is unfiltered in this example).

We're ready to aggregate the variants by genic units. As insurance, we can start with the same set of inputs by downloading what I generated last week:

```{r}
ben.workshop.path <- "https://github.com/bheavner/topmed_workshop_2017_bh/raw/master"

parsed_snp_file <- "parsed_snp.tsv"
parsed_indel_file <- "parsed_indel.tsv"
unit_defs_file <- "feature_bounds_20170804.tsv"

if (!file.exists(parsed_snp_file)) download.file(file.path(ben.workshop.path, parsed_snp_file), parsed_snp_file)

if (!file.exists(parsed_indel_file)) download.file(file.path(ben.workshop.path, parsed_indel_file), parsed_indel_file)

if (!file.exists(unit_defs_file)) download.file(file.path(ben.workshop.path, unit_defs_file), unit_defs_file)
```

Load the tab-separated files to tibbles (data frames) to work with:

```{R}
snps <- read_tsv(parsed_snp_file, comment = "#")
indels <- read_tsv(parsed_indel_file, comment = "#")
unit_defs <- read_tsv(unit_defs_file, comment = "#", skip = 1)
unit_defs <- select(unit_defs, c(gene_id, agg_start, agg_end))
```

There's probably a nice, fast, vectorized to accomplish this task, but for demonstration purposes, we'll just loop over the unit_defs and select indels and snps within the genomic ranges of interest:

```{r}
# make an empty tibble
foo <- tibble(group_id="", chromosome="", position="", ref="", alt="") %>%
  filter(length(group_id)>1)

# loop over unit defs
for (rowIndex in 1:nrow(unit_defs)) {

  # select snps and insert to foo ## SNPs could be filtered here
  snpsToAdd <- select(snps, c(chr, pos, ref, alt)) %>%
    dplyr::filter(between(pos, unit_defs[rowIndex,]$agg_start, unit_defs[rowIndex,]$agg_end)) %>% # This is the line to vectorize
    distinct() %>%
    mutate(group_id = unit_defs[rowIndex,]$gene_id)

  if (nrow(snpsToAdd) > 0) {
    foo <- add_row(
      foo,
      group_id = snpsToAdd$group_id,
      chromosome = snpsToAdd$chr,
      position = snpsToAdd$pos,
      ref = snpsToAdd$ref,
      alt = snpsToAdd$alt
    )
  }

  # select indels and insert to foo ## Indels could be filtered here, too
  toAdd <- select(indels, c(chr, pos, ref, alt)) %>%
    dplyr::filter(between(pos, unit_defs[rowIndex, ]$agg_start, unit_defs[rowIndex, ]$agg_end)) %>% # to vectorize
    distinct() %>%
    mutate(group_id = unit_defs[rowIndex, ]$gene_id)

  if (rowIndex %% 10 == 0){
    message(
      paste0("row: ", rowIndex,
             " snps to add: ", nrow(snpsToAdd),
             " indels to add: ", nrow(toAdd)))
  }

  if (nrow(toAdd) > 0) {
    foo <- add_row(
      foo,
      group_id = toAdd$group_id,
      chromosome = toAdd$chr,
      position = toAdd$pos,
      ref = toAdd$ref,
      alt = toAdd$alt
    )
  }
}

aggregated_variants <- distinct(foo)
```

That may not be fast or pretty, but we've now got a set of variants aggregated into genic units using the GENCODE gene model! This set can be saved and used with the analysis pipeline for association testing.

We can inspect the tibble with glimpse:
```{r}
glimpse(aggregated_variants)
```

We can do things like counting how many genic units we're using:
```{r}
distinct(as.tibble(aggregated_variants$group_id))
```

We can look at number of variants per aggregation unit:
```{r}
counts <- aggregated_variants %>% group_by(group_id) %>% summarize(n())
```

Feel free to look at other summary statistics and do other exploratory data analysis as you'd like, but don't forget to save it if you'd like to use it for the analysis pipeline!

```{r}
save(aggregated_variants, file = "chr22_gene_aggregates.RDA")
```

### Aggregate unit for association testing exercise
We will be using a slightly different gene-based aggregation unit for the assocation testing exercise. As before, this analysis uses a subset of the TOPMed SNP variants that are present in the 1000 Genomes Project. However, in this exercise, the genic units include TOPMed SNP variants from all chromosomes (no indels, and not just chromosome 22 as before). Further, each genic unit is expanded to include the set of TOPMed SNP variants falling within a GENCODE-defined gene along with 20 kb flanking regions upstream and downstream of that range.

In a larger-scale analysis of TOPMed data, aggregation units could include both TOPMed SNP and indel variants falling within defined aggregation units, and would not be restricted to the variants also present in this chosen subset of the 1000 Genomes Project. An analyst might also choose to filter variants within each unit based on various annotations (examples include loss of function, conservation, deleteriousness scores, etc.).

As before, the aggregation units are defined in an R dataframe. Each row of the dataframe specifies a variant (chromosome, position, ref, alt) and the group identifier (group_id) assigned to it. Mutiple rows with different group identifiers can be specified to assign a variant to different groups (for example a variant can be assigned to mutiple genes).

```{r agg_unit}
aggfile <- "variants_by_gene.RData"
if (!file.exists(aggfile)) download.file(file.path(workshop.path, aggfile), aggfile)
aggunit <- TopmedPipeline::getobj(aggfile)
names(aggunit)
head(aggunit)

# an example of variant that is present in mutiple groups
library(dplyr)
mult <- aggunit %>%
    group_by(chromosome, position) %>%
    summarise(n=n()) %>%
    filter(n > 1)
inner_join(aggunit, mult[2,1:2])
```

### Association testing with aggregate units

We can run a burden test or SKAT on each of these units using the GENESIS function `assocTestSeq`. This function expects a list, where each element of the list is a dataframe representing a single aggregation unit and containing the unique variant.id assigned to each variant in a GDS file. We use the TopmedPipeline function `aggregateListByAllele` to quickly convert our single dataframe to the required format. This function can account for multiallelic variants (the same chromosome, position, and ref, but different alt alleles). The first argument is the GDS object returned by `seqOpen` (see above).

```{r aggVarList}
library(TopmedPipeline)
aggVarList <- aggregateListByAllele(gds, aggunit)
length(aggVarList)
head(names(aggVarList))
aggVarList[[1]]
```

As in the previous section, we must fit the null model before running the association test.

```{r assoc_aggregate}
assoc <- assocTestSeq(seqData, nullmod, test="Burden", aggVarList=aggVarList,
                      AF.range=c(0,0.1), weight.beta=c(1,1))
names(assoc)
head(assoc$results)
head(names(assoc$variantInfo))
head(assoc$variantInfo[[1]])

qqPlot(assoc$results$Score.pval)
```


### Exercise

Since we are working with a subset of the data, many of the genes listed in `group_id` have a very small number of variants. Create a new set of units based on position rather than gene name, using the TopmedPipeline function `aggregateListByPosition`. Then run SKAT using those units.

```{r exercise_aggregate}
agg2 <- aggunit %>%
    mutate(chromosome=factor(chromosome, levels=c(1:22, "X"))) %>%
    select(chromosome, position) %>%
    distinct() %>%
    group_by(chromosome) %>%
    summarise(min=min(position), max=max(position))
head(agg2)

aggByPos <- bind_rows(lapply(1:nrow(agg2), function(i) {
    data.frame(chromosome=agg2$chromosome[i],
               start=seq(agg2$min[i], agg2$max[i]-1e6, length.out=10),
               end=seq(agg2$min[i]+1e6, agg2$max[i], length.out=10))
})) %>%
    mutate(group_id=1:n())
head(aggByPos)

aggVarList <- aggregateListByPosition(gds, aggByPos)
aggVarList[1:2]

assoc <- assocTestSeq(seqData, nullmod, test="SKAT", aggVarList=aggVarList,
                      AF.range=c(0,0.1), weight.beta=c(1,25))
head(assoc$results)
```

```{r assoc_close}
seqClose(gds)
```