gnomeR/vignettes/data-processing-vignette.Rmd at 05c6e05362706a832bace4ac4d4e86d4a799fcb6 · MSKCC-Epi-Bio/gnomeR · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
---
title: "Processing Data with {gnomeR}"
output: rmarkdown::html_vignette
author: Akriti Mishra, Karissa Whiting, Hannah Fuchs
vignette: >
  %\VignetteIndexEntry{Processing Data with {gnomeR}}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Introduction

In this vignette we will walk through a data example to show available {gnomeR} data processing,
visualization and analysis functions. We will also outline some {gnomeR} helper functions to use when you encounter common pitfalls and format inconsistencies when working with mutation, CNA or structural variant data.

##  Setting up

Make sure {gnomeR} is installed & loaded. We will use dplyr for the purposes of this vignette as well.

```{r message = FALSE, warning=FALSE}
library(gnomeR)
library(dplyr)
```

To demonstrate {gnomeR} functions, we will be using a random sample of 200 patients from a publicly available prostate cancer study retrieved from cBioPortal. Data on mutations, CNA, and structural variants for these patients are available in the gnomeR package (`gnomeR::mutations`, `gnomeR::cna`, `gnomeR::sv`).

Note: To access data from cBioPortal, you can use the {cbioportalR} package:
https://github.com/karissawhiting/cbioportalR.


## Data Formats

Mutation, CNA, or structural variant (also called fusion) data may be formatted differently depending on where you source it. Below we outline some differences you may encounter when downloading data from the [cBioPortal website](https://www.cbioportal.org/) versus pulling it via the [cBioPortal API](https://github.com/karissawhiting/cbioportalR), and review the important/required columns for each. Example data in this package was pulled using the API.

See [cBioPortal documentation](https://docs.cbioportal.org/file-formats/) for more details on different file formats supported on cBioPortal, and their data schema and coding.

### Mutation Data

The most common mutation data format is the [Mutation Annotation Format (MAF)](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) created as part of The Cancer Genome Atlas (TCGA) project. Each row in a MAF file represents a unique gene for a specified sample, therefore there are usually several rows per sample. To use MAF files, you need at minimum a sample ID column and a hugo symbol column, though often additional information like mutation type or location are also necessary.

MAF formats are fairly consistent across sources, however if you download the raw data from a study on cBioPortal using the interactive download button you might notice some differences in the variables names in comparison to data imported from the API. For instance, below web-downloaded MAF names are on the left and API-downloaded MAF names are on the right:

    * `Tumor_Sample_Barcode` is called `sampleId`
    * `Hugo_symbol` is called `hugoGeneSymbol`
    * `Variant_Classification` is called `mutationType`
    * `HGVSp_Short` is called `proteinChange`
    * `Chromosome` is called 'chr'

Some of the other variables are named differently as well but those differences are more intuitive. You can
refer to `gnomeR::names_df` for more information on possible MAF variable names.

Luckily, most {gnomeR} functions use the data dictionary in `gnomeR::names_df` to automatically recognize the most common MAF variable names and turn them into clean, snakecase names in resulting dataframes. For example:

```{r}
gnomeR::mutations %>% names()
```

```{r}
rename_columns(gnomeR::mutations) %>% names()
```

As you can see, some variables, such as `linkMsa`, were not transformed because they are not used in {gnomeR} functions.

### CNA data

The discrete copy number data from cBioPortal contains values that would be derived from copy-number analysis algorithms like GISTIC 2.0 or RAE. CNA data is often presented in a long or wide format:

  1. Long-format - Each row is a CNA event for a given gene and sample, therefore samples often have multiple rows. This is most common format you will receive when downloading data using the API.

```{r }
gnomeR::cna[1:6, ]
```

  2. Wide-format - Organized such that there is one column per sample and one row per gene. Thus, each sample's events are contained within one column. This is most common format you will receive when downloading data from the cBioPortal web browser.

```{r}
gnomeR::cna_wide[1:6, 1:6]
```

{gnomeR} features two helper functions to easily pivot from wide- to long-format and vice-versa.

```{r, eval=FALSE}
pivot_cna_wider(rename_columns(gnomeR::cna))
pivot_cna_longer(gnomeR::cna_wide)
```

These functions will also relabel CNA levels (numeric values) to characters as shown below:

```{r, echo=FALSE}
allowed_cna_levels <- tibble::tribble(
               ~detailed_coding, ~numeric_coding,   ~simplified_coding,
                      "neutral",             "0",       "neutral",
          "homozygous deletion",            "-2",      "deletion",
                          "loh",          "-1.5",      "deletion",
          "hemizygous deletion",            "-1",      "deletion",
                         "gain",             "1", "amplification",
      "high level amplification",            "2", "amplification")

allowed_cna_levels %>% knitr::kable()

```

{gnomeR} automatically checks CNA data labels and recodes as needed within functions. You can also use the `recode_cna()` function to do it yourself if pivoting is unnecessary:


```{r, eval=FALSE}
gnomeR::cna %>%
  mutate(alteration_recoded = recode_cna(alteration))%>%
  select(hugoGeneSymbol, sampleId, alteration, alteration_recoded)%>%
  head()
```


## Preparing Data For Analysis

### Process Data with `create_gene_binary()`

Often the first step to analyzing genomic data is organizing it in an event matrix. This matrix will have one row for each sample in your cohort and one column for each type of genomic event.  Each cell will take a value of `0` (no event on that gene/sample), `1` (event on that gene/sample) or `NA` (missing data or gene not tested on panel). The `create_gene_binary()` function helps you process your data into this format for use in downstream analysis.

You can `create_gene_binary()` from any single type of data (mutation, CNA or fusion):

```{r, include = TRUE}
create_gene_binary(mutation = gnomeR::mutations)[1:6, 1:6]
```

or you can process several types of alterations into a single matrix. Supported data types are:

- mutations
- copy number amplifications
- copy number deletions
- gene fusions

<div class="alert alert-danger" role="alert">
  All datasets should be in long-format.
</div>


When processing multiple types of alteration data, by default there will be a separate column for each type of alteration on that gene. For example, if the TP53 gene could have up to 4 columns: `TP53` (mutation), `TP53.Amp` (amplification), `TP53.Del` (deletion), and `TP53.fus` (structural variant). Further, if no events are observed within the data set for a type of alteration (let's say no TP53 mutations but some TP53 amplifications), columns with all zeros will be excluded. For example, in `colnames(all_bin)` below, there is at least one `ERG.fus` event, but no `ERG` mutation events.

Note the use of the `samples` argument. This allows you to specify exactly which samples are in
your resulting data frame. This argument allows you to retain samples that have no genetic events as a row of `0` and `NA`. If you do not specify such samples within your sample cohort, rows with no alterations will be excluded from the final matrix.

```{r, include = TRUE}

samples <- unique(gnomeR::mutations$sampleId)[1:10]

all_bin <- create_gene_binary(
    samples = samples,
    mutation = gnomeR::mutations,
    cna = gnomeR::cna,
    fusion = gnomeR::sv
)

all_bin[1:6, 1:6]

colnames(all_bin)

```

**Notes on some helpful `create_gene_binary()` arguments:**

- `mut_type`- by default, any germline mutations will be omitted because data is often incomplete, but you can choose to leave them in if needed.
- `specify_panel`- If you are working across a set of samples that was sequenced on several different gene panels, this argument will insert NAs for the genes that weren't tested for any given sample. You can pass a string `"impact"` indicating automatically guessing panels and processsing IMPACT samples based on ID, or you can pass a data frame with columns sample_id and gene_panel for more fine grained control of NA annotation.
- `recode_aliases` - Sometimes genes have several accepted names or change names over time. This can be an issue if genes are coded under multiple names in studies, or if you are working across studies. By default, this function will search for aliases for genes in your data set and resolved them to their current most common name.

### Point Mutations: a Special Consideration

Sometimes researchers will be interested in specific changes to DNA sequences at targeted places on a chromosome and allele.

For example, a few distinct mutations of the TERT gene have been associated with brain tumorgenesis: C228T and C250T.  [1](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8945783/) In both situations, nucleic acid C is swapped for T. For more background information on the biological mechanisms of TERT regulation, please reference the following article [here](https://jcp.bmj.com/content/72/4/281).

Let's say you want to differentiate C228T and C250T from any other TERT mutations. Using a MAF file, you can use the `chromosome` and `start_position` variables to pinpoint the location of these mutations on the TERT gene and classify them by a new `hugo_symbol` using `add_point_muts()`. If you are unsure of these values, [Gene Cards](https://www.genecards.org/) is a helpful resource. Below are examples using the [MSK-IMPACT Clinical Sequencing Cohort](https://www.cbioportal.org/study/summary?id=msk_impact_2017) from cBioPortal used in a [prior study](https://pubmed.ncbi.nlm.nih.gov/28481359/).

The `add_point_mutation()` function has four arguments:
- `chr_num` - number to identify chromosome
- `start_pos` - the start position of the mutation we are interesed in, which can be found on Gene Cards
- `new_name` - a string to indicate what the point mutation should be called moving forward
- `mutually_exclusive` - logical, default = `TRUE`, indicating if this point mutation should be considered mutually exclusive from the overall gene of interest. This affects mutational frequency calculations and interpretations of results. For example, say you had 10 TERT mutations across 20 individuals and two are TERT.C228T point mutations. By default, the mutational frequencies would be calculated as 40% for TERT (8/20) and 10% for TERT.C228T. If set to false, the mutation frequencies would be 50% for any TERT mutation (10/20) and 10% for TERT.C228T (2/20).

```{r}
# load in IMPACT data
# would need an API token from cBioPortal to access data

cbioportalR::set_cbioportal_db("public")

impact_df <- cbioportalR::get_mutations_by_study(study_id = 'msk_impact_2017')

# find number of TERT mutations in dataset regardless of
impact_df %>%
  count(hugoGeneSymbol) %>%
  filter(hugoGeneSymbol == "TERT")

TERT_point_muts <- impact_df %>%
  filter(hugoGeneSymbol == "TERT")%>%
  # the start position could be a set value, or a pattern
  # that can be searched for in the data
  add_point_mut(chr_num = 5, start_pos = '1295228', new_name = 'TERT.C228T', mutually_exclusive = T) %>%
  add_point_mut(chr_num = 5, start_pos = '250$', new_name = 'TERT.C250T', mutually_exclusive = T)

# see new counts of hugo symbols
count(TERT_point_muts, hugo_symbol)

```

### Collapse Data with `summarize_by_gene()`

If the type of alteration event (mutation, amplification, deletion, structural variant) does not matter for your analysis, and you want to see if any event occurred for a gene, pipe your `create_gene_binary()` object through the `summarize_by_gene()` function. As you can see, this compresses all alteration types of the same gene into one column. So, where in `all_bin` there was an `ERG.fus` column but no `ERG` column, now `summarize_by_gene()` only has an `ERG` column with a `1` for any type of event.


```{r, include = TRUE}
dim(all_bin)

all_bin_summary <- all_bin %>%
  summarize_by_gene()

all_bin_summary[1:6, 1:6]

colnames(all_bin_summary)
```


## Analyzing Data

Once you have processed the data into a binary format, you may want to visualize and summarize it with the following helper functions:

### Summarize Alterations with `subset_by_frequency()` and `tbl_genomic()`

You can use the `subset_by_frequency()` function along with `tbl_genomic()` function to easily display highest prevalence alterations in your data set.

First, subset your data set by top genes at a give threshold (`t`). The threshold is a value between 0 and 1 to indicate % prevalence cutoff to keep.  You can subset at this threshold on the alteration level:

```{r}
top_prev_alts <- all_bin %>%
  subset_by_frequency(t = .1)

colnames(top_prev_alts)
```


Or gene level:

```{r}
top_prev_genes <- all_bin_summary %>%
  subset_by_frequency(t = .1)

colnames(top_prev_genes)
```

Next, we can pass our subset of genes or alterations to the `tbl_genomic()` function, which can be used to display summary tables of alterations. It is built off the {gtsummary} package and therefore you can use most customizations available in that package to alter the look of your tables.

The `gene_binary` argument expects binary data as generated by the `create_gene_binary()`, `summarize_by_gene()` or `subset_by_frequency()` functions. Example below shows the summary using gene data for ten samples.

In the example below for `tb1`, if the gene is altered in at least 15% of samples, the gene is included in the summary table.

```{r, include = TRUE}
samples <- unique(mutations$sampleId)[1:10]

gene_binary <- create_gene_binary(
  samples = samples,
  mutation = mutations,
  cna = cna,
  mut_type = "somatic_only", snp_only = FALSE,
  specify_panel = "no"
)

tbl1 <- gene_binary %>% subset_by_frequency(t = .15) %>%
  tbl_genomic()
```

We can add additional customizations to the table with the following gtsummary functions:

```{r}
tbl1 %>%
  gtsummary::bold_labels()

```


### Annotate Gene Pathways with `add_pathways()`

The `add_pathways()` function allows you add columns to your gene binary matrix that annotate custom gene pathways, or oncogenic signaling pathways (add citation).

The function expects a binary matrix as obtained from the `gene_binary()` function and will return a gene binary with additional columns added for specified pathways.

There are a set of default pathways available in the package that can be viewed using `gnomeR::pathways` ([Sanchez-Vega, F et al., 2018](https://pubmed.ncbi.nlm.nih.gov/29625050/)). This new data frame will include columns for mutations, CNAs, structural variants, and pathways. You can subset to only the pathways if you choose.


```{r, include = TRUE}
# available pathways
names(gnomeR::pathways)

pathways <- add_pathways(gene_binary, pathways = c("Notch", "p53")) %>%
  select(sample_id, pathway_Notch, pathway_p53)

head(pathways)

```

### Data Visualizations

The mutation_viz functions allows you to visualize data for the variables related to variant classification, variant type, SNV class as well as top variant genes.

```{r, echo = TRUE, message = FALSE}
mutation_viz(mutations)
```

#### Customizing your colors

{gnomeR} comes with 3 distinct palettes that can be useful for plotting high dimensional genomic data: `pancan`, `main`, and `sunset`. `pancan` and `main` offer a wide range of colors that can be useful to map to discrete scales. `sunset` has fewer colors but offers a color spectrum useful for interpolation for continuous variables. You can view hex codes of all colors with `gnomer_colors`, and the 3 distinct palettes with `gnomer_palettes$pancan`, `gnomer_palettes$main`, `gnomer_palettes$sunset`. The `gnomer_palette()` is used to set/subset specific palettes, create a continuous palette, and/or plot palettes to show the user the specific colors they chose.

The code below will show the first 4 colors from the `pancan` palette for a discrete palette.

```{r}
gnomer_palette(
  name = "pancan",
  n = 4,
  type = "discrete",
  plot_col = TRUE,
  reverse = FALSE
)
```

If you wanted to make a continuous palette you can change the `type=` option and specify how many colors your want to use to create a gradient palette. `type = continuous` uses `grDevices::colorRampPalette()` as the engine to create the gradient palette. Additionally, `gnomer_palette()` accepts additional arguments to be passed to `colorRampPalette()` with the parameter `...`. The example below uses 20 colors.

```{r}

gnomer_palette(
  name = "sunset",
  n = 20,
  type = "continuous",
  plot_col = TRUE,
  reverse = FALSE
)
```

Examples how to use `gnomer_palette()`:

```{r}

library(ggplot2)

ggplot(iris, aes(Sepal.Width, Sepal.Length, color = Species)) +
  geom_point(size = 4) +
  scale_color_manual(values = gnomer_palette("pancan"))
```

```{r }
# use a continuous color scale - interpolates between colors
ggplot(iris, aes(Sepal.Width, Sepal.Length, color = Sepal.Length)) +
  geom_point(size = 4, alpha = .6) +
  scale_color_gradientn(colors =
                          gnomer_palette("sunset", type = "continuous"))

```

#### Set Color Palettes Globally

If you do not want to constantly have to set the palette for each plot you can set a palette theme globally for your entire session using `set_gnomer_palette()`. Additionally, you can set the discrete and gradient (appropriate for continuous variables) palette independently. With the examples below you can see the default colors change for each plot without having to call a `scale_*` function.

```{r}

set_gnomer_palette(palette = "main", gradient = "sunset")

ggplot(mtcars, aes(wt, mpg, color = factor(cyl))) +
  geom_point()

ggplot(mtcars, aes(wt, mpg, color = cyl)) +
  geom_point()

```

You can reset the palettes back to default ggplot color palettes with `reset_gnomer_palette()`.

```{r}
reset_gnomer_palette()

ggplot(mtcars, aes(wt, mpg, color = cyl)) +
  geom_point()

```