Skip to content

Commit 227c4a1

Browse files
authored
Merge pull request #39 from stemangiola/post_rehearsal_changes
Post rehearsal changes
2 parents d30da6e + 43f1fc7 commit 227c4a1

File tree

3 files changed

+225
-27
lines changed

3 files changed

+225
-27
lines changed

vignettes/solutions.Rmd

Lines changed: 196 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,196 @@
1+
---
2+
title: "Bioc 2020 Tidytranscriptomics Solutions"
3+
output: rmarkdown::html_vignette
4+
vignette: >
5+
%\VignetteIndexEntry{Solutions}
6+
%\VignetteEngine{knitr::rmarkdown}
7+
%\VignetteEncoding{UTF-8}
8+
---
9+
10+
Questions:
11+
1. What is the Fraction of Variance for PC1 and PC2? What do PC1 and PC2 represent?
12+
2. How many DE genes are there for treated vs untreated? What is the top DE gene by P value?
13+
3. What code can generate a heatmap of variable genes (starting from count_scaled)?
14+
4. What code can you use to visualise expression of the pasilla gene (gene id: FBgn0261552)
15+
5. What code can generate an interactive volcano plot that has gene symbols showing on hover?
16+
6. What code can generate a heatmap of the top 100 DE genes?
17+
18+
Suggested answers are below. You might have some different code e.g. to customise the volcano plot as you like. Feel free to comment on any of these solutions in the workshop website as described [here](https://github.com/stemangiola/bioc_2020_tidytranscriptomics/blob/master/CONTRIBUTING.md).
19+
20+
```{r out.width = "40%", message=FALSE, warning=FALSE}
21+
# load libraries
22+
23+
# tidyverse core packages
24+
library(tibble)
25+
library(dplyr)
26+
library(tidyr)
27+
library(readr)
28+
library(stringr)
29+
library(ggplot2)
30+
31+
# tidyverse-friendly packages
32+
library(tidyHeatmap)
33+
library(tidybulk)
34+
library(ggrepel)
35+
library(plotly)
36+
37+
# load data
38+
data("pasilla", package = "bioc2020tidytranscriptomics")
39+
40+
# create tidybulk tibble
41+
counts_tt <- pasilla %>%
42+
tidybulk()
43+
44+
# scale counts
45+
counts_scaled <- counts_tt %>% scale_abundance(factor_of_interest = condition)
46+
47+
# create density plots
48+
counts_scaled %>%
49+
filter(!lowly_abundant) %>%
50+
pivot_longer(cols = c("counts", "counts_scaled"), names_to = "source", values_to = "abundance") %>%
51+
ggplot(aes(x=abundance + 1, group=sample, color=condition)) +
52+
geom_density() +
53+
facet_wrap(~source) +
54+
scale_x_log10() +
55+
theme_bw()
56+
```
57+
58+
1. What is the Fraction of Variance for PC1 and PC2?
59+
60+
```{r}
61+
counts_scal_PCA <-
62+
counts_scaled %>%
63+
reduce_dimensions(method="PCA")
64+
```
65+
66+
Answer: PC1: 47%, PC2: 25%
67+
68+
What do PC1 and PC2 represent?
69+
70+
```{r out.width = "40%"}
71+
counts_scal_PCA %>%
72+
pivot_sample() %>%
73+
ggplot(aes(x=PC1, y=PC2, colour=condition, shape=type)) +
74+
geom_point() +
75+
geom_text_repel(aes(label=sample), show.legend = FALSE) +
76+
theme_bw()
77+
```
78+
79+
Answer: PC1 represents variance due to treatment effect(treated vs untreated). PC2 represents variance due to sequencing type single vs paired.
80+
81+
82+
```{r}
83+
counts_de <-
84+
counts_tt %>%
85+
test_differential_abundance(.formula = ~ 0 + condition + type,
86+
.contrasts = c("conditiontreated - conditionuntreated"),
87+
omit_contrast_in_colnames = TRUE)
88+
```
89+
90+
2. How many DE genes are there for treated vs untreated (FDR < 0.05)?
91+
92+
```{r}
93+
counts_de %>%
94+
filter(significant == TRUE) %>%
95+
summarise(num_de = n_distinct(feature))
96+
```
97+
98+
Answer: 1128
99+
100+
What is the top DE gene by P value?
101+
102+
```{r}
103+
topgenes <- counts_de %>%
104+
pivot_transcript() %>%
105+
arrange(PValue) %>%
106+
head(6)
107+
108+
topgenes
109+
```
110+
111+
Answer: FBgn0025111
112+
113+
114+
3. What code can generate a heatmap of variable genes (starting from count_scaled)?
115+
116+
```{r out.width = "40%"}
117+
counts_scaled %>%
118+
119+
# filter lowly abundant
120+
filter(!lowly_abundant) %>%
121+
122+
# extract 500 most variable genes
123+
keep_variable( .abundance = counts_scaled, top = 500) %>%
124+
125+
# create heatmap
126+
heatmap(
127+
.column = sample,
128+
.row = feature,
129+
.value = counts_scaled,
130+
annotation = c(condition, type),
131+
transform = log1p
132+
)
133+
```
134+
135+
4. What code can you use to visualise expression of the pasilla gene (gene id: FBgn0261552)
136+
137+
```{r out.width = "40%"}
138+
counts_scaled %>%
139+
140+
# extract counts for pasilla gene
141+
filter(feature == "FBgn0261552") %>%
142+
143+
# make stripchart
144+
ggplot(aes(x = condition, y = counts_scaled + 1, fill =condition, label = sample)) +
145+
geom_boxplot() +
146+
geom_jitter() +
147+
scale_y_log10()+
148+
theme_bw()
149+
```
150+
151+
5. What code can generate an interactive volcano plot that has gene ids showing on hover?
152+
153+
```{r out.width = "40%"}
154+
p <- counts_de %>%
155+
pivot_transcript() %>%
156+
157+
# Subset data
158+
filter(!lowly_abundant) %>%
159+
mutate(significant = FDR<0.05 & abs(logFC) >=2) %>%
160+
161+
# Plot
162+
ggplot(aes(x = logFC, y = PValue, label=feature)) +
163+
geom_point(aes(color = significant, size = significant, alpha=significant)) +
164+
geom_text_repel() +
165+
166+
# Custom scales
167+
scale_y_continuous(trans = "log10_reverse") +
168+
scale_color_manual(values=c("black", "#e11f28")) +
169+
scale_size_discrete(range = c(0, 2)) +
170+
theme_bw()
171+
172+
ggplotly(p, tooltip = c("text"))
173+
```
174+
Tip: You can use "text" instead of "label" if you don't want the column name to show up in the hover e.g. above will give "FBgn0261552" rather than "feature:FBgn0261552".
175+
176+
177+
178+
6. What code can generate a heatmap of the top 100 DE genes?
179+
180+
```{r out.width = "40%"}
181+
top100 <-
182+
counts_de %>%
183+
pivot_transcript() %>%
184+
arrange(PValue) %>%
185+
head(100)
186+
187+
counts_scaled %>%
188+
filter(feature %in% top100$feature) %>%
189+
heatmap(
190+
.column = sample,
191+
.row = feature,
192+
.value = counts_scaled,
193+
annotation = c(condition, type),
194+
transform = log1p
195+
)
196+
```

vignettes/supplementary.Rmd

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ counts_tt <-
5353
# shorten sample name
5454
mutate(sample=str_remove(sample, "SRR1039")) %>%
5555
56-
# convert to tidybulk object
56+
# convert to tidybulk tibble
5757
tidybulk(.sample=sample, .transcript=geneID, .abundance=counts)
5858
```
5959

@@ -67,8 +67,7 @@ counts_tt %>%
6767

6868
We can also check how many counts we have for each sample by making a bar plot. This helps us see whether there are any major discrepancies between the samples more easily.
6969

70-
```{r}
71-
# make barplot of counts
70+
```{r out.width = "40%"}
7271
ggplot(counts_tt, aes(x=sample, weight=counts, fill=sample)) +
7372
geom_bar() +
7473
theme_bw()
@@ -78,14 +77,14 @@ As we are using ggplot2, we can also easily view by any other variable that's a
7877

7978
We can colour by dex treatment.
8079

81-
```{r}
80+
```{r out.width = "40%"}
8281
ggplot(counts_tt, aes(x=sample, weight=counts, fill=dex)) +
8382
geom_bar() +
8483
theme_bw()
8584
```
8685
We can colour by cell line.
8786

88-
```{r}
87+
```{r out.width = "40%"}
8988
ggplot(counts_tt, aes(x=sample, weight=counts, fill=cell)) +
9089
geom_bar() +
9190
theme_bw()
@@ -94,7 +93,7 @@ ggplot(counts_tt, aes(x=sample, weight=counts, fill=cell)) +
9493

9594
## How to examine normalised counts with boxplots
9695

97-
```{r}
96+
```{r out.width = "40%"}
9897
# scale counts
9998
counts_scaled <- counts_tt %>% scale_abundance(factor_of_interest = dex)
10099
@@ -112,7 +111,7 @@ counts_scaled %>%
112111

113112
## How to create MDS plot
114113

115-
```{r}
114+
```{r out.width = "40%"}
116115
airway %>%
117116
tidybulk() %>%
118117
scale_abundance(factor_of_interest=dex) %>%
@@ -127,7 +126,7 @@ airway %>%
127126

128127
MA plots enable us to visualise amount of expression (logCPM) versus logFC. Highly expressed genes are towards the right of the plot. We can also colour significant genes (e.g. genes with FDR < 0.05)
129128

130-
```{r}
129+
```{r out.width = "40%"}
131130
# perform differential testing
132131
counts_de <-
133132
counts_tt %>%
@@ -148,7 +147,7 @@ counts_de %>%
148147

149148
A more informative MA plot, integrating some of the packages in tidyverse.
150149

151-
```{r warning=FALSE}
150+
```{r out.width = "40%", warning=FALSE}
152151
counts_de %>%
153152
pivot_transcript() %>%
154153
@@ -167,9 +166,9 @@ counts_de %>%
167166
```
168167

169168

170-
## How to perform gene set analysis
169+
## How to perform gene enrichment analysis
171170

172-
To run below you'll need the `clusterProfiler` and `org.Hs.eg.db` packages. This is just one suggestion, if you have other suggestions for how to do a 'tidy' pathway analysis feel free to [let us know](https://github.com/stemangiola/bioc_2020_tidytranscriptomics/blob/master/CONTRIBUTING.md).
171+
To run below you'll need the `clusterProfiler` and `org.Hs.eg.db` packages. This is just one suggestion, adapted from [here](https://simon-anders.github.io/data_analysis_course/lecture9.html). If you have other suggestions for how to do a 'tidy' pathway analysis feel free to [let us know](https://github.com/stemangiola/bioc_2020_tidytranscriptomics/blob/master/CONTRIBUTING.md).
173172

174173
```{r eval=FALSE}
175174
library(clusterProfiler)

0 commit comments

Comments
 (0)